2. How does this work?
General architecture
Advices
Tools
The lecture will not cover the
technical details about the gpu, it
shows only overview needed to
understand current technologies
and standads.
3. GPU
CPU
Vertex Fragment
Processing Processing
application
BUS
Commands, Textures, Framebuffer
3D Api Vertices, Shaders,
(DirectX/OpenGL)
Data… Memory
Driver
Display
4. Vertex units
Vertex
Processing
Memory/textures
Fragment
Processing
As we can see, previous architectures
matched vertex/fragment „fixed” chain… so at
the beginning all the data was processed in Pixel units
„vertex units” and then it was moved to
fragment units.
5. SISD – Single Instruction Single Data
Standard way… one instruction is being executed
per single data.
SIMD – Single Instruction Multiple Data
Instruction is being executed per several data –
like for one 4D vector (128 bits)
MIMD – Multiple Instructions Multiple Data
Parrarel processing!
6. Vertex units used Units
Dynamic task division…
Fragment units used Vertex units used
u n u s e d fragment units used
Effect that uses a lot vertex processing Effect that uses a lot vertex processing
Vertex units used Units
Fixed task division…
u n u s e d
Fragment units used fragment units used
vertex units used
Effect that uses a lot fragment processing Effect that uses a lot fragment processing
Vertex units/Fragment units and their quantities were fixed – we had N vertex processors, and M
fragment processors, but now we have unifed architecture. That means that we have K units
that can process vertex and fragments… there is no difference between them.
7. Controller
Stream processors
As we can see there are no
vertex/fragment units… instead there Shared memory
are stream processors that can handle
both vertex and fragments… and even
more.
8. Scalars… not Vectors!
Stream processor uses only one data per
instruction.
But we have a lot of SP!
SP gives far more great flexibility.
GPGPU
SIMT – Single Instruction Multiple Threads
9. New architecture - NV
DX11, OpenCL
Miltithreaded Rendering
Rendering commands can be called from difrent threads
3 000 000 000 transistors!
End of 2009? End of winter 2010? Never?
Double precission callculations cost twice as much as float,
not ten times as it was before!
Debugging – one can debug gpu directly from VisualStudio
10. Fragment
Vertex Shader
Shader
Geometry
Shader
CUDA
Unified Shader OpenCL
DirectX Compute
ATI Stream
11. General-purpose computing on graphics
processing units
Kernels – code that will be executed on the
GPU
Not only graphics but also:
Physics
▪ Fluids
▪ Collisions
▪ N-body simulations…
Financial
Speach/Pattern recognition
Phenomena modelling – weather…
Neural nets
AI
12. Use as few as possible:
calculations
Huge textures – mimpaps instead
interpolators
Data
Rendering state changes
Dynamic Vertex Buffers
Textures… use texture atlases maybe
Texture fetches
Use more:
Batches
Triangle stripes
13. Use Maths
Uniform sphere:
p = sqrt(Rx^2 + Ry^2 + (Rz + 1)^2) =
sqrt(Rx^2 + Ry^2 + Rz^2 + 2Rz + 1);
R vector is normalized so: Rx^2 + Ry^2 + Rz^2 = 1
p = sqrt(2 * (Rz + 1)) = 1.414*sqrt(Rz + 1)
Calculte this before it
is send to the gpu!
Reduce calculation on uniform vars!
half4 main(float2 diffuse : TEXCOORD0,
uniform sampler2D diffuseTex,
uniform half4 g_OverbrightColor) {
return tex2D(diffuseTex, diffuse) * g_OverbrightColor * 3.0;
}
Normalize
dot(normalize(N), normalize(L)) uses two sqrts!
but:
(N/|N|) dot (L/|L|) = (N dot L) / (|N| * |L|) = (N dot L) / (sqrt( (N dot N) *
(L dot L) ) = (N dot L) * rsq( (N dot N) * (L dot L) )
Now we have only one sqrt – three dots are much cheaper than sqrt
14. Texture lookups:
~ 10 : 1 (ALU:Sampler)
Normalization cube map
Single „Dot” is not worth texture lookups…
But calculation of NormalDistribution… YES!
Early Z-Test
Depth-only Rendering, then full scene (for the
second time)
15. Lighten number of attributes – „pack” them as possible.
float4 myData is better than:
▪ float3 myDataOne;
▪ float1 myDataTwo;
But do not pack in interpolators
Use as few scalars as possible
When vectors are packed no optimalizations can be performed
What do you really need?
Normal, binormal, tangent… no! You need only two of them!
Binormal = normal _Cross_ Tangent
16. PerfKit
•For DirectX mostly
•Little support for OpenGL – via glExpert
PiX for Windows
•Shows everything! But only for Windows, DirectX…
AMD GPU Perf
Similar to Pix, but for
OpenGL… 800$ ;(
17. GLIntercept
• OpenGL
• free
• log every call of opengl command
• edit shaders in realtime
• although it is a bit simple it has a
powerful impact on debugging…
18. GPU ShaderAnalyzer
• free, from AMD!
• glsl/hlsl
• shows number of asm instructions
• ALU, TEX instructions, etc..
• bottlenecks