4. GPU COMPUTE
WHY WE NEED COMPUTE IN GAMES
Games can be CPU limited on some configurations
‒ Many traditional CPU tasks can be offloaded to the GPU
Some rendering algorithms are faster when implemented using Compute
‒ Post-processing techniques
Some tasks are a natural fit for Compute
‒ Non-graphics Data Parallel programming
‒ Physics
4 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT
5. DX11 GRAPHICS PIPELINE
WHY WE NEED DirectCompute
Input Assembly
Textures
Too many unnecessary stages for compute
Vertex Shader
Old DX9 “GPGPU” programming
‒ Render full screen quad
‒ Use Pixel Shader for compute
Buffers
Hull Shader
Render Targets
Tessellator
Domain Shader
Geometry Shader
Constants
Stream
Out
UAVs
Rasterizer
Pixel Shader
Output Merger
5 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT
…
Depth Test
Depthstencil
6. DirectCompute PIPELINE
ONLY WHAT’S NECESSARY
API designed for compute programming
‒ Interoperability with DirectX
Pipeline is much simpler
‒ No need to render triangles
‒ Bind input and output then call Dispatch()
SRVs
Compute Shader
UAVs
6 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT
7. DirectCompute FEATURES
MORE REASONS TO USE DirectCompute
Structured Buffers
‒ Maps better to general purpose data structures
Append/Consume
‒ Fast for non order dependent i/o
Atomics, Barriers
‒ Synchronization for finer grain thread control
Thread Group Shared Memory
‒ Fast on-chip local memory shared between threads in a group
‒ Great for intermediate results
7 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT
8. DISPATCH
THREAD GROUP ORGANIZATION
Threads are organized into Thread Groups
Each thread is executing an instance of the compute shader code
ID3D11DeviceContext::Dispatch( Dx, Dy, Dz )
‒ Example: Dispatch(4,3,2) => 24 thread groups
0,0,0
1,0,0
2,0,0
3,0,0
0,0,1
1,0,1
2,0,1
3,0,1
2,1,0
0,1,0
1,1,0
3,1,0
0,1,1
1,1,1
2,1,1
3,1,1
0,2,0
1,2,0
2,2,0
3,2,0
0,2,1
1,2,1
2,2,1
3,2,1
8 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT
9. THREAD GROUP
ORGANIZATION OF THREADS WITHIN A THREAD GROUP
Threads in a group are organized in an array
‒ For full screen rendering you can think of a thread group as a tile on the screen, and the thread as a pixel
The compute shader HLSL code declares the thread layout
‒ [numthreads(Tx, Ty, Tz)]
‒ Example: [numthreads(8, 4, 2)] => 64 threads per thread group
SystemValues provide information about the current thread
‒ SV_GroupID = thread group address, SV_GroupThreadID = thread address relative to thread group
0,0,0 1,0,0 2,0,0 3,0,0 4,0,0 5,0,0 6,0,0 7,0,0
0,0,1 1,0,1 2,0,1 3,0,1 4,0,1 5,0,1 6,0,1 7,0,1
0,1,0 1,1,0 2,1,0 3,1,0 4,1,0 5,1,0 6,1,0 7,1,0
0,1,1 1,1,1 2,1,1 3,1,1 4,1,1 5,1,1 6,1,1 7,1,1
0,2,0 1,2,0 2,2,0 3,2,0 4,2,0 5,2,0 6,2,0 7,2,0
0,2,1 1,2,1 2,2,1 3,2,1 4,2,1 5,2,1 6,2,1 7,2,1
0,3,0 1,3,0 2,3,0 3,3,0 4,3,0 5,3,0 6,3,0 7,3,0
0,3,1 1,3,1 2,3,1 3,3,1 4,3,1 5,3,1 6,3,1 7,3,1
9 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT
10. DirectCompute SETUP
C CODE TO SETUP COMPUTE SHADER EXECUTION
Create the resources
pd3dDevice->CreateBuffer(&Desc, NULL, &pCBCSPerFrame);
pd3dDevice->CreateBuffer(&bufferDesc, NULL, &pRWBuffer));
10 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT
11. DirectCompute SETUP
C CODE TO SETUP COMPUTE SHADER EXECUTION
Create the resources
Create the views
pd3dDevice->CreateUnorderedAccessView(pRWBuffer, &UAVDesc, &pUAV));
11 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT
12. DirectCompute SETUP
C CODE TO SETUP COMPUTE SHADER EXECUTION
Create the resources
Create the views
Compile the compute shader
CompileShaderFromFile( L“myComputeShader.hlsl", "CSMain", "cs_5_0", &pBlob )
pd3dDevice->CreateComputeShader( pBlob, . . . , pComputeShader );
12 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT
13. DirectCompute SETUP
C CODE TO SETUP COMPUTE SHADER EXECUTION
Create the resources
Create the views
Compile the compute shader
Bind the resources
pd3dContext->CSSetConstantBuffers(0, 1, &m_pCBCSPerFrame);
ID3D11ShaderResourceView* ppSRV[1] = {pUAV};
pd3dContext->CSSetShaderResources( 0, 1, ppSRV);
13 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT
14. DirectCompute SETUP
C CODE TO SETUP COMPUTE SHADER EXECUTION
Create the resources
Create the views
Compile the compute shader
Bind the resources
Set the shader
pd3dContext->CSSetShader(pComputeShader , NULL, 0 )
14 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT
15. DirectCompute SETUP
C CODE TO SETUP COMPUTE SHADER EXECUTION
Create the resources
Create the views
Compile the compute shader
Bind the resources
Set the shader
Go!
pd3dContext->Dispatch(numOfGroupsX, numOfGroupsY, 1);
15 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT
16. COMPUTE SHADER CODE
IMPORTANT ADDITIONS TO HLSL FOR COMPUTE
numthreads
‒ Defines the number of threads in a thread group
‒ Put this right before the definition of the compute shader
groupshared float4 sharedPos[THREAD_GROUP_SIZE];
[numthreads(THREAD_GROUP_SIZE, 1, 1)]
void MyComputeShader(uint GIndex : SV_GroupIndex)
{
…
sharedPos[GIndex] = gThreadData[myIndex];
GroupMemoryBarrierWithGroupSync();
InterlockedExchange(myUAV[loc], newVal, oldVal);
}
16 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT
17. COMPUTE SHADER CODE
IMPORTANT ADDITIONS TO HLSL FOR COMPUTE (CONTINUED)
Thread Group Shared Memory
‒ Variables that are stored in memory that is shared between threads in a group (up to 32K bytes per group)
‒ Use the groupshared modifier in front of the variable declaration
groupshared float4 sharedPos[THREAD_GROUP_SIZE];
[numthreads(THREAD_GROUP_SIZE, 1, 1)]
void MyComputeShader(uint GIndex : SV_GroupIndex)
{
…
sharedPos[GIndex] = gThreadData[myIndex];
GroupMemoryBarrierWithGroupSync();
InterlockedExchange(myUAV[loc], newVal, oldVal);
}
17 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT
18. COMPUTE SHADER CODE
IMPORTANT ADDITIONS TO HLSL FOR COMPUTE (CONTINUED)
System Values that can be used in a Compute Shader
‒ SV_GroupID , SV_GroupThreadID , SV_DispatchThreadID , SV_GroupIndex
‒ The system values tell the compute shader what group and what thread it’s currently working on.
groupshared float4 sharedPos[THREAD_GROUP_SIZE];
[numthreads(THREAD_GROUP_SIZE, 1, 1)]
void MyComputeShader(uint GIndex : SV_GroupIndex)
{
…
groupSharedMemVar[GIndex] = gThreadData[myIndex];
GroupMemoryBarrierWithGroupSync();
InterlockedExchange(myUAV[loc], newVal, oldVal);
}
18 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT
19. COMPUTE SHADER CODE
IMPORTANT ADDITIONS TO HLSL FOR COMPUTE (CONTINUED)
Synchronization
‒ Barriers
‒ GroupMemoryBarrier(), GroupMemoryBarrierWithGroupSync()
‒ Blocks execution until threads are synchronized
groupshared float4 sharedPos[THREAD_GROUP_SIZE];
[numthreads(THREAD_GROUP_SIZE, 1, 1)]
void MyComputeShader(uint GIndex : SV_GroupIndex)
{
…
sharedPos[GIndex] = gThreadData[myIndex];
GroupMemoryBarrierWithGroupSync();
InterlockedExchange(myUAV[loc], newVal, oldVal);
}
19 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT
20. COMPUTE SHADER CODE
IMPORTANT ADDITIONS TO HLSL FOR COMPUTE (CONTINUED)
Synchronization
‒ Atomics
‒ “Interlock” versions of Add, Min, Max, Or, And, Xor, CompareStore, and Exchange
‒ Guarantee exclusive access to shared memory
groupshared float4 sharedPos[THREAD_GROUP_SIZE];
[numthreads(THREAD_GROUP_SIZE, 1, 1)]
void MyComputeShader(uint GIndex : SV_GroupIndex)
{
…
sharedPos[GIndex] = gThreadData[myIndex];
GroupMemoryBarrierWithGroupSync();
InterlockedExchange(myUAV[loc], newVal, oldVal);
}
20 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT
22. GCN REFRESHER
COMPUTE UNIT
Basic GPU building block
‒ R9 290x has 44 Compute Units
(4) 16-wide SIMDs that can execute
instructions from multiple threads
Branch &
Message Unit
Scheduler
Vector Units
(4x SIMD-16)
Scalar Unit
Texture Filter
Units (4)
Texture Fetch
Load / Store
Units (16)
‒ Each SIMD has 64KB of vector general purpose
register (VGPR) memory
16 Texture Fetch (load/store) Units
‒ 16KB L1 cache
One Scalar Unit
Vector Registers
(4x 64KB)
Local Data Share
(64KB)
Scalar Registers
(8KB)
‒ 8KB of Scalar GPR memory
Local Data Share
‒ 64K shared memory
22 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT
GCN Compute Unit
L1 Cache
(16KB)
23. WAVEFRONTS
HOW THREADS EXECUTE ON COMPUTE UNITS
A wavefront consists of 4 batches of threads
‒ Execution is time-sliced to hide latency
Since there are 16 ALUs per SIMD, the size of a
wavefront is 4 x 16 = 64 threads
Time (clocks)
Batch 1
Batch 3
Batch 4
Stall
Each SIMD can keep track of 10 wavefronts
‒ 40 wavefronts per CU = 2560 threads
‒ For the R9 290x, that’s a total of 44 x 2560 =
112,640 threads in flight!
Batch 2
Stall
Stall
Runnable
Stall
Runnable
Done!
Runnable
Done!
Runnable
Done!
Done!
23 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT
24. GCN AND DirectCompute
HOW DirectCompute Maps to the GPU
DirectCompute thread groups correspond to groups of threads running on a Compute Unit
‒ Choose a thread group size that works well with the hardare
Thread Group Shared Memory is stored in the Compute Unit’s Local Data Share (LDS)
‒ 4 SIMDs (64 ALUs) are all sharing the same LDS
‒ Barriers and atomics synchronize access to this storage
24 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT
25. OPTIMIZE FOR THE HARDWARE
IMPORTANT THINGS TO CONSIDER WHEN DESIGNING A COMPUTE SHADER
Thread Group Size
‒ Since there are 64 threads per wavefront, make the thread group size multiples of 64
‒ If [numthreads(x,y,z)] then (x * y * z) should be a multiple of 64
‒ If the thread group size is not a multiple of 64, the SIMDs will still run 64 threads per wavefront
‒ Whether they’re all used or not!
‒ Start with 64 threads per group and then see if higher multiples improve performance.
LDS memory access is faster than off-chip memory
‒ Use thread group shared memory for storing intermediate results instead of writing out to the UAV
‒ Thread group shared memory can also be used to eliminate unnecessary fetches
‒ For example, discrete convolutions sample many of the same locations when the kernel moves to an
adjacent pixel
‒ TGSM is limited to 32K – pack the data to save space
25 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT
26. OPTIMIZE FOR THE HARDWARE
THREAD GROUP SHARED MEMORY ACCESS
Avoid Bank Conflicts
‒ Thread Group Shared Memory is stored in LDS which is organized in 32 banks
‒ If one thread accesses a location and another thread accesses location + (n x 32), a bank conflict will occur
‒ The hardware can resolve this, but at the cost of performance
BANK
0
1
2
3
4
…
31
0
1
2
3
ADDRESS
0
1
2
3
4
…
31
32
33
34
35
26 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT
27. OPTIMIZE FOR THE HARDWARE
GPR USAGE
Local variables are stored in general purpose registers (GPRs)
‒ There’s a limit of 64K bytes of storage for vector registers per SIMD
‒ The register memory has to be shared between wavefronts, so too many GPRS can reduce the number of
wavefronts
‒ Maximizing parallelism is critical to compute shader performance (and shader performance in general)
‒ The more wavefronts that can run, the more parallelism you can get
‒ Using fewer local variables can help, but look at the DirectX assembly code to get a better idea of how many
registers are being used.
GCN VGPR Count <=24 28 32 36 40 48 64 84 <= 128 > 128
Max
Waves/SIMD
10 9
27 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT
8
7
6
5
4
3
2
1
30. SEPARABLE FILTER
TWO PASS DISCRETE CONVOLUTIONS
Many useful filters are separable
‒ Typical example is Gaussian filter for a high quality image blur – used often in post-processing
Separable Algorithm
‒ Run the first pass in the horizontal direction and store the results in an intermediate buffer
‒ Run the second pass on the intermediate buffer in the vertical direction
Source
RT
Intermediate
RT
Horizontal Pass
30 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT
Destination RT
Vertical Pass
31. SEPARABLE FILTER
COMPUTE SHADER OPTIMIZATION
Use Thread Group Shared Memory to cache previously sampled values
Naïve approach is to load a row of texels into TGSM at once, then calculate each pixel value
128 threads load 128 texels
...........
Kernel Radius
128 – ( Kernel Radius * 2 ) threads compute results
Redundant compute threads
31 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT
32. SEPARABLE FILTER
BETTER OPTIMIZATION
Make some of the threads load more texels
‒ We only want as many threads as there are pixels
‒ However we need more texels than pixles since the kernel extends beyond the pixels
‒ Just have some of the threads load more pixels
Process multiple lines per thread group
‒ Keeps the thread group size a multiple of 64
64 threads load 256 texels
...........
...........
Kernel Radius
64 threads compute 256 results
32 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT
Kernel Radius * 4 threads
load 1 extra texel each
34. TILED LIGHTING
COMPUTE SHADER LIGHT CULLING
Break up the screen into tiles
Create a list of lights for each tile
1
3
Only render with lights touching the tile
[1]
34 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT
2
[1,2,3]
[2,3]
35. TILED LIGHTING
TILED FRUSTUMS
Divide screen
into tiles
Fit asymmetric
frustum around
each tile
Tile0
Tile1
35 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT
Tile2
Tile3
36. TILED LIGHTING
CREATE FRONT AND BACK PLANES
Use z buffer from depth pre-pass as input
Find min and max depth per tile
Use this frustum for intersection testing
36 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT
37. TILED LIGHTING
CREATE PER TILE LIGHT LIST (PART 1)
Test each light against the frustum
Light0
•Position
•Radius
Light1
•Position
•Radius
Light2
•Position
•Radius
Light3
•Position
•Radius
Light4
•Position
•Radius
…
•Position
Light10 •Radius
37 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT
38. TILED LIGHTING
CREATE PER TILE LIGHT LIST (PART 2)
For each light that intersects the
frustum
Light0
‒ Write the index of the light in the
per-tile list
Index0 •Count
Index1 •1
Index2 •4
Index3 •Empty
Index4 •Empty
…
38 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT
1
4
•Position
•Radius
Light1
•Position
•Radius
Light2
•Position
•Radius
Light3
•Position
•Radius
Light4
•Position
•Radius
…
•Position
Light10 •Radius
39. TILED LIGHTING
COMPUTE SHADER
Implemented using a single compute shader
A thread group is executed per tile
‒ e.g. [numthreads(16,16,1)] for 16x16 tile size
Build frustum
Calculate Z extent
‒ Each thread calculates Z extent in parallel
256 lights are culled in parallel (for 16x16 tile size)
Indices of intersecting lights are written to thread group shared memory (TGSM)
‒ groupshared uint ldsLightIdx[MAX_NUM_LIGHTS_PER_TILE];
Then export to global light index list
‒ RWBuffer<uint> g_PerTileLightIndexBufferOut : register( u0 );
39 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT
41. TressFX PHYSICS
PHYSICALLY BASED HAIR SIMULATION
Responds to natural gravity
Global constraints to maintain various hair styles
Collision detection with head and body
Supports wind and other forces
Artist-friendly
‒ Programmatic control over attributes
‒ Can tweak for different results
The entire simulation is done on Compute Shaders
41 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT
43. TressFX PHYSICS
DATA FLOW
CPU
GPU
Integration and
Global Shapes
CS
Raw Vertex Data
(CPU Memory)
Hair Vertex Data
UAV
Vertex data is only
loaded once during
life of the program.
43 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT
Local Shapes
Constraint
CS
Length
Constraints and
Wind
CS
Collision and
Tangents
CS
Hair Render
Vertex Shader
Vertex data
stays on the GPU!
44. TressFX PHYSICS
BENEFITS OF USING DirectCompute TO SIMULATE HAIR
Interoperability with Direct3D
‒ UAVs can be shared with rendering so data can stay on the GPU
Massively Parallel
‒ Thousands of hair vertices that can be animated in parallel
‒ Each thread in the thread group represents a vertex (or strand)
Calculations done on data stored locally on chip
‒ Vertex data is loaded into Thread Group Shared Memory before calculations begin
//-----------------------------// Copy data into shared memory
//-----------------------------if (localVertexIndex < numVerticesInTheStrand )
{
currentPos = sharedPos[indexForSharedMem] = g_HairVertexPositions[globalVertexIndex];
initialPos = g_InitialHairPositions[globalVertexIndex];
initialPos.xyz = mul(float4( initialPos.xyz, 1), g_ModelTransformForHead).xyz;
}
GroupMemoryBarrierWithGroupSync();
44 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT
45. DirectCompute in Games
OTHER USES
HDAO+
‒ New version of our High Definition Ambient Occlusion
‒ Uses TGSM for storing samples
Bokeh Depth of Field
‒ Used in the Frostbite 3 engine for rendering Bokeh
shapes where “hotspots” are located.
Global Illumination
‒ Geomerics
Scene Management
Voxels
‒ Sparse Voxel Trees
‒ Octrees, Quadtrees
Particle Systems
45 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT
46. Summary of DirectCompute in Games
WHEN TO USE Direct Compute
Parallelism
‒ Some algorithms can be highly parallelized
Repetitive Fetches
‒ Caching values in TGSM can be a big win
Results used in the graphics pipeline
‒ DirectCompute can reduce transfers between GPU and CPU
Existing algorithms already optimized for DirectCompute
‒ Separable Filters
‒ Tiled Lighting
‒ Hair Physics
‒ HDAO+
‒ We have samples you can use!
http://developer.amd.com/tools-and-sdks/graphics-development/amd-radeon-sdk/
46 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT
47. QUESTIONS?
Ask now, or send questions to:
bill.bilodeau@amd.com
47 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT