GS-4108, Direct Compute in Gaming, by Bill Bilodeau

DirectCompute in Gaming
BILL BILODEAU
DEVELOPER TECHNOLOGY ENGINEER, AMD

AGENDA
TOPICS COVERED IN THIS TALK

 Introduction to DirectCompute

 GCN Overview
 DirectCompute Programming
 Optimization Techniques
 Examples of Compute in Games
‒ Separable Filter
‒ Tiled Lighting
‒ TressFX Physics

2 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT

INTRODUCTION TO
DIRECT COMPUTE

GPU COMPUTE
WHY WE NEED COMPUTE IN GAMES

 Games can be CPU limited on some configurations
‒ Many traditional CPU tasks can be offloaded to the GPU

 Some rendering algorithms are faster when implemented using Compute
‒ Post-processing techniques

 Some tasks are a natural fit for Compute
‒ Non-graphics Data Parallel programming
‒ Physics


DX11 GRAPHICS PIPELINE
WHY WE NEED DirectCompute

Input Assembly

Textures

 Too many unnecessary stages for compute

Vertex Shader

 Old DX9 “GPGPU” programming
‒ Render full screen quad
‒ Use Pixel Shader for compute

Buffers
Hull Shader
Render Targets

Tessellator

Domain Shader

Geometry Shader

Constants

Stream
Out

UAVs

Rasterizer

Pixel Shader

Output Merger


…

Depth Test

Depthstencil

DirectCompute PIPELINE
ONLY WHAT’S NECESSARY

 API designed for compute programming
‒ Interoperability with DirectX

 Pipeline is much simpler
‒ No need to render triangles
‒ Bind input and output then call Dispatch()
SRVs

Compute Shader
UAVs


DirectCompute FEATURES
MORE REASONS TO USE DirectCompute

 Structured Buffers
‒ Maps better to general purpose data structures

 Append/Consume
‒ Fast for non order dependent i/o

 Atomics, Barriers
‒ Synchronization for finer grain thread control

 Thread Group Shared Memory
‒ Fast on-chip local memory shared between threads in a group
‒ Great for intermediate results


DISPATCH
THREAD GROUP ORGANIZATION

 Threads are organized into Thread Groups

 Each thread is executing an instance of the compute shader code
 ID3D11DeviceContext::Dispatch( Dx, Dy, Dz )
‒ Example: Dispatch(4,3,2) => 24 thread groups

0,0,0
1,0,0
2,0,0
3,0,0
0,0,1
1,0,1
2,0,1
3,0,1
2,1,0
0,1,0
1,1,0
3,1,0
0,1,1
1,1,1
2,1,1
3,1,1
0,2,0
1,2,0
2,2,0
3,2,0
0,2,1
1,2,1
2,2,1
3,2,1


THREAD GROUP
ORGANIZATION OF THREADS WITHIN A THREAD GROUP

 Threads in a group are organized in an array
‒ For full screen rendering you can think of a thread group as a tile on the screen, and the thread as a pixel

 The compute shader HLSL code declares the thread layout
‒ [numthreads(Tx, Ty, Tz)]
‒ Example: [numthreads(8, 4, 2)] => 64 threads per thread group

 SystemValues provide information about the current thread
‒ SV_GroupID = thread group address, SV_GroupThreadID = thread address relative to thread group

0,0,0 1,0,0 2,0,0 3,0,0 4,0,0 5,0,0 6,0,0 7,0,0
0,0,1 1,0,1 2,0,1 3,0,1 4,0,1 5,0,1 6,0,1 7,0,1
0,1,0 1,1,0 2,1,0 3,1,0 4,1,0 5,1,0 6,1,0 7,1,0
0,1,1 1,1,1 2,1,1 3,1,1 4,1,1 5,1,1 6,1,1 7,1,1
0,2,0 1,2,0 2,2,0 3,2,0 4,2,0 5,2,0 6,2,0 7,2,0
0,2,1 1,2,1 2,2,1 3,2,1 4,2,1 5,2,1 6,2,1 7,2,1
0,3,0 1,3,0 2,3,0 3,3,0 4,3,0 5,3,0 6,3,0 7,3,0
0,3,1 1,3,1 2,3,1 3,3,1 4,3,1 5,3,1 6,3,1 7,3,1

DirectCompute SETUP
C CODE TO SETUP COMPUTE SHADER EXECUTION

 Create the resources
 pd3dDevice->CreateBuffer(&Desc, NULL, &pCBCSPerFrame);
 pd3dDevice->CreateBuffer(&bufferDesc, NULL, &pRWBuffer));


DirectCompute SETUP

 Create the views
 pd3dDevice->CreateUnorderedAccessView(pRWBuffer, &UAVDesc, &pUAV));


DirectCompute SETUP

 Compile the compute shader
 CompileShaderFromFile( L“myComputeShader.hlsl", "CSMain", "cs_5_0", &pBlob )
 pd3dDevice->CreateComputeShader( pBlob, . . . , pComputeShader );


DirectCompute SETUP

 Bind the resources
 pd3dContext->CSSetConstantBuffers(0, 1, &m_pCBCSPerFrame);
 ID3D11ShaderResourceView* ppSRV[1] = {pUAV};
 pd3dContext->CSSetShaderResources( 0, 1, ppSRV);


DirectCompute SETUP

 Set the shader
 pd3dContext->CSSetShader(pComputeShader , NULL, 0 )


DirectCompute SETUP

 Set the shader
 Go!

 pd3dContext->Dispatch(numOfGroupsX, numOfGroupsY, 1);


COMPUTE SHADER CODE
IMPORTANT ADDITIONS TO HLSL FOR COMPUTE

 numthreads
‒ Defines the number of threads in a thread group
‒ Put this right before the definition of the compute shader

 groupshared float4 sharedPos[THREAD_GROUP_SIZE];
 [numthreads(THREAD_GROUP_SIZE, 1, 1)]
 void MyComputeShader(uint GIndex : SV_GroupIndex)

{
…
 sharedPos[GIndex] = gThreadData[myIndex];
 GroupMemoryBarrierWithGroupSync();
 InterlockedExchange(myUAV[loc], newVal, oldVal);

}


COMPUTE SHADER CODE
IMPORTANT ADDITIONS TO HLSL FOR COMPUTE (CONTINUED)

 Thread Group Shared Memory
‒ Variables that are stored in memory that is shared between threads in a group (up to 32K bytes per group)
‒ Use the groupshared modifier in front of the variable declaration


{
…

}


COMPUTE SHADER CODE

 System Values that can be used in a Compute Shader
‒ SV_GroupID , SV_GroupThreadID , SV_DispatchThreadID , SV_GroupIndex
‒ The system values tell the compute shader what group and what thread it’s currently working on.


{
…
 groupSharedMemVar[GIndex] = gThreadData[myIndex];

}


COMPUTE SHADER CODE

 Synchronization
‒ Barriers
‒ GroupMemoryBarrier(), GroupMemoryBarrierWithGroupSync()
‒ Blocks execution until threads are synchronized

{
…

}


COMPUTE SHADER CODE

 Synchronization
‒ Atomics
‒ “Interlock” versions of Add, Min, Max, Or, And, Xor, CompareStore, and Exchange
‒ Guarantee exclusive access to shared memory

{
…

}


GCN REFRESHER
COMPUTE UNIT

 Basic GPU building block
‒ R9 290x has 44 Compute Units

 (4) 16-wide SIMDs that can execute
instructions from multiple threads

Branch &
Message Unit

Scheduler

Vector Units
(4x SIMD-16)

Scalar Unit

Texture Filter
Units (4)

Texture Fetch
Load / Store
Units (16)

‒ Each SIMD has 64KB of vector general purpose
register (VGPR) memory

 16 Texture Fetch (load/store) Units
‒ 16KB L1 cache

 One Scalar Unit

Vector Registers
(4x 64KB)

Local Data Share
(64KB)

Scalar Registers
(8KB)

‒ 8KB of Scalar GPR memory

 Local Data Share
‒ 64K shared memory


GCN Compute Unit

L1 Cache
(16KB)

WAVEFRONTS
HOW THREADS EXECUTE ON COMPUTE UNITS

 A wavefront consists of 4 batches of threads
‒ Execution is time-sliced to hide latency

 Since there are 16 ALUs per SIMD, the size of a
wavefront is 4 x 16 = 64 threads

Time (clocks)

Batch 1

Batch 3

Batch 4

Stall

 Each SIMD can keep track of 10 wavefronts
‒ 40 wavefronts per CU = 2560 threads
‒ For the R9 290x, that’s a total of 44 x 2560 =
112,640 threads in flight!

Batch 2

Stall

Stall

Runnable
Stall

Runnable
Done!

Runnable
Done!

Runnable
Done!
Done!


GCN AND DirectCompute
HOW DirectCompute Maps to the GPU

 DirectCompute thread groups correspond to groups of threads running on a Compute Unit
‒ Choose a thread group size that works well with the hardare

 Thread Group Shared Memory is stored in the Compute Unit’s Local Data Share (LDS)
‒ 4 SIMDs (64 ALUs) are all sharing the same LDS
‒ Barriers and atomics synchronize access to this storage


OPTIMIZE FOR THE HARDWARE
IMPORTANT THINGS TO CONSIDER WHEN DESIGNING A COMPUTE SHADER

 Thread Group Size
‒ Since there are 64 threads per wavefront, make the thread group size multiples of 64
‒ If [numthreads(x,y,z)] then (x * y * z) should be a multiple of 64
‒ If the thread group size is not a multiple of 64, the SIMDs will still run 64 threads per wavefront
‒ Whether they’re all used or not!
‒ Start with 64 threads per group and then see if higher multiples improve performance.
 LDS memory access is faster than off-chip memory
‒ Use thread group shared memory for storing intermediate results instead of writing out to the UAV
‒ Thread group shared memory can also be used to eliminate unnecessary fetches
‒ For example, discrete convolutions sample many of the same locations when the kernel moves to an
adjacent pixel
‒ TGSM is limited to 32K – pack the data to save space


THREAD GROUP SHARED MEMORY ACCESS

 Avoid Bank Conflicts
‒ Thread Group Shared Memory is stored in LDS which is organized in 32 banks
‒ If one thread accesses a location and another thread accesses location + (n x 32), a bank conflict will occur
‒ The hardware can resolve this, but at the cost of performance

BANK

0

1

2

3

4

…

31

0

1

2

3

ADDRESS

0

1

2

3

4

…

31

32

33

34

35


GPR USAGE

 Local variables are stored in general purpose registers (GPRs)
‒ There’s a limit of 64K bytes of storage for vector registers per SIMD
‒ The register memory has to be shared between wavefronts, so too many GPRS can reduce the number of
wavefronts
‒ Maximizing parallelism is critical to compute shader performance (and shader performance in general)
‒ The more wavefronts that can run, the more parallelism you can get

‒ Using fewer local variables can help, but look at the DirectX assembly code to get a better idea of how many
registers are being used.

GCN VGPR Count <=24 28 32 36 40 48 64 84 <= 128 > 128
Max
Waves/SIMD

10  9


8

7

6

5

4

3

2

1

DirectCompute
Techniques
for Games

SEPARABLE FILTER
SLEEPING DOGS

Separable Filter compute shader used for ambient occlusion

SEPARABLE FILTER
TWO PASS DISCRETE CONVOLUTIONS

 Many useful filters are separable
‒ Typical example is Gaussian filter for a high quality image blur – used often in post-processing

 Separable Algorithm
‒ Run the first pass in the horizontal direction and store the results in an intermediate buffer
‒ Run the second pass on the intermediate buffer in the vertical direction
Source
RT

Intermediate
RT
Horizontal Pass


Destination RT

Vertical Pass

SEPARABLE FILTER
COMPUTE SHADER OPTIMIZATION

 Use Thread Group Shared Memory to cache previously sampled values

 Naïve approach is to load a row of texels into TGSM at once, then calculate each pixel value
128 threads load 128 texels

...........

Kernel Radius
128 – ( Kernel Radius * 2 ) threads compute results

Redundant compute threads 


SEPARABLE FILTER
BETTER OPTIMIZATION

 Make some of the threads load more texels
‒ We only want as many threads as there are pixels
‒ However we need more texels than pixles since the kernel extends beyond the pixels
‒ Just have some of the threads load more pixels

 Process multiple lines per thread group
‒ Keeps the thread group size a multiple of 64
64 threads load 256 texels

...........
...........

Kernel Radius
64 threads compute 256 results

Kernel Radius * 4 threads
load 1 extra texel each

TILED LIGHTING
FROSTBITE ENGINES, DIRT SHOWDOWN


TILED LIGHTING
COMPUTE SHADER LIGHT CULLING

 Break up the screen into tiles
 Create a list of lights for each tile

1

3

 Only render with lights touching the tile

[1]


2

[1,2,3]

[2,3]

TILED LIGHTING
TILED FRUSTUMS

 Divide screen
into tiles
 Fit asymmetric
frustum around
each tile

Tile0

Tile1


Tile2

Tile3

TILED LIGHTING
CREATE FRONT AND BACK PLANES

 Use z buffer from depth pre-pass as input

 Find min and max depth per tile
 Use this frustum for intersection testing


TILED LIGHTING
CREATE PER TILE LIGHT LIST (PART 1)

 Test each light against the frustum

Light0

•Position
•Radius

Light1

•Position
•Radius

Light2

•Position
•Radius

Light3

•Position
•Radius

Light4

•Position
•Radius

…
•Position

Light10 •Radius


TILED LIGHTING
CREATE PER TILE LIGHT LIST (PART 2)

 For each light that intersects the
frustum
Light0

‒ Write the index of the light in the
per-tile list
Index0 •Count
Index1 •1
Index2 •4
Index3 •Empty
Index4 •Empty

…


1

4

•Position
•Radius

Light1

•Position
•Radius

Light2

•Position
•Radius

Light3

•Position
•Radius

Light4

•Position
•Radius

…
•Position

Light10 •Radius

TILED LIGHTING
COMPUTE SHADER

 Implemented using a single compute shader

 A thread group is executed per tile
‒ e.g. [numthreads(16,16,1)] for 16x16 tile size

 Build frustum
 Calculate Z extent
‒ Each thread calculates Z extent in parallel

 256 lights are culled in parallel (for 16x16 tile size)
 Indices of intersecting lights are written to thread group shared memory (TGSM)
‒ groupshared uint ldsLightIdx[MAX_NUM_LIGHTS_PER_TILE];

 Then export to global light index list
‒ RWBuffer<uint> g_PerTileLightIndexBufferOut : register( u0 );


TressFX PHYSICS
TOMB RAIDER


TressFX PHYSICS
PHYSICALLY BASED HAIR SIMULATION

 Responds to natural gravity

 Global constraints to maintain various hair styles
 Collision detection with head and body
 Supports wind and other forces
 Artist-friendly
‒ Programmatic control over attributes
‒ Can tweak for different results

 The entire simulation is done on Compute Shaders


TressFX PHYSICS
TOMB RAIDER VIDEO


TressFX PHYSICS
DATA FLOW
CPU

GPU
Integration and
Global Shapes
CS

Raw Vertex Data
(CPU Memory)

Hair Vertex Data
UAV

Vertex data is only
loaded once during
life of the program.

Local Shapes
Constraint
CS
Length
Constraints and
Wind
CS
Collision and
Tangents
CS

Hair Render
Vertex Shader

Vertex data
stays on the GPU!

TressFX PHYSICS
BENEFITS OF USING DirectCompute TO SIMULATE HAIR

 Interoperability with Direct3D
‒ UAVs can be shared with rendering so data can stay on the GPU

 Massively Parallel
‒ Thousands of hair vertices that can be animated in parallel
‒ Each thread in the thread group represents a vertex (or strand)

 Calculations done on data stored locally on chip
‒ Vertex data is loaded into Thread Group Shared Memory before calculations begin
//-----------------------------// Copy data into shared memory
//-----------------------------if (localVertexIndex < numVerticesInTheStrand )
{
currentPos = sharedPos[indexForSharedMem] = g_HairVertexPositions[globalVertexIndex];
initialPos = g_InitialHairPositions[globalVertexIndex];
initialPos.xyz = mul(float4( initialPos.xyz, 1), g_ModelTransformForHead).xyz;
}
GroupMemoryBarrierWithGroupSync();


DirectCompute in Games
OTHER USES

 HDAO+
‒ New version of our High Definition Ambient Occlusion
‒ Uses TGSM for storing samples

 Bokeh Depth of Field
‒ Used in the Frostbite 3 engine for rendering Bokeh
shapes where “hotspots” are located.

 Global Illumination
‒ Geomerics

 Scene Management

 Voxels
‒ Sparse Voxel Trees
‒ Octrees, Quadtrees

 Particle Systems


Summary of DirectCompute in Games
WHEN TO USE Direct Compute

 Parallelism
‒ Some algorithms can be highly parallelized

 Repetitive Fetches
‒ Caching values in TGSM can be a big win

 Results used in the graphics pipeline
‒ DirectCompute can reduce transfers between GPU and CPU

 Existing algorithms already optimized for DirectCompute
‒ Separable Filters
‒ Tiled Lighting
‒ Hair Physics
‒ HDAO+
‒ We have samples you can use!

http://developer.amd.com/tools-and-sdks/graphics-development/amd-radeon-sdk/


QUESTIONS?

 Ask now, or send questions to:
bill.bilodeau@amd.com


DISCLAIMER & ATTRIBUTION

The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors.
The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap
changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software
changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD
reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of
such revisions or changes.
AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY
INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION.
AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE
LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION
CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.

ATTRIBUTION
© 2013 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices,
Inc. in the United States and/or other jurisdictions. SPEC is a registered trademark of the Standard Performance Evaluation Corporation (SPEC). Other names
are for informational purposes only and may be trademarks of their respective owners.

GS-4108, Direct Compute in Gaming, by Bill Bilodeau

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (8)

Ähnlich wie GS-4108, Direct Compute in Gaming, by Bill Bilodeau

Ähnlich wie GS-4108, Direct Compute in Gaming, by Bill Bilodeau (20)

Mehr von AMD Developer Central

Mehr von AMD Developer Central (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

GS-4108, Direct Compute in Gaming, by Bill Bilodeau