SlideShare ist ein Scribd-Unternehmen logo
1 von 48
Downloaden Sie, um offline zu lesen
DirectCompute in Gaming
BILL BILODEAU
DEVELOPER TECHNOLOGY ENGINEER, AMD
AGENDA
TOPICS COVERED IN THIS TALK

 Introduction to DirectCompute

 GCN Overview
 DirectCompute Programming
 Optimization Techniques
 Examples of Compute in Games
‒ Separable Filter
‒ Tiled Lighting
‒ TressFX Physics

2 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT
INTRODUCTION TO
DIRECT COMPUTE
GPU COMPUTE
WHY WE NEED COMPUTE IN GAMES

 Games can be CPU limited on some configurations
‒ Many traditional CPU tasks can be offloaded to the GPU

 Some rendering algorithms are faster when implemented using Compute
‒ Post-processing techniques

 Some tasks are a natural fit for Compute
‒ Non-graphics Data Parallel programming
‒ Physics

4 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT
DX11 GRAPHICS PIPELINE
WHY WE NEED DirectCompute

Input Assembly

Textures

 Too many unnecessary stages for compute

Vertex Shader

 Old DX9 “GPGPU” programming
‒ Render full screen quad
‒ Use Pixel Shader for compute

Buffers
Hull Shader
Render Targets

Tessellator

Domain Shader

Geometry Shader

Constants

Stream
Out

UAVs

Rasterizer

Pixel Shader

Output Merger

5 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT

…

Depth Test

Depthstencil
DirectCompute PIPELINE
ONLY WHAT’S NECESSARY

 API designed for compute programming
‒ Interoperability with DirectX

 Pipeline is much simpler
‒ No need to render triangles
‒ Bind input and output then call Dispatch()
SRVs

Compute Shader
UAVs

6 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT
DirectCompute FEATURES
MORE REASONS TO USE DirectCompute

 Structured Buffers
‒ Maps better to general purpose data structures

 Append/Consume
‒ Fast for non order dependent i/o

 Atomics, Barriers
‒ Synchronization for finer grain thread control

 Thread Group Shared Memory
‒ Fast on-chip local memory shared between threads in a group
‒ Great for intermediate results

7 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT
DISPATCH
THREAD GROUP ORGANIZATION

 Threads are organized into Thread Groups

 Each thread is executing an instance of the compute shader code
 ID3D11DeviceContext::Dispatch( Dx, Dy, Dz )
‒ Example: Dispatch(4,3,2) => 24 thread groups

0,0,0
1,0,0
2,0,0
3,0,0
0,0,1
1,0,1
2,0,1
3,0,1
2,1,0
0,1,0
1,1,0
3,1,0
0,1,1
1,1,1
2,1,1
3,1,1
0,2,0
1,2,0
2,2,0
3,2,0
0,2,1
1,2,1
2,2,1
3,2,1

8 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT
THREAD GROUP
ORGANIZATION OF THREADS WITHIN A THREAD GROUP

 Threads in a group are organized in an array
‒ For full screen rendering you can think of a thread group as a tile on the screen, and the thread as a pixel

 The compute shader HLSL code declares the thread layout
‒ [numthreads(Tx, Ty, Tz)]
‒ Example: [numthreads(8, 4, 2)] => 64 threads per thread group

 SystemValues provide information about the current thread
‒ SV_GroupID = thread group address, SV_GroupThreadID = thread address relative to thread group

0,0,0 1,0,0 2,0,0 3,0,0 4,0,0 5,0,0 6,0,0 7,0,0
0,0,1 1,0,1 2,0,1 3,0,1 4,0,1 5,0,1 6,0,1 7,0,1
0,1,0 1,1,0 2,1,0 3,1,0 4,1,0 5,1,0 6,1,0 7,1,0
0,1,1 1,1,1 2,1,1 3,1,1 4,1,1 5,1,1 6,1,1 7,1,1
0,2,0 1,2,0 2,2,0 3,2,0 4,2,0 5,2,0 6,2,0 7,2,0
0,2,1 1,2,1 2,2,1 3,2,1 4,2,1 5,2,1 6,2,1 7,2,1
0,3,0 1,3,0 2,3,0 3,3,0 4,3,0 5,3,0 6,3,0 7,3,0
0,3,1 1,3,1 2,3,1 3,3,1 4,3,1 5,3,1 6,3,1 7,3,1
9 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT
DirectCompute SETUP
C CODE TO SETUP COMPUTE SHADER EXECUTION

 Create the resources
 pd3dDevice->CreateBuffer(&Desc, NULL, &pCBCSPerFrame);
 pd3dDevice->CreateBuffer(&bufferDesc, NULL, &pRWBuffer));

10 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT
DirectCompute SETUP
C CODE TO SETUP COMPUTE SHADER EXECUTION

 Create the resources
 Create the views
 pd3dDevice->CreateUnorderedAccessView(pRWBuffer, &UAVDesc, &pUAV));

11 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT
DirectCompute SETUP
C CODE TO SETUP COMPUTE SHADER EXECUTION

 Create the resources
 Create the views
 Compile the compute shader
 CompileShaderFromFile( L“myComputeShader.hlsl", "CSMain", "cs_5_0", &pBlob )
 pd3dDevice->CreateComputeShader( pBlob, . . . , pComputeShader );

12 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT
DirectCompute SETUP
C CODE TO SETUP COMPUTE SHADER EXECUTION

 Create the resources
 Create the views
 Compile the compute shader
 Bind the resources
 pd3dContext->CSSetConstantBuffers(0, 1, &m_pCBCSPerFrame);
 ID3D11ShaderResourceView* ppSRV[1] = {pUAV};
 pd3dContext->CSSetShaderResources( 0, 1, ppSRV);

13 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT
DirectCompute SETUP
C CODE TO SETUP COMPUTE SHADER EXECUTION

 Create the resources
 Create the views
 Compile the compute shader
 Bind the resources
 Set the shader
 pd3dContext->CSSetShader(pComputeShader , NULL, 0 )

14 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT
DirectCompute SETUP
C CODE TO SETUP COMPUTE SHADER EXECUTION

 Create the resources
 Create the views
 Compile the compute shader
 Bind the resources
 Set the shader
 Go!

 pd3dContext->Dispatch(numOfGroupsX, numOfGroupsY, 1);

15 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT
COMPUTE SHADER CODE
IMPORTANT ADDITIONS TO HLSL FOR COMPUTE

 numthreads
‒ Defines the number of threads in a thread group
‒ Put this right before the definition of the compute shader

 groupshared float4 sharedPos[THREAD_GROUP_SIZE];
 [numthreads(THREAD_GROUP_SIZE, 1, 1)]
 void MyComputeShader(uint GIndex : SV_GroupIndex)

{
…
 sharedPos[GIndex] = gThreadData[myIndex];
 GroupMemoryBarrierWithGroupSync();
 InterlockedExchange(myUAV[loc], newVal, oldVal);

}

16 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT
COMPUTE SHADER CODE
IMPORTANT ADDITIONS TO HLSL FOR COMPUTE (CONTINUED)

 Thread Group Shared Memory
‒ Variables that are stored in memory that is shared between threads in a group (up to 32K bytes per group)
‒ Use the groupshared modifier in front of the variable declaration

 groupshared float4 sharedPos[THREAD_GROUP_SIZE];
 [numthreads(THREAD_GROUP_SIZE, 1, 1)]
 void MyComputeShader(uint GIndex : SV_GroupIndex)

{
…
 sharedPos[GIndex] = gThreadData[myIndex];
 GroupMemoryBarrierWithGroupSync();
 InterlockedExchange(myUAV[loc], newVal, oldVal);

}

17 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT
COMPUTE SHADER CODE
IMPORTANT ADDITIONS TO HLSL FOR COMPUTE (CONTINUED)

 System Values that can be used in a Compute Shader
‒ SV_GroupID , SV_GroupThreadID , SV_DispatchThreadID , SV_GroupIndex
‒ The system values tell the compute shader what group and what thread it’s currently working on.

 groupshared float4 sharedPos[THREAD_GROUP_SIZE];
 [numthreads(THREAD_GROUP_SIZE, 1, 1)]
 void MyComputeShader(uint GIndex : SV_GroupIndex)

{
…
 groupSharedMemVar[GIndex] = gThreadData[myIndex];
 GroupMemoryBarrierWithGroupSync();
 InterlockedExchange(myUAV[loc], newVal, oldVal);

}

18 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT
COMPUTE SHADER CODE
IMPORTANT ADDITIONS TO HLSL FOR COMPUTE (CONTINUED)

 Synchronization
‒ Barriers
‒ GroupMemoryBarrier(), GroupMemoryBarrierWithGroupSync()
‒ Blocks execution until threads are synchronized
 groupshared float4 sharedPos[THREAD_GROUP_SIZE];
 [numthreads(THREAD_GROUP_SIZE, 1, 1)]
 void MyComputeShader(uint GIndex : SV_GroupIndex)

{
…
 sharedPos[GIndex] = gThreadData[myIndex];
 GroupMemoryBarrierWithGroupSync();
 InterlockedExchange(myUAV[loc], newVal, oldVal);

}

19 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT
COMPUTE SHADER CODE
IMPORTANT ADDITIONS TO HLSL FOR COMPUTE (CONTINUED)

 Synchronization
‒ Atomics
‒ “Interlock” versions of Add, Min, Max, Or, And, Xor, CompareStore, and Exchange
‒ Guarantee exclusive access to shared memory
 groupshared float4 sharedPos[THREAD_GROUP_SIZE];
 [numthreads(THREAD_GROUP_SIZE, 1, 1)]
 void MyComputeShader(uint GIndex : SV_GroupIndex)

{
…
 sharedPos[GIndex] = gThreadData[myIndex];
 GroupMemoryBarrierWithGroupSync();
 InterlockedExchange(myUAV[loc], newVal, oldVal);

}

20 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT
DirectCompute AND GCN
GCN REFRESHER
COMPUTE UNIT

 Basic GPU building block
‒ R9 290x has 44 Compute Units

 (4) 16-wide SIMDs that can execute
instructions from multiple threads

Branch &
Message Unit

Scheduler

Vector Units
(4x SIMD-16)

Scalar Unit

Texture Filter
Units (4)

Texture Fetch
Load / Store
Units (16)

‒ Each SIMD has 64KB of vector general purpose
register (VGPR) memory

 16 Texture Fetch (load/store) Units
‒ 16KB L1 cache

 One Scalar Unit

Vector Registers
(4x 64KB)

Local Data Share
(64KB)

Scalar Registers
(8KB)

‒ 8KB of Scalar GPR memory

 Local Data Share
‒ 64K shared memory

22 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT

GCN Compute Unit

L1 Cache
(16KB)
WAVEFRONTS
HOW THREADS EXECUTE ON COMPUTE UNITS

 A wavefront consists of 4 batches of threads
‒ Execution is time-sliced to hide latency

 Since there are 16 ALUs per SIMD, the size of a
wavefront is 4 x 16 = 64 threads

Time (clocks)

Batch 1

Batch 3

Batch 4

Stall

 Each SIMD can keep track of 10 wavefronts
‒ 40 wavefronts per CU = 2560 threads
‒ For the R9 290x, that’s a total of 44 x 2560 =
112,640 threads in flight!

Batch 2

Stall

Stall

Runnable
Stall

Runnable
Done!

Runnable
Done!

Runnable
Done!
Done!

23 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT
GCN AND DirectCompute
HOW DirectCompute Maps to the GPU

 DirectCompute thread groups correspond to groups of threads running on a Compute Unit
‒ Choose a thread group size that works well with the hardare

 Thread Group Shared Memory is stored in the Compute Unit’s Local Data Share (LDS)
‒ 4 SIMDs (64 ALUs) are all sharing the same LDS
‒ Barriers and atomics synchronize access to this storage

24 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT
OPTIMIZE FOR THE HARDWARE
IMPORTANT THINGS TO CONSIDER WHEN DESIGNING A COMPUTE SHADER

 Thread Group Size
‒ Since there are 64 threads per wavefront, make the thread group size multiples of 64
‒ If [numthreads(x,y,z)] then (x * y * z) should be a multiple of 64
‒ If the thread group size is not a multiple of 64, the SIMDs will still run 64 threads per wavefront
‒ Whether they’re all used or not!
‒ Start with 64 threads per group and then see if higher multiples improve performance.
 LDS memory access is faster than off-chip memory
‒ Use thread group shared memory for storing intermediate results instead of writing out to the UAV
‒ Thread group shared memory can also be used to eliminate unnecessary fetches
‒ For example, discrete convolutions sample many of the same locations when the kernel moves to an
adjacent pixel
‒ TGSM is limited to 32K – pack the data to save space

25 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT
OPTIMIZE FOR THE HARDWARE
THREAD GROUP SHARED MEMORY ACCESS

 Avoid Bank Conflicts
‒ Thread Group Shared Memory is stored in LDS which is organized in 32 banks
‒ If one thread accesses a location and another thread accesses location + (n x 32), a bank conflict will occur
‒ The hardware can resolve this, but at the cost of performance

BANK

0

1

2

3

4

…

31

0

1

2

3

ADDRESS

0

1

2

3

4

…

31

32

33

34

35

26 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT
OPTIMIZE FOR THE HARDWARE
GPR USAGE

 Local variables are stored in general purpose registers (GPRs)
‒ There’s a limit of 64K bytes of storage for vector registers per SIMD
‒ The register memory has to be shared between wavefronts, so too many GPRS can reduce the number of
wavefronts
‒ Maximizing parallelism is critical to compute shader performance (and shader performance in general)
‒ The more wavefronts that can run, the more parallelism you can get

‒ Using fewer local variables can help, but look at the DirectX assembly code to get a better idea of how many
registers are being used.

GCN VGPR Count <=24 28 32 36 40 48 64 84 <= 128 > 128
Max
Waves/SIMD

10  9

27 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT

8

7

6

5

4

3

2

1
DirectCompute
Techniques
for Games
SEPARABLE FILTER
SLEEPING DOGS

Separable Filter compute shader used for ambient occlusion
29 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT
SEPARABLE FILTER
TWO PASS DISCRETE CONVOLUTIONS

 Many useful filters are separable
‒ Typical example is Gaussian filter for a high quality image blur – used often in post-processing

 Separable Algorithm
‒ Run the first pass in the horizontal direction and store the results in an intermediate buffer
‒ Run the second pass on the intermediate buffer in the vertical direction
Source
RT

Intermediate
RT
Horizontal Pass

30 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT

Destination RT

Vertical Pass
SEPARABLE FILTER
COMPUTE SHADER OPTIMIZATION

 Use Thread Group Shared Memory to cache previously sampled values

 Naïve approach is to load a row of texels into TGSM at once, then calculate each pixel value
128 threads load 128 texels

...........

Kernel Radius
128 – ( Kernel Radius * 2 ) threads compute results

Redundant compute threads 

31 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT
SEPARABLE FILTER
BETTER OPTIMIZATION

 Make some of the threads load more texels
‒ We only want as many threads as there are pixels
‒ However we need more texels than pixles since the kernel extends beyond the pixels
‒ Just have some of the threads load more pixels

 Process multiple lines per thread group
‒ Keeps the thread group size a multiple of 64
64 threads load 256 texels

...........
...........

Kernel Radius
64 threads compute 256 results
32 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT

Kernel Radius * 4 threads
load 1 extra texel each
TILED LIGHTING
FROSTBITE ENGINES, DIRT SHOWDOWN

33 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT
TILED LIGHTING
COMPUTE SHADER LIGHT CULLING

 Break up the screen into tiles
 Create a list of lights for each tile

1

3

 Only render with lights touching the tile

[1]

34 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT

2

[1,2,3]

[2,3]
TILED LIGHTING
TILED FRUSTUMS

 Divide screen
into tiles
 Fit asymmetric
frustum around
each tile

Tile0

Tile1

35 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT

Tile2

Tile3
TILED LIGHTING
CREATE FRONT AND BACK PLANES

 Use z buffer from depth pre-pass as input

 Find min and max depth per tile
 Use this frustum for intersection testing

36 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT
TILED LIGHTING
CREATE PER TILE LIGHT LIST (PART 1)

 Test each light against the frustum

Light0

•Position
•Radius

Light1

•Position
•Radius

Light2

•Position
•Radius

Light3

•Position
•Radius

Light4

•Position
•Radius

…
•Position

Light10 •Radius

37 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT
TILED LIGHTING
CREATE PER TILE LIGHT LIST (PART 2)

 For each light that intersects the
frustum
Light0

‒ Write the index of the light in the
per-tile list
Index0 •Count
Index1 •1
Index2 •4
Index3 •Empty
Index4 •Empty

…

38 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT

1

4

•Position
•Radius

Light1

•Position
•Radius

Light2

•Position
•Radius

Light3

•Position
•Radius

Light4

•Position
•Radius

…
•Position

Light10 •Radius
TILED LIGHTING
COMPUTE SHADER

 Implemented using a single compute shader

 A thread group is executed per tile
‒ e.g. [numthreads(16,16,1)] for 16x16 tile size

 Build frustum
 Calculate Z extent
‒ Each thread calculates Z extent in parallel

 256 lights are culled in parallel (for 16x16 tile size)
 Indices of intersecting lights are written to thread group shared memory (TGSM)
‒ groupshared uint ldsLightIdx[MAX_NUM_LIGHTS_PER_TILE];

 Then export to global light index list
‒ RWBuffer<uint> g_PerTileLightIndexBufferOut : register( u0 );

39 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT
TressFX PHYSICS
TOMB RAIDER

40 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT
TressFX PHYSICS
PHYSICALLY BASED HAIR SIMULATION

 Responds to natural gravity

 Global constraints to maintain various hair styles
 Collision detection with head and body
 Supports wind and other forces
 Artist-friendly
‒ Programmatic control over attributes
‒ Can tweak for different results

 The entire simulation is done on Compute Shaders

41 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT
TressFX PHYSICS
TOMB RAIDER VIDEO

42 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT
TressFX PHYSICS
DATA FLOW
CPU

GPU
Integration and
Global Shapes
CS

Raw Vertex Data
(CPU Memory)

Hair Vertex Data
UAV

Vertex data is only
loaded once during
life of the program.
43 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT

Local Shapes
Constraint
CS
Length
Constraints and
Wind
CS
Collision and
Tangents
CS

Hair Render
Vertex Shader

Vertex data
stays on the GPU!
TressFX PHYSICS
BENEFITS OF USING DirectCompute TO SIMULATE HAIR

 Interoperability with Direct3D
‒ UAVs can be shared with rendering so data can stay on the GPU

 Massively Parallel
‒ Thousands of hair vertices that can be animated in parallel
‒ Each thread in the thread group represents a vertex (or strand)

 Calculations done on data stored locally on chip
‒ Vertex data is loaded into Thread Group Shared Memory before calculations begin
//-----------------------------// Copy data into shared memory
//-----------------------------if (localVertexIndex < numVerticesInTheStrand )
{
currentPos = sharedPos[indexForSharedMem] = g_HairVertexPositions[globalVertexIndex];
initialPos = g_InitialHairPositions[globalVertexIndex];
initialPos.xyz = mul(float4( initialPos.xyz, 1), g_ModelTransformForHead).xyz;
}
GroupMemoryBarrierWithGroupSync();

44 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT
DirectCompute in Games
OTHER USES

 HDAO+
‒ New version of our High Definition Ambient Occlusion
‒ Uses TGSM for storing samples

 Bokeh Depth of Field
‒ Used in the Frostbite 3 engine for rendering Bokeh
shapes where “hotspots” are located.

 Global Illumination
‒ Geomerics

 Scene Management

 Voxels
‒ Sparse Voxel Trees
‒ Octrees, Quadtrees

 Particle Systems

45 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT
Summary of DirectCompute in Games
WHEN TO USE Direct Compute

 Parallelism
‒ Some algorithms can be highly parallelized

 Repetitive Fetches
‒ Caching values in TGSM can be a big win

 Results used in the graphics pipeline
‒ DirectCompute can reduce transfers between GPU and CPU

 Existing algorithms already optimized for DirectCompute
‒ Separable Filters
‒ Tiled Lighting
‒ Hair Physics
‒ HDAO+
‒ We have samples you can use!

http://developer.amd.com/tools-and-sdks/graphics-development/amd-radeon-sdk/

46 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT
QUESTIONS?

 Ask now, or send questions to:
bill.bilodeau@amd.com

47 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT
DISCLAIMER & ATTRIBUTION

The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors.
The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap
changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software
changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD
reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of
such revisions or changes.
AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY
INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION.
AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE
LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION
CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.

ATTRIBUTION
© 2013 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices,
Inc. in the United States and/or other jurisdictions. SPEC is a registered trademark of the Standard Performance Evaluation Corporation (SPEC). Other names
are for informational purposes only and may be trademarks of their respective owners.
48 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT

Weitere ähnliche Inhalte

Was ist angesagt?

PL-4048, Adapting languages for parallel processing on GPUs, by Neil Henning
PL-4048, Adapting languages for parallel processing on GPUs, by Neil HenningPL-4048, Adapting languages for parallel processing on GPUs, by Neil Henning
PL-4048, Adapting languages for parallel processing on GPUs, by Neil HenningAMD Developer Central
 
CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distribu...
CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distribu...CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distribu...
CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distribu...AMD Developer Central
 
TressFX The Fast and The Furry by Nicolas Thibieroz
TressFX The Fast and The Furry by Nicolas ThibierozTressFX The Fast and The Furry by Nicolas Thibieroz
TressFX The Fast and The Furry by Nicolas ThibierozAMD Developer Central
 
PL-4042, Wholly Graal: Accelerating GPU offload for Java/Sumatra using the Op...
PL-4042, Wholly Graal: Accelerating GPU offload for Java/Sumatra using the Op...PL-4042, Wholly Graal: Accelerating GPU offload for Java/Sumatra using the Op...
PL-4042, Wholly Graal: Accelerating GPU offload for Java/Sumatra using the Op...AMD Developer Central
 
An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar
An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware WebinarAn Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar
An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware WebinarAMD Developer Central
 
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14AMD Developer Central
 
PL-4047, Big Data Workload Analysis Using SWAT and Ipython Notebooks, by Moni...
PL-4047, Big Data Workload Analysis Using SWAT and Ipython Notebooks, by Moni...PL-4047, Big Data Workload Analysis Using SWAT and Ipython Notebooks, by Moni...
PL-4047, Big Data Workload Analysis Using SWAT and Ipython Notebooks, by Moni...AMD Developer Central
 
Bolt C++ Standard Template Libary for HSA by Ben Sanders, AMD
Bolt C++ Standard Template Libary for HSA  by Ben Sanders, AMDBolt C++ Standard Template Libary for HSA  by Ben Sanders, AMD
Bolt C++ Standard Template Libary for HSA by Ben Sanders, AMDHSA Foundation
 
GS-4152, AMD’s Radeon R9-290X, One Big dGPU, by Michael Mantor
GS-4152, AMD’s Radeon R9-290X, One Big dGPU, by Michael MantorGS-4152, AMD’s Radeon R9-290X, One Big dGPU, by Michael Mantor
GS-4152, AMD’s Radeon R9-290X, One Big dGPU, by Michael MantorAMD Developer Central
 
SE-4128, DRM: From software secrets to hardware protection, by Rod Schultz
SE-4128, DRM: From software secrets to hardware protection, by Rod SchultzSE-4128, DRM: From software secrets to hardware protection, by Rod Schultz
SE-4128, DRM: From software secrets to hardware protection, by Rod SchultzAMD Developer Central
 
MM-4092, Optimizing FFMPEG and Handbrake Using OpenCL and Other AMD HW Capabi...
MM-4092, Optimizing FFMPEG and Handbrake Using OpenCL and Other AMD HW Capabi...MM-4092, Optimizing FFMPEG and Handbrake Using OpenCL and Other AMD HW Capabi...
MM-4092, Optimizing FFMPEG and Handbrake Using OpenCL and Other AMD HW Capabi...AMD Developer Central
 
Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)Rob Gillen
 
HSA-4123, HSA Memory Model, by Ben Gaster
HSA-4123, HSA Memory Model, by Ben GasterHSA-4123, HSA Memory Model, by Ben Gaster
HSA-4123, HSA Memory Model, by Ben GasterAMD Developer Central
 
PL-4043, Accelerating OpenVL for Heterogeneous Platforms, by Gregor Miller
PL-4043, Accelerating OpenVL for Heterogeneous Platforms, by Gregor MillerPL-4043, Accelerating OpenVL for Heterogeneous Platforms, by Gregor Miller
PL-4043, Accelerating OpenVL for Heterogeneous Platforms, by Gregor MillerAMD Developer Central
 
Leverage the Speed of OpenCL™ with AMD Math Libraries
Leverage the Speed of OpenCL™ with AMD Math LibrariesLeverage the Speed of OpenCL™ with AMD Math Libraries
Leverage the Speed of OpenCL™ with AMD Math LibrariesAMD Developer Central
 
MM-4097, OpenCV-CL, by Harris Gasparakis, Vadim Pisarevsky and Andrey Pavlenko
MM-4097, OpenCV-CL, by Harris Gasparakis, Vadim Pisarevsky and Andrey PavlenkoMM-4097, OpenCV-CL, by Harris Gasparakis, Vadim Pisarevsky and Andrey Pavlenko
MM-4097, OpenCV-CL, by Harris Gasparakis, Vadim Pisarevsky and Andrey PavlenkoAMD Developer Central
 
HC-4019, "Exploiting Coarse-grained Parallelism in B+ Tree Searches on an APU...
HC-4019, "Exploiting Coarse-grained Parallelism in B+ Tree Searches on an APU...HC-4019, "Exploiting Coarse-grained Parallelism in B+ Tree Searches on an APU...
HC-4019, "Exploiting Coarse-grained Parallelism in B+ Tree Searches on an APU...AMD Developer Central
 

Was ist angesagt? (20)

PL-4048, Adapting languages for parallel processing on GPUs, by Neil Henning
PL-4048, Adapting languages for parallel processing on GPUs, by Neil HenningPL-4048, Adapting languages for parallel processing on GPUs, by Neil Henning
PL-4048, Adapting languages for parallel processing on GPUs, by Neil Henning
 
CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distribu...
CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distribu...CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distribu...
CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distribu...
 
TressFX The Fast and The Furry by Nicolas Thibieroz
TressFX The Fast and The Furry by Nicolas ThibierozTressFX The Fast and The Furry by Nicolas Thibieroz
TressFX The Fast and The Furry by Nicolas Thibieroz
 
PL-4042, Wholly Graal: Accelerating GPU offload for Java/Sumatra using the Op...
PL-4042, Wholly Graal: Accelerating GPU offload for Java/Sumatra using the Op...PL-4042, Wholly Graal: Accelerating GPU offload for Java/Sumatra using the Op...
PL-4042, Wholly Graal: Accelerating GPU offload for Java/Sumatra using the Op...
 
Gcn performance ftw by stephan hodes
Gcn performance ftw by stephan hodesGcn performance ftw by stephan hodes
Gcn performance ftw by stephan hodes
 
An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar
An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware WebinarAn Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar
An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar
 
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14
 
PL-4047, Big Data Workload Analysis Using SWAT and Ipython Notebooks, by Moni...
PL-4047, Big Data Workload Analysis Using SWAT and Ipython Notebooks, by Moni...PL-4047, Big Data Workload Analysis Using SWAT and Ipython Notebooks, by Moni...
PL-4047, Big Data Workload Analysis Using SWAT and Ipython Notebooks, by Moni...
 
Bolt C++ Standard Template Libary for HSA by Ben Sanders, AMD
Bolt C++ Standard Template Libary for HSA  by Ben Sanders, AMDBolt C++ Standard Template Libary for HSA  by Ben Sanders, AMD
Bolt C++ Standard Template Libary for HSA by Ben Sanders, AMD
 
GS-4152, AMD’s Radeon R9-290X, One Big dGPU, by Michael Mantor
GS-4152, AMD’s Radeon R9-290X, One Big dGPU, by Michael MantorGS-4152, AMD’s Radeon R9-290X, One Big dGPU, by Michael Mantor
GS-4152, AMD’s Radeon R9-290X, One Big dGPU, by Michael Mantor
 
Media SDK Webinar 2014
Media SDK Webinar 2014Media SDK Webinar 2014
Media SDK Webinar 2014
 
Cuda
CudaCuda
Cuda
 
SE-4128, DRM: From software secrets to hardware protection, by Rod Schultz
SE-4128, DRM: From software secrets to hardware protection, by Rod SchultzSE-4128, DRM: From software secrets to hardware protection, by Rod Schultz
SE-4128, DRM: From software secrets to hardware protection, by Rod Schultz
 
MM-4092, Optimizing FFMPEG and Handbrake Using OpenCL and Other AMD HW Capabi...
MM-4092, Optimizing FFMPEG and Handbrake Using OpenCL and Other AMD HW Capabi...MM-4092, Optimizing FFMPEG and Handbrake Using OpenCL and Other AMD HW Capabi...
MM-4092, Optimizing FFMPEG and Handbrake Using OpenCL and Other AMD HW Capabi...
 
Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)
 
HSA-4123, HSA Memory Model, by Ben Gaster
HSA-4123, HSA Memory Model, by Ben GasterHSA-4123, HSA Memory Model, by Ben Gaster
HSA-4123, HSA Memory Model, by Ben Gaster
 
PL-4043, Accelerating OpenVL for Heterogeneous Platforms, by Gregor Miller
PL-4043, Accelerating OpenVL for Heterogeneous Platforms, by Gregor MillerPL-4043, Accelerating OpenVL for Heterogeneous Platforms, by Gregor Miller
PL-4043, Accelerating OpenVL for Heterogeneous Platforms, by Gregor Miller
 
Leverage the Speed of OpenCL™ with AMD Math Libraries
Leverage the Speed of OpenCL™ with AMD Math LibrariesLeverage the Speed of OpenCL™ with AMD Math Libraries
Leverage the Speed of OpenCL™ with AMD Math Libraries
 
MM-4097, OpenCV-CL, by Harris Gasparakis, Vadim Pisarevsky and Andrey Pavlenko
MM-4097, OpenCV-CL, by Harris Gasparakis, Vadim Pisarevsky and Andrey PavlenkoMM-4097, OpenCV-CL, by Harris Gasparakis, Vadim Pisarevsky and Andrey Pavlenko
MM-4097, OpenCV-CL, by Harris Gasparakis, Vadim Pisarevsky and Andrey Pavlenko
 
HC-4019, "Exploiting Coarse-grained Parallelism in B+ Tree Searches on an APU...
HC-4019, "Exploiting Coarse-grained Parallelism in B+ Tree Searches on an APU...HC-4019, "Exploiting Coarse-grained Parallelism in B+ Tree Searches on an APU...
HC-4019, "Exploiting Coarse-grained Parallelism in B+ Tree Searches on an APU...
 

Andere mochten auch

Parallel Graphics in Frostbite - Current & Future (Siggraph 2009)
Parallel Graphics in Frostbite - Current & Future (Siggraph 2009)Parallel Graphics in Frostbite - Current & Future (Siggraph 2009)
Parallel Graphics in Frostbite - Current & Future (Siggraph 2009)Johan Andersson
 
SPU-Based Deferred Shading in BATTLEFIELD 3 for Playstation 3
SPU-Based Deferred Shading in BATTLEFIELD 3 for Playstation 3SPU-Based Deferred Shading in BATTLEFIELD 3 for Playstation 3
SPU-Based Deferred Shading in BATTLEFIELD 3 for Playstation 3Electronic Arts / DICE
 
Study: The Future of VR, AR and Self-Driving Cars
Study: The Future of VR, AR and Self-Driving CarsStudy: The Future of VR, AR and Self-Driving Cars
Study: The Future of VR, AR and Self-Driving CarsLinkedIn
 
UX, ethnography and possibilities: for Libraries, Museums and Archives
UX, ethnography and possibilities: for Libraries, Museums and ArchivesUX, ethnography and possibilities: for Libraries, Museums and Archives
UX, ethnography and possibilities: for Libraries, Museums and ArchivesNed Potter
 
Designing Teams for Emerging Challenges
Designing Teams for Emerging ChallengesDesigning Teams for Emerging Challenges
Designing Teams for Emerging ChallengesAaron Irizarry
 
Visual Design with Data
Visual Design with DataVisual Design with Data
Visual Design with DataSeth Familian
 
3 Things Every Sales Team Needs to Be Thinking About in 2017
3 Things Every Sales Team Needs to Be Thinking About in 20173 Things Every Sales Team Needs to Be Thinking About in 2017
3 Things Every Sales Team Needs to Be Thinking About in 2017Drift
 
How to Become a Thought Leader in Your Niche
How to Become a Thought Leader in Your NicheHow to Become a Thought Leader in Your Niche
How to Become a Thought Leader in Your NicheLeslie Samuel
 

Andere mochten auch (8)

Parallel Graphics in Frostbite - Current & Future (Siggraph 2009)
Parallel Graphics in Frostbite - Current & Future (Siggraph 2009)Parallel Graphics in Frostbite - Current & Future (Siggraph 2009)
Parallel Graphics in Frostbite - Current & Future (Siggraph 2009)
 
SPU-Based Deferred Shading in BATTLEFIELD 3 for Playstation 3
SPU-Based Deferred Shading in BATTLEFIELD 3 for Playstation 3SPU-Based Deferred Shading in BATTLEFIELD 3 for Playstation 3
SPU-Based Deferred Shading in BATTLEFIELD 3 for Playstation 3
 
Study: The Future of VR, AR and Self-Driving Cars
Study: The Future of VR, AR and Self-Driving CarsStudy: The Future of VR, AR and Self-Driving Cars
Study: The Future of VR, AR and Self-Driving Cars
 
UX, ethnography and possibilities: for Libraries, Museums and Archives
UX, ethnography and possibilities: for Libraries, Museums and ArchivesUX, ethnography and possibilities: for Libraries, Museums and Archives
UX, ethnography and possibilities: for Libraries, Museums and Archives
 
Designing Teams for Emerging Challenges
Designing Teams for Emerging ChallengesDesigning Teams for Emerging Challenges
Designing Teams for Emerging Challenges
 
Visual Design with Data
Visual Design with DataVisual Design with Data
Visual Design with Data
 
3 Things Every Sales Team Needs to Be Thinking About in 2017
3 Things Every Sales Team Needs to Be Thinking About in 20173 Things Every Sales Team Needs to Be Thinking About in 2017
3 Things Every Sales Team Needs to Be Thinking About in 2017
 
How to Become a Thought Leader in Your Niche
How to Become a Thought Leader in Your NicheHow to Become a Thought Leader in Your Niche
How to Become a Thought Leader in Your Niche
 

Ähnlich wie GS-4108, Direct Compute in Gaming, by Bill Bilodeau

[Unite Seoul 2019] Mali GPU Architecture and Mobile Studio
[Unite Seoul 2019] Mali GPU Architecture and Mobile Studio [Unite Seoul 2019] Mali GPU Architecture and Mobile Studio
[Unite Seoul 2019] Mali GPU Architecture and Mobile Studio Owen Wu
 
Optimizing unity games (Google IO 2014)
Optimizing unity games (Google IO 2014)Optimizing unity games (Google IO 2014)
Optimizing unity games (Google IO 2014)Alexander Dolbilov
 
Direct3D and the Future of Graphics APIs - AMD at GDC14
Direct3D and the Future of Graphics APIs - AMD at GDC14Direct3D and the Future of Graphics APIs - AMD at GDC14
Direct3D and the Future of Graphics APIs - AMD at GDC14AMD Developer Central
 
Introduction to parallel computing using CUDA
Introduction to parallel computing using CUDAIntroduction to parallel computing using CUDA
Introduction to parallel computing using CUDAMartin Peniak
 
Anatomy of ROCgdb presentation at gcc cauldron 2022
Anatomy of ROCgdb presentation at gcc cauldron 2022Anatomy of ROCgdb presentation at gcc cauldron 2022
Anatomy of ROCgdb presentation at gcc cauldron 2022ssuser866937
 
“Programming Vision Pipelines on AMD’s AI Engines,” a Presentation from AMD
“Programming Vision Pipelines on AMD’s AI Engines,” a Presentation from AMD“Programming Vision Pipelines on AMD’s AI Engines,” a Presentation from AMD
“Programming Vision Pipelines on AMD’s AI Engines,” a Presentation from AMDEdge AI and Vision Alliance
 
Intro to GPGPU Programming with Cuda
Intro to GPGPU Programming with CudaIntro to GPGPU Programming with Cuda
Intro to GPGPU Programming with CudaRob Gillen
 
Nvidia cuda tutorial_no_nda_apr08
Nvidia cuda tutorial_no_nda_apr08Nvidia cuda tutorial_no_nda_apr08
Nvidia cuda tutorial_no_nda_apr08Angela Mendoza M.
 
BitSquid Tech: Benefits of a data-driven renderer
BitSquid Tech: Benefits of a data-driven rendererBitSquid Tech: Benefits of a data-driven renderer
BitSquid Tech: Benefits of a data-driven renderertobias_persson
 
GPU - An Introduction
GPU - An IntroductionGPU - An Introduction
GPU - An IntroductionDhan V Sagar
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
gpuprogram_lecture,architecture_designsn
gpuprogram_lecture,architecture_designsngpuprogram_lecture,architecture_designsn
gpuprogram_lecture,architecture_designsnARUNACHALAM468781
 
Gpu workshop cluster universe: scripting cuda
Gpu workshop cluster universe: scripting cudaGpu workshop cluster universe: scripting cuda
Gpu workshop cluster universe: scripting cudaFerdinand Jamitzky
 
PT-4055, Optimizing Raytracing on GCN with AMD Development Tools, by Tzachi C...
PT-4055, Optimizing Raytracing on GCN with AMD Development Tools, by Tzachi C...PT-4055, Optimizing Raytracing on GCN with AMD Development Tools, by Tzachi C...
PT-4055, Optimizing Raytracing on GCN with AMD Development Tools, by Tzachi C...AMD Developer Central
 
Enabling Artificial Intelligence - Alison B. Lowndes
Enabling Artificial Intelligence - Alison B. LowndesEnabling Artificial Intelligence - Alison B. Lowndes
Enabling Artificial Intelligence - Alison B. LowndesWithTheBest
 

Ähnlich wie GS-4108, Direct Compute in Gaming, by Bill Bilodeau (20)

[Unite Seoul 2019] Mali GPU Architecture and Mobile Studio
[Unite Seoul 2019] Mali GPU Architecture and Mobile Studio [Unite Seoul 2019] Mali GPU Architecture and Mobile Studio
[Unite Seoul 2019] Mali GPU Architecture and Mobile Studio
 
Optimizing unity games (Google IO 2014)
Optimizing unity games (Google IO 2014)Optimizing unity games (Google IO 2014)
Optimizing unity games (Google IO 2014)
 
Gpgpu
GpgpuGpgpu
Gpgpu
 
Direct3D and the Future of Graphics APIs - AMD at GDC14
Direct3D and the Future of Graphics APIs - AMD at GDC14Direct3D and the Future of Graphics APIs - AMD at GDC14
Direct3D and the Future of Graphics APIs - AMD at GDC14
 
Introduction to parallel computing using CUDA
Introduction to parallel computing using CUDAIntroduction to parallel computing using CUDA
Introduction to parallel computing using CUDA
 
Anatomy of ROCgdb presentation at gcc cauldron 2022
Anatomy of ROCgdb presentation at gcc cauldron 2022Anatomy of ROCgdb presentation at gcc cauldron 2022
Anatomy of ROCgdb presentation at gcc cauldron 2022
 
“Programming Vision Pipelines on AMD’s AI Engines,” a Presentation from AMD
“Programming Vision Pipelines on AMD’s AI Engines,” a Presentation from AMD“Programming Vision Pipelines on AMD’s AI Engines,” a Presentation from AMD
“Programming Vision Pipelines on AMD’s AI Engines,” a Presentation from AMD
 
Intro to GPGPU Programming with Cuda
Intro to GPGPU Programming with CudaIntro to GPGPU Programming with Cuda
Intro to GPGPU Programming with Cuda
 
Nvidia cuda tutorial_no_nda_apr08
Nvidia cuda tutorial_no_nda_apr08Nvidia cuda tutorial_no_nda_apr08
Nvidia cuda tutorial_no_nda_apr08
 
BitSquid Tech: Benefits of a data-driven renderer
BitSquid Tech: Benefits of a data-driven rendererBitSquid Tech: Benefits of a data-driven renderer
BitSquid Tech: Benefits of a data-driven renderer
 
GPU - An Introduction
GPU - An IntroductionGPU - An Introduction
GPU - An Introduction
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
gpuprogram_lecture,architecture_designsn
gpuprogram_lecture,architecture_designsngpuprogram_lecture,architecture_designsn
gpuprogram_lecture,architecture_designsn
 
Gpu workshop cluster universe: scripting cuda
Gpu workshop cluster universe: scripting cudaGpu workshop cluster universe: scripting cuda
Gpu workshop cluster universe: scripting cuda
 
PT-4055, Optimizing Raytracing on GCN with AMD Development Tools, by Tzachi C...
PT-4055, Optimizing Raytracing on GCN with AMD Development Tools, by Tzachi C...PT-4055, Optimizing Raytracing on GCN with AMD Development Tools, by Tzachi C...
PT-4055, Optimizing Raytracing on GCN with AMD Development Tools, by Tzachi C...
 
Enabling Artificial Intelligence - Alison B. Lowndes
Enabling Artificial Intelligence - Alison B. LowndesEnabling Artificial Intelligence - Alison B. Lowndes
Enabling Artificial Intelligence - Alison B. Lowndes
 
Cuda intro
Cuda introCuda intro
Cuda intro
 

Mehr von AMD Developer Central

DX12 & Vulkan: Dawn of a New Generation of Graphics APIs
DX12 & Vulkan: Dawn of a New Generation of Graphics APIsDX12 & Vulkan: Dawn of a New Generation of Graphics APIs
DX12 & Vulkan: Dawn of a New Generation of Graphics APIsAMD Developer Central
 
Webinar: Whats New in Java 8 with Develop Intelligence
Webinar: Whats New in Java 8 with Develop IntelligenceWebinar: Whats New in Java 8 with Develop Intelligence
Webinar: Whats New in Java 8 with Develop IntelligenceAMD Developer Central
 
The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...
The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...
The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...AMD Developer Central
 
Rendering Battlefield 4 with Mantle by Yuriy ODonnell
Rendering Battlefield 4 with Mantle by Yuriy ODonnellRendering Battlefield 4 with Mantle by Yuriy ODonnell
Rendering Battlefield 4 with Mantle by Yuriy ODonnellAMD Developer Central
 
Low-level Shader Optimization for Next-Gen and DX11 by Emil Persson
Low-level Shader Optimization for Next-Gen and DX11 by Emil PerssonLow-level Shader Optimization for Next-Gen and DX11 by Emil Persson
Low-level Shader Optimization for Next-Gen and DX11 by Emil PerssonAMD Developer Central
 
Introduction to Direct 3D 12 by Ivan Nevraev
Introduction to Direct 3D 12 by Ivan NevraevIntroduction to Direct 3D 12 by Ivan Nevraev
Introduction to Direct 3D 12 by Ivan NevraevAMD Developer Central
 
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth ThomasHoly smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth ThomasAMD Developer Central
 
Computer Vision Powered by Heterogeneous System Architecture (HSA) by Dr. Ha...
Computer Vision Powered by Heterogeneous System Architecture (HSA) by  Dr. Ha...Computer Vision Powered by Heterogeneous System Architecture (HSA) by  Dr. Ha...
Computer Vision Powered by Heterogeneous System Architecture (HSA) by Dr. Ha...AMD Developer Central
 
Productive OpenCL Programming An Introduction to OpenCL Libraries with Array...
Productive OpenCL Programming An Introduction to OpenCL Libraries  with Array...Productive OpenCL Programming An Introduction to OpenCL Libraries  with Array...
Productive OpenCL Programming An Introduction to OpenCL Libraries with Array...AMD Developer Central
 
RapidFire - the Easy Route to low Latency Cloud Gaming Solutions - AMD at GDC14
RapidFire - the Easy Route to low Latency Cloud Gaming Solutions - AMD at GDC14RapidFire - the Easy Route to low Latency Cloud Gaming Solutions - AMD at GDC14
RapidFire - the Easy Route to low Latency Cloud Gaming Solutions - AMD at GDC14AMD Developer Central
 
Mantle and Nitrous - Combining Efficient Engine Design with a modern API - AM...
Mantle and Nitrous - Combining Efficient Engine Design with a modern API - AM...Mantle and Nitrous - Combining Efficient Engine Design with a modern API - AM...
Mantle and Nitrous - Combining Efficient Engine Design with a modern API - AM...AMD Developer Central
 
Mantle - Introducing a new API for Graphics - AMD at GDC14
Mantle - Introducing a new API for Graphics - AMD at GDC14Mantle - Introducing a new API for Graphics - AMD at GDC14
Mantle - Introducing a new API for Graphics - AMD at GDC14AMD Developer Central
 
Vertex Shader Tricks by Bill Bilodeau - AMD at GDC14
Vertex Shader Tricks by Bill Bilodeau - AMD at GDC14Vertex Shader Tricks by Bill Bilodeau - AMD at GDC14
Vertex Shader Tricks by Bill Bilodeau - AMD at GDC14AMD Developer Central
 
GS-4106 The AMD GCN Architecture - A Crash Course, by Layla Mah
GS-4106 The AMD GCN Architecture - A Crash Course, by Layla MahGS-4106 The AMD GCN Architecture - A Crash Course, by Layla Mah
GS-4106 The AMD GCN Architecture - A Crash Course, by Layla MahAMD Developer Central
 
Keynote (Tony King-Smith) - Silicon? Check. HSA? Check. All done? Wrong! - by...
Keynote (Tony King-Smith) - Silicon? Check. HSA? Check. All done? Wrong! - by...Keynote (Tony King-Smith) - Silicon? Check. HSA? Check. All done? Wrong! - by...
Keynote (Tony King-Smith) - Silicon? Check. HSA? Check. All done? Wrong! - by...AMD Developer Central
 
Keynote (Nandini Ramani) - The Role of Java in Heterogeneous Computing & How ...
Keynote (Nandini Ramani) - The Role of Java in Heterogeneous Computing & How ...Keynote (Nandini Ramani) - The Role of Java in Heterogeneous Computing & How ...
Keynote (Nandini Ramani) - The Role of Java in Heterogeneous Computing & How ...AMD Developer Central
 

Mehr von AMD Developer Central (20)

DX12 & Vulkan: Dawn of a New Generation of Graphics APIs
DX12 & Vulkan: Dawn of a New Generation of Graphics APIsDX12 & Vulkan: Dawn of a New Generation of Graphics APIs
DX12 & Vulkan: Dawn of a New Generation of Graphics APIs
 
Introduction to Node.js
Introduction to Node.jsIntroduction to Node.js
Introduction to Node.js
 
DirectGMA on AMD’S FirePro™ GPUS
DirectGMA on AMD’S  FirePro™ GPUSDirectGMA on AMD’S  FirePro™ GPUS
DirectGMA on AMD’S FirePro™ GPUS
 
Webinar: Whats New in Java 8 with Develop Intelligence
Webinar: Whats New in Java 8 with Develop IntelligenceWebinar: Whats New in Java 8 with Develop Intelligence
Webinar: Whats New in Java 8 with Develop Intelligence
 
The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...
The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...
The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...
 
Inside XBox- One, by Martin Fuller
Inside XBox- One, by Martin FullerInside XBox- One, by Martin Fuller
Inside XBox- One, by Martin Fuller
 
Rendering Battlefield 4 with Mantle by Yuriy ODonnell
Rendering Battlefield 4 with Mantle by Yuriy ODonnellRendering Battlefield 4 with Mantle by Yuriy ODonnell
Rendering Battlefield 4 with Mantle by Yuriy ODonnell
 
Low-level Shader Optimization for Next-Gen and DX11 by Emil Persson
Low-level Shader Optimization for Next-Gen and DX11 by Emil PerssonLow-level Shader Optimization for Next-Gen and DX11 by Emil Persson
Low-level Shader Optimization for Next-Gen and DX11 by Emil Persson
 
Inside XBOX ONE by Martin Fuller
Inside XBOX ONE by Martin FullerInside XBOX ONE by Martin Fuller
Inside XBOX ONE by Martin Fuller
 
Introduction to Direct 3D 12 by Ivan Nevraev
Introduction to Direct 3D 12 by Ivan NevraevIntroduction to Direct 3D 12 by Ivan Nevraev
Introduction to Direct 3D 12 by Ivan Nevraev
 
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth ThomasHoly smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
 
Computer Vision Powered by Heterogeneous System Architecture (HSA) by Dr. Ha...
Computer Vision Powered by Heterogeneous System Architecture (HSA) by  Dr. Ha...Computer Vision Powered by Heterogeneous System Architecture (HSA) by  Dr. Ha...
Computer Vision Powered by Heterogeneous System Architecture (HSA) by Dr. Ha...
 
Productive OpenCL Programming An Introduction to OpenCL Libraries with Array...
Productive OpenCL Programming An Introduction to OpenCL Libraries  with Array...Productive OpenCL Programming An Introduction to OpenCL Libraries  with Array...
Productive OpenCL Programming An Introduction to OpenCL Libraries with Array...
 
RapidFire - the Easy Route to low Latency Cloud Gaming Solutions - AMD at GDC14
RapidFire - the Easy Route to low Latency Cloud Gaming Solutions - AMD at GDC14RapidFire - the Easy Route to low Latency Cloud Gaming Solutions - AMD at GDC14
RapidFire - the Easy Route to low Latency Cloud Gaming Solutions - AMD at GDC14
 
Mantle and Nitrous - Combining Efficient Engine Design with a modern API - AM...
Mantle and Nitrous - Combining Efficient Engine Design with a modern API - AM...Mantle and Nitrous - Combining Efficient Engine Design with a modern API - AM...
Mantle and Nitrous - Combining Efficient Engine Design with a modern API - AM...
 
Mantle - Introducing a new API for Graphics - AMD at GDC14
Mantle - Introducing a new API for Graphics - AMD at GDC14Mantle - Introducing a new API for Graphics - AMD at GDC14
Mantle - Introducing a new API for Graphics - AMD at GDC14
 
Vertex Shader Tricks by Bill Bilodeau - AMD at GDC14
Vertex Shader Tricks by Bill Bilodeau - AMD at GDC14Vertex Shader Tricks by Bill Bilodeau - AMD at GDC14
Vertex Shader Tricks by Bill Bilodeau - AMD at GDC14
 
GS-4106 The AMD GCN Architecture - A Crash Course, by Layla Mah
GS-4106 The AMD GCN Architecture - A Crash Course, by Layla MahGS-4106 The AMD GCN Architecture - A Crash Course, by Layla Mah
GS-4106 The AMD GCN Architecture - A Crash Course, by Layla Mah
 
Keynote (Tony King-Smith) - Silicon? Check. HSA? Check. All done? Wrong! - by...
Keynote (Tony King-Smith) - Silicon? Check. HSA? Check. All done? Wrong! - by...Keynote (Tony King-Smith) - Silicon? Check. HSA? Check. All done? Wrong! - by...
Keynote (Tony King-Smith) - Silicon? Check. HSA? Check. All done? Wrong! - by...
 
Keynote (Nandini Ramani) - The Role of Java in Heterogeneous Computing & How ...
Keynote (Nandini Ramani) - The Role of Java in Heterogeneous Computing & How ...Keynote (Nandini Ramani) - The Role of Java in Heterogeneous Computing & How ...
Keynote (Nandini Ramani) - The Role of Java in Heterogeneous Computing & How ...
 

Kürzlich hochgeladen

#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGSujit Pal
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 

Kürzlich hochgeladen (20)

#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAG
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 

GS-4108, Direct Compute in Gaming, by Bill Bilodeau

  • 1. DirectCompute in Gaming BILL BILODEAU DEVELOPER TECHNOLOGY ENGINEER, AMD
  • 2. AGENDA TOPICS COVERED IN THIS TALK  Introduction to DirectCompute  GCN Overview  DirectCompute Programming  Optimization Techniques  Examples of Compute in Games ‒ Separable Filter ‒ Tiled Lighting ‒ TressFX Physics 2 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT
  • 4. GPU COMPUTE WHY WE NEED COMPUTE IN GAMES  Games can be CPU limited on some configurations ‒ Many traditional CPU tasks can be offloaded to the GPU  Some rendering algorithms are faster when implemented using Compute ‒ Post-processing techniques  Some tasks are a natural fit for Compute ‒ Non-graphics Data Parallel programming ‒ Physics 4 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT
  • 5. DX11 GRAPHICS PIPELINE WHY WE NEED DirectCompute Input Assembly Textures  Too many unnecessary stages for compute Vertex Shader  Old DX9 “GPGPU” programming ‒ Render full screen quad ‒ Use Pixel Shader for compute Buffers Hull Shader Render Targets Tessellator Domain Shader Geometry Shader Constants Stream Out UAVs Rasterizer Pixel Shader Output Merger 5 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT … Depth Test Depthstencil
  • 6. DirectCompute PIPELINE ONLY WHAT’S NECESSARY  API designed for compute programming ‒ Interoperability with DirectX  Pipeline is much simpler ‒ No need to render triangles ‒ Bind input and output then call Dispatch() SRVs Compute Shader UAVs 6 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT
  • 7. DirectCompute FEATURES MORE REASONS TO USE DirectCompute  Structured Buffers ‒ Maps better to general purpose data structures  Append/Consume ‒ Fast for non order dependent i/o  Atomics, Barriers ‒ Synchronization for finer grain thread control  Thread Group Shared Memory ‒ Fast on-chip local memory shared between threads in a group ‒ Great for intermediate results 7 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT
  • 8. DISPATCH THREAD GROUP ORGANIZATION  Threads are organized into Thread Groups  Each thread is executing an instance of the compute shader code  ID3D11DeviceContext::Dispatch( Dx, Dy, Dz ) ‒ Example: Dispatch(4,3,2) => 24 thread groups 0,0,0 1,0,0 2,0,0 3,0,0 0,0,1 1,0,1 2,0,1 3,0,1 2,1,0 0,1,0 1,1,0 3,1,0 0,1,1 1,1,1 2,1,1 3,1,1 0,2,0 1,2,0 2,2,0 3,2,0 0,2,1 1,2,1 2,2,1 3,2,1 8 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT
  • 9. THREAD GROUP ORGANIZATION OF THREADS WITHIN A THREAD GROUP  Threads in a group are organized in an array ‒ For full screen rendering you can think of a thread group as a tile on the screen, and the thread as a pixel  The compute shader HLSL code declares the thread layout ‒ [numthreads(Tx, Ty, Tz)] ‒ Example: [numthreads(8, 4, 2)] => 64 threads per thread group  SystemValues provide information about the current thread ‒ SV_GroupID = thread group address, SV_GroupThreadID = thread address relative to thread group 0,0,0 1,0,0 2,0,0 3,0,0 4,0,0 5,0,0 6,0,0 7,0,0 0,0,1 1,0,1 2,0,1 3,0,1 4,0,1 5,0,1 6,0,1 7,0,1 0,1,0 1,1,0 2,1,0 3,1,0 4,1,0 5,1,0 6,1,0 7,1,0 0,1,1 1,1,1 2,1,1 3,1,1 4,1,1 5,1,1 6,1,1 7,1,1 0,2,0 1,2,0 2,2,0 3,2,0 4,2,0 5,2,0 6,2,0 7,2,0 0,2,1 1,2,1 2,2,1 3,2,1 4,2,1 5,2,1 6,2,1 7,2,1 0,3,0 1,3,0 2,3,0 3,3,0 4,3,0 5,3,0 6,3,0 7,3,0 0,3,1 1,3,1 2,3,1 3,3,1 4,3,1 5,3,1 6,3,1 7,3,1 9 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT
  • 10. DirectCompute SETUP C CODE TO SETUP COMPUTE SHADER EXECUTION  Create the resources  pd3dDevice->CreateBuffer(&Desc, NULL, &pCBCSPerFrame);  pd3dDevice->CreateBuffer(&bufferDesc, NULL, &pRWBuffer)); 10 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT
  • 11. DirectCompute SETUP C CODE TO SETUP COMPUTE SHADER EXECUTION  Create the resources  Create the views  pd3dDevice->CreateUnorderedAccessView(pRWBuffer, &UAVDesc, &pUAV)); 11 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT
  • 12. DirectCompute SETUP C CODE TO SETUP COMPUTE SHADER EXECUTION  Create the resources  Create the views  Compile the compute shader  CompileShaderFromFile( L“myComputeShader.hlsl", "CSMain", "cs_5_0", &pBlob )  pd3dDevice->CreateComputeShader( pBlob, . . . , pComputeShader ); 12 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT
  • 13. DirectCompute SETUP C CODE TO SETUP COMPUTE SHADER EXECUTION  Create the resources  Create the views  Compile the compute shader  Bind the resources  pd3dContext->CSSetConstantBuffers(0, 1, &m_pCBCSPerFrame);  ID3D11ShaderResourceView* ppSRV[1] = {pUAV};  pd3dContext->CSSetShaderResources( 0, 1, ppSRV); 13 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT
  • 14. DirectCompute SETUP C CODE TO SETUP COMPUTE SHADER EXECUTION  Create the resources  Create the views  Compile the compute shader  Bind the resources  Set the shader  pd3dContext->CSSetShader(pComputeShader , NULL, 0 ) 14 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT
  • 15. DirectCompute SETUP C CODE TO SETUP COMPUTE SHADER EXECUTION  Create the resources  Create the views  Compile the compute shader  Bind the resources  Set the shader  Go!  pd3dContext->Dispatch(numOfGroupsX, numOfGroupsY, 1); 15 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT
  • 16. COMPUTE SHADER CODE IMPORTANT ADDITIONS TO HLSL FOR COMPUTE  numthreads ‒ Defines the number of threads in a thread group ‒ Put this right before the definition of the compute shader  groupshared float4 sharedPos[THREAD_GROUP_SIZE];  [numthreads(THREAD_GROUP_SIZE, 1, 1)]  void MyComputeShader(uint GIndex : SV_GroupIndex) { …  sharedPos[GIndex] = gThreadData[myIndex];  GroupMemoryBarrierWithGroupSync();  InterlockedExchange(myUAV[loc], newVal, oldVal); } 16 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT
  • 17. COMPUTE SHADER CODE IMPORTANT ADDITIONS TO HLSL FOR COMPUTE (CONTINUED)  Thread Group Shared Memory ‒ Variables that are stored in memory that is shared between threads in a group (up to 32K bytes per group) ‒ Use the groupshared modifier in front of the variable declaration  groupshared float4 sharedPos[THREAD_GROUP_SIZE];  [numthreads(THREAD_GROUP_SIZE, 1, 1)]  void MyComputeShader(uint GIndex : SV_GroupIndex) { …  sharedPos[GIndex] = gThreadData[myIndex];  GroupMemoryBarrierWithGroupSync();  InterlockedExchange(myUAV[loc], newVal, oldVal); } 17 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT
  • 18. COMPUTE SHADER CODE IMPORTANT ADDITIONS TO HLSL FOR COMPUTE (CONTINUED)  System Values that can be used in a Compute Shader ‒ SV_GroupID , SV_GroupThreadID , SV_DispatchThreadID , SV_GroupIndex ‒ The system values tell the compute shader what group and what thread it’s currently working on.  groupshared float4 sharedPos[THREAD_GROUP_SIZE];  [numthreads(THREAD_GROUP_SIZE, 1, 1)]  void MyComputeShader(uint GIndex : SV_GroupIndex) { …  groupSharedMemVar[GIndex] = gThreadData[myIndex];  GroupMemoryBarrierWithGroupSync();  InterlockedExchange(myUAV[loc], newVal, oldVal); } 18 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT
  • 19. COMPUTE SHADER CODE IMPORTANT ADDITIONS TO HLSL FOR COMPUTE (CONTINUED)  Synchronization ‒ Barriers ‒ GroupMemoryBarrier(), GroupMemoryBarrierWithGroupSync() ‒ Blocks execution until threads are synchronized  groupshared float4 sharedPos[THREAD_GROUP_SIZE];  [numthreads(THREAD_GROUP_SIZE, 1, 1)]  void MyComputeShader(uint GIndex : SV_GroupIndex) { …  sharedPos[GIndex] = gThreadData[myIndex];  GroupMemoryBarrierWithGroupSync();  InterlockedExchange(myUAV[loc], newVal, oldVal); } 19 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT
  • 20. COMPUTE SHADER CODE IMPORTANT ADDITIONS TO HLSL FOR COMPUTE (CONTINUED)  Synchronization ‒ Atomics ‒ “Interlock” versions of Add, Min, Max, Or, And, Xor, CompareStore, and Exchange ‒ Guarantee exclusive access to shared memory  groupshared float4 sharedPos[THREAD_GROUP_SIZE];  [numthreads(THREAD_GROUP_SIZE, 1, 1)]  void MyComputeShader(uint GIndex : SV_GroupIndex) { …  sharedPos[GIndex] = gThreadData[myIndex];  GroupMemoryBarrierWithGroupSync();  InterlockedExchange(myUAV[loc], newVal, oldVal); } 20 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT
  • 22. GCN REFRESHER COMPUTE UNIT  Basic GPU building block ‒ R9 290x has 44 Compute Units  (4) 16-wide SIMDs that can execute instructions from multiple threads Branch & Message Unit Scheduler Vector Units (4x SIMD-16) Scalar Unit Texture Filter Units (4) Texture Fetch Load / Store Units (16) ‒ Each SIMD has 64KB of vector general purpose register (VGPR) memory  16 Texture Fetch (load/store) Units ‒ 16KB L1 cache  One Scalar Unit Vector Registers (4x 64KB) Local Data Share (64KB) Scalar Registers (8KB) ‒ 8KB of Scalar GPR memory  Local Data Share ‒ 64K shared memory 22 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT GCN Compute Unit L1 Cache (16KB)
  • 23. WAVEFRONTS HOW THREADS EXECUTE ON COMPUTE UNITS  A wavefront consists of 4 batches of threads ‒ Execution is time-sliced to hide latency  Since there are 16 ALUs per SIMD, the size of a wavefront is 4 x 16 = 64 threads Time (clocks) Batch 1 Batch 3 Batch 4 Stall  Each SIMD can keep track of 10 wavefronts ‒ 40 wavefronts per CU = 2560 threads ‒ For the R9 290x, that’s a total of 44 x 2560 = 112,640 threads in flight! Batch 2 Stall Stall Runnable Stall Runnable Done! Runnable Done! Runnable Done! Done! 23 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT
  • 24. GCN AND DirectCompute HOW DirectCompute Maps to the GPU  DirectCompute thread groups correspond to groups of threads running on a Compute Unit ‒ Choose a thread group size that works well with the hardare  Thread Group Shared Memory is stored in the Compute Unit’s Local Data Share (LDS) ‒ 4 SIMDs (64 ALUs) are all sharing the same LDS ‒ Barriers and atomics synchronize access to this storage 24 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT
  • 25. OPTIMIZE FOR THE HARDWARE IMPORTANT THINGS TO CONSIDER WHEN DESIGNING A COMPUTE SHADER  Thread Group Size ‒ Since there are 64 threads per wavefront, make the thread group size multiples of 64 ‒ If [numthreads(x,y,z)] then (x * y * z) should be a multiple of 64 ‒ If the thread group size is not a multiple of 64, the SIMDs will still run 64 threads per wavefront ‒ Whether they’re all used or not! ‒ Start with 64 threads per group and then see if higher multiples improve performance.  LDS memory access is faster than off-chip memory ‒ Use thread group shared memory for storing intermediate results instead of writing out to the UAV ‒ Thread group shared memory can also be used to eliminate unnecessary fetches ‒ For example, discrete convolutions sample many of the same locations when the kernel moves to an adjacent pixel ‒ TGSM is limited to 32K – pack the data to save space 25 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT
  • 26. OPTIMIZE FOR THE HARDWARE THREAD GROUP SHARED MEMORY ACCESS  Avoid Bank Conflicts ‒ Thread Group Shared Memory is stored in LDS which is organized in 32 banks ‒ If one thread accesses a location and another thread accesses location + (n x 32), a bank conflict will occur ‒ The hardware can resolve this, but at the cost of performance BANK 0 1 2 3 4 … 31 0 1 2 3 ADDRESS 0 1 2 3 4 … 31 32 33 34 35 26 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT
  • 27. OPTIMIZE FOR THE HARDWARE GPR USAGE  Local variables are stored in general purpose registers (GPRs) ‒ There’s a limit of 64K bytes of storage for vector registers per SIMD ‒ The register memory has to be shared between wavefronts, so too many GPRS can reduce the number of wavefronts ‒ Maximizing parallelism is critical to compute shader performance (and shader performance in general) ‒ The more wavefronts that can run, the more parallelism you can get ‒ Using fewer local variables can help, but look at the DirectX assembly code to get a better idea of how many registers are being used. GCN VGPR Count <=24 28 32 36 40 48 64 84 <= 128 > 128 Max Waves/SIMD 10  9 27 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT 8 7 6 5 4 3 2 1
  • 29. SEPARABLE FILTER SLEEPING DOGS Separable Filter compute shader used for ambient occlusion 29 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT
  • 30. SEPARABLE FILTER TWO PASS DISCRETE CONVOLUTIONS  Many useful filters are separable ‒ Typical example is Gaussian filter for a high quality image blur – used often in post-processing  Separable Algorithm ‒ Run the first pass in the horizontal direction and store the results in an intermediate buffer ‒ Run the second pass on the intermediate buffer in the vertical direction Source RT Intermediate RT Horizontal Pass 30 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT Destination RT Vertical Pass
  • 31. SEPARABLE FILTER COMPUTE SHADER OPTIMIZATION  Use Thread Group Shared Memory to cache previously sampled values  Naïve approach is to load a row of texels into TGSM at once, then calculate each pixel value 128 threads load 128 texels ........... Kernel Radius 128 – ( Kernel Radius * 2 ) threads compute results Redundant compute threads  31 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT
  • 32. SEPARABLE FILTER BETTER OPTIMIZATION  Make some of the threads load more texels ‒ We only want as many threads as there are pixels ‒ However we need more texels than pixles since the kernel extends beyond the pixels ‒ Just have some of the threads load more pixels  Process multiple lines per thread group ‒ Keeps the thread group size a multiple of 64 64 threads load 256 texels ........... ........... Kernel Radius 64 threads compute 256 results 32 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT Kernel Radius * 4 threads load 1 extra texel each
  • 33. TILED LIGHTING FROSTBITE ENGINES, DIRT SHOWDOWN 33 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT
  • 34. TILED LIGHTING COMPUTE SHADER LIGHT CULLING  Break up the screen into tiles  Create a list of lights for each tile 1 3  Only render with lights touching the tile [1] 34 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT 2 [1,2,3] [2,3]
  • 35. TILED LIGHTING TILED FRUSTUMS  Divide screen into tiles  Fit asymmetric frustum around each tile Tile0 Tile1 35 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT Tile2 Tile3
  • 36. TILED LIGHTING CREATE FRONT AND BACK PLANES  Use z buffer from depth pre-pass as input  Find min and max depth per tile  Use this frustum for intersection testing 36 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT
  • 37. TILED LIGHTING CREATE PER TILE LIGHT LIST (PART 1)  Test each light against the frustum Light0 •Position •Radius Light1 •Position •Radius Light2 •Position •Radius Light3 •Position •Radius Light4 •Position •Radius … •Position Light10 •Radius 37 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT
  • 38. TILED LIGHTING CREATE PER TILE LIGHT LIST (PART 2)  For each light that intersects the frustum Light0 ‒ Write the index of the light in the per-tile list Index0 •Count Index1 •1 Index2 •4 Index3 •Empty Index4 •Empty … 38 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT 1 4 •Position •Radius Light1 •Position •Radius Light2 •Position •Radius Light3 •Position •Radius Light4 •Position •Radius … •Position Light10 •Radius
  • 39. TILED LIGHTING COMPUTE SHADER  Implemented using a single compute shader  A thread group is executed per tile ‒ e.g. [numthreads(16,16,1)] for 16x16 tile size  Build frustum  Calculate Z extent ‒ Each thread calculates Z extent in parallel  256 lights are culled in parallel (for 16x16 tile size)  Indices of intersecting lights are written to thread group shared memory (TGSM) ‒ groupshared uint ldsLightIdx[MAX_NUM_LIGHTS_PER_TILE];  Then export to global light index list ‒ RWBuffer<uint> g_PerTileLightIndexBufferOut : register( u0 ); 39 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT
  • 40. TressFX PHYSICS TOMB RAIDER 40 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT
  • 41. TressFX PHYSICS PHYSICALLY BASED HAIR SIMULATION  Responds to natural gravity  Global constraints to maintain various hair styles  Collision detection with head and body  Supports wind and other forces  Artist-friendly ‒ Programmatic control over attributes ‒ Can tweak for different results  The entire simulation is done on Compute Shaders 41 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT
  • 42. TressFX PHYSICS TOMB RAIDER VIDEO 42 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT
  • 43. TressFX PHYSICS DATA FLOW CPU GPU Integration and Global Shapes CS Raw Vertex Data (CPU Memory) Hair Vertex Data UAV Vertex data is only loaded once during life of the program. 43 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT Local Shapes Constraint CS Length Constraints and Wind CS Collision and Tangents CS Hair Render Vertex Shader Vertex data stays on the GPU!
  • 44. TressFX PHYSICS BENEFITS OF USING DirectCompute TO SIMULATE HAIR  Interoperability with Direct3D ‒ UAVs can be shared with rendering so data can stay on the GPU  Massively Parallel ‒ Thousands of hair vertices that can be animated in parallel ‒ Each thread in the thread group represents a vertex (or strand)  Calculations done on data stored locally on chip ‒ Vertex data is loaded into Thread Group Shared Memory before calculations begin //-----------------------------// Copy data into shared memory //-----------------------------if (localVertexIndex < numVerticesInTheStrand ) { currentPos = sharedPos[indexForSharedMem] = g_HairVertexPositions[globalVertexIndex]; initialPos = g_InitialHairPositions[globalVertexIndex]; initialPos.xyz = mul(float4( initialPos.xyz, 1), g_ModelTransformForHead).xyz; } GroupMemoryBarrierWithGroupSync(); 44 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT
  • 45. DirectCompute in Games OTHER USES  HDAO+ ‒ New version of our High Definition Ambient Occlusion ‒ Uses TGSM for storing samples  Bokeh Depth of Field ‒ Used in the Frostbite 3 engine for rendering Bokeh shapes where “hotspots” are located.  Global Illumination ‒ Geomerics  Scene Management  Voxels ‒ Sparse Voxel Trees ‒ Octrees, Quadtrees  Particle Systems 45 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT
  • 46. Summary of DirectCompute in Games WHEN TO USE Direct Compute  Parallelism ‒ Some algorithms can be highly parallelized  Repetitive Fetches ‒ Caching values in TGSM can be a big win  Results used in the graphics pipeline ‒ DirectCompute can reduce transfers between GPU and CPU  Existing algorithms already optimized for DirectCompute ‒ Separable Filters ‒ Tiled Lighting ‒ Hair Physics ‒ HDAO+ ‒ We have samples you can use! http://developer.amd.com/tools-and-sdks/graphics-development/amd-radeon-sdk/ 46 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT
  • 47. QUESTIONS?  Ask now, or send questions to: bill.bilodeau@amd.com 47 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT
  • 48. DISCLAIMER & ATTRIBUTION The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes. AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION. AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. ATTRIBUTION © 2013 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. SPEC is a registered trademark of the Standard Performance Evaluation Corporation (SPEC). Other names are for informational purposes only and may be trademarks of their respective owners. 48 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT