SlideShare ist ein Scribd-Unternehmen logo
1 von 49
Future Directions for Compute-for-Graphics
Andrew Lauritzen
Senior Rendering Engineer, SEED
@AndrewLauritzen
Who am I?
• Senior Rendering Engineer at SEED – Electronic Arts
• @SEED, https://www.ea.com/SEED
• Previously
• Graphics Software Engineer at Intel (DX12, Vulkan, OpenCL 1.0, GPU architecture, R&D)
• Developer at RapidMind (“GPGPU”, Cell SPU, x86, ray tracing)
• Notable research
• Deferred shading (Beyond Programmable Shading 2010 and 2012)
• Variance shadow maps, sample distribution shadow maps, etc.
Open Problems in Real-Time Rendering – SIGGRAPH 2017
Renderer Complexity is Increasing Rapidly
• Why?
• Demand for continued increases in quality on wide variety of hardware
• GPU fixed function not scaling as quickly as FLOPS
• Balance of GPU resources is rarely optimal for a given pass, game, etc.
• Power efficiency becoming a limiting factor in hardware/software design
• Opportunity to render “smarter”
• Less brute force use of computation and memory bandwidth
• Algorithms tailored specifically for renderer needs and content
Open Problems in Real-Time Rendering – SIGGRAPH 2017
From “GPU-Driven Rendering Pipelines” by Ulrich Haar and Sebastian Aaltonen, SIGGRAPH 2015
1600x1800
• Clear (IDs and Depth)
• Partial Z-Pass
• G-Buffer Laydown
• Resolve AA Depth
• G-Buffer Decals
• HBAO + Shadows
• Tiled Lighting + SSS
• Emissive
• Sky
• Transparency
• Velocity Vectors
3200x1800
• CB Resolve + Temporal AA
• Motion Blur
• Foreground Transparency
• Gaussian Pyramid
• Final Post-Processing
• Silhouette Outlines
3840x2160 • Display Mapping + Resample
BF1 : PS4™Pro
From “4K Checkerboard in Battlefield 1 and Mass Effect Andromeda” by Graham Wihlidal, GDC 2017
Dynamic resolution scaling
ish…
ish…
GPU Compute Ecosystem
• Complex algorithms and data structures are impractical to express
• End up micro-optimizing the simple algorithms into local optima
• Significant language and hardware issues impede progress
• Writing portable, composable and efficient GPU code is often impossible
• GPU compute for graphics has been mostly stagnant
• Most of the problems I will discuss were known ~7 years ago! [BPS 2010]
• CUDA and OpenCL have progressed somewhat, but not practical for game use
Open Problems in Real-Time Rendering – SIGGRAPH 2017
Case Study: Lighting and Shading
Lighting and Shading
• Deferred shading gives us more control over scheduling shading work
• … but still a chunk of it lives in the G-buffer pass
• Per pixel state is large enough to inhibit multi-layer G-buffers
• Want better control over shading rates, scheduling, etc.
• Visibility Buffers [Burns and Hunt 2013]
• Follow-up work by Tomasz Stachowiak on GCN [Stachowiak 2015]
• Store minimal outputs from rasterization, defer attribute interpolation
• Possibly even defer full vertex shading
• Use the rasterizer just to intersect primary rays, rest in compute
Open Problems in Real-Time Rendering – SIGGRAPH 2017
Visibility Buffer Shading
• We want to run different shaders based on material/instance ID
• Similar need in deferred shading or ray tracing
• Sounds like dynamic dispatch via function pointer or virtual
• Nice and coherent in screen space
struct Material {
Texture2D<float4> albedo;
float roughness;
...
virtual float4 shade(...) const;
};
Open Problems in Real-Time Rendering – SIGGRAPH 2017
Übershader Dispatch
struct Material
{
int id;
Texture2D<float4> albedo;
float roughness;
...
};
float4 shade(Material* material)
{
switch (material->id)
{
case 0: return shaderA(material); // Entire function gets inlined
case 1: return shaderB(material); // Entire function gets inlined
case 2: return shaderC(material); // Entire function gets inlined
...
};
}
Open Problems in Real-Time Rendering – SIGGRAPH 2017
Shader Resource Allocation
• Why are we forced to write code like this?
• GPUs have been fully capable of dynamic jumps/branches for some time now
• Why not just define a calling convention/ABI and be done with it?
• Static resource allocation
• GPUs needs to know how many resources a shader uses before launch
• Can only launch a warp if all resources are available
• Will tie up those resources for entire duration of the shader
• Function calls either get inlined, or all potential targets must be known
• “Occupancy” is a key metric for getting good GPU performance
• GPUs rely on switching between warps to hide latency (memory, instruction)
Open Problems in Real-Time Rendering – SIGGRAPH 2017
Shader Resource Allocation Example
return shaderA(material);
Registers (40) Shared Memory (12)
Warp 0
=> “Occupancy 5”
Warp 1
Warp 2
Warp 3
Warp 4
Open Problems in Real-Time Rendering – SIGGRAPH 2017
Shader Resource Allocation Example
switch (material->id)
{
case 0: return shaderA(material);
case 1: return shaderB(material);
};
Registers (40) Shared Memory (12)
=> “Occupancy 2”
Warp 0
Warp 1
Open Problems in Real-Time Rendering – SIGGRAPH 2017
Improving Occupancy via Software?
• Per IHV…
• Per architecture…
• Per SKU…
• Per driver/compiler…
From “A deferred material rendering system” by Tomasz Stachowiak, 2017
Improving Occupancy via Hardware
• Dynamic resource allocation/freeing inside warps?
• Some potential here, but can get complicated quickly for the scheduler
• Compile everything for good occupancy
• Let caches handle the dynamic variation via spill to L1$?
• GPU caches are typically small relative to register files
• Generally a poor idea to rely on them to help manage dynamic working set
• … but improvements are starting to emerge!
Open Problems in Real-Time Rendering – SIGGRAPH 2017
From “Inside Volta” by Olivier Giroux and Luke Durant, GTC 2017
From “Inside Volta” by Olivier Giroux and Luke Durant, GTC 2017
Visibility Buffer Takeaways
• Hardware needs improvements
• More dynamic resource allocation and scheduling
• Less sensitive to “worst case” statically reachable code
• NVIDIA Volta appears to be a step in the right direction
• Enables software improvements
• Dynamic dispatch
• Separate compilation
• Compose and reusable shader code
Open Problems in Real-Time Rendering – SIGGRAPH 2017
Case Study:
Hierarchical Structures
Hierarchical Structures
• Hierarchical acceleration structures are a good tool for “smarter” rendering
• Mipmaps, quadtrees, octrees, bounding volume/interval hierarchies, …
• Constructing these structures is inherently nested parallel
• Data propagated bottom-up, top-down or both
• Conventionally created on the CPU and consumed by the GPU
• Several reasons we want to construct these structures on the GPU as well
• Latency: input data may be generated by rendering
• FLOPS are significantly higher on the GPU on some platforms
Open Problems in Real-Time Rendering – SIGGRAPH 2017
From “Tiled Light Trees” by Yuriy O’Donnell and Matthäus Chajdas, I3D 2017
Quadtree GPU Dispatch
128x128
32x32 32x32
16x16
32x32
Dispatch
Dispatch
Dispatch
Store/reload state
Store/reload state
16x16 16x16 16x16 16x16 16x16 16x16 16x16
32x32
16x16 16x16 16x16 16x16 16x16 16x16 16x16 16x16
Open Problems in Real-Time Rendering – SIGGRAPH 2017
GPU Dispatch Inefficiency
From AMD’s Radeon GPU Profiler
Open Problems in Real-Time Rendering – SIGGRAPH 2017
Idle GPU: bad!
“Opportunity” for async compute! 
Idle GPU: bad!
#define BLOCK_SIZE 16
groupshared TileData tiles[BLOCK_SIZE][BLOCK_SIZE];
groupshared float4 output[BLOCK_SIZE][BLOCK_SIZE];
float4 TileFlat(uint2 tid, TileData initialData)
{
bool alive = true;
...
for (int level = MAX_LEVELS; level > 0; --level)
{
if (alive && all(tid.xy < (1 << level)))
{
...
if (isBaseCase(tileData))
{
output[tid] = baseCase(tileData);
alive = false;
}
...
}
GroupMemoryBarrierWithGroupSync();
}
return output[tid];
}
Quadtree GPU Block Reduction
• Baked into kernel
• Optimal sizes are highly GPU dependent
• May not align with optimal sizes for data
• Potential occupancy issues
• Some algorithms need a full TileData stack
Open Problems in Real-Time Rendering – SIGGRAPH 2017
#define BLOCK_SIZE 16
groupshared TileData tiles[BLOCK_SIZE][BLOCK_SIZE];
groupshared float4 output[BLOCK_SIZE][BLOCK_SIZE];
float4 TileFlat(uint2 tid, TileData initialData)
{
bool alive = true;
...
for (int level = MAX_LEVELS; level > 0; --level)
{
if (alive && all(tid.xy < (1 << level)))
{
...
if (isBaseCase(tileData))
{
output[tid] = baseCase(tileData);
alive = false;
}
...
}
GroupMemoryBarrierWithGroupSync();
}
return output[tid];
}
Quadtree GPU Block Reduction
• Lots of GPU threads doing useless work
• Can’t terminate early
• Can’t re-pack threads
Open Problems in Real-Time Rendering – SIGGRAPH 2017
#define BLOCK_SIZE 16
groupshared TileData tiles[BLOCK_SIZE][BLOCK_SIZE];
groupshared float4 output[BLOCK_SIZE][BLOCK_SIZE];
float4 TileFlat(uint2 tid, TileData initialData)
{
bool alive = true;
...
for (int level = MAX_LEVELS; level > 0; --level)
{
if (alive && all(tid.xy < (1 << level)))
{
...
if (isBaseCase(tileData))
{
output[tid] = baseCase(tileData);
alive = false;
}
...
}
GroupMemoryBarrierWithGroupSync();
}
return output[tid];
}
Quadtree GPU Block Reduction
• Base and subdivide cases can’t do thread sync
• Can’t (efficiently) use shared memory
• Breaks nesting abstraction
Open Problems in Real-Time Rendering – SIGGRAPH 2017
Improving Performance
Writing
From “Optimizing the Graphics Pipeline with Compute” by Graham Wihlidal, GDC 2016
Remove barriers
(numthreads = hardware warp size)
Remove smem and atomics
Replace with register swizzling
i.e. bypass the warp width abstraction
Write code directly for GPU warps
/** DO NOT CHANGE!! **/
Compute ShadersFast
Writing Fast and Portable Compute Shaders
• … is just not possible right now in the compute languages we have
• SIMD widths vary, sometimes even between kernels on the same hardware
• Reliance on brittle compiler “optimization” steps to not fall off the fast path
• Warp synchronous programming is not spec compliant [Perelygin 2017]
• No legal way to “wait” for another thread outside your group [Foley 2013]
• Performance delta can be 10x or more
• Too large to accept compatible “fallback”
• Is the SIMD width abstraction doing more harm than good?
Open Problems in Real-Time Rendering – SIGGRAPH 2017
More Explicit SIMD Model?
// No “numthreads”: this is called once per warp always
void kernel(float *data)
{
// Operations and memory at this scope are scalar, as you’d expect
int x = 0;
float4 array[128];
// Does not have to match hardware SIMD size: compiler will add a loop in the kernel as needed
parallel_foreach(int i: 0 .. N)
{
// This gets compiled to SIMD – this is like a regular shader in this scope
float d = data[i] + data[i + N];
// All the usual GPU per-lane control flow stuff works normally
if (d > 0.0f)
data[i] = d;
// Parallel for’s can be nested; various options on which scope/axis to compile to SIMD
parallel_foreach(int j: 0 .. M) { ... }
}
// Another parallel/SIMD loop with different dimensions; no explicit sync needed
parallel_foreach(int i: 0 .. X) { ... }
} Open Problems in Real-Time Rendering – SIGGRAPH 2017
void TileRecurse(int level, TileData tileData)
{
if (level == 0 || isBaseCase(tileData))
return baseCase(tileData);
else
{
// Subdivide and recurse
TileData subtileData[4];
computeSubtileData(level, tileData, subtileData);
--level;
spawn TileRecurse(level, subtileData[0]);
spawn TileRecurse(level, subtileData[1]);
spawn TileRecurse(level, subtileData[2]);
spawn TileRecurse(level, subtileData[3]);
}
}
Quadtree Recursion on the GPU?
* exception: CUDA dynamic parallelism
*
error X3500: 'TileRecurse': recursive functions not allowed in cs_5_1
Open Problems in Real-Time Rendering – SIGGRAPH 2017
CUDA Dynamic Parallelism
From “How Tesla K20 Speeds Quicksort, a Familiar Comp-Sci Code”, Blog Post, Link
CUDA Dynamic Parallelism
• CUDA does support recursion including fork and join!
• CUDA also supports separate compilation
• Performance issues remain
• Mostly syntactic sugar for spilling, indirect dispatch, reload
• Not really targeted at games and real-time rendering
• Definitely an improvement on the semantic front
• Get similar capabilities into HLSL and implementations improve over time?
• Chicken and egg problem with game performance (ex. geometry shaders)
Open Problems in Real-Time Rendering – SIGGRAPH 2017
CUDA Cooperative Groups
From “CUDA 9 and Beyond” by Mark Harris, GTC 2017
Hierarchical Structures Takeaways
• Nested parallelism and trees are key components of many algorithms
• We need an efficient, portable way to express these
• Current compute languages are not particularly suitable
• Only way to write efficient code is to defeat the abstractions
• We need new languages and language features
• Map/reduce? Fork/join?
• CUDA cooperative groups?
• ISPC [Pharr 2012]? BSGP [TOG 2008]? Halide [Ragan-Kelly 2013]?
• Good opportunity for academia and industry collaboration!
Open Problems in Real-Time Rendering – SIGGRAPH 2017
Code Compatibility and Reuse
Compute Language Basics
• C-like, unless there’s a compelling reason to deviate
• Including all the regular C types… 8/16/32/64-bit
• Buffers
• Structured buffers
• “Byte address buffers” (really dword + two extra 0 bits for fun!)
• Constant buffers, texture buffers, typed buffers, …
• REAL pointers please!
• We’ve already taken on all of the “negative” aspects with root buffers (DX12)
• Now give us the positives
Open Problems in Real-Time Rendering – SIGGRAPH 2017
Compute Language Basics
• Resource binding and shader interface mess
• Even just between DX12 and Vulkan this is a giant headache
• Get rid of “signature” and “layout” glue
• Replace with regular structures and positional binding (i.e. “function calls”)
• Pointers for buffers, bindless for textures
• DX12 global heap is an acceptable solution for bindless
• Must be able to store references to textures in our own data structures
• Ideal would be to standardize a descriptor size (or even format!)
Open Problems in Real-Time Rendering – SIGGRAPH 2017
Compute Language Basics
• Avoid/remove weird features that end up as legacy pain
• UAV counters
• Be very careful with shader “extensions”
• Mature JIT compilation setup
• DXIL/SPIR-V make this both better and worse…
• Shader compiler/driver cannot be allowed to emit compilation errors (!!)
• We cannot have random errors on end user machines
Open Problems in Real-Time Rendering – SIGGRAPH 2017
Compute Language Essentials
• CPU/GPU interoperation and synchronization
• Low latency, bi-directional submission, signaling and atomics
• … in user space
• Shared data structures
• Consistent structure layouts and alignment between CPU/GPU
Open Problems in Real-Time Rendering – SIGGRAPH 2017
Call to Action
Compute for Graphics
Open Problems in Real-Time Rendering – SIGGRAPH 2017
Call to Action
• Hardware vendors
• More dynamic resource allocation and scheduling
• Less sensitive to “worst case” statically reachable code
• Dynamic warp sorting and repacking
• Unified compression and format conversions on all memory paths
• Operating system vendors
• GPUs as first class citizens (like another “core” in the system)
• Fast, user-space dispatch and signaling
• Shared virtual memory
Open Problems in Real-Time Rendering – SIGGRAPH 2017
Call to Action
• Standards bodies
• Be willing to make breaking changes to compute execution models
• Forgo some low level performance (optionally) for better algorithmic expression
• Resist being bullied by hardware vendors 
• Academics and language designers
• Innovate on new efficient, composable, and robust parallel languages
• Work with industry on requirements and pain points
• Game developers
• Let people know that the status quo is not fine
• Help the people who are working to improve it!
Open Problems in Real-Time Rendering – SIGGRAPH 2017
Conclusion
• Recently industry did explicit graphics APIs
• Data-parallel intermediate languages (DXIL, SPIR-V)
• Next problem to solve is compute
• Want to take real-time graphics further
• Ex. 8k @ 90Hz stereo VR with high quality lighting, shading and anti-aliasing
• Need to render “smarter”
• Requires industry-wide cooperation!
Open Problems in Real-Time Rendering – SIGGRAPH 2017
Acknowledgements
• Aaron Lefohn (@aaronlefohn)
• Alex Fry (@TheFryster)
• Colin Barré Brisebois (@ZigguratVertigo)
• Dave Oldcorn
• Graham Wihlidal (@gwihlidal)
• Jasper Bekkers (@JasperBekkers)
• Jefferson Montgomery (@jdmo3)
• Johan Andersson (@repi)
• Matt Pharr (@mattpharr)
• Natalya Tatarchuk (@mirror2mask)
• Neil Henning (@sheredom)
• Tim Foley (@TangentVector)
• Timothy Lottes (@TimothyLottes)
• Tomasz Stachowiak (@h3r2tic)
• Yuriy O'Donnell (@YuriyODonnell)
Open Problems in Real-Time Rendering – SIGGRAPH 2017
References
• Brodman, James et al. Writing Scalable SIMD Programs with ISPC, WPMVP 2014, Link Web Site
• Burns, Christopher and Hunt, Warren, The Visibility Buffer: A Cache-Friendly Approach to Deferred Shading, JCGT 2013, Link
• Foley, Tim, A Digression on Divergence, Blog Post in 2013, Link
• Giroux, Olivier and Durant, Luke, Inside Volta, GTC 2017, Link
• Haar, Ulrich and Aaltonen, Sebastian, GPU-Driven Rendering Pipelines, Advances in Real-Time Rendering 2015, Link
• Harris, Mark, CUDA 9 and Beyond, GTC 2017, Link
• Hou, Qiming et al. BSGP: Bulk-Synchronous GPU Programming, TOG 2008, Link
• Jones, Stephen, Introduction to Dynamic Parallelism, GTC 2012, Link
• O’Donnell, Yuriy and Chajdas, Matthäus, Tiled Light Trees, I3D 2017, Link
• Perelygin, Kyrylo and Lin, Yuan, Cooperative Groups, GTC 2017, Link
• Pharr, Matt and Mark, William, ISPC: A SPMD Compiler for High-Performance SIMD Programming, InPar 2012, Link
• Ragan-Kelly, Jonathan et al. Halide: A Language and Compiler for Optimizing Parallelism, Locality and Recomputation in Image
Processing Pipelines, PLDI 2013, Link Web Site
• Stachowiak, Tomasz, A Deferred Material Rendering System, Blog Post 2015, Link
• Wihlidal, Graham, 4K Checkerboard in Battlefield 1 and Mass Effect Andromeda, GDC 2017, Link
• Wihlidal, Graham, Optimizing the Graphics Pipeline with Compute, GDC 2016, Link
Open Problems in Real-Time Rendering – SIGGRAPH 2017

Weitere ähnliche Inhalte

Was ist angesagt?

Frostbite Rendering Architecture and Real-time Procedural Shading & Texturing...
Frostbite Rendering Architecture and Real-time Procedural Shading & Texturing...Frostbite Rendering Architecture and Real-time Procedural Shading & Texturing...
Frostbite Rendering Architecture and Real-time Procedural Shading & Texturing...
Johan Andersson
 
Moving Frostbite to Physically Based Rendering
Moving Frostbite to Physically Based RenderingMoving Frostbite to Physically Based Rendering
Moving Frostbite to Physically Based Rendering
Electronic Arts / DICE
 
Terrain Rendering in Frostbite using Procedural Shader Splatting (Siggraph 2007)
Terrain Rendering in Frostbite using Procedural Shader Splatting (Siggraph 2007)Terrain Rendering in Frostbite using Procedural Shader Splatting (Siggraph 2007)
Terrain Rendering in Frostbite using Procedural Shader Splatting (Siggraph 2007)
Johan Andersson
 

Was ist angesagt? (20)

Rendering AAA-Quality Characters of Project A1
Rendering AAA-Quality Characters of Project A1Rendering AAA-Quality Characters of Project A1
Rendering AAA-Quality Characters of Project A1
 
Frostbite Rendering Architecture and Real-time Procedural Shading & Texturing...
Frostbite Rendering Architecture and Real-time Procedural Shading & Texturing...Frostbite Rendering Architecture and Real-time Procedural Shading & Texturing...
Frostbite Rendering Architecture and Real-time Procedural Shading & Texturing...
 
Physically Based and Unified Volumetric Rendering in Frostbite
Physically Based and Unified Volumetric Rendering in FrostbitePhysically Based and Unified Volumetric Rendering in Frostbite
Physically Based and Unified Volumetric Rendering in Frostbite
 
Moving Frostbite to Physically Based Rendering
Moving Frostbite to Physically Based RenderingMoving Frostbite to Physically Based Rendering
Moving Frostbite to Physically Based Rendering
 
Dissecting the Rendering of The Surge
Dissecting the Rendering of The SurgeDissecting the Rendering of The Surge
Dissecting the Rendering of The Surge
 
Bending the Graphics Pipeline
Bending the Graphics PipelineBending the Graphics Pipeline
Bending the Graphics Pipeline
 
Rendering Tech of Space Marine
Rendering Tech of Space MarineRendering Tech of Space Marine
Rendering Tech of Space Marine
 
Shiny PC Graphics in Battlefield 3
Shiny PC Graphics in Battlefield 3Shiny PC Graphics in Battlefield 3
Shiny PC Graphics in Battlefield 3
 
Bindless Deferred Decals in The Surge 2
Bindless Deferred Decals in The Surge 2Bindless Deferred Decals in The Surge 2
Bindless Deferred Decals in The Surge 2
 
Parallel Graphics in Frostbite - Current & Future (Siggraph 2009)
Parallel Graphics in Frostbite - Current & Future (Siggraph 2009)Parallel Graphics in Frostbite - Current & Future (Siggraph 2009)
Parallel Graphics in Frostbite - Current & Future (Siggraph 2009)
 
Lighting the City of Glass
Lighting the City of GlassLighting the City of Glass
Lighting the City of Glass
 
Parallel Futures of a Game Engine (v2.0)
Parallel Futures of a Game Engine (v2.0)Parallel Futures of a Game Engine (v2.0)
Parallel Futures of a Game Engine (v2.0)
 
4K Checkerboard in Battlefield 1 and Mass Effect Andromeda
4K Checkerboard in Battlefield 1 and Mass Effect Andromeda4K Checkerboard in Battlefield 1 and Mass Effect Andromeda
4K Checkerboard in Battlefield 1 and Mass Effect Andromeda
 
Rendering Technologies from Crysis 3 (GDC 2013)
Rendering Technologies from Crysis 3 (GDC 2013)Rendering Technologies from Crysis 3 (GDC 2013)
Rendering Technologies from Crysis 3 (GDC 2013)
 
Terrain Rendering in Frostbite using Procedural Shader Splatting (Siggraph 2007)
Terrain Rendering in Frostbite using Procedural Shader Splatting (Siggraph 2007)Terrain Rendering in Frostbite using Procedural Shader Splatting (Siggraph 2007)
Terrain Rendering in Frostbite using Procedural Shader Splatting (Siggraph 2007)
 
Graphics Gems from CryENGINE 3 (Siggraph 2013)
Graphics Gems from CryENGINE 3 (Siggraph 2013)Graphics Gems from CryENGINE 3 (Siggraph 2013)
Graphics Gems from CryENGINE 3 (Siggraph 2013)
 
Taking Killzone Shadow Fall Image Quality Into The Next Generation
Taking Killzone Shadow Fall Image Quality Into The Next GenerationTaking Killzone Shadow Fall Image Quality Into The Next Generation
Taking Killzone Shadow Fall Image Quality Into The Next Generation
 
Beyond porting
Beyond portingBeyond porting
Beyond porting
 
A Scalable Real-Time Many-Shadowed-Light Rendering System
A Scalable Real-Time Many-Shadowed-Light Rendering SystemA Scalable Real-Time Many-Shadowed-Light Rendering System
A Scalable Real-Time Many-Shadowed-Light Rendering System
 
Photogrammetry and Star Wars Battlefront
Photogrammetry and Star Wars BattlefrontPhotogrammetry and Star Wars Battlefront
Photogrammetry and Star Wars Battlefront
 

Andere mochten auch

High Dynamic Range color grading and display in Frostbite
High Dynamic Range color grading and display in FrostbiteHigh Dynamic Range color grading and display in Frostbite
High Dynamic Range color grading and display in Frostbite
Electronic Arts / DICE
 

Andere mochten auch (15)

A Certain Slant of Light - Past, Present and Future Challenges of Global Illu...
A Certain Slant of Light - Past, Present and Future Challenges of Global Illu...A Certain Slant of Light - Past, Present and Future Challenges of Global Illu...
A Certain Slant of Light - Past, Present and Future Challenges of Global Illu...
 
High Dynamic Range color grading and display in Frostbite
High Dynamic Range color grading and display in FrostbiteHigh Dynamic Range color grading and display in Frostbite
High Dynamic Range color grading and display in Frostbite
 
Destruction Masking in Frostbite 2 using Volume Distance Fields
Destruction Masking in Frostbite 2 using Volume Distance FieldsDestruction Masking in Frostbite 2 using Volume Distance Fields
Destruction Masking in Frostbite 2 using Volume Distance Fields
 
5 Major Challenges in Interactive Rendering
5 Major Challenges in Interactive Rendering5 Major Challenges in Interactive Rendering
5 Major Challenges in Interactive Rendering
 
Stylized Rendering in Battlefield Heroes
Stylized Rendering in Battlefield HeroesStylized Rendering in Battlefield Heroes
Stylized Rendering in Battlefield Heroes
 
A Real-time Radiosity Architecture
A Real-time Radiosity ArchitectureA Real-time Radiosity Architecture
A Real-time Radiosity Architecture
 
Battlelog - Building scalable web sites with tight game integration
Battlelog - Building scalable web sites with tight game integrationBattlelog - Building scalable web sites with tight game integration
Battlelog - Building scalable web sites with tight game integration
 
Shadows & Decals: D3D10 Techniques in Frostbite (GDC'09)
 	 Shadows & Decals: D3D10 Techniques in Frostbite (GDC'09) 	 Shadows & Decals: D3D10 Techniques in Frostbite (GDC'09)
Shadows & Decals: D3D10 Techniques in Frostbite (GDC'09)
 
Scope Stack Allocation
Scope Stack AllocationScope Stack Allocation
Scope Stack Allocation
 
How data rules the world: Telemetry in Battlefield Heroes
How data rules the world: Telemetry in Battlefield HeroesHow data rules the world: Telemetry in Battlefield Heroes
How data rules the world: Telemetry in Battlefield Heroes
 
Building the Battlefield AI Experience
Building the Battlefield AI ExperienceBuilding the Battlefield AI Experience
Building the Battlefield AI Experience
 
A Step Towards Data Orientation
A Step Towards Data OrientationA Step Towards Data Orientation
A Step Towards Data Orientation
 
Introduction to Data Oriented Design
Introduction to Data Oriented DesignIntroduction to Data Oriented Design
Introduction to Data Oriented Design
 
Stable SSAO in Battlefield 3 with Selective Temporal Filtering
Stable SSAO in Battlefield 3 with Selective Temporal FilteringStable SSAO in Battlefield 3 with Selective Temporal Filtering
Stable SSAO in Battlefield 3 with Selective Temporal Filtering
 
Level Design Challenges & Solutions - Mirror's Edge
Level Design Challenges & Solutions - Mirror's EdgeLevel Design Challenges & Solutions - Mirror's Edge
Level Design Challenges & Solutions - Mirror's Edge
 

Ähnlich wie Future Directions for Compute-for-Graphics

RAPIDS cuGraph – Accelerating all your Graph needs
RAPIDS cuGraph – Accelerating all your Graph needsRAPIDS cuGraph – Accelerating all your Graph needs
RAPIDS cuGraph – Accelerating all your Graph needs
Connected Data World
 

Ähnlich wie Future Directions for Compute-for-Graphics (20)

Hardware Accelerated Machine Learning Solution for Detecting Fraud and Money ...
Hardware Accelerated Machine Learning Solution for Detecting Fraud and Money ...Hardware Accelerated Machine Learning Solution for Detecting Fraud and Money ...
Hardware Accelerated Machine Learning Solution for Detecting Fraud and Money ...
 
GPU Algorithms and trends 2018
GPU Algorithms and trends 2018GPU Algorithms and trends 2018
GPU Algorithms and trends 2018
 
Spark and Deep Learning frameworks with distributed workloads
Spark and Deep Learning frameworks with distributed workloadsSpark and Deep Learning frameworks with distributed workloads
Spark and Deep Learning frameworks with distributed workloads
 
Infrastructure and Tooling - Full Stack Deep Learning
Infrastructure and Tooling - Full Stack Deep LearningInfrastructure and Tooling - Full Stack Deep Learning
Infrastructure and Tooling - Full Stack Deep Learning
 
Android performance
Android performanceAndroid performance
Android performance
 
RAPIDS cuGraph – Accelerating all your Graph needs
RAPIDS cuGraph – Accelerating all your Graph needsRAPIDS cuGraph – Accelerating all your Graph needs
RAPIDS cuGraph – Accelerating all your Graph needs
 
The Rise of Parallel Computing
The Rise of Parallel ComputingThe Rise of Parallel Computing
The Rise of Parallel Computing
 
Dsdt meetup 2017 11-21
Dsdt meetup 2017 11-21Dsdt meetup 2017 11-21
Dsdt meetup 2017 11-21
 
DSDT Meetup Nov 2017
DSDT Meetup Nov 2017DSDT Meetup Nov 2017
DSDT Meetup Nov 2017
 
The Visual Computing Company
The Visual Computing CompanyThe Visual Computing Company
The Visual Computing Company
 
Backend.AI Technical Introduction (19.09 / 2019 Autumn)
Backend.AI Technical Introduction (19.09 / 2019 Autumn)Backend.AI Technical Introduction (19.09 / 2019 Autumn)
Backend.AI Technical Introduction (19.09 / 2019 Autumn)
 
Advancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio
Advancing GPU Analytics with RAPIDS Accelerator for Spark and AlluxioAdvancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio
Advancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio
 
Gpu with cuda architecture
Gpu with cuda architectureGpu with cuda architecture
Gpu with cuda architecture
 
Fast data in times of crisis with GPU accelerated database QikkDB | Business ...
Fast data in times of crisis with GPU accelerated database QikkDB | Business ...Fast data in times of crisis with GPU accelerated database QikkDB | Business ...
Fast data in times of crisis with GPU accelerated database QikkDB | Business ...
 
Venkatesh Ramanathan, Data Scientist, PayPal at MLconf ATL 2017
Venkatesh Ramanathan, Data Scientist, PayPal at MLconf ATL 2017Venkatesh Ramanathan, Data Scientist, PayPal at MLconf ATL 2017
Venkatesh Ramanathan, Data Scientist, PayPal at MLconf ATL 2017
 
SeaJUG 5 15-2018
SeaJUG 5 15-2018SeaJUG 5 15-2018
SeaJUG 5 15-2018
 
JavaDayKiev'15 Java in production for Data Mining Research projects
JavaDayKiev'15 Java in production for Data Mining Research projectsJavaDayKiev'15 Java in production for Data Mining Research projects
JavaDayKiev'15 Java in production for Data Mining Research projects
 
Introduction to GPUs for Machine Learning
Introduction to GPUs for Machine LearningIntroduction to GPUs for Machine Learning
Introduction to GPUs for Machine Learning
 
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
 
Optimizing HDRP with NVIDIA Nsight Graphics – Unite Copenhagen 2019
Optimizing HDRP with NVIDIA Nsight Graphics – Unite Copenhagen 2019Optimizing HDRP with NVIDIA Nsight Graphics – Unite Copenhagen 2019
Optimizing HDRP with NVIDIA Nsight Graphics – Unite Copenhagen 2019
 

Mehr von Electronic Arts / DICE

Mehr von Electronic Arts / DICE (20)

GDC2019 - SEED - Towards Deep Generative Models in Game Development
GDC2019 - SEED - Towards Deep Generative Models in Game DevelopmentGDC2019 - SEED - Towards Deep Generative Models in Game Development
GDC2019 - SEED - Towards Deep Generative Models in Game Development
 
SIGGRAPH 2010 - Style and Gameplay in the Mirror's Edge
SIGGRAPH 2010 - Style and Gameplay in the Mirror's EdgeSIGGRAPH 2010 - Style and Gameplay in the Mirror's Edge
SIGGRAPH 2010 - Style and Gameplay in the Mirror's Edge
 
SEED - Halcyon Architecture
SEED - Halcyon ArchitectureSEED - Halcyon Architecture
SEED - Halcyon Architecture
 
Syysgraph 2018 - Modern Graphics Abstractions & Real-Time Ray Tracing
Syysgraph 2018 - Modern Graphics Abstractions & Real-Time Ray TracingSyysgraph 2018 - Modern Graphics Abstractions & Real-Time Ray Tracing
Syysgraph 2018 - Modern Graphics Abstractions & Real-Time Ray Tracing
 
Khronos Munich 2018 - Halcyon and Vulkan
Khronos Munich 2018 - Halcyon and VulkanKhronos Munich 2018 - Halcyon and Vulkan
Khronos Munich 2018 - Halcyon and Vulkan
 
CEDEC 2018 - Towards Effortless Photorealism Through Real-Time Raytracing
CEDEC 2018 - Towards Effortless Photorealism Through Real-Time RaytracingCEDEC 2018 - Towards Effortless Photorealism Through Real-Time Raytracing
CEDEC 2018 - Towards Effortless Photorealism Through Real-Time Raytracing
 
CEDEC 2018 - Functional Symbiosis of Art Direction and Proceduralism
CEDEC 2018 - Functional Symbiosis of Art Direction and ProceduralismCEDEC 2018 - Functional Symbiosis of Art Direction and Proceduralism
CEDEC 2018 - Functional Symbiosis of Art Direction and Proceduralism
 
SIGGRAPH 2018 - PICA PICA and NVIDIA Turing
SIGGRAPH 2018 - PICA PICA and NVIDIA TuringSIGGRAPH 2018 - PICA PICA and NVIDIA Turing
SIGGRAPH 2018 - PICA PICA and NVIDIA Turing
 
SIGGRAPH 2018 - Full Rays Ahead! From Raster to Real-Time Raytracing
SIGGRAPH 2018 - Full Rays Ahead! From Raster to Real-Time RaytracingSIGGRAPH 2018 - Full Rays Ahead! From Raster to Real-Time Raytracing
SIGGRAPH 2018 - Full Rays Ahead! From Raster to Real-Time Raytracing
 
HPG 2018 - Game Ray Tracing: State-of-the-Art and Open Problems
HPG 2018 - Game Ray Tracing: State-of-the-Art and Open ProblemsHPG 2018 - Game Ray Tracing: State-of-the-Art and Open Problems
HPG 2018 - Game Ray Tracing: State-of-the-Art and Open Problems
 
EPC 2018 - SEED - Exploring The Collaboration Between Proceduralism & Deep Le...
EPC 2018 - SEED - Exploring The Collaboration Between Proceduralism & Deep Le...EPC 2018 - SEED - Exploring The Collaboration Between Proceduralism & Deep Le...
EPC 2018 - SEED - Exploring The Collaboration Between Proceduralism & Deep Le...
 
DD18 - SEED - Raytracing in Hybrid Real-Time Rendering
DD18 - SEED - Raytracing in Hybrid Real-Time RenderingDD18 - SEED - Raytracing in Hybrid Real-Time Rendering
DD18 - SEED - Raytracing in Hybrid Real-Time Rendering
 
Creativity of Rules and Patterns: Designing Procedural Systems
Creativity of Rules and Patterns: Designing Procedural SystemsCreativity of Rules and Patterns: Designing Procedural Systems
Creativity of Rules and Patterns: Designing Procedural Systems
 
Shiny Pixels and Beyond: Real-Time Raytracing at SEED
Shiny Pixels and Beyond: Real-Time Raytracing at SEEDShiny Pixels and Beyond: Real-Time Raytracing at SEED
Shiny Pixels and Beyond: Real-Time Raytracing at SEED
 
FrameGraph: Extensible Rendering Architecture in Frostbite
FrameGraph: Extensible Rendering Architecture in FrostbiteFrameGraph: Extensible Rendering Architecture in Frostbite
FrameGraph: Extensible Rendering Architecture in Frostbite
 
Stochastic Screen-Space Reflections
Stochastic Screen-Space ReflectionsStochastic Screen-Space Reflections
Stochastic Screen-Space Reflections
 
Frostbite on Mobile
Frostbite on MobileFrostbite on Mobile
Frostbite on Mobile
 
Rendering Battlefield 4 with Mantle
Rendering Battlefield 4 with MantleRendering Battlefield 4 with Mantle
Rendering Battlefield 4 with Mantle
 
Mantle for Developers
Mantle for DevelopersMantle for Developers
Mantle for Developers
 
Battlefield 4 + Frostbite + Mantle
Battlefield 4 + Frostbite + MantleBattlefield 4 + Frostbite + Mantle
Battlefield 4 + Frostbite + Mantle
 

Kürzlich hochgeladen

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Kürzlich hochgeladen (20)

AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 

Future Directions for Compute-for-Graphics

  • 1. Future Directions for Compute-for-Graphics Andrew Lauritzen Senior Rendering Engineer, SEED @AndrewLauritzen
  • 2. Who am I? • Senior Rendering Engineer at SEED – Electronic Arts • @SEED, https://www.ea.com/SEED • Previously • Graphics Software Engineer at Intel (DX12, Vulkan, OpenCL 1.0, GPU architecture, R&D) • Developer at RapidMind (“GPGPU”, Cell SPU, x86, ray tracing) • Notable research • Deferred shading (Beyond Programmable Shading 2010 and 2012) • Variance shadow maps, sample distribution shadow maps, etc. Open Problems in Real-Time Rendering – SIGGRAPH 2017
  • 3. Renderer Complexity is Increasing Rapidly • Why? • Demand for continued increases in quality on wide variety of hardware • GPU fixed function not scaling as quickly as FLOPS • Balance of GPU resources is rarely optimal for a given pass, game, etc. • Power efficiency becoming a limiting factor in hardware/software design • Opportunity to render “smarter” • Less brute force use of computation and memory bandwidth • Algorithms tailored specifically for renderer needs and content Open Problems in Real-Time Rendering – SIGGRAPH 2017
  • 4. From “GPU-Driven Rendering Pipelines” by Ulrich Haar and Sebastian Aaltonen, SIGGRAPH 2015
  • 5. 1600x1800 • Clear (IDs and Depth) • Partial Z-Pass • G-Buffer Laydown • Resolve AA Depth • G-Buffer Decals • HBAO + Shadows • Tiled Lighting + SSS • Emissive • Sky • Transparency • Velocity Vectors 3200x1800 • CB Resolve + Temporal AA • Motion Blur • Foreground Transparency • Gaussian Pyramid • Final Post-Processing • Silhouette Outlines 3840x2160 • Display Mapping + Resample BF1 : PS4™Pro From “4K Checkerboard in Battlefield 1 and Mass Effect Andromeda” by Graham Wihlidal, GDC 2017 Dynamic resolution scaling ish… ish…
  • 6. GPU Compute Ecosystem • Complex algorithms and data structures are impractical to express • End up micro-optimizing the simple algorithms into local optima • Significant language and hardware issues impede progress • Writing portable, composable and efficient GPU code is often impossible • GPU compute for graphics has been mostly stagnant • Most of the problems I will discuss were known ~7 years ago! [BPS 2010] • CUDA and OpenCL have progressed somewhat, but not practical for game use Open Problems in Real-Time Rendering – SIGGRAPH 2017
  • 7. Case Study: Lighting and Shading
  • 8. Lighting and Shading • Deferred shading gives us more control over scheduling shading work • … but still a chunk of it lives in the G-buffer pass • Per pixel state is large enough to inhibit multi-layer G-buffers • Want better control over shading rates, scheduling, etc. • Visibility Buffers [Burns and Hunt 2013] • Follow-up work by Tomasz Stachowiak on GCN [Stachowiak 2015] • Store minimal outputs from rasterization, defer attribute interpolation • Possibly even defer full vertex shading • Use the rasterizer just to intersect primary rays, rest in compute Open Problems in Real-Time Rendering – SIGGRAPH 2017
  • 9.
  • 10. Visibility Buffer Shading • We want to run different shaders based on material/instance ID • Similar need in deferred shading or ray tracing • Sounds like dynamic dispatch via function pointer or virtual • Nice and coherent in screen space struct Material { Texture2D<float4> albedo; float roughness; ... virtual float4 shade(...) const; }; Open Problems in Real-Time Rendering – SIGGRAPH 2017
  • 11. Übershader Dispatch struct Material { int id; Texture2D<float4> albedo; float roughness; ... }; float4 shade(Material* material) { switch (material->id) { case 0: return shaderA(material); // Entire function gets inlined case 1: return shaderB(material); // Entire function gets inlined case 2: return shaderC(material); // Entire function gets inlined ... }; } Open Problems in Real-Time Rendering – SIGGRAPH 2017
  • 12. Shader Resource Allocation • Why are we forced to write code like this? • GPUs have been fully capable of dynamic jumps/branches for some time now • Why not just define a calling convention/ABI and be done with it? • Static resource allocation • GPUs needs to know how many resources a shader uses before launch • Can only launch a warp if all resources are available • Will tie up those resources for entire duration of the shader • Function calls either get inlined, or all potential targets must be known • “Occupancy” is a key metric for getting good GPU performance • GPUs rely on switching between warps to hide latency (memory, instruction) Open Problems in Real-Time Rendering – SIGGRAPH 2017
  • 13. Shader Resource Allocation Example return shaderA(material); Registers (40) Shared Memory (12) Warp 0 => “Occupancy 5” Warp 1 Warp 2 Warp 3 Warp 4 Open Problems in Real-Time Rendering – SIGGRAPH 2017
  • 14. Shader Resource Allocation Example switch (material->id) { case 0: return shaderA(material); case 1: return shaderB(material); }; Registers (40) Shared Memory (12) => “Occupancy 2” Warp 0 Warp 1 Open Problems in Real-Time Rendering – SIGGRAPH 2017
  • 15. Improving Occupancy via Software? • Per IHV… • Per architecture… • Per SKU… • Per driver/compiler… From “A deferred material rendering system” by Tomasz Stachowiak, 2017
  • 16. Improving Occupancy via Hardware • Dynamic resource allocation/freeing inside warps? • Some potential here, but can get complicated quickly for the scheduler • Compile everything for good occupancy • Let caches handle the dynamic variation via spill to L1$? • GPU caches are typically small relative to register files • Generally a poor idea to rely on them to help manage dynamic working set • … but improvements are starting to emerge! Open Problems in Real-Time Rendering – SIGGRAPH 2017
  • 17. From “Inside Volta” by Olivier Giroux and Luke Durant, GTC 2017
  • 18. From “Inside Volta” by Olivier Giroux and Luke Durant, GTC 2017
  • 19. Visibility Buffer Takeaways • Hardware needs improvements • More dynamic resource allocation and scheduling • Less sensitive to “worst case” statically reachable code • NVIDIA Volta appears to be a step in the right direction • Enables software improvements • Dynamic dispatch • Separate compilation • Compose and reusable shader code Open Problems in Real-Time Rendering – SIGGRAPH 2017
  • 21. Hierarchical Structures • Hierarchical acceleration structures are a good tool for “smarter” rendering • Mipmaps, quadtrees, octrees, bounding volume/interval hierarchies, … • Constructing these structures is inherently nested parallel • Data propagated bottom-up, top-down or both • Conventionally created on the CPU and consumed by the GPU • Several reasons we want to construct these structures on the GPU as well • Latency: input data may be generated by rendering • FLOPS are significantly higher on the GPU on some platforms Open Problems in Real-Time Rendering – SIGGRAPH 2017
  • 22.
  • 23. From “Tiled Light Trees” by Yuriy O’Donnell and Matthäus Chajdas, I3D 2017
  • 24. Quadtree GPU Dispatch 128x128 32x32 32x32 16x16 32x32 Dispatch Dispatch Dispatch Store/reload state Store/reload state 16x16 16x16 16x16 16x16 16x16 16x16 16x16 32x32 16x16 16x16 16x16 16x16 16x16 16x16 16x16 16x16 Open Problems in Real-Time Rendering – SIGGRAPH 2017
  • 25. GPU Dispatch Inefficiency From AMD’s Radeon GPU Profiler Open Problems in Real-Time Rendering – SIGGRAPH 2017 Idle GPU: bad! “Opportunity” for async compute!  Idle GPU: bad!
  • 26. #define BLOCK_SIZE 16 groupshared TileData tiles[BLOCK_SIZE][BLOCK_SIZE]; groupshared float4 output[BLOCK_SIZE][BLOCK_SIZE]; float4 TileFlat(uint2 tid, TileData initialData) { bool alive = true; ... for (int level = MAX_LEVELS; level > 0; --level) { if (alive && all(tid.xy < (1 << level))) { ... if (isBaseCase(tileData)) { output[tid] = baseCase(tileData); alive = false; } ... } GroupMemoryBarrierWithGroupSync(); } return output[tid]; } Quadtree GPU Block Reduction • Baked into kernel • Optimal sizes are highly GPU dependent • May not align with optimal sizes for data • Potential occupancy issues • Some algorithms need a full TileData stack Open Problems in Real-Time Rendering – SIGGRAPH 2017
  • 27. #define BLOCK_SIZE 16 groupshared TileData tiles[BLOCK_SIZE][BLOCK_SIZE]; groupshared float4 output[BLOCK_SIZE][BLOCK_SIZE]; float4 TileFlat(uint2 tid, TileData initialData) { bool alive = true; ... for (int level = MAX_LEVELS; level > 0; --level) { if (alive && all(tid.xy < (1 << level))) { ... if (isBaseCase(tileData)) { output[tid] = baseCase(tileData); alive = false; } ... } GroupMemoryBarrierWithGroupSync(); } return output[tid]; } Quadtree GPU Block Reduction • Lots of GPU threads doing useless work • Can’t terminate early • Can’t re-pack threads Open Problems in Real-Time Rendering – SIGGRAPH 2017
  • 28. #define BLOCK_SIZE 16 groupshared TileData tiles[BLOCK_SIZE][BLOCK_SIZE]; groupshared float4 output[BLOCK_SIZE][BLOCK_SIZE]; float4 TileFlat(uint2 tid, TileData initialData) { bool alive = true; ... for (int level = MAX_LEVELS; level > 0; --level) { if (alive && all(tid.xy < (1 << level))) { ... if (isBaseCase(tileData)) { output[tid] = baseCase(tileData); alive = false; } ... } GroupMemoryBarrierWithGroupSync(); } return output[tid]; } Quadtree GPU Block Reduction • Base and subdivide cases can’t do thread sync • Can’t (efficiently) use shared memory • Breaks nesting abstraction Open Problems in Real-Time Rendering – SIGGRAPH 2017
  • 30. Writing From “Optimizing the Graphics Pipeline with Compute” by Graham Wihlidal, GDC 2016 Remove barriers (numthreads = hardware warp size) Remove smem and atomics Replace with register swizzling i.e. bypass the warp width abstraction Write code directly for GPU warps /** DO NOT CHANGE!! **/ Compute ShadersFast
  • 31. Writing Fast and Portable Compute Shaders • … is just not possible right now in the compute languages we have • SIMD widths vary, sometimes even between kernels on the same hardware • Reliance on brittle compiler “optimization” steps to not fall off the fast path • Warp synchronous programming is not spec compliant [Perelygin 2017] • No legal way to “wait” for another thread outside your group [Foley 2013] • Performance delta can be 10x or more • Too large to accept compatible “fallback” • Is the SIMD width abstraction doing more harm than good? Open Problems in Real-Time Rendering – SIGGRAPH 2017
  • 32. More Explicit SIMD Model? // No “numthreads”: this is called once per warp always void kernel(float *data) { // Operations and memory at this scope are scalar, as you’d expect int x = 0; float4 array[128]; // Does not have to match hardware SIMD size: compiler will add a loop in the kernel as needed parallel_foreach(int i: 0 .. N) { // This gets compiled to SIMD – this is like a regular shader in this scope float d = data[i] + data[i + N]; // All the usual GPU per-lane control flow stuff works normally if (d > 0.0f) data[i] = d; // Parallel for’s can be nested; various options on which scope/axis to compile to SIMD parallel_foreach(int j: 0 .. M) { ... } } // Another parallel/SIMD loop with different dimensions; no explicit sync needed parallel_foreach(int i: 0 .. X) { ... } } Open Problems in Real-Time Rendering – SIGGRAPH 2017
  • 33. void TileRecurse(int level, TileData tileData) { if (level == 0 || isBaseCase(tileData)) return baseCase(tileData); else { // Subdivide and recurse TileData subtileData[4]; computeSubtileData(level, tileData, subtileData); --level; spawn TileRecurse(level, subtileData[0]); spawn TileRecurse(level, subtileData[1]); spawn TileRecurse(level, subtileData[2]); spawn TileRecurse(level, subtileData[3]); } } Quadtree Recursion on the GPU? * exception: CUDA dynamic parallelism * error X3500: 'TileRecurse': recursive functions not allowed in cs_5_1 Open Problems in Real-Time Rendering – SIGGRAPH 2017
  • 34. CUDA Dynamic Parallelism From “How Tesla K20 Speeds Quicksort, a Familiar Comp-Sci Code”, Blog Post, Link
  • 35. CUDA Dynamic Parallelism • CUDA does support recursion including fork and join! • CUDA also supports separate compilation • Performance issues remain • Mostly syntactic sugar for spilling, indirect dispatch, reload • Not really targeted at games and real-time rendering • Definitely an improvement on the semantic front • Get similar capabilities into HLSL and implementations improve over time? • Chicken and egg problem with game performance (ex. geometry shaders) Open Problems in Real-Time Rendering – SIGGRAPH 2017
  • 36. CUDA Cooperative Groups From “CUDA 9 and Beyond” by Mark Harris, GTC 2017
  • 37. Hierarchical Structures Takeaways • Nested parallelism and trees are key components of many algorithms • We need an efficient, portable way to express these • Current compute languages are not particularly suitable • Only way to write efficient code is to defeat the abstractions • We need new languages and language features • Map/reduce? Fork/join? • CUDA cooperative groups? • ISPC [Pharr 2012]? BSGP [TOG 2008]? Halide [Ragan-Kelly 2013]? • Good opportunity for academia and industry collaboration! Open Problems in Real-Time Rendering – SIGGRAPH 2017
  • 39. Compute Language Basics • C-like, unless there’s a compelling reason to deviate • Including all the regular C types… 8/16/32/64-bit • Buffers • Structured buffers • “Byte address buffers” (really dword + two extra 0 bits for fun!) • Constant buffers, texture buffers, typed buffers, … • REAL pointers please! • We’ve already taken on all of the “negative” aspects with root buffers (DX12) • Now give us the positives Open Problems in Real-Time Rendering – SIGGRAPH 2017
  • 40. Compute Language Basics • Resource binding and shader interface mess • Even just between DX12 and Vulkan this is a giant headache • Get rid of “signature” and “layout” glue • Replace with regular structures and positional binding (i.e. “function calls”) • Pointers for buffers, bindless for textures • DX12 global heap is an acceptable solution for bindless • Must be able to store references to textures in our own data structures • Ideal would be to standardize a descriptor size (or even format!) Open Problems in Real-Time Rendering – SIGGRAPH 2017
  • 41. Compute Language Basics • Avoid/remove weird features that end up as legacy pain • UAV counters • Be very careful with shader “extensions” • Mature JIT compilation setup • DXIL/SPIR-V make this both better and worse… • Shader compiler/driver cannot be allowed to emit compilation errors (!!) • We cannot have random errors on end user machines Open Problems in Real-Time Rendering – SIGGRAPH 2017
  • 42. Compute Language Essentials • CPU/GPU interoperation and synchronization • Low latency, bi-directional submission, signaling and atomics • … in user space • Shared data structures • Consistent structure layouts and alignment between CPU/GPU Open Problems in Real-Time Rendering – SIGGRAPH 2017
  • 44. Compute for Graphics Open Problems in Real-Time Rendering – SIGGRAPH 2017
  • 45. Call to Action • Hardware vendors • More dynamic resource allocation and scheduling • Less sensitive to “worst case” statically reachable code • Dynamic warp sorting and repacking • Unified compression and format conversions on all memory paths • Operating system vendors • GPUs as first class citizens (like another “core” in the system) • Fast, user-space dispatch and signaling • Shared virtual memory Open Problems in Real-Time Rendering – SIGGRAPH 2017
  • 46. Call to Action • Standards bodies • Be willing to make breaking changes to compute execution models • Forgo some low level performance (optionally) for better algorithmic expression • Resist being bullied by hardware vendors  • Academics and language designers • Innovate on new efficient, composable, and robust parallel languages • Work with industry on requirements and pain points • Game developers • Let people know that the status quo is not fine • Help the people who are working to improve it! Open Problems in Real-Time Rendering – SIGGRAPH 2017
  • 47. Conclusion • Recently industry did explicit graphics APIs • Data-parallel intermediate languages (DXIL, SPIR-V) • Next problem to solve is compute • Want to take real-time graphics further • Ex. 8k @ 90Hz stereo VR with high quality lighting, shading and anti-aliasing • Need to render “smarter” • Requires industry-wide cooperation! Open Problems in Real-Time Rendering – SIGGRAPH 2017
  • 48. Acknowledgements • Aaron Lefohn (@aaronlefohn) • Alex Fry (@TheFryster) • Colin Barré Brisebois (@ZigguratVertigo) • Dave Oldcorn • Graham Wihlidal (@gwihlidal) • Jasper Bekkers (@JasperBekkers) • Jefferson Montgomery (@jdmo3) • Johan Andersson (@repi) • Matt Pharr (@mattpharr) • Natalya Tatarchuk (@mirror2mask) • Neil Henning (@sheredom) • Tim Foley (@TangentVector) • Timothy Lottes (@TimothyLottes) • Tomasz Stachowiak (@h3r2tic) • Yuriy O'Donnell (@YuriyODonnell) Open Problems in Real-Time Rendering – SIGGRAPH 2017
  • 49. References • Brodman, James et al. Writing Scalable SIMD Programs with ISPC, WPMVP 2014, Link Web Site • Burns, Christopher and Hunt, Warren, The Visibility Buffer: A Cache-Friendly Approach to Deferred Shading, JCGT 2013, Link • Foley, Tim, A Digression on Divergence, Blog Post in 2013, Link • Giroux, Olivier and Durant, Luke, Inside Volta, GTC 2017, Link • Haar, Ulrich and Aaltonen, Sebastian, GPU-Driven Rendering Pipelines, Advances in Real-Time Rendering 2015, Link • Harris, Mark, CUDA 9 and Beyond, GTC 2017, Link • Hou, Qiming et al. BSGP: Bulk-Synchronous GPU Programming, TOG 2008, Link • Jones, Stephen, Introduction to Dynamic Parallelism, GTC 2012, Link • O’Donnell, Yuriy and Chajdas, Matthäus, Tiled Light Trees, I3D 2017, Link • Perelygin, Kyrylo and Lin, Yuan, Cooperative Groups, GTC 2017, Link • Pharr, Matt and Mark, William, ISPC: A SPMD Compiler for High-Performance SIMD Programming, InPar 2012, Link • Ragan-Kelly, Jonathan et al. Halide: A Language and Compiler for Optimizing Parallelism, Locality and Recomputation in Image Processing Pipelines, PLDI 2013, Link Web Site • Stachowiak, Tomasz, A Deferred Material Rendering System, Blog Post 2015, Link • Wihlidal, Graham, 4K Checkerboard in Battlefield 1 and Mass Effect Andromeda, GDC 2017, Link • Wihlidal, Graham, Optimizing the Graphics Pipeline with Compute, GDC 2016, Link Open Problems in Real-Time Rendering – SIGGRAPH 2017

Hinweis der Redaktion

  1. I just recently joined the SEED team at Electronic Arts. Before that I worked at Intel for 8 years. During my time there I was involved pretty heavily with the DirectX 12 and Vulkan specifications and prototypes, and before that with OpenCL’s initial specification. I worked in the early “GPGPU” days at RapidMind on a variety of parallel processors. It’s worth noting that I’m not a compiler expert, but I have experience with parallel languages and GPU architecture and design.
  2. To motivate this talk, let me point out a trend that most people here won’t find very surprising: the complexity of modern real-time renderers has been increasing rapidly over the past few years. There’s a number of reasons why this trend has accelerated recently, but fundamentally while GPUs continue to get faster, simply scaling the same brute force algorithms up produces diminishing returns in terms of image quality. Happily, GPUs are also becoming increasingly programmable, and thus we’ve seen a sharp increase in “smarter” renderers.
  3. For instance, Ubisoft gave a talk a couple years ago wherein they described the geometry pipeline in Assassin's Creed Unity. To improve the performance of rendering massive amounts of geometry, they implemented a variety of hierarchical clustering and culling steps in compute shaders to greatly cut down on the amount of geometry sent to the GPU rasterizer. This allowed them to greatly increase the amount of detail in their large, open world games.
  4. At the other end of the pipeline, shading is getting increasingly decoupled from the conventional GPU rendering pipeline. In a presentation earlier this year, Graham described how the Frostbite engine uses checkerboarding to add more perceived resolution by being more particular about the frequencies at which they evaluate various terms during rendering. In reality it’s actually even more decoupled than this image shows, as the internal render resolutions are also driven by an adaptive scaling algorithm, and temporal anti-aliasing is itself an additional decoupling in time. Of course all of these things significantly increase the complexity of the renderer, and that trend will continue.
  5. Unfortunately, we are rapidly approaching a complexity wall past which it becomes impractical for us to express the algorithms and data structures we need to advance the state of the art. Thus instead of pursuing fundamentally more efficient algorithms, we often fall back on low level optimization of the simpler algorithms at the expense of the scaling we really desire. At the root of the problem is a variety of language and hardware issues that I’ll describe in this talk. It’s worth noting that none of the problems I’ll talk about are particularly new. Unfortunately, GPU compute in graphics APIs has seen very few changes since its initial introduction. While CUDA and OpenCL have made a little progress, they are not targeted at games.
  6. Let’s go through a couple case studies to support these claims.
  7. Deferred shading is a great tool that lets us decouple a good chunk of the shading cost so that we can schedule it in a more efficient manner. That said, the per-pixel cost of storing surface data makes things like multi-layer G-buffers fairly expensive. We also want even more control over how we position and shade samples in the future, so a light representation of the rays to shade would be ideal. So why not defer even more of the shading work? Visibility buffers do that by performing the minimal amount of rasterization up front and deferring attribute interpolation and surface shading to the later pass.
  8. Rendering a visibility buffer is pretty straightforward. We just need a unique mapping from a given pixel to the nearest triangle covering that pixel. The easiest way to do this is to render depth and primitive IDs to a slim visibility buffer. We can also encode a material ID if convenient, or just have a mapping from primitive or instance ID to material. On some platforms it also makes sense to render barycentric coordinates, but these can be computed in the deferred pass as well.
  9. Now that we have a visibility buffer, we have a lot of flexibility in how to shade it. In most cases we want some sort of dynamic dispatch of different shaders based on the material or instance ID in the buffer. People who are familiar with conventional deferred shading will recognize this, but it comes up in ray tracing and a variety of other situations too. Here’s one natural expression of what we’re trying to do here. Now low level GPU folks are probably losing their minds, but don’t get tied up in details or implementation baggage of these notions in C++. The point is that we need a clean, composable way to do dynamic dispatch on GPUs.
  10. Instead of that, here’s the code you actually have to write to do something like this on GPUs today. This is a classic “ubershader”.
  11. It’s worth taking a second to consider why we have to write ubershaders instead of just function calls? GPUs have been able to do dynamic dispatch for quite a while, so why not just define the appropriate calling convention and support it in the language? The problem is that GPU hardware is still very reliant on static resource allocation. Before launching a new warp, the scheduler must ensure that all resources that it might potentially need – such as registers and shared memory – are available. Once the warp is in flight, all of those resources are tied up for the duration regardless of the actual dynamic use of those resources. This is where we get the notion of “occupancy”. Occupancy is a measure of how many warps are live on the GPU at a given time. While occupancy does not equal performance, GPUs heavily rely on switching between live warps to hide various latencies and thus lower occupancy will generally lead to lower performance. Let’s look at a quick example.
  12. Here we have a single material shader that needs some number of registers, and some amount of shared memory. When we launch a wave onto the GPU, those resources are effectively tied up. We can continue launching waves until we no longer have enough registers to launch 6th one. Thus we can say that the “occupancy” of this shader is 5, which just means that 5 is the largest number of waves of this particularly shader that we can ever fit on a single execution unit.
  13. Now consider adding another material to our ubershader. The “shaderB” path being present in this shader at all – even if it is never called – increases the amount of static resources that our shader requires. This in turn reduces the number of warps we can fit on the machine. Unfortunately occupancy is not only determined by the worst case code path through the shader, but also the worst case use of any individual GPU resource.
  14. Mixing code with different occupancy hurts performance… so let’s not mix occupancy! Tom notes that there aren’t actually that many different possible occupancies, so why not just bin our invocations appropriately and do one dispatch per bin? This is actually a potentially viable solution on a console, but unfortunately the occupancy of a given shader depends on the IHV… and architecture… potentially the SKU… and definitely the given driver version and associated shader compiler. Thus it’s impractical to do this in a portable way. There’s also some additional overhead from the binning pass that we’d like to avoid.
  15. Going forward, we really need some improvements from hardware in this area. There’s a variety of ways to accomplish this. One option is to allow shaders to more dynamically allocate and free registers and other resources while running. There’s definitely potential here, but it can get complicated for the warp scheduler to ensure good utilization without deadlocks. Another option is to forcibly compile all of our shaders with limited resource counts that ensure good occupancy and rely on caches to handle the variation. The problem with this is that GPU caches are typically very small relative to register files, and thus relying on them to help manage the dynamic working set of a shader is generally a poor plan. That said, things are starting to improve on this front.
  16. In NVIDIA’s recently announced Volta architecture, they have merged their shared memory and L1$. The new block of memory is now about half the size of the register file, which is starting to get interesting in terms of handling those rare but expensive shader paths.
  17. They also demonstrated that code that increasingly you only need to explicitly use shared memory if you are doing heavily banked gather/scatter or atomics. This is a benefit both in terms of occupancy, but also in terms of some other issues with the compute model that we’ll talk about shortly.
  18. The takeaway here is that we need some hardware improvements in this area. Volta appears to be a step in the right direction, but it is by no means the only way to improve things. If we’re going to write increasingly complex code we need tools like dynamic dispatch and separate compilation, and we can’t continue to be crippled by the worst case reachable path through a shader or function call.
  19. Next, let’s look at another case study that comes up commonly in rendering: hierarchical data structures and algorithms.
  20. If we want to do more “smarter” rendering, hierarchical acceleration structures are a great tool to have. They already show up all over the place in graphics: mipmaps, octrees and BVH’s are all examples. While using these structures is fairly straightforward, constructing them is inherently nested parallel. Data needs to be propagated up the tree, down the tree or sometimes both in sequence. These structures are conventionally constructed on the CPU and then consumed by the GPU. Going forward there are several reasons that we want to construct them on the GPU as well. The most important is latency: often we generate data on the GPU and want to construct some sort of acceleration structure on the fly. Reading the data back to the CPU and trying to cover the associated latency with other GPU work is not always possible. Another consideration is that on current consoles the GPU has the lions share of the FLOPS budget, so certain algorithms might not run reasonably on the CPU at all.
  21. To give a simple example, we’ve known for a little while that tiled GPU light culling algorithms are fairly inefficient. They duplicate a lot of calculations and generally fail to take advantage of the large amounts of coherency present in the distribution of lights on the screen. At SIGGRAPH 2012 I showed that a relatively modest CPU could outperform a high end GPU at light culling by using a simple quad tree.
  22. Earlier this year Yuriy and Matthäus presented “Tiled Light Trees”, which create a per-tile bounding interval hierarchy of lights to further accelerate intersection tests. As with the previous example, the data structure is created on the CPU and consumed by the GPU.
  23. Let’s look at how we might express these algorithms via GPU compute. A key component of tree algorithms is that the width of parallelism changes as we travel up and down the tree. This is true whether we’re doing a more bottom up reduction style algorithm or a top-down recursion as shown here. In either case, to change the number of threads running on the GPU we currently only have one option: issuing a new dispatch. Between dispatches we actually need to save and reload any live state from main memory, which carries some overhead. Some algorithms need to return back up the tree as well, creating even more state that we need to save and restore. Furthermore since we don’t know the dynamic control flow path of the reduction on the CPU, we either have to dispatch the “worst case” number of threads and terminate them early, or do an indirect dispatch on the GPU. The latter generally requires an additional analysis pass between each dispatch which introduces even more overhead.
  24. These problems manifest as poor GPU performance. Dispatches higher in the quadtree do not fill the machine at all as there simply isn’t enough work. More seriously, between dispatches the machine has to drain and spin back up which results in lots of wasted cycles. Recently I’ve learned to put a positive spin on this and call these gaps “opportunities for async compute”. Indeed concurrent compute work can help, but doesn’t excuse the underlying inefficiency. Ultimately the synchronization here is too coarse-grained for the real dependencies. We’re effectively forced to specify that every node in one level depends on every node in the previous level, which is obviously over-constrained.
  25. Unfortunately, dispatch is only one part of the problem. Let’s take a quick look at the kernel code itself that we execute for each node. To iterate within a block we have to write code that looks something like this. Since we need to pass data from one iteration to the next via shared memory, we need a barrier in the loop. This in turn requires that we track threads that have hit the base case manually since we are not allowed to diverge at all around a barrier. There are a number of problems with this style of code. First, our block size is baked into the kernel which means we need to recompile if we want to vary it. That’s particularly annoying since the optimal block sizes are highly depending on the specific GPU, and may not line up nicely with the optimal data structure granularity. Along with any shared memory use also comes occupancy concerns, which are even worse in algorithms that need a full stack.
  26. Being forced to only have barriers in converged code means that we’re going to waste a lot of cycles doing useless work. All threads – whether they are needed by the current level or not – have to execute the loop, branch and control logic just to stay in sync with the rest of the group. Even if they will no longer do any useful work, we can’t terminate or otherwise repack threads. Thus they just continue to waste time and hardware waves doing useless work until the entire thread group is complete.
  27. Using shared memory and synchronization also breaks nesting. The base and subdivide cases are unable to make any real use of shared memory or barriers themselves since they aren’t at the outer-most scope. Similarly this code can’t easily be called as part of another algorithm.
  28. So how do we avoid all of this inefficiency in practice?
  29. Here’s an example compute shader written in a fairly portable style. Now what if we actually want to make this run fast? Luckily, Graham showed us how to do this at GDC last year, but you’ll find the same sort of thing in optimized compute code. Setting numthreads to the hardware warp size ensures that branching on threadId is only wasting SIMD lanes, not entire hardware warps, occupancy, etc. By removing shared memory use we no longer need barriers. Since we’re only running one warp per group, we can use warp operations and swizzling instead to share data. This is actually a relatively simply example since the kernel itself is agnostic to the group size. In most cases changing the thread group size to match the hardware instead of the data also implies adding loops and bounds checks everywhere as well. What we’re really doing here is bypassing the compute shader execution model abstraction and instead writing code directly to the warps of the GPU. This is all well and good when we’re writing code for a specific console, but what if we want to write fast compute shaders in a more portable way?
  30. Unfortunately writing fast compute shaders that are also portable just isn’t really possible right now. First, hardware SIMD widths vary, sometimes even between different kernels running on the same piece of hardware. Second, getting the “fast path” with no barriers and appropriate use of scalar and vector operations relies on a number of somewhat brittle compiler optimizations steps. Beyond these practical issues, writing warp synchronous code and omitting barriers is not spec compliant, even if it happens to work on the specific driver and hardware in your dev box. Similarly, there’s no legal way for any warp to ever wait for another one outside of barriers, so any sort of “locks” or producer/consumer patterns are not allowed. If writing the portable version was giving up 10 or 20% performance we’d probably just suck it up and write two paths, but in a lot of cases it can be an order of magnitude slower which is unworkable. Given these realities, it’s worth asking if the current compute models – and in particularly their abstraction over SIMD widths – is actually doing more harm than good at this point.
  31. Along these lines, one direction to go for games is to make SIMD a bit more explicit instead of implicit. If we give up on the SIMD width abstraction, as well as shared memory , groups and barriers we can easily come up with much cleaner and portable syntax. Borrowing a bit from ISPC, you can create a pretty clean language based on nested parallel foreach statements. The code here is just an example – it doesn’t have to be this exact syntax or semantics, but the point is we should really be exploring some of these avenues given the issues with the current compute execution models.
  32. In terms of dispatch, let’s think about how we might want to express the algorithm in a more direct manner. I’ve borrowed a Cilk-like syntax for spawning recursive work here, but you can easily substitute TBB, PPL or a home grown work stealing task scheduler to your liking. The key point her is that this *style* of code is fine for a massively data-parallel processor like GPU. Unfortunately, if you try to write this in HLSL you’re going to get a rude awakening.
  33. But… you actually can write something similar to this code in CUDA, which is pretty awesome.
  34. CUDA cooperative groups are another interesting direction for improvements. They attack the composability and warp width abstraction problem by generalizing thread groups into an explicated nested-parallel style primitive that can be subdivided, synchronized and share data safely. This allows writing reusable code that is relatively independent of the calling context, both in terms of the nature of the cooperating threads, but also in terms of any dynamic control flow at the call site. The abstraction extends beyond warps and thread groups to cooperation across the whole GPU, or even between multiple GPUs. While the syntax may or may not be a good fit for graphics, the fundamental hardware capabilities that it is exposing are really required for writing fast, portable compute code in the future.
  35. In addition to the larger items, there’s a bunch of smaller items that inhibit our ability to write reusable code that we should just fix ASAP.
  36. UAV counters and their weird semantics in DirectX are a perfect example of a feature that might have felt like a good idea at the time when an IHV had some shiny new hardware, but the semantics are strange, and now we’ll have to replicate them long after the underlying hardware has changed.
  37. This is a common meme, but it’s an appropriate one. We’ve convinced ourselves that the compute ecosystem is workable, as long as we just work around a bunch of minor things. But in the back of our minds when we’re rewriting an optimized reduction for the 100th time, we’re at least vaguely aware that things might not actually be fine. Hopefully most of us are ready to admit there are problems, as we all have a part in fixing this.
  38. Whether or not we make heavy use of shared virtual memory, it needs to be supported for orthogonality.
  39. Recently the industry did a large shift towards more explicit graphics APIs with data-parallel intermediate languages. I hope I’ve convinced you that the next big problem for the industry to solve in this area is compute. We really want to take real-time graphics further, and realistically we’re not going to get to stuff like 8k VR rendering with brute force algorithms. We’re also not going to get anywhere by finger pointing about whose responsibility it is to fix this. We all need to work together.