5. Control
New model
Traditional Model:
Black Box
Explicit Model:
Mantle
Middle-ground abstraction – compromise
between performance & “usability”
Thin low-level abstraction to expose how
hardware works
Hidden resource memory & state
App explicit memory management
Resource CPU access tied to device context
Resources are globally accessible
Driver analyzes & synchronizes implicitly
App explicit resource state transitions
6. Control
App responsibility
Tell when render target will be used as a texture
‒ And many more resource state transitions
Don’t destroy resources that GPU is using
‒ Keep track with fences or frames
Manual dynamic resource renaming
‒ No DISCARD for driver resource renaming
Resource memory tiling
Powerful validation layer will help!
7. Control
Explicit control enables
App high-level decisions & optimizations
‒ Has full scene information
‒ Easier to optimize performance & memory
Flexible & efficient memory management
‒ Linear frame allocators
‒ Memory pools
‒ Pinned memory
Reduced development time
‒ For advanced game engines & apps
‒ Easier to get to target performance & robustness
8. Control
Explicit control enables
Transient resources
‒ Alias render targets within frame
‒ Major memory savings
‒ No need to pre-allocate everything
Light-weight driver
‒ Easier to develop & maintain
‒ Reduced CPU draw call overhead
11. CPU perf
Descriptor sets
Table with resource references to bind to
graphics or compute pipeline
Image
Memory
Sampler
Link
Replaces traditional resource stage binding
‒ Major performance & flexibility advantage
‒ Closer to how the hardware works
Example 1: Single simple dynamic descriptor set
‒ Bind everything you need for a single draw call
‒ Close to DX/GL model but share between stages
Dynamic descriptor set
VertexBuffer (VS)
Texture0 (VS+PS)
Constants (VS)
Texture1 (PS)
App managed - lots of strategies possible!
‒ Tiny vs huge sets
‒ Single vs multiple
‒ Static vs semi-static vs dynamic
Texture2 (PS)
Sampler0 (VS+PS)
12. CPU perf
Descriptor sets
Table with resource references to bind to
graphics or compute pipeline
Image
Example 2: Reuse static set with nesting
‒ Reduce update time & memory usage
Memory
Static descriptor set
Sampler
Link
Dynamic descriptor set
Replaces traditional resource stage binding
‒ Major performance & flexibility advantage
‒ Closer to how the hardware works
Constants (VS)
Link
VertexBuffer (VS)
Texture0 (VS+PS)
Texture1 (PS)
Texture2 (PS)
Texture3 (PS)
App managed - lots of strategies possible!
‒ Tiny vs huge sets
‒ Single vs multiple
‒ Static vs semi-static vs dynamic
Texture4 (PS)
Sampler0 (VS+PS)
Sampler1 (PS)
13. CPU perf
Monolithic pipelines
Shader stages & select graphics state combined into single object
‒ No runtime compilation or patching needed!
‒ Significantly less runtime overhead to use
Pipeline state
Supports parallel building & caching
‒ Fast loading times
Usage & management up to the app
‒ Static vs dynamic creation
‒ Amount of pipelines
‒ State usage
IA
DB
VS
HS
DS
Tessellator
GS
RS
PS
CB
14. CPU perf
Command buffers
Issue pipelined graphics & compute commands into a command buffer
‒ Bind graphics state, descriptor sets, pipeline
‒ Draw calls
‒ Render targets
‒ Clears
‒ Memory transfers
‒ NOT: resource mapping
Fully independent objects
‒ Create multiple every frame
‒ Or pre-build up front and reuse
15. CPU perf
DX/GL parallelism
CPU 0
CPU 1
CPU 2
Game
Game
Game
Render
Render
Driver Render
Automatically extracts parallelism out of most apps
Doesn’t scale beyond 2-3 cores
Additional latency
Driver thread often bottleneck – can collide app threads
Render
16. CPU perf
Parallel dispatch with Mantle
CPU 0
Game
Game
Game
CPU 1
Render
Render
Render
CPU 2
Render
Render
Render
CPU 3
Render
Render
Render
CPU 4
Render
Render
Render
App can go fully wide with its rendering – minimal latency
Close to linear scaling with CPU cores
No driver threads – no overhead – no contention
Frostbite’s approach on all consoles – and on PC with Mantle!
18. GPU perf
GPU optimizations
Thanks to improved CPU performance – CPU
will rarely be a bottleneck for the GPU
‒ CPU could help GPU more:
‒ Less brute force rendering
‒ Improve culling
Resource states
‒ Gives driver a lot more knowledge & flexibility
‒ Apps can avoid expensive/redundant
transitions, such as surface decompression
Expose existing GPU functionality
Shader pipeline object – driver optimizations
‒ Can optimize with pipeline state knowledge
‒ Can optimize across all shader stages
‒ Quad & Rect-lists
‒ HW-specific MSAA & depth data access
‒ Programmable sample patterns
‒ And more..
19. GPU perf
Queues
Modern GPUs are heterogeneous machines
with multiple engines
Graphics
‒ Graphics pipeline
‒ Compute pipeline(s)
‒ DMA transfer
‒ Video encode/decode
‒ More…
Mantle exposes queues for the engines +
synchronization primitives
Compute
DMA
...
Queues
GPU
21. GPU perf
Queue use cases
Async DMA transfers
‒ Copy resources in parallel with graphics or
compute
Copy
DMA
Graphics
Render
Other render
Use copy
22. GPU perf
Queue use cases
Async DMA transfers
‒ Copy resources in parallel with graphics or
compute
Async compute together with graphics
‒ ALU heavy compute work at the same time as
memory/ROP bound work to utilize idle units
Compute
Graphics
GBuffer
Non-shadowed lighting
Shadowmap 0
Shadowmap 1
Final lighting
23. GPU perf
Queue use cases
Async DMA transfers
Multiple compute kernels collaborating
‒ Copy resources in parallel with graphics or
compute
‒ Can be faster than über-kernel
‒ Example: Compute geometry backend & compute
rasterizer
Async compute together with graphics
‒ ALU heavy compute work at the same time as
memory/ROP bound work to utilize idle units
Compute 0
Compute 1
Graphics
Compute Geometry
Compute Rasterizer
Ordinary Rendering
24. GPU perf
Queue use cases
Async DMA transfers
Multiple compute kernels collaborating
‒ Copy resources in parallel with graphics or
compute
‒ Can be faster than über-kernel
‒ Example: Compute geometry backend & compute
rasterizer
Async compute together with graphics
‒ ALU heavy compute work at the same time as
memory/ROP bound work to utilize idle units
Compute
Graphics
Process0
Process1
Draw0
Compute as frontend for graphics pipeline
‒ Compute runs asynchronously ahead and prepares
& optimizes geometry for graphics pipeline
Process0
Draw1
Draw2
25. GPU perf
Queue use cases
Async DMA transfers
Multiple compute kernels collaborating
‒ Copy resources in parallel with graphics or
compute
‒ Can be faster than über-kernel
‒ Example: Compute geometry backend & compute
rasterizer
Async compute together with graphics
‒ ALU heavy compute work at the same time as
memory/ROP bound work to utilize idle units
Compute as frontend for graphics pipeline
‒ Compute runs asynchronously ahead and prepares
& optimizes geometry for graphics pipeline
Game engines will build large GPU job graphs
‒ Move away from single sequential submission
‒ Just as we already have done on CPU
27. Programmability
Explicit Multi-GPU
Explicit control of GPU queues and synchronization, finally!
‒ Implement your own Alternate-Frame-Rendering
‒ Or something more exotic..
Use case: Workstation rendering with 4-8 GPUs
‒ Super high-quality rendering & simulation
‒ Load balance graphics & compute job graphs across GPUs
‒ 20-40 TFlops in a single machine!
Use case: Low-latency rendering
‒ Important for VR and competitive games
‒ Latency optimized GPU job graph scheduling
‒ VR: Simultaneously drive 2 GPUs (1 per eye)
28. Programmability
New mechanisms
Command buffer predication & flow control
‒ GPU affecting/skipping submitted commands
‒ Go beyond DrawIndirect / DispatchIndirect
‒ Advanced variable workloads
‒ Advanced culling optimizations
Write occlusion query results into GPU buffer
‒ No CPU roundtrip needed
‒ Can drive predicated rendering
‒ Or use results directly in shaders (lens flares)
29. Programmability
Bindless resources
Mantle supports bindless resources
‒ Shaders can select resources to use instead of
static binding from CPU
‒ Extension of the descriptor set support
Examples
‒ Performance optimizations – less data to update
‒ Logic & data structures that live fully on the GPU
‒ Scene culling & rendering
‒ Material representations
Key component that will open up a lot of
opportunities!
‒ Deferred shading
‒ Raytracing
31. Platforms
Today
Mantle gives us strong benefits on Windows today
‒ Console-like performance & programmability on both Windows 7 and Windows 8
‒ For us, well worth the dev time!
DX & GL are the industry standards
‒ Needed for platforms that do not support Mantle
‒ Needed by devs who do not want/need more control
‒ Have to have fallback paths for GL/DX, but not limit oneself to it
Mantle and PlayStation 4 will drive our future Frostbite designs & optimizations
‒ PS4 graphics API has great programmability & performance as well
‒ Share concepts, methods & optimization strategies
32. Platforms
Linux & Mac
Want to see Mantle on Linux and Mac!
‒ Would enable support for our full engine & rendering
‒ Significantly easier to do efficient renderer with Mantle than with OpenGL
Use cases:
‒ Workstations
‒ R&D
‒ Not limited by WDDM
‒ Games
‒ Mantle + SteamOS = powerful combination!
33. Platforms
Mobile
Mobile architectures are getting closer in capabilities to desktop GPUs
Want graphics API that allows apps to fully utilize the hardware
‒ Power efficient
‒ High performance
‒ Programmable
Major opportunity with Mantle – leap frog GL4, DX11
‒ For mobile SoC vendors
‒ For Google and Apple
34. Platforms
Multi-vendor?
Mantle is designed to be a thin hardware abstraction
‒ Not tied to AMD’s GCN architecture
‒ Forward compatible
‒ Extensions for architecture- and platform-specific functionality
Mantle would be a much more efficient graphics API for other vendors as well
‒ Most Mantle functionality can be supported on today’s modern GPUs
Want to see future version of Mantle supported on all platforms and on all modern GPUs!
‒ Become an active industry standard with IHVs and ISVs collaborating
‒ Enable us developers to innovate with great performance & programmability everywhere
35.
36. Frostbite
Battlefield 4
Mantle support is in development
‒ Core renderer (closer to PS4 than DX11)
‒ Implement all rendering techniques used in BF4 (many!)
‒ CPU optimizations (parallel dispatch, descriptor sets)
‒ GPU optimizations (minimize transitions, MSAA)
‒ R&D for advanced GPU optimizations
‒ Memory management
‒ Multi-GPU support
‒ ~2 months of work
Update targeting late December
37. Frostbite
Plants vs Zombies: Garden Warfare
Very different rendering
compared to BF4
Frostbite Mantle renderer will
work out of the box
Focus on APU performance
38. Frostbite
Future
All Frostbite games designed with Mantle
‒ 15 games in development across all of EA
Advanced Mantle rendering & use cases
‒ Lots of exciting R&D opportunities!
Want multi-vendor & multi-platform support!
So what is Mantle?Mantle is a low-level graphics API and it’s goals are to improve performance and make easier to develop these really advanced application and give developers a lot of freedom to build innovative graphics solutions.And it is a bit of a challenge to the established order of things, which I think is fun and healthy for the industry
We’ve been working with Mantle for some time now and adding support in our engine Frostbite and Battlefield 4. And I wanted to share some of our learnings and what Mantle can mean in general for developers
Can be of any sizeSingle per-draw call small dynamic descriptor setStatic + dynamicMultiple ones nested by update frequency
Can be of any sizeSingle per-draw call small dynamic descriptor setStatic + dynamicMultiple ones nested by update frequency
Not needed for:Resource mapping´No implicit pipeline flushingMuch easier to track down stalls in the app itself
Also foveated rendering
? Need to go beyond HLSL for pointer support in shaders