SlideShare ist ein Scribd-Unternehmen logo
1 von 42
Downloaden Sie, um offline zu lesen
Advanced Scenegraph Rendering Pipeline
Markus Tavenrath – NVIDIA – matavenrath@nvidia.com
Christoph Kubisch – NVIDIA – ckubisch@nvidia.com
2
 Traditional approach is render while
traversing a SceneGraph
 Scene complexity increases
– Deep hierarchies, traversal expensive
– Large objects split up into a lot of
little pieces, increased draw call count
– Unsorted rendering, lot of state
changes
 CPU becomes bottleneck when
rendering those scenes
SceneGraph Rendering
models courtesy of PTC
Introduction SceneGraph SceneTree ShapeList Renderer
3
Overview
Introduction SceneGraph SceneTree ShapeList Renderer
G0
T0 T1
T2
S1 S2
G1
T3
S0
G0
T0 T1
T2
S1 S2
G1
T3
S0
T2‘
S1‘ S2‘
G1‘
T3‘
ShapeList
S1 S2S0 S1‘ S2‘
RendererCache
S1 S2S0 S1‘ S2‘
SceneGraph SceneTree ShapeList
Renderer
Gi Group Ti Transform Si Shape
4
SceneGraph
G0
T0 T1
T2
S1 S2
G1
T3
S0
 SceneGraph is DAG
 No unique path to a node
– Cannot efficiently cache path-dependent data per node
 Traversal runs over 14 nodes for rendering.
 Processed 6 Transform Nodes
– 6 matrix/matrix multiplications and inversions
 Nodes are usually ‚large‘ and not linear in memory
– Each node access generates at least one, most
likely cache misses
Gi Group Ti Transform Si Shape
Introduction SceneGraph SceneTree ShapeList Renderer
5
SceneTree construction
G0
T0 T1
T2
S1 S2
G1
T3
S0
Gi Group Ti Transform Si Shape
Introduction SceneGraph SceneTree ShapeList Renderer
G0
T0 T1
T2
S1 S2
G1
T3
S0
T2‘
S1‘ S2‘
G1‘
T3‘
G0 -> (G0)
T0 -> (T0)
T1 -> (T1)
S0 -> (S0)
G1 -> (G1)
T2 -> (T2)
S1 -> (S1)
T3 -> (T3)
S2 -> (S2)
G0 -> (G0)
T0 -> (T0)
T1 -> (T1)
S0 -> (S0)
G1 -> (G1,G1‘)
T2 -> (T2,T2‘)
S1 -> (S1,S1‘)
T3 -> (T3,T3‘)
S2 -> (S2,S2‘)
 Observer based
synchronization
6
 SceneTree has unique path to each
node
 Store accumulated attributes like
transforms or visibility in each Node
 Trade memory for performance
– 64-byte per node, 100k nodes ~6MB
– Transforms stored separate vector
 Traversal still processes 14 nodes.
Gi Group Ti Transform Si Shape
G0
T0 T1
T2
S1 S2
G1
T3
S0
T2‘
S1‘ S2‘
G1‘
T3‘
SceneTree
Introduction SceneGraph SceneTree ShapeList Renderer
7
SceneTree invalidate attributes cache
 Keep dirty flags per node
 Keep dirty vector per flag
 SceneGraph change notifications
invalidated nodes
– If not dirty, mark dirty and add to
dirty vector
– O(1) operation, no sorting required
upon changes
 Before rendering a frame process
dirty vectorDirty vector
T1T3 T3‘
G0
T0 T1
T2
S1 S2
G1
T3
S0
T2‘
S1‘ S2‘
G1‘
T3‘
Introduction SceneGraph SceneTree ShapeList Renderer
8
T3‘
T1
T2‘T3
SceneTree validate attribute cache
 Walk through dirty vector
— Node marked dirty -> search top dirty
— Validate subtree from top dirty
 Validation example
— T3 dirty, traverse up to root node
 T3 top dirty node, validate T3 subtree
— T3‘ dirty, traverse up to root node
 T1 top dirty node, validate T1 subtree
— T1 not dirty
 No work to do
Dirty vector
T1T3 T3‘
G0
T0 T1
T2
S1 S2
G1
T3
S0
T3‘
S1‘ S2‘
G1‘
T3 T1T3‘
Introduction SceneGraph SceneTree ShapeList Renderer
9
SceneTree to ShapeList
 Add Events for ShapeList generation
– addShape(Shape)
– removeShape(Shape)
ShapeList
S1 S2S0 S1‘ S2‘
Gi Group Ti Transform Si Shape
G0
T0 T1
T2
S1 S2
G1
T3
S0
T2‘
S1‘ S2‘
G1‘
T3‘
Introduction SceneGraph SceneTree ShapeList Renderer
10
 SceneGraph to SceneTree synchronization
– Store accumulated data per node instance
 SceneTree to ShapeList synchronization
– Avoid SceneTree traversal
 Next: Efficient data structure for renderer based on
ShapeList
Summary
Introduction SceneGraph SceneTree ShapeList Renderer
11
Renderer Data Structures
Program
Shader
Vertex
Shader
Fragment
ParameterDescription
Camera
ParameterDescription
Light
ParameterDescription
Matrices
ParameterDescription
Material
ambient
diffuse
specular
texture
Name Type Arraysize
vec3
vec3
vec3
Sampler
2
2
2
0
ParamaterDescriptionShape
Program
‚colored‘
Geometry
Shape1
ParameterData
Camera1
ParameterData
Lightset 1
ParameterData
Transform 1
ParameterData
red
Introduction SceneGraph SceneTree ShapeList Renderer
12
Example Parameter Grouping
Shader independent globals, i.e. camera
Object parameters, i.e. position/rotation/scaling
Material handles, i.e. textures and buffers
Material raw values, i.e. float, int and bool
Light, i.e. light sources and shadow maps
Shader dependent globals, i.e. environment map
always
frequent
constant
rare
Introduction SceneGraph SceneTree ShapeList Renderer
Parameters Frequency
13
Shapes
coloredGroup
by
Program
shapelist
Rendering structures
‚colored‘ ‚textured‘ ‚colored‘ ‚colored‘ ‚textured‘
‚colored‘ ‚colored‘‚colored‘
Shapes
textured
‚textured‘ ‚textured‘
Introduction SceneGraph SceneTree ShapeList Renderer
Sort by ParameterData
14
ParameterData Cache
 Cache is a big char[] with all ParameterData.
 ParameterData are sorted by first usage.
 Parameters are converted to Target-API datatype, i.e.
— Int8 to int32, TextureHandle to bindless texture...
 Updating parameters is only playback of data in memory,
no conditionals.
 Filter for used parameters to reduce cache size
Parameters
colored
red blue
Parameters
textured
wood marble
Introduction SceneGraph SceneTree ShapeList Renderer
15
Vertex Attribute Cache
 Big char[] with vertex attribute pointers
— Bindless pointers, VBOs or VAB streams
 Each set of attributes stored only once
 Ordered by first usage
 Attributes required by program are known
— Store only used attributes in Cache
— Useful for special passes like depth pass where only pos is
required
attributes
colored
pos
normal
pos
normal
pos
normal
Introduction SceneGraph SceneTree ShapeList Renderer
16
Renderer Cache complete
Parameters
colored
Phong red Phong blue
Shapes
colored
‚colored‘ ‚colored‘‚colored‘
Attributes
colored
pos
normal
pos
normal
pos
normal
Introduction SceneGraph SceneTree ShapeList Renderer
foreach(shape) {
if (visible(shape)) {
if (changed(parameters)) render(parameters);
if (changed(attributes)) render(attributes);
render(shape);
}
}
17
 CPU boundedness improved (application)
– Recomputation of attributes (transforms)
– Deep hierarchies: traversal expensive
– Unsorted rendering, lot of state changes
 CPU boundedness remaining (OpenGL usage)
– Large objects split up into a lot of little pieces,
increased draw call count
Achievements
ShapeList
Renderer
SceneTree
RendererCache
OpenGL
implementation
Renderer
18
 Avoid data redundancy
– Data stored once, referenced multiple times
– Update only once (less host to gpu transfers)
 Increase batching potential
– Further cuts api calls
– Less driver CPU work
 Minimize CPU/GPU interaction
– Allow GPU to update its own data
– Lower api usage when scene is changed little
– E.g. GPU-based culling
Enabling Hardware Scalability
19
 Avoids classic SceneGraph design
 Geometry
– Vertex/IndexBuffer
– BoundingBox
– Divided into parts (CAD features)
 Material
 Node Hierarchy
 Object
– Node and Geometry reference
– For each Geometry part
 Material reference
 Enabled state
model courtesy of PTC
OpenGL Research Framework
 99000 total parts, 3.8 Mtris, 5.1 Mverts
 700 geometries, 128 materials
 2000 objects
20
 Kepler Quadro K5000, i7
 vbo bind and drawcall per part, i.e. 99 000
drawcalls
scene draw time > 38 ms (CPU bound)
 vbo bind per geometry, drawcalls per part
scene draw time > 14 ms (CPU bound)
 All subsequent techniques raise perf significantly
scene draw time < 6 ms
1.8 ms with occlusion culling
Performance baseline
21
 MultiDraw (1.x)
– Render ranges from current VBO/IBO
– Single drawcall for many distinct objects
– Reduces overhead for low complexity objects
 ARB_draw_indirect (4.x)
 ARB_multi_draw_indirect
– Store drawcall information on GPU or HOST
– Let GPU create/modify GPU buffers
Drawcall Reduction
DrawElementsIndirect
{
GLuint count;
GLuint instanceCount;
GLuint firstIndex;
GLint baseVertex;
GLuint baseInstance;
}
22
– All use multidraw capabilites to
render across gaps
– BATCHED use CPU generated list of
combined parts with same state
 Object‘s part cache must be rebuilt
based on material/enabled state
– INDIVIDUAL stay on per-part level
 No caches, can update assignment or
cmd buffers directly
Drawing Techniques
a b c
a+b c
Parts with different materials in geometry
Grouped and „grown“ drawcalls
Single call, encode material/matrix
assignment via vertex attribute
23
 Group parameters by frequency of change
 Generating shader strings allows different storage
backend for „uniforms“
Parameters
Effect "Phong {
Group „material" (many) {
vec4 "ambient"
vec4 "diffuse"
vec4 "specular"
}
Group „view" (few) {
vec4 „viewProjTM„
}
Group „object" (many) {
mat4 „worldTM„
}
... Code ...
}
 OpenGL 2 uniforms
 OpenGL 3,4 buffers
 NVIDIA bindless technology...
24
 GL2 approach:
– Avoid many small
uniforms
– Arrays of uniforms,
grouped by frequency of
update, tightly-packed
Parameters
uniform mat4 worldMatrices[2];
uniform vec4 materialData[8];
#define matrix_world worldMatrices[0]
#define matrix_worldIT worldMatrices[1]
#define material_diffuse materialData[0]
#define material_emissive materialData[1]
#define material_gloss materialData[2].x
// GL3 can use floatBitsToInt and friends
// for free reinterpret casts within
// macros
...
wPos = matrix_world * oPos;
...
// in fragment shader
color = material_diffuse +
material_emissive;
...
25
 GL4 approach:
– TextureBufferObject
(TBO) for matrices
– UniformBufferObject
(UBO) with array data
to save costly binds
– Assignment indices
passed as vertex
attribute
Parameters in vec4 oPos;
uniform samplerBuffer matrixBuffer;
uniform materialBuffer {
Material materials[512];
};
in ivec2 vAssigns;
flat out ivec2 fAssigns;
// in vertex shader
fAssigns = vAssigns;
worldTM = getMatrix (matrixBuffer,
vAssigns.x);
wPos = worldTM * oPos;
...
// in fragment shader
color = materials[fAssigns.y].color;
...
26
setupSceneMatrixAndMaterialBuffer (scene);
foreach (obj in scene) {
if ( isVisible(obj) ) {
setupDrawGeometryVertexBuffer (obj);
// iterate over different materials used
foreach ( batch in obj.materialCaches) {
glVertexAttribI2i (indexAttr, batch.materialIndex, matrixIndex);
glMultiDrawElements (GL_TRIANGLES, batch.counts, GL_UNSIGNED_INT ,
batch.offsets,batched.numUsed);
}
}
}
OpenGL 4.x approach
27
glVertexAttribDivisor == 0 : VArray[ gl_VertexID + baseVertex ]
glVertexAttribDivisor != 0 : VArray[ gl_InstanceID / VDivisor + baseInstance ]
VArray[ 0 / 1 + baseInstance ]
Material & Matrix Index
VertexBuffer (divisor:1)
Position & Normal
VertexBuffer (divisor:0)
...
instanceCount = 1
baseInstance = 0
...
instanceCount = 1
baseInstance = 1
MultiDrawIndirect
Buffer
Per drawcall vertex attribute
vertex attributes
fetched for last
vertex in second
drawcall
baseinstance = 1
28
OpenGL 4.2+ indirect approach
...
foreach ( obj in scene.objects ) {
...
// instead of glVertexAttribI2i calls and a loop
// we use the baseInstance for the attribute
// bind special assignment buffer as vertex attribute
glBindBuffer ( GL_ARRAY_BUFFER, obj->assignBuffer);
glVertexAttribIPointer (indexAttr, 2, GL_INT, . . . );
// draw everything in one go
glMultiDrawElementsIndirect ( GL_TRIANGLES, GL_UNSIGNED_INT,
obj->indirectOffset, obj->numIndirects, 0 );
}
29
 ARB_vertex_attrib_binding
(VAB)
– Avoids many buffer changes
– Separates format from data
– Bind multiple vertex
attributes to one buffer
 NV_vertex_buffer_unified_
memory (VBUM)
– Allows very fast switching
through GPU pointers
Vertex Setup
/* setup once, similar to glVertexAttribPointer
but with relative offset last */
glVertexAttribFormat (ATTR_NORMAL, 3,
GL_FLOAT, GL_TRUE, offsetof(Vertex,normal));
glVertexAttribFormat (ATTR_POS, 3,
GL_FLOAT, GL_FALSE, offsetof(Vertex,pos));
// bind to stream
glVertexAttribBinding (ATTR_NORMAL, 0);
glVertexAttribBinding (ATTR_POS, 0);
// switch single stream buffer
glBindVertexBuffer (0, bufID, 0, sizeof(Vertex));
// NV_vertex_buffer_unified_memory
// enable once and set stride
glEnableClientState (GL_VERTEX...NV);...
glBindVertexBuffer (0, 0, 0, sizeof(Vertex));
// switch single buffer via pointer
glBufferAddressRangeNV (GL_VERTEX...,0,bufADDR,
bufSize);
30
0
200
400
600
800
1000
VBO VAB VAB+VBUM BINDLESS
INDIRECT HOST
VAB+VBUM
BINDLESS
INDIRECT GPU
VAB+VBUM
Timeinmicroseocnds[us]
– Vertex/Index setup inside MultiDrawIndirect command
NV_bindless_multidraw_indirect
one GL call to draw entire scene
GPU benefit depends on triangles
per drawcall (> ~ 500)
NV_bindless_multidraw_indirect
~ 2400 drawcalls, GL4 BATCHED style
Lower is
Better
Effect on CPU time
31
0
1000
2000
3000
4000
5000
6000
K KB K KB K KB K KB
GL4 INDIRECTHOST
INDIVIDUAL
GL4 INDIRECTGPU
INDIVIDUAL
GL4 BATCHED GL2 BATCHED
Timeinmicroseconds[us]
Bindless (green) always reduces CPU, and
may help framerate/GPU a bit
K = Kepler 5000, regular VBO
KB = Kepler 5000, VBUM + VAB
Lower is
Better
GPU
CPU
99.000 2.400 hw drawcalls
2.000 2.400 sw drawcalls
32
0
1000
2000
3000
4000
5000
6000
K KB K KB K KB K KB
GL4 INDIRECTHOST
INDIVIDUAL
GL4 INDIRECTGPU
INDIVIDUAL
GL4 BATCHED GL2 BATCHED
Timeinmicroseconds[us]
GPU
CPU
MultiDrawIndirect achieves almost 20 Mio drawcalls per
second (2000 VBO changes, „only“ 1/3 perf lost).
GPU-buffered commands save lots of CPU time
Lower is
Better
99.000 2.400 hw drawcalls
2.000 2.400 sw drawcalls
Scene-dependent!
INDIVIDUAL could be as fast
if enough work per drawcall
K = Kepler 5000, regular VBO
KB = Kepler 5000, VBUM + VAB
33
0
1000
2000
3000
4000
5000
6000
K KB K KB K KB K KB
GL4 INDIRECTHOST
INDIVIDUAL
GL4 INDIRECTGPU
INDIVIDUAL
GL4 BATCHED GL2 BATCHED
Timeinmicroseconds[us]
GL2 uniforms beat paletted UBO a bit in GPU, but are slower on
CPU side. (1 glUniform call with 8x vec4, vs indexed UBO)
Lower is
Better
GPU
CPU
Scene-dependent!
GL4 better when more
materials changed per object
K = Kepler 5000, regular VBO
KB = Kepler 5000, VBUM + VAB
99.000 2.400 hw drawcalls
2.000 2.400 sw drawcalls
34
 Share geometry buffers for batching
 Group parameters for fast updating
 MultiDraw/Indirect for keeping objects
independent or remove additional loops
– baseInstance to provide unique
index/assignments for drawcall
 Bindless to reduce validation
overhead/add flexibility
Recap
35
 GPU friendly processing
– Matrix and bbox buffer, object buffer
– XFB/Compute or „invisible“ rendering
– Vs. old techniques: Single GPU job for ALL objects!
 Results
– „Readback“ GPU to Host
 Can use GPU to pack into bit stream
– „Indirect“ GPU to GPU
 Set DrawIndirect‘s instanceCount to 0 or 1
GPU Culling Basics
0,1,0,1,1,1,0,0,0
buffer cmdBuffer{
Command cmds[];
};
...
cmds[obj].instanceCount = visible;
36
 OpenGL 4.2+
– Depth-Pass
– Raster „invisible“ bounding boxes
 Disable Color/Depth writes
 Geometry Shader to create the three
visible box sides
 Depth buffer discards occluded fragments
(earlyZ...)
 Fragment Shader writes output:
visible[objindex] = 1
Occlusion Culling
// GLSL fragment shader
// from ARB_shader_image_load_store
layout(early_fragment_tests) in;
buffer visibilityBuffer{
int visibility[];
};
flat in int objID;
void main(){
visibility[objID] = 1;
}
// buffer would have been cleared
// to 0 before
Passing bbox fragments
enable object
Algorithm by
Evgeny Makarov, NVIDIA
depth
buffer
37
 Exploit that majority of objects don‘t change
much relative to camera
 Draw each object only once (vertex/drawcall-
bound)
– Render last visible, fully shaded
(last)
– Test all against current depth:
(visible)
– Render newly added visible:
none, if no spatial changes made
(~last) & (visible)
– (last) = (visible)
Temporal Coherence
frame: f – 1
frame: f
last visible
bboxes occluded
bboxes pass depth
(visible)
new visible
invisible
visible
camera
camera
moved
38
Culling Readback vs Indirect
0
500
1000
1500
2000
2500
readback indirect NVindirect
Timeinmicroseconds[us] In the „draw new visible“ phase indirect cannot
benefit of „nothing to setup/draw“ in advance,
still processes „empty“ lists
For readback results, CPU has to
wait for GPU idle
37% faster with
culling
33% faster with
culling 37% faster with
culling
NV_bindless_
multidraw_indirect
saves CPU and bit of
GPU time
Scene-dependent,
i.e. triangles per
drawcall and # of
„invisible“Lower is
Better
GL4 BATCHED style
GPU
CPU
39
 Temporal culling very useful for object/vertex-boundedness
– Can also apply for Z-pass...
 Readback vs Indirect
– Readback variant „easier“ to be faster (no setups...), but syncs!
– NV_bindless_multidraw benefit depends on scene (VBO changes
and primitives per drawcall)
 Working towards GPU autonomous system
– (NV_bindless)/ARB_multidraw_indirect as mechanism for GPU
creating its own work, research and feature work in progresss
Culling Results
40
 Thank you!
– Contact
 ckubisch@nvidia.com
 matavenrath@nvidia.com
glFinish();
41
 Family of extensions to use
native handles/addresses
 NV_vertex_buffer_unified_memory
 NV_bindless_multidraw_indirect
 NV_shader_buffer_load/store
– Pointers in GLSL
 NV_bindless_texture
– No more unit restrictions
– References inside buffers
NVIDIA Bindless Technology
// GLSL with true pointers
uniform MyStruct* mystructs;
// API
glUniformui64NV (bufferLocation,
bufferADDR);
texHDL = glGetTextureHandleNV (tex);
// later instead of glBindTexture
glUniformHandleui64NV (texLocation,
texHDL)
// GLSL
// can also store textures in resources
uniform materialBuffer {
sampler2D manyTextures [LARGE];
}
42
0
500
1000
1500
2000
2500
Readback
GPU
Readback
CPU
Indirect
GPU
Indirect
CPU
NVIndirect
GPU
NVIndirect
CPU
Timeinmicroseconds
6. Draw New Visible
5. Update Internals
4. Occlusion Cull
3. Draw Last Visible
2. Update Internals
1. Frustum Cull
Nothing „new“ to draw, but CPU doesn‘t
know, still setting things up, GPU runs
thru „empty“ cmd buffer
For readback results, CPU has to
wait for GPU idle
Culling
Readback
vs
Indirect 432 fps with culling
315 without 387 fps with culling
289 without
429 fps with culling
313 without
Special bindless indirect
version can save lots of
CPU and a bit GPU costs
for drawing the scene
with a single big cmd
buffer

Weitere ähnliche Inhalte

Was ist angesagt?

A Bit More Deferred Cry Engine3
A Bit More Deferred   Cry Engine3A Bit More Deferred   Cry Engine3
A Bit More Deferred Cry Engine3
guest11b095
 

Was ist angesagt? (20)

Frostbite on Mobile
Frostbite on MobileFrostbite on Mobile
Frostbite on Mobile
 
Low-level Shader Optimization for Next-Gen and DX11 by Emil Persson
Low-level Shader Optimization for Next-Gen and DX11 by Emil PerssonLow-level Shader Optimization for Next-Gen and DX11 by Emil Persson
Low-level Shader Optimization for Next-Gen and DX11 by Emil Persson
 
4K Checkerboard in Battlefield 1 and Mass Effect Andromeda
4K Checkerboard in Battlefield 1 and Mass Effect Andromeda4K Checkerboard in Battlefield 1 and Mass Effect Andromeda
4K Checkerboard in Battlefield 1 and Mass Effect Andromeda
 
Terrain in Battlefield 3: A Modern, Complete and Scalable System
Terrain in Battlefield 3: A Modern, Complete and Scalable SystemTerrain in Battlefield 3: A Modern, Complete and Scalable System
Terrain in Battlefield 3: A Modern, Complete and Scalable System
 
Five Rendering Ideas from Battlefield 3 & Need For Speed: The Run
Five Rendering Ideas from Battlefield 3 & Need For Speed: The RunFive Rendering Ideas from Battlefield 3 & Need For Speed: The Run
Five Rendering Ideas from Battlefield 3 & Need For Speed: The Run
 
Vertex Shader Tricks by Bill Bilodeau - AMD at GDC14
Vertex Shader Tricks by Bill Bilodeau - AMD at GDC14Vertex Shader Tricks by Bill Bilodeau - AMD at GDC14
Vertex Shader Tricks by Bill Bilodeau - AMD at GDC14
 
A Bit More Deferred Cry Engine3
A Bit More Deferred   Cry Engine3A Bit More Deferred   Cry Engine3
A Bit More Deferred Cry Engine3
 
Decima Engine: Visibility in Horizon Zero Dawn
Decima Engine: Visibility in Horizon Zero DawnDecima Engine: Visibility in Horizon Zero Dawn
Decima Engine: Visibility in Horizon Zero Dawn
 
Rendering Technologies from Crysis 3 (GDC 2013)
Rendering Technologies from Crysis 3 (GDC 2013)Rendering Technologies from Crysis 3 (GDC 2013)
Rendering Technologies from Crysis 3 (GDC 2013)
 
Killzone Shadow Fall Demo Postmortem
Killzone Shadow Fall Demo PostmortemKillzone Shadow Fall Demo Postmortem
Killzone Shadow Fall Demo Postmortem
 
Stochastic Screen-Space Reflections
Stochastic Screen-Space ReflectionsStochastic Screen-Space Reflections
Stochastic Screen-Space Reflections
 
Star Ocean 4 - Flexible Shader Managment and Post-processing
Star Ocean 4 - Flexible Shader Managment and Post-processingStar Ocean 4 - Flexible Shader Managment and Post-processing
Star Ocean 4 - Flexible Shader Managment and Post-processing
 
Graphics Gems from CryENGINE 3 (Siggraph 2013)
Graphics Gems from CryENGINE 3 (Siggraph 2013)Graphics Gems from CryENGINE 3 (Siggraph 2013)
Graphics Gems from CryENGINE 3 (Siggraph 2013)
 
Parallel Futures of a Game Engine
Parallel Futures of a Game EngineParallel Futures of a Game Engine
Parallel Futures of a Game Engine
 
Hair in Tomb Raider
Hair in Tomb RaiderHair in Tomb Raider
Hair in Tomb Raider
 
Rendering AAA-Quality Characters of Project A1
Rendering AAA-Quality Characters of Project A1Rendering AAA-Quality Characters of Project A1
Rendering AAA-Quality Characters of Project A1
 
Dx11 performancereloaded
Dx11 performancereloadedDx11 performancereloaded
Dx11 performancereloaded
 
OpenGL 3.2 and More
OpenGL 3.2 and MoreOpenGL 3.2 and More
OpenGL 3.2 and More
 
Calibrating Lighting and Materials in Far Cry 3
Calibrating Lighting and Materials in Far Cry 3Calibrating Lighting and Materials in Far Cry 3
Calibrating Lighting and Materials in Far Cry 3
 
Parallel Graphics in Frostbite - Current & Future (Siggraph 2009)
Parallel Graphics in Frostbite - Current & Future (Siggraph 2009)Parallel Graphics in Frostbite - Current & Future (Siggraph 2009)
Parallel Graphics in Frostbite - Current & Future (Siggraph 2009)
 

Andere mochten auch

Game engine introduction and approach
Game engine introduction and approachGame engine introduction and approach
Game engine introduction and approach
Duy Tan Geek
 
Game Engine Overview
Game Engine OverviewGame Engine Overview
Game Engine Overview
Sharad Mitra
 
Game Architecture and Programming
Game Architecture and ProgrammingGame Architecture and Programming
Game Architecture and Programming
Sumit Jain
 

Andere mochten auch (15)

Game engine introduction and approach
Game engine introduction and approachGame engine introduction and approach
Game engine introduction and approach
 
Game Engine Overview
Game Engine OverviewGame Engine Overview
Game Engine Overview
 
Game Engine Architecture
Game Engine ArchitectureGame Engine Architecture
Game Engine Architecture
 
What Is A Game Engine
What Is A Game EngineWhat Is A Game Engine
What Is A Game Engine
 
Killzone Shadow Fall: Threading the Entity Update on PS4
Killzone Shadow Fall: Threading the Entity Update on PS4Killzone Shadow Fall: Threading the Entity Update on PS4
Killzone Shadow Fall: Threading the Entity Update on PS4
 
Design your 3d game engine
Design your 3d game engineDesign your 3d game engine
Design your 3d game engine
 
Ray tracing
Ray tracingRay tracing
Ray tracing
 
Ray tracing
Ray tracingRay tracing
Ray tracing
 
GRPHICS08 - Raytracing and Radiosity
GRPHICS08 - Raytracing and RadiosityGRPHICS08 - Raytracing and Radiosity
GRPHICS08 - Raytracing and Radiosity
 
Unity - Game Engine
Unity - Game EngineUnity - Game Engine
Unity - Game Engine
 
Game Architecture and Programming
Game Architecture and ProgrammingGame Architecture and Programming
Game Architecture and Programming
 
Ray tracing
 Ray tracing Ray tracing
Ray tracing
 
Pitfalls of Object Oriented Programming by SONY
Pitfalls of Object Oriented Programming by SONYPitfalls of Object Oriented Programming by SONY
Pitfalls of Object Oriented Programming by SONY
 
Migrating from OpenGL to Vulkan
Migrating from OpenGL to VulkanMigrating from OpenGL to Vulkan
Migrating from OpenGL to Vulkan
 
Ray tracing
Ray tracingRay tracing
Ray tracing
 

Ähnlich wie Advanced Scenegraph Rendering Pipeline

D3 D10 Unleashed New Features And Effects
D3 D10 Unleashed   New Features And EffectsD3 D10 Unleashed   New Features And Effects
D3 D10 Unleashed New Features And Effects
Thomas Goddard
 
Introduction To Geometry Shaders
Introduction To Geometry ShadersIntroduction To Geometry Shaders
Introduction To Geometry Shaders
pjcozzi
 
Hardware Shaders
Hardware ShadersHardware Shaders
Hardware Shaders
gueste52f1b
 
Rendering of Complex 3D Treemaps (GRAPP 2013)
Rendering of Complex 3D Treemaps (GRAPP 2013)Rendering of Complex 3D Treemaps (GRAPP 2013)
Rendering of Complex 3D Treemaps (GRAPP 2013)
Matthias Trapp
 
The Intersection of Game Engines & GPUs: Current & Future (Graphics Hardware ...
The Intersection of Game Engines & GPUs: Current & Future (Graphics Hardware ...The Intersection of Game Engines & GPUs: Current & Future (Graphics Hardware ...
The Intersection of Game Engines & GPUs: Current & Future (Graphics Hardware ...
Johan Andersson
 

Ähnlich wie Advanced Scenegraph Rendering Pipeline (20)

D3 D10 Unleashed New Features And Effects
D3 D10 Unleashed   New Features And EffectsD3 D10 Unleashed   New Features And Effects
D3 D10 Unleashed New Features And Effects
 
Penn graphics
Penn graphicsPenn graphics
Penn graphics
 
DirectX 11 Rendering in Battlefield 3
DirectX 11 Rendering in Battlefield 3DirectX 11 Rendering in Battlefield 3
DirectX 11 Rendering in Battlefield 3
 
Hpg2011 papers kazakov
Hpg2011 papers kazakovHpg2011 papers kazakov
Hpg2011 papers kazakov
 
OpenGL 4 for 2010
OpenGL 4 for 2010OpenGL 4 for 2010
OpenGL 4 for 2010
 
Computer Graphics - Lecture 01 - 3D Programming I
Computer Graphics - Lecture 01 - 3D Programming IComputer Graphics - Lecture 01 - 3D Programming I
Computer Graphics - Lecture 01 - 3D Programming I
 
Introduction To Geometry Shaders
Introduction To Geometry ShadersIntroduction To Geometry Shaders
Introduction To Geometry Shaders
 
NVIDIA's OpenGL Functionality
NVIDIA's OpenGL FunctionalityNVIDIA's OpenGL Functionality
NVIDIA's OpenGL Functionality
 
Hardware Shaders
Hardware ShadersHardware Shaders
Hardware Shaders
 
Rendering of Complex 3D Treemaps (GRAPP 2013)
Rendering of Complex 3D Treemaps (GRAPP 2013)Rendering of Complex 3D Treemaps (GRAPP 2013)
Rendering of Complex 3D Treemaps (GRAPP 2013)
 
[Unite Seoul 2019] Mali GPU Architecture and Mobile Studio
[Unite Seoul 2019] Mali GPU Architecture and Mobile Studio [Unite Seoul 2019] Mali GPU Architecture and Mobile Studio
[Unite Seoul 2019] Mali GPU Architecture and Mobile Studio
 
3 boyd direct3_d12 (1)
3 boyd direct3_d12 (1)3 boyd direct3_d12 (1)
3 boyd direct3_d12 (1)
 
Chapter-3.pdf
Chapter-3.pdfChapter-3.pdf
Chapter-3.pdf
 
NVIDIA Graphics, Cg, and Transparency
NVIDIA Graphics, Cg, and TransparencyNVIDIA Graphics, Cg, and Transparency
NVIDIA Graphics, Cg, and Transparency
 
2DCompsitionEngine
2DCompsitionEngine2DCompsitionEngine
2DCompsitionEngine
 
Advanced Graphics Workshop - GFX2011
Advanced Graphics Workshop - GFX2011Advanced Graphics Workshop - GFX2011
Advanced Graphics Workshop - GFX2011
 
7nm "Navi" GPU - A GPU Built For Performance
7nm "Navi" GPU - A GPU Built For Performance 7nm "Navi" GPU - A GPU Built For Performance
7nm "Navi" GPU - A GPU Built For Performance
 
Introduction to Direct 3D 12 by Ivan Nevraev
Introduction to Direct 3D 12 by Ivan NevraevIntroduction to Direct 3D 12 by Ivan Nevraev
Introduction to Direct 3D 12 by Ivan Nevraev
 
GPU - how can we use it?
GPU - how can we use it?GPU - how can we use it?
GPU - how can we use it?
 
The Intersection of Game Engines & GPUs: Current & Future (Graphics Hardware ...
The Intersection of Game Engines & GPUs: Current & Future (Graphics Hardware ...The Intersection of Game Engines & GPUs: Current & Future (Graphics Hardware ...
The Intersection of Game Engines & GPUs: Current & Future (Graphics Hardware ...
 

Kürzlich hochgeladen

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 

Kürzlich hochgeladen (20)

[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 

Advanced Scenegraph Rendering Pipeline

  • 1. Advanced Scenegraph Rendering Pipeline Markus Tavenrath – NVIDIA – matavenrath@nvidia.com Christoph Kubisch – NVIDIA – ckubisch@nvidia.com
  • 2. 2  Traditional approach is render while traversing a SceneGraph  Scene complexity increases – Deep hierarchies, traversal expensive – Large objects split up into a lot of little pieces, increased draw call count – Unsorted rendering, lot of state changes  CPU becomes bottleneck when rendering those scenes SceneGraph Rendering models courtesy of PTC Introduction SceneGraph SceneTree ShapeList Renderer
  • 3. 3 Overview Introduction SceneGraph SceneTree ShapeList Renderer G0 T0 T1 T2 S1 S2 G1 T3 S0 G0 T0 T1 T2 S1 S2 G1 T3 S0 T2‘ S1‘ S2‘ G1‘ T3‘ ShapeList S1 S2S0 S1‘ S2‘ RendererCache S1 S2S0 S1‘ S2‘ SceneGraph SceneTree ShapeList Renderer Gi Group Ti Transform Si Shape
  • 4. 4 SceneGraph G0 T0 T1 T2 S1 S2 G1 T3 S0  SceneGraph is DAG  No unique path to a node – Cannot efficiently cache path-dependent data per node  Traversal runs over 14 nodes for rendering.  Processed 6 Transform Nodes – 6 matrix/matrix multiplications and inversions  Nodes are usually ‚large‘ and not linear in memory – Each node access generates at least one, most likely cache misses Gi Group Ti Transform Si Shape Introduction SceneGraph SceneTree ShapeList Renderer
  • 5. 5 SceneTree construction G0 T0 T1 T2 S1 S2 G1 T3 S0 Gi Group Ti Transform Si Shape Introduction SceneGraph SceneTree ShapeList Renderer G0 T0 T1 T2 S1 S2 G1 T3 S0 T2‘ S1‘ S2‘ G1‘ T3‘ G0 -> (G0) T0 -> (T0) T1 -> (T1) S0 -> (S0) G1 -> (G1) T2 -> (T2) S1 -> (S1) T3 -> (T3) S2 -> (S2) G0 -> (G0) T0 -> (T0) T1 -> (T1) S0 -> (S0) G1 -> (G1,G1‘) T2 -> (T2,T2‘) S1 -> (S1,S1‘) T3 -> (T3,T3‘) S2 -> (S2,S2‘)  Observer based synchronization
  • 6. 6  SceneTree has unique path to each node  Store accumulated attributes like transforms or visibility in each Node  Trade memory for performance – 64-byte per node, 100k nodes ~6MB – Transforms stored separate vector  Traversal still processes 14 nodes. Gi Group Ti Transform Si Shape G0 T0 T1 T2 S1 S2 G1 T3 S0 T2‘ S1‘ S2‘ G1‘ T3‘ SceneTree Introduction SceneGraph SceneTree ShapeList Renderer
  • 7. 7 SceneTree invalidate attributes cache  Keep dirty flags per node  Keep dirty vector per flag  SceneGraph change notifications invalidated nodes – If not dirty, mark dirty and add to dirty vector – O(1) operation, no sorting required upon changes  Before rendering a frame process dirty vectorDirty vector T1T3 T3‘ G0 T0 T1 T2 S1 S2 G1 T3 S0 T2‘ S1‘ S2‘ G1‘ T3‘ Introduction SceneGraph SceneTree ShapeList Renderer
  • 8. 8 T3‘ T1 T2‘T3 SceneTree validate attribute cache  Walk through dirty vector — Node marked dirty -> search top dirty — Validate subtree from top dirty  Validation example — T3 dirty, traverse up to root node  T3 top dirty node, validate T3 subtree — T3‘ dirty, traverse up to root node  T1 top dirty node, validate T1 subtree — T1 not dirty  No work to do Dirty vector T1T3 T3‘ G0 T0 T1 T2 S1 S2 G1 T3 S0 T3‘ S1‘ S2‘ G1‘ T3 T1T3‘ Introduction SceneGraph SceneTree ShapeList Renderer
  • 9. 9 SceneTree to ShapeList  Add Events for ShapeList generation – addShape(Shape) – removeShape(Shape) ShapeList S1 S2S0 S1‘ S2‘ Gi Group Ti Transform Si Shape G0 T0 T1 T2 S1 S2 G1 T3 S0 T2‘ S1‘ S2‘ G1‘ T3‘ Introduction SceneGraph SceneTree ShapeList Renderer
  • 10. 10  SceneGraph to SceneTree synchronization – Store accumulated data per node instance  SceneTree to ShapeList synchronization – Avoid SceneTree traversal  Next: Efficient data structure for renderer based on ShapeList Summary Introduction SceneGraph SceneTree ShapeList Renderer
  • 11. 11 Renderer Data Structures Program Shader Vertex Shader Fragment ParameterDescription Camera ParameterDescription Light ParameterDescription Matrices ParameterDescription Material ambient diffuse specular texture Name Type Arraysize vec3 vec3 vec3 Sampler 2 2 2 0 ParamaterDescriptionShape Program ‚colored‘ Geometry Shape1 ParameterData Camera1 ParameterData Lightset 1 ParameterData Transform 1 ParameterData red Introduction SceneGraph SceneTree ShapeList Renderer
  • 12. 12 Example Parameter Grouping Shader independent globals, i.e. camera Object parameters, i.e. position/rotation/scaling Material handles, i.e. textures and buffers Material raw values, i.e. float, int and bool Light, i.e. light sources and shadow maps Shader dependent globals, i.e. environment map always frequent constant rare Introduction SceneGraph SceneTree ShapeList Renderer Parameters Frequency
  • 13. 13 Shapes coloredGroup by Program shapelist Rendering structures ‚colored‘ ‚textured‘ ‚colored‘ ‚colored‘ ‚textured‘ ‚colored‘ ‚colored‘‚colored‘ Shapes textured ‚textured‘ ‚textured‘ Introduction SceneGraph SceneTree ShapeList Renderer Sort by ParameterData
  • 14. 14 ParameterData Cache  Cache is a big char[] with all ParameterData.  ParameterData are sorted by first usage.  Parameters are converted to Target-API datatype, i.e. — Int8 to int32, TextureHandle to bindless texture...  Updating parameters is only playback of data in memory, no conditionals.  Filter for used parameters to reduce cache size Parameters colored red blue Parameters textured wood marble Introduction SceneGraph SceneTree ShapeList Renderer
  • 15. 15 Vertex Attribute Cache  Big char[] with vertex attribute pointers — Bindless pointers, VBOs or VAB streams  Each set of attributes stored only once  Ordered by first usage  Attributes required by program are known — Store only used attributes in Cache — Useful for special passes like depth pass where only pos is required attributes colored pos normal pos normal pos normal Introduction SceneGraph SceneTree ShapeList Renderer
  • 16. 16 Renderer Cache complete Parameters colored Phong red Phong blue Shapes colored ‚colored‘ ‚colored‘‚colored‘ Attributes colored pos normal pos normal pos normal Introduction SceneGraph SceneTree ShapeList Renderer foreach(shape) { if (visible(shape)) { if (changed(parameters)) render(parameters); if (changed(attributes)) render(attributes); render(shape); } }
  • 17. 17  CPU boundedness improved (application) – Recomputation of attributes (transforms) – Deep hierarchies: traversal expensive – Unsorted rendering, lot of state changes  CPU boundedness remaining (OpenGL usage) – Large objects split up into a lot of little pieces, increased draw call count Achievements ShapeList Renderer SceneTree RendererCache OpenGL implementation Renderer
  • 18. 18  Avoid data redundancy – Data stored once, referenced multiple times – Update only once (less host to gpu transfers)  Increase batching potential – Further cuts api calls – Less driver CPU work  Minimize CPU/GPU interaction – Allow GPU to update its own data – Lower api usage when scene is changed little – E.g. GPU-based culling Enabling Hardware Scalability
  • 19. 19  Avoids classic SceneGraph design  Geometry – Vertex/IndexBuffer – BoundingBox – Divided into parts (CAD features)  Material  Node Hierarchy  Object – Node and Geometry reference – For each Geometry part  Material reference  Enabled state model courtesy of PTC OpenGL Research Framework  99000 total parts, 3.8 Mtris, 5.1 Mverts  700 geometries, 128 materials  2000 objects
  • 20. 20  Kepler Quadro K5000, i7  vbo bind and drawcall per part, i.e. 99 000 drawcalls scene draw time > 38 ms (CPU bound)  vbo bind per geometry, drawcalls per part scene draw time > 14 ms (CPU bound)  All subsequent techniques raise perf significantly scene draw time < 6 ms 1.8 ms with occlusion culling Performance baseline
  • 21. 21  MultiDraw (1.x) – Render ranges from current VBO/IBO – Single drawcall for many distinct objects – Reduces overhead for low complexity objects  ARB_draw_indirect (4.x)  ARB_multi_draw_indirect – Store drawcall information on GPU or HOST – Let GPU create/modify GPU buffers Drawcall Reduction DrawElementsIndirect { GLuint count; GLuint instanceCount; GLuint firstIndex; GLint baseVertex; GLuint baseInstance; }
  • 22. 22 – All use multidraw capabilites to render across gaps – BATCHED use CPU generated list of combined parts with same state  Object‘s part cache must be rebuilt based on material/enabled state – INDIVIDUAL stay on per-part level  No caches, can update assignment or cmd buffers directly Drawing Techniques a b c a+b c Parts with different materials in geometry Grouped and „grown“ drawcalls Single call, encode material/matrix assignment via vertex attribute
  • 23. 23  Group parameters by frequency of change  Generating shader strings allows different storage backend for „uniforms“ Parameters Effect "Phong { Group „material" (many) { vec4 "ambient" vec4 "diffuse" vec4 "specular" } Group „view" (few) { vec4 „viewProjTM„ } Group „object" (many) { mat4 „worldTM„ } ... Code ... }  OpenGL 2 uniforms  OpenGL 3,4 buffers  NVIDIA bindless technology...
  • 24. 24  GL2 approach: – Avoid many small uniforms – Arrays of uniforms, grouped by frequency of update, tightly-packed Parameters uniform mat4 worldMatrices[2]; uniform vec4 materialData[8]; #define matrix_world worldMatrices[0] #define matrix_worldIT worldMatrices[1] #define material_diffuse materialData[0] #define material_emissive materialData[1] #define material_gloss materialData[2].x // GL3 can use floatBitsToInt and friends // for free reinterpret casts within // macros ... wPos = matrix_world * oPos; ... // in fragment shader color = material_diffuse + material_emissive; ...
  • 25. 25  GL4 approach: – TextureBufferObject (TBO) for matrices – UniformBufferObject (UBO) with array data to save costly binds – Assignment indices passed as vertex attribute Parameters in vec4 oPos; uniform samplerBuffer matrixBuffer; uniform materialBuffer { Material materials[512]; }; in ivec2 vAssigns; flat out ivec2 fAssigns; // in vertex shader fAssigns = vAssigns; worldTM = getMatrix (matrixBuffer, vAssigns.x); wPos = worldTM * oPos; ... // in fragment shader color = materials[fAssigns.y].color; ...
  • 26. 26 setupSceneMatrixAndMaterialBuffer (scene); foreach (obj in scene) { if ( isVisible(obj) ) { setupDrawGeometryVertexBuffer (obj); // iterate over different materials used foreach ( batch in obj.materialCaches) { glVertexAttribI2i (indexAttr, batch.materialIndex, matrixIndex); glMultiDrawElements (GL_TRIANGLES, batch.counts, GL_UNSIGNED_INT , batch.offsets,batched.numUsed); } } } OpenGL 4.x approach
  • 27. 27 glVertexAttribDivisor == 0 : VArray[ gl_VertexID + baseVertex ] glVertexAttribDivisor != 0 : VArray[ gl_InstanceID / VDivisor + baseInstance ] VArray[ 0 / 1 + baseInstance ] Material & Matrix Index VertexBuffer (divisor:1) Position & Normal VertexBuffer (divisor:0) ... instanceCount = 1 baseInstance = 0 ... instanceCount = 1 baseInstance = 1 MultiDrawIndirect Buffer Per drawcall vertex attribute vertex attributes fetched for last vertex in second drawcall baseinstance = 1
  • 28. 28 OpenGL 4.2+ indirect approach ... foreach ( obj in scene.objects ) { ... // instead of glVertexAttribI2i calls and a loop // we use the baseInstance for the attribute // bind special assignment buffer as vertex attribute glBindBuffer ( GL_ARRAY_BUFFER, obj->assignBuffer); glVertexAttribIPointer (indexAttr, 2, GL_INT, . . . ); // draw everything in one go glMultiDrawElementsIndirect ( GL_TRIANGLES, GL_UNSIGNED_INT, obj->indirectOffset, obj->numIndirects, 0 ); }
  • 29. 29  ARB_vertex_attrib_binding (VAB) – Avoids many buffer changes – Separates format from data – Bind multiple vertex attributes to one buffer  NV_vertex_buffer_unified_ memory (VBUM) – Allows very fast switching through GPU pointers Vertex Setup /* setup once, similar to glVertexAttribPointer but with relative offset last */ glVertexAttribFormat (ATTR_NORMAL, 3, GL_FLOAT, GL_TRUE, offsetof(Vertex,normal)); glVertexAttribFormat (ATTR_POS, 3, GL_FLOAT, GL_FALSE, offsetof(Vertex,pos)); // bind to stream glVertexAttribBinding (ATTR_NORMAL, 0); glVertexAttribBinding (ATTR_POS, 0); // switch single stream buffer glBindVertexBuffer (0, bufID, 0, sizeof(Vertex)); // NV_vertex_buffer_unified_memory // enable once and set stride glEnableClientState (GL_VERTEX...NV);... glBindVertexBuffer (0, 0, 0, sizeof(Vertex)); // switch single buffer via pointer glBufferAddressRangeNV (GL_VERTEX...,0,bufADDR, bufSize);
  • 30. 30 0 200 400 600 800 1000 VBO VAB VAB+VBUM BINDLESS INDIRECT HOST VAB+VBUM BINDLESS INDIRECT GPU VAB+VBUM Timeinmicroseocnds[us] – Vertex/Index setup inside MultiDrawIndirect command NV_bindless_multidraw_indirect one GL call to draw entire scene GPU benefit depends on triangles per drawcall (> ~ 500) NV_bindless_multidraw_indirect ~ 2400 drawcalls, GL4 BATCHED style Lower is Better Effect on CPU time
  • 31. 31 0 1000 2000 3000 4000 5000 6000 K KB K KB K KB K KB GL4 INDIRECTHOST INDIVIDUAL GL4 INDIRECTGPU INDIVIDUAL GL4 BATCHED GL2 BATCHED Timeinmicroseconds[us] Bindless (green) always reduces CPU, and may help framerate/GPU a bit K = Kepler 5000, regular VBO KB = Kepler 5000, VBUM + VAB Lower is Better GPU CPU 99.000 2.400 hw drawcalls 2.000 2.400 sw drawcalls
  • 32. 32 0 1000 2000 3000 4000 5000 6000 K KB K KB K KB K KB GL4 INDIRECTHOST INDIVIDUAL GL4 INDIRECTGPU INDIVIDUAL GL4 BATCHED GL2 BATCHED Timeinmicroseconds[us] GPU CPU MultiDrawIndirect achieves almost 20 Mio drawcalls per second (2000 VBO changes, „only“ 1/3 perf lost). GPU-buffered commands save lots of CPU time Lower is Better 99.000 2.400 hw drawcalls 2.000 2.400 sw drawcalls Scene-dependent! INDIVIDUAL could be as fast if enough work per drawcall K = Kepler 5000, regular VBO KB = Kepler 5000, VBUM + VAB
  • 33. 33 0 1000 2000 3000 4000 5000 6000 K KB K KB K KB K KB GL4 INDIRECTHOST INDIVIDUAL GL4 INDIRECTGPU INDIVIDUAL GL4 BATCHED GL2 BATCHED Timeinmicroseconds[us] GL2 uniforms beat paletted UBO a bit in GPU, but are slower on CPU side. (1 glUniform call with 8x vec4, vs indexed UBO) Lower is Better GPU CPU Scene-dependent! GL4 better when more materials changed per object K = Kepler 5000, regular VBO KB = Kepler 5000, VBUM + VAB 99.000 2.400 hw drawcalls 2.000 2.400 sw drawcalls
  • 34. 34  Share geometry buffers for batching  Group parameters for fast updating  MultiDraw/Indirect for keeping objects independent or remove additional loops – baseInstance to provide unique index/assignments for drawcall  Bindless to reduce validation overhead/add flexibility Recap
  • 35. 35  GPU friendly processing – Matrix and bbox buffer, object buffer – XFB/Compute or „invisible“ rendering – Vs. old techniques: Single GPU job for ALL objects!  Results – „Readback“ GPU to Host  Can use GPU to pack into bit stream – „Indirect“ GPU to GPU  Set DrawIndirect‘s instanceCount to 0 or 1 GPU Culling Basics 0,1,0,1,1,1,0,0,0 buffer cmdBuffer{ Command cmds[]; }; ... cmds[obj].instanceCount = visible;
  • 36. 36  OpenGL 4.2+ – Depth-Pass – Raster „invisible“ bounding boxes  Disable Color/Depth writes  Geometry Shader to create the three visible box sides  Depth buffer discards occluded fragments (earlyZ...)  Fragment Shader writes output: visible[objindex] = 1 Occlusion Culling // GLSL fragment shader // from ARB_shader_image_load_store layout(early_fragment_tests) in; buffer visibilityBuffer{ int visibility[]; }; flat in int objID; void main(){ visibility[objID] = 1; } // buffer would have been cleared // to 0 before Passing bbox fragments enable object Algorithm by Evgeny Makarov, NVIDIA depth buffer
  • 37. 37  Exploit that majority of objects don‘t change much relative to camera  Draw each object only once (vertex/drawcall- bound) – Render last visible, fully shaded (last) – Test all against current depth: (visible) – Render newly added visible: none, if no spatial changes made (~last) & (visible) – (last) = (visible) Temporal Coherence frame: f – 1 frame: f last visible bboxes occluded bboxes pass depth (visible) new visible invisible visible camera camera moved
  • 38. 38 Culling Readback vs Indirect 0 500 1000 1500 2000 2500 readback indirect NVindirect Timeinmicroseconds[us] In the „draw new visible“ phase indirect cannot benefit of „nothing to setup/draw“ in advance, still processes „empty“ lists For readback results, CPU has to wait for GPU idle 37% faster with culling 33% faster with culling 37% faster with culling NV_bindless_ multidraw_indirect saves CPU and bit of GPU time Scene-dependent, i.e. triangles per drawcall and # of „invisible“Lower is Better GL4 BATCHED style GPU CPU
  • 39. 39  Temporal culling very useful for object/vertex-boundedness – Can also apply for Z-pass...  Readback vs Indirect – Readback variant „easier“ to be faster (no setups...), but syncs! – NV_bindless_multidraw benefit depends on scene (VBO changes and primitives per drawcall)  Working towards GPU autonomous system – (NV_bindless)/ARB_multidraw_indirect as mechanism for GPU creating its own work, research and feature work in progresss Culling Results
  • 40. 40  Thank you! – Contact  ckubisch@nvidia.com  matavenrath@nvidia.com glFinish();
  • 41. 41  Family of extensions to use native handles/addresses  NV_vertex_buffer_unified_memory  NV_bindless_multidraw_indirect  NV_shader_buffer_load/store – Pointers in GLSL  NV_bindless_texture – No more unit restrictions – References inside buffers NVIDIA Bindless Technology // GLSL with true pointers uniform MyStruct* mystructs; // API glUniformui64NV (bufferLocation, bufferADDR); texHDL = glGetTextureHandleNV (tex); // later instead of glBindTexture glUniformHandleui64NV (texLocation, texHDL) // GLSL // can also store textures in resources uniform materialBuffer { sampler2D manyTextures [LARGE]; }
  • 42. 42 0 500 1000 1500 2000 2500 Readback GPU Readback CPU Indirect GPU Indirect CPU NVIndirect GPU NVIndirect CPU Timeinmicroseconds 6. Draw New Visible 5. Update Internals 4. Occlusion Cull 3. Draw Last Visible 2. Update Internals 1. Frustum Cull Nothing „new“ to draw, but CPU doesn‘t know, still setting things up, GPU runs thru „empty“ cmd buffer For readback results, CPU has to wait for GPU idle Culling Readback vs Indirect 432 fps with culling 315 without 387 fps with culling 289 without 429 fps with culling 313 without Special bindless indirect version can save lots of CPU and a bit GPU costs for drawing the scene with a single big cmd buffer