Efficient occlusion culling in dynamic scenes is a very important topic to the game and real-time graphics community in order to accelerate rendering. We present a novel algorithm inspired by recent advances in depth culling for graphics hardware, but adapted and optimized for SIMD-capable CPUs. Our algorithm has very low memory overhead and is three times faster than previous work, while culling 98% of all triangles by a full resolution depth buffer approach. It supports interleaving occluder rasterization and occlusion queries without penalty, making it easy
5. Pixel processing
Geometry processing
Draw call
5
Hardware Fixed-function Occlusion Culling
Handled automatically under the hood
Per-tile culling granularity
– Semi-occluded triangles can be
partially culled
Very late in the pipeline
Upload frame data
Game logic
Z Tile Culling
CPUsideGPUside
6. CPUsideGPUside
Game logic +
Pixel processing
Geometry processing
Draw call
Upload frame data
Z Tile Culling
SW culling
6
Software Occlusion Culling
Cull very early in the pipeline
– Cull both CPU and GPU work
Short delay
– Can be integrated with scene traversal
7. 7
Binary Space Partitioning (BSP)
trees & portals
Precomputed – very efficient
Scene (occluders) must be static
Difficult to handle general
scenes
Potentially Visible Sets (PVS)
Quake II, id Software, 1997
Half-Life 2, Valve Corporation, 2004
8. 8
Potentially Visible Sets (PVS)
Quake II, id Software, 1997
Half-Life 2, Valve Corporation, 2004
Player
Not part of PVS
Leaf boundaries
9. 9
Increasingly popular
Modern games have more
complex and dynamic worlds
No complex pre-computation
– Simpler content pipeline
Dynamic Occlusion Culling
Assassin’s Creed Unity, Ubisoft, 2014
Battlefield 4, EA DICE, 2013
[HA15]
[Col11]
10. 10
Hierarchical Z Buffer (HiZ) [Greene93]
Rasterize to full resolution z buffer
Create HiZ buffer
– Find the maximum depth in each NxN tile
Perform occlusion query with HiZ buffer
General algorithm works for both SW
and HW occlusion culling
Z-buffer Based Culling
Full resolution depth buffer
HiZ buffer
Complex
object
Bounding
shape
Dragon model courtesy of Stanford University Computer Graphics Laboratory
11. 11
Intel Software Occlusion Culling Framework
[CMK16]
Algorithm phases:
1. Rasterize a few designated occluder
objects to z buffer
– Heavily SSE/AVX optimized
– Parallel triangle setup
– Parallel pixel depth computation
2. Compute 1-level HiZ buffer (and throw
away z buffer)
3. Perform queries and render surviving
objects
12. 12
Rendering to z-buffer per pixel
Updating HiZ tile needs all pixels within
the tile
Occlusion Query per tile
Wouldn’t it be nice to compute HiZ
directly?
– Being conservative is the only requirement
Idea: use alternative HiZ representation
Z-buffer Based Culling
Full resolution depth buffer
HiZ buffer
24. 24
Originally designed for graphics hardware
Directly update HiZ buffer without
computing a full res z buffer
Decouples coverage sampling
(rasterization) and depth computation
Masked Occlusion Culling [AHAM15]
Approximate, conservative HiZ buffer
Depth buffer
25. 25
Masked Software Occlusion Culling
Could Masked Occlusion Culling [AHAM15] be really fast for software
occlusion culling?
Much less memory to read/write than full res z-buffer
Updates use bitmasks – can process many pixels in parallel (i.e. SSE/AVX)
No need to compute per-pixel depths
– Would need a fast SW rasterizer to compute coverage
Turns out it can
Paper presented at High Performance Graphics this year [HAAM16]
Source code available!
33. 33
Compute Depth Plane
Depth = ax + by + c
– Conservative tile depth: Check sign of a and b
– Can be incrementally updated Update
Depth test
Compute coverage
Traversal setup
Depth plane
Compute bounds
Clip
Transform
-, - +, -
-, + +, +
Clamp to vertex depths
+ a
+ b
36. 36
AVX Register Layout
One scanline per SIMD lane
Update
Depth test
Compute coverage
Traversal setup
Depth plane
Compute bounds
Clip
Transform
37. Compute slopes (∆y/∆x) once
– Similar to regular scanline rasterizers
37
Edge Slopes
Update
Depth test
Compute coverage
Traversal setup
Depth plane
Compute bounds
Clip
Transform
38. 38
Compute Intersections
Compute intersections for each scanline
– Eight scanlines in parallel using AVX Update
Depth test
Compute coverage
Traversal setup
Depth plane
Compute bounds
Clip
Transform
Intersections
39. 39
Compute Coverage Mask
Start with full coverage mask
Update
Depth test
Compute coverage
Traversal setup
Depth plane
Compute bounds
Clip
Transform
Intersections
40. 40
Compute Coverage Mask
>>
>>
>>
>>
>>
>>
>>
>>
Start with full coverage mask
– Shift each lane (scanline) to intersection
– AVX2 and later have per-lane shift instruction Update
Depth test
Compute coverage
Traversal setup
Depth plane
Compute bounds
Clip
Transform
Intersections
41. 41
Compute Coverage Mask
Repeat the same process for the next edge
Left edge
Right edge
Right edge
Update
Depth test
Compute coverage
Traversal setup
Depth plane
Compute bounds
Clip
Transform
Intersections
42. 42
Compute Coverage Mask
Repeat the same process for the next edge
– Edge is facing right invert mask
Update
Depth test
Compute coverage
Traversal setup
Depth plane
Compute bounds
Clip
Transform
Intersections
43. 43
Compute Coverage Mask
Combine masks of all overlapping edges
Update
Depth test
Compute coverage
Traversal setup
Depth plane
Compute bounds
Clip
Transform
44. 44
Compute Coverage Mask
Combine masks of all overlapping edges
– Using bitwise AND
Update
Depth test
Compute coverage
Traversal setup
Depth plane
Compute bounds
Clip
Transform
45. 45
Compute Coverage Mask
Combine masks of all overlapping edges
– Using bitwise AND
Update
Depth test
Compute coverage
Traversal setup
Depth plane
Compute bounds
Clip
Transform
46. 46
Shuffle Mask
Shuffle mask to form better shaped tiles
– Before: each SIMD lane is a scanline
Update
Depth test
Compute coverage
Traversal setup
Depth plane
Compute bounds
Clip
Transform
47. 47
Shuffle Mask
Shuffle mask to form better shaped tiles
– Before: each SIMD lane is a scanline
– After: each SIMD lane is a 8x4 tile Update
Depth test
Compute coverage
Traversal setup
Depth plane
Compute bounds
Clip
Transform
48. 48
Depth Test
Interpolate conservative depth (per 8x4 tile)
Test against buffer
Update
Depth test
Compute coverage
Traversal setup
Depth plane
Compute bounds
Clip
Transform
Buffer
49. 49
Update Tile
Update
Depth test
Compute coverage
Traversal setup
Depth plane
Compute bounds
Clip
Transform
Two code paths (can be switched compile time)
– Original update method [AHAM15]
– New update method tailored for SW [HAAM16]
Why use a new update method?
– Faster – same culling power
– Less accurate than original, more dependent on render order
– Works best if you render front-to-back
50. 50
Update Tile, New Method [HAAM16]
Update
Depth test
Compute coverage
Traversal setup
Depth plane
Compute bounds
Clip
Transform
zmax is the reference layer
– Maximum value for the entire tile
zmax is the working layer
– Maximum value for a subset of the tile
– Updated as
– New depth = max(zmax , zmax)
– New mask = TriangleMask OR LayerMask
Whenever working layer mask is full, overwrite reference layer
1
1
tri
0
60. 60
Update Tile
Update
Depth test
Compute coverage
Traversal setup
Depth plane
Compute bounds
Clip
Transform
Update is quicker than original [AHAM15]
Test is also quicker
– Need only to test against reference layer (zmax)0
61.
62. 62
Results
Intel Occlusion Culling Sample
Clear: Clearing the depth buffer
Geom: Transform & project geometry
Rast: Triangle setup & occluder rasterization
Gen: Compute HiZ buffer from full resolution z buffer
Test: Perform occlusion queries
3.7x16x
(μs)
Old [CMK16]
New [HAAM16]
65. 65
Masked Occlusion Culling API
void SetResolution();
void SetNearClipPlane();
void ClearBuffer();
static void TransformVertices();
Result RenderTriangles();
Result TestTriangles();
Result TestRect();
void ComputePixelDepthBuffer();
OcclusionCullingStatistics GetStatistics();
Setup
Debug
Render &
query
66. 66
Masked Occlusion Culling API
Result RenderTriangles(
float *inVtx,
uint *inTris,
int nTris,
ClipPlanes mask,
ScissorRect *scissor,
VertexLayout &layout
);
Render to the software HiZ buffer
// Clip space vertex positions
// Index array (Indices to inVtx buffer)
// Triangle count (the number of index triplets in inTris)
// Mask for potential frustum bound overlap
// Scissor region
// Vertex format of inTris. There is a fast-path for AoS with
(x, y, z, w) coordinates
67. 67
Masked Occlusion Culling API
Result RenderTriangles(
float *inVtx,
uint *inTris,
int nTris,
ClipPlanes mask,
ScissorRect *scissor,
VertexLayout &layout
);
Eye
View frustum
Near plane
mask = 0
mask = leftPlane | nearPlane
Clipping is not free...
– If you’re already doing frustum culling, let the API know the outcome
68. 68
Masked Occlusion Culling API
Result RenderTriangles(
float *inVtx,
uint *inTris,
int nTris,
ClipPlanes mask,
ScissorRect *scissor,
VertexLayout &layout
);
Eye
View frustum
Scissor region
(screen space AABB)
Can be used for threading
– One scissor region per thread
69. 69
Masked Occlusion Culling API
Result TestTriangles(
float *inVtx,
uint *inTris,
int nTris,
ClipPlanes mask,
ScissorRect *scissor,
VertexLayout &layout
);
Test triangles against the software HiZ buffer
– Does not update the buffer
// Returns the collective culling outcome of the triangles
// Clip space vertex positions
// Index array (Indices to inVtx buffer)
// Triangle count (the number of index triplets in inTris)
// Mask for potential frustum bound overlap
// Scissor region
// Vertex format of inTris. There is a fast-path for AoS with
(x, y, z, w) coordinates
70. 70
Masked Occlusion Culling API
Result TestRect(
float xmin,
float ymin,
float xmax,
float ymax,
float wmin
);
Test rectangle against the software HiZ buffer
– Does not update the buffer
// Returns the culling outcome of the screen space rectangle
/*
Screen space bounds:
[xmin, ymin] – [xmax, ymax]
*/
// Conservative clip space w (typically the w-component of the nearest
bbox vertex in clip space)
71. 71
Example use case: Scene Bounding Volume Hierarchy (BVH) traversal and culling
ClearBuffer();
prioQueue.push(root);
while (!prioQueue.empty()) {
Node node = prioQueue.pop();
if (FrustumTest(node) == Culled)
continue;
compute_screen_space_bounds(node);
if (TestRect(bounds) == Culled)
continue;
if (node is InnerNode) {
prioQueue.push(node.left, dist);
prioQueue.push(node.right, dist);
} else (node is Leaf) {
TransformVertices(leaf.vertices);
RenderTriangles(xfVertices);
send_leaf_to_GPU();
}
}
RenderFrame
Culled!
72. 72
Essential Tools We Have Relied On
Intel® VTune™
– https://software.intel.com/en-us/intel-vtune-amplifier-xe
SSE/AVX intrinsics guide
– https://software.intel.com/sites/landingpage/IntrinsicsGuide/
73. 73
References
[AHAM15] ANDERSSON M., HASSELGREN J., AKENINE-MÖLLER T.: Masked Depth Culling for
Graphics Hardware. ACM Transactions on Graphics 34, 6 (2015), pp. 188:1–188:9
[CMK16] CHANDRASEKARAN C., MCNABB D., KUAH K., FAUCONNEAU M., GIESEN F.: Software
Occlusion Culling. Published online at: https://software.intel.com/en-us/articles/
software-occlusion-culling, (2013–2016)
[Col11] COLLIN D.: Culling the Battlefield. Game Developer’s Conference (presentation), (2011)
[Greene93] GREENE N., KASS M., MILLER G.: Hierarchical Z-Buffer Visibility. In Proceedings of
SIGGRAPH, (1993), pp. 231–238
[HA15] HAAR U., AALTONEN S.: GPU-Driven Rendering Pipelines. SIGGRAPH Advances in Real-Time
Rendering in Games course, (2015)
[HAAM16] HASSELGREN J., ANDERSSON M., AKENINE-MÖLLER T.: Masked Software Occlusion
Culling. High Performance Graphics, (2016)
74. 74
Check it out!
GitHub: Lightweight library
– https://github.com/GameTechDev/MaskedOcclusionCulling
GitHub: Example integrated in Intel’s Software Occlusion Culling demo
– https://github.com/GameTechDev/OcclusionCulling
Project page: Masked Software Occlusion Culling
– https://software.intel.com/en-us/articles/masked-software-occlusion-culling
Questions and feedback welcome
– magnus.andersson@intel.com