This session provides a detailed analysis of next-generation mobile graphics hardware and points out special caveats and opportunities that are specific to mobile. This is followed by an in-depth explanation of advanced rendering techniques that were previously only considered for high-end PCs or dedicated gaming consoles, but which have now been brought over to mobile devices in Unreal Engine 4 - including features such as physically-based rendering, HDR and image-based lighting.
Next generation mobile gp us and rendering techniques - niklas smedberg
1. Next-gen Mobile GPUs and Rendering Techniques
Game Connection, October 29-31, 2014
Niklas Smedberg
Senior Engine Programmer, Epic Games
2. Game Connection, October 29-31, 2014
Introduction
• Niklas Smedberg, a.k.a. “Smedis”
– 17 years in the game industry
– Graphics programmer since the C64 demo scene
3. Game Connection, October 29-31, 2014
Content
• Next-gen Hardware
– Tile-based GPU vs direct-rendering GPU
• Next-gen Rendering Techniques
– Unreal Engine 4
– Previous goal: Bringing AAA Console graphics to mobile (DONE!)
– New goal: Bringing AAA PC graphics to mobile (DONE!??)
4. Next Generation Mobile Hardware
Game Connection, October 29-31, 2014
• Big leap in features and performance
• Full-featured (e.g. OpenGL ES 3.1+AEP, DirectX 10/11)
• Peak performance now comparable to consoles (Xbox 360/PS3)
– About 300+ GFLOPS and 26 GB/s
• New goal: Bring AAA PC graphics to mobile
6. Game Connection, October 29-31, 2014
iOS 8 Metal
• New rendering API for iOS
• Better match for the hardware
– Like a “game console API”
– Much more efficient on the CPU
– Exposes hardware features
• 20x faster on our rendering thread
– OpenGL ES API: 30% of renderthread
– Metal API: 1.6% of renderthread
• All power and perf responsibility is now in the hands of the developer!
7. Tech Demo: Zen Garden
Game Connection, October 29-31, 2014
8. Game Connection, October 29-31, 2014
Tile-based Mobile GPU
• Mobile GPUs are usually tile-based (next-gen too)
– Tile-based: ImgTec, Qualcomm*, ARM
– Direct: NVIDIA, Intel, Vivante
* Qualcomm Adreno can render either tile-based or direct to frame buffer
– Extension: GL_QCOM_binning_control
9. Game Connection, October 29-31, 2014
Tile-Based Mobile GPU
Summary:
• Split the screen into tiles
– E.g. 32x32 pixels (ImgTec) or 300x300 (Qualcomm)
• The whole tile fits within GPU, on chip
• Process all drawcalls for one tile
– Write out final tile results to RAM
• Repeat for each tile to fill the image in RAM
10. ImgTec Tile-based Rendering Process
Game Vertex Processing Tile Data (RAM)
Game Connection, October 29-31, 2014
Pixel Processing (Top-most
only)
Frame Buffer (RAM)
Cmd Buffer (RAM)
Hidden Surface
Removal
Tile
Memory
Per Tile:
11. Game Connection, October 29-31, 2014
ImgTec Series 6 G6430
• One GPU core
• Four shader units (USC)
– 16-way scalar
• FP16: Two SOP per clock
– (a*b + c*d)
• FP32: Two MADD per clock
– (a*b + c)
• 154 GFLOPS @ 400 MHz
– 16-bit floating point
12. Game Connection, October 29-31, 2014
FP16 Is Faster Than FP32
• ImgTec Series 6: 50% faster
– FP16 pipeline: Two SOP per clock
– FP32 pipeline: Two MADD per clock
• ImgTec Series 6XT: 100% faster
– FP16 pipeline: Four MADD per clock
– FP32 pipeline: Two MADD per clock
• Qualcomm Snapdragon: 100% faster
13. Game Connection, October 29-31, 2014
ImgTec Rendering Tips
• Hidden Surface Removal
– For opaque only
– Don’t keep alpha-test enabled all the time
– Don’t keep “discard” keyword in shader source, even if it’s not used
• Group opaque drawcalls together
• Sort on state, not distance
14. Game Connection, October 29-31, 2014
Framebuffer Resolve/Restore
• Expensive to switch Frame Buffer Object on Tile-based GPUs
– Saves the current FBO to RAM
– Reloads the new FBO from RAM
• Best performance:
– A single rendertarget for the entire frame
– No post-processing passes
• Does not apply to NVIDIA Tegra GPUs!
– This made it simpler for us to make our “Rivalry” tech demo for K1
15. Game Connection, October 29-31, 2014
Framebuffer Resolve/Restore
• Clear ALL FBO attachments after new frame/rendertarget
– Clear after eglSwapBuffers / glBindFramebuffer
– Avoids reloading FBO from RAM
– NOTE: Do NOT clear unnecessary on non-tile-based GPUs (e.g. NVIDIA)
• Discard unused attachments before new frame/rendertarget
– Discard before eglSwapBuffers / glBindFramebuffer
– Avoids saving unused FBO attachments to RAM
– glDiscardFramebufferEXT / glInvalidateFramebuffer
16. Game Connection, October 29-31, 2014
iOS Performance Profiling
• Screenshot from Xcode, which shows:
– How we clear FBO at the beginning of every render pass
– Other important performance info
17. Game Connection, October 29-31, 2014
Programmable Blending
• GL_EXT_shader_framebuffer_fetch (gl_LastFragData)
• Reads current pixel background “for free”
• Potential uses:
– Custom color blending
– Blend by background depth value (depth in alpha)
• E.g. Soft intersection against world geometry for particles
– Tone-mapping within tile-memory
– Deferred shading without resolving G-buffer
• Stay on GPU and avoid expensive round-trip to RAM
• See also: GL_EXT_shader_pixel_local_storage
19. Core Optimization: Opaque Draw Ordering
Game Connection, October 29-31, 2014
• All platforms,
1. Group draws by material (shader) to reduce state changes
• Then for all platforms except ImgTec,
2. Skybox last: 5 ms/frame savings (vs drawing skybox first)
3. Sort groups nearest first : extra 3 ms/frame savings
4. Sort inside groups nearest first : extra 7 ms/frame savings
• (Timings from a UE4 map on a Qualcomm-based device)
20. Game Connection, October 29-31, 2014
Optimization: Resolution
• Resolution does not match GPU performance!
• Use custom resolution to match perf/pixel
• Examples of same GPU:
– iPhone 5S = 0.7 Mpix (1136x640)
– iPad Air = 3.1 Mpix (2048x1536), more than 4 times slower!
• Other examples:
– Prev-Gen Console = 0.9 Mpix (1280x720)
– Current-Gen Console = 2.1 Mpix (1920x1080)
21. Single Content, Multiple Platforms
Game Connection, October 29-31, 2014
• Core motivating factor in designing UE4
– Authoring consistency between PC and Mobile
• Authoring environment for both platforms
– Physically-based shading model
– High dynamic range linear color space
– High quality post processing
22. Game DeGvealmopee Cr Cononnefecrteionnc,e O, Mctoabrcehr 295-3219, 20143
Comparison: PC
24. Game Connection, October 29-31, 2014
Consistency: Material Editor
• One material for many platforms
– Artist authored feature levels to scale shader
perf from PC to mobile
• Using cross compiler tool to retarget
HLSL into GLSL
25. Directional Light + SDF Shadows
Game DeGvealmopee Cr Cononnefecrteionnc,e O, Mctoabrcehr 295-3219, 20143
26. Consistency: HDR Directional Lightmaps
Game Connection, October 29-31, 2014
• Two compressed + One uncompressed textures:
1. HDR color with log luma encoding
2. World space 1st order spherical harmonic luma directionality
3. SDF shadow texture
• Optimized for Mobile and PVRTC compression:
– PVRTC (ImgTec) lacks separately compressed encoding for alpha
– PC color: RGB/Luma, LogLuma in Alpha
– Mobile color: RGB/Luma * LogLuma, no Alpha (LogLuma derived with ALU)
– Mobile: A single dynamic directional light as the “sun”
28. Game Connection, October 29-31, 2014
Image-Based Lighting
• Selects a mip-level in an IBL cubemap based on per-pixel roughness
– Same as PC, filtering same as PC
• PC: FP16
• Mobile: Decode(RGBM)^2
• PC: Blend multiple cubemaps per surface and parallax correction
• Mobile: One infinite-distance cubemap per object
29. 1/2 DOF
Downsample
1/2 DOF Filter
Tonemapping
Game Connection, October 29-31, 2014
Mobile Post Processing Pipeline
FP16 RGB=Color A=Depth
Anti-Aliasing
1/4 DOF Near Dilation
1/4 Lightshaft Filter, Pass 1
1/4 Final Filter Passes and Merge
{Light Shafts, Bloom, Vignette}
Bloom Filter Tree
1/4 Lightshaft Filter, Pass 2
1/4 Smart Reduction
Light Shaft (Sun) Mask
A=Depth to A=CoC+Sun
[conversion done on chip if possible]
1/8
1/16
1/32
1/8
1/16
1/32
1/64
30. Mobile Depth of Field
Game DeGvealmopee Cr Cononnefecrteionnc,e O, Mctoabrcehr 295-3219, 20143
31. Packed Circle of Confusion and Sun Intensity
Game Connection, October 29-31, 2014
• Optimization done for Depth of Field and Light Shafts
– Represent sun intensity and circle of confusion (CoC) in one FP16 value
– Saves needing an extra render target
Depth (0 to 65504)
CoC (0 to 1) Sun Intensity (1 to 65504)
0=Max Near Bokeh
0.5=In Focus
1=Max Far Bokeh and No Sun
65504=Max Sun (still at Max Far Bokeh)
32. Vignette + Bloom + Light Shafts
Game DeGvealmopee Cr Cononnefecrteionnc,e O, Mctoabrcehr 295-3219, 20143
33. Vignette + Bloom + Light Shafts: Optimized
Game Connection, October 29-31, 2014
• Composited at quarter resolution
– Using pre-multiplied alpha FP16
– Also performs the final filtering passes for bloom and light shafts
• Optimized to minimize rendertarget switches
• 3 effects applied to scene with one full-res TEX fetch
– Applied in the tonemap pass
34. Game Connection, October 29-31, 2014
Mobile Light Shafts
• Filtered at quarter resolution
• Monochromatic with artist controlled tint on final composite
– Bloom and light shaft down-sample pass shared
– RGB = color for bloom, A = light shaft intensity
• Light shaft filter runs in tonemapped space (8-bit/channel)
– Applied in linear (reverse tonemap before tint and composite)
• Filtering done by 3 passes of 8 taps
35. Game DeGvealmopee Cr Cononnefecrteionnc,e O, Mctoabrcehr 295-3219, 20143
Mobile Bloom
36. Game Connection, October 29-31, 2014
Mobile Bloom
• Lower quality version to minimize resolves
– Limited effect radius and less passes
• Turns out to be faster and higher-quality
– Standard hierarchical algorithm with some optimizations
– Down-sample from 1:1/4 res first (shared with light shaft)
– Then down-sample in 1:1/2 resolution passes
– Single pass circle-based filter (instead of two-pass Gaussian)
• 15 taps on circle during down-sampling
• 7 taps for both circles during up-sample+merge pass
37. Film Post Tonemapping
Game DeGvealmopee Cr Cononnefecrteionnc,e O, Mctoabrcehr 295-3219, 20143
38. Film Post Tonemapping: Mobile Shader
Game Connection, October 29-31, 2014
Film Post
RGBA8 LDR sRGB Output
Quantization Grain
(optional)
Shadow Tint
(optional)
Color Matrix
{Saturation, Channel Mixer}
(optional)
Tint, Tonemap Curve Linear to sRGB
RGBA16F HDR Linear Color
Blend in DOF
(optional) 1/2 DOF
Film Grain Jitter
(optional)
Blend in {Bloom, Vignette, Light Shafts}
(optional)
Film Grain
(optional)
1/4 Pre-multiplied Alpha
39. Game DeGvealmopee Cr Cononnefecrteionnc,e O, Mctoabrcehr 295-3219, 20143
Anti-Aliasing
40. Game Connection, October 29-31, 2014
Anti-Aliasing on Mobile
• Using a spatial & temporal anti-aliasing filter
– 2x temporal super-sampling (higher quality surface shading)
– Blending two jittered frames after tonemapping (32 bpp,
perceptual color space)
– Some special logic to remove ghosting/judder (not using
motion re-projection)
• Not currently using MSAA
– Needed super-sampled shading for HDR physically based shading model
Pixel
41. All Techniques Combined
Game DeGvealmopee Cr Cononnefecrteionnc,e O, Mctoabrcehr 295-3219, 20143
42. Game Connection, October 29-31, 2014
Android Extension Pack
• Released with Android “L”
• OpenGL ES 3.1 plus a specific set of extensions
– Compute shaders
– Tessellation
– ASTC
– Detailed spec on www.khronos.org: GL_ANDROID_extension_pack_es31a
• Full UE4 desktop rendering pipeline on Android!
– Deferred rendering with G-buffer
– Physically-based shading
– Image-based lighting
– Solid 30 FPS on NVIDIA K1 mobile GPU
The method of “processing all drawcalls for one tile” differ between GPU architectures. Details in later slides.
Hidden Surface Removal: Order-independent per pixel (opaque only)
G6430 diagram: Shows one GPU core, 4 SIMD, 16 fp16 pipelines and 16 fp32 pipelines (no dual-issue). Compare with SGX USSE.
Example of programmable blending in the “Soul” tech demo: Waterfall on the right
Before, we were rendering into gamma-space, but now because of fp16 we can run in linear color space, with lighting in HDR.
We need high-quality post-processing, because physically-based shading puts extra challenges on materials, especially when the camera is in motion.
Here is the mobile SunTemple map with some post process changes, as rendered using the PC path in UE4.
Resolution 1920x1200.
Same map rendered with the mobile renderer with DOF on.
Compared to PC: Lighting is slightly different - reflections are less accurate, anti-aliasing is not as good.
However, the same game content can run on all platforms and still look quite consistent between the two platforms.
You can create a single material that renders the same way on both platforms.
If you want to make something special on one platform, we support “Feature Level switches” which lets you can special graph paths that only execute on higher-end devices.
This makes it possible to create a single material that can scale from very low-end to very high-end devices.
The signed distance field shadows enables a clean silhouette.
This shot is rendered using the mobile renderer at 1920x1200 with AA, grain, depth of field, bloom,
and leveraging film post for the black and white conversion.
The B&W conversion has warm tint to the highlights and a cool tint to the shadows.
Resolution would be lower on a mobile device.
On mobile LogLuma is computed by taking Luminance of the RGB (extra math).
This shot is entirely in shadow. There is no direct light from the sun. All light is coming from reflection cubemaps that has been pre-captured.
Specular and reflections from the fire is generated from image based lighting.
This shot is 1920x1200 rendered in the mobile path with AA, grain, depth of field, bloom,
and leveraging film post for the black and white conversion.
The B&W conversion has warm tint to the highlights and a cool tint to the shadows.
Resolution would be lower on a mobile device.
Lots of managed complexity: many passes ended up getting merged to save on resolves.
Bloom, DOF, Light shafts, Vignette, Tonemapping, Anti-aliasing
We can’t do the large Bokeh that we do on PC, so we changed the mobile DoF to focus on the quality of the blur and the transitions, rather than just large Bokeh.
Supporting both near and far depth of field via separate focal planes.
This is not physically correct, but important for gameplay purposes.
Supporting only a small max circle of confusion size for performance reasons, equivalent to 16 taps.
Optimization done for Depth of Field and Light Shafts
Represent both sun intensity and CoC in one FP16 value
Saves needing an extra render target (saves both an FBO switch and an extra texture lookup)
Note CoC is effectively an alternative encoding for depth
Encode
Alpha = CoC size + sun intensity
Where CoC size is {0.0 = maximum near CoC size, 0.5 = in focus (CoC=0.0), 1.0 = maximum far CoC size (the infinite far plane)}
Decode
CoC size = clamp(Alpha, 0.0, 1.0)
sun intensity = max(0.0, Alpha - 1.0)
Works because sun is only at CoC = max far size, and light shafts are monochromatic
On GPUs with framebuffer fetch this transform is done inline without a resolve
All three effects are combined in a quarter-res buffer.
Warm light from outside, but cold lighting inside: Monochromatic light shafts with a blue tint + vignette with blue tint + bloom with no tint.
Filtering of the light shafts is done in tonemapped space as an optimization for bandwidth.
We can now run our high-quality bloom path on mobile, allowing a very wide bloom.
Down-sampled to 1/64, then back up to 1/4 resolution.
Ring-filter, where no sample is on the same vertical or horizontal axis.
Applied in a hierarchical fashion, which gives us very good results.
Running in FP16 linear HDR
32-bit RGBM does not filter
32-bit RGBT (RGB ratio, tonemapped luma) too expensive (ALU)
The same scene from two different camera positions and two completely different film post settings.
Left has film-grain, right does not.
Overview of features exposed by permutations of the tonemapping shader.
Uber-shader that combines many passes, for optimization.
Quantization grain can be used to break up darker shades in sRGB.
This shot uses anti-aliasing, depth of field, grain, and the film post controls.
Anti-aliasing is running in linear color space.
Bilinear fetch to remove jittered sub-pixel offset (small spatial filter)
Neighborhood clamp to remove ghosting
Thin features may only show in one of the two frame, so need to allow a fraction of prior frame outside neighborhood to remove flickering of these thin features
Reduce allowance as camera rotation increases to remove judder
Not using full motion re-projection, e.g. using a motion vector field (too expensive)