Presentation I gave at Unite Boston 2015. I'll cover a few techniques we used to optimize our Unity mobile game - Jake & Tess' Finding Monsters Adventure
8. Our budget is the limit
• Push as much content as
possible with smooth gameplay
and no overheat
– Can we get the same quality
with a similar approach?
– Are we doing something we
don’t need to?
9. What if we hit our budge
• What happens when we fail?
– Either gameplay or visual quality will
be impacted
• When it comes to remove
effects, trust is important
11. Optimization Process
• Do not make any assumptions.
• A profiler will tell you where the bottleneck is.
Profile Optimize Test
12. Optimization Process
• Rewrite code to use resources more efficiently
• Often we can fake or simplify effects
• Experience comes into play here.
OptimizeProfile Test
14. How to find our bottleneck?
• Unity comes with a built-in profiler
that does most of the work
• We wanted to have more
detailed GPU info
– Adreno Profiler – Snapdragon GPUs
– Mali Graphics Debugger (MGD) and
DS-5 Streamline – Mali GPUs
19. Case Study – Royale Moon
• Triangles 106k
• Drawcalls 87
• Overdraw 2.51x
• Shader Stats:
– Up to 160 ALU/Frag
– Up to 7 texture samples
• Adreno %Time Shading Fragment - max
– Fragment bound
21. Case Study – Royale Moon
• Early Z-Test Discards occluded fragments
• Render Order Matters
• Optimized Render Order
– Opaques – Front to Back
– Skybox
– Transparent – Back to Front
– Overlay (UI / HUD)
We need to improve this
22. How to assign object to sorting layers?
• Per Shader
– Have to duplicate shader files. Hard to maintain because we
have to make changes individually to each duplicate.
• Per Mesh
– Not scalable, requires lot of work.
– Risky! May break batches by mistake.
• Per Material
– YES!
– In that case do not use same material for different scene
• While you fix sort for one might break for the other.
23. Custom Material Inspector
• Created an editor script
BRSMaterialEditor to set
Material.renderQueue
• Add CustomEditor “BRSMaterialEditor”
to the end of shader file.
34. • Improving Shader Instructions
– Model: ops that can be done once per drawcall
• Use scripts to compute and pass values to shader
• Input Vector Normalization (ex. Rim Light)
• Scroll Offset
– Vertex: Ops that can be done per vertex
• Uniform texture tile & offset
– Fragment: Ops that needs to be done per pixel
• Equation simplification
• Half & Fixed precision for better thermal
• Saturate vs max(0.0, dot)
Fragment
Vertex
Model
COMPLEXITY
How to optimize fragment shader
35. Optimizing Shaders
• Many custom shaders done in ShaderForge
– ShaderForge does heavy work on fragment
• Many variants and not exactly the same code
structure
• How to optimize them all?
– 1st pass optimizing in ShaderForge
– 2nd pass optimizing in Code
36. 1st Pass: ShaderForge
• Identify core changes to lighting model
– BlinnPhongWrapped
– BlinnPhongRamp
• Created custom code node
– Artist helped with the process to replace for this code
– This made shader code common and more organized
38. Custom Lightmap in ShaderForge
• One major art complain was the lack of support for lightmap
in custom lighting
• Created a Lightmap node for them
• Problem1: Need to enable lightmap in config shader header.
• Problem2: ShaderForge does not exposes interpolated data.
39. 2nd Pass: Shader Code
Created a cginc file with macros for optimized code
• ShaderForge follows name convention for input
data
40. The results - Ground Shader
After optimization:
Before optimization:
• Avg ALU/Frag – ~21% reduction
• Fragments Shaded – ~45% reduction Overall Improvement: ~7ms
• Fragment Instructions – ~64% reduction
41. Further Improvements
• Fallback Shader
– We came across some problems
with shaders not being supported
for some configurations
– Vertex Animation with a noise
texture (tex2dlod) is not supported
on OpenGL ES 2.0 profiles
– Fallback shader to standout in
those cases
– Makes it easy to differentiate from
other errors
43. ASTC
• Optimal performance with high quality
• Improves bandwitdh and power consuption
• Galaxy Note 4, Galaxy S6 and above support it
• Supported with OpenGL 3 Unity profile
45. ASTC
Format RGB RGBA Normal Map
Codec ASTC 6x6 ASTC 4x4 ASTC 4x4
BPP 3.56 8 8
Size vs
Uncompressed
14.8% 50% 50%
Size vs ETC2 89% 100% 100%
Recommended Settings:
46. Review
• Do not make assumptions, use a profiler.
• GPU profilers will give you in-depth data per
drawcall
• One can assign objects to sorting layers at material
level for best workflow
• Reduce amount of work to optimize shader by
creating means to reuse optimized code.
• ASTC texture compression is best option available
for quality but only supported in a few devices.
We will play Finding Monsters Release Trailer Here.
Optimizitation allows us to push more content at higher framerates. We want to push as much content as possible without impacting gameplay.
At mobile we are also concerned with Overheat and Battery time. Optimizing for thermal will give more gameplay time for players.
While Optimizing, we frequently ask ourselves the following: * Can we get the same quality with a similar effect? For instance, if you want to take a screenshot, a RenderTexture is faster in most cases than doing a ReadPixels. Or sometimes we can make some simplifications in the shader to achieve a similar effect. However, there’s no free lunch. I often come up to the technical artists and say: “Hey, we can achieve a very similar effect but we’ll have to change some material properties and/or maps.”
* Are we doing something we don’t need to? For instance, creating and destroying game objects while you could be pre-alocating and caching them.
If we fail at further optimizing our game and still consume more resources the GPU can offer, either gameplay (lower framerates) or visual quality will be impacted. Usually we favor smooth gameplay over visual quality and end up removing effects.
When it comes to that, Trust plays an important role. At Blackriver studios we built a team upon trust. We look out for each other. I know the effort, dedication and passion the art team put into our games and I do my best to optimize it. When it comes to the point I say we must make some adjustments that will impact visuals they know that we really do.
We need to find the responsible for consuming those precious ms of your game. Engineers often tend to make assumptions on what might be slowing down our game and off course those assumptions get better with experience. However, one golden rule of optimization is to never assume anything. Sometimes the culprit is something that looks fairly simple like a blob shadow for instance.
Use a profiler to tell where your bottleneck is. If you’re optimizing something that’s not your bottleneck then you’re wasting time.
Once you find your bottleneck then it’s time to actually get the hands on optimizing the hotzone. Experience plays an important role here and will give you a hint of what to do.
Finally we want to test if we actually had some improvement. One very important thing is to note that the test scenario has to have exactly the same conditions of the scenario we profiled or you might get wrong results. sometimes. That might be a little tricky though.
At the end, we see how many ms we saved and repeat it all over again.
How to find the bottleneck? Profilers will timestamp your game to tell what the hot zones are. Unity comes with a builtin profile that can do most of the work. However, we want to have more detailed info on what’s going on in the GPU. We used GPU profilers for that.
They come with specific counters that can tell you easily the graphics pipeline hotzones and even allow you to replace a few resources while running to speed up your tests.
* Adreno Profiler is a all-in-one solution to profile Qualcomm’s Snapdragon GPUs.
* Mali Graphis Debugger and DS-5 Streamline are tools provided by ARM to debug and profiler Mali GPUs.
Throughout this talk will show how we profiled our game using Adreno GPU Profiler.
Our optimization workflow goes like this:
We fire up Adreno Profiler. There’s an override to disable all OpenGL calls submitted to GPU. Disable OpenGL calls -> (Does it greatly improve fps?) -> No -> We’re CPU bound -> Go for Unity profiler. (You might get Render data there, in that case you have too much driver overhead)
Yes -> GPU Bound -> Adreno also has many counters to tell which stage of the pipeline is stalled
% time vertex (draw calls and triangles, index vs triangle ratio, vertex shader)
% time fragment (frag shader instructions, blend, overdraw, texture sampling & filtering)
memory stalls (blocking resolves, texture bandwidth)
One can breakdown the graphics pipeline into 3 macro stages: Vertex, Fragment, and Bandwidth.
Vertex Bound:
Improve Index Locality for better cache. (Unity does this for you if you toggle Optimize Mesh at import settings.)
Use less vertex attributes possible (normals, color, tangent, etc). Each additional attribute might split your vertices.
Decrease the amount of triangles sent to GPU by performing Frustum & Occlusion Culling and by using Mesh LOD and Impostors to render distant meshes.
Simplify Vertex Shaders:
Per-vertex lights.
GPU Skinning
Vertex Offset
Fragment Bound:
Simplify Fragment Shader
Amount of instructions and samples in texture
Dependent Texture Reads
Blending
Decrease amount of per-pixel lights. (Forward Rendering)
Bandwidth
Use compression and mipmaps.
Avoid operations that stall GPU (block resolve). ReadPixels for instance.
Royal Moon is one of the stages in our game. We’ll show it a few techniques we used to optimize it.
This is the breakdown of our scene.
We’re clearly Fragment Bound.
Here’s an Overdraw debugger captured with Adreno Profiler. Brighter pixels are hot zones and tell that how much they have been written. For opaque meshes every time we redraw a pixel we’re wasting time. We need to sort our scene for optimial performance.
From this image we can do less fragment operations by reducing overdraw ratio.
Whenever you process a fragment in the frag shader, the GPU already know it’s depth or z value. The GPU then can test if the current fragment is already occluded by a previously computed one and discard it as it will not have any effect in the final image. That is called Depth Test.
Thus, the order in which you render your objects matters for perfomance as you can try to maximize the amount of fragments that gets discarded.
The best way to render your scene is:
Render Opaque objects from Front to Back.
Render Skybox
Render Transparent objects from Back to Front. The reason for that is because in order to correctly blend alpha objects must have the value of pixels behing it already computed.
Overlays (HUD/UI)
In order to improve overdraw we need to improve the render order of our opaque objects. Unity already does this for you based on game object pivot. However there are some special cases that doesn’t work (as we can see in our image). What we can do is to group objects into different sorting layers to improve those specific cases.
We have a few options when it comes to assign objects to different sorting layers.
Per-Shader:
In Unity you can set a RenderQueue in the ShaderLab file. The problem with that is that you’ll have to duplicate a shader file just to assign it to a different sorting layer. It will increase shader compilation and warm time and that’s not easily managed. Plus, when a change is required in the shader we’ll have to propagate to all variants manually.
Per-Mesh:
This is not scalable and requires a lot of work tweaking per-mesh settings. Also, this is risky as assigning objects to different layers will break batches and one might do it by mistake.
Per-Material
Seems a balanced approach. It’s easy to group and create materials. One can do per-scene materials to make sure the work done in a scene doesn’t affect other.
Unity allows to extend material inspector by creating a custom MaterialEditor script called BRSMaterialEditor. We created one that exposes the render order and layer to easily tweak it.
In order to use it on just need to add the following line to the end of the ShaderLab file:
CustomEditor “BRSMaterialEditor”
We ended up having five opaque render layers for this scene:
1) Character and Props
2) Island Top that camera is on.
3) Outer Islands
4) Planets
5) Skydome
This is a comparative of before and after improving sort.
You can see now that characters have much better overdraw and bottom islands don’t appear anymore.
OBS: The ground will appears darker on the first image due to me capturing the frame without rendering shadows by mistake (which add a additional render pass to render the ground)
This is a hightlight of Depth Test discards. You can think of this image as a negative to the previous one where more red the better. You can see characters and planets now have much more discards too.
Adreno Profiler provides a nice and fast way to see your fragment shader hotzone. You can query pre-draw call stats like Fragment Instructions, Textures / Fragment and Math Ops / Frag and Adreno will colorize each one of them.
This picture sorts drawcalls by the % percent of time spent on shading fragments, which is our bottleneck. This counter takes into account the fragments shaded * complexity to shade fragments.
This picture shows that the ground is the rendercall that spents most time shading fragments. So that will be a good candidate for improvement.
Here´s another interesting shader we can look at. Characters. They have the most ALU/frag and texture/frag in the scene. So, why isn´t it the this the rendercall that spends more time shading fragments. Simply due to the fragments shaded being about 1/3 of the one in the ground.
Remember the ground was renderer prior to characters before we optimized for overdraw.
One good thing to notice when optimizing fragment shaders is to do less operations possible on it. If there’s something we can do at vertex or even at model that would be best.For instance, one of our monsters has a inner point light inside of him. This light flickers by adjusting light intensity using a Fourier Sum of sines. We don’t need to compute this light intensity per-fragment not per-vertex. We do it at a script level. Then we pass the light intensity as a uniform to the shader.
Another example is: if we know all of our textures that use uv0 have the same tile & offset we can perform this at vertex instead of doing at fragment. This will save us not only a few instructions on the fragment but also be better for gpu to sample the textures.
At fragment level we can also do some micro-optimizations.
One of the challenges that we faced to optimize the shaders of this game was the fact that most shaders were authored by Tech Artists using a visual node tool called ShaderForge. Although ShaderForge is a nice to create and prototype shaders it does heavy work on fragment, which is far from ideal in our case.
Also, due to the shaders being written by a visual node tool frequently there are tons of mini variations that don’t produce the same code.
At this point we came to the question of how to optimize all these shaders.
We did it in 2 passes. First we did a first pass on ShaderForge and in the shader code to optimized for things ShaderForge don’t account for.
In the first pass we first identified all the core lighting model functions. Most of the shaders were using variations of BlinnPhong with Diffuse Wrap and Ramps. There were some other variations not to the core of lighting like Rim Lights and Custom Fog.
ShaderForge allows one to create code nodes. Then, we created a few code to implement uniformly these core lighting functions and with the help of the artists replicated in the shaders we had. Also, we also come up with a solution to save/load these code nodes. if we ever needed a change to this core code nodes, we could just change them and replicate to other shaders as opposed to make changes individually to each one.
This made shader code more uniform and easy to work on later.
This is an example of a Code Node we did. Ambient color is not applied in it as you can see because not all of our shaders use it.
While optimizing the shaders, one major complain from art was the lack of support for lightmap for custom lighting shaders in ShaderForge. I sat down with our artists and we discussed how they wanted it to be implemented.
We came up with a solution with a code node that worked with minimal changes required. We found out the following:
Although ShaderForge doesn’t support lightmap in custom lighting one can open the shader file and change the variable lmpd:False to lmpd:True. ShaderForge does not rewrite the shader header when it gets compiled, so we only needed to do this once per new shader.
Another problem we found was that we have no means to get fragment shader input interpolated data. We have to pass that as input with a Node property and had to reapply tile/offset to lightmap uv in the fragment. Later, when we optimize in shader code we move this to vertex.
In a second pass we optimized for the shader in code. First we created a cginc file to add all of our functions and MACROS.
Because the code is uniform, i.e, all vertex and fragment has same name conventions and core functions have all same params names we can easily replace ShaderForge generated code with our optimized one by replacing with MACROS.
In the picture you can see the MACROS and functions we made make the shader code really clean and lean. Plus, if we come up with a improvement they will all be replicated to all shaders.
You can also see the custom material editor and our error fallback shader in which I will discuss further in this presentation.
Here’s rough comparative of the results we got for the ground shader. We came from 90.50 ALU/Frag down to 71.
The fragments shaded were reduced by 45% accouting for a total of 64% in the fragment instructions.
Considering all shaders optimized for this scene, the total improvement was roughly 7ms.
As a further improvement in the shaders we came up with a fallback error shader. We came up with a few errors in the shaders. Some of them were related to features not supported in OpenGL 2.0 profiles like doing a vertex offset by sampling a texture in the vertex shader (with tex2dlod). In those cases, the shader was fallbacking to plain diffuse which was kind of hard to spot right on.
We then created a fallback error shader to make it easily standout when a shader is not supported in our current configurations. That makes it really easy to standout from other shader problems.
Texture compression is important to improve bandwidth and improve power consuption.
Blocky texture compression like ASTC, ETCn, DXTn, ATC and PVRTC are straighforward to GPUs and they don’t need to decompress it in order to read. However, the algorithms lose information when compress the texture (they are called lossy compression).
ASTC is a texture compression developed by ARM that has the advantage of a block texture compression speed without losing much of the texture quality. At the moment, Samsung’s Galaxy Note 4, Galaxy S6 and above support it. It is supported in Unity OpenGL 3 profile and for Android Lollipop Android devices.
Unlike other texture compression formats, ASTC supports different block compression configurations, allowing one to tweak tradeoff between performance and quality.
Here’s a comparative of ASTC4x4, ASTC6x6 and ETC2. It is important to notice that even ASTC 6x6 having a block size larger than ETC2 the quality of the compression is still much better.
This table showing the texture configuration we have for our most common assets. It’s also interesting to notice that ASTC4x4 provides the same size of ETC2 however with greater quality.