1. Parallelizing Conqueror’s Blade*
Making the Most of Intel® Core™ for the Best Gaming Experience
Nan Mi
Engineer Lead @BoomingGames
Lei Su
Senior Engineer @BoomingGames
Sheng Guo
Application Engineer @Intel.com
2. Agenda
Multi-core: Opportunities to scale user experience
Conqueror’s Blade*: Case study to leverage multi-core
Optimization background
Building job system
Jobifying engine sub-systems
Scaling user experience
2
3. Next Generation Multi-Core Processor
3
Physical CPU/cores increasing quickly
4 cores: max install base
6 cores: mainstream shipping
8-18 cores: high-end shipping
Multicore utilization of games today
Most multithreaded, but only with 2~3
heavy threads
Insufficient CPU utilization
Steam Hardware & Software Survey: February 2018
4. What to Do with the Idle of Cores
4
Boost
Performance
Enrich
Experience
Software occlusion culling Buffering load turbulenceBalancing load among cores
Global illumination Detailed animation
Realistic clothing
Realistic ragdoll
Realistic destruction Advanced particles
Wind & Weather
3D audio
Additional rendering passes
More details of distant model
Ambient animation and background life
Decorative contents
5. With Great User Experience Comes Great
Parallelized Engine
5
Maximized
User Experience
Parallelized
Game Engine
Scale User Experience (Performance + Effects) with More Cores
6. Key Problems to Consider
6
Maximize
User Experience
Parallelize
Game Engine
Enable perceptible multi-core
scaling w/o impacting game play
The quality of effects
The types of high-quality effects
The coverage of high-quality
effects on all unites
Decompose engine functionality to
fine-grained jobs
Rendering
Game Logic
Simulation
Build efficient job scheduler
8. Outline
Game Background
Engine Architecture Evolution
Building Scalable Game Engine
Job system
Case study parallelization engine subsystems
Scaled Gaming Experience
Tips & Tricks
Future Work
8
9. Game Background
Conqueror’s Blade* is a PC online-game
Hero : Action gameplay
Legion : Tactic gameplay
Empowered war machines
Immersive battlefield
9
11. Motivation For Multicore Scalable Engine
Game is Logic Heavy
Huge number of individual soldiers
Dynamic battleground
Rich battlefield elements
Problems of Legacy Architecture
Difficult to scale to more cores
CPU Bound
11
12. Goals & Challenges
Goals
Support more than 1K actors with individual AI and states
Dynamic battlefield
Easy to scale
Multi-thread debug friendly
Challenges
Game is in development & test
On-the-fly upgrade engine
Time-limited (~2.5 months)
Technique Choice
Entity-Component-System model
Job system
12
13. ECS Model
Entity-Component-System*
Data is everything
Entity is just ID
Component holds only data
System contains the same kind of component and methods
Pros
Parallelization friendly
Cache friendly
Memory management friendly
13
*[Timothy17] Overwatch Gameplay Architecture and Netcode, GDC 2017
14. Original vs ECS
Original Model ECS Model
14
Entity
Animation
Component
Physics
Component
Transform
Component
...
...Entity
Animation
Component
Physics
Component
Transform
Component ...
Entity
Animation
Component
Physics
Component
Transform
Component ...
Entity
Animation
Component
Physics
Component
Transform
Component ...
...
Animation
Component
Animation
Component
Animation
Component
Physics
Component
Physics
Component
Physics
Component
Transform
Component
Transform
Component
Transform
Component
Animation
System
Physics
System
Transform
System
Data organized by entity
Data Heterogeneous
Memory Jumping
Cache Miss
Data organized by system
Data Homogeneous
Memory Contiguous
Cache Friendly
16. Thread Fork/Join (Intermediate)
Fixed Multi-thread
Thread Fork/Join
Thread Pool
Fork/Join from fixed thread
16
Render
Simulation
Logic
Visibility GBuffer Shadow Lighting Forward
Transpar
ent
Postproc
ess
UI
LOD
Ani
mati
on
Lua AI
Network
Work
Thread
Work
Thread
Work
Thread
AI Task
AI Task
AI Task
Animation
Task
Animation
Task
Animation
Task
Animatio
n
AI
Physics Particle ...
Motor ...
17. Job Based (Final)
Fixed Multi-thread
Thread fork/join
Job Based
Render Backend
Job System
Network…
17
Global Job
Queue
Job
Job
Job
Work
Thread
Job Queue
Job
Job
Work
Thread
Job Queue
Job
Waiting Job
Job
JobJob...Work
Thread
Job Queue
Job
Job JobJob Network
Render
Backend
Engine Architecture
Job System
18. Job System
Fiber based implementation*
What is fiber
A lightweight execution context(include a user provided stack, registers…)
Fiber execution is collaborative, means a fiber can switch to another interactively
Pros
Easy to implement task schedule
Easy to handle task dependency
Job stack is isolated
Avoid frequency context switch
Cons
C++ does not natively support fiber
Implementation is different between OS
Has some restrictions(thread_local invalid)
Fiber Implement
Boost context: Cross-platform, Industry proven, Fast
18
*[Christian15] Parallelizing the Naughty Dog engine using fibers, GDC 2015
19. Job Scheduler
Thread Independent Job Queue
Each work thread has its own job queue
The job generated from the thread will be added to the queue
Separate Global Job Queue
Job submit outside job system (frame begin, some middleware …)
LIFO Mode
In most case, job dependency is tree like
Some system add jobs occasionally but wait them immediately
Job Stealing
Worker thread load balance
19
20. Global Job
20
Global Job
Queue
Job
Job
Job
Work
Thread
Job Queue
Job
Job
Work
Thread
Job Queue
Job
Waiting Job
Job
JobJob...Work
Thread
Job Queue
Job
Job JobJob
Job
Outside threads
add global jobs
Work thread gets global
job from global queue
23. On-The-Fly Change Step
Change to ECS Model
Entity level update to component level update
Gather same component to system, system level update
Parallelization each system
Keep system tick order
Split jobs in self system and wait jobs to finish before system end
Modify system dependency
Clarify system dependency
Launch independent systems at the same time
Wait system jobs in the system really dependent on them
23
24. System From Single-Thread To Multi-Thread
Lock
Always the first change step
Behaves well when there are few conflicts
Backup of lock-free version
Batch and Swap
Useful for polling system
Lock-Free
Use the simplest lock-free data structure
24
26. Physics System
Physics System build on PhysX/Apex Library
Features
Rigidbody
Cloth
Destruction
Ragdoll
26
27. Jobify PhysX Knowhow
PhysX Library support task
Only need to implement the
PxCpuDispatcher
Code is easy to be integrated
Details need consider
PhysX occasionally submits tasks
and then immediately waits for them
to complete, so suggest using the
LIFO mode
PhysX has synchronization stage
PxScene::flushQueryUpdates
27
Trigger sync stage
Reduce shapes
usage!!!
28. Animation Works
Animation Tree Update
Each Animation Tree updates
independently
Trigger Effect/Particle/Sound…
Skeleton Transform Calculation
28
Simply split jobs by actor count!
29. Difficulties
Related with many other systems
Not thread-safe ready
Difficult to balance job load
Cost has huge difference between actors
29
bad job
30. MPSC Queue
30
op op op op op
Animation
Worker
Animation
Worker
Animation
Worker
Animation
Worker
Pre
Fetch OP
Post
Animation system Related system
31. Load Balance
Cover other than really balance
Split job by experience
Launch independent systems earlier
Wait animation results in another dependency system
31
cover job
32. Script
Script Usage
Lua as script
Lua call engine c++ functions
Script jobify
Lua is not native multi-thread
Make heavy calculation in C++
Gather calculations together
Parallel only c++ codes
Script logic can tick with fixed time(like 100ms)
32
33. Jobify Particle System
Particle System Module
Experience Job Split Rules
By particle classify
By particle simulation phases
Problems & Solutions
Particle job conflicts
Particle job workload balance
33
34. Particle System Module
Particle Emitters
Particle spawn and delete
Particle Renders
Billboard/Trail/Mesh/Beam …
Particle Affectors
Color over Life, gravity, motion …
Use global particle pool to control particle budget
34
Particle System
Emitter
Affector
Affector
Render
…
Emitter
Affector
Affector
Render
…
…
36. Job Split Rule 2 – Particle Phases
Spawn jobs
Particle emit and delete
Update jobs
Particle property refresh
Render Prepare jobs
GPU friendly data
Problems:
Conflicts in global pool
Simply splitting job by particle system count causes bad workload balance
36
Particle System
Emitter
Affector
Affector
Render
…
Emitter
Affector
Affector
Render
…
…Spawn
Update
Render Prepare
Particle System …
37. Solve Particle Job Conflict
Conflict Case 1
Particle Spawn
Allocate particle block from pool with Atomic
Allocate block is just AtomicAdd
New particle from block
Particle Dead
Simple swap with the last particle in block
When block is empty, free whole block back to pool
Conflict Case 2
Particle render transfer into one big vertex buffer
Use AtomicAdd to get write position in linear pool
37
PoolBlock Block
Block Pointer
Block Particle count
(atomic)
Particle Particle
Particle Particle
46. Extra Optimization
Intel Masked Occlusion Culling Library *
CPU Software Occlusion Culling
Easy to be integrated
Reduce draw call
46
*Masked Occlusion Culling, https://github.com/GameTechDev/MaskedOcclusionCulling
47. Masked Software Occlusion Culling Result
Performance (4 cores)
47
Level Rasterize &
Visibility
MOC off MOC on Speedup
Main City 2.7ms 25 fps 30 fps 1.2x
Siege Battlefield 3.1ms 23.2 fps 29 fps 1.25x
48. Enriching Visual Effects for More Cores
Clothing
Physics destruction
Particles
Ragdoll
Animation
48
49. Tips & Tricks
Optimize the code itself first rather than parallelize
Lock is your friend in the first step
Pending and swap
Data-oriented is both optimization friendly and debug friendly
Simple structure means easier to parallelize and debug
49
50. Future Work
Further data-oriented design
More clearly identified system dependencies
Chunk-based multi-thread rendering
Job based lock (no more mutex, lock…)
50
Title:
Parallelizing Conqueror’s Blade: Making the Most of Intel Core for the Best Gaming Experience
Session Description:
Giving your players the best experience possible on all levels of hardware is the ultimate goal. However, with the quickly increasing number of cores built-in to modern mainstream CPUs, challenges inherent in developing gaming engines leaves many potentially available cores sitting idle on the sideline. In this talk, we'd like to share our experience and lessons in building our multicore scalable game engine of Conquer's Blade, an AAA game of ancient warfare from Netease/BoomingGames. We'll detail how we multithread the game engine, especially the rendering system which is typically the No.1 CPU bottleneck in modern games, to squeeze out performance scalability. And with the resulted performance headroom, how to implement the perceptible visual differentiation for maximizing the gaming experience on different CPU platforms was introduced as well.
User Experience = performance + effects (visual/audio)
OK, now developers from BoomingGames will share Experience and lessons from engineering practice.
Hello everyone, I’m Nan Mi, engineer lead of Boominggames.
We will first introduce our game background.
Then will show our engine architecture evolution.
Then will go to detail about how we use job system to build scalable game engine,
The case study parallelizaton subsystem in engine.
We will show scaled gaming experience.
And then show the tips, tricks we learnd from the practice and future work.
Conqueror’s Blade is an PC online-game, now in beta-test and will coming soon.
Player controls both hero and a legion to battle in the world.
control hero is like an action gameplay, and meanwhile control legion is somekind tactic gameplay.
The battlefield mix cold and hot weapons, empowered war machines to show a immersive battlefield.
Let’s see game trailer to feeling the war.
OK, our game is logic heavy,
Its include huge amount of individual soldiers with independent AI and animation and states.
It’s a dynamic battleground with rich battlefield elements like explode, destruction, legion melee and so on.
Our legacy architecture have some problems,
Its difficult to scale to more cores, and cpu bound, so we need a more multicore scalable engine.
But this architecture is easy to understand
Our Goals is to Support more than 1K actors with individual AI and states
Dynamic battlefield with destruction and
Engine need Easy to scale and Multi-thread debug friendly
Challenges,
Game is still in developing & test
On-the-fly smooth upgrade engine
Time-limited (~2.5 months),
So our technique choice is Entity-Component-System model and job system.
My colleague Lei Su will introduce the implement detail of the ECS and job system.
Hello guys, I’m sulei, senior engineer of boominggames
Ok, let’s talk about entity-component-system model. It’s a data organization architecture, and is similar with [OverWatch]
In ecs model, data is everything. Entity is just a ID. Component holds only datas. And the system contains the same kind of compoent and its methods
You can think as we change our engine interfaces from c++ style to c style, and change the design pattern from object oriented pattern to data oriented pattern
Why we do these changes?
Ok, we think the ECS model has at least 3 advantages. Which we called parallelization friendly ,cache friendly and memory management friendly
Let’s make a intuitive comparison between the original model with ecs model
In original model, An entity holds all its component data, and each component has its interfaces.
Data is organized by entity, so in an entity, data is heterogeneous
So if we update an entity’s all components first, which means update from the pictures left to right, the memory is contiguous but the methods is different
If we update all entity’s same component first, in the picture top to bottom, the memory will jumping
We can see all the 2 methods is not parallel friendly, and the second one will also cause cache miss
In ecs model, we update the systems one by one. Then the memory is contiguous, and the update method is same. Obviously, this is parallel friendly and cache friendly
So, we choose the entity-component-system architecture to organize our data.
Next we will talk about our multi-thread architecture evolution.
As you see in the picture
Our original multi-thread mode is quite easy to understand.
We have 3 fixed threads, One for render, one for simulation and one for logic. The Network and IO thread will always be there, we will not talk about it.
So as our game needs more and more excellent experience, the architecture hit its bottleneck, its hard to scale to more cores.
Then we change to fork/join mode.
We still have 3 heavy threads, but each thread can fork some thread to parallel do one kind of works, and then back to original thread to continue.
This is very similar with single thread execute sequence.
We gain some boost on this architecture, but we abandon it quickly. Why?
Ok, before I say the reason, I would like to share some of my little understand of system design first.
I think when we design a system, we can not only consider the system self’s efficiency, but also we should take the system’s user efficiency into account.
Means when we design a system for designer, we need to consider can the designer use it to quickly make much different game experience?
When we design a system for artist, we need to consider how to really free the artist’s inspiration.
So back to the multi-thread architecture, its user is programmer. We should consider the programmer’s efficiency.
When we use this architecture, programmers would have to consider thread fork and join, and the worker thread count may influence task split, etc. All theses are not friendly
Finally, we choose the job based architecture, both system self efficient and programmer efficient
In this architecture, the engine has a render backend thread, a network and IO thread, and the rest is the job system.
The job system will use a thread pool to run the jobs.
So this mode is really suitable for multi-core architecture, and naturelly scalable when cpu core count increases. And it’s programmer friendly.
The programmers no more need to consider worker thread count, they can split jobs with nearly zero consideration, and the jobs’ dependency is much more easy and free to express
And in theory, this architecture is more efficient than the previous one.
Let’s look inside the job system.
We use the fiber based job system implementation, it’s the same as naughty dog.
Ok, let’s see what is fiber. In my opinion, fiber has two key features.
One is it’s a lightweight execution context include a user provided stack and registers and so on.
And the other powerful magic fiber has is that the fiber execution is collaborative, means a fiber can switch to another interactively, and in theory the switch is fast.
This makes fiber is a wonderful choice to implement job system.
Easily switched in and out means task schedule is easy to implement, and the task dependency is easy to build.
User provided stack mean each fiber can have individual stack, so the job runs in the fiber’s stack is isolated
Manually control fiber switch, means we can easily solve task chaining effect. So to avoid context switch.
Task chaining effect is that A dependent on B and C, when A is wait, he can choose to run D, but D launched E and F, So when B, C finished. In theory, we can run A. But A has been buried in the call chain. So it need to wait D to finish or suffer a context switch.
Ok, the fiber is beauty, but it also have some problems.
It’s not C++ language level navity supported, and even in os level, its implementation is different.
And if we used fiber, the job codes runs in the fiber must obey some restrictions, like can not use thread_local
To solve the problems, we choose the boost context to implement our fiber. Boost context is cross-platform, industry proven and fast
And we write the job codes restirctions to our coding standards.
Ok, this is the fundamental of our job system. Next we will talk about the core of the job system, the job scheduler.
When we design our job scheduler, we considered the game engine’s peculiarity.
We use 2 types of queues, One is thread independent queue, each worker thread has its own job queue, and the job generated from the thread will be added to the queue. This will deduce the job taken conflicts
And we also have a separate global job queue for the threads outside of the job system to submit job to run in the job system. For example, at the frame begin we will and a initial update job to the job system by render backend.
As now, we didn’t task over all the 3rd party middlewares multi-thread system, so the jobs from these threads will be added to the global job queue.
Global job queue is used for job submit outside the job system, like at the frame begin or some middlewares we didn’t task over their multi-threading system
Maybe in the future when all middlewares multi-threading system is under our control, then we treat the whole engine update as one big initial job, we can remove the Global Queue.
Our engine generate jobs is tree liked and some systems add jobs occasionally but wait them to complete immediately. Consider these, we choose the stack like last in first out schedule mode.
When we put jobs to job system, we can not ensure the jobs is split fairly. So there will be some worker thread finished its all jobs, but another worker thread may have many jobs to run. To balance the work load between worker thread, the fast worker thread can steal job from another.
Words description is abstract, let’s take a visualized look of the scheduler!
Ok, the global job is generated from the outside threads, and the worker thread will get the global job from the global queue
Ok, also this picture, here a job is running, then it generated 2 new jobs, and choose to wait the jobs complete. So the job has dependency on the newly generated 2 jobs, and its state changed from running to waiting. Been switched out to the waiting queue.
Because of the stack like LIFO mode, the new added jobs run first.
When both of the 2 jobs finished, the parent job become ready. Then the scheduler will switch in the parent job to contine.
This is the thumbnail of the job schedule.
Ok, next job stealing.
It’s quite simple, as you can see there is a worker thread finish all its jobs. So it would steal a job free the tail of another worker thread. Then it has job to do now.
Ok, we introduced 2 powerful weapons(the entity-component-system model and the job based multi-threading) to optimize our engine. Let’s use them step by step.
First, we decided to change our engines data organization. It’s the base of parallelization.
We change update order from by entity to by component. This will change the system’s behavior. But never mind, we fix it first. At the mean while, the performance is a little loss, because this update method cause cache miss. But it’s not a big problem. We will soon get it back.
We the update method is stable, we gather the same component to system, and update by system order. This is just change the place to locate the component data essentially, so it’s relative simple and bug less. From now on, our change to ECS model is finished, we can start to parallel each systems now.
When parallel the systems
Game engine’s multi systems has function update or tick, so batch and swap is widely used in engine
Complex lock-free data structure is difficult to debug
This is the performance result we start to profimize.
U can see, there only 3 heavy thread works, a lot of empty hone in other thread.
标题标题!按什么标准来分job?
PhysX recursion generate tasks, so it suggest LIFO mode do deal with its jobs.
Optimize (the author has started to optimize this function)
Reduce Shapes Usage
Each soldier shape from 60 to 3
Original
Each solider state use one shape to present
Optimize
Each solider max has 3 shapes
Each pose move the shape
Now simply split jobs by actor count, in the future we can split jobs by animation calculation types
Mpsc queue to solve thread safe problem
Cost difference solve by cover not really solve
MPSC: Multi producer single consumer
No deeper technology
All above system’s jobify is relatively simple.
Next I will return the talk back to minan, He will give us some more complex job split cases.
Ok, lets go to jobify particle system.
First will introduce our particle sytem module.
We rely on two experience rules to split particle jobs, on by particle claasify, and another rule rely on particle simulation phases.
Then show problems we meet abount job conflicts and workload balance problem and our solution.
One particle system have 3 modules.
Particle emitters control the spawn and delete dead particle.
The render module controls how we render the particle, use billboard or trail or mesh or beam.
Each emitter may have several affecters,
Each affecter control how to modify the particle data while its lifetime, as color over lift, gravity affector, and motion and so on.
And we use a global particle pool to control particle system budget, it means in the initialize time , we now the limit of the particle
So our First rule is split particle by classify:
The particle system can naturelly split into two types.
Entity relative, dependent on animation result such as animation trail or some character skills
None entity relative, such as smoke or explode or bomb in the scene, they are self-explain and not reply on any actor in the scene.
This split gives us a choice to submit none-entity-relative jobs in the very beginning of the frame, while entity-relative jobs need wait animation system finished.
This helps balance the job workload.
先说按什么标准来分job?再说如何submission。从听众关注的角度来表达观点。
Inside Each Particle System,
Its simple split whole particle simulation into 3 phases.
First one is spawn jobj, we parallel all emitters together to spawn and delete dead particles.
Then update jobs like color affector or size affector will refresh and update the particle property。
The third phase we prepare particle for render, build gpu friendly data, such as vertex buffer, material info and drallcall ready data.
Each phase will wait the last phase finished.
But this cause two problems:
First one is conflict in global pool.
As we use one big particle pool to control the budget. The spawn jobs need paralle get and delete particles from the pool.
And render prepare have same problem to paralle write particle result into one big vertex buffer pool.
Another problem is that some particle system job may run much longer than others, this cause bad workload balance.
For particle job conflict problem, its easy to deal with a simple lock-free version.
We Allocate particle from global pool block by block and use atomic number to avoid multi thread problem.
One block is 64 particles, which size it good for one cache line. And their a atomic number holds the total particle number use in the block.
Its very like linear allocator.
Spawn one particle is just Atomic Add the particle count, and allocate in the block.
If one block is full, the particle system will allocate a new block from the global pool.
And particle dead is reverse way, just swap with the last one in the block. And atomic Decrease the number.
When whole block is empty or particle system is removed, free whole block back to pool.
The prepare phase conflict deals in same way.
Each job use AtomicAdd to get a write position in the whole vertex buffer, then all prepare jobs can parallel write into same big pool
Another problem with particle jobs is that some update jobs may much heavier than others.
Your can see in the picture, bad heavy job block the whole job systems. While the prepare phase need wait all particle update finished.
These jobs maybe heavy weather particles or massive ammos animation trial.
As we batch one kind ammo’s particle into one big particle system.
So these single particle system need split jobs deeper by particle emitters. One job for each particle system emitter.
The last case of subsystem is our render thread.
Our legacy single thread render build on D3D11, use traditional deferred shading pipeline.
As you can see ,we need deal visibility and whole render pipeline in main thread.
Our new multi-threading render have too parts.
One is Render Backend Tread you see in the previous section.
This thread is not work on our job system, it flush build command list on intermedia context.
Other part is render job context.
Each render job context will build d3d11 command list use deferred context.
We simple split the jobs by scenes, as you can see.
Their will max have 6 jobs, shadow, 3 for gbuffer, one for translucent, one for forward.
This split strategy is very simple to implement and this is our time-limited choice.
In the very beginning of the frame, the render backend thread will deal with scaleform ui.
UI will not build command list, instead sumit direcly on intermedia context. This is because we want to make gpu happy, and send work to gpu as soon as possible.
In the same time, two jobs sumit to work job, one for eye visibility and one for shadow visibility.
After visibility job finished, it will emit more jobs as gbuffer relative and cascade shadow part.
When work job finish build command list, it will go back to render backend thread to submit.
As some command list have order depency, the deferred shading work need wait both gbuffer and shadow command list sumit finished.
For our stresstest senarion, This our early performance result on intel high end pc, more than 8 cores.
We can see a lot of holes in the result, the cpu usage is quite low.
One frame cast more than 50 milliseconds.
After parallel evolution. The result becomes much better. The total time cast drop to about 19ms.
But you can still see some holes in the pictures.
The holes shows the dependency between different system.
For example, the physics system need wait all animation result finished.
While actually only those ragdoll results need animation.
And render thread still emit some long jobs.
Our next step is to make more clear system dependency, and make the job wait in the first time the other system use the result.
Another picture show the performance on diferrent cores.
You can see system is now scalable much well from 2 cores to 6 cores,
But performance improve little over 6 cores.
This main reason is that over 6cores the whole system is bound by render jobs.
As we only use per scene build command list in render, so in theory we have only 6 jobs for render.
Need better jobify render in the future
Over job system, another optimization weapon is intel masked occlusion culling library.
It’s a high performance software occlusion culling library and easy to integration.
Helps a lot for reduce draw call.
we replace our original occlusion culling implement by it.
For Common case, both in our main city and battlefield, we get have 20% performance improve
For Some extreme situation, like behind wall, you may double performance
Parallization on multi-core give us performance improve, so we can enriching visual effects for high end pc.
Like more clothing, destruction, more particles , ragdoll effect.