Parallelizing Conqueror's Blade

Parallelizing Conqueror’s Blade*
Making the Most of Intel® Core™ for the Best Gaming Experience
Nan Mi
Engineer Lead @BoomingGames
Lei Su
Senior Engineer @BoomingGames
Sheng Guo
Application Engineer @Intel.com

Agenda
 Multi-core: Opportunities to scale user experience
 Conqueror’s Blade*: Case study to leverage multi-core
 Optimization background
 Building job system
 Jobifying engine sub-systems
 Scaling user experience
2

Next Generation Multi-Core Processor
3
 Physical CPU/cores increasing quickly
 4 cores: max install base
 6 cores: mainstream shipping
 8-18 cores: high-end shipping
 Multicore utilization of games today
 Most multithreaded, but only with 2~3
heavy threads
 Insufficient CPU utilization
Steam Hardware & Software Survey: February 2018

What to Do with the Idle of Cores
4
Boost
Performance
Enrich
Experience
Software occlusion culling Buffering load turbulenceBalancing load among cores
Global illumination Detailed animation
Realistic clothing
Realistic ragdoll
Realistic destruction Advanced particles
Wind & Weather
3D audio
Additional rendering passes
More details of distant model
Ambient animation and background life
Decorative contents

With Great User Experience Comes Great
Parallelized Engine
5
Maximized
User Experience
Parallelized
Game Engine
Scale User Experience (Performance + Effects) with More Cores

Key Problems to Consider
6
Maximize
User Experience
Parallelize
Game Engine
 Enable perceptible multi-core
scaling w/o impacting game play
 The quality of effects
 The types of high-quality effects
 The coverage of high-quality
effects on all unites
 Decompose engine functionality to
fine-grained jobs
 Rendering
 Game Logic
 Simulation
 Build efficient job scheduler

7
Case study: Conqueror’s Blade*

Outline
 Game Background
 Engine Architecture Evolution
 Building Scalable Game Engine
 Job system
 Case study parallelization engine subsystems
 Scaled Gaming Experience
 Tips & Tricks
 Future Work
8

Game Background
 Conqueror’s Blade* is a PC online-game
 Hero : Action gameplay
 Legion : Tactic gameplay
 Empowered war machines
 Immersive battlefield
9

Motivation For Multicore Scalable Engine
 Game is Logic Heavy
 Huge number of individual soldiers
 Dynamic battleground
 Rich battlefield elements
 Problems of Legacy Architecture
 Difficult to scale to more cores
 CPU Bound
11

Goals & Challenges
 Goals
 Support more than 1K actors with individual AI and states
 Dynamic battlefield
 Easy to scale
 Multi-thread debug friendly
 Challenges
 Game is in development & test
 On-the-fly upgrade engine
 Time-limited (~2.5 months)
 Technique Choice
 Entity-Component-System model
 Job system
12

ECS Model
 Entity-Component-System*
 Data is everything
 Entity is just ID
 Component holds only data
 System contains the same kind of component and methods
 Pros
 Parallelization friendly
 Cache friendly
 Memory management friendly
13
*[Timothy17] Overwatch Gameplay Architecture and Netcode, GDC 2017

Original vs ECS
 Original Model  ECS Model
14
Entity
Animation
Component
Physics
Component
Transform
Component
...
...Entity
Animation
Component
Physics
Component
Transform
Component ...
Entity
Animation
Component
Physics
Component
Transform
Component ...
Entity
Animation
Component
Physics
Component
Transform
Component ...
...
Animation
Component
Animation
Component
Animation
Component
Physics
Component
Physics
Component
Physics
Component
Transform
Component
Transform
Component
Transform
Component
Animation
System
Physics
System
Transform
System
Data organized by entity
Data Heterogeneous
Memory Jumping
Cache Miss
Data organized by system
Data Homogeneous
Memory Contiguous
Cache Friendly

Fixed Multi-thread (Legacy)
Render
Simulation
Logic
Visibility GBuffer Shadow Lighting Forward
Transpar
ent
Postproc
ess
UI
LOD
Animatio
n
Physics Particle
Lua AI Motor ...
...
Network
 Fixed Multi-thread
 Render
 Simulation
 Logic
15

Thread Fork/Join (Intermediate)
 Thread Fork/Join
 Thread Pool
 Fork/Join from fixed thread
16
Render
Simulation
Logic
Visibility GBuffer Shadow Lighting Forward
Transpar
ent
Postproc
ess
UI
LOD
Ani
mati
on
Lua AI
Network
Work
Thread
Work
Thread
Work
Thread
AI Task
AI Task
AI Task
Animation
Task
Animation
Task
Animation
Task
Animatio
n
AI
Physics Particle ...
Motor ...

Job Based (Final)
 Thread fork/join
 Job Based
 Render Backend
 Job System
 Network…
17
Global Job
Queue
Job
Job
Job
Work
Thread
Job Queue
Job
Job
Work
Thread
Job Queue
Job
Waiting Job
Job
JobJob...Work
Thread
Job Queue
Job
Job JobJob Network
Render
Backend
Engine Architecture
Job System

Job System
 Fiber based implementation*
 What is fiber
 A lightweight execution context(include a user provided stack, registers…)
 Fiber execution is collaborative, means a fiber can switch to another interactively
 Pros
 Easy to implement task schedule
 Easy to handle task dependency
 Job stack is isolated
 Avoid frequency context switch
 Cons
 C++ does not natively support fiber
 Implementation is different between OS
 Has some restrictions(thread_local invalid)
 Fiber Implement
 Boost context: Cross-platform, Industry proven, Fast
18
*[Christian15] Parallelizing the Naughty Dog engine using fibers, GDC 2015

Job Scheduler
 Thread Independent Job Queue
 Each work thread has its own job queue
 The job generated from the thread will be added to the queue
 Separate Global Job Queue
 Job submit outside job system (frame begin, some middleware …)
 LIFO Mode
 In most case, job dependency is tree like
 Some system add jobs occasionally but wait them immediately
 Job Stealing
 Worker thread load balance
19

Global Job
20
Global Job
Queue
Job
Job
Job
Work
Thread
Job Queue
Job
Job
Work
Thread
Job Queue
Job
Waiting Job
Job
JobJob...Work
Thread
Job Queue
Job
Job JobJob
Job
Outside threads
add global jobs
Work thread gets global
job from global queue

Job Dependency
21
Global Job
Queue
Job
Job
Job
Work
Thread
Job Queue
Job
Job
Work
Thread
Job Queue
Job
Waiting Job
Job
JobJob...Work
Thread
Job Queue
Job
Job JobJobJob
Job
runing waitingruning ready
dependency
new added jobs
Run First

Job Stealing
22
Global Job
Queue
Job
Job
Job
Waiting Job
Job
JobJob
...
Work
Thread
Job Queue
Job
Job
Work
Thread
Job Queue
Work
Thread
Job Queue
Job
Job
Queue
empty

On-The-Fly Change Step
 Change to ECS Model
 Entity level update to component level update
 Gather same component to system, system level update
 Parallelization each system
 Keep system tick order
 Split jobs in self system and wait jobs to finish before system end
 Modify system dependency
 Clarify system dependency
 Launch independent systems at the same time
 Wait system jobs in the system really dependent on them
23

System From Single-Thread To Multi-Thread
 Lock
 Always the first change step
 Behaves well when there are few conflicts
 Backup of lock-free version
 Batch and Swap
 Useful for polling system
 Lock-Free
 Use the simplest lock-free data structure
24

Subsystems Overview
25
Lua Physics
Animation Particle
Motor
Render

Physics System
 Physics System build on PhysX/Apex Library
 Features
 Rigidbody
 Cloth
 Destruction
 Ragdoll
26

Jobify PhysX Knowhow
 PhysX Library support task
 Only need to implement the
PxCpuDispatcher
 Code is easy to be integrated
 Details need consider
 PhysX occasionally submits tasks
and then immediately waits for them
to complete, so suggest using the
LIFO mode
 PhysX has synchronization stage
 PxScene::flushQueryUpdates
27
Trigger sync stage
Reduce shapes
usage!!!

Animation Works
 Animation Tree Update
 Each Animation Tree updates
independently
 Trigger Effect/Particle/Sound…
 Skeleton Transform Calculation
28
Simply split jobs by actor count!

Difficulties
 Related with many other systems
 Not thread-safe ready
 Difficult to balance job load
 Cost has huge difference between actors
29
bad job

MPSC Queue
30
op op op op op
Animation
Worker
Animation
Worker
Animation
Worker
Animation
Worker
Pre
Fetch OP
Post
Animation system Related system

Load Balance
 Cover other than really balance
 Split job by experience
 Launch independent systems earlier
 Wait animation results in another dependency system
31
cover job

Script
 Script Usage
 Lua as script
 Lua call engine c++ functions
 Script jobify
 Lua is not native multi-thread
 Make heavy calculation in C++
 Gather calculations together
 Parallel only c++ codes
 Script logic can tick with fixed time(like 100ms)
32

Jobify Particle System
 Particle System Module
 Experience Job Split Rules
 By particle classify
 By particle simulation phases
 Problems & Solutions
 Particle job conflicts
 Particle job workload balance
33

Particle System Module
 Particle Emitters
 Particle spawn and delete
 Particle Renders
 Billboard/Trail/Mesh/Beam …
 Particle Affectors
 Color over Life, gravity, motion …
 Use global particle pool to control particle budget
34
Particle System
Emitter
Affector
Affector
Render
…
Emitter
Affector
Affector
Render
…
…

Job Split Rule 1 - Particle Classify
 Entity-Relative
 Animation result dependent
 Animation trail, etc
 Non-Entity-Relative
 Smoke, explosion, weather, etc
35

Job Split Rule 2 – Particle Phases
 Spawn jobs
 Particle emit and delete
 Update jobs
 Particle property refresh
 Render Prepare jobs
 GPU friendly data
 Problems:
 Conflicts in global pool
 Simply splitting job by particle system count causes bad workload balance
36
Particle System
Emitter
Affector
Affector
Render
…
Emitter
Affector
Affector
Render
…
…Spawn
Update
Render Prepare
Particle System …

Solve Particle Job Conflict
 Conflict Case 1
 Particle Spawn
 Allocate particle block from pool with Atomic
 Allocate block is just AtomicAdd
 New particle from block
 Particle Dead
 Simple swap with the last particle in block
 When block is empty, free whole block back to pool
 Conflict Case 2
 Particle render transfer into one big vertex buffer
 Use AtomicAdd to get write position in linear pool
37
PoolBlock Block
Block Pointer
Block Particle count
(atomic)
Particle Particle
Particle Particle

Workload Balance Problem
38
Good
particle
jobs
Bad job, too heavy

Split by Emitter
 Some particle jobs are too heavy
 Weather particle
 Massive ammo animation trails
 Split by particle emitters
39

Render Thread
 Legacy Single Thread Render
 D3D11
 Deferred shading pipeline
 Visibility & render on main thread
40
Visibility
Scaleform
UI
GBuffer
Cascade
Shadow
Deferred
Shading
Forward Transparent PostProcess Present

Multi-threading Render
 Render Backend Thread
 Flush command list on intermedia context
 Render Job Context
 Build D3D11 command list use deferred context
 Split per scene
 6 render jobs
 Shadow
 GBuffer
 Terrain Relative
 Static Object Relative
 Dynamic Object Relative
 Translucent
 Forward
41

Render Multi-Thread WorkFlow
42
Time
Render Thread
Intermedia Context
Work Job
Deferred Context
Work Job
Deferred Context
Work Job
Deferred Context
Work Job
Deferred Context
Work Job
Deferred Context
Work Job
Deferred Context
Scaleform UI
Eye Visibility
Shadow
Visibility
Gbuffer Terrain
Gbuffer Static
Gbuffer Dynamic
Forward
Transparent
Cascade
Shadow
GBuffer
Command
Shadow
Command
Deferred
Shading
CLWait PostProcessCL

Performance Comparison - Before
43
> 50ms

Performance Comparison - After
44
much
butter
~19ms

CPU Scaling
45
0
0.5
1
1.5
2
2.5
3
3.5
2 cores 4 cores 6 cores 8 cores > 8 cores
Render needs to
better jobify

Extra Optimization
 Intel Masked Occlusion Culling Library *
 CPU Software Occlusion Culling
 Easy to be integrated
 Reduce draw call
46
*Masked Occlusion Culling, https://github.com/GameTechDev/MaskedOcclusionCulling

Masked Software Occlusion Culling Result
 Performance (4 cores)
47
Level Rasterize &
Visibility
MOC off MOC on Speedup
Main City 2.7ms 25 fps 30 fps 1.2x
Siege Battlefield 3.1ms 23.2 fps 29 fps 1.25x

Enriching Visual Effects for More Cores
 Clothing
 Physics destruction
 Particles
 Ragdoll
 Animation
48

Tips & Tricks
 Optimize the code itself first rather than parallelize
 Lock is your friend in the first step
 Pending and swap
 Data-oriented is both optimization friendly and debug friendly
 Simple structure means easier to parallelize and debug
49

Future Work
 Further data-oriented design
 More clearly identified system dependencies
 Chunk-based multi-thread rendering
 Job based lock (no more mutex, lock…)
50

Legal Disclaimer & Optimization Notice
Optimization Notice
Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel
microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the
availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent
optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are
reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific
instruction sets covered by this notice.
Notice revision #20110804
52
INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY
INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL
DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES
RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR
OTHER INTELLECTUAL PROPERTY RIGHT.
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests,
such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change
to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating
your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit
www.intel.com/benchmarks.
Copyright © 2018, Intel Corporation. All rights reserved. Intel, Pentium, Xeon, Xeon Phi, Core, VTune, Cilk, and the Intel logo are trademarks of
Intel Corporation in the U.S. and other countries.

Parallelizing Conqueror's Blade

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (9)

Ähnlich wie Parallelizing Conqueror's Blade

Ähnlich wie Parallelizing Conqueror's Blade (20)

Mehr von Intel® Software

Mehr von Intel® Software (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Parallelizing Conqueror's Blade

Hinweis der Redaktion