Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.

Parallelizing Conqueror's Blade

1.289 Aufrufe

Veröffentlicht am

Parallelizing Conqueror's Blade

Veröffentlicht in: Technologie
  • Als Erste(r) kommentieren

Parallelizing Conqueror's Blade

  1. 1. Parallelizing Conqueror’s Blade* Making the Most of Intel® Core™ for the Best Gaming Experience Nan Mi Engineer Lead @BoomingGames Lei Su Senior Engineer @BoomingGames Sheng Guo Application Engineer @Intel.com
  2. 2. Agenda  Multi-core: Opportunities to scale user experience  Conqueror’s Blade*: Case study to leverage multi-core  Optimization background  Building job system  Jobifying engine sub-systems  Scaling user experience 2
  3. 3. Next Generation Multi-Core Processor 3  Physical CPU/cores increasing quickly  4 cores: max install base  6 cores: mainstream shipping  8-18 cores: high-end shipping  Multicore utilization of games today  Most multithreaded, but only with 2~3 heavy threads  Insufficient CPU utilization Steam Hardware & Software Survey: February 2018
  4. 4. What to Do with the Idle of Cores 4 Boost Performance Enrich Experience Software occlusion culling Buffering load turbulenceBalancing load among cores Global illumination Detailed animation Realistic clothing Realistic ragdoll Realistic destruction Advanced particles Wind & Weather 3D audio Additional rendering passes More details of distant model Ambient animation and background life Decorative contents
  5. 5. With Great User Experience Comes Great Parallelized Engine 5 Maximized User Experience Parallelized Game Engine Scale User Experience (Performance + Effects) with More Cores
  6. 6. Key Problems to Consider 6 Maximize User Experience Parallelize Game Engine  Enable perceptible multi-core scaling w/o impacting game play  The quality of effects  The types of high-quality effects  The coverage of high-quality effects on all unites  Decompose engine functionality to fine-grained jobs  Rendering  Game Logic  Simulation  Build efficient job scheduler
  7. 7. 7 Case study: Conqueror’s Blade*
  8. 8. Outline  Game Background  Engine Architecture Evolution  Building Scalable Game Engine  Job system  Case study parallelization engine subsystems  Scaled Gaming Experience  Tips & Tricks  Future Work 8
  9. 9. Game Background  Conqueror’s Blade* is a PC online-game  Hero : Action gameplay  Legion : Tactic gameplay  Empowered war machines  Immersive battlefield 9
  10. 10. Gameplay Trailer 10
  11. 11. Motivation For Multicore Scalable Engine  Game is Logic Heavy  Huge number of individual soldiers  Dynamic battleground  Rich battlefield elements  Problems of Legacy Architecture  Difficult to scale to more cores  CPU Bound 11
  12. 12. Goals & Challenges  Goals  Support more than 1K actors with individual AI and states  Dynamic battlefield  Easy to scale  Multi-thread debug friendly  Challenges  Game is in development & test  On-the-fly upgrade engine  Time-limited (~2.5 months)  Technique Choice  Entity-Component-System model  Job system 12
  13. 13. ECS Model  Entity-Component-System*  Data is everything  Entity is just ID  Component holds only data  System contains the same kind of component and methods  Pros  Parallelization friendly  Cache friendly  Memory management friendly 13 *[Timothy17] Overwatch Gameplay Architecture and Netcode, GDC 2017
  14. 14. Original vs ECS  Original Model  ECS Model 14 Entity Animation Component Physics Component Transform Component ... ...Entity Animation Component Physics Component Transform Component ... Entity Animation Component Physics Component Transform Component ... Entity Animation Component Physics Component Transform Component ... ... Animation Component Animation Component Animation Component Physics Component Physics Component Physics Component Transform Component Transform Component Transform Component Animation System Physics System Transform System Data organized by entity Data Heterogeneous Memory Jumping Cache Miss Data organized by system Data Homogeneous Memory Contiguous Cache Friendly
  15. 15. Fixed Multi-thread (Legacy) Render Simulation Logic Visibility GBuffer Shadow Lighting Forward Transpar ent Postproc ess UI LOD Animatio n Physics Particle Lua AI Motor ... ... Network  Fixed Multi-thread  Render  Simulation  Logic 15
  16. 16. Thread Fork/Join (Intermediate)  Fixed Multi-thread  Thread Fork/Join  Thread Pool  Fork/Join from fixed thread 16 Render Simulation Logic Visibility GBuffer Shadow Lighting Forward Transpar ent Postproc ess UI LOD Ani mati on Lua AI Network Work Thread Work Thread Work Thread AI Task AI Task AI Task Animation Task Animation Task Animation Task Animatio n AI Physics Particle ... Motor ...
  17. 17. Job Based (Final)  Fixed Multi-thread  Thread fork/join  Job Based  Render Backend  Job System  Network… 17 Global Job Queue Job Job Job Work Thread Job Queue Job Job Work Thread Job Queue Job Waiting Job Job JobJob...Work Thread Job Queue Job Job JobJob Network Render Backend Engine Architecture Job System
  18. 18. Job System  Fiber based implementation*  What is fiber  A lightweight execution context(include a user provided stack, registers…)  Fiber execution is collaborative, means a fiber can switch to another interactively  Pros  Easy to implement task schedule  Easy to handle task dependency  Job stack is isolated  Avoid frequency context switch  Cons  C++ does not natively support fiber  Implementation is different between OS  Has some restrictions(thread_local invalid)  Fiber Implement  Boost context: Cross-platform, Industry proven, Fast 18 *[Christian15] Parallelizing the Naughty Dog engine using fibers, GDC 2015
  19. 19. Job Scheduler  Thread Independent Job Queue  Each work thread has its own job queue  The job generated from the thread will be added to the queue  Separate Global Job Queue  Job submit outside job system (frame begin, some middleware …)  LIFO Mode  In most case, job dependency is tree like  Some system add jobs occasionally but wait them immediately  Job Stealing  Worker thread load balance 19
  20. 20. Global Job 20 Global Job Queue Job Job Job Work Thread Job Queue Job Job Work Thread Job Queue Job Waiting Job Job JobJob...Work Thread Job Queue Job Job JobJob Job Outside threads add global jobs Work thread gets global job from global queue
  21. 21. Job Dependency 21 Global Job Queue Job Job Job Work Thread Job Queue Job Job Work Thread Job Queue Job Waiting Job Job JobJob...Work Thread Job Queue Job Job JobJobJob Job runing waitingruning ready dependency new added jobs Run First
  22. 22. Job Stealing 22 Global Job Queue Job Job Job Waiting Job Job JobJob ... Work Thread Job Queue Job Job Work Thread Job Queue Work Thread Job Queue Job Job Queue empty
  23. 23. On-The-Fly Change Step  Change to ECS Model  Entity level update to component level update  Gather same component to system, system level update  Parallelization each system  Keep system tick order  Split jobs in self system and wait jobs to finish before system end  Modify system dependency  Clarify system dependency  Launch independent systems at the same time  Wait system jobs in the system really dependent on them 23
  24. 24. System From Single-Thread To Multi-Thread  Lock  Always the first change step  Behaves well when there are few conflicts  Backup of lock-free version  Batch and Swap  Useful for polling system  Lock-Free  Use the simplest lock-free data structure 24
  25. 25. Subsystems Overview 25 Lua Physics Animation Particle Motor Render
  26. 26. Physics System  Physics System build on PhysX/Apex Library  Features  Rigidbody  Cloth  Destruction  Ragdoll 26
  27. 27. Jobify PhysX Knowhow  PhysX Library support task  Only need to implement the PxCpuDispatcher  Code is easy to be integrated  Details need consider  PhysX occasionally submits tasks and then immediately waits for them to complete, so suggest using the LIFO mode  PhysX has synchronization stage  PxScene::flushQueryUpdates 27 Trigger sync stage Reduce shapes usage!!!
  28. 28. Animation Works  Animation Tree Update  Each Animation Tree updates independently  Trigger Effect/Particle/Sound…  Skeleton Transform Calculation 28 Simply split jobs by actor count!
  29. 29. Difficulties  Related with many other systems  Not thread-safe ready  Difficult to balance job load  Cost has huge difference between actors 29 bad job
  30. 30. MPSC Queue 30 op op op op op Animation Worker Animation Worker Animation Worker Animation Worker Pre Fetch OP Post Animation system Related system
  31. 31. Load Balance  Cover other than really balance  Split job by experience  Launch independent systems earlier  Wait animation results in another dependency system 31 cover job
  32. 32. Script  Script Usage  Lua as script  Lua call engine c++ functions  Script jobify  Lua is not native multi-thread  Make heavy calculation in C++  Gather calculations together  Parallel only c++ codes  Script logic can tick with fixed time(like 100ms) 32
  33. 33. Jobify Particle System  Particle System Module  Experience Job Split Rules  By particle classify  By particle simulation phases  Problems & Solutions  Particle job conflicts  Particle job workload balance 33
  34. 34. Particle System Module  Particle Emitters  Particle spawn and delete  Particle Renders  Billboard/Trail/Mesh/Beam …  Particle Affectors  Color over Life, gravity, motion …  Use global particle pool to control particle budget 34 Particle System Emitter Affector Affector Render … Emitter Affector Affector Render … …
  35. 35. Job Split Rule 1 - Particle Classify  Entity-Relative  Animation result dependent  Animation trail, etc  Non-Entity-Relative  Smoke, explosion, weather, etc 35
  36. 36. Job Split Rule 2 – Particle Phases  Spawn jobs  Particle emit and delete  Update jobs  Particle property refresh  Render Prepare jobs  GPU friendly data  Problems:  Conflicts in global pool  Simply splitting job by particle system count causes bad workload balance 36 Particle System Emitter Affector Affector Render … Emitter Affector Affector Render … …Spawn Update Render Prepare Particle System …
  37. 37. Solve Particle Job Conflict  Conflict Case 1  Particle Spawn  Allocate particle block from pool with Atomic  Allocate block is just AtomicAdd  New particle from block  Particle Dead  Simple swap with the last particle in block  When block is empty, free whole block back to pool  Conflict Case 2  Particle render transfer into one big vertex buffer  Use AtomicAdd to get write position in linear pool 37 PoolBlock Block Block Pointer Block Particle count (atomic) Particle Particle Particle Particle
  38. 38. Workload Balance Problem 38 Good particle jobs Bad job, too heavy
  39. 39. Split by Emitter  Some particle jobs are too heavy  Weather particle  Massive ammo animation trails  Split by particle emitters 39
  40. 40. Render Thread  Legacy Single Thread Render  D3D11  Deferred shading pipeline  Visibility & render on main thread 40 Visibility Scaleform UI GBuffer Cascade Shadow Deferred Shading Forward Transparent PostProcess Present
  41. 41. Multi-threading Render  Render Backend Thread  Flush command list on intermedia context  Render Job Context  Build D3D11 command list use deferred context  Split per scene  6 render jobs  Shadow  GBuffer  Terrain Relative  Static Object Relative  Dynamic Object Relative  Translucent  Forward 41
  42. 42. Render Multi-Thread WorkFlow 42 Time Render Thread Intermedia Context Work Job Deferred Context Work Job Deferred Context Work Job Deferred Context Work Job Deferred Context Work Job Deferred Context Work Job Deferred Context Scaleform UI Eye Visibility Shadow Visibility Gbuffer Terrain Gbuffer Static Gbuffer Dynamic Forward Transparent Cascade Shadow GBuffer Command Shadow Command Deferred Shading CLWait PostProcessCL
  43. 43. Performance Comparison - Before 43 > 50ms
  44. 44. Performance Comparison - After 44 much butter ~19ms
  45. 45. CPU Scaling 45 0 0.5 1 1.5 2 2.5 3 3.5 2 cores 4 cores 6 cores 8 cores > 8 cores Render needs to better jobify
  46. 46. Extra Optimization  Intel Masked Occlusion Culling Library *  CPU Software Occlusion Culling  Easy to be integrated  Reduce draw call 46 *Masked Occlusion Culling, https://github.com/GameTechDev/MaskedOcclusionCulling
  47. 47. Masked Software Occlusion Culling Result  Performance (4 cores) 47 Level Rasterize & Visibility MOC off MOC on Speedup Main City 2.7ms 25 fps 30 fps 1.2x Siege Battlefield 3.1ms 23.2 fps 29 fps 1.25x
  48. 48. Enriching Visual Effects for More Cores  Clothing  Physics destruction  Particles  Ragdoll  Animation 48
  49. 49. Tips & Tricks  Optimize the code itself first rather than parallelize  Lock is your friend in the first step  Pending and swap  Data-oriented is both optimization friendly and debug friendly  Simple structure means easier to parallelize and debug 49
  50. 50. Future Work  Further data-oriented design  More clearly identified system dependencies  Chunk-based multi-thread rendering  Job based lock (no more mutex, lock…) 50
  51. 51. 51 Thanks
  52. 52. Legal Disclaimer & Optimization Notice Optimization Notice Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 52 INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit www.intel.com/benchmarks. Copyright © 2018, Intel Corporation. All rights reserved. Intel, Pentium, Xeon, Xeon Phi, Core, VTune, Cilk, and the Intel logo are trademarks of Intel Corporation in the U.S. and other countries.

×