SlideShare ist ein Scribd-Unternehmen logo
1 von 114
Downloaden Sie, um offline zu lesen
Sony Computer Entertainment Europe
     Research & Development Division


Pitfalls of Object Oriented Programming

                  Tony Albrecht – Technical Consultant
                                     Developer Services
What I will be covering
• A quick look at Object Oriented (OO)
  programming
• A common example
• Optimisation of that example
• Summary

                    Slide 2
Object Oriented (OO) Programming
•   What is OO programming?
     –   a programming paradigm that uses "objects" – data structures
         consisting of datafields and methods together with their interactions – to
         design applications and computer programs.
         (Wikipedia)

•   Includes features such as
     –   Data abstraction
     –   Encapsulation
     –   Polymorphism
     –   Inheritance




                                     Slide 3
What’s OOP for?
• OO programming allows you to think about
  problems in terms of objects and their
  interactions.
• Each object is (ideally) self contained
   – Contains its own code and data.
   – Defines an interface to its code and data.
• Each object can be perceived as a „black box‟.
                        Slide 4
Objects
• If objects are self contained then they can be
   – Reused.
   – Maintained without side effects.
   – Used without understanding internal
     implementation/representation.

• This is good, yes?

                       Slide 5
Are Objects Good?

• Well, yes

• And no.

• First some history.

                        Slide 6
A Brief History of C++



       C++ development started




1979                                2009


                         Slide 7
A Brief History of C++


            Named “C++”




1979 1983                            2009


                          Slide 8
A Brief History of C++


              First Commercial release




1979   1985                              2009


                        Slide 9
A Brief History of C++


                 Release of v2.0




1979      1989                     2009


                    Slide 10
A Brief History of C++
                                Added
                                • multiple inheritance,
                                • abstract classes,
                 Release of v2.0
                                • static member functions,
                                • const member functions
                                • protected members.




1979      1989                               2009


                    Slide 11
A Brief History of C++


                                 Standardised




1979                      1998                  2009


               Slide 12
A Brief History of C++


                                 Updated




1979                      2003      2009


               Slide 13
A Brief History of C++


                                   C++0x




1979                            2009   ?


               Slide 14
So what has changed since 1979?
• Many more features have
  been added to C++
• CPUs have become much
  faster.
• Transition to multiple cores
• Memory has become faster.
                                   http://www.vintagecomputing.com




                        Slide 15
CPU performance




     Slide 16   Computer architecture: a quantitative approach
                By John L. Hennessy, David A. Patterson, Andrea C. Arpaci-Dusseau
CPU/Memory performance




         Slide 17   Computer architecture: a quantitative approach
                    By John L. Hennessy, David A. Patterson, Andrea C. Arpaci-Dusseau
What has changed since 1979?
• One of the biggest changes is that memory
  access speeds are far slower (relatively)
   – 1980: RAM latency ~ 1 cycle
   – 2009: RAM latency ~ 400+ cycles


• What can you do in 400 cycles?

                      Slide 18
What has this to do with OO?
• OO classes encapsulate code and data.
• So, an instantiated object will generally
  contain all data associated with it.




                     Slide 19
My Claim
• With modern HW (particularly consoles),
  excessive encapsulation is BAD.

• Data flow should be fundamental to your
  design (Data Oriented Design)

                   Slide 20
Consider a simple OO Scene Tree

•   Base Object class
     – Contains general data
•   Node
     – Container class
•   Modifier
     – Updates transforms
•   Drawable/Cube
     – Renders objects


                               Slide 21
Object
• Each object
  –   Maintains bounding sphere for culling
  –   Has transform (local and world)
  –   Dirty flag (optimisation)
  –   Pointer to Parent


                       Slide 22
Objects
                   Each square is
Class Definition   4 bytes           Memory Layout




                          Slide 23
Nodes
• Each Node is an object, plus
   – Has a container of other objects
   – Has a visibility flag.




                      Slide 24
Nodes
Class Definition              Memory Layout




                   Slide 25
Consider the following code…
• Update the world transform and world
  space bounding sphere for each object.




                  Slide 26
Consider the following code…
• Leaf nodes (objects) return transformed
  bounding spheres




                   Slide 27
Consider the following code…
• Leaf nodes (objects)this
           What‟s wrong with return transformed
  bounding code?
           spheres




                      Slide 28
Consider the following code…
• Leaf nodesm_Dirty=false thenreturn transformed
           If (objects) we get branch
  bounding misprediction which costs 23 or 24
           spheres
           cycles.




                      Slide 29
Consider the following code…
• Leaf nodes (objects) return transformed
  bounding Calculation12 cycles. bounding sphere
           spheresthe world
           takes only
                       of




                      Slide 30
Consider the following code…
• Leaf nodes (objects) return transformed
  bounding So using a dirty using oneis actually
           spheres flag here (in the case
           slower than not
             where it is false)




                            Slide 31
Lets illustrate cache usage
Main Memory    Each cache line is
               128 bytes                  L2 Cache




                               Slide 32
Cache usage
Main Memory                                L2 Cache

              parentTransform is already
              in the cache (somewhere)




                               Slide 33
Cache usage
               Assume this is a 128byte
Main Memory   boundary (start of cacheline)   L2 Cache




                              Slide 34
Cache usage
Main Memory                                 L2 Cache




              Load m_Transform into cache




                               Slide 35
Cache usage
Main Memory                                    L2 Cache




              m_WorldTransform is stored via
                  cache (write-back)
                                Slide 36
Cache usage
Main Memory                 L2 Cache




                            Next it loads m_Objects

                 Slide 37
Cache usage
Main Memory                                 L2 Cache



 Then a pointer is pulled from
  somewhere else (Memory
   managed by std::vector)




                                 Slide 38
Cache usage
Main Memory                                  L2 Cache




  vtbl ptr loaded into Cache




                                  Slide 39
Cache usage
Main Memory                                 L2 Cache




   Look up virtual function




                                 Slide 40
Cache usage
Main Memory                               L2 Cache




    Then branch to that code
      (load in instructions)




                               Slide 41
Cache usage
 Main Memory                                 L2 Cache




New code checks dirty flag then
 sets world bounding sphere




                                  Slide 42
Cache usage
  Main Memory                                L2 Cache



Node‟s World Bounding Sphere
      is then Expanded




                                  Slide 43
Cache usage
Main Memory                           L2 Cache




                Then the next Object is
                      processed
                  Slide 44
Cache usage
Main Memory                              L2 Cache




               First object costs at least 7
                      cache misses
                    Slide 45
Cache usage
Main Memory                                L2 Cache




              Subsequent objects cost at least
                  2 cache misses each



                        Slide 46
The Test
• 11,111 nodes/objects in a
  tree 5 levels deep
• Every node being
  transformed
• Hierarchical culling of tree
• Render method is empty


                         Slide 47
Performance

              This is the time
              taken just to
              traverse the tree!




   Slide 48
Why is it so slow?
                 ~22ms




      Slide 49
Look at GetWorldBoundingSphere()




              Slide 50
Samples can be a little
misleading at the source
code level




Slide 51
if(!m_Dirty) comparison




                          Slide 52
Stalls due to the load 2
           instructions earlier




Slide 53
Similarly with the matrix
           multiply




Slide 54
Some rough calculations
Branch Mispredictions: 50,421 @ 23 cycles each ~= 0.36ms




                                Slide 55
Some rough calculations
Branch Mispredictions: 50,421 @ 23 cycles each ~= 0.36ms
L2 Cache misses: 36,345 @ 400 cycles each      ~= 4.54ms




                                Slide 56
• From Tuner, ~ 3 L2 cache misses per
  object
  – These cache misses are mostly sequential
    (more than 1 fetch from main memory can
    happen at once)
  – Code/cache miss/code/cache miss/code…


                    Slide 57
Slow memory is the problem here


• How can we fix it?
• And still keep the same functionality and
  interface?



                     Slide 58
The first step
• Use homogenous, sequential sets of data




                   Slide 59
Homogeneous Sequential Data




           Slide 60
Generating Contiguous Data
• Use custom allocators
   – Minimal impact on existing code
• Allocate contiguous
   – Nodes
   – Matrices
   – Bounding spheres


                        Slide 61
Performance
              19.6ms -> 12.9ms

              35% faster just by
              moving things around in
              memory!




   Slide 62
What next?
• Process data in order
• Use implicit structure for hierarchy
   – Minimise to and fro from nodes.
• Group logic to optimally use what is already in
  cache.
• Remove regularly called virtuals.

                       Slide 63
We start with
                Hierarchy
a parent Node
                  Node




                  Slide 64
Hierarchy
            Node


Node   Which has children   Node
       nodes




            Slide 65
Hierarchy
            Node


Node                     Node
       And they have a
       parent




            Slide 66
Hierarchy
                            Node


          Node                                  Node




Node   Node      Node   Node           Node   Node     Node   Node

                        And they have
                        children
                            Slide 67
Hierarchy
                             Node


          Node                                  Node




Node   Node      Node   Node           Node   Node     Node   Node

                        And they all have
                        parents
                            Slide 68
Hierarchy
                             Node


            Node                               Node




Node   Node Node Node
         Node      Node   Node   Node NodeNode
                                   Node     Node      Node   Node


                              A lot of this
                          information can be
                                  inferred
                              Slide 69
Hierarchy
Node
                            Use a set of arrays,
                             one per hierarchy
Node   Node                        level



Node   Node   Node   Node    Node           Node   Node   Node




                                 Slide 70
Hierarchy
Node                       Parent has 2
  :2                         children

Node   Node                                            Children have 4
                     :4                                    children
  :4

Node   Node   Node        Node   Node          Node   Node   Node




                                    Slide 71
Hierarchy
Node                         Ensure nodes and their
                             data are contiguous in
  :2
                                    memory
Node   Node
                     :4
  :4

Node   Node   Node        Node   Node          Node   Node   Node




                                    Slide 72
• Make the processing global rather than
  local
   – Pull the updates out of the objects.
      • No more virtuals
   – Easier to understand too – all code in one
     place.


                           Slide 73
Need to change some things…
• OO version
  – Update transform top down and expand WBS
    bottom up




                   Slide 74
Update
       transform               Node


              Node                                Node




Node      Node       Node   Node         Node   Node     Node   Node




                              Slide 75
Update
       transform               Node


              Node                                Node




Node      Node       Node   Node         Node   Node     Node   Node




                              Slide 76
Node


              Node                                   Node




Node     Node        Node      Node         Node   Node     Node   Node


       Update transform and
       world bounding sphere
                                 Slide 77
Node


              Node                                 Node




Node     Node        Node    Node         Node   Node     Node   Node


       Add bounding sphere
       of child
                               Slide 78
Node


              Node                                   Node




Node     Node        Node      Node         Node   Node     Node   Node


       Update transform and
       world bounding sphere
                                 Slide 79
Node


              Node                                 Node




Node     Node        Node    Node         Node   Node     Node   Node


       Add bounding sphere
       of child
                               Slide 80
Node


              Node                                   Node




Node     Node        Node      Node         Node   Node     Node   Node


       Update transform and
       world bounding sphere
                                 Slide 81
Node


              Node                                 Node




Node     Node        Node    Node         Node   Node     Node   Node


       Add bounding sphere
       of child
                               Slide 82
• Hierarchical bounding spheres pass info up
• Transforms cascade down
• Data use and code is „striped‟.
   – Processing is alternating



                       Slide 83
Conversion to linear
• To do this with a „flat‟ hierarchy, break it
  into 2 passes
   – Update the transforms and bounding
     spheres(from top down)
   – Expand bounding spheres (bottom up)


                      Slide 84
Transform and BS updates
Node
                            For each node at each level (top down)
  :2                        {
                               multiply world transform by parent‟s
Node   Node                    transform wbs by world transform
                     :4     }
  :4

Node   Node   Node        Node   Node          Node   Node   Node




                                    Slide 85
Update bounding sphere hierarchies
Node
                            For each node at each level (bottom up)
  :2                        {
                               add wbs to parent‟s
Node   Node                 } cull wbs against frustum
                     :4     }
  :4

Node   Node   Node        Node   Node          Node   Node   Node




                                    Slide 86
Update Transform and Bounding Sphere
          How many children nodes
          to process




                   Slide 87
Update Transform and Bounding Sphere




                           For each child, update
                           transform and bounding
                           sphere

                Slide 88
Update Transform and Bounding Sphere


                           Note the contiguous arrays




                Slide 89
So, what’s happening in the cache?
 Parent
                                   Unified L2 Cache
   Children node
    Children‟s
    data not needed
          Parent Data




     Childrens‟ Data

                        Slide 90
Load parent and its transform
Parent
                                  Unified L2 Cache



         Parent Data




    Childrens‟ Data

                       Slide 91
Load child transform and set world transform
      Parent
                                        Unified L2 Cache



               Parent Data




          Childrens‟ Data

                             Slide 92
Load child BS and set WBS
Parent
                                  Unified L2 Cache



         Parent Data




    Childrens‟ Data

                       Slide 93
Load child BS and set WBS
Parent
    Next child is calculated with
    no extra cache misses !         Unified L2 Cache



       Parent Data




    Childrens‟ Data

                        Slide 94
Load child BS and set WBS
Parent next 2 children incur 2
    The
    cache misses in total        Unified L2 Cache



      Parent Data




    Childrens‟ Data

                      Slide 95
Prefetching
Parent
     Because all data is linear, we
                                      Unified L2 Cache
     can predict what memory will
     be needed in ~400 cycles
     and prefetch

       Parent Data




    Childrens‟ Data

                        Slide 96
• Tuner scans show about 1.7 cache misses
  per node.
• But, these misses are much more frequent
  – Code/cache miss/cache miss/code
  – Less stalling


                   Slide 97
Performance

              19.6 -> 12.9 -> 4.8ms




   Slide 98
Prefetching
• Data accesses are now predictable
• Can use prefetch (dcbt) to warm the cache
   – Data streams can be tricky
   – Many reasons for stream termination
   – Easier to just use dcbt blindly
      • (look ahead x number of iterations)




                             Slide 99
Prefetching example
• Prefetch a predetermined number of
  iterations ahead
• Ignore incorrect prefetches




                   Slide 100
Performance

               19.6 -> 12.9 -> 4.8 -> 3.3ms




   Slide 101
A Warning on Prefetching
• This example makes very heavy use of the
  cache
• This can affect other threads‟ use of the
  cache
   – Multiple threads with heavy cache use may
     thrash the cache


                     Slide 102
The old scan
               ~22ms




   Slide 103
The new scan

          ~16.6ms




    Slide 104
Up close

        ~16.6ms




  Slide 105
Looking at the code (samples)




            Slide 106
Performance counters
Branch mispredictions: 2,867 (cf. 47,000)
L2 cache misses: 16,064      (cf 36,000)




                                     Slide 107
In Summary
• Just reorganising data locations was a win
• Data + code reorganisation= dramatic
  improvement.
• + prefetching equals even more WIN.



                    Slide 108
OO is not necessarily EVIL
• Be careful not to design yourself into a corner
• Consider data in your design
   – Can you decouple data from objects?
   –               …code from objects?
• Be aware of what the compiler and HW are
  doing

                      Slide 109
Its all about the memory
• Optimise for data first, then code.
   – Memory access is probably going to be your
     biggest bottleneck
• Simplify systems
   – KISS
   – Easier to optimise, easier to parallelise

                        Slide 110
Homogeneity
• Keep code and data homogenous
   – Avoid introducing variations
   – Don‟t test for exceptions – sort by them.
• Not everything needs to be an object
   – If you must have a pattern, then consider
     using Managers

                       Slide 111
Remember
• You are writing a GAME
   – You have control over the input data
   – Don‟t be afraid to preformat it – drastically if
     need be.
• Design for specifics, not generics
  (generally).

                        Slide 112
Data Oriented Design Delivers
•   Better performance
•   Better realisation of code optimisations
•   Often simpler code
•   More parallelisable code


                       Slide 113
The END
• Questions?




                Slide 114

Weitere ähnliche Inhalte

Ähnlich wie Pitfalls of object_oriented_programming_gcap_09

32 dynamic linking nd overlays
32 dynamic linking nd overlays32 dynamic linking nd overlays
32 dynamic linking nd overlays
myrajendra
 
COMP1900 L3 Week3 2009
COMP1900 L3 Week3 2009COMP1900 L3 Week3 2009
COMP1900 L3 Week3 2009
Damia
 
blueMarine photographic workflow with Java
blueMarine photographic workflow with JavablueMarine photographic workflow with Java
blueMarine photographic workflow with Java
Fabrizio Giudici
 
blueMarine Sailing with NetBeans Platform
blueMarine Sailing with NetBeans PlatformblueMarine Sailing with NetBeans Platform
blueMarine Sailing with NetBeans Platform
Fabrizio Giudici
 

Ähnlich wie Pitfalls of object_oriented_programming_gcap_09 (20)

KVSの性能、RDBMSのインデックス、更にMapReduceを併せ持つAll-in-One NoSQL: MongoDB
KVSの性能、RDBMSのインデックス、更にMapReduceを併せ持つAll-in-One NoSQL: MongoDB KVSの性能、RDBMSのインデックス、更にMapReduceを併せ持つAll-in-One NoSQL: MongoDB
KVSの性能、RDBMSのインデックス、更にMapReduceを併せ持つAll-in-One NoSQL: MongoDB
 
Acceleo Code Generation
Acceleo Code GenerationAcceleo Code Generation
Acceleo Code Generation
 
32 dynamic linking nd overlays
32 dynamic linking nd overlays32 dynamic linking nd overlays
32 dynamic linking nd overlays
 
Vips 4mar09e
Vips 4mar09eVips 4mar09e
Vips 4mar09e
 
Creating a Machine Learning Model on the Cloud
Creating a Machine Learning Model on the CloudCreating a Machine Learning Model on the Cloud
Creating a Machine Learning Model on the Cloud
 
Get your Project back in Shape!
Get your Project back in Shape!Get your Project back in Shape!
Get your Project back in Shape!
 
Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013
Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013
Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013
 
How shit works: the CPU
How shit works: the CPUHow shit works: the CPU
How shit works: the CPU
 
DevOps Days Boston 2017: Real-world Kubernetes for DevOps
DevOps Days Boston 2017: Real-world Kubernetes for DevOpsDevOps Days Boston 2017: Real-world Kubernetes for DevOps
DevOps Days Boston 2017: Real-world Kubernetes for DevOps
 
Java on the GPU: Where are we now?
Java on the GPU: Where are we now?Java on the GPU: Where are we now?
Java on the GPU: Where are we now?
 
Surge2012
Surge2012Surge2012
Surge2012
 
From Java code to Java heap: Understanding and optimizing your application's ...
From Java code to Java heap: Understanding and optimizing your application's ...From Java code to Java heap: Understanding and optimizing your application's ...
From Java code to Java heap: Understanding and optimizing your application's ...
 
COMP1900 L3 Week3 2009
COMP1900 L3 Week3 2009COMP1900 L3 Week3 2009
COMP1900 L3 Week3 2009
 
2013 syscan360 yuki_chen_syscan360_exploit your java native vulnerabilities o...
2013 syscan360 yuki_chen_syscan360_exploit your java native vulnerabilities o...2013 syscan360 yuki_chen_syscan360_exploit your java native vulnerabilities o...
2013 syscan360 yuki_chen_syscan360_exploit your java native vulnerabilities o...
 
Dojo for programmers (TXJS 2010)
Dojo for programmers (TXJS 2010)Dojo for programmers (TXJS 2010)
Dojo for programmers (TXJS 2010)
 
blueMarine photographic workflow with Java
blueMarine photographic workflow with JavablueMarine photographic workflow with Java
blueMarine photographic workflow with Java
 
blueMarine Sailing with NetBeans Platform
blueMarine Sailing with NetBeans PlatformblueMarine Sailing with NetBeans Platform
blueMarine Sailing with NetBeans Platform
 
Pulsar
PulsarPulsar
Pulsar
 
ContainerDays NYC 2015: "Easing Your Way Into Docker: Lessons From a Journey ...
ContainerDays NYC 2015: "Easing Your Way Into Docker: Lessons From a Journey ...ContainerDays NYC 2015: "Easing Your Way Into Docker: Lessons From a Journey ...
ContainerDays NYC 2015: "Easing Your Way Into Docker: Lessons From a Journey ...
 
Container Days
Container DaysContainer Days
Container Days
 

Pitfalls of object_oriented_programming_gcap_09

  • 1. Sony Computer Entertainment Europe Research & Development Division Pitfalls of Object Oriented Programming Tony Albrecht – Technical Consultant Developer Services
  • 2. What I will be covering • A quick look at Object Oriented (OO) programming • A common example • Optimisation of that example • Summary Slide 2
  • 3. Object Oriented (OO) Programming • What is OO programming? – a programming paradigm that uses "objects" – data structures consisting of datafields and methods together with their interactions – to design applications and computer programs. (Wikipedia) • Includes features such as – Data abstraction – Encapsulation – Polymorphism – Inheritance Slide 3
  • 4. What’s OOP for? • OO programming allows you to think about problems in terms of objects and their interactions. • Each object is (ideally) self contained – Contains its own code and data. – Defines an interface to its code and data. • Each object can be perceived as a „black box‟. Slide 4
  • 5. Objects • If objects are self contained then they can be – Reused. – Maintained without side effects. – Used without understanding internal implementation/representation. • This is good, yes? Slide 5
  • 6. Are Objects Good? • Well, yes • And no. • First some history. Slide 6
  • 7. A Brief History of C++ C++ development started 1979 2009 Slide 7
  • 8. A Brief History of C++ Named “C++” 1979 1983 2009 Slide 8
  • 9. A Brief History of C++ First Commercial release 1979 1985 2009 Slide 9
  • 10. A Brief History of C++ Release of v2.0 1979 1989 2009 Slide 10
  • 11. A Brief History of C++ Added • multiple inheritance, • abstract classes, Release of v2.0 • static member functions, • const member functions • protected members. 1979 1989 2009 Slide 11
  • 12. A Brief History of C++ Standardised 1979 1998 2009 Slide 12
  • 13. A Brief History of C++ Updated 1979 2003 2009 Slide 13
  • 14. A Brief History of C++ C++0x 1979 2009 ? Slide 14
  • 15. So what has changed since 1979? • Many more features have been added to C++ • CPUs have become much faster. • Transition to multiple cores • Memory has become faster. http://www.vintagecomputing.com Slide 15
  • 16. CPU performance Slide 16 Computer architecture: a quantitative approach By John L. Hennessy, David A. Patterson, Andrea C. Arpaci-Dusseau
  • 17. CPU/Memory performance Slide 17 Computer architecture: a quantitative approach By John L. Hennessy, David A. Patterson, Andrea C. Arpaci-Dusseau
  • 18. What has changed since 1979? • One of the biggest changes is that memory access speeds are far slower (relatively) – 1980: RAM latency ~ 1 cycle – 2009: RAM latency ~ 400+ cycles • What can you do in 400 cycles? Slide 18
  • 19. What has this to do with OO? • OO classes encapsulate code and data. • So, an instantiated object will generally contain all data associated with it. Slide 19
  • 20. My Claim • With modern HW (particularly consoles), excessive encapsulation is BAD. • Data flow should be fundamental to your design (Data Oriented Design) Slide 20
  • 21. Consider a simple OO Scene Tree • Base Object class – Contains general data • Node – Container class • Modifier – Updates transforms • Drawable/Cube – Renders objects Slide 21
  • 22. Object • Each object – Maintains bounding sphere for culling – Has transform (local and world) – Dirty flag (optimisation) – Pointer to Parent Slide 22
  • 23. Objects Each square is Class Definition 4 bytes Memory Layout Slide 23
  • 24. Nodes • Each Node is an object, plus – Has a container of other objects – Has a visibility flag. Slide 24
  • 25. Nodes Class Definition Memory Layout Slide 25
  • 26. Consider the following code… • Update the world transform and world space bounding sphere for each object. Slide 26
  • 27. Consider the following code… • Leaf nodes (objects) return transformed bounding spheres Slide 27
  • 28. Consider the following code… • Leaf nodes (objects)this What‟s wrong with return transformed bounding code? spheres Slide 28
  • 29. Consider the following code… • Leaf nodesm_Dirty=false thenreturn transformed If (objects) we get branch bounding misprediction which costs 23 or 24 spheres cycles. Slide 29
  • 30. Consider the following code… • Leaf nodes (objects) return transformed bounding Calculation12 cycles. bounding sphere spheresthe world takes only of Slide 30
  • 31. Consider the following code… • Leaf nodes (objects) return transformed bounding So using a dirty using oneis actually spheres flag here (in the case slower than not where it is false) Slide 31
  • 32. Lets illustrate cache usage Main Memory Each cache line is 128 bytes L2 Cache Slide 32
  • 33. Cache usage Main Memory L2 Cache parentTransform is already in the cache (somewhere) Slide 33
  • 34. Cache usage Assume this is a 128byte Main Memory boundary (start of cacheline) L2 Cache Slide 34
  • 35. Cache usage Main Memory L2 Cache Load m_Transform into cache Slide 35
  • 36. Cache usage Main Memory L2 Cache m_WorldTransform is stored via cache (write-back) Slide 36
  • 37. Cache usage Main Memory L2 Cache Next it loads m_Objects Slide 37
  • 38. Cache usage Main Memory L2 Cache Then a pointer is pulled from somewhere else (Memory managed by std::vector) Slide 38
  • 39. Cache usage Main Memory L2 Cache vtbl ptr loaded into Cache Slide 39
  • 40. Cache usage Main Memory L2 Cache Look up virtual function Slide 40
  • 41. Cache usage Main Memory L2 Cache Then branch to that code (load in instructions) Slide 41
  • 42. Cache usage Main Memory L2 Cache New code checks dirty flag then sets world bounding sphere Slide 42
  • 43. Cache usage Main Memory L2 Cache Node‟s World Bounding Sphere is then Expanded Slide 43
  • 44. Cache usage Main Memory L2 Cache Then the next Object is processed Slide 44
  • 45. Cache usage Main Memory L2 Cache First object costs at least 7 cache misses Slide 45
  • 46. Cache usage Main Memory L2 Cache Subsequent objects cost at least 2 cache misses each Slide 46
  • 47. The Test • 11,111 nodes/objects in a tree 5 levels deep • Every node being transformed • Hierarchical culling of tree • Render method is empty Slide 47
  • 48. Performance This is the time taken just to traverse the tree! Slide 48
  • 49. Why is it so slow? ~22ms Slide 49
  • 51. Samples can be a little misleading at the source code level Slide 51
  • 53. Stalls due to the load 2 instructions earlier Slide 53
  • 54. Similarly with the matrix multiply Slide 54
  • 55. Some rough calculations Branch Mispredictions: 50,421 @ 23 cycles each ~= 0.36ms Slide 55
  • 56. Some rough calculations Branch Mispredictions: 50,421 @ 23 cycles each ~= 0.36ms L2 Cache misses: 36,345 @ 400 cycles each ~= 4.54ms Slide 56
  • 57. • From Tuner, ~ 3 L2 cache misses per object – These cache misses are mostly sequential (more than 1 fetch from main memory can happen at once) – Code/cache miss/code/cache miss/code… Slide 57
  • 58. Slow memory is the problem here • How can we fix it? • And still keep the same functionality and interface? Slide 58
  • 59. The first step • Use homogenous, sequential sets of data Slide 59
  • 61. Generating Contiguous Data • Use custom allocators – Minimal impact on existing code • Allocate contiguous – Nodes – Matrices – Bounding spheres Slide 61
  • 62. Performance 19.6ms -> 12.9ms 35% faster just by moving things around in memory! Slide 62
  • 63. What next? • Process data in order • Use implicit structure for hierarchy – Minimise to and fro from nodes. • Group logic to optimally use what is already in cache. • Remove regularly called virtuals. Slide 63
  • 64. We start with Hierarchy a parent Node Node Slide 64
  • 65. Hierarchy Node Node Which has children Node nodes Slide 65
  • 66. Hierarchy Node Node Node And they have a parent Slide 66
  • 67. Hierarchy Node Node Node Node Node Node Node Node Node Node Node And they have children Slide 67
  • 68. Hierarchy Node Node Node Node Node Node Node Node Node Node Node And they all have parents Slide 68
  • 69. Hierarchy Node Node Node Node Node Node Node Node Node Node Node NodeNode Node Node Node Node A lot of this information can be inferred Slide 69
  • 70. Hierarchy Node Use a set of arrays, one per hierarchy Node Node level Node Node Node Node Node Node Node Node Slide 70
  • 71. Hierarchy Node Parent has 2 :2 children Node Node Children have 4 :4 children :4 Node Node Node Node Node Node Node Node Slide 71
  • 72. Hierarchy Node Ensure nodes and their data are contiguous in :2 memory Node Node :4 :4 Node Node Node Node Node Node Node Node Slide 72
  • 73. • Make the processing global rather than local – Pull the updates out of the objects. • No more virtuals – Easier to understand too – all code in one place. Slide 73
  • 74. Need to change some things… • OO version – Update transform top down and expand WBS bottom up Slide 74
  • 75. Update transform Node Node Node Node Node Node Node Node Node Node Node Slide 75
  • 76. Update transform Node Node Node Node Node Node Node Node Node Node Node Slide 76
  • 77. Node Node Node Node Node Node Node Node Node Node Node Update transform and world bounding sphere Slide 77
  • 78. Node Node Node Node Node Node Node Node Node Node Node Add bounding sphere of child Slide 78
  • 79. Node Node Node Node Node Node Node Node Node Node Node Update transform and world bounding sphere Slide 79
  • 80. Node Node Node Node Node Node Node Node Node Node Node Add bounding sphere of child Slide 80
  • 81. Node Node Node Node Node Node Node Node Node Node Node Update transform and world bounding sphere Slide 81
  • 82. Node Node Node Node Node Node Node Node Node Node Node Add bounding sphere of child Slide 82
  • 83. • Hierarchical bounding spheres pass info up • Transforms cascade down • Data use and code is „striped‟. – Processing is alternating Slide 83
  • 84. Conversion to linear • To do this with a „flat‟ hierarchy, break it into 2 passes – Update the transforms and bounding spheres(from top down) – Expand bounding spheres (bottom up) Slide 84
  • 85. Transform and BS updates Node For each node at each level (top down) :2 { multiply world transform by parent‟s Node Node transform wbs by world transform :4 } :4 Node Node Node Node Node Node Node Node Slide 85
  • 86. Update bounding sphere hierarchies Node For each node at each level (bottom up) :2 { add wbs to parent‟s Node Node } cull wbs against frustum :4 } :4 Node Node Node Node Node Node Node Node Slide 86
  • 87. Update Transform and Bounding Sphere How many children nodes to process Slide 87
  • 88. Update Transform and Bounding Sphere For each child, update transform and bounding sphere Slide 88
  • 89. Update Transform and Bounding Sphere Note the contiguous arrays Slide 89
  • 90. So, what’s happening in the cache? Parent Unified L2 Cache Children node Children‟s data not needed Parent Data Childrens‟ Data Slide 90
  • 91. Load parent and its transform Parent Unified L2 Cache Parent Data Childrens‟ Data Slide 91
  • 92. Load child transform and set world transform Parent Unified L2 Cache Parent Data Childrens‟ Data Slide 92
  • 93. Load child BS and set WBS Parent Unified L2 Cache Parent Data Childrens‟ Data Slide 93
  • 94. Load child BS and set WBS Parent Next child is calculated with no extra cache misses ! Unified L2 Cache Parent Data Childrens‟ Data Slide 94
  • 95. Load child BS and set WBS Parent next 2 children incur 2 The cache misses in total Unified L2 Cache Parent Data Childrens‟ Data Slide 95
  • 96. Prefetching Parent Because all data is linear, we Unified L2 Cache can predict what memory will be needed in ~400 cycles and prefetch Parent Data Childrens‟ Data Slide 96
  • 97. • Tuner scans show about 1.7 cache misses per node. • But, these misses are much more frequent – Code/cache miss/cache miss/code – Less stalling Slide 97
  • 98. Performance 19.6 -> 12.9 -> 4.8ms Slide 98
  • 99. Prefetching • Data accesses are now predictable • Can use prefetch (dcbt) to warm the cache – Data streams can be tricky – Many reasons for stream termination – Easier to just use dcbt blindly • (look ahead x number of iterations) Slide 99
  • 100. Prefetching example • Prefetch a predetermined number of iterations ahead • Ignore incorrect prefetches Slide 100
  • 101. Performance 19.6 -> 12.9 -> 4.8 -> 3.3ms Slide 101
  • 102. A Warning on Prefetching • This example makes very heavy use of the cache • This can affect other threads‟ use of the cache – Multiple threads with heavy cache use may thrash the cache Slide 102
  • 103. The old scan ~22ms Slide 103
  • 104. The new scan ~16.6ms Slide 104
  • 105. Up close ~16.6ms Slide 105
  • 106. Looking at the code (samples) Slide 106
  • 107. Performance counters Branch mispredictions: 2,867 (cf. 47,000) L2 cache misses: 16,064 (cf 36,000) Slide 107
  • 108. In Summary • Just reorganising data locations was a win • Data + code reorganisation= dramatic improvement. • + prefetching equals even more WIN. Slide 108
  • 109. OO is not necessarily EVIL • Be careful not to design yourself into a corner • Consider data in your design – Can you decouple data from objects? – …code from objects? • Be aware of what the compiler and HW are doing Slide 109
  • 110. Its all about the memory • Optimise for data first, then code. – Memory access is probably going to be your biggest bottleneck • Simplify systems – KISS – Easier to optimise, easier to parallelise Slide 110
  • 111. Homogeneity • Keep code and data homogenous – Avoid introducing variations – Don‟t test for exceptions – sort by them. • Not everything needs to be an object – If you must have a pattern, then consider using Managers Slide 111
  • 112. Remember • You are writing a GAME – You have control over the input data – Don‟t be afraid to preformat it – drastically if need be. • Design for specifics, not generics (generally). Slide 112
  • 113. Data Oriented Design Delivers • Better performance • Better realisation of code optimisations • Often simpler code • More parallelisable code Slide 113