SlideShare ist ein Scribd-Unternehmen logo
1 von 69
Downloaden Sie, um offline zu lesen
High-Performance Physics Solver Design
for Next Generation Consoles
Vangelis Kokkevis
Steven Osman
Eric Larsen
Simulation Technology Group
Sony Computer Entertainment America
US R&D
This Talk
Optimizing physics simulation on a multi-core
architecture.
Focus on CELL architecture
Variety of simulation domains
Cloth, Rigid Bodies, Fluids, Particles
Practical advice based on real case-studies
Demos!
Basic Issues
Looking for opportunities to parallelize processing
High Level – Many independent solvers on multiple cores
Low Level – One solver, one/multiple cores
Coding with small memory in mind
Streaming
Batching up work
Software Caching
Speeding up processing within each unit
SIMD processing, instruction scheduling
Double-buffering
Parallelizing/optimizing existing code
What is not in this talk?
Details on specific physics algorithms
Too much material for a 1-hour talk
Will provide references to techniques
Much insight on non-CELL platforms
Concentrate on actual results
Concepts should be applicable beyond CELL
The Cell Processor Model
PPU
SPE0
DMA
256K LS
SPU0
SPE0
DMA
256K LS
SPU1
SPE0
DMA
256K LS
SPU2
SPE0
DMA
256K LS
SPU3
DMA
256K LS
SPU4
DMA
256K LS
SPU5
DMA
256K LS
SPU6
DMA
256K LS
SPU7
Main Memory
L1/L2
Physics on CELL
Physics should happen mostly on SPUs
There’s more of them!
SPUs have greater bandwidth & performance
PPU is busy doing other stuff
PPU
SPE0
DMA
256K LS
SPU0
DMA
SPU1
DMA
SPU2
DMA
SPU3
DMA
SPU4
DMA
SPU5
DMA
SPU6
DMA
SPU7
Main Memory
L1/L2
256K LS 256K LS 256K LS
256K LS 256K LS 256K LS 256K LS
SPU Performance Recipe
Large bandwidth to and from main memory
Quick (1-cycle) LS memory access
SIMD instruction set
Concurrent DMA and processing
Challenges:
Limited LS size, shared between code and data
Random accesses of main memory are slow
Cloth Simulation
Cloth Simulation
Cloth mesh simulated as point masses
(vertices) connected via distance constraints
(edges).
m1m1
m2m2 m3m3
d1d1
d2d2
d3d3Mesh TriangleMesh Triangle
References:
T.Jacobsen,Advanced Character Physics, GDC 2001
A.Meggs,Taking Real-Time Cloth Beyond Curtains,GDC 2005
Simulation Step
1. Compute external forces, fE,per vertex
2. Compute new vertex positions [ Integration ]:
pt+1 = (2pt − pt−1) + 1
3. Fix edge lengths
Adjust vertex positions
4. Correct penetrations with collision geometry
Adjust vertex positions
2fE ∗ 1
m ∗ Δt2
How many vertices?
How many vertices fit in 256K (less actually)?
A lot, surprisingly…
Tips:
Look for opportunities to stream data
Keep in LS only data required for each step
Integration Step
Less than 4000 verts in 200K of memory
We don’t need to keep them all in LS
Keep vertex data in main memory and bring it
in in blocks
16 + 16 + 16 + 4 = 52 bytes / vertex
pt+1 = (2pt − pt−1) + 1
2fE ∗ 1
m ∗ Δt2
Streaming Integration
B0 B1 B2 B3
B0 B1 B2 B3
B0 B1 B2 B3
pt
pt−1
fE
B0 B1 B2 B3
1
m
Main Memory
Local Store
Streaming Integration
B0 B1 B2 B3
B0 B1 B2 B3
B0 B1 B2 B3
pt
pt−1
fE
B0 B1 B2 B3
1
m
Local Store
Main Memory
DMA_IN
B0
B0 B0 B0 B0
Streaming Integration
B0 B1 B2 B3
B0 B1 B2 B3
B0 B1 B2 B3
pt
pt−1
fE
B0 B1 B2 B3
1
m
Main Memory
Local Store
DMA_IN
B0
B0 B0 B0 B0
Process B0
Streaming Integration
B0 B1 B2 B3
B0 B1 B2 B3
B0 B1 B2 B3
pt
pt−1
fE
Process B0
DMA_OUT
B0
B0 B1 B2 B3
1
m
Main Memory
Local Store
DMA_IN
B0
B0 B0 B0 B0
Streaming Integration
B0 B1 B2 B3
B0 B1 B2 B3
B0 B1 B2 B3
pt
pt−1
fE
Process B0
DMA_OUT
B0
B0 B1 B2 B3
1
m
DMA_IN
B1
Main Memory
Local Store
DMA_IN
B0
B1 B1 B1 B1
Streaming Integration
B0 B1 B2 B3
B0 B1 B2 B3
B0 B1 B2 B3
pt
pt−1
fE
Process B0
DMA_OUT
B0
B0 B1 B2 B3
1
m
DMA_IN
B1
Main Memory
Local Store
DMA_IN
B0
B1 B1 B1 B1
Process B1
Streaming Integration
B0 B1 B2 B3
B0 B1 B2 B3
B0 B1 B2 B3
pt
pt−1
fE
Process B0
DMA_OUT
B0
B0 B1 B2 B3
1
m
DMA_IN
B1
Main Memory
Local Store
DMA_IN
B0
B1 B1 B1 B1
Process B1
DMA_OUT
B1
Streaming Integration
B0 B1 B2 B3
B0 B1 B2 B3
B0 B1 B2 B3
pt
pt−1
fE
Process B0
DMA_OUT
B0
B0 B1 B2 B3
1
m
DMA_IN
B1
Main Memory
Local Store
DMA_IN
B0
B1 B1 B1 B1
Process B1
DMA_OUT
B1
…
Double-buffering
Take advantage of concurrent DMA and
processing to hide transfer times
Without double-buffering:
Process B0
DMA_OUT
B0
DMA_IN
B1
DMA_IN
B0
Process B1
DMA_OUT
B1
…
With double-buffering:
Process B0
DMA_IN
B1
DMA_IN
B0
…
DMA_OUT
B0
Process B1
DMA_IN
B2
Process B2
DMA_OUT
B1
DMA_IN
B3
Streaming Data
Streaming is possible when the data access
pattern is simple and predictable (e.g. linear)
Number of verts processed per frame depends on
processing speed and bandwidth but not LS size
Unfortunately, not every step in the cloth
solver can be fully streamed
Fixing edge lengths requires random memory
access…
Fixing Edge Lengths
Points coming out of the integration step don’t
necessarily satisfy edge distance constraints
struct Edge
{
int v1;
int v2;
float restLen;
}
p[v1] p[v2]
p[v1] p[v2]
Vector3 d = p[v2] – p[v1];
float len = sqrt(dot(d,d));
diff = (len-restLen)/len;
p[v1] -= d * 0.5 * diff;
p[v2] += d * 0.5 * diff;
Fixing Edge Lengths
An iterative process: Fix one edge at a time
by adjusting 2 vertex positions
Requires random access to particle positions
array
Solution:
Keep all particle positions in LS
Stream in edge data
In 200K we can fit 200KB / 16B > 12K vertices
Rigid Bodies
Our group is currently porting the AGEIATM
PhysXTM SDK to CELL
Large codebase written with a PC
architecture in mind
Assumes easy random access to memory
Processes tasks sequentially (no parallelism)
Interesting example on how to port existing
code to a multi-core architecture
Starting the Port
Determine all the stages of the rigid body
pipeline
Look for stages that are good candidates for
parallelizing/optimizing
Profile code to make sure we are focusing on
the right parts
Constraint
Solve
Constraint
Solve
Broadphase
Collision Detection
Rigid Body Pipeline
Broadphase
Collision Detection
Narrowphase
Collision Detection
Narrowphase
Collision Detection
Constraint PrepConstraint Prep
IntegrationIntegration
Constraint PrepConstraint Prep
Points of contact
between bodies
Potentially colliding
body pairs
New body positions
Updated body
velocities
Current body positions
Constraint Equations
Rigid Body Pipeline
Broadphase
Collision Detection
Broadphase
Collision Detection
Points of contact
between bodies
Potentially colliding
body pairs
New body positions
Updated body
velocities
Current body positions
Constraint Equations
NP NP NP
CP CP CP
CS CS CS
I I I
Profiling Scenario
Profiling Results
Cumulative Frame Time
0
10000
20000
30000
40000
50000
60000
70000
1
57
113
169
225
281
337
393
449
505
561
617
673
729
785
841
897
953
1009
1065
1121
1177
1233
1289
1345
1401
1457
1513
1569
1625
1681
1737
1793
1849
1905
1961
Other
INTEGRATION
SOLVER
CONSTRAINT_PREP
NARROWPHASE
BROADPHASE
Running on the SPUs
Three steps:
1. (PPU) Pre-process
“Gather” operation (extract data from PhysX data
structures and pack it in MM)
2. (SPU) Execute
DMA packed data from MM to LS
Process data and store output in LS
DMA output to MM
3. (PPU) Post-process
“Scatter” operation (unpack output data and put back in
PhysX data structures)
Why Involve the PPU?
Required PhysX data is not conveniently packed
Data is often not aligned
We need to use PhysX data structures to avoid
breaking features we haven’t ported
Solutions:
Use list DMAs to bring in data
Modify existing code to force alignment
Change PhysX code to work with new data structures
Batching Up Work
Task
Description
Task
Description
batch
inputs/
outputs
batch
inputs/
outputs
Work batch
buffers in MM
Work batch
buffers in MM
PhysX
data-structures
PhysX
data-structures
PPU
Pre-Process
PPU
Pre-Process
Task
Description
Task
Description
batch
inputs/
outputs
batch
inputs/
outputs
……
SPU
Execute
SPU
Execute
PPU
Post-Process
PPU
Post-Process
PhysX
data-structures
PhysX
data-structures
Create work batches for each task
A
Narrow-phase Collision Detection
Problem:
A list of object pairs that may be colliding
Want to do contact processing on SPUs
Pairs list has references to geometry
C
B
(A,C)
(A,B)
(B,C)
…
Narrow-phase Collision Detection
Data locality
Same bodies may be in several pairs
Geometry may be instanced for different bodies
SPU memory access
Can only access main memory with DMA
No hardware cache
Data reuse must be explicit
Software Cache
Idea: make a (read-only) software cache
Cache entry is one geometric object
Entries have variable size
Basic operation
SPU checks cache for object
If not in cache, object fetched with DMA
Cache returns a local address for object
Software Cache
Data Structures
Two entry buffers
New entries appended to “current” buffer
Hash-table used to record and find loaded entries
A
B
C
Buffer 0
Next DMA
Buffer 1
Software Cache
Data Replacement
When space runs out in a buffer
Overwrite data in second buffer
Considerations
Does not fragment memory
No searches for free space
But does not prefer frequently used data
Software Cache
Hiding the DMA latency
Double-buffering
Start DMA for un-cached entries
Process previously DMA’d entries
Process/pre-fetch batches
Fetch and compute times vary
Batching may improve balance
DMA-lists useful
One DMA command
Multiple chunks of data gathered
Current Buffer
D
F
E
Process
DMA
A
B
C
Software Caching
Conclusions
Simple cache is practical
Used for small convex objects in PhysX
Design considerations
Tradeoff of cache-logic cycles vs. bandwidth saved
Pre-fetching important to include
Single SPU Performance
PPU only:PPU only:
ExecExec
PPU + SPU:PPU + SPU:
PPUPPU
PPUPPU
SPUSPU ExecExec
FreeFree
SPU Exec < PPU Exec: SIMD + fast mem accessSPU Exec < PPU Exec: SIMD + fast mem access
PreP PostP
Multiple SPU Performance
Pre- and Post- processing times determine
how many SPUs can be used effectively
Multiple SPU Performance
11 22 33 44PPUPPU
1 SPU1 SPU
11
22
33
44SS
3 SPUs3 SPUs
11 22 4433
11
22
33
44
2 SPUs2 SPUs
PPU vs SPU comparisons
Convex Stack (500 boxes)
0
10000
20000
30000
40000
50000
60000
70000
80000
1
35
69
103
137
171
205
239
273
307
341
375
409
443
477
511
545
579
613
647
681
715
749
783
817
851
885
919
953
987
1021
1055
1089
1123
1157
1191
1225
1259
1293
frame
microseconds
PPU-only
1-SPU
2-SPUs
3-SPUs
4-SPUs
Duck Demo
One of our first CELL demos (spring 2005)
Several interacting physics systems:
Rigid bodies (ducks & boats)
Height-field water surface
Cloth with ripping (sails)
Particle based fluids (splashes + cups)
Duck Demo (Lots of Ducks)
Duck Demo
Ambitious project with short deadline
Early PC prototypes of some pieces
Most straightforward way to parallelize:
Dedicate one SPU for each subsystem
Each piece could be developed and tested
individually
Duck Demo Resource Allocation
PU – main loop
SPU thread synchronization, draw calls
SPU0 – height field water (<50%)
SPU1 – splashes iso-surface (<50%)
SPU2 – cloth sails for boat 1 (<50%)
SPU3 – cloth sails for boat 2 (<50%)
SPU4 – rigid body collision/response (95%)
HF water
Iso-Surface
Cloth
Cloth
Rigid Bodies
1 frame
Parallelization Recipe
One three-step approach to code
parallelization:
1. Find independent components
2. Run them side-by-side
3. Recursively apply recipe to components
Challenges
Step 1: Find independent components
Where do you look?
Maybe you need to break apart and overlap
your data?
e.g. Broad phase collision detection
Maybe you need to break apart your loop into
individual iterations?
e.g. Solving cloth constraints
Broad Phase Collision Detection
600 Objects
200 Objects A 200 Objects B 200 Objects C
200 Objects A 200 Objects Bvs
200 Objects A 200 Objects Cvs
200 Objects C200 Objects B vs
We can execute
all three of
these
simultaneously
Need to test 600 rigid bodies against each other.
Cloth Solving
for (i=1 to 5) {
cloth=solve(cloth)
}
A B
C
for (i=1 to 5) {
solve_on_proc1(a);
solve_on_proc2(b);
wait_for_all()
solve_on_proc1(c);
wait_for_all();
}
…challenges
Step 2: Run them side-by-side
Bandwidth and cache issues
Need good data layout to avoid thrashing cache
or bus
Processor issues
Need efficient processor management scheme
What if the job sizes are very different?
e.g. a suit of cloth and a separate neck tie
Need further refinement of large jobs, or you only
save on the small neck tie time
…challenges
Step 3: Recurse
When do you stop?
Overhead of launching smaller jobs
Synchronization when a stage is done
e.g. Gather results from all collision detection before
solving
But this can go down to the instruction level
e.g. Using Structure-of-Arrays, transform four
independent vectors at once
High Level Parallelization:
Duck Demo
Fluid Simulation Fluid Surface Rigid Bodies Cloth Sails
Dependency exists
Fluid Simulation Fluid Surface
Rigid Bodies
Cloth Sails
Cloth Boat 1
Cloth Boat 2
Note that the parts didn’t take an
equal amount of time to run. We
could have done better given time!
But cloth was for
multiple boats
Broad Phase
Collision Detection
Narrow Phase
Collision Detection
Constraint Solving
Lower Level Parallelization
Rigid Body Simulation
600 bodies
example
Objects A Objects B Objects C
Broad Phase
Collision Detection
Narrow Phase
Collision Detection
Constraint Solving
Objects A Objects B
Objects CObjects B
Objects A
Objects C
Proc 3Proc 2Proc 1 Proc 3Proc 2Proc 1 Proc 3Proc 2Proc 1
Structure of Arrays
X0 Y0 Z0 W0
X1 Y1 Z1 W1
X2 Y2 Z2 W2
X3 Y3 Z3 W3
X4 Y4 Z4 W4
X5 Y5 Z5 W5
X6 Y6 Z6 W6
X7 Y7 Z7 W7
Data[0]
Data[1]
Data[2]
Data[3]
Data[4]
Data[5]
Data[6]
Data[7]
X0
Y0
Z0
W0
X1
Y1
Z1
W1
X2
Y2
Z2
W2
X3
Y3
Z3
W3
X4
Y4
Z4
W4
X5
Y5
Z5
W5
X6
Y6
Z6
W6
X7
Y7
Z7
W7
Data[0]
Data[1]
Data[2]
Data[3]
Data[4]
Data[5]
Data[6]
Data[7]
Array of Structures
or “AoS”
Structure of Arrays
or “SoA”
1 AoS Vector
1 SoA Vector
Bonus!
Since W is almost always 0 or 1, we can eliminate it with a
clever math library and save 25% memory and bandwidth!
Lowest Level Parallelization:
Structure-of-Array processing of Particles
Given:
pn(t)=position of particle n at time t
vn(t)=velocity of particle n at time t
p1(ti)=p1(ti-1) + v1(ti-1) * dt + 0.5 * G * dt2
p2(ti)=p2(ti-1) + v2(ti-1) * dt + 0.5 * G * dt2
…
Note they are independent of each other
So we can run four together using SoA
p{1-4}(ti)=p{1-4}(ti-1) + v{1-4}(ti-1) * dt + 0.5 * G * dt2
Failure Case
Gauss Seidel Solver
Consider a simple position-based solver that
uses distance constraints. Given:
p=current positions of all objects
solve(cn, p) takes p and constraint cn and computes a new p
that satisfies cn
p=solve(c0, p)
p=solve(c1, p)
…
Note that to solve c1, we need the result of c0.
Can’t solve c0 and c1 concurrently!
Failure Case
Possible Solutions
Generally, it’s you’re out of luck, but…
Some cases have very limited dependencies
e.g. particle-based cloth solving
Solution: Arrange constraints such that no four
adjacent constraints share cloth particles
Consider a different solver
e.g. Jacobi solvers don’t use updated values until all
constraints have been processed once
But they need more memory (pnew and pcurrent)
And may need more iterations to converge
Duck Demo (EyeToy + SPH)
Smoothed Particle Hydrodynamics
(SPH) Fluid Simulation
Smoothed-particles
Mass distributed around a point
Density falls to 0 at a radius h
Forces between particles closer than 2h
h
SPH Fluid Simulation
High-level parallelism
Put particles in grid cells
Process on different SPUs
(Not used in duck demo)
Low-level parallelism
SIMD and dual-issue on SPU
Large n per cell may be better
Less grid overhead
Loops fast on SPU
SPH Loop
Consider two sets of particles P and Q
E.g., taken from neighbor grid cells
O(n2) problem
Can unroll (e.g., by 4)
for (i = 0; i < numP; i++)
for (j = 0; j < numQ; j+=4)
Compute force (pi, qj)
Compute force (pi, qj+1)
Compute force (pi, qj+2)
Compute force (pi, qj+3)
SPH Loop, SoA
Idea:
Increase SIMD throughput with structure-of-arrays
Transpose and produce combinations
pi
qj
qj+1
x y z
x y z
x y z
x
y
z
x
y
z
x
y
z
x
y
z
SoA p
SoA q x
y
z
x
y
z
x
y
z
x
y
zx y z
x y zqj+2
qj+3
SPH Loop, Software Pipelined
Add software pipelining
Conversion instructions can dual-issue with math
Load[i]
To SoA[i]
Compute[i]
From SoA[i]
Store[i]
Compute[i]
Load[i+1]
Store [i-1]
To SoA[i+1]
From SoA[i-1]
Pipe 0 Pipe 1
Recap
Finding independence is hard!
Across subsystems or within subsystems?
Across iterations or within iterations?
Data level independence?
Instruction level independence?
How about “bandwidth level” independence?
Parallelization overhead
Sometimes running serially wins over overhead of
parallelization
Particle Simulation Demo
Questions?
http://www.research.scea.com/
Contacts:
Vangelis Kokkevis: vangelis_kokkevis@playstation.sony.com
Eric Larsen: eric_larsen@playstation.sony.com
Steven Osman: steven_osman@playstation.sony.com

Weitere ähnliche Inhalte

Was ist angesagt?

Memory Barriers in the Linux Kernel
Memory Barriers in the Linux KernelMemory Barriers in the Linux Kernel
Memory Barriers in the Linux KernelDavidlohr Bueso
 
Kernel Recipes 2019 - Marvels of Memory Auto-configuration (SPD)
Kernel Recipes 2019 - Marvels of Memory Auto-configuration (SPD)Kernel Recipes 2019 - Marvels of Memory Auto-configuration (SPD)
Kernel Recipes 2019 - Marvels of Memory Auto-configuration (SPD)Anne Nicolas
 
Xdp and ebpf_maps
Xdp and ebpf_mapsXdp and ebpf_maps
Xdp and ebpf_mapslcplcp1
 
Parallel Graphics in Frostbite - Current & Future (Siggraph 2009)
Parallel Graphics in Frostbite - Current & Future (Siggraph 2009)Parallel Graphics in Frostbite - Current & Future (Siggraph 2009)
Parallel Graphics in Frostbite - Current & Future (Siggraph 2009)Johan Andersson
 
Debugging Hung Python Processes With GDB
Debugging Hung Python Processes With GDBDebugging Hung Python Processes With GDB
Debugging Hung Python Processes With GDBbmbouter
 
BPF Hardware Offload Deep Dive
BPF Hardware Offload Deep DiveBPF Hardware Offload Deep Dive
BPF Hardware Offload Deep DiveNetronome
 
Range reader/writer locking for the Linux kernel
Range reader/writer locking for the Linux kernelRange reader/writer locking for the Linux kernel
Range reader/writer locking for the Linux kernelDavidlohr Bueso
 
Stratified B-trees - HotStorage11
Stratified B-trees - HotStorage11Stratified B-trees - HotStorage11
Stratified B-trees - HotStorage11Acunu
 
Kernel Recipes 2017 - EBPF and XDP - Eric Leblond
Kernel Recipes 2017 - EBPF and XDP - Eric LeblondKernel Recipes 2017 - EBPF and XDP - Eric Leblond
Kernel Recipes 2017 - EBPF and XDP - Eric LeblondAnne Nicolas
 
Inference accelerators
Inference acceleratorsInference accelerators
Inference acceleratorsDarshanG13
 
2011.06.20 stratified-btree
2011.06.20 stratified-btree2011.06.20 stratified-btree
2011.06.20 stratified-btreeAcunu
 
CETH for XDP [Linux Meetup Santa Clara | July 2016]
CETH for XDP [Linux Meetup Santa Clara | July 2016] CETH for XDP [Linux Meetup Santa Clara | July 2016]
CETH for XDP [Linux Meetup Santa Clara | July 2016] IO Visor Project
 
Dynamic Load-balancing On Graphics Processors
Dynamic Load-balancing On Graphics ProcessorsDynamic Load-balancing On Graphics Processors
Dynamic Load-balancing On Graphics Processorsdaced
 
eBPF Perf Tools 2019
eBPF Perf Tools 2019eBPF Perf Tools 2019
eBPF Perf Tools 2019Brendan Gregg
 
09 accelerators
09 accelerators09 accelerators
09 acceleratorsMurali M
 
Disaggregation a Primer: Optimizing design for Edge Cloud & Bare Metal applic...
Disaggregation a Primer: Optimizing design for Edge Cloud & Bare Metal applic...Disaggregation a Primer: Optimizing design for Edge Cloud & Bare Metal applic...
Disaggregation a Primer: Optimizing design for Edge Cloud & Bare Metal applic...Netronome
 
DockerCon 2017 - Cilium - Network and Application Security with BPF and XDP
DockerCon 2017 - Cilium - Network and Application Security with BPF and XDPDockerCon 2017 - Cilium - Network and Application Security with BPF and XDP
DockerCon 2017 - Cilium - Network and Application Security with BPF and XDPThomas Graf
 

Was ist angesagt? (20)

Memory Barriers in the Linux Kernel
Memory Barriers in the Linux KernelMemory Barriers in the Linux Kernel
Memory Barriers in the Linux Kernel
 
Scope Stack Allocation
Scope Stack AllocationScope Stack Allocation
Scope Stack Allocation
 
Debug generic process
Debug generic processDebug generic process
Debug generic process
 
Kernel Recipes 2019 - Marvels of Memory Auto-configuration (SPD)
Kernel Recipes 2019 - Marvels of Memory Auto-configuration (SPD)Kernel Recipes 2019 - Marvels of Memory Auto-configuration (SPD)
Kernel Recipes 2019 - Marvels of Memory Auto-configuration (SPD)
 
Xdp and ebpf_maps
Xdp and ebpf_mapsXdp and ebpf_maps
Xdp and ebpf_maps
 
Parallel Graphics in Frostbite - Current & Future (Siggraph 2009)
Parallel Graphics in Frostbite - Current & Future (Siggraph 2009)Parallel Graphics in Frostbite - Current & Future (Siggraph 2009)
Parallel Graphics in Frostbite - Current & Future (Siggraph 2009)
 
Debugging Hung Python Processes With GDB
Debugging Hung Python Processes With GDBDebugging Hung Python Processes With GDB
Debugging Hung Python Processes With GDB
 
BPF Hardware Offload Deep Dive
BPF Hardware Offload Deep DiveBPF Hardware Offload Deep Dive
BPF Hardware Offload Deep Dive
 
Range reader/writer locking for the Linux kernel
Range reader/writer locking for the Linux kernelRange reader/writer locking for the Linux kernel
Range reader/writer locking for the Linux kernel
 
Pipeline
PipelinePipeline
Pipeline
 
Stratified B-trees - HotStorage11
Stratified B-trees - HotStorage11Stratified B-trees - HotStorage11
Stratified B-trees - HotStorage11
 
Kernel Recipes 2017 - EBPF and XDP - Eric Leblond
Kernel Recipes 2017 - EBPF and XDP - Eric LeblondKernel Recipes 2017 - EBPF and XDP - Eric Leblond
Kernel Recipes 2017 - EBPF and XDP - Eric Leblond
 
Inference accelerators
Inference acceleratorsInference accelerators
Inference accelerators
 
2011.06.20 stratified-btree
2011.06.20 stratified-btree2011.06.20 stratified-btree
2011.06.20 stratified-btree
 
CETH for XDP [Linux Meetup Santa Clara | July 2016]
CETH for XDP [Linux Meetup Santa Clara | July 2016] CETH for XDP [Linux Meetup Santa Clara | July 2016]
CETH for XDP [Linux Meetup Santa Clara | July 2016]
 
Dynamic Load-balancing On Graphics Processors
Dynamic Load-balancing On Graphics ProcessorsDynamic Load-balancing On Graphics Processors
Dynamic Load-balancing On Graphics Processors
 
eBPF Perf Tools 2019
eBPF Perf Tools 2019eBPF Perf Tools 2019
eBPF Perf Tools 2019
 
09 accelerators
09 accelerators09 accelerators
09 accelerators
 
Disaggregation a Primer: Optimizing design for Edge Cloud & Bare Metal applic...
Disaggregation a Primer: Optimizing design for Edge Cloud & Bare Metal applic...Disaggregation a Primer: Optimizing design for Edge Cloud & Bare Metal applic...
Disaggregation a Primer: Optimizing design for Edge Cloud & Bare Metal applic...
 
DockerCon 2017 - Cilium - Network and Application Security with BPF and XDP
DockerCon 2017 - Cilium - Network and Application Security with BPF and XDPDockerCon 2017 - Cilium - Network and Application Security with BPF and XDP
DockerCon 2017 - Cilium - Network and Application Security with BPF and XDP
 

Ähnlich wie High-Performance Physics Solver Design for Next Generation Consoles

Main Memory Management in Operating System
Main Memory Management in Operating SystemMain Memory Management in Operating System
Main Memory Management in Operating SystemRashmi Bhat
 
IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory exp...
IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory exp...IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory exp...
IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory exp...In-Memory Computing Summit
 
Lec10 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part2
Lec10 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part2Lec10 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part2
Lec10 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part2Hsien-Hsin Sean Lee, Ph.D.
 
Performance and predictability (1)
Performance and predictability (1)Performance and predictability (1)
Performance and predictability (1)RichardWarburton
 
Performance and Predictability - Richard Warburton
Performance and Predictability - Richard WarburtonPerformance and Predictability - Richard Warburton
Performance and Predictability - Richard WarburtonJAXLondon2014
 
Performance and predictability
Performance and predictabilityPerformance and predictability
Performance and predictabilityRichardWarburton
 
Parallelism Processor Design
Parallelism Processor DesignParallelism Processor Design
Parallelism Processor DesignSri Prasanna
 
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...Odinot Stanislas
 
Performance and predictability
Performance and predictabilityPerformance and predictability
Performance and predictabilityRichardWarburton
 
Технологии работы с дисковыми хранилищами и файловыми системами Windows Serve...
Технологии работы с дисковыми хранилищами и файловыми системами Windows Serve...Технологии работы с дисковыми хранилищами и файловыми системами Windows Serve...
Технологии работы с дисковыми хранилищами и файловыми системами Windows Serve...Виталий Стародубцев
 
Avoiding Chaos: Methodology for Managing Performance in a Shared Storage A...
Avoiding Chaos:  Methodology for Managing Performance in a Shared Storage A...Avoiding Chaos:  Methodology for Managing Performance in a Shared Storage A...
Avoiding Chaos: Methodology for Managing Performance in a Shared Storage A...brettallison
 
Ceph Day Netherlands - Ceph @ BIT
Ceph Day Netherlands - Ceph @ BIT Ceph Day Netherlands - Ceph @ BIT
Ceph Day Netherlands - Ceph @ BIT Ceph Community
 
Cassandra in Operation
Cassandra in OperationCassandra in Operation
Cassandra in Operationniallmilton
 

Ähnlich wie High-Performance Physics Solver Design for Next Generation Consoles (20)

Main Memory Management in Operating System
Main Memory Management in Operating SystemMain Memory Management in Operating System
Main Memory Management in Operating System
 
IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory exp...
IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory exp...IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory exp...
IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory exp...
 
Memory Mapping Cache
Memory Mapping CacheMemory Mapping Cache
Memory Mapping Cache
 
Lec10 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part2
Lec10 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part2Lec10 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part2
Lec10 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part2
 
Performance and predictability (1)
Performance and predictability (1)Performance and predictability (1)
Performance and predictability (1)
 
Performance and Predictability - Richard Warburton
Performance and Predictability - Richard WarburtonPerformance and Predictability - Richard Warburton
Performance and Predictability - Richard Warburton
 
Dpdk applications
Dpdk applicationsDpdk applications
Dpdk applications
 
Performance and predictability
Performance and predictabilityPerformance and predictability
Performance and predictability
 
Parallelism Processor Design
Parallelism Processor DesignParallelism Processor Design
Parallelism Processor Design
 
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
 
Ceph
CephCeph
Ceph
 
Performance and predictability
Performance and predictabilityPerformance and predictability
Performance and predictability
 
Технологии работы с дисковыми хранилищами и файловыми системами Windows Serve...
Технологии работы с дисковыми хранилищами и файловыми системами Windows Serve...Технологии работы с дисковыми хранилищами и файловыми системами Windows Serve...
Технологии работы с дисковыми хранилищами и файловыми системами Windows Serve...
 
04 Cache Memory
04  Cache  Memory04  Cache  Memory
04 Cache Memory
 
Avoiding Chaos: Methodology for Managing Performance in a Shared Storage A...
Avoiding Chaos:  Methodology for Managing Performance in a Shared Storage A...Avoiding Chaos:  Methodology for Managing Performance in a Shared Storage A...
Avoiding Chaos: Methodology for Managing Performance in a Shared Storage A...
 
Ceph Day Netherlands - Ceph @ BIT
Ceph Day Netherlands - Ceph @ BIT Ceph Day Netherlands - Ceph @ BIT
Ceph Day Netherlands - Ceph @ BIT
 
The Cell Processor
The Cell ProcessorThe Cell Processor
The Cell Processor
 
Cassandra in Operation
Cassandra in OperationCassandra in Operation
Cassandra in Operation
 
Low-level Graphics APIs
Low-level Graphics APIsLow-level Graphics APIs
Low-level Graphics APIs
 
Operating System
Operating SystemOperating System
Operating System
 

Mehr von Slide_N

New Millennium for Computer Entertainment - Kutaragi
New Millennium for Computer Entertainment - KutaragiNew Millennium for Computer Entertainment - Kutaragi
New Millennium for Computer Entertainment - KutaragiSlide_N
 
Sony Transformation 60 - Kutaragi
Sony Transformation 60 - KutaragiSony Transformation 60 - Kutaragi
Sony Transformation 60 - KutaragiSlide_N
 
Sony Transformation 60
Sony Transformation 60 Sony Transformation 60
Sony Transformation 60 Slide_N
 
Moving Innovative Game Technology from the Lab to the Living Room
Moving Innovative Game Technology from the Lab to the Living RoomMoving Innovative Game Technology from the Lab to the Living Room
Moving Innovative Game Technology from the Lab to the Living RoomSlide_N
 
Cell Technology for Graphics and Visualization
Cell Technology for Graphics and VisualizationCell Technology for Graphics and Visualization
Cell Technology for Graphics and VisualizationSlide_N
 
Industry Trends in Microprocessor Design
Industry Trends in Microprocessor DesignIndustry Trends in Microprocessor Design
Industry Trends in Microprocessor DesignSlide_N
 
Translating GPU Binaries to Tiered SIMD Architectures with Ocelot
Translating GPU Binaries to Tiered SIMD Architectures with OcelotTranslating GPU Binaries to Tiered SIMD Architectures with Ocelot
Translating GPU Binaries to Tiered SIMD Architectures with OcelotSlide_N
 
Cellular Neural Networks: Theory
Cellular Neural Networks: TheoryCellular Neural Networks: Theory
Cellular Neural Networks: TheorySlide_N
 
Network Processing on an SPE Core in Cell Broadband EngineTM
Network Processing on an SPE Core in Cell Broadband EngineTMNetwork Processing on an SPE Core in Cell Broadband EngineTM
Network Processing on an SPE Core in Cell Broadband EngineTMSlide_N
 
Deferred Pixel Shading on the PLAYSTATION®3
Deferred Pixel Shading on the PLAYSTATION®3Deferred Pixel Shading on the PLAYSTATION®3
Deferred Pixel Shading on the PLAYSTATION®3Slide_N
 
Developing Technology for Ratchet and Clank Future: Tools of Destruction
Developing Technology for Ratchet and Clank Future: Tools of DestructionDeveloping Technology for Ratchet and Clank Future: Tools of Destruction
Developing Technology for Ratchet and Clank Future: Tools of DestructionSlide_N
 
NVIDIA Tesla Accelerated Computing Platform for IBM Power
NVIDIA Tesla Accelerated Computing Platform for IBM PowerNVIDIA Tesla Accelerated Computing Platform for IBM Power
NVIDIA Tesla Accelerated Computing Platform for IBM PowerSlide_N
 
The Visual Computing Revolution Continues
The Visual Computing Revolution ContinuesThe Visual Computing Revolution Continues
The Visual Computing Revolution ContinuesSlide_N
 
Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...
Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...
Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...Slide_N
 
MLAA on PS3
MLAA on PS3MLAA on PS3
MLAA on PS3Slide_N
 
SPU gameplay
SPU gameplaySPU gameplay
SPU gameplaySlide_N
 
Insomniac Physics
Insomniac PhysicsInsomniac Physics
Insomniac PhysicsSlide_N
 
SPU Shaders
SPU ShadersSPU Shaders
SPU ShadersSlide_N
 
SPU Physics
SPU PhysicsSPU Physics
SPU PhysicsSlide_N
 
Deferred Rendering in Killzone 2
Deferred Rendering in Killzone 2Deferred Rendering in Killzone 2
Deferred Rendering in Killzone 2Slide_N
 

Mehr von Slide_N (20)

New Millennium for Computer Entertainment - Kutaragi
New Millennium for Computer Entertainment - KutaragiNew Millennium for Computer Entertainment - Kutaragi
New Millennium for Computer Entertainment - Kutaragi
 
Sony Transformation 60 - Kutaragi
Sony Transformation 60 - KutaragiSony Transformation 60 - Kutaragi
Sony Transformation 60 - Kutaragi
 
Sony Transformation 60
Sony Transformation 60 Sony Transformation 60
Sony Transformation 60
 
Moving Innovative Game Technology from the Lab to the Living Room
Moving Innovative Game Technology from the Lab to the Living RoomMoving Innovative Game Technology from the Lab to the Living Room
Moving Innovative Game Technology from the Lab to the Living Room
 
Cell Technology for Graphics and Visualization
Cell Technology for Graphics and VisualizationCell Technology for Graphics and Visualization
Cell Technology for Graphics and Visualization
 
Industry Trends in Microprocessor Design
Industry Trends in Microprocessor DesignIndustry Trends in Microprocessor Design
Industry Trends in Microprocessor Design
 
Translating GPU Binaries to Tiered SIMD Architectures with Ocelot
Translating GPU Binaries to Tiered SIMD Architectures with OcelotTranslating GPU Binaries to Tiered SIMD Architectures with Ocelot
Translating GPU Binaries to Tiered SIMD Architectures with Ocelot
 
Cellular Neural Networks: Theory
Cellular Neural Networks: TheoryCellular Neural Networks: Theory
Cellular Neural Networks: Theory
 
Network Processing on an SPE Core in Cell Broadband EngineTM
Network Processing on an SPE Core in Cell Broadband EngineTMNetwork Processing on an SPE Core in Cell Broadband EngineTM
Network Processing on an SPE Core in Cell Broadband EngineTM
 
Deferred Pixel Shading on the PLAYSTATION®3
Deferred Pixel Shading on the PLAYSTATION®3Deferred Pixel Shading on the PLAYSTATION®3
Deferred Pixel Shading on the PLAYSTATION®3
 
Developing Technology for Ratchet and Clank Future: Tools of Destruction
Developing Technology for Ratchet and Clank Future: Tools of DestructionDeveloping Technology for Ratchet and Clank Future: Tools of Destruction
Developing Technology for Ratchet and Clank Future: Tools of Destruction
 
NVIDIA Tesla Accelerated Computing Platform for IBM Power
NVIDIA Tesla Accelerated Computing Platform for IBM PowerNVIDIA Tesla Accelerated Computing Platform for IBM Power
NVIDIA Tesla Accelerated Computing Platform for IBM Power
 
The Visual Computing Revolution Continues
The Visual Computing Revolution ContinuesThe Visual Computing Revolution Continues
The Visual Computing Revolution Continues
 
Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...
Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...
Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...
 
MLAA on PS3
MLAA on PS3MLAA on PS3
MLAA on PS3
 
SPU gameplay
SPU gameplaySPU gameplay
SPU gameplay
 
Insomniac Physics
Insomniac PhysicsInsomniac Physics
Insomniac Physics
 
SPU Shaders
SPU ShadersSPU Shaders
SPU Shaders
 
SPU Physics
SPU PhysicsSPU Physics
SPU Physics
 
Deferred Rendering in Killzone 2
Deferred Rendering in Killzone 2Deferred Rendering in Killzone 2
Deferred Rendering in Killzone 2
 

Kürzlich hochgeladen

Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditSkynet Technologies
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...panagenda
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 

Kürzlich hochgeladen (20)

Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance Audit
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 

High-Performance Physics Solver Design for Next Generation Consoles

  • 1. High-Performance Physics Solver Design for Next Generation Consoles Vangelis Kokkevis Steven Osman Eric Larsen Simulation Technology Group Sony Computer Entertainment America US R&D
  • 2. This Talk Optimizing physics simulation on a multi-core architecture. Focus on CELL architecture Variety of simulation domains Cloth, Rigid Bodies, Fluids, Particles Practical advice based on real case-studies Demos!
  • 3. Basic Issues Looking for opportunities to parallelize processing High Level – Many independent solvers on multiple cores Low Level – One solver, one/multiple cores Coding with small memory in mind Streaming Batching up work Software Caching Speeding up processing within each unit SIMD processing, instruction scheduling Double-buffering Parallelizing/optimizing existing code
  • 4. What is not in this talk? Details on specific physics algorithms Too much material for a 1-hour talk Will provide references to techniques Much insight on non-CELL platforms Concentrate on actual results Concepts should be applicable beyond CELL
  • 5. The Cell Processor Model PPU SPE0 DMA 256K LS SPU0 SPE0 DMA 256K LS SPU1 SPE0 DMA 256K LS SPU2 SPE0 DMA 256K LS SPU3 DMA 256K LS SPU4 DMA 256K LS SPU5 DMA 256K LS SPU6 DMA 256K LS SPU7 Main Memory L1/L2
  • 6. Physics on CELL Physics should happen mostly on SPUs There’s more of them! SPUs have greater bandwidth & performance PPU is busy doing other stuff PPU SPE0 DMA 256K LS SPU0 DMA SPU1 DMA SPU2 DMA SPU3 DMA SPU4 DMA SPU5 DMA SPU6 DMA SPU7 Main Memory L1/L2 256K LS 256K LS 256K LS 256K LS 256K LS 256K LS 256K LS
  • 7. SPU Performance Recipe Large bandwidth to and from main memory Quick (1-cycle) LS memory access SIMD instruction set Concurrent DMA and processing Challenges: Limited LS size, shared between code and data Random accesses of main memory are slow
  • 9. Cloth Simulation Cloth mesh simulated as point masses (vertices) connected via distance constraints (edges). m1m1 m2m2 m3m3 d1d1 d2d2 d3d3Mesh TriangleMesh Triangle References: T.Jacobsen,Advanced Character Physics, GDC 2001 A.Meggs,Taking Real-Time Cloth Beyond Curtains,GDC 2005
  • 10. Simulation Step 1. Compute external forces, fE,per vertex 2. Compute new vertex positions [ Integration ]: pt+1 = (2pt − pt−1) + 1 3. Fix edge lengths Adjust vertex positions 4. Correct penetrations with collision geometry Adjust vertex positions 2fE ∗ 1 m ∗ Δt2
  • 11. How many vertices? How many vertices fit in 256K (less actually)? A lot, surprisingly… Tips: Look for opportunities to stream data Keep in LS only data required for each step
  • 12. Integration Step Less than 4000 verts in 200K of memory We don’t need to keep them all in LS Keep vertex data in main memory and bring it in in blocks 16 + 16 + 16 + 4 = 52 bytes / vertex pt+1 = (2pt − pt−1) + 1 2fE ∗ 1 m ∗ Δt2
  • 13. Streaming Integration B0 B1 B2 B3 B0 B1 B2 B3 B0 B1 B2 B3 pt pt−1 fE B0 B1 B2 B3 1 m Main Memory Local Store
  • 14. Streaming Integration B0 B1 B2 B3 B0 B1 B2 B3 B0 B1 B2 B3 pt pt−1 fE B0 B1 B2 B3 1 m Local Store Main Memory DMA_IN B0 B0 B0 B0 B0
  • 15. Streaming Integration B0 B1 B2 B3 B0 B1 B2 B3 B0 B1 B2 B3 pt pt−1 fE B0 B1 B2 B3 1 m Main Memory Local Store DMA_IN B0 B0 B0 B0 B0 Process B0
  • 16. Streaming Integration B0 B1 B2 B3 B0 B1 B2 B3 B0 B1 B2 B3 pt pt−1 fE Process B0 DMA_OUT B0 B0 B1 B2 B3 1 m Main Memory Local Store DMA_IN B0 B0 B0 B0 B0
  • 17. Streaming Integration B0 B1 B2 B3 B0 B1 B2 B3 B0 B1 B2 B3 pt pt−1 fE Process B0 DMA_OUT B0 B0 B1 B2 B3 1 m DMA_IN B1 Main Memory Local Store DMA_IN B0 B1 B1 B1 B1
  • 18. Streaming Integration B0 B1 B2 B3 B0 B1 B2 B3 B0 B1 B2 B3 pt pt−1 fE Process B0 DMA_OUT B0 B0 B1 B2 B3 1 m DMA_IN B1 Main Memory Local Store DMA_IN B0 B1 B1 B1 B1 Process B1
  • 19. Streaming Integration B0 B1 B2 B3 B0 B1 B2 B3 B0 B1 B2 B3 pt pt−1 fE Process B0 DMA_OUT B0 B0 B1 B2 B3 1 m DMA_IN B1 Main Memory Local Store DMA_IN B0 B1 B1 B1 B1 Process B1 DMA_OUT B1
  • 20. Streaming Integration B0 B1 B2 B3 B0 B1 B2 B3 B0 B1 B2 B3 pt pt−1 fE Process B0 DMA_OUT B0 B0 B1 B2 B3 1 m DMA_IN B1 Main Memory Local Store DMA_IN B0 B1 B1 B1 B1 Process B1 DMA_OUT B1 …
  • 21. Double-buffering Take advantage of concurrent DMA and processing to hide transfer times Without double-buffering: Process B0 DMA_OUT B0 DMA_IN B1 DMA_IN B0 Process B1 DMA_OUT B1 … With double-buffering: Process B0 DMA_IN B1 DMA_IN B0 … DMA_OUT B0 Process B1 DMA_IN B2 Process B2 DMA_OUT B1 DMA_IN B3
  • 22. Streaming Data Streaming is possible when the data access pattern is simple and predictable (e.g. linear) Number of verts processed per frame depends on processing speed and bandwidth but not LS size Unfortunately, not every step in the cloth solver can be fully streamed Fixing edge lengths requires random memory access…
  • 23. Fixing Edge Lengths Points coming out of the integration step don’t necessarily satisfy edge distance constraints struct Edge { int v1; int v2; float restLen; } p[v1] p[v2] p[v1] p[v2] Vector3 d = p[v2] – p[v1]; float len = sqrt(dot(d,d)); diff = (len-restLen)/len; p[v1] -= d * 0.5 * diff; p[v2] += d * 0.5 * diff;
  • 24. Fixing Edge Lengths An iterative process: Fix one edge at a time by adjusting 2 vertex positions Requires random access to particle positions array Solution: Keep all particle positions in LS Stream in edge data In 200K we can fit 200KB / 16B > 12K vertices
  • 25. Rigid Bodies Our group is currently porting the AGEIATM PhysXTM SDK to CELL Large codebase written with a PC architecture in mind Assumes easy random access to memory Processes tasks sequentially (no parallelism) Interesting example on how to port existing code to a multi-core architecture
  • 26. Starting the Port Determine all the stages of the rigid body pipeline Look for stages that are good candidates for parallelizing/optimizing Profile code to make sure we are focusing on the right parts
  • 27. Constraint Solve Constraint Solve Broadphase Collision Detection Rigid Body Pipeline Broadphase Collision Detection Narrowphase Collision Detection Narrowphase Collision Detection Constraint PrepConstraint Prep IntegrationIntegration Constraint PrepConstraint Prep Points of contact between bodies Potentially colliding body pairs New body positions Updated body velocities Current body positions Constraint Equations
  • 28. Rigid Body Pipeline Broadphase Collision Detection Broadphase Collision Detection Points of contact between bodies Potentially colliding body pairs New body positions Updated body velocities Current body positions Constraint Equations NP NP NP CP CP CP CS CS CS I I I
  • 30. Profiling Results Cumulative Frame Time 0 10000 20000 30000 40000 50000 60000 70000 1 57 113 169 225 281 337 393 449 505 561 617 673 729 785 841 897 953 1009 1065 1121 1177 1233 1289 1345 1401 1457 1513 1569 1625 1681 1737 1793 1849 1905 1961 Other INTEGRATION SOLVER CONSTRAINT_PREP NARROWPHASE BROADPHASE
  • 31. Running on the SPUs Three steps: 1. (PPU) Pre-process “Gather” operation (extract data from PhysX data structures and pack it in MM) 2. (SPU) Execute DMA packed data from MM to LS Process data and store output in LS DMA output to MM 3. (PPU) Post-process “Scatter” operation (unpack output data and put back in PhysX data structures)
  • 32. Why Involve the PPU? Required PhysX data is not conveniently packed Data is often not aligned We need to use PhysX data structures to avoid breaking features we haven’t ported Solutions: Use list DMAs to bring in data Modify existing code to force alignment Change PhysX code to work with new data structures
  • 33. Batching Up Work Task Description Task Description batch inputs/ outputs batch inputs/ outputs Work batch buffers in MM Work batch buffers in MM PhysX data-structures PhysX data-structures PPU Pre-Process PPU Pre-Process Task Description Task Description batch inputs/ outputs batch inputs/ outputs …… SPU Execute SPU Execute PPU Post-Process PPU Post-Process PhysX data-structures PhysX data-structures Create work batches for each task
  • 34. A Narrow-phase Collision Detection Problem: A list of object pairs that may be colliding Want to do contact processing on SPUs Pairs list has references to geometry C B (A,C) (A,B) (B,C) …
  • 35. Narrow-phase Collision Detection Data locality Same bodies may be in several pairs Geometry may be instanced for different bodies SPU memory access Can only access main memory with DMA No hardware cache Data reuse must be explicit
  • 36. Software Cache Idea: make a (read-only) software cache Cache entry is one geometric object Entries have variable size Basic operation SPU checks cache for object If not in cache, object fetched with DMA Cache returns a local address for object
  • 37. Software Cache Data Structures Two entry buffers New entries appended to “current” buffer Hash-table used to record and find loaded entries A B C Buffer 0 Next DMA Buffer 1
  • 38. Software Cache Data Replacement When space runs out in a buffer Overwrite data in second buffer Considerations Does not fragment memory No searches for free space But does not prefer frequently used data
  • 39. Software Cache Hiding the DMA latency Double-buffering Start DMA for un-cached entries Process previously DMA’d entries Process/pre-fetch batches Fetch and compute times vary Batching may improve balance DMA-lists useful One DMA command Multiple chunks of data gathered Current Buffer D F E Process DMA A B C
  • 40. Software Caching Conclusions Simple cache is practical Used for small convex objects in PhysX Design considerations Tradeoff of cache-logic cycles vs. bandwidth saved Pre-fetching important to include
  • 41. Single SPU Performance PPU only:PPU only: ExecExec PPU + SPU:PPU + SPU: PPUPPU PPUPPU SPUSPU ExecExec FreeFree SPU Exec < PPU Exec: SIMD + fast mem accessSPU Exec < PPU Exec: SIMD + fast mem access PreP PostP
  • 42. Multiple SPU Performance Pre- and Post- processing times determine how many SPUs can be used effectively
  • 43. Multiple SPU Performance 11 22 33 44PPUPPU 1 SPU1 SPU 11 22 33 44SS 3 SPUs3 SPUs 11 22 4433 11 22 33 44 2 SPUs2 SPUs
  • 44. PPU vs SPU comparisons Convex Stack (500 boxes) 0 10000 20000 30000 40000 50000 60000 70000 80000 1 35 69 103 137 171 205 239 273 307 341 375 409 443 477 511 545 579 613 647 681 715 749 783 817 851 885 919 953 987 1021 1055 1089 1123 1157 1191 1225 1259 1293 frame microseconds PPU-only 1-SPU 2-SPUs 3-SPUs 4-SPUs
  • 45. Duck Demo One of our first CELL demos (spring 2005) Several interacting physics systems: Rigid bodies (ducks & boats) Height-field water surface Cloth with ripping (sails) Particle based fluids (splashes + cups)
  • 46. Duck Demo (Lots of Ducks)
  • 47. Duck Demo Ambitious project with short deadline Early PC prototypes of some pieces Most straightforward way to parallelize: Dedicate one SPU for each subsystem Each piece could be developed and tested individually
  • 48. Duck Demo Resource Allocation PU – main loop SPU thread synchronization, draw calls SPU0 – height field water (<50%) SPU1 – splashes iso-surface (<50%) SPU2 – cloth sails for boat 1 (<50%) SPU3 – cloth sails for boat 2 (<50%) SPU4 – rigid body collision/response (95%) HF water Iso-Surface Cloth Cloth Rigid Bodies 1 frame
  • 49. Parallelization Recipe One three-step approach to code parallelization: 1. Find independent components 2. Run them side-by-side 3. Recursively apply recipe to components
  • 50. Challenges Step 1: Find independent components Where do you look? Maybe you need to break apart and overlap your data? e.g. Broad phase collision detection Maybe you need to break apart your loop into individual iterations? e.g. Solving cloth constraints
  • 51. Broad Phase Collision Detection 600 Objects 200 Objects A 200 Objects B 200 Objects C 200 Objects A 200 Objects Bvs 200 Objects A 200 Objects Cvs 200 Objects C200 Objects B vs We can execute all three of these simultaneously Need to test 600 rigid bodies against each other.
  • 52. Cloth Solving for (i=1 to 5) { cloth=solve(cloth) } A B C for (i=1 to 5) { solve_on_proc1(a); solve_on_proc2(b); wait_for_all() solve_on_proc1(c); wait_for_all(); }
  • 53. …challenges Step 2: Run them side-by-side Bandwidth and cache issues Need good data layout to avoid thrashing cache or bus Processor issues Need efficient processor management scheme What if the job sizes are very different? e.g. a suit of cloth and a separate neck tie Need further refinement of large jobs, or you only save on the small neck tie time
  • 54. …challenges Step 3: Recurse When do you stop? Overhead of launching smaller jobs Synchronization when a stage is done e.g. Gather results from all collision detection before solving But this can go down to the instruction level e.g. Using Structure-of-Arrays, transform four independent vectors at once
  • 55. High Level Parallelization: Duck Demo Fluid Simulation Fluid Surface Rigid Bodies Cloth Sails Dependency exists Fluid Simulation Fluid Surface Rigid Bodies Cloth Sails Cloth Boat 1 Cloth Boat 2 Note that the parts didn’t take an equal amount of time to run. We could have done better given time! But cloth was for multiple boats
  • 56. Broad Phase Collision Detection Narrow Phase Collision Detection Constraint Solving Lower Level Parallelization Rigid Body Simulation 600 bodies example Objects A Objects B Objects C Broad Phase Collision Detection Narrow Phase Collision Detection Constraint Solving Objects A Objects B Objects CObjects B Objects A Objects C Proc 3Proc 2Proc 1 Proc 3Proc 2Proc 1 Proc 3Proc 2Proc 1
  • 57. Structure of Arrays X0 Y0 Z0 W0 X1 Y1 Z1 W1 X2 Y2 Z2 W2 X3 Y3 Z3 W3 X4 Y4 Z4 W4 X5 Y5 Z5 W5 X6 Y6 Z6 W6 X7 Y7 Z7 W7 Data[0] Data[1] Data[2] Data[3] Data[4] Data[5] Data[6] Data[7] X0 Y0 Z0 W0 X1 Y1 Z1 W1 X2 Y2 Z2 W2 X3 Y3 Z3 W3 X4 Y4 Z4 W4 X5 Y5 Z5 W5 X6 Y6 Z6 W6 X7 Y7 Z7 W7 Data[0] Data[1] Data[2] Data[3] Data[4] Data[5] Data[6] Data[7] Array of Structures or “AoS” Structure of Arrays or “SoA” 1 AoS Vector 1 SoA Vector Bonus! Since W is almost always 0 or 1, we can eliminate it with a clever math library and save 25% memory and bandwidth!
  • 58. Lowest Level Parallelization: Structure-of-Array processing of Particles Given: pn(t)=position of particle n at time t vn(t)=velocity of particle n at time t p1(ti)=p1(ti-1) + v1(ti-1) * dt + 0.5 * G * dt2 p2(ti)=p2(ti-1) + v2(ti-1) * dt + 0.5 * G * dt2 … Note they are independent of each other So we can run four together using SoA p{1-4}(ti)=p{1-4}(ti-1) + v{1-4}(ti-1) * dt + 0.5 * G * dt2
  • 59. Failure Case Gauss Seidel Solver Consider a simple position-based solver that uses distance constraints. Given: p=current positions of all objects solve(cn, p) takes p and constraint cn and computes a new p that satisfies cn p=solve(c0, p) p=solve(c1, p) … Note that to solve c1, we need the result of c0. Can’t solve c0 and c1 concurrently!
  • 60. Failure Case Possible Solutions Generally, it’s you’re out of luck, but… Some cases have very limited dependencies e.g. particle-based cloth solving Solution: Arrange constraints such that no four adjacent constraints share cloth particles Consider a different solver e.g. Jacobi solvers don’t use updated values until all constraints have been processed once But they need more memory (pnew and pcurrent) And may need more iterations to converge
  • 62. Smoothed Particle Hydrodynamics (SPH) Fluid Simulation Smoothed-particles Mass distributed around a point Density falls to 0 at a radius h Forces between particles closer than 2h h
  • 63. SPH Fluid Simulation High-level parallelism Put particles in grid cells Process on different SPUs (Not used in duck demo) Low-level parallelism SIMD and dual-issue on SPU Large n per cell may be better Less grid overhead Loops fast on SPU
  • 64. SPH Loop Consider two sets of particles P and Q E.g., taken from neighbor grid cells O(n2) problem Can unroll (e.g., by 4) for (i = 0; i < numP; i++) for (j = 0; j < numQ; j+=4) Compute force (pi, qj) Compute force (pi, qj+1) Compute force (pi, qj+2) Compute force (pi, qj+3)
  • 65. SPH Loop, SoA Idea: Increase SIMD throughput with structure-of-arrays Transpose and produce combinations pi qj qj+1 x y z x y z x y z x y z x y z x y z x y z SoA p SoA q x y z x y z x y z x y zx y z x y zqj+2 qj+3
  • 66. SPH Loop, Software Pipelined Add software pipelining Conversion instructions can dual-issue with math Load[i] To SoA[i] Compute[i] From SoA[i] Store[i] Compute[i] Load[i+1] Store [i-1] To SoA[i+1] From SoA[i-1] Pipe 0 Pipe 1
  • 67. Recap Finding independence is hard! Across subsystems or within subsystems? Across iterations or within iterations? Data level independence? Instruction level independence? How about “bandwidth level” independence? Parallelization overhead Sometimes running serially wins over overhead of parallelization
  • 69. Questions? http://www.research.scea.com/ Contacts: Vangelis Kokkevis: vangelis_kokkevis@playstation.sony.com Eric Larsen: eric_larsen@playstation.sony.com Steven Osman: steven_osman@playstation.sony.com