1. Neil Brown – Senior Engineer
Developer Services
Sony Computer Entertainment Europe
Research & Development Division
PlayStation ® : Cutting Edge Techniques
2. Slide 2
• PlayStation® Development Resources
• PSP Case study
– Motorstorm® : Arctic Edge
• PS3 Case Studies
– Maintaining Framerate
– SPU assisted graphics
– Other tips
What I Will Be Covering
3. Slide 3
SCEE R&D
Developer Services
• Great Marlborough Street in
Central London
• Cover ‘PAL’ Region
Developers
4. Slide 4
• Support – SCE DevNet
• Field Engineering
• Technical and Strategic Consultancy
• TRC Consultancy
• Technical Training
What We Cover
9. Slide 9
• An illustration of some techniques used in
real games
– Making use of the PSP® and PS3™
Hardware
– How to combine the Cell Broadband Engine™
with RSX™
Case Studies
10. Slide 10
• Migration from a PS3™ title
• Platform specific game engine, 30Hz
• Render Target are oversized to 512x304 32-bit
– Soften jaggies
Motorstorm® : Arctic Edge
11. Slide 11
• Use morph target for better realism
• 2 LODs
• Real-time lit, with specular and modulated
by track lighting
• Each skin is divided up into sub-skins that
are affected by a maximum of 8 different
bones
• Each skin’s skeleton has 16 bones
Vehicles and Character
12. Slide 12
• Streamed with 3 blocks (15kpolys) in
memory
• Lighting baked into vertices
• Maximum of 8.5MB
• Specular snow rendering use 2-pass:
– untextured, specularly-lit only
– textured layer with alpha "holes" which
modulated the addition of the lower layer
• Whole elements to be clipped
Landscape
13. Slide 13
• Custom physics code to mimic PS3™ Motorstorm
• Vehicles are represented by convex hulls.
• The Landscape is represented by compressed
vertices of an arbitrary mesh in a grid based scene.
• Ragdoll physics is a verlet system.
Physics
14. Slide 14
• Pre-built display lists for most objects
• Compressed data for all geometry
• Z-fighting was heavily reduced by increasing the far
clipping plane massively beyond the actual draw distance
required
Graphics Optimisation
15. Slide 15
Graphics Optimisation
Near Clip Plane
Far Clip Plane
Actual Draw
Distance
Z-Buffer
Most accurate section
Near Clip Plane
Far Clip Plane
Z-Buffer
Most accurate section
17. Slide 17
• Different widths between 1280 and 1920 for a 1080p 60Hz game
– Or 1024 to 1280 for a 720p game
• Feedback loop adjusts render width based on prior frame durations
– Resize is hidden in post processing full screen passes
– Horizontal hardware scaler helps for 1080p modes
Dynamic resolution
18. Slide 18
• Framerate can vary over time
– Particle systems can eat GPU time
– Views affect render time
• Cull objects
– Small screen footprint
– Used on Motorstorm 2
Maintaining frame rate
19. Slide 19
• Dynamic geometry LOD
– Average per mesh polygon size computed offline
– Viewport polygon size used to pick lowest LOD
– For multiplayer, LODs automatically adjust
• (smaller viewports)
• Shader LOD
– Reduce shader instruction count in distance
• Lose parallax & normal – keep specular map
Reducing GPU load
22. Slide 22
• Many games are fragment shader bound
• Rendering Z only ‘primes’ the RSX™ Z-cull unit
– Very fast, 16 pixels/clock rather than 8
– Render entire scene,
– Or ‘large’ meshes only
– Easily save 10% GPU
Z pre-pass
26. Slide 26
• Processor are becoming increasingly parallel
• Frame time is bound by the max of any of these component
Game Loop
27. Slide 27
• If any part run too slow, offload tasks on to
a helper CPU
Helper CPU
28. Slide 28
• Problem: Update an array of objects and submit the visible
ones for rendering
• Object.update() method was bottleneck on PPE
– Determined using SN Tuner
– This function generates the world space matrices for each
object
– Embarrassingly parallel
• No data dependencies between objects
• If we had 1 processor for each object we could do that
Function off load example using a SPURS Task
29. Slide 29
SPURS Tasks
//---Conditionally calling SPU Code
if (usingSPU==false) //---------------------------generic version
{
for(int i=0; i<OBJECT_COUNT; i++)
{
testObject[i].update();
}
}
else //-------------------------------------------SPU version
{
int test=spurs.ThrowTaskRunComplete(taskUpdateObject_elf_start,
(int)testObject,
OBJECT_COUNT,0,0);
//could do something useful here…
int result = spurs.CatchTaskRunComplete(test);
}
30. Slide 30
• Getting data in and out of SPU is first problem
• Get this working before worrying about actual processing
– Brute force often works just fine
• DMA entire object array into SPE memory
• Run update method
• DMA entire object array back to main memory
• List DMA allows you to get fancy
– Gather and Scatter operation
• If the data set is really big or if you need every last drop of performance
– Streaming model
• Overlap DMA transfers with processing
• Double buffering
SPURS Task
31. Slide 31
SPURS Task: Getting Data in and out
int cellSpuMain(.....)
{
int sourceAddr = ....; //source address
int count = ....; //number of data elements
int dataSize = count * sizeof(gfxObject); //amount of memory to get
gfxObject *buf = (gfxObject*)memalign(128,dataSize); //alloc mem
DmaGet((void*)buf,sourceAddr,dataSize);
DmaWait(....);
//---data is loaded at this point so do something interesting
DmaPut((void*)buf,sourceAddr,dataSize);
DmaWait(....);
return(0);
}
32. Slide 32
• Keep code the same as PPU Version
– Conditional compilation based on processor type
– Might have to split code into a separate file
SPURS Task: Executing the Code
//--- SPU code to call the update method
//--- data is loaded at this point so do something interesting
//--- step through array of objects and update
gfxObject *tp;
for(int i=0; i<count; i++)
{
tp = &buf[i];
tp->update(); //same method as on PPE compiled for SPU
}
33. Slide 33
• 5x Speed up in this instance
– Brute force solution
– Could add more SPUS
– Problem is embarrassingly parallel
SPURS Task: Results
34. Slide 34
• Simple pipeline
– No stalls like LHS
• No Operating System, very few system calls
– Performance is high and deterministic
• Memory is very fast like L1 cache
– Memory transfers a asynchronous so can be hidden
SPUs are faster than the PPU
35. Slide 35
• PPU Stalled while waiting for SPU to Process
• SPU Data get and put were synchronous
– Not using hardware to its fullest extent
– Easy to get more if we need it
SPURS Task: Time Waster
PPU PPU
GET PUTExec
TIME
SPU
36. Slide 36
SPU Migration
• Look for large blocks of data with simple processing and
minimal synchronization requirements
– Vertex transform
– Animation
– Audio
– Image processing
• Entire subsystems that can be moved over
37. Slide 37
• Graphics API
• Multistream – Audio
• Bullet – Physics
• PlayStation®Edge
– Geometry/Compression/Animation/Post FX
• FIOS – Optimised disc access
Our Libraries that run on SPUs
38. Slide 38
• Animation
• AI Threat prediction
• AI Line of fire
• AI Obstacle avoidance
• Collision detection
• Physics
• Particle simulation
• Particle rendering
• Scene graph
Killzone®2 SPU Usage
• Display list building
• IBL Light probes
• Graphics post-processing
• Dynamic music system
• Skinning
• MP3 Decompression
• Edge Zlib
• etc.
44 Job types
39. Slide 39
• Accelerate both PPU and RSX™
• Allow to shorten the overall time
SPU can do more than CPU tasks
40. Slide 40
• Visibility and
Geometry Culling
• Vertex and lighting
processing
• Post processing
Offloading the GPU
41. Slide 41
• ‘Edge’ SPU library to offload vertex work from RSX™
– Trade SPU time for GPU performance
• Animation, skinning, culling and transformation
– Pre culled, so all RSXTM work generates pixels
Geometry processing
42. Slide 42
• Engine supports up to 256 vertex lights
• Light accumulation
– RGBE vertex colour
– Runs on SPU
• Zero impact on RSX™
Wipeout® weapon lighting
43. Slide 43
• Geometric occluders
– Tested using halfplanes
– Fast checks on SPU
• Low res software occlusion
– Coarse Zbuffer on SPU
– 256x114 float Z
– Low-poly conservative occluders
Occlusion Culling
44. Slide 44
• SPU/PPU
– Object culling – Offscreen or very small objects
– Occlusion Culling – Occluded objects
– Edge Geometry – Vertex culling
• RSX
– Early Z-cull – Coarse pixel culling
• Before fragment shader
• Remember to do a z pre-pass
– Depth test – Pixel culling
• After fragment shader
Culling
45. Slide 45
• SPU geometry code for landscape
– SPUs can generate or remove graphic primitives (procedural, sub-division
surfaces…)
– Allow to send only triangles that contribute to the final scene
Continuous LOD
46. Slide 46
PhyreEngine™
• Free game engine including
– Modular run time
– Samples, docs and whitepapers
– Full source and art work
• Optimized for multi-core especially PS3™
• Extractable and reusable components
47. Slide 47
Procedural Foliage
• Final SPU program generates RSX™ geometry with movement
• LOD calculated and transition to billboards
Original models are from the Xfrog Plant Library and used with permission of Greenworks Organic Software.
49. Slide 49
• Motion Blur, Bloom, Depth of field processed on SPU
– Quarter-res image
– Merged into final scene
Killzone® post processing
50. Slide 50
• SPUs assist RSX™
1. RSX™ prepares low-res image buffers in XDR
2. RSX™ triggers interrupt to start SPUs
3. SPUs perform image operations
4. RSX™ already starts on next frame
5. Result of SPUs processed by RSX™ early in next frame
Post Processing
52. Slide 52
A Few Samples
Exponential
shadow map
Motion Blur
53. Slide 53
Post-Effect Impl. Difference
RSXTM SPUs
Bilinear filtering is nearly free Bilinear filtering is expensive
Nearest filtering is nearly free SPU only supports truncate mode
Sequential texture accesses can
benefit from texture cache
Sequential texture accesses need to
be DMA transferred into LS
Random accesses are handled nicely
at a cost of trashing the texture cache
Random accesses means handling
multiple small DMA transfers
54. Slide 54
• HD resolution is still expensive
– run post-processing at lower resolution!
Performance
Effect Single SPU at 720p Five SPUs at 720p
Depth of Field ~22.76 ms ~4.74 ms
ROP ~14.36 ms ~2.99 ms
Motion Blur (Intrinsics) ~27.5 ms ~5.73 ms
Motion Blur (hand tuned) ~17.6 ms ~3.67 ms
55. Slide 55
• Every PS3™ comes with an HDD
• Can access both HDD and Blu-ray simultaneously
– Helps when continuously streaming data: levels,
actors, sounds, music, textures…
– 2GB of System Cache
• Allows prefetching data in advance
Data Streaming
56. Slide 56
• Brute force method
– High quality images
– 128 FP16 frames
• Depth of field
• Motion Blur
– Use sub-pixel jitter to give 128x super-sampled AA
Photo-Mode
58. Slide 58
• Think of the system and your game as a whole
– Rearrange your rendering if necessary
– Rendering is not just a GPU problem
• Unlock the potential of multi-core programming
– PS3™ = PPU + SPUs + RSX™ working together in parallel
– Minimize contention
Summary
61. Slide 61
• Thanks to all the studios that let me talk about their games
ANY QUESTIONS?
www.scee.com
ps3.scedev.net
research.scee.net
tpr.scee.net
www.worldwidestudios.net/xdev
Finally