1. Design issues of IBM CellDesign issues of IBM Cell
ArchitectureArchitecture
Vitthal Gutthe MEIT 1326Vitthal Gutthe MEIT 1326
Pravin kumar Yadav MEIT 1338Pravin kumar Yadav MEIT 1338
Vyanktesh Dorlikar MEIT 1324Vyanktesh Dorlikar MEIT 1324
2. contentscontents
General IntroductionGeneral Introduction
History of developmentHistory of development
Technical overview of architectureTechnical overview of architecture
Detailed technical discussion ofDetailed technical discussion of
componentscomponents
Design choicesDesign choices
Cell programming issuesCell programming issues
3. History of DevelopmentHistory of Development
Sony Playstation2Sony Playstation2
• Released March 2000 in JapanReleased March 2000 in Japan
• 128bit “Emotion Engine”128bit “Emotion Engine”
• With freq. of 294Mhz,MIPS CPUWith freq. of 294Mhz,MIPS CPU
• Having capability of 6.2gflops(gegaHaving capability of 6.2gflops(gega
floating point operation per second)floating point operation per second)
4. History ContinuedHistory Continued
Partnership between Sony, Toshiba,Partnership between Sony, Toshiba,
IBM in Summer of 2000IBM in Summer of 2000
Initial goal of 1000 x PS2 Power inInitial goal of 1000 x PS2 Power in
single Machinesingle Machine
March 2001, Sony-IBM-ToshibaMarch 2001, Sony-IBM-Toshiba
design center opened with andesign center opened with an
investment of $400m investment.investment of $400m investment.
5. Overall Goals for CellOverall Goals for Cell
High performance in multimedia appsHigh performance in multimedia apps
Gain Real time performanceGain Real time performance
Power consumption should bePower consumption should be
minimumminimum
Cost as low as possibleCost as low as possible
Available by 2005Available by 2005
Avoid memory latency issuesAvoid memory latency issues
associated with control structuresassociated with control structures
6. The Cell itselfThe Cell itself
Power PC basedPower PC based
main core (PPE)main core (PPE)
MultipleMultiple
SPEs(Synergistic)SPEs(Synergistic)
On die memoryOn die memory
controllercontroller
Inter-coreInter-core
transport bustransport bus
High speed IOHigh speed IO
8. Cell ImplementationCell Implementation
Cell is an architectureCell is an architecture
Preliminary ImplementationPreliminary Implementation
• 1 PPE1 PPE
• 7 SPE (1 Disabled for yield increase)7 SPE (1 Disabled for yield increase)
• 221 mm² die size on a 90 nm process221 mm² die size on a 90 nm process
• Clocked at freq. 3-4ghzClocked at freq. 3-4ghz
• 256GFLOPS Single Precision @ 4ghz256GFLOPS Single Precision @ 4ghz
9. Why a Cell ArchitectureWhy a Cell Architecture
Follows a trend in computingFollows a trend in computing
architecturearchitecture
Natural extension of dual and multi-Natural extension of dual and multi-
corecore
Extremely low hardware overheadExtremely low hardware overhead
Software controllableSoftware controllable
Specialized hardware more useful forSpecialized hardware more useful for
multimediamultimedia
11. Power Processing ElementPower Processing Element
PowerPC instruction set with AltiVecPowerPC instruction set with AltiVec
Used for general purpose computingUsed for general purpose computing
and controlling SPE’sand controlling SPE’s
Simultaneous MultithreadingSimultaneous Multithreading
Separate 32 KB L1 Caches andSeparate 32 KB L1 Caches and
unified 512 KB L2 Cacheunified 512 KB L2 Cache
12. PPE (cont.)PPE (cont.)
Slow but power efficient PowerPCSlow but power efficient PowerPC
instruction set implementationinstruction set implementation
Two issue in-order instruction fetchTwo issue in-order instruction fetch
Conspicuous lack of instructionConspicuous lack of instruction
windowwindow
Compare to conventional PowerPCCompare to conventional PowerPC
implementations (G5)implementations (G5)
Performance depends on SPEPerformance depends on SPE
utilizationutilization
13. Synergistic Processing Element (SPE)Synergistic Processing Element (SPE)
Specialized hardwareSpecialized hardware
Meant to be used inMeant to be used in
parallelparallel
• (7 on PS3(7 on PS3
implementation)implementation)
On chip memory (256kb)On chip memory (256kb)
No branch predictionNo branch prediction
In-order executionIn-order execution
Dual issueDual issue
14. SPE ArchitectureSPE Architecture
0.99µm2 on 90nm Process0.99µm2 on 90nm Process
128 registers (128 bits wide)128 registers (128 bits wide)
• Instructions assumed to be 4x 32bitInstructions assumed to be 4x 32bit
Variant of VMX instruction setVariant of VMX instruction set
• Modified for 128 registersModified for 128 registers
On chip memory is NOT a cacheOn chip memory is NOT a cache
15. SPE ExecutionSPE Execution
Dual issue, in-orderDual issue, in-order
Seven execution unitsSeven execution units
Vector logicVector logic
8 single precision operations per8 single precision operations per
cyclecycle
Significant performance hit forSignificant performance hit for
double precisiondouble precision
17. SPE Local Storage AreaSPE Local Storage Area
NOT a cacheNOT a cache
256kb, 4 x 64kb ECC single port256kb, 4 x 64kb ECC single port
SRAMSRAM
Completely private to each SPECompletely private to each SPE
Directly addressable by softwareDirectly addressable by software
Can be used as a cache, but onlyCan be used as a cache, but only
with software controlswith software controls
No tag bits, or any extra hardwareNo tag bits, or any extra hardware
18. SPE LS SchedulingSPE LS Scheduling
Software controlled DMASoftware controlled DMA
DMA to and from main memoryDMA to and from main memory
Scheduling a HUGE problemScheduling a HUGE problem
• Done primarily in softwareDone primarily in software
• IBM predicts 80-90% usage ideallyIBM predicts 80-90% usage ideally
Request queue handles 16 simultaneousRequest queue handles 16 simultaneous
requestsrequests
• Up to 16 kb transfer eachUp to 16 kb transfer each
• Priority: DMA, L/S, FetchPriority: DMA, L/S, Fetch
Fetch / execute parallelismFetch / execute parallelism
19. SPE Control LogicSPE Control Logic
Very little in comparisonVery little in comparison
Represents shift in focusRepresents shift in focus
Complete lack of branch predictionComplete lack of branch prediction
• Software branch predictionSoftware branch prediction
• Loop unrollingLoop unrolling
• 18 cycle penalty18 cycle penalty
Software controlled DMASoftware controlled DMA
20. SPE PipelineSPE Pipeline
Little ILP, and thusLittle ILP, and thus
little control logiclittle control logic
Dual issueDual issue
Simple commitSimple commit
unit (no reorderunit (no reorder
buffer or otherbuffer or other
complexities)complexities)
Same executionSame execution
unit for FP/intunit for FP/int
21. SPE SummarySPE Summary
Essentially small vector computerEssentially small vector computer
Based on Altivec/VMX ISABased on Altivec/VMX ISA
• Extensions for DMA and LS managementExtensions for DMA and LS management
• Extended for 128x 128bit registerfileExtended for 128x 128bit registerfile
Uniquely suited for real time applicationsUniquely suited for real time applications
Extremely fast for certain FP operationsExtremely fast for certain FP operations
Offload a large amount on to compiler /Offload a large amount on to compiler /
software.software.
22. Element Interconnect BusElement Interconnect Bus
4 concentric rings connecting all Cell4 concentric rings connecting all Cell
elementselements
128-bit wide interconnects128-bit wide interconnects
23. EIB (cont.)EIB (cont.)
Designed to minimize coupling noiseDesigned to minimize coupling noise
Rings of data traveling in alternatingRings of data traveling in alternating
directionsdirections
Buffers and repeaters at each SPEBuffers and repeaters at each SPE
boundaryboundary
Architecture can be scaled up withArchitecture can be scaled up with
increased bus latencyincreased bus latency
24. EIB (cont.)EIB (cont.)
Total bandwidth at ~200GB/sTotal bandwidth at ~200GB/s
EIB controller located physically inEIB controller located physically in
center of chip between SPE’scenter of chip between SPE’s
Controller reserves channels for eachController reserves channels for each
individual data transfer requestindividual data transfer request
Implementation allows for SPEImplementation allows for SPE
extension horizontallyextension horizontally
25. Memory InterfaceMemory Interface
Rambus XDR memory to keep Cell atRambus XDR memory to keep Cell at
full utilizationfull utilization
3.2 Gbps data bandwidth per device3.2 Gbps data bandwidth per device
connected to XDR interfaceconnected to XDR interface
Cell uses dual channel XDR with fourCell uses dual channel XDR with four
devices and 16-bit wide buses todevices and 16-bit wide buses to
achieve 25.2 GB/s total memoryachieve 25.2 GB/s total memory
bandwidthbandwidth
26. Input / Output BusInput / Output Bus
Rambus FlexIO BusRambus FlexIO Bus
IO interface consists of 12IO interface consists of 12
unidirectional byte lanesunidirectional byte lanes
Each lane supports 6.4 GB/sEach lane supports 6.4 GB/s
bandwidthbandwidth
7 outbound lanes and 5 inbound7 outbound lanes and 5 inbound
laneslanes
27. Design ChoicesDesign Choices
In-order executionIn-order execution
• Abandoning ILPAbandoning ILP
• ILP – 10-20% increase per generationILP – 10-20% increase per generation
• Reducing control logicReducing control logic
• Real time responsivenessReal time responsiveness
Cache DesignCache Design
• Software configuration on SPESoftware configuration on SPE
• Standard L2 cache on PPEStandard L2 cache on PPE
28. Cell Programming IssuesCell Programming Issues
No Cell compiler in existence to manageNo Cell compiler in existence to manage
utilization of SPE’s at compile timeutilization of SPE’s at compile time
SPE’s do not natively support contextSPE’s do not natively support context
switching. Must be OS managed.switching. Must be OS managed.
SPE’s are vector processors. Not efficientSPE’s are vector processors. Not efficient
for general-purpose computation.for general-purpose computation.
PPE’s and SPE’s use different instructionPPE’s and SPE’s use different instruction
sets.sets.
29. Cell Programming (cont.)Cell Programming (cont.)
Functional Offload ModelFunctional Offload Model
Simplest model for Cell programmingSimplest model for Cell programming
Optimize existing libraries for SPEOptimize existing libraries for SPE
computationcomputation
Requires no rebuild of mainRequires no rebuild of main
application logic which runs on PPEapplication logic which runs on PPE
30. RefrencesRefrences
• "Synergistic Processing in Cell's Multicore
Architecture"(PDF). IEEE. Retrieved 2007-03-22.
•Jump up^ "Cell Designer talks about PS3 and IBM
Cell Processors". Retrieved 2007-03-22.
•Jump up^ "Cell Broadband Engine Interconnect
and Memory Interface"(PDF). IBM. Retrieved 2007-
03-22.
•http://en.wikipedia.org/wiki/Cell_(microprocessor)