Ibm cell

Design issues of IBM CellDesign issues of IBM Cell
ArchitectureArchitecture
Vitthal Gutthe MEIT 1326Vitthal Gutthe MEIT 1326
Pravin kumar Yadav MEIT 1338Pravin kumar Yadav MEIT 1338
Vyanktesh Dorlikar MEIT 1324Vyanktesh Dorlikar MEIT 1324

contentscontents
 General IntroductionGeneral Introduction
 History of developmentHistory of development
 Technical overview of architectureTechnical overview of architecture
 Detailed technical discussion ofDetailed technical discussion of
componentscomponents
 Design choicesDesign choices
 Cell programming issuesCell programming issues

History of DevelopmentHistory of Development
 Sony Playstation2Sony Playstation2
• Released March 2000 in JapanReleased March 2000 in Japan
• 128bit “Emotion Engine”128bit “Emotion Engine”
• With freq. of 294Mhz,MIPS CPUWith freq. of 294Mhz,MIPS CPU
• Having capability of 6.2gflops(gegaHaving capability of 6.2gflops(gega
floating point operation per second)floating point operation per second)

History ContinuedHistory Continued
 Partnership between Sony, Toshiba,Partnership between Sony, Toshiba,
IBM in Summer of 2000IBM in Summer of 2000
 Initial goal of 1000 x PS2 Power inInitial goal of 1000 x PS2 Power in
single Machinesingle Machine
 March 2001, Sony-IBM-ToshibaMarch 2001, Sony-IBM-Toshiba
design center opened with andesign center opened with an
investment of $400m investment.investment of $400m investment.

Overall Goals for CellOverall Goals for Cell
 High performance in multimedia appsHigh performance in multimedia apps
 Gain Real time performanceGain Real time performance
 Power consumption should bePower consumption should be
minimumminimum
 Cost as low as possibleCost as low as possible
 Available by 2005Available by 2005
 Avoid memory latency issuesAvoid memory latency issues
associated with control structuresassociated with control structures

The Cell itselfThe Cell itself
 Power PC basedPower PC based
main core (PPE)main core (PPE)
 MultipleMultiple
SPEs(Synergistic)SPEs(Synergistic)
 On die memoryOn die memory
controllercontroller
 Inter-coreInter-core
transport bustransport bus
 High speed IOHigh speed IO

Cell Die LayoutCell Die Layout

Cell ImplementationCell Implementation
 Cell is an architectureCell is an architecture
 Preliminary ImplementationPreliminary Implementation
• 1 PPE1 PPE
• 7 SPE (1 Disabled for yield increase)7 SPE (1 Disabled for yield increase)
• 221 mm² die size on a 90 nm process221 mm² die size on a 90 nm process
• Clocked at freq. 3-4ghzClocked at freq. 3-4ghz
• 256GFLOPS Single Precision @ 4ghz256GFLOPS Single Precision @ 4ghz

Why a Cell ArchitectureWhy a Cell Architecture
 Follows a trend in computingFollows a trend in computing
architecturearchitecture
 Natural extension of dual and multi-Natural extension of dual and multi-
corecore
 Extremely low hardware overheadExtremely low hardware overhead
 Software controllableSoftware controllable
 Specialized hardware more useful forSpecialized hardware more useful for
multimediamultimedia

Possible UsesPossible Uses
 Playstation3Playstation3
(Obviously)(Obviously)
 Blade servers (IBM)Blade servers (IBM)
• Amazing singleAmazing single
precision FPprecision FP
performanceperformance
• Scientific applicationsScientific applications
 Toshiba HDTVToshiba HDTV
productsproducts

Power Processing ElementPower Processing Element
 PowerPC instruction set with AltiVecPowerPC instruction set with AltiVec
 Used for general purpose computingUsed for general purpose computing
and controlling SPE’sand controlling SPE’s
 Simultaneous MultithreadingSimultaneous Multithreading
 Separate 32 KB L1 Caches andSeparate 32 KB L1 Caches and
unified 512 KB L2 Cacheunified 512 KB L2 Cache

PPE (cont.)PPE (cont.)
 Slow but power efficient PowerPCSlow but power efficient PowerPC
instruction set implementationinstruction set implementation
 Two issue in-order instruction fetchTwo issue in-order instruction fetch
 Conspicuous lack of instructionConspicuous lack of instruction
windowwindow
 Compare to conventional PowerPCCompare to conventional PowerPC
implementations (G5)implementations (G5)
 Performance depends on SPEPerformance depends on SPE
utilizationutilization

Synergistic Processing Element (SPE)Synergistic Processing Element (SPE)
 Specialized hardwareSpecialized hardware
 Meant to be used inMeant to be used in
parallelparallel
• (7 on PS3(7 on PS3
implementation)implementation)
 On chip memory (256kb)On chip memory (256kb)
 No branch predictionNo branch prediction
 In-order executionIn-order execution
 Dual issueDual issue

SPE ArchitectureSPE Architecture
 0.99µm2 on 90nm Process0.99µm2 on 90nm Process
 128 registers (128 bits wide)128 registers (128 bits wide)
• Instructions assumed to be 4x 32bitInstructions assumed to be 4x 32bit
 Variant of VMX instruction setVariant of VMX instruction set
• Modified for 128 registersModified for 128 registers
 On chip memory is NOT a cacheOn chip memory is NOT a cache

SPE ExecutionSPE Execution
 Dual issue, in-orderDual issue, in-order
 Seven execution unitsSeven execution units
 Vector logicVector logic
 8 single precision operations per8 single precision operations per
cyclecycle
 Significant performance hit forSignificant performance hit for
double precisiondouble precision

SPE Execution DiagramSPE Execution Diagram

SPE Local Storage AreaSPE Local Storage Area
 NOT a cacheNOT a cache
 256kb, 4 x 64kb ECC single port256kb, 4 x 64kb ECC single port
SRAMSRAM
 Completely private to each SPECompletely private to each SPE
 Directly addressable by softwareDirectly addressable by software
 Can be used as a cache, but onlyCan be used as a cache, but only
with software controlswith software controls
 No tag bits, or any extra hardwareNo tag bits, or any extra hardware

SPE LS SchedulingSPE LS Scheduling
 Software controlled DMASoftware controlled DMA
 DMA to and from main memoryDMA to and from main memory
 Scheduling a HUGE problemScheduling a HUGE problem
• Done primarily in softwareDone primarily in software
• IBM predicts 80-90% usage ideallyIBM predicts 80-90% usage ideally
 Request queue handles 16 simultaneousRequest queue handles 16 simultaneous
requestsrequests
• Up to 16 kb transfer eachUp to 16 kb transfer each
• Priority: DMA, L/S, FetchPriority: DMA, L/S, Fetch
 Fetch / execute parallelismFetch / execute parallelism

SPE Control LogicSPE Control Logic
 Very little in comparisonVery little in comparison
 Represents shift in focusRepresents shift in focus
 Complete lack of branch predictionComplete lack of branch prediction
• Software branch predictionSoftware branch prediction
• Loop unrollingLoop unrolling
• 18 cycle penalty18 cycle penalty
 Software controlled DMASoftware controlled DMA

SPE PipelineSPE Pipeline
 Little ILP, and thusLittle ILP, and thus
little control logiclittle control logic
 Dual issueDual issue
 Simple commitSimple commit
unit (no reorderunit (no reorder
buffer or otherbuffer or other
complexities)complexities)
 Same executionSame execution
unit for FP/intunit for FP/int

SPE SummarySPE Summary
 Essentially small vector computerEssentially small vector computer
 Based on Altivec/VMX ISABased on Altivec/VMX ISA
• Extensions for DMA and LS managementExtensions for DMA and LS management
• Extended for 128x 128bit registerfileExtended for 128x 128bit registerfile
 Uniquely suited for real time applicationsUniquely suited for real time applications
 Extremely fast for certain FP operationsExtremely fast for certain FP operations
 Offload a large amount on to compiler /Offload a large amount on to compiler /
software.software.

Element Interconnect BusElement Interconnect Bus
 4 concentric rings connecting all Cell4 concentric rings connecting all Cell
elementselements
 128-bit wide interconnects128-bit wide interconnects

EIB (cont.)EIB (cont.)
 Designed to minimize coupling noiseDesigned to minimize coupling noise
 Rings of data traveling in alternatingRings of data traveling in alternating
directionsdirections
 Buffers and repeaters at each SPEBuffers and repeaters at each SPE
boundaryboundary
 Architecture can be scaled up withArchitecture can be scaled up with
increased bus latencyincreased bus latency

EIB (cont.)EIB (cont.)
 Total bandwidth at ~200GB/sTotal bandwidth at ~200GB/s
 EIB controller located physically inEIB controller located physically in
center of chip between SPE’scenter of chip between SPE’s
 Controller reserves channels for eachController reserves channels for each
individual data transfer requestindividual data transfer request
 Implementation allows for SPEImplementation allows for SPE
extension horizontallyextension horizontally

Memory InterfaceMemory Interface
 Rambus XDR memory to keep Cell atRambus XDR memory to keep Cell at
full utilizationfull utilization
 3.2 Gbps data bandwidth per device3.2 Gbps data bandwidth per device
connected to XDR interfaceconnected to XDR interface
 Cell uses dual channel XDR with fourCell uses dual channel XDR with four
devices and 16-bit wide buses todevices and 16-bit wide buses to
achieve 25.2 GB/s total memoryachieve 25.2 GB/s total memory
bandwidthbandwidth

Input / Output BusInput / Output Bus
 Rambus FlexIO BusRambus FlexIO Bus
 IO interface consists of 12IO interface consists of 12
unidirectional byte lanesunidirectional byte lanes
 Each lane supports 6.4 GB/sEach lane supports 6.4 GB/s
bandwidthbandwidth
 7 outbound lanes and 5 inbound7 outbound lanes and 5 inbound
laneslanes

Design ChoicesDesign Choices
 In-order executionIn-order execution
• Abandoning ILPAbandoning ILP
• ILP – 10-20% increase per generationILP – 10-20% increase per generation
• Reducing control logicReducing control logic
• Real time responsivenessReal time responsiveness
 Cache DesignCache Design
• Software configuration on SPESoftware configuration on SPE
• Standard L2 cache on PPEStandard L2 cache on PPE

Cell Programming IssuesCell Programming Issues
 No Cell compiler in existence to manageNo Cell compiler in existence to manage
utilization of SPE’s at compile timeutilization of SPE’s at compile time
 SPE’s do not natively support contextSPE’s do not natively support context
switching. Must be OS managed.switching. Must be OS managed.
 SPE’s are vector processors. Not efficientSPE’s are vector processors. Not efficient
for general-purpose computation.for general-purpose computation.
 PPE’s and SPE’s use different instructionPPE’s and SPE’s use different instruction
sets.sets.

Cell Programming (cont.)Cell Programming (cont.)
 Functional Offload ModelFunctional Offload Model
 Simplest model for Cell programmingSimplest model for Cell programming
 Optimize existing libraries for SPEOptimize existing libraries for SPE
computationcomputation
 Requires no rebuild of mainRequires no rebuild of main
application logic which runs on PPEapplication logic which runs on PPE

RefrencesRefrences
• "Synergistic Processing in Cell's Multicore
Architecture"(PDF). IEEE. Retrieved 2007-03-22.
•Jump up^ "Cell Designer talks about PS3 and IBM
Cell Processors". Retrieved 2007-03-22.
•Jump up^ "Cell Broadband Engine Interconnect
and Memory Interface"(PDF). IBM. Retrieved 2007-
03-22.
•http://en.wikipedia.org/wiki/Cell_(microprocessor)

Ibm cell

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (6)

Similar to Ibm cell

Similar to Ibm cell (20)

Recently uploaded

Recently uploaded (20)

Ibm cell