SlideShare ist ein Scribd-Unternehmen logo
1 von 34
Downloaden Sie, um offline zu lesen
Heiko J Schick – IBM Deutschland R&D GmbH November 2010 directCellCell/B.E. tightly coupled via PCI Express
Agenda Section 1:  directCell Section 2: Building Blocks Section 3: Summary Section 4: PCI Express Gen 3 2
Terminology An inline accelerator  is an accelerator that runs sequentially with the main compute engine. A core accelerator  is a mechanism that accelerates the performance of a single core. A core may run multiple hardware threads as in an SMT implementation. A chip accelerator  is an off-chip mechanism that boosts the performance of the primary compute chip. Graphics accelerators are typically of this type. A system accelerator  is a network-attached appliance that boosts the performance of a primary multinode system. Azul is an example of a system accelerator 3 Section 1: directCell
4 Remote Control Section 1: directCell Our goal is to remotely control a chip accelerator via a device driver based on the primary compute chip. The chip accelerator does not run an operating system, but merely a firmware-based bare metal support library to facilitate the host based device driver. Requirements Operation (e.g. start and stop acceleration) Memory Mapped I/O (e.g. Cell Broadband Architecture) Special Instruction Interrupts Memory Compatibility Bus / Interconnect (e.g. PCI Express, PCI Express Endpoint)
What is tightly coupled? Distributed systems are state of the art Tightly Coupled: Usage as a device rather than a system Completely integrated into the host's global address space I/O attached Commonly referred to as a “hybrid” OS-less, Controlled by host Driven by interactive workloads Example: A button is pressed, etc Pluggable into existing form factors 5 Section 1: directCell
Why tightly coupled? Customers want to purchase applied acceleration Classic appliance box will be deprecated by modular and hybrid approaches Deployment and serviceability A system needs to be installed and administered Nobody is happy with accelerators that has to be program Ship working appliance kernels Software involvement and required 6 Section 1: directCell
PCI Express Features Computer expansion card interface format Replacement for PCI, PCI-X and AGP as industry standard for PCs (Workstation and Server). Serial Interconnect Based on differential signals with 4 wires per lane Each lane transmits 250 MB/s per direction  Up to 32 lane per link provides 4 GB/s per direction Low Latency Memory-mapped IO (MMIO) and direct memory access (DMA) are key concepts 7 Section 1: directCell
Cell/B.E. Accelerator via PCI Express Connect Cell/B.E. System as PCI Express device to a host system Operating Systems runs only on host system (e.g. Linux, Windows) Main application runs on host system Compute intensive tasks will run as threads on SPEs Using the same Cell/B.E. programming models as for non-hybrid systems. Three level memory hierarchy instead of two level. Cell/B.E. processor does not run any operation systems MMIO and DMA used as access methods in both directions  8 Section 1: directCell
PCI Express Cabling Products 9 Section 1: directCell
Cell/B.E. Accelerator System 10 Section 1: directCell Application Main Thread SPU Threads SPU Tasks Operating system SPE PPE SPE SPU Core SPU Local Store LS Execution Units Local Store LS Execution Units MFC MFC L2 DMA  MMIO Registers DMA  MMIO Registers EIB CELL/B.E. Memory Southbridge DMAEngine
Cell/B.E. Accelerator System 11 Section 1: directCell Application Main Thread SPU Threads Application Main Thread SPU Tasks Operating system Operating System SPE PPE SPE Host Processor Host Memory SPU Core SPU Core Local Store LS Execution Units Local Store LS Execution Units MFC MFC L2 L2 DMA  MMIO Registers DMA  MMIO Registers EIB CELL/B.E. Memory Southbridge Southbridge PCI Express Link DMAEngine
Building Block #1: Interconnect Currently PCI Express support is included in many front office systems, hence most accelerator innovation will take place via PCI Express. Intel's QPI & PCI Express convergence (core i5/i7) drives a strong movement to make I/O a native subset of the front-side bus. PCI Express EP support for modern processors is the only real option for tightly coupled interconnects. PCI Express has bifurcation support and hot plug support. Current ECNs (ATS, TLP Hints, Atomic Ops) must be included in those designs! 12 Section 2: Building Blocks
Building Block #2: Addressing (1) Section 2: Building Blocks Integration on the Bus Level Host BIOS or firmware maps accelerators via PCI Express BARs: Increase BAR size in EP designs Resizable BAR ECN Bus level integration scales well: 264 = 16 Exabyte = 16 K Petabyte Entire SOCs clusters can be mapped into host 13
Building Block #2: Addressing (2) Section 2: Building Blocks Inbound Address Translation PIM / POM, IOMMUs, etc. Switch-based PCIe ATS Specification PCIe Address Translation Services Allow EP virtual to real address translation for DMA: Application provides VA pointer to EP.  Host uses EP VA pointer to program it. Userspace DMA Problem Buffers on accelerator and host need to be pinned for async DMA transfers. Kernel involvement should be minimal. Linux UIO framework HugeTLBfs  is needed. Windows UMDF Large Pages is needed. 14
Building Block #3: Run-time Control Minimal software on accelerator Device driver is running on host system Include DMA engine(s) on accelerator Control Mechanisms MMIO Can easily be mapped as VFS -> UIO. PCIe core of acc should be able to map entire MMIO range. Special instructions Clumsy to map as virtual file system. Expose to userspace as system call or IOCTL. Fixed length of parameter area must be made user accessible. PCI Express core of accelerator should be able to dispatch special instruction to every unit in the accelerator. Include helper registers, scratchpads, doorbells and ring buffers 15 Section 2: Building Blocks
directCell Operation 16 Section 2: Building Blocks SPU Threads Application Main Thread SPU Tasks Operating System 4 4 1 SPE PPE SPE Host Processor Host Memory SPU Core SPU Core Local Store LS Execution Units Local Store LS Execution Units MFC MFC L2 L2 DMA  MMIO Registers DMA  MMIO Registers 6 3 EIB CELL/B.E. Memory 5 5 2 2 Southbridge Southbridge PCI Express Link DMAEngine
Prototype Concept validation HS21 Intel Xeon Blade connected to QS2x Cell/B.E. Blade via PCI Express 4x . Special firmware on QS2x Cell/B.E. Blade to set PCI connector as endpoint. Microsoft Windows as OS on HS21 blade. Windows device driver, enabling user space access to QS2x. Working and verified DMA transfer from and to Cell/B.E. Memory from Windows application. DMA transfer from and to Local Store from Windows application. Access to Cell/B.E. MMIO registers. Start of SPE thread from Windows (thread context is not preserved)‏. SPE DMA to host memory via PCI Express. Memory management code . User libs on Windows to abstract Cell/B.E. usage (compatible to libspe )‏. SPE Context save and restore (needed for proper multi thread execution)‏. 17 Section 3: Summary
Project Review Technology study proposed to target new application domains & markets Use Cell as an acceleration device. All system management done from host system (GPGPU-like accelerator)‏. Enables Cell on Wintel platforms  Cell/B.E. Systems has no dependency on OS. Compute intensive tasks will run as threads on SPEs. Use MMIO and DMA operations via PCI Express to reach any memory-mapped resources of the Cell/B.E. System from the host, and vice versa. Exhibits a new Runtime model for Processors Show that a processor designed for standalone operation can be fully integrated into another host system. 18 Section 3: Summary
New Features Atomic Operations TLP Processing Hints TLP Prefix Resizable BAR Dynamic Power Allocation Latency Tolerance Reporting Multicast Internal Error Reporting Alternative Routing-ID Interpretation Extended Tag Enable Default Single Root I/O Virtualization Multi Root I/O Virtualization Address Translation Services 19 Section 4: PCI Express Gen 3
20 Thank you very much for your attention.
21 Atomic Operations This optional normative ECN defines 3 new PCIe transactions, each of which carries out a specific Atomic Operation (“AtomicOp”) on a target location in Memory Space.  The 3 AtomicOps are  FetchAdd (Fetch and Add) Swap (Unconditional Swap) CAS (Compare and Swap).  Direct support for the 3 chosen AtomicOps over PCIe enables easier migration of existing highperformance SMP applications to systems that use PCIe as the interconnect to tightly-coupled accelerators, co-processors, or GP-GPUs. Section 4: PCI Express Gen 3 Source: PCI-SIG, Atomic Operations ECN
22 TLP Processing Hints This optional normative ECR defines a mechanism by which a Requester can provide hints on a per transaction basis to facilitate optimized processing of transactions that target Memory Space.  The architected mechanisms may be used to enable association of system processing resources (e.g. caches) with the processing of Requests from specific Functions or enable optimized system specific (e.g. system interconnect and Memory) processing of Requests. Providing such information enables the Root Complex and Endpoint to optimize handling of Requests by differentiating data likely to be reused soon from bulk flows that could monopolize system resources. Section 4: PCI Express Gen 3 Source: PCI-SIG, Processing Hints ECN
23 TLP Prefix Emerging usage model trends indicate a requirement for increase in header size fields to provide additional information than what can be accommodated in currently defined TLP header sizes. The TLP Prefix mechanism extends the header size by adding DWORDS to the front of headers that carry additional information. The TLP Prefix mechanism provides architectural headroom for PCIe headers to grow in the future. Switches and Switch related software can be built that are transparent to the encoding of future End-End TLPs.  The End-End TLP Prefix mechanism defines rules for routing elements to route TLPs containing End-End TLP Prefixes without requiring the routing element logic to explicitly support any specific End-End TLP Prefix encoding(s). Section 4: PCI Express Gen 3 Source: PCI-SIG, TLP Prefix ECN
24 Resizable BAR This optional ECN adds a capability for Functions with BARs to report various options for sizes of their memory mapped resources that will operate properly. Also added is an ability for software to program the size to configure the BAR to. The Resizable BAR Capability allows system software to allocate all resources in systems where the total amount of resources requesting allocation plus the amount of installed system memory is larger than the supported address space. Section 4: PCI Express Gen 3 Source: PCI-SIG, Resizable BAR ECN
25 Dynamic Power Allocation DPA (Dynamic Power Allocation) extends existing PCIe device power management to provide active (D0) device power management substates for appropriate devices, while comprehending existing PCIe PM Capabilities including PCI-PM and Power Budgeting. Section 4: PCI Express Gen 3 Source: PCI-SIG, Dynamic Power Allocation ECN
26 Latency Tolerance Reporting This ECR proposes to add a new mechanism for Endpoints to report their service latency requirements for Memory Reads and Writes to the Root Complex such that central platform resources (such as main memory, RC internal interconnects, snoop resources, and other resources associated with the RC) can be power managed without impacting Endpoint functionality and performance. Current platform Power Management (PM) policies guesstimate when devices are idle (e.g. using inactivity timers). Guessing wrong can cause performance issues, or even hardware failures. In the worst case, users/admins will disable PM to allow functionality at the cost of increased platform power consumption. This ECR impacts Endpoint devices, RCs and Switches that choose to implement the new optional feature. Section 4: PCI Express Gen 3 Source: PCI-SIG, Latency Tolerance Reporting ECN
27 Multicast This optional normative ECN adds Multicast functionality to PCI Express by means of an Extended Capability structure for applicable Functions in Root Complexes, Switches, and components with Endpoints.  The Capability structure defines how Multicast TLPs are identified and routed. It also provides means for checking and enforcing send permission with Function-level granularity. The ECN identifies Multicast errors and adds an MC Blocked TLP error to AER for reporting those errors. Multicast allows a single Posted Request TLP sent from a source to be distributed to multiple recipients, resulting in a very high performance gain when applicable. Section 4: PCI Express Gen 3 Source: PCI-SIG, Multicast ECN
28 Internal Error Reporting PCI Express (PCIe) defines error signaling and logging mechanisms for errors that occur on a PCIe interface and for errors that occur on behalf of transactions initiated on PCIe. It does not define error signaling and logging mechanisms for errors that occur within a component or are unrelated to a particular PCIe transaction. This ECN defines optional error signaling and logging mechanisms for all components except PCIe to PCI/PCI-X Bridges (i.e., Switches, Root Complexes, and Endpoints) to report internal errors that are associated with a PCI Express interface. Errors that occur within components but are not associated with PCI Express remain outside the scope of the specification. Section 4: PCI Express Gen 3 Source: PCI-SIG, Internal Error Reporting ECN
29 Alternative Routing-ID Interpretation For virtualized and non-virtualized environments, a number of PCI-SIG member companies have requested that the current constraints on number of Functions allowed per multi-Function Device be increased to accommodate the needs of next generation I/O implementations.  This ECR specifies a new method to interpret the Device Number and Function Number fields within Routing IDs, Requester IDs, and Completer IDs, thereby increasing the number of Functions that can be supported by a single Device. Alternative Routing-ID Interpretation (ARI) enables next generation I/O implementations to support an increased number of concurrent users of a multi-Function device while providing the same level of isolation and controls found in existing implementations. Section 4: PCI Express Gen 3 Source: PCI-SIG, Alternative Routing-ID Interpretation ECN
30 Extended Tag Enable Default The change allows a Function to use Extended Tag fields (256 unique tag values) by default; this is done by allowing the Extended Tag Enable control field to be set by default. The obligatory 32 tags provided by PCIe per Function are not sufficient to meet the throughput and requirements of emerging applications. Extended tags allow up to 256 concurrent requests, but such capability is not enabled by default in PCIe. Section 4: PCI Express Gen 3 Source: PCI-SIG, Extended Tag Enable Default ECN
31 Single Root I/O Virtualization The specification is focused on single root topologies; e.g., a single computer that supports virtualization technology. Within the industry, significant effort has been expended to increase the effective hardware resource utilization (i.e., application execution) through the use of virtualization technology.  The Single Root I/O Virtualization and Sharing Specification (SR-IOV) defines extensions to the PCI Express (PCIe) specification suite to enable multiple System Images (SI) to share PCI hardware resources. Section 4: PCI Express Gen 3 Source: PCI-SIG, Single Root I/O Virtualization Specification
32 Multi Root I/O Virtualization Section 4: PCI Express Gen 3 The specification is focused on multi-root topologies; e.g., a server blade enclosure that uses a PCI Express® Switch-based topology to connect server blades to PCI Express Devices or PCI Express to-PCI Bridges and enable the leaf Devices to be serially or simultaneously shared by one or more System Images (SI).  Unlike the Single Root IOV environment, independent SI may execute on disparate processing components such as independent server blades. The Multi-Root I/O Virtualization (MR-IOV) specification defines extensions to the PCI Express (PCIe) specification suite to enable multiple non-coherent Root Complexes (RCs) to share PCI hardware resources. Source: PCI-SIG, Multi Root I/O Virtualization Specification
33 Address Translation Services This specification describes the extensions required to allow PCI Express Devices to interact with an address translation agent (TA) in or above a Root Complex (RC) to enable translations of DMA addresses to be cached in the Device. The purpose of having an Address Translation Cache (ATC) in a Device is to minimize latency and to provide a scalable distributed caching solution that will improve I/O performance while alleviating TA resource pressure. Section 4: PCI Express Gen 3 Source: PCI-SIG, Address Translation Services Specification
34 Disclaimer IBM®, DB2®, MVS/ESA, AIX®, S/390®, AS/400®, OS/390®, OS/400®, iSeries, pSeries, xSeries, zSeries, z/OS, AFP, Intelligent Miner, WebSphere®, Netfinity®, Tivoli®, Informix und Informix® Dynamic ServerTM, IBM, BladeCenter and POWER and others are trademarks of the IBM Corporation in US and/or other countries. Cell Broadband Engine is a trademark of Sony Computer Entertainment, Inc. in the United States, other countries, or both and is used under license there from. Linux is a trademark of Linus Torvalds in the United States, other countries or both.  Other company, product, or service names may be trademarks or service marks of others. The information and materials are provided on an "as is" basis and are subject to change.

Weitere ähnliche Inhalte

Was ist angesagt? (18)

Lecture 47
Lecture 47Lecture 47
Lecture 47
 
soc ip core based for spacecraft application
soc ip core based for spacecraft applicationsoc ip core based for spacecraft application
soc ip core based for spacecraft application
 
Data cache design itanium 2
Data cache design itanium 2Data cache design itanium 2
Data cache design itanium 2
 
Risc & cisk
Risc & ciskRisc & cisk
Risc & cisk
 
Array Processor
Array ProcessorArray Processor
Array Processor
 
Ec8791 arm 9 processor
Ec8791 arm 9 processorEc8791 arm 9 processor
Ec8791 arm 9 processor
 
Cp uarch
Cp uarchCp uarch
Cp uarch
 
Chapter 3
Chapter 3Chapter 3
Chapter 3
 
Introduction Cell Processor
Introduction Cell ProcessorIntroduction Cell Processor
Introduction Cell Processor
 
Cisc vs risc
Cisc vs riscCisc vs risc
Cisc vs risc
 
Linux Internals - Interview essentials 2.0
Linux Internals - Interview essentials 2.0Linux Internals - Interview essentials 2.0
Linux Internals - Interview essentials 2.0
 
Introducing Embedded Systems and the Microcontrollers
Introducing Embedded Systems and the MicrocontrollersIntroducing Embedded Systems and the Microcontrollers
Introducing Embedded Systems and the Microcontrollers
 
Input & Output
Input & OutputInput & Output
Input & Output
 
Computer architecture
Computer architectureComputer architecture
Computer architecture
 
Reliability, Availability and Serviceability on Linux
Reliability, Availability and Serviceability on LinuxReliability, Availability and Serviceability on Linux
Reliability, Availability and Serviceability on Linux
 
Risc processors
Risc processorsRisc processors
Risc processors
 
SOC System Design Approach
SOC System Design ApproachSOC System Design Approach
SOC System Design Approach
 
Lecture 46
Lecture 46Lecture 46
Lecture 46
 

Ähnlich wie directCellCell/B.E. tightly coupled via PCI Express

Motherboard components and their functions
Motherboard components and their functionsMotherboard components and their functions
Motherboard components and their functionsAbdullah-Al- Mahmud
 
System_on_Chip_SOC.ppt
System_on_Chip_SOC.pptSystem_on_Chip_SOC.ppt
System_on_Chip_SOC.pptzahixdd
 
BUD17 Socionext SC2A11 ARM Server SoC
BUD17 Socionext SC2A11 ARM Server SoCBUD17 Socionext SC2A11 ARM Server SoC
BUD17 Socionext SC2A11 ARM Server SoCLinaro
 
Design of a low power processor for Embedded system applications
Design of a low power processor for Embedded system applicationsDesign of a low power processor for Embedded system applications
Design of a low power processor for Embedded system applicationsROHIT89352
 
Heterogeneous Computing : The Future of Systems
Heterogeneous Computing : The Future of SystemsHeterogeneous Computing : The Future of Systems
Heterogeneous Computing : The Future of SystemsAnand Haridass
 
Multilevel arch & str org.& mips, 8086, memory
Multilevel arch & str org.& mips, 8086, memoryMultilevel arch & str org.& mips, 8086, memory
Multilevel arch & str org.& mips, 8086, memoryMahesh Kumar Attri
 
The Basics of Cell Computing Technology
The Basics of Cell Computing TechnologyThe Basics of Cell Computing Technology
The Basics of Cell Computing TechnologySlide_N
 
Michael Gschwind, Cell Broadband Engine: Exploiting multiple levels of parall...
Michael Gschwind, Cell Broadband Engine: Exploiting multiple levels of parall...Michael Gschwind, Cell Broadband Engine: Exploiting multiple levels of parall...
Michael Gschwind, Cell Broadband Engine: Exploiting multiple levels of parall...Michael Gschwind
 
Michael Gschwind et al, "An Open Source Environment for Cell Broadband Engine...
Michael Gschwind et al, "An Open Source Environment for Cell Broadband Engine...Michael Gschwind et al, "An Open Source Environment for Cell Broadband Engine...
Michael Gschwind et al, "An Open Source Environment for Cell Broadband Engine...Michael Gschwind
 
Computer Hardware & Software Lab Manual 2
Computer Hardware & Software Lab Manual 2Computer Hardware & Software Lab Manual 2
Computer Hardware & Software Lab Manual 2senayteklay
 
2.1 Computing Platforms - 2.1.1 & 2.1.2.pptx
2.1 Computing Platforms - 2.1.1 & 2.1.2.pptx2.1 Computing Platforms - 2.1.1 & 2.1.2.pptx
2.1 Computing Platforms - 2.1.1 & 2.1.2.pptxDrMThillaiRaniAsstPr
 
Intel new processors
Intel new processorsIntel new processors
Intel new processorszaid_b
 
Intro to Cell Broadband Engine for HPC
Intro to Cell Broadband Engine for HPCIntro to Cell Broadband Engine for HPC
Intro to Cell Broadband Engine for HPCSlide_N
 
Improving software system load balancing using messaging.
Improving software system load balancing using messaging.Improving software system load balancing using messaging.
Improving software system load balancing using messaging.Marc Karasek
 

Ähnlich wie directCellCell/B.E. tightly coupled via PCI Express (20)

The Cell Processor
The Cell ProcessorThe Cell Processor
The Cell Processor
 
Motherboard components and their functions
Motherboard components and their functionsMotherboard components and their functions
Motherboard components and their functions
 
System_on_Chip_SOC.ppt
System_on_Chip_SOC.pptSystem_on_Chip_SOC.ppt
System_on_Chip_SOC.ppt
 
Distributed Computing
Distributed ComputingDistributed Computing
Distributed Computing
 
BUD17 Socionext SC2A11 ARM Server SoC
BUD17 Socionext SC2A11 ARM Server SoCBUD17 Socionext SC2A11 ARM Server SoC
BUD17 Socionext SC2A11 ARM Server SoC
 
Design of a low power processor for Embedded system applications
Design of a low power processor for Embedded system applicationsDesign of a low power processor for Embedded system applications
Design of a low power processor for Embedded system applications
 
Heterogeneous Computing : The Future of Systems
Heterogeneous Computing : The Future of SystemsHeterogeneous Computing : The Future of Systems
Heterogeneous Computing : The Future of Systems
 
Multilevel arch & str org.& mips, 8086, memory
Multilevel arch & str org.& mips, 8086, memoryMultilevel arch & str org.& mips, 8086, memory
Multilevel arch & str org.& mips, 8086, memory
 
The Basics of Cell Computing Technology
The Basics of Cell Computing TechnologyThe Basics of Cell Computing Technology
The Basics of Cell Computing Technology
 
Michael Gschwind, Cell Broadband Engine: Exploiting multiple levels of parall...
Michael Gschwind, Cell Broadband Engine: Exploiting multiple levels of parall...Michael Gschwind, Cell Broadband Engine: Exploiting multiple levels of parall...
Michael Gschwind, Cell Broadband Engine: Exploiting multiple levels of parall...
 
Michael Gschwind et al, "An Open Source Environment for Cell Broadband Engine...
Michael Gschwind et al, "An Open Source Environment for Cell Broadband Engine...Michael Gschwind et al, "An Open Source Environment for Cell Broadband Engine...
Michael Gschwind et al, "An Open Source Environment for Cell Broadband Engine...
 
Computer Hardware & Software Lab Manual 2
Computer Hardware & Software Lab Manual 2Computer Hardware & Software Lab Manual 2
Computer Hardware & Software Lab Manual 2
 
2.1 Computing Platforms - 2.1.1 & 2.1.2.pptx
2.1 Computing Platforms - 2.1.1 & 2.1.2.pptx2.1 Computing Platforms - 2.1.1 & 2.1.2.pptx
2.1 Computing Platforms - 2.1.1 & 2.1.2.pptx
 
NWU and HPC
NWU and HPCNWU and HPC
NWU and HPC
 
chameleon chip
chameleon chipchameleon chip
chameleon chip
 
Memory Mapping Cache
Memory Mapping CacheMemory Mapping Cache
Memory Mapping Cache
 
Intel new processors
Intel new processorsIntel new processors
Intel new processors
 
Intro to Cell Broadband Engine for HPC
Intro to Cell Broadband Engine for HPCIntro to Cell Broadband Engine for HPC
Intro to Cell Broadband Engine for HPC
 
Improving software system load balancing using messaging.
Improving software system load balancing using messaging.Improving software system load balancing using messaging.
Improving software system load balancing using messaging.
 
CISC & RISC Architecture
CISC & RISC Architecture CISC & RISC Architecture
CISC & RISC Architecture
 

Mehr von Heiko Joerg Schick

Da Vinci - A scaleable architecture for neural network computing (updated v4)
Da Vinci - A scaleable architecture for neural network computing (updated v4)Da Vinci - A scaleable architecture for neural network computing (updated v4)
Da Vinci - A scaleable architecture for neural network computing (updated v4)Heiko Joerg Schick
 
Huawei empowers healthcare industry with AI technology
Huawei empowers healthcare industry with AI technologyHuawei empowers healthcare industry with AI technology
Huawei empowers healthcare industry with AI technologyHeiko Joerg Schick
 
The 2025 Huawei trend forecast gives you the lowdown on data centre facilitie...
The 2025 Huawei trend forecast gives you the lowdown on data centre facilitie...The 2025 Huawei trend forecast gives you the lowdown on data centre facilitie...
The 2025 Huawei trend forecast gives you the lowdown on data centre facilitie...Heiko Joerg Schick
 
The Smarter Car for Autonomous Driving
 The Smarter Car for Autonomous Driving The Smarter Car for Autonomous Driving
The Smarter Car for Autonomous DrivingHeiko Joerg Schick
 
From edge computing to in-car computing
From edge computing to in-car computingFrom edge computing to in-car computing
From edge computing to in-car computingHeiko Joerg Schick
 
Need and value for various levels of autonomous driving
Need and value for various levels of autonomous drivingNeed and value for various levels of autonomous driving
Need and value for various levels of autonomous drivingHeiko Joerg Schick
 
Petascale Analytics - The World of Big Data Requires Big Analytics
Petascale Analytics - The World of Big Data Requires Big AnalyticsPetascale Analytics - The World of Big Data Requires Big Analytics
Petascale Analytics - The World of Big Data Requires Big AnalyticsHeiko Joerg Schick
 
Experiences in Application Specific Supercomputer Design - Reasons, Challenge...
Experiences in Application Specific Supercomputer Design - Reasons, Challenge...Experiences in Application Specific Supercomputer Design - Reasons, Challenge...
Experiences in Application Specific Supercomputer Design - Reasons, Challenge...Heiko Joerg Schick
 
Run-Time Reconfiguration for HyperTransport coupled FPGAs using ACCFS
Run-Time Reconfiguration for HyperTransport coupled FPGAs using ACCFSRun-Time Reconfiguration for HyperTransport coupled FPGAs using ACCFS
Run-Time Reconfiguration for HyperTransport coupled FPGAs using ACCFSHeiko Joerg Schick
 
Browser and Management App for Google's Person Finder
Browser and Management App for Google's Person FinderBrowser and Management App for Google's Person Finder
Browser and Management App for Google's Person FinderHeiko Joerg Schick
 
High Performance Computing - Challenges on the Road to Exascale Computing
High Performance Computing - Challenges on the Road to Exascale ComputingHigh Performance Computing - Challenges on the Road to Exascale Computing
High Performance Computing - Challenges on the Road to Exascale ComputingHeiko Joerg Schick
 
IBM Corporate Service Corps - Helping Create Interactive Flood Maps
IBM Corporate Service Corps - Helping Create Interactive Flood MapsIBM Corporate Service Corps - Helping Create Interactive Flood Maps
IBM Corporate Service Corps - Helping Create Interactive Flood MapsHeiko Joerg Schick
 
Real time Flood Simulation for Metro Manila and the Philippines
Real time Flood Simulation for Metro Manila and the PhilippinesReal time Flood Simulation for Metro Manila and the Philippines
Real time Flood Simulation for Metro Manila and the PhilippinesHeiko Joerg Schick
 
QPACE QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)
QPACE QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)QPACE QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)
QPACE QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)Heiko Joerg Schick
 
QPACE - QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)
QPACE - QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)QPACE - QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)
QPACE - QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)Heiko Joerg Schick
 

Mehr von Heiko Joerg Schick (18)

Da Vinci - A scaleable architecture for neural network computing (updated v4)
Da Vinci - A scaleable architecture for neural network computing (updated v4)Da Vinci - A scaleable architecture for neural network computing (updated v4)
Da Vinci - A scaleable architecture for neural network computing (updated v4)
 
Huawei empowers healthcare industry with AI technology
Huawei empowers healthcare industry with AI technologyHuawei empowers healthcare industry with AI technology
Huawei empowers healthcare industry with AI technology
 
The 2025 Huawei trend forecast gives you the lowdown on data centre facilitie...
The 2025 Huawei trend forecast gives you the lowdown on data centre facilitie...The 2025 Huawei trend forecast gives you the lowdown on data centre facilitie...
The 2025 Huawei trend forecast gives you the lowdown on data centre facilitie...
 
The Smarter Car for Autonomous Driving
 The Smarter Car for Autonomous Driving The Smarter Car for Autonomous Driving
The Smarter Car for Autonomous Driving
 
From edge computing to in-car computing
From edge computing to in-car computingFrom edge computing to in-car computing
From edge computing to in-car computing
 
Need and value for various levels of autonomous driving
Need and value for various levels of autonomous drivingNeed and value for various levels of autonomous driving
Need and value for various levels of autonomous driving
 
Petascale Analytics - The World of Big Data Requires Big Analytics
Petascale Analytics - The World of Big Data Requires Big AnalyticsPetascale Analytics - The World of Big Data Requires Big Analytics
Petascale Analytics - The World of Big Data Requires Big Analytics
 
Experiences in Application Specific Supercomputer Design - Reasons, Challenge...
Experiences in Application Specific Supercomputer Design - Reasons, Challenge...Experiences in Application Specific Supercomputer Design - Reasons, Challenge...
Experiences in Application Specific Supercomputer Design - Reasons, Challenge...
 
Run-Time Reconfiguration for HyperTransport coupled FPGAs using ACCFS
Run-Time Reconfiguration for HyperTransport coupled FPGAs using ACCFSRun-Time Reconfiguration for HyperTransport coupled FPGAs using ACCFS
Run-Time Reconfiguration for HyperTransport coupled FPGAs using ACCFS
 
Blue Gene Active Storage
Blue Gene Active StorageBlue Gene Active Storage
Blue Gene Active Storage
 
Browser and Management App for Google's Person Finder
Browser and Management App for Google's Person FinderBrowser and Management App for Google's Person Finder
Browser and Management App for Google's Person Finder
 
High Performance Computing - Challenges on the Road to Exascale Computing
High Performance Computing - Challenges on the Road to Exascale ComputingHigh Performance Computing - Challenges on the Road to Exascale Computing
High Performance Computing - Challenges on the Road to Exascale Computing
 
IBM Corporate Service Corps - Helping Create Interactive Flood Maps
IBM Corporate Service Corps - Helping Create Interactive Flood MapsIBM Corporate Service Corps - Helping Create Interactive Flood Maps
IBM Corporate Service Corps - Helping Create Interactive Flood Maps
 
Real time Flood Simulation for Metro Manila and the Philippines
Real time Flood Simulation for Metro Manila and the PhilippinesReal time Flood Simulation for Metro Manila and the Philippines
Real time Flood Simulation for Metro Manila and the Philippines
 
Slimline Open Firmware
Slimline Open FirmwareSlimline Open Firmware
Slimline Open Firmware
 
Agnostic Device Drivers
Agnostic Device DriversAgnostic Device Drivers
Agnostic Device Drivers
 
QPACE QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)
QPACE QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)QPACE QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)
QPACE QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)
 
QPACE - QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)
QPACE - QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)QPACE - QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)
QPACE - QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)
 

Kürzlich hochgeladen

Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...panagenda
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024TopCSSGallery
 
Kuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorialKuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorialJoão Esperancinha
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...
Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...
Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...itnewsafrica
 
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...Jeffrey Haguewood
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxfnnc6jmgwh
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integrationmarketing932765
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Nikki Chapple
 
Landscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdfLandscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdfAarwolf Industries LLC
 
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...amber724300
 
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...BookNet Canada
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
4. Cobus Valentine- Cybersecurity Threats and Solutions for the Public Sector
4. Cobus Valentine- Cybersecurity Threats and Solutions for the Public Sector4. Cobus Valentine- Cybersecurity Threats and Solutions for the Public Sector
4. Cobus Valentine- Cybersecurity Threats and Solutions for the Public Sectoritnewsafrica
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...Nikki Chapple
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 

Kürzlich hochgeladen (20)

Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024
 
Kuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorialKuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorial
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...
Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...
Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...
 
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
 
Landscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdfLandscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdf
 
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
 
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
4. Cobus Valentine- Cybersecurity Threats and Solutions for the Public Sector
4. Cobus Valentine- Cybersecurity Threats and Solutions for the Public Sector4. Cobus Valentine- Cybersecurity Threats and Solutions for the Public Sector
4. Cobus Valentine- Cybersecurity Threats and Solutions for the Public Sector
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 

directCellCell/B.E. tightly coupled via PCI Express

  • 1. Heiko J Schick – IBM Deutschland R&D GmbH November 2010 directCellCell/B.E. tightly coupled via PCI Express
  • 2. Agenda Section 1: directCell Section 2: Building Blocks Section 3: Summary Section 4: PCI Express Gen 3 2
  • 3. Terminology An inline accelerator  is an accelerator that runs sequentially with the main compute engine. A core accelerator  is a mechanism that accelerates the performance of a single core. A core may run multiple hardware threads as in an SMT implementation. A chip accelerator  is an off-chip mechanism that boosts the performance of the primary compute chip. Graphics accelerators are typically of this type. A system accelerator  is a network-attached appliance that boosts the performance of a primary multinode system. Azul is an example of a system accelerator 3 Section 1: directCell
  • 4. 4 Remote Control Section 1: directCell Our goal is to remotely control a chip accelerator via a device driver based on the primary compute chip. The chip accelerator does not run an operating system, but merely a firmware-based bare metal support library to facilitate the host based device driver. Requirements Operation (e.g. start and stop acceleration) Memory Mapped I/O (e.g. Cell Broadband Architecture) Special Instruction Interrupts Memory Compatibility Bus / Interconnect (e.g. PCI Express, PCI Express Endpoint)
  • 5. What is tightly coupled? Distributed systems are state of the art Tightly Coupled: Usage as a device rather than a system Completely integrated into the host's global address space I/O attached Commonly referred to as a “hybrid” OS-less, Controlled by host Driven by interactive workloads Example: A button is pressed, etc Pluggable into existing form factors 5 Section 1: directCell
  • 6. Why tightly coupled? Customers want to purchase applied acceleration Classic appliance box will be deprecated by modular and hybrid approaches Deployment and serviceability A system needs to be installed and administered Nobody is happy with accelerators that has to be program Ship working appliance kernels Software involvement and required 6 Section 1: directCell
  • 7. PCI Express Features Computer expansion card interface format Replacement for PCI, PCI-X and AGP as industry standard for PCs (Workstation and Server). Serial Interconnect Based on differential signals with 4 wires per lane Each lane transmits 250 MB/s per direction Up to 32 lane per link provides 4 GB/s per direction Low Latency Memory-mapped IO (MMIO) and direct memory access (DMA) are key concepts 7 Section 1: directCell
  • 8. Cell/B.E. Accelerator via PCI Express Connect Cell/B.E. System as PCI Express device to a host system Operating Systems runs only on host system (e.g. Linux, Windows) Main application runs on host system Compute intensive tasks will run as threads on SPEs Using the same Cell/B.E. programming models as for non-hybrid systems. Three level memory hierarchy instead of two level. Cell/B.E. processor does not run any operation systems MMIO and DMA used as access methods in both directions 8 Section 1: directCell
  • 9. PCI Express Cabling Products 9 Section 1: directCell
  • 10. Cell/B.E. Accelerator System 10 Section 1: directCell Application Main Thread SPU Threads SPU Tasks Operating system SPE PPE SPE SPU Core SPU Local Store LS Execution Units Local Store LS Execution Units MFC MFC L2 DMA MMIO Registers DMA MMIO Registers EIB CELL/B.E. Memory Southbridge DMAEngine
  • 11. Cell/B.E. Accelerator System 11 Section 1: directCell Application Main Thread SPU Threads Application Main Thread SPU Tasks Operating system Operating System SPE PPE SPE Host Processor Host Memory SPU Core SPU Core Local Store LS Execution Units Local Store LS Execution Units MFC MFC L2 L2 DMA MMIO Registers DMA MMIO Registers EIB CELL/B.E. Memory Southbridge Southbridge PCI Express Link DMAEngine
  • 12. Building Block #1: Interconnect Currently PCI Express support is included in many front office systems, hence most accelerator innovation will take place via PCI Express. Intel's QPI & PCI Express convergence (core i5/i7) drives a strong movement to make I/O a native subset of the front-side bus. PCI Express EP support for modern processors is the only real option for tightly coupled interconnects. PCI Express has bifurcation support and hot plug support. Current ECNs (ATS, TLP Hints, Atomic Ops) must be included in those designs! 12 Section 2: Building Blocks
  • 13. Building Block #2: Addressing (1) Section 2: Building Blocks Integration on the Bus Level Host BIOS or firmware maps accelerators via PCI Express BARs: Increase BAR size in EP designs Resizable BAR ECN Bus level integration scales well: 264 = 16 Exabyte = 16 K Petabyte Entire SOCs clusters can be mapped into host 13
  • 14. Building Block #2: Addressing (2) Section 2: Building Blocks Inbound Address Translation PIM / POM, IOMMUs, etc. Switch-based PCIe ATS Specification PCIe Address Translation Services Allow EP virtual to real address translation for DMA: Application provides VA pointer to EP. Host uses EP VA pointer to program it. Userspace DMA Problem Buffers on accelerator and host need to be pinned for async DMA transfers. Kernel involvement should be minimal. Linux UIO framework HugeTLBfs is needed. Windows UMDF Large Pages is needed. 14
  • 15. Building Block #3: Run-time Control Minimal software on accelerator Device driver is running on host system Include DMA engine(s) on accelerator Control Mechanisms MMIO Can easily be mapped as VFS -> UIO. PCIe core of acc should be able to map entire MMIO range. Special instructions Clumsy to map as virtual file system. Expose to userspace as system call or IOCTL. Fixed length of parameter area must be made user accessible. PCI Express core of accelerator should be able to dispatch special instruction to every unit in the accelerator. Include helper registers, scratchpads, doorbells and ring buffers 15 Section 2: Building Blocks
  • 16. directCell Operation 16 Section 2: Building Blocks SPU Threads Application Main Thread SPU Tasks Operating System 4 4 1 SPE PPE SPE Host Processor Host Memory SPU Core SPU Core Local Store LS Execution Units Local Store LS Execution Units MFC MFC L2 L2 DMA MMIO Registers DMA MMIO Registers 6 3 EIB CELL/B.E. Memory 5 5 2 2 Southbridge Southbridge PCI Express Link DMAEngine
  • 17. Prototype Concept validation HS21 Intel Xeon Blade connected to QS2x Cell/B.E. Blade via PCI Express 4x . Special firmware on QS2x Cell/B.E. Blade to set PCI connector as endpoint. Microsoft Windows as OS on HS21 blade. Windows device driver, enabling user space access to QS2x. Working and verified DMA transfer from and to Cell/B.E. Memory from Windows application. DMA transfer from and to Local Store from Windows application. Access to Cell/B.E. MMIO registers. Start of SPE thread from Windows (thread context is not preserved)‏. SPE DMA to host memory via PCI Express. Memory management code . User libs on Windows to abstract Cell/B.E. usage (compatible to libspe )‏. SPE Context save and restore (needed for proper multi thread execution)‏. 17 Section 3: Summary
  • 18. Project Review Technology study proposed to target new application domains & markets Use Cell as an acceleration device. All system management done from host system (GPGPU-like accelerator)‏. Enables Cell on Wintel platforms Cell/B.E. Systems has no dependency on OS. Compute intensive tasks will run as threads on SPEs. Use MMIO and DMA operations via PCI Express to reach any memory-mapped resources of the Cell/B.E. System from the host, and vice versa. Exhibits a new Runtime model for Processors Show that a processor designed for standalone operation can be fully integrated into another host system. 18 Section 3: Summary
  • 19. New Features Atomic Operations TLP Processing Hints TLP Prefix Resizable BAR Dynamic Power Allocation Latency Tolerance Reporting Multicast Internal Error Reporting Alternative Routing-ID Interpretation Extended Tag Enable Default Single Root I/O Virtualization Multi Root I/O Virtualization Address Translation Services 19 Section 4: PCI Express Gen 3
  • 20. 20 Thank you very much for your attention.
  • 21. 21 Atomic Operations This optional normative ECN defines 3 new PCIe transactions, each of which carries out a specific Atomic Operation (“AtomicOp”) on a target location in Memory Space. The 3 AtomicOps are FetchAdd (Fetch and Add) Swap (Unconditional Swap) CAS (Compare and Swap). Direct support for the 3 chosen AtomicOps over PCIe enables easier migration of existing highperformance SMP applications to systems that use PCIe as the interconnect to tightly-coupled accelerators, co-processors, or GP-GPUs. Section 4: PCI Express Gen 3 Source: PCI-SIG, Atomic Operations ECN
  • 22. 22 TLP Processing Hints This optional normative ECR defines a mechanism by which a Requester can provide hints on a per transaction basis to facilitate optimized processing of transactions that target Memory Space. The architected mechanisms may be used to enable association of system processing resources (e.g. caches) with the processing of Requests from specific Functions or enable optimized system specific (e.g. system interconnect and Memory) processing of Requests. Providing such information enables the Root Complex and Endpoint to optimize handling of Requests by differentiating data likely to be reused soon from bulk flows that could monopolize system resources. Section 4: PCI Express Gen 3 Source: PCI-SIG, Processing Hints ECN
  • 23. 23 TLP Prefix Emerging usage model trends indicate a requirement for increase in header size fields to provide additional information than what can be accommodated in currently defined TLP header sizes. The TLP Prefix mechanism extends the header size by adding DWORDS to the front of headers that carry additional information. The TLP Prefix mechanism provides architectural headroom for PCIe headers to grow in the future. Switches and Switch related software can be built that are transparent to the encoding of future End-End TLPs. The End-End TLP Prefix mechanism defines rules for routing elements to route TLPs containing End-End TLP Prefixes without requiring the routing element logic to explicitly support any specific End-End TLP Prefix encoding(s). Section 4: PCI Express Gen 3 Source: PCI-SIG, TLP Prefix ECN
  • 24. 24 Resizable BAR This optional ECN adds a capability for Functions with BARs to report various options for sizes of their memory mapped resources that will operate properly. Also added is an ability for software to program the size to configure the BAR to. The Resizable BAR Capability allows system software to allocate all resources in systems where the total amount of resources requesting allocation plus the amount of installed system memory is larger than the supported address space. Section 4: PCI Express Gen 3 Source: PCI-SIG, Resizable BAR ECN
  • 25. 25 Dynamic Power Allocation DPA (Dynamic Power Allocation) extends existing PCIe device power management to provide active (D0) device power management substates for appropriate devices, while comprehending existing PCIe PM Capabilities including PCI-PM and Power Budgeting. Section 4: PCI Express Gen 3 Source: PCI-SIG, Dynamic Power Allocation ECN
  • 26. 26 Latency Tolerance Reporting This ECR proposes to add a new mechanism for Endpoints to report their service latency requirements for Memory Reads and Writes to the Root Complex such that central platform resources (such as main memory, RC internal interconnects, snoop resources, and other resources associated with the RC) can be power managed without impacting Endpoint functionality and performance. Current platform Power Management (PM) policies guesstimate when devices are idle (e.g. using inactivity timers). Guessing wrong can cause performance issues, or even hardware failures. In the worst case, users/admins will disable PM to allow functionality at the cost of increased platform power consumption. This ECR impacts Endpoint devices, RCs and Switches that choose to implement the new optional feature. Section 4: PCI Express Gen 3 Source: PCI-SIG, Latency Tolerance Reporting ECN
  • 27. 27 Multicast This optional normative ECN adds Multicast functionality to PCI Express by means of an Extended Capability structure for applicable Functions in Root Complexes, Switches, and components with Endpoints. The Capability structure defines how Multicast TLPs are identified and routed. It also provides means for checking and enforcing send permission with Function-level granularity. The ECN identifies Multicast errors and adds an MC Blocked TLP error to AER for reporting those errors. Multicast allows a single Posted Request TLP sent from a source to be distributed to multiple recipients, resulting in a very high performance gain when applicable. Section 4: PCI Express Gen 3 Source: PCI-SIG, Multicast ECN
  • 28. 28 Internal Error Reporting PCI Express (PCIe) defines error signaling and logging mechanisms for errors that occur on a PCIe interface and for errors that occur on behalf of transactions initiated on PCIe. It does not define error signaling and logging mechanisms for errors that occur within a component or are unrelated to a particular PCIe transaction. This ECN defines optional error signaling and logging mechanisms for all components except PCIe to PCI/PCI-X Bridges (i.e., Switches, Root Complexes, and Endpoints) to report internal errors that are associated with a PCI Express interface. Errors that occur within components but are not associated with PCI Express remain outside the scope of the specification. Section 4: PCI Express Gen 3 Source: PCI-SIG, Internal Error Reporting ECN
  • 29. 29 Alternative Routing-ID Interpretation For virtualized and non-virtualized environments, a number of PCI-SIG member companies have requested that the current constraints on number of Functions allowed per multi-Function Device be increased to accommodate the needs of next generation I/O implementations. This ECR specifies a new method to interpret the Device Number and Function Number fields within Routing IDs, Requester IDs, and Completer IDs, thereby increasing the number of Functions that can be supported by a single Device. Alternative Routing-ID Interpretation (ARI) enables next generation I/O implementations to support an increased number of concurrent users of a multi-Function device while providing the same level of isolation and controls found in existing implementations. Section 4: PCI Express Gen 3 Source: PCI-SIG, Alternative Routing-ID Interpretation ECN
  • 30. 30 Extended Tag Enable Default The change allows a Function to use Extended Tag fields (256 unique tag values) by default; this is done by allowing the Extended Tag Enable control field to be set by default. The obligatory 32 tags provided by PCIe per Function are not sufficient to meet the throughput and requirements of emerging applications. Extended tags allow up to 256 concurrent requests, but such capability is not enabled by default in PCIe. Section 4: PCI Express Gen 3 Source: PCI-SIG, Extended Tag Enable Default ECN
  • 31. 31 Single Root I/O Virtualization The specification is focused on single root topologies; e.g., a single computer that supports virtualization technology. Within the industry, significant effort has been expended to increase the effective hardware resource utilization (i.e., application execution) through the use of virtualization technology. The Single Root I/O Virtualization and Sharing Specification (SR-IOV) defines extensions to the PCI Express (PCIe) specification suite to enable multiple System Images (SI) to share PCI hardware resources. Section 4: PCI Express Gen 3 Source: PCI-SIG, Single Root I/O Virtualization Specification
  • 32. 32 Multi Root I/O Virtualization Section 4: PCI Express Gen 3 The specification is focused on multi-root topologies; e.g., a server blade enclosure that uses a PCI Express® Switch-based topology to connect server blades to PCI Express Devices or PCI Express to-PCI Bridges and enable the leaf Devices to be serially or simultaneously shared by one or more System Images (SI). Unlike the Single Root IOV environment, independent SI may execute on disparate processing components such as independent server blades. The Multi-Root I/O Virtualization (MR-IOV) specification defines extensions to the PCI Express (PCIe) specification suite to enable multiple non-coherent Root Complexes (RCs) to share PCI hardware resources. Source: PCI-SIG, Multi Root I/O Virtualization Specification
  • 33. 33 Address Translation Services This specification describes the extensions required to allow PCI Express Devices to interact with an address translation agent (TA) in or above a Root Complex (RC) to enable translations of DMA addresses to be cached in the Device. The purpose of having an Address Translation Cache (ATC) in a Device is to minimize latency and to provide a scalable distributed caching solution that will improve I/O performance while alleviating TA resource pressure. Section 4: PCI Express Gen 3 Source: PCI-SIG, Address Translation Services Specification
  • 34. 34 Disclaimer IBM®, DB2®, MVS/ESA, AIX®, S/390®, AS/400®, OS/390®, OS/400®, iSeries, pSeries, xSeries, zSeries, z/OS, AFP, Intelligent Miner, WebSphere®, Netfinity®, Tivoli®, Informix und Informix® Dynamic ServerTM, IBM, BladeCenter and POWER and others are trademarks of the IBM Corporation in US and/or other countries. Cell Broadband Engine is a trademark of Sony Computer Entertainment, Inc. in the United States, other countries, or both and is used under license there from. Linux is a trademark of Linus Torvalds in the United States, other countries or both. Other company, product, or service names may be trademarks or service marks of others. The information and materials are provided on an "as is" basis and are subject to change.