SlideShare ist ein Scribd-Unternehmen logo
1 von 48
Downloaden Sie, um offline zu lesen
Agenda


X86 PROCESSOR EVOLUTION



THE GPU AS AN ACCELERATOR



ACCELERATED PROCESSING UNITS



INTRODUCTION TO OpenCL
Evolving x86 Processors
AMD architecture
“Istambul” six-core diagram


                      1    2        3         4        5       6
                                                                    Balanced
      Native                                                         caches
                      L2   L2       L2       L2       L2       L2
     six-core
    processor
                                    L3 Cache                          Lower memory
                                                                         latency
                                    CROSSBAR




                            Hyper                 Memory
                           Transport              Controller


                                          HyperTransport


                                                  PCI-e
   Fast full-duplex             Chipset
         bus
4P/24-core system example
very good scalability



                                 One memory controller for every
MEMORY




                        MEMORY
                                 processor


                                 Full-duplex Hyper Transport links
                                 (up to 5.2GHz)
MEMORY




                        MEMORY
                                 Bus Optimization: HT Assist (Cache
                                 Probe Filtering)


                                 Still the only available 4P system
                                 with Direct Connect Architecture
Direct Connect Architecture 1.0
Balanced and Scalable Design to Support up to 6 Cores




               CHANNELS
               2 MEMORY




                                                              2 MEMORY
                                                              CHANNELS
     8 DIMMs                                                             8 DIMMs
     per CPU                                                             per CPU
               CHANNELS




                                                              2 MEMORY
               2 MEMORY




                                                              CHANNELS
     8 DIMMs                                                             8 DIMMs
     per CPU                                                             per CPU


    No front side bus                   HyperTransport™ technology

    Integrated memory controller        NUMA memory architecture
Direct Connect Architecture 2.0
Balanced and Scalable Design to Support up to 16 Cores* per CPU




             CHANNELS
             4 MEMORY




                                                          4 MEMORY
                                                          CHANNELS
  12 DIMMs                                                           12 DIMMs
   per CPU                                                            per CPU
             CHANNELS
             4 MEMORY




                                                          4 MEMORY
                                                          CHANNELS
  12 DIMMs                                                           12 DIMMs
   per CPU                                                            per CPU


    • 1-hop between processors      • Four memory channels

    • Up to 50% more DIMMs          • Up to 33% increase in CPU to CPU
                                      communication speed±
What is next for x86 CPUs

• More processor cores to come
(12, 16, 16 double cores)


• More memory channels
(improves memory bandwidth per
core)


• Improved IPC
(8 per cycle is a target)
Top500 list - beyond the petaflop




                             Datacenters in the
                            USA will spend more
                             than $3 billion on
                              energy in 2009
1997:




                  X


 Garry Kasparov       IBM Deep Blue
The World’s Most Powerful GPU




                    =
2011 GPU Architecture
    AMD Radeon™ HD 6900 Series
Dual graphics engines
New VLIW4 core architecture
Up to 24 SIMD engines
Up to 96 Texture Units
Upgraded render back-ends
    Improved anti-aliasing performance

Fast 256-bit GDDR5 memory interface
    Up to 5.5 Gbps

New GPU compute features
Designing very efficient GPUs
Full load: 180W; Idle:27W


  16

                                                                                    14.47
  14                                                                                GFLOPS/W


  12
          GFLOPS/W
          GFLOPS/mm2
  10
                                                                         7.50


   8
                                                        4.50                                   7.90
                                                                                            GFLOPS/mm2
   6
                       2.01             2.21                                    4.56
   4
        1.07                                                   2.24

   2   0.42                   1.06             0.92


   0
         Nov-05        Jan-06            Sep-07           Nov-07           Jun-08              Oct-09
       ATI Radeon™   ATI Radeon™     ATI Radeon™ HD   ATI Radeon™ HD   ATI Radeon™ HD   ATI Radeon™ HD
        X1800 XT      X1900 XTX         2900 PRO           3870             4870             5870
Old and New in High Performance Computing

Old: Power is free, Transistors are expensive
New: Power expensive, Transistors free
(Can put more transistors on chip than can afford to turn on)


Old: Multiplies are slow, Memory access is fast
New: Multiplies fast, Memory slow
(up 200 clocks to DRAM memory, 4 clocks for FP multiply)


Old: Increasing Instruction Level Parallelism via compilers innovation
New: Explicit thread and data parallelism must be exploited
GPUs: more than just gaming

                  Processing power – millions of operations per second
    Single Core   12
     Dual Core     24
     Quad Core          48
     Hexa Core               72
      12 Cores                    144
                                                                                          2700
Radeon HD 5970




                                        Both use GPUs

         Wii Sports - Golf                              Oil exploration platform - 2010

           15
DirectX® 11 Multi-Threading

 Application, DirectX runtime, and DirectX driver can each run in separate
  threads
 Tasks like loading a texture or compiling a shader can execute in parallel
  with main rendering thread

                   DirectX® 10                   DirectX® 11




     16
Today’s GPUs focused on


GAMING




ENTERTAINMENT




PRODUCTIVITY
DirectX® 11 Tessellation


                     DirectX® 10     DirectX® 11




                   No Tessellation   Tessellation

Images courtesy of Unigine Corp.




           18
5/26/2011
5/26/2011
Research companies already using




Oil exploration   Wheather forecast   Fluid Dynamics   Nature simulation

       21
AMD Balanced Platform
                                                     GPU is ideal for data parallel algorithms
CPU is excellent for running some                    like image processing, CAE, etc
algorithms
                                                             Great use for ATI Stream
       Ideal place to process if GPU is                      technology
        fully loaded
                                                             Great use for additional GPUs
       Great use for additional CPU
        cores




                                                    Graphics Workloads

                        Serial/Task-Parallel        Other Highly
                                 Workloads          Parallel Workloads




           Delivers    optimal performance              for a wide range of
                                   platform configurations
ATI Stream Technology is…

Heterogeneous: Developers leverage AMD GPUs and x86
CPUs for optimal application performance and user experience

High performance: Massively parallel, programmable GPU
architecture delivers unprecedented performance and power
efficiency

Industry Standards: OpenCL™ and DirectCompute 11 enable
cross-platform development




  Sciences   Government   Engineering   Gaming    Digital   Productivity
                                                 Content
                                                 Creation
Improvements already reached consumers



                                               80%


                                               70%


                                               60%


                                               50%
                                                                   ATI
                                                                  Stream
                                               40%


                                               30%


                                               20%


                                               10%


                                               0%

                                                     Processor utilization

 Adobe Flash plugin used by Youtube.com
  Better image quality and video smoothness
  Lower processor usage
GPU-accelerated video transcoding




                                               Ipod Video
       HD Video




           Up to 6x faster when using an AMD graphics card
Video Transcoding Sample
No GPU Acceleration
                          CPU Usage: 100%




                                               Using four
                                               CPU Cores




                                               GPU Usage: 1%




  CPU Usage: 100%     Time to finish: 1h 52m       Total Power: 0.23kW/h
   GPU Usage: 1%       Peak power: 145W              Energy Price: $0.15   26
Video Transcoding Sample
ATI GPU Acceleration
                               CPU Usage: 45%




                                                      GPU Usage: 35%

                                                                 Using hundreds of
                                                                 Stream Processors



CPU Usage: 45% (100%)   Time to finish: 26m (1h52m)   Total Power: 0.11kW/h (0.23)
 GPU Usage: 35% (1%)    Peak power: 198W (145W)        Energy Price: $0.07 ($0.15)   27
FUSION TECHNOLOGY
Today




  Multi-core CPU             TeraFLOPS-class GPU

  ~800 million transistors        Up to 2 billion transistors

  Multi-tasking               Jogos em multiplos monitores

                                     Video e audio Full HD
A new Era on performance evolution

                                                                                       Heterogeneous
                Single-Core                          Multi-Core
                                                                                         computing
       Challenge:                               Challenge:                         Pros:
          Power consumption                        Power consumption                 Performance
          Complexity                               Software                          Power efficient

                                                                                   Cons:
                                                                                      Software availability
Single-thread




                                       Performance




                                                                         Performance
                                   ?
                                                           We are here
                     We are here


                                                                                           We are here


                     Time                              Time x Cores                            Time
A new Era on performance evolution


      Single-Core          Multi-Core
CPU




              Core efficiency




                                         Software
                                        Acceleration



                                        Multimedia



                                          Gaming




                                           GPU
Putting all together – The Future is Fusion
  AMD “Istambul” six-core processor                                     RV500 GPU Core (2006)


   1    2        3         4        5           6
                                                                                                    Ring
   L2   L2       L2       L2       L2           L2                                                  Stop

                                                                               Client Interface                Client Interface




                 Cache L3




                                                                                                                                  Client Interface
                                                            Client Interface
                 CROSSBAR
                                                     Ring                                         Memory                                             Ring
                                                     Stop                                         Controller                                         Stop


         Hyper                 Memory




                                                                                                                                  Client Interface
        Transport              Controller




                                                            Client Interface
                                                                               Client Interface                Client Interface


                       HyperTransport
                                                                                                    Ring
                                                                                                    Stop
                                        PCI-e




             Chipset
Putting all together – The Future is Fusion
  AMD “Istambul” six-core processor                  RV700 GPU Core (2008-2009)


   1    2        3         4        5           6
   L2   L2       L2       L2       L2           L2



                 Cache L3


                 CROSSBAR



         Hyper                 Memory
        Transport              Controller


                       HyperTransport
                                        PCI-e




             Chipset
Putting all together – The Future is Fusion
  AMD “Istambul” six-core processor   RV700 GPU Core




                                                       CROSSBAR
             CROSSBAR
2011: welcome to the APU time!




CPU                    APU                      GPU

 “Supercomputing power in a notebook platform whose
            battery lasts for a full day”
One Design, Fewer Watts, Massive Capability

                                                         “Zacate”
                                    Discrete-level         AMD
                   Dual-Core
Northbridge    +     CPU
                                  +  DirectX® 11
                                         GPU
                                                     =    Fusion
                                                           APU




  66 sq. mm        117 sq. mm        59 sq. mm          75 sq. mm
  13 watts         25 watts          8 watts            18 watts
Graphics and Media Processing Efficiency
 Improvements
     2010 IGP-based Platform                                      2011 APU-based Platform


              ~17 GB/sec        ~17 GB/sec

                                                                                    CPU
                                                                                   Cores               DDR3 DIMM
                CPU                                                                                    Memory




                                                                                           UNB / MC
               Cores
   CPU Chip                            DDR3 DIMM
                                                                      APU Chip
                           MC




                                       Memory                                      UVD

                UNB

                                                                                   GPU
                                                                                                      ~27 GB/sec
~7 GB/sec
                                        Graphics requires
               GPU     UVD              memory bandwidth                     ~27 GB/sec    PCIe
                                           to bring full
               SB Functions             capabilities to life    3X bandwidth between GPU and memory
                                                                Even the same sized GPU is substantially
                                                                 more effective in this configuration
               PCIe
                                                                Eliminate latency and power associated
                                                                 with the extra chip crossing
    Bandwidth pinch points and latency                          Substantially smaller physical foot print
      hold back the GPU capabilities
“Ontario” & “Zacate” Architecture
 APU
 >2 x86 CPU Cores (40nm “Bobcat” core – 1 MB
  L2, 64-bit FPU)
 >C6 and power gating
 >Array of SIMD Engines
   • DX11 graphics performance
   • Industry leading 3D and graphics processing
 >3rd Generation Unified Video Decoder
       >H.264, VC1, DixX/Xvid format
 >DDR3 800-1066, 2 DIMMs, 64 bit channel
 >BGA package




 Display and I/O
 >Two dedicated digital display interfaces
   • Configurable externally as HDMI, DVI, and/or
     Display Port
   • Also supports a single link LVDS for internal
     panels
 >Integrated VGA
 >5x8 PCIe®
 > “Hudson” Fusion Controller Hub
OpenCL
Working together
ATI Stream SDK:
OpenCL™ For Multicore x86 CPUs and GPUs
http://developer.amd.com/

 The Power of Fusion: Developers leverage heterogeneous
    architecture to deliver superior user experience
 • First complete OpenCL™ development platform
 • Certified OpenCL 1.0 compliant by the Khronos Group
 •   Write code that can scale well on multi-core CPUs and GPUs
 •   AMD delivers on the promise of OpenCL™, with both high-
     performance CPU and GPU technologies
 •   Available for download now as part of ATI Stream SDK beta
     program – includes documentation, samples, and developer
     support
OpenCL™: Game-Changing Development
Enabling Broad Adoption of GP-GPU Capabilities



    Industry standard API: Open, multiplatform development
     platform for heterogeneous architectures
    The power of Fusion: Leverages CPUs and GPUs for
     balanced system approach
    Broad industry support: Created by architects from AMD,
     Apple, IBM, Intel, Nvidia, Sony, etc.
    Fast track development: Ratified in December; AMD is the
     first company to provide a complete OpenCL solution
    Momentum: Enormous interest from mainstream
     developers and application ISVs


              More stream-enabled applications across
                all markets
Open Standards:
Maximize Developer Freedom and Addressable Market

      Vendor specific                    Vendor neutral
  Cross-platform limiters
                                     Cross-platform enablers
  • Apple Display Connector

  • 3dfx Glide                  Digital Visual
                                                 OpenCL™   DirectX®
                                  Interface
  • Nvidia CUDA

  • Nvidia Cg

  • Rambus                      Certified DP      JEDEC    OpenGL®

  • Unified Display Interface
Comparing OpenCL™ and DirectX® 11 DirectCompute


How will developers choose between OpenCL™ and DirectX® 11
DirectCompute?
 Feature set is similar in both APIs
DirectX® 11 DirectCompute
 Easiest path to add compute capabilities to existing DirectX
  applications
 Windows Vista® and Windows® 7 only
OpenCL™
 Ideal path for new applications porting to the GPU for the first
  time
 True multiplatform: Windows®, Linux®, MacOS
 Natural programming without dealing with a graphics API
Anatomy of OpenCL™


                             Language Specification
  • C-based cross-platform programming interface
  • Subset of ISO C99 with language extensions - familiar to developers
  • Well-defined numerical accuracy - IEEE 754 rounding behavior with defined maximum error
  • Online or offline compilation and build of compute kernel executables
  • Includes a rich set of built-in functions



                                 Platform Layer API

  • A hardware abstraction layer over diverse computational resources
  • Query, select and initialize compute devices
  • Create compute contexts and work-queues



                                     Runtime API
  • Execute compute kernels
  • Manage scheduling, compute, and memory resources
OpenCL Example

                                       Scalar

   void square(int n, const float *a, float *result)
   {
      int i;
      for (i=0; i<n; i++)
         result[i] = a[i] * a[i];
   }



                                  Data-Parallel

   kernel dp_square (const float *a, float *result)
   {
     int id = get_global_id(0);
     result[id] = a[id] * a[id];
   }

   // dp_square executes oven “n” work-items
Summary


X86 PROCESSOR EVOLUTION



THE GPU AS AN ACCELERATOR



ACCELERATED PROCESSING UNITS


INTRODUCTION TO OpenCL
http://developer.amd.com

   46
Obrigado!
roberto.brandao@amd.com
roberto.brandao@amd.com




    Obrigado!

Weitere ähnliche Inhalte

Was ist angesagt?

Multicore processor by Ankit Raj and Akash Prajapati
Multicore processor by Ankit Raj and Akash PrajapatiMulticore processor by Ankit Raj and Akash Prajapati
Multicore processor by Ankit Raj and Akash PrajapatiAnkit Raj
 
Leveraging Open Source Integration with WSO2 Enterprise Service Bus
Leveraging Open Source Integration with WSO2 Enterprise Service BusLeveraging Open Source Integration with WSO2 Enterprise Service Bus
Leveraging Open Source Integration with WSO2 Enterprise Service BusWSO2
 
Trend - HPC-29mai2012
Trend - HPC-29mai2012Trend - HPC-29mai2012
Trend - HPC-29mai2012Agora Group
 
Realtime scheduling for virtual machines in SKT
Realtime scheduling for virtual machines in SKTRealtime scheduling for virtual machines in SKT
Realtime scheduling for virtual machines in SKTThe Linux Foundation
 
16 August 2012 - SWUG - Hyper-V in Windows 2012
16 August 2012 - SWUG - Hyper-V in Windows 201216 August 2012 - SWUG - Hyper-V in Windows 2012
16 August 2012 - SWUG - Hyper-V in Windows 2012Daniel Mar
 
PV-Drivers for SeaBIOS using Upstream Qemu
PV-Drivers for SeaBIOS using Upstream QemuPV-Drivers for SeaBIOS using Upstream Qemu
PV-Drivers for SeaBIOS using Upstream QemuThe Linux Foundation
 
Running Applications on the NetBSD Rump Kernel by Justin Cormack
Running Applications on the NetBSD Rump Kernel by Justin Cormack Running Applications on the NetBSD Rump Kernel by Justin Cormack
Running Applications on the NetBSD Rump Kernel by Justin Cormack eurobsdcon
 
A Reimplementation of NetBSD Based on a Microkernel by Andrew S. Tanenbaum
A Reimplementation of NetBSD Based on a Microkernel by Andrew S. TanenbaumA Reimplementation of NetBSD Based on a Microkernel by Andrew S. Tanenbaum
A Reimplementation of NetBSD Based on a Microkernel by Andrew S. Tanenbaumeurobsdcon
 
Storage Class Memory: Technology Overview & System Impacts
Storage Class Memory: Technology Overview & System ImpactsStorage Class Memory: Technology Overview & System Impacts
Storage Class Memory: Technology Overview & System ImpactsZhichao Liang
 
Vcpfaq
VcpfaqVcpfaq
Vcpfaqpeddin
 
Introduction to multi core
Introduction to multi coreIntroduction to multi core
Introduction to multi coremukul bhardwaj
 
Prairie DevCon-What's New in Hyper-V in Windows Server "8" Beta - Part 1
Prairie DevCon-What's New in Hyper-V in Windows Server "8" Beta - Part 1Prairie DevCon-What's New in Hyper-V in Windows Server "8" Beta - Part 1
Prairie DevCon-What's New in Hyper-V in Windows Server "8" Beta - Part 1Damir Bersinic
 

Was ist angesagt? (18)

San & Virutualisation
San & VirutualisationSan & Virutualisation
San & Virutualisation
 
Multicore processor by Ankit Raj and Akash Prajapati
Multicore processor by Ankit Raj and Akash PrajapatiMulticore processor by Ankit Raj and Akash Prajapati
Multicore processor by Ankit Raj and Akash Prajapati
 
Leveraging Open Source Integration with WSO2 Enterprise Service Bus
Leveraging Open Source Integration with WSO2 Enterprise Service BusLeveraging Open Source Integration with WSO2 Enterprise Service Bus
Leveraging Open Source Integration with WSO2 Enterprise Service Bus
 
Cpu Caches
Cpu CachesCpu Caches
Cpu Caches
 
Trend - HPC-29mai2012
Trend - HPC-29mai2012Trend - HPC-29mai2012
Trend - HPC-29mai2012
 
Realtime scheduling for virtual machines in SKT
Realtime scheduling for virtual machines in SKTRealtime scheduling for virtual machines in SKT
Realtime scheduling for virtual machines in SKT
 
16 August 2012 - SWUG - Hyper-V in Windows 2012
16 August 2012 - SWUG - Hyper-V in Windows 201216 August 2012 - SWUG - Hyper-V in Windows 2012
16 August 2012 - SWUG - Hyper-V in Windows 2012
 
PV-Drivers for SeaBIOS using Upstream Qemu
PV-Drivers for SeaBIOS using Upstream QemuPV-Drivers for SeaBIOS using Upstream Qemu
PV-Drivers for SeaBIOS using Upstream Qemu
 
Running Applications on the NetBSD Rump Kernel by Justin Cormack
Running Applications on the NetBSD Rump Kernel by Justin Cormack Running Applications on the NetBSD Rump Kernel by Justin Cormack
Running Applications on the NetBSD Rump Kernel by Justin Cormack
 
A Reimplementation of NetBSD Based on a Microkernel by Andrew S. Tanenbaum
A Reimplementation of NetBSD Based on a Microkernel by Andrew S. TanenbaumA Reimplementation of NetBSD Based on a Microkernel by Andrew S. Tanenbaum
A Reimplementation of NetBSD Based on a Microkernel by Andrew S. Tanenbaum
 
Storage Class Memory: Technology Overview & System Impacts
Storage Class Memory: Technology Overview & System ImpactsStorage Class Memory: Technology Overview & System Impacts
Storage Class Memory: Technology Overview & System Impacts
 
Linux on System z – disk I/O performance
Linux on System z – disk I/O performanceLinux on System z – disk I/O performance
Linux on System z – disk I/O performance
 
Vcpfaq
VcpfaqVcpfaq
Vcpfaq
 
Introduction to multi core
Introduction to multi coreIntroduction to multi core
Introduction to multi core
 
Multicore computers
Multicore computersMulticore computers
Multicore computers
 
Prairie DevCon-What's New in Hyper-V in Windows Server "8" Beta - Part 1
Prairie DevCon-What's New in Hyper-V in Windows Server "8" Beta - Part 1Prairie DevCon-What's New in Hyper-V in Windows Server "8" Beta - Part 1
Prairie DevCon-What's New in Hyper-V in Windows Server "8" Beta - Part 1
 
9P Overview
9P Overview9P Overview
9P Overview
 
Hp Integrity Servers
Hp Integrity ServersHp Integrity Servers
Hp Integrity Servers
 

Ähnlich wie Evolving x86 Processors and GPU Accelerators

HP - HPC-29mai2012
HP - HPC-29mai2012HP - HPC-29mai2012
HP - HPC-29mai2012Agora Group
 
Exaflop In 2018 Hardware
Exaflop In 2018   HardwareExaflop In 2018   Hardware
Exaflop In 2018 HardwareJacob Wu
 
Sandy bridge platform from ttec
Sandy bridge platform from ttecSandy bridge platform from ttec
Sandy bridge platform from ttecTTEC
 
If AMD Adopted OMI in their EPYC Architecture
If AMD Adopted OMI in their EPYC ArchitectureIf AMD Adopted OMI in their EPYC Architecture
If AMD Adopted OMI in their EPYC ArchitectureAllan Cantle
 
Computação acelerada – a era das ap us roberto brandão, ciência
Computação acelerada – a era das ap us   roberto brandão,  ciênciaComputação acelerada – a era das ap us   roberto brandão,  ciência
Computação acelerada – a era das ap us roberto brandão, ciênciaCampus Party Brasil
 
Amd accelerated computing -ufrj
Amd   accelerated computing -ufrjAmd   accelerated computing -ufrj
Amd accelerated computing -ufrjRoberto Brandao
 
POWER9 AC922 Newell System - HPC & AI
POWER9 AC922 Newell System - HPC & AI POWER9 AC922 Newell System - HPC & AI
POWER9 AC922 Newell System - HPC & AI Anand Haridass
 
Argonne's Theta Supercomputer Architecture
Argonne's Theta Supercomputer ArchitectureArgonne's Theta Supercomputer Architecture
Argonne's Theta Supercomputer Architectureinside-BigData.com
 
QsNetIII, An HPC Interconnect For Peta Scale Systems
QsNetIII, An HPC Interconnect For Peta Scale SystemsQsNetIII, An HPC Interconnect For Peta Scale Systems
QsNetIII, An HPC Interconnect For Peta Scale SystemsFederica Pisani
 
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...Jeff Larkin
 
Theta and the Future of Accelerator Programming
Theta and the Future of Accelerator ProgrammingTheta and the Future of Accelerator Programming
Theta and the Future of Accelerator Programminginside-BigData.com
 
04536342
0453634204536342
04536342fidan78
 
Multi-core architectures
Multi-core architecturesMulti-core architectures
Multi-core architecturesnextlib
 
Shared Memory Centric Computing with CXL & OMI
Shared Memory Centric Computing with CXL & OMIShared Memory Centric Computing with CXL & OMI
Shared Memory Centric Computing with CXL & OMIAllan Cantle
 
IBM System x3850 X5 Technical Presentation
IBM System x3850 X5 Technical PresentationIBM System x3850 X5 Technical Presentation
IBM System x3850 X5 Technical PresentationCliff Kinard
 
Compute Blades
Compute BladesCompute Blades
Compute Bladesjpaugh
 
Amd Barcelona Presentation Slideshare
Amd Barcelona Presentation SlideshareAmd Barcelona Presentation Slideshare
Amd Barcelona Presentation SlideshareDon Scansen
 

Ähnlich wie Evolving x86 Processors and GPU Accelerators (20)

HP - HPC-29mai2012
HP - HPC-29mai2012HP - HPC-29mai2012
HP - HPC-29mai2012
 
Exaflop In 2018 Hardware
Exaflop In 2018   HardwareExaflop In 2018   Hardware
Exaflop In 2018 Hardware
 
Sandy bridge platform from ttec
Sandy bridge platform from ttecSandy bridge platform from ttec
Sandy bridge platform from ttec
 
If AMD Adopted OMI in their EPYC Architecture
If AMD Adopted OMI in their EPYC ArchitectureIf AMD Adopted OMI in their EPYC Architecture
If AMD Adopted OMI in their EPYC Architecture
 
Computação acelerada – a era das ap us roberto brandão, ciência
Computação acelerada – a era das ap us   roberto brandão,  ciênciaComputação acelerada – a era das ap us   roberto brandão,  ciência
Computação acelerada – a era das ap us roberto brandão, ciência
 
Workshop actualización SVG CESGA 2012
Workshop actualización SVG CESGA 2012 Workshop actualización SVG CESGA 2012
Workshop actualización SVG CESGA 2012
 
Amd accelerated computing -ufrj
Amd   accelerated computing -ufrjAmd   accelerated computing -ufrj
Amd accelerated computing -ufrj
 
POWER9 AC922 Newell System - HPC & AI
POWER9 AC922 Newell System - HPC & AI POWER9 AC922 Newell System - HPC & AI
POWER9 AC922 Newell System - HPC & AI
 
Argonne's Theta Supercomputer Architecture
Argonne's Theta Supercomputer ArchitectureArgonne's Theta Supercomputer Architecture
Argonne's Theta Supercomputer Architecture
 
QsNetIII, An HPC Interconnect For Peta Scale Systems
QsNetIII, An HPC Interconnect For Peta Scale SystemsQsNetIII, An HPC Interconnect For Peta Scale Systems
QsNetIII, An HPC Interconnect For Peta Scale Systems
 
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...
 
Theta and the Future of Accelerator Programming
Theta and the Future of Accelerator ProgrammingTheta and the Future of Accelerator Programming
Theta and the Future of Accelerator Programming
 
04536342
0453634204536342
04536342
 
Multi-core architectures
Multi-core architecturesMulti-core architectures
Multi-core architectures
 
Shared Memory Centric Computing with CXL & OMI
Shared Memory Centric Computing with CXL & OMIShared Memory Centric Computing with CXL & OMI
Shared Memory Centric Computing with CXL & OMI
 
IBM System x3850 X5 Technical Presentation
IBM System x3850 X5 Technical PresentationIBM System x3850 X5 Technical Presentation
IBM System x3850 X5 Technical Presentation
 
Compute Blades
Compute BladesCompute Blades
Compute Blades
 
Amd Barcelona Presentation Slideshare
Amd Barcelona Presentation SlideshareAmd Barcelona Presentation Slideshare
Amd Barcelona Presentation Slideshare
 
Sun Microsystems
Sun MicrosystemsSun Microsystems
Sun Microsystems
 
Ibm power7
Ibm power7Ibm power7
Ibm power7
 

Mehr von Roberto Brandao

Apresentacao + Demo Brazos
Apresentacao + Demo BrazosApresentacao + Demo Brazos
Apresentacao + Demo BrazosRoberto Brandao
 
Webseminario AMD phenom II x6
Webseminario AMD phenom II x6Webseminario AMD phenom II x6
Webseminario AMD phenom II x6Roberto Brandao
 
Atualização Canal Phenom I I X2 7000 Outras C P Us Dragon
Atualização  Canal    Phenom  I I    X2 7000    Outras  C P Us    DragonAtualização  Canal    Phenom  I I    X2 7000    Outras  C P Us    Dragon
Atualização Canal Phenom I I X2 7000 Outras C P Us DragonRoberto Brandao
 
Chipsets Amd Webseminario
Chipsets Amd WebseminarioChipsets Amd Webseminario
Chipsets Amd WebseminarioRoberto Brandao
 
AtualizaçãO Desktops Mobile Para Consumer
AtualizaçãO Desktops Mobile Para ConsumerAtualizaçãO Desktops Mobile Para Consumer
AtualizaçãO Desktops Mobile Para ConsumerRoberto Brandao
 

Mehr von Roberto Brandao (10)

Apresentacao + Demo Brazos
Apresentacao + Demo BrazosApresentacao + Demo Brazos
Apresentacao + Demo Brazos
 
Webseminario AMD phenom II x6
Webseminario AMD phenom II x6Webseminario AMD phenom II x6
Webseminario AMD phenom II x6
 
Web Seminario Athlon Ii
Web Seminario Athlon IiWeb Seminario Athlon Ii
Web Seminario Athlon Ii
 
Atualização Canal Phenom I I X2 7000 Outras C P Us Dragon
Atualização  Canal    Phenom  I I    X2 7000    Outras  C P Us    DragonAtualização  Canal    Phenom  I I    X2 7000    Outras  C P Us    Dragon
Atualização Canal Phenom I I X2 7000 Outras C P Us Dragon
 
SDC Server Sao Jose
SDC Server Sao JoseSDC Server Sao Jose
SDC Server Sao Jose
 
AMD Green
AMD GreenAMD Green
AMD Green
 
Chipsets Amd Webseminario
Chipsets Amd WebseminarioChipsets Amd Webseminario
Chipsets Amd Webseminario
 
Web Seminario Phenom X3
Web Seminario Phenom X3Web Seminario Phenom X3
Web Seminario Phenom X3
 
AtualizaçãO Desktops Mobile Para Consumer
AtualizaçãO Desktops Mobile Para ConsumerAtualizaçãO Desktops Mobile Para Consumer
AtualizaçãO Desktops Mobile Para Consumer
 
Roadshow Canal AMD
Roadshow Canal AMDRoadshow Canal AMD
Roadshow Canal AMD
 

Kürzlich hochgeladen

Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 

Kürzlich hochgeladen (20)

Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 

Evolving x86 Processors and GPU Accelerators

  • 1.
  • 2. Agenda X86 PROCESSOR EVOLUTION THE GPU AS AN ACCELERATOR ACCELERATED PROCESSING UNITS INTRODUCTION TO OpenCL
  • 4. AMD architecture “Istambul” six-core diagram 1 2 3 4 5 6 Balanced Native caches L2 L2 L2 L2 L2 L2 six-core processor L3 Cache Lower memory latency CROSSBAR Hyper Memory Transport Controller HyperTransport PCI-e Fast full-duplex Chipset bus
  • 5. 4P/24-core system example very good scalability One memory controller for every MEMORY MEMORY processor Full-duplex Hyper Transport links (up to 5.2GHz) MEMORY MEMORY Bus Optimization: HT Assist (Cache Probe Filtering) Still the only available 4P system with Direct Connect Architecture
  • 6. Direct Connect Architecture 1.0 Balanced and Scalable Design to Support up to 6 Cores CHANNELS 2 MEMORY 2 MEMORY CHANNELS 8 DIMMs 8 DIMMs per CPU per CPU CHANNELS 2 MEMORY 2 MEMORY CHANNELS 8 DIMMs 8 DIMMs per CPU per CPU No front side bus HyperTransport™ technology Integrated memory controller NUMA memory architecture
  • 7. Direct Connect Architecture 2.0 Balanced and Scalable Design to Support up to 16 Cores* per CPU CHANNELS 4 MEMORY 4 MEMORY CHANNELS 12 DIMMs 12 DIMMs per CPU per CPU CHANNELS 4 MEMORY 4 MEMORY CHANNELS 12 DIMMs 12 DIMMs per CPU per CPU • 1-hop between processors • Four memory channels • Up to 50% more DIMMs • Up to 33% increase in CPU to CPU communication speed±
  • 8. What is next for x86 CPUs • More processor cores to come (12, 16, 16 double cores) • More memory channels (improves memory bandwidth per core) • Improved IPC (8 per cycle is a target)
  • 9. Top500 list - beyond the petaflop Datacenters in the USA will spend more than $3 billion on energy in 2009
  • 10. 1997: X Garry Kasparov IBM Deep Blue
  • 11. The World’s Most Powerful GPU =
  • 12. 2011 GPU Architecture AMD Radeon™ HD 6900 Series Dual graphics engines New VLIW4 core architecture Up to 24 SIMD engines Up to 96 Texture Units Upgraded render back-ends  Improved anti-aliasing performance Fast 256-bit GDDR5 memory interface  Up to 5.5 Gbps New GPU compute features
  • 13. Designing very efficient GPUs Full load: 180W; Idle:27W 16 14.47 14 GFLOPS/W 12 GFLOPS/W GFLOPS/mm2 10 7.50 8 4.50 7.90 GFLOPS/mm2 6 2.01 2.21 4.56 4 1.07 2.24 2 0.42 1.06 0.92 0 Nov-05 Jan-06 Sep-07 Nov-07 Jun-08 Oct-09 ATI Radeon™ ATI Radeon™ ATI Radeon™ HD ATI Radeon™ HD ATI Radeon™ HD ATI Radeon™ HD X1800 XT X1900 XTX 2900 PRO 3870 4870 5870
  • 14. Old and New in High Performance Computing Old: Power is free, Transistors are expensive New: Power expensive, Transistors free (Can put more transistors on chip than can afford to turn on) Old: Multiplies are slow, Memory access is fast New: Multiplies fast, Memory slow (up 200 clocks to DRAM memory, 4 clocks for FP multiply) Old: Increasing Instruction Level Parallelism via compilers innovation New: Explicit thread and data parallelism must be exploited
  • 15. GPUs: more than just gaming Processing power – millions of operations per second Single Core 12 Dual Core 24 Quad Core 48 Hexa Core 72 12 Cores 144 2700 Radeon HD 5970 Both use GPUs Wii Sports - Golf Oil exploration platform - 2010 15
  • 16. DirectX® 11 Multi-Threading  Application, DirectX runtime, and DirectX driver can each run in separate threads  Tasks like loading a texture or compiling a shader can execute in parallel with main rendering thread DirectX® 10 DirectX® 11 16
  • 17. Today’s GPUs focused on GAMING ENTERTAINMENT PRODUCTIVITY
  • 18. DirectX® 11 Tessellation DirectX® 10 DirectX® 11 No Tessellation Tessellation Images courtesy of Unigine Corp. 18
  • 21. Research companies already using Oil exploration Wheather forecast Fluid Dynamics Nature simulation 21
  • 22. AMD Balanced Platform GPU is ideal for data parallel algorithms CPU is excellent for running some like image processing, CAE, etc algorithms  Great use for ATI Stream  Ideal place to process if GPU is technology fully loaded  Great use for additional GPUs  Great use for additional CPU cores Graphics Workloads Serial/Task-Parallel Other Highly Workloads Parallel Workloads Delivers optimal performance for a wide range of platform configurations
  • 23. ATI Stream Technology is… Heterogeneous: Developers leverage AMD GPUs and x86 CPUs for optimal application performance and user experience High performance: Massively parallel, programmable GPU architecture delivers unprecedented performance and power efficiency Industry Standards: OpenCL™ and DirectCompute 11 enable cross-platform development Sciences Government Engineering Gaming Digital Productivity Content Creation
  • 24. Improvements already reached consumers 80% 70% 60% 50% ATI Stream 40% 30% 20% 10% 0% Processor utilization Adobe Flash plugin used by Youtube.com  Better image quality and video smoothness  Lower processor usage
  • 25. GPU-accelerated video transcoding Ipod Video HD Video Up to 6x faster when using an AMD graphics card
  • 26. Video Transcoding Sample No GPU Acceleration CPU Usage: 100% Using four CPU Cores GPU Usage: 1% CPU Usage: 100% Time to finish: 1h 52m Total Power: 0.23kW/h GPU Usage: 1% Peak power: 145W Energy Price: $0.15 26
  • 27. Video Transcoding Sample ATI GPU Acceleration CPU Usage: 45% GPU Usage: 35% Using hundreds of Stream Processors CPU Usage: 45% (100%) Time to finish: 26m (1h52m) Total Power: 0.11kW/h (0.23) GPU Usage: 35% (1%) Peak power: 198W (145W) Energy Price: $0.07 ($0.15) 27
  • 29. Today Multi-core CPU TeraFLOPS-class GPU ~800 million transistors Up to 2 billion transistors Multi-tasking Jogos em multiplos monitores Video e audio Full HD
  • 30. A new Era on performance evolution Heterogeneous Single-Core Multi-Core computing Challenge: Challenge: Pros: Power consumption Power consumption  Performance Complexity Software  Power efficient Cons: Software availability Single-thread Performance Performance ? We are here We are here We are here Time Time x Cores Time
  • 31. A new Era on performance evolution Single-Core Multi-Core CPU Core efficiency Software Acceleration Multimedia Gaming GPU
  • 32. Putting all together – The Future is Fusion AMD “Istambul” six-core processor RV500 GPU Core (2006) 1 2 3 4 5 6 Ring L2 L2 L2 L2 L2 L2 Stop Client Interface Client Interface Cache L3 Client Interface Client Interface CROSSBAR Ring Memory Ring Stop Controller Stop Hyper Memory Client Interface Transport Controller Client Interface Client Interface Client Interface HyperTransport Ring Stop PCI-e Chipset
  • 33. Putting all together – The Future is Fusion AMD “Istambul” six-core processor RV700 GPU Core (2008-2009) 1 2 3 4 5 6 L2 L2 L2 L2 L2 L2 Cache L3 CROSSBAR Hyper Memory Transport Controller HyperTransport PCI-e Chipset
  • 34. Putting all together – The Future is Fusion AMD “Istambul” six-core processor RV700 GPU Core CROSSBAR CROSSBAR
  • 35. 2011: welcome to the APU time! CPU APU GPU “Supercomputing power in a notebook platform whose battery lasts for a full day”
  • 36. One Design, Fewer Watts, Massive Capability “Zacate” Discrete-level AMD Dual-Core Northbridge + CPU + DirectX® 11 GPU = Fusion APU  66 sq. mm  117 sq. mm  59 sq. mm  75 sq. mm  13 watts  25 watts  8 watts  18 watts
  • 37. Graphics and Media Processing Efficiency Improvements 2010 IGP-based Platform 2011 APU-based Platform ~17 GB/sec ~17 GB/sec CPU Cores DDR3 DIMM CPU Memory UNB / MC Cores CPU Chip DDR3 DIMM APU Chip MC Memory UVD UNB GPU ~27 GB/sec ~7 GB/sec Graphics requires GPU UVD memory bandwidth ~27 GB/sec PCIe to bring full SB Functions capabilities to life  3X bandwidth between GPU and memory  Even the same sized GPU is substantially more effective in this configuration PCIe  Eliminate latency and power associated with the extra chip crossing Bandwidth pinch points and latency  Substantially smaller physical foot print hold back the GPU capabilities
  • 38. “Ontario” & “Zacate” Architecture APU >2 x86 CPU Cores (40nm “Bobcat” core – 1 MB L2, 64-bit FPU) >C6 and power gating >Array of SIMD Engines • DX11 graphics performance • Industry leading 3D and graphics processing >3rd Generation Unified Video Decoder >H.264, VC1, DixX/Xvid format >DDR3 800-1066, 2 DIMMs, 64 bit channel >BGA package Display and I/O >Two dedicated digital display interfaces • Configurable externally as HDMI, DVI, and/or Display Port • Also supports a single link LVDS for internal panels >Integrated VGA >5x8 PCIe® > “Hudson” Fusion Controller Hub
  • 40. ATI Stream SDK: OpenCL™ For Multicore x86 CPUs and GPUs http://developer.amd.com/ The Power of Fusion: Developers leverage heterogeneous architecture to deliver superior user experience • First complete OpenCL™ development platform • Certified OpenCL 1.0 compliant by the Khronos Group • Write code that can scale well on multi-core CPUs and GPUs • AMD delivers on the promise of OpenCL™, with both high- performance CPU and GPU technologies • Available for download now as part of ATI Stream SDK beta program – includes documentation, samples, and developer support
  • 41. OpenCL™: Game-Changing Development Enabling Broad Adoption of GP-GPU Capabilities  Industry standard API: Open, multiplatform development platform for heterogeneous architectures  The power of Fusion: Leverages CPUs and GPUs for balanced system approach  Broad industry support: Created by architects from AMD, Apple, IBM, Intel, Nvidia, Sony, etc.  Fast track development: Ratified in December; AMD is the first company to provide a complete OpenCL solution  Momentum: Enormous interest from mainstream developers and application ISVs More stream-enabled applications across all markets
  • 42. Open Standards: Maximize Developer Freedom and Addressable Market Vendor specific Vendor neutral Cross-platform limiters Cross-platform enablers • Apple Display Connector • 3dfx Glide Digital Visual OpenCL™ DirectX® Interface • Nvidia CUDA • Nvidia Cg • Rambus Certified DP JEDEC OpenGL® • Unified Display Interface
  • 43. Comparing OpenCL™ and DirectX® 11 DirectCompute How will developers choose between OpenCL™ and DirectX® 11 DirectCompute?  Feature set is similar in both APIs DirectX® 11 DirectCompute  Easiest path to add compute capabilities to existing DirectX applications  Windows Vista® and Windows® 7 only OpenCL™  Ideal path for new applications porting to the GPU for the first time  True multiplatform: Windows®, Linux®, MacOS  Natural programming without dealing with a graphics API
  • 44. Anatomy of OpenCL™ Language Specification • C-based cross-platform programming interface • Subset of ISO C99 with language extensions - familiar to developers • Well-defined numerical accuracy - IEEE 754 rounding behavior with defined maximum error • Online or offline compilation and build of compute kernel executables • Includes a rich set of built-in functions Platform Layer API • A hardware abstraction layer over diverse computational resources • Query, select and initialize compute devices • Create compute contexts and work-queues Runtime API • Execute compute kernels • Manage scheduling, compute, and memory resources
  • 45. OpenCL Example Scalar void square(int n, const float *a, float *result) { int i; for (i=0; i<n; i++) result[i] = a[i] * a[i]; } Data-Parallel kernel dp_square (const float *a, float *result) { int id = get_global_id(0); result[id] = a[id] * a[id]; } // dp_square executes oven “n” work-items
  • 46. Summary X86 PROCESSOR EVOLUTION THE GPU AS AN ACCELERATOR ACCELERATED PROCESSING UNITS INTRODUCTION TO OpenCL http://developer.amd.com 46