SlideShare ist ein Scribd-Unternehmen logo
1 von 30
Downloaden Sie, um offline zu lesen
Introduction to
Multicore architecture

         Tao Zhang   Oct. 21, 2010
Overview
   Part1: General multicore architecture
   Part2: GPU architecture
Part1:
General Multicore architecture
Uniprocessor Performance (SPECint)
                               10000                                                                         3X
                                       From Hennessy and Patterson,
                                       Computer Architecture: A Quantitative                         ??%/year
                                       Approach, 4th edition, 2006
                                1000
Performance (vs. VAX-11/780)




                                                                         52%/year

                                 100



                                                                               ⇒ Sea change in chip
                                  10
                                              25%/year                         design: multiple “cores” or
                                                                               processors per chip
                                   1
                                   1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006
• VAX       : 25%/year 1978 to 1986
• RISC + x86: 52%/year 1986 to 2002
• RISC + x86: ??%/year 2002 to present                                                                          6
Conventional Wisdom (CW)
    in Computer Architecture
   Old CW: Chips reliable internally, errors at pins
   New CW: ≤65 nm ⇒ high soft & hard error rates
   Old CW: Demonstrate new ideas by building chips
   New CW: Mask costs, ECAD costs, GHz clock rates
    ⇒ researchers can’t build believable prototypes
   Old CW: Innovate via compiler optimizations +
    architecture
   New: Takes > 10 years before new optimization at
    leading conference gets into production compilers
   Old: Hardware is hard to change, SW is flexible
   New: Hardware is flexible, SW is hard to change
                                                        4
Conventional Wisdom (CW)
            in Computer Architecture
   Old CW: Power is free, Transistors expensive
   New CW: “Power wall” Power expensive, Xtors free
    (Can put more on chip than can afford to turn on)
   Old: Multiplies are slow, Memory access is fast
   New: “Memory wall” Memory slow, multiplies fast
    (200 clocks to DRAM memory, 4 clocks for FP multiply)
   Old : Increasing Instruction Level Parallelism via compilers,
    innovation (Out-of-order, speculation, VLIW, …)
   New CW: “ILP wall” diminishing returns on more ILP
   New: Power Wall + Memory Wall + ILP Wall = Brick Wall
     Old CW: Uniprocessor performance 2X / 1.5 yrs
     New CW: Uniprocessor performance only 2X / 5 yrs?
                                                                5
The Memory Wall




     •    On die caches are both area intensive and power
          intensive
             StrongArm dissipates more than 43% power in caches
             Caches incur huge area costs


                                                      ECE 4100/6100 (21)




                                             The Power Wall

         P  CVdd f  Vdd I st  Vdd I leak
                        2


• Power per transistor scales with frequency
  but also scales with Vdd
   Lower Vdd can be compensated for with increased
    pipelining to keep throughput constant
   Power per transistor is not same as power per
    area  power density is the problem!
   Multiple units can be run at lower frequencies to
    keep throughput constant, while saving power




                                                      ECE 4100/6100 (22)
The Current Power Trend
                                                                                    Sun’s
           10000
                                                                                   Surface

                                                            Rocket
Power Density (W/cm2)
                        1000
                                                            Nozzle

                                                  Nuclear
                         100
                                                  Reactor

                                     8086      Hot Plate
                          10 4004                              P6
                              8008 8085     386             Pentium®
                                        286           486
                               8080
                           1
                           1970       1980        1990          2000    2010
                                                  Year                         Source: Intel Corp.




                                                                         ECE 4100/6100 (23)




                                            Improving Power/Perfomance

                                   P  CVdd f  Vdd I st  Vdd I leak
                                                  2




                    • Consider constant die size and decreasing
                      core area each generation = more cores/chip
                              Effect of lowering voltage and frequency  power
                               reduction
                              Increasing cores/chip  performance increase


                                                         Better power performance!




                                                                         ECE 4100/6100 (24)
The Memory Wall


                                                    µProc
1000                                         CPU    60%/yr.
                     “Moore’s Law”
100                                  Processor-Memory
                                     Performance Gap:
                                     (grows 50% / year)

  10
                                                    DRAM
                                                    7%/yr.
                                            DRAM

  1

                     Time




                                         ECE 4100/6100 (19)




                             The Memory Wall


       Average
       access
        time




                     Year?
• Increasing the number of cores increases the
  demanded memory bandwidth
• What architectural techniques can meet this
  demand?



                                         ECE 4100/6100 (20)
The Memory Wall


                                                    µProc
1000                                         CPU    60%/yr.
                     “Moore’s Law”
100                                  Processor-Memory
                                     Performance Gap:
                                     (grows 50% / year)

  10
                                                    DRAM
                                                    7%/yr.
                                            DRAM

  1

                     Time




                                         ECE 4100/6100 (19)




                             The Memory Wall


       Average
       access
        time




                     Year?
• Increasing the number of cores increases the
  demanded memory bandwidth
• What architectural techniques can meet this
  demand?



                                         ECE 4100/6100 (20)
The Memory Wall




     •    On die caches are both area intensive and power
          intensive
             StrongArm dissipates more than 43% power in caches
             Caches incur huge area costs


                                                      ECE 4100/6100 (21)




                                             The Power Wall

         P  CVdd f  Vdd I st  Vdd I leak
                        2


• Power per transistor scales with frequency
  but also scales with Vdd
   Lower Vdd can be compensated for with increased
    pipelining to keep throughput constant
   Power per transistor is not same as power per
    area  power density is the problem!
   Multiple units can be run at lower frequencies to
    keep throughput constant, while saving power




                                                      ECE 4100/6100 (22)
The ILP Wall

• Limiting phenomena for ILP extraction:
    Clock rate: at the wall each increase in clock rate has
     a corresponding CPI increase (branches, other
     hazards)
    Instruction fetch and decode: at the wall more
     instructions cannot be fetched and decoded per clock
     cycle
    Cache hit rate: poor locality can limit ILP and it
     adversely affects memory bandwidth
    ILP in applications: serial fraction on applications
• Reality:
    Limit studies cap IPC at 100-400 (using ideal
     processor)
    Current processors have IPC of only 2-8/thread?

                                                     ECE 4100/6100 (17)




                                 The ILP Wall: Options

• Increase granularity of parallelism
    Simultaneous Multi-threading to exploit TLP
      o   TLP has to exist  otherwise poor utilization results
    Coarse grain multithreading
    Throughput computing


• New languages/applications
    Data intensive computing in the enterprise
    Media rich applications




                                                     ECE 4100/6100 (18)
Part2:
GPU architecture
GPU Evolution - Hardware




  1995          1999          2002          2003          2004          2005        2006-2007
   NV1        GeForce 256   GeForce4      GeForce FX    GeForce 6     GeForce 7     GeForce 8
 1 Million     22 Million   63 Million    130 Million   222 Million   302 Million   754 Million
Transistors   Transistors   Transistors   Transistors   Transistors   Transistors   Transistors       2008
                                                                                                  GeForce GTX 200
                                                                                                     1.4 Billion
                                                                                                    Transistors


Beyond Programmable Shading: In Action
GPU Architectures:
Past/Present/Future
         1995: Z-Buffered Triangles
         Riva 128: 1998: Textured Tris
         NV10: 1999: Fixed Function X-Formed Shaded
         Triangles
         NV20: 2001: FFX Triangles with Combiners at Pixels
         NV30: 2002: Programmable Vertex and Pixel
         Shaders (!)
         NV50: 2006: Unified shaders, CUDA
               GIobal Illumination, Physics, Ray tracing, AI
         future???: extrapolate trajectory
                 Trajectory == Extension + Unification

© NVIDIA Corporation 2007
No Lighting           Per-Vertex Lighting   Per-Pixel Lighting




Copyright © NVIDIA Corporation 2006
                                                                                 Unreal © Epic
The Classic Graphics Hardware
                                 Texture
                                  Maps
             Combine
             vertices into Texture map
 Transform   triangle,     fragments   Z-cull
 Project     convert to    Light
             fragments                 Alpha Blend

  Vertex      Triangle     Fragment    Fragment      Frame-
  Shader       Setup        Shader      Blender      Buffer(s)


                                                            programmable

                                                            configurable

                     GPU                                    fixed
Modern Graphics Hardware
   Pipelining                           1   2   3

     Number     of stages
                                             1

   Parallelism                              2

                                             3
     Number     of parallel processes

                                         1   2   3
   Parallelism + pipelining             1   2   3

     Number     of parallel pipelines   1   2   3
Modern GPUs: Unified Design




     Vertex shaders, pixel shaders, etc. become threads
        running different programs on a flexible core
Why unify?
                      Vertex Shader



                      Pixel Shader
                            Idle hardware


                                              Heavy Geometry
                                              Workload Perf = 4

                      Vertex Shader

                              Idle hardware



                      Pixel Shader


                                                Heavy Pixel
© NVIDIA Corporation 2007
                                              Workload Perf = 8
Why unify?
                Unified Shader

                     Vertex Workload


                                       Pixel




                                                 Heavy Geometry
                                                Workload Perf = 11


                Unified Shader

                     Pixel Workload


                                       Vertex




                                                  Heavy Pixel
© NVIDIA Corporation 2007
                                                Workload Perf = 11
GeForce 8: Modern GPU Architecture
                                               Host

                                      Input Assembler                                               Setup & Rasterize

                                     Vertex Thread Issue              Geom Thread Issue             Pixel Thread Issue


SP         SP        SP        SP         SP          SP   SP        SP        SP        SP   SP         SP      SP        SP   SP        SP




                                                                                                                                                Thread Processor
TF                   TF                   TF               TF                  TF             TF                 TF             TF



      L1                  L1                    L1              L1                  L1              L1                L1             L1




                L2                              L2                        L2                   L2                          L2                  L2


     Framebuffer                    Framebuffer            Framebuffer               Framebuffer              Framebuffer            Framebuffer




 Beyond Programmable Shading: In Action
Hardware Implementation:
A Set of SIMD Multiprocessors
                                      Device

         The device is a set of          Multiprocessor N

         multiprocessors
                                       Multiprocessor 2
         Each multiprocessor is a     Multiprocessor 1
         set of 32-bit processors
         with a Single Instruction
         Multiple Data architecture
                                                                                    Instruction

         At each clock cycle, a       Processor 1   Processor 2   …   Processor M
                                                                                       Unit


         multiprocessor executes
         the same instruction on a
         group of threads called a
         warp
         The number of threads in
         a warp is the warp size
© NVIDIA Corporation 2007
Goal: Performance per millimeter

  • For GPUs, perfomance == throughput

  • Strategy: hide latency with computation not cache
       Heavy multithreading!

  • Implication: need many threads to hide latency
        – Occupancy – typically prefer 128 or more threads/TPA
        – Multiple thread blocks/TPA help minimize effect of barriers


  • Strategy: Single Instruction Multiple Thread (SIMT)
        – Support SPMD programming model
        – Balance performance with ease of programming

Beyond Programmable Shading: In Action
SIMT Thread Execution

  • High-level description of SIMT:

        – Launch zillions of threads

        – When they do the same thing, hardware makes
          them go fast

        – When they do different things, hardware handles
          it gracefully

Beyond Programmable Shading: In Action
SIMT Thread Execution
• Groups of 32 threads formed into warps
     – always executing same instruction
     – some become inactive when code path diverges
     – hardware automatically handles divergence


• Warps are the primitive unit of scheduling
     – pick 1 of 32 warps for each instruction slot
     – Note warps may be running different programs/shaders!


• SIMT execution is an implementation choice
     – sharing control logic leaves more space for ALUs
     – largely invisible to programmer
     – must understand for performance, not correctness

 Beyond Programmable Shading: In Action
GPU Architecture: Trends
• Long history of ever-increasing programmability
     – Culminating today in CUDA: program GPU directly in C


• Graphics pipeline, APIs are abstractions
     – CUDA + graphics enable “replumbing” the pipeline


• Future: continue adding expressiveness, flexibility
     – CUDA, OpenCL, DX11 Compute Shader, ...
     – Lower barrier further between compute and graphics




 Beyond Programmable Shading: In Action
CPU/GPU Parallelism

         Moore’s Law gives you more and more transistors
                   What do you want to do with them?
                   CPU strategy: make the workload (one compute thread) run as
                   fast as possible
                            Tactics:
                             – Cache (area limiting)
                             – Instruction/Data prefetch
                             – Speculative execution
                               limited by “perimeter” – communication bandwidth
                             …then add task parallelism…multi-core
                   GPU strategy: make the workload (as many threads as possible)
                   run as fast as possible
                            Tactics:
                             – Parallelism (1000s of threads)
                             – Pipelining
                               limited by “area” – compute capability

© NVIDIA Corporation 2007
GPU Architecture
         Massively Parallel
                    1000s of processors (today)
         Power Efficient
                   Fixed Function Hardware = area & power efficient
                   Lack of speculation. More processing, less leaky cache
         Latency Tolerant from Day 1
         Memory Bandwidth
                   Saturate 512 Bits of Exotic DRAMs All Day Long (140 GB/sec
                   today)
                   No end in sight for Effective Memory Bandwidth
         Commercially Viable Parallelism
                   Largest installed base of Massively Parallel (N>4) Processors
                            Using CUDA!!! Not just as graphics
         Not dependent on large caches for performance
                   Computing power = Freq * Transistors
                   Moore’s law ^2



© NVIDIA Corporation 2007
GPU Architecture: Summary
• From fixed function to configurable to programmable
    architecture now centers on flexible processor core

• Goal: performance / mm2 (perf == throughput)
    architecture uses heavy multithreading

• Goal: balance performance with ease of use
    SIMT: hardware-managed parallel thread execution




Beyond Programmable Shading: In Action

Weitere ähnliche Inhalte

Andere mochten auch

Indic threads pune12-accelerating computation in html 5
Indic threads pune12-accelerating computation in html 5Indic threads pune12-accelerating computation in html 5
Indic threads pune12-accelerating computation in html 5IndicThreads
 
Scaling Jenkins with Docker: Swarm, Kubernetes or Mesos?
Scaling Jenkins with Docker: Swarm, Kubernetes or Mesos?Scaling Jenkins with Docker: Swarm, Kubernetes or Mesos?
Scaling Jenkins with Docker: Swarm, Kubernetes or Mesos?Carlos Sanchez
 
Building Data Integration and Transformations using Pentaho
Building Data Integration and Transformations using PentahoBuilding Data Integration and Transformations using Pentaho
Building Data Integration and Transformations using PentahoAshnikbiz
 
Advanced ETL2 Pentaho
Advanced ETL2  Pentaho Advanced ETL2  Pentaho
Advanced ETL2 Pentaho Sunny U Okoro
 
Docker Ecosystem - Part II - Compose
Docker Ecosystem - Part II - ComposeDocker Ecosystem - Part II - Compose
Docker Ecosystem - Part II - ComposeMario IC
 
NGINX Plus PLATFORM For Flawless Application Delivery
NGINX Plus PLATFORM For Flawless Application DeliveryNGINX Plus PLATFORM For Flawless Application Delivery
NGINX Plus PLATFORM For Flawless Application DeliveryAshnikbiz
 
Jenkins Peru Meetup Docker Ecosystem
Jenkins Peru Meetup Docker EcosystemJenkins Peru Meetup Docker Ecosystem
Jenkins Peru Meetup Docker EcosystemMario IC
 
Scaling Jenkins with Docker and Kubernetes
Scaling Jenkins with Docker and KubernetesScaling Jenkins with Docker and Kubernetes
Scaling Jenkins with Docker and KubernetesCarlos Sanchez
 
Clustering with Docker Swarm - Dockerops 2016 @ Cento (FE) Italy
Clustering with Docker Swarm - Dockerops 2016 @ Cento (FE) ItalyClustering with Docker Swarm - Dockerops 2016 @ Cento (FE) Italy
Clustering with Docker Swarm - Dockerops 2016 @ Cento (FE) ItalyGiovanni Toraldo
 
Pentaho | Data Integration & Report designer
Pentaho | Data Integration & Report designerPentaho | Data Integration & Report designer
Pentaho | Data Integration & Report designerHamdi Hmidi
 
Docker Ecosystem: Engine, Compose, Machine, Swarm, Registry
Docker Ecosystem: Engine, Compose, Machine, Swarm, RegistryDocker Ecosystem: Engine, Compose, Machine, Swarm, Registry
Docker Ecosystem: Engine, Compose, Machine, Swarm, RegistryMario IC
 
Continuous Development with Jenkins - Stephen Connolly at PuppetCamp Dublin '12
Continuous Development with Jenkins - Stephen Connolly at PuppetCamp Dublin '12Continuous Development with Jenkins - Stephen Connolly at PuppetCamp Dublin '12
Continuous Development with Jenkins - Stephen Connolly at PuppetCamp Dublin '12Puppet
 
Migración de datos con OpenERP-Kettle
Migración de datos con OpenERP-KettleMigración de datos con OpenERP-Kettle
Migración de datos con OpenERP-Kettleraimonesteve
 
Business Intelligence and Big Data Analytics with Pentaho
Business Intelligence and Big Data Analytics with Pentaho Business Intelligence and Big Data Analytics with Pentaho
Business Intelligence and Big Data Analytics with Pentaho Uday Kothari
 
Introduction to docker swarm
Introduction to docker swarmIntroduction to docker swarm
Introduction to docker swarmWalid Ashraf
 
Elementos ETL - Kettle Pentaho
Elementos ETL - Kettle Pentaho Elementos ETL - Kettle Pentaho
Elementos ETL - Kettle Pentaho valex_haro
 
Building a data warehouse with Pentaho and Docker
Building a data warehouse with Pentaho and DockerBuilding a data warehouse with Pentaho and Docker
Building a data warehouse with Pentaho and DockerWellington Marinho
 
Load Balancing Apps in Docker Swarm with NGINX
Load Balancing Apps in Docker Swarm with NGINXLoad Balancing Apps in Docker Swarm with NGINX
Load Balancing Apps in Docker Swarm with NGINXNGINX, Inc.
 
Docker swarm introduction
Docker swarm introductionDocker swarm introduction
Docker swarm introductionEvan Lin
 

Andere mochten auch (20)

Indic threads pune12-accelerating computation in html 5
Indic threads pune12-accelerating computation in html 5Indic threads pune12-accelerating computation in html 5
Indic threads pune12-accelerating computation in html 5
 
Scaling Jenkins with Docker: Swarm, Kubernetes or Mesos?
Scaling Jenkins with Docker: Swarm, Kubernetes or Mesos?Scaling Jenkins with Docker: Swarm, Kubernetes or Mesos?
Scaling Jenkins with Docker: Swarm, Kubernetes or Mesos?
 
Building Data Integration and Transformations using Pentaho
Building Data Integration and Transformations using PentahoBuilding Data Integration and Transformations using Pentaho
Building Data Integration and Transformations using Pentaho
 
Advanced ETL2 Pentaho
Advanced ETL2  Pentaho Advanced ETL2  Pentaho
Advanced ETL2 Pentaho
 
Docker Ecosystem - Part II - Compose
Docker Ecosystem - Part II - ComposeDocker Ecosystem - Part II - Compose
Docker Ecosystem - Part II - Compose
 
NGINX Plus PLATFORM For Flawless Application Delivery
NGINX Plus PLATFORM For Flawless Application DeliveryNGINX Plus PLATFORM For Flawless Application Delivery
NGINX Plus PLATFORM For Flawless Application Delivery
 
Jenkins Peru Meetup Docker Ecosystem
Jenkins Peru Meetup Docker EcosystemJenkins Peru Meetup Docker Ecosystem
Jenkins Peru Meetup Docker Ecosystem
 
Scaling Jenkins with Docker and Kubernetes
Scaling Jenkins with Docker and KubernetesScaling Jenkins with Docker and Kubernetes
Scaling Jenkins with Docker and Kubernetes
 
Clustering with Docker Swarm - Dockerops 2016 @ Cento (FE) Italy
Clustering with Docker Swarm - Dockerops 2016 @ Cento (FE) ItalyClustering with Docker Swarm - Dockerops 2016 @ Cento (FE) Italy
Clustering with Docker Swarm - Dockerops 2016 @ Cento (FE) Italy
 
Pentaho | Data Integration & Report designer
Pentaho | Data Integration & Report designerPentaho | Data Integration & Report designer
Pentaho | Data Integration & Report designer
 
Docker Ecosystem: Engine, Compose, Machine, Swarm, Registry
Docker Ecosystem: Engine, Compose, Machine, Swarm, RegistryDocker Ecosystem: Engine, Compose, Machine, Swarm, Registry
Docker Ecosystem: Engine, Compose, Machine, Swarm, Registry
 
Continuous Development with Jenkins - Stephen Connolly at PuppetCamp Dublin '12
Continuous Development with Jenkins - Stephen Connolly at PuppetCamp Dublin '12Continuous Development with Jenkins - Stephen Connolly at PuppetCamp Dublin '12
Continuous Development with Jenkins - Stephen Connolly at PuppetCamp Dublin '12
 
Migración de datos con OpenERP-Kettle
Migración de datos con OpenERP-KettleMigración de datos con OpenERP-Kettle
Migración de datos con OpenERP-Kettle
 
Introduction to GPU Programming
Introduction to GPU ProgrammingIntroduction to GPU Programming
Introduction to GPU Programming
 
Business Intelligence and Big Data Analytics with Pentaho
Business Intelligence and Big Data Analytics with Pentaho Business Intelligence and Big Data Analytics with Pentaho
Business Intelligence and Big Data Analytics with Pentaho
 
Introduction to docker swarm
Introduction to docker swarmIntroduction to docker swarm
Introduction to docker swarm
 
Elementos ETL - Kettle Pentaho
Elementos ETL - Kettle Pentaho Elementos ETL - Kettle Pentaho
Elementos ETL - Kettle Pentaho
 
Building a data warehouse with Pentaho and Docker
Building a data warehouse with Pentaho and DockerBuilding a data warehouse with Pentaho and Docker
Building a data warehouse with Pentaho and Docker
 
Load Balancing Apps in Docker Swarm with NGINX
Load Balancing Apps in Docker Swarm with NGINXLoad Balancing Apps in Docker Swarm with NGINX
Load Balancing Apps in Docker Swarm with NGINX
 
Docker swarm introduction
Docker swarm introductionDocker swarm introduction
Docker swarm introduction
 

Ähnlich wie Tao zhang

云计算核心技术架构分论坛 一石三鸟 性能 功耗及成本
云计算核心技术架构分论坛 一石三鸟 性能 功耗及成本云计算核心技术架构分论坛 一石三鸟 性能 功耗及成本
云计算核心技术架构分论坛 一石三鸟 性能 功耗及成本Riquelme624
 
Experiences in Application Specific Supercomputer Design - Reasons, Challenge...
Experiences in Application Specific Supercomputer Design - Reasons, Challenge...Experiences in Application Specific Supercomputer Design - Reasons, Challenge...
Experiences in Application Specific Supercomputer Design - Reasons, Challenge...Heiko Joerg Schick
 
High Performance Computing - Challenges on the Road to Exascale Computing
High Performance Computing - Challenges on the Road to Exascale ComputingHigh Performance Computing - Challenges on the Road to Exascale Computing
High Performance Computing - Challenges on the Road to Exascale ComputingHeiko Joerg Schick
 
Data flow super computing valentina balas
Data flow super computing   valentina balasData flow super computing   valentina balas
Data flow super computing valentina balasValentina Emilia Balas
 
Parallelism Processor Design
Parallelism Processor DesignParallelism Processor Design
Parallelism Processor DesignSri Prasanna
 
High performance computing - building blocks, production & perspective
High performance computing - building blocks, production & perspectiveHigh performance computing - building blocks, production & perspective
High performance computing - building blocks, production & perspectiveJason Shih
 
Valladolid final-septiembre-2010
Valladolid final-septiembre-2010Valladolid final-septiembre-2010
Valladolid final-septiembre-2010TELECOM I+D
 
Semiconductor overview
Semiconductor overviewSemiconductor overview
Semiconductor overviewNabil Chouba
 
סולאראדג' - תכנון מערכות סולאריות לתשואה מירבית
 סולאראדג' - תכנון מערכות סולאריות לתשואה מירבית סולאראדג' - תכנון מערכות סולאריות לתשואה מירבית
סולאראדג' - תכנון מערכות סולאריות לתשואה מירביתTashtiot media
 
Datacenter Revolution Dean Nelson, Sun
Datacenter  Revolution    Dean  Nelson,  SunDatacenter  Revolution    Dean  Nelson,  Sun
Datacenter Revolution Dean Nelson, SunNiklas Johnsson
 
Dileep Random Access Talk at salishan 2016
Dileep Random Access Talk at salishan 2016Dileep Random Access Talk at salishan 2016
Dileep Random Access Talk at salishan 2016Dileep Bhandarkar
 
287233027-Chapter-1-Fundamentals-of-Computer-Design-ppt.ppt
287233027-Chapter-1-Fundamentals-of-Computer-Design-ppt.ppt287233027-Chapter-1-Fundamentals-of-Computer-Design-ppt.ppt
287233027-Chapter-1-Fundamentals-of-Computer-Design-ppt.pptDrUrvashiBansal
 
Chip Multiprocessing and the Cell Broadband Engine.pdf
Chip Multiprocessing and the Cell Broadband Engine.pdfChip Multiprocessing and the Cell Broadband Engine.pdf
Chip Multiprocessing and the Cell Broadband Engine.pdfSlide_N
 
Michael Gschwind, Chip Multiprocessing and the Cell Broadband Engine
Michael Gschwind, Chip Multiprocessing and the Cell Broadband EngineMichael Gschwind, Chip Multiprocessing and the Cell Broadband Engine
Michael Gschwind, Chip Multiprocessing and the Cell Broadband EngineMichael Gschwind
 
The Berkeley View on the Parallel Computing Landscape
The Berkeley View on the Parallel Computing LandscapeThe Berkeley View on the Parallel Computing Landscape
The Berkeley View on the Parallel Computing Landscapeugur candan
 
System on Chip (SoC) for mobile phones
System on Chip (SoC) for mobile phonesSystem on Chip (SoC) for mobile phones
System on Chip (SoC) for mobile phonesJeffrey Funk
 

Ähnlich wie Tao zhang (20)

云计算核心技术架构分论坛 一石三鸟 性能 功耗及成本
云计算核心技术架构分论坛 一石三鸟 性能 功耗及成本云计算核心技术架构分论坛 一石三鸟 性能 功耗及成本
云计算核心技术架构分论坛 一石三鸟 性能 功耗及成本
 
Experiences in Application Specific Supercomputer Design - Reasons, Challenge...
Experiences in Application Specific Supercomputer Design - Reasons, Challenge...Experiences in Application Specific Supercomputer Design - Reasons, Challenge...
Experiences in Application Specific Supercomputer Design - Reasons, Challenge...
 
High Performance Computing - Challenges on the Road to Exascale Computing
High Performance Computing - Challenges on the Road to Exascale ComputingHigh Performance Computing - Challenges on the Road to Exascale Computing
High Performance Computing - Challenges on the Road to Exascale Computing
 
Data flow super computing valentina balas
Data flow super computing   valentina balasData flow super computing   valentina balas
Data flow super computing valentina balas
 
Did you know
Did you knowDid you know
Did you know
 
Parallelism Processor Design
Parallelism Processor DesignParallelism Processor Design
Parallelism Processor Design
 
High performance computing - building blocks, production & perspective
High performance computing - building blocks, production & perspectiveHigh performance computing - building blocks, production & perspective
High performance computing - building blocks, production & perspective
 
Valladolid final-septiembre-2010
Valladolid final-septiembre-2010Valladolid final-septiembre-2010
Valladolid final-septiembre-2010
 
Semiconductor overview
Semiconductor overviewSemiconductor overview
Semiconductor overview
 
סולאראדג' - תכנון מערכות סולאריות לתשואה מירבית
 סולאראדג' - תכנון מערכות סולאריות לתשואה מירבית סולאראדג' - תכנון מערכות סולאריות לתשואה מירבית
סולאראדג' - תכנון מערכות סולאריות לתשואה מירבית
 
Datacenter Revolution Dean Nelson, Sun
Datacenter  Revolution    Dean  Nelson,  SunDatacenter  Revolution    Dean  Nelson,  Sun
Datacenter Revolution Dean Nelson, Sun
 
Conferencia
ConferenciaConferencia
Conferencia
 
Conferencia
ConferenciaConferencia
Conferencia
 
Dileep Random Access Talk at salishan 2016
Dileep Random Access Talk at salishan 2016Dileep Random Access Talk at salishan 2016
Dileep Random Access Talk at salishan 2016
 
287233027-Chapter-1-Fundamentals-of-Computer-Design-ppt.ppt
287233027-Chapter-1-Fundamentals-of-Computer-Design-ppt.ppt287233027-Chapter-1-Fundamentals-of-Computer-Design-ppt.ppt
287233027-Chapter-1-Fundamentals-of-Computer-Design-ppt.ppt
 
Chip Multiprocessing and the Cell Broadband Engine.pdf
Chip Multiprocessing and the Cell Broadband Engine.pdfChip Multiprocessing and the Cell Broadband Engine.pdf
Chip Multiprocessing and the Cell Broadband Engine.pdf
 
Michael Gschwind, Chip Multiprocessing and the Cell Broadband Engine
Michael Gschwind, Chip Multiprocessing and the Cell Broadband EngineMichael Gschwind, Chip Multiprocessing and the Cell Broadband Engine
Michael Gschwind, Chip Multiprocessing and the Cell Broadband Engine
 
The Berkeley View on the Parallel Computing Landscape
The Berkeley View on the Parallel Computing LandscapeThe Berkeley View on the Parallel Computing Landscape
The Berkeley View on the Parallel Computing Landscape
 
Prof k r sarma
Prof k r sarmaProf k r sarma
Prof k r sarma
 
System on Chip (SoC) for mobile phones
System on Chip (SoC) for mobile phonesSystem on Chip (SoC) for mobile phones
System on Chip (SoC) for mobile phones
 

Kürzlich hochgeladen

fourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingfourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingTeacherCyreneCayanan
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxVishalSingh1417
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxDenish Jangid
 
Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfAyushMahapatra5
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxAreebaZafar22
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfagholdier
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Celine George
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood
 
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...Shubhangi Sonawane
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxheathfieldcps1
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDThiyagu K
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.pptRamjanShidvankar
 
Making and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdfMaking and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdfChris Hunter
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...christianmathematics
 

Kürzlich hochgeladen (20)

fourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingfourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writing
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
Advance Mobile Application Development class 07
Advance Mobile Application Development class 07Advance Mobile Application Development class 07
Advance Mobile Application Development class 07
 
Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdf
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
 
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SD
 
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 
Making and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdfMaking and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdf
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
 

Tao zhang

  • 1. Introduction to Multicore architecture Tao Zhang Oct. 21, 2010
  • 2. Overview  Part1: General multicore architecture  Part2: GPU architecture
  • 4. Uniprocessor Performance (SPECint) 10000 3X From Hennessy and Patterson, Computer Architecture: A Quantitative ??%/year Approach, 4th edition, 2006 1000 Performance (vs. VAX-11/780) 52%/year 100 ⇒ Sea change in chip 10 25%/year design: multiple “cores” or processors per chip 1 1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 • VAX : 25%/year 1978 to 1986 • RISC + x86: 52%/year 1986 to 2002 • RISC + x86: ??%/year 2002 to present 6
  • 5. Conventional Wisdom (CW) in Computer Architecture  Old CW: Chips reliable internally, errors at pins  New CW: ≤65 nm ⇒ high soft & hard error rates  Old CW: Demonstrate new ideas by building chips  New CW: Mask costs, ECAD costs, GHz clock rates ⇒ researchers can’t build believable prototypes  Old CW: Innovate via compiler optimizations + architecture  New: Takes > 10 years before new optimization at leading conference gets into production compilers  Old: Hardware is hard to change, SW is flexible  New: Hardware is flexible, SW is hard to change 4
  • 6. Conventional Wisdom (CW) in Computer Architecture  Old CW: Power is free, Transistors expensive  New CW: “Power wall” Power expensive, Xtors free (Can put more on chip than can afford to turn on)  Old: Multiplies are slow, Memory access is fast  New: “Memory wall” Memory slow, multiplies fast (200 clocks to DRAM memory, 4 clocks for FP multiply)  Old : Increasing Instruction Level Parallelism via compilers, innovation (Out-of-order, speculation, VLIW, …)  New CW: “ILP wall” diminishing returns on more ILP  New: Power Wall + Memory Wall + ILP Wall = Brick Wall  Old CW: Uniprocessor performance 2X / 1.5 yrs  New CW: Uniprocessor performance only 2X / 5 yrs? 5
  • 7. The Memory Wall • On die caches are both area intensive and power intensive  StrongArm dissipates more than 43% power in caches  Caches incur huge area costs ECE 4100/6100 (21) The Power Wall P  CVdd f  Vdd I st  Vdd I leak 2 • Power per transistor scales with frequency but also scales with Vdd  Lower Vdd can be compensated for with increased pipelining to keep throughput constant  Power per transistor is not same as power per area  power density is the problem!  Multiple units can be run at lower frequencies to keep throughput constant, while saving power ECE 4100/6100 (22)
  • 8. The Current Power Trend Sun’s 10000 Surface Rocket Power Density (W/cm2) 1000 Nozzle Nuclear 100 Reactor 8086 Hot Plate 10 4004 P6 8008 8085 386 Pentium® 286 486 8080 1 1970 1980 1990 2000 2010 Year Source: Intel Corp. ECE 4100/6100 (23) Improving Power/Perfomance P  CVdd f  Vdd I st  Vdd I leak 2 • Consider constant die size and decreasing core area each generation = more cores/chip  Effect of lowering voltage and frequency  power reduction  Increasing cores/chip  performance increase Better power performance! ECE 4100/6100 (24)
  • 9. The Memory Wall µProc 1000 CPU 60%/yr. “Moore’s Law” 100 Processor-Memory Performance Gap: (grows 50% / year) 10 DRAM 7%/yr. DRAM 1 Time ECE 4100/6100 (19) The Memory Wall Average access time Year? • Increasing the number of cores increases the demanded memory bandwidth • What architectural techniques can meet this demand? ECE 4100/6100 (20)
  • 10. The Memory Wall µProc 1000 CPU 60%/yr. “Moore’s Law” 100 Processor-Memory Performance Gap: (grows 50% / year) 10 DRAM 7%/yr. DRAM 1 Time ECE 4100/6100 (19) The Memory Wall Average access time Year? • Increasing the number of cores increases the demanded memory bandwidth • What architectural techniques can meet this demand? ECE 4100/6100 (20)
  • 11. The Memory Wall • On die caches are both area intensive and power intensive  StrongArm dissipates more than 43% power in caches  Caches incur huge area costs ECE 4100/6100 (21) The Power Wall P  CVdd f  Vdd I st  Vdd I leak 2 • Power per transistor scales with frequency but also scales with Vdd  Lower Vdd can be compensated for with increased pipelining to keep throughput constant  Power per transistor is not same as power per area  power density is the problem!  Multiple units can be run at lower frequencies to keep throughput constant, while saving power ECE 4100/6100 (22)
  • 12. The ILP Wall • Limiting phenomena for ILP extraction:  Clock rate: at the wall each increase in clock rate has a corresponding CPI increase (branches, other hazards)  Instruction fetch and decode: at the wall more instructions cannot be fetched and decoded per clock cycle  Cache hit rate: poor locality can limit ILP and it adversely affects memory bandwidth  ILP in applications: serial fraction on applications • Reality:  Limit studies cap IPC at 100-400 (using ideal processor)  Current processors have IPC of only 2-8/thread? ECE 4100/6100 (17) The ILP Wall: Options • Increase granularity of parallelism  Simultaneous Multi-threading to exploit TLP o TLP has to exist  otherwise poor utilization results  Coarse grain multithreading  Throughput computing • New languages/applications  Data intensive computing in the enterprise  Media rich applications ECE 4100/6100 (18)
  • 14. GPU Evolution - Hardware 1995 1999 2002 2003 2004 2005 2006-2007 NV1 GeForce 256 GeForce4 GeForce FX GeForce 6 GeForce 7 GeForce 8 1 Million 22 Million 63 Million 130 Million 222 Million 302 Million 754 Million Transistors Transistors Transistors Transistors Transistors Transistors Transistors 2008 GeForce GTX 200 1.4 Billion Transistors Beyond Programmable Shading: In Action
  • 15. GPU Architectures: Past/Present/Future 1995: Z-Buffered Triangles Riva 128: 1998: Textured Tris NV10: 1999: Fixed Function X-Formed Shaded Triangles NV20: 2001: FFX Triangles with Combiners at Pixels NV30: 2002: Programmable Vertex and Pixel Shaders (!) NV50: 2006: Unified shaders, CUDA GIobal Illumination, Physics, Ray tracing, AI future???: extrapolate trajectory Trajectory == Extension + Unification © NVIDIA Corporation 2007
  • 16. No Lighting Per-Vertex Lighting Per-Pixel Lighting Copyright © NVIDIA Corporation 2006 Unreal © Epic
  • 17. The Classic Graphics Hardware Texture Maps Combine vertices into Texture map Transform triangle, fragments Z-cull Project convert to Light fragments Alpha Blend Vertex Triangle Fragment Fragment Frame- Shader Setup Shader Blender Buffer(s) programmable configurable GPU fixed
  • 18. Modern Graphics Hardware  Pipelining 1 2 3  Number of stages 1  Parallelism 2 3  Number of parallel processes 1 2 3  Parallelism + pipelining 1 2 3  Number of parallel pipelines 1 2 3
  • 19. Modern GPUs: Unified Design Vertex shaders, pixel shaders, etc. become threads running different programs on a flexible core
  • 20. Why unify? Vertex Shader Pixel Shader Idle hardware Heavy Geometry Workload Perf = 4 Vertex Shader Idle hardware Pixel Shader Heavy Pixel © NVIDIA Corporation 2007 Workload Perf = 8
  • 21. Why unify? Unified Shader Vertex Workload Pixel Heavy Geometry Workload Perf = 11 Unified Shader Pixel Workload Vertex Heavy Pixel © NVIDIA Corporation 2007 Workload Perf = 11
  • 22. GeForce 8: Modern GPU Architecture Host Input Assembler Setup & Rasterize Vertex Thread Issue Geom Thread Issue Pixel Thread Issue SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP Thread Processor TF TF TF TF TF TF TF TF L1 L1 L1 L1 L1 L1 L1 L1 L2 L2 L2 L2 L2 L2 Framebuffer Framebuffer Framebuffer Framebuffer Framebuffer Framebuffer Beyond Programmable Shading: In Action
  • 23. Hardware Implementation: A Set of SIMD Multiprocessors Device The device is a set of Multiprocessor N multiprocessors Multiprocessor 2 Each multiprocessor is a Multiprocessor 1 set of 32-bit processors with a Single Instruction Multiple Data architecture Instruction At each clock cycle, a Processor 1 Processor 2 … Processor M Unit multiprocessor executes the same instruction on a group of threads called a warp The number of threads in a warp is the warp size © NVIDIA Corporation 2007
  • 24. Goal: Performance per millimeter • For GPUs, perfomance == throughput • Strategy: hide latency with computation not cache Heavy multithreading! • Implication: need many threads to hide latency – Occupancy – typically prefer 128 or more threads/TPA – Multiple thread blocks/TPA help minimize effect of barriers • Strategy: Single Instruction Multiple Thread (SIMT) – Support SPMD programming model – Balance performance with ease of programming Beyond Programmable Shading: In Action
  • 25. SIMT Thread Execution • High-level description of SIMT: – Launch zillions of threads – When they do the same thing, hardware makes them go fast – When they do different things, hardware handles it gracefully Beyond Programmable Shading: In Action
  • 26. SIMT Thread Execution • Groups of 32 threads formed into warps – always executing same instruction – some become inactive when code path diverges – hardware automatically handles divergence • Warps are the primitive unit of scheduling – pick 1 of 32 warps for each instruction slot – Note warps may be running different programs/shaders! • SIMT execution is an implementation choice – sharing control logic leaves more space for ALUs – largely invisible to programmer – must understand for performance, not correctness Beyond Programmable Shading: In Action
  • 27. GPU Architecture: Trends • Long history of ever-increasing programmability – Culminating today in CUDA: program GPU directly in C • Graphics pipeline, APIs are abstractions – CUDA + graphics enable “replumbing” the pipeline • Future: continue adding expressiveness, flexibility – CUDA, OpenCL, DX11 Compute Shader, ... – Lower barrier further between compute and graphics Beyond Programmable Shading: In Action
  • 28. CPU/GPU Parallelism Moore’s Law gives you more and more transistors What do you want to do with them? CPU strategy: make the workload (one compute thread) run as fast as possible Tactics: – Cache (area limiting) – Instruction/Data prefetch – Speculative execution limited by “perimeter” – communication bandwidth …then add task parallelism…multi-core GPU strategy: make the workload (as many threads as possible) run as fast as possible Tactics: – Parallelism (1000s of threads) – Pipelining limited by “area” – compute capability © NVIDIA Corporation 2007
  • 29. GPU Architecture Massively Parallel 1000s of processors (today) Power Efficient Fixed Function Hardware = area & power efficient Lack of speculation. More processing, less leaky cache Latency Tolerant from Day 1 Memory Bandwidth Saturate 512 Bits of Exotic DRAMs All Day Long (140 GB/sec today) No end in sight for Effective Memory Bandwidth Commercially Viable Parallelism Largest installed base of Massively Parallel (N>4) Processors Using CUDA!!! Not just as graphics Not dependent on large caches for performance Computing power = Freq * Transistors Moore’s law ^2 © NVIDIA Corporation 2007
  • 30. GPU Architecture: Summary • From fixed function to configurable to programmable architecture now centers on flexible processor core • Goal: performance / mm2 (perf == throughput) architecture uses heavy multithreading • Goal: balance performance with ease of use SIMT: hardware-managed parallel thread execution Beyond Programmable Shading: In Action