SlideShare ist ein Scribd-Unternehmen logo
1 von 20
Downloaden Sie, um offline zu lesen
NVIDIA GPU Architecture:
From Fermi to Kepler
Ofer Rosenberg
Jan 21st 2013
Scope
   This presentation covers the main features of
    Fermi, Fermi refresh & Kepler architectures



   The overview is done from compute perspective,
    and as such Graphics features are not discussed
     Polyphase Engine, Raster, ROBs, etc.
Quick Numbers
                   GTX 480       GTX580        GTX680
Architecture       GF100          GF110        GK104
SM / SMX             15            16             8
CUDA cores           480           512          1536
Core Frequency     700MHz        772MHz       1006MHz
Compute Power    1345 GFLOPS   1581 GFLOPS   3090 GFLOPS
Memory BW         177.4 GB/s    192.2 GB/s    192.2 GB/s
Transistors         3.2B          3.0B          3.5B
Technology          40nm          40nm          28nm
Power               250W          244W          195W
GF100 SM
   SM - Stream Multiprocessor

   32 “CUDA cores”, organized into two clusters, 16 cores each

   Warp is 32 threads – two cycles to complete a Warp
       NVIDIA solution - ALU clock is double the Core clock

   4 SFUs (accelerate transcendental functions)

   16 Load / Store units

   Dual Warp scheduler – execute two warps concurrently
       Note bottlenecks on LD/ST & SFU – architecture decision

   Each SM can hold up to 48 Warps, divided up to 8 blocks
       Hold “in-flight” warps to hide latency

       Typically no. of blocks is lower.

       For example, 24 warps per block = 2 blocks per SM
Packing it all together
   GPC – Graphic Processing Cluster
     Four SMs
     Transparent to compute usages
Packing it all together
   Four GPCs
   768K L2 shared between SMs
       Support L2 only or L1&L2 caching

   384-bit GDDR5
   GigaThread Scheduler
       Schedule thread blocks to SMs
       Concurrent Kernel Execution - separated
        kernels per SM.
Fermi GF104 SM
Changes from GF100 SM:

   48 “CUDA cores”, organized into three clusters of 16 cores
    each

   8 SFUs instead of 4

   Rest remains the same (32K 32-bit registers, 64K L1/Shared,
    etc.)



   Wait a sec…three clusters, but still schedule two warps ?

   Under-utilization study of GF100 led to scheduling redesign –
    Next slide…
Instruction Level Parallelism (ILP)




GF100                                                GF104
   Two warp Schedulers feed two clusters of cores      Adopt ILP idea from CPU world - issue two
   Memory access or SFU access lead to                  instructions per clock
    underutilization of Cores Cluster                   Add a third cluster for balanced utilization
Meet GK104 SMX
   192 “CUDA Cores”

   Organized into 6 clusters of 32 cores each
       No more “dual clocked ALU”

   16 Load/Store units

   16 SFUs

   64K 32-bit registers

   Same 64K L1/Shared

   Same dual-issued Warp scheduling:
       Execute 4 warps concurrently

       Issue two instructions per cycle

   Each SMX can hold up to 64 warps,
    divided up to 16 blocks
From GF104 to GK104
   Look at Half of SMX
                                 SM   SMX
   Same:
       Two warp schedulers
       Two dispatch units per
        scheduler
       32K register file
       6 rows of cores
       1 row of load/store
       1 row of SFU

   Different:
       On SMX, a row of cores
        is 16 vs 8 on SM
       On SMX a row of SFU is
        16 vs 8 on SM
Packing it all together
   Four GPCs, each has two SMXs

   512K L2 shared between SMs
     L1 is no longer used for CUDA

   256-bit GDDR5

   GigaThread Scheduler
     Dynamic Parallelism
GK104 vs. GF104
   Kepler has less “multiprocessors”
     8 vs. 16
     Less flexible on executing different kernels concurrently

   Each “multiprocessor” is stronger
     Issue twice the warps (6 vs. 3)

     Twice the register file
     Execute warp in a single cycle

     More SFUs
     10x Faster atomic operations

   But:
     SMX Holds 64 warps vs. the 48 for SM – less latency hiding per warp cluster

     L1/Shared Memory stayed the same size – and totally bypassed in CUDA/OpenCL
     Memory BW did not scale as compute/cores did (192GB/Sec, same as in GF110)
GK110 SMX
   Tesla only (no GeForce
    version)
   Very similar to GK104 SMX
   Additional Double-Precision
    units, otherwise the same
GK110




   Production versions: 14 & 13 SMXs (not 15)
   Improved device-level scheduling (next slides):
     HyperQ
     Dynamic Parallelism
Improved scheduling 1 - HyperQ
   Scenario: multiple CPU processes send work to the GPU

   On Fermi, time division between processes

   On Kepler, simultaneous processing from multiple processes
Improved scheduling 2
   A new age in GPU programmability:

       moving from Master-Salve pattern to self-feeding
Questions ?

Weitere ähnliche Inhalte

Was ist angesagt?

Antonios Giannopoulos Percona 2016 WiredTiger Configuration Variables
Antonios Giannopoulos Percona 2016 WiredTiger Configuration VariablesAntonios Giannopoulos Percona 2016 WiredTiger Configuration Variables
Antonios Giannopoulos Percona 2016 WiredTiger Configuration Variables
Antonios Giannopoulos
 
Network & Filesystem: Doing less cross rings memory copy
Network & Filesystem: Doing less cross rings memory copyNetwork & Filesystem: Doing less cross rings memory copy
Network & Filesystem: Doing less cross rings memory copy
Scaleway
 

Was ist angesagt? (20)

AARCH64 VMSA Under Linux Kernel
AARCH64 VMSA Under Linux KernelAARCH64 VMSA Under Linux Kernel
AARCH64 VMSA Under Linux Kernel
 
Some analysis of BlueStore and RocksDB
Some analysis of BlueStore and RocksDBSome analysis of BlueStore and RocksDB
Some analysis of BlueStore and RocksDB
 
Memory Bandwidth QoS
Memory Bandwidth QoSMemory Bandwidth QoS
Memory Bandwidth QoS
 
Introduction to SPI and PMIC with SPI interface (chinese)
Introduction to SPI and PMIC with SPI interface (chinese)Introduction to SPI and PMIC with SPI interface (chinese)
Introduction to SPI and PMIC with SPI interface (chinese)
 
Training Slides: Advanced 302: Performing Schema Changes in a Multi-Site/Mult...
Training Slides: Advanced 302: Performing Schema Changes in a Multi-Site/Mult...Training Slides: Advanced 302: Performing Schema Changes in a Multi-Site/Mult...
Training Slides: Advanced 302: Performing Schema Changes in a Multi-Site/Mult...
 
Cuda
CudaCuda
Cuda
 
20171101 taco scargo luminous is out, what's in it for you
20171101 taco scargo   luminous is out, what's in it for you20171101 taco scargo   luminous is out, what's in it for you
20171101 taco scargo luminous is out, what's in it for you
 
Jslinux
JslinuxJslinux
Jslinux
 
Cat @ scale
Cat @ scaleCat @ scale
Cat @ scale
 
Gluster volume snapshot
Gluster volume snapshotGluster volume snapshot
Gluster volume snapshot
 
Corralling Big Data at TACC
Corralling Big Data at TACCCorralling Big Data at TACC
Corralling Big Data at TACC
 
Antonios Giannopoulos Percona 2016 WiredTiger Configuration Variables
Antonios Giannopoulos Percona 2016 WiredTiger Configuration VariablesAntonios Giannopoulos Percona 2016 WiredTiger Configuration Variables
Antonios Giannopoulos Percona 2016 WiredTiger Configuration Variables
 
Miracle Bodyshop - Thorium Energy
Miracle Bodyshop - Thorium Energy Miracle Bodyshop - Thorium Energy
Miracle Bodyshop - Thorium Energy
 
Js on-microcontrollers
Js on-microcontrollersJs on-microcontrollers
Js on-microcontrollers
 
Userspace Linux I/O
Userspace Linux I/O Userspace Linux I/O
Userspace Linux I/O
 
BKK16-506 PMWG Farm
BKK16-506 PMWG FarmBKK16-506 PMWG Farm
BKK16-506 PMWG Farm
 
Wish list from PostgreSQL - Linux Kernel Summit 2009
Wish list from PostgreSQL - Linux Kernel Summit 2009Wish list from PostgreSQL - Linux Kernel Summit 2009
Wish list from PostgreSQL - Linux Kernel Summit 2009
 
BKK16-402 Cross distro BoF
BKK16-402 Cross distro BoFBKK16-402 Cross distro BoF
BKK16-402 Cross distro BoF
 
NFS updates for CLSF
NFS updates for CLSFNFS updates for CLSF
NFS updates for CLSF
 
Network & Filesystem: Doing less cross rings memory copy
Network & Filesystem: Doing less cross rings memory copyNetwork & Filesystem: Doing less cross rings memory copy
Network & Filesystem: Doing less cross rings memory copy
 

Andere mochten auch

GPGPU Seminar (GPGPU and CUDA Fortran)
GPGPU Seminar (GPGPU and CUDA Fortran)GPGPU Seminar (GPGPU and CUDA Fortran)
GPGPU Seminar (GPGPU and CUDA Fortran)
智啓 出川
 

Andere mochten auch (7)

La paradoja de fermi
La paradoja de fermiLa paradoja de fermi
La paradoja de fermi
 
Advanced computer architecture lesson 5 and 6
Advanced computer architecture lesson 5 and 6Advanced computer architecture lesson 5 and 6
Advanced computer architecture lesson 5 and 6
 
GPGPU Seminar (GPGPU and CUDA Fortran)
GPGPU Seminar (GPGPU and CUDA Fortran)GPGPU Seminar (GPGPU and CUDA Fortran)
GPGPU Seminar (GPGPU and CUDA Fortran)
 
Lec04 gpu architecture
Lec04 gpu architectureLec04 gpu architecture
Lec04 gpu architecture
 
CUDA
CUDACUDA
CUDA
 
1070: CUDA プログラミング入門
1070: CUDA プログラミング入門1070: CUDA プログラミング入門
1070: CUDA プログラミング入門
 
AMD Ryzen CPU Zen Cores Architecture
AMD Ryzen CPU Zen Cores ArchitectureAMD Ryzen CPU Zen Cores Architecture
AMD Ryzen CPU Zen Cores Architecture
 

Ähnlich wie From fermi to kepler

Designing High Performance Computing Architectures for Reliable Space Applica...
Designing High Performance Computing Architectures for Reliable Space Applica...Designing High Performance Computing Architectures for Reliable Space Applica...
Designing High Performance Computing Architectures for Reliable Space Applica...
Fisnik Kraja
 
Exaflop In 2018 Hardware
Exaflop In 2018   HardwareExaflop In 2018   Hardware
Exaflop In 2018 Hardware
Jacob Wu
 
Presentation sun stor edge 9990 system technical
Presentation   sun stor edge 9990 system technicalPresentation   sun stor edge 9990 system technical
Presentation sun stor edge 9990 system technical
xKinAnx
 
Graphics processing uni computer archiecture
Graphics processing uni computer archiectureGraphics processing uni computer archiecture
Graphics processing uni computer archiecture
Haris456
 

Ähnlich wie From fermi to kepler (20)

Argonne's Theta Supercomputer Architecture
Argonne's Theta Supercomputer ArchitectureArgonne's Theta Supercomputer Architecture
Argonne's Theta Supercomputer Architecture
 
Lecture 7 cuda execution model
Lecture 7   cuda execution modelLecture 7   cuda execution model
Lecture 7 cuda execution model
 
Theta and the Future of Accelerator Programming
Theta and the Future of Accelerator ProgrammingTheta and the Future of Accelerator Programming
Theta and the Future of Accelerator Programming
 
Designing High Performance Computing Architectures for Reliable Space Applica...
Designing High Performance Computing Architectures for Reliable Space Applica...Designing High Performance Computing Architectures for Reliable Space Applica...
Designing High Performance Computing Architectures for Reliable Space Applica...
 
Exaflop In 2018 Hardware
Exaflop In 2018   HardwareExaflop In 2018   Hardware
Exaflop In 2018 Hardware
 
Workshop actualización SVG CESGA 2012
Workshop actualización SVG CESGA 2012 Workshop actualización SVG CESGA 2012
Workshop actualización SVG CESGA 2012
 
Gpu archi
Gpu archiGpu archi
Gpu archi
 
Presentation sun stor edge 9990 system technical
Presentation   sun stor edge 9990 system technicalPresentation   sun stor edge 9990 system technical
Presentation sun stor edge 9990 system technical
 
Nodes and Networks for HPC computing
Nodes and Networks for HPC computingNodes and Networks for HPC computing
Nodes and Networks for HPC computing
 
Graphics processing uni computer archiecture
Graphics processing uni computer archiectureGraphics processing uni computer archiecture
Graphics processing uni computer archiecture
 
Core 2 processors
Core 2 processorsCore 2 processors
Core 2 processors
 
AMD Opteron™ 6200 Series Processor Guide, Silicon Mechanics
AMD Opteron™ 6200 Series Processor Guide, Silicon MechanicsAMD Opteron™ 6200 Series Processor Guide, Silicon Mechanics
AMD Opteron™ 6200 Series Processor Guide, Silicon Mechanics
 
BURA Supercomputer
BURA SupercomputerBURA Supercomputer
BURA Supercomputer
 
Amd accelerated computing -ufrj
Amd   accelerated computing -ufrjAmd   accelerated computing -ufrj
Amd accelerated computing -ufrj
 
ISSCC 2018: "Zeppelin": an SoC for Multi-chip Architectures
ISSCC 2018: "Zeppelin": an SoC for Multi-chip ArchitecturesISSCC 2018: "Zeppelin": an SoC for Multi-chip Architectures
ISSCC 2018: "Zeppelin": an SoC for Multi-chip Architectures
 
µCLinux on Pluto 6 Project presentation
µCLinux on Pluto 6 Project presentationµCLinux on Pluto 6 Project presentation
µCLinux on Pluto 6 Project presentation
 
NVIDIA GPUs Power HPC & AI Workloads in Cloud with Univa
NVIDIA GPUs Power HPC & AI Workloads in Cloud with UnivaNVIDIA GPUs Power HPC & AI Workloads in Cloud with Univa
NVIDIA GPUs Power HPC & AI Workloads in Cloud with Univa
 
Latest HPC News from NVIDIA
Latest HPC News from NVIDIALatest HPC News from NVIDIA
Latest HPC News from NVIDIA
 
Monte Carlo G P U Jan2010
Monte  Carlo  G P U  Jan2010Monte  Carlo  G P U  Jan2010
Monte Carlo G P U Jan2010
 
Do Theoretical Flo Ps Matter For Real Application’S Performance Kaust 2012
Do Theoretical Flo Ps Matter For Real Application’S Performance Kaust 2012Do Theoretical Flo Ps Matter For Real Application’S Performance Kaust 2012
Do Theoretical Flo Ps Matter For Real Application’S Performance Kaust 2012
 

Mehr von Ofer Rosenberg

Intel's Presentation in SIGGRAPH OpenCL BOF
Intel's Presentation in SIGGRAPH OpenCL BOFIntel's Presentation in SIGGRAPH OpenCL BOF
Intel's Presentation in SIGGRAPH OpenCL BOF
Ofer Rosenberg
 
Open CL For Haifa Linux Club
Open CL For Haifa Linux ClubOpen CL For Haifa Linux Club
Open CL For Haifa Linux Club
Ofer Rosenberg
 
Open CL For Speedup Workshop
Open CL For Speedup WorkshopOpen CL For Speedup Workshop
Open CL For Speedup Workshop
Ofer Rosenberg
 

Mehr von Ofer Rosenberg (9)

HSA Introduction
HSA IntroductionHSA Introduction
HSA Introduction
 
GPU Ecosystem
GPU EcosystemGPU Ecosystem
GPU Ecosystem
 
The GPGPU Continuum
The GPGPU ContinuumThe GPGPU Continuum
The GPGPU Continuum
 
Newbie’s guide to_the_gpgpu_universe
Newbie’s guide to_the_gpgpu_universeNewbie’s guide to_the_gpgpu_universe
Newbie’s guide to_the_gpgpu_universe
 
Introduction To GPUs 2012
Introduction To GPUs 2012Introduction To GPUs 2012
Introduction To GPUs 2012
 
Intel's Presentation in SIGGRAPH OpenCL BOF
Intel's Presentation in SIGGRAPH OpenCL BOFIntel's Presentation in SIGGRAPH OpenCL BOF
Intel's Presentation in SIGGRAPH OpenCL BOF
 
Compute API –Past & Future
Compute API –Past & FutureCompute API –Past & Future
Compute API –Past & Future
 
Open CL For Haifa Linux Club
Open CL For Haifa Linux ClubOpen CL For Haifa Linux Club
Open CL For Haifa Linux Club
 
Open CL For Speedup Workshop
Open CL For Speedup WorkshopOpen CL For Speedup Workshop
Open CL For Speedup Workshop
 

Kürzlich hochgeladen

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Kürzlich hochgeladen (20)

Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 

From fermi to kepler

  • 1. NVIDIA GPU Architecture: From Fermi to Kepler Ofer Rosenberg Jan 21st 2013
  • 2. Scope  This presentation covers the main features of Fermi, Fermi refresh & Kepler architectures  The overview is done from compute perspective, and as such Graphics features are not discussed  Polyphase Engine, Raster, ROBs, etc.
  • 3. Quick Numbers GTX 480 GTX580 GTX680 Architecture GF100 GF110 GK104 SM / SMX 15 16 8 CUDA cores 480 512 1536 Core Frequency 700MHz 772MHz 1006MHz Compute Power 1345 GFLOPS 1581 GFLOPS 3090 GFLOPS Memory BW 177.4 GB/s 192.2 GB/s 192.2 GB/s Transistors 3.2B 3.0B 3.5B Technology 40nm 40nm 28nm Power 250W 244W 195W
  • 4.
  • 5. GF100 SM  SM - Stream Multiprocessor  32 “CUDA cores”, organized into two clusters, 16 cores each  Warp is 32 threads – two cycles to complete a Warp  NVIDIA solution - ALU clock is double the Core clock  4 SFUs (accelerate transcendental functions)  16 Load / Store units  Dual Warp scheduler – execute two warps concurrently  Note bottlenecks on LD/ST & SFU – architecture decision  Each SM can hold up to 48 Warps, divided up to 8 blocks  Hold “in-flight” warps to hide latency  Typically no. of blocks is lower.  For example, 24 warps per block = 2 blocks per SM
  • 6. Packing it all together  GPC – Graphic Processing Cluster  Four SMs  Transparent to compute usages
  • 7. Packing it all together  Four GPCs  768K L2 shared between SMs  Support L2 only or L1&L2 caching  384-bit GDDR5  GigaThread Scheduler  Schedule thread blocks to SMs  Concurrent Kernel Execution - separated kernels per SM.
  • 8.
  • 9. Fermi GF104 SM Changes from GF100 SM:  48 “CUDA cores”, organized into three clusters of 16 cores each  8 SFUs instead of 4  Rest remains the same (32K 32-bit registers, 64K L1/Shared, etc.)  Wait a sec…three clusters, but still schedule two warps ?  Under-utilization study of GF100 led to scheduling redesign – Next slide…
  • 10. Instruction Level Parallelism (ILP) GF100 GF104  Two warp Schedulers feed two clusters of cores  Adopt ILP idea from CPU world - issue two  Memory access or SFU access lead to instructions per clock underutilization of Cores Cluster  Add a third cluster for balanced utilization
  • 11.
  • 12. Meet GK104 SMX  192 “CUDA Cores”  Organized into 6 clusters of 32 cores each  No more “dual clocked ALU”  16 Load/Store units  16 SFUs  64K 32-bit registers  Same 64K L1/Shared  Same dual-issued Warp scheduling:  Execute 4 warps concurrently  Issue two instructions per cycle  Each SMX can hold up to 64 warps, divided up to 16 blocks
  • 13. From GF104 to GK104  Look at Half of SMX SM SMX  Same:  Two warp schedulers  Two dispatch units per scheduler  32K register file  6 rows of cores  1 row of load/store  1 row of SFU  Different:  On SMX, a row of cores is 16 vs 8 on SM  On SMX a row of SFU is 16 vs 8 on SM
  • 14. Packing it all together  Four GPCs, each has two SMXs  512K L2 shared between SMs  L1 is no longer used for CUDA  256-bit GDDR5  GigaThread Scheduler  Dynamic Parallelism
  • 15. GK104 vs. GF104  Kepler has less “multiprocessors”  8 vs. 16  Less flexible on executing different kernels concurrently  Each “multiprocessor” is stronger  Issue twice the warps (6 vs. 3)  Twice the register file  Execute warp in a single cycle  More SFUs  10x Faster atomic operations  But:  SMX Holds 64 warps vs. the 48 for SM – less latency hiding per warp cluster  L1/Shared Memory stayed the same size – and totally bypassed in CUDA/OpenCL  Memory BW did not scale as compute/cores did (192GB/Sec, same as in GF110)
  • 16. GK110 SMX  Tesla only (no GeForce version)  Very similar to GK104 SMX  Additional Double-Precision units, otherwise the same
  • 17. GK110  Production versions: 14 & 13 SMXs (not 15)  Improved device-level scheduling (next slides):  HyperQ  Dynamic Parallelism
  • 18. Improved scheduling 1 - HyperQ  Scenario: multiple CPU processes send work to the GPU  On Fermi, time division between processes  On Kepler, simultaneous processing from multiple processes
  • 19. Improved scheduling 2  A new age in GPU programmability: moving from Master-Salve pattern to self-feeding