SlideShare ist ein Scribd-Unternehmen logo
1 von 94
Graphics Processing Unit (GPU) Architecture and Programming TU/e 5kk73 Zhenyu Ye Bart Mesman Henk Corporaal 2010-11-08
Today's Topics ,[object Object],[object Object],[object Object],[object Object],[object Object]
Today's Topics ,[object Object],[object Object],[object Object],[object Object],[object Object]
System Architecture
GPU Architecture NVIDIA Fermi, 512 Processing Elements (PEs)
What Can It Do? Render triangles. NVIDIA GTX480 can render 1.6 billion triangles per second!
General Purposed Computing ref:  http://www.nvidia.com/object/tesla_computing_solutions.html
The Vision of NVIDIA ,[object Object],[object Object]
Single-Chip GPU v.s. Fastest Super Computers ref:  http://www.llnl.gov/str/JanFeb05/Seager.html
Top500 Super Computer in June 2010
GPU Will Top the List in Nov 2010
The Gap Between CPU and GPU ref:  Tesla GPU Computing Brochure
GPU Has 10x Comp Density Given the  same chip area , the  achievable performance  of GPU is 10x higher than that of CPU.
Evolution of Intel Pentium Pentium I Pentium II Pentium III Pentium IV Chip area breakdown Q: What can you observe? Why?
Extrapolation of Single Core CPU If we extrapolate the trend, in a few generations, Pentium will look like: Of course, we know it did not happen.  Q: What happened instead? Why?
Evolution of Multi-core CPUs Penryn Bloomfield Gulftown Beckton Chip area breakdown Q: What can you observe? Why?
Let's Take a Closer Look Less than  10%  of total chip area is used for the real execution. Q: Why?
The Memory Hierarchy Notes on Energy at 45nm:  64-bit Int ADD takes about 1 pJ. 64-bit FP FMA takes about 200 pJ. It seems we can not further increase the computational density.
The Brick Wall -- UC Berkeley's View Power Wall : power expensive, transistors free Memory Wall : Memory slow, multiplies fast ILP Wall : diminishing returns on more ILP HW David Patterson, "Computer Architecture is Back - The Berkeley View of the Parallel Computing Research Landscape", Stanford EE Computer Systems Colloquium, Jan 2007,  link
The Brick Wall -- UC Berkeley's View Power Wall : power expensive, transistors free Memory Wall : Memory slow, multiplies fast ILP Wall : diminishing returns on more ILP HW Power Wall  +  Memory Wall  +  ILP Wall  =  Brick Wall David Patterson, "Computer Architecture is Back - The Berkeley View of the Parallel Computing Research Landscape", Stanford EE Computer Systems Colloquium, Jan 2007,  link
How to Break the Brick Wall? Hint: how to exploit the parallelism inside the application?
Step 1: Trade Latency with Throughput Hind the memory latency through fine-grained interleaved threading.
Interleaved Multi-threading
Interleaved Multi-threading ,[object Object],[object Object],[object Object],[object Object]
Interleaved Multi-threading ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Interleaved Multi-threading ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Fine-Grained Interleaved Threading Pros:  reduce cache size, no branch predictor,  no OOO scheduler Cons:  register pressure, thread scheduler, require huge parallelism Without and with fine-grained interleaved threading
HW Support Register file supports  zero overhead  context switch between interleaved threads.
Can We Make Further Improvement? ,[object Object],[object Object],Hint: We have only utilized thread level parallelism (TLP) so far.
Step 2: Single Instruction Multiple Data SSE has 4 data lanes GPU has 8/16/24/... data lanes GPU uses wide SIMD: 8/16/24/... processing elements (PEs) CPU uses short SIMD: usually has vector width of 4.
Hardware Support Supporting interleaved threading + SIMD execution
Single Instruction Multiple Thread (SIMT) Hide vector width using scalar threads.
Example of SIMT Execution Assume 32 threads are grouped into one warp.
Step 3: Simple Core The Stream Multiprocessor (SM) is a light weight core compared to IA core. Light weight PE: Fused Multiply Add (FMA) SFU: Special Function Unit
NVIDIA's Motivation of Simple Core "This [multiple IA-core] approach is analogous to trying to build an airplane by putting wings on a train." --Bill Dally, NVIDIA
Review: How Do We Reach Here? NVIDIA Fermi, 512 Processing Elements (PEs)
Throughput Oriented Architectures ,[object Object],[object Object],[object Object],[object Object],ref: Michael Garland. David B. Kirk, "Understanding throughput-oriented architectures", CACM 2010. ( link )
Today's Topics ,[object Object],[object Object],[object Object],[object Object],[object Object]
CUDA Programming Massive number (>10000) of  light-weight  threads.
Express Data Parallelism in Threads  ,[object Object]
Vector Program ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Vector width is exposed to programmers.
CUDA Program ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Two Levels of Thread Hierarchy ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Multi-dimension Thread and Block ID ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Both grid and thread block can have two dimensional index.
Scheduling Thread Blocks on SM Example: Scheduling 4 thread blocks on 3 SMs.
Executing Thread Block on SM ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Executed on machine with width of 4: Executed on machine with width of 8: Notes: the number of Processing Elements (PEs) is transparent to programmer.
Multiple Levels of Memory Hierarchy Name Cache? cycle read-only? Global L1/L2 200~400 (cache miss) R/W Shared No 1~3 R/W Constant Yes 1~3 Read-only Texture Yes ~100 Read-only Local L1/L2 200~400 (cache miss) R/W
Explicit Management of Shared Mem Shared memory is frequently used to exploit locality.
Shared Memory and Synchronization kernelF<<<(1,1),(16,16)>>>(A);   __device__    kernelF(A){      __shared__ smem[16][16]; //allocate smem      i = threadIdx.y;      j = threadIdx.x;      smem[i][j] = A[i][j];      __sync();      A[i][j] = ( smem[i-1][j-1]                     + smem[i-1][j]                     ...                     + smem[i+1][i+1] ) / 9; }   Example: average filter with 3x3 window 3x3 window on image Image data in DRAM
Shared Memory and Synchronization kernelF<<<(1,1),(16,16)>>>(A);   __device__    kernelF(A){      __shared__ smem[16][16];      i = threadIdx.y;      j = threadIdx.x;      smem[i][j] = A[i][j]; // load to smem      __sync(); // thread wait at barrier      A[i][j] = ( smem[i-1][j-1]                     + smem[i-1][j]                     ...                     + smem[i+1][i+1] ) / 9; }   Example: average filter over 3x3 window 3x3 window on image Stage data in shared mem
Shared Memory and Synchronization kernelF<<<(1,1),(16,16)>>>(A);   __device__    kernelF(A){      __shared__ smem[16][16];      i = threadIdx.y;      j = threadIdx.x;      smem[i][j] = A[i][j];      __sync(); // every thread is ready      A[i][j] = ( smem[i-1][j-1]                     + smem[i-1][j]                     ...                     + smem[i+1][i+1] ) / 9; }   Example: average filter over 3x3 window 3x3 window on image all threads finish the load
Shared Memory and Synchronization kernelF<<<(1,1),(16,16)>>>(A);   __device__    kernelF(A){      __shared__ smem[16][16];      i = threadIdx.y;      j = threadIdx.x;      smem[i][j] = A[i][j];      __sync();      A[i][j] = ( smem[i-1][j-1]                     + smem[i-1][j]                     ...                     + smem[i+1][i+1] ) / 9; }   Example: average filter over 3x3 window 3x3 window on image Start computation
Programmers Think in Threads Q: Why make this hassle?
Why Use Thread instead of Vector? ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Features of CUDA ,[object Object],[object Object],[object Object]
Today's Topics ,[object Object],[object Object],[object Object],[object Object],[object Object]
Micro-architecture GF100 micro-architecture
HW Groups Threads Into Warps Example: 32 threads per warp
Example of Implementation Note: NVIDIA may use a more complicated implementation.
Example ,[object Object],[object Object],[object Object],Assume  warp 0  and  warp 1  are scheduled for execution.
Read Src Op ,[object Object],[object Object],[object Object],Read source operands: r1  for warp 0 r4  for warp 1
Buffer Src Op ,[object Object],[object Object],[object Object],Push ops to op collector: r1  for warp 0 r4  for warp 1
Read Src Op ,[object Object],[object Object],[object Object],Read source operands: r2  for warp 0 r5  for warp 1
Buffer Src Op ,[object Object],[object Object],[object Object],Push ops to op collector: r2  for warp 0 r5  for warp 1
Execute ,[object Object],[object Object],[object Object],Compute the  first 16 threads  in the warp.
Execute ,[object Object],[object Object],[object Object],Compute the  last 16 threads  in the warp.
Write back ,[object Object],[object Object],[object Object],Write back: r0  for warp 0 r3  for warp 1
Other High Performance GPU ,[object Object]
ATI Radeon 5000 Series Architecture
Radeon SIMD Engine ,[object Object],[object Object]
VLIW Stream Core (SC)
Local Data Share (LDS)
Today's Topics ,[object Object],[object Object],[object Object],[object Object],[object Object]
Performance Optimization ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Performance Optimization ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Shared Mem Contains Multiple Banks
Compute Capability Need arch info to perform optimization. ref: NVIDIA, &quot;CUDA C Programming Guide&quot;, ( link )
Shared Memory (compute capability 2.x) without bank conflict: with bank conflict:
Performance Optimization ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Global Memory In Off-Chip DRAM ,[object Object]
Global Memory
Global Memory
Global Memory
Roofline Model Identify performance bottleneck:  computation bound  v.s.  bandwidth bound
Optimization Is Key for Attainable Gflops/s
Computation, Bandwidth, Latency ,[object Object]
Today's Topics ,[object Object],[object Object],[object Object],[object Object],[object Object]
Trends ,[object Object],[object Object],[object Object]
Intel Many Integrated Core (MIC) 32 core version of MIC:
Intel Sandy Bridge ,[object Object],[object Object],[object Object]
Sandy Bridge's New CPU-GPU interface  ref: &quot;Intel's Sandy Bridge Architecture Exposed&quot;, from Anandtech, ( link )
Sandy Bridge's New CPU-GPU interface  ref: &quot;Intel's Sandy Bridge Architecture Exposed&quot;, from Anandtech, ( link )
AMD Llano Fusion APU (expt. Q3 2011) ,[object Object],[object Object],[object Object]
GPU Research in ES Group ,[object Object],[object Object]

Weitere ähnliche Inhalte

Was ist angesagt?

Graphic Processing Unit
Graphic Processing UnitGraphic Processing Unit
Graphic Processing Unit
Kamran Ashraf
 
CPU vs. GPU presentation
CPU vs. GPU presentationCPU vs. GPU presentation
CPU vs. GPU presentation
Vishal Singh
 
memory Interleaving and low order interleaving and high interleaving
memory Interleaving and low order interleaving and high interleavingmemory Interleaving and low order interleaving and high interleaving
memory Interleaving and low order interleaving and high interleaving
Jawwad Rafiq
 
Production System in AI
Production System in AIProduction System in AI
Production System in AI
Bharat Bhushan
 

Was ist angesagt? (20)

Lecture 9 Markov decision process
Lecture 9 Markov decision processLecture 9 Markov decision process
Lecture 9 Markov decision process
 
Lec04 gpu architecture
Lec04 gpu architectureLec04 gpu architecture
Lec04 gpu architecture
 
Graphic Processing Unit
Graphic Processing UnitGraphic Processing Unit
Graphic Processing Unit
 
Graphic Processing Unit (GPU)
Graphic Processing Unit (GPU)Graphic Processing Unit (GPU)
Graphic Processing Unit (GPU)
 
Travelling salesman problem using genetic algorithms
Travelling salesman problem using genetic algorithms Travelling salesman problem using genetic algorithms
Travelling salesman problem using genetic algorithms
 
FPGA Hardware Accelerator for Machine Learning
FPGA Hardware Accelerator for Machine Learning FPGA Hardware Accelerator for Machine Learning
FPGA Hardware Accelerator for Machine Learning
 
Virtual Mouse
Virtual MouseVirtual Mouse
Virtual Mouse
 
parallel processing
parallel processingparallel processing
parallel processing
 
Introduction to OpenCL
Introduction to OpenCLIntroduction to OpenCL
Introduction to OpenCL
 
Introduction and Application of Computer Graphics.
Introduction and Application of Computer Graphics.Introduction and Application of Computer Graphics.
Introduction and Application of Computer Graphics.
 
CPU vs. GPU presentation
CPU vs. GPU presentationCPU vs. GPU presentation
CPU vs. GPU presentation
 
Embedded systems
Embedded systemsEmbedded systems
Embedded systems
 
Hopfield Networks
Hopfield NetworksHopfield Networks
Hopfield Networks
 
GPU
GPUGPU
GPU
 
Computer science seminar topics
Computer science seminar topicsComputer science seminar topics
Computer science seminar topics
 
Multi core processors
Multi core processorsMulti core processors
Multi core processors
 
memory Interleaving and low order interleaving and high interleaving
memory Interleaving and low order interleaving and high interleavingmemory Interleaving and low order interleaving and high interleaving
memory Interleaving and low order interleaving and high interleaving
 
High performance computing
High performance computingHigh performance computing
High performance computing
 
Production System in AI
Production System in AIProduction System in AI
Production System in AI
 
AI Hardware
AI HardwareAI Hardware
AI Hardware
 

Andere mochten auch

Andere mochten auch (20)

Indian Contribution towards Parallel Processing
Indian Contribution towards Parallel ProcessingIndian Contribution towards Parallel Processing
Indian Contribution towards Parallel Processing
 
Parallel computing in india
Parallel computing in indiaParallel computing in india
Parallel computing in india
 
network ram parallel computing
network ram parallel computingnetwork ram parallel computing
network ram parallel computing
 
Graphics Processing Unit - GPU
Graphics Processing Unit - GPUGraphics Processing Unit - GPU
Graphics Processing Unit - GPU
 
GRAPHICS PROCESSING UNIT (GPU)
GRAPHICS PROCESSING UNIT (GPU)GRAPHICS PROCESSING UNIT (GPU)
GRAPHICS PROCESSING UNIT (GPU)
 
NVIDIA – Inventor of the GPU
NVIDIA – Inventor of the GPUNVIDIA – Inventor of the GPU
NVIDIA – Inventor of the GPU
 
tesla home battery power wall by braj mohan
tesla home battery power wall by braj mohantesla home battery power wall by braj mohan
tesla home battery power wall by braj mohan
 
Tesla Powerwall
Tesla PowerwallTesla Powerwall
Tesla Powerwall
 
Surface Computer
Surface ComputerSurface Computer
Surface Computer
 
Racetrack
RacetrackRacetrack
Racetrack
 
surface computer ppt
surface computer pptsurface computer ppt
surface computer ppt
 
Microsoft surface by NIRAV RANA
Microsoft surface by NIRAV RANAMicrosoft surface by NIRAV RANA
Microsoft surface by NIRAV RANA
 
Solar battery storage for your home battery
Solar battery storage for your home batterySolar battery storage for your home battery
Solar battery storage for your home battery
 
Riscv 20160507-patterson
Riscv 20160507-pattersonRiscv 20160507-patterson
Riscv 20160507-patterson
 
Surface computer
Surface computerSurface computer
Surface computer
 
Surface computer
Surface computerSurface computer
Surface computer
 
Powerwall installation and user's manual online-b
Powerwall installation and user's manual online-bPowerwall installation and user's manual online-b
Powerwall installation and user's manual online-b
 
Gpu Systems
Gpu SystemsGpu Systems
Gpu Systems
 
Surface computer
Surface computerSurface computer
Surface computer
 
microsoft Surface computer
microsoft Surface computer microsoft Surface computer
microsoft Surface computer
 

Ähnlich wie Gpu and The Brick Wall

Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
Arka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
Arka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
Arka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
Arka Ghosh
 
Parallelism Processor Design
Parallelism Processor DesignParallelism Processor Design
Parallelism Processor Design
Sri Prasanna
 
Best Practices for performance evaluation and diagnosis of Java Applications ...
Best Practices for performance evaluation and diagnosis of Java Applications ...Best Practices for performance evaluation and diagnosis of Java Applications ...
Best Practices for performance evaluation and diagnosis of Java Applications ...
IndicThreads
 
SOUG_SDM_OracleDB_V3
SOUG_SDM_OracleDB_V3SOUG_SDM_OracleDB_V3
SOUG_SDM_OracleDB_V3
UniFabric
 
Lllsjjsjsjjshshjshjsjjsjjsjjzjsjjzjjzjjzj
LllsjjsjsjjshshjshjsjjsjjsjjzjsjjzjjzjjzjLllsjjsjsjjshshjshjsjjsjjsjjzjsjjzjjzjjzj
Lllsjjsjsjjshshjshjsjjsjjsjjzjsjjzjjzjjzj
ManhHoangVan
 

Ähnlich wie Gpu and The Brick Wall (20)

Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...
Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...
Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
gpuprogram_lecture,architecture_designsn
gpuprogram_lecture,architecture_designsngpuprogram_lecture,architecture_designsn
gpuprogram_lecture,architecture_designsn
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
No[1][1]
No[1][1]No[1][1]
No[1][1]
 
processors
processorsprocessors
processors
 
Trip down the GPU lane with Machine Learning
Trip down the GPU lane with Machine LearningTrip down the GPU lane with Machine Learning
Trip down the GPU lane with Machine Learning
 
Parallelism Processor Design
Parallelism Processor DesignParallelism Processor Design
Parallelism Processor Design
 
GPU Architecture NVIDIA (GTX GeForce 480)
GPU Architecture NVIDIA (GTX GeForce 480)GPU Architecture NVIDIA (GTX GeForce 480)
GPU Architecture NVIDIA (GTX GeForce 480)
 
Nt1310 Unit 3 Computer Components
Nt1310 Unit 3 Computer ComponentsNt1310 Unit 3 Computer Components
Nt1310 Unit 3 Computer Components
 
Entenda de onde vem toda a potência do Intel® Xeon Phi™
Entenda de onde vem toda a potência do Intel® Xeon Phi™ Entenda de onde vem toda a potência do Intel® Xeon Phi™
Entenda de onde vem toda a potência do Intel® Xeon Phi™
 
Coa presentation3
Coa presentation3Coa presentation3
Coa presentation3
 
Open power ddl and lms
Open power ddl and lmsOpen power ddl and lms
Open power ddl and lms
 
Java Memory Model
Java Memory ModelJava Memory Model
Java Memory Model
 
Best Practices for performance evaluation and diagnosis of Java Applications ...
Best Practices for performance evaluation and diagnosis of Java Applications ...Best Practices for performance evaluation and diagnosis of Java Applications ...
Best Practices for performance evaluation and diagnosis of Java Applications ...
 
SOUG_SDM_OracleDB_V3
SOUG_SDM_OracleDB_V3SOUG_SDM_OracleDB_V3
SOUG_SDM_OracleDB_V3
 
Lllsjjsjsjjshshjshjsjjsjjsjjzjsjjzjjzjjzj
LllsjjsjsjjshshjshjsjjsjjsjjzjsjjzjjzjjzjLllsjjsjsjjshshjshjsjjsjjsjjzjsjjzjjzjjzj
Lllsjjsjsjjshshjshjsjjsjjsjjzjsjjzjjzjjzj
 
Memory model
Memory modelMemory model
Memory model
 

Mehr von ugur candan

Sap innovation forum istanbul 2012
Sap innovation forum istanbul 2012Sap innovation forum istanbul 2012
Sap innovation forum istanbul 2012
ugur candan
 
The End of an Architectural Era Michael Stonebraker
The End of an Architectural Era Michael StonebrakerThe End of an Architectural Era Michael Stonebraker
The End of an Architectural Era Michael Stonebraker
ugur candan
 
Hana Intel SAP Whitepaper
Hana Intel SAP WhitepaperHana Intel SAP Whitepaper
Hana Intel SAP Whitepaper
ugur candan
 

Mehr von ugur candan (20)

SAP AI What are examples Oct2022
SAP AI  What are examples Oct2022SAP AI  What are examples Oct2022
SAP AI What are examples Oct2022
 
CEO Agenda 2019 by Ugur Candan
CEO Agenda 2019 by Ugur CandanCEO Agenda 2019 by Ugur Candan
CEO Agenda 2019 by Ugur Candan
 
Digital transformation and SAP
Digital transformation and SAPDigital transformation and SAP
Digital transformation and SAP
 
Digital Enterprise Transformsation and SAP
Digital Enterprise Transformsation and SAPDigital Enterprise Transformsation and SAP
Digital Enterprise Transformsation and SAP
 
MOONSHOTS for in-memory computing
MOONSHOTS for in-memory computingMOONSHOTS for in-memory computing
MOONSHOTS for in-memory computing
 
WHY SAP Real Time Data Platform - RTDP
WHY SAP Real Time Data Platform - RTDPWHY SAP Real Time Data Platform - RTDP
WHY SAP Real Time Data Platform - RTDP
 
Opening Analytics Networking Event
Opening Analytics Networking EventOpening Analytics Networking Event
Opening Analytics Networking Event
 
Sap innovation forum istanbul 2012
Sap innovation forum istanbul 2012Sap innovation forum istanbul 2012
Sap innovation forum istanbul 2012
 
İş Zekasının Değişen Kuralları
İş Zekasının Değişen Kurallarıİş Zekasının Değişen Kuralları
İş Zekasının Değişen Kuralları
 
Gamification of eEducation
Gamification of eEducationGamification of eEducation
Gamification of eEducation
 
Why sap hana
Why sap hanaWhy sap hana
Why sap hana
 
The End of an Architectural Era Michael Stonebraker
The End of an Architectural Era Michael StonebrakerThe End of an Architectural Era Michael Stonebraker
The End of an Architectural Era Michael Stonebraker
 
Ramcloud
RamcloudRamcloud
Ramcloud
 
Hana Intel SAP Whitepaper
Hana Intel SAP WhitepaperHana Intel SAP Whitepaper
Hana Intel SAP Whitepaper
 
The Berkeley View on the Parallel Computing Landscape
The Berkeley View on the Parallel Computing LandscapeThe Berkeley View on the Parallel Computing Landscape
The Berkeley View on the Parallel Computing Landscape
 
Exadata is still oracle
Exadata is still oracleExadata is still oracle
Exadata is still oracle
 
Gerçek Gerçek Zamanlı Mimari
Gerçek Gerçek Zamanlı MimariGerçek Gerçek Zamanlı Mimari
Gerçek Gerçek Zamanlı Mimari
 
Michael stonebraker mit session
Michael stonebraker mit sessionMichael stonebraker mit session
Michael stonebraker mit session
 
Introduction to HANA in-memory from SAP
Introduction to HANA in-memory from SAPIntroduction to HANA in-memory from SAP
Introduction to HANA in-memory from SAP
 
Complex Event Prosessing
Complex Event ProsessingComplex Event Prosessing
Complex Event Prosessing
 

Kürzlich hochgeladen

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Kürzlich hochgeladen (20)

Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 

Gpu and The Brick Wall

  • 1. Graphics Processing Unit (GPU) Architecture and Programming TU/e 5kk73 Zhenyu Ye Bart Mesman Henk Corporaal 2010-11-08
  • 2.
  • 3.
  • 5. GPU Architecture NVIDIA Fermi, 512 Processing Elements (PEs)
  • 6. What Can It Do? Render triangles. NVIDIA GTX480 can render 1.6 billion triangles per second!
  • 7. General Purposed Computing ref:  http://www.nvidia.com/object/tesla_computing_solutions.html
  • 8.
  • 9. Single-Chip GPU v.s. Fastest Super Computers ref:  http://www.llnl.gov/str/JanFeb05/Seager.html
  • 10. Top500 Super Computer in June 2010
  • 11. GPU Will Top the List in Nov 2010
  • 12. The Gap Between CPU and GPU ref: Tesla GPU Computing Brochure
  • 13. GPU Has 10x Comp Density Given the same chip area , the achievable performance of GPU is 10x higher than that of CPU.
  • 14. Evolution of Intel Pentium Pentium I Pentium II Pentium III Pentium IV Chip area breakdown Q: What can you observe? Why?
  • 15. Extrapolation of Single Core CPU If we extrapolate the trend, in a few generations, Pentium will look like: Of course, we know it did not happen.  Q: What happened instead? Why?
  • 16. Evolution of Multi-core CPUs Penryn Bloomfield Gulftown Beckton Chip area breakdown Q: What can you observe? Why?
  • 17. Let's Take a Closer Look Less than 10% of total chip area is used for the real execution. Q: Why?
  • 18. The Memory Hierarchy Notes on Energy at 45nm:  64-bit Int ADD takes about 1 pJ. 64-bit FP FMA takes about 200 pJ. It seems we can not further increase the computational density.
  • 19. The Brick Wall -- UC Berkeley's View Power Wall : power expensive, transistors free Memory Wall : Memory slow, multiplies fast ILP Wall : diminishing returns on more ILP HW David Patterson, &quot;Computer Architecture is Back - The Berkeley View of the Parallel Computing Research Landscape&quot;, Stanford EE Computer Systems Colloquium, Jan 2007, link
  • 20. The Brick Wall -- UC Berkeley's View Power Wall : power expensive, transistors free Memory Wall : Memory slow, multiplies fast ILP Wall : diminishing returns on more ILP HW Power Wall + Memory Wall + ILP Wall = Brick Wall David Patterson, &quot;Computer Architecture is Back - The Berkeley View of the Parallel Computing Research Landscape&quot;, Stanford EE Computer Systems Colloquium, Jan 2007, link
  • 21. How to Break the Brick Wall? Hint: how to exploit the parallelism inside the application?
  • 22. Step 1: Trade Latency with Throughput Hind the memory latency through fine-grained interleaved threading.
  • 24.
  • 25.
  • 26.
  • 27. Fine-Grained Interleaved Threading Pros:  reduce cache size, no branch predictor,  no OOO scheduler Cons:  register pressure, thread scheduler, require huge parallelism Without and with fine-grained interleaved threading
  • 28. HW Support Register file supports zero overhead context switch between interleaved threads.
  • 29.
  • 30. Step 2: Single Instruction Multiple Data SSE has 4 data lanes GPU has 8/16/24/... data lanes GPU uses wide SIMD: 8/16/24/... processing elements (PEs) CPU uses short SIMD: usually has vector width of 4.
  • 31. Hardware Support Supporting interleaved threading + SIMD execution
  • 32. Single Instruction Multiple Thread (SIMT) Hide vector width using scalar threads.
  • 33. Example of SIMT Execution Assume 32 threads are grouped into one warp.
  • 34. Step 3: Simple Core The Stream Multiprocessor (SM) is a light weight core compared to IA core. Light weight PE: Fused Multiply Add (FMA) SFU: Special Function Unit
  • 35. NVIDIA's Motivation of Simple Core &quot;This [multiple IA-core] approach is analogous to trying to build an airplane by putting wings on a train.&quot; --Bill Dally, NVIDIA
  • 36. Review: How Do We Reach Here? NVIDIA Fermi, 512 Processing Elements (PEs)
  • 37.
  • 38.
  • 39. CUDA Programming Massive number (>10000) of light-weight threads.
  • 40.
  • 41.
  • 42.
  • 43.
  • 44.
  • 45. Scheduling Thread Blocks on SM Example: Scheduling 4 thread blocks on 3 SMs.
  • 46.
  • 47. Multiple Levels of Memory Hierarchy Name Cache? cycle read-only? Global L1/L2 200~400 (cache miss) R/W Shared No 1~3 R/W Constant Yes 1~3 Read-only Texture Yes ~100 Read-only Local L1/L2 200~400 (cache miss) R/W
  • 48. Explicit Management of Shared Mem Shared memory is frequently used to exploit locality.
  • 49. Shared Memory and Synchronization kernelF<<<(1,1),(16,16)>>>(A);   __device__    kernelF(A){      __shared__ smem[16][16]; //allocate smem      i = threadIdx.y;      j = threadIdx.x;      smem[i][j] = A[i][j];      __sync();      A[i][j] = ( smem[i-1][j-1]                    + smem[i-1][j]                    ...                    + smem[i+1][i+1] ) / 9; }   Example: average filter with 3x3 window 3x3 window on image Image data in DRAM
  • 50. Shared Memory and Synchronization kernelF<<<(1,1),(16,16)>>>(A);   __device__    kernelF(A){      __shared__ smem[16][16];      i = threadIdx.y;      j = threadIdx.x;      smem[i][j] = A[i][j]; // load to smem      __sync(); // thread wait at barrier      A[i][j] = ( smem[i-1][j-1]                    + smem[i-1][j]                    ...                    + smem[i+1][i+1] ) / 9; }   Example: average filter over 3x3 window 3x3 window on image Stage data in shared mem
  • 51. Shared Memory and Synchronization kernelF<<<(1,1),(16,16)>>>(A);   __device__    kernelF(A){      __shared__ smem[16][16];      i = threadIdx.y;      j = threadIdx.x;      smem[i][j] = A[i][j];      __sync(); // every thread is ready      A[i][j] = ( smem[i-1][j-1]                    + smem[i-1][j]                    ...                    + smem[i+1][i+1] ) / 9; }   Example: average filter over 3x3 window 3x3 window on image all threads finish the load
  • 52. Shared Memory and Synchronization kernelF<<<(1,1),(16,16)>>>(A);   __device__    kernelF(A){      __shared__ smem[16][16];      i = threadIdx.y;      j = threadIdx.x;      smem[i][j] = A[i][j];      __sync();      A[i][j] = ( smem[i-1][j-1]                    + smem[i-1][j]                    ...                    + smem[i+1][i+1] ) / 9; }   Example: average filter over 3x3 window 3x3 window on image Start computation
  • 53. Programmers Think in Threads Q: Why make this hassle?
  • 54.
  • 55.
  • 56.
  • 58. HW Groups Threads Into Warps Example: 32 threads per warp
  • 59. Example of Implementation Note: NVIDIA may use a more complicated implementation.
  • 60.
  • 61.
  • 62.
  • 63.
  • 64.
  • 65.
  • 66.
  • 67.
  • 68.
  • 69. ATI Radeon 5000 Series Architecture
  • 70.
  • 73.
  • 74.
  • 75.
  • 76. Shared Mem Contains Multiple Banks
  • 77. Compute Capability Need arch info to perform optimization. ref: NVIDIA, &quot;CUDA C Programming Guide&quot;, ( link )
  • 78. Shared Memory (compute capability 2.x) without bank conflict: with bank conflict:
  • 79.
  • 80.
  • 84. Roofline Model Identify performance bottleneck:  computation bound v.s. bandwidth bound
  • 85. Optimization Is Key for Attainable Gflops/s
  • 86.
  • 87.
  • 88.
  • 89. Intel Many Integrated Core (MIC) 32 core version of MIC:
  • 90.
  • 91. Sandy Bridge's New CPU-GPU interface  ref: &quot;Intel's Sandy Bridge Architecture Exposed&quot;, from Anandtech, ( link )
  • 92. Sandy Bridge's New CPU-GPU interface  ref: &quot;Intel's Sandy Bridge Architecture Exposed&quot;, from Anandtech, ( link )
  • 93.
  • 94.

Hinweis der Redaktion

  1. NVIDIA planned to put 512 PEs into a single GPU, but the GTX480 turns out to have 480 PEs.
  2. GPU can achieve 10x performance over CPU. 
  3. Notice the third place is PowerXCell. Rmax is the performance of Linpack benchmark. Rpeak is the raw performance of the machine.
  4. This gap is narrowed by multi-core CPUs.
  5. Comparing raw performance is less interesting.
  6. The area breakdown is an approximation, but it is good enough to see the trend.
  7. The size of L3 in high end and low end CPUs are quite different.
  8. This break down is also an approximation.
  9. Numbers are based on Intel Nehalem at 45nm and the presentation of Bill Dally.
  10. More registers are required to store the contexts of threads.
  11. Hiding memory latency by multi-threading. The Cell uses a relatively static approach. The overlapping of computation and DMA transfer is explicitly specified by programmer.
  12. Fine-grained multi-threading can keep the PEs busy even the program has little ILP.
  13. The cache can still help.
  14. The address assignment and translation is done dynamically by hardware.
  15. The vector core should be larger than scalar core.
  16. From scalar to vector.
  17. From vector to threads.
  18. Warp can be grouped at run time by hardware. In this case it will be transparent to the programmer.
  19. The NVIDIA Fermi PE can do int and fp.
  20. We have ignored some architectural features of Fermi.  Noticeably the interconnection network is not discussed here. 
  21. These features are summarized by the paper of Michael Garland and David Kirk.
  22. The vector program use SSE as example. However, the &amp;quot;incps&amp;quot; is not an SSE instruction. It is used here to represent incrementation of the vector.
  23. Each thread uses its ID to locate its working data set.
  24. The scheduler tries to maintain load balancing among SMs.
  25. Numbers taken from an old paper on G80 architecture, but it should be similar to the GF100 architecture.
  26. The old architecture has 16 banks.
  27. It is a trend to use threads to hide vector width. The OpenCL applies the same programming model.
  28. It is arguable whether working on threads is more productive.
  29. This example assumes the two warp schedulers are decoupled. It is possible that they are coupled together, at the cost of hardware complexity.
  30. Assume the register file has one read port. The register file may need two read port to support instructions with 3 source operands, e.g. the Fused Multiply Add (FMA).
  31. 5 issue VLIW.
  32. The atomic unit is helpful in voting operation, e.g. histogram. 
  33. The figure is taken from 8800 GPU. See the paper of Samuel Williams for more detail.
  34. The number is obtained in 8800 GPU.
  35. The latency hiding is addressed in the PhD thesis of Samuel Williams.