SlideShare ist ein Scribd-Unternehmen logo
1 von 28
Downloaden Sie, um offline zu lesen
Naga Vydyanathan
GPU Computing with CUDA
Accelerated Computing
GPU Teaching Kit
2
3 Ways to Accelerate Applications
Applications
Libraries
Easy to use
Most Performance
Programming
Languages
Most Performance
Most Flexibility
Easy to use
Portable code
Compiler
Directives
CUDA
3
Objective
– To understand the CUDA memory hierarchy
– Registers, shared memory, global memory
– Scope and lifetime
– To learn the basic memory API functions in CUDA host code
– Device Memory Allocation
– Host-Device Data Transfer
– To learn about CUDA threads, the main mechanism for exploiting of
data parallelism
– Hierarchical thread organization
– Launching parallel execution
– Thread index to data index mapping
4
Global Memory
Processing Unit
I/O
ALU
Processor (SM)
Shared
Memory
Register
File
Control Unit
PC IR
Hardware View of CUDA Memories
5
Programmer View of CUDA Memories
Grid
Global Memory
Block (0, 0)
Shared Memory
Thread (0, 0)
Registers
Thread (1, 0)
Registers
Block (1, 0)
Shared Memory
Thread (0, 0)
Registers
Thread (1, 0)
Registers
Host
Constant Memory
6
Declaring CUDA Variables
– __device__ is optional when used with __shared__, or __constant__
– Automatic variables reside in a register
– Except per-thread arrays that reside in global memory
Variable declaration Memory Scope Lifetime
int LocalVar; register thread thread
__device__ __shared__ int SharedVar; shared block block
__device__ int GlobalVar; global grid application
__device__ __constant__ int ConstantVar; constant grid application
7
Global Memory in CUDA
– GPU Device memory (DRAM)
– Large in size (~16/32 GB)
– Bandwidth ~700-900 GB/s
– High latency (~1000 cycles)
– Scope of access and sharing - kernel
– Lifetime – application
– main means of communicating data between host and GPU.
8
Shared Memory in CUDA
– A special type of memory whose contents are explicitly defined and
used in the kernel source code
– One in each SM
– Accessed at much higher speed (in both latency and throughput) than global
memory
– Scope of access and sharing - thread blocks
– Lifetime – thread block, contents will disappear after the corresponding thread block
finishes and terminates execution
– A form of scratchpad memory in computer architecture
We will re-visit shared memory later.
9
A[0]
vector A
vector B
vector C
A[1] A[2] A[N-1]
B[0] B[1] B[2]
…
… B[N-1]
C[0] C[1] C[2] C[N-1]
…
+ + + +
Data Parallelism - Vector Addition Example
10
Vector Addition – Traditional C Code
// Compute vector sum C = A + B
void vecAdd(float *h_A, float *h_B, float *h_C, int n)
{
int i;
for (i = 0; i<n; i++) h_C[i] = h_A[i] + h_B[i];
}
int main()
{
// Memory allocation for h_A, h_B, and h_C
// I/O to read h_A and h_B, N elements
…
vecAdd(h_A, h_B, h_C, N);
}
10
11
CPU
Host Memory
GPU
Device Memory
Part 1
Part 3
Heterogeneous Computing vecAdd CUDA Host Code
#include <cuda.h>
void vecAdd(float *h_A, float *h_B, float *h_C, int n)
{
int size = n* sizeof(float);
float *d_A, *d_B, *d_C;
// Part 1
// Allocate device memory for A, B, and C
// copy A and B to device memory
// Part 2
// Kernel launch code – the device performs the actual vector addition
// Part 3
// copy C from the device memory
// Free device vectors
}
11
Part 2
12
Partial Overview of CUDA Memories
– Device code can:
– R/W per-thread registers
– R/W all-shared global
memory
– Host code can
– Transfer data to/from per
grid global memory
12
Host
(Device) Grid
Global
Memory
Block (0, 0)
Thread (0, 0)
Registers
Block (0, 1)
Thread (0, 0)
Registers
Thread (0, 1)
Registers
Thread (0, 1)
Registers
13
CUDA Device Memory Management API functions
– cudaMalloc()
– Allocates an object in the device
global memory
– Two parameters
– Address of a pointer to the
allocated object
– Size of allocated object in terms
of bytes
– cudaFree()
– Frees object from device global
memory
– One parameter
– Pointer to freed object
Host
(Device) Grid
Global
Memory
Block (0, 0)
Thread (0, 0)
Registers
Block (0, 1)
Thread (0, 0)
Registers
Thread (0, 1)
Registers
Thread (0, 1)
Registers
14
Host-Device Data Transfer API functions
– cudaMemcpy()
– memory data transfer
– Requires four parameters
– Pointer to destination
– Pointer to source
– Number of bytes copied
– Type/Direction of transfer
– Transfer to device is synchronous
Host
(Device) Grid
Global
Memory
Block (0, 0)
Thread (0, 0)
Registers
Block (0, 1)
Thread (0, 0)
Registers
Thread (0, 1)
Registers
Thread (0, 1)
Registers
15
Vector Addition Host Code
void vecAdd(float *h_A, float *h_B, float *h_C, int n)
{
int size = n * sizeof(float); float *d_A, *d_B, *d_C;
cudaMalloc((void **) &d_A, size);
cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
cudaMalloc((void **) &d_B, size);
cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);
cudaMalloc((void **) &d_C, size);
// Kernel invocation code – to be shown later
cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);
cudaFree(d_A); cudaFree(d_B); cudaFree (d_C);
}
15
16
In Practice, Check for API Errors in Host Code
cudaError_t err = cudaMalloc((void **) &d_A, size);
if (err != cudaSuccess) {
printf(“%s in %s at line %dn”, cudaGetErrorString(err), __FILE__,
__LINE__);
exit(EXIT_FAILURE);
}
10
17
CUDA Execution Model
– Heterogeneous host (CPU) + device (GPU) application C program
– Serial parts in host C code
– Parallel parts in device SPMD kernel code
Serial Code (host)
. . .
. . .
Parallel Kernel (device)
KernelA<<< nBlk, nTid >>>(args);
Serial Code (host)
Parallel Kernel (device)
KernelB<<< nBlk, nTid >>>(args);
18
Arrays of Parallel Threads
• A CUDA kernel is executed by a grid (array) of threads
– All threads in a grid run the same kernel code (Single Program Multiple Data)
– Each thread has indexes that it uses to compute memory addresses and make
control decisions
i = blockIdx.x * blockDim.x + threadIdx.x;
C[i] = A[i] + B[i];
…
0 1 2 254 255
…
19
Thread Blocks: Scalable Cooperation
– Divide thread array into multiple blocks
– Threads within a block cooperate via shared memory, atomic operations and
barrier synchronization
– Threads in different blocks do not interact
19
i = blockIdx.x * blockDim.x +
threadIdx.x;
C[i] = A[i] + B[i];
…
0 1 2 254 255
Thread Block 0
…
1 2 254 255
Thread Block 1
0
i = blockIdx.x * blockDim.x +
threadIdx.x;
C[i] = A[i] + B[i];
…
1 2 254 255
Thread Block N-1
0
i = blockIdx.x * blockDim.x +
threadIdx.x;
C[i] = A[i] + B[i];
…
…
… …
20
blockIdx and threadIdx
• Each thread uses indices to decide what data to work
on
– blockIdx: 1D, 2D, or 3D (CUDA 4.0)
– threadIdx: 1D, 2D, or 3D
• Simplifies memory
addressing when processing
multidimensional data
– Image processing
– Solving PDEs on volumes
– …
20
device
Grid Block (0,
0)
Block (1,
1)
Block (1,
0)
Block (0,
1)
Block (1,1)
Thread
(0,0,0)
Thread
(0,1,3)
Thread
(0,1,0)
Thread
(0,1,1)
Thread
(0,1,2)
Thread
(0,0,0)
Thread
(0,0,1)
Thread
(0,0,2)
Thread
(0,0,3)
(1,0,0) (1,0,1) (1,0,2) (1,0,3)
21
Example: Vector Addition Kernel
// Compute vector sum C = A + B
// Each thread performs one pair-wise addition
__global__
void vecAddKernel(float* A, float* B, float* C, int n)
{
int i = threadIdx.x+blockDim.x*blockIdx.x;
if(i<n) C[i] = A[i] + B[i];
}
Device Code
22
Example: Vector Addition Kernel Launch (Host Code)
void vecAdd(float* h_A, float* h_B, float* h_C, int n)
{
// d_A, d_B, d_C allocations and copies omitted
// Run ceil(n/256.0) blocks of 256 threads each
vecAddKernel<<<ceil(n/256.0),256>>>(d_A, d_B, d_C, n);
}
Host Code
4
The ceiling function makes sure that there
are enough threads to cover all elements.
23
More on Kernel Launch (Host Code)
void vecAdd(float* h_A, float* h_B, float* h_C, int n)
{
dim3 DimGrid((n-1)/256 + 1, 1, 1);
dim3 DimBlock(256, 1, 1);
vecAddKernel<<<DimGrid,DimBlock>>>(d_A, d_B, d_C, n);
}
23
Host Code
This is an equivalent way to express the
ceiling function.
24
__host__
void vecAdd(…)
{
dim3 DimGrid(ceil(n/256.0),1,1);
dim3 DimBlock(256,1,1);
vecAddKernel<<<DimGrid,DimBlock>>>(d_A,d_B
,d_C,n);
}
Kernel execution in a nutshell
24
Grid
Blk 0 Blk N-1
• • •
GPU
M0
RAM
Mk
• • •
__global__
void vecAddKernel(float *A,
float *B, float *C, int n)
{
int i = blockIdx.x * blockDim.x
+ threadIdx.x;
if( i<n ) C[i] = A[i]+B[i];
}
25
More on CUDA Function Declarations
− __global__ defines a kernel function
− Each “__” consists of two underscore characters
− A kernel function must return void
− __device__ and __host__ can be used together
− __host__ is optional if used alone
25
host
host
__host__ float HostFunc()
host
device
__global__ void KernelFunc()
device
device
__device__ float DeviceFunc()
Only callable from
the:
Executed on
the:
26
CUDA Device Query
– Enumerates properties of CUDA devices in your system
26
27
Next Webinar
– Learn to write, compile and run a multi-dimensional CUDA program
– Learn some CUDA optimization techniques
– Memory optimization
– Compute optimization
– Transfer optimization
GPU Teaching Kit
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under
the Creative Commons Attribution-NonCommercial 4.0 International License.
Accelerated Computing
Naga Vydyanathan: nvydyanathan@nvidia.com
NVIDIA Developer Zone: https://developer.nvidia.com/

Weitere ähnliche Inhalte

Was ist angesagt?

A beginner’s guide to programming GPUs with CUDA
A beginner’s guide to programming GPUs with CUDAA beginner’s guide to programming GPUs with CUDA
A beginner’s guide to programming GPUs with CUDAPiyush Mittal
 
Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)Rob Gillen
 
NVidia CUDA Tutorial - June 15, 2009
NVidia CUDA Tutorial - June 15, 2009NVidia CUDA Tutorial - June 15, 2009
NVidia CUDA Tutorial - June 15, 2009Randall Hand
 
Nvidia cuda tutorial_no_nda_apr08
Nvidia cuda tutorial_no_nda_apr08Nvidia cuda tutorial_no_nda_apr08
Nvidia cuda tutorial_no_nda_apr08Angela Mendoza M.
 
GPU Computing with Ruby
GPU Computing with RubyGPU Computing with Ruby
GPU Computing with RubyShin Yee Chung
 
NVidia CUDA for Bruteforce Attacks - DefCamp 2012
NVidia CUDA for Bruteforce Attacks - DefCamp 2012NVidia CUDA for Bruteforce Attacks - DefCamp 2012
NVidia CUDA for Bruteforce Attacks - DefCamp 2012DefCamp
 
SHA-3, Keccak & Sponge function
SHA-3, Keccak & Sponge functionSHA-3, Keccak & Sponge function
SHA-3, Keccak & Sponge functionGennaro Caccavale
 
Parallel K means clustering using CUDA
Parallel K means clustering using CUDAParallel K means clustering using CUDA
Parallel K means clustering using CUDAprithan
 
IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...
IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...
IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...npinto
 
CUDA by Example : Thread Cooperation : Notes
CUDA by Example : Thread Cooperation : NotesCUDA by Example : Thread Cooperation : Notes
CUDA by Example : Thread Cooperation : NotesSubhajit Sahu
 

Was ist angesagt? (20)

Lecture 04
Lecture 04Lecture 04
Lecture 04
 
GPU: Understanding CUDA
GPU: Understanding CUDAGPU: Understanding CUDA
GPU: Understanding CUDA
 
A beginner’s guide to programming GPUs with CUDA
A beginner’s guide to programming GPUs with CUDAA beginner’s guide to programming GPUs with CUDA
A beginner’s guide to programming GPUs with CUDA
 
Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)
 
Cuda Architecture
Cuda ArchitectureCuda Architecture
Cuda Architecture
 
NVidia CUDA Tutorial - June 15, 2009
NVidia CUDA Tutorial - June 15, 2009NVidia CUDA Tutorial - June 15, 2009
NVidia CUDA Tutorial - June 15, 2009
 
Cuda
CudaCuda
Cuda
 
Cuda tutorial
Cuda tutorialCuda tutorial
Cuda tutorial
 
Nvidia cuda tutorial_no_nda_apr08
Nvidia cuda tutorial_no_nda_apr08Nvidia cuda tutorial_no_nda_apr08
Nvidia cuda tutorial_no_nda_apr08
 
GPU Computing with Ruby
GPU Computing with RubyGPU Computing with Ruby
GPU Computing with Ruby
 
Cuda
CudaCuda
Cuda
 
NVidia CUDA for Bruteforce Attacks - DefCamp 2012
NVidia CUDA for Bruteforce Attacks - DefCamp 2012NVidia CUDA for Bruteforce Attacks - DefCamp 2012
NVidia CUDA for Bruteforce Attacks - DefCamp 2012
 
Keccak
KeccakKeccak
Keccak
 
SHA-3, Keccak & Sponge function
SHA-3, Keccak & Sponge functionSHA-3, Keccak & Sponge function
SHA-3, Keccak & Sponge function
 
Gpu perf-presentation
Gpu perf-presentationGpu perf-presentation
Gpu perf-presentation
 
Slide tesi
Slide tesiSlide tesi
Slide tesi
 
Sha3
Sha3Sha3
Sha3
 
Parallel K means clustering using CUDA
Parallel K means clustering using CUDAParallel K means clustering using CUDA
Parallel K means clustering using CUDA
 
IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...
IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...
IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...
 
CUDA by Example : Thread Cooperation : Notes
CUDA by Example : Thread Cooperation : NotesCUDA by Example : Thread Cooperation : Notes
CUDA by Example : Thread Cooperation : Notes
 

Ähnlich wie GPU Computing with CUDA

002 - Introduction to CUDA Programming_1.ppt
002 - Introduction to CUDA Programming_1.ppt002 - Introduction to CUDA Programming_1.ppt
002 - Introduction to CUDA Programming_1.pptceyifo9332
 
An Introduction to CUDA-OpenCL - University.pptx
An Introduction to CUDA-OpenCL - University.pptxAn Introduction to CUDA-OpenCL - University.pptx
An Introduction to CUDA-OpenCL - University.pptxAnirudhGarg35
 
Gpu computing workshop
Gpu computing workshopGpu computing workshop
Gpu computing workshopdatastack
 
Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...
Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...
Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...mouhouioui
 
lecture_GPUArchCUDA02-CUDAMem.pdf
lecture_GPUArchCUDA02-CUDAMem.pdflecture_GPUArchCUDA02-CUDAMem.pdf
lecture_GPUArchCUDA02-CUDAMem.pdfTigabu Yaya
 
lecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptxlecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptxssuser413a98
 
Intro2 Cuda Moayad
Intro2 Cuda MoayadIntro2 Cuda Moayad
Intro2 Cuda MoayadMoayadhn
 
gpuprogram_lecture,architecture_designsn
gpuprogram_lecture,architecture_designsngpuprogram_lecture,architecture_designsn
gpuprogram_lecture,architecture_designsnARUNACHALAM468781
 
Tema3_Introduction_to_CUDA_C.pdf
Tema3_Introduction_to_CUDA_C.pdfTema3_Introduction_to_CUDA_C.pdf
Tema3_Introduction_to_CUDA_C.pdfpepe464163
 
Introduction to cuda geek camp singapore 2011
Introduction to cuda   geek camp singapore 2011Introduction to cuda   geek camp singapore 2011
Introduction to cuda geek camp singapore 2011Raymond Tay
 
Graphics processing uni computer archiecture
Graphics processing uni computer archiectureGraphics processing uni computer archiecture
Graphics processing uni computer archiectureHaris456
 
Computing using GPUs
Computing using GPUsComputing using GPUs
Computing using GPUsShree Kumar
 
Intro to GPGPU Programming with Cuda
Intro to GPGPU Programming with CudaIntro to GPGPU Programming with Cuda
Intro to GPGPU Programming with CudaRob Gillen
 
GPGPU Computation
GPGPU ComputationGPGPU Computation
GPGPU Computationjtsagata
 
Data Grids with Oracle Coherence
Data Grids with Oracle CoherenceData Grids with Oracle Coherence
Data Grids with Oracle CoherenceBen Stopford
 
Coding for multiple cores
Coding for multiple coresCoding for multiple cores
Coding for multiple coresLee Hanxue
 

Ähnlich wie GPU Computing with CUDA (20)

002 - Introduction to CUDA Programming_1.ppt
002 - Introduction to CUDA Programming_1.ppt002 - Introduction to CUDA Programming_1.ppt
002 - Introduction to CUDA Programming_1.ppt
 
An Introduction to CUDA-OpenCL - University.pptx
An Introduction to CUDA-OpenCL - University.pptxAn Introduction to CUDA-OpenCL - University.pptx
An Introduction to CUDA-OpenCL - University.pptx
 
Gpu computing workshop
Gpu computing workshopGpu computing workshop
Gpu computing workshop
 
Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...
Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...
Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...
 
Cuda 2011
Cuda 2011Cuda 2011
Cuda 2011
 
lecture_GPUArchCUDA02-CUDAMem.pdf
lecture_GPUArchCUDA02-CUDAMem.pdflecture_GPUArchCUDA02-CUDAMem.pdf
lecture_GPUArchCUDA02-CUDAMem.pdf
 
lecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptxlecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptx
 
Paralell
ParalellParalell
Paralell
 
Intro2 Cuda Moayad
Intro2 Cuda MoayadIntro2 Cuda Moayad
Intro2 Cuda Moayad
 
gpuprogram_lecture,architecture_designsn
gpuprogram_lecture,architecture_designsngpuprogram_lecture,architecture_designsn
gpuprogram_lecture,architecture_designsn
 
Tema3_Introduction_to_CUDA_C.pdf
Tema3_Introduction_to_CUDA_C.pdfTema3_Introduction_to_CUDA_C.pdf
Tema3_Introduction_to_CUDA_C.pdf
 
Introduction to cuda geek camp singapore 2011
Introduction to cuda   geek camp singapore 2011Introduction to cuda   geek camp singapore 2011
Introduction to cuda geek camp singapore 2011
 
Chapter 6 os
Chapter 6 osChapter 6 os
Chapter 6 os
 
Graphics processing uni computer archiecture
Graphics processing uni computer archiectureGraphics processing uni computer archiecture
Graphics processing uni computer archiecture
 
Computing using GPUs
Computing using GPUsComputing using GPUs
Computing using GPUs
 
Intro to GPGPU Programming with Cuda
Intro to GPGPU Programming with CudaIntro to GPGPU Programming with Cuda
Intro to GPGPU Programming with Cuda
 
GPGPU Computation
GPGPU ComputationGPGPU Computation
GPGPU Computation
 
posix.pdf
posix.pdfposix.pdf
posix.pdf
 
Data Grids with Oracle Coherence
Data Grids with Oracle CoherenceData Grids with Oracle Coherence
Data Grids with Oracle Coherence
 
Coding for multiple cores
Coding for multiple coresCoding for multiple cores
Coding for multiple cores
 

Kürzlich hochgeladen

Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024The Digital Insurer
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024The Digital Insurer
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz
 

Kürzlich hochgeladen (20)

Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 

GPU Computing with CUDA

  • 1. Naga Vydyanathan GPU Computing with CUDA Accelerated Computing GPU Teaching Kit
  • 2. 2 3 Ways to Accelerate Applications Applications Libraries Easy to use Most Performance Programming Languages Most Performance Most Flexibility Easy to use Portable code Compiler Directives CUDA
  • 3. 3 Objective – To understand the CUDA memory hierarchy – Registers, shared memory, global memory – Scope and lifetime – To learn the basic memory API functions in CUDA host code – Device Memory Allocation – Host-Device Data Transfer – To learn about CUDA threads, the main mechanism for exploiting of data parallelism – Hierarchical thread organization – Launching parallel execution – Thread index to data index mapping
  • 4. 4 Global Memory Processing Unit I/O ALU Processor (SM) Shared Memory Register File Control Unit PC IR Hardware View of CUDA Memories
  • 5. 5 Programmer View of CUDA Memories Grid Global Memory Block (0, 0) Shared Memory Thread (0, 0) Registers Thread (1, 0) Registers Block (1, 0) Shared Memory Thread (0, 0) Registers Thread (1, 0) Registers Host Constant Memory
  • 6. 6 Declaring CUDA Variables – __device__ is optional when used with __shared__, or __constant__ – Automatic variables reside in a register – Except per-thread arrays that reside in global memory Variable declaration Memory Scope Lifetime int LocalVar; register thread thread __device__ __shared__ int SharedVar; shared block block __device__ int GlobalVar; global grid application __device__ __constant__ int ConstantVar; constant grid application
  • 7. 7 Global Memory in CUDA – GPU Device memory (DRAM) – Large in size (~16/32 GB) – Bandwidth ~700-900 GB/s – High latency (~1000 cycles) – Scope of access and sharing - kernel – Lifetime – application – main means of communicating data between host and GPU.
  • 8. 8 Shared Memory in CUDA – A special type of memory whose contents are explicitly defined and used in the kernel source code – One in each SM – Accessed at much higher speed (in both latency and throughput) than global memory – Scope of access and sharing - thread blocks – Lifetime – thread block, contents will disappear after the corresponding thread block finishes and terminates execution – A form of scratchpad memory in computer architecture We will re-visit shared memory later.
  • 9. 9 A[0] vector A vector B vector C A[1] A[2] A[N-1] B[0] B[1] B[2] … … B[N-1] C[0] C[1] C[2] C[N-1] … + + + + Data Parallelism - Vector Addition Example
  • 10. 10 Vector Addition – Traditional C Code // Compute vector sum C = A + B void vecAdd(float *h_A, float *h_B, float *h_C, int n) { int i; for (i = 0; i<n; i++) h_C[i] = h_A[i] + h_B[i]; } int main() { // Memory allocation for h_A, h_B, and h_C // I/O to read h_A and h_B, N elements … vecAdd(h_A, h_B, h_C, N); } 10
  • 11. 11 CPU Host Memory GPU Device Memory Part 1 Part 3 Heterogeneous Computing vecAdd CUDA Host Code #include <cuda.h> void vecAdd(float *h_A, float *h_B, float *h_C, int n) { int size = n* sizeof(float); float *d_A, *d_B, *d_C; // Part 1 // Allocate device memory for A, B, and C // copy A and B to device memory // Part 2 // Kernel launch code – the device performs the actual vector addition // Part 3 // copy C from the device memory // Free device vectors } 11 Part 2
  • 12. 12 Partial Overview of CUDA Memories – Device code can: – R/W per-thread registers – R/W all-shared global memory – Host code can – Transfer data to/from per grid global memory 12 Host (Device) Grid Global Memory Block (0, 0) Thread (0, 0) Registers Block (0, 1) Thread (0, 0) Registers Thread (0, 1) Registers Thread (0, 1) Registers
  • 13. 13 CUDA Device Memory Management API functions – cudaMalloc() – Allocates an object in the device global memory – Two parameters – Address of a pointer to the allocated object – Size of allocated object in terms of bytes – cudaFree() – Frees object from device global memory – One parameter – Pointer to freed object Host (Device) Grid Global Memory Block (0, 0) Thread (0, 0) Registers Block (0, 1) Thread (0, 0) Registers Thread (0, 1) Registers Thread (0, 1) Registers
  • 14. 14 Host-Device Data Transfer API functions – cudaMemcpy() – memory data transfer – Requires four parameters – Pointer to destination – Pointer to source – Number of bytes copied – Type/Direction of transfer – Transfer to device is synchronous Host (Device) Grid Global Memory Block (0, 0) Thread (0, 0) Registers Block (0, 1) Thread (0, 0) Registers Thread (0, 1) Registers Thread (0, 1) Registers
  • 15. 15 Vector Addition Host Code void vecAdd(float *h_A, float *h_B, float *h_C, int n) { int size = n * sizeof(float); float *d_A, *d_B, *d_C; cudaMalloc((void **) &d_A, size); cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice); cudaMalloc((void **) &d_B, size); cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice); cudaMalloc((void **) &d_C, size); // Kernel invocation code – to be shown later cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost); cudaFree(d_A); cudaFree(d_B); cudaFree (d_C); } 15
  • 16. 16 In Practice, Check for API Errors in Host Code cudaError_t err = cudaMalloc((void **) &d_A, size); if (err != cudaSuccess) { printf(“%s in %s at line %dn”, cudaGetErrorString(err), __FILE__, __LINE__); exit(EXIT_FAILURE); } 10
  • 17. 17 CUDA Execution Model – Heterogeneous host (CPU) + device (GPU) application C program – Serial parts in host C code – Parallel parts in device SPMD kernel code Serial Code (host) . . . . . . Parallel Kernel (device) KernelA<<< nBlk, nTid >>>(args); Serial Code (host) Parallel Kernel (device) KernelB<<< nBlk, nTid >>>(args);
  • 18. 18 Arrays of Parallel Threads • A CUDA kernel is executed by a grid (array) of threads – All threads in a grid run the same kernel code (Single Program Multiple Data) – Each thread has indexes that it uses to compute memory addresses and make control decisions i = blockIdx.x * blockDim.x + threadIdx.x; C[i] = A[i] + B[i]; … 0 1 2 254 255 …
  • 19. 19 Thread Blocks: Scalable Cooperation – Divide thread array into multiple blocks – Threads within a block cooperate via shared memory, atomic operations and barrier synchronization – Threads in different blocks do not interact 19 i = blockIdx.x * blockDim.x + threadIdx.x; C[i] = A[i] + B[i]; … 0 1 2 254 255 Thread Block 0 … 1 2 254 255 Thread Block 1 0 i = blockIdx.x * blockDim.x + threadIdx.x; C[i] = A[i] + B[i]; … 1 2 254 255 Thread Block N-1 0 i = blockIdx.x * blockDim.x + threadIdx.x; C[i] = A[i] + B[i]; … … … …
  • 20. 20 blockIdx and threadIdx • Each thread uses indices to decide what data to work on – blockIdx: 1D, 2D, or 3D (CUDA 4.0) – threadIdx: 1D, 2D, or 3D • Simplifies memory addressing when processing multidimensional data – Image processing – Solving PDEs on volumes – … 20 device Grid Block (0, 0) Block (1, 1) Block (1, 0) Block (0, 1) Block (1,1) Thread (0,0,0) Thread (0,1,3) Thread (0,1,0) Thread (0,1,1) Thread (0,1,2) Thread (0,0,0) Thread (0,0,1) Thread (0,0,2) Thread (0,0,3) (1,0,0) (1,0,1) (1,0,2) (1,0,3)
  • 21. 21 Example: Vector Addition Kernel // Compute vector sum C = A + B // Each thread performs one pair-wise addition __global__ void vecAddKernel(float* A, float* B, float* C, int n) { int i = threadIdx.x+blockDim.x*blockIdx.x; if(i<n) C[i] = A[i] + B[i]; } Device Code
  • 22. 22 Example: Vector Addition Kernel Launch (Host Code) void vecAdd(float* h_A, float* h_B, float* h_C, int n) { // d_A, d_B, d_C allocations and copies omitted // Run ceil(n/256.0) blocks of 256 threads each vecAddKernel<<<ceil(n/256.0),256>>>(d_A, d_B, d_C, n); } Host Code 4 The ceiling function makes sure that there are enough threads to cover all elements.
  • 23. 23 More on Kernel Launch (Host Code) void vecAdd(float* h_A, float* h_B, float* h_C, int n) { dim3 DimGrid((n-1)/256 + 1, 1, 1); dim3 DimBlock(256, 1, 1); vecAddKernel<<<DimGrid,DimBlock>>>(d_A, d_B, d_C, n); } 23 Host Code This is an equivalent way to express the ceiling function.
  • 24. 24 __host__ void vecAdd(…) { dim3 DimGrid(ceil(n/256.0),1,1); dim3 DimBlock(256,1,1); vecAddKernel<<<DimGrid,DimBlock>>>(d_A,d_B ,d_C,n); } Kernel execution in a nutshell 24 Grid Blk 0 Blk N-1 • • • GPU M0 RAM Mk • • • __global__ void vecAddKernel(float *A, float *B, float *C, int n) { int i = blockIdx.x * blockDim.x + threadIdx.x; if( i<n ) C[i] = A[i]+B[i]; }
  • 25. 25 More on CUDA Function Declarations − __global__ defines a kernel function − Each “__” consists of two underscore characters − A kernel function must return void − __device__ and __host__ can be used together − __host__ is optional if used alone 25 host host __host__ float HostFunc() host device __global__ void KernelFunc() device device __device__ float DeviceFunc() Only callable from the: Executed on the:
  • 26. 26 CUDA Device Query – Enumerates properties of CUDA devices in your system 26
  • 27. 27 Next Webinar – Learn to write, compile and run a multi-dimensional CUDA program – Learn some CUDA optimization techniques – Memory optimization – Compute optimization – Transfer optimization
  • 28. GPU Teaching Kit The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License. Accelerated Computing Naga Vydyanathan: nvydyanathan@nvidia.com NVIDIA Developer Zone: https://developer.nvidia.com/