3. CUDA APIs
API allows the host to manage the devices
Allocate memory & transfer data
Launch kernels
CUDA C “Runtime” API
High level of abstraction - start here!
CUDA C “Driver” API
More control, more verbose
(OpenCL: Similar to CUDA C Driver API)
4. CUDA C and OpenCL
Entry point for
Entry point for developers developers
who want low-level API who prefer
high-level C
Shared back-end compiler
and optimization technology
5. Processing Flow
PCI Bus
Copy input data from CPU memory to GPU
memory
6. Processing Flow
PCI Bus
1. Copy input data from CPU memory to GPU
memory
2. Load GPU program and execute,
caching data on chip for performance
7. Processing Flow
PCI Bus
1. Copy input data from CPU memory to GPU
memory
2. Load GPU program and execute,
caching data on chip for performance
3. Copy results from GPU memory to CPU
memory
8. CUDA Parallel Computing Architecture
Parallel computing architecture
and programming model
Includes a CUDA C compiler,
support for OpenCL and
DirectCompute
Architected to natively support
multiple computational
interfaces (standard languages
and APIs)
9. C for CUDA : C with a few keywords
void saxpy_serial(int n, float a, float *x, float *y)
{
for (int i = 0; i < n; ++i)
y[i] = a*x[i] + y[i];
}
Standard C Code
// Invoke serial SAXPY kernel
saxpy_serial(n, 2.0, x, y);
__global__ void saxpy_parallel(int n, float a, float *x, float *y)
{
int i = blockIdx.x*blockDim.x + threadIdx.x;
if (i < n) y[i] = a*x[i] + y[i]; Parallel C Code
}
// Invoke parallel SAXPY kernel with 256 threads/block
int nblocks = (n + 255) / 256;
saxpy_parallel<<<nblocks, 256>>>(n, 2.0, x, y);
10. CUDA Parallel Computing Architecture
CUDA defines:
Programming model
Memory model
Execution model
CUDA uses the GPU, but is for general-purpose computing
Facilitate heterogeneous computing: CPU + GPU
CUDA is scalable
Scale to run on 100s of cores/1000s of parallel threads
11. Compiling CUDA C Applications (Runtime API)
void serial_function(… ) {
... C CUDA Rest of C
}
void other_function(int ... ) {
Key Kernels Application
...
}
NVCC
void saxpy_serial(float ... ) {
CPU Compiler
(Open64)
for (int i = 0; i < n; ++i)
y[i] = a*x[i] + y[i]; Modify into
} Parallel CUDA object CPU object
void main( ) {
CUDA code files files
float x; Linker
saxpy_serial(..);
...
} CPU-GPU
Executable
12. PROGRAMMING MODEL
CUDA Kernels
Parallel portion of application: execute as a kernel
Entire GPU executes kernel, many threads
CUDA threads:
Lightweight
Fast switching
1000s execute simultaneously
CPU Host Executes functions
GPU Device Executes kernels
13. CUDA Kernels: Parallel Threads
A kernel is a function executed
on the GPU
Array of threads, in parallel
float x = input[threadID];
All threads execute the same float y = func(x);
output[threadID] = y;
code, can take different paths
Each thread has an ID
Select input/output data
Control decisions
16. CUDA Kernels: Subdivide into Blocks
Threads are grouped into blocks
Blocks are grouped into a grid
17. CUDA Kernels: Subdivide into Blocks
Threads are grouped into blocks
Blocks are grouped into a grid
A kernel is executed as a grid of blocks of threads
18. CUDA Kernels: Subdivide into Blocks
GPU
Threads are grouped into blocks
Blocks are grouped into a grid
A kernel is executed as a grid of blocks of threads
19. Communication Within a Block
Threads may need to cooperate
Memory accesses
Share results
Cooperate using shared memory
Accessible by all threads within a block
Restriction to “within a block” permits scalability
Fast communication between N threads is not feasible when N large
23. CUDA Programming Model - Summary
Host Device
A kernel executes as a grid of
thread blocks
Kernel 1 0 1 2 3
A block is a batch of threads 1D
Communicate through shared
memory
0,0 0,1 0,2 0,3
Kernel 2
Each block has a block ID 1,0 1,1 1,2 1,3
2D
Each thread has a thread ID
28. Memory hierarchy
Thread:
Registers
Thread:
Local memory
Block of threads:
Shared memory
All blocks:
Global memory
29. Memory hierarchy
Thread:
Registers
Thread:
Local memory
Block of threads:
Shared memory
All blocks:
Global memory
30. Additional Memories
Host can also allocate textures and arrays of constants
Textures and constants have dedicated caches
Hinweis der Redaktion
Agenda slide:Heading – Agenda - Font size 30, Arial BoldPlease restrict this slide with just 5 agenda points. If you have more than 5 points on the agenda slide please add another slide. If you have only 3 then you can use just one slide and delete the other 2 points.
Agenda slide:Heading – Agenda - Font size 30, Arial BoldPlease restrict this slide with just 5 agenda points. If you have more than 5 points on the agenda slide please add another slide. If you have only 3 then you can use just one slide and delete the other 2 points.
Content Slide: This is usually the most frequently used slide in every presentation. Use this slide for Text heavy slides. Text can only be used in bullet pointsTitle Heading – font size 30, Arial boldSlide Content – Should not reduce beyond Arial font 16If you need to use sub bullets please use the indent buttons located next to the bullets buttons in the tool bar and this will automatically provide you with the second, third, fourth & fifth level bullet styles and font sizesPlease note you can also press the tab key to create the different levels of bulleted content
Blank slide or a freeform slide you may use this to insert or show screenshots etcIf content is added in this slide you will need to use bulleted text