Cuda Architecture

CUDA APIs

API allows the host to manage the devices
Allocate memory & transfer data
Launch kernels

CUDA C “Runtime” API
High level of abstraction - start here!

CUDA C “Driver” API
More control, more verbose

(OpenCL: Similar to CUDA C Driver API)

CUDA C and OpenCL

Entry point for
Entry point for developers developers
who want low-level API who prefer
high-level C

Shared back-end compiler
and optimization technology

Processing Flow

PCI Bus

Copy input data from CPU memory to GPU
memory

Processing Flow

PCI Bus

1. Copy input data from CPU memory to GPU
memory
2. Load GPU program and execute,
caching data on chip for performance

Processing Flow

PCI Bus

1. Copy input data from CPU memory to GPU
memory
2. Load GPU program and execute,
caching data on chip for performance
3. Copy results from GPU memory to CPU
memory

CUDA Parallel Computing Architecture

Parallel computing architecture
and programming model

Includes a CUDA C compiler,
support for OpenCL and
DirectCompute

Architected to natively support
multiple computational
interfaces (standard languages
and APIs)

C for CUDA : C with a few keywords
void saxpy_serial(int n, float a, float *x, float *y)
{
for (int i = 0; i < n; ++i)

y[i] = a*x[i] + y[i];
}
Standard C Code
// Invoke serial SAXPY kernel
saxpy_serial(n, 2.0, x, y);

__global__ void saxpy_parallel(int n, float a, float *x, float *y)
{
int i = blockIdx.x*blockDim.x + threadIdx.x;
if (i < n) y[i] = a*x[i] + y[i]; Parallel C Code
}
// Invoke parallel SAXPY kernel with 256 threads/block
int nblocks = (n + 255) / 256;
saxpy_parallel<<<nblocks, 256>>>(n, 2.0, x, y);

CUDA Parallel Computing Architecture

CUDA defines:
Programming model
Memory model
Execution model

CUDA uses the GPU, but is for general-purpose computing
Facilitate heterogeneous computing: CPU + GPU

CUDA is scalable
Scale to run on 100s of cores/1000s of parallel threads

Compiling CUDA C Applications (Runtime API)

void serial_function(… ) {

... C CUDA Rest of C
}
void other_function(int ... ) {
Key Kernels Application
...
}
NVCC
void saxpy_serial(float ... ) {
CPU Compiler
(Open64)
for (int i = 0; i < n; ++i)
y[i] = a*x[i] + y[i]; Modify into
} Parallel CUDA object CPU object
void main( ) {
CUDA code files files
float x; Linker
saxpy_serial(..);
...
} CPU-GPU
Executable

PROGRAMMING MODEL
CUDA Kernels

Parallel portion of application: execute as a kernel
Entire GPU executes kernel, many threads

CUDA threads:
Lightweight
Fast switching
1000s execute simultaneously

CPU Host Executes functions
GPU Device Executes kernels

CUDA Kernels: Parallel Threads

A kernel is a function executed
on the GPU
Array of threads, in parallel

float x = input[threadID];
All threads execute the same float y = func(x);
output[threadID] = y;
code, can take different paths

Each thread has an ID
Select input/output data
Control decisions

CUDA Kernels: Subdivide into Blocks


Threads are grouped into blocks


Blocks are grouped into a grid


A kernel is executed as a grid of blocks of threads


GPU

A kernel is executed as a grid of blocks of threads

Communication Within a Block

Threads may need to cooperate
Memory accesses
Share results

Cooperate using shared memory
Accessible by all threads within a block

Restriction to “within a block” permits scalability
Fast communication between N threads is not feasible when N large

Transparent Scalability – G84
1 2 3 4 5 6 7 8 9 10 11 12

11 12

9 10

7 8

5 6

3 4

1 2

Transparent Scalability – G80

1 2 3 4 5 6 7 8 9 10 11 12

9 10 11 12

1 2 3 4 5 6

Transparent Scalability – GT200

1 2 3 4 5 6 7 8 9 1 1 1
0 1 2

2 10 11 12 Idl Idl Idl
1 3 4 5 6 7 8 9 e e
... e
Idl
e

CUDA Programming Model - Summary

Host Device
A kernel executes as a grid of
thread blocks
Kernel 1 0 1 2 3

A block is a batch of threads 1D
Communicate through shared
memory
0,0 0,1 0,2 0,3
Kernel 2
Each block has a block ID 1,0 1,1 1,2 1,3

2D
Each thread has a thread ID

MEMORY MODEL
Memory hierarchy

Thread:
Registers

Memory hierarchy

Thread:
Registers

Thread:
Local memory

Memory hierarchy

Thread:
Registers

Thread:
Local memory

Block of threads:
Shared memory

Memory hierarchy

Thread:
Registers

Thread:
Local memory

Block of threads:
Shared memory

All blocks:
Global memory

Additional Memories

Host can also allocate textures and arrays of constants

Textures and constants have dedicated caches

Cuda Architecture

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (12)

Ähnlich wie Cuda Architecture

Ähnlich wie Cuda Architecture (20)

Mehr von Piyush Mittal

Mehr von Piyush Mittal (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Cuda Architecture

Hinweis der Redaktion