2. 2
Last few years GPUs have developed from graphics
cards into a platform from HPC
There is now great interest in using GPUs for scientific
high performance computing and GPUs are being
designed with that application in mind
CUDA programming – C/C++ with a few additional
features and routines to support GPU programming.
Uses data parallel paradigm
Graphics Processing Units
(GPUs)
4. Graphics Processing Units (GPUs)
Brief History
1970 2010200019901980
Atari 8-bit
computer
text/graphics chip
Source of information http://en.wikipedia.org/wiki/Graphics_Processing_Unit
IBM PC Professional
Graphics Controller
card
S3 graphics cards-
single chip 2D
accelerator
OpenGL graphics API
Hardware-accelerated
3D graphics
DirectX graphics API
Playstation
GPUs with
programmable shading
Nvidia GeForce
GE 3 (2001) with
programmable shading
General-purpose computing
on graphics processing units
(GPGPUs)
GPU Computing
5. NVIDIA products
NVIDIA Corp. is the leader in GPUs for high performance
computing:
1993 201019991995 20092007 20082000 2001 2002 2003 2004 2005 2006
Established by Jen-
Hsun Huang, Chris
Malachowsky,
Curtis Priem
NV1 GeForce 1
GeForce 2 series GeForce FX series
GeForce 8 series
GeForce 200 series
GeForce 400 series
GTX460/465/470/475/
480/485
GTX260/275/280/285/295
GeForce
8800
GT 80
Tesla
Quadro
NVIDIA's first
GPU with
general purpose
processors
C870, S870, C1060, S1070, C2050, …
Tesla 2050 GPU
has 448 thread
processors
Fermi
Kepler
(2011)
Maxwell
(2013)
7. 7
CPU-GPU architecture evolution
Co-processors -- very old idea that appeared in
1970s and 1980s with floating point co-
processors attached to microprocessors that did
not then have floating point capability.
These coprocessors simply executed floating
point instructions that were fetched from
memory.
Around same time, interest to provide hardware
support for displays, especially with increasing
use of graphics and PC games.
Led to graphics processing units (GPUs)
attached to CPU to create video display.
CPU
Graphics
card
Display
Memory
Early design
8. 8
Birth of general purpose
programmable GPU
Dedicated pipeline (late1990s-early 2000s)
By late1990’s, graphics chips
needed to support 3-D graphics,
especially for games and graphics
APIs such as DirectX and
OpenGL.
Graphics chips generally had a
pipeline structure with individual
stages performing specialized
operations, finally leading to
loading frame buffer for display.
Individual stages may have access
to graphics memory for storing
intermediate computed data.
Input stage
Vertex shader
stage
Geometry
shader stage
Rasterizer stage
Frame
buffer
Pixel shading
stage
Graphics
memory
10. 10
General-Purpose GPU designs
High performance pipelines call for high-speed (IEEE) floating point
operations.
People had been trying to use GPU cards to speed up scientific
computations
Known as GPGPU (General-purpose computing on graphics
processing units) -- Difficult to do with specialized graphics pipelines,
but possible.)
By mid 2000’s, recognized that individual stages of graphics pipeline
could be implemented by a more general purpose processor core
(although with a data-parallel paradym)
11. 11
2006 -- First GPU for general high performance computing as well
as graphics processing, NVIDIA GT 80 chip/GeForce 8800 card.
Unified processors that could perform vertex, geometry, pixel, and
general computing operations
Could now write programs in C rather than graphics APIs.
Single-instruction multiple thread (SIMT) programming model
GPU design for general high
performance computing
15. 15
CUDA
(Compute Unified Device Architecture)
Architecture and programming model, introduced in NVIDIA in 2007
Enables GPUs to execute programs written in C.
Within C programs, call SIMT “kernel” routines that are executed on
GPU.
CUDA syntax extension to C identify routine as a Kernel.
Very easy to learn although to get highest possible execution
performance requires understanding of hardware architecture
16. 16
Programming Model
•
Program once compiled has code
executed on CPU and code
executed on GPU
•
Separate memories on CPU and
GPU
Need to
•
Explicitly transfer data from CPU to
GPU for GPU computation, and
•
Explicitly transfer results in GPU
memory copied back to CPU
memory
Copy from
CPU to
GPU
Copy from
GPU to
CPU
GPU
CPU
CPU main memory
GPU global memory
17. 17
Basic CUDA program structure
int main (int argc, char **argv ) {
1. Allocate memory space in device (GPU) for data
2. Allocate memory space in host (CPU) for data
3. Copy data to GPU
4. Call “kernel” routine to execute on GPU
(with CUDA syntax that defines no of threads and their physical structure)
5. Transfer results from GPU to CPU
6. Free memory space in device (GPU)
7. Free memory space in host (CPU)
return;
}
18. 18
1. Allocating memory space in
“device” (GPU) for data
Use CUDA malloc routines:
int size = N *sizeof( int); // space for N integers
int *devA, *devB, *devC; // devA, devB, devC ptrs
cudaMalloc( (void**)&devA, size) );
cudaMalloc( (void**)&devB, size );
cudaMalloc( (void**)&devC, size );
Derived from Jason Sanders, "Introduction to CUDA C" GPU technology conference, Sept. 20, 2010.
19. 19
2. Allocating memory space in
“host” (CPU) for data
Use regular C malloc routines:
int *a, *b, *c;
…
a = (int*)malloc(size);
b = (int*)malloc(size);
c = (int*)malloc(size);
or statically declare variables:
#define N 256
…
int a[N], b[N], c[N];
20. 20
3. Transferring data from host
(CPU) to device (GPU)
Use CUDA routine cudaMemcpy
cudaMemcpy( devA, &A, size, cudaMemcpyHostToDevice);
cudaMemcpy( dev_B, &B, size, cudaMemcpyHostToDevice);
where devA and devB are pointers to destination in
device and A and B are pointers to host data
21. 21
4. Declaring “kernel” routine to
execute on device (GPU)
CUDA introduces a syntax addition to C:
Triple angle brackets mark call from host code to device code.
Contains organization and number of threads in two parameters:
myKernel<<< n, m >>>(arg1, … );
n and m will define organization of thread blocks and threads in a
block.
For now, we will set n = 1, which say one block and m = N, which
says N threads in this block.
arg1, … , -- arguments to routine myKernel typically pointers to
device memory obtained previously from cudaMallac.
22. 22
A kernel defined using CUDA specifier __global__
Example – Adding to vectors A and B
#define N 256
__global__ void vecAdd(int *A, int *B, int *C) { // Kernel definition
int i = threadIdx.x;
C[i] = A[i] + B[i];
}
int main() {
// allocate device memory &
// copy data to device
// device mem. ptrs devA,devB,devC
vecAdd<<<1, N>>>(devA,devB,devC);
…
}
Loosely derived from CUDA C programming guide, v 3.2 , 2010, NVIDIA
Declaring a Kernel Routine
Each of the N threads performs one pair-
wise addition:
Thread 0: devC[0] = devA[0] + devB[0];
Thread 1: devC[1] = devA[1] + devB[1];
Thread N-1: devC[N-1] = devA[N-1]+devB[N-1];
Grid of one block, block has N threads
CUDA structure that provides thread ID in block
23. 23
5. Transferring data from device
(GPU) to host (CPU)
Use CUDA routine cudaMemcpy
cudaMemcpy( &C, devC, size,
cudaMemcpyDeviceToHost);
where devC is a pointer in device and C is a pointer in
host.
24. 24
6. Free memory space in “device”
(GPU)
Use CUDA cudaFree routine:
cudaFree( dev_a);
cudaFree( dev_b);
cudaFree( dev_c);
25. 25
7. Free memory space in (CPU) host
(if CPU memory allocated with malloc)
Use regular C free routine to deallocate memory if
previously allocated with malloc:
free( a );
free( b );
free( c );
26. 26
Complete
CUDA
program
Adding two
vectors, A and B
N elements in A and
B, and N threads
(without code to load
arrays with data)
#define N 256
__global__ void vecAdd(int *A, int *B, int *C) {
int i = threadIdx.x;
C[i] = A[i] + B[i];
}
int main (int argc, char **argv ) {
int size = N *sizeof( int);
int a[N], b[N], c[N], *devA, *devB, *devC;
cudaMalloc( (void**)&devA, size) );
cudaMalloc( (void**)&devB, size );
cudaMalloc( (void**)&devC, size );
a = (int*)malloc(size); b = (int*)malloc(size);c =
(int*)malloc(size);
cudaMemcpy( devA, a, size, cudaMemcpyHostToDevice);
cudaMemcpy( dev_B, b size, cudaMemcpyHostToDevice);
vecAdd<<<1, N>>>(devA, devB, devC);
cudaMemcpy( &c, devC size, cudaMemcpyDeviceToHost);
cudaFree( dev_a);
cudaFree( dev_b);
cudaFree( dev_c);
free( a ); free( b ); free( c );
return (0);
}
Derived from Jason Sanders,
"Introduction to CUDA C" GPU
technology conference, Sept. 20,
27. 27
Can be 1 or 2
dimensions
Can be 1, 2 or
3 dimensions
CUDA C programming guide, v 3.2, 2010,
NVIDIA
CUDA SIMT
Thread Structure
Allows
flexibility and
efficiency in
processing
1D, 2-D, and
3-D data on
GPU.
Linked to
internal
organization
Threads in
one block
execute
together.
28. 28
Need to provide each kernel call with values for two key structures:
•
Number of blocks in each dimension
•
Threads per block in each dimension
myKernel<<< numBlocks, threadsperBlock >>>(arg1, … );
numBlocks – number of blocks in grid in each dimension (1D or
2D). An integer would define a 1D grid of that size, otherwise use
CUDA structure, see next.
threadsperBlock – number of threads in a block in each dimension
(1D, 2D, or 3D). An integer would define a 1D block of that size,
otherwise use CUDA structure, see next.
Notes: Number of blocks not limited by specific GPU.
Number of threads/block is limited by specific GPU.
Defining Grid/Block Structure
29. 29
CUDA provided with built-in variables and structures to define
number of blocks of threads in grid in each dimension and number
of threads in a block in each dimension.
CUDA Vector Types/Structures
unit3 and dim3 – can be considered essentially as CUDA-defined
structures of unsigned integers: x, y, z, i.e.
struct unit3 { x; y; z; };
struct dim3 { x; y; z; };
Used to define grid of blocks and threads, see next.
Unassigned structure components automatically set to 1.
There are other CUDA vector types.
Built-in CUDA data types and
structures
30. 30
Built-in Variables for Grid/Block
Sizes
dim3 gridDim -- Size of grid:
gridDim.x * gridDim.y
(z not used)
dim3 blockDim -- Size of block:
blockDim.x * blockDim.y * blockDim.z
Example
dim3 grid(16, 16); // Grid -- 16 x 16 blocks
dim3 block(32, 32); // Block -- 32 x 32 threads
myKernel<<<grid, block>>>(...);
31. 31
Built-in Variables for Grid/Block
Indices
uinit3 blockIdx -- block index within grid:
blockIdx.x, blockIdx.y
(z not used)
uint3 threadIdx -- thread index within block:
blockIdx.x, blockIdx.y, blockId.z
Full global thread ID in x and y dimensions can be computed by:
x = blockIdx.x * blockDim.x + threadIdx.x;
y = blockIdx.y * blockDim.y + threadIdx.y;
32. 32
Example -- x direction
A 1-D grid and 1-D block
4 blocks, each having 8 threads
0 1 2 3 4 765 0 1 2 3 4 7650 1 2 3 4 765 0 1 2 3 4 765
threadIdx.x threadIdx.x threadIdx.x
blockIdx.x = 3
threadIdx.x
blockIdx.x = 1blockIdx.x = 0
Derived from Jason Sanders, "Introduction to CUDA
C" GPU technology conference, Sept. 20, 2010.
blockIdx.x = 2
gridDim = 4 x 1
blockDim = 8 x 1
Global thread ID = blockIdx.x * blockDim.x + threadIdx.x
= 3 * 8 + 2 = thread 26 with linear global addressing
Global ID 26
33. 33
#define N 2048 // size of vectors
#define T 256 // number of threads per block
__global__ void vecAdd(int *A, int *B, int *C) {
int i = blockIdx.x*blockDim.x + threadIdx.x;
C[i] = A[i] + B[i];
}
int main (int argc, char **argv ) {
…
vecAdd<<<N/T, T>>>(devA, devB, devC); // assumes N/T is an integer
…
return (0);
}
Code example with a 1-D grid
and 1-D blocks
Number of blocks to map each vector across grid,
one element of each vector per thread
34. 34
#define N 2048 // size of vectors
#define T 240 // number of threads per block
__global__ void vecAdd(int *A, int *B, int *C) {
int i = blockIdx.x*blockDim.x + threadIdx.x;
if (i < N) C[i] = A[i] + B[i]; // allows for more threads than vector elements
// some unused
}
int main (int argc, char **argv ) {
int blocks = (N + T - 1) / T; // efficient way of rounding to next integer
…
vecAdd<<<blocks, T>>>(devA, devB, devC);
…
return (0);
}
If T/N not necessarily an integer:
35. 35
Example using 1-D grid and 2-D blocks
Adding two arrays
#define N 2048 // size of arrays
__global__void addMatrix (int *a, int *b, int *c) {
int i = blockIdx.x*blockDim.x+threadIdx.x;
int j =blockIdx.y*blockDim.y+threadIdx.y;
int index = i + j * N;
if ( i < N && j < N) c[index]= a[index] + b[index];
}
Void main() {
...
dim3 dimBlock (16,16);
dim3 dimGrid (N/dimBlock.x, N/dimBlock.y);
addMatrix<<<dimGrid, dimBlock>>>(devA, devB, devC);
…
}
36. 36
Memory Structure within GPU
Local private memory -- per thread
Shared memory -- per block
Global memory -- per application
GPU executes one or more kernel grids.
Streaming multiprocessor (SM) executes
one or more thread blocks
CUDA cores and other execution units in
the SM execute threads.
SM executes threads in groups of 32
threads called a warp.*
* Whitepaper NVIDIA’s Next Generation CUDA Compute Architecture: Fermi, NVIDIA, 2008
37. 37
Compiling code
Linux
Command line. CUDA provides nvcc (a NVIDIA “compiler-driver”.
Use instead of gcc
nvcc –O3 –o <exe> <input> -I/usr/local/cuda/include
–L/usr/local/cuda/lib –lcudart
Separates compiled code for CPU and for GPU and compiles code.
Need regular C compiler installed for CPU.
Make files also provided.
Windows
NVIDIA suggests using Microsoft Visual Studio
39. 39
GPU Clusters
GPU systems for HPC
GPU clusters
GPU Grids
GPU Clouds
With advent of GPUs for scientific high performance
computing, compute cluster now can incorporate, greatly
increasing their compute capability.
42. 42
Hybrid Programming Model for Clusters having
Multicore Shared Memory Processors
Combine MPI between nodes and OpenMP with nodes (or other
thread libraries such as Pthreads):
MPI/OpenMP compilation:
mpicc -o mpi_out mpi_test.c -fopenmp
43. 43
Hybrid Programming Model for Clusters having
Multicore Shared Memory Processors and
GPUs
Combine OpenMP and CUDA on one node for CPU
and GPU respectively, with MPI between nodes
Note – All three as C-based so can be compiled
together.
NVIDIA does provide a sample OpenMP/CUDA