SlideShare ist ein Scribd-Unternehmen logo
1 von 46
Emergence of GPU systems
and clusters for general
purpose High Performance
Computing
ITCS 4145/5145 Nov 8, 2010 © Barry Wilkinson
2
Last few years GPUs have developed from graphics
cards into a platform from HPC
There is now great interest in using GPUs for scientific
high performance computing and GPUs are being
designed with that application in mind
CUDA programming – C/C++ with a few additional
features and routines to support GPU programming.
Uses data parallel paradigm
Graphics Processing Units
(GPUs)
3
http://www.hpcwire.com/blogs/New-China-GPGPU-Super-Outruns-Jaguar-105987389.html
Graphics Processing Units (GPUs)
Brief History
1970 2010200019901980
Atari 8-bit
computer
text/graphics chip
Source of information http://en.wikipedia.org/wiki/Graphics_Processing_Unit
IBM PC Professional
Graphics Controller
card
S3 graphics cards-
single chip 2D
accelerator
OpenGL graphics API
Hardware-accelerated
3D graphics
DirectX graphics API
Playstation
GPUs with
programmable shading
Nvidia GeForce
GE 3 (2001) with
programmable shading
General-purpose computing
on graphics processing units
(GPGPUs)
GPU Computing
NVIDIA products
NVIDIA Corp. is the leader in GPUs for high performance
computing:
1993 201019991995 20092007 20082000 2001 2002 2003 2004 2005 2006
Established by Jen-
Hsun Huang, Chris
Malachowsky,
Curtis Priem
NV1 GeForce 1
GeForce 2 series GeForce FX series
GeForce 8 series
GeForce 200 series
GeForce 400 series
GTX460/465/470/475/
480/485
GTX260/275/280/285/295
GeForce
8800
GT 80
Tesla
Quadro
NVIDIA's first
GPU with
general purpose
processors
C870, S870, C1060, S1070, C2050, …
Tesla 2050 GPU
has 448 thread
processors
Fermi
Kepler
(2011)
Maxwell
(2013)
6
GPU performance gains over CPUs
0
200
400
600
800
1000
1200
1400
9/22/2002 2/4/2004 6/18/2005 10/31/2006 3/14/2008 7/27/2009
GFLOPs
NVIDIAGPU
IntelCPU
T12
Westmere
NV30
NV40
G70
G80
GT200
3GHz Dual
Core P4
3GHz Core2
Duo
3GHz Xeon
Quad
Source © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009
ECE 498AL Spring 2010, University of Illinois, Urbana-Champaign
7
CPU-GPU architecture evolution
Co-processors -- very old idea that appeared in
1970s and 1980s with floating point co-
processors attached to microprocessors that did
not then have floating point capability.
These coprocessors simply executed floating
point instructions that were fetched from
memory.
Around same time, interest to provide hardware
support for displays, especially with increasing
use of graphics and PC games.
Led to graphics processing units (GPUs)
attached to CPU to create video display.
CPU
Graphics
card
Display
Memory
Early design
8
Birth of general purpose
programmable GPU
Dedicated pipeline (late1990s-early 2000s)
By late1990’s, graphics chips
needed to support 3-D graphics,
especially for games and graphics
APIs such as DirectX and
OpenGL.
Graphics chips generally had a
pipeline structure with individual
stages performing specialized
operations, finally leading to
loading frame buffer for display.
Individual stages may have access
to graphics memory for storing
intermediate computed data.
Input stage
Vertex shader
stage
Geometry
shader stage
Rasterizer stage
Frame
buffer
Pixel shading
stage
Graphics
memory
9
GeForce 6 Series
Architecture
(2004-5)
From GPU Gems 2, Copyright
2005 by NVIDIA Corporation
10
General-Purpose GPU designs
High performance pipelines call for high-speed (IEEE) floating point
operations.
People had been trying to use GPU cards to speed up scientific
computations
Known as GPGPU (General-purpose computing on graphics
processing units) -- Difficult to do with specialized graphics pipelines,
but possible.)
By mid 2000’s, recognized that individual stages of graphics pipeline
could be implemented by a more general purpose processor core
(although with a data-parallel paradym)
11
2006 -- First GPU for general high performance computing as well
as graphics processing, NVIDIA GT 80 chip/GeForce 8800 card.
Unified processors that could perform vertex, geometry, pixel, and
general computing operations
Could now write programs in C rather than graphics APIs.
Single-instruction multiple thread (SIMT) programming model
GPU design for general high
performance computing
12
13
Evolving GPU design
NVIDIA Fermi architecture
(announced Sept 2009)
•
512 stream processing engines (SPEs)
•
Organized as 16 SPEs, each having 32 cores
•
3GB or 6 GB GDDR5 memory
•
Many innovations including L1/L2 caches, unified device memory
addressing, ECC memory, …
First implementation: Tesla 20 series
(single chip C2050/2070, 4 chip S2050/2070)
3 billion transistor chip?
New Fermi chips planned (GT 300, GeForce 400 series)
14
Fermi Streaming
Multiprocessor (SM)
* Whitepaper
NVIDIA’s Next
Generation CUDA
Compute
Architecture: Fermi,
NVIDIA, 2008
15
CUDA
(Compute Unified Device Architecture)
Architecture and programming model, introduced in NVIDIA in 2007
Enables GPUs to execute programs written in C.
Within C programs, call SIMT “kernel” routines that are executed on
GPU.
CUDA syntax extension to C identify routine as a Kernel.
Very easy to learn although to get highest possible execution
performance requires understanding of hardware architecture
16
Programming Model
•
Program once compiled has code
executed on CPU and code
executed on GPU
•
Separate memories on CPU and
GPU
Need to
•
Explicitly transfer data from CPU to
GPU for GPU computation, and
•
Explicitly transfer results in GPU
memory copied back to CPU
memory
Copy from
CPU to
GPU
Copy from
GPU to
CPU
GPU
CPU
CPU main memory
GPU global memory
17
Basic CUDA program structure
int main (int argc, char **argv ) {
1. Allocate memory space in device (GPU) for data
2. Allocate memory space in host (CPU) for data
3. Copy data to GPU
4. Call “kernel” routine to execute on GPU
(with CUDA syntax that defines no of threads and their physical structure)
5. Transfer results from GPU to CPU
6. Free memory space in device (GPU)
7. Free memory space in host (CPU)
return;
}
18
1. Allocating memory space in
“device” (GPU) for data
Use CUDA malloc routines:
int size = N *sizeof( int); // space for N integers
int *devA, *devB, *devC; // devA, devB, devC ptrs
cudaMalloc( (void**)&devA, size) );
cudaMalloc( (void**)&devB, size );
cudaMalloc( (void**)&devC, size );
Derived from Jason Sanders, "Introduction to CUDA C" GPU technology conference, Sept. 20, 2010.
19
2. Allocating memory space in
“host” (CPU) for data
Use regular C malloc routines:
int *a, *b, *c;
…
a = (int*)malloc(size);
b = (int*)malloc(size);
c = (int*)malloc(size);
or statically declare variables:
#define N 256
…
int a[N], b[N], c[N];
20
3. Transferring data from host
(CPU) to device (GPU)
Use CUDA routine cudaMemcpy
cudaMemcpy( devA, &A, size, cudaMemcpyHostToDevice);
cudaMemcpy( dev_B, &B, size, cudaMemcpyHostToDevice);
where devA and devB are pointers to destination in
device and A and B are pointers to host data
21
4. Declaring “kernel” routine to
execute on device (GPU)
CUDA introduces a syntax addition to C:
Triple angle brackets mark call from host code to device code.
Contains organization and number of threads in two parameters:
myKernel<<< n, m >>>(arg1, … );
n and m will define organization of thread blocks and threads in a
block.
For now, we will set n = 1, which say one block and m = N, which
says N threads in this block.
arg1, … , -- arguments to routine myKernel typically pointers to
device memory obtained previously from cudaMallac.
22
A kernel defined using CUDA specifier __global__
Example – Adding to vectors A and B
#define N 256
__global__ void vecAdd(int *A, int *B, int *C) { // Kernel definition
int i = threadIdx.x;
C[i] = A[i] + B[i];
}
int main() {
// allocate device memory &
// copy data to device
// device mem. ptrs devA,devB,devC
vecAdd<<<1, N>>>(devA,devB,devC);
…
}
Loosely derived from CUDA C programming guide, v 3.2 , 2010, NVIDIA
Declaring a Kernel Routine
Each of the N threads performs one pair-
wise addition:
Thread 0: devC[0] = devA[0] + devB[0];
Thread 1: devC[1] = devA[1] + devB[1];
Thread N-1: devC[N-1] = devA[N-1]+devB[N-1];
Grid of one block, block has N threads
CUDA structure that provides thread ID in block
23
5. Transferring data from device
(GPU) to host (CPU)
Use CUDA routine cudaMemcpy
cudaMemcpy( &C, devC, size,
cudaMemcpyDeviceToHost);
where devC is a pointer in device and C is a pointer in
host.
24
6. Free memory space in “device”
(GPU)
Use CUDA cudaFree routine:
cudaFree( dev_a);
cudaFree( dev_b);
cudaFree( dev_c);
25
7. Free memory space in (CPU) host
(if CPU memory allocated with malloc)
Use regular C free routine to deallocate memory if
previously allocated with malloc:
free( a );
free( b );
free( c );
26
Complete
CUDA
program
Adding two
vectors, A and B
N elements in A and
B, and N threads
(without code to load
arrays with data)
#define N 256
__global__ void vecAdd(int *A, int *B, int *C) {
int i = threadIdx.x;
C[i] = A[i] + B[i];
}
int main (int argc, char **argv ) {
int size = N *sizeof( int);
int a[N], b[N], c[N], *devA, *devB, *devC;
cudaMalloc( (void**)&devA, size) );
cudaMalloc( (void**)&devB, size );
cudaMalloc( (void**)&devC, size );
a = (int*)malloc(size); b = (int*)malloc(size);c =
(int*)malloc(size);
cudaMemcpy( devA, a, size, cudaMemcpyHostToDevice);
cudaMemcpy( dev_B, b size, cudaMemcpyHostToDevice);
vecAdd<<<1, N>>>(devA, devB, devC);
cudaMemcpy( &c, devC size, cudaMemcpyDeviceToHost);
cudaFree( dev_a);
cudaFree( dev_b);
cudaFree( dev_c);
free( a ); free( b ); free( c );
return (0);
}
Derived from Jason Sanders,
"Introduction to CUDA C" GPU
technology conference, Sept. 20,
27
Can be 1 or 2
dimensions
Can be 1, 2 or
3 dimensions
CUDA C programming guide, v 3.2, 2010,
NVIDIA
CUDA SIMT
Thread Structure
Allows
flexibility and
efficiency in
processing
1D, 2-D, and
3-D data on
GPU.
Linked to
internal
organization
Threads in
one block
execute
together.
28
Need to provide each kernel call with values for two key structures:
•
Number of blocks in each dimension
•
Threads per block in each dimension
myKernel<<< numBlocks, threadsperBlock >>>(arg1, … );
numBlocks – number of blocks in grid in each dimension (1D or
2D). An integer would define a 1D grid of that size, otherwise use
CUDA structure, see next.
threadsperBlock – number of threads in a block in each dimension
(1D, 2D, or 3D). An integer would define a 1D block of that size,
otherwise use CUDA structure, see next.
Notes: Number of blocks not limited by specific GPU.
Number of threads/block is limited by specific GPU.
Defining Grid/Block Structure
29
CUDA provided with built-in variables and structures to define
number of blocks of threads in grid in each dimension and number
of threads in a block in each dimension.
CUDA Vector Types/Structures
unit3 and dim3 – can be considered essentially as CUDA-defined
structures of unsigned integers: x, y, z, i.e.
struct unit3 { x; y; z; };
struct dim3 { x; y; z; };
Used to define grid of blocks and threads, see next.
Unassigned structure components automatically set to 1.
There are other CUDA vector types.
Built-in CUDA data types and
structures
30
Built-in Variables for Grid/Block
Sizes
dim3 gridDim -- Size of grid:
gridDim.x * gridDim.y
(z not used)
dim3 blockDim -- Size of block:
blockDim.x * blockDim.y * blockDim.z
Example
dim3 grid(16, 16); // Grid -- 16 x 16 blocks
dim3 block(32, 32); // Block -- 32 x 32 threads
myKernel<<<grid, block>>>(...);
31
Built-in Variables for Grid/Block
Indices
uinit3 blockIdx -- block index within grid:
blockIdx.x, blockIdx.y
(z not used)
uint3 threadIdx -- thread index within block:
blockIdx.x, blockIdx.y, blockId.z
Full global thread ID in x and y dimensions can be computed by:
x = blockIdx.x * blockDim.x + threadIdx.x;
y = blockIdx.y * blockDim.y + threadIdx.y;
32
Example -- x direction
A 1-D grid and 1-D block
4 blocks, each having 8 threads
0 1 2 3 4 765 0 1 2 3 4 7650 1 2 3 4 765 0 1 2 3 4 765
threadIdx.x threadIdx.x threadIdx.x
blockIdx.x = 3
threadIdx.x
blockIdx.x = 1blockIdx.x = 0
Derived from Jason Sanders, "Introduction to CUDA
C" GPU technology conference, Sept. 20, 2010.
blockIdx.x = 2
gridDim = 4 x 1
blockDim = 8 x 1
Global thread ID = blockIdx.x * blockDim.x + threadIdx.x
= 3 * 8 + 2 = thread 26 with linear global addressing
Global ID 26
33
#define N 2048 // size of vectors
#define T 256 // number of threads per block
__global__ void vecAdd(int *A, int *B, int *C) {
int i = blockIdx.x*blockDim.x + threadIdx.x;
C[i] = A[i] + B[i];
}
int main (int argc, char **argv ) {
…
vecAdd<<<N/T, T>>>(devA, devB, devC); // assumes N/T is an integer
…
return (0);
}
Code example with a 1-D grid
and 1-D blocks
Number of blocks to map each vector across grid,
one element of each vector per thread
34
#define N 2048 // size of vectors
#define T 240 // number of threads per block
__global__ void vecAdd(int *A, int *B, int *C) {
int i = blockIdx.x*blockDim.x + threadIdx.x;
if (i < N) C[i] = A[i] + B[i]; // allows for more threads than vector elements
// some unused
}
int main (int argc, char **argv ) {
int blocks = (N + T - 1) / T; // efficient way of rounding to next integer
…
vecAdd<<<blocks, T>>>(devA, devB, devC);
…
return (0);
}
If T/N not necessarily an integer:
35
Example using 1-D grid and 2-D blocks
Adding two arrays
#define N 2048 // size of arrays
__global__void addMatrix (int *a, int *b, int *c) {
int i = blockIdx.x*blockDim.x+threadIdx.x;
int j =blockIdx.y*blockDim.y+threadIdx.y;
int index = i + j * N;
if ( i < N && j < N) c[index]= a[index] + b[index];
}
Void main() {
...
dim3 dimBlock (16,16);
dim3 dimGrid (N/dimBlock.x, N/dimBlock.y);
addMatrix<<<dimGrid, dimBlock>>>(devA, devB, devC);
…
}
36
Memory Structure within GPU
Local private memory -- per thread
Shared memory -- per block
Global memory -- per application
GPU executes one or more kernel grids.
Streaming multiprocessor (SM) executes
one or more thread blocks
CUDA cores and other execution units in
the SM execute threads.
SM executes threads in groups of 32
threads called a warp.*
* Whitepaper NVIDIA’s Next Generation CUDA Compute Architecture: Fermi, NVIDIA, 2008
37
Compiling code
Linux
Command line. CUDA provides nvcc (a NVIDIA “compiler-driver”.
Use instead of gcc
nvcc –O3 –o <exe> <input> -I/usr/local/cuda/include
–L/usr/local/cuda/lib –lcudart
Separates compiled code for CPU and for GPU and compiles code.
Need regular C compiler installed for CPU.
Make files also provided.
Windows
NVIDIA suggests using Microsoft Visual Studio
38
Debugging
NVIDIA has recently
develped a debugging
tool called Parallel
Nsight
Available for use with
Visual Studio
39
GPU Clusters

GPU systems for HPC

GPU clusters

GPU Grids

GPU Clouds
With advent of GPUs for scientific high performance
computing, compute cluster now can incorporate, greatly
increasing their compute capability.
40
41
Maryland CPU-GPU Cluster Infrastructure
http://www.umiacs.umd.edu/res
earch/GPU/facilities.html
42
Hybrid Programming Model for Clusters having
Multicore Shared Memory Processors
Combine MPI between nodes and OpenMP with nodes (or other
thread libraries such as Pthreads):
MPI/OpenMP compilation:
mpicc -o mpi_out mpi_test.c -fopenmp
43
Hybrid Programming Model for Clusters having
Multicore Shared Memory Processors and
GPUs
Combine OpenMP and CUDA on one node for CPU
and GPU respectively, with MPI between nodes
Note – All three as C-based so can be compiled
together.
NVIDIA does provide a sample OpenMP/CUDA
44
http://www.gpugrid.net/
45
Intel’s
response to
Nvidia and
GPUs
Questions

Weitere ähnliche Inhalte

Was ist angesagt?

Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)Rob Gillen
 
NVidia CUDA Tutorial - June 15, 2009
NVidia CUDA Tutorial - June 15, 2009NVidia CUDA Tutorial - June 15, 2009
NVidia CUDA Tutorial - June 15, 2009Randall Hand
 
Introduction to CUDA
Introduction to CUDAIntroduction to CUDA
Introduction to CUDARaymond Tay
 
A beginner’s guide to programming GPUs with CUDA
A beginner’s guide to programming GPUs with CUDAA beginner’s guide to programming GPUs with CUDA
A beginner’s guide to programming GPUs with CUDAPiyush Mittal
 
Gpu with cuda architecture
Gpu with cuda architectureGpu with cuda architecture
Gpu with cuda architectureDhaval Kaneria
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
PL/CUDA - Fusion of HPC Grade Power with In-Database AnalyticsPL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
PL/CUDA - Fusion of HPC Grade Power with In-Database AnalyticsKohei KaiGai
 
Kato Mivule: An Overview of CUDA for High Performance Computing
Kato Mivule: An Overview of CUDA for High Performance ComputingKato Mivule: An Overview of CUDA for High Performance Computing
Kato Mivule: An Overview of CUDA for High Performance ComputingKato Mivule
 
Computing using GPUs
Computing using GPUsComputing using GPUs
Computing using GPUsShree Kumar
 
NVidia CUDA for Bruteforce Attacks - DefCamp 2012
NVidia CUDA for Bruteforce Attacks - DefCamp 2012NVidia CUDA for Bruteforce Attacks - DefCamp 2012
NVidia CUDA for Bruteforce Attacks - DefCamp 2012DefCamp
 
Introduction to CUDA C: NVIDIA : Notes
Introduction to CUDA C: NVIDIA : NotesIntroduction to CUDA C: NVIDIA : Notes
Introduction to CUDA C: NVIDIA : NotesSubhajit Sahu
 
Monte Carlo G P U Jan2010
Monte  Carlo  G P U  Jan2010Monte  Carlo  G P U  Jan2010
Monte Carlo G P U Jan2010John Holden
 

Was ist angesagt? (18)

Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)
 
GPU: Understanding CUDA
GPU: Understanding CUDAGPU: Understanding CUDA
GPU: Understanding CUDA
 
NVidia CUDA Tutorial - June 15, 2009
NVidia CUDA Tutorial - June 15, 2009NVidia CUDA Tutorial - June 15, 2009
NVidia CUDA Tutorial - June 15, 2009
 
CUDA
CUDACUDA
CUDA
 
Introduction to CUDA
Introduction to CUDAIntroduction to CUDA
Introduction to CUDA
 
A beginner’s guide to programming GPUs with CUDA
A beginner’s guide to programming GPUs with CUDAA beginner’s guide to programming GPUs with CUDA
A beginner’s guide to programming GPUs with CUDA
 
Gpu with cuda architecture
Gpu with cuda architectureGpu with cuda architecture
Gpu with cuda architecture
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
PL/CUDA - Fusion of HPC Grade Power with In-Database AnalyticsPL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
 
Kato Mivule: An Overview of CUDA for High Performance Computing
Kato Mivule: An Overview of CUDA for High Performance ComputingKato Mivule: An Overview of CUDA for High Performance Computing
Kato Mivule: An Overview of CUDA for High Performance Computing
 
Cuda tutorial
Cuda tutorialCuda tutorial
Cuda tutorial
 
Computing using GPUs
Computing using GPUsComputing using GPUs
Computing using GPUs
 
NVidia CUDA for Bruteforce Attacks - DefCamp 2012
NVidia CUDA for Bruteforce Attacks - DefCamp 2012NVidia CUDA for Bruteforce Attacks - DefCamp 2012
NVidia CUDA for Bruteforce Attacks - DefCamp 2012
 
Gpu perf-presentation
Gpu perf-presentationGpu perf-presentation
Gpu perf-presentation
 
Introduction to CUDA C: NVIDIA : Notes
Introduction to CUDA C: NVIDIA : NotesIntroduction to CUDA C: NVIDIA : Notes
Introduction to CUDA C: NVIDIA : Notes
 
Tech Talk NVIDIA CUDA
Tech Talk NVIDIA CUDATech Talk NVIDIA CUDA
Tech Talk NVIDIA CUDA
 
Monte Carlo G P U Jan2010
Monte  Carlo  G P U  Jan2010Monte  Carlo  G P U  Jan2010
Monte Carlo G P U Jan2010
 
CUDA Architecture
CUDA ArchitectureCUDA Architecture
CUDA Architecture
 

Andere mochten auch

Android and Smartphones
Android and SmartphonesAndroid and Smartphones
Android and SmartphonesPhilip David
 
Exposed_ The Secret Life of Roots _ United States Botanic Garden
Exposed_ The Secret Life of Roots _ United States Botanic GardenExposed_ The Secret Life of Roots _ United States Botanic Garden
Exposed_ The Secret Life of Roots _ United States Botanic Gardenjerry_d_glover
 

Andere mochten auch (20)

Announcements for Feb 19 2012
Announcements for Feb 19 2012Announcements for Feb 19 2012
Announcements for Feb 19 2012
 
赤字決算の対処法
赤字決算の対処法赤字決算の対処法
赤字決算の対処法
 
Command GM
Command GMCommand GM
Command GM
 
GCC
GCCGCC
GCC
 
November 2012 announcements
November 2012 announcementsNovember 2012 announcements
November 2012 announcements
 
Announcements for july 2013
Announcements for july 2013Announcements for july 2013
Announcements for july 2013
 
December 2012 announcements
December 2012 announcementsDecember 2012 announcements
December 2012 announcements
 
Hoarding
HoardingHoarding
Hoarding
 
Announcements for june 2013
Announcements for june 2013Announcements for june 2013
Announcements for june 2013
 
Horror film trailer analysis
Horror film trailer analysisHorror film trailer analysis
Horror film trailer analysis
 
July 2012 announcements
July 2012 announcementsJuly 2012 announcements
July 2012 announcements
 
Announcements for March 11 2012
Announcements for March 11 2012Announcements for March 11 2012
Announcements for March 11 2012
 
Android and Smartphones
Android and SmartphonesAndroid and Smartphones
Android and Smartphones
 
September 2012 announcements
September 2012 announcementsSeptember 2012 announcements
September 2012 announcements
 
Announcements for june 2014
Announcements for june 2014Announcements for june 2014
Announcements for june 2014
 
Announcements for june 2013
Announcements for june 2013Announcements for june 2013
Announcements for june 2013
 
Announcements for May 2014
Announcements for May 2014Announcements for May 2014
Announcements for May 2014
 
Announcements for june 2014
Announcements for june 2014Announcements for june 2014
Announcements for june 2014
 
September 2012 announcements
September 2012 announcementsSeptember 2012 announcements
September 2012 announcements
 
Exposed_ The Secret Life of Roots _ United States Botanic Garden
Exposed_ The Secret Life of Roots _ United States Botanic GardenExposed_ The Secret Life of Roots _ United States Botanic Garden
Exposed_ The Secret Life of Roots _ United States Botanic Garden
 

Ähnlich wie Cuda intro

Introduction to Accelerators
Introduction to AcceleratorsIntroduction to Accelerators
Introduction to AcceleratorsDilum Bandara
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...
Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...
Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...mouhouioui
 
Monte Carlo on GPUs
Monte Carlo on GPUsMonte Carlo on GPUs
Monte Carlo on GPUsfcassier
 
lecture_GPUArchCUDA02-CUDAMem.pdf
lecture_GPUArchCUDA02-CUDAMem.pdflecture_GPUArchCUDA02-CUDAMem.pdf
lecture_GPUArchCUDA02-CUDAMem.pdfTigabu Yaya
 
Using GPUs to handle Big Data with Java by Adam Roberts.
Using GPUs to handle Big Data with Java by Adam Roberts.Using GPUs to handle Big Data with Java by Adam Roberts.
Using GPUs to handle Big Data with Java by Adam Roberts.J On The Beach
 
gpuprogram_lecture,architecture_designsn
gpuprogram_lecture,architecture_designsngpuprogram_lecture,architecture_designsn
gpuprogram_lecture,architecture_designsnARUNACHALAM468781
 
lecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptxlecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptxssuser413a98
 
introduction to CUDA_C.pptx it is widely used
introduction to CUDA_C.pptx it is widely usedintroduction to CUDA_C.pptx it is widely used
introduction to CUDA_C.pptx it is widely usedHimanshu577858
 
GPU - An Introduction
GPU - An IntroductionGPU - An Introduction
GPU - An IntroductionDhan V Sagar
 
The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)theijes
 
Intro2 Cuda Moayad
Intro2 Cuda MoayadIntro2 Cuda Moayad
Intro2 Cuda MoayadMoayadhn
 
S0333 gtc2012-gmac-programming-cuda
S0333 gtc2012-gmac-programming-cudaS0333 gtc2012-gmac-programming-cuda
S0333 gtc2012-gmac-programming-cudamistercteam
 
Parallel computing with Gpu
Parallel computing with GpuParallel computing with Gpu
Parallel computing with GpuRohit Khatana
 
Hardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLHardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLinside-BigData.com
 

Ähnlich wie Cuda intro (20)

Introduction to Accelerators
Introduction to AcceleratorsIntroduction to Accelerators
Introduction to Accelerators
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...
Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...
Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...
 
Monte Carlo on GPUs
Monte Carlo on GPUsMonte Carlo on GPUs
Monte Carlo on GPUs
 
lecture_GPUArchCUDA02-CUDAMem.pdf
lecture_GPUArchCUDA02-CUDAMem.pdflecture_GPUArchCUDA02-CUDAMem.pdf
lecture_GPUArchCUDA02-CUDAMem.pdf
 
Using GPUs to handle Big Data with Java by Adam Roberts.
Using GPUs to handle Big Data with Java by Adam Roberts.Using GPUs to handle Big Data with Java by Adam Roberts.
Using GPUs to handle Big Data with Java by Adam Roberts.
 
gpuprogram_lecture,architecture_designsn
gpuprogram_lecture,architecture_designsngpuprogram_lecture,architecture_designsn
gpuprogram_lecture,architecture_designsn
 
lecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptxlecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptx
 
introduction to CUDA_C.pptx it is widely used
introduction to CUDA_C.pptx it is widely usedintroduction to CUDA_C.pptx it is widely used
introduction to CUDA_C.pptx it is widely used
 
GPU - An Introduction
GPU - An IntroductionGPU - An Introduction
GPU - An Introduction
 
The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)
 
Cuda materials
Cuda materialsCuda materials
Cuda materials
 
Intro2 Cuda Moayad
Intro2 Cuda MoayadIntro2 Cuda Moayad
Intro2 Cuda Moayad
 
S0333 gtc2012-gmac-programming-cuda
S0333 gtc2012-gmac-programming-cudaS0333 gtc2012-gmac-programming-cuda
S0333 gtc2012-gmac-programming-cuda
 
Deep Learning Edge
Deep Learning Edge Deep Learning Edge
Deep Learning Edge
 
Introduction to GPU Programming
Introduction to GPU ProgrammingIntroduction to GPU Programming
Introduction to GPU Programming
 
Parallel computing with Gpu
Parallel computing with GpuParallel computing with Gpu
Parallel computing with Gpu
 
Hardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLHardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and ML
 

Mehr von Anshul Sharma

Mehr von Anshul Sharma (11)

Understanding concurrency
Understanding concurrencyUnderstanding concurrency
Understanding concurrency
 
Interm codegen
Interm codegenInterm codegen
Interm codegen
 
Programming using Open Mp
Programming using Open MpProgramming using Open Mp
Programming using Open Mp
 
Open MPI 2
Open MPI 2Open MPI 2
Open MPI 2
 
Open MPI
Open MPIOpen MPI
Open MPI
 
Paralle programming 2
Paralle programming 2Paralle programming 2
Paralle programming 2
 
Parallel programming
Parallel programmingParallel programming
Parallel programming
 
Cuda 3
Cuda 3Cuda 3
Cuda 3
 
Cuda 2
Cuda 2Cuda 2
Cuda 2
 
Des
DesDes
Des
 
Intoduction to Linux
Intoduction to LinuxIntoduction to Linux
Intoduction to Linux
 

Kürzlich hochgeladen

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 

Kürzlich hochgeladen (20)

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 

Cuda intro

  • 1. Emergence of GPU systems and clusters for general purpose High Performance Computing ITCS 4145/5145 Nov 8, 2010 © Barry Wilkinson
  • 2. 2 Last few years GPUs have developed from graphics cards into a platform from HPC There is now great interest in using GPUs for scientific high performance computing and GPUs are being designed with that application in mind CUDA programming – C/C++ with a few additional features and routines to support GPU programming. Uses data parallel paradigm Graphics Processing Units (GPUs)
  • 4. Graphics Processing Units (GPUs) Brief History 1970 2010200019901980 Atari 8-bit computer text/graphics chip Source of information http://en.wikipedia.org/wiki/Graphics_Processing_Unit IBM PC Professional Graphics Controller card S3 graphics cards- single chip 2D accelerator OpenGL graphics API Hardware-accelerated 3D graphics DirectX graphics API Playstation GPUs with programmable shading Nvidia GeForce GE 3 (2001) with programmable shading General-purpose computing on graphics processing units (GPGPUs) GPU Computing
  • 5. NVIDIA products NVIDIA Corp. is the leader in GPUs for high performance computing: 1993 201019991995 20092007 20082000 2001 2002 2003 2004 2005 2006 Established by Jen- Hsun Huang, Chris Malachowsky, Curtis Priem NV1 GeForce 1 GeForce 2 series GeForce FX series GeForce 8 series GeForce 200 series GeForce 400 series GTX460/465/470/475/ 480/485 GTX260/275/280/285/295 GeForce 8800 GT 80 Tesla Quadro NVIDIA's first GPU with general purpose processors C870, S870, C1060, S1070, C2050, … Tesla 2050 GPU has 448 thread processors Fermi Kepler (2011) Maxwell (2013)
  • 6. 6 GPU performance gains over CPUs 0 200 400 600 800 1000 1200 1400 9/22/2002 2/4/2004 6/18/2005 10/31/2006 3/14/2008 7/27/2009 GFLOPs NVIDIAGPU IntelCPU T12 Westmere NV30 NV40 G70 G80 GT200 3GHz Dual Core P4 3GHz Core2 Duo 3GHz Xeon Quad Source © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL Spring 2010, University of Illinois, Urbana-Champaign
  • 7. 7 CPU-GPU architecture evolution Co-processors -- very old idea that appeared in 1970s and 1980s with floating point co- processors attached to microprocessors that did not then have floating point capability. These coprocessors simply executed floating point instructions that were fetched from memory. Around same time, interest to provide hardware support for displays, especially with increasing use of graphics and PC games. Led to graphics processing units (GPUs) attached to CPU to create video display. CPU Graphics card Display Memory Early design
  • 8. 8 Birth of general purpose programmable GPU Dedicated pipeline (late1990s-early 2000s) By late1990’s, graphics chips needed to support 3-D graphics, especially for games and graphics APIs such as DirectX and OpenGL. Graphics chips generally had a pipeline structure with individual stages performing specialized operations, finally leading to loading frame buffer for display. Individual stages may have access to graphics memory for storing intermediate computed data. Input stage Vertex shader stage Geometry shader stage Rasterizer stage Frame buffer Pixel shading stage Graphics memory
  • 9. 9 GeForce 6 Series Architecture (2004-5) From GPU Gems 2, Copyright 2005 by NVIDIA Corporation
  • 10. 10 General-Purpose GPU designs High performance pipelines call for high-speed (IEEE) floating point operations. People had been trying to use GPU cards to speed up scientific computations Known as GPGPU (General-purpose computing on graphics processing units) -- Difficult to do with specialized graphics pipelines, but possible.) By mid 2000’s, recognized that individual stages of graphics pipeline could be implemented by a more general purpose processor core (although with a data-parallel paradym)
  • 11. 11 2006 -- First GPU for general high performance computing as well as graphics processing, NVIDIA GT 80 chip/GeForce 8800 card. Unified processors that could perform vertex, geometry, pixel, and general computing operations Could now write programs in C rather than graphics APIs. Single-instruction multiple thread (SIMT) programming model GPU design for general high performance computing
  • 12. 12
  • 13. 13 Evolving GPU design NVIDIA Fermi architecture (announced Sept 2009) • 512 stream processing engines (SPEs) • Organized as 16 SPEs, each having 32 cores • 3GB or 6 GB GDDR5 memory • Many innovations including L1/L2 caches, unified device memory addressing, ECC memory, … First implementation: Tesla 20 series (single chip C2050/2070, 4 chip S2050/2070) 3 billion transistor chip? New Fermi chips planned (GT 300, GeForce 400 series)
  • 14. 14 Fermi Streaming Multiprocessor (SM) * Whitepaper NVIDIA’s Next Generation CUDA Compute Architecture: Fermi, NVIDIA, 2008
  • 15. 15 CUDA (Compute Unified Device Architecture) Architecture and programming model, introduced in NVIDIA in 2007 Enables GPUs to execute programs written in C. Within C programs, call SIMT “kernel” routines that are executed on GPU. CUDA syntax extension to C identify routine as a Kernel. Very easy to learn although to get highest possible execution performance requires understanding of hardware architecture
  • 16. 16 Programming Model • Program once compiled has code executed on CPU and code executed on GPU • Separate memories on CPU and GPU Need to • Explicitly transfer data from CPU to GPU for GPU computation, and • Explicitly transfer results in GPU memory copied back to CPU memory Copy from CPU to GPU Copy from GPU to CPU GPU CPU CPU main memory GPU global memory
  • 17. 17 Basic CUDA program structure int main (int argc, char **argv ) { 1. Allocate memory space in device (GPU) for data 2. Allocate memory space in host (CPU) for data 3. Copy data to GPU 4. Call “kernel” routine to execute on GPU (with CUDA syntax that defines no of threads and their physical structure) 5. Transfer results from GPU to CPU 6. Free memory space in device (GPU) 7. Free memory space in host (CPU) return; }
  • 18. 18 1. Allocating memory space in “device” (GPU) for data Use CUDA malloc routines: int size = N *sizeof( int); // space for N integers int *devA, *devB, *devC; // devA, devB, devC ptrs cudaMalloc( (void**)&devA, size) ); cudaMalloc( (void**)&devB, size ); cudaMalloc( (void**)&devC, size ); Derived from Jason Sanders, "Introduction to CUDA C" GPU technology conference, Sept. 20, 2010.
  • 19. 19 2. Allocating memory space in “host” (CPU) for data Use regular C malloc routines: int *a, *b, *c; … a = (int*)malloc(size); b = (int*)malloc(size); c = (int*)malloc(size); or statically declare variables: #define N 256 … int a[N], b[N], c[N];
  • 20. 20 3. Transferring data from host (CPU) to device (GPU) Use CUDA routine cudaMemcpy cudaMemcpy( devA, &A, size, cudaMemcpyHostToDevice); cudaMemcpy( dev_B, &B, size, cudaMemcpyHostToDevice); where devA and devB are pointers to destination in device and A and B are pointers to host data
  • 21. 21 4. Declaring “kernel” routine to execute on device (GPU) CUDA introduces a syntax addition to C: Triple angle brackets mark call from host code to device code. Contains organization and number of threads in two parameters: myKernel<<< n, m >>>(arg1, … ); n and m will define organization of thread blocks and threads in a block. For now, we will set n = 1, which say one block and m = N, which says N threads in this block. arg1, … , -- arguments to routine myKernel typically pointers to device memory obtained previously from cudaMallac.
  • 22. 22 A kernel defined using CUDA specifier __global__ Example – Adding to vectors A and B #define N 256 __global__ void vecAdd(int *A, int *B, int *C) { // Kernel definition int i = threadIdx.x; C[i] = A[i] + B[i]; } int main() { // allocate device memory & // copy data to device // device mem. ptrs devA,devB,devC vecAdd<<<1, N>>>(devA,devB,devC); … } Loosely derived from CUDA C programming guide, v 3.2 , 2010, NVIDIA Declaring a Kernel Routine Each of the N threads performs one pair- wise addition: Thread 0: devC[0] = devA[0] + devB[0]; Thread 1: devC[1] = devA[1] + devB[1]; Thread N-1: devC[N-1] = devA[N-1]+devB[N-1]; Grid of one block, block has N threads CUDA structure that provides thread ID in block
  • 23. 23 5. Transferring data from device (GPU) to host (CPU) Use CUDA routine cudaMemcpy cudaMemcpy( &C, devC, size, cudaMemcpyDeviceToHost); where devC is a pointer in device and C is a pointer in host.
  • 24. 24 6. Free memory space in “device” (GPU) Use CUDA cudaFree routine: cudaFree( dev_a); cudaFree( dev_b); cudaFree( dev_c);
  • 25. 25 7. Free memory space in (CPU) host (if CPU memory allocated with malloc) Use regular C free routine to deallocate memory if previously allocated with malloc: free( a ); free( b ); free( c );
  • 26. 26 Complete CUDA program Adding two vectors, A and B N elements in A and B, and N threads (without code to load arrays with data) #define N 256 __global__ void vecAdd(int *A, int *B, int *C) { int i = threadIdx.x; C[i] = A[i] + B[i]; } int main (int argc, char **argv ) { int size = N *sizeof( int); int a[N], b[N], c[N], *devA, *devB, *devC; cudaMalloc( (void**)&devA, size) ); cudaMalloc( (void**)&devB, size ); cudaMalloc( (void**)&devC, size ); a = (int*)malloc(size); b = (int*)malloc(size);c = (int*)malloc(size); cudaMemcpy( devA, a, size, cudaMemcpyHostToDevice); cudaMemcpy( dev_B, b size, cudaMemcpyHostToDevice); vecAdd<<<1, N>>>(devA, devB, devC); cudaMemcpy( &c, devC size, cudaMemcpyDeviceToHost); cudaFree( dev_a); cudaFree( dev_b); cudaFree( dev_c); free( a ); free( b ); free( c ); return (0); } Derived from Jason Sanders, "Introduction to CUDA C" GPU technology conference, Sept. 20,
  • 27. 27 Can be 1 or 2 dimensions Can be 1, 2 or 3 dimensions CUDA C programming guide, v 3.2, 2010, NVIDIA CUDA SIMT Thread Structure Allows flexibility and efficiency in processing 1D, 2-D, and 3-D data on GPU. Linked to internal organization Threads in one block execute together.
  • 28. 28 Need to provide each kernel call with values for two key structures: • Number of blocks in each dimension • Threads per block in each dimension myKernel<<< numBlocks, threadsperBlock >>>(arg1, … ); numBlocks – number of blocks in grid in each dimension (1D or 2D). An integer would define a 1D grid of that size, otherwise use CUDA structure, see next. threadsperBlock – number of threads in a block in each dimension (1D, 2D, or 3D). An integer would define a 1D block of that size, otherwise use CUDA structure, see next. Notes: Number of blocks not limited by specific GPU. Number of threads/block is limited by specific GPU. Defining Grid/Block Structure
  • 29. 29 CUDA provided with built-in variables and structures to define number of blocks of threads in grid in each dimension and number of threads in a block in each dimension. CUDA Vector Types/Structures unit3 and dim3 – can be considered essentially as CUDA-defined structures of unsigned integers: x, y, z, i.e. struct unit3 { x; y; z; }; struct dim3 { x; y; z; }; Used to define grid of blocks and threads, see next. Unassigned structure components automatically set to 1. There are other CUDA vector types. Built-in CUDA data types and structures
  • 30. 30 Built-in Variables for Grid/Block Sizes dim3 gridDim -- Size of grid: gridDim.x * gridDim.y (z not used) dim3 blockDim -- Size of block: blockDim.x * blockDim.y * blockDim.z Example dim3 grid(16, 16); // Grid -- 16 x 16 blocks dim3 block(32, 32); // Block -- 32 x 32 threads myKernel<<<grid, block>>>(...);
  • 31. 31 Built-in Variables for Grid/Block Indices uinit3 blockIdx -- block index within grid: blockIdx.x, blockIdx.y (z not used) uint3 threadIdx -- thread index within block: blockIdx.x, blockIdx.y, blockId.z Full global thread ID in x and y dimensions can be computed by: x = blockIdx.x * blockDim.x + threadIdx.x; y = blockIdx.y * blockDim.y + threadIdx.y;
  • 32. 32 Example -- x direction A 1-D grid and 1-D block 4 blocks, each having 8 threads 0 1 2 3 4 765 0 1 2 3 4 7650 1 2 3 4 765 0 1 2 3 4 765 threadIdx.x threadIdx.x threadIdx.x blockIdx.x = 3 threadIdx.x blockIdx.x = 1blockIdx.x = 0 Derived from Jason Sanders, "Introduction to CUDA C" GPU technology conference, Sept. 20, 2010. blockIdx.x = 2 gridDim = 4 x 1 blockDim = 8 x 1 Global thread ID = blockIdx.x * blockDim.x + threadIdx.x = 3 * 8 + 2 = thread 26 with linear global addressing Global ID 26
  • 33. 33 #define N 2048 // size of vectors #define T 256 // number of threads per block __global__ void vecAdd(int *A, int *B, int *C) { int i = blockIdx.x*blockDim.x + threadIdx.x; C[i] = A[i] + B[i]; } int main (int argc, char **argv ) { … vecAdd<<<N/T, T>>>(devA, devB, devC); // assumes N/T is an integer … return (0); } Code example with a 1-D grid and 1-D blocks Number of blocks to map each vector across grid, one element of each vector per thread
  • 34. 34 #define N 2048 // size of vectors #define T 240 // number of threads per block __global__ void vecAdd(int *A, int *B, int *C) { int i = blockIdx.x*blockDim.x + threadIdx.x; if (i < N) C[i] = A[i] + B[i]; // allows for more threads than vector elements // some unused } int main (int argc, char **argv ) { int blocks = (N + T - 1) / T; // efficient way of rounding to next integer … vecAdd<<<blocks, T>>>(devA, devB, devC); … return (0); } If T/N not necessarily an integer:
  • 35. 35 Example using 1-D grid and 2-D blocks Adding two arrays #define N 2048 // size of arrays __global__void addMatrix (int *a, int *b, int *c) { int i = blockIdx.x*blockDim.x+threadIdx.x; int j =blockIdx.y*blockDim.y+threadIdx.y; int index = i + j * N; if ( i < N && j < N) c[index]= a[index] + b[index]; } Void main() { ... dim3 dimBlock (16,16); dim3 dimGrid (N/dimBlock.x, N/dimBlock.y); addMatrix<<<dimGrid, dimBlock>>>(devA, devB, devC); … }
  • 36. 36 Memory Structure within GPU Local private memory -- per thread Shared memory -- per block Global memory -- per application GPU executes one or more kernel grids. Streaming multiprocessor (SM) executes one or more thread blocks CUDA cores and other execution units in the SM execute threads. SM executes threads in groups of 32 threads called a warp.* * Whitepaper NVIDIA’s Next Generation CUDA Compute Architecture: Fermi, NVIDIA, 2008
  • 37. 37 Compiling code Linux Command line. CUDA provides nvcc (a NVIDIA “compiler-driver”. Use instead of gcc nvcc –O3 –o <exe> <input> -I/usr/local/cuda/include –L/usr/local/cuda/lib –lcudart Separates compiled code for CPU and for GPU and compiles code. Need regular C compiler installed for CPU. Make files also provided. Windows NVIDIA suggests using Microsoft Visual Studio
  • 38. 38 Debugging NVIDIA has recently develped a debugging tool called Parallel Nsight Available for use with Visual Studio
  • 39. 39 GPU Clusters  GPU systems for HPC  GPU clusters  GPU Grids  GPU Clouds With advent of GPUs for scientific high performance computing, compute cluster now can incorporate, greatly increasing their compute capability.
  • 40. 40
  • 41. 41 Maryland CPU-GPU Cluster Infrastructure http://www.umiacs.umd.edu/res earch/GPU/facilities.html
  • 42. 42 Hybrid Programming Model for Clusters having Multicore Shared Memory Processors Combine MPI between nodes and OpenMP with nodes (or other thread libraries such as Pthreads): MPI/OpenMP compilation: mpicc -o mpi_out mpi_test.c -fopenmp
  • 43. 43 Hybrid Programming Model for Clusters having Multicore Shared Memory Processors and GPUs Combine OpenMP and CUDA on one node for CPU and GPU respectively, with MPI between nodes Note – All three as C-based so can be compiled together. NVIDIA does provide a sample OpenMP/CUDA