Weitere Àhnliche Inhalte
Ăhnlich wie IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NVIDIA) (20)
KĂŒrzlich hochgeladen (20)
IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NVIDIA)
- 2. The âNewâ Mooreâs Law
â Computers no longer get faster, just wider
â You must re-think your algorithms to be parallel !
â Data-parallel computing is most scalable solution
© NVIDIA Corporation 2008
- 3. Enter the GPU
â Massive economies of scale
â Massively parallel
© NVIDIA Corporation 2008
- 4. Enter CUDA
â Scalable parallel programming model
â Minimal extensions to familiar C/C++ environment
â Heterogeneous serial-parallel computing
© NVIDIA Corporation 2008
- 5. Sound Bite
GPUs + CUDA
=
The Democratization of Parallel Computing
Massively parallel computing has become
a commodity technology
© NVIDIA Corporation 2007
- 8. CUDA: âCâ FOR PARALLELISM
void saxpy_serial(int n, float a, float *x, float *y)
{
for (int i = 0; i < n; ++i)
y[i] = a*x[i] + y[i];
}
// Invoke serial SAXPY kernel
Standard C Code
saxpy_serial(n, 2.0, x, y);
__global__ void saxpy_parallel(int n, float a, float *x, float *y)
{
int i = blockIdx.x*blockDim.x + threadIdx.x;
if (i < n) y[i] = a*x[i] + y[i];
}
// Invoke parallel SAXPY kernel with 256 threads/block
int nblocks = (n + 255) / 256;
saxpy_parallel<<<nblocks, 256>>>(n, 2.0, x, y);
Parallel C Code
© NVIDIA Corporation 2008
- 9. Hierarchy of concurrent threads
Thread t
â Parallel kernels composed of many threads
â all threads execute the same sequential program
Block b
â Threads are grouped into thread blocks t0 t1 ⊠tB
â threads in the same block can cooperate
Kernel foo()
â Threads/blocks have
...
unique IDs
© NVIDIA Corporation 2008
- 10. Hierarchical organization
Block
Thread
per-block
per-thread shared
local memory memory
Local barrier
Kernel 0
...
Global barrier per-device
global
Kernel 1
memory
...
© NVIDIA Corporation 2008
- 11. Heterogeneous Programming
â CUDA = serial program with parallel kernels, all in C
â Serial C code executes in a CPU thread
â Parallel kernel C code executes in thread blocks
across multiple processing elements
Serial Code
Parallel Kernel
...
foo<<< nBlk, nTid >>>(args);
Serial Code
Parallel Kernel
...
bar<<< nBlk, nTid >>>(args);
© NVIDIA Corporation 2008
- 12. Thread = virtualized scalar processor
â Independent thread of execution
â has its own PC, variables (registers), processor state, etc.
â no implication about how threads are scheduled
â CUDA threads might be physical threads
â as on NVIDIA GPUs
â CUDA threads might be virtual threads
â might pick 1 block = 1 physical thread on multicore CPU
© NVIDIA Corporation 2008
- 13. Block = virtualized multiprocessor
â Provides programmer flexibility
â freely choose processors to fit data
â freely customize for each kernel launch
â Thread block = a (data) parallel task
â all blocks in kernel have the same entry point
â but may execute any code they want
â Thread blocks of kernel must be independent tasks
â program valid for any interleaving of block executions
© NVIDIA Corporation 2008
- 14. Blocks must be independent
â Any possible interleaving of blocks should be valid
â presumed to run to completion without pre-emption
â can run in any order
â can run concurrently OR sequentially
â Blocks may coordinate but not synchronize
â shared queue pointer: OK
â shared lock: BAD ⊠can easily deadlock
â Independence requirement gives scalability
© NVIDIA Corporation 2008
- 15. Scalable Execution Model
Kernel launched by host
...
Blocks Run on Multiprocessors
MT IU MT IU MT IU MT IU
MT IU MT IU MT IU MT IU
SP SP SP SP SP SP SP SP
...
Shared Shared Shared Shared
Shared Shared Shared Shared
Memory Memory Memory Memory
Memory Memory Memory Memory
Device Memory
© NVIDIA Corporation 2008
- 16. Synchronization & Cooperation
â Threads within block may synchronize with barriers
⊠Step 1 âŠ
__syncthreads();
⊠Step 2 âŠ
â Blocks coordinate via atomic memory operations
â e.g., increment shared queue pointer with atomicInc()
â Implicit barrier between dependent kernels
vec_minus<<<nblocks, blksize>>>(a, b, c);
vec_dot<<<nblocks, blksize>>>(c, c);
© NVIDIA Corporation 2008
- 17. Using per-block shared memory
Block
â Variables shared across block
Shared
__shared__ int *begin, *end;
â Scratchpad memory
__shared__ int scratch[blocksize];
scratch[threadIdx.x] = begin[threadIdx.x];
// ⊠compute on scratch values âŠ
begin[threadIdx.x] = scratch[threadIdx.x];
â Communicating values between threads
scratch[threadIdx.x] = begin[threadIdx.x];
__syncthreads();
int left = scratch[threadIdx.x - 1];
© NVIDIA Corporation 2008
- 18. Example: Parallel Reduction
â Summing up a sequence with 1 thread:
int sum = 0;
for(int i=0; i<N; ++i) sum += x[i];
â Parallel reduction builds a summation tree
â each thread holds 1 element
â stepwise partial sums
â N threads need log N steps
â one possible approach:
Butterfly pattern
© NVIDIA Corporation 2008
- 19. Example: Parallel Reduction
â Summing up a sequence with 1 thread:
int sum = 0;
for(int i=0; i<N; ++i) sum += x[i];
â Parallel reduction builds a summation tree
â each thread holds 1 element
â stepwise partial sums
â N threads need log N steps
â one possible approach:
Butterfly pattern
© NVIDIA Corporation 2008
- 20. Parallel Reduction for 1 Block
// INPUT: Thread i holds value x_i
int i = threadIdx.x;
__shared__ int sum[blocksize];
// One thread per element
sum[i] = x_i; __syncthreads();
for(int bit=blocksize/2; bit>0; bit/=2)
{
int t=sum[i]+sum[i^bit]; __syncthreads();
sum[i]=t; __syncthreads();
}
// OUTPUT: Every thread now holds sum in sum[i]
© NVIDIA Corporation 2008
- 21. Summing Up CUDA
â CUDA = C + a few simple extensions
â makes it easy to start writing basic parallel programs
â Three key abstractions:
1.â hierarchy of parallel threads
2.â corresponding levels of synchronization
3.â corresponding memory spaces
â Supports massive parallelism of manycore GPUs
© NVIDIA Corporation 2008
- 22. Parallel
// INPUT: Thread i holds value x_i
int i = threadIdx.x;
__shared__ int sum[blocksize];
// One thread per element
sum[i] = x_i; __syncthreads();
for(int bit=blocksize/2; bit>0; bit/=2)
{
int t=sum[i]+sum[i^bit]; __syncthreads();
sum[i]=t; __syncthreads();
}
// OUTPUT: Every thread now holds sum in sum[i]
© NVIDIA Corporation 2008
- 23. More efficient reduction tree
template<class T> __device__ T reduce(T *x)
{
unsigned int i = threadIdx.x;
unsigned int n = blockDim.x;
for(unsigned int offset=n/2; offset>0; offset/=2)
{
if(tid < offset) x[i] += x[i + offset];
__syncthreads();
}
// Note that only thread 0 has full sum
return x[i];
}
© NVIDIA Corporation 2008
- 24. Reduction tree execution example
Input (shared memory)
10 1 8 -1 0 -2 3 5 -2 -3 2 7 0 11 0 2
1 2 3 4 5 6 7 active threads
0
x[i] += x[i+8]; 8 -2 10 6 0 9 3 7 -2 -3 2 7 0 11 0 2
1 2 3
0
x[i] += x[i+4]; 8 7 13 13 0 9 3 7 -2 -3 2 7 0 11 0 2
1
0
x[i] += x[i+2]; 21 20 13 13 0 9 3 7 -2 -3 2 7 0 11 0 2
0
41 20 13 13 0 9 3 7 -2 -3 2 7 0 11 0 2
x[i] += x[i+1];
Final result
© NVIDIA Corporation 2008
- 25. Pattern 1: Fit kernel to the data
â Code lets B-thread block reduce B-element array
â For larger sequences:
â launch N/B blocks to reduce each B-element subsequence
â write N/B partial sums to temporary array
â repeat until done
â Weâll be done in logB N steps
© NVIDIA Corporation 2008
- 26. Pattern 2: Fit kernel to the machine
â For a block size of B:
1)â Launch B blocks to sum N/B elements
2)â Launch 1 block to combine B partial sums
31704 16 3 31704 16 331704 16 3 31704 16 3 31704 16 331704 16 331704 16 331704 16 3
4 7 5 9 4 7 5 9 4 7 5 9 4 7 5 9 4 7 5 9 4 7 5 9 4 7 5 9 4 7 5 9
11 14 11 14 11 14 11 14 11 14 11 14 11 14 11 14
Stage 1:
25 25 25 25 25 25 25 25
many blocks