SlideShare ist ein Scribd-Unternehmen logo
1 von 54
Downloaden Sie, um offline zu lesen
Course Outline
ƒ GPU Architecture Overview (this module)
ƒ Tools of the Trade
ƒ Introduction to CUDA C
ƒ Patterns of Parallel Computing
ƒ Thread Cooperation and Synchronization
ƒ The Many Types of Memory
ƒ Atomic Operations
ƒ Events and Streams
ƒ CUDA in Advanced Scenarios
Note: to follow this course, you require a CUDA-capable GPU.urse, you require a CUDA cap
C
U
D
A
Teaching
C
enter,D
epartm
entofC
om
puterApplications,N
IT,Trichy
History of GPU Computation
No GPU
CPU handled graphic output
Dedicated GPU
Separate graphics card (PCI, AGP)
Programmable GPU
Shaders
C
U
D
A
Teaching
C
enter,D
epartm
entofC
om
puterApplications,N
IT,Trichy
Shaders
ƒ Small program that runs on the GPU
ƒ Types
† Vertex (used to calculate location of a vertex)
† Pixel (used to calculate the color components of a single pixel)
ƒ Shader languages
† High Level Shader Language (HLSL, Microsoft DirectX)
† OpenGL Shading Language (GLSL, OpenGL)
† Both are C-like
† Both are intended for graphics (i.e., not general-purpose)
ƒ Pixel shaders used for math
† Convert data to texture
† Run texture through pixel shader
† Get result texture and convert to data
C
U
D
A
Teaching
C
enter,D
epartm
entofC
om
puterApplications,N
IT,Trichy
Why GPGPU?
ƒ General-Purpose Computation on GPUs
ƒ Highly parallel architecture
† Lots of concurrent threads of execution (SIMT)
† Higher throughput compared to CPUs
† Even taking into account many cores, hypethreading, SIMD
† Thus more FLOPS (floating-point operations per second)
ƒ Commodity hardware
† Commonly available (mainly used by gamers)
† Relatively cheap compared to custom solutions (e.g., FPGAs)
ƒ Sensible programming model
† Manufacturers realized GPGPU market potential
† Graphics made optional
† NVIDIA offers dedicated GPU platform “Tesla”
† No output connections
C
U
D
A
Teaching
C
enter,D
epartm
entofC
om
puterApplications,N
IT,Trichy
GPGPU Frameworks
ƒ Compute Unified Driver Architecture (CUDA)
† Developed by NVIDIA Corporation
† Extensions to programming languages (C/C++)
† Wrappers for other languages/platforms
(e.g., FORTRAN, PyCUDA, MATLAB)
ƒ Open Computing Language (OpenCL)
† Supported by many manufacturers (inc. NVIDIA)
† The high-level way to perform computation on ATI devices
ƒ C++ Accelerated Massive Programming (AMP)
† C++ superset
† A standard by Microsoft, part of MSVC++
† Supports both ATI and NVIDIA
ƒ Other frameworks and cross-compilers
† Alea, Aparapi, Brook, etc.
C
U
D
A
Teaching
C
enter,D
epartm
entofC
om
puterApplications,N
IT,Trichy
Graphics Processor Architecture
ƒ Warning: NVIDIA terminology ahead
ƒ Streaming Multiprocessor (SM)
† Contains several CUDA cores
† Can have >1 SM on card
ƒ CUDA Core (a.k.a. Streaming Processor, Shader Unit)
† # of cores per SM tied to compute capability
ƒ Different types of memory
† Means of access
† Performance characteristics
C
U
D
A
Teaching
C
enter,D
epartm
entofC
om
puterApplications,N
IT,Trichy
Graphics Processor Architecture
SM2
SM1
SP1 SP2 SP3
Device Memory
Registers
Shared Memory
SP4
Constant Memory
Texture Memory
C
U
D
A
Teaching
C
enter,D
epartm
entofC
om
puterApplications,N
IT,Trichy
Compute Capability
ƒ A number indicating what the card can do
† Current range: 1.0, 1.x, 2.x, 3.0, 3.5
ƒ Affects both hardware and API support
† Number of CUDA cores per SM
† Max # of 32-bit registers per SM
† Max # of instructions per kernel
† Support for double-precision ops
† … and many other parameters
ƒ Higher is better
† See http://en.wikipedia.org/wiki/CUDA
C
U
D
A
Teaching
C
enter,D
epartm
entofC
om
puterApplications,N
IT,Trichy
Choosing a Graphics Card
ƒ Look at performance specs (peak flops)
† Pay attention to single vs. double
† E.g. http://www.nvidia.com/object/tesla-servers.html
ƒ Number of GPUs
ƒ Compute capability/architecture
† COTS cheaper than dedicated
ƒ Memory size
ƒ Ensure PSU/motherboard is good enough to handle the card(s)
ƒ Can have >1 graphics card
† YMMV (PCI saturation)
ƒ Can mix architectures (e.g. NVIDIA+ATI)
C
U
D
A
Teaching
C
enter,D
epartm
entofC
om
puterApplications,N
IT,Trichy
Overview
ƒ Windows, Mac OS, Linux
† This course uses Windows/Visual Studio
† This course focuses on desktop GPU development
ƒ CUDA Toolkit
† LLVM-based compiler
† Headers & libraries
† Documentation
† Samples
ƒ NSight
† Visual Studio: plugin that allows debugging
† Eclipse IDE
ƒ http://developer.nvidia.com/cuda
C
U
D
A
Teaching
C
enter,D
epartm
entofC
om
puterApplications,N
IT,Trichy
CUDA Tools
C
U
D
A
Teaching
C
enter,D
epartm
entofC
om
puterApplications,N
IT,Trichy
What is NSight?
ƒ Many developers, few GPUs
† No need for each dev to have GPU
ƒ Client-server architecture
ƒ Server with GPU (target)
† Runs Nsight service
† Accepts connections from developers
ƒ Clients with Visual Studio + NSight plugin (host)
† Use NSight to connect to the server to run application
and debug GPU code
ƒ Caveats
† Need to use VS Remote Debugging to debug CPU code
† Need your own mechanism for syncing output
Target GPU(s)
NSight Service
Visual Studio w/NSight
C
U
D
A
Teaching
C
enter,D
epartm
entofC
om
puterApplications,N
IT,Trichy
Overview
ƒ Compilation Process
ƒ Obligatory “Hello Cuda” demo
ƒ Location Qualifiers
ƒ Execution Model
ƒ Grid and Block Dimensions
ƒ Error Handling
ƒ Device Introspection
C
U
D
A
Teaching
C
enter,D
epartm
entofC
om
puterApplications,N
IT,Trichy
NVidia Cuda Compiler (nvcc)
ƒ nvcc is used to compile CUDA-specific code
† Not a compiler!
† Uses host C or C++ compiler (MSVC, GCC)
† Some aspects are C++ specific
ƒ Splits code into GPU and non-GPU parts
† Host code passed to native compiler
ƒ Accepts project-defined GPU-specific settings
† E.g., compute capability
ƒ Translates code written in CUDA C into PTX
† Graphics driver turns PTX into binary code
C
U
D
A
Teaching
C
enter,D
epartm
entofC
om
puterApplications,N
IT,Trichy
NVCC Compilation Process
__global__ void a()
{
}
void b()
{
}
__global__ void a()
{
}
nvcc
void b()
{
}
Executable
Host compiler
(e.g. MSVC’s cl.exe)
nvcc
C
U
D
A
Teaching
C
enter,D
epartm
entofC
om
puterApplications,N
IT,Trichy
Parallel Thread Execution (PTX)
ƒ PTX is the ‘assembly language’ of CUDA
† Similar to .NET IL or Java bytecode
† Low-level GPU instructions
ƒ Can be generated from a project
ƒ Typically useful to compiler writers
† E.g., GPU Ocelot https://code.google.com/p/gpuocelot/
ƒ Inline PTX (asm)
C
U
D
A
Teaching
C
enter,D
epartm
entofC
om
puterApplications,N
IT,Trichy
Location Qualifiers
ƒ __global__
Defines a kernel.
Runs on the GPU, called from the CPU.
Executed with <<<dim3>>> arguments.
ƒ __device__
Runs on the GPU, called from the GPU.
† Can be used for variables too
ƒ __host__
Runs on the CPU, called from the CPU.
ƒ Qualifiers can be mixed
† E.g. __host__ __device__ foo()
† Code compiled for both CPU and GPU
† Useful for testing
C
U
D
A
Teaching
C
enter,D
epartm
entofC
om
puterApplications,N
IT,Trichy
Execution Model
doSomething<<<2,3>>>()
doSomethingElse<<<4,2>>>()
thread
thread block
grid
C
U
D
A
Teaching
C
enter,D
epartm
entofC
om
puterApplications,N
IT,Trichy
Execution Model
ƒ Thread blocks are scheduled to run on available SMs
ƒ Each SM executes one block at a time
ƒ Thread block is divided into warps
† Number of threads per warp depends on compute capability
ƒ All warps are handled in parallel
ƒ CUDA Warp Watch
C
U
D
A
Teaching
C
enter,D
epartm
entofC
om
puterApplications,N
IT,Trichy
Dimensions
ƒ We defined execution as
† <<<a,b>>>
† A grid of a blocks of b threads each
† The grid and each block are 1D structures
ƒ In reality, these constructs are 3D
† A 3-dimensional grid of 3-dimensional blocks
† You can define (a×b×c) blocks of (x×y×z) threads
† Can have 2D or 1D by setting extra dimensions to 1
ƒ Defined in dim3 structure
† Simple container with x, y and z values.
† Some constructors defined for C++
† Automatic conversion for
<<<a,b>>> Æ (a,1,1) by (b,1,1)
grid
block
thread
C
U
D
A
Teaching
C
enter,D
epartm
entofC
om
puterApplications,N
IT,Trichy
Thread Variables
ƒ Execution parameters & current position
ƒ blockIdx
† Where we are in the grid
ƒ gridDim
† The size of the grid
ƒ threadIdx
† Position of current thread in thread block
ƒ blockDim
† Size of thread block
ƒ Limitations
† Grid & block sizes
† # of threads
C
U
D
A
Teaching
C
enter,D
epartm
entofC
om
puterApplications,N
IT,Trichy
Error Handling
ƒ CUDA does not throw
† Silent failure
ƒ Core functions return cudaError_t
† Can check against cudaSuccess
† Get description with cudaGetErrorString()
ƒ Libraries may have different error types
† E.g. cuRAND has curandStatus_t
C
U
D
A
Teaching
C
enter,D
epartm
entofC
om
puterApplications,N
IT,Trichy
Rules of the Game
ƒ Different types of memory
† Shared vs. private
† Access speeds
ƒ Data is in arrays
† No parallel data structures
† No other data structures
† No auto-parallelization/vectorization compiler support
† No CPU-type SIMD equivalent
ƒ Compiler constraint
† No C++11 support
C
U
D
A
Teaching
C
enter,D
epartm
entofC
om
puterApplications,N
IT,Trichy
Overview
ƒ Data Access
ƒ Map
ƒ Gather
ƒ Reduce
ƒ Scan
C
U
D
A
Teaching
C
enter,D
epartm
entofC
om
puterApplications,N
IT,Trichy
Data Access
ƒ A real problem!
ƒ Thread space can be up to 6D
† 3D grid of 3D thread blocks
ƒ Input space typically 1D
† 2D arrays are possible
ƒ Need to map threads to inputs
ƒ Some examples
† 1 block, N threads Æ threadIdx.x
† 1 block, MxN threads Æ threadIdx.y * blockDim.x + threadIdx.x
† N blocks, M threads Æ blockIdx.x * gridDim.x + threadIdx.x
† … and so on
C
U
D
A
Teaching
C
enter,D
epartm
entofC
om
puterApplications,N
IT,Trichy
Map
C
U
D
A
Teaching
C
enter,D
epartm
entofC
om
puterApplications,N
IT,Trichy
Gather
C
U
D
A
Teaching
C
enter,D
epartm
entofC
om
puterApplications,N
IT,Trichy
Black Scholes Formula
C
U
D
A
Teaching
C
enter,D
epartm
entofC
om
puterApplications,N
IT,Trichy
Reduce
+
+
+
+ +
+
C
U
D
A
Teaching
C
enter,D
epartm
entofC
om
puterApplications,N
IT,Trichy
Reduce in Practice
ƒ Adding up N data elements ƒ Adding up N data elements
ƒ Use 1 block of N/2 threads
ƒ Each thread does
x[i] += x[j];
ƒ At each step
† # of threads halved
† Distance (j-i) doubled
ƒ x[0] is the result
1 2 3 4 5 6 7 8
3 7 11 153
0
7
1
11
2
155
3
10 260100
0
62626
1
3663636
0
C
U
D
A
Teaching
C
enter,D
epartm
entofC
om
puterApplications,N
IT,Trichy
Scan
4 2 5 3 6
4 6 11 14 20
C
U
D
A
Teaching
C
enter,D
epartm
entofC
om
puterApplications,N
IT,Trichy
Scan in Practice
ƒ Similar to reduce
1 2 3 4
1 3 5 73
0
5
1
1 3 6 106
0
1 3 6 10
0
7
2
10
1
ƒ Require N-1 threads
ƒ Step size keeps doubling
ƒ Number of threads reduced
by step size
ƒ Each thread n does
x[n+step] += x[n];
5
3
9
2
000000 14
000 15
C
U
D
A
Teaching
C
enter,D
epartm
entofC
om
puterApplications,N
IT,Trichy
Graphics Processor Architecture
SM2
SM1
SP1 SP2 SP3
Device Memory
Registers
Shared Memory
SP4
Constant Cache
Texture CacheT
C
C
U
D
A
Teaching
C
enter,D
epartm
entofC
om
puterApplications,N
IT,Trichy
Device Memory
ƒ Grid scope (i.e., available to all threads in all blocks in the grid)
ƒ Application lifetime (exists until app exits or explicitly deallocated)
ƒ Dynamic
† cudaMalloc() to allocate
† Pass pointer to kernel
† cudaMemcpy() to copy to/from host memory
† cudaFree() to deallocate
ƒ Static
† Declare global variable as device
__device__ int sum = 0;
† Use freely within the kernel
† Use cudaMemcpy[To/From]Symbol() to copy to/from host memory
† No need to explicitly deallocate
ƒ Slowest and most inefficient
C
U
D
A
Teaching
C
enter,D
epartm
entofC
om
puterApplications,N
IT,Trichy
Constant & Texture Memory
ƒ Read-only: useful for lookup tables, model parameters, etc.
ƒ Grid scope, Application lifetime
ƒ Resides in device memory, but
ƒ Cached in a constant memory cache
ƒ Constrained by MAX_CONSTANT_MEMORY
† Expect 64kb
ƒ Similar operation to statically-defined device memory
† Declare as __constant__
† Use freely within the kernel
† Use cudaMemcpy[To/From]Symbol() to copy to/from host memory
ƒ Very fast provided all threads read from the same location
ƒ Used for kernel arguments
ƒ Texture memory: similar to Constant, optimized for 2D access
patterns
C
U
D
A
Teaching
C
enter,D
epartm
entofC
om
puterApplications,N
IT,Trichy
Shared Memory
ƒ Block scope
† Shared only within a thread block
† Not shared between blocks
ƒ Kernel lifetime
ƒ Must be declared within the kernel function body
ƒ Very fast
C
U
D
A
Teaching
C
enter,D
epartm
entofC
om
puterApplications,N
IT,Trichy
Register & Local Memory
ƒ Memory can be allocated right within the kernel
† Thread scope, Kernel lifetime
ƒ Non-array memory
† int tid = …
† Stored in a register
† Very fast
ƒ Array memory
† Stored in ‘local memory’
† Local memory is an abstraction, actually put in global memory
† Thus, as slow as global memory
C
U
D
A
Teaching
C
enter,D
epartm
entofC
om
puterApplications,N
IT,Trichy
Summary
Declaration Memory Scope Lifetime Slowdown
int foo; Register Thread Kernel 1x
int foo[10]; Local Thread Kernel 100x
__shared__ int foo; Shared Block Kernel 1x
__device__ int foo; Global Grid Application 100x
__constant__ int foo; Constant Grid Application 1x
C
U
D
A
Teaching
C
enter,D
epartm
entofC
om
puterApplications,N
IT,Trichy
Thread Synchronization
ƒ Threads can take different amounts
of time to complete a part of a
computation
ƒ Sometimes, you want all threads to reach
a particular point before continuing
their work
ƒ CUDA provides a thread barrier function
__syncthreads()
ƒ A thread that calls __syncthreads() waits
for other threads to reach this line
ƒ Then, all threads can continue executing
barrier
A B C
C
U
D
A
Teaching
C
enter,D
epartm
entofC
om
puterApplications,N
IT,Trichy
Restrictions
ƒ __syncthreads() only synchronizes threads within a block
ƒ A thread that calls __syncthreads() waits for other threads to reach
this location
ƒ All threads must reach __syncthreads()
† if (x > 0.5)
{
__syncthreads(); // bad idea
}
ƒ Each call to __syncthreads() is unique
† if (x > 0.5)
{
__syncthreads();
} else {
__syncthreads(); // also a bad idea
}
C
U
D
A
Teaching
C
enter,D
epartm
entofC
om
puterApplications,N
IT,Trichy
Branch Divergence
ƒ Once a block is assigned to an
SM, it is split into several
warps
ƒ All threads within a warp
must execute the same
instruction at the same time
ƒ I.e., if and else branches
cannot be executed
concurrently
ƒ Avoid branching if possible
ƒ Ensure if statements cut on
warp boundaries
branch
performance loss
C
U
D
A
Teaching
C
enter,D
epartm
entofC
om
puterApplications,N
IT,Trichy
Summary
ƒ Threads are synchronized with __syncthreads()
† Block scope
† All threads must reach the barrier
† Each __syncthreads() creates a unique barrier
ƒ A block is split into warps
† All threads in a warp execute same instruction
† Branching leads to warp divergence (performance loss)
† Avoid branching or ensure branch cases fall in different warps
C
U
D
A
Teaching
C
enter,D
epartm
entofC
om
puterApplications,N
IT,Trichy
Overview
ƒ Why atomics?
ƒ Atomic Functions
ƒ Atomic Sum
ƒ Monte Carlo Pi
C
U
D
A
Teaching
C
enter,D
epartm
entofC
om
puterApplications,N
IT,Trichy
Why Atomics?
ƒ x++ is a read-modify-write operation
† Read x into a register
† Increment register value
† Write register back into x
† Effectively { temp = x; temp = temp+1; x = temp; }
ƒ If two threads do x++
† Each thread has its own temp (say t1 and t2)
† { t1 = x; t1 = t1+1; x = t1;}
{ t2 = x; t2 = t2+1; x = t2;}
† Race condition: the thread that writes to x first wins
† Whoever wins, x gets incremented only once
C
U
D
A
Teaching
C
enter,D
epartm
entofC
om
puterApplications,N
IT,Trichy
Atomic Functions
ƒ Problem: many threads accessing the same memory location
ƒ Atomic operations ensure that only one thread can access the location
ƒ Grid scope!
ƒ atomicOP(x,y)
† t1 = *x; // read
† t2 = t1 OP y; // modify
† *a = t2; // write
ƒ Atomics need to be configured
† #include "sm_20_atomic_functions.h“
†
C
U
D
A
Teaching
C
enter,D
epartm
entofC
om
puterApplications,N
IT,Trichy
Monte Carlo Pi
r
C
U
D
A
Teaching
C
enter,D
epartm
entofC
om
puterApplications,N
IT,Trichy
Summary
ƒ Atomic operations ensure operations on a variable cannot be
interrupted by a different thread
ƒ CUDA supports several atomic operations
† atomicAdd()
† atomicOr()
† atomicMin()
† … and others
ƒ Atomics incur a heavy performance penalty
C
U
D
A
Teaching
C
enter,D
epartm
entofC
om
puterApplications,N
IT,Trichy
Overview
ƒ Events
ƒ Event API
ƒ Event example
ƒ Pinned memory
ƒ Streams
ƒ Stream API
ƒ Example (single stream)
ƒ Example (multiple streams)
C
U
D
A
Teaching
C
enter,D
epartm
entofC
om
puterApplications,N
IT,Trichy
Events
ƒ How to measure performance?
ƒ Use OS timers
† Too much noise
ƒ Use profiler
† Times only kernel duration + other invocations
ƒ CUDA Events
† Event = timestamp
† Timestamp recorded on the GPU
† Invoked from the CPU side
C
U
D
A
Teaching
C
enter,D
epartm
entofC
om
puterApplications,N
IT,Trichy
Event API
ƒ cudaEvent_t
† The event handle
ƒ cudaEventCreate(&e)
† Creates the event
ƒ cudaEventRecord(e, 0)
† Records the event, i.e. timestamp
† Second param is the stream to which to record
ƒ cudaEventSynchronize(e)
† CPU and GPU are async, can be doing things in parallel
† cudaEventSynchronize() blocks all instruction processing until the GPU
has reached the event
ƒ cudaEventElapsedTime(&f, start, stop)
† Computes elapsed time (msec) between start and stop, stored as float
C
U
D
A
Teaching
C
enter,D
epartm
entofC
om
puterApplications,N
IT,Trichy
Pinned Memory
ƒ CPU memory is pageable
† Can be swapped to disk
ƒ Pinned (page-locked) stays in place
ƒ Performance advantage when copying to/from GPU
ƒ Use cudaHostAlloc() instead of malloc() or new
ƒ Use cudaFreeHost() to deallocate
ƒ Cannot be swapped out
† Must have enough
† Proactively deallocate
C
U
D
A
Teaching
C
enter,D
epartm
entofC
om
puterApplications,N
IT,Trichy
Streams
ƒ Remember cudaEventRecord(event, stream)?
ƒ A CUDA stream is a queue of GPU operations
† Kernel launch
† Memory copy
ƒ Streams allow a form of task-based parallelism
† Performance improvement
ƒ To leverage streams you need device overlap support
† GPU_OVERLAP
C
U
D
A
Teaching
C
enter,D
epartm
entofC
om
puterApplications,N
IT,Trichy
Stream API
ƒ cudaStream_t
ƒ cudaStreamCreate(&stream)
ƒ kernel<<<blocks,threads,shared,stream>>>
ƒ cudaMemcpyAsync()
† Must use pinned memory!
ƒ stream parameter
ƒ cudaStreamSynchronize(stream)
C
U
D
A
Teaching
C
enter,D
epartm
entofC
om
puterApplications,N
IT,Trichy
Summary
ƒ CUDA events let you time your code on the GPU
ƒ Pinned memory speeds up data transfers to/from device
ƒ CUDA streams allow you to queue up operations asynchronously
† Lets you do different things in parallel on the GPU
† Use of pinned memory is required
C
U
D
A
Teaching
C
enter,D
epartm
entofC
om
puterApplications,N
IT,Trichy

Weitere ähnliche Inhalte

Was ist angesagt?

Intel's Presentation in SIGGRAPH OpenCL BOF
Intel's Presentation in SIGGRAPH OpenCL BOFIntel's Presentation in SIGGRAPH OpenCL BOF
Intel's Presentation in SIGGRAPH OpenCL BOFOfer Rosenberg
 
Best practices for long-term support and security of the device-tree
Best practices for long-term support and security of the device-treeBest practices for long-term support and security of the device-tree
Best practices for long-term support and security of the device-treeAlison Chaiken
 
PL-4048, Adapting languages for parallel processing on GPUs, by Neil Henning
PL-4048, Adapting languages for parallel processing on GPUs, by Neil HenningPL-4048, Adapting languages for parallel processing on GPUs, by Neil Henning
PL-4048, Adapting languages for parallel processing on GPUs, by Neil HenningAMD Developer Central
 
[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...
[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...
[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...npinto
 
PT-4102, Simulation, Compilation and Debugging of OpenCL on the AMD Southern ...
PT-4102, Simulation, Compilation and Debugging of OpenCL on the AMD Southern ...PT-4102, Simulation, Compilation and Debugging of OpenCL on the AMD Southern ...
PT-4102, Simulation, Compilation and Debugging of OpenCL on the AMD Southern ...AMD Developer Central
 
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basicsnpinto
 
HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated ...
HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated ...HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated ...
HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated ...AMD Developer Central
 
[01][gpu 컴퓨팅을 위한 언어, 도구 및 api] miller languages tools
[01][gpu 컴퓨팅을 위한 언어, 도구 및 api] miller languages tools[01][gpu 컴퓨팅을 위한 언어, 도구 및 api] miller languages tools
[01][gpu 컴퓨팅을 위한 언어, 도구 및 api] miller languages toolslaparuma
 
Bolt C++ Standard Template Libary for HSA by Ben Sanders, AMD
Bolt C++ Standard Template Libary for HSA  by Ben Sanders, AMDBolt C++ Standard Template Libary for HSA  by Ben Sanders, AMD
Bolt C++ Standard Template Libary for HSA by Ben Sanders, AMDHSA Foundation
 
Monte Carlo G P U Jan2010
Monte  Carlo  G P U  Jan2010Monte  Carlo  G P U  Jan2010
Monte Carlo G P U Jan2010John Holden
 
Linux and Open Source in Math, Science and Engineering
Linux and Open Source in Math, Science and EngineeringLinux and Open Source in Math, Science and Engineering
Linux and Open Source in Math, Science and EngineeringPDE1D
 
PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by W...
PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by W...PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by W...
PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by W...AMD Developer Central
 
[05][cuda 및 fermi 최적화 기술] hryu optimization
[05][cuda 및 fermi 최적화 기술] hryu optimization[05][cuda 및 fermi 최적화 기술] hryu optimization
[05][cuda 및 fermi 최적화 기술] hryu optimizationlaparuma
 
WT-4065, Superconductor: GPU Web Programming for Big Data Visualization, by ...
WT-4065, Superconductor: GPU Web Programming for Big Data Visualization, by  ...WT-4065, Superconductor: GPU Web Programming for Big Data Visualization, by  ...
WT-4065, Superconductor: GPU Web Programming for Big Data Visualization, by ...AMD Developer Central
 
Direct3D and the Future of Graphics APIs - AMD at GDC14
Direct3D and the Future of Graphics APIs - AMD at GDC14Direct3D and the Future of Graphics APIs - AMD at GDC14
Direct3D and the Future of Graphics APIs - AMD at GDC14AMD Developer Central
 
Gallium3D - Mesa's New Driver Model
Gallium3D - Mesa's New Driver ModelGallium3D - Mesa's New Driver Model
Gallium3D - Mesa's New Driver ModelChia-I Wu
 

Was ist angesagt? (20)

Media SDK Webinar 2014
Media SDK Webinar 2014Media SDK Webinar 2014
Media SDK Webinar 2014
 
Intel's Presentation in SIGGRAPH OpenCL BOF
Intel's Presentation in SIGGRAPH OpenCL BOFIntel's Presentation in SIGGRAPH OpenCL BOF
Intel's Presentation in SIGGRAPH OpenCL BOF
 
Best practices for long-term support and security of the device-tree
Best practices for long-term support and security of the device-treeBest practices for long-term support and security of the device-tree
Best practices for long-term support and security of the device-tree
 
PL-4048, Adapting languages for parallel processing on GPUs, by Neil Henning
PL-4048, Adapting languages for parallel processing on GPUs, by Neil HenningPL-4048, Adapting languages for parallel processing on GPUs, by Neil Henning
PL-4048, Adapting languages for parallel processing on GPUs, by Neil Henning
 
[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...
[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...
[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...
 
PT-4102, Simulation, Compilation and Debugging of OpenCL on the AMD Southern ...
PT-4102, Simulation, Compilation and Debugging of OpenCL on the AMD Southern ...PT-4102, Simulation, Compilation and Debugging of OpenCL on the AMD Southern ...
PT-4102, Simulation, Compilation and Debugging of OpenCL on the AMD Southern ...
 
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
 
HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated ...
HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated ...HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated ...
HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated ...
 
Arvindsujeeth scaladays12
Arvindsujeeth scaladays12Arvindsujeeth scaladays12
Arvindsujeeth scaladays12
 
[01][gpu 컴퓨팅을 위한 언어, 도구 및 api] miller languages tools
[01][gpu 컴퓨팅을 위한 언어, 도구 및 api] miller languages tools[01][gpu 컴퓨팅을 위한 언어, 도구 및 api] miller languages tools
[01][gpu 컴퓨팅을 위한 언어, 도구 및 api] miller languages tools
 
Bolt C++ Standard Template Libary for HSA by Ben Sanders, AMD
Bolt C++ Standard Template Libary for HSA  by Ben Sanders, AMDBolt C++ Standard Template Libary for HSA  by Ben Sanders, AMD
Bolt C++ Standard Template Libary for HSA by Ben Sanders, AMD
 
Monte Carlo G P U Jan2010
Monte  Carlo  G P U  Jan2010Monte  Carlo  G P U  Jan2010
Monte Carlo G P U Jan2010
 
Linux and Open Source in Math, Science and Engineering
Linux and Open Source in Math, Science and EngineeringLinux and Open Source in Math, Science and Engineering
Linux and Open Source in Math, Science and Engineering
 
PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by W...
PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by W...PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by W...
PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by W...
 
PostgreSQL with OpenCL
PostgreSQL with OpenCLPostgreSQL with OpenCL
PostgreSQL with OpenCL
 
[05][cuda 및 fermi 최적화 기술] hryu optimization
[05][cuda 및 fermi 최적화 기술] hryu optimization[05][cuda 및 fermi 최적화 기술] hryu optimization
[05][cuda 및 fermi 최적화 기술] hryu optimization
 
GPU Programming with Java
GPU Programming with JavaGPU Programming with Java
GPU Programming with Java
 
WT-4065, Superconductor: GPU Web Programming for Big Data Visualization, by ...
WT-4065, Superconductor: GPU Web Programming for Big Data Visualization, by  ...WT-4065, Superconductor: GPU Web Programming for Big Data Visualization, by  ...
WT-4065, Superconductor: GPU Web Programming for Big Data Visualization, by ...
 
Direct3D and the Future of Graphics APIs - AMD at GDC14
Direct3D and the Future of Graphics APIs - AMD at GDC14Direct3D and the Future of Graphics APIs - AMD at GDC14
Direct3D and the Future of Graphics APIs - AMD at GDC14
 
Gallium3D - Mesa's New Driver Model
Gallium3D - Mesa's New Driver ModelGallium3D - Mesa's New Driver Model
Gallium3D - Mesa's New Driver Model
 

Ähnlich wie Cuda materials

Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)Rob Gillen
 
Introduction to parallel computing using CUDA
Introduction to parallel computing using CUDAIntroduction to parallel computing using CUDA
Introduction to parallel computing using CUDAMartin Peniak
 
Introduction to Accelerators
Introduction to AcceleratorsIntroduction to Accelerators
Introduction to AcceleratorsDilum Bandara
 
Nvidia cuda tutorial_no_nda_apr08
Nvidia cuda tutorial_no_nda_apr08Nvidia cuda tutorial_no_nda_apr08
Nvidia cuda tutorial_no_nda_apr08Angela Mendoza M.
 
gpuprogram_lecture,architecture_designsn
gpuprogram_lecture,architecture_designsngpuprogram_lecture,architecture_designsn
gpuprogram_lecture,architecture_designsnARUNACHALAM468781
 
Cuda introduction
Cuda introductionCuda introduction
Cuda introductionHanibei
 
Intro to GPGPU Programming with Cuda
Intro to GPGPU Programming with CudaIntro to GPGPU Programming with Cuda
Intro to GPGPU Programming with CudaRob Gillen
 
Kato Mivule: An Overview of CUDA for High Performance Computing
Kato Mivule: An Overview of CUDA for High Performance ComputingKato Mivule: An Overview of CUDA for High Performance Computing
Kato Mivule: An Overview of CUDA for High Performance ComputingKato Mivule
 
A beginner’s guide to programming GPUs with CUDA
A beginner’s guide to programming GPUs with CUDAA beginner’s guide to programming GPUs with CUDA
A beginner’s guide to programming GPUs with CUDAPiyush Mittal
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Using GPUs to handle Big Data with Java by Adam Roberts.
Using GPUs to handle Big Data with Java by Adam Roberts.Using GPUs to handle Big Data with Java by Adam Roberts.
Using GPUs to handle Big Data with Java by Adam Roberts.J On The Beach
 
Computing using GPUs
Computing using GPUsComputing using GPUs
Computing using GPUsShree Kumar
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
GPU - An Introduction
GPU - An IntroductionGPU - An Introduction
GPU - An IntroductionDhan V Sagar
 
Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...
Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...
Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...mouhouioui
 

Ähnlich wie Cuda materials (20)

Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)
 
Introduction to parallel computing using CUDA
Introduction to parallel computing using CUDAIntroduction to parallel computing using CUDA
Introduction to parallel computing using CUDA
 
Introduction to Accelerators
Introduction to AcceleratorsIntroduction to Accelerators
Introduction to Accelerators
 
Haskell Accelerate
Haskell  AccelerateHaskell  Accelerate
Haskell Accelerate
 
Nvidia cuda tutorial_no_nda_apr08
Nvidia cuda tutorial_no_nda_apr08Nvidia cuda tutorial_no_nda_apr08
Nvidia cuda tutorial_no_nda_apr08
 
Cuda intro
Cuda introCuda intro
Cuda intro
 
gpuprogram_lecture,architecture_designsn
gpuprogram_lecture,architecture_designsngpuprogram_lecture,architecture_designsn
gpuprogram_lecture,architecture_designsn
 
Cuda introduction
Cuda introductionCuda introduction
Cuda introduction
 
Intro to GPGPU Programming with Cuda
Intro to GPGPU Programming with CudaIntro to GPGPU Programming with Cuda
Intro to GPGPU Programming with Cuda
 
GPU: Understanding CUDA
GPU: Understanding CUDAGPU: Understanding CUDA
GPU: Understanding CUDA
 
Kato Mivule: An Overview of CUDA for High Performance Computing
Kato Mivule: An Overview of CUDA for High Performance ComputingKato Mivule: An Overview of CUDA for High Performance Computing
Kato Mivule: An Overview of CUDA for High Performance Computing
 
A beginner’s guide to programming GPUs with CUDA
A beginner’s guide to programming GPUs with CUDAA beginner’s guide to programming GPUs with CUDA
A beginner’s guide to programming GPUs with CUDA
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Using GPUs to handle Big Data with Java by Adam Roberts.
Using GPUs to handle Big Data with Java by Adam Roberts.Using GPUs to handle Big Data with Java by Adam Roberts.
Using GPUs to handle Big Data with Java by Adam Roberts.
 
Computing using GPUs
Computing using GPUsComputing using GPUs
Computing using GPUs
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
GPU - An Introduction
GPU - An IntroductionGPU - An Introduction
GPU - An Introduction
 
Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...
Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...
Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...
 

Cuda materials

  • 1. Course Outline ƒ GPU Architecture Overview (this module) ƒ Tools of the Trade ƒ Introduction to CUDA C ƒ Patterns of Parallel Computing ƒ Thread Cooperation and Synchronization ƒ The Many Types of Memory ƒ Atomic Operations ƒ Events and Streams ƒ CUDA in Advanced Scenarios Note: to follow this course, you require a CUDA-capable GPU.urse, you require a CUDA cap C U D A Teaching C enter,D epartm entofC om puterApplications,N IT,Trichy
  • 2. History of GPU Computation No GPU CPU handled graphic output Dedicated GPU Separate graphics card (PCI, AGP) Programmable GPU Shaders C U D A Teaching C enter,D epartm entofC om puterApplications,N IT,Trichy
  • 3. Shaders ƒ Small program that runs on the GPU ƒ Types † Vertex (used to calculate location of a vertex) † Pixel (used to calculate the color components of a single pixel) ƒ Shader languages † High Level Shader Language (HLSL, Microsoft DirectX) † OpenGL Shading Language (GLSL, OpenGL) † Both are C-like † Both are intended for graphics (i.e., not general-purpose) ƒ Pixel shaders used for math † Convert data to texture † Run texture through pixel shader † Get result texture and convert to data C U D A Teaching C enter,D epartm entofC om puterApplications,N IT,Trichy
  • 4. Why GPGPU? ƒ General-Purpose Computation on GPUs ƒ Highly parallel architecture † Lots of concurrent threads of execution (SIMT) † Higher throughput compared to CPUs † Even taking into account many cores, hypethreading, SIMD † Thus more FLOPS (floating-point operations per second) ƒ Commodity hardware † Commonly available (mainly used by gamers) † Relatively cheap compared to custom solutions (e.g., FPGAs) ƒ Sensible programming model † Manufacturers realized GPGPU market potential † Graphics made optional † NVIDIA offers dedicated GPU platform “Tesla” † No output connections C U D A Teaching C enter,D epartm entofC om puterApplications,N IT,Trichy
  • 5. GPGPU Frameworks ƒ Compute Unified Driver Architecture (CUDA) † Developed by NVIDIA Corporation † Extensions to programming languages (C/C++) † Wrappers for other languages/platforms (e.g., FORTRAN, PyCUDA, MATLAB) ƒ Open Computing Language (OpenCL) † Supported by many manufacturers (inc. NVIDIA) † The high-level way to perform computation on ATI devices ƒ C++ Accelerated Massive Programming (AMP) † C++ superset † A standard by Microsoft, part of MSVC++ † Supports both ATI and NVIDIA ƒ Other frameworks and cross-compilers † Alea, Aparapi, Brook, etc. C U D A Teaching C enter,D epartm entofC om puterApplications,N IT,Trichy
  • 6. Graphics Processor Architecture ƒ Warning: NVIDIA terminology ahead ƒ Streaming Multiprocessor (SM) † Contains several CUDA cores † Can have >1 SM on card ƒ CUDA Core (a.k.a. Streaming Processor, Shader Unit) † # of cores per SM tied to compute capability ƒ Different types of memory † Means of access † Performance characteristics C U D A Teaching C enter,D epartm entofC om puterApplications,N IT,Trichy
  • 7. Graphics Processor Architecture SM2 SM1 SP1 SP2 SP3 Device Memory Registers Shared Memory SP4 Constant Memory Texture Memory C U D A Teaching C enter,D epartm entofC om puterApplications,N IT,Trichy
  • 8. Compute Capability ƒ A number indicating what the card can do † Current range: 1.0, 1.x, 2.x, 3.0, 3.5 ƒ Affects both hardware and API support † Number of CUDA cores per SM † Max # of 32-bit registers per SM † Max # of instructions per kernel † Support for double-precision ops † … and many other parameters ƒ Higher is better † See http://en.wikipedia.org/wiki/CUDA C U D A Teaching C enter,D epartm entofC om puterApplications,N IT,Trichy
  • 9. Choosing a Graphics Card ƒ Look at performance specs (peak flops) † Pay attention to single vs. double † E.g. http://www.nvidia.com/object/tesla-servers.html ƒ Number of GPUs ƒ Compute capability/architecture † COTS cheaper than dedicated ƒ Memory size ƒ Ensure PSU/motherboard is good enough to handle the card(s) ƒ Can have >1 graphics card † YMMV (PCI saturation) ƒ Can mix architectures (e.g. NVIDIA+ATI) C U D A Teaching C enter,D epartm entofC om puterApplications,N IT,Trichy
  • 10. Overview ƒ Windows, Mac OS, Linux † This course uses Windows/Visual Studio † This course focuses on desktop GPU development ƒ CUDA Toolkit † LLVM-based compiler † Headers & libraries † Documentation † Samples ƒ NSight † Visual Studio: plugin that allows debugging † Eclipse IDE ƒ http://developer.nvidia.com/cuda C U D A Teaching C enter,D epartm entofC om puterApplications,N IT,Trichy
  • 12. What is NSight? ƒ Many developers, few GPUs † No need for each dev to have GPU ƒ Client-server architecture ƒ Server with GPU (target) † Runs Nsight service † Accepts connections from developers ƒ Clients with Visual Studio + NSight plugin (host) † Use NSight to connect to the server to run application and debug GPU code ƒ Caveats † Need to use VS Remote Debugging to debug CPU code † Need your own mechanism for syncing output Target GPU(s) NSight Service Visual Studio w/NSight C U D A Teaching C enter,D epartm entofC om puterApplications,N IT,Trichy
  • 13. Overview ƒ Compilation Process ƒ Obligatory “Hello Cuda” demo ƒ Location Qualifiers ƒ Execution Model ƒ Grid and Block Dimensions ƒ Error Handling ƒ Device Introspection C U D A Teaching C enter,D epartm entofC om puterApplications,N IT,Trichy
  • 14. NVidia Cuda Compiler (nvcc) ƒ nvcc is used to compile CUDA-specific code † Not a compiler! † Uses host C or C++ compiler (MSVC, GCC) † Some aspects are C++ specific ƒ Splits code into GPU and non-GPU parts † Host code passed to native compiler ƒ Accepts project-defined GPU-specific settings † E.g., compute capability ƒ Translates code written in CUDA C into PTX † Graphics driver turns PTX into binary code C U D A Teaching C enter,D epartm entofC om puterApplications,N IT,Trichy
  • 15. NVCC Compilation Process __global__ void a() { } void b() { } __global__ void a() { } nvcc void b() { } Executable Host compiler (e.g. MSVC’s cl.exe) nvcc C U D A Teaching C enter,D epartm entofC om puterApplications,N IT,Trichy
  • 16. Parallel Thread Execution (PTX) ƒ PTX is the ‘assembly language’ of CUDA † Similar to .NET IL or Java bytecode † Low-level GPU instructions ƒ Can be generated from a project ƒ Typically useful to compiler writers † E.g., GPU Ocelot https://code.google.com/p/gpuocelot/ ƒ Inline PTX (asm) C U D A Teaching C enter,D epartm entofC om puterApplications,N IT,Trichy
  • 17. Location Qualifiers ƒ __global__ Defines a kernel. Runs on the GPU, called from the CPU. Executed with <<<dim3>>> arguments. ƒ __device__ Runs on the GPU, called from the GPU. † Can be used for variables too ƒ __host__ Runs on the CPU, called from the CPU. ƒ Qualifiers can be mixed † E.g. __host__ __device__ foo() † Code compiled for both CPU and GPU † Useful for testing C U D A Teaching C enter,D epartm entofC om puterApplications,N IT,Trichy
  • 19. Execution Model ƒ Thread blocks are scheduled to run on available SMs ƒ Each SM executes one block at a time ƒ Thread block is divided into warps † Number of threads per warp depends on compute capability ƒ All warps are handled in parallel ƒ CUDA Warp Watch C U D A Teaching C enter,D epartm entofC om puterApplications,N IT,Trichy
  • 20. Dimensions ƒ We defined execution as † <<<a,b>>> † A grid of a blocks of b threads each † The grid and each block are 1D structures ƒ In reality, these constructs are 3D † A 3-dimensional grid of 3-dimensional blocks † You can define (a×b×c) blocks of (x×y×z) threads † Can have 2D or 1D by setting extra dimensions to 1 ƒ Defined in dim3 structure † Simple container with x, y and z values. † Some constructors defined for C++ † Automatic conversion for <<<a,b>>> Æ (a,1,1) by (b,1,1) grid block thread C U D A Teaching C enter,D epartm entofC om puterApplications,N IT,Trichy
  • 21. Thread Variables ƒ Execution parameters & current position ƒ blockIdx † Where we are in the grid ƒ gridDim † The size of the grid ƒ threadIdx † Position of current thread in thread block ƒ blockDim † Size of thread block ƒ Limitations † Grid & block sizes † # of threads C U D A Teaching C enter,D epartm entofC om puterApplications,N IT,Trichy
  • 22. Error Handling ƒ CUDA does not throw † Silent failure ƒ Core functions return cudaError_t † Can check against cudaSuccess † Get description with cudaGetErrorString() ƒ Libraries may have different error types † E.g. cuRAND has curandStatus_t C U D A Teaching C enter,D epartm entofC om puterApplications,N IT,Trichy
  • 23. Rules of the Game ƒ Different types of memory † Shared vs. private † Access speeds ƒ Data is in arrays † No parallel data structures † No other data structures † No auto-parallelization/vectorization compiler support † No CPU-type SIMD equivalent ƒ Compiler constraint † No C++11 support C U D A Teaching C enter,D epartm entofC om puterApplications,N IT,Trichy
  • 24. Overview ƒ Data Access ƒ Map ƒ Gather ƒ Reduce ƒ Scan C U D A Teaching C enter,D epartm entofC om puterApplications,N IT,Trichy
  • 25. Data Access ƒ A real problem! ƒ Thread space can be up to 6D † 3D grid of 3D thread blocks ƒ Input space typically 1D † 2D arrays are possible ƒ Need to map threads to inputs ƒ Some examples † 1 block, N threads Æ threadIdx.x † 1 block, MxN threads Æ threadIdx.y * blockDim.x + threadIdx.x † N blocks, M threads Æ blockIdx.x * gridDim.x + threadIdx.x † … and so on C U D A Teaching C enter,D epartm entofC om puterApplications,N IT,Trichy
  • 30. Reduce in Practice ƒ Adding up N data elements ƒ Adding up N data elements ƒ Use 1 block of N/2 threads ƒ Each thread does x[i] += x[j]; ƒ At each step † # of threads halved † Distance (j-i) doubled ƒ x[0] is the result 1 2 3 4 5 6 7 8 3 7 11 153 0 7 1 11 2 155 3 10 260100 0 62626 1 3663636 0 C U D A Teaching C enter,D epartm entofC om puterApplications,N IT,Trichy
  • 31. Scan 4 2 5 3 6 4 6 11 14 20 C U D A Teaching C enter,D epartm entofC om puterApplications,N IT,Trichy
  • 32. Scan in Practice ƒ Similar to reduce 1 2 3 4 1 3 5 73 0 5 1 1 3 6 106 0 1 3 6 10 0 7 2 10 1 ƒ Require N-1 threads ƒ Step size keeps doubling ƒ Number of threads reduced by step size ƒ Each thread n does x[n+step] += x[n]; 5 3 9 2 000000 14 000 15 C U D A Teaching C enter,D epartm entofC om puterApplications,N IT,Trichy
  • 33. Graphics Processor Architecture SM2 SM1 SP1 SP2 SP3 Device Memory Registers Shared Memory SP4 Constant Cache Texture CacheT C C U D A Teaching C enter,D epartm entofC om puterApplications,N IT,Trichy
  • 34. Device Memory ƒ Grid scope (i.e., available to all threads in all blocks in the grid) ƒ Application lifetime (exists until app exits or explicitly deallocated) ƒ Dynamic † cudaMalloc() to allocate † Pass pointer to kernel † cudaMemcpy() to copy to/from host memory † cudaFree() to deallocate ƒ Static † Declare global variable as device __device__ int sum = 0; † Use freely within the kernel † Use cudaMemcpy[To/From]Symbol() to copy to/from host memory † No need to explicitly deallocate ƒ Slowest and most inefficient C U D A Teaching C enter,D epartm entofC om puterApplications,N IT,Trichy
  • 35. Constant & Texture Memory ƒ Read-only: useful for lookup tables, model parameters, etc. ƒ Grid scope, Application lifetime ƒ Resides in device memory, but ƒ Cached in a constant memory cache ƒ Constrained by MAX_CONSTANT_MEMORY † Expect 64kb ƒ Similar operation to statically-defined device memory † Declare as __constant__ † Use freely within the kernel † Use cudaMemcpy[To/From]Symbol() to copy to/from host memory ƒ Very fast provided all threads read from the same location ƒ Used for kernel arguments ƒ Texture memory: similar to Constant, optimized for 2D access patterns C U D A Teaching C enter,D epartm entofC om puterApplications,N IT,Trichy
  • 36. Shared Memory ƒ Block scope † Shared only within a thread block † Not shared between blocks ƒ Kernel lifetime ƒ Must be declared within the kernel function body ƒ Very fast C U D A Teaching C enter,D epartm entofC om puterApplications,N IT,Trichy
  • 37. Register & Local Memory ƒ Memory can be allocated right within the kernel † Thread scope, Kernel lifetime ƒ Non-array memory † int tid = … † Stored in a register † Very fast ƒ Array memory † Stored in ‘local memory’ † Local memory is an abstraction, actually put in global memory † Thus, as slow as global memory C U D A Teaching C enter,D epartm entofC om puterApplications,N IT,Trichy
  • 38. Summary Declaration Memory Scope Lifetime Slowdown int foo; Register Thread Kernel 1x int foo[10]; Local Thread Kernel 100x __shared__ int foo; Shared Block Kernel 1x __device__ int foo; Global Grid Application 100x __constant__ int foo; Constant Grid Application 1x C U D A Teaching C enter,D epartm entofC om puterApplications,N IT,Trichy
  • 39. Thread Synchronization ƒ Threads can take different amounts of time to complete a part of a computation ƒ Sometimes, you want all threads to reach a particular point before continuing their work ƒ CUDA provides a thread barrier function __syncthreads() ƒ A thread that calls __syncthreads() waits for other threads to reach this line ƒ Then, all threads can continue executing barrier A B C C U D A Teaching C enter,D epartm entofC om puterApplications,N IT,Trichy
  • 40. Restrictions ƒ __syncthreads() only synchronizes threads within a block ƒ A thread that calls __syncthreads() waits for other threads to reach this location ƒ All threads must reach __syncthreads() † if (x > 0.5) { __syncthreads(); // bad idea } ƒ Each call to __syncthreads() is unique † if (x > 0.5) { __syncthreads(); } else { __syncthreads(); // also a bad idea } C U D A Teaching C enter,D epartm entofC om puterApplications,N IT,Trichy
  • 41. Branch Divergence ƒ Once a block is assigned to an SM, it is split into several warps ƒ All threads within a warp must execute the same instruction at the same time ƒ I.e., if and else branches cannot be executed concurrently ƒ Avoid branching if possible ƒ Ensure if statements cut on warp boundaries branch performance loss C U D A Teaching C enter,D epartm entofC om puterApplications,N IT,Trichy
  • 42. Summary ƒ Threads are synchronized with __syncthreads() † Block scope † All threads must reach the barrier † Each __syncthreads() creates a unique barrier ƒ A block is split into warps † All threads in a warp execute same instruction † Branching leads to warp divergence (performance loss) † Avoid branching or ensure branch cases fall in different warps C U D A Teaching C enter,D epartm entofC om puterApplications,N IT,Trichy
  • 43. Overview ƒ Why atomics? ƒ Atomic Functions ƒ Atomic Sum ƒ Monte Carlo Pi C U D A Teaching C enter,D epartm entofC om puterApplications,N IT,Trichy
  • 44. Why Atomics? ƒ x++ is a read-modify-write operation † Read x into a register † Increment register value † Write register back into x † Effectively { temp = x; temp = temp+1; x = temp; } ƒ If two threads do x++ † Each thread has its own temp (say t1 and t2) † { t1 = x; t1 = t1+1; x = t1;} { t2 = x; t2 = t2+1; x = t2;} † Race condition: the thread that writes to x first wins † Whoever wins, x gets incremented only once C U D A Teaching C enter,D epartm entofC om puterApplications,N IT,Trichy
  • 45. Atomic Functions ƒ Problem: many threads accessing the same memory location ƒ Atomic operations ensure that only one thread can access the location ƒ Grid scope! ƒ atomicOP(x,y) † t1 = *x; // read † t2 = t1 OP y; // modify † *a = t2; // write ƒ Atomics need to be configured † #include "sm_20_atomic_functions.h“ † C U D A Teaching C enter,D epartm entofC om puterApplications,N IT,Trichy
  • 47. Summary ƒ Atomic operations ensure operations on a variable cannot be interrupted by a different thread ƒ CUDA supports several atomic operations † atomicAdd() † atomicOr() † atomicMin() † … and others ƒ Atomics incur a heavy performance penalty C U D A Teaching C enter,D epartm entofC om puterApplications,N IT,Trichy
  • 48. Overview ƒ Events ƒ Event API ƒ Event example ƒ Pinned memory ƒ Streams ƒ Stream API ƒ Example (single stream) ƒ Example (multiple streams) C U D A Teaching C enter,D epartm entofC om puterApplications,N IT,Trichy
  • 49. Events ƒ How to measure performance? ƒ Use OS timers † Too much noise ƒ Use profiler † Times only kernel duration + other invocations ƒ CUDA Events † Event = timestamp † Timestamp recorded on the GPU † Invoked from the CPU side C U D A Teaching C enter,D epartm entofC om puterApplications,N IT,Trichy
  • 50. Event API ƒ cudaEvent_t † The event handle ƒ cudaEventCreate(&e) † Creates the event ƒ cudaEventRecord(e, 0) † Records the event, i.e. timestamp † Second param is the stream to which to record ƒ cudaEventSynchronize(e) † CPU and GPU are async, can be doing things in parallel † cudaEventSynchronize() blocks all instruction processing until the GPU has reached the event ƒ cudaEventElapsedTime(&f, start, stop) † Computes elapsed time (msec) between start and stop, stored as float C U D A Teaching C enter,D epartm entofC om puterApplications,N IT,Trichy
  • 51. Pinned Memory ƒ CPU memory is pageable † Can be swapped to disk ƒ Pinned (page-locked) stays in place ƒ Performance advantage when copying to/from GPU ƒ Use cudaHostAlloc() instead of malloc() or new ƒ Use cudaFreeHost() to deallocate ƒ Cannot be swapped out † Must have enough † Proactively deallocate C U D A Teaching C enter,D epartm entofC om puterApplications,N IT,Trichy
  • 52. Streams ƒ Remember cudaEventRecord(event, stream)? ƒ A CUDA stream is a queue of GPU operations † Kernel launch † Memory copy ƒ Streams allow a form of task-based parallelism † Performance improvement ƒ To leverage streams you need device overlap support † GPU_OVERLAP C U D A Teaching C enter,D epartm entofC om puterApplications,N IT,Trichy
  • 53. Stream API ƒ cudaStream_t ƒ cudaStreamCreate(&stream) ƒ kernel<<<blocks,threads,shared,stream>>> ƒ cudaMemcpyAsync() † Must use pinned memory! ƒ stream parameter ƒ cudaStreamSynchronize(stream) C U D A Teaching C enter,D epartm entofC om puterApplications,N IT,Trichy
  • 54. Summary ƒ CUDA events let you time your code on the GPU ƒ Pinned memory speeds up data transfers to/from device ƒ CUDA streams allow you to queue up operations asynchronously † Lets you do different things in parallel on the GPU † Use of pinned memory is required C U D A Teaching C enter,D epartm entofC om puterApplications,N IT,Trichy