3. Rohit khatana
What is Parallel Computing?
ďŹPerforming or Executing a task/program
on more than one machine or processor.
ďŹIn simple way dividing a job in a group.
12. What kind of processors will we
build?
(major design constraint: power)
Cpu: - Complex Control Hardware
Flexibility + Performance
Expensive in Terms of Power
GPU: - Simpler Control Hardware
More H/W for Computation
Potentially More power Efficient (ops/watt)
More Restrictive Programming Model
14. Graphics Logical Pipeline
⢠The GPU receives geometry information
from the CPU as an input and provides
a picture as an output
⢠Letâs see how that happens
15. Host Interface
⢠The host interface is the communication bridge
between the CPU and the GPU
⢠It receives commands from the CPU and also
pulls geometry information from system
memory
⢠It outputs a stream of vertices in object space
with all their associated information (normals,
texture coordinates, per vertex color etc)
16. Vertex Processing
⢠The vertex processing stage receives vertices from the
host interface in object space and outputs them in screen
space
⢠This may be a simple linear transformation, or a complex
operation involving morphing effects
⢠No new vertices are created in this stage, and no
vertices are discarded (input/output has 1:1 mapping)
17. Triangle Setup
⢠In this stage geometry information becomes raster
information (screen space geometry is the input,
pixels are the output)
⢠Prior to rasterization, triangles that are backfacing
or are located outside the viewing frustrum are
rejected
18. Triangle Setup
⢠A fragment is generated if and only if its center
is inside the triangle
⢠Every fragment generated has its attributes
computed to be the perspective correct
interpolation of the three vertices that make up
the triangle
19. Fragment Processing
⢠Each fragment provided by triangle setup is fed
into fragment processing as a set of attributes
(position, normal, texcoord etc), which are used to
compute the final color for this pixel
⢠The computations taking place here include
texture mapping and math operations
20. Memory Interface
⢠Fragments provided by the last step are written to
the framebuffer.
⢠Before the final write occurs, some fragments are
rejected by the zbuffer, stencil and alpha tests
24. CUDA
⢠CUDA gives developers access to the
virtual instruction set and memory of the
parallel computational elements in CUDA
GPUs.
⢠CUDA supports standard programming
languages , including C++,python , Fortran.
25. Programming Model
⢠Threads are organized into blocks.
⢠Blocks are organized into a grid.
⢠A multiprocessor executes one block at a
time.
⢠A warp is the set of threads executed in
parallel.
⢠32 threads in a warp.
26. Typical CUDA/GPU Program
1. CPU allocates storage on GPU (cudaMalloc).
2. CPU copies input data from CPU GPU
(cudaMemcpy).
3. CPU launches kernel on GPU to process the data.
(Kernel function<<<no of threads>>>(parameter))
4. CPU copies results back to CPU from GPU
(cudaMemcpy)
27. simply squaring the elements of an array
__global__ void square(float * d_out, float * d_in){
// Todo: Fill in this function
int idx = threadIdx.x;
float f = d_in[idx];
d_out[idx] = f*f
}
theadIdx.x =gives the current thread number
GPU/CUDA programming
29. Main program(cont.)
// transfer the array to the GPU
cudaMemcpy(d_in, h_in, ARRAY_BYTES, cudaMemcpyHostToDevice);
// launch the kernel
square<<<1, ARRAY_SIZE>>>(d_out, d_in);
// copy back the result array to the CPU
cudaMemcpy(h_out, d_out, ARRAY_BYTES, cudaMemcpyDeviceToHost);
// print out the resulting array
for (int i =0; i < ARRAY_SIZE; i++) {
printf("%f", h_out[i]);
}
32. Conclusion
⢠GPU computing is a good choice for fine-
grained data-parallel programs with limited
communication
⢠GPU computing is not so good for coarse-
grained program with a lot of communication
⢠The GPU has become a co-processor to the
CPU.
33. References
⢠1.[âIEEEâ] Accelerating image processing capability using
graphics processors Jason. Dalea, Gordon. Caina, Brad.
ZellbaVision4ce Ltd. Crowthorne Enterprise Center,
Crowthorne, Berkshire, UK, RG45 6AWbVision4ce LLC
Severna Park, USA, MD2114
â˘
⢠2.Udacity cs344,Intro to parallel Programming with GPU
⢠3.Wikipedia
⢠4.Nividia docs