2. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 2
What about this Talk
The Multicore Era
It’s time for Parallel Computing
GPGPU/CUDA
GUGPU Architecture
Parallel Programming by CUDA
Some Examples
Image Restoration (Retinex)
Feature Extraction (SIFT)
Video Cloud Computing
3. 3
1. The Multicore Era
for Computer Vision
Paradigm shift from Clock Speed Race
to Multicore Race
Some examples of Multicore
4. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 4
Multicore Computing
What Is Multicore
Combine multiple chips of
processor into single chip
Multicore computing is inevitable
5. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 5
Moore's Law
In 1965, Gordon Moore (Intel co-founder)
predicted
The transistors no. on an IC would double
every 18 months
The well-known law
• The performance of computer
doubles every 18 months
• More transistors
More performance
The prediction was
kept correctly by
Intel's CPUs for 40 years
6. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 6
Review of Moore's Law
Transistors in a chip did increase
7. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 7
Problems
More transistors need high frequency
High frequency needs high power
consumption
We come into the Clock Speed Race
But 4GHz has been the limit
Moore’s law breaks
8. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 8
Paradigm Shift from 2000
General-purpose multicore
comes of age
Chip companies race to create
multicore processors
CPU: Intel Core Duo, Quad-core, ...
DSP: TI DaVinci
GPU: nVidia GeForce/Tesla
...
9. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 9
The Multicore Evolution
From large mono-core to multiple lightweight cores
Pentium processor Core Duo 5~10 years
Optimized for single 10~100 energy efficient
thread cores optimized for
parallel execution
10. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 10
Moore’s Law Needs Multicore
Single core cannot fit Moore's law
Multicore can fit Moore's law if a
parallel programming model exists
Multi-Core
Performance
Single Core
Time
11. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 11
Two Architectures
for Multicore
Symmetric multiprocessing (SMP)
Multicore CPU,
GPGPU,
multicore DSP
Homogeneous computing
Asymmetric multiprocessing (AMP)
CPU+GPGPU,
CPU+FPGA,
CPU+DSP
Heterogeneous computing
12. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 12
Multicore CPU (1/2)
Two or more CPUs on a chip
Ex.: Intel Core i7
One
Processor
With multiple
execution Cores
13. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 13
Multicore CPU (2/2)
Windows Task Manager(工作管理員)
Two cores Eight cores
14. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 14
GPGPU (1/2)
GPU (Graphical Processing Unit)
The processor in graphics card to speed
up 3D graphics
Game playing
is a major
application
GPGPU: General-Purpose GPU
General purpose computation using
GPU in applications other than 3D
graphics
15. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 15
GPGPU (2/2)
GPGPU has more cores than CPU
120 ~ 512 cores
GPGPU is more powerful than
multicore CPU
Vendors:
nVidia
ATI
Intel
AMD
16. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 16
Computer Vision Needs
High Performance Computing
An CV example : video processing
Intelligent video surveillance,
Its complexity is high
One video: 10 Megapixels, 30fps,
100 flops per pixel
30 Gigaflops per video
Massive data processing
Intensive computation
17. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 17
Approaches for HPC
Cluster/distributed computing
MAP-REDUCE(Google)
Supercomputer
(Cloud Computing)
MPI
Multi-processing
computing
Multicore CPU
Programming with multithreading
FPGA/DSP
GPGPU
Programming with CUDA
18. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 18
However
Multicore is not a simple solution for
upgrading performance
The transition from single core to
multicore will be blocked by
software
We are not ready to face the
software programming challenges
19. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 19
Multicore Demands Threading
20. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 20
2. GPGPU and CUDA
GPGPU Hardware
Programming by CUDA
21. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 21
Why GPGPU
GPGPU has many-core (> 100 cores)
Suitable for intensive parallel computing
GPGPU v.s. CPU
Calculation: 367 GFLOPS v.s. 32 GFLOPS
Memory Bandwidth: 86.4 GB/s v.s. 8.4 GB/s
22. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 22
GPGPU Vendors
NVIDIA
ATI
Intel
AMD
…
23. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 23
Hardware View
• PC-based
• GPGPU card as a coprocessor
From PC to PSC : Personal Super-Computer
24. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 24
Applications of GPGPU
http://developer.nvidia.com/category/zone/cuda-zone
25. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 25
Two New GPGPUs
from nVidia
GT200
GTX 260/280, Quardro5800, Tesla 1060
Fermi
Tesla 2060
ALU ALU
Control
ALU ALU
Cache
DRAM DRAM
CPU(host) GPU(device)
Multicore Many-core
26. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 26
nVidia GPGPU Architecture
SM/SP(Stream multiprocessor/Stream
processor) + Shared memory + DRAM
27. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 27
Memory Hierarchy
On-Chip Memory
Registers
Shared Memory
Constant Memory
Texture Memory
Off-Chip Memory
Local Memory
Global Memory
28. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 28
Parallel Computing
Serial
Computing
GPGPU Cores
Parallel
Computing
29. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 29
Parallel Programming
Many codes are written in C/C++/Java
Especially algorithmic programs
Can we write GPGPU parallel
programs by C/C++/Java?
However, C/C++ is sequential
Three control structures of C/C++/Java:
sequence, selection, repetition
30. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 30
Multi-threading
Multi-threading is the most
important technique for parallel
programming
Some techniques are ready
Pthread, Win32 thread, OpenMP,
MPI, Intel TBB (Threading Building
Block)...
New techniques
CUDA, OpenCL, ...
31. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 31
Parallel Programming in
Sequential Language
Do we need to learn new languages for
multi-threading?
No
Write multi-threading codes in C/C++
Add functions/directives to C/C++ for
multi-threading
That is the way current solutions did
pthread, Win32 thread, OpenMP,
MPI, CUDA, OpenCL, ...
32. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 32
CUDA
CUDA: Compute Unified Device
Architecture
Parallel programming
for nVidia's GPGPU
Use C/C++ language
Java, Fortran, Matlab are OK
When executing CUDA programs,
the GPU operates as coprocessor to
the main CPU
33. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 33
CUDA Hardware Environment:
CPU+GPU
GPU
Organizes, interprets, and CPU PCI-E
GPU
communicates information
GPU
Handles the core processing on large quantities
of parallel information
Compute-intensive portions of applications
that are executed many times, but on different
data, are extracted from the main application
and compiled to execute in parallel on the GPU
35. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 35
Processing Flow on CUDA
Main
CPU 3
2 Memory
Copy processing 5 Instruct the
data Copy the processing
result
4
1 Memory
for GPU Execute
Allocate parallel in
device memory each core
6
Release
device memory
36. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 36
Programming with
Memory Hierarchy
Locality
principle
Temporal
locality
Spatial
locality
37. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 37
Example - Hello World(1/3)
int main()
{ Host Device
char src[12]="Hello World";
char h_hello[12]; src d_hello1
char* d_hello1;
char* d_hello2; h_hello d_hello2
cudaMalloc((void**) &d_hello1, sizeof(char)*12);
cudaMalloc((void**) &d_hello2, sizeof(char)*12);
cudaMemcpy(d_hello1 , src , sizeof(char)* 12 ,
cudaMemcpyHostToDevice);
hello<<<1,1>>>(d_hello1 , d_hello2 );
call the kernel function
38. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 38
Example - Hello World(2/3)
Kernel Function
__global__ void hello(char* hello1 , char* hello2 )
{
int k;
for(k = 0 ; hello1[k] != '0' ; k++){
Host Device
hello2[k] = hello1[k];
} src d_hello1
}
No parallel processing in this example
h_hello d_hello2
40. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 40
Parallelization
Multicore/Multi-threading
Data Parallelization
Data distribution
Parallel convolution
Reduction algorithm
Amdahl’s law
Memory Hierarchy Management
Locality principle
Program accesses a relatively small portion
of the address space at any instant of time
42. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 42
3. Image Restoration
(Retinex) by CUDA
43. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 43
Image Restoration
Restore and enhance an image
Its complexity is high for large images
Original Complexity: Restored
O(N2) ~ O(N3)
44. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 44
Algorithms for
Image Restoration
Wiener Filter
Histogram Based Approach
Histogram Equalization,
Histogram Modification, …
Retinex
Path-based Retinex
Recursive Retinex
Center/surround Retinex
No iterative process and is suitable for parallelization
Multi-Scale Retinex with Color Restoration
(MSRCR) [Rahman et al. 1997]
45. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 45
MSRCR Algorithm
n
Ri x, y ri ( x, y ) Wk log Ii x, y log Fk x, y Ii x, y , i R, G, B ,
k 1
Ri x, y : the MSRCR output
Ii x, y : the original image distribution in the ith spectral band
F x, y
k
: the kth Gaussian Surround function
: the convolution operation
W : the weight
k
ri ( x, y ) : the color restoration factor in the ith spectral band
I i ( x, y )
N : the number of spectral bands
ri ( x, y ) log N , : the gain constant
i 1
I i ( x, y )
: controls the strength of the nonlinearity
46. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 46
Decompose the Problem
Two basic approaches to partition
computational work
Domain decomposition GPGPU
Partition the data used
Cooperate
in solving the problem
Function decomposition CPU
Partition the jobs (functions)
from the overall work (problem)
47. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 47
Multi-Threading
A program running
In Serial
In Parallel
http://en.wikipedia.org/wiki/Thread_(computer_science)
48. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 48
Domain Decomposition (1/3)
An image example
It is 2D data
Three popular partition ways
49. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 49
Domain Decomposition (2/3)
Domain data are usually processed
by loop
for (i=0; i<height; i++)
for (j=0; j<width; j++)
img2[i][j] = RemoveNoise(img1[i][j]);
j
i
The X-ray image
of a circuit board
Original image(img1) Enhanced image(img2)
50. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 50
Domain Decomposition (3/3)
j
i A three-block partition
example OpenMP
// Thread 1 CUDA(SPMD)
for (i=0; i<height/3; i++)
for (j=0; j<width; j++)
img2[i][j] = RemoveNoise(img1[i][j]);
// Thread 2
for (i=height/3; i<height*2/3; i++)
fork(threads)
subdomain 1 subdomain 2 subdomain 3
for (j=0; j<width; j++)
i=0 i=4 i=8 img2[i][j] = RemoveNoise(img1[i][j]);
i=1 i=5 i=9 // Thread 3
i=2 i=6 i=10
i=3 i=7 i=11
for (i=height*2/3; i<height; i++)
for (j=0; j<width; j++)
join(barrier) img2[i][j] = RemoveNoise(img1[i][j]);
51. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 51
The Method
CPU GPGPU
Copy Data
from CPU to Gaussian Blur
GPGPU
Log-domain
Processing
Normalization
Copy Data Histogram
from GPGPU Stretching
to CPU
Intel Core 2 - 2 cores Tesla C1060 - 240 SPs
(3.0GHZ) (1.296GHZ)
52. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 52
Parallelization by GPGPU
Multicore/Multi-threading
Tesla C1060 : 240 SP (Stream Processor)
CUDA: , Thread , Block , Grid
Data Parallelization
Parallel convolution
Parallel convolution
M pixels PE data time
1 pixels pixels 1 pixels t0 t1 t2 t3 t4 t5
A(0) A(0)+A(1) A(0)+A(1)+A(2)+A(3) sum
0
1 A(1)
M PE i PE i 2 A(2) A(2)+A(3)
pixels pixels pixels pixels A(3)
3
4 A(4) A(4)+A(5) A(4)+A(5)+A(6)+A(7)
pixels 5 A(5)
1 pixels 1 pixels 6 A(6) A(6)+A(7)
pixels 7 A(7)
79. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 79
Issues with Parallelization
Good parallel programs
Execute correctly
with good speedup
Ideal speedup by Amdahl's law
Speedup = N if you has N cores
However, no ideal speedup exists
Because parallel overhead, such as
Data communication
Data dependencies and synchronization
Other issues: design overhead
No free lunch for software development
80. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 80
Parallel Computing on GPGPU
CUDA can only parallelize codes for
nVidia's GPGPU
CUDA’s programming model:
Multithread
SPMD (Single Program Multiple Data)
Best-performance CUDA code needs
optimization
Native code can be improved by CUDA
2~3 times
Optimization can be achieved by
Data parallelism, Thread parallelism, Data
localization
81. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 81
Programming Challenges
of CUDA
We have to manually parallelize the
algorithm
We need expertise in
Algorithms of image and signal processing
Filtering, frequency analysis, compression,
feature extraction, recognition, ...
Theory, tools and methodology of parallel
computing
Communication, synchronization, resource
management, load balancing, debugging, ...
82. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 82
GPUs for Multimedia
3.5X 10 X 10 X
PowerDirector7 Ultra CUDA JPEG Decoder DivideFrame GPU Decoder
26 X 10 X
Hyperspectral Image GPU Decoder Motion Estimation for
Compression on (Vegas/Premiere) - H.264/AVC on
NVIDIA GPUs Using the Power of Multiple GPUs
NVIDIA Graphic Card to Using NVIDIA CUDA
Decode H.264 Video Files
83. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 83
GPUs for Computer Vision(1/2)
87 X 26 X 200 X 100 X
CUDA SURF – A Real-time Leukocyte Tracking: Real-time Spatiotemporal Image Denoising with
Implementation for SURF ImageJ Plugin Stereo Matching Using the Bilateral Filter
TU Darmstadt University of Virginia Dual-Cross-Bilateral Grid Wlroclaw University
of Technology
85 X 100 X 8X 13 X
Digital Breast Fast Optical Flow on GPU A Framework for Efficient Accelerating Advanced MRI
Tomosynthesis At Video Rate for Full HD and Scalable Execution of Reconstructions
Reconstruction Resolution Domain-specific Templates University of Illinois
Massachusetts General Onera On GPU
Hospital NEC Labs, Berkeley, Purdue
84. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 84
GPUs for Computer Vision(2/2)
20 X 13 X 109 X 263 X
GPU for Surveillance Fast Human Detection with Fast Sliding-Window GPU Acceleration of Object
Cascaded Ensembles Object Detection Classification Algorithm
Using NVIDIA CUDA
300 X 10 X 45 X 3X
Audience Measurement – Real-time A GPU Accelerated Canny Edge Detection
Real-time Video Analysis Visual Tracker by Evolutionary
for Counting People, Face Stream Processing Computer Vision System
Detection and Tracking
85. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 85
The ParLab in Berkeley
The Parallel Computing Lab. in UC
Berkeley
http://parlab.eecs.berkeley.edu
The ParLab. offers programmers a
practical introduction to parallel
programming techniques and tools on
current parallel computers,
emphasizing multicore and manycore
computers.
86. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 86
Multicore Programming
Practice (MPP)
Goal: Write portable C/C++
programs to be "Multicore ready"
and platform compatible
Proposed by a
MPP working group
in the Multicore
Association
http://www.multicore-association.org/workgroup/mpp.php
87. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 87
Special Conference
HPEC: High Performance Embedded
Computing,
MIT Lincoln Lab, 1997 ~