Parallel Vision by GPGPU/CUDA

1

平行視覺與GPGPU/CUDA
王元凱
輔仁大學電機工程系
Email: ykwang@mail.fju.edu.tw
URL: http://www.ykwang.tw
2011/10/07

本著作採用創用CC 「姓名標示」授權條款台灣3.0版

Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 2

What about this Talk
 The Multicore Era
 It’s time for Parallel Computing
 GPGPU/CUDA
 GUGPU Architecture
 Parallel Programming by CUDA
 Some Examples
 Image Restoration (Retinex)
 Feature Extraction (SIFT)
 Video Cloud Computing

3

1. The Multicore Era
for Computer Vision
 Paradigm shift from Clock Speed Race
to Multicore Race
 Some examples of Multicore


Multicore Computing
 What Is Multicore
 Combine multiple chips of
processor into single chip
 Multicore computing is inevitable


Moore's Law
 In 1965, Gordon Moore (Intel co-founder)
predicted
 The transistors no. on an IC would double
every 18 months
 The well-known law
• The performance of computer
doubles every 18 months
• More transistors
 More performance
 The prediction was
kept correctly by
Intel's CPUs for 40 years


Review of Moore's Law
 Transistors in a chip did increase


Problems
 More transistors need high frequency
 High frequency needs high power
consumption
 We come into the Clock Speed Race
 But 4GHz has been the limit
Moore’s law breaks


Paradigm Shift from 2000
 General-purpose multicore
comes of age
 Chip companies race to create
multicore processors
 CPU: Intel Core Duo, Quad-core, ...
 DSP: TI DaVinci
 GPU: nVidia GeForce/Tesla
 ...


The Multicore Evolution
From large mono-core to multiple lightweight cores

Pentium processor Core Duo 5~10 years
Optimized for single 10~100 energy efficient
thread cores optimized for
parallel execution


Moore’s Law Needs Multicore
 Single core cannot fit Moore's law
 Multicore can fit Moore's law if a
parallel programming model exists
Multi-Core
Performance

Single Core

Time


Two Architectures
for Multicore
 Symmetric multiprocessing (SMP)
 Multicore CPU,
GPGPU,
multicore DSP
 Homogeneous computing
 Asymmetric multiprocessing (AMP)
 CPU+GPGPU,
CPU+FPGA,
CPU+DSP
 Heterogeneous computing


Multicore CPU (1/2)
 Two or more CPUs on a chip
 Ex.: Intel Core i7

One
Processor

With multiple
execution Cores


Multicore CPU (2/2)
 Windows Task Manager(工作管理員)
Two cores Eight cores


GPGPU (1/2)
 GPU (Graphical Processing Unit)
 The processor in graphics card to speed
up 3D graphics
 Game playing
is a major
application
 GPGPU: General-Purpose GPU
 General purpose computation using
GPU in applications other than 3D
graphics


GPGPU (2/2)
 GPGPU has more cores than CPU
 120 ~ 512 cores
 GPGPU is more powerful than
multicore CPU
 Vendors:
 nVidia
 ATI
 Intel
 AMD


Computer Vision Needs
High Performance Computing
 An CV example : video processing
 Intelligent video surveillance,
 Its complexity is high
 One video: 10 Megapixels, 30fps,
 100 flops per pixel
  30 Gigaflops per video
 Massive data processing
 Intensive computation


Approaches for HPC
 Cluster/distributed computing
 MAP-REDUCE(Google)
Supercomputer
(Cloud Computing)
 MPI
 Multi-processing
computing
 Multicore CPU
 Programming with multithreading
 FPGA/DSP
 GPGPU
 Programming with CUDA


However
 Multicore is not a simple solution for
upgrading performance
 The transition from single core to
multicore will be blocked by
software
 We are not ready to face the
software programming challenges


Multicore Demands Threading


2. GPGPU and CUDA
 GPGPU Hardware
 Programming by CUDA


Why GPGPU
 GPGPU has many-core (> 100 cores)
 Suitable for intensive parallel computing
 GPGPU v.s. CPU
 Calculation: 367 GFLOPS v.s. 32 GFLOPS
 Memory Bandwidth: 86.4 GB/s v.s. 8.4 GB/s


GPGPU Vendors
 NVIDIA
 ATI
 Intel
 AMD
 …


Hardware View
• PC-based
• GPGPU card as a coprocessor

From PC to PSC : Personal Super-Computer


Applications of GPGPU

http://developer.nvidia.com/category/zone/cuda-zone


Two New GPGPUs
from nVidia
 GT200
 GTX 260/280, Quardro5800, Tesla 1060
 Fermi
 Tesla 2060
ALU ALU
Control
ALU ALU

Cache

DRAM DRAM

CPU(host) GPU(device)
Multicore Many-core


nVidia GPGPU Architecture
 SM/SP(Stream multiprocessor/Stream
processor) + Shared memory + DRAM


Memory Hierarchy
 On-Chip Memory
 Registers
 Shared Memory
 Constant Memory
 Texture Memory
 Off-Chip Memory
 Local Memory
 Global Memory


Parallel Computing
 Serial
Computing

GPGPU Cores

 Parallel
Computing


Parallel Programming
 Many codes are written in C/C++/Java
 Especially algorithmic programs
 Can we write GPGPU parallel
programs by C/C++/Java?
 However, C/C++ is sequential
 Three control structures of C/C++/Java:
sequence, selection, repetition


Multi-threading
 Multi-threading is the most
important technique for parallel
programming
 Some techniques are ready
 Pthread, Win32 thread, OpenMP,
MPI, Intel TBB (Threading Building
Block)...
 New techniques
 CUDA, OpenCL, ...


Parallel Programming in
Sequential Language
 Do we need to learn new languages for
multi-threading?
 No
 Write multi-threading codes in C/C++
 Add functions/directives to C/C++ for
multi-threading
 That is the way current solutions did
 pthread, Win32 thread, OpenMP,
MPI, CUDA, OpenCL, ...


CUDA
 CUDA: Compute Unified Device
Architecture
 Parallel programming
for nVidia's GPGPU
 Use C/C++ language
 Java, Fortran, Matlab are OK
 When executing CUDA programs,
the GPU operates as coprocessor to
the main CPU


CUDA Hardware Environment:
CPU+GPU
 GPU
 Organizes, interprets, and CPU PCI-E
GPU
communicates information
 GPU
 Handles the core processing on large quantities
of parallel information
 Compute-intensive portions of applications
that are executed many times, but on different
data, are extracted from the main application
and compiled to execute in parallel on the GPU


CUDA Software Stack


Processing Flow on CUDA
Main
CPU 3
2 Memory
Copy processing 5 Instruct the
data Copy the processing
result
4
1 Memory
for GPU Execute
Allocate parallel in
device memory each core

6
Release
device memory


Programming with
Memory Hierarchy
 Locality
principle
 Temporal
locality
 Spatial
locality


Example - Hello World(1/3)
int main()
{ Host Device
char src[12]="Hello World";
char h_hello[12]; src d_hello1
char* d_hello1;
char* d_hello2; h_hello d_hello2

cudaMalloc((void**) &d_hello1, sizeof(char)*12);
cudaMalloc((void**) &d_hello2, sizeof(char)*12);
cudaMemcpy(d_hello1 , src , sizeof(char)* 12 ,
cudaMemcpyHostToDevice);
hello<<<1,1>>>(d_hello1 , d_hello2 );
call the kernel function


 Kernel Function

__global__ void hello(char* hello1 , char* hello2 )
{
int k;

for(k = 0 ; hello1[k] != '0' ; k++){
Host Device
hello2[k] = hello1[k];
} src d_hello1
}
No parallel processing in this example
h_hello d_hello2


cudaMemcpy(h_hello, d_hello2, sizeof(char)*
12, cudaMemcpyDeviceToHost);

printf("%sn", h_hello);
Host Device
cudaFree(d_hello1);
 cudaFree(d_hello2); src d_hello1
system("pause");
h_hello d_hello2
return 0;
}
Result:


Parallelization
 Multicore/Multi-threading
 Data Parallelization
 Data distribution
 Parallel convolution
 Reduction algorithm
 Amdahl’s law
 Memory Hierarchy Management
 Locality principle
 Program accesses a relatively small portion
of the address space at any instant of time


Develop Multi-thread Program
 Identify parallelism: Analyze algorithm
 Express parallelism: Write parallel code
 Validate parallelism: Debug & verify
parallel code
 Optimize parallelism: enhance parallel
performance


3. Image Restoration
(Retinex) by CUDA


Image Restoration
 Restore and enhance an image
 Its complexity is high for large images

Original Complexity: Restored
O(N2) ~ O(N3)


Algorithms for
Image Restoration
 Wiener Filter
 Histogram Based Approach
 Histogram Equalization,
Histogram Modification, …
 Retinex
 Path-based Retinex
 Recursive Retinex
 Center/surround Retinex
 No iterative process and is suitable for parallelization
 Multi-Scale Retinex with Color Restoration
(MSRCR) [Rahman et al. 1997]


MSRCR Algorithm
 
n
Ri  x, y   ri ( x, y )   Wk log Ii  x, y   log  Fk  x, y   Ii  x, y   , i   R, G, B ,
 
k 1

 Ri  x, y  : the MSRCR output
 Ii  x, y : the original image distribution in the ith spectral band
 F  x, y 
k
: the kth Gaussian Surround function
 : the convolution operation
W : the weight
k

 ri ( x, y ) : the color restoration factor in the ith spectral band

 
 I i ( x, y ) 
N : the number of spectral bands
ri ( x, y )    log    N  , : the gain constant
 


i 1
I i ( x, y ) 

: controls the strength of the nonlinearity


Decompose the Problem
 Two basic approaches to partition
computational work
 Domain decomposition GPGPU
 Partition the data used
Cooperate
in solving the problem
 Function decomposition CPU
 Partition the jobs (functions)
from the overall work (problem)


Multi-Threading
 A program running
In Serial

In Parallel

http://en.wikipedia.org/wiki/Thread_(computer_science)


Domain Decomposition (1/3)
 An image example
 It is 2D data
 Three popular partition ways


 Domain data are usually processed
by loop
 for (i=0; i<height; i++)
for (j=0; j<width; j++)
img2[i][j] = RemoveNoise(img1[i][j]);
j
i

The X-ray image
of a circuit board
Original image(img1) Enhanced image(img2)


j
i A three-block partition
example OpenMP
 // Thread 1 CUDA(SPMD)
for (i=0; i<height/3; i++)
img2[i][j] = RemoveNoise(img1[i][j]);
 // Thread 2
for (i=height/3; i<height*2/3; i++)
fork(threads)
subdomain 1 subdomain 2 subdomain 3
i=0 i=4 i=8 img2[i][j] = RemoveNoise(img1[i][j]);
i=1 i=5 i=9  // Thread 3
i=2 i=6 i=10
i=3 i=7 i=11
for (i=height*2/3; i<height; i++)
join(barrier) img2[i][j] = RemoveNoise(img1[i][j]);


The Method
CPU GPGPU
Copy Data
from CPU to Gaussian Blur
GPGPU
Log-domain
Processing

Normalization
Copy Data Histogram
from GPGPU Stretching
to CPU

Intel Core 2 - 2 cores Tesla C1060 - 240 SPs
(3.0GHZ) (1.296GHZ)


Parallelization by GPGPU
 Multicore/Multi-threading
 Tesla C1060 : 240 SP (Stream Processor)
 CUDA: , Thread , Block , Grid
 Data Parallelization
M pixels PE data time
1 pixels pixels 1 pixels t0 t1 t2 t3 t4 t5
A(0) A(0)+A(1) A(0)+A(1)+A(2)+A(3) sum
0
1 A(1)
M PE i PE i 2 A(2) A(2)+A(3)
pixels pixels pixels pixels A(3)
3
4 A(4) A(4)+A(5) A(4)+A(5)+A(6)+A(7)
pixels 5 A(5)
1 pixels 1 pixels 6 A(6) A(6)+A(7)
pixels 7 A(7)


Our Memory Hierarchy

Texture Parallel Gaussian Blur
Memory

Constant Parallel Log-domain
Memory Processing
Global
Memory

Parallel Normalization
Shared
Memory
Parallel Histogram
Stretching


Experimental Results (1/2)

Original images CPU results GPGPU results


Experimental Results (2/2)

Original images CPU results GPGPU results


GPGPU Speedup over CPU
2
10
Speedup__N 74x
Speedup
Speedup__P
Speedup__NPP
2x
Speedup

1
10
2 3 4
10 10 10
M
• Ideal speedup: 240 * (1.296GHz/ 3GHz) = 103
• NPP: nVidia Performance Primitive


4. Feature Extraction
(SIFT) by CUDA


What Is SIFT
 SIFT
 Scale Invariant Feature Transform
 Invariance of feature points
 Translation
 Rotation
 Scale


Applications of SIFT
Object recognition/tracking
Image retrieval
Autostitch


Parallelize SIFT by GPGPU

Intel Q9400 Geforce GTS 250
Quad cores 128 SPs
(2.66GHz) (1.836GHz)


Experimental Results
CPU GPU


Execution Time

CPU:
10 seconds
in average
ms

GPGPU:
0.8 seconds
in average


Speedup

13x speedup in average


5. Video
Cloud Computing
戶外/園區的大面積監控
• 大量攝影機數目
• 系統穩定度之挑戰

技術特點
• 涵蓋雲端運算與嵌入式系統
• 整合電子地圖、事件、與視訊摘要之中控顯示
• 克服戶外天候影響之偵測技術


A Campus Monitoring System
中控室技術展示區

人
事
件
技
術
展
示
區

車事件技術展示區


一、人事件技術展示

電子資訊研究大樓

交大校內
機車環校道路
科學園區

 翻牆及禁區入侵偵測技術
 嵌入式PTZ相機追蹤技術
 攝影機異常偵測技術


1.1 翻牆及禁區入侵偵測技術

偵測電資大樓
後方與科學園
區銜接之機車
環校道路圍牆，電子資訊研究大樓

是否有人爬牆
侵入，並發送
警報。交大校內
機車環校道路
科學園區


1.2 嵌入式PTZ相機追蹤技術
透過前端固定式
監控系統取得追
蹤物體之初始位
置。
以嵌入式平台進電子資訊研究大樓

行移動物體追蹤，
並控制PTZ攝影交大校內

機鏡頭。機車環校道路
科學園區


1.3 攝影機異常偵測技術

以雲端平台同時對環
校及電資大樓多支攝
影機進行攝影機異常
偵測。(GPGPU)
模擬電資大樓之攝影
機被人蓄意破壞，將
偵測並警報。
有效排除人來人往的
環校攝影機之假警報。


二、車事件技術展示

 嵌入式非法停車偵測技術
(暨動態場景之人物特徵偵測)
 戶外停車場空位偵測技術


2.1 嵌入式非法停車偵測技術

 以嵌入式平台
偵測違法停車
車輛，並驅動
PTZ攝影機拍攝
事件特寫影像。
 多解析度連續
影像之人臉偵
測，以停止PTZ
攝影機之特寫
追蹤。(GPGPU)


2.2 戶外停車場空位偵測技術
偵測大型停
車場車位狀
態，並顯示
空車位位置。
當車輛停妥
於任一空車
位，該車位
將顯示為佔
用中。


三、中控室技術展示

智慧型社區事件安全監控系
統中控室

 電子地圖式中空式展示技術
(中央視訊及管理系統)
 多重解析度廣域監視技術
 高效率的影片事件檢索技術


3.1 電子地圖式中控室展示技術
以 Google
Map 整合所
有異質監控
資訊。
Video
Event
Geograph
y


3.2 多重解析度廣域監視技術

可旋轉式投影機
大小眼多重
解析度顯示
 整合 Google
Earth
 GPGPU 硬體
加速影像貼
合計算固定式投影機


3.3 高效率的影片事件檢索技術
 將冗長的監視影片，轉換
成精簡的摘要影片，使用
者可在短時間內調閱指定
攝影機之全日事件。
3:00 對濃縮影片進行瀏覽
5:00
時
電子資訊研究大樓
間
軸
交大校內
機車環校道路
科學園區

利用空間對時間做壓
縮

Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p.

系統架構
環校電資大樓

…
…
停車場合法停車
x5 CMS
608
3D

……
3D

…
停車場
x8

人非法翻牆

路邊非法停車

翻牆 HVR
CAD

CAD

77


Issues with Parallelization
 Good parallel programs
 Execute correctly
 with good speedup
 Ideal speedup by Amdahl's law
 Speedup = N if you has N cores
 However, no ideal speedup exists
 Because parallel overhead, such as
Data communication
Data dependencies and synchronization
 Other issues: design overhead
 No free lunch for software development


Parallel Computing on GPGPU
 CUDA can only parallelize codes for
nVidia's GPGPU
 CUDA’s programming model:
 Multithread
 SPMD (Single Program Multiple Data)
 Best-performance CUDA code needs
optimization
 Native code can be improved by CUDA
 2~3 times
 Optimization can be achieved by
 Data parallelism, Thread parallelism, Data
localization


Programming Challenges
of CUDA
 We have to manually parallelize the
algorithm
 We need expertise in
 Algorithms of image and signal processing
 Filtering, frequency analysis, compression,
feature extraction, recognition, ...
 Theory, tools and methodology of parallel
computing
 Communication, synchronization, resource
management, load balancing, debugging, ...


GPUs for Multimedia

3.5X 10 X 10 X
PowerDirector7 Ultra CUDA JPEG Decoder DivideFrame GPU Decoder

26 X 10 X
Hyperspectral Image GPU Decoder Motion Estimation for
Compression on (Vegas/Premiere) - H.264/AVC on
NVIDIA GPUs Using the Power of Multiple GPUs
NVIDIA Graphic Card to Using NVIDIA CUDA
Decode H.264 Video Files


GPUs for Computer Vision(1/2)

87 X 26 X 200 X 100 X
CUDA SURF – A Real-time Leukocyte Tracking: Real-time Spatiotemporal Image Denoising with
Implementation for SURF ImageJ Plugin Stereo Matching Using the Bilateral Filter
TU Darmstadt University of Virginia Dual-Cross-Bilateral Grid Wlroclaw University
of Technology

85 X 100 X 8X 13 X
Digital Breast Fast Optical Flow on GPU A Framework for Efficient Accelerating Advanced MRI
Tomosynthesis At Video Rate for Full HD and Scalable Execution of Reconstructions
Reconstruction Resolution Domain-specific Templates University of Illinois
Massachusetts General Onera On GPU
Hospital NEC Labs, Berkeley, Purdue


GPUs for Computer Vision(2/2)

20 X 13 X 109 X 263 X
GPU for Surveillance Fast Human Detection with Fast Sliding-Window GPU Acceleration of Object
Cascaded Ensembles Object Detection Classification Algorithm
Using NVIDIA CUDA

300 X 10 X 45 X 3X
Audience Measurement – Real-time A GPU Accelerated Canny Edge Detection
Real-time Video Analysis Visual Tracker by Evolutionary
for Counting People, Face Stream Processing Computer Vision System
Detection and Tracking


The ParLab in Berkeley
 The Parallel Computing Lab. in UC
Berkeley
http://parlab.eecs.berkeley.edu
 The ParLab. offers programmers a
practical introduction to parallel
programming techniques and tools on
current parallel computers,
emphasizing multicore and manycore
computers.


Multicore Programming
Practice (MPP)
 Goal: Write portable C/C++
programs to be "Multicore ready"
and platform compatible
 Proposed by a
MPP working group
in the Multicore
Association

http://www.multicore-association.org/workgroup/mpp.php


Special Conference
 HPEC: High Performance Embedded
Computing,
 MIT Lincoln Lab, 1997 ~

88

The End
Free for Questions

Parallel Vision by GPGPU/CUDA

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (18)

Ähnlich wie Parallel Vision by GPGPU/CUDA

Ähnlich wie Parallel Vision by GPGPU/CUDA (20)

Mehr von IEEE International Conference on Intelligent Information Hiding and Multimedia Signal Processing

Mehr von IEEE International Conference on Intelligent Information Hiding and Multimedia Signal Processing (16)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Parallel Vision by GPGPU/CUDA