SlideShare ist ein Scribd-Unternehmen logo
1 von 88
Downloaden Sie, um offline zu lesen
1




     平行視覺與GPGPU/CUDA
                  王元凱
               輔仁大學電機工程系
            Email: ykwang@mail.fju.edu.tw
              URL: http://www.ykwang.tw
                      2011/10/07




本著作採用創用CC 「姓名標示」授權條款台灣3.0版
Wang, Yuan-Kai (王元凱)   Parallel Vision with GPGPU/CUDA   p. 2



                  What about this Talk
          The Multicore Era
            It’s time for Parallel Computing
          GPGPU/CUDA
            GUGPU Architecture
            Parallel Programming by CUDA
          Some Examples
            Image Restoration (Retinex)
            Feature Extraction (SIFT)
            Video Cloud Computing
3




    1. The Multicore Era
    for Computer Vision
   Paradigm shift from Clock Speed Race
    to Multicore Race
   Some examples of Multicore
Wang, Yuan-Kai (王元凱)   Parallel Vision with GPGPU/CUDA   p. 4



                  Multicore Computing
          What Is Multicore
            Combine multiple chips of
             processor into single chip
          Multicore computing is inevitable
Wang, Yuan-Kai (王元凱)      Parallel Vision with GPGPU/CUDA   p. 5



                       Moore's Law
          In 1965, Gordon Moore (Intel co-founder)
           predicted
            The transistors no. on an IC would double
             every 18 months
          The well-known law
         • The performance of computer
           doubles every 18 months
           • More transistors
              More performance
          The prediction was
           kept correctly by
           Intel's CPUs for 40 years
Wang, Yuan-Kai (王元凱)   Parallel Vision with GPGPU/CUDA   p. 6



               Review of Moore's Law
          Transistors in a chip did increase
Wang, Yuan-Kai (王元凱)    Parallel Vision with GPGPU/CUDA   p. 7



                       Problems
          More transistors need high frequency
          High frequency needs high power
           consumption
            We come into the Clock Speed Race
            But 4GHz has been the limit
             Moore’s law breaks
Wang, Yuan-Kai (王元凱)     Parallel Vision with GPGPU/CUDA   p. 8



           Paradigm Shift from 2000
          General-purpose multicore
           comes of age
          Chip companies race to create
           multicore processors
              CPU: Intel Core Duo, Quad-core, ...
              DSP: TI DaVinci
              GPU: nVidia GeForce/Tesla
              ...
Wang, Yuan-Kai (王元凱)     Parallel Vision with GPGPU/CUDA                  p. 9



              The Multicore Evolution
     From large mono-core to multiple lightweight cores




  Pentium processor         Core Duo                        5~10 years
  Optimized for single                                10~100 energy efficient
        thread                                          cores optimized for
                                                         parallel execution
Wang, Yuan-Kai (王元凱)      Parallel Vision with GPGPU/CUDA        p. 10



     Moore’s Law Needs Multicore
          Single core cannot fit Moore's law
          Multicore can fit Moore's law if a
           parallel programming model exists
                                                    Multi-Core
            Performance




                                             Single Core




                                    Time
Wang, Yuan-Kai (王元凱)       Parallel Vision with GPGPU/CUDA   p. 11



                       Two Architectures
                         for Multicore
          Symmetric multiprocessing (SMP)
            Multicore CPU,
             GPGPU,
             multicore DSP
            Homogeneous computing
          Asymmetric multiprocessing (AMP)
            CPU+GPGPU,
             CPU+FPGA,
             CPU+DSP
            Heterogeneous computing
Wang, Yuan-Kai (王元凱)        Parallel Vision with GPGPU/CUDA                 p. 12



                       Multicore CPU (1/2)
          Two or more CPUs on a chip
          Ex.: Intel Core i7

         One
      Processor




                                                               With multiple
                                                              execution Cores
Wang, Yuan-Kai (王元凱)          Parallel Vision with GPGPU/CUDA          p. 13



                       Multicore CPU (2/2)
        Windows Task Manager(工作管理員)
                  Two cores                              Eight cores
Wang, Yuan-Kai (王元凱)    Parallel Vision with GPGPU/CUDA   p. 14



                       GPGPU (1/2)
          GPU (Graphical Processing Unit)
            The processor in graphics card to speed
             up 3D graphics
            Game playing
             is a major
             application
          GPGPU: General-Purpose GPU
            General purpose computation using
             GPU in applications other than 3D
             graphics
Wang, Yuan-Kai (王元凱)     Parallel Vision with GPGPU/CUDA   p. 15



                        GPGPU (2/2)
          GPGPU has more cores than CPU
            120 ~ 512 cores
          GPGPU is more powerful than
           multicore CPU
          Vendors:
              nVidia
              ATI
              Intel
              AMD
Wang, Yuan-Kai (王元凱)    Parallel Vision with GPGPU/CUDA   p. 16



        Computer Vision Needs
     High Performance Computing
          An CV example : video processing
            Intelligent video surveillance,
          Its complexity is high
            One video: 10 Megapixels, 30fps,
            100 flops per pixel
             30 Gigaflops per video
          Massive data processing
            Intensive computation
Wang, Yuan-Kai (王元凱)   Parallel Vision with GPGPU/CUDA              p. 17



                  Approaches for HPC
          Cluster/distributed computing
            MAP-REDUCE(Google)
                                                         Supercomputer
             (Cloud Computing)
            MPI
          Multi-processing
           computing
            Multicore CPU
              Programming with multithreading
            FPGA/DSP
            GPGPU
              Programming with CUDA
Wang, Yuan-Kai (王元凱)   Parallel Vision with GPGPU/CUDA   p. 18



                       However
          Multicore is not a simple solution for
           upgrading performance
            The transition from single core to
             multicore will be blocked by
             software
            We are not ready to face the
             software programming challenges
Wang, Yuan-Kai (王元凱)   Parallel Vision with GPGPU/CUDA   p. 19




        Multicore Demands Threading
Wang, Yuan-Kai (王元凱)   Parallel Vision with GPGPU/CUDA   p. 20




           2. GPGPU and CUDA
            GPGPU Hardware
            Programming by CUDA
Wang, Yuan-Kai (王元凱)    Parallel Vision with GPGPU/CUDA   p. 21



                       Why GPGPU
        GPGPU has many-core (> 100 cores)
          Suitable for intensive parallel computing
        GPGPU v.s. CPU
          Calculation: 367 GFLOPS v.s. 32 GFLOPS
          Memory Bandwidth: 86.4 GB/s v.s. 8.4 GB/s
Wang, Yuan-Kai (王元凱)     Parallel Vision with GPGPU/CUDA   p. 22



                       GPGPU Vendors
               NVIDIA
               ATI
               Intel
               AMD
               …
Wang, Yuan-Kai (王元凱)     Parallel Vision with GPGPU/CUDA   p. 23



                       Hardware View
               • PC-based
               • GPGPU card as a coprocessor




        From PC to PSC : Personal Super-Computer
Wang, Yuan-Kai (王元凱)         Parallel Vision with GPGPU/CUDA      p. 24



                 Applications of GPGPU




            http://developer.nvidia.com/category/zone/cuda-zone
Wang, Yuan-Kai (王元凱)                Parallel Vision with GPGPU/CUDA           p. 25



                           Two New GPGPUs
                             from nVidia
             GT200
               GTX 260/280, Quardro5800, Tesla 1060
             Fermi
               Tesla 2060
                              ALU     ALU
                 Control
                              ALU     ALU


                           Cache



          DRAM                                       DRAM


                   CPU(host)                                    GPU(device)
                   Multicore                                    Many-core
Wang, Yuan-Kai (王元凱)   Parallel Vision with GPGPU/CUDA   p. 26



           nVidia GPGPU Architecture
          SM/SP(Stream multiprocessor/Stream
           processor) + Shared memory + DRAM
Wang, Yuan-Kai (王元凱)       Parallel Vision with GPGPU/CUDA   p. 27



                       Memory Hierarchy
       On-Chip Memory
           Registers
           Shared Memory
           Constant Memory
           Texture Memory
       Off-Chip Memory
         Local Memory
         Global Memory
Wang, Yuan-Kai (王元凱)       Parallel Vision with GPGPU/CUDA          p. 28



                       Parallel Computing
          Serial
           Computing


                                                             GPGPU Cores

          Parallel
           Computing
Wang, Yuan-Kai (王元凱)   Parallel Vision with GPGPU/CUDA   p. 29



                 Parallel Programming
        Many codes are written in C/C++/Java
          Especially algorithmic programs
        Can we write GPGPU parallel
         programs by C/C++/Java?
        However, C/C++ is sequential
          Three control structures of C/C++/Java:
           sequence, selection, repetition
Wang, Yuan-Kai (王元凱)      Parallel Vision with GPGPU/CUDA   p. 30



                       Multi-threading
          Multi-threading is the most
           important technique for parallel
           programming
            Some techniques are ready
              Pthread, Win32 thread, OpenMP,
               MPI, Intel TBB (Threading Building
               Block)...
            New techniques
              CUDA, OpenCL, ...
Wang, Yuan-Kai (王元凱)   Parallel Vision with GPGPU/CUDA   p. 31



             Parallel Programming in
              Sequential Language
         Do we need to learn new languages for
          multi-threading?
           No
         Write multi-threading codes in C/C++
           Add functions/directives to C/C++ for
            multi-threading
           That is the way current solutions did
             pthread, Win32 thread, OpenMP,
              MPI, CUDA, OpenCL, ...
Wang, Yuan-Kai (王元凱)   Parallel Vision with GPGPU/CUDA   p. 32



                           CUDA
          CUDA: Compute Unified Device
           Architecture
          Parallel programming
           for nVidia's GPGPU
          Use C/C++ language
            Java, Fortran, Matlab are OK
          When executing CUDA programs,
           the GPU operates as coprocessor to
           the main CPU
Wang, Yuan-Kai (王元凱)     Parallel Vision with GPGPU/CUDA             p. 33



      CUDA Hardware Environment:
             CPU+GPU
         GPU
            Organizes, interprets, and CPU                PCI-E
                                                                   GPU
             communicates information
         GPU
            Handles the core processing on large quantities
             of parallel information
            Compute-intensive portions of applications
             that are executed many times, but on different
             data, are extracted from the main application
             and compiled to execute in parallel on the GPU
Wang, Yuan-Kai (王元凱)   Parallel Vision with GPGPU/CUDA   p. 34



                CUDA Software Stack
Wang, Yuan-Kai (王元凱)                Parallel Vision with GPGPU/CUDA                              p. 35



            Processing Flow on CUDA
                                      Main
                                                                      CPU   3
       2                             Memory
           Copy processing                          5                           Instruct  the 
                data                                    Copy the                 processing
                                                         result
                                                                                   4
   1                                 Memory
                                     for GPU                                         Execute  
          Allocate                                                                  parallel in 
       device memory                                                                each core


                6
                       Release 
                    device memory
Wang, Yuan-Kai (王元凱)       Parallel Vision with GPGPU/CUDA   p. 36



                       Programming with
                       Memory Hierarchy
           Locality
            principle
             Temporal
              locality
             Spatial
              locality
Wang, Yuan-Kai (王元凱)         Parallel Vision with GPGPU/CUDA                   p. 37



            Example - Hello World(1/3)
    int main()
    {                                                          Host      Device
        char src[12]="Hello World";
        char h_hello[12];                                      src       d_hello1
          char* d_hello1;
          char* d_hello2;                                      h_hello   d_hello2

          cudaMalloc((void**) &d_hello1, sizeof(char)*12);
          cudaMalloc((void**) &d_hello2, sizeof(char)*12);
          cudaMemcpy(d_hello1 , src , sizeof(char)* 12 ,
                     cudaMemcpyHostToDevice);
          hello<<<1,1>>>(d_hello1 , d_hello2 );
                  call the kernel function
Wang, Yuan-Kai (王元凱)       Parallel Vision with GPGPU/CUDA                  p. 38



            Example - Hello World(2/3)
          Kernel Function

     __global__ void hello(char* hello1 , char* hello2 )
     {
         int k;

           for(k = 0 ; hello1[k] != '0' ; k++){
                                               Host                    Device
               hello2[k] = hello1[k];
           }                                   src                     d_hello1
    }
    No parallel processing in this example
                                                             h_hello   d_hello2
Wang, Yuan-Kai (王元凱)     Parallel Vision with GPGPU/CUDA                p. 39



            Example - Hello World(3/3)
           cudaMemcpy(h_hello, d_hello2, sizeof(char)*
           12, cudaMemcpyDeviceToHost);

           printf("%sn", h_hello);
                                                     Host      Device
           cudaFree(d_hello1);
          cudaFree(d_hello2);                       src       d_hello1
           system("pause");
                                                     h_hello   d_hello2
           return 0;
     }
         Result:
Wang, Yuan-Kai (王元凱)       Parallel Vision with GPGPU/CUDA   p. 40



                       Parallelization
          Multicore/Multi-threading
          Data Parallelization
              Data distribution
              Parallel convolution
              Reduction algorithm
              Amdahl’s law
          Memory Hierarchy Management
            Locality principle
                Program accesses a relatively small portion
                 of the address space at any instant of time
Wang, Yuan-Kai (王元凱)   Parallel Vision with GPGPU/CUDA   p. 41



         Develop Multi-thread Program
        Identify parallelism: Analyze algorithm
        Express parallelism: Write parallel code
        Validate parallelism: Debug & verify
         parallel code
        Optimize parallelism: enhance parallel
         performance
Wang, Yuan-Kai (王元凱)   Parallel Vision with GPGPU/CUDA   p. 42




           3. Image Restoration
            (Retinex) by CUDA
Wang, Yuan-Kai (王元凱)        Parallel Vision with GPGPU/CUDA              p. 43



                       Image Restoration
       Restore and enhance an image
       Its complexity is high for large images




                 Original     Complexity:                     Restored
                              O(N2) ~ O(N3)
Wang, Yuan-Kai (王元凱)         Parallel Vision with GPGPU/CUDA         p. 44



                         Algorithms for
                       Image Restoration
          Wiener Filter
          Histogram Based Approach
            Histogram Equalization,
             Histogram Modification, …
          Retinex
            Path-based Retinex
            Recursive Retinex
            Center/surround Retinex
                No iterative process and is suitable for parallelization
                Multi-Scale Retinex with Color Restoration
                 (MSRCR) [Rahman et al. 1997]
Wang, Yuan-Kai (王元凱)                     Parallel Vision with GPGPU/CUDA                             p. 45



                          MSRCR Algorithm
                                                                                       
                               n
     Ri  x, y   ri ( x, y )   Wk log Ii  x, y   log  Fk  x, y   Ii  x, y   , i   R, G, B ,
                                                                                       
                              k 1

           Ri  x, y    : the MSRCR output
           Ii  x, y    : the original image distribution in the ith spectral band
           F  x, y 
               k
                          : the kth Gaussian Surround function
                        : the convolution operation
          W              : the weight
               k

           ri ( x, y )   : the color restoration factor in the ith spectral band

                                                   
                                   I i ( x, y ) 
                                                        N : the number of spectral bands
     ri ( x, y )    log    N                   ,   : the gain constant
                                                   
                           
                           
                                i 1
                                       I i ( x, y ) 
                                                    
                                                          : controls the strength of the nonlinearity
Wang, Yuan-Kai (王元凱)   Parallel Vision with GPGPU/CUDA   p. 46



            Decompose the Problem
        Two basic approaches to partition
         computational work
          Domain decomposition GPGPU
            Partition the data used
                                            Cooperate
             in solving the problem
          Function decomposition CPU
            Partition the jobs (functions)
             from the overall work (problem)
Wang, Yuan-Kai (王元凱)         Parallel Vision with GPGPU/CUDA         p. 47



                       Multi-Threading
        A program running
         In Serial


        In Parallel




            http://en.wikipedia.org/wiki/Thread_(computer_science)
Wang, Yuan-Kai (王元凱)    Parallel Vision with GPGPU/CUDA   p. 48



       Domain Decomposition (1/3)
       An        image example
            It is 2D data
            Three popular partition ways
Wang, Yuan-Kai (王元凱)             Parallel Vision with GPGPU/CUDA              p. 49



       Domain Decomposition (2/3)
       Domain                 data are usually processed
         by loop
            for (i=0; i<height; i++)
              for (j=0; j<width; j++)
               img2[i][j] = RemoveNoise(img1[i][j]);
                           j
                       i


 The X-ray image
 of a circuit board
                           Original image(img1)             Enhanced image(img2)
Wang, Yuan-Kai (王元凱)                     Parallel Vision with GPGPU/CUDA          p. 50



       Domain Decomposition (3/3)
       j
  i                                     A three-block partition
                                         example           OpenMP
                                           // Thread 1                   CUDA(SPMD)
                                               for (i=0; i<height/3; i++)
                                                for (j=0; j<width; j++)
                                                  img2[i][j] = RemoveNoise(img1[i][j]);
                                           // Thread 2
                                               for (i=height/3; i<height*2/3; i++)
                 fork(threads)
   subdomain 1 subdomain 2 subdomain 3
                                                for (j=0; j<width; j++)
           i=0        i=4         i=8             img2[i][j] = RemoveNoise(img1[i][j]);
           i=1        i=5         i=9      // Thread 3
           i=2        i=6        i=10
           i=3        i=7        i=11
                                               for (i=height*2/3; i<height; i++)
                                                for (j=0; j<width; j++)
                 join(barrier)                    img2[i][j] = RemoveNoise(img1[i][j]);
Wang, Yuan-Kai (王元凱)       Parallel Vision with GPGPU/CUDA                   p. 51



                        The Method
                                  CPU                          GPGPU
                               Copy Data
                              from CPU to                    Gaussian Blur
                                 GPGPU
                                                              Log-domain
                                                              Processing

                                                             Normalization
                               Copy Data                       Histogram
                             from GPGPU                        Stretching
                                to CPU

                       Intel Core 2 - 2 cores            Tesla C1060 - 240 SPs
                             (3.0GHZ)                         (1.296GHZ)
Wang, Yuan-Kai (王元凱)                        Parallel Vision with GPGPU/CUDA                                               p. 52



                Parallelization by GPGPU
              Multicore/Multi-threading
                Tesla C1060 : 240 SP (Stream Processor)
                CUDA: , Thread , Block , Grid
              Data Parallelization
                Parallel convolution
                                                                    Parallel convolution
              M pixels                                            PE   data         time
                                1 pixels     pixels   1 pixels                t0       t1      t2              t3          t4     t5
                                                                       A(0)        A(0)+A(1)        A(0)+A(1)+A(2)+A(3)         sum
                                                                  0
                                                                  1    A(1)
  M                    PE i                  PE i                 2    A(2)        A(2)+A(3)
pixels       pixels            pixels                    pixels        A(3)
                                                                  3
                                                                  4    A(4)        A(4)+A(5)    A(4)+A(5)+A(6)+A(7)
                      pixels                                      5    A(5)
                                 1 pixels             1 pixels    6    A(6)        A(6)+A(7)
                                             pixels               7    A(7)
Wang, Yuan-Kai (王元凱)    Parallel Vision with GPGPU/CUDA            p. 53



               Our Memory Hierarchy

         Texture       Parallel Gaussian Blur
         Memory



         Constant       Parallel Log-domain
         Memory             Processing
                                                          Global
                                                          Memory

                       Parallel Normalization
         Shared
         Memory
                         Parallel Histogram
                             Stretching
Wang, Yuan-Kai (王元凱)     Parallel Vision with GPGPU/CUDA                   p. 54



          Experimental Results (1/2)




       Original images        CPU results                  GPGPU results
Wang, Yuan-Kai (王元凱)     Parallel Vision with GPGPU/CUDA                   p. 55



          Experimental Results (2/2)




       Original images        CPU results                  GPGPU results
Wang, Yuan-Kai (王元凱)               Parallel Vision with GPGPU/CUDA          p. 56



                  GPGPU Speedup over CPU
              2
            10
                    Speedup__N                                       74x
                    Speedup
                    Speedup__P
                    Speedup__NPP
                                                                           2x
  Speedup




              1
            10
                2                                 3                             4
              10                                10                          10
                                                M
            • Ideal speedup: 240 * (1.296GHz/ 3GHz) = 103
            • NPP: nVidia Performance Primitive
Wang, Yuan-Kai (王元凱)   Parallel Vision with GPGPU/CUDA   p. 57




           4. Feature Extraction
              (SIFT) by CUDA
Wang, Yuan-Kai (王元凱)    Parallel Vision with GPGPU/CUDA   p. 58



                       What Is SIFT
       SIFT
            Scale Invariant Feature Transform
       Invariance      of feature points
            Translation
            Rotation
            Scale
Wang, Yuan-Kai (王元凱)   Parallel Vision with GPGPU/CUDA   p. 59



                  Applications of SIFT
    Object recognition/tracking
    Image retrieval
    Autostitch
Wang, Yuan-Kai (王元凱)   Parallel Vision with GPGPU/CUDA              p. 60



          Parallelize SIFT by GPGPU



Intel Q9400                                              Geforce GTS 250
Quad cores                                               128 SPs
(2.66GHz)                                                (1.836GHz)
Wang, Yuan-Kai (王元凱)         Parallel Vision with GPGPU/CUDA         p. 61


                 Experimental Results
                       CPU                                     GPU
Wang, Yuan-Kai (王元凱)      Parallel Vision with GPGPU/CUDA                 p. 62



                       Execution Time

                                                               CPU:
                                                            10 seconds
                                                            in average
       ms




                                                              GPGPU:
                                                            0.8 seconds
                                                             in average
Wang, Yuan-Kai (王元凱)        Parallel Vision with GPGPU/CUDA   p. 63



                            Speedup




                       13x speedup in average
Wang, Yuan-Kai (王元凱)     Parallel Vision with GPGPU/CUDA   p. 64




                     5. Video
                 Cloud Computing
                       戶外/園區的大面積監控
                          • 大量攝影機數目
                         • 系統穩定度之挑戰

                             技術特點
                    • 涵蓋雲端運算與嵌入式系統
               • 整合電子地圖、事件、與視訊摘要之中控顯示
                   • 克服戶外天候影響之偵測技術
Wang, Yuan-Kai (王元凱)   Parallel Vision with GPGPU/CUDA   p. 65



     A Campus Monitoring System
                       中控室技術展示區

       人
       事
       件
       技
       術
       展
       示
       區

                       車事件技術展示區
Wang, Yuan-Kai (王元凱)     Parallel Vision with GPGPU/CUDA            p. 66



                       一、人事件技術展示



                                                   電子資訊研究大樓




                                                             交大校內
                                                           機車環校道路
                                                            科學園區


                                    翻牆及禁區入侵偵測技術
                                    嵌入式PTZ相機追蹤技術
                                    攝影機異常偵測技術
Wang, Yuan-Kai (王元凱)   Parallel Vision with GPGPU/CUDA             p. 67



         1.1 翻牆及禁區入侵偵測技術

     偵 測 電 資 大 樓
      後方與科學園
      區銜接之機車
      環校道路圍牆,                                    電子資訊研究大樓

      是否有人爬牆
      侵入,並發送
      警報。                                                  交大校內
                                                         機車環校道路
                                                            科學園區
Wang, Yuan-Kai (王元凱)   Parallel Vision with GPGPU/CUDA             p. 68



          1.2 嵌入式PTZ相機追蹤技術
     透過前端固定式
      監控系統取得追
      蹤物體之初始位
      置。
     以嵌入式平台進                                    電子資訊研究大樓


      行移動物體追蹤,
      並控制PTZ攝影                                             交大校內

      機鏡頭。                                               機車環校道路
                                                            科學園區
Wang, Yuan-Kai (王元凱)   Parallel Vision with GPGPU/CUDA   p. 69



               1.3 攝影機異常偵測技術

     以雲端平台同時對環
      校及電資大樓多支攝
      影機進行攝影機異常
      偵測。(GPGPU)
     模擬電資大樓之攝影
      機被人蓄意破壞,將
      偵測並警報。
     有效排除人來人往的
      環校攝影機之假警報。
Wang, Yuan-Kai (王元凱)     Parallel Vision with GPGPU/CUDA   p. 70



                       二、車事件技術展示




                                           嵌入式非法停車偵測技術
                                          (暨動態場景之人物特徵偵測)
                                           戶外停車場空位偵測技術
Wang, Yuan-Kai (王元凱)   Parallel Vision with GPGPU/CUDA   p. 71



         2.1 嵌入式非法停車偵測技術

  以嵌入式平台
   偵測違法停車
   車輛,並驅動
   PTZ攝影機拍攝
   事件特寫影像。
  多解析度連續
   影像之人臉偵
   測,以停止PTZ
   攝影機之特寫
   追蹤。(GPGPU)
Wang, Yuan-Kai (王元凱)   Parallel Vision with GPGPU/CUDA   p. 72



         2.2 戶外停車場空位偵測技術
  偵測大型停
   車場車位狀
   態,並顯示
   空車位位置。
  當車輛停妥
   於任一空車
   位,該車位
   將顯示為佔
   用中。
Wang, Yuan-Kai (王元凱)           Parallel Vision with GPGPU/CUDA   p. 73



                       三、中控室技術展示

                       智慧型社區事件安全監控系
                       統中控室




                                                 電子地圖式中空式展示技術
                                                  (中央視訊及管理系統)
                                                 多重解析度廣域監視技術
                                                 高效率的影片事件檢索技術
Wang, Yuan-Kai (王元凱)   Parallel Vision with GPGPU/CUDA   p. 74


      3.1 電子地圖式中控室展示技術
  以    Google
   Map 整 合 所
   有異質監控
   資訊。
   Video
   Event
   Geograph
     y
Wang, Yuan-Kai (王元凱)   Parallel Vision with GPGPU/CUDA        p. 75


         3.2 多重解析度廣域監視技術

                                                         可旋轉式投影機
   大小眼多重
     解析度顯示
    整 合 Google
     Earth
    GPGPU 硬 體
     加速影像貼
     合計算                                                   固定式投影機
Wang, Yuan-Kai (王元凱)              Parallel Vision with GPGPU/CUDA         p. 76



       3.3 高效率的影片事件檢索技術
    將冗長的監視影片,轉換
     成精簡的摘要影片,使用
     者可在短時間內調閱指定
     攝影機之全日事件。
                                                           3:00     對濃縮影片進行瀏覽
                                               5:00
                                       時
               電子資訊研究大樓
                                       間
                                       軸
                           交大校內
                       機車環校道路
                           科學園區




                                            利用空間對時間做壓
                                               縮
Wang, Yuan-Kai (王元凱)   Parallel Vision with GPGPU/CUDA                        p.


                       系統架構
                                   環校                    電資大樓




                                     …
                                             …
  停車場合法停車
                                        x5                              CMS
                                                 608
                                   3D




                                     ……
                                                              3D




                                      …
                                                          停車場
                                        x8



                                   人非法翻牆

  路邊非法停車


                                                                   翻牆          HVR
                                  CAD


                                                                        CAD




                                                         77
78




6. Conclusions
Wang, Yuan-Kai (王元凱)    Parallel Vision with GPGPU/CUDA   p. 79



          Issues with Parallelization
         Good parallel programs
           Execute correctly
           with good speedup
         Ideal speedup by Amdahl's law
           Speedup = N if you has N cores
         However, no ideal speedup exists
           Because parallel overhead, such as
            Data communication
            Data dependencies and synchronization
         Other issues: design overhead
           No free lunch for software development
Wang, Yuan-Kai (王元凱)     Parallel Vision with GPGPU/CUDA   p. 80



       Parallel Computing on GPGPU
          CUDA can only parallelize codes for
           nVidia's GPGPU
          CUDA’s programming model:
            Multithread
            SPMD (Single Program Multiple Data)
          Best-performance CUDA code needs
           optimization
            Native code can be improved by CUDA
              2~3 times
            Optimization can be achieved by
              Data parallelism, Thread parallelism, Data
               localization
Wang, Yuan-Kai (王元凱)     Parallel Vision with GPGPU/CUDA   p. 81



           Programming Challenges
                  of CUDA
          We have to manually parallelize the
           algorithm
          We need expertise in
            Algorithms of image and signal processing
              Filtering, frequency analysis, compression,
               feature extraction, recognition, ...
            Theory, tools and methodology of parallel
             computing
              Communication, synchronization, resource
               management, load balancing, debugging, ...
Wang, Yuan-Kai (王元凱)                 Parallel Vision with GPGPU/CUDA                       p. 82



                  GPUs for Multimedia


                       3.5X                   10 X                       10 X
              PowerDirector7 Ultra     CUDA JPEG Decoder         DivideFrame GPU Decoder




                       26 X                   10 X
              Hyperspectral Image          GPU Decoder             Motion Estimation for
                Compression on           (Vegas/Premiere) -           H.264/AVC on
                 NVIDIA GPUs            Using the Power of            Multiple GPUs
                                      NVIDIA Graphic Card to       Using NVIDIA CUDA
                                      Decode H.264 Video Files
Wang, Yuan-Kai (王元凱)                  Parallel Vision with GPGPU/CUDA                             p. 83



      GPUs for Computer Vision(1/2)


         87 X                      26 X                       200 X                     100 X
CUDA SURF – A Real-time     Leukocyte Tracking:       Real-time Spatiotemporal     Image Denoising with
Implementation for SURF        ImageJ Plugin          Stereo Matching Using the      Bilateral Filter
     TU Darmstadt           University of Virginia    Dual-Cross-Bilateral Grid     Wlroclaw University
                                                                                       of Technology




         85 X                     100 X                         8X                       13 X
      Digital Breast      Fast Optical Flow on GPU     A Framework for Efficient Accelerating Advanced MRI
     Tomosynthesis        At Video Rate for Full HD    and Scalable Execution of       Reconstructions
     Reconstruction              Resolution            Domain-specific Templates    University of Illinois
  Massachusetts General             Onera                      On GPU
        Hospital                                      NEC Labs, Berkeley, Purdue
Wang, Yuan-Kai (王元凱)                     Parallel Vision with GPGPU/CUDA                            p. 84



      GPUs for Computer Vision(2/2)


         20 X                        13 X                      109 X                      263 X
  GPU for Surveillance       Fast Human Detection with    Fast Sliding-Window     GPU Acceleration of Object
                                Cascaded Ensembles          Object Detection       Classification Algorithm
                                                                                    Using NVIDIA CUDA




        300 X                         10 X                      45 X                        3X
 Audience Measurement –              Real-time             A GPU Accelerated        Canny Edge Detection
 Real-time Video Analysis        Visual Tracker by            Evolutionary
 for Counting People, Face       Stream Processing       Computer Vision System
  Detection and Tracking
Wang, Yuan-Kai (王元凱)   Parallel Vision with GPGPU/CUDA   p. 85



              The ParLab in Berkeley
          The Parallel Computing Lab. in UC
           Berkeley
           http://parlab.eecs.berkeley.edu
            The ParLab. offers programmers a
             practical introduction to parallel
             programming techniques and tools on
             current parallel computers,
             emphasizing multicore and manycore
             computers.
Wang, Yuan-Kai (王元凱)        Parallel Vision with GPGPU/CUDA         p. 86



              Multicore Programming
                  Practice (MPP)
        Goal: Write portable C/C++
         programs to be "Multicore ready"
         and platform compatible
          Proposed by a
           MPP working group
           in the Multicore
           Association

           http://www.multicore-association.org/workgroup/mpp.php
Wang, Yuan-Kai (王元凱)        Parallel Vision with GPGPU/CUDA   p. 87



                       Special Conference
          HPEC: High Performance Embedded
           Computing,
            MIT Lincoln Lab, 1997 ~
88




 The End
Free for Questions

Weitere ähnliche Inhalte

Was ist angesagt?

qtdd11_qtmultimediakitonmobile
qtdd11_qtmultimediakitonmobileqtdd11_qtmultimediakitonmobile
qtdd11_qtmultimediakitonmobilegareth_stockwell
 
20150318-SFPUG-Meetup-PGStrom
20150318-SFPUG-Meetup-PGStrom20150318-SFPUG-Meetup-PGStrom
20150318-SFPUG-Meetup-PGStromKohei KaiGai
 
QGATE 0.3: QUANTUM CIRCUIT SIMULATOR
QGATE 0.3: QUANTUM CIRCUIT SIMULATORQGATE 0.3: QUANTUM CIRCUIT SIMULATOR
QGATE 0.3: QUANTUM CIRCUIT SIMULATORNVIDIA Japan
 
CGYRO Performance on Power9 CPUs and Volta GPUS
CGYRO Performance on Power9 CPUs and Volta GPUSCGYRO Performance on Power9 CPUs and Volta GPUS
CGYRO Performance on Power9 CPUs and Volta GPUSIgor Sfiligoi
 
Gpu application in cuda memory
Gpu application in cuda memoryGpu application in cuda memory
Gpu application in cuda memoryjournalacij
 
Gpu with cuda architecture
Gpu with cuda architectureGpu with cuda architecture
Gpu with cuda architectureDhaval Kaneria
 
Report on GPGPU at FCA (Lyon, France, 11-15 October, 2010)
Report on GPGPU at FCA  (Lyon, France, 11-15 October, 2010)Report on GPGPU at FCA  (Lyon, France, 11-15 October, 2010)
Report on GPGPU at FCA (Lyon, France, 11-15 October, 2010)PhtRaveller
 
Graphics Processing Unit: An Introduction
Graphics Processing Unit: An IntroductionGraphics Processing Unit: An Introduction
Graphics Processing Unit: An Introductionijtsrd
 
Can we boost more HPC performance? Integrate IBM POWER servers with GPUs to O...
Can we boost more HPC performance? Integrate IBM POWER servers with GPUs to O...Can we boost more HPC performance? Integrate IBM POWER servers with GPUs to O...
Can we boost more HPC performance? Integrate IBM POWER servers with GPUs to O...NTT Communications Technology Development
 
CuPy: A NumPy-compatible Library for GPU
CuPy: A NumPy-compatible Library for GPUCuPy: A NumPy-compatible Library for GPU
CuPy: A NumPy-compatible Library for GPUShohei Hido
 
PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
PL/CUDA - Fusion of HPC Grade Power with In-Database AnalyticsPL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
PL/CUDA - Fusion of HPC Grade Power with In-Database AnalyticsKohei KaiGai
 

Was ist angesagt? (18)

Cuda
CudaCuda
Cuda
 
qtdd11_qtmultimediakitonmobile
qtdd11_qtmultimediakitonmobileqtdd11_qtmultimediakitonmobile
qtdd11_qtmultimediakitonmobile
 
Example Application of GPU
Example Application of GPUExample Application of GPU
Example Application of GPU
 
20150318-SFPUG-Meetup-PGStrom
20150318-SFPUG-Meetup-PGStrom20150318-SFPUG-Meetup-PGStrom
20150318-SFPUG-Meetup-PGStrom
 
QGATE 0.3: QUANTUM CIRCUIT SIMULATOR
QGATE 0.3: QUANTUM CIRCUIT SIMULATORQGATE 0.3: QUANTUM CIRCUIT SIMULATOR
QGATE 0.3: QUANTUM CIRCUIT SIMULATOR
 
CGYRO Performance on Power9 CPUs and Volta GPUS
CGYRO Performance on Power9 CPUs and Volta GPUSCGYRO Performance on Power9 CPUs and Volta GPUS
CGYRO Performance on Power9 CPUs and Volta GPUS
 
Gpu application in cuda memory
Gpu application in cuda memoryGpu application in cuda memory
Gpu application in cuda memory
 
Gpu with cuda architecture
Gpu with cuda architectureGpu with cuda architecture
Gpu with cuda architecture
 
PG-Strom
PG-StromPG-Strom
PG-Strom
 
2014/07/17 Parallelize computer vision by GPGPU computing
2014/07/17 Parallelize computer vision by GPGPU computing2014/07/17 Parallelize computer vision by GPGPU computing
2014/07/17 Parallelize computer vision by GPGPU computing
 
PostgreSQL with OpenCL
PostgreSQL with OpenCLPostgreSQL with OpenCL
PostgreSQL with OpenCL
 
Report on GPGPU at FCA (Lyon, France, 11-15 October, 2010)
Report on GPGPU at FCA  (Lyon, France, 11-15 October, 2010)Report on GPGPU at FCA  (Lyon, France, 11-15 October, 2010)
Report on GPGPU at FCA (Lyon, France, 11-15 October, 2010)
 
Cuda
CudaCuda
Cuda
 
Graphics Processing Unit: An Introduction
Graphics Processing Unit: An IntroductionGraphics Processing Unit: An Introduction
Graphics Processing Unit: An Introduction
 
Can we boost more HPC performance? Integrate IBM POWER servers with GPUs to O...
Can we boost more HPC performance? Integrate IBM POWER servers with GPUs to O...Can we boost more HPC performance? Integrate IBM POWER servers with GPUs to O...
Can we boost more HPC performance? Integrate IBM POWER servers with GPUs to O...
 
Cuda intro
Cuda introCuda intro
Cuda intro
 
CuPy: A NumPy-compatible Library for GPU
CuPy: A NumPy-compatible Library for GPUCuPy: A NumPy-compatible Library for GPU
CuPy: A NumPy-compatible Library for GPU
 
PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
PL/CUDA - Fusion of HPC Grade Power with In-Database AnalyticsPL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
 

Ähnlich wie Parallel Vision by GPGPU/CUDA

Graphics processing unit ppt
Graphics processing unit pptGraphics processing unit ppt
Graphics processing unit pptSandeep Singh
 
S1170143 2
S1170143 2S1170143 2
S1170143 2s1170143
 
The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)theijes
 
An exposition of performance comparison of graphic processing unit virtualiza...
An exposition of performance comparison of graphic processing unit virtualiza...An exposition of performance comparison of graphic processing unit virtualiza...
An exposition of performance comparison of graphic processing unit virtualiza...Asif Farooq
 
GPU - An Introduction
GPU - An IntroductionGPU - An Introduction
GPU - An IntroductionDhan V Sagar
 
Monte Carlo G P U Jan2010
Monte  Carlo  G P U  Jan2010Monte  Carlo  G P U  Jan2010
Monte Carlo G P U Jan2010John Holden
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
An Exposition of Performance Comparison of Graphic Processing Unit Virtualiza...
An Exposition of Performance Comparison of Graphic Processing Unit Virtualiza...An Exposition of Performance Comparison of Graphic Processing Unit Virtualiza...
An Exposition of Performance Comparison of Graphic Processing Unit Virtualiza...IJCSIS Research Publications
 
Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)Rob Gillen
 
A beginner’s guide to programming GPUs with CUDA
A beginner’s guide to programming GPUs with CUDAA beginner’s guide to programming GPUs with CUDA
A beginner’s guide to programming GPUs with CUDAPiyush Mittal
 
GPGPU Accelerates PostgreSQL (English)
GPGPU Accelerates PostgreSQL (English)GPGPU Accelerates PostgreSQL (English)
GPGPU Accelerates PostgreSQL (English)Kohei KaiGai
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Pycon2014 GPU computing
Pycon2014 GPU computingPycon2014 GPU computing
Pycon2014 GPU computingAshwin Ashok
 
Nvidia (History, GPU Architecture and New Pascal Architecture)
Nvidia (History, GPU Architecture and New Pascal Architecture)Nvidia (History, GPU Architecture and New Pascal Architecture)
Nvidia (History, GPU Architecture and New Pascal Architecture)Saksham Tanwar
 
Monte Carlo on GPUs
Monte Carlo on GPUsMonte Carlo on GPUs
Monte Carlo on GPUsfcassier
 

Ähnlich wie Parallel Vision by GPGPU/CUDA (20)

Graphics processing unit ppt
Graphics processing unit pptGraphics processing unit ppt
Graphics processing unit ppt
 
GPU Programming with Java
GPU Programming with JavaGPU Programming with Java
GPU Programming with Java
 
S1170143 2
S1170143 2S1170143 2
S1170143 2
 
The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)
 
An exposition of performance comparison of graphic processing unit virtualiza...
An exposition of performance comparison of graphic processing unit virtualiza...An exposition of performance comparison of graphic processing unit virtualiza...
An exposition of performance comparison of graphic processing unit virtualiza...
 
GPU - An Introduction
GPU - An IntroductionGPU - An Introduction
GPU - An Introduction
 
Introduction to GPU Programming
Introduction to GPU ProgrammingIntroduction to GPU Programming
Introduction to GPU Programming
 
GPU
GPUGPU
GPU
 
Monte Carlo G P U Jan2010
Monte  Carlo  G P U  Jan2010Monte  Carlo  G P U  Jan2010
Monte Carlo G P U Jan2010
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
An Exposition of Performance Comparison of Graphic Processing Unit Virtualiza...
An Exposition of Performance Comparison of Graphic Processing Unit Virtualiza...An Exposition of Performance Comparison of Graphic Processing Unit Virtualiza...
An Exposition of Performance Comparison of Graphic Processing Unit Virtualiza...
 
Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)
 
A beginner’s guide to programming GPUs with CUDA
A beginner’s guide to programming GPUs with CUDAA beginner’s guide to programming GPUs with CUDA
A beginner’s guide to programming GPUs with CUDA
 
GPGPU Accelerates PostgreSQL (English)
GPGPU Accelerates PostgreSQL (English)GPGPU Accelerates PostgreSQL (English)
GPGPU Accelerates PostgreSQL (English)
 
GPU Computing
GPU ComputingGPU Computing
GPU Computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Pycon2014 GPU computing
Pycon2014 GPU computingPycon2014 GPU computing
Pycon2014 GPU computing
 
Nvidia (History, GPU Architecture and New Pascal Architecture)
Nvidia (History, GPU Architecture and New Pascal Architecture)Nvidia (History, GPU Architecture and New Pascal Architecture)
Nvidia (History, GPU Architecture and New Pascal Architecture)
 
Monte Carlo on GPUs
Monte Carlo on GPUsMonte Carlo on GPUs
Monte Carlo on GPUs
 

Mehr von IEEE International Conference on Intelligent Information Hiding and Multimedia Signal Processing

Mehr von IEEE International Conference on Intelligent Information Hiding and Multimedia Signal Processing (16)

Computer Vision in the Age of IoT
Computer Vision in the Age of IoTComputer Vision in the Age of IoT
Computer Vision in the Age of IoT
 
Towards Embedded Computer Vision - New @ 2013
Towards Embedded Computer Vision - New @ 2013Towards Embedded Computer Vision - New @ 2013
Towards Embedded Computer Vision - New @ 2013
 
老師與教學助理的互動經驗分享 1010217
老師與教學助理的互動經驗分享 1010217老師與教學助理的互動經驗分享 1010217
老師與教學助理的互動經驗分享 1010217
 
Markov Random Field (MRF)
Markov Random Field (MRF)Markov Random Field (MRF)
Markov Random Field (MRF)
 
07 approximate inference in bn
07 approximate inference in bn07 approximate inference in bn
07 approximate inference in bn
 
06 exact inference in bn
06 exact inference in bn06 exact inference in bn
06 exact inference in bn
 
08 probabilistic inference over time
08 probabilistic inference over time08 probabilistic inference over time
08 probabilistic inference over time
 
05 probabilistic graphical models
05 probabilistic graphical models05 probabilistic graphical models
05 probabilistic graphical models
 
04 Uncertainty inference(continuous)
04 Uncertainty inference(continuous)04 Uncertainty inference(continuous)
04 Uncertainty inference(continuous)
 
03 Uncertainty inference(discrete)
03 Uncertainty inference(discrete)03 Uncertainty inference(discrete)
03 Uncertainty inference(discrete)
 
01 Probability review
01 Probability review01 Probability review
01 Probability review
 
02 Statistics review
02 Statistics review02 Statistics review
02 Statistics review
 
Monocular Human Pose Estimation with Bayesian Networks
Monocular Human Pose Estimation with Bayesian NetworksMonocular Human Pose Estimation with Bayesian Networks
Monocular Human Pose Estimation with Bayesian Networks
 
Towards Embedded Computer Vision邁向嵌入式電腦視覺
Towards Embedded Computer Vision邁向嵌入式電腦視覺Towards Embedded Computer Vision邁向嵌入式電腦視覺
Towards Embedded Computer Vision邁向嵌入式電腦視覺
 
Intelligent Video Surveillance with Cloud Computing
Intelligent Video Surveillance with Cloud ComputingIntelligent Video Surveillance with Cloud Computing
Intelligent Video Surveillance with Cloud Computing
 
Intelligent Video Surveillance and Sousveillance
Intelligent Video Surveillance and SousveillanceIntelligent Video Surveillance and Sousveillance
Intelligent Video Surveillance and Sousveillance
 

Kürzlich hochgeladen

Choosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for ParentsChoosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for Parentsnavabharathschool99
 
Transaction Management in Database Management System
Transaction Management in Database Management SystemTransaction Management in Database Management System
Transaction Management in Database Management SystemChristalin Nelson
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatYousafMalik24
 
Karra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxKarra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxAshokKarra1
 
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfGrade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfJemuel Francisco
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for BeginnersSabitha Banu
 
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...JojoEDelaCruz
 
4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptxmary850239
 
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfInclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfTechSoup
 
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSGRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSJoshuaGantuangco2
 
Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designMIPLM
 
Food processing presentation for bsc agriculture hons
Food processing presentation for bsc agriculture honsFood processing presentation for bsc agriculture hons
Food processing presentation for bsc agriculture honsManeerUddin
 
ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4MiaBumagat1
 
ROLES IN A STAGE PRODUCTION in arts.pptx
ROLES IN A STAGE PRODUCTION in arts.pptxROLES IN A STAGE PRODUCTION in arts.pptx
ROLES IN A STAGE PRODUCTION in arts.pptxVanesaIglesias10
 
Concurrency Control in Database Management system
Concurrency Control in Database Management systemConcurrency Control in Database Management system
Concurrency Control in Database Management systemChristalin Nelson
 
Barangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptxBarangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptxCarlos105
 
Music 9 - 4th quarter - Vocal Music of the Romantic Period.pptx
Music 9 - 4th quarter - Vocal Music of the Romantic Period.pptxMusic 9 - 4th quarter - Vocal Music of the Romantic Period.pptx
Music 9 - 4th quarter - Vocal Music of the Romantic Period.pptxleah joy valeriano
 
Active Learning Strategies (in short ALS).pdf
Active Learning Strategies (in short ALS).pdfActive Learning Strategies (in short ALS).pdf
Active Learning Strategies (in short ALS).pdfPatidar M
 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Mark Reed
 

Kürzlich hochgeladen (20)

Choosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for ParentsChoosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for Parents
 
Transaction Management in Database Management System
Transaction Management in Database Management SystemTransaction Management in Database Management System
Transaction Management in Database Management System
 
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptxFINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice great
 
Karra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxKarra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptx
 
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfGrade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for Beginners
 
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
 
4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx
 
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfInclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
 
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSGRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
 
Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-design
 
Food processing presentation for bsc agriculture hons
Food processing presentation for bsc agriculture honsFood processing presentation for bsc agriculture hons
Food processing presentation for bsc agriculture hons
 
ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4
 
ROLES IN A STAGE PRODUCTION in arts.pptx
ROLES IN A STAGE PRODUCTION in arts.pptxROLES IN A STAGE PRODUCTION in arts.pptx
ROLES IN A STAGE PRODUCTION in arts.pptx
 
Concurrency Control in Database Management system
Concurrency Control in Database Management systemConcurrency Control in Database Management system
Concurrency Control in Database Management system
 
Barangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptxBarangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptx
 
Music 9 - 4th quarter - Vocal Music of the Romantic Period.pptx
Music 9 - 4th quarter - Vocal Music of the Romantic Period.pptxMusic 9 - 4th quarter - Vocal Music of the Romantic Period.pptx
Music 9 - 4th quarter - Vocal Music of the Romantic Period.pptx
 
Active Learning Strategies (in short ALS).pdf
Active Learning Strategies (in short ALS).pdfActive Learning Strategies (in short ALS).pdf
Active Learning Strategies (in short ALS).pdf
 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)
 

Parallel Vision by GPGPU/CUDA

  • 1. 1 平行視覺與GPGPU/CUDA 王元凱 輔仁大學電機工程系 Email: ykwang@mail.fju.edu.tw URL: http://www.ykwang.tw 2011/10/07 本著作採用創用CC 「姓名標示」授權條款台灣3.0版
  • 2. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 2 What about this Talk  The Multicore Era  It’s time for Parallel Computing  GPGPU/CUDA  GUGPU Architecture  Parallel Programming by CUDA  Some Examples  Image Restoration (Retinex)  Feature Extraction (SIFT)  Video Cloud Computing
  • 3. 3 1. The Multicore Era for Computer Vision  Paradigm shift from Clock Speed Race to Multicore Race  Some examples of Multicore
  • 4. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 4 Multicore Computing  What Is Multicore  Combine multiple chips of processor into single chip  Multicore computing is inevitable
  • 5. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 5 Moore's Law  In 1965, Gordon Moore (Intel co-founder) predicted  The transistors no. on an IC would double every 18 months  The well-known law • The performance of computer doubles every 18 months • More transistors  More performance  The prediction was kept correctly by Intel's CPUs for 40 years
  • 6. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 6 Review of Moore's Law  Transistors in a chip did increase
  • 7. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 7 Problems  More transistors need high frequency  High frequency needs high power consumption  We come into the Clock Speed Race  But 4GHz has been the limit Moore’s law breaks
  • 8. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 8 Paradigm Shift from 2000  General-purpose multicore comes of age  Chip companies race to create multicore processors  CPU: Intel Core Duo, Quad-core, ...  DSP: TI DaVinci  GPU: nVidia GeForce/Tesla  ...
  • 9. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 9 The Multicore Evolution From large mono-core to multiple lightweight cores Pentium processor Core Duo 5~10 years Optimized for single 10~100 energy efficient thread cores optimized for parallel execution
  • 10. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 10 Moore’s Law Needs Multicore  Single core cannot fit Moore's law  Multicore can fit Moore's law if a parallel programming model exists Multi-Core Performance Single Core Time
  • 11. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 11 Two Architectures for Multicore  Symmetric multiprocessing (SMP)  Multicore CPU, GPGPU, multicore DSP  Homogeneous computing  Asymmetric multiprocessing (AMP)  CPU+GPGPU, CPU+FPGA, CPU+DSP  Heterogeneous computing
  • 12. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 12 Multicore CPU (1/2)  Two or more CPUs on a chip  Ex.: Intel Core i7 One Processor With multiple execution Cores
  • 13. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 13 Multicore CPU (2/2)  Windows Task Manager(工作管理員) Two cores Eight cores
  • 14. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 14 GPGPU (1/2)  GPU (Graphical Processing Unit)  The processor in graphics card to speed up 3D graphics  Game playing is a major application  GPGPU: General-Purpose GPU  General purpose computation using GPU in applications other than 3D graphics
  • 15. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 15 GPGPU (2/2)  GPGPU has more cores than CPU  120 ~ 512 cores  GPGPU is more powerful than multicore CPU  Vendors:  nVidia  ATI  Intel  AMD
  • 16. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 16 Computer Vision Needs High Performance Computing  An CV example : video processing  Intelligent video surveillance,  Its complexity is high  One video: 10 Megapixels, 30fps,  100 flops per pixel   30 Gigaflops per video  Massive data processing  Intensive computation
  • 17. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 17 Approaches for HPC  Cluster/distributed computing  MAP-REDUCE(Google) Supercomputer (Cloud Computing)  MPI  Multi-processing computing  Multicore CPU  Programming with multithreading  FPGA/DSP  GPGPU  Programming with CUDA
  • 18. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 18 However  Multicore is not a simple solution for upgrading performance  The transition from single core to multicore will be blocked by software  We are not ready to face the software programming challenges
  • 19. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 19 Multicore Demands Threading
  • 20. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 20 2. GPGPU and CUDA  GPGPU Hardware  Programming by CUDA
  • 21. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 21 Why GPGPU  GPGPU has many-core (> 100 cores)  Suitable for intensive parallel computing  GPGPU v.s. CPU  Calculation: 367 GFLOPS v.s. 32 GFLOPS  Memory Bandwidth: 86.4 GB/s v.s. 8.4 GB/s
  • 22. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 22 GPGPU Vendors  NVIDIA  ATI  Intel  AMD  …
  • 23. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 23 Hardware View • PC-based • GPGPU card as a coprocessor From PC to PSC : Personal Super-Computer
  • 24. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 24 Applications of GPGPU http://developer.nvidia.com/category/zone/cuda-zone
  • 25. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 25 Two New GPGPUs from nVidia  GT200  GTX 260/280, Quardro5800, Tesla 1060  Fermi  Tesla 2060 ALU ALU Control ALU ALU Cache DRAM DRAM CPU(host) GPU(device) Multicore Many-core
  • 26. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 26 nVidia GPGPU Architecture  SM/SP(Stream multiprocessor/Stream processor) + Shared memory + DRAM
  • 27. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 27 Memory Hierarchy  On-Chip Memory  Registers  Shared Memory  Constant Memory  Texture Memory  Off-Chip Memory  Local Memory  Global Memory
  • 28. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 28 Parallel Computing  Serial Computing GPGPU Cores  Parallel Computing
  • 29. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 29 Parallel Programming  Many codes are written in C/C++/Java  Especially algorithmic programs  Can we write GPGPU parallel programs by C/C++/Java?  However, C/C++ is sequential  Three control structures of C/C++/Java: sequence, selection, repetition
  • 30. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 30 Multi-threading  Multi-threading is the most important technique for parallel programming  Some techniques are ready  Pthread, Win32 thread, OpenMP, MPI, Intel TBB (Threading Building Block)...  New techniques  CUDA, OpenCL, ...
  • 31. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 31 Parallel Programming in Sequential Language  Do we need to learn new languages for multi-threading?  No  Write multi-threading codes in C/C++  Add functions/directives to C/C++ for multi-threading  That is the way current solutions did  pthread, Win32 thread, OpenMP, MPI, CUDA, OpenCL, ...
  • 32. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 32 CUDA  CUDA: Compute Unified Device Architecture  Parallel programming for nVidia's GPGPU  Use C/C++ language  Java, Fortran, Matlab are OK  When executing CUDA programs, the GPU operates as coprocessor to the main CPU
  • 33. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 33 CUDA Hardware Environment: CPU+GPU  GPU  Organizes, interprets, and CPU PCI-E GPU communicates information  GPU  Handles the core processing on large quantities of parallel information  Compute-intensive portions of applications that are executed many times, but on different data, are extracted from the main application and compiled to execute in parallel on the GPU
  • 34. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 34 CUDA Software Stack
  • 35. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 35 Processing Flow on CUDA Main CPU 3 2 Memory Copy processing  5 Instruct  the  data Copy the  processing result 4 1 Memory for GPU Execute   Allocate  parallel in  device memory each core 6 Release  device memory
  • 36. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 36 Programming with Memory Hierarchy  Locality principle  Temporal locality  Spatial locality
  • 37. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 37 Example - Hello World(1/3) int main() { Host Device char src[12]="Hello World"; char h_hello[12]; src d_hello1 char* d_hello1; char* d_hello2; h_hello d_hello2 cudaMalloc((void**) &d_hello1, sizeof(char)*12); cudaMalloc((void**) &d_hello2, sizeof(char)*12); cudaMemcpy(d_hello1 , src , sizeof(char)* 12 , cudaMemcpyHostToDevice); hello<<<1,1>>>(d_hello1 , d_hello2 ); call the kernel function
  • 38. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 38 Example - Hello World(2/3)  Kernel Function __global__ void hello(char* hello1 , char* hello2 ) { int k; for(k = 0 ; hello1[k] != '0' ; k++){ Host Device hello2[k] = hello1[k]; } src d_hello1 } No parallel processing in this example h_hello d_hello2
  • 39. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 39 Example - Hello World(3/3) cudaMemcpy(h_hello, d_hello2, sizeof(char)* 12, cudaMemcpyDeviceToHost); printf("%sn", h_hello); Host Device cudaFree(d_hello1);  cudaFree(d_hello2); src d_hello1 system("pause"); h_hello d_hello2 return 0; } Result:
  • 40. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 40 Parallelization  Multicore/Multi-threading  Data Parallelization  Data distribution  Parallel convolution  Reduction algorithm  Amdahl’s law  Memory Hierarchy Management  Locality principle  Program accesses a relatively small portion of the address space at any instant of time
  • 41. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 41 Develop Multi-thread Program  Identify parallelism: Analyze algorithm  Express parallelism: Write parallel code  Validate parallelism: Debug & verify parallel code  Optimize parallelism: enhance parallel performance
  • 42. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 42 3. Image Restoration (Retinex) by CUDA
  • 43. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 43 Image Restoration  Restore and enhance an image  Its complexity is high for large images Original Complexity: Restored O(N2) ~ O(N3)
  • 44. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 44 Algorithms for Image Restoration  Wiener Filter  Histogram Based Approach  Histogram Equalization, Histogram Modification, …  Retinex  Path-based Retinex  Recursive Retinex  Center/surround Retinex  No iterative process and is suitable for parallelization  Multi-Scale Retinex with Color Restoration (MSRCR) [Rahman et al. 1997]
  • 45. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 45 MSRCR Algorithm   n Ri  x, y   ri ( x, y )   Wk log Ii  x, y   log  Fk  x, y   Ii  x, y   , i   R, G, B ,   k 1  Ri  x, y  : the MSRCR output  Ii  x, y : the original image distribution in the ith spectral band  F  x, y  k : the kth Gaussian Surround function  : the convolution operation W : the weight k  ri ( x, y ) : the color restoration factor in the ith spectral band    I i ( x, y )  N : the number of spectral bands ri ( x, y )    log    N  , : the gain constant     i 1 I i ( x, y )   : controls the strength of the nonlinearity
  • 46. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 46 Decompose the Problem  Two basic approaches to partition computational work  Domain decomposition GPGPU  Partition the data used Cooperate in solving the problem  Function decomposition CPU  Partition the jobs (functions) from the overall work (problem)
  • 47. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 47 Multi-Threading  A program running In Serial In Parallel http://en.wikipedia.org/wiki/Thread_(computer_science)
  • 48. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 48 Domain Decomposition (1/3)  An image example  It is 2D data  Three popular partition ways
  • 49. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 49 Domain Decomposition (2/3)  Domain data are usually processed by loop  for (i=0; i<height; i++) for (j=0; j<width; j++) img2[i][j] = RemoveNoise(img1[i][j]); j i The X-ray image of a circuit board Original image(img1) Enhanced image(img2)
  • 50. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 50 Domain Decomposition (3/3) j i A three-block partition example OpenMP  // Thread 1 CUDA(SPMD) for (i=0; i<height/3; i++) for (j=0; j<width; j++) img2[i][j] = RemoveNoise(img1[i][j]);  // Thread 2 for (i=height/3; i<height*2/3; i++) fork(threads) subdomain 1 subdomain 2 subdomain 3 for (j=0; j<width; j++) i=0 i=4 i=8 img2[i][j] = RemoveNoise(img1[i][j]); i=1 i=5 i=9  // Thread 3 i=2 i=6 i=10 i=3 i=7 i=11 for (i=height*2/3; i<height; i++) for (j=0; j<width; j++) join(barrier) img2[i][j] = RemoveNoise(img1[i][j]);
  • 51. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 51 The Method CPU GPGPU Copy Data from CPU to Gaussian Blur GPGPU Log-domain Processing Normalization Copy Data Histogram from GPGPU Stretching to CPU Intel Core 2 - 2 cores Tesla C1060 - 240 SPs (3.0GHZ) (1.296GHZ)
  • 52. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 52 Parallelization by GPGPU  Multicore/Multi-threading  Tesla C1060 : 240 SP (Stream Processor)  CUDA: , Thread , Block , Grid  Data Parallelization  Parallel convolution  Parallel convolution M pixels PE data time 1 pixels pixels 1 pixels t0 t1 t2 t3 t4 t5 A(0) A(0)+A(1) A(0)+A(1)+A(2)+A(3) sum 0 1 A(1) M PE i PE i 2 A(2) A(2)+A(3) pixels pixels pixels pixels A(3) 3 4 A(4) A(4)+A(5) A(4)+A(5)+A(6)+A(7) pixels 5 A(5) 1 pixels 1 pixels 6 A(6) A(6)+A(7) pixels 7 A(7)
  • 53. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 53 Our Memory Hierarchy Texture Parallel Gaussian Blur Memory Constant Parallel Log-domain Memory Processing Global Memory Parallel Normalization Shared Memory Parallel Histogram Stretching
  • 54. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 54 Experimental Results (1/2) Original images CPU results GPGPU results
  • 55. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 55 Experimental Results (2/2) Original images CPU results GPGPU results
  • 56. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 56 GPGPU Speedup over CPU 2 10 Speedup__N 74x Speedup Speedup__P Speedup__NPP 2x Speedup 1 10 2 3 4 10 10 10 M • Ideal speedup: 240 * (1.296GHz/ 3GHz) = 103 • NPP: nVidia Performance Primitive
  • 57. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 57 4. Feature Extraction (SIFT) by CUDA
  • 58. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 58 What Is SIFT  SIFT  Scale Invariant Feature Transform  Invariance of feature points  Translation  Rotation  Scale
  • 59. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 59 Applications of SIFT Object recognition/tracking Image retrieval Autostitch
  • 60. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 60 Parallelize SIFT by GPGPU Intel Q9400 Geforce GTS 250 Quad cores 128 SPs (2.66GHz) (1.836GHz)
  • 61. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 61 Experimental Results CPU GPU
  • 62. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 62 Execution Time CPU: 10 seconds in average ms GPGPU: 0.8 seconds in average
  • 63. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 63 Speedup 13x speedup in average
  • 64. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 64 5. Video Cloud Computing 戶外/園區的大面積監控 • 大量攝影機數目 • 系統穩定度之挑戰 技術特點 • 涵蓋雲端運算與嵌入式系統 • 整合電子地圖、事件、與視訊摘要之中控顯示 • 克服戶外天候影響之偵測技術
  • 65. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 65 A Campus Monitoring System 中控室技術展示區 人 事 件 技 術 展 示 區 車事件技術展示區
  • 66. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 66 一、人事件技術展示 電子資訊研究大樓 交大校內 機車環校道路 科學園區  翻牆及禁區入侵偵測技術  嵌入式PTZ相機追蹤技術  攝影機異常偵測技術
  • 67. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 67 1.1 翻牆及禁區入侵偵測技術 偵 測 電 資 大 樓 後方與科學園 區銜接之機車 環校道路圍牆, 電子資訊研究大樓 是否有人爬牆 侵入,並發送 警報。 交大校內 機車環校道路 科學園區
  • 68. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 68 1.2 嵌入式PTZ相機追蹤技術 透過前端固定式 監控系統取得追 蹤物體之初始位 置。 以嵌入式平台進 電子資訊研究大樓 行移動物體追蹤, 並控制PTZ攝影 交大校內 機鏡頭。 機車環校道路 科學園區
  • 69. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 69 1.3 攝影機異常偵測技術 以雲端平台同時對環 校及電資大樓多支攝 影機進行攝影機異常 偵測。(GPGPU) 模擬電資大樓之攝影 機被人蓄意破壞,將 偵測並警報。 有效排除人來人往的 環校攝影機之假警報。
  • 70. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 70 二、車事件技術展示  嵌入式非法停車偵測技術 (暨動態場景之人物特徵偵測)  戶外停車場空位偵測技術
  • 71. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 71 2.1 嵌入式非法停車偵測技術  以嵌入式平台 偵測違法停車 車輛,並驅動 PTZ攝影機拍攝 事件特寫影像。  多解析度連續 影像之人臉偵 測,以停止PTZ 攝影機之特寫 追蹤。(GPGPU)
  • 72. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 72 2.2 戶外停車場空位偵測技術 偵測大型停 車場車位狀 態,並顯示 空車位位置。 當車輛停妥 於任一空車 位,該車位 將顯示為佔 用中。
  • 73. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 73 三、中控室技術展示 智慧型社區事件安全監控系 統中控室  電子地圖式中空式展示技術 (中央視訊及管理系統)  多重解析度廣域監視技術  高效率的影片事件檢索技術
  • 74. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 74 3.1 電子地圖式中控室展示技術 以 Google Map 整 合 所 有異質監控 資訊。 Video Event Geograph y
  • 75. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 75 3.2 多重解析度廣域監視技術 可旋轉式投影機 大小眼多重 解析度顯示  整 合 Google Earth  GPGPU 硬 體 加速影像貼 合計算 固定式投影機
  • 76. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 76 3.3 高效率的影片事件檢索技術  將冗長的監視影片,轉換 成精簡的摘要影片,使用 者可在短時間內調閱指定 攝影機之全日事件。 3:00 對濃縮影片進行瀏覽 5:00 時 電子資訊研究大樓 間 軸 交大校內 機車環校道路 科學園區 利用空間對時間做壓 縮
  • 77. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 系統架構 環校 電資大樓 … … 停車場合法停車 x5 CMS 608 3D …… 3D … 停車場 x8 人非法翻牆 路邊非法停車 翻牆 HVR CAD CAD 77
  • 79. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 79 Issues with Parallelization  Good parallel programs  Execute correctly  with good speedup  Ideal speedup by Amdahl's law  Speedup = N if you has N cores  However, no ideal speedup exists  Because parallel overhead, such as Data communication Data dependencies and synchronization  Other issues: design overhead  No free lunch for software development
  • 80. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 80 Parallel Computing on GPGPU  CUDA can only parallelize codes for nVidia's GPGPU  CUDA’s programming model:  Multithread  SPMD (Single Program Multiple Data)  Best-performance CUDA code needs optimization  Native code can be improved by CUDA  2~3 times  Optimization can be achieved by  Data parallelism, Thread parallelism, Data localization
  • 81. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 81 Programming Challenges of CUDA  We have to manually parallelize the algorithm  We need expertise in  Algorithms of image and signal processing  Filtering, frequency analysis, compression, feature extraction, recognition, ...  Theory, tools and methodology of parallel computing  Communication, synchronization, resource management, load balancing, debugging, ...
  • 82. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 82 GPUs for Multimedia 3.5X 10 X 10 X PowerDirector7 Ultra CUDA JPEG Decoder DivideFrame GPU Decoder 26 X 10 X Hyperspectral Image GPU Decoder Motion Estimation for Compression on (Vegas/Premiere) - H.264/AVC on NVIDIA GPUs Using the Power of Multiple GPUs NVIDIA Graphic Card to Using NVIDIA CUDA Decode H.264 Video Files
  • 83. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 83 GPUs for Computer Vision(1/2) 87 X 26 X 200 X 100 X CUDA SURF – A Real-time Leukocyte Tracking: Real-time Spatiotemporal Image Denoising with Implementation for SURF ImageJ Plugin Stereo Matching Using the Bilateral Filter TU Darmstadt University of Virginia Dual-Cross-Bilateral Grid Wlroclaw University of Technology 85 X 100 X 8X 13 X Digital Breast Fast Optical Flow on GPU A Framework for Efficient Accelerating Advanced MRI Tomosynthesis At Video Rate for Full HD and Scalable Execution of Reconstructions Reconstruction Resolution Domain-specific Templates University of Illinois Massachusetts General Onera On GPU Hospital NEC Labs, Berkeley, Purdue
  • 84. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 84 GPUs for Computer Vision(2/2) 20 X 13 X 109 X 263 X GPU for Surveillance Fast Human Detection with Fast Sliding-Window GPU Acceleration of Object Cascaded Ensembles Object Detection Classification Algorithm Using NVIDIA CUDA 300 X 10 X 45 X 3X Audience Measurement – Real-time A GPU Accelerated Canny Edge Detection Real-time Video Analysis Visual Tracker by Evolutionary for Counting People, Face Stream Processing Computer Vision System Detection and Tracking
  • 85. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 85 The ParLab in Berkeley  The Parallel Computing Lab. in UC Berkeley http://parlab.eecs.berkeley.edu  The ParLab. offers programmers a practical introduction to parallel programming techniques and tools on current parallel computers, emphasizing multicore and manycore computers.
  • 86. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 86 Multicore Programming Practice (MPP)  Goal: Write portable C/C++ programs to be "Multicore ready" and platform compatible  Proposed by a MPP working group in the Multicore Association http://www.multicore-association.org/workgroup/mpp.php
  • 87. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 87 Special Conference  HPEC: High Performance Embedded Computing,  MIT Lincoln Lab, 1997 ~
  • 88. 88 The End Free for Questions