RaVioli: A Parallel Vide Processing Library with Auto Resolution Adjustability
1. RaVioli: A Parallel Video Processing Librarywith Auto Resolution Adjustability Hiroko SAKURAI†Masaomi OHNO†Shintaro OKADA‡ Tomoaki TSUMURA† Hiroshi MATSUO† † Nagoya Institute of Technology, Japan ‡ Toyota Motor Corp., Japan IADIS International Conference APPLIED COMPUTING 2009 November 19 – 21, 2009 Rome, Italy
2. Background(1/2): Portability of Video Applications Real-time video processing applications should run on a great variety of platforms Cell phones Cars PCs Principal goal of an application Long battery life High throughput Good accuracy Applied Computing 2009 2 We must rewrite a video processing program, when porting it to another platform
3. Background(2/2): Many-Core Era is Coming Multi/Many-core processors have come into wide use Video processing applications have various parallelisms Pixels in video frames have data parallelism Multiple frames can be processed in parallel by pipelining promise good performance on such parallel systems Applied Computing 2009 3 Parallelizing programs is not so simple It becomes much important to improve compilers and libraries
4. A Video Processing Library: RaVioli RaVioli provides: Easy writeability of pseudo real-time video processing Interfaces for parallelization Detecting data dependencies and formulating reductions Balancing loadsof pipeline stages Applied Computing 2009 4
5. Outline Concept of RaVioli RaVioli hides resolutions from programmers Easy writeability of video processing applications Pseudo real-time processing by adjusting loads Semi-automatic parallelization functions Automatic block decomposition Pipelining interface with automatic load balance mechanism Evaluation results Applied Computing 2009 5
6. Traditional Image Processing Program Image processing program written by traditional C Applied Computing 2009 6 InImg void main{ // Input image intluma; for(int y=0;y<180;y++){ for(int x=0;x<200;x++){ luma = (int)( InImg[x][y].R*0.299 +InImg[x][y].G*0.587 +InImg[x][y].B*0.114); OutImg[x][y].R = luma; OutImg[x][y].G = luma; OutImg[x][y].B = luma; } } } OutImg
7. Image Processing Program with RaVioli Grayscale program using RaVioli Applied Computing 2009 7 RV_ImageInImg Component function RV_PixelGrayScale(RV_Pixel Pix){ intluma; luma=(int)( Pix.R()*0.299 +Pix.G()*0.587 +Pix.B()*0.114); return(Pix.setRGB(luma, luma, luma)); } void main(){ RV_ImageInImg,OutImg; // Input image OutImg=InImg.procPix(GrayScale); } Higher-oder method procPix RV_ImageOutImg
8. Video Processing Program with RaVioli Video processing program with RaVioli Applied Computing 2009 8 RV_Imageobj RV_PixelGrayScale(RV_Pixelp){ } Higher-oder method Grayscale RV_ImageGrayScale(RV_Imageimg){ } RV_Imageobj RV_Videoobj Higher-oder method
9. Outline Concept of RaVioli RaVioli hides resolutions from programmers Easy writeability of video processing applications Pseudo real-time processing by adjusting loads Semi-automatic parallelization functions Automatic block decomposition Pipelining interface with automatic load balance mechanism Evaluation results Applied Computing 2009 9
11. Priority Set Which stride should be increased? (Spatial resolution, Temporal resolution)= (7,3) : keep spatial stride and temporal stride in the ratio of “3:7” (1,0) : keep spatial stride “1” Applied Computing 2009 11 Moving object detection Temporal resolution Pattern recognition Spatial resolution We can specify resolution priorities by priority set St=1 St=2 Ss=1 Ss=2
12. Detecting Overload Applied Computing 2009 12 RV_Video class Frame interval Higher-oder method Overloaded! < Ring buffer Processing time RV_Image instance Image Processing program Higher-order method
13. Outline Concept of RaVioli RaVioli hides resolutions from programmers Easy writeability of video processing applications Pseudo real-time processing by adjusting loads Semi-automatic parallelization functions Automatic block decomposition Pipelining interface with automatic load balance mechanism Evaluation results of our work Applied Computing 2009 13
16. Translator for Block Decomposition Reduction operations may be required Applied Computing 2009 16 Translator RV_PixGrayScale(RV_PixPix){ intY; Y = (int)( Pix.R()*0.299 +Pix.G()*0.587 +Pix.B()*0.114); return(Pix.setRGB(Y, Y, Y) ); } void main(){ RV_ImgInImg,OutImg; OutImg = InImg.procPix(GrayScale); } RV_PixGrayScale(RV_PixPix){ intY; Y = (int)( Pix.R()*0.299 +Pix.G()*0.587 +Pix.B()*0.114); return( Pix.setRGB(Y, Y, Y) ); } void main(){ RV_ImgInImg,OutImg; OutImg = InImg.procPix(GrayScale, 4); } parallelize
17. for Reference: Example Code with OpenMP OpenMP Standardized model of parallel programming for C/C++ and FORTRAN #define NUM_THREADS 4 inti; int sum=0; #pragma parallel for(i=1;i<=256;i++) sum+= i; Reduction pragma reduction(+:sum) Process 1 Process 2 Process 3 Process 4 for( ... )sum1+= i; for( ... )sum2+= i; for( ... )sum3+= i; for( ... )sum4+= i; sum
18. Reduction Op.s can be Automatically Added Applied Computing 2009 18 intsum = 0; void pixSum(RV_Pixel p){ sum += 1; } intmain(){ RV_ImageInputImg; //read image data in “InputImg” InputImg.procPix(pixSum); } void __pixSum(intthreadNum) { mutex_lock(&Mutex); sum += _localsum; mutex_unlock(&Mutex); } __thread int_localsum= 0; sum += 1; _localsum+= 1; Component function InputImg.procPix(pixSum, 4); inputImg.reduction(__pixSum); sum += 1 associative law ? commutative law ? associative law OK! commutative law OK! Reduction operation _localsum+=1; sum+= _localsum;
19. Outline Concept of RaVioli RaVioli hides resolutions from programmers Easy writeability of video processing applications Pseudo real-time processing by adjusting loads Semi-automatic parallelization functions Automatic block decomposition Pipelining interface with automatic load balance mechanism Evaluation results of our work Applied Computing 2009 19
20.
21. is troublesome for programmersthread1 thread2 thread3 binarize edge detect hough trans FIFO3 FIFO2 FIFO1 ・ ・ ・ ・ ・ ・ ・ ・ ・
22. Interface for Pipelining Applied Computing 2009 21 RV_Pipedata* GrayScale(RV_Pipedata* data){ // Grayscale processing for a frame return data; } RV_Pipedata* Laplacian(RV_Pipedata* data){ // Laplacian filter processing for a frame return data;} int main (){ RV_Pipelinepipe; pipe.push(GrayScale); pipe.push(Laplacian); pipe.run(); return 0;} RV_Pipeline pipe FIFO1 FIFO2 thread1 thread2 push Laplacian GrayScale run ・ ・ ・ ・ ・ ・
23. Interface for Pipelining Applied Computing 2009 22 RV_Pipedata* GrayScale(RV_Pipedata* data){ // Grayscale processing for a frame return data; } RV_Pipedata* Laplacian(RV_Pipedata* data){ // Laplacian filter processing for a frame return data;} int main (){ RV_Pipelinepipe; pipe.push(GrayScale); pipe.push(Laplacian); pipe.run(); return 0;} RV_Pipeline pipe FIFO1 FIFO2 push thread1 thread2 Laplacian GrayScale run ・ ・ ・ ・ ・ ・
24. Load Imbalance between Stages Applied Computing 2009 23 thread1 thread2 thread3 A B C frame1 A B C frame2 A B C frame3 Pipeline stalls thread3 thread1 thread2 1 A B C 2 3 ・ ・ ・ ・ ・ ・ ・ ・ ・
25. Automatic Load Balancing Applied Computing 2009 24 thread1 thread2 thread3 frame1 frame2 frame3 thread2 C thread3 thread1 thread2 thread1 A B C B thread3 ・ ・ ・ ・ ・ ・ ・ ・ ・ C
26. Automatic Load Balancing Applied Computing 2009 25 thread1 thread2 thread3 A B C frame1 A B C frame2 A B C frame3 thread2 C thread1 thread1 1 A B 2 3 thread3 ・ ・ ・ ・ ・ ・ C
27. Outline Concept of RaVioli RaVioli hides resolutions from programmers Easy writeability of video processing applications Pseudo real-time processing by adjusting loads Semi-automatic parallelization functions Automatic parallelization with block decomposition Pipelining interfacewith automatic load balance mechanism Evaluation results of our work Applied Computing 2009 26
28. Evaluation: Resolution Adjustment 27 frame rate(fps) Number of pixels Priority set Spatial resolution :Temporal resolution 0:1 1:0 3:7
33. Conclusion RaVioli hides resolutions from programmers pseudo real-time processing has semi-automatic parallelization functions semi-automatic block decompotision load balancing mechanism between pipeline stages Our future works implementing automatic power-saving function to RaVioli making RaVioli adaptive to various platforms such as Cell Broadband Engine designing easy-to-write language which cooperates with RaVioli Applied Computing 2009 32
34. Automatic Load Balancing Applied Computing 2009 33 Manager thread3 thread1 thread2 1 2 3 A B C 4 5 ・ ・ ・ ・ ・ ・ ・ ・ ・
35. Automatic Load Balancing Applied Computing 2009 34 A:1 B:1 C:4 Manager thread2 1 1 4 C thread3 thread1 thread2 thread1 4 5 2 A B C B 3 1 thread3 ・ ・ ・ ・ ・ ・ ・ ・ ・ C 1