SlideShare ist ein Scribd-Unternehmen logo
1 von 1
Accelerating Particle Image Velocimetry using Hybrid Architectures
                                                                                         Vivek Venugopal, Cameron D. Patterson, Kevin Shinpaugh
                                                                                                                                                                                                                        vivekv@vt.edu, cdp@vt.edu, kashin@vt.edu


 Introduction                                                                            t                                                                                 (16 x16) or
                                                                                                                                                                           (32 x 32) or
                                                                                                                                                                                                                                                                                                                                                                           FFT-based PIV algorithm
      Cardio Vascular Disease (CVD)                        5%
                                                             4%3%                                                                                                          (64 x 64)
                                                                                                                                                                                                                                                                                                                                                                                                  Motion
                                                                                                                                                                                                                                                                                                                                                                                                                               • Each image is broken down into overlapping
      Cancer                                             6%                 35%                                                                                                                                                                                                                                                                                                                                                zones and the corresponding zones from both
      Other                                                                                                                                                                                                                                                                                                                                                                                       Vector
      Chronic Lung Respiratory Disease                                                       Image 1                                                                                                                                                                                                                                                                                                                           images are correlated.
      Accidents                                       24%                                t + dt
                                                                                                                                                                                                                                                               FFT
                                                                                                                                                                                                                                                                                                                                                                                                                               • The peak detection is done using the FFT
      Alzheimer’s Disease                                                                                                                                                            zone 1                                                                                                                                                                                                                                    block, which is then translated to depict the
      Diabetes                                                      23%                                                                                                                                                                                                                                                 Multiplication                                                 IFFT                Reduction           direction of the particles within the zone.
                                                                                                                                                                                                                                                                                                                                                                                                                               • The reduction block consists of a sub-pixel
                 Cause of Deaths in United States for 2005                                                                                                                                                                                                     FFT                                                                                                                                                             algorithm and a filtering routine to compute
                                                                                             Image 2
 • The Advanced Experimental Thermofluids Engineering                                                                                                                                 zone 2
                                                                                                                                                                                                                                                                                                                                                                                                                               the velocity value.
 (AEThER) Lab at Virginia Tech is involved in the area of
 cardiovascular fluid dynamics and stent hemodynamics,
 which is instrumental for designing intravascular stents
 for treating CVD.
                                                                                                                                                                                325 Apple Mac Pro nodes
                                                                                                                                                                                2 quad-core Intel Xeon processors with
                                                                                                                                                                                                                                                                                                                                                        Implementation platforms and Results
                                                                                                                                                                                                                                                                                                                                                                              zone_create
 • The AEThER lab uses the Particle Image Velocimetry                                                                                                                           8 GB RAM per node running Linux                                                                                                                                                             real2complex
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   1.25
 (PIV) technique to track the motion of particles in an                                                                                                                         CentOS                                                                                                                                                                                      complex_mul
                                                                                                                                                                                                                                                                                                                                                                           complex_conj
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  CPU
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  GPU
                                                                                                                                                                                System G platform
 illuminated flow field.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            FPGA




                                                                                                                                                                                                                                                                                                                                                   CUDA kernel functions
                                                                                                                                                                                                                                                                                                                                                                               conj_symm                                                                                           1.00   1.05
                                                                                                                 GPU                                                                                                                                                                                                                                                       c2c_radix2_sp
                                                                                                                                                                                                                                                                 Host (CPU)                                   Device (GPU)
                                                                                                                                                                                                                                                                                                                                                                            find_corrmax




                                                                                                                                                                                                                                                                                                                                                                                                                                                                 Time in seconds
                                                                                                                                                                                                                    Multiprocessor 30



                                                                                                                                                                                                               Multiprocessor 2
                                                                                                                                                                                                                                                                                                  Grid
                                                                                                                                                                                                                                                                                                                                                                                transpose                                                                                          0.75
                                                                                                                                                                                                           Multiprocessor 1                                                                         Block
                                                                                                                                                                                                                                                                                                    (0,0)
                                                                                                                                                                                                                                                                                                                   Block
                                                                                                                                                                                                                                                                                                                   (0,1)
                                                                                                                                                                                                                                                                                                                                 Block
                                                                                                                                                                                                                                                                                                                                 (0,2)                                              x_field
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        0.65
                                                                                                                                            Shared Memory
                                                                                                                                                                                                                                                                   kernel 1
                                                                                                                                                                                                                                                                                                    Block          Block         Block                                              y_field
                                                                                                                      Registers             Registers                          Registers                                                                                                            (1,0)          (1,1)         (1,2)
                                                                                                                                                                                                                                                                                                                                                                            indx_reorder                                                                                           0.50
                                                                                                                                                                                                               Instruction                                                                                                                                                                                                     CUDA 2.1 with memcopy
                                                                                                                      Processor 1           Processor 2                        Processor 8                        Unit
                                                                                                                                                                                                                                                                                                               Block (1,1)                                                     meshgrid_x                                      CUDA 2.2 with memcopy
                                                                                                                                                                                                                                                                                                     Thread    Thread   Thread   Thread
                                                                                                                                                                                                                                                                                                                                                                               meshgrid_y                                      CUDA 2.2 with zerocopy                                            0.34
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   0.25
                                                                                                                                                                                                                  Constant                                                                            (0,0)     (0,1)    (0,2)    (0,3)
                                                                                                                                                                                                                    Cache
                                                                                                                                                                                                                                                                                                     Thread
                                                                                                                                                                                                                                                                                                      (1,0)
                                                                                                                                                                                                                                                                                                               Thread
                                                                                                                                                                                                                                                                                                                (1,1)
                                                                                                                                                                                                                                                                                                                        Thread
                                                                                                                                                                                                                                                                                                                         (1,2)
                                                                                                                                                                                                                                                                                                                                 Thread
                                                                                                                                                                                                                                                                                                                                  (1,3)
                                                                                                                                                                                                                                                                                                                                                                                  fft_indx
                                                                                                                                                                                                                      Texture
                                                                                                                                                                                                                       Cache
                                                                                                                                                                                                                                                                                                     Thread
                                                                                                                                                                                                                                                                                                    Block
                                                                                                                                                                                                                                                                                                      (2,0)
                                                                                                                                                                                                                                                                                                               Thread Thread
                                                                                                                                                                                                                                                                                                                    Block
                                                                                                                                                                                                                                                                                                                (2,1)   (2,2)
                                                                                                                                                                                                                                                                                                                                 Thread
                                                                                                                                                                                                                                                                                                                                  Block
                                                                                                                                                                                                                                                                                                                                  (2,3)
                                                                                                                                                                                                                                                                                                                                                                                  cc_indx
                                                                                                                                                                                                                                                                                                    (0,0)          (0,1)          (0,2)
                                                                                                                                                                                                                                                                                                                                                                                memcopy
                                                                                                                                                                                                                                                                   kernel 2
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     0
             Location of                    Pressure distribution                                                                                                         GPU memory
                                                                                                                                                                                                                                                                                                    Block
                                                                                                                                                                                                                                                                                                    (1,0)
                                                                                                                                                                                                                                                                                                                   Block
                                                                                                                                                                                                                                                                                                                   (1,1)
                                                                                                                                                                                                                                                                                                                                 Block
                                                                                                                                                                                                                                                                                                                                 (1,2)                                                    1.000      26.591          707.107         18803.015      500000.000                            Execution Device
          Region of Interest                  within the heart
                                                                                                                                                                                                                                                                                                                                                                                                               GPU time in usecs
                                                                                              CUDA hardware model for Tesla C1060                                                                                                                                         CUDA programming model
                                                                                                                                                                                                                                                                                                                                                                                                           CUDA profile graph comparison                                               Execution time comparison
                                                                                                                                                                                                                                                                 SFG representation
                                                                                                                                                                                                                                                                   of application
                                                     Case 1                                                                                                                                                                                                          algorithm                    PRO-PART
                                                                                                                                                                                                                                                                         +


                                                                                                                                                                                                                                                                                                                                                                           Conclusion
                                                                                                                                                                                                                                                                                                  Design flow
                                                                                                                                              ML310 board 1                                                                           ML310 board 2
                                                                                                                                                                                                                                                                     component
                                                                                                                                                                                                                                                                   specification of
                              Stent                                      Each case =                                      Aurora switches                                                                         Aurora switches
                                                                                                                                                                                                                                                               implementation platform
                                                     Case 2                                                                                                                                                                                                                                                                       NOFIS platform
                          configuration 1                              1250 image pairs
                                                                                                                              FSL                                                                                       FSL
                                                                                                                                                                                                                                                                 Input specifications
                                                                                                                                                                            Aurora




                                                                                                                                                                                                                                                                                                                                                             • Nvidia's Tesla C1060 GPU provides computational speedup as compared to both
                                                                                                                        PE1            PE2                                                                      PE1             PE2
                                                                      x 5 MB = 6.25 GB
                                                                                                                                                        Aurora switches




                                                                                                                                                                                                                                             Aurora switches
                                                                                              Aurora switches




                                                                                                                                                                                       Aurora switches




      PIV                                                                                                       FSL                                                                                      FSL

                              Stent
  experimental
                          configuration 2
                                                                                                                                                                                                                                                                                                                                                             the sequential CPU implementation and the NOFIS implementation for the PIV
                                                                                                                        PE3            PE4                                                                      PE3             PE4                                                      SAFC dataflow          structure and
     setup                                                                                                                                                                                                                                                                               capture using
                                                                                                                                                                                                                                                                                             SFG
                                                                                                                                                                                                                                                                                                               components
                                                                                                                                                                                                                                                                                                               specification
                                                    Case 100
                                                                                                                                                                                                                                                                                                                                                             application.
                                                                                                                          Aurora switches                                                                         Aurora switches




                                                                                                                                                                                                                                                                                                                                                             • The synchronization and the latency in data movement can be optimized between
                                                                                                                                            Aurora                                                         Aurora
                                                                                                                                              ML310 board 4                                                                           ML310 board 3                                              partitioning and
                               Stent                                                                                      Aurora switches                                                                         Aurora switches
                                                                                                                                                                                                                                                                                              communication resource
                                                                                                                                                                                                                                                                                                  specification


                                                                                                                                                                                                                                                                                                                                                             the FPGAs by having custom communication interfaces without an Operating
                          configuration 20                                                                                     FSL                                                                                      FSL
                                                                                                                                                                                                                                                                                                                  automated
                                                                                                                        PE1            PE2                                  Aurora                              PE1             PE2
                                                                                                                                                        Aurora switches




                                                                                                                                                                                                                                             Aurora switches
                                                                                              Aurora switches




                                                                                                                                                                                       Aurora switches




                                                                                                                                                                                                                                                                                                                                                             System (OS) overhead.
                                                                                                                FSL                                                                                      FSL                                                                                  configure and generate


 Time for FlowIQ analysis of one image pair on 2GHz                                                                                                                                                                                                                                          values for communication
                                                                                                                                                                                                                                                                                                       cores



                                                                                                                                                                                                                                                                                                                                                             • The PRO-PART design flow facilitates a fast and easier mapping of a streaming
                                                                                                                        PE3            PE4                                                                      PE3             PE4




 Xeon processor = 16 minutes. So time required for                                                                        Aurora switches                                                                         Aurora switches




 complete dataset, 1250 image pairs x 100 cases x 20
                                                                                                                                                                                                                                                                                               mapping to hardware
                                                                                                                                                                                                                                                                                                                                                             application on the specialized NOFIS platform with a significant advantage in
 stent configurations = 2.6 years                                                             Multi-Core SAFC hardware: NOFIS platform                                                                                                                                            PRO-PART design flow                                                         development time.


References
[1] American Heart Association. (Last Accessed: February 2009) Cardiovascular Disease Statistics. [Online]. Available: http://www.americanheart.org/presenter.jhtml?identifier=4478
[2] A. Eckstein, J. Charonko, and P.Vlachos. Phase Correlation Processing for DPIV Measurements: Part 1 Spatial Domain Analysis. In Proceedings of FEDSM2007,5th Joint ASME/JSME Fluids Engineering
Conference, number FEDSM2007-37286, San Diego, California, August 2007. ASME.
[3] Nvidia Inc. (Last Accessed: February 2009) Nvidia Tesla C1060 GPU Computing Processor. [Online]. Available: http://www.nvidia.com/object/product_tesla_c1060_us.html

Weitere ähnliche Inhalte

Andere mochten auch

Beautiful World of Snow
Beautiful World of SnowBeautiful World of Snow
Beautiful World of SnowRen
 
Quelle beaute d jk1
Quelle beaute d jk1Quelle beaute d jk1
Quelle beaute d jk1Renée Bukay
 
Hardware Trojans By - Anupam Tiwari
Hardware Trojans By - Anupam TiwariHardware Trojans By - Anupam Tiwari
Hardware Trojans By - Anupam TiwariOWASP Delhi
 
The Most Amazing Waterfalls
The Most Amazing Waterfalls The Most Amazing Waterfalls
The Most Amazing Waterfalls Joanne Creasy
 
Worlds scariest bridge 2014
Worlds scariest bridge 2014Worlds scariest bridge 2014
Worlds scariest bridge 2014Eugene Koh
 

Andere mochten auch (8)

Beautiful world places
Beautiful world placesBeautiful world places
Beautiful world places
 
30 Most Beautiful Waterfalls in India
30 Most Beautiful Waterfalls in India30 Most Beautiful Waterfalls in India
30 Most Beautiful Waterfalls in India
 
Beautiful World of Snow
Beautiful World of SnowBeautiful World of Snow
Beautiful World of Snow
 
Quelle beaute d jk1
Quelle beaute d jk1Quelle beaute d jk1
Quelle beaute d jk1
 
Hardware Trojans By - Anupam Tiwari
Hardware Trojans By - Anupam TiwariHardware Trojans By - Anupam Tiwari
Hardware Trojans By - Anupam Tiwari
 
Waterfalls
WaterfallsWaterfalls
Waterfalls
 
The Most Amazing Waterfalls
The Most Amazing Waterfalls The Most Amazing Waterfalls
The Most Amazing Waterfalls
 
Worlds scariest bridge 2014
Worlds scariest bridge 2014Worlds scariest bridge 2014
Worlds scariest bridge 2014
 

Mehr von Vivek Venugopalan

xDEFENSE: An Extended DEFENSE for mitigating Next Generation Intrusions
xDEFENSE: An Extended DEFENSE for mitigating Next Generation IntrusionsxDEFENSE: An Extended DEFENSE for mitigating Next Generation Intrusions
xDEFENSE: An Extended DEFENSE for mitigating Next Generation IntrusionsVivek Venugopalan
 
Design, Implementation and Security Analysis of Hardware Trojan Threats in FPGA
Design, Implementation and Security Analysis of Hardware Trojan Threats in FPGADesign, Implementation and Security Analysis of Hardware Trojan Threats in FPGA
Design, Implementation and Security Analysis of Hardware Trojan Threats in FPGAVivek Venugopalan
 
Accelerating Real-Time LiDAR Data Processing Using GPUs
Accelerating Real-Time LiDAR Data Processing Using GPUsAccelerating Real-Time LiDAR Data Processing Using GPUs
Accelerating Real-Time LiDAR Data Processing Using GPUsVivek Venugopalan
 
Hardware acceleration of TEA and XTEA algorithms on FPGA, GPU and multi-core ...
Hardware acceleration of TEA and XTEA algorithms on FPGA, GPU and multi-core ...Hardware acceleration of TEA and XTEA algorithms on FPGA, GPU and multi-core ...
Hardware acceleration of TEA and XTEA algorithms on FPGA, GPU and multi-core ...Vivek Venugopalan
 
Real-time processing for ATST
Real-time processing for ATSTReal-time processing for ATST
Real-time processing for ATSTVivek Venugopalan
 
Accelerating Particle Image Velocimetry using Hybrid Architectures
Accelerating Particle Image Velocimetry using Hybrid ArchitecturesAccelerating Particle Image Velocimetry using Hybrid Architectures
Accelerating Particle Image Velocimetry using Hybrid ArchitecturesVivek Venugopalan
 

Mehr von Vivek Venugopalan (7)

xDEFENSE: An Extended DEFENSE for mitigating Next Generation Intrusions
xDEFENSE: An Extended DEFENSE for mitigating Next Generation IntrusionsxDEFENSE: An Extended DEFENSE for mitigating Next Generation Intrusions
xDEFENSE: An Extended DEFENSE for mitigating Next Generation Intrusions
 
Design, Implementation and Security Analysis of Hardware Trojan Threats in FPGA
Design, Implementation and Security Analysis of Hardware Trojan Threats in FPGADesign, Implementation and Security Analysis of Hardware Trojan Threats in FPGA
Design, Implementation and Security Analysis of Hardware Trojan Threats in FPGA
 
Accelerating Real-Time LiDAR Data Processing Using GPUs
Accelerating Real-Time LiDAR Data Processing Using GPUsAccelerating Real-Time LiDAR Data Processing Using GPUs
Accelerating Real-Time LiDAR Data Processing Using GPUs
 
Hardware acceleration of TEA and XTEA algorithms on FPGA, GPU and multi-core ...
Hardware acceleration of TEA and XTEA algorithms on FPGA, GPU and multi-core ...Hardware acceleration of TEA and XTEA algorithms on FPGA, GPU and multi-core ...
Hardware acceleration of TEA and XTEA algorithms on FPGA, GPU and multi-core ...
 
Real-time processing for ATST
Real-time processing for ATSTReal-time processing for ATST
Real-time processing for ATST
 
Accelerating Particle Image Velocimetry using Hybrid Architectures
Accelerating Particle Image Velocimetry using Hybrid ArchitecturesAccelerating Particle Image Velocimetry using Hybrid Architectures
Accelerating Particle Image Velocimetry using Hybrid Architectures
 
CISL talk
CISL talkCISL talk
CISL talk
 

Kürzlich hochgeladen

Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfsanyamsingh5019
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfchloefrazer622
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdfQucHHunhnh
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Educationpboyjonauth
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactdawncurless
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfJayanti Pande
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxGaneshChakor2
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphThiyagu K
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Sapana Sha
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformChameera Dedduwage
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxheathfieldcps1
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104misteraugie
 
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxContemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxRoyAbrique
 
Hybridoma Technology ( Production , Purification , and Application )
Hybridoma Technology  ( Production , Purification , and Application  ) Hybridoma Technology  ( Production , Purification , and Application  )
Hybridoma Technology ( Production , Purification , and Application ) Sakshi Ghasle
 
Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...
Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...
Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...RKavithamani
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfciinovamais
 

Kürzlich hochgeladen (20)

Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdf
 
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdf
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Education
 
Staff of Color (SOC) Retention Efforts DDSD
Staff of Color (SOC) Retention Efforts DDSDStaff of Color (SOC) Retention Efforts DDSD
Staff of Color (SOC) Retention Efforts DDSD
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptx
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy Reform
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxContemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
 
Hybridoma Technology ( Production , Purification , and Application )
Hybridoma Technology  ( Production , Purification , and Application  ) Hybridoma Technology  ( Production , Purification , and Application  )
Hybridoma Technology ( Production , Purification , and Application )
 
Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...
Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...
Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 

Accelerating PIV using hybrid architectures

  • 1. Accelerating Particle Image Velocimetry using Hybrid Architectures Vivek Venugopal, Cameron D. Patterson, Kevin Shinpaugh vivekv@vt.edu, cdp@vt.edu, kashin@vt.edu Introduction t (16 x16) or (32 x 32) or FFT-based PIV algorithm Cardio Vascular Disease (CVD) 5% 4%3% (64 x 64) Motion • Each image is broken down into overlapping Cancer 6% 35% zones and the corresponding zones from both Other Vector Chronic Lung Respiratory Disease Image 1 images are correlated. Accidents 24% t + dt FFT • The peak detection is done using the FFT Alzheimer’s Disease zone 1 block, which is then translated to depict the Diabetes 23% Multiplication IFFT Reduction direction of the particles within the zone. • The reduction block consists of a sub-pixel Cause of Deaths in United States for 2005 FFT algorithm and a filtering routine to compute Image 2 • The Advanced Experimental Thermofluids Engineering zone 2 the velocity value. (AEThER) Lab at Virginia Tech is involved in the area of cardiovascular fluid dynamics and stent hemodynamics, which is instrumental for designing intravascular stents for treating CVD. 325 Apple Mac Pro nodes 2 quad-core Intel Xeon processors with Implementation platforms and Results zone_create • The AEThER lab uses the Particle Image Velocimetry 8 GB RAM per node running Linux real2complex 1.25 (PIV) technique to track the motion of particles in an CentOS complex_mul complex_conj CPU GPU System G platform illuminated flow field. FPGA CUDA kernel functions conj_symm 1.00 1.05 GPU c2c_radix2_sp Host (CPU) Device (GPU) find_corrmax Time in seconds Multiprocessor 30 Multiprocessor 2 Grid transpose 0.75 Multiprocessor 1 Block (0,0) Block (0,1) Block (0,2) x_field 0.65 Shared Memory kernel 1 Block Block Block y_field Registers Registers Registers (1,0) (1,1) (1,2) indx_reorder 0.50 Instruction CUDA 2.1 with memcopy Processor 1 Processor 2 Processor 8 Unit Block (1,1) meshgrid_x CUDA 2.2 with memcopy Thread Thread Thread Thread meshgrid_y CUDA 2.2 with zerocopy 0.34 0.25 Constant (0,0) (0,1) (0,2) (0,3) Cache Thread (1,0) Thread (1,1) Thread (1,2) Thread (1,3) fft_indx Texture Cache Thread Block (2,0) Thread Thread Block (2,1) (2,2) Thread Block (2,3) cc_indx (0,0) (0,1) (0,2) memcopy kernel 2 0 Location of Pressure distribution GPU memory Block (1,0) Block (1,1) Block (1,2) 1.000 26.591 707.107 18803.015 500000.000 Execution Device Region of Interest within the heart GPU time in usecs CUDA hardware model for Tesla C1060 CUDA programming model CUDA profile graph comparison Execution time comparison SFG representation of application Case 1 algorithm PRO-PART + Conclusion Design flow ML310 board 1 ML310 board 2 component specification of Stent Each case = Aurora switches Aurora switches implementation platform Case 2 NOFIS platform configuration 1 1250 image pairs FSL FSL Input specifications Aurora • Nvidia's Tesla C1060 GPU provides computational speedup as compared to both PE1 PE2 PE1 PE2 x 5 MB = 6.25 GB Aurora switches Aurora switches Aurora switches Aurora switches PIV FSL FSL Stent experimental configuration 2 the sequential CPU implementation and the NOFIS implementation for the PIV PE3 PE4 PE3 PE4 SAFC dataflow structure and setup capture using SFG components specification Case 100 application. Aurora switches Aurora switches • The synchronization and the latency in data movement can be optimized between Aurora Aurora ML310 board 4 ML310 board 3 partitioning and Stent Aurora switches Aurora switches communication resource specification the FPGAs by having custom communication interfaces without an Operating configuration 20 FSL FSL automated PE1 PE2 Aurora PE1 PE2 Aurora switches Aurora switches Aurora switches Aurora switches System (OS) overhead. FSL FSL configure and generate Time for FlowIQ analysis of one image pair on 2GHz values for communication cores • The PRO-PART design flow facilitates a fast and easier mapping of a streaming PE3 PE4 PE3 PE4 Xeon processor = 16 minutes. So time required for Aurora switches Aurora switches complete dataset, 1250 image pairs x 100 cases x 20 mapping to hardware application on the specialized NOFIS platform with a significant advantage in stent configurations = 2.6 years Multi-Core SAFC hardware: NOFIS platform PRO-PART design flow development time. References [1] American Heart Association. (Last Accessed: February 2009) Cardiovascular Disease Statistics. [Online]. Available: http://www.americanheart.org/presenter.jhtml?identifier=4478 [2] A. Eckstein, J. Charonko, and P.Vlachos. Phase Correlation Processing for DPIV Measurements: Part 1 Spatial Domain Analysis. In Proceedings of FEDSM2007,5th Joint ASME/JSME Fluids Engineering Conference, number FEDSM2007-37286, San Diego, California, August 2007. ASME. [3] Nvidia Inc. (Last Accessed: February 2009) Nvidia Tesla C1060 GPU Computing Processor. [Online]. Available: http://www.nvidia.com/object/product_tesla_c1060_us.html