1. Accelerating Particle Image Velocimetry using Hybrid Architectures
Vivek Venugopal, Cameron D. Patterson, Kevin Shinpaugh
vivekv@vt.edu, cdp@vt.edu, kashin@vt.edu
Introduction t (16 x16) or
(32 x 32) or
FFT-based PIV algorithm
Cardio Vascular Disease (CVD) 5%
4%3% (64 x 64)
Motion
• Each image is broken down into overlapping
Cancer 6% 35% zones and the corresponding zones from both
Other Vector
Chronic Lung Respiratory Disease Image 1 images are correlated.
Accidents 24% t + dt
FFT
• The peak detection is done using the FFT
Alzheimer’s Disease zone 1 block, which is then translated to depict the
Diabetes 23% Multiplication IFFT Reduction direction of the particles within the zone.
• The reduction block consists of a sub-pixel
Cause of Deaths in United States for 2005 FFT algorithm and a filtering routine to compute
Image 2
• The Advanced Experimental Thermofluids Engineering zone 2
the velocity value.
(AEThER) Lab at Virginia Tech is involved in the area of
cardiovascular fluid dynamics and stent hemodynamics,
which is instrumental for designing intravascular stents
for treating CVD.
325 Apple Mac Pro nodes
2 quad-core Intel Xeon processors with
Implementation platforms and Results
zone_create
• The AEThER lab uses the Particle Image Velocimetry 8 GB RAM per node running Linux real2complex
1.25
(PIV) technique to track the motion of particles in an CentOS complex_mul
complex_conj
CPU
GPU
System G platform
illuminated flow field. FPGA
CUDA kernel functions
conj_symm 1.00 1.05
GPU c2c_radix2_sp
Host (CPU) Device (GPU)
find_corrmax
Time in seconds
Multiprocessor 30
Multiprocessor 2
Grid
transpose 0.75
Multiprocessor 1 Block
(0,0)
Block
(0,1)
Block
(0,2) x_field
0.65
Shared Memory
kernel 1
Block Block Block y_field
Registers Registers Registers (1,0) (1,1) (1,2)
indx_reorder 0.50
Instruction CUDA 2.1 with memcopy
Processor 1 Processor 2 Processor 8 Unit
Block (1,1) meshgrid_x CUDA 2.2 with memcopy
Thread Thread Thread Thread
meshgrid_y CUDA 2.2 with zerocopy 0.34
0.25
Constant (0,0) (0,1) (0,2) (0,3)
Cache
Thread
(1,0)
Thread
(1,1)
Thread
(1,2)
Thread
(1,3)
fft_indx
Texture
Cache
Thread
Block
(2,0)
Thread Thread
Block
(2,1) (2,2)
Thread
Block
(2,3)
cc_indx
(0,0) (0,1) (0,2)
memcopy
kernel 2
0
Location of Pressure distribution GPU memory
Block
(1,0)
Block
(1,1)
Block
(1,2) 1.000 26.591 707.107 18803.015 500000.000 Execution Device
Region of Interest within the heart
GPU time in usecs
CUDA hardware model for Tesla C1060 CUDA programming model
CUDA profile graph comparison Execution time comparison
SFG representation
of application
Case 1 algorithm PRO-PART
+
Conclusion
Design flow
ML310 board 1 ML310 board 2
component
specification of
Stent Each case = Aurora switches Aurora switches
implementation platform
Case 2 NOFIS platform
configuration 1 1250 image pairs
FSL FSL
Input specifications
Aurora
• Nvidia's Tesla C1060 GPU provides computational speedup as compared to both
PE1 PE2 PE1 PE2
x 5 MB = 6.25 GB
Aurora switches
Aurora switches
Aurora switches
Aurora switches
PIV FSL FSL
Stent
experimental
configuration 2
the sequential CPU implementation and the NOFIS implementation for the PIV
PE3 PE4 PE3 PE4 SAFC dataflow structure and
setup capture using
SFG
components
specification
Case 100
application.
Aurora switches Aurora switches
• The synchronization and the latency in data movement can be optimized between
Aurora Aurora
ML310 board 4 ML310 board 3 partitioning and
Stent Aurora switches Aurora switches
communication resource
specification
the FPGAs by having custom communication interfaces without an Operating
configuration 20 FSL FSL
automated
PE1 PE2 Aurora PE1 PE2
Aurora switches
Aurora switches
Aurora switches
Aurora switches
System (OS) overhead.
FSL FSL configure and generate
Time for FlowIQ analysis of one image pair on 2GHz values for communication
cores
• The PRO-PART design flow facilitates a fast and easier mapping of a streaming
PE3 PE4 PE3 PE4
Xeon processor = 16 minutes. So time required for Aurora switches Aurora switches
complete dataset, 1250 image pairs x 100 cases x 20
mapping to hardware
application on the specialized NOFIS platform with a significant advantage in
stent configurations = 2.6 years Multi-Core SAFC hardware: NOFIS platform PRO-PART design flow development time.
References
[1] American Heart Association. (Last Accessed: February 2009) Cardiovascular Disease Statistics. [Online]. Available: http://www.americanheart.org/presenter.jhtml?identifier=4478
[2] A. Eckstein, J. Charonko, and P.Vlachos. Phase Correlation Processing for DPIV Measurements: Part 1 Spatial Domain Analysis. In Proceedings of FEDSM2007,5th Joint ASME/JSME Fluids Engineering
Conference, number FEDSM2007-37286, San Diego, California, August 2007. ASME.
[3] Nvidia Inc. (Last Accessed: February 2009) Nvidia Tesla C1060 GPU Computing Processor. [Online]. Available: http://www.nvidia.com/object/product_tesla_c1060_us.html