Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.

CFD acceleration with FPGA (byteLAKE's presentation from PPAM 2019)

229 Aufrufe

Veröffentlicht am

byteLAKE's presentation from the PPAM 2019 conference.

The goal of this work is to adapt 4 CFD kernels to the Xilinx ALVEO U250 FPGA, including first-order step of the non-linear iterative upwind advection MPDATA schemes (non-oscillatory forward in time), the divergence part of the matrix-free linear operator formulation in the iterative Krylov scheme, tridiagonal Thomas algorithm for vertical matrix inversion inside preconditioner for the iterative solver, and computation of the psuedovelocity for the second pass of upwind algorithm in MPDATA. All the kernels use 3-dimensional compute domain consisted from 7 to 11 arrays. Since all kernels belong to the group of memory bound algorithms, our main challenge is to provide the highest utilization of global memory bandwidth. Our adaptation allows us to reduce the execution time upto 4x.

Find out more at: www.byteLAKE.com/en/CFD

Foot note:
This is the presentation about the non-AI version of byteLAKE's CFD kernels, highly optimized for Alveo FPGA. Based on this research project and many others in the CFD space, we decided to shift the course of the CFD Suite product development and leverage AI to accelerate computations and enable new possibilities. Instead of adapting CFD solvers to accelerators, we use AI and work on a cross-platform solution. More on the latest: www.byteLAKE.com/en/CFDSuite.

Update for 2020: byteLAKE is currently developing CFD Suite as AI for CFD Suite, a collection of AI/ Artificial Intelligence Models to accelerate and enable new features for CFD simulations. It is a cross-platform solution (not only for FPGAs). More: www.byteLAKE.com/en/CFDSuite.

Veröffentlicht in: Technologie
  • Loggen Sie sich ein, um Kommentare anzuzeigen.

  • Gehören Sie zu den Ersten, denen das gefällt!

CFD acceleration with FPGA (byteLAKE's presentation from PPAM 2019)

  1. 1. DSc PhD Krzysztof ROJEK, byteLAKE’s CTO PPAM 2019, Bialystok, Poland, September 8-11, 2019 CFD code adaptation to the FPGA architecture
  2. 2. • Current trends in the FPGA market • Common FPGA applications • FPGA access • Architecture of the Xilinx Alveo U250 FPGA • Evaluation metrics • Algorithm scenario • Development of FPGA codes • Algorithm design 2 Background • OpenCL kernel processing • Memory queue • Limitations of memory access • Burst memory access • Vectorization • Code regionalization • CPU implementation overview • Performance and Energy results • Conclusion
  3. 3. 3 Current trends in the FPGA market
  4. 4. • Confirmed effectiveness – Audio processing – Image processing – Cryptography – Routers/switches/gateways software – Digital displays – Scientific instruments (amplifiers, radio astronomy, radars) • Current challenges – Machine learning – Deep learning – High Performance Computing (HPC) 4 Common FPGA applications
  5. 5. • Test Drive in the Cloud – Nimbix: High Performance Computing & Supercomputing Platform – Other cloud providers, soon… • Your own cluster – RAM memory: 80GB (16GB for deployment only) – Hard disk space: 100GB – OS: RedHat, CentOS, Ubuntu – Xilinx Runtime – driver for Alveo – Deployment Shell – the communication layer physically implemented and flashed into the card – The Xilinx SDAccel IDE – framework for development 5 FPGA access More cloud providers soon…
  6. 6. • Premiere: October 02, 2018 • Built on the Xilinx 16nm UltraScale™ architecture 6 Xilinx Alveo U250 FPGA Memory Off-chip Memory Capacity 64 GB Off-chip Total Bandwidth 77 GB/s Internal SRAM Capacity 54 MB Internal SRAM Total Bandwidth 38 TB/s Power and Thermal Maximum Total Power 225W Thermal Cooling Passive Clocks KERNEL CLK 500 MHz DATA CLK 300 MHz
  7. 7. • The deployment shell that handles device bring-up and configuration over PCIe is contained within the static region of the FPGA • The resources in the dynamic region are available for creating custom accelerators 7 Xilinx Alveo U250 FPGA SLR1 Dynamic Region SLR2 Dynamic Region SLR3 Dynamic Region SLR0 Dynamic Region Static Region DDR DDR DDR DDR Resources Look-Up Tables (LUTs) (K) 1341 Registers (K) 2749 36 Kb Block RAMs 2000 288 Kb UltraRAMs 1280
  8. 8. • Desired features of a data center – Low price – Low Energy consumption – High performance – Technical support – Reliability and fast service • Important metrics – Execution time [s] – Data throughput of a simulation [MB/s] – Power dissipation [W] – Energy consumption [J] 8 Is it a good for you? How many cards is required to achieve a desired performance? How many cards can I handle within a given Energy budget? What performance can be achieved within my Energy budget? How these results refer to the CPU-based solution?
  9. 9. • Computational Fluid Dynamics (CFD) kernel with support for all industrial parameters and settings • Advection algorithm that is the method to predict changes in transport of a substance (fluid) or quantity by bulk motion in time – An example of advection is the transport of pollutants or silt in a river by bulk water flow downstream – It is also transport of energy by water or air 9 Real scientific scenario • Based on upwind scheme • 3D compute domain • Dataset (9 arrays + scalar): – 3 x velocity vectors – 2 x forces (implosion, explosion) – 2 x density vectors – 2 x transported substance (in, out) – t – time interval • Configuration: – Job setting (size, timestep) – Border conditions (periodic, open) – Data accuracy (double, single, half) PERIODIC DOMAIN IN X DIMENSION OPEN DOMAIN
  10. 10. • Config, makefile, and source 10 Development
  11. 11. • Config, makefile, and source 11 Development
  12. 12. • Config, makefile, and source 12 Development
  13. 13. • The compute domain is divided into 4 sub-domains • Host sends data to the FPGA global memory • Host calls kernel to execute it on FPGA (kernel is called many times) • Each kernel call represents a single time step • FPGA sends the output array back to host Algorithm design FPGA CPU Compute domain Sub-domain Sub-domain Sub-domain Sub-domain Kernel call Data sending Data receiving Data receiving Data sending Kernel processing Migrate memory objects N x call Copy buffer
  14. 14. • Kernel is distributed into 4 SLRs • Each sub-domain is allocated in different memory bank • Data transfer occurs between neighboring memory banks Kernel processing SLR0 Kernel_A SLR1 Kernel_B SLR2 Kernel_C SLR3 Kernel_D Kernel Bank0 Bank1 Bank2 Bank3 Sub-domain Sub-domain Sub-domain Sub-domain 19
  15. 15. • A pipe stores data organized as a FIFO • Pipes can be used to stream data from one kernel to another inside the FPGA device without having to use the external memory • Pipes must be statically defined outside of all kernel functions • Pipes must be declared in lower case alphanumerics • Xilinx extended OpenCL pipes by adding blocking mode that allows users to synchronize kernels 15 Kernels communication with pipes pipe int p0 __attribute__((xcl_reqd_pipe_depth(512)));
  16. 16. • Each array is transferred from the global memory to the fast BRAM memory • To minimize the data traffic we use a memory queue across iterations 16 Memory queue Global memory BRAM
  17. 17. • Each array is transferred from the global memory to the fast BRAM memory • To minimize the data traffic we use a memory queue across iterations 17 Memory queue Global memory BRAM
  18. 18. • Each array is transferred from the global memory to the fast BRAM memory • To minimize the data traffic we use a memory queue across interactions 18 Memory queue Global memory BRAM
  19. 19. • 31 pins are available in Alveo u250 – Each pointer to the global memory set as the kernel argument reserves one memory pin – Each kernel reserves one memory pin • Using 4 banks and 4 kernels we can set up to 6 global pointers to the global memory • To send all required arrays we need to pack them into larger buffers (different for input and output data) • All kernel ports require 512-bits data access to provide the highest memory access 19 Memory access within a kernel
  20. 20. • Burst memory access – Loop pipelining – Port data width: 512bits – Separated data copings from the computation – Vectorization 20 Burst memory access/vectorization void copy(__global const float16 * __restrict globMem) { float16 bram[tKM]; … write_0: __attribute__((xcl_pipeline_loop)) for(int kj=0; kj<tKM; ++kj) { bram[kj] = globMem[gIdx+kj]; } … } Time traditional pipelining
  21. 21. • Shifting elements within a vector (standard shuffle API is not supported) 21 Stencil vectorization __attribute__((always_inline)) inline float16 getM1(const float a, const float16 b) { const float16 *ptr2=(realX*)&b; float16 out; float *o=(realX*)&out; o[0] = a; __attribute__((opencl_unroll_hint(15))) for(int i=1; i<VECS; ++i) { o[i] = ptr2[i-1]; } return out; } X[i] = Y[i-1] X[i]=getM1(Y[i-1][15], Y[i]);
  22. 22. • Memory access supports two accesses per a single array 22 Memory ports calc_0: __attribute__((xcl_pipeline_loop)) for(int kj=0; kj<tKM; ++kj) { bramX[kj] = bramY[kj-off]+bramY[kj]+bramY[kj+off]; } calc_0: __attribute__((xcl_pipeline_loop)) for(int kj=0; kj<tKM; ++kj) { bramX[kj] = bramY[kj-off]+bramY[kj]; } calc_1: __attribute__((xcl_pipeline_loop)) for(int kj=0; kj<tKM; ++kj) { bramX[kj] = bramX[kj]+bramY[kj+off]; }
  23. 23. • Independent regions in the code should be explicitly separated • It helps compiler distribute the code amongst LUT • The separation can be done by adding brackets around independent code blocks 23 Regionalization { //the first block of instructions } { //the second block of instructions }
  24. 24. • Our CPU implementation utilizes two processors: – Intel® Xeon® CPU E5-2695 v2 2.40 – 3.2 GHz (2x12 cores) • The code adaptation includes: – 24 cores utilization – Loop transformations – Memory alignment – Thread affinity – Data locality within nested loops – Compiler optimizations • The final simulation throughput is: 3.7 GB/s • The power dissipation is: 142 Watts 25 CPU implementation
  25. 25. 26 FPGA optimizations
  26. 26. 27 Results FPGA 2xCPU Ratio FPGA/CPU Exec. time [s] 11,4 18,0 1,6 Throughput [MB/s] 5840,8 3699,2 0,6 Power [W] 101,0 142,0 1,4 Energy [J] 1151,4 2556,0 2,2 5840.8 3699.2 FPGA 2XCPU The higher the better Throughput [MB/s] 1151.4 2556.0 FPGA 2XCPU The lower the better Energy [J]
  27. 27. 29 byteLAKE’s ecosystem of partners Complete solutions for CFD market ➢HPC system design, build-up and configuration ➢HPC software applications development and optimization to make the most of the hardware … and more
  28. 28. More at: byteLAKE.com/en/CFD Accelerated CFD Kernels Compatible with geophysical models like EULAG Pseudovelocity Divergence Thomas algorithm CFD Kernels Advection • Faster time to results and more efficient processing compared to CPU-only nodes • 4x faster • 80% lower energy consumption • 6x better performance per Watt About byteLAKE • AI (highly optimized AI engines to analyze text, image, video, time series data) • HPC (highly optimized apps and kernels for HPC architectures)
  29. 29. Contact me: krojek@byteLAKE.com 31
  30. 30. We build AI and HPC solutions. Focusing on software. We use machine/ deep learning to bring automation and optimize operations in businesses across various industries. We create highly optimized software for supercomputers. Our researchers hold PhD and DSc degrees. byteLAKE www.byteLAKE.com • AI (highly optimized AI engines to analyze text, image, video, time series data) • HPC (highly optimized apps and kernels for HPC architectures) Building solutions for real-life business problems