Field Programmable Gate Arrays (FPGAs) are configurable integrated circuits able to provide a good trade-off in terms of performance, power consumption, and flexibility with respect to other architectures, like CPUs, GPUs and ASICs. The main drawback in using FPGAs, however, is their steep learning curve. An emerging solution to this problem is to write algorithms in a Domain Specific Language (DSL) and to let the DSL compiler generate efficient code targeting FPGAs. This work proposes FROST, a unified backend that enables different DSL compilers to target FPGA architectures. Differently from other code generation frameworks targeting FPGA, FROST exploits a scheduling co-language that enables users to have full control over which optimizations to apply in order to generate efficient code (e.g. loop pipelining, array partitioning, vectorization). At first, FROST analyzes and manipulates the input Abstract Syntax Tree (AST) in order to apply FPGA-oriented transformations and optimizations, then generates a C/C++ implementation suitable for High-Level Synthesis (HLS) tools. Finally, the output of HLS phase is synthesized and implemented on the target FPGA using Xilinx SDAccel toolchain.
FROST currently supports as front-end Halide, an Image Processing DSL, and Tiramisu, a DSL optimizer, and allows to achieve significant speedups with respect to state-of-the-art FPGA implementations of the same algorithms.
Intze Overhead Water Tank Design by Working Stress - IS Method.pdf
A Common Backend for Hardware Acceleration of DSLs on FPGA
1. A Common Backend for Hardware
Acceleration of DSLs on FPGA
E. Del Sozzo1, R. Baghdadi2, S. Amarasinghe2, M. D. Santambrogio1
emanuele.delsozzo@polimi.it
1 Politecnico di Milano
2 Massachusetts Institute of Technology
Google, Mountain View
05/29/18
2. Context Definition 2
• Domain Specific Languages (DSLs)1,2,3,4 main
features:
Portability
Productivity
Specific optimizations
• Many of the computations described by DSLs
(stencils, linear algebra, etc.) are ideal candidates for
hardware acceleration
[1] Halide - http://halide-lang.org
[2] Tensorflow - https://www.tensorflow.org
[3] Julia - https://julialang.org
[4] Theano - http://deeplearning.net/software/theano/
3. Hardware Acceleration of DSLs 3
• FPGAs offer a good trade-off in terms of performance,
power consumption, and flexibility w.r.t. GPUs, CPUs,
and ASICs
• However, FPGAs have a steep learning curve and it is
complex to produce efficient designs
• Existing High Level Synthesis (HLS) tools may help,
but they provide:
– Low level of abstractions
– Limited support for high level languages
4. Current DSL to FPGA Workflow 4
Translation and Optimization passes FPGA synthesisDSL 1 Frontend
C/C++C/C++HDL
Code
Input
Algorithm
Input
AlgorithmCode
Pass 1 Pass N
…
[5] Hegarty, James, et al. “Darkroom: compiling high-level image processing code into hardware pipelines.” ACM Trans. Graph. 33.4 (2014): 144-1
[6] Chugh, Nitin, et al. “A DSL compiler for accelerating image processing pipelines on FPGAs.” Parallel Architecture and Compilation Techniques (PACT), 2016
International Conference on. IEEE, 2016.
[7] Stewart, Robert, et al. “A dataflow IR for memory efficient RIPL compilation to FPGAs.” International Conference on Algorithms and Architectures for Parallel
Processing. Springer International Publishing, 2016.
5. Current DSL to FPGA Workflow 5
Translation and Optimization passes FPGA synthesisDSL 1 Frontend
C/C++C/C++HDL
Code
Input
Algorithm
Input
AlgorithmCode
Pass 1 Pass N
…
[5] Hegarty, James, et al. “Darkroom: compiling high-level image processing code into hardware pipelines.” ACM Trans. Graph. 33.4 (2014): 144-1
[6] Chugh, Nitin, et al. “A DSL compiler for accelerating image processing pipelines on FPGAs.” Parallel Architecture and Compilation Techniques (PACT), 2016
International Conference on. IEEE, 2016.
[7] Stewart, Robert, et al. “A dataflow IR for memory efficient RIPL compilation to FPGAs.” International Conference on Algorithms and Architectures for Parallel
Processing. Springer International Publishing, 2016.
6. Current DSL to FPGA Workflow 6
[5] Hegarty, James, et al. “Darkroom: compiling high-level image processing code into hardware pipelines.” ACM Trans. Graph. 33.4 (2014): 144-1
[6] Chugh, Nitin, et al. “A DSL compiler for accelerating image processing pipelines on FPGAs.” Parallel Architecture and Compilation Techniques (PACT), 2016
International Conference on. IEEE, 2016.
[7] Stewart, Robert, et al. “A dataflow IR for memory efficient RIPL compilation to FPGAs.” International Conference on Algorithms and Architectures for Parallel
Processing. Springer International Publishing, 2016.
Different flows for different DSLs
7. Current DSL to FPGA Workflow 7
[5] Hegarty, James, et al. “Darkroom: compiling high-level image processing code into hardware pipelines.” ACM Trans. Graph. 33.4 (2014): 144-1
[6] Chugh, Nitin, et al. “A DSL compiler for accelerating image processing pipelines on FPGAs.” Parallel Architecture and Compilation Techniques (PACT), 2016
International Conference on. IEEE, 2016.
[7] Stewart, Robert, et al. “A dataflow IR for memory efficient RIPL compilation to FPGAs.” International Conference on Algorithms and Architectures for Parallel
Processing. Springer International Publishing, 2016.
Different flows for different DSLsMost of the passes are in common
8. FROST 8
• A common backend for the hardware acceleration of
DSLs on FPGAs
• FROST is designed to express data parallel algorithms
operating on dense arrays:
– image processing
– dense linear and tensor algebra
– stencil computations
– etc.
• Differently from other similar tools[5-7], FROST leverages a
high-level scheduling co-language to specify the
optimizations
[5] Hegarty, James, et al. “Darkroom: compiling high-level image processing code into hardware pipelines.” ACM Trans. Graph. 33.4 (2014): 144-1
[6] Chugh, Nitin, et al. “A DSL compiler for accelerating image processing pipelines on FPGAs.” Parallel Architecture and Compilation Techniques (PACT), 2016
International Conference on. IEEE, 2016.
[7] Stewart, Robert, et al. “A dataflow IR for memory efficient RIPL compilation to FPGAs.” International Conference on Algorithms and Architectures for Parallel
Processing. Springer International Publishing, 2016.
10. Frontends 10
• FROST currently supports
as frontends:
– Halide
– Tiramisu
• The user can exploits both:
– High level languages and abstractions for designing
the algorithm
– Frontend features (e.g. scheduling and optimizations)
• An ad-hoc translator produces FROST IR from the input
FROST
Halide
Tiramisu
Julia Theano
Xilinx SDAccel
frontendstoolchain
Other
DSLs
11. Halide 11
• Halide[8,9] is a state-of-the-art DSL and compiler for
image processing pipelines
• Halide exploits a separate scheduling co-language
• The user can optimize the computation at different
levels:
– Loop tiling, unrolling, fusion, …
– Parallelization, vectorization, …
– …
[8] Ragan-Kelley, Jonathan, et al. “Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines.” ACM
SIGPLAN Notices 48.6 (2013)
[9] Tyler Denniston, Shoaib Kamil, and Saman Amarasinghe. “Distributed Halide.” Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice
of Parallel Programming. ACM, 2016.
12. Tiramisu 12
Layer I: Abstract Algorithm
Portable performance across a range of platforms
Layer II: Computation Management
Layer III: Data Management
Automatic or
user specified
schedules
Automatic or
user specified
data mapping Tiramisu
Backends
High
Level
Code
FPGA
(Xilinx)
Communication (distribution across nodes)
Vectorized
parallel
X86
GPU
(Nvidia)
Code generation: Abstract Syntax Tree
...
Layer IV: Communication Management
Developer DSL Compiler
[10] Riyadh Baghdadi, Jessica Ray, Malek Ben Romdhane, Emanuele Del Sozzo, Patricia Suriana, Shoaib Kamil, Saman Amarasinghe, "Tiramisu: A Code
Optimization Framework for High Performance Systems", arXiv
13. Frontends 13
• FROST currently supports
as frontends:
– Halide
– Tiramisu
• The user can exploit both:
– High level languages and abstractions for designing
the algorithm
– Frontend features (e.g. scheduling and optimizations)
• An ad-hoc translator produces FROST IR from the input
FROST
Halide
Tiramisu
Julia Theano
Xilinx SDAccel
frontendstoolchain
Other
DSLs
14. FROST Scheduling Co-Language 14
• Generating efficient code for FPGA is a complex task
• HLS tools provide a set of directives to apply optimizations
• However, most of the optimization work is still up to the designer
• Therefore, evaluating different hardware designs quickly turns into a
time-consuming and error-prone task
15. FROST Scheduling Commands 15
Command Description
Computation Scheduling Commands
pipeline(i) Marks dimension i to be pipelined
unroll(i) Marks dimension i to be unrolled
flatten(i) Marks dimension i to be flattened
vectorize(b, n) Marks buffer b to be vectorized in chunks of n bits
Local Memory Scheduling Commands
partition(b, d, t) Marks buffer b to be portioned on dimension d in a t way
Architecture Scheduling Commands
tiled() Marks the architecture to be implemented in a tiled way
streaming() Marks the architecture to be implemented in a streaming way
16. FROST Scheduling Commands 16
Command Description
Computation Scheduling Commands
pipeline(i) Marks dimension i to be pipelined
unroll(i) Marks dimension i to be unrolled
flatten(i) Marks dimension i to be flattened
vectorize(b, n) Marks buffer b to be vectorized in chunks of n bits
Local Memory Scheduling Commands
partition(b, d, t) Marks buffer b to be portioned on dimension d in a t way
Architecture Scheduling Commands
tiled() Marks the architecture to be implemented in a tiled way
streaming() Marks the architecture to be implemented in a streaming way
17. FROST Scheduling Commands 17
Command Description
Computation Scheduling Commands
pipeline(i) Marks dimension i to be pipelined
unroll(i) Marks dimension i to be unrolled
flatten(i) Marks dimension i to be flattened
vectorize(b, n) Marks buffer b to be vectorized in chunks of n bits
Local Memory Scheduling Commands
partition(b, d, t) Marks buffer b to be portioned on dimension d in a t way
Architecture Scheduling Commands
tiled() Marks the architecture to be implemented in a tiled way
streaming() Marks the architecture to be implemented in a streaming way
18. FROST Scheduling Commands 18
Command Description
Computation Scheduling Commands
pipeline(i) Marks dimension i to be pipelined
unroll(i) Marks dimension i to be unrolled
flatten(i) Marks dimension i to be flattened
vectorize(b, n) Marks buffer b to be vectorized in chunks of n bits
Local Memory Scheduling Commands
partition(b, d, t) Marks buffer b to be portioned on dimension d in a t way
Architecture Scheduling Commands
tiled() Marks the architecture to be implemented in a tiled way
streaming() Marks the architecture to be implemented in a streaming way
FROST scheduling commands can be combined with the frontends scheduling commands
19. Backend 19
• From the optimized IR, FROST generates a C++ code for
HLS tools
• This allows to have a preliminary analysis of the resulting
design (e.g. latency, resource usage, …)
• Xilinx SDAccel takes care of the bitstream generation
[*] https://www.xilinx.com/products/design-tools/software-zone/sdaccel.html
*
20. Experimental Evaluation 20
• Benchmarks (8-bit RGB Full-HD images):
Halide
• Threshold
• Add Weighted
• Erode
Tiramisu
• Scale
• Sobel
• Gaussian
• Target FPGA: ADM-PCIE-7V3
• We compared against the Vivado HLS Video
Library11
[11] http://www.wiki.xilinx.com/HLS+Video+Library
21. Input Image Layout 21
• Vivado HLS Video Library expects input images arranged
in a channel interleaved way
Channel interleaved
22. Input Image Layout 22
• Vivado HLS Video Library expects input images arranged
in a channel interleaved way
• However, a different layout (i.e. plane interleaved) could
improve the level of parallelism and memory usage
Channel interleaved Plane interleaved
23. Input Image Layout 23
• Vivado HLS Video Library expects input images arranged
in a channel interleaved way
• However, a different layout (i.e. plane interleaved) could
improve the level of parallelism and memory usage
Channel interleaved Plane interleaved
HLS supports one
pixel per clock cycle
24. Input Image Layout 24
• Vivado HLS Video Library expects input images arranged
in a channel interleaved way
• However, a different layout (i.e. plane interleaved) could
improve the level of parallelism and memory usage
Channel interleaved Plane interleaved
HLS supports one
pixel per clock cycle
Up to N elements per clock cycle
25. FROST designs 25
• We implemented a streaming architecture, just like
Vivado HLS Video Library does
• We relied on the frontends to enable vectorization
(e.g. loop splitting)
• Image layout:
– Halide:
• channel interleaved (3-element vectorization)
– Tiramisu:
• channel interleaved (3-element vectorization)
• plane interleaved (64-element vectorization)
26. FROST + Halide vs HLS Video Library 26
HLS Video Library
FROST + Halide (channel interleaved)
1.004X
1.28X
1.005X
NormalizedExectionTime
0.1
1
Applications
Threshold AddWeighted Erode
31. Conclusion & Future Work 31
• FROST is a common backend for hardware
acceleration of DSLs on FPGAs
• FROST high level scheduling co-language enables
different types of optimizations
• As future work, we plan to:
– Implement other scheduling commands
– Feedback loop with Xilinx toolchain
– Integration with more DSLs
E. Del Sozzo, R. Baghdadi, S. Amarasinghe, M. D. Santambrogio
emanuele.delsozzo@polimi.it