A Common Backend for Hardware Acceleration of DSLs on FPGA

A Common Backend for Hardware
Acceleration of DSLs on FPGA
E. Del Sozzo1, R. Baghdadi2, S. Amarasinghe2, M. D. Santambrogio1
emanuele.delsozzo@polimi.it
1 Politecnico di Milano
2 Massachusetts Institute of Technology
Google, Mountain View
05/29/18

Context Definition 2
• Domain Specific Languages (DSLs)1,2,3,4 main
features:
Portability
Productivity
Specific optimizations
• Many of the computations described by DSLs
(stencils, linear algebra, etc.) are ideal candidates for
hardware acceleration
[1] Halide - http://halide-lang.org
[2] Tensorflow - https://www.tensorflow.org
[3] Julia - https://julialang.org
[4] Theano - http://deeplearning.net/software/theano/

Hardware Acceleration of DSLs 3
• FPGAs offer a good trade-off in terms of performance,
power consumption, and flexibility w.r.t. GPUs, CPUs,
and ASICs
• However, FPGAs have a steep learning curve and it is
complex to produce efficient designs
• Existing High Level Synthesis (HLS) tools may help,
but they provide:
– Low level of abstractions
– Limited support for high level languages

Current DSL to FPGA Workflow 4
Translation and Optimization passes FPGA synthesisDSL 1 Frontend
C/C++C/C++HDL
Code
Input
Algorithm
Input
AlgorithmCode
Pass 1 Pass N
…
[5] Hegarty, James, et al. “Darkroom: compiling high-level image processing code into hardware pipelines.” ACM Trans. Graph. 33.4 (2014): 144-1
[6] Chugh, Nitin, et al. “A DSL compiler for accelerating image processing pipelines on FPGAs.” Parallel Architecture and Compilation Techniques (PACT), 2016
International Conference on. IEEE, 2016.
[7] Stewart, Robert, et al. “A dataflow IR for memory efficient RIPL compilation to FPGAs.” International Conference on Algorithms and Architectures for Parallel
Processing. Springer International Publishing, 2016.

Translation and Optimization passes FPGA synthesisDSL 1 Frontend
C/C++C/C++HDL
Code
Input
Algorithm
Input
AlgorithmCode
Pass 1 Pass N
…

Different flows for different DSLs

Different flows for different DSLsMost of the passes are in common

FROST 8
• A common backend for the hardware acceleration of
DSLs on FPGAs
• FROST is designed to express data parallel algorithms
operating on dense arrays:
– image processing
– dense linear and tensor algebra
– stencil computations
– etc.
• Differently from other similar tools[5-7], FROST leverages a
high-level scheduling co-language to specify the
optimizations

Frontends 10
• FROST currently supports
as frontends:
– Halide
– Tiramisu
• The user can exploits both:
– High level languages and abstractions for designing
the algorithm
– Frontend features (e.g. scheduling and optimizations)
• An ad-hoc translator produces FROST IR from the input
FROST
Halide
Tiramisu
Julia Theano
Xilinx SDAccel
frontendstoolchain
Other
DSLs

Halide 11
• Halide[8,9] is a state-of-the-art DSL and compiler for
image processing pipelines
• Halide exploits a separate scheduling co-language
• The user can optimize the computation at different
levels:
– Loop tiling, unrolling, fusion, …
– Parallelization, vectorization, …
– …
[8] Ragan-Kelley, Jonathan, et al. “Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines.” ACM
SIGPLAN Notices 48.6 (2013)
[9] Tyler Denniston, Shoaib Kamil, and Saman Amarasinghe. “Distributed Halide.” Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice
of Parallel Programming. ACM, 2016.

Tiramisu 12
Layer I: Abstract Algorithm
Portable performance across a range of platforms
Layer II: Computation Management
Layer III: Data Management
Automatic or
user specified
schedules
Automatic or
user specified
data mapping Tiramisu
Backends
High 
Level
Code
FPGA
(Xilinx)
Communication (distribution across nodes)
Vectorized
parallel
X86
GPU 
(Nvidia)
Code generation: Abstract Syntax Tree
...
Layer IV: Communication Management
Developer DSL Compiler
[10] Riyadh Baghdadi, Jessica Ray, Malek Ben Romdhane, Emanuele Del Sozzo, Patricia Suriana, Shoaib Kamil, Saman Amarasinghe, "Tiramisu: A Code
Optimization Framework for High Performance Systems", arXiv

Frontends 13
• FROST currently supports
as frontends:
– Halide
– Tiramisu
• The user can exploit both:
– High level languages and abstractions for designing
the algorithm
– Frontend features (e.g. scheduling and optimizations)
• An ad-hoc translator produces FROST IR from the input
FROST
Halide
Tiramisu
Julia Theano
Xilinx SDAccel
frontendstoolchain
Other
DSLs

FROST Scheduling Co-Language 14
• Generating efficient code for FPGA is a complex task
• HLS tools provide a set of directives to apply optimizations
• However, most of the optimization work is still up to the designer
• Therefore, evaluating different hardware designs quickly turns into a
time-consuming and error-prone task

FROST Scheduling Commands 15
Command Description
Computation Scheduling Commands
pipeline(i) Marks dimension i to be pipelined
unroll(i) Marks dimension i to be unrolled
flatten(i) Marks dimension i to be flattened
vectorize(b, n) Marks buffer b to be vectorized in chunks of n bits
Local Memory Scheduling Commands
partition(b, d, t) Marks buffer b to be portioned on dimension d in a t way
Architecture Scheduling Commands
tiled() Marks the architecture to be implemented in a tiled way
streaming() Marks the architecture to be implemented in a streaming way

Command Description

Command Description
FROST scheduling commands can be combined with the frontends scheduling commands

Backend 19
• From the optimized IR, FROST generates a C++ code for
HLS tools
• This allows to have a preliminary analysis of the resulting
design (e.g. latency, resource usage, …)
• Xilinx SDAccel takes care of the bitstream generation
[*] https://www.xilinx.com/products/design-tools/software-zone/sdaccel.html
*

Experimental Evaluation 20
• Benchmarks (8-bit RGB Full-HD images):
Halide
• Threshold
• Add Weighted
• Erode
Tiramisu
• Scale
• Sobel
• Gaussian
• Target FPGA: ADM-PCIE-7V3
• We compared against the Vivado HLS Video
Library11
[11] http://www.wiki.xilinx.com/HLS+Video+Library

Input Image Layout 21
• Vivado HLS Video Library expects input images arranged
in a channel interleaved way
Channel interleaved

• However, a different layout (i.e. plane interleaved) could
improve the level of parallelism and memory usage
Channel interleaved Plane interleaved

HLS supports one
pixel per clock cycle

HLS supports one
pixel per clock cycle
Up to N elements per clock cycle

FROST designs 25
• We implemented a streaming architecture, just like
Vivado HLS Video Library does
• We relied on the frontends to enable vectorization
(e.g. loop splitting)
• Image layout:
– Halide:
• channel interleaved (3-element vectorization)
– Tiramisu:
• channel interleaved (3-element vectorization)
• plane interleaved (64-element vectorization)

FROST + Halide vs HLS Video Library 26
HLS Video Library
FROST + Halide (channel interleaved)
1.004X
1.28X
1.005X
NormalizedExectionTime
0.1
1
Applications
Threshold AddWeighted Erode

Resource Usage: Halide Benchmarks 27
Application
BRAM_18K
(2940)
DSP48E
(3600)
Flip Flops
(866400)
LookUp Tables
(433200)
FROST
Threshold 2 0 814 1542
AddWeighted 0 30 5487 7383
Erode 8 0 1023 2012
Vivado HLS Video Library
Threshold 0 0 307 1300
AddWeighted 0 30 6810 11465
Erode 9 0 2170 2828

FROST + Tiramisu vs HLS Video Library 28
HLS Video Library
FROST + Tiramisu (channel interleaved)
1.30X
1.004X
1.31X
0.1
1
Applications
Scale Sobel Gaussian

FROST + Tiramisu vs HLS Video Library 29
HLS Video Library
FROST + Tiramisu (plane interleaved)
10.32X
7.98X
10.14X
0.1
1
Applications
Scale Sobel Gaussian

Resource Usage: Tiramisu Benchmarks 30
Application
BRAM_18K
(2940)
DSP48E
(3600)
Flip Flops
(866400)
LookUp Tables
(433200)
FROST (channel interleaved)
Scale 2 15 3550 4803
Sobel 8 0 1047 1787
Gaussian 8 0 997 1907
FROST (plane interleaved)
Scale 30 320 59642 73933
Sobel 144 0 3287 7543
Gaussian 60 0 2757 7008
Vivado HLS Video Library
Scale 0 15 4828 8792
Sobel 9 0 2220 3255
Gaussian 9 12 2560 3355

Conclusion & Future Work 31
• FROST is a common backend for hardware
acceleration of DSLs on FPGAs
• FROST high level scheduling co-language enables
different types of optimizations
• As future work, we plan to:
– Implement other scheduling commands
– Feedback loop with Xilinx toolchain
– Integration with more DSLs
E. Del Sozzo, R. Baghdadi, S. Amarasinghe, M. D. Santambrogio
emanuele.delsozzo@polimi.it

A Common Backend for Hardware Acceleration of DSLs on FPGA

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (11)

Ähnlich wie A Common Backend for Hardware Acceleration of DSLs on FPGA

Ähnlich wie A Common Backend for Hardware Acceleration of DSLs on FPGA (20)

Mehr von NECST Lab @ Politecnico di Milano

Mehr von NECST Lab @ Politecnico di Milano (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

A Common Backend for Hardware Acceleration of DSLs on FPGA