SlideShare ist ein Scribd-Unternehmen logo
1 von 31
Downloaden Sie, um offline zu lesen
A Common Backend for Hardware
Acceleration of DSLs on FPGA
E. Del Sozzo1, R. Baghdadi2, S. Amarasinghe2, M. D. Santambrogio1
emanuele.delsozzo@polimi.it
1 Politecnico di Milano
2 Massachusetts Institute of Technology
Google, Mountain View
05/29/18
Context Definition 2
• Domain Specific Languages (DSLs)1,2,3,4 main
features:
Portability
Productivity
Specific optimizations
• Many of the computations described by DSLs
(stencils, linear algebra, etc.) are ideal candidates for
hardware acceleration
[1] Halide - http://halide-lang.org
[2] Tensorflow - https://www.tensorflow.org
[3] Julia - https://julialang.org
[4] Theano - http://deeplearning.net/software/theano/
Hardware Acceleration of DSLs 3
• FPGAs offer a good trade-off in terms of performance,
power consumption, and flexibility w.r.t. GPUs, CPUs,
and ASICs
• However, FPGAs have a steep learning curve and it is
complex to produce efficient designs
• Existing High Level Synthesis (HLS) tools may help,
but they provide:
– Low level of abstractions
– Limited support for high level languages
Current DSL to FPGA Workflow 4
Translation and Optimization passes FPGA synthesisDSL 1 Frontend
C/C++C/C++HDL
Code
Input
Algorithm
Input
AlgorithmCode
Pass 1 Pass N
…
[5] Hegarty, James, et al. “Darkroom: compiling high-level image processing code into hardware pipelines.” ACM Trans. Graph. 33.4 (2014): 144-1
[6] Chugh, Nitin, et al. “A DSL compiler for accelerating image processing pipelines on FPGAs.” Parallel Architecture and Compilation Techniques (PACT), 2016
International Conference on. IEEE, 2016.
[7] Stewart, Robert, et al. “A dataflow IR for memory efficient RIPL compilation to FPGAs.” International Conference on Algorithms and Architectures for Parallel
Processing. Springer International Publishing, 2016.
Current DSL to FPGA Workflow 5
Translation and Optimization passes FPGA synthesisDSL 1 Frontend
C/C++C/C++HDL
Code
Input
Algorithm
Input
AlgorithmCode
Pass 1 Pass N
…
[5] Hegarty, James, et al. “Darkroom: compiling high-level image processing code into hardware pipelines.” ACM Trans. Graph. 33.4 (2014): 144-1
[6] Chugh, Nitin, et al. “A DSL compiler for accelerating image processing pipelines on FPGAs.” Parallel Architecture and Compilation Techniques (PACT), 2016
International Conference on. IEEE, 2016.
[7] Stewart, Robert, et al. “A dataflow IR for memory efficient RIPL compilation to FPGAs.” International Conference on Algorithms and Architectures for Parallel
Processing. Springer International Publishing, 2016.
Current DSL to FPGA Workflow 6
[5] Hegarty, James, et al. “Darkroom: compiling high-level image processing code into hardware pipelines.” ACM Trans. Graph. 33.4 (2014): 144-1
[6] Chugh, Nitin, et al. “A DSL compiler for accelerating image processing pipelines on FPGAs.” Parallel Architecture and Compilation Techniques (PACT), 2016
International Conference on. IEEE, 2016.
[7] Stewart, Robert, et al. “A dataflow IR for memory efficient RIPL compilation to FPGAs.” International Conference on Algorithms and Architectures for Parallel
Processing. Springer International Publishing, 2016.
Different flows for different DSLs
Current DSL to FPGA Workflow 7
[5] Hegarty, James, et al. “Darkroom: compiling high-level image processing code into hardware pipelines.” ACM Trans. Graph. 33.4 (2014): 144-1
[6] Chugh, Nitin, et al. “A DSL compiler for accelerating image processing pipelines on FPGAs.” Parallel Architecture and Compilation Techniques (PACT), 2016
International Conference on. IEEE, 2016.
[7] Stewart, Robert, et al. “A dataflow IR for memory efficient RIPL compilation to FPGAs.” International Conference on Algorithms and Architectures for Parallel
Processing. Springer International Publishing, 2016.
Different flows for different DSLsMost of the passes are in common
FROST 8
• A common backend for the hardware acceleration of
DSLs on FPGAs
• FROST is designed to express data parallel algorithms
operating on dense arrays:
– image processing
– dense linear and tensor algebra
– stencil computations
– etc.
• Differently from other similar tools[5-7], FROST leverages a
high-level scheduling co-language to specify the
optimizations
[5] Hegarty, James, et al. “Darkroom: compiling high-level image processing code into hardware pipelines.” ACM Trans. Graph. 33.4 (2014): 144-1
[6] Chugh, Nitin, et al. “A DSL compiler for accelerating image processing pipelines on FPGAs.” Parallel Architecture and Compilation Techniques (PACT), 2016
International Conference on. IEEE, 2016.
[7] Stewart, Robert, et al. “A dataflow IR for memory efficient RIPL compilation to FPGAs.” International Conference on Algorithms and Architectures for Parallel
Processing. Springer International Publishing, 2016.
FROST Workflow 9
Frontends 10
• FROST currently supports
as frontends:
– Halide
– Tiramisu
• The user can exploits both:
– High level languages and abstractions for designing
the algorithm
– Frontend features (e.g. scheduling and optimizations)
• An ad-hoc translator produces FROST IR from the input
FROST
Halide
Tiramisu
Julia Theano
Xilinx SDAccel
frontendstoolchain
Other
DSLs
Halide 11
• Halide[8,9] is a state-of-the-art DSL and compiler for
image processing pipelines
• Halide exploits a separate scheduling co-language
• The user can optimize the computation at different
levels:
– Loop tiling, unrolling, fusion, …
– Parallelization, vectorization, …
– …
[8] Ragan-Kelley, Jonathan, et al. “Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines.” ACM
SIGPLAN Notices 48.6 (2013)
[9] Tyler Denniston, Shoaib Kamil, and Saman Amarasinghe. “Distributed Halide.” Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice
of Parallel Programming. ACM, 2016.
Tiramisu 12
Layer I: Abstract Algorithm
Portable performance across a range of platforms
Layer II: Computation Management
Layer III: Data Management
Automatic or
user specified
schedules
Automatic or
user specified
data mapping Tiramisu
Backends
High

Level
Code
FPGA
(Xilinx)
Communication (distribution across nodes)
Vectorized
parallel
X86
GPU

(Nvidia)
Code generation: Abstract Syntax Tree
...
Layer IV: Communication Management
Developer DSL Compiler
[10] Riyadh Baghdadi, Jessica Ray, Malek Ben Romdhane, Emanuele Del Sozzo, Patricia Suriana, Shoaib Kamil, Saman Amarasinghe, "Tiramisu: A Code
Optimization Framework for High Performance Systems", arXiv
Frontends 13
• FROST currently supports
as frontends:
– Halide
– Tiramisu
• The user can exploit both:
– High level languages and abstractions for designing
the algorithm
– Frontend features (e.g. scheduling and optimizations)
• An ad-hoc translator produces FROST IR from the input
FROST
Halide
Tiramisu
Julia Theano
Xilinx SDAccel
frontendstoolchain
Other
DSLs
FROST Scheduling Co-Language 14
• Generating efficient code for FPGA is a complex task
• HLS tools provide a set of directives to apply optimizations
• However, most of the optimization work is still up to the designer
• Therefore, evaluating different hardware designs quickly turns into a
time-consuming and error-prone task
FROST Scheduling Commands 15
Command Description
Computation Scheduling Commands
pipeline(i) Marks dimension i to be pipelined
unroll(i) Marks dimension i to be unrolled
flatten(i) Marks dimension i to be flattened
vectorize(b, n) Marks buffer b to be vectorized in chunks of n bits
Local Memory Scheduling Commands
partition(b, d, t) Marks buffer b to be portioned on dimension d in a t way
Architecture Scheduling Commands
tiled() Marks the architecture to be implemented in a tiled way
streaming() Marks the architecture to be implemented in a streaming way
FROST Scheduling Commands 16
Command Description
Computation Scheduling Commands
pipeline(i) Marks dimension i to be pipelined
unroll(i) Marks dimension i to be unrolled
flatten(i) Marks dimension i to be flattened
vectorize(b, n) Marks buffer b to be vectorized in chunks of n bits
Local Memory Scheduling Commands
partition(b, d, t) Marks buffer b to be portioned on dimension d in a t way
Architecture Scheduling Commands
tiled() Marks the architecture to be implemented in a tiled way
streaming() Marks the architecture to be implemented in a streaming way
FROST Scheduling Commands 17
Command Description
Computation Scheduling Commands
pipeline(i) Marks dimension i to be pipelined
unroll(i) Marks dimension i to be unrolled
flatten(i) Marks dimension i to be flattened
vectorize(b, n) Marks buffer b to be vectorized in chunks of n bits
Local Memory Scheduling Commands
partition(b, d, t) Marks buffer b to be portioned on dimension d in a t way
Architecture Scheduling Commands
tiled() Marks the architecture to be implemented in a tiled way
streaming() Marks the architecture to be implemented in a streaming way
FROST Scheduling Commands 18
Command Description
Computation Scheduling Commands
pipeline(i) Marks dimension i to be pipelined
unroll(i) Marks dimension i to be unrolled
flatten(i) Marks dimension i to be flattened
vectorize(b, n) Marks buffer b to be vectorized in chunks of n bits
Local Memory Scheduling Commands
partition(b, d, t) Marks buffer b to be portioned on dimension d in a t way
Architecture Scheduling Commands
tiled() Marks the architecture to be implemented in a tiled way
streaming() Marks the architecture to be implemented in a streaming way
FROST scheduling commands can be combined with the frontends scheduling commands
Backend 19
• From the optimized IR, FROST generates a C++ code for
HLS tools
• This allows to have a preliminary analysis of the resulting
design (e.g. latency, resource usage, …)
• Xilinx SDAccel takes care of the bitstream generation
[*] https://www.xilinx.com/products/design-tools/software-zone/sdaccel.html
*
Experimental Evaluation 20
• Benchmarks (8-bit RGB Full-HD images):
Halide
• Threshold
• Add Weighted
• Erode
Tiramisu
• Scale
• Sobel
• Gaussian
• Target FPGA: ADM-PCIE-7V3
• We compared against the Vivado HLS Video
Library11
[11] http://www.wiki.xilinx.com/HLS+Video+Library
Input Image Layout 21
• Vivado HLS Video Library expects input images arranged
in a channel interleaved way
Channel interleaved
Input Image Layout 22
• Vivado HLS Video Library expects input images arranged
in a channel interleaved way
• However, a different layout (i.e. plane interleaved) could
improve the level of parallelism and memory usage
Channel interleaved Plane interleaved
Input Image Layout 23
• Vivado HLS Video Library expects input images arranged
in a channel interleaved way
• However, a different layout (i.e. plane interleaved) could
improve the level of parallelism and memory usage
Channel interleaved Plane interleaved
HLS supports one
pixel per clock cycle
Input Image Layout 24
• Vivado HLS Video Library expects input images arranged
in a channel interleaved way
• However, a different layout (i.e. plane interleaved) could
improve the level of parallelism and memory usage
Channel interleaved Plane interleaved
HLS supports one
pixel per clock cycle
Up to N elements per clock cycle
FROST designs 25
• We implemented a streaming architecture, just like
Vivado HLS Video Library does
• We relied on the frontends to enable vectorization
(e.g. loop splitting)
• Image layout:
– Halide:
• channel interleaved (3-element vectorization)
– Tiramisu:
• channel interleaved (3-element vectorization)
• plane interleaved (64-element vectorization)
FROST + Halide vs HLS Video Library 26
HLS Video Library
FROST + Halide (channel interleaved)
1.004X
1.28X
1.005X
NormalizedExectionTime
0.1
1
Applications
Threshold AddWeighted Erode
Resource Usage: Halide Benchmarks 27
Application
BRAM_18K
(2940)
DSP48E
(3600)
Flip Flops
(866400)
LookUp Tables
(433200)
FROST
Threshold 2 0 814 1542
AddWeighted 0 30 5487 7383
Erode 8 0 1023 2012
Vivado HLS Video Library
Threshold 0 0 307 1300
AddWeighted 0 30 6810 11465
Erode 9 0 2170 2828
FROST + Tiramisu vs HLS Video Library 28
HLS Video Library
FROST + Tiramisu (channel interleaved)
1.30X
1.004X
1.31X
NormalizedExectionTime
0.1
1
Applications
Scale Sobel Gaussian
FROST + Tiramisu vs HLS Video Library 29
HLS Video Library
FROST + Tiramisu (plane interleaved)
10.32X
7.98X
10.14X
NormalizedExectionTime
0.1
1
Applications
Scale Sobel Gaussian
Resource Usage: Tiramisu Benchmarks 30
Application
BRAM_18K
(2940)
DSP48E
(3600)
Flip Flops
(866400)
LookUp Tables
(433200)
FROST (channel interleaved)
Scale 2 15 3550 4803
Sobel 8 0 1047 1787
Gaussian 8 0 997 1907
FROST (plane interleaved)
Scale 30 320 59642 73933
Sobel 144 0 3287 7543
Gaussian 60 0 2757 7008
Vivado HLS Video Library
Scale 0 15 4828 8792
Sobel 9 0 2220 3255
Gaussian 9 12 2560 3355
Conclusion & Future Work 31
• FROST is a common backend for hardware
acceleration of DSLs on FPGAs
• FROST high level scheduling co-language enables
different types of optimizations
• As future work, we plan to:
– Implement other scheduling commands
– Feedback loop with Xilinx toolchain
– Integration with more DSLs
E. Del Sozzo, R. Baghdadi, S. Amarasinghe, M. D. Santambrogio
emanuele.delsozzo@polimi.it

Weitere ähnliche Inhalte

Was ist angesagt?

Development of Signal Processing Algorithms using OpenCL for FPGA based Archi...
Development of Signal Processing Algorithms using OpenCL for FPGA based Archi...Development of Signal Processing Algorithms using OpenCL for FPGA based Archi...
Development of Signal Processing Algorithms using OpenCL for FPGA based Archi...
Pradeep Singh
 
Target updated track f
Target updated   track fTarget updated   track f
Target updated track f
Alona Gradman
 

Was ist angesagt? (11)

Vlsi design flow
Vlsi design flowVlsi design flow
Vlsi design flow
 
Architectures and operating systems
Architectures and operating systemsArchitectures and operating systems
Architectures and operating systems
 
CC-4007, Large-Scale Machine Learning on Graphs, by Yucheng Low, Joseph Gonza...
CC-4007, Large-Scale Machine Learning on Graphs, by Yucheng Low, Joseph Gonza...CC-4007, Large-Scale Machine Learning on Graphs, by Yucheng Low, Joseph Gonza...
CC-4007, Large-Scale Machine Learning on Graphs, by Yucheng Low, Joseph Gonza...
 
Development of Signal Processing Algorithms using OpenCL for FPGA based Archi...
Development of Signal Processing Algorithms using OpenCL for FPGA based Archi...Development of Signal Processing Algorithms using OpenCL for FPGA based Archi...
Development of Signal Processing Algorithms using OpenCL for FPGA based Archi...
 
Vlsi Synthesis
Vlsi SynthesisVlsi Synthesis
Vlsi Synthesis
 
SequenceL gets rid of decades of programming baggage
SequenceL gets rid of decades of programming baggageSequenceL gets rid of decades of programming baggage
SequenceL gets rid of decades of programming baggage
 
The CLAM Framework
The CLAM FrameworkThe CLAM Framework
The CLAM Framework
 
Introducing Embedded Systems and the Microcontrollers
Introducing Embedded Systems and the MicrocontrollersIntroducing Embedded Systems and the Microcontrollers
Introducing Embedded Systems and the Microcontrollers
 
Target updated track f
Target updated   track fTarget updated   track f
Target updated track f
 
Chip Ex2010 Gert Goossens
Chip Ex2010 Gert GoossensChip Ex2010 Gert Goossens
Chip Ex2010 Gert Goossens
 
The CAOS framework: Democratize the acceleration of compute intensive applica...
The CAOS framework: Democratize the acceleration of compute intensive applica...The CAOS framework: Democratize the acceleration of compute intensive applica...
The CAOS framework: Democratize the acceleration of compute intensive applica...
 

Ähnlich wie A Common Backend for Hardware Acceleration of DSLs on FPGA

OliverStoneSWResume2015-05
OliverStoneSWResume2015-05OliverStoneSWResume2015-05
OliverStoneSWResume2015-05
Oliver Stone
 
Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...
inside-BigData.com
 
OliverStoneResume2015-2
OliverStoneResume2015-2OliverStoneResume2015-2
OliverStoneResume2015-2
Oliver Stone
 
ResumeThomasV9
ResumeThomasV9ResumeThomasV9
ResumeThomasV9
Thomas Yeh
 

Ähnlich wie A Common Backend for Hardware Acceleration of DSLs on FPGA (20)

4.FPGA for dummies: Design Flow
4.FPGA for dummies: Design Flow4.FPGA for dummies: Design Flow
4.FPGA for dummies: Design Flow
 
Cockatrice: A Hardware Design Environment with Elixir
Cockatrice: A Hardware Design Environment with ElixirCockatrice: A Hardware Design Environment with Elixir
Cockatrice: A Hardware Design Environment with Elixir
 
Computing Without Computers - Oct08
Computing Without Computers - Oct08Computing Without Computers - Oct08
Computing Without Computers - Oct08
 
A Review of FPGA-based design methodologies for efficient hardware Area estim...
A Review of FPGA-based design methodologies for efficient hardware Area estim...A Review of FPGA-based design methodologies for efficient hardware Area estim...
A Review of FPGA-based design methodologies for efficient hardware Area estim...
 
OliverStoneSWResume2015-05
OliverStoneSWResume2015-05OliverStoneSWResume2015-05
OliverStoneSWResume2015-05
 
Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...
 
ElixirでFPGAを設計する
ElixirでFPGAを設計するElixirでFPGAを設計する
ElixirでFPGAを設計する
 
Realizing the promise of portable data processing with Apache Beam
Realizing the promise of portable data processing with Apache BeamRealizing the promise of portable data processing with Apache Beam
Realizing the promise of portable data processing with Apache Beam
 
KIRANKUMAR_MV
KIRANKUMAR_MVKIRANKUMAR_MV
KIRANKUMAR_MV
 
OliverStoneResume2015-2
OliverStoneResume2015-2OliverStoneResume2015-2
OliverStoneResume2015-2
 
ResumeThomasV9
ResumeThomasV9ResumeThomasV9
ResumeThomasV9
 
FEL Flyer F12
FEL Flyer F12FEL Flyer F12
FEL Flyer F12
 
Del Sozzo's talk @ ICCD17
Del Sozzo's talk @ ICCD17Del Sozzo's talk @ ICCD17
Del Sozzo's talk @ ICCD17
 
The CAOS framework: democratize the acceleration of compute intensive applica...
The CAOS framework: democratize the acceleration of compute intensive applica...The CAOS framework: democratize the acceleration of compute intensive applica...
The CAOS framework: democratize the acceleration of compute intensive applica...
 
System Design on Zynq using SDSoC
System Design on Zynq using SDSoCSystem Design on Zynq using SDSoC
System Design on Zynq using SDSoC
 
International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER) International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)
 
Detailed Cv
Detailed CvDetailed Cv
Detailed Cv
 
Superfluid Deployment of Virtual Functions: Exploiting Mobile Edge Computing ...
Superfluid Deployment of Virtual Functions: Exploiting Mobile Edge Computing ...Superfluid Deployment of Virtual Functions: Exploiting Mobile Edge Computing ...
Superfluid Deployment of Virtual Functions: Exploiting Mobile Edge Computing ...
 
verification resume
verification resumeverification resume
verification resume
 
[Unite Seoul 2019] Mali GPU Architecture and Mobile Studio
[Unite Seoul 2019] Mali GPU Architecture and Mobile Studio [Unite Seoul 2019] Mali GPU Architecture and Mobile Studio
[Unite Seoul 2019] Mali GPU Architecture and Mobile Studio
 

Mehr von NECST Lab @ Politecnico di Milano

Embedding based knowledge graph link prediction for drug repurposing
Embedding based knowledge graph link prediction for drug repurposingEmbedding based knowledge graph link prediction for drug repurposing
Embedding based knowledge graph link prediction for drug repurposing
NECST Lab @ Politecnico di Milano
 

Mehr von NECST Lab @ Politecnico di Milano (20)

Mesticheria Team - WiiReflex
Mesticheria Team - WiiReflexMesticheria Team - WiiReflex
Mesticheria Team - WiiReflex
 
Punto e virgola Team - Stressometro
Punto e virgola Team - StressometroPunto e virgola Team - Stressometro
Punto e virgola Team - Stressometro
 
BitIt Team - Stay.straight
BitIt Team - Stay.straight BitIt Team - Stay.straight
BitIt Team - Stay.straight
 
BabYodini Team - Talking Gloves
BabYodini Team - Talking GlovesBabYodini Team - Talking Gloves
BabYodini Team - Talking Gloves
 
printf("Nome Squadra"); Team - NeoTon
printf("Nome Squadra"); Team - NeoTonprintf("Nome Squadra"); Team - NeoTon
printf("Nome Squadra"); Team - NeoTon
 
BlackBoard Team - Motion Tracking Platform
BlackBoard Team - Motion Tracking PlatformBlackBoard Team - Motion Tracking Platform
BlackBoard Team - Motion Tracking Platform
 
#include<brain.h> Team - HomeBeatHome
#include<brain.h> Team - HomeBeatHome#include<brain.h> Team - HomeBeatHome
#include<brain.h> Team - HomeBeatHome
 
Flipflops Team - Wave U
Flipflops Team - Wave UFlipflops Team - Wave U
Flipflops Team - Wave U
 
Bug(atta) Team - Little Brother
Bug(atta) Team - Little BrotherBug(atta) Team - Little Brother
Bug(atta) Team - Little Brother
 
#NECSTCamp: come partecipare
#NECSTCamp: come partecipare#NECSTCamp: come partecipare
#NECSTCamp: come partecipare
 
NECSTCamp101@2020.10.1
NECSTCamp101@2020.10.1NECSTCamp101@2020.10.1
NECSTCamp101@2020.10.1
 
NECSTLab101 2020.2021
NECSTLab101 2020.2021NECSTLab101 2020.2021
NECSTLab101 2020.2021
 
TreeHouse, nourish your community
TreeHouse, nourish your communityTreeHouse, nourish your community
TreeHouse, nourish your community
 
TiReX: Tiled Regular eXpressionsmatching architecture
TiReX: Tiled Regular eXpressionsmatching architectureTiReX: Tiled Regular eXpressionsmatching architecture
TiReX: Tiled Regular eXpressionsmatching architecture
 
Embedding based knowledge graph link prediction for drug repurposing
Embedding based knowledge graph link prediction for drug repurposingEmbedding based knowledge graph link prediction for drug repurposing
Embedding based knowledge graph link prediction for drug repurposing
 
PLASTER - PYNQ-based abandoned object detection using a map-reduce approach o...
PLASTER - PYNQ-based abandoned object detection using a map-reduce approach o...PLASTER - PYNQ-based abandoned object detection using a map-reduce approach o...
PLASTER - PYNQ-based abandoned object detection using a map-reduce approach o...
 
EMPhASIS - An EMbedded Public Attention Stress Identification System
 EMPhASIS - An EMbedded Public Attention Stress Identification System EMPhASIS - An EMbedded Public Attention Stress Identification System
EMPhASIS - An EMbedded Public Attention Stress Identification System
 
Luns - Automatic lungs segmentation through neural network
Luns - Automatic lungs segmentation through neural networkLuns - Automatic lungs segmentation through neural network
Luns - Automatic lungs segmentation through neural network
 
BlastFunction: How to combine Serverless and FPGAs
BlastFunction: How to combine Serverless and FPGAsBlastFunction: How to combine Serverless and FPGAs
BlastFunction: How to combine Serverless and FPGAs
 
Maeve - Fast genome analysis leveraging exact string matching
Maeve - Fast genome analysis leveraging exact string matchingMaeve - Fast genome analysis leveraging exact string matching
Maeve - Fast genome analysis leveraging exact string matching
 

Kürzlich hochgeladen

Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Christo Ananth
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
ssuser89054b
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Christo Ananth
 
notes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptnotes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.ppt
MsecMca
 
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 

Kürzlich hochgeladen (20)

Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
 
Call for Papers - International Journal of Intelligent Systems and Applicatio...
Call for Papers - International Journal of Intelligent Systems and Applicatio...Call for Papers - International Journal of Intelligent Systems and Applicatio...
Call for Papers - International Journal of Intelligent Systems and Applicatio...
 
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
 
NFPA 5000 2024 standard .
NFPA 5000 2024 standard                                  .NFPA 5000 2024 standard                                  .
NFPA 5000 2024 standard .
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
 
Unit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdfUnit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdf
 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPT
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghly
 
Double Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueDouble Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torque
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
 
notes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptnotes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.ppt
 
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELLPVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
 
Unleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapUnleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leap
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - V
 
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
 
Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024
 
Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)
 
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
 
Intze Overhead Water Tank Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank  Design by Working Stress - IS Method.pdfIntze Overhead Water Tank  Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank Design by Working Stress - IS Method.pdf
 

A Common Backend for Hardware Acceleration of DSLs on FPGA

  • 1. A Common Backend for Hardware Acceleration of DSLs on FPGA E. Del Sozzo1, R. Baghdadi2, S. Amarasinghe2, M. D. Santambrogio1 emanuele.delsozzo@polimi.it 1 Politecnico di Milano 2 Massachusetts Institute of Technology Google, Mountain View 05/29/18
  • 2. Context Definition 2 • Domain Specific Languages (DSLs)1,2,3,4 main features: Portability Productivity Specific optimizations • Many of the computations described by DSLs (stencils, linear algebra, etc.) are ideal candidates for hardware acceleration [1] Halide - http://halide-lang.org [2] Tensorflow - https://www.tensorflow.org [3] Julia - https://julialang.org [4] Theano - http://deeplearning.net/software/theano/
  • 3. Hardware Acceleration of DSLs 3 • FPGAs offer a good trade-off in terms of performance, power consumption, and flexibility w.r.t. GPUs, CPUs, and ASICs • However, FPGAs have a steep learning curve and it is complex to produce efficient designs • Existing High Level Synthesis (HLS) tools may help, but they provide: – Low level of abstractions – Limited support for high level languages
  • 4. Current DSL to FPGA Workflow 4 Translation and Optimization passes FPGA synthesisDSL 1 Frontend C/C++C/C++HDL Code Input Algorithm Input AlgorithmCode Pass 1 Pass N … [5] Hegarty, James, et al. “Darkroom: compiling high-level image processing code into hardware pipelines.” ACM Trans. Graph. 33.4 (2014): 144-1 [6] Chugh, Nitin, et al. “A DSL compiler for accelerating image processing pipelines on FPGAs.” Parallel Architecture and Compilation Techniques (PACT), 2016 International Conference on. IEEE, 2016. [7] Stewart, Robert, et al. “A dataflow IR for memory efficient RIPL compilation to FPGAs.” International Conference on Algorithms and Architectures for Parallel Processing. Springer International Publishing, 2016.
  • 5. Current DSL to FPGA Workflow 5 Translation and Optimization passes FPGA synthesisDSL 1 Frontend C/C++C/C++HDL Code Input Algorithm Input AlgorithmCode Pass 1 Pass N … [5] Hegarty, James, et al. “Darkroom: compiling high-level image processing code into hardware pipelines.” ACM Trans. Graph. 33.4 (2014): 144-1 [6] Chugh, Nitin, et al. “A DSL compiler for accelerating image processing pipelines on FPGAs.” Parallel Architecture and Compilation Techniques (PACT), 2016 International Conference on. IEEE, 2016. [7] Stewart, Robert, et al. “A dataflow IR for memory efficient RIPL compilation to FPGAs.” International Conference on Algorithms and Architectures for Parallel Processing. Springer International Publishing, 2016.
  • 6. Current DSL to FPGA Workflow 6 [5] Hegarty, James, et al. “Darkroom: compiling high-level image processing code into hardware pipelines.” ACM Trans. Graph. 33.4 (2014): 144-1 [6] Chugh, Nitin, et al. “A DSL compiler for accelerating image processing pipelines on FPGAs.” Parallel Architecture and Compilation Techniques (PACT), 2016 International Conference on. IEEE, 2016. [7] Stewart, Robert, et al. “A dataflow IR for memory efficient RIPL compilation to FPGAs.” International Conference on Algorithms and Architectures for Parallel Processing. Springer International Publishing, 2016. Different flows for different DSLs
  • 7. Current DSL to FPGA Workflow 7 [5] Hegarty, James, et al. “Darkroom: compiling high-level image processing code into hardware pipelines.” ACM Trans. Graph. 33.4 (2014): 144-1 [6] Chugh, Nitin, et al. “A DSL compiler for accelerating image processing pipelines on FPGAs.” Parallel Architecture and Compilation Techniques (PACT), 2016 International Conference on. IEEE, 2016. [7] Stewart, Robert, et al. “A dataflow IR for memory efficient RIPL compilation to FPGAs.” International Conference on Algorithms and Architectures for Parallel Processing. Springer International Publishing, 2016. Different flows for different DSLsMost of the passes are in common
  • 8. FROST 8 • A common backend for the hardware acceleration of DSLs on FPGAs • FROST is designed to express data parallel algorithms operating on dense arrays: – image processing – dense linear and tensor algebra – stencil computations – etc. • Differently from other similar tools[5-7], FROST leverages a high-level scheduling co-language to specify the optimizations [5] Hegarty, James, et al. “Darkroom: compiling high-level image processing code into hardware pipelines.” ACM Trans. Graph. 33.4 (2014): 144-1 [6] Chugh, Nitin, et al. “A DSL compiler for accelerating image processing pipelines on FPGAs.” Parallel Architecture and Compilation Techniques (PACT), 2016 International Conference on. IEEE, 2016. [7] Stewart, Robert, et al. “A dataflow IR for memory efficient RIPL compilation to FPGAs.” International Conference on Algorithms and Architectures for Parallel Processing. Springer International Publishing, 2016.
  • 10. Frontends 10 • FROST currently supports as frontends: – Halide – Tiramisu • The user can exploits both: – High level languages and abstractions for designing the algorithm – Frontend features (e.g. scheduling and optimizations) • An ad-hoc translator produces FROST IR from the input FROST Halide Tiramisu Julia Theano Xilinx SDAccel frontendstoolchain Other DSLs
  • 11. Halide 11 • Halide[8,9] is a state-of-the-art DSL and compiler for image processing pipelines • Halide exploits a separate scheduling co-language • The user can optimize the computation at different levels: – Loop tiling, unrolling, fusion, … – Parallelization, vectorization, … – … [8] Ragan-Kelley, Jonathan, et al. “Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines.” ACM SIGPLAN Notices 48.6 (2013) [9] Tyler Denniston, Shoaib Kamil, and Saman Amarasinghe. “Distributed Halide.” Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. ACM, 2016.
  • 12. Tiramisu 12 Layer I: Abstract Algorithm Portable performance across a range of platforms Layer II: Computation Management Layer III: Data Management Automatic or user specified schedules Automatic or user specified data mapping Tiramisu Backends High
 Level Code FPGA (Xilinx) Communication (distribution across nodes) Vectorized parallel X86 GPU
 (Nvidia) Code generation: Abstract Syntax Tree ... Layer IV: Communication Management Developer DSL Compiler [10] Riyadh Baghdadi, Jessica Ray, Malek Ben Romdhane, Emanuele Del Sozzo, Patricia Suriana, Shoaib Kamil, Saman Amarasinghe, "Tiramisu: A Code Optimization Framework for High Performance Systems", arXiv
  • 13. Frontends 13 • FROST currently supports as frontends: – Halide – Tiramisu • The user can exploit both: – High level languages and abstractions for designing the algorithm – Frontend features (e.g. scheduling and optimizations) • An ad-hoc translator produces FROST IR from the input FROST Halide Tiramisu Julia Theano Xilinx SDAccel frontendstoolchain Other DSLs
  • 14. FROST Scheduling Co-Language 14 • Generating efficient code for FPGA is a complex task • HLS tools provide a set of directives to apply optimizations • However, most of the optimization work is still up to the designer • Therefore, evaluating different hardware designs quickly turns into a time-consuming and error-prone task
  • 15. FROST Scheduling Commands 15 Command Description Computation Scheduling Commands pipeline(i) Marks dimension i to be pipelined unroll(i) Marks dimension i to be unrolled flatten(i) Marks dimension i to be flattened vectorize(b, n) Marks buffer b to be vectorized in chunks of n bits Local Memory Scheduling Commands partition(b, d, t) Marks buffer b to be portioned on dimension d in a t way Architecture Scheduling Commands tiled() Marks the architecture to be implemented in a tiled way streaming() Marks the architecture to be implemented in a streaming way
  • 16. FROST Scheduling Commands 16 Command Description Computation Scheduling Commands pipeline(i) Marks dimension i to be pipelined unroll(i) Marks dimension i to be unrolled flatten(i) Marks dimension i to be flattened vectorize(b, n) Marks buffer b to be vectorized in chunks of n bits Local Memory Scheduling Commands partition(b, d, t) Marks buffer b to be portioned on dimension d in a t way Architecture Scheduling Commands tiled() Marks the architecture to be implemented in a tiled way streaming() Marks the architecture to be implemented in a streaming way
  • 17. FROST Scheduling Commands 17 Command Description Computation Scheduling Commands pipeline(i) Marks dimension i to be pipelined unroll(i) Marks dimension i to be unrolled flatten(i) Marks dimension i to be flattened vectorize(b, n) Marks buffer b to be vectorized in chunks of n bits Local Memory Scheduling Commands partition(b, d, t) Marks buffer b to be portioned on dimension d in a t way Architecture Scheduling Commands tiled() Marks the architecture to be implemented in a tiled way streaming() Marks the architecture to be implemented in a streaming way
  • 18. FROST Scheduling Commands 18 Command Description Computation Scheduling Commands pipeline(i) Marks dimension i to be pipelined unroll(i) Marks dimension i to be unrolled flatten(i) Marks dimension i to be flattened vectorize(b, n) Marks buffer b to be vectorized in chunks of n bits Local Memory Scheduling Commands partition(b, d, t) Marks buffer b to be portioned on dimension d in a t way Architecture Scheduling Commands tiled() Marks the architecture to be implemented in a tiled way streaming() Marks the architecture to be implemented in a streaming way FROST scheduling commands can be combined with the frontends scheduling commands
  • 19. Backend 19 • From the optimized IR, FROST generates a C++ code for HLS tools • This allows to have a preliminary analysis of the resulting design (e.g. latency, resource usage, …) • Xilinx SDAccel takes care of the bitstream generation [*] https://www.xilinx.com/products/design-tools/software-zone/sdaccel.html *
  • 20. Experimental Evaluation 20 • Benchmarks (8-bit RGB Full-HD images): Halide • Threshold • Add Weighted • Erode Tiramisu • Scale • Sobel • Gaussian • Target FPGA: ADM-PCIE-7V3 • We compared against the Vivado HLS Video Library11 [11] http://www.wiki.xilinx.com/HLS+Video+Library
  • 21. Input Image Layout 21 • Vivado HLS Video Library expects input images arranged in a channel interleaved way Channel interleaved
  • 22. Input Image Layout 22 • Vivado HLS Video Library expects input images arranged in a channel interleaved way • However, a different layout (i.e. plane interleaved) could improve the level of parallelism and memory usage Channel interleaved Plane interleaved
  • 23. Input Image Layout 23 • Vivado HLS Video Library expects input images arranged in a channel interleaved way • However, a different layout (i.e. plane interleaved) could improve the level of parallelism and memory usage Channel interleaved Plane interleaved HLS supports one pixel per clock cycle
  • 24. Input Image Layout 24 • Vivado HLS Video Library expects input images arranged in a channel interleaved way • However, a different layout (i.e. plane interleaved) could improve the level of parallelism and memory usage Channel interleaved Plane interleaved HLS supports one pixel per clock cycle Up to N elements per clock cycle
  • 25. FROST designs 25 • We implemented a streaming architecture, just like Vivado HLS Video Library does • We relied on the frontends to enable vectorization (e.g. loop splitting) • Image layout: – Halide: • channel interleaved (3-element vectorization) – Tiramisu: • channel interleaved (3-element vectorization) • plane interleaved (64-element vectorization)
  • 26. FROST + Halide vs HLS Video Library 26 HLS Video Library FROST + Halide (channel interleaved) 1.004X 1.28X 1.005X NormalizedExectionTime 0.1 1 Applications Threshold AddWeighted Erode
  • 27. Resource Usage: Halide Benchmarks 27 Application BRAM_18K (2940) DSP48E (3600) Flip Flops (866400) LookUp Tables (433200) FROST Threshold 2 0 814 1542 AddWeighted 0 30 5487 7383 Erode 8 0 1023 2012 Vivado HLS Video Library Threshold 0 0 307 1300 AddWeighted 0 30 6810 11465 Erode 9 0 2170 2828
  • 28. FROST + Tiramisu vs HLS Video Library 28 HLS Video Library FROST + Tiramisu (channel interleaved) 1.30X 1.004X 1.31X NormalizedExectionTime 0.1 1 Applications Scale Sobel Gaussian
  • 29. FROST + Tiramisu vs HLS Video Library 29 HLS Video Library FROST + Tiramisu (plane interleaved) 10.32X 7.98X 10.14X NormalizedExectionTime 0.1 1 Applications Scale Sobel Gaussian
  • 30. Resource Usage: Tiramisu Benchmarks 30 Application BRAM_18K (2940) DSP48E (3600) Flip Flops (866400) LookUp Tables (433200) FROST (channel interleaved) Scale 2 15 3550 4803 Sobel 8 0 1047 1787 Gaussian 8 0 997 1907 FROST (plane interleaved) Scale 30 320 59642 73933 Sobel 144 0 3287 7543 Gaussian 60 0 2757 7008 Vivado HLS Video Library Scale 0 15 4828 8792 Sobel 9 0 2220 3255 Gaussian 9 12 2560 3355
  • 31. Conclusion & Future Work 31 • FROST is a common backend for hardware acceleration of DSLs on FPGAs • FROST high level scheduling co-language enables different types of optimizations • As future work, we plan to: – Implement other scheduling commands – Feedback loop with Xilinx toolchain – Integration with more DSLs E. Del Sozzo, R. Baghdadi, S. Amarasinghe, M. D. Santambrogio emanuele.delsozzo@polimi.it