1. High Level Synthesis Framework For a Coarse
Grain Reconfigurable Architecture
Omer Malik, Ahmed Hemani and Muhammad Ali Shami
Dept. of Electronic Systems, School of ICT
Royal Institute of Technology, KTH, Stockholm, Sweden
Email: {omerm, hemani, shami}@kth.se
Abstract—A High Level Synthesis Framework for mapping Algorithmic developer guides VESYLA towards a spe-
DSP algorithms on a Coarse Grain Reconfigurable Architecture cific architectural style by using VESYLA pragmas. A small
is presented. Behavioral specification of the algorithm in C is change in these pragmas would result in a different archi-
specified with pragmas in comments and the tool generates
configware after performing timing and synchronization synthe- tectural implementation and this change can also manipulate
sis. Pragmas identify SIMD type concurrency and sweep the serial/parallel structure of the implementation; thus user is
architectural space with allocation and binding annotations to always in full control and can capture the architectural space
produce implementations from fully serial to fully parallel. This effectively. These properties makes VESYLA an interactive
allows user to stay at algorithmic level and guide the HLS tool to design tool where it utilizes the human developer’s guidance
search a restricted architectural space bounded by the pragmas
thus making the synthesis process more efficient and predictable. and yet following the “push button” methodology to generate
the RTL implementation.
Index Terms—High level synthesis; CGRA; Symbolic Assem- Main contributions of this paper are : (a) Design space
bler; High Level Language; is easily explorable and various architectural solutions can
be implemented with minimum efforts. (b) VESYLA hides
I. I NTRODUCTION unnecessary low level details from the users by allowing it to
DRRA (Dynamically Reconfigurable Resource Array), is work on the higher abstraction level; results in less chances
a CGRA (Coarse Grain Reconfigurable Architecture) for of making mistakes and design time is reduced. (c) Developer
implementing DSP applications. DRRA offers DLP (Data can exploit parallelization options available with ease using
Level Parallelism), where large set of data is processed by pragmas. (d) A controlled HLS approach that would result in
the same set of instructions in parallel threads like SIMD an optimal solution (discussed in “Related Work” Section).
(Single Instruction Multiple Data). DRRA also allows MIMD In Section II, we have presented the related work, Section
(Multiple Instructions Multiple Data) where multiple different III outlines the DRRA fabric, Section IV describes VESYLA,
SIMD clusters are operating in parallel. Modems and Codecs Section V consists of experimental results and Section VI
represent the physical layer in the ISO’s (International Orga- presents the conclusion and future work.
nization for Standardization) 7 layer model and the functions
used in these applications are DSP functions characterized by II. R ELATED W ORK
high degree of regularity. Due to their regular structure, the In this section we will review some industry standard HLS
computation can be divided into threads of data parallel tasks tools followed by few schemes for mapping algorithms on
and we call these as a pattern in a DSP function. CGRAs.
VESYLA (VEctorizing SYmbolic Language Assembler) is GAUT [5] is an open source HLS tool for DSP applications,
a semi-automatic framework for implementing DSP functions which takes input in bit-accurate C format along with some
on DRRA. VESYLA takes an untimed C specification of design constraints. The C specification is converted into DFG
a DSP function and generates configware for DRRA after (Data Flow Graph) for extracting potential parallelism and data
performing timing and synchronization synthesis; an activity dependencies. Lastly it generates the RTL after going through
that is cumbersome, time consuming and error prone. The allocation, scheduling and binding tasks.
developer makes and explicitly expresses critical implemen- [3] is an automatic synthesis tool which accepts code
tation decisions on how many resources are allocated and described in ANSI C++ and few synthesis constraints which
how the operators and operands are mapped to the allocated are used to explore the design space. Designer can guide
resources. These activities are very much like HLS (High the synthesis procedure towards an optimal solution. Tool
Level Synthesis), but in this case VESYLA user knows the generates RTL suitable for targeted hardware.
targeted RTL structure and guides the tool towards it with the [4] takes System C modules as input and produce optimized
allocation and binding pragmas. Allocation and binding being RTL for specified target technology identified by the user in
the well known concepts from the HLS domain. VESYLA form of a .lib file.
performs the scheduling, syntheses control and does the code [6] is based on a design environment where designer con-
generation. trols the HLS and can change the synthesis decisions about
978-1-4244-8971-8/10$26.00 c 2010 IEEE
2. > Z D
scheduling, allocation and binding by using a GUI (Graphical d d Z & d ^
User Interface) at any stage.
DRESC[8] focuses on loop level parallelization for different
segments of application code and map them on a CGRA
by using Modulo scheduling algorithms to achieve ILP for
optimal performance.
[7] is based on mapping the hyperops obtained from a DFG
on to the CGRA. Code acceleration is achieved by run time
reconfiguration of CGRA to accommodate these hyperops.
[9] uses SUIF compiler framework for portioning the input
code w.r.t. to available resources and Native Mapping Lan-
guage generates XPP’s PE, which are automatically placed by
their tool.
Our approach is similar to traditional HLS tool but in
our case search space is significantly reduced by explicitly
identifying architectural elements with the help of pragmas d d / d ^
^ ZZ ZZ
and automating the synthesis procedure is much easier as
compared to traditional HLS tools because of their extremely
Figure 1. The DRRA PHY Layer Fabric Fragment
large search space. Another key differentiator is that VESYLA
outputs generate distributed controls i.e. it generates multi-
ple FSMs implemented in the micro-coded sequencer with
from the LATAA or inputs from other MAUSEEQAARS as
each thread having it’s own control, being able to execute
interrupts.
independently or can synchronize with each other if required.
d. SILSILAY (a Seamless, sliding window, circuit switched
Traditional HLS techniques generates a thread as a single
interconnect fabric) are regular non-blocking point-to-point,
control scheme or a single FSM.
point-to-multipoint, low latency interconnection network with
sliding window connectivity which allows arbitrary parallelism
III. DRRA A RCHITECTURE
among large subsystems. The DRRA interconnect fabric al-
This section briefly describes the DRRA architecture. Figure lows every resource to receive/send input/output to/from any
1 shows an instance of 7x2 DRRA fabric. Every DRRA cell other resource in its own column and 3 columns on each side
consists of following components; [1], i.e., every LATAA and REFI is connected to 28 LATAAs
a. LATAAs (Logic And morphable daTA path unit) are 16- and REFIs, including itself.
bit data path unit which consists of : 1) Logic Partitions to
deal with logical functions. 2) Arithmetic Partition to imple- IV. VESYLA
ment commonly used signal processing algorithms like, MAC, VESYLA is based on subset of C language with some
Symmetric FIR, FFT butterfly, Sum of difference, Different of restrictions imposed, like user is not allowed to use the
sum, 4/2 input add/subtracts etc [2]. The mDPU also has 17 FILE handling features, creation of dynamic data structures,
bits counter and 13 bits status register. The values of the status and dynamic functions available in C. The operators that
registers are read by micro-coded state machines, which takes are allowed in VESYLA, are also restricted to those that
decisions accordingly. correspond to the instructions offered by the DRRA LATAA.
b. REFIs (REgister FIle) are 64 word 16 bit Register Files As the instructions of DRRA’s LATAA already correspond to
with 2 read and 2 write ports. REFI has an AGU (Address the typical DSP operations like MAC, butterfly etc. and as the
Generation Unit) that can generate address in vector mode, objective is to specify the RTL implementation of an algorithm
circular buffer and in bit reversing modes for FFT. These and not the algorithm itself, this restriction is natural and not
modes can execute once, some limited number of times or intrusive; the implementation in any case requires composing
in infinite loop by using an arbitrary initial delay, a middle the algorithms in terms of DRRA operations.
delay between each read/write and an end delay before the Information about allocation and binding is provided to
loop iterates. VESYLA with the help of pragmas, which are essentially
c. MAUSEEQAARs (Micro-coded hierarchical sequenc- the directives for the guidance to VESYLA for generating
ing machine) controls all the resources in a DRRA cell. It a specific RTL architectural style. These pragmas are in
sends instructions to the AGUs of the register file, selects the form of parameters and constraints and after analyzing
LATAA modes, and configure the interconnects for proper them VESYLA takes the decision that how operands and
operations. They can send output signals to each other for operators are mapped to the DRRA resources involved in
control communication. They can also receive status bits the implementation. Topological relationship of the DRRA
from mDPU. The MAUSEEQAAR has configurable interrupt resources is also part of the mapping specifications in the
handling capabilities which allows user to configure the inputs pragmas. With the help of these pragmas, user specifies a
3. pattern of a particular implementation. This pattern sweeps of cycles required for each operation. This synchronization is
the implementation space in terms of degree of parallelism - achieved by inserting wait instructions, using delays provided
from fully serial to fully parallel and number of SIMD/MIMD by AGU and counters present in LATAAs. VESYLA resolve
threads. all these issues (related to synchronization of statements and
These pragmas are categorized in form of dimensional and dependencies) and generates configware for programming the
positional generics. a) Dimensional generics identifies the MAUSEEQAARs; contains instructions set for configuring
dimension of the problem and the architecture. For example REFIs/LATAAs/SILSILAY in their respective modes.
a FFT can be characterized in multiple ways by setting its Consider VESYLA code in Figure 3 which generates a FIR
parameters to match the dimension to what is required by asymmetric Filter for N number of taps. This small piece
a specific context (like WLAN would need 64 point FFT, of code can generate a Nth order filter with M degree of
whereas DVB would need 4096 point FFT). Similarly one parallelism. Fully parallel implementation is achieved when
has to decide the specific micro-architecture for that imple- M=N and when M=1 this filter is fully serial while anything
mentation (like radix- 2/4/mixed and the number of butterflys in between is partially parallel.
etc). Lastly number of resources to be used should also be
taken into consideration. b) positional generics decides the E D
Z D
exact location/range of locations to be used in DRRA Fabric
for that particular implementation (These things are discussed E D
later in the section). E D ^
E D
Figure 2 shows the complete flow of VESYLA HLS frame- D
work. Developer specifies the behavioral specification as an ^ D
Z D
untimed C model and is responsible for defining the allocation
and binding constraints. VESYLA analyzes each statement of D
the code and builds a CDFG (Control Data Flow Graph). s D
DE
VESYLA performs an extensive DDA (Data Dependency Z/
Analysis) and RDA(Resource Dependency Analysis) phase D
using scheduling information implied from relative ordering ^
of the statements. ^ d D
E
Figure 3. VESYLA pattern for Asymmetric FIR filters
s^z
VESYLA code is structured in two parts; First part consists
W
of declaration statements and the second part consists of
functional statements.
'
Z/
A. Declarative Statements
W Statements in declarative section involve the pragmas and
Z t /d
^ resources to be used.
^ / 1) Generics Declaration: Statement 1 declares VESYLA
D
^ generics. Usage of generics N, M has already been discussed
^
above. Generics r c specifies the row and column indices in
DRRA fabric. All indices of DRRA resources in this pattern
D ^
' are specified relative to r and c. Statement 3 identifies the
number of columns to be used in DRRA architecture.
s,
2) Resource Declaration: These statements allocate and
ZZ
d bind operands and operations using pragmas. “_REFI”
“_LATAA” pragmas identify the location and number of
Figure 2. VESYLA Framework resources to be used in DRRA fabric. Statements 4 to 8
are resource declaration statements. Statement 4,5 6 are
VESYLA resolves dependencies while considering the allo- using the “_REFI pragmas” to specify that x (the delay line)
cation/binding pragmas in resource/data dependency analysis and c (co-efficients) are to be placed in row r and will
phase and makes the critical decision that which section occupy columns c to c+M-1; i.e., distribute x c across M
of algorithm/code can be executed in parallel and which REFIS; the distribution is as equitable as possible.Statement
part of the algorithm has to be sequentialized. After this 11 is specifying the number of LATAAs (mDPUs) required
step, VESYLA schedules the sequences of operations and in DRRA fabric. Generics r constant colRange, indicates
synchronizes the execution by calculating the exact number the exact location of the mDPUs in DRRA Fabric. This
4. scheme allows a) serial, parallel trade-offs to be captured in ^
terms of generics like N and M and b) positioning a specific
implementation anywhere in the fabric for an arbitrary value
of r (and c).
B. Functional Statements
Functional statements consists of pre-defined functions for
performing various operations. Statements 12 to 15 are the
functional statements of VESYLA. Statement 12 computes the
convolution sum using M macs and the values are held in the
intermediate variables Lout, which are fed to an adder tree
that sums up these values. Statement 15 implements a shift d
line for shifting the samples in REFIs when the new sample
X0 is arrived. vAsymMac is a pre-defined functions that
performs the asymmetric MAC operations on a vector or slice
of max k size created by the pre-defined (r)slice functions.
VESYLA creates M parallel threads (using statement 12), each
performing max k MACs. Similarly adderTree is also a pre- ^
defined function that corresponds to the 4 input adderTree
Figure 4. Partially Parallel 125 Taps Asymmetric FIR Filter
mode of LATAAs. Statement 15 implements the shift line.
C. VESYLA Configware Generation
VESYLA generates configware corresponding to specific E. Flexibility offered by VESYLA
values of generics like N, M, r and c. This configware Primary target of any HLS tool is to meet the performance
is at register transfer level of abstraction and has absolute by optimizing area/power constraints. VESYLA helps this
timing and synchronization details that VESYLA synthesizes. by a controlled mechanism for HLS, where user can choose
Suppose we want to compute convolution sum of 126 Taps any resource in DRRA Fabric. User can easily change the
asymmetric FIR FILTER using only 5 computation threads allocation parameters in DRRA Fabric using pragmas (po-
(partially parallel implementation) in DRRA fabric. This can sitional generics). In the example earlier, co-efficients and
be achieved by using M=5 N=126 as shown in Figure 4. samples are sharing the same REFIs, but this can be easily
Additionally assume r and c to be zero, which implies that changed by making the following small alteration as shown
columns 0 - 4 are being used. Each REFI can store 25 samples in Figure 5. Although this change looks very simple but it
except the last one which contains 26 samples. VESYLA will has a significant effect at low level, as in each SIMD thread
program the micro-coded state machines in row 0 columns the bit stream pattern for MAUSEEQAARS are changed and
(0-4) for generating these REFI instructions (Read, Write using interconnect mechanism between storage/functional unit is
port A/B in streaming Modes), LATAA instructions (Asym. affected as well. Manually altering these bit streams at lower
Mac mode, Adder Tree) accordingly. VESYLA will similarly level is a very difficult task (as shown in “Experimental
deal with allocation of REFI locations for co-efficients ensur- Results”) and VESYLA unburdens the user from doing these
ing that they are aligned with their corresponding samples. tedious jobs by using this simple mechanism.
Instructions in these MAUSEEQAARs are issued sequentially
and are synchronized by VESYLA using the delays provided
E D
by AGU and in each MAUSEEQAAR there are multiple
FSMs being created, which includes streaming data to/from
E D
REFIS, consuming data for computational purpose in LATAAs
including the Adder Trees.
Figure 5. Simple Mechanism to Select Resources in DRRA
D. VESYLA Optional Pragmas Inferences
There are some optional parameters which programmer can One can switch to two different architectures, without mak-
omit and VESYLA will still generate the correct functionality ing any significant changes in the code and with full control
by inferring these parameters. It can be seen from the FIR filter over the degree of parallelism. This can be achieved by making
codes that there is no information about the ports of REFI. few changes in VESYLA code for generating a different
VESYLA chooses by itself and assign the unused ports of architectural solution for the same algorithm. Consider the
that particular REFI. Similarly the address ranges mentioned code presented in Figure 6 which is generating symmetric Fir
in the code are logical addresses which are resolved to physical Filter instead of asymmetric FIR Filter. Few changes are made
addresses by VESYLA. If a user wants to use some specific in previous VESYLA code, as in symmetric FIR Filter samples
physical addresses, then it should inform VESYLA by using are summed together and then multiplied by the coefficients.
some additional syntax in pragmas. Similarly distribution of samples and coefficients over REFI
5. are different as compared to the code for asymmetric FIR V. E XPERIMENTAL R ESULTS
Filter. Samples for symmetric FIRs are divided into two halves Some of the above mentioned algorithms were implemented
and are distributed over the same resource while co-efficients in VESYLA and were simulated using the generated config-
are using a different REFIs. A 64 taps symmetric Fir Filter ware. Figure 8 shows the partial configware generated for
using M = 1 N=64 will generate the configuration shown a single MAUSEEQAAR (FIR FILTER) with their instruc-
in Figure 6. tion sets and the corresponding FSMs being generated. This
MAUSEEQAAR represents an instance from a single thread
E D of multiple SIMD threads and Figure 9 elaborates only one of
^ the FSM (instr. num. 5) executed by using VESYLA. FSMs
W E
E D related to Read/Write operations are similar to each other. It
W D should also be noted that the manually mapped design would
E D W
D E W produce the same output as VESYLA.
D
^ D D h^Y Z
D D E / E /
Z/
D s D / / / Z W
/ /
W
Z/
/ Z Z/ W ^
^ D / /
^ / /
/ D / D
y
^ d D K
E / /
K
/ Z Z/ W / /
^ / /
/ D / t ^
^
K
:
/ /
Figure 6. Symmetric FIR filters in VESYLA / d / d
K D
K K K D
/ t Z/ W
VESYLA can also exploit TLP, where multiple execution ^ / /
K K
/ t ^
units are performing computations on data sets with different / D
K :
set of instructions just like in MIMD. Consider the code pre- d
sented in Figure 7 which will generate two threads executing
independent of each other computing different instructions.
Figure 8. Configware For N-Taps FIR Filter
Here binding/allocation pragams are pointing to the exact
row/column in DRRA fabric. We can use the same code to
VESYLA programs each MAUSEEQAAR involved by ex-
generate multiple SIMD threads with in each MIMD thread
ecuting and synchronizing these FSMs properly. Imagine how
by replacing the 0’s 1’s in pragmas with some generic
complex the job would be in case of manually generating
values; which will sweep the architectural space accordingly.
this configware for each architectural choice with multiple
There can be other variants, where multiple MIMD threads
SIMD/MIMD threads. Chances of making mistakes and time
can communicate with each other and VESYLA can handle
for debugging the code goes up as complexity of the algorithm
the complexity.
increases.
D/D d E E Z Z
E
E
E ^
E
E
/
^ / ^
y Z
^
z z
E
/ D
y
E D
/
z
Figure 9. Single Read FSM
Figure 7. VESYLA MIMD Code Threads Due to lack of space we cannot show the complete con-