SlideShare ist ein Scribd-Unternehmen logo
1 von 147
Downloaden Sie, um offline zu lesen
Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
Architectural Synthesis of DSP Structured
Datapaths
Shereef B. M. Shehata
Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
OUTLINE
• An overview of the architectural Level Synthesis Problem.
• Subtasks of the High Level Synthesis problems
Ë Scheduling
Ë Binding
Ë Architecture Optimization
• NP-hard Algorithms(Heuristics versus Mathematical Programming techniques)
• Novel Mathematical Programming Formulation of the Synthesis Problem:
Ë Linearization of the Quadratic Nonlinear Problem
Ë Optimization of Performance and Structural Complexity
Ë Techniques To improve the Solution time for ILP formulation:
Ë Heuristics as Bounds for Mathematical Programming.
• Results for typical HLS benchmarks.
• Conclusion.
•
•
•
Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
Motivation
 To develop an architectural synthesis technique specific to the synthesis of
architectures for DSP targeting FPGA implementations.
 The technique is general enough to accommodate other technologies, such as new
submicron technologies.
 To provide an accurate evaluation method for our High Level Synthesis
methodologies.
• The total execution time is the yardstick for Performance comparison and
not The number of control steps.
 Exploit important features of FPGA technology:
• Large number of Registers
• FPGA utilization is largely reduced with complex interconnections
• High multiplexer cost.
• Wide difference between the delays of multiplications and additions.
• Efficient RAM storage.
• Dedicated high-speed carry-propagation circuit
Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
Chapter 1
Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
The Symmetrical Array FPGA Module (Xilinx)
Ë CLB routing is associated with each row and column of the CLB array.
Ë Global Routing consists of dedicated networks primarily designed to distribute clocks
throughout the device with minimum delay and skew. It can also be used to distribute high fan-
out signals throughout the device with minimum delay.
Ë Global nets and buffers has increased in more recent Xilinx 4000 generation to allow more
flexibility in routing.
Programmable
Connection Matrix
Programmable
Switching Matrix
Programmable Logic Block
Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
XC4000 family switch box architecture
Ë SRAM configuration cell, implies Reuse, and prototyping. The hardware becomes
reconfigurable and the designer can update the system on the fly.
Ë The total size of the SRAM configuration cell and the transistor switch that the SRAM drives
is larger than the programming devices used in antifuse technologies.
Interconnect Points Switch Matrix
DataLines
Six pass transistors per switch
matric interconnect point
Data Lines
Ë The horizontal and vertical single- and double-length lines intersect at a box called a
programmable switch matrix. Each switch matrix consists of programmable pass
transistors used to establish connections between the lines.
Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
The Xilinx 4000 Configurable logic block (dedicated carry logic is not shown)
Ë The inputs C1-C4 can also be used to control the use of the F and G- LUTs as 32-bits of SRAM.
Ë Mux control maps four control inputs (C1-C4) into: LUT input H1, direct in (DIN), enable
clock (EC) and set/reset for the flip flops.
Ë The XC4000 CLB has also has special fast dedicated carry logic hardwired between the
CLBs.
G1
G2
G3
G4
F4
LUT
LUT
LUT
multiplexer
C1 C2 C3 C4
R
S
state
state
D
D
Q
Q
G
Q2
Q1
Fclock
Programmable
H1
DIN
F1
F2
F3
Carry outCarry in
Carry outCarry in
to/from adhacent CLBs
to/from adhacent CLBs
Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
Carry propagation paths in Xilinx 4000 series
Ë The carry chain in XC4000 can run either up or down. At the top or bottom of the columns
where there are no more CLBs, the carry is propagated to the right.
Ë The Fast carry logic can be accessed by using Relational Placed Macros that already include
special library symbols for using the fast carry logic.
Ë The carry logic shares operands and control with the function generators.
CLB CLB CLB CLB
CLB CLB CLB CLB
CLB CLB CLB CLB
CLB CLB CLB CLB
Dedicated carry-path
Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
Interconnect Overview for the XC4000 family
Long
Double
Single
Quad
Quad
Long
Global
Clock
Long
Double
CLB Direct
Connect
Long
Carry
Chain
Direct
Connect
Single
Global
Clock
Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
Details of XC4000 dedicated carry logic.
Ë The two 4-input function generators can be configured as a 2-bit adder with built-in hidden
carry that can be expanded to any length.
Ë This dedicated carry circuitry is so fast that conventional speed-up methods like carry
generate/propagate has marginal benefit at the 32-bit level and almost no effect at the 16-bit
level.
Ai+1Bi+1
Si
Si+1
Ci+2
G-Function Generator
F-Function Generator
Bi
Ai
Ci
Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
Details of a Logic Array Block (LAB) in FLEX 8000 family
4
4
4
4
4
4
4
4
4
4
8
8 16
8
Carry-out to the LAB
on the right
LAB Local
interconnect
Carry-in
from the LAB
on left
Row Interconnect
Column Interconnect
LAB Control
Signals
LE
LE
LE
LE
LE
LE
LE
LE
Ë There are Eight LEs stacked
to form a Logic Array Block
(LAB)
Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
FLEX 8000 Logic Element(LE)
Ë The FLEX LE uses a four-input LUT, a flip-flop, cascade logic and carry logic.
Carry
Chain
Look-Up
Table(LUT)
Cascade
Chain
QD
CLRN
PRN LE Out
Carry-In Cascade-In
DATA1
DATA2
DATA3
DATA4
LABCTRL1
LABCTRL2
LABCTRL3
LABCTRL4
Clear/Preset
Logic
Carry-Out Cascade-Out
Clock
Select
Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
Flex 8000 device block diagram
IOE
IOE
IOE
IOE
IOE
IOE
IOE
IOE
IOEIOE
IOEIOE
IOEIOE
IOEIOE
Fast Track Interconnect
I/O Element
Logic
Element
Logic Array
Block(LAB)
Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
General Architecture Model
FUi FUj
R
Chaining Register
Interconnect
Register Mux FU
Mux
FU O/P
Tristate Bus
One of the Pipelined Busses
Driver
Register File
( RAM) Modules
FU
Module
Register
Mux
FU Mux
Sub-Module
(Optional)
(Optional)(Optional)
Control Unit
InterconnectControl
signals
Function Units and Register
Control Signals
Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
CDFG
- Data Storage Assignment
STEP-LAST: Register Allocation
STEP-4: ILP: Bus Insertion
-Bus transfer
scheduling
-Bus allocation
-Storage Minimization
-Bus loading Minim.
-Interconnect minimization.
-Bus loading minimization.
- Scheduling and Binding
- Chaining of Operations
STEP-3: ILP: Random Topology
-Interconnect minimization.
- Clock cycle minimization +
- FU pipelining choice
ation of the numberMinimiz
of cycles.
OR
- Minimization of the total
execution time, (i.e. throughput
maximization).
- VHDL generation of the
Datapath and the Controller
- Heuristics to determine the lower bound on the number of
cycles.
- Heuristics to tighten the ASAP/ALAP values under the given
resource constraints.
DFG
-DFG exploration.
-Dynamic Set generation for chaining
-ILP constraint generation
To Logic Synthesis tools
STEP-2: C++: Constraint Generation for ILP
STEP-1: Scheduling Bounds
Tech
Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
Flow of the Back-End Tools
Ë Stage-2 uses Synopsys tools(logic synthesis and FPGA mapping), and stage-3 uses
Xilinx(xact tools) for PPR
VHDL SOURCE FILES
- Xilinx Hard-macros
Simulate
Read HDL and insert pads
- Area Constraints
- Delay Constraints
- FU-Pipelining (i.e.
Register-balancing)
- Xilinx Library
To simulation
Partition, Placement
and Routing
Xilinx
SYNOPSYS
compile and optimize the
datapath and controller
Stage-3
Stage-2
Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
Chapter 2
Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
Basic Definitions.
Ë A Polyhedron “P“: is the set of points that satisfy a finite number of linear
inequalities, that is:
Ë A polytope: is a bounded polyhedron, that is:
Ë A Polyhedron Face: The set is called a face of P and
the valid inequality is said to define the face F.
P R
n
⊆ P x R
n
∈ A x⋅ b≤
 
 
 
=
 
 
 
,
w∃ R
1
∈ P x R
n
∈ w– x j w≤ ≤( ) j∀ j 1…n=,( )
 
 
 
⊆
F x P∈ π x⋅ π0={ }=
π x⋅ π0≤
Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
Ë The Convex Hull: Given a set , a point . The Convex hull of S
denoted by Conv(S) is the set of finite points that can be written as a convex
combination of points in S.
Ë where x1, x2, ..., xt are any finite set of points in S. The convex hull Conv(S) can
be described by a finite set of linear inequalities.
S R
n
⊆ x R
n
∈
Conv S( ) x R +
n
∈ x λi x
i
⋅
i 1=
∑=
 
 
 
 
 
=
λi
i 1=
t
∑ , λ R +
t
∈
Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
Ë A partially ordered set: , or poset, is a non-empty set X and a binary relationship B
on X which is reflexive, anti-symmetric and transitive. The elements of X are called points
and the binary relationship B is called partial ordering on X.
Ë A strict partially ordered set: , or Sposet, is a non-empty set X and a
binary relationship on X which is irreflexive, anti-symmetric and transitive.
Ë We use to denote that and to denote that .
Ë A Hasse diagram: of a poset (X,P) is a drawing in which the points of X are places
so that if y covers x, then y is placed at a higher level than x and joined to x by a line
segment. The corresponding graph is called a Hasse Graph of the poset.
Ë A Clique in a graph G = (V,E) is a with the property that every pair of nodes in C is
joined by an edge.
Ë A subset of the vertices of the graph is an r-clique if it induces a complete
subgraph, i.e.
Ë A stable set (or independent set) of vertices is a subset X of the vertex set of a graph G,
no two of which are adjacent.
X B,( )
X B˜,( )
B˜
xBy x y,( ) B∈ xB˜ y x y,( ) B˜∈
C V⊆
A V⊆ G V E,( )=
GA Kr≅
Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
Ë A Comparability graph: is an undirected graph that is transitively orientable.
That is each edge can be assigned a one-way direction such that the resulting
directed graph G = (V,E) satisfies the following condition: and
imply .
Ë A graph G is a triangulated graph, if for every simple cycle of length strictly greater than
3 posses a chord.
Ë The stability number of G is the number of vertices in a stable set of
maximum cardinality.
Ë The chromatic number of G the smallest possible k for which there exists
a proper k-coloring of G.
Ë The clique number of G is the number of vertices in a clique of maximum
cardinality.
Ë The clique cover number is the fewest number of complete subgraphs
needed to cover the vertices of G, i.e. the size of the smallest possible clique cover
of the graph G.
a b,( ) E∈ b c,( ) E∈
a c,( ) E∈ a b c, ,∀ V∈
α G( )
γ G( )
ω G( )
θ G( )
Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
Ë A Vertex packing on a graph G = (V,E) is a set of vertices , with the property
that no pair of vertices in U is joined by an edge.
Ë The fractional vertex packing polytope of a graph G = (V,E) is
where and is the maximal clique matrix of
the graph G.
U V⊆
P x R +
n
∈ κ x⋅ 1≤
 
 
 
= n V= κ
Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
chapter 3
Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
Simultaneous Performance Optimization and Interconnect minimization
• Exploration of much larger solution space guided by a Highly selective objective
function that rejects architectures with more interconnection unsuitable for FPGA
implementation.
• Developing an ILP formulation that incorporates:
Ë Multilevel chaining of operations and deeply pipelined functional units which are
effective for FPGAs.
Ë Optimal scheduling and binding of Operations while minimizing interconnections.
Ë Determination of the system clock duration.
Ë Minimization of the Total execution time vs. the number of control steps.
Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
Details of the Integer Linear Programming Formulation
• Operation Assignment Constraints
Ë This Constraint assigns Every Operation of the DFG to only one control step and one FU.
Xop n s,,
n 1=
Nt
∑
s Range op( )∈
∑ 1 op∀=
Xi,1,1 Xi,2,1 Xi,3,1
Xi,1,2 Xi,2,2 Xi,3,2 Xj,1,2 Xj,2,2
Xi,1,3 Xi,2,3 Xi,3,3 Xj,1,3 Xj,2,3
Xi,1,4 Xi,2,4 Xi,3,4 Xj,1,4 Xj,2,4
Xj,1,5 Xj,2,5
21
Op j
ALAP(opj)
1 2 3
Op i
ASAP(opj)
ALAP(opi)
ASAP(opi)
The variables in the shaded region add up to 1.
OPi
OPj
precedence
Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
Details of the Integer Linear Programming Formulation
• Function Unit Assignment Constraint
Ë Each FU has at most only one operation assigned at a given time.
Xop n p,,
op Fut∈
∑
p s=
s L op( )– 1+
∑ 1≤ n s∀,∀
Xi,1,1 Xi,2,1 Xj,1,1 Xj,2,1
Xi,1,2 Xi,2,2 Xj,1,2 Xj,2,2 Xk,1,2 Xk,2,2
Xi,1,3 Xi,2,3 Xj,1,3 Xj,2,3 Xk,1,3 Xk,2,3
Xi,1,4 Xi,2,4 Xj,1,4 Xj,2,4 Xk,1,4 Xk,2,4
Xj,1,5 Xj,2,5
Op i
1 2
Op k
1 2
Op j
1 2
c-step1
c-step2
c-step3
c-step4
c-step5
The summation of these variables is less than 1
OPi
OPj
precedence
Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
Details of the Integer Linear Programming Formulation
• Scheduling partially ordered operations has to follow the precedence order (no
Chaining)
X
opi n p, ,
X
op j n p, ,
n 1=
Ntj
∑ 1≤
p ASAP op j( )=
s
∑+
n 1=
Nti
∑
p s D opi( )– 1+=
ALAP opi( )
∑
ASAP op j( ) s ALAP opi( ) D opi( ) 1–+≤ ≤
s∀ opi op j→( )∀,
Xi,1,1 Xi,2,1 Xi,3,1
Xi,1,2 Xi,2,2 Xi,3,2 Xj,1,2 Xj,2,2
Xi,1,3 Xi,2,3 Xi,3,3 Xj,1,3 Xj,2,3
Xi,1,4 Xi,2,4 Xi,3,4 Xj,1,4 Xj,2,4
Xi,1,5 Xi,2,5 Xi,3,5 Xj,1,5 Xj,2,5
Xi,1,6 Xi,2,6 Xi,3,6 Xj,1,6 Xj,2,6
OPi
OPj
precedenceASAP(opj)
current c-step
The variables in the shaded region add up to 1
ALAP(opi)
ASAP(opi)
1 2 3
Op i
21
Op j
Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
To Determine the Total length of the schedule
Ë The following constraint illustrates the determination of the total number
of steps T, from the schedule of the operations in the set W.
Where W is the set of operations without Successors in the DFG.
Ë The variable T has both an upper and lower bound (Determined from
Heuristics) as:
s Xop n s,, T–×
n 1=
Nt
∑
s Range op( )∈
∑ D op( ) 1+–( )≤ op W∈∀
T Tcr≥
T Tcr T∆+≤
Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
Constraints to minimize the structural complexity of the synthesized Architecture
Ë Counting the number of Motifs
Ë A corresponding term to minimize the MOTIFSUM is included in the objective function
to increase the utilization of the already assigned interconnect between different Function
units.
Xi,1,1 Xi,2,1 Xi,3,1
Xi,1,2 Xi,2,2 Xi,3,2 Xj,1,2 Xj,2,2
Xi,1,3 Xi,2,3 Xi,3,3 Xj,1,3 Xj,2,3
Xi,1,4 Xi,2,4 Xi,3,4 Xj,1,4 Xj,2,3
Xj,1,5 Xj,2,5
Xo pi n s,,
s Range opi( )∈
o pi Fut∈
∑ Xo p j n s,,
s Range op j( )∈
o p j Fut′∈
∑+
Motif Fut n Fut′ n′,,,( ) 1≤–
o pi op j→( )∀
n n 1…Nt=( )∀
n′ n′ 1…Nt′=( )∀
1 2 3
Op i
21
Op j
c-step 1
c-step 3
c-step 2
c-step 4
ASAP(op
i
)
ASAP(op
j
)
ALAP(op
i
)
ALAP(op
j
)c-step 5
The summation of these variables sets the value of Motif A 2 M 1,,,( )
Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
Constraints to minimize the structural complexity of the synthesized Architecture
Ë Counting the number of Chaining Motifs
Ë A corresponding term to minimize the CMOTIFSUM is included in the objective
function to increase the utilization of the already assigned Chaining interconnect between
different Function units.
Xi,1,1 Xi,2,1 Xi,3,1
Xi,1,2 Xi,2,2 Xi,3,2 Xj,1,2 Xj,2,2
Xi,1,3 Xi,2,3 Xi,3,3 Xj,1,3 Xj,2,3
Xi,1,4 Xi,2,4 Xi,3,4 Xj,1,4 Xj,2,4
Xj,1,5 Xj,2,5
1 2 3
op i
21
opj
c-step 1
c-step 3
c-step 2
c-step 4
The summation of these variables sets the value of CMotif A 2 M 1,,,( )
opi
opj
Precedence
c-step 5
Xo pi n s,,
o pi Fut∈
∑ Xo p j n s,,
o p j Fut′∈
∑+
CMotif Fut n Fut′ n′,,,( ) 1≤–
s∀ , o pi op j→( )∀
n n 1…Nt=( )∀
n′ n′ 1…Nt′=( )∀
Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
Ë Counting Incompatible Motifs
Ë The idea is to minimize the number of Motifs that terminates on the Same Function unit. This will
decrease the number of Multiplexers in the synthesized architecture.
Moti f Fut n Fut′ n′,,,( )
n 1=
Nt
∑
Fut
∑ Incom p Fut′( )– 0≤
n′∀
Fut′∀
'1
'3
'2
'1
'1
'3
'1
'3
'1
I/O
'2
'3
'1
Schedules and Motifs Architecture
Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
Ë Minimizing the Maximum Number of edges with the Same FU Destination
Type(Incompatible Motifs).
Introducing an integer variable to count the number of incompatible Motifs.
Moti f Fut n Fut′ n′,,,( )
n 1=
Nt
∑
Fut
∑ Incom p Fut′( )– 0≤ Fut′∀ n′∀,
(a) (b) (c)
Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
Minimizing the Maximum Number of Edge Overlap
K Xopi n p, , Xop j n p, ,
n 1=
Ntj
∑
p =
ASAP o p j( )
s
∑–
n 1=
Nti
∑
p =
ASAP o pi( )
s
∑
 
 
 
 
 
 
 
o pi op j→( )∀
op j Fut∈
edge wrap∉
∑










×
Xopi n p, , Xop j n p, ,
n 1=
Ntj
∑
p s 1+=
ALAP op j( )
∑+
n 1=
Nti
∑
p =
ASAP o pi( )
s
∑
 
 
 
 
 
 
 
o pi op j→( )∀
op j Fut∈
edge wrap∈
∑+
M– axovla p Fut( ) 0≤ s∀ Fut∀
K 1
Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
Formulation for chaining of Two operations per control step
• The destination operation can not be scheduled “before” the source operation.
Ë However, they can be share the same control step.
Xi,1,1 Xi,2,1 Xi,3,1 Xj,1,1 Xj,1,2
Xi,1,2 Xi,2,2 Xi,3,2 Xj,1,2 Xj,2,2
Xi,1,3 Xi,2,3 Xi,3,3 Xj,1,3 Xj,2,3
Xi,1,4 Xi,2,4 Xi,3,4 Xj,1,4 Xj,2,4
Xi,1,5 Xi,2,5 Xi,3,5 Xj,1,5 Xj,2,5
Xi,1,6 Xi,2,6 Xi,3,6 Xj,1,6 Xj,2,6
ASAP(opj)
current c-step
ALAP(op i)
The Summation of the variables in the shaded regions add up to 1
21
Op j
1 2 3
Op i
OPi
OPj
precedence
X
opi n p, ,
X
op j n p, ,
n 1=
Ntj
∑ 1≤
p ASAP op j( )=
s 1–
∑+
n 1=
Nti
∑
p s D opi( )– 1+=
ALAP opi( )
∑
ASAP op j( ) s ALAP opi( ) D opi( ) 1–+≤ ≤
s∀ opi op j→( )∀,
Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
Formulation for chaining of Two operations per control step
• The source operation can not be scheduled “after” the destination operation.
Ë However, they can share the same control step.This constraints and the previous one are not
redundant. They tighten the Formulation.
Xopi n p, , Xop j n p, ,
n 1=
Ntj
∑ 1≤
p s=
∑+
n 1=
Nti
∑
p s D opi( )– 2+=
ALAP opi( )
∑
ASAP op j( ) s ALAP opi( ) D opi( ) 1–+≤ ≤
s∀ , opi op j→( )∀
Xi,1,1 Xi,2,1 Xi,3,1 Xj,1,1 Xj,2,2
Xi,1,2 Xi,2,2 Xi,3,2 Xj,1,2 Xj,2,2
Xi,1,3 Xi,2,3 Xi,3,3 Xj,1,2 Xj,2,3
Xi,1,4 Xi,2,4 Xi,3,4 Xj,1,4 Xj,2,4
Xi,1,5 Xi,2,5 Xi,3,5 Xj,1,5 Xj,2,5
Xi,1,6 Xi,2,6 Xi,3,6 Xj,1,6 Xj,2,6
211 2 3
Op i Op j
ASAP(opj)
current c-step
ALAP(opi)
ASAP(opi)
The variables in the shaded region add up to 1
OPi
OPj
precedence
Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
Formulation for chaining of Two operations per control step
• The following constraint prevents chaining of more than two operations in the same
control step.
Xopi n p, , Xopk n p, ,
n 1=
Ntk
∑ 1≤
p s=
∑+
n 1=
Nti
∑
p s D opi( ) 1+–=
∑ s∀ , opi op j,( )∀ ℜ2∈
ASAP opk( ) s ALAP opi( ) D opi( ) 1–+≤ ≤
Xi,1,1 Xi,2,1 Xi,3,1 Xj,1,1 Xj,2,1
Xi,1,2 Xi,2,2 Xi,3,2 Xj,1,2 Xj,2,2
Xi,1,3 Xi,2,3 Xi,3,3 Xj,1,3 Xj,2,3
Xi,1,4 Xi,2,4 Xi,3,4 Xj,1,4 Xj,2,4
Xi,1,5 Xi,2,5 Xi,3,5 Xj,1,5 Xj,2,5
Xi,1,6 Xi,2,6 Xi,3,6 Xj,1,6 Xj,2,6
OPi
OPj
precedenceASAP(opk)
current c-stepALAP(opi)
OPk
precedence
ASAP(opi)
The variables in the shaded region add up to 1
1 2 3
Op i
21
Op k
Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
Multi- Level Chaining
Ë Patterns to look for in the DFG
Ë Formulation
Ë By generating the set , such that if , and
is a multi-cycle operation(e.g. multiply operation).
Ë The following constraint will then apply to the members of this set
*+ +
opi
opk
opi
opk
*
opi
opk
*
C D
+ +
opi
opk
A B
∆M O
2
⊆ op1 op2,( ) ∆M∈ op1 opM→ opM op2→
opM
Xopi n p, , Xopk n p, ,
n 1=
Ntk
∑ 1≤
p s=
∑+
n 1=
Nti
∑
p s D opi( ) 1+–=
∑
ASAP opk( ) s∀ ALAP opi( ) D opi( ) 1–+≤ ≤
o pi o pk( , )∀ ∆M∈
Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
Delay model for an N-bit adder implemented in Xilinx FPGAs
For the Xilinx 4000 series, is 0.7/1 ns, and is 4 ns.
Ë The delay is linear with the number of bits. This proportionality factor is , and as such
they make the fastest possible carry path circuits.
Adder
S0 S1 S2 S3 S4 S5 SN-4 SN-3 SN-2 SN-1
TOPCY
Tsum
LSB MSB
A0,
B0
A1,
B1
A2,
B2
A3,
B3
A4,
B4
AN-4,
BN-4
AN-3,
BN-3
AN-2,
BN-2
AN-1,
BN-1
(N-4)/2 CLBs
Tcarry Tcarry Tcarry
CLB
T A TOPCY N 4–( ) 2⁄ Tcarry× Tsum+ +=
Tcarry T
OPCY
Tcarry
Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
Delay model for a pipelined-multiplier chained with an adder
For the Xilinx 4000 series, is 5 ns.
Adder
Last pipeline stage of a multiplier
S0 S1 S2 S3 S4 S5 SN-4 SN-3 SN-2 SN-1
TOPCY
Tcarry
Tsum
Tsum
Tcarry
LSB MSB
Tcarry Tcarry
TOPCY
Tcarry Tcarry
CLB
T pd T pipe Tsum+=
Tsum
Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
Scheduling with Multi-level chaining and Interconnect minimization
+
+
+
+
+
+
+
+
+
+
+
+
i1 i2 i3 i4 i5
i9 i10 i11 i12 i13
i6 i7 i8
out
+
++
+
R1R2
i4 i5 i8 i12 i13i3 i7 i11
i1 i9 i2 i6 i10
+
Adder 2
+ Adder 3
Adder 1
i6
i11
i8
i1
i10 i9
i3 i5 i7 i13 i4 i12
R1
R2
+
Extra Number of Mux inputs: 2
Number of CLBs: 128
Execution time: 84 nsec
Number of registers: 2
Extra Number of Mux inputs: 8
Number of CLBs: 180
Execution time: 96 nsec
Number of registers: 2
++
+
+ +
+
+
+
+
+
+
i1 i2 i3
i4 i5
i6
i7 i8
i9 i10
i11
i12 i13
out
+
Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
Delaying of interconnect optimization after scheduling
Ë Comparison of our results for an addition tree, with methods that restrict the solution space,
or does not minimize interconnect simultaneously with scheduling and binding of operations.
+
+
+
+
+ +
+
+
+
+
out
+ +
i6 i7 i8
i9 i10 i11 i12 i13i1 i2 i3 i4 i5
+
+
+
+
Adder1R1
R3
R2
Adder 3
Adder 2
Adder 4
R4
i9 i10i6 i11i13 i12
i4 i8 i3 i5
i2
i7
i1
Extra number of mux inputs: 7
Number of CLBs: 168
Execution time: 84 nsec
Number of registers: 4
Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
Scheduling and binding for the CDFG with non-pipelined multipliers and no chaining
• The schedule needs 7 control steps, with clock duration of 150ns
+
+ +
+*
*
+
++
+
+
+
+
+
+
Clock cycle: 150 ns
Exec. Time: 7 * 150 = 1050 ns
Resources: 2 Adders, 1 Multiplier
No-Chaining
Non-Piplined Multipliers.
c-step 1
c-step 2
c-step 3
c-step 4
c-step 5
c-step 6
c-step 1
Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
Effect of increasing the resources to three adders and one non-pipelined multiplier on the
scheduling of the CDFG
• Increasing the resources by one adder does not effect the execution time for the CDFG
+
+ +
+
*
*
+
+ +
+
+
+
+
+
+
Clock cycle: 150 ns
Exec. Time: 7 * 150 = 1050 ns
Resources: 3 Adders, 1 Multiplier
No-Chaining
Non-Piplined Multipliers.
c-step 1
c-step 2
c-step 3
c-step 4
c-step 5
c-step 6
c-step 7
Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
Scheduling and binding for the CDFG with pipelined multipliers and no chaining.
• The schedule needs 8 control steps, with clock duration of 80ns
+
+
+
*
*
+
+
+
+
+
+
+
+
+
+
Clock Cycle: 80 ns
Execution Time: 8 * 80 = 640 ns
Resources: 2 Adders, 1 Multiplier
No-Chaining
Pipelined Multipliers.
c-step 1
c-step 2
c-step 3
c-step 4
c-step 5
c-step 6
c-step 7
c-step 8
Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
Scheduling and binding of the CDFG, using pipelined multiplier and chaining
• The schedule needs 5 control steps with clock duration of 90 ns.
Clock Cycle: 90 ns
Execution Time: 5 * 90 = 450 ns
Resources = 3 Adders, 1 Multiplier
Pipelined Multipliers
2-level Chaining allowed.
*
*
c-step 1
c-step 2
c-step 3
c-step 4
c-step 5
Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
Synthesized architecture of for the scheduling and binding using pipelining and
chaining.
R4R2R1R3
*
Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
Minimization of the Total Execution time (Performance Optimization)
Ë The Following constraint sets the Clock duration during the solution:
Ë The constraint to set the chaining variable is given below:
Ë The Upper and Lower limits that exist for the Clock Duration:
δ ψijk( ) ψijk× Ω≤ ψijk∀ Ψ∈,
ψMAA
Xopi n p, , Xopk n p, , ψMAA–
n 1=
Ntk
∑ 1≤
p s=
∑+
n 1=
Nti
∑
p s D opi( ) 1+–=
∑
ALAP opi( ) D opi( ) 1–+ s∀ ASAP o pk( ) o pi o pk( , )∀ ℑ2S∈,≥ ≥
opi NM∈( )and o pk NA∈( )
Ωmin Ω Ωmax≤ ≤
Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
Ë The values of the Upper/Lower bounds are determined as follows:
Ë If the clock duration is allowed only discrete values:
Ë is a relaxed version of the discrete valued , that can assume any
positive number.
Ωmax MAX δ Ψ( ){ }=
Ωmin MIN δ Ψ( ){ }=
δ ψijk( ) ψijk× Ωrelaxed≤ ψijk∀ Ψ∈
Ω
Ωrelaxed
Ωmin
------------------------- Ωmin⋅=
Ωrelaxed Ω
Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
Minimization of the DFG Total Execution Time
Ë The Number of control steps (integer) can be represented in terms of Binary Variables:
Ë The part of the Objective function that minimizes the Total execution is Nonlinear
Ë The Objective Function can be conceptually presented as:
T 2i β
i
⋅
i 0=
n 1–
∑=
IN 2
i
CLOCK⋅( ) β
i
⋅
i 0=
n 1–
∑=
I IN IL1+=
Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
Minimization of the DFG Total Execution Time
Ë Linearization of the Nonlinear part of the Objective function
Ë Linearization of the Nonlinear part of the Objective function(cont’d):
IL2 2
i
CLKMIN⋅ βi⋅ Θi+
 
 
i 0=
n 1–
∑=
Θi 2
i
CLOCK⋅ 2
i
CLKMIN⋅ βi⋅– 2
i
CLKMAX⋅ 1 βi–( )– i,≥ 0 … n 1–,,=
Θi 0 i,≥ 0 … n 1–,,=
Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
Minimization of the DFG Total Execution Time
Ë Linearization does not increase the complexity of the formulation:
• Where n is the number of discrete variables added to the formulation
Θi
2i CLOCK CLKMAX–( )⋅ if βi is 0 Θi 0≥( ),,
2i CLOCK CLKMIN–( )⋅ if βi is 1 Θi 0≥( ),,





≥
IL2 2
i
CLOCK⋅
i ri, 1=
∑=
n Tlog( ) 2log( )⁄=
Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
Tree Hight Reduction
Ë The performance of the architecture is bounded by the length of the critical path.
Before THR After(THR) Delay Estimation
A B C D
(A + B)+ C + D
A B C D
(A+B) + (C+D)
δ ψAAA( ) δ ψAA( )=
A B C D
(A + B) - C + D
A B CD
(A+B) + (D - C)
δ ψASA( ) MAX δ ψAA( ) δ ψSA( ){ , }=
A B C D
(A + B) + C - D
A B DC
(A+B) + (C - D)
δ ψAAS( ) MAX δ ψAA( ) δ ψSA( ){ , }=
Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
A B C D
(A - B) + C + D
A B DC
(A-B) + (C +D)
δ ψSAA( ) MAX δ ψAA( ) δ ψSA( ){ , }=
A B C D
(A - B) - C + D
A B CD
(A-B) + (D - C)
δ ψSSA( ) MAX δ ψSA( ) δ ψSA( ){ , }=
Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
A B C D
(A - B) - C - D
A B DC
(A-B) - (C + D)
δ ψSSS( ) MAX δ ψSS( ) δ ψAS( ){ , }=
A B
C D
(A * B) + C + D
A B
C D
(A * B) + (C + D)
δ ψMAA( ) MAX δ ψMA( ) δ ψAA( ){ , }=
Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
A B
C D
(A * B) - C + D
A B
D C
(A * B) + (D - C)
δ ψMSA( ) MAX δ ψMA( ) δ ψSA( ){ , }=
A B
C D
(A * B) - C - D
A B
C D
(A * B) - (C + D)
δ ψMSS( ) MAX δ ψMS( ) δ ψAS( ){ , }=
Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
CHAPTER 4
Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
Hasse Graph for scheduling with n-level chaining
1 2 3
1
2
3
4
5
n-1 n
α1 α2 αn−2 αn−1 αn
cstep,s
op
6
7
n+1
Assignement Edges
Timing Edges
Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
Topological Sorting of the Hasse Graph can be modified to be used for Coloring
the Graph
Ë Nodes Are numbered according to topological sorting.
op
cstep,s
1 2 3
1
2
3
4
5
6 1
4
7
10
13
16
3
6
9
12
15
2
5
8
11
14
Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
Two different Colorings for the Hasse Graph for scheduling with 2-level chaining
Ë Nodes are numbered according to the Corresponding color.
op
cstep,s
1 2 3
1
2
3
4
5
6 5
4
3
2
1
4
3
2
1
0
5
4
3
2
1
5
op
cstep,s
1 2 3
1
2
3
4
5
6
4
3
2
1
0
5 4
3
2
1
0
4
3
2
1
0
Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
Topological Sorting of the Hasse Graph can be modified to be used for Coloring
the Graph
opcstep,s
1 2 3 4
1
2
3
4
5
6
1 2 3
4 5 6 7
8
12
16
20
9
13
17
21
10
14
18
22
11
15
19
Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
Two different Colorings for the Hasse Graph for scheduling with 3-level chaining
Ë The graph has 22 nodes and “43” edges. Then number of maximal cliques can not be greater
than 22 (or even equal 22).
Ë The Transitive Closure of the graph has “115” edges.
op
cstep,s
1 2 3 4
1
2
3
4
5
6
5 5 4
4 4 4 3
3 3 3 2
2 2 2 1
1 1 1 0
0 0 0
op
cstep,s
1 2 3 4
1
2
3
4
5
6 5 5 5
5 4 4 4
4
3
2
1
3
2
1
0
3
2
1
0
3
2
1
Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
An Odd-Hole graph and A Wheel graph
1
2
34
5
6
1
2
34
5
An Odd-Hole Graph
x1 x2 x3 x4 x5+ + + + 2≤
A Wheel Graph
x1 x2 x3 x4 x5 2 x6⋅+ + + + + 2≤
Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
The Extended Wheel Graph
1
2
34
5
6
7
An Extended-Wheel Graph
x1 x2 x3 x4 x5 2 x6⋅ 2 x7⋅+ + + + + + 2≤
Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
Example Constraint Class
Example: for s = 3
Constraint (α2βα1β) for 3-level chain
op
cstep,s
1 2 3 4
1
2
3
4
5
X1 3, X3 2, X3 3, X4 2, X4 3, 2 X1 4,⋅ 2 X1 5,⋅+ + 2≤+ + + +
Xopi a s D opi( ) 1+–( ), ,
a 1=
Nti
∑ Xopk a p, ,
a 1=
Ntk
∑
p s 1–=
s
∑+
Xopl a p, ,
a 1=
Ntl
∑
p s 1–=
s
∑+ + 2 Xopi a p, ,
a 1=
Nti
∑
p s D opi( ) 2+–=
ALAP opi( )
∑⋅
 
 
 
2≤
s∀ Range opi( ) Range opl( )∩( )∈
s D opi( ) 2+– Range opi( )∈ s 1–( ) Range opl( )∈,
opi opk,( ) ℑ2S∈∀ opi opl,( ) ℑ3S∈∀,
Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
An Extended Wheel Graph Constraint Class for 3-level chaining.
Example Constraint Class
Example: for s = 3
Constraint (α2βα1β) for 3-level chain
op
cstep,s
1 2 3 4
1
2
3
4
5
X1 3, X3 2, X3 3, X4 2, X4 3, 2 X1 4,⋅ 2 X1 5,⋅+ + 2≤+ + + +
Xopi a s D opi( ) 1+–( ), ,
a 1=
Nti
∑ Xopk a p, ,
a 1=
Ntk
∑
p s 1–=
s
∑+
Xopl a p, ,
a 1=
Ntl
∑
p s 1–=
s
∑+ + 2 Xopi a p, ,
a 1=
Nti
∑
p s D opi( ) 2+–=
ALAP opi( )
∑⋅
 
 
 
2≤
s∀ Range opi( ) Range opl( )∩( )∈
s D opi( ) 2+– Range opi( )∈ s 1–( ) Range opl( )∈,
opi opk,( ) ℑ2S∈∀ opi opl,( ) ℑ3S∈∀,
Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
Exploring the Hasse diagram for schedules with 2-level chaining.
class 1
α1
α2
β
class 3
class 4
α1
class 3
α1
class 5
β
start
β
α1/α2
α1/α2
β
class 2
β
β β
Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
1- Clique Constraint Class for 2-level chainingβ
)
op
cstep,s
1 2 3
1
2
3
4
5
6
Ë The constraint class 1 :β
)
Xop n s,,
n 1=
Nt
∑
s Range op( )∈
∑ 1 op DFG∈∀≤
Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
2-Clique Constraint Class for 2-level chainingβsα2βs
op
cstep,s
1 2 3
1
2
3
4
5
6
Ë The constraint class 2 :βsα2βs
Xopi n p, ,
n 1=
Nti
∑
p s D opi( )– 1+=
ALAP opi( )
∑ Xopk n p, ,
n 1=
Ntk
∑
p ASAP opk( )=
s
∑+ 1≤
s∀ Range opi( ) Range opk( )∩( )∈
opi opk,( )∀ ℑ2S∈
Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
3-Clique Constraint Class for 2-level chainingβsα1β s 1–( )
op
cstep,s
1 2 3
1
2
3
4
5
6
Ë The constraint class 3βsα1β s 1–( )
Xopi n p, ,
n 1=
Nti
∑
p s D opi( )– 1+=
ALAP opi( )
∑ Xop j n p, ,
n 1=
Ntj
∑
p ASAP op j( )=
s 1–
∑+ 1≤
s∀ Range opi( ) s 1–( ) Range op j( )∈( )∈
opi op j,( )∀ ℑ1∈
Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
4-Clique Constraint Class for 2-level chaining
Ë The example illustrated in the Figure for class 4 is for the case of both .
βsα1β˜
s 2–( ) j k, ,
i′
α1β s 2– i′–( )
op
cstep,s
1 2 3
1
2
3
4
5
6
Ë The constraint class 4
βsα1β˜
s 2–( ) j k, ,
i′
α1β s 2– i′–( )
Xopi n p, ,
n 1=
Nti
∑
p s D opi( )– 1+=
ALAP opi( )
∑ Xop j n p, ,
n 1=
Ntj
∑
p ASAP op j( )=
s 1–
∑+ 1≤
s∀ Range opi( ) s 1–( ) Range op j( )∈( )∈
opi op j,( )∀ ℑ1∈
i′ 1=
Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
5-Clique Constraint Class for 2-level chainingβsα1α1β s 2–( )
op
cstep,s
1 2 3
1
2
3
4
5
6
Ë The constraint class 5βsα1α1β s 2–( )
Xopi n p, ,
n 1=
Nti
∑
p s D opi( )– 1+=
ALAP opi( )
∑ Xop j n s 1–( ), ,
n 1=
Ntj
∑ Xopk n p, ,
n 1=
Ntk
∑
p ASAP opk( )=
s 2–
∑+ + 1≤
s∀ Range opi( )∈ s 2–( ) Range opk( )∈( )
opi op j,( )∀ ℑ1S∈ , op j opk,( )∀ ℑ2S∈
Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
Exploring 3 the Hasse diagram for schedules with 3-level chaining.
class 1
α1 α2
α3
β α2
class 6
class 9
β
class 8
α1class 6
α1
β
β
α1
class 11
class 12
β
β
class 14
β
start
β
α1/α2/α3
α1/α2/α3
β
α2
class 7
β
class 3
β
class 3
class 5
α1
β
class 2
β
class 4
β
α1
α1
class 8
β
class 10
β
α1
class 11
β
class 13
α1
β
β
Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
1- Clique Constraint Class for 3-level chainingβ
)
op
cstep,s
1 2 3 4
1
2
3
4
5
α1 α2 α3
6
7
8
Ë The constraint class 1β
) Xop n s,,
n 1=
Nt
∑
s Range op( )∈
∑ 1 op DFG∈∀≤
Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
2- Clique Constraint Class for 3-level chainingβsα3βs
cstep,s
1 2 3 4
1
2
3
4
5
α1 α2 α3
6
7
8
op Ë The constraint class 2 for 3-level chainingβsα3βs
Xopi n p, ,
n 1=
Nti
∑
p s D opi( )– 1+=
ALAP opi( )
∑ Xopl n p, ,
n 1=
Ntl
∑
p ASAP opl( )=
s
∑+ 1≤
s∀ Range opi( ) Range opl( )∩( )∈
opi opl,( )∀ ℑ3S∈
Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
3- Clique Constraint Class for 3-level chainingβsα2β s 1–( )
op
cstep,s
1 2 3 4
1
2
3
4
5
α1 α2 α3
6
7
8
Ë The constraint class 3:βsα2β s 1–( )
Xopi n p, ,
n 1=
Nti
∑
p s D opi( )– 1+=
ALAP opi( )
∑ Xopk n p, ,
n 1=
Ntk
∑
p ASAP opk( )=
s 1–
∑+ 1≤
s∀ Range opi( )∈ s 1–( ) Range opk( )∈
opi opk,( )∀ ℑ2S∈
Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
4- Clique Constraint Class for 3-level chainingβsα2β˜
s 2–( ) k l, ,
i′
α1β
s 2– i′–( )
op
cstep,s
1 2 3 4
1
2
3
4
5
α1 α2 α3
6
7
8
Ë The constraint constraint class 4βsα2β˜
s 2–( ) k l, ,
i′
α1β
s 2– i′–( )
Xopi n p, ,
n 1=
Nti
∑
p s D opi( )– 1+=
ALAP opi( )
∑ Xopk n p, ,
n 1=
Ntk
∑
p s 1– i′–( )=
s 1–( )
∑ Xopl n p, ,
n 1=
Ntl
∑
p ASAP opl( )=
s 2– i′–( )
∑+ + 1≤
s∀ Range opi( )∈ s 2–( ) Range opl( )∈( )
opi op j,( ) op j opk,( ) opk opl,( ), ,∀ ℑ1S∈ opi opk,( ) ℑ2S∈ opi opl,( ) ℑ3S∈,
i′ 1 i′ s 2 A– SAP opl( )–≤ ≤∀
Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
5- Clique Constraint Class for 3-level chainingβsα2α1β s 2–( )
op
cstep,s
1 2 3 4
1
2
3
4
5
α1 α2 α3
6
7
8
Ë The constraint class 5βsα2α1β s 2–( )
Xopi n p, ,
n 1=
Nti
∑
p s D opi( )– 1+=
ALAP opi( )
∑ Xopk n s 1–( ), ,
n 1=
Ntk
∑ Xopl n p, ,
n 1=
Ntl
∑
p ASAP opl( )=
s 2–
∑+ + 1≤
s∀ Range opi( )∈ s 2–( ) Range opl( )∈( )
opi op j,( ) op j opk,( ) opk opl,( ), ,∀ ℑ1S∈ opi opk,( ) ℑ2S∈ opi opl,( ) ℑ3S∈,
Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
6- Clique Constraint Class for 3-level chainingβsα1β s 1–( )
op
cstep,s
1 2 3 4
1
2
3
4
5
α1 α2 α3
6
7
8
Ë The constraint class 6 :βsα1β s 1–( )
Xopi n p, ,
n 1=
Nti
∑
p s D opi( )– 1+=
ALAP opi( )
∑ Xop j n p, ,
n 1=
Ntj
∑
p ASAP op j( )=
s 1–
∑+ 1≤
s∀ Range opi( )∈ s 1–( ) Range op j( )∈,
opi op j,( )∀ ℑ1∈
Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
7- Clique Constraint Class for 3-level chainingβsα1β˜
s 2–( ) j k, ,
i′
α2β
s 2– i′–( )
op
cstep,s
1 2 3 4
1
2
3
4
5
α1 α2 α3
6
7
8
Ë The constraint class 7
βsα1β˜
s 2–( ) j l, ,
i′
α2β
s 2– i′–( )
Xopi n p, ,
n 1=
Nti
∑
p s D opi( )– 1+=
ALAP opi( )
∑ Xop j n s 1–( ), ,
n 1=
Ntj
∑
s 1– i′–( )
s 1–
∑ Xopl n p, ,
n 1=
Ntl
∑
p ASAP opl( )=
s 2– i′–( )
∑+ + 1≤
s∀ Range opi( )∈ s 2–( ) Range opl( )∈,
opi op j,( ) op j opk,( ) opk opl,( ), ,∀ ℑ1S∈ op j opl,( ) ℑ2S∈ opi opl,( ) ℑ3S∈,
i′ 1 i′ s 2 A– SAP opl( )–≤ ≤∀
Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
8- Clique Constraint Class for 3-level chainingβsα1β˜
s 2–( ) j k, ,
i′
α1β
s 2– i′–( )
op
cstep,s
1 2 3 4
1
2
3
4
5
α1 α2 α3
6
7
8
Ë The constraint class 8βsα1β˜
s 2–( ) j k, ,
i′
α1β
s 2– i′–( )
Xopi n p, ,
n 1=
Nti
∑
p s D opi( )– 1+=
ALAP opi( )
∑ X
n 1=
Ntj
∑ op j n p, ,
p s 1– i′–( )=
s 1–
∑ Xopk n p, ,
n 1=
Ntk
∑
p ASAP opk( )=
s 2– i′–
∑+ + 1≤
s∀ Range opi( )∈ s 2– i′–( ) Range opk( )∈( )
opi op j,( ) op j opk,( ),∀ ℑ1S∈ opi opk,( ) ℑ2S∈
i′ 1 i′ s 2 A– SAP opk( )–≤ ≤∀
Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
9- Clique Constraint Class for 3-level chaining
Ë The example illustrated in the Figure for class 9 is for the case of both .
βsα1β˜
s 2–( ) j l, ,
i′
α1β˜
s 2–( ) k l, ,
i″
α1β
s i′– i″–( )
op
cstep,s
1 2 3 4
1
2
3
4
5
α1 α2 α3
6
7
8
Ë The constraint class 9βsα1β˜
s 2–( ) j l, ,
i′
α1β˜
s 2–( ) k l, ,
i″
α1β s i′– i″–( )
Xopi n p, ,
n 1=
Nti
∑
p s D opi( )– 1+=
ALAP opi( )
∑ X
n 1=
Ntj
∑ op j n p, ,
p s 1– i′–( )=
s 1–
∑+ +
Xopk n p, ,
n 1=
Ntk
∑
p s 2– i′– i″–=
s 2– i′–
∑ Xopl n p, ,
n 1=
Ntl
∑
p ASAP opl( )=
s 3 i′ i″–––
∑ 1≤+
s∀ Range opi( )∈ s 3– i′ i″––( ) Range opl( )∈( )
opi op j,( ) op j opk,( ) opk opl,( ), ,∀ ℑ1S∈ opi opk,( ) ℑ2S∈ opi opl,( ) ℑ3S∈,
i′ 1 i′ s 4 A– SAP opl( )and i″∀ 1 i″ s 3– ASAP opl( ) i′––≤≤( )–≤ ≤∀
max i′ i″+( ) s 3– ASAP opl( )–=
i′ i″, 1=
Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
10- Clique Constraint Class for 3-level chainingβsα1β˜
s 2–( ) j k, ,
i′
α1α1β
s 2– i–( )
op
cstep,s
1 2 3 4
1
2
3
4
5
α1 α2 α3
6
7
8
Ë The constraint class 10:βsα1β˜
s 2–( ) j k, ,
i′
α1α1β
s 2– i–( )
Xopi n p, ,
n 1=
Nti
∑
p s D opi( )– 1+=
ALAP opi( )
∑ X
n 1=
Ntj
∑ op j n p, ,
p s 1– i′–( )=
s 1–
∑+ +
Xopk n s 2– i′–( ), ,
n 1=
Ntk
∑ Xopl n p, ,
n 1=
Ntl
∑
p ASAP opl( )=
s 3 i′––
∑ 1≤+
s∀ Range opi( )∈ s 3– i′–( ) Range opl( )∈( )
opi op j,( ) op j opk,( ) opk opl,( ), ,∀ ℑ1S∈ opi opk,( ) ℑ2S∈ opi opl,( ) ℑ3S∈,
i′ 1 i′ s 3 A– SAP opl( )–≤ ≤∀
Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
11- Clique Constraint Class for 3-level chainingβsα1α1β
s 2–( )
op
cstep,s
1 2 3 4
1
2
3
4
5
α1 α2 α3
6
7
8
Ë The constraint class 11βsα1α1β
s 2–( )
Xopi n p, ,
n 1=
Nti
∑
p s D opi( )– 1+=
ALAP opi( )
∑ Xop j n s 1–( ), ,
n 1=
Ntj
∑ Xopk n p, ,
n 1=
Ntk
∑
p ASAP opk( )=
s 2–
∑+ + 1≤
s∀ Range opi( )∈ s 2–( ) Range opk( )∈( )
opi op j,( ) op j opk,( ),∀ ℑ1S∈ opi opk,( ) ℑ2S∈
Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
12- Clique Constraint Class for 3-level chaining
Ë The example illustrated in the Figure for class 12 is for the case of both .
βsα1α1β˜
s 2–( ) k l, ,
i″
α1β
s i″–( )
op
cstep,s
1 2 3 4
1
2
3
4
5
α1 α2 α3
6
7
8
Ë The constraint class 12
βsα1α1β˜
s 2–( ) k l, ,
i″
α1β
s i″–( )
Xopi n p, ,
n 1=
Nti
∑
p s D opi( )– 1+=
ALAP opi( )
∑ X
n 1=
Ntk
∑ opk n p, ,
p s 2– i′–( )=
s 2–
∑+ +
Xop j n s 1–( ), ,
n 1=
Ntj
∑ Xopl n p, ,
n 1=
Ntl
∑
p ASAP opl( )=
s 3 i′––
∑ 1≤+
s∀ Range opi( ) s 3– i′–( ) Range opl( )∈( )∈
opi op j,( ) op j opk,( ) opk opl,( ), ,∀ ℑ1S∈ opi opk,( ) ℑ2S∈ opi opl,( ) ℑ3S∈,
i′ 1 i′ s 3 A– SAP opl( )–≤ ≤∀
i′ 1=
Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
13- Clique Constraint Class for 3-level chaining formulationβsα1α1α
1
β
s 3–( )
op
cstep,s
1 2 3 4
1
2
3
4
5
α1 α2 α3
6
7
8
Ë The constraint class 13βsα1α1α
1
β
s 3–( )
Xopi n p, ,
n 1=
Nti
∑
p s D opi( )– 1+=
ALAP opi( )
∑ Xop j n s 1–( ), ,
n 1=
Ntj
∑+ +
Xopk n s 2–( ), ,
n 1=
Ntk
∑ Xopk n p, ,∑
p ASAP opl( )=
s 3–
∑ 1≤+
s∀ Range opi( )∈ s 3–( ) Range opl( )∈( )
opi op j,( ) op j opk,( ) opk opl,( ), ,∀ ℑ1S∈ opi opk,( ) ℑ2S∈ opi opl,( ) ℑ3S∈,
Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
14- Clique Constraint Class for 3-level chaining formulationβsα1α2β s 2–( )
op
cstep,s
1 2 3 4
1
2
3
4
5
α1 α2 α3
6
7
8
Ë The constraint class 14βsα1α2β s 2–( )
Xopi n p, ,
n 1=
Nti
∑
p s D opi( )– 1+=
ALAP opi( )
∑ Xop j n s 1–( ), ,
n 1=
Ntj
∑ Xopl n p, ,
n 1=
Ntl
∑
p ASAP opl( )=
s 2–
∑+ + 1≤
s∀ Range opi( )∈ s 2–( ) Range opl( )∈
opi op j,( ) op j opk,( ) opk opl,( ), ,∀ ℑ1S∈ op j opl,( ) ℑ2S∈ opi opl,( ) ℑ3S∈,
Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
Maximal Clique Constraints are stronger than the Extended Wheel Constraints
Ë The Extended Wheel Constraint:
Ë The combined maximal cliques constraint:
op
cstep,s
1 2 3 4
1
2
3
4
5
cstep,s
1 2 3 4
1
2
3
4
5
α1 α2 α3
6
op
X1 3, X3 2, X3 3, X4 2, X4 3, 2 X1 4,⋅ 2 X1 5,⋅+ + 2≤+ + + +
X1 3, X1 4, X1 5, X+ +
3 1,
X
3 2,
X3 3, X
3 4,
X3 5, X
3 6,
X4 2, X4 3, 2≤+ + + + + + + +
Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
Comparing the logical formulation vs. the maximal clique formulation for the AR
filter
Ë To reach a first integer solution, the maximal clique formulation takes more time
logical formulation
maximal cliques
formulation
Number of iterations (Primal) 1,200 1,480
Number of iterations
(Integer)
1,420 1,706
Number of nodes of Branch
and Bound
54 103
CPU time in sec (primal) 12 (Ultra Sparc 2) 24 (Ultra Sparc 2
CPU time in sec (integer) 19 (Ultra Sparc 2 38 (Ultra Sparc 2
Total CPU time in sec 31 62
Optimality condition first integer first integer
Number of discrete variables
in the formulation
536 536
Number of Single inequali-
ties
7,363 9,256 (25.7% increase)
Termination condition first integer solution first integer solution
Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
Comparing the logical formulation vs. the maximal clique formulation for the AR
filter
Ë The maximal clique formulation achieves an optimal solution within tolerance long before
the logical fomulation.
logical formulation
maximal cliques
formulation
Number of iterations (Primal) 1,200 1,480
Number of iterations (Integer) 8.45E5 14,577
Number of nodes of Branch and
Bound
42,025 596
CPU time in sec (primal) 12 (Ultra Sparc 2) 24 (Ultra Sparc 2)
CPU time in sec (integer) 14,491 (Ultra Sparc 2) 221 (Ultra Sparc 2)
Total CPU time in sec 14,503 245
Optimality condition 0.07 (not achieved) 0.07 (achieved)
Number of discrete variables in
the formulation
536 536
Number of Single inequalities 7,363 9,256
Termination condition. after 5 integer solutions achieved optimal result
within tolerance
Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
Comparing the logical formulation vs. the maximal clique formulation for the
EWF benchmark
Ë The maximal clique formulation achieves an optimal solution within tolerance before the
logical fomulation.
logical formulation
maximal cliques
formulation
Number of iterations (Primal) 3,192 3,659
Number of iterations (Integer) 56,697 4,668
Number of nodes in Branch and
Bound
1,827 190
CPU time in sec (primal) 86 (Ultra Sparc 2) 150 (Ultra Sparc 2)
CPU time in sec (integer) 5.4 E3 (Ultra Sparc 2) 518 (Ultra Sparc 2)
Total CPU time in sec 5.48 E3 668
Optimality condition 0.1 (not achieved) 0.1 (achieved)
Number of discrete variables in the for-
mulation
940 940
Number of Single inequalities 11,154 16,195 (45.2 % increase)
Termination condition after 5 integer solutions achieved optimal result
within tolerance
Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
Comparing the logical formulation vs. the maximal clique formulation for the
DCT benchmark
Ë The maximal clique formulation achieves an optimal solution within tolerance before the
logical fomulation.
logical formulation
maximal cliques
formulation
Number of iterations (Primal) 3,288 (Ultra Sparc 2) 4,623 (Ultra Sparc 2)
Number of iterations (Integer) 23 (Ultra Sparc 2) (Ultra Sparc 2)
Number of nodes in Branch and
Bound
1E4 168
CPU time in sec (primal) 83 312
CPU time in sec (integer) 2E5 2,575
Total CPU time in sec 2 E5 2,887
Optimality condition 0.15 (not achieved) 0.15 (achieved)
Number of discrete variables in
the formulation
1,066 1,066
Number of Single inequalities 13,623 18,979 (39.3%)
Termination condition after 5 integer solutions achieved optimal result
within tolerance
Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
Chapter 5
Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
Convex Bipartite Graph. Matching
Ë This bipartite graph corresponds to the multiply operations of the EWF benchmark. The
function unit resources are 1 Multiplier.
FU_IOEI
[3,7]
[3,7]
[8,11]
[8,11]
[12,15]
[13,15]
[13,15]
[14,15]
[3,6]
[4,7]
[8,10]
[9,11]
[12,12]
[13,13]
[14,14]
[15,15]
InitialOperation
17
18
8
29
33
24
4
12
12
34
56
78
9
10
11
12
Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
Strong components corresponding to the bipartite graph matching
Ë Dotted edges can be pruned at this step.
1 2
3 4
6 5
10
9
11
12
7
8
12
34
56
78
9
10
11
12
Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
Chapter 6
Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
The fifth-order Elliptic Wave Filter benchmark
Ë Consists of 34 operations(8 multiplications and 26 additions)
++++
+
Z
Z
+*
+
+
+
+
+ *
+
+ +
Z
+
*
+
*
+
Z
+
+ +
*
*
+
+
+
Z
+ +
*
Z
Z
*
+
input
output
Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
The DFG of the EWF benchmark
control
step
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
OUT
IN a b c d e f g h i
a b c d e f g h i
1 25
2715
6 26
16
19
7 20
21
2822
103
11
3132
2 23
5 13
3414
9
4
8
12
17
18
24
29
30
33
1
2
3 4
5
6
7
8
9 10
11
12
13
14
15
32
28
33
36
302724
25
20
18
19
50
17
35 54
42
41
4038
39
43
45
44
16
37
56
57
53
55
58
29
48
51
34
52
21
22
23
26
31
47
46
49
Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
Effect of Chaining AND Pipelining FUs On Datapath Performance.
Cost ( Number of CLBs)
Totexec, Λ, ns
1- 1+,1*
3-Non-Pipe
4
5 6
8
9
10
11
2-a-Bus-ours
• 7
Exploration of the Design Space for the EWF benchmark.
2-b-Best-others
2-pipe
4-pipe
Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
Effect of Chaining and Pipelining FUs On EWF Datapath
Performance
Design
Space
CSteps, T Resources Pipeline level Chaining Cost T(ns)
1 27 1+, 1* 2-stages yes 140 2158
2-a (ours) 17 2+, 1 *,1b 2-stages NO 160 1275
2-b [13] 17 2+, 1 * 2-stages NO 180 1275
3 10 3+, 1* No-pipe yes 195 1650
4 12 3+, 1* 2-stages yes 185 996
5 11 3+, 1* 2-stages yes 190 935
6 11 3+, 1* 2-stages yes 195 913
7 17 3+, 1* 4-stages yes 225 731
8 19 3+, 1* 4-stages yes 205 836
9 17 3+, 1* 4-stages yes 210 765
10 18 3+, 1* 4-stages yes 215 774
11 17 3+, 1* 4-stages yes 220 765
Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
Final FGPA Implementation on Xilinx4000 series. †
† Using XACT 5.0 tools, the best area architecture would fit into x4006 chip and require about
200 CLBs.
Our Best Area
Our Best Perfor-
mance
Best in Litera-
ture(Simulated Evo-
lution)
Controller 33 27 30
Register_File 10 Not used Not used
ROM 4 4 4
Multiplier 110 110 110
Adder 10 10 10
4/3/2 to 1 mux 16/8 16/16/8 16/16/8
Register /Tristate 8/1 8/1 8/1
7/6/5/To 1 Mux Not used 36/26/25 36/26/25
Total # CLBS: 323 391 361
Total Execution time
(nsec):
1275 731 1275
Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
Scheduling and binding for the AR-filter, illustrating register binding.
1
2
3
4
5
6
7
8
9
1 2
3 4
9
10
5 6
13
7 8
14
11
12
15
1617
18
19
20
21 22
23 24
25
26
28
27
R1
R2
R2R2
R1
R1
R3
R1
R1
Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
Synthesized architecture for the AR filter
Ë Resources:2 Multiplier (2-stage Pipelined),2 Adders and uses 3-registers and 12-
multiplexer inputs.
R1
R3
R2
A1
M1
M2
A2
Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
The scheduling and binding for the AR filter, using 4-stage pipelined multipliers
1
2
3
4
5
6
7
8
9
10
11
12
13
1 2
3 4
5 6
8 7
9
10
11
12
13
14
16
15
17
18
19
22 21
23
20
24
25
26
27
28
Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
The DFG of the Fast Discrete Transform Benchmark
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
1
2
3
4
9
10
12
15
24
23
17
20
19
11
25
26
6
5
7
22
21
2827
8
29
30
40
37
39
41
42
1 2
3
4
5
6
7
8
9 10
11
12
13 14
15
16
17
18
19
20
21 22
23 24
25
26
27
28
29 31
32
33
34
35
36
30
37
38
39
40
43 44
45
46
47
48
49
50
51
52
41 42
16
33
13
14
35
32
38
36
34
31
18
Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
The Fast Discrete Cosine Transform.
Ours SODAS-DSP MARS
Resources 2*, 2+,2- 2*, 2+,2- 2*, 2+, 2-
# mux inputs 37 66 NA
# registers 13 47 NA
Clock (ns) 60 100 NA
# csteps 10 12, dii=8a
a. dii is the data initiation rate for the Pipelined architecture used in SODAS-DSP.
8b
b. MARS, reports 8 cycles. No other details of the scheduling is available.
Totexec(ns) 600 1200 NA
Throughputc (MHz)
c. Throughput indicates the highest input-sampling rate of the architecture.
1.67 1.25 NA
Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
Synthesized Architecture for the Fast Discrete Cosine Transform benchmark.
A2A1M1 M2S1S2
Ë Resources: 2 Multiplier and 2Adders and 2 Subtracters. Uses 13 registers, 37 mux
inputs.
Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
The DFG of the Discrete Cosine Transform Benchmark
1
2
3
4
5
6
7
8
2
3
4
5
6
7
8
11 2 3 4
9 10
5 6 7 8
15
33
34
45
35
36
46
37
38
47
39
40
48
1211 1413
41 42
43 44
d7 d0 d4 d6 d1 d5 d2 d0 d3 d1 d2 d5d3 d7 d4 d6
1 2
3 4
5 6 7 8
9 10
11 12
13
14
15
16
17
19
18 20
21 22
23
24
25
26
27 28
29
30
31
32
33
34
35
36
37
38
41
42
43
44 45 46 47 48 49 50
51
52
53
54
55
56
57 58
59 60
61
62
6364
16
39
40
17 18 20 21 23 24 26 27
25 28 29 30 32
19
22
31
Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
Synthesized Architecture for the Discrete Cosine Transform.
Ë Resources: 2 Multiplier (4-stage Pipelined) and 4Adders. Uses 11 registers, 28 mux inputs.
A3M1M2A1A2 A4
Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
Chaining paths for the Discrete Cosine Transform
M1
M2
A1 A2
A3 A4
A3M1M2A1A2 A4
Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
Chaining interconnections modeled for false paths detection
M1
M2
A1
A2
A3
A4
M1
M2
A1
A2
A3
A4
M1
M2
A1
A2
A3
A4
V1 V2 V3
Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
Synthesized Bus architecture of the DCT benchmark
Ë Resources 1 Multiplier (4-stage Pipelined) and 3 Adders/Subtracters. Uses 9 registers, 18 mux
inputs and 1 Bus.
Bus1
A1A2
A3
R1
ROM
R4
R7
R5
R6
R8
R2
R9
R3
Register
File
M
class1
α1
α2
β
class3
class4
α1
class3
α1
class5
β
startβ
α1/α2
α1/α2
β
class2
β
ββ
Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
Synthesized Random topology architecture for the DCT benchmark
Ë Resources: 1 Multiplier (4-pipe stages) and 3 Adders/Subtracters. Uses 10 registers and 24
mux inputs
A1 A2 A3
R2 R10R9 R7R5R8R6R4R1R3
ROM
Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
Synthesized Random topology architecture for the DCT benchmark
Ë Resources: 1 Multiplier (4-pipe stages) and 3 Adders/Subtracters. Uses 12 registers and 20
mux inputs.
A1 A2 A3
R2 R1R9R2 R3 R4 R5R6R7R8R11R12R10
ROM
M
Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
The Discrete Cosine Transform Benchmark
Ours PSGA_Syn,
[69]
Tool [23]
Chaudhuri/
Walker
SALSA[34]
(Chain)
SALSA[34]
Resources 2*, 4+ 3*,3+ 3*, 4+ 2*, 4+ 2*,4+
# mux inputs 28 NA NA NA 30
# registers 11 14 NA 15 13
Clock (ns) 45 120a
a. This tool does not use chaining nor pipelining for the DCT.
65b
b. The tool described in [23], does not use chaining.
135c
c. The level of chaining is not reported in [34]
65d
d. SALSA[34], does not determine the clock duration of the total execution. However, we have
used the same library for comparison
# csteps 11 18 9 8 11
Totexec(ns) 495 2160 585 1080 715
Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
The Discrete Cosine Transform Benchmark
Ours PSGA_Syn
Tool in [69]
SALSA
(Chain)
[34]
OSTA no-Chain
[70]
Resources 1*, 3+ 3*,3+ 2*, 4+ 3*, 6+
# mux i/p 24 NA NA 38
# registers 10 14 15 24
Clock (ns) 45 120 130 120
# csteps, T 19 18 8 9
Totexec(ns) 855 2160 1080 1080
Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
chapter 7
Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
CONCLUSIONS
• Our architectural model is suitable for a broad base of technology
implementations. Specifically FPGAs including bus/SRAM based ones.
• Introduced optimization criteria for ILP solvers for Datapath Synthesis:
Ë Our model and criteria can be used for other solvers (e.g.stochastic).
• The approach:
Ë Scheduling with chaining and deep-pipelining of FUs while minimizing “Structural
Complexity ”.
Ë Optimization of the Total Execution time of the architecture, with clock cycle determination.
Ë followed by bus assignment if it is supported by the FPGA.
• This Approach has demonstrated that a discriminating search of a larger architectural space
can produce:
Ë Regular Architectures with minimuminterconnections, Low resources and Fast
Throughput.
Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
Contribution of this research
Ë Several interconnect minimization measures were incorporated in the formulation,
which significantly improve the quality of the resulting synthesized architectures.
Ë This was demonstrated for different benchmarks, where number of registers and
multiplexer inputs were consistently smaller in architectures synthesized with this
methodology as compared to previously published results. This is an important issue
in developing a tool geared toward technologies with scarce interconnect resources
such as FPGAs.
Ë For the first time, an Integral Linear Programming (ILP) formulation that includes
a non-tabular, non-restricted model of the system clock duration was developed. This
has proved to be a significant step in the modeling of the total execution time of the
architecture and as a result, successful performance minimization.
Ë The formulation of the architectural synthesis scheduling and binding as a
performance optimization problem rather than the mere minimization of the number
of control steps was presented. A theoretical linearization technique for the objective
function of this formulation was presented. It was demonstrated that this linearization
technique has negligible impact on the size of the problem.
Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
Contribution of this research
Ë Verification of the validity of the overall methodology by integrating this tool to logic
synthesis and back-end tools.
Ë The development of the set of valid inequalities for the scheduling and binding problem.
The identification and derivation of both the extended wheel graph inequalities and the
maximal clique inequalities. This guarantees the tightest formulation for schedules with n-
levels of chaining and multicycled/pipelined resources for the first time.
Ë An algorithmic approach for the generation of the minimum set of inequality
classes necessary for the general scheduling and binding problem is developed. This
algorithm explores a Hasse graph representing the scheduling problem. The algorithm
classifies all the maximal paths into maximal path classes. These classes can be
incorporated into the automatic generation of the maximal clique constraints. These
maximal clique constraints represent the tightest description of the scheduling and
binding problem with n-level chaining.
Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
VERTICAL PAGES
Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
Flow of the Architectural synthesis methodology.
Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
CDFG
-Data Storage Assignment
STEP-LAST: Register Allocation
STEP-4: ILP: Bus Insertion
-Bus transfer scheduling
-Bus allocation
-Storage Minimization
-Bus loading Minim.
-Interconnect minimization.
-Bus loading minimization.
- Scheduling and Binding
- Chaining of Operations
STEP-3: ILP: Random Topology
-Interconnect minimization.
- Clock cycle minimization +
- FU pipelining choice
ation of the numberMinimiz
of cycles.
OR
- Minimization of the total
execution time, (i.e. throughput
maximization).
- VHDL generation of the
Datapath and the Controller
- Heuristics to determine the lower bound on the
number of cycles.
- Heuristics to tighten the ASAP/ALAP values
under the given resource constraints.
DFG
-DFG exploration.
-Dynamic Set generation for chaining
-ILP constraint generation
STEP-2: C++: Constraint Generation for ILP
STEP-1: Scheduling Bounds
Tech
Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
Flow of the Back-End Tools
Ë Stage-2 uses Synopsys tools(logic synthesis and FPGA mapping), and stage-3 uses
Xilinx(xact tools) for PPR
Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
VHDL SOURCE FILES
- Xilinx Hard-macros
Simulate
Read HDL and insert pads
- Area Constraints
- Delay Constraints
- FU-Pipelining (i.e.
Register-balancing)
- Xilinx Library
To simulation
Partition, Placement
and Routing
Xilinx
SYNOPSYS
compile and optimize the
datapath and controller
Stage-3
Stage-2
Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
ASAP Scheduling
Input: Data Flow Graph G
Output: node arrayint, schedule_I, representing the As soon as possible
scheduling of the nodes of the DFG for a maximum chaining level
“Max_Chain_Length”.
ASAP{
1- G.for_all_nodes(v) {
if (input_degree(v) = 0)
{ schedule_I(v) = 1; }
else
{ schedule_I(v) = 0; insert v into the node set S; }
2- While (node set S ≠ Φ )
{
G.for_all_nodes(v) {
if ( (v ∈ S) and (all_pred_scheduled(G,v,schedule_I))
{
G.all_input_edges(e,v){
w = G.source(e);
if (G.type(w) and G.type(v) ≠ “multicycle”)
if ( Ch_Level_ASAP(w) ≤ Max_Chain_Length)
{ temp_schedule = schedule_I(w);}
else
{ temp_schedule = schedule_I(w) + delay(v);}
Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
if ((G.type(w) = “multicycle”)
{
if (G.type(v) ≠ “multicycle”)
{ temp_schedule = schedule_I(w) + delay(w) -1 ;}
else
{ temp_schedule = schedule_I(w) + delay(w);}
}
if ( temp_schedule  schedule_I(v))
{ schedule_I(v) = temp_schedule;}
}
3- Adj_Ch_Level_ASAP(G, v, schedule_I, Ch_Level_ASAP);
4- delete node v from the node set S;
} } } }
Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
Adjust Chaining level of a node
Input: Data Flow Graph G, node v, node array representing the current schedule schedule_I, and the node array
representing ther current chaining level Ch_level_ASAP.
Output: Adjusted version of Ch_level_ASAP for node v, according to the current schedule schedule_I
Adj_Ch_Level_ASAP{
G.all_input_edges(e,v) {
w = G.source(e);
if ( ( G.type(w) ≠ “multicycle”) and (schedule_I(v) = schedule_I(w))
and (Ch_Level_ASAP(w)  Max_Chain_Length)
and (Ch_Level_ASAP(v)  Ch_Level_ASAP(w) + 1))
{Ch_Level_ASAP(v) = Ch_Level_ASAP(w) + 1;}
if ( ( G.type(w) = “multicycle”) and (G.type(v) ≠ multicycle”)
and (schedule_I(v) = schedule_I(w) + mul_delay -1)
and (Ch_Level_ASAP(v) ≤ 2))
{ Ch_Level_ASAP(v) = Ch_Level_ASAP(w) + 1; }
}
Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
procedure create_classes_with_β
create_classes_with_β ( active_edge , distance, j, classBase ) {
if (distance = 0) {
class j = classBase + ;
}
else
for x = 0 to distance {
if (x = 0) {
class j = classBase + ;
classnew = class j;
create_class_without_β (active_edge , distance, j, classnew );
}
if (x  0) {
j = j + 1;
if ( x = n - i ) {
class j = classBase + ;
}
if ( x  n - i ) {
class j = classBase + ;
distancenew = distance - x;
classnew = class j;
create_class_with_β ( active_edge , distancenew, j, classnew );
}
αi
β{ }
β{ }
αi
αx{ } β{ }+
αx{ }
αi
Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
procedure create_classes_without_β:
create_classes_without_β ( active_edge , distance, j, classBase ) {
for t = distance down to 1 {
if t = (n - i ) {
j = j + 1;
class j = classBase + ;
}
if t  (n - i) {
j = j + 1;
class j = classBase + ;
distancenew = distance - t;
classnew = class j;
create_class_with_β ( active_edge , distancenew, j, classnew );
}
}
αi
αt{ } β{ }+
αt{ }
αi
Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata

Weitere ähnliche Inhalte

Was ist angesagt?

OPTIMIZATION OF IP NETWORKS IN VARIOUS HYBRID IGP/MPLS ROUTING SCHEMES
OPTIMIZATION OF IP NETWORKS IN VARIOUS HYBRID IGP/MPLS ROUTING SCHEMESOPTIMIZATION OF IP NETWORKS IN VARIOUS HYBRID IGP/MPLS ROUTING SCHEMES
OPTIMIZATION OF IP NETWORKS IN VARIOUS HYBRID IGP/MPLS ROUTING SCHEMESEM Legacy
 
Study about Locator/Identifier Separation Protocol (LISP)
Study about Locator/Identifier Separation Protocol (LISP)Study about Locator/Identifier Separation Protocol (LISP)
Study about Locator/Identifier Separation Protocol (LISP)Assia Bakrim
 
A novel area efficient vlsi architecture for recursion computation in lte tur...
A novel area efficient vlsi architecture for recursion computation in lte tur...A novel area efficient vlsi architecture for recursion computation in lte tur...
A novel area efficient vlsi architecture for recursion computation in lte tur...jpstudcorner
 
International Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentInternational Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentIJERD Editor
 
Implementation and Estimation of Delay, Power and Area for Parallel Prefix Ad...
Implementation and Estimation of Delay, Power and Area for Parallel Prefix Ad...Implementation and Estimation of Delay, Power and Area for Parallel Prefix Ad...
Implementation and Estimation of Delay, Power and Area for Parallel Prefix Ad...IJMTST Journal
 
2 - Generation of PSK signal using non linear devices via MATLAB (presented i...
2 - Generation of PSK signal using non linear devices via MATLAB (presented i...2 - Generation of PSK signal using non linear devices via MATLAB (presented i...
2 - Generation of PSK signal using non linear devices via MATLAB (presented i...Youness Lahdili
 
DUAL FIELD DUAL CORE SECURE CRYPTOPROCESSOR ON FPGA PLATFORM
DUAL FIELD DUAL CORE SECURE CRYPTOPROCESSOR ON FPGA PLATFORMDUAL FIELD DUAL CORE SECURE CRYPTOPROCESSOR ON FPGA PLATFORM
DUAL FIELD DUAL CORE SECURE CRYPTOPROCESSOR ON FPGA PLATFORMVLSICS Design
 
Implementation of High Throughput Radix-16 FFT Processor
Implementation of High Throughput Radix-16 FFT ProcessorImplementation of High Throughput Radix-16 FFT Processor
Implementation of High Throughput Radix-16 FFT ProcessorIJMER
 
High –Speed Implementation of Design and Analysis by Using Parallel Prefix Ad...
High –Speed Implementation of Design and Analysis by Using Parallel Prefix Ad...High –Speed Implementation of Design and Analysis by Using Parallel Prefix Ad...
High –Speed Implementation of Design and Analysis by Using Parallel Prefix Ad...IOSRJECE
 
JPJ1433 Cost-Effective Resource Allocation of Overlay Routing Relay Nodes
JPJ1433 Cost-Effective Resource Allocation of Overlay Routing Relay NodesJPJ1433 Cost-Effective Resource Allocation of Overlay Routing Relay Nodes
JPJ1433 Cost-Effective Resource Allocation of Overlay Routing Relay Nodeschennaijp
 
Moolle fan-out control for scalable distributed data stores
Moolle  fan-out control for scalable distributed data storesMoolle  fan-out control for scalable distributed data stores
Moolle fan-out control for scalable distributed data storesSungJu Cho
 
A multi path routing algorithm for ip
A multi path routing algorithm for ipA multi path routing algorithm for ip
A multi path routing algorithm for ipAlvianus Dengen
 
Implementation and Design of High Speed FPGA-based Content Addressable Memory
Implementation and Design of High Speed FPGA-based Content Addressable MemoryImplementation and Design of High Speed FPGA-based Content Addressable Memory
Implementation and Design of High Speed FPGA-based Content Addressable Memoryijsrd.com
 
IRJET- A Survey on Reconstruct Structural Design of FPGA
IRJET-  	  A Survey on Reconstruct Structural Design of FPGAIRJET-  	  A Survey on Reconstruct Structural Design of FPGA
IRJET- A Survey on Reconstruct Structural Design of FPGAIRJET Journal
 
A fast re route method
A fast re route methodA fast re route method
A fast re route methodSandhiyaL
 
Vtc conf presentation - cdt website
Vtc conf presentation - cdt websiteVtc conf presentation - cdt website
Vtc conf presentation - cdt websiteimad Al-Samman
 

Was ist angesagt? (19)

OPTIMIZATION OF IP NETWORKS IN VARIOUS HYBRID IGP/MPLS ROUTING SCHEMES
OPTIMIZATION OF IP NETWORKS IN VARIOUS HYBRID IGP/MPLS ROUTING SCHEMESOPTIMIZATION OF IP NETWORKS IN VARIOUS HYBRID IGP/MPLS ROUTING SCHEMES
OPTIMIZATION OF IP NETWORKS IN VARIOUS HYBRID IGP/MPLS ROUTING SCHEMES
 
Study about Locator/Identifier Separation Protocol (LISP)
Study about Locator/Identifier Separation Protocol (LISP)Study about Locator/Identifier Separation Protocol (LISP)
Study about Locator/Identifier Separation Protocol (LISP)
 
A novel area efficient vlsi architecture for recursion computation in lte tur...
A novel area efficient vlsi architecture for recursion computation in lte tur...A novel area efficient vlsi architecture for recursion computation in lte tur...
A novel area efficient vlsi architecture for recursion computation in lte tur...
 
International Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentInternational Journal of Engineering Research and Development
International Journal of Engineering Research and Development
 
Implementation and Estimation of Delay, Power and Area for Parallel Prefix Ad...
Implementation and Estimation of Delay, Power and Area for Parallel Prefix Ad...Implementation and Estimation of Delay, Power and Area for Parallel Prefix Ad...
Implementation and Estimation of Delay, Power and Area for Parallel Prefix Ad...
 
2 - Generation of PSK signal using non linear devices via MATLAB (presented i...
2 - Generation of PSK signal using non linear devices via MATLAB (presented i...2 - Generation of PSK signal using non linear devices via MATLAB (presented i...
2 - Generation of PSK signal using non linear devices via MATLAB (presented i...
 
DUAL FIELD DUAL CORE SECURE CRYPTOPROCESSOR ON FPGA PLATFORM
DUAL FIELD DUAL CORE SECURE CRYPTOPROCESSOR ON FPGA PLATFORMDUAL FIELD DUAL CORE SECURE CRYPTOPROCESSOR ON FPGA PLATFORM
DUAL FIELD DUAL CORE SECURE CRYPTOPROCESSOR ON FPGA PLATFORM
 
Implementation of High Throughput Radix-16 FFT Processor
Implementation of High Throughput Radix-16 FFT ProcessorImplementation of High Throughput Radix-16 FFT Processor
Implementation of High Throughput Radix-16 FFT Processor
 
Ad04606184188
Ad04606184188Ad04606184188
Ad04606184188
 
High –Speed Implementation of Design and Analysis by Using Parallel Prefix Ad...
High –Speed Implementation of Design and Analysis by Using Parallel Prefix Ad...High –Speed Implementation of Design and Analysis by Using Parallel Prefix Ad...
High –Speed Implementation of Design and Analysis by Using Parallel Prefix Ad...
 
Hadoop classes in mumbai
Hadoop classes in mumbaiHadoop classes in mumbai
Hadoop classes in mumbai
 
Bu34437441
Bu34437441Bu34437441
Bu34437441
 
JPJ1433 Cost-Effective Resource Allocation of Overlay Routing Relay Nodes
JPJ1433 Cost-Effective Resource Allocation of Overlay Routing Relay NodesJPJ1433 Cost-Effective Resource Allocation of Overlay Routing Relay Nodes
JPJ1433 Cost-Effective Resource Allocation of Overlay Routing Relay Nodes
 
Moolle fan-out control for scalable distributed data stores
Moolle  fan-out control for scalable distributed data storesMoolle  fan-out control for scalable distributed data stores
Moolle fan-out control for scalable distributed data stores
 
A multi path routing algorithm for ip
A multi path routing algorithm for ipA multi path routing algorithm for ip
A multi path routing algorithm for ip
 
Implementation and Design of High Speed FPGA-based Content Addressable Memory
Implementation and Design of High Speed FPGA-based Content Addressable MemoryImplementation and Design of High Speed FPGA-based Content Addressable Memory
Implementation and Design of High Speed FPGA-based Content Addressable Memory
 
IRJET- A Survey on Reconstruct Structural Design of FPGA
IRJET-  	  A Survey on Reconstruct Structural Design of FPGAIRJET-  	  A Survey on Reconstruct Structural Design of FPGA
IRJET- A Survey on Reconstruct Structural Design of FPGA
 
A fast re route method
A fast re route methodA fast re route method
A fast re route method
 
Vtc conf presentation - cdt website
Vtc conf presentation - cdt websiteVtc conf presentation - cdt website
Vtc conf presentation - cdt website
 

Andere mochten auch

PMU-Based Real-Time Damping Control System Software and Hardware Architecture...
PMU-Based Real-Time Damping Control System Software and Hardware Architecture...PMU-Based Real-Time Damping Control System Software and Hardware Architecture...
PMU-Based Real-Time Damping Control System Software and Hardware Architecture...Luigi Vanfretti
 
Building Ethnography into the design process
Building Ethnography into the design processBuilding Ethnography into the design process
Building Ethnography into the design processStephen Cox
 
Low power tool paper
Low power tool paperLow power tool paper
Low power tool paperM Madan Gopal
 
Building a Responsive Web Design Process
Building a Responsive Web Design ProcessBuilding a Responsive Web Design Process
Building a Responsive Web Design ProcessLydia Whitehead
 
Architectural Design Method (Design Method Workshop)
Architectural Design Method (Design Method Workshop)Architectural Design Method (Design Method Workshop)
Architectural Design Method (Design Method Workshop)Lim Gim Huang
 
Logic synthesis with synopsys design compiler
Logic synthesis with synopsys design compilerLogic synthesis with synopsys design compiler
Logic synthesis with synopsys design compilernaeemtayyab
 
Episode 55 : Conceptual Process Synthesis-Design
Episode 55 :  Conceptual Process Synthesis-DesignEpisode 55 :  Conceptual Process Synthesis-Design
Episode 55 : Conceptual Process Synthesis-DesignSAJJAD KHUDHUR ABBAS
 
Building Codes And The Design Process
Building Codes And The Design ProcessBuilding Codes And The Design Process
Building Codes And The Design ProcessEric Anderson
 
Design process and concepts
Design process and conceptsDesign process and concepts
Design process and conceptsSlideshare
 
Architectural Design Process
Architectural Design ProcessArchitectural Design Process
Architectural Design ProcessKhaled Almusa
 
Architectural Professional Practice - Introduction
Architectural Professional Practice - IntroductionArchitectural Professional Practice - Introduction
Architectural Professional Practice - IntroductionGalala University
 
Architectural Professional Practice - Site الممارسة المهنية المعمارية - الموقع
Architectural Professional Practice - Site الممارسة المهنية المعمارية - الموقعArchitectural Professional Practice - Site الممارسة المهنية المعمارية - الموقع
Architectural Professional Practice - Site الممارسة المهنية المعمارية - الموقعGalala University
 
Architectural Design 1 Lectures by Dr. Yasser Mahgoub - Process
Architectural Design 1 Lectures by Dr. Yasser Mahgoub - ProcessArchitectural Design 1 Lectures by Dr. Yasser Mahgoub - Process
Architectural Design 1 Lectures by Dr. Yasser Mahgoub - ProcessGalala University
 
Architectural Design 1 Lectures by Dr. Yasser Mahgoub - Lecture 1 Introduction
Architectural Design 1 Lectures by Dr. Yasser Mahgoub - Lecture 1 IntroductionArchitectural Design 1 Lectures by Dr. Yasser Mahgoub - Lecture 1 Introduction
Architectural Design 1 Lectures by Dr. Yasser Mahgoub - Lecture 1 IntroductionGalala University
 

Andere mochten auch (20)

PMU-Based Real-Time Damping Control System Software and Hardware Architecture...
PMU-Based Real-Time Damping Control System Software and Hardware Architecture...PMU-Based Real-Time Damping Control System Software and Hardware Architecture...
PMU-Based Real-Time Damping Control System Software and Hardware Architecture...
 
114 santanu
114 santanu114 santanu
114 santanu
 
Building Design Process
Building Design ProcessBuilding Design Process
Building Design Process
 
Building Ethnography into the design process
Building Ethnography into the design processBuilding Ethnography into the design process
Building Ethnography into the design process
 
Low power tool paper
Low power tool paperLow power tool paper
Low power tool paper
 
Building a Responsive Web Design Process
Building a Responsive Web Design ProcessBuilding a Responsive Web Design Process
Building a Responsive Web Design Process
 
Architectural Design Method (Design Method Workshop)
Architectural Design Method (Design Method Workshop)Architectural Design Method (Design Method Workshop)
Architectural Design Method (Design Method Workshop)
 
Logic synthesis with synopsys design compiler
Logic synthesis with synopsys design compilerLogic synthesis with synopsys design compiler
Logic synthesis with synopsys design compiler
 
Episode 55 : Conceptual Process Synthesis-Design
Episode 55 :  Conceptual Process Synthesis-DesignEpisode 55 :  Conceptual Process Synthesis-Design
Episode 55 : Conceptual Process Synthesis-Design
 
Building Codes And The Design Process
Building Codes And The Design ProcessBuilding Codes And The Design Process
Building Codes And The Design Process
 
Design process and concepts
Design process and conceptsDesign process and concepts
Design process and concepts
 
Architectural Design Process
Architectural Design ProcessArchitectural Design Process
Architectural Design Process
 
Architectural Professional Practice - Introduction
Architectural Professional Practice - IntroductionArchitectural Professional Practice - Introduction
Architectural Professional Practice - Introduction
 
SPA Professional Practice I
SPA Professional Practice ISPA Professional Practice I
SPA Professional Practice I
 
1 Fire safety design principles
1  Fire safety design principles1  Fire safety design principles
1 Fire safety design principles
 
Site analysis
Site analysis Site analysis
Site analysis
 
Architectural Professional Practice - Site الممارسة المهنية المعمارية - الموقع
Architectural Professional Practice - Site الممارسة المهنية المعمارية - الموقعArchitectural Professional Practice - Site الممارسة المهنية المعمارية - الموقع
Architectural Professional Practice - Site الممارسة المهنية المعمارية - الموقع
 
Architectural Design 1 Lectures by Dr. Yasser Mahgoub - Process
Architectural Design 1 Lectures by Dr. Yasser Mahgoub - ProcessArchitectural Design 1 Lectures by Dr. Yasser Mahgoub - Process
Architectural Design 1 Lectures by Dr. Yasser Mahgoub - Process
 
Architectural Design 1 Lectures by Dr. Yasser Mahgoub - Lecture 1 Introduction
Architectural Design 1 Lectures by Dr. Yasser Mahgoub - Lecture 1 IntroductionArchitectural Design 1 Lectures by Dr. Yasser Mahgoub - Lecture 1 Introduction
Architectural Design 1 Lectures by Dr. Yasser Mahgoub - Lecture 1 Introduction
 
Concept sheet - Thesis
Concept sheet - ThesisConcept sheet - Thesis
Concept sheet - Thesis
 

Ähnlich wie DSP Structured Datapath Synthesis

Copy of colloquium 3 latest
Copy of  colloquium 3 latestCopy of  colloquium 3 latest
Copy of colloquium 3 latestshaik fairooz
 
Cockatrice: A Hardware Design Environment with Elixir
Cockatrice: A Hardware Design Environment with ElixirCockatrice: A Hardware Design Environment with Elixir
Cockatrice: A Hardware Design Environment with ElixirHideki Takase
 
Flexible dsp accelerator architecture exploiting carry save arithmetic
Flexible dsp accelerator architecture exploiting carry save arithmeticFlexible dsp accelerator architecture exploiting carry save arithmetic
Flexible dsp accelerator architecture exploiting carry save arithmeticIeee Xpert
 
Iaetsd pipelined parallel fft architecture through folding transformation
Iaetsd pipelined parallel fft architecture through folding transformationIaetsd pipelined parallel fft architecture through folding transformation
Iaetsd pipelined parallel fft architecture through folding transformationIaetsd Iaetsd
 
Optimal configuration of network
Optimal configuration of networkOptimal configuration of network
Optimal configuration of networkjpstudcorner
 
Compare Performance-power of Arm Cortex vs RISC-V for AI applications_oct_2021
Compare Performance-power of Arm Cortex vs RISC-V for AI applications_oct_2021Compare Performance-power of Arm Cortex vs RISC-V for AI applications_oct_2021
Compare Performance-power of Arm Cortex vs RISC-V for AI applications_oct_2021Deepak Shankar
 
Iaetsd vlsi architecture for exploiting carry save arithmetic using verilog hdl
Iaetsd vlsi architecture for exploiting carry save arithmetic using verilog hdlIaetsd vlsi architecture for exploiting carry save arithmetic using verilog hdl
Iaetsd vlsi architecture for exploiting carry save arithmetic using verilog hdlIaetsd Iaetsd
 
RAMSES: Robust Analytic Models for Science at Extreme Scales
RAMSES: Robust Analytic Models for Science at Extreme ScalesRAMSES: Robust Analytic Models for Science at Extreme Scales
RAMSES: Robust Analytic Models for Science at Extreme ScalesIan Foster
 
Vlsi design process for low power design methodology using reconfigurable fpga
Vlsi design process for low power design methodology using reconfigurable fpgaVlsi design process for low power design methodology using reconfigurable fpga
Vlsi design process for low power design methodology using reconfigurable fpgaeSAT Journals
 
Assisting User’s Transition to Titan’s Accelerated Architecture
Assisting User’s Transition to Titan’s Accelerated ArchitectureAssisting User’s Transition to Titan’s Accelerated Architecture
Assisting User’s Transition to Titan’s Accelerated Architectureinside-BigData.com
 
How to use Parquet as a Sasis for ETL and Analytics
How to use Parquet as a Sasis for ETL and AnalyticsHow to use Parquet as a Sasis for ETL and Analytics
How to use Parquet as a Sasis for ETL and AnalyticsDataWorks Summit
 
Flexible dsp accelerator architecture exploiting carry save arithmetic
Flexible dsp accelerator architecture exploiting carry save arithmeticFlexible dsp accelerator architecture exploiting carry save arithmetic
Flexible dsp accelerator architecture exploiting carry save arithmeticNexgen Technology
 
LISP and NSH in Open vSwitch
LISP and NSH in Open vSwitchLISP and NSH in Open vSwitch
LISP and NSH in Open vSwitchmestery
 
Mp So C 18 Apr
Mp So C 18 AprMp So C 18 Apr
Mp So C 18 AprFNian
 
Mirabilis_Design AMD Versal System-Level IP Library
Mirabilis_Design AMD Versal System-Level IP LibraryMirabilis_Design AMD Versal System-Level IP Library
Mirabilis_Design AMD Versal System-Level IP LibraryDeepak Shankar
 

Ähnlich wie DSP Structured Datapath Synthesis (20)

Copy of colloquium 3 latest
Copy of  colloquium 3 latestCopy of  colloquium 3 latest
Copy of colloquium 3 latest
 
Devdutt Pawaskar Resume
Devdutt Pawaskar ResumeDevdutt Pawaskar Resume
Devdutt Pawaskar Resume
 
Cockatrice: A Hardware Design Environment with Elixir
Cockatrice: A Hardware Design Environment with ElixirCockatrice: A Hardware Design Environment with Elixir
Cockatrice: A Hardware Design Environment with Elixir
 
Flexible dsp accelerator architecture exploiting carry save arithmetic
Flexible dsp accelerator architecture exploiting carry save arithmeticFlexible dsp accelerator architecture exploiting carry save arithmetic
Flexible dsp accelerator architecture exploiting carry save arithmetic
 
Streams on wires
Streams on wiresStreams on wires
Streams on wires
 
PD_Tcl_Examples
PD_Tcl_ExamplesPD_Tcl_Examples
PD_Tcl_Examples
 
Iaetsd pipelined parallel fft architecture through folding transformation
Iaetsd pipelined parallel fft architecture through folding transformationIaetsd pipelined parallel fft architecture through folding transformation
Iaetsd pipelined parallel fft architecture through folding transformation
 
Optimal configuration of network
Optimal configuration of networkOptimal configuration of network
Optimal configuration of network
 
Compare Performance-power of Arm Cortex vs RISC-V for AI applications_oct_2021
Compare Performance-power of Arm Cortex vs RISC-V for AI applications_oct_2021Compare Performance-power of Arm Cortex vs RISC-V for AI applications_oct_2021
Compare Performance-power of Arm Cortex vs RISC-V for AI applications_oct_2021
 
3DD 1e Laura
3DD 1e Laura3DD 1e Laura
3DD 1e Laura
 
Iaetsd vlsi architecture for exploiting carry save arithmetic using verilog hdl
Iaetsd vlsi architecture for exploiting carry save arithmetic using verilog hdlIaetsd vlsi architecture for exploiting carry save arithmetic using verilog hdl
Iaetsd vlsi architecture for exploiting carry save arithmetic using verilog hdl
 
RAMSES: Robust Analytic Models for Science at Extreme Scales
RAMSES: Robust Analytic Models for Science at Extreme ScalesRAMSES: Robust Analytic Models for Science at Extreme Scales
RAMSES: Robust Analytic Models for Science at Extreme Scales
 
Vlsi design process for low power design methodology using reconfigurable fpga
Vlsi design process for low power design methodology using reconfigurable fpgaVlsi design process for low power design methodology using reconfigurable fpga
Vlsi design process for low power design methodology using reconfigurable fpga
 
Assisting User’s Transition to Titan’s Accelerated Architecture
Assisting User’s Transition to Titan’s Accelerated ArchitectureAssisting User’s Transition to Titan’s Accelerated Architecture
Assisting User’s Transition to Titan’s Accelerated Architecture
 
Logic Synthesis
Logic SynthesisLogic Synthesis
Logic Synthesis
 
How to use Parquet as a Sasis for ETL and Analytics
How to use Parquet as a Sasis for ETL and AnalyticsHow to use Parquet as a Sasis for ETL and Analytics
How to use Parquet as a Sasis for ETL and Analytics
 
Flexible dsp accelerator architecture exploiting carry save arithmetic
Flexible dsp accelerator architecture exploiting carry save arithmeticFlexible dsp accelerator architecture exploiting carry save arithmetic
Flexible dsp accelerator architecture exploiting carry save arithmetic
 
LISP and NSH in Open vSwitch
LISP and NSH in Open vSwitchLISP and NSH in Open vSwitch
LISP and NSH in Open vSwitch
 
Mp So C 18 Apr
Mp So C 18 AprMp So C 18 Apr
Mp So C 18 Apr
 
Mirabilis_Design AMD Versal System-Level IP Library
Mirabilis_Design AMD Versal System-Level IP LibraryMirabilis_Design AMD Versal System-Level IP Library
Mirabilis_Design AMD Versal System-Level IP Library
 

Mehr von Shereef Shehata

Mehr von Shereef Shehata (19)

Windows_Scaling_2X_Speedup
Windows_Scaling_2X_SpeedupWindows_Scaling_2X_Speedup
Windows_Scaling_2X_Speedup
 
2D_block_scaling_Software
2D_block_scaling_Software2D_block_scaling_Software
2D_block_scaling_Software
 
2D_BLIT_software_Blackness
2D_BLIT_software_Blackness2D_BLIT_software_Blackness
2D_BLIT_software_Blackness
 
CIECAM02_Color_Management
CIECAM02_Color_ManagementCIECAM02_Color_Management
CIECAM02_Color_Management
 
Deblocking_Filter_v2
Deblocking_Filter_v2Deblocking_Filter_v2
Deblocking_Filter_v2
 
log_algorithm
log_algorithmlog_algorithm
log_algorithm
 
Temporal_video_noise_reduction
Temporal_video_noise_reductionTemporal_video_noise_reduction
Temporal_video_noise_reduction
 
Shereef_Color_Processing
Shereef_Color_ProcessingShereef_Color_Processing
Shereef_Color_Processing
 
Inertial_Sensors
Inertial_SensorsInertial_Sensors
Inertial_Sensors
 
magentometers
magentometersmagentometers
magentometers
 
Shereef_MP3_decoder
Shereef_MP3_decoderShereef_MP3_decoder
Shereef_MP3_decoder
 
Fusion_Class
Fusion_ClassFusion_Class
Fusion_Class
 
Gyroscope_sensors
Gyroscope_sensorsGyroscope_sensors
Gyroscope_sensors
 
2DCompsitionEngine
2DCompsitionEngine2DCompsitionEngine
2DCompsitionEngine
 
Block_Scaler_Control
Block_Scaler_ControlBlock_Scaler_Control
Block_Scaler_Control
 
2D_BitBlt_Scale
2D_BitBlt_Scale2D_BitBlt_Scale
2D_BitBlt_Scale
 
xvYCC_RGB
xvYCC_RGBxvYCC_RGB
xvYCC_RGB
 
The_Mismatch_Noise_Cancellation_Architecture
The_Mismatch_Noise_Cancellation_ArchitectureThe_Mismatch_Noise_Cancellation_Architecture
The_Mismatch_Noise_Cancellation_Architecture
 
High_Level_Synthesis_of_DSP_Archiectures_Targeting_FPGAs
High_Level_Synthesis_of_DSP_Archiectures_Targeting_FPGAsHigh_Level_Synthesis_of_DSP_Archiectures_Targeting_FPGAs
High_Level_Synthesis_of_DSP_Archiectures_Targeting_FPGAs
 

DSP Structured Datapath Synthesis

  • 1. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata Architectural Synthesis of DSP Structured Datapaths Shereef B. M. Shehata
  • 2. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata OUTLINE • An overview of the architectural Level Synthesis Problem. • Subtasks of the High Level Synthesis problems Ë Scheduling Ë Binding Ë Architecture Optimization • NP-hard Algorithms(Heuristics versus Mathematical Programming techniques) • Novel Mathematical Programming Formulation of the Synthesis Problem: Ë Linearization of the Quadratic Nonlinear Problem Ë Optimization of Performance and Structural Complexity Ë Techniques To improve the Solution time for ILP formulation: Ë Heuristics as Bounds for Mathematical Programming. • Results for typical HLS benchmarks. • Conclusion. • • •
  • 3. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata Motivation To develop an architectural synthesis technique specific to the synthesis of architectures for DSP targeting FPGA implementations. The technique is general enough to accommodate other technologies, such as new submicron technologies. To provide an accurate evaluation method for our High Level Synthesis methodologies. • The total execution time is the yardstick for Performance comparison and not The number of control steps. Exploit important features of FPGA technology: • Large number of Registers • FPGA utilization is largely reduced with complex interconnections • High multiplexer cost. • Wide difference between the delays of multiplications and additions. • Efficient RAM storage. • Dedicated high-speed carry-propagation circuit
  • 4. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata Chapter 1
  • 5. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata The Symmetrical Array FPGA Module (Xilinx) Ë CLB routing is associated with each row and column of the CLB array. Ë Global Routing consists of dedicated networks primarily designed to distribute clocks throughout the device with minimum delay and skew. It can also be used to distribute high fan- out signals throughout the device with minimum delay. Ë Global nets and buffers has increased in more recent Xilinx 4000 generation to allow more flexibility in routing. Programmable Connection Matrix Programmable Switching Matrix Programmable Logic Block
  • 6. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata XC4000 family switch box architecture Ë SRAM configuration cell, implies Reuse, and prototyping. The hardware becomes reconfigurable and the designer can update the system on the fly. Ë The total size of the SRAM configuration cell and the transistor switch that the SRAM drives is larger than the programming devices used in antifuse technologies. Interconnect Points Switch Matrix DataLines Six pass transistors per switch matric interconnect point Data Lines Ë The horizontal and vertical single- and double-length lines intersect at a box called a programmable switch matrix. Each switch matrix consists of programmable pass transistors used to establish connections between the lines.
  • 7. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata The Xilinx 4000 Configurable logic block (dedicated carry logic is not shown) Ë The inputs C1-C4 can also be used to control the use of the F and G- LUTs as 32-bits of SRAM. Ë Mux control maps four control inputs (C1-C4) into: LUT input H1, direct in (DIN), enable clock (EC) and set/reset for the flip flops. Ë The XC4000 CLB has also has special fast dedicated carry logic hardwired between the CLBs. G1 G2 G3 G4 F4 LUT LUT LUT multiplexer C1 C2 C3 C4 R S state state D D Q Q G Q2 Q1 Fclock Programmable H1 DIN F1 F2 F3 Carry outCarry in Carry outCarry in to/from adhacent CLBs to/from adhacent CLBs
  • 8. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata Carry propagation paths in Xilinx 4000 series Ë The carry chain in XC4000 can run either up or down. At the top or bottom of the columns where there are no more CLBs, the carry is propagated to the right. Ë The Fast carry logic can be accessed by using Relational Placed Macros that already include special library symbols for using the fast carry logic. Ë The carry logic shares operands and control with the function generators. CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB Dedicated carry-path
  • 9. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata Interconnect Overview for the XC4000 family Long Double Single Quad Quad Long Global Clock Long Double CLB Direct Connect Long Carry Chain Direct Connect Single Global Clock
  • 10. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
  • 11. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata Details of XC4000 dedicated carry logic. Ë The two 4-input function generators can be configured as a 2-bit adder with built-in hidden carry that can be expanded to any length. Ë This dedicated carry circuitry is so fast that conventional speed-up methods like carry generate/propagate has marginal benefit at the 32-bit level and almost no effect at the 16-bit level. Ai+1Bi+1 Si Si+1 Ci+2 G-Function Generator F-Function Generator Bi Ai Ci
  • 12. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata Details of a Logic Array Block (LAB) in FLEX 8000 family 4 4 4 4 4 4 4 4 4 4 8 8 16 8 Carry-out to the LAB on the right LAB Local interconnect Carry-in from the LAB on left Row Interconnect Column Interconnect LAB Control Signals LE LE LE LE LE LE LE LE Ë There are Eight LEs stacked to form a Logic Array Block (LAB)
  • 13. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata FLEX 8000 Logic Element(LE) Ë The FLEX LE uses a four-input LUT, a flip-flop, cascade logic and carry logic. Carry Chain Look-Up Table(LUT) Cascade Chain QD CLRN PRN LE Out Carry-In Cascade-In DATA1 DATA2 DATA3 DATA4 LABCTRL1 LABCTRL2 LABCTRL3 LABCTRL4 Clear/Preset Logic Carry-Out Cascade-Out Clock Select
  • 14. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata Flex 8000 device block diagram IOE IOE IOE IOE IOE IOE IOE IOE IOEIOE IOEIOE IOEIOE IOEIOE Fast Track Interconnect I/O Element Logic Element Logic Array Block(LAB)
  • 15. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata General Architecture Model FUi FUj R Chaining Register Interconnect Register Mux FU Mux FU O/P Tristate Bus One of the Pipelined Busses Driver Register File ( RAM) Modules FU Module Register Mux FU Mux Sub-Module (Optional) (Optional)(Optional) Control Unit InterconnectControl signals Function Units and Register Control Signals
  • 16. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata CDFG - Data Storage Assignment STEP-LAST: Register Allocation STEP-4: ILP: Bus Insertion -Bus transfer scheduling -Bus allocation -Storage Minimization -Bus loading Minim. -Interconnect minimization. -Bus loading minimization. - Scheduling and Binding - Chaining of Operations STEP-3: ILP: Random Topology -Interconnect minimization. - Clock cycle minimization + - FU pipelining choice ation of the numberMinimiz of cycles. OR - Minimization of the total execution time, (i.e. throughput maximization). - VHDL generation of the Datapath and the Controller - Heuristics to determine the lower bound on the number of cycles. - Heuristics to tighten the ASAP/ALAP values under the given resource constraints. DFG -DFG exploration. -Dynamic Set generation for chaining -ILP constraint generation To Logic Synthesis tools STEP-2: C++: Constraint Generation for ILP STEP-1: Scheduling Bounds Tech
  • 17. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata Flow of the Back-End Tools Ë Stage-2 uses Synopsys tools(logic synthesis and FPGA mapping), and stage-3 uses Xilinx(xact tools) for PPR VHDL SOURCE FILES - Xilinx Hard-macros Simulate Read HDL and insert pads - Area Constraints - Delay Constraints - FU-Pipelining (i.e. Register-balancing) - Xilinx Library To simulation Partition, Placement and Routing Xilinx SYNOPSYS compile and optimize the datapath and controller Stage-3 Stage-2
  • 18. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata Chapter 2
  • 19. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata Basic Definitions. Ë A Polyhedron “P“: is the set of points that satisfy a finite number of linear inequalities, that is: Ë A polytope: is a bounded polyhedron, that is: Ë A Polyhedron Face: The set is called a face of P and the valid inequality is said to define the face F. P R n ⊆ P x R n ∈ A x⋅ b≤       =       , w∃ R 1 ∈ P x R n ∈ w– x j w≤ ≤( ) j∀ j 1…n=,( )       ⊆ F x P∈ π x⋅ π0={ }= π x⋅ π0≤
  • 20. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata Ë The Convex Hull: Given a set , a point . The Convex hull of S denoted by Conv(S) is the set of finite points that can be written as a convex combination of points in S. Ë where x1, x2, ..., xt are any finite set of points in S. The convex hull Conv(S) can be described by a finite set of linear inequalities. S R n ⊆ x R n ∈ Conv S( ) x R + n ∈ x λi x i ⋅ i 1= ∑=           = λi i 1= t ∑ , λ R + t ∈
  • 21. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata Ë A partially ordered set: , or poset, is a non-empty set X and a binary relationship B on X which is reflexive, anti-symmetric and transitive. The elements of X are called points and the binary relationship B is called partial ordering on X. Ë A strict partially ordered set: , or Sposet, is a non-empty set X and a binary relationship on X which is irreflexive, anti-symmetric and transitive. Ë We use to denote that and to denote that . Ë A Hasse diagram: of a poset (X,P) is a drawing in which the points of X are places so that if y covers x, then y is placed at a higher level than x and joined to x by a line segment. The corresponding graph is called a Hasse Graph of the poset. Ë A Clique in a graph G = (V,E) is a with the property that every pair of nodes in C is joined by an edge. Ë A subset of the vertices of the graph is an r-clique if it induces a complete subgraph, i.e. Ë A stable set (or independent set) of vertices is a subset X of the vertex set of a graph G, no two of which are adjacent. X B,( ) X B˜,( ) B˜ xBy x y,( ) B∈ xB˜ y x y,( ) B˜∈ C V⊆ A V⊆ G V E,( )= GA Kr≅
  • 22. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata Ë A Comparability graph: is an undirected graph that is transitively orientable. That is each edge can be assigned a one-way direction such that the resulting directed graph G = (V,E) satisfies the following condition: and imply . Ë A graph G is a triangulated graph, if for every simple cycle of length strictly greater than 3 posses a chord. Ë The stability number of G is the number of vertices in a stable set of maximum cardinality. Ë The chromatic number of G the smallest possible k for which there exists a proper k-coloring of G. Ë The clique number of G is the number of vertices in a clique of maximum cardinality. Ë The clique cover number is the fewest number of complete subgraphs needed to cover the vertices of G, i.e. the size of the smallest possible clique cover of the graph G. a b,( ) E∈ b c,( ) E∈ a c,( ) E∈ a b c, ,∀ V∈ α G( ) γ G( ) ω G( ) θ G( )
  • 23. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata Ë A Vertex packing on a graph G = (V,E) is a set of vertices , with the property that no pair of vertices in U is joined by an edge. Ë The fractional vertex packing polytope of a graph G = (V,E) is where and is the maximal clique matrix of the graph G. U V⊆ P x R + n ∈ κ x⋅ 1≤       = n V= κ
  • 24. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata chapter 3
  • 25. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata Simultaneous Performance Optimization and Interconnect minimization • Exploration of much larger solution space guided by a Highly selective objective function that rejects architectures with more interconnection unsuitable for FPGA implementation. • Developing an ILP formulation that incorporates: Ë Multilevel chaining of operations and deeply pipelined functional units which are effective for FPGAs. Ë Optimal scheduling and binding of Operations while minimizing interconnections. Ë Determination of the system clock duration. Ë Minimization of the Total execution time vs. the number of control steps.
  • 26. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata Details of the Integer Linear Programming Formulation • Operation Assignment Constraints Ë This Constraint assigns Every Operation of the DFG to only one control step and one FU. Xop n s,, n 1= Nt ∑ s Range op( )∈ ∑ 1 op∀= Xi,1,1 Xi,2,1 Xi,3,1 Xi,1,2 Xi,2,2 Xi,3,2 Xj,1,2 Xj,2,2 Xi,1,3 Xi,2,3 Xi,3,3 Xj,1,3 Xj,2,3 Xi,1,4 Xi,2,4 Xi,3,4 Xj,1,4 Xj,2,4 Xj,1,5 Xj,2,5 21 Op j ALAP(opj) 1 2 3 Op i ASAP(opj) ALAP(opi) ASAP(opi) The variables in the shaded region add up to 1. OPi OPj precedence
  • 27. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata Details of the Integer Linear Programming Formulation • Function Unit Assignment Constraint Ë Each FU has at most only one operation assigned at a given time. Xop n p,, op Fut∈ ∑ p s= s L op( )– 1+ ∑ 1≤ n s∀,∀ Xi,1,1 Xi,2,1 Xj,1,1 Xj,2,1 Xi,1,2 Xi,2,2 Xj,1,2 Xj,2,2 Xk,1,2 Xk,2,2 Xi,1,3 Xi,2,3 Xj,1,3 Xj,2,3 Xk,1,3 Xk,2,3 Xi,1,4 Xi,2,4 Xj,1,4 Xj,2,4 Xk,1,4 Xk,2,4 Xj,1,5 Xj,2,5 Op i 1 2 Op k 1 2 Op j 1 2 c-step1 c-step2 c-step3 c-step4 c-step5 The summation of these variables is less than 1 OPi OPj precedence
  • 28. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata Details of the Integer Linear Programming Formulation • Scheduling partially ordered operations has to follow the precedence order (no Chaining) X opi n p, , X op j n p, , n 1= Ntj ∑ 1≤ p ASAP op j( )= s ∑+ n 1= Nti ∑ p s D opi( )– 1+= ALAP opi( ) ∑ ASAP op j( ) s ALAP opi( ) D opi( ) 1–+≤ ≤ s∀ opi op j→( )∀, Xi,1,1 Xi,2,1 Xi,3,1 Xi,1,2 Xi,2,2 Xi,3,2 Xj,1,2 Xj,2,2 Xi,1,3 Xi,2,3 Xi,3,3 Xj,1,3 Xj,2,3 Xi,1,4 Xi,2,4 Xi,3,4 Xj,1,4 Xj,2,4 Xi,1,5 Xi,2,5 Xi,3,5 Xj,1,5 Xj,2,5 Xi,1,6 Xi,2,6 Xi,3,6 Xj,1,6 Xj,2,6 OPi OPj precedenceASAP(opj) current c-step The variables in the shaded region add up to 1 ALAP(opi) ASAP(opi) 1 2 3 Op i 21 Op j
  • 29. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata To Determine the Total length of the schedule Ë The following constraint illustrates the determination of the total number of steps T, from the schedule of the operations in the set W. Where W is the set of operations without Successors in the DFG. Ë The variable T has both an upper and lower bound (Determined from Heuristics) as: s Xop n s,, T–× n 1= Nt ∑ s Range op( )∈ ∑ D op( ) 1+–( )≤ op W∈∀ T Tcr≥ T Tcr T∆+≤
  • 30. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata Constraints to minimize the structural complexity of the synthesized Architecture Ë Counting the number of Motifs Ë A corresponding term to minimize the MOTIFSUM is included in the objective function to increase the utilization of the already assigned interconnect between different Function units. Xi,1,1 Xi,2,1 Xi,3,1 Xi,1,2 Xi,2,2 Xi,3,2 Xj,1,2 Xj,2,2 Xi,1,3 Xi,2,3 Xi,3,3 Xj,1,3 Xj,2,3 Xi,1,4 Xi,2,4 Xi,3,4 Xj,1,4 Xj,2,3 Xj,1,5 Xj,2,5 Xo pi n s,, s Range opi( )∈ o pi Fut∈ ∑ Xo p j n s,, s Range op j( )∈ o p j Fut′∈ ∑+ Motif Fut n Fut′ n′,,,( ) 1≤– o pi op j→( )∀ n n 1…Nt=( )∀ n′ n′ 1…Nt′=( )∀ 1 2 3 Op i 21 Op j c-step 1 c-step 3 c-step 2 c-step 4 ASAP(op i ) ASAP(op j ) ALAP(op i ) ALAP(op j )c-step 5 The summation of these variables sets the value of Motif A 2 M 1,,,( )
  • 31. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata Constraints to minimize the structural complexity of the synthesized Architecture Ë Counting the number of Chaining Motifs Ë A corresponding term to minimize the CMOTIFSUM is included in the objective function to increase the utilization of the already assigned Chaining interconnect between different Function units. Xi,1,1 Xi,2,1 Xi,3,1 Xi,1,2 Xi,2,2 Xi,3,2 Xj,1,2 Xj,2,2 Xi,1,3 Xi,2,3 Xi,3,3 Xj,1,3 Xj,2,3 Xi,1,4 Xi,2,4 Xi,3,4 Xj,1,4 Xj,2,4 Xj,1,5 Xj,2,5 1 2 3 op i 21 opj c-step 1 c-step 3 c-step 2 c-step 4 The summation of these variables sets the value of CMotif A 2 M 1,,,( ) opi opj Precedence c-step 5 Xo pi n s,, o pi Fut∈ ∑ Xo p j n s,, o p j Fut′∈ ∑+ CMotif Fut n Fut′ n′,,,( ) 1≤– s∀ , o pi op j→( )∀ n n 1…Nt=( )∀ n′ n′ 1…Nt′=( )∀
  • 32. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata Ë Counting Incompatible Motifs Ë The idea is to minimize the number of Motifs that terminates on the Same Function unit. This will decrease the number of Multiplexers in the synthesized architecture. Moti f Fut n Fut′ n′,,,( ) n 1= Nt ∑ Fut ∑ Incom p Fut′( )– 0≤ n′∀ Fut′∀ '1 '3 '2 '1 '1 '3 '1 '3 '1 I/O '2 '3 '1 Schedules and Motifs Architecture
  • 33. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata Ë Minimizing the Maximum Number of edges with the Same FU Destination Type(Incompatible Motifs). Introducing an integer variable to count the number of incompatible Motifs. Moti f Fut n Fut′ n′,,,( ) n 1= Nt ∑ Fut ∑ Incom p Fut′( )– 0≤ Fut′∀ n′∀, (a) (b) (c)
  • 34. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
  • 35. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata Minimizing the Maximum Number of Edge Overlap K Xopi n p, , Xop j n p, , n 1= Ntj ∑ p = ASAP o p j( ) s ∑– n 1= Nti ∑ p = ASAP o pi( ) s ∑               o pi op j→( )∀ op j Fut∈ edge wrap∉ ∑           × Xopi n p, , Xop j n p, , n 1= Ntj ∑ p s 1+= ALAP op j( ) ∑+ n 1= Nti ∑ p = ASAP o pi( ) s ∑               o pi op j→( )∀ op j Fut∈ edge wrap∈ ∑+ M– axovla p Fut( ) 0≤ s∀ Fut∀ K 1
  • 36. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata Formulation for chaining of Two operations per control step • The destination operation can not be scheduled “before” the source operation. Ë However, they can be share the same control step. Xi,1,1 Xi,2,1 Xi,3,1 Xj,1,1 Xj,1,2 Xi,1,2 Xi,2,2 Xi,3,2 Xj,1,2 Xj,2,2 Xi,1,3 Xi,2,3 Xi,3,3 Xj,1,3 Xj,2,3 Xi,1,4 Xi,2,4 Xi,3,4 Xj,1,4 Xj,2,4 Xi,1,5 Xi,2,5 Xi,3,5 Xj,1,5 Xj,2,5 Xi,1,6 Xi,2,6 Xi,3,6 Xj,1,6 Xj,2,6 ASAP(opj) current c-step ALAP(op i) The Summation of the variables in the shaded regions add up to 1 21 Op j 1 2 3 Op i OPi OPj precedence X opi n p, , X op j n p, , n 1= Ntj ∑ 1≤ p ASAP op j( )= s 1– ∑+ n 1= Nti ∑ p s D opi( )– 1+= ALAP opi( ) ∑ ASAP op j( ) s ALAP opi( ) D opi( ) 1–+≤ ≤ s∀ opi op j→( )∀,
  • 37. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata Formulation for chaining of Two operations per control step • The source operation can not be scheduled “after” the destination operation. Ë However, they can share the same control step.This constraints and the previous one are not redundant. They tighten the Formulation. Xopi n p, , Xop j n p, , n 1= Ntj ∑ 1≤ p s= ∑+ n 1= Nti ∑ p s D opi( )– 2+= ALAP opi( ) ∑ ASAP op j( ) s ALAP opi( ) D opi( ) 1–+≤ ≤ s∀ , opi op j→( )∀ Xi,1,1 Xi,2,1 Xi,3,1 Xj,1,1 Xj,2,2 Xi,1,2 Xi,2,2 Xi,3,2 Xj,1,2 Xj,2,2 Xi,1,3 Xi,2,3 Xi,3,3 Xj,1,2 Xj,2,3 Xi,1,4 Xi,2,4 Xi,3,4 Xj,1,4 Xj,2,4 Xi,1,5 Xi,2,5 Xi,3,5 Xj,1,5 Xj,2,5 Xi,1,6 Xi,2,6 Xi,3,6 Xj,1,6 Xj,2,6 211 2 3 Op i Op j ASAP(opj) current c-step ALAP(opi) ASAP(opi) The variables in the shaded region add up to 1 OPi OPj precedence
  • 38. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata Formulation for chaining of Two operations per control step • The following constraint prevents chaining of more than two operations in the same control step. Xopi n p, , Xopk n p, , n 1= Ntk ∑ 1≤ p s= ∑+ n 1= Nti ∑ p s D opi( ) 1+–= ∑ s∀ , opi op j,( )∀ ℜ2∈ ASAP opk( ) s ALAP opi( ) D opi( ) 1–+≤ ≤ Xi,1,1 Xi,2,1 Xi,3,1 Xj,1,1 Xj,2,1 Xi,1,2 Xi,2,2 Xi,3,2 Xj,1,2 Xj,2,2 Xi,1,3 Xi,2,3 Xi,3,3 Xj,1,3 Xj,2,3 Xi,1,4 Xi,2,4 Xi,3,4 Xj,1,4 Xj,2,4 Xi,1,5 Xi,2,5 Xi,3,5 Xj,1,5 Xj,2,5 Xi,1,6 Xi,2,6 Xi,3,6 Xj,1,6 Xj,2,6 OPi OPj precedenceASAP(opk) current c-stepALAP(opi) OPk precedence ASAP(opi) The variables in the shaded region add up to 1 1 2 3 Op i 21 Op k
  • 39. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata Multi- Level Chaining Ë Patterns to look for in the DFG Ë Formulation Ë By generating the set , such that if , and is a multi-cycle operation(e.g. multiply operation). Ë The following constraint will then apply to the members of this set *+ + opi opk opi opk * opi opk * C D + + opi opk A B ∆M O 2 ⊆ op1 op2,( ) ∆M∈ op1 opM→ opM op2→ opM Xopi n p, , Xopk n p, , n 1= Ntk ∑ 1≤ p s= ∑+ n 1= Nti ∑ p s D opi( ) 1+–= ∑ ASAP opk( ) s∀ ALAP opi( ) D opi( ) 1–+≤ ≤ o pi o pk( , )∀ ∆M∈
  • 40. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata Delay model for an N-bit adder implemented in Xilinx FPGAs For the Xilinx 4000 series, is 0.7/1 ns, and is 4 ns. Ë The delay is linear with the number of bits. This proportionality factor is , and as such they make the fastest possible carry path circuits. Adder S0 S1 S2 S3 S4 S5 SN-4 SN-3 SN-2 SN-1 TOPCY Tsum LSB MSB A0, B0 A1, B1 A2, B2 A3, B3 A4, B4 AN-4, BN-4 AN-3, BN-3 AN-2, BN-2 AN-1, BN-1 (N-4)/2 CLBs Tcarry Tcarry Tcarry CLB T A TOPCY N 4–( ) 2⁄ Tcarry× Tsum+ += Tcarry T OPCY Tcarry
  • 41. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata Delay model for a pipelined-multiplier chained with an adder For the Xilinx 4000 series, is 5 ns. Adder Last pipeline stage of a multiplier S0 S1 S2 S3 S4 S5 SN-4 SN-3 SN-2 SN-1 TOPCY Tcarry Tsum Tsum Tcarry LSB MSB Tcarry Tcarry TOPCY Tcarry Tcarry CLB T pd T pipe Tsum+= Tsum
  • 42. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
  • 43. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata Scheduling with Multi-level chaining and Interconnect minimization + + + + + + + + + + + + i1 i2 i3 i4 i5 i9 i10 i11 i12 i13 i6 i7 i8 out + ++ + R1R2 i4 i5 i8 i12 i13i3 i7 i11 i1 i9 i2 i6 i10 + Adder 2 + Adder 3 Adder 1 i6 i11 i8 i1 i10 i9 i3 i5 i7 i13 i4 i12 R1 R2 + Extra Number of Mux inputs: 2 Number of CLBs: 128 Execution time: 84 nsec Number of registers: 2 Extra Number of Mux inputs: 8 Number of CLBs: 180 Execution time: 96 nsec Number of registers: 2 ++ + + + + + + + + + i1 i2 i3 i4 i5 i6 i7 i8 i9 i10 i11 i12 i13 out +
  • 44. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata Delaying of interconnect optimization after scheduling Ë Comparison of our results for an addition tree, with methods that restrict the solution space, or does not minimize interconnect simultaneously with scheduling and binding of operations. + + + + + + + + + + out + + i6 i7 i8 i9 i10 i11 i12 i13i1 i2 i3 i4 i5 + + + + Adder1R1 R3 R2 Adder 3 Adder 2 Adder 4 R4 i9 i10i6 i11i13 i12 i4 i8 i3 i5 i2 i7 i1 Extra number of mux inputs: 7 Number of CLBs: 168 Execution time: 84 nsec Number of registers: 4
  • 45. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata Scheduling and binding for the CDFG with non-pipelined multipliers and no chaining • The schedule needs 7 control steps, with clock duration of 150ns + + + +* * + ++ + + + + + + Clock cycle: 150 ns Exec. Time: 7 * 150 = 1050 ns Resources: 2 Adders, 1 Multiplier No-Chaining Non-Piplined Multipliers. c-step 1 c-step 2 c-step 3 c-step 4 c-step 5 c-step 6 c-step 1
  • 46. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata Effect of increasing the resources to three adders and one non-pipelined multiplier on the scheduling of the CDFG • Increasing the resources by one adder does not effect the execution time for the CDFG + + + + * * + + + + + + + + + Clock cycle: 150 ns Exec. Time: 7 * 150 = 1050 ns Resources: 3 Adders, 1 Multiplier No-Chaining Non-Piplined Multipliers. c-step 1 c-step 2 c-step 3 c-step 4 c-step 5 c-step 6 c-step 7
  • 47. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata Scheduling and binding for the CDFG with pipelined multipliers and no chaining. • The schedule needs 8 control steps, with clock duration of 80ns + + + * * + + + + + + + + + + Clock Cycle: 80 ns Execution Time: 8 * 80 = 640 ns Resources: 2 Adders, 1 Multiplier No-Chaining Pipelined Multipliers. c-step 1 c-step 2 c-step 3 c-step 4 c-step 5 c-step 6 c-step 7 c-step 8
  • 48. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata Scheduling and binding of the CDFG, using pipelined multiplier and chaining • The schedule needs 5 control steps with clock duration of 90 ns. Clock Cycle: 90 ns Execution Time: 5 * 90 = 450 ns Resources = 3 Adders, 1 Multiplier Pipelined Multipliers 2-level Chaining allowed. * * c-step 1 c-step 2 c-step 3 c-step 4 c-step 5
  • 49. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata Synthesized architecture of for the scheduling and binding using pipelining and chaining. R4R2R1R3 *
  • 50. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata Minimization of the Total Execution time (Performance Optimization) Ë The Following constraint sets the Clock duration during the solution: Ë The constraint to set the chaining variable is given below: Ë The Upper and Lower limits that exist for the Clock Duration: δ ψijk( ) ψijk× Ω≤ ψijk∀ Ψ∈, ψMAA Xopi n p, , Xopk n p, , ψMAA– n 1= Ntk ∑ 1≤ p s= ∑+ n 1= Nti ∑ p s D opi( ) 1+–= ∑ ALAP opi( ) D opi( ) 1–+ s∀ ASAP o pk( ) o pi o pk( , )∀ ℑ2S∈,≥ ≥ opi NM∈( )and o pk NA∈( ) Ωmin Ω Ωmax≤ ≤
  • 51. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata Ë The values of the Upper/Lower bounds are determined as follows: Ë If the clock duration is allowed only discrete values: Ë is a relaxed version of the discrete valued , that can assume any positive number. Ωmax MAX δ Ψ( ){ }= Ωmin MIN δ Ψ( ){ }= δ ψijk( ) ψijk× Ωrelaxed≤ ψijk∀ Ψ∈ Ω Ωrelaxed Ωmin ------------------------- Ωmin⋅= Ωrelaxed Ω
  • 52. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
  • 53. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata Minimization of the DFG Total Execution Time Ë The Number of control steps (integer) can be represented in terms of Binary Variables: Ë The part of the Objective function that minimizes the Total execution is Nonlinear Ë The Objective Function can be conceptually presented as: T 2i β i ⋅ i 0= n 1– ∑= IN 2 i CLOCK⋅( ) β i ⋅ i 0= n 1– ∑= I IN IL1+=
  • 54. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata Minimization of the DFG Total Execution Time Ë Linearization of the Nonlinear part of the Objective function Ë Linearization of the Nonlinear part of the Objective function(cont’d): IL2 2 i CLKMIN⋅ βi⋅ Θi+     i 0= n 1– ∑= Θi 2 i CLOCK⋅ 2 i CLKMIN⋅ βi⋅– 2 i CLKMAX⋅ 1 βi–( )– i,≥ 0 … n 1–,,= Θi 0 i,≥ 0 … n 1–,,=
  • 55. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata Minimization of the DFG Total Execution Time Ë Linearization does not increase the complexity of the formulation: • Where n is the number of discrete variables added to the formulation Θi 2i CLOCK CLKMAX–( )⋅ if βi is 0 Θi 0≥( ),, 2i CLOCK CLKMIN–( )⋅ if βi is 1 Θi 0≥( ),,      ≥ IL2 2 i CLOCK⋅ i ri, 1= ∑= n Tlog( ) 2log( )⁄=
  • 56. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata Tree Hight Reduction Ë The performance of the architecture is bounded by the length of the critical path. Before THR After(THR) Delay Estimation A B C D (A + B)+ C + D A B C D (A+B) + (C+D) δ ψAAA( ) δ ψAA( )= A B C D (A + B) - C + D A B CD (A+B) + (D - C) δ ψASA( ) MAX δ ψAA( ) δ ψSA( ){ , }= A B C D (A + B) + C - D A B DC (A+B) + (C - D) δ ψAAS( ) MAX δ ψAA( ) δ ψSA( ){ , }=
  • 57. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata A B C D (A - B) + C + D A B DC (A-B) + (C +D) δ ψSAA( ) MAX δ ψAA( ) δ ψSA( ){ , }= A B C D (A - B) - C + D A B CD (A-B) + (D - C) δ ψSSA( ) MAX δ ψSA( ) δ ψSA( ){ , }=
  • 58. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata A B C D (A - B) - C - D A B DC (A-B) - (C + D) δ ψSSS( ) MAX δ ψSS( ) δ ψAS( ){ , }= A B C D (A * B) + C + D A B C D (A * B) + (C + D) δ ψMAA( ) MAX δ ψMA( ) δ ψAA( ){ , }=
  • 59. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata A B C D (A * B) - C + D A B D C (A * B) + (D - C) δ ψMSA( ) MAX δ ψMA( ) δ ψSA( ){ , }= A B C D (A * B) - C - D A B C D (A * B) - (C + D) δ ψMSS( ) MAX δ ψMS( ) δ ψAS( ){ , }=
  • 60. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata CHAPTER 4
  • 61. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata Hasse Graph for scheduling with n-level chaining 1 2 3 1 2 3 4 5 n-1 n α1 α2 αn−2 αn−1 αn cstep,s op 6 7 n+1 Assignement Edges Timing Edges
  • 62. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata Topological Sorting of the Hasse Graph can be modified to be used for Coloring the Graph Ë Nodes Are numbered according to topological sorting. op cstep,s 1 2 3 1 2 3 4 5 6 1 4 7 10 13 16 3 6 9 12 15 2 5 8 11 14
  • 63. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata Two different Colorings for the Hasse Graph for scheduling with 2-level chaining Ë Nodes are numbered according to the Corresponding color. op cstep,s 1 2 3 1 2 3 4 5 6 5 4 3 2 1 4 3 2 1 0 5 4 3 2 1 5 op cstep,s 1 2 3 1 2 3 4 5 6 4 3 2 1 0 5 4 3 2 1 0 4 3 2 1 0
  • 64. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata Topological Sorting of the Hasse Graph can be modified to be used for Coloring the Graph opcstep,s 1 2 3 4 1 2 3 4 5 6 1 2 3 4 5 6 7 8 12 16 20 9 13 17 21 10 14 18 22 11 15 19
  • 65. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata Two different Colorings for the Hasse Graph for scheduling with 3-level chaining Ë The graph has 22 nodes and “43” edges. Then number of maximal cliques can not be greater than 22 (or even equal 22). Ë The Transitive Closure of the graph has “115” edges. op cstep,s 1 2 3 4 1 2 3 4 5 6 5 5 4 4 4 4 3 3 3 3 2 2 2 2 1 1 1 1 0 0 0 0 op cstep,s 1 2 3 4 1 2 3 4 5 6 5 5 5 5 4 4 4 4 3 2 1 3 2 1 0 3 2 1 0 3 2 1
  • 66. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata An Odd-Hole graph and A Wheel graph 1 2 34 5 6 1 2 34 5 An Odd-Hole Graph x1 x2 x3 x4 x5+ + + + 2≤ A Wheel Graph x1 x2 x3 x4 x5 2 x6⋅+ + + + + 2≤
  • 67. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata The Extended Wheel Graph 1 2 34 5 6 7 An Extended-Wheel Graph x1 x2 x3 x4 x5 2 x6⋅ 2 x7⋅+ + + + + + 2≤
  • 68. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata Example Constraint Class Example: for s = 3 Constraint (α2βα1β) for 3-level chain op cstep,s 1 2 3 4 1 2 3 4 5 X1 3, X3 2, X3 3, X4 2, X4 3, 2 X1 4,⋅ 2 X1 5,⋅+ + 2≤+ + + + Xopi a s D opi( ) 1+–( ), , a 1= Nti ∑ Xopk a p, , a 1= Ntk ∑ p s 1–= s ∑+ Xopl a p, , a 1= Ntl ∑ p s 1–= s ∑+ + 2 Xopi a p, , a 1= Nti ∑ p s D opi( ) 2+–= ALAP opi( ) ∑⋅       2≤ s∀ Range opi( ) Range opl( )∩( )∈ s D opi( ) 2+– Range opi( )∈ s 1–( ) Range opl( )∈, opi opk,( ) ℑ2S∈∀ opi opl,( ) ℑ3S∈∀,
  • 69. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata An Extended Wheel Graph Constraint Class for 3-level chaining. Example Constraint Class Example: for s = 3 Constraint (α2βα1β) for 3-level chain op cstep,s 1 2 3 4 1 2 3 4 5 X1 3, X3 2, X3 3, X4 2, X4 3, 2 X1 4,⋅ 2 X1 5,⋅+ + 2≤+ + + + Xopi a s D opi( ) 1+–( ), , a 1= Nti ∑ Xopk a p, , a 1= Ntk ∑ p s 1–= s ∑+ Xopl a p, , a 1= Ntl ∑ p s 1–= s ∑+ + 2 Xopi a p, , a 1= Nti ∑ p s D opi( ) 2+–= ALAP opi( ) ∑⋅       2≤ s∀ Range opi( ) Range opl( )∩( )∈ s D opi( ) 2+– Range opi( )∈ s 1–( ) Range opl( )∈, opi opk,( ) ℑ2S∈∀ opi opl,( ) ℑ3S∈∀,
  • 70. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
  • 71. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata Exploring the Hasse diagram for schedules with 2-level chaining. class 1 α1 α2 β class 3 class 4 α1 class 3 α1 class 5 β start β α1/α2 α1/α2 β class 2 β β β
  • 72. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata 1- Clique Constraint Class for 2-level chainingβ ) op cstep,s 1 2 3 1 2 3 4 5 6 Ë The constraint class 1 :β ) Xop n s,, n 1= Nt ∑ s Range op( )∈ ∑ 1 op DFG∈∀≤
  • 73. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata 2-Clique Constraint Class for 2-level chainingβsα2βs op cstep,s 1 2 3 1 2 3 4 5 6 Ë The constraint class 2 :βsα2βs Xopi n p, , n 1= Nti ∑ p s D opi( )– 1+= ALAP opi( ) ∑ Xopk n p, , n 1= Ntk ∑ p ASAP opk( )= s ∑+ 1≤ s∀ Range opi( ) Range opk( )∩( )∈ opi opk,( )∀ ℑ2S∈
  • 74. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata 3-Clique Constraint Class for 2-level chainingβsα1β s 1–( ) op cstep,s 1 2 3 1 2 3 4 5 6 Ë The constraint class 3βsα1β s 1–( ) Xopi n p, , n 1= Nti ∑ p s D opi( )– 1+= ALAP opi( ) ∑ Xop j n p, , n 1= Ntj ∑ p ASAP op j( )= s 1– ∑+ 1≤ s∀ Range opi( ) s 1–( ) Range op j( )∈( )∈ opi op j,( )∀ ℑ1∈
  • 75. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata 4-Clique Constraint Class for 2-level chaining Ë The example illustrated in the Figure for class 4 is for the case of both . βsα1β˜ s 2–( ) j k, , i′ α1β s 2– i′–( ) op cstep,s 1 2 3 1 2 3 4 5 6 Ë The constraint class 4 βsα1β˜ s 2–( ) j k, , i′ α1β s 2– i′–( ) Xopi n p, , n 1= Nti ∑ p s D opi( )– 1+= ALAP opi( ) ∑ Xop j n p, , n 1= Ntj ∑ p ASAP op j( )= s 1– ∑+ 1≤ s∀ Range opi( ) s 1–( ) Range op j( )∈( )∈ opi op j,( )∀ ℑ1∈ i′ 1=
  • 76. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata 5-Clique Constraint Class for 2-level chainingβsα1α1β s 2–( ) op cstep,s 1 2 3 1 2 3 4 5 6 Ë The constraint class 5βsα1α1β s 2–( ) Xopi n p, , n 1= Nti ∑ p s D opi( )– 1+= ALAP opi( ) ∑ Xop j n s 1–( ), , n 1= Ntj ∑ Xopk n p, , n 1= Ntk ∑ p ASAP opk( )= s 2– ∑+ + 1≤ s∀ Range opi( )∈ s 2–( ) Range opk( )∈( ) opi op j,( )∀ ℑ1S∈ , op j opk,( )∀ ℑ2S∈
  • 77. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata Exploring 3 the Hasse diagram for schedules with 3-level chaining. class 1 α1 α2 α3 β α2 class 6 class 9 β class 8 α1class 6 α1 β β α1 class 11 class 12 β β class 14 β start β α1/α2/α3 α1/α2/α3 β α2 class 7 β class 3 β class 3 class 5 α1 β class 2 β class 4 β α1 α1 class 8 β class 10 β α1 class 11 β class 13 α1 β β
  • 78. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata 1- Clique Constraint Class for 3-level chainingβ ) op cstep,s 1 2 3 4 1 2 3 4 5 α1 α2 α3 6 7 8 Ë The constraint class 1β ) Xop n s,, n 1= Nt ∑ s Range op( )∈ ∑ 1 op DFG∈∀≤
  • 79. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata 2- Clique Constraint Class for 3-level chainingβsα3βs cstep,s 1 2 3 4 1 2 3 4 5 α1 α2 α3 6 7 8 op Ë The constraint class 2 for 3-level chainingβsα3βs Xopi n p, , n 1= Nti ∑ p s D opi( )– 1+= ALAP opi( ) ∑ Xopl n p, , n 1= Ntl ∑ p ASAP opl( )= s ∑+ 1≤ s∀ Range opi( ) Range opl( )∩( )∈ opi opl,( )∀ ℑ3S∈
  • 80. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata 3- Clique Constraint Class for 3-level chainingβsα2β s 1–( ) op cstep,s 1 2 3 4 1 2 3 4 5 α1 α2 α3 6 7 8 Ë The constraint class 3:βsα2β s 1–( ) Xopi n p, , n 1= Nti ∑ p s D opi( )– 1+= ALAP opi( ) ∑ Xopk n p, , n 1= Ntk ∑ p ASAP opk( )= s 1– ∑+ 1≤ s∀ Range opi( )∈ s 1–( ) Range opk( )∈ opi opk,( )∀ ℑ2S∈
  • 81. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata 4- Clique Constraint Class for 3-level chainingβsα2β˜ s 2–( ) k l, , i′ α1β s 2– i′–( ) op cstep,s 1 2 3 4 1 2 3 4 5 α1 α2 α3 6 7 8 Ë The constraint constraint class 4βsα2β˜ s 2–( ) k l, , i′ α1β s 2– i′–( ) Xopi n p, , n 1= Nti ∑ p s D opi( )– 1+= ALAP opi( ) ∑ Xopk n p, , n 1= Ntk ∑ p s 1– i′–( )= s 1–( ) ∑ Xopl n p, , n 1= Ntl ∑ p ASAP opl( )= s 2– i′–( ) ∑+ + 1≤ s∀ Range opi( )∈ s 2–( ) Range opl( )∈( ) opi op j,( ) op j opk,( ) opk opl,( ), ,∀ ℑ1S∈ opi opk,( ) ℑ2S∈ opi opl,( ) ℑ3S∈, i′ 1 i′ s 2 A– SAP opl( )–≤ ≤∀
  • 82. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata 5- Clique Constraint Class for 3-level chainingβsα2α1β s 2–( ) op cstep,s 1 2 3 4 1 2 3 4 5 α1 α2 α3 6 7 8 Ë The constraint class 5βsα2α1β s 2–( ) Xopi n p, , n 1= Nti ∑ p s D opi( )– 1+= ALAP opi( ) ∑ Xopk n s 1–( ), , n 1= Ntk ∑ Xopl n p, , n 1= Ntl ∑ p ASAP opl( )= s 2– ∑+ + 1≤ s∀ Range opi( )∈ s 2–( ) Range opl( )∈( ) opi op j,( ) op j opk,( ) opk opl,( ), ,∀ ℑ1S∈ opi opk,( ) ℑ2S∈ opi opl,( ) ℑ3S∈,
  • 83. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata 6- Clique Constraint Class for 3-level chainingβsα1β s 1–( ) op cstep,s 1 2 3 4 1 2 3 4 5 α1 α2 α3 6 7 8 Ë The constraint class 6 :βsα1β s 1–( ) Xopi n p, , n 1= Nti ∑ p s D opi( )– 1+= ALAP opi( ) ∑ Xop j n p, , n 1= Ntj ∑ p ASAP op j( )= s 1– ∑+ 1≤ s∀ Range opi( )∈ s 1–( ) Range op j( )∈, opi op j,( )∀ ℑ1∈
  • 84. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata 7- Clique Constraint Class for 3-level chainingβsα1β˜ s 2–( ) j k, , i′ α2β s 2– i′–( ) op cstep,s 1 2 3 4 1 2 3 4 5 α1 α2 α3 6 7 8 Ë The constraint class 7 βsα1β˜ s 2–( ) j l, , i′ α2β s 2– i′–( ) Xopi n p, , n 1= Nti ∑ p s D opi( )– 1+= ALAP opi( ) ∑ Xop j n s 1–( ), , n 1= Ntj ∑ s 1– i′–( ) s 1– ∑ Xopl n p, , n 1= Ntl ∑ p ASAP opl( )= s 2– i′–( ) ∑+ + 1≤ s∀ Range opi( )∈ s 2–( ) Range opl( )∈, opi op j,( ) op j opk,( ) opk opl,( ), ,∀ ℑ1S∈ op j opl,( ) ℑ2S∈ opi opl,( ) ℑ3S∈, i′ 1 i′ s 2 A– SAP opl( )–≤ ≤∀
  • 85. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata 8- Clique Constraint Class for 3-level chainingβsα1β˜ s 2–( ) j k, , i′ α1β s 2– i′–( ) op cstep,s 1 2 3 4 1 2 3 4 5 α1 α2 α3 6 7 8 Ë The constraint class 8βsα1β˜ s 2–( ) j k, , i′ α1β s 2– i′–( ) Xopi n p, , n 1= Nti ∑ p s D opi( )– 1+= ALAP opi( ) ∑ X n 1= Ntj ∑ op j n p, , p s 1– i′–( )= s 1– ∑ Xopk n p, , n 1= Ntk ∑ p ASAP opk( )= s 2– i′– ∑+ + 1≤ s∀ Range opi( )∈ s 2– i′–( ) Range opk( )∈( ) opi op j,( ) op j opk,( ),∀ ℑ1S∈ opi opk,( ) ℑ2S∈ i′ 1 i′ s 2 A– SAP opk( )–≤ ≤∀
  • 86. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata 9- Clique Constraint Class for 3-level chaining Ë The example illustrated in the Figure for class 9 is for the case of both . βsα1β˜ s 2–( ) j l, , i′ α1β˜ s 2–( ) k l, , i″ α1β s i′– i″–( ) op cstep,s 1 2 3 4 1 2 3 4 5 α1 α2 α3 6 7 8 Ë The constraint class 9βsα1β˜ s 2–( ) j l, , i′ α1β˜ s 2–( ) k l, , i″ α1β s i′– i″–( ) Xopi n p, , n 1= Nti ∑ p s D opi( )– 1+= ALAP opi( ) ∑ X n 1= Ntj ∑ op j n p, , p s 1– i′–( )= s 1– ∑+ + Xopk n p, , n 1= Ntk ∑ p s 2– i′– i″–= s 2– i′– ∑ Xopl n p, , n 1= Ntl ∑ p ASAP opl( )= s 3 i′ i″––– ∑ 1≤+ s∀ Range opi( )∈ s 3– i′ i″––( ) Range opl( )∈( ) opi op j,( ) op j opk,( ) opk opl,( ), ,∀ ℑ1S∈ opi opk,( ) ℑ2S∈ opi opl,( ) ℑ3S∈, i′ 1 i′ s 4 A– SAP opl( )and i″∀ 1 i″ s 3– ASAP opl( ) i′––≤≤( )–≤ ≤∀ max i′ i″+( ) s 3– ASAP opl( )–= i′ i″, 1=
  • 87. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata 10- Clique Constraint Class for 3-level chainingβsα1β˜ s 2–( ) j k, , i′ α1α1β s 2– i–( ) op cstep,s 1 2 3 4 1 2 3 4 5 α1 α2 α3 6 7 8 Ë The constraint class 10:βsα1β˜ s 2–( ) j k, , i′ α1α1β s 2– i–( ) Xopi n p, , n 1= Nti ∑ p s D opi( )– 1+= ALAP opi( ) ∑ X n 1= Ntj ∑ op j n p, , p s 1– i′–( )= s 1– ∑+ + Xopk n s 2– i′–( ), , n 1= Ntk ∑ Xopl n p, , n 1= Ntl ∑ p ASAP opl( )= s 3 i′–– ∑ 1≤+ s∀ Range opi( )∈ s 3– i′–( ) Range opl( )∈( ) opi op j,( ) op j opk,( ) opk opl,( ), ,∀ ℑ1S∈ opi opk,( ) ℑ2S∈ opi opl,( ) ℑ3S∈, i′ 1 i′ s 3 A– SAP opl( )–≤ ≤∀
  • 88. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata 11- Clique Constraint Class for 3-level chainingβsα1α1β s 2–( ) op cstep,s 1 2 3 4 1 2 3 4 5 α1 α2 α3 6 7 8 Ë The constraint class 11βsα1α1β s 2–( ) Xopi n p, , n 1= Nti ∑ p s D opi( )– 1+= ALAP opi( ) ∑ Xop j n s 1–( ), , n 1= Ntj ∑ Xopk n p, , n 1= Ntk ∑ p ASAP opk( )= s 2– ∑+ + 1≤ s∀ Range opi( )∈ s 2–( ) Range opk( )∈( ) opi op j,( ) op j opk,( ),∀ ℑ1S∈ opi opk,( ) ℑ2S∈
  • 89. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata 12- Clique Constraint Class for 3-level chaining Ë The example illustrated in the Figure for class 12 is for the case of both . βsα1α1β˜ s 2–( ) k l, , i″ α1β s i″–( ) op cstep,s 1 2 3 4 1 2 3 4 5 α1 α2 α3 6 7 8 Ë The constraint class 12 βsα1α1β˜ s 2–( ) k l, , i″ α1β s i″–( ) Xopi n p, , n 1= Nti ∑ p s D opi( )– 1+= ALAP opi( ) ∑ X n 1= Ntk ∑ opk n p, , p s 2– i′–( )= s 2– ∑+ + Xop j n s 1–( ), , n 1= Ntj ∑ Xopl n p, , n 1= Ntl ∑ p ASAP opl( )= s 3 i′–– ∑ 1≤+ s∀ Range opi( ) s 3– i′–( ) Range opl( )∈( )∈ opi op j,( ) op j opk,( ) opk opl,( ), ,∀ ℑ1S∈ opi opk,( ) ℑ2S∈ opi opl,( ) ℑ3S∈, i′ 1 i′ s 3 A– SAP opl( )–≤ ≤∀ i′ 1=
  • 90. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata 13- Clique Constraint Class for 3-level chaining formulationβsα1α1α 1 β s 3–( ) op cstep,s 1 2 3 4 1 2 3 4 5 α1 α2 α3 6 7 8 Ë The constraint class 13βsα1α1α 1 β s 3–( ) Xopi n p, , n 1= Nti ∑ p s D opi( )– 1+= ALAP opi( ) ∑ Xop j n s 1–( ), , n 1= Ntj ∑+ + Xopk n s 2–( ), , n 1= Ntk ∑ Xopk n p, ,∑ p ASAP opl( )= s 3– ∑ 1≤+ s∀ Range opi( )∈ s 3–( ) Range opl( )∈( ) opi op j,( ) op j opk,( ) opk opl,( ), ,∀ ℑ1S∈ opi opk,( ) ℑ2S∈ opi opl,( ) ℑ3S∈,
  • 91. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata 14- Clique Constraint Class for 3-level chaining formulationβsα1α2β s 2–( ) op cstep,s 1 2 3 4 1 2 3 4 5 α1 α2 α3 6 7 8 Ë The constraint class 14βsα1α2β s 2–( ) Xopi n p, , n 1= Nti ∑ p s D opi( )– 1+= ALAP opi( ) ∑ Xop j n s 1–( ), , n 1= Ntj ∑ Xopl n p, , n 1= Ntl ∑ p ASAP opl( )= s 2– ∑+ + 1≤ s∀ Range opi( )∈ s 2–( ) Range opl( )∈ opi op j,( ) op j opk,( ) opk opl,( ), ,∀ ℑ1S∈ op j opl,( ) ℑ2S∈ opi opl,( ) ℑ3S∈,
  • 92. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata Maximal Clique Constraints are stronger than the Extended Wheel Constraints Ë The Extended Wheel Constraint: Ë The combined maximal cliques constraint: op cstep,s 1 2 3 4 1 2 3 4 5 cstep,s 1 2 3 4 1 2 3 4 5 α1 α2 α3 6 op X1 3, X3 2, X3 3, X4 2, X4 3, 2 X1 4,⋅ 2 X1 5,⋅+ + 2≤+ + + + X1 3, X1 4, X1 5, X+ + 3 1, X 3 2, X3 3, X 3 4, X3 5, X 3 6, X4 2, X4 3, 2≤+ + + + + + + +
  • 93. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata Comparing the logical formulation vs. the maximal clique formulation for the AR filter Ë To reach a first integer solution, the maximal clique formulation takes more time logical formulation maximal cliques formulation Number of iterations (Primal) 1,200 1,480 Number of iterations (Integer) 1,420 1,706 Number of nodes of Branch and Bound 54 103 CPU time in sec (primal) 12 (Ultra Sparc 2) 24 (Ultra Sparc 2 CPU time in sec (integer) 19 (Ultra Sparc 2 38 (Ultra Sparc 2 Total CPU time in sec 31 62 Optimality condition first integer first integer Number of discrete variables in the formulation 536 536 Number of Single inequali- ties 7,363 9,256 (25.7% increase) Termination condition first integer solution first integer solution
  • 94. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata Comparing the logical formulation vs. the maximal clique formulation for the AR filter Ë The maximal clique formulation achieves an optimal solution within tolerance long before the logical fomulation. logical formulation maximal cliques formulation Number of iterations (Primal) 1,200 1,480 Number of iterations (Integer) 8.45E5 14,577 Number of nodes of Branch and Bound 42,025 596 CPU time in sec (primal) 12 (Ultra Sparc 2) 24 (Ultra Sparc 2) CPU time in sec (integer) 14,491 (Ultra Sparc 2) 221 (Ultra Sparc 2) Total CPU time in sec 14,503 245 Optimality condition 0.07 (not achieved) 0.07 (achieved) Number of discrete variables in the formulation 536 536 Number of Single inequalities 7,363 9,256 Termination condition. after 5 integer solutions achieved optimal result within tolerance
  • 95. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata Comparing the logical formulation vs. the maximal clique formulation for the EWF benchmark Ë The maximal clique formulation achieves an optimal solution within tolerance before the logical fomulation. logical formulation maximal cliques formulation Number of iterations (Primal) 3,192 3,659 Number of iterations (Integer) 56,697 4,668 Number of nodes in Branch and Bound 1,827 190 CPU time in sec (primal) 86 (Ultra Sparc 2) 150 (Ultra Sparc 2) CPU time in sec (integer) 5.4 E3 (Ultra Sparc 2) 518 (Ultra Sparc 2) Total CPU time in sec 5.48 E3 668 Optimality condition 0.1 (not achieved) 0.1 (achieved) Number of discrete variables in the for- mulation 940 940 Number of Single inequalities 11,154 16,195 (45.2 % increase) Termination condition after 5 integer solutions achieved optimal result within tolerance
  • 96. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata Comparing the logical formulation vs. the maximal clique formulation for the DCT benchmark Ë The maximal clique formulation achieves an optimal solution within tolerance before the logical fomulation. logical formulation maximal cliques formulation Number of iterations (Primal) 3,288 (Ultra Sparc 2) 4,623 (Ultra Sparc 2) Number of iterations (Integer) 23 (Ultra Sparc 2) (Ultra Sparc 2) Number of nodes in Branch and Bound 1E4 168 CPU time in sec (primal) 83 312 CPU time in sec (integer) 2E5 2,575 Total CPU time in sec 2 E5 2,887 Optimality condition 0.15 (not achieved) 0.15 (achieved) Number of discrete variables in the formulation 1,066 1,066 Number of Single inequalities 13,623 18,979 (39.3%) Termination condition after 5 integer solutions achieved optimal result within tolerance
  • 97. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata Chapter 5
  • 98. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata Convex Bipartite Graph. Matching Ë This bipartite graph corresponds to the multiply operations of the EWF benchmark. The function unit resources are 1 Multiplier. FU_IOEI [3,7] [3,7] [8,11] [8,11] [12,15] [13,15] [13,15] [14,15] [3,6] [4,7] [8,10] [9,11] [12,12] [13,13] [14,14] [15,15] InitialOperation 17 18 8 29 33 24 4 12 12 34 56 78 9 10 11 12
  • 99. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata Strong components corresponding to the bipartite graph matching Ë Dotted edges can be pruned at this step. 1 2 3 4 6 5 10 9 11 12 7 8 12 34 56 78 9 10 11 12
  • 100. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata Chapter 6
  • 101. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata The fifth-order Elliptic Wave Filter benchmark Ë Consists of 34 operations(8 multiplications and 26 additions) ++++ + Z Z +* + + + + + * + + + Z + * + * + Z + + + * * + + + Z + + * Z Z * + input output
  • 102. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata The DFG of the EWF benchmark control step 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 OUT IN a b c d e f g h i a b c d e f g h i 1 25 2715 6 26 16 19 7 20 21 2822 103 11 3132 2 23 5 13 3414 9 4 8 12 17 18 24 29 30 33 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 32 28 33 36 302724 25 20 18 19 50 17 35 54 42 41 4038 39 43 45 44 16 37 56 57 53 55 58 29 48 51 34 52 21 22 23 26 31 47 46 49
  • 103. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata Effect of Chaining AND Pipelining FUs On Datapath Performance. Cost ( Number of CLBs) Totexec, Λ, ns 1- 1+,1* 3-Non-Pipe 4 5 6 8 9 10 11 2-a-Bus-ours • 7 Exploration of the Design Space for the EWF benchmark. 2-b-Best-others 2-pipe 4-pipe
  • 104. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata Effect of Chaining and Pipelining FUs On EWF Datapath Performance Design Space CSteps, T Resources Pipeline level Chaining Cost T(ns) 1 27 1+, 1* 2-stages yes 140 2158 2-a (ours) 17 2+, 1 *,1b 2-stages NO 160 1275 2-b [13] 17 2+, 1 * 2-stages NO 180 1275 3 10 3+, 1* No-pipe yes 195 1650 4 12 3+, 1* 2-stages yes 185 996 5 11 3+, 1* 2-stages yes 190 935 6 11 3+, 1* 2-stages yes 195 913 7 17 3+, 1* 4-stages yes 225 731 8 19 3+, 1* 4-stages yes 205 836 9 17 3+, 1* 4-stages yes 210 765 10 18 3+, 1* 4-stages yes 215 774 11 17 3+, 1* 4-stages yes 220 765
  • 105. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata Final FGPA Implementation on Xilinx4000 series. † † Using XACT 5.0 tools, the best area architecture would fit into x4006 chip and require about 200 CLBs. Our Best Area Our Best Perfor- mance Best in Litera- ture(Simulated Evo- lution) Controller 33 27 30 Register_File 10 Not used Not used ROM 4 4 4 Multiplier 110 110 110 Adder 10 10 10 4/3/2 to 1 mux 16/8 16/16/8 16/16/8 Register /Tristate 8/1 8/1 8/1 7/6/5/To 1 Mux Not used 36/26/25 36/26/25 Total # CLBS: 323 391 361 Total Execution time (nsec): 1275 731 1275
  • 106. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
  • 107. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata Scheduling and binding for the AR-filter, illustrating register binding. 1 2 3 4 5 6 7 8 9 1 2 3 4 9 10 5 6 13 7 8 14 11 12 15 1617 18 19 20 21 22 23 24 25 26 28 27 R1 R2 R2R2 R1 R1 R3 R1 R1
  • 108. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata Synthesized architecture for the AR filter Ë Resources:2 Multiplier (2-stage Pipelined),2 Adders and uses 3-registers and 12- multiplexer inputs. R1 R3 R2 A1 M1 M2 A2
  • 109. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata The scheduling and binding for the AR filter, using 4-stage pipelined multipliers 1 2 3 4 5 6 7 8 9 10 11 12 13 1 2 3 4 5 6 8 7 9 10 11 12 13 14 16 15 17 18 19 22 21 23 20 24 25 26 27 28
  • 110. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata The DFG of the Fast Discrete Transform Benchmark 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 1 2 3 4 9 10 12 15 24 23 17 20 19 11 25 26 6 5 7 22 21 2827 8 29 30 40 37 39 41 42 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 31 32 33 34 35 36 30 37 38 39 40 43 44 45 46 47 48 49 50 51 52 41 42 16 33 13 14 35 32 38 36 34 31 18
  • 111. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata The Fast Discrete Cosine Transform. Ours SODAS-DSP MARS Resources 2*, 2+,2- 2*, 2+,2- 2*, 2+, 2- # mux inputs 37 66 NA # registers 13 47 NA Clock (ns) 60 100 NA # csteps 10 12, dii=8a a. dii is the data initiation rate for the Pipelined architecture used in SODAS-DSP. 8b b. MARS, reports 8 cycles. No other details of the scheduling is available. Totexec(ns) 600 1200 NA Throughputc (MHz) c. Throughput indicates the highest input-sampling rate of the architecture. 1.67 1.25 NA
  • 112. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata Synthesized Architecture for the Fast Discrete Cosine Transform benchmark. A2A1M1 M2S1S2 Ë Resources: 2 Multiplier and 2Adders and 2 Subtracters. Uses 13 registers, 37 mux inputs.
  • 113. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata The DFG of the Discrete Cosine Transform Benchmark 1 2 3 4 5 6 7 8 2 3 4 5 6 7 8 11 2 3 4 9 10 5 6 7 8 15 33 34 45 35 36 46 37 38 47 39 40 48 1211 1413 41 42 43 44 d7 d0 d4 d6 d1 d5 d2 d0 d3 d1 d2 d5d3 d7 d4 d6 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 19 18 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 6364 16 39 40 17 18 20 21 23 24 26 27 25 28 29 30 32 19 22 31
  • 114. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata Synthesized Architecture for the Discrete Cosine Transform. Ë Resources: 2 Multiplier (4-stage Pipelined) and 4Adders. Uses 11 registers, 28 mux inputs. A3M1M2A1A2 A4
  • 115. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata Chaining paths for the Discrete Cosine Transform M1 M2 A1 A2 A3 A4 A3M1M2A1A2 A4
  • 116. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata Chaining interconnections modeled for false paths detection M1 M2 A1 A2 A3 A4 M1 M2 A1 A2 A3 A4 M1 M2 A1 A2 A3 A4 V1 V2 V3
  • 117. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata Synthesized Bus architecture of the DCT benchmark Ë Resources 1 Multiplier (4-stage Pipelined) and 3 Adders/Subtracters. Uses 9 registers, 18 mux inputs and 1 Bus. Bus1 A1A2 A3 R1 ROM R4 R7 R5 R6 R8 R2 R9 R3 Register File M class1 α1 α2 β class3 class4 α1 class3 α1 class5 β startβ α1/α2 α1/α2 β class2 β ββ
  • 118. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata Synthesized Random topology architecture for the DCT benchmark Ë Resources: 1 Multiplier (4-pipe stages) and 3 Adders/Subtracters. Uses 10 registers and 24 mux inputs A1 A2 A3 R2 R10R9 R7R5R8R6R4R1R3 ROM
  • 119. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata Synthesized Random topology architecture for the DCT benchmark Ë Resources: 1 Multiplier (4-pipe stages) and 3 Adders/Subtracters. Uses 12 registers and 20 mux inputs. A1 A2 A3 R2 R1R9R2 R3 R4 R5R6R7R8R11R12R10 ROM M
  • 120. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata The Discrete Cosine Transform Benchmark Ours PSGA_Syn, [69] Tool [23] Chaudhuri/ Walker SALSA[34] (Chain) SALSA[34] Resources 2*, 4+ 3*,3+ 3*, 4+ 2*, 4+ 2*,4+ # mux inputs 28 NA NA NA 30 # registers 11 14 NA 15 13 Clock (ns) 45 120a a. This tool does not use chaining nor pipelining for the DCT. 65b b. The tool described in [23], does not use chaining. 135c c. The level of chaining is not reported in [34] 65d d. SALSA[34], does not determine the clock duration of the total execution. However, we have used the same library for comparison # csteps 11 18 9 8 11 Totexec(ns) 495 2160 585 1080 715
  • 121. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata The Discrete Cosine Transform Benchmark Ours PSGA_Syn Tool in [69] SALSA (Chain) [34] OSTA no-Chain [70] Resources 1*, 3+ 3*,3+ 2*, 4+ 3*, 6+ # mux i/p 24 NA NA 38 # registers 10 14 15 24 Clock (ns) 45 120 130 120 # csteps, T 19 18 8 9 Totexec(ns) 855 2160 1080 1080
  • 122. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata chapter 7
  • 123. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata CONCLUSIONS • Our architectural model is suitable for a broad base of technology implementations. Specifically FPGAs including bus/SRAM based ones. • Introduced optimization criteria for ILP solvers for Datapath Synthesis: Ë Our model and criteria can be used for other solvers (e.g.stochastic). • The approach: Ë Scheduling with chaining and deep-pipelining of FUs while minimizing “Structural Complexity ”. Ë Optimization of the Total Execution time of the architecture, with clock cycle determination. Ë followed by bus assignment if it is supported by the FPGA. • This Approach has demonstrated that a discriminating search of a larger architectural space can produce: Ë Regular Architectures with minimuminterconnections, Low resources and Fast Throughput.
  • 124. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata Contribution of this research Ë Several interconnect minimization measures were incorporated in the formulation, which significantly improve the quality of the resulting synthesized architectures. Ë This was demonstrated for different benchmarks, where number of registers and multiplexer inputs were consistently smaller in architectures synthesized with this methodology as compared to previously published results. This is an important issue in developing a tool geared toward technologies with scarce interconnect resources such as FPGAs. Ë For the first time, an Integral Linear Programming (ILP) formulation that includes a non-tabular, non-restricted model of the system clock duration was developed. This has proved to be a significant step in the modeling of the total execution time of the architecture and as a result, successful performance minimization. Ë The formulation of the architectural synthesis scheduling and binding as a performance optimization problem rather than the mere minimization of the number of control steps was presented. A theoretical linearization technique for the objective function of this formulation was presented. It was demonstrated that this linearization technique has negligible impact on the size of the problem.
  • 125. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata Contribution of this research Ë Verification of the validity of the overall methodology by integrating this tool to logic synthesis and back-end tools. Ë The development of the set of valid inequalities for the scheduling and binding problem. The identification and derivation of both the extended wheel graph inequalities and the maximal clique inequalities. This guarantees the tightest formulation for schedules with n- levels of chaining and multicycled/pipelined resources for the first time. Ë An algorithmic approach for the generation of the minimum set of inequality classes necessary for the general scheduling and binding problem is developed. This algorithm explores a Hasse graph representing the scheduling problem. The algorithm classifies all the maximal paths into maximal path classes. These classes can be incorporated into the automatic generation of the maximal clique constraints. These maximal clique constraints represent the tightest description of the scheduling and binding problem with n-level chaining.
  • 126. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
  • 127. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
  • 128. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata VERTICAL PAGES
  • 129. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata Flow of the Architectural synthesis methodology.
  • 130. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
  • 131. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata CDFG -Data Storage Assignment STEP-LAST: Register Allocation STEP-4: ILP: Bus Insertion -Bus transfer scheduling -Bus allocation -Storage Minimization -Bus loading Minim. -Interconnect minimization. -Bus loading minimization. - Scheduling and Binding - Chaining of Operations STEP-3: ILP: Random Topology -Interconnect minimization. - Clock cycle minimization + - FU pipelining choice ation of the numberMinimiz of cycles. OR - Minimization of the total execution time, (i.e. throughput maximization). - VHDL generation of the Datapath and the Controller - Heuristics to determine the lower bound on the number of cycles. - Heuristics to tighten the ASAP/ALAP values under the given resource constraints. DFG -DFG exploration. -Dynamic Set generation for chaining -ILP constraint generation STEP-2: C++: Constraint Generation for ILP STEP-1: Scheduling Bounds Tech
  • 132. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
  • 133. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata Flow of the Back-End Tools Ë Stage-2 uses Synopsys tools(logic synthesis and FPGA mapping), and stage-3 uses Xilinx(xact tools) for PPR
  • 134. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
  • 135. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata VHDL SOURCE FILES - Xilinx Hard-macros Simulate Read HDL and insert pads - Area Constraints - Delay Constraints - FU-Pipelining (i.e. Register-balancing) - Xilinx Library To simulation Partition, Placement and Routing Xilinx SYNOPSYS compile and optimize the datapath and controller Stage-3 Stage-2
  • 136. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
  • 137. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
  • 138. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata ASAP Scheduling Input: Data Flow Graph G Output: node arrayint, schedule_I, representing the As soon as possible scheduling of the nodes of the DFG for a maximum chaining level “Max_Chain_Length”. ASAP{ 1- G.for_all_nodes(v) { if (input_degree(v) = 0) { schedule_I(v) = 1; } else { schedule_I(v) = 0; insert v into the node set S; } 2- While (node set S ≠ Φ ) { G.for_all_nodes(v) { if ( (v ∈ S) and (all_pred_scheduled(G,v,schedule_I)) { G.all_input_edges(e,v){ w = G.source(e); if (G.type(w) and G.type(v) ≠ “multicycle”) if ( Ch_Level_ASAP(w) ≤ Max_Chain_Length) { temp_schedule = schedule_I(w);} else { temp_schedule = schedule_I(w) + delay(v);}
  • 139. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata if ((G.type(w) = “multicycle”) { if (G.type(v) ≠ “multicycle”) { temp_schedule = schedule_I(w) + delay(w) -1 ;} else { temp_schedule = schedule_I(w) + delay(w);} } if ( temp_schedule schedule_I(v)) { schedule_I(v) = temp_schedule;} } 3- Adj_Ch_Level_ASAP(G, v, schedule_I, Ch_Level_ASAP); 4- delete node v from the node set S; } } } }
  • 140. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata Adjust Chaining level of a node Input: Data Flow Graph G, node v, node array representing the current schedule schedule_I, and the node array representing ther current chaining level Ch_level_ASAP. Output: Adjusted version of Ch_level_ASAP for node v, according to the current schedule schedule_I Adj_Ch_Level_ASAP{ G.all_input_edges(e,v) { w = G.source(e); if ( ( G.type(w) ≠ “multicycle”) and (schedule_I(v) = schedule_I(w)) and (Ch_Level_ASAP(w) Max_Chain_Length) and (Ch_Level_ASAP(v) Ch_Level_ASAP(w) + 1)) {Ch_Level_ASAP(v) = Ch_Level_ASAP(w) + 1;} if ( ( G.type(w) = “multicycle”) and (G.type(v) ≠ multicycle”) and (schedule_I(v) = schedule_I(w) + mul_delay -1) and (Ch_Level_ASAP(v) ≤ 2)) { Ch_Level_ASAP(v) = Ch_Level_ASAP(w) + 1; } }
  • 141. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
  • 142. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata procedure create_classes_with_β create_classes_with_β ( active_edge , distance, j, classBase ) { if (distance = 0) { class j = classBase + ; } else for x = 0 to distance { if (x = 0) { class j = classBase + ; classnew = class j; create_class_without_β (active_edge , distance, j, classnew ); } if (x 0) { j = j + 1; if ( x = n - i ) { class j = classBase + ; } if ( x n - i ) { class j = classBase + ; distancenew = distance - x; classnew = class j; create_class_with_β ( active_edge , distancenew, j, classnew ); } αi β{ } β{ } αi αx{ } β{ }+ αx{ } αi
  • 143. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata procedure create_classes_without_β: create_classes_without_β ( active_edge , distance, j, classBase ) { for t = distance down to 1 { if t = (n - i ) { j = j + 1; class j = classBase + ; } if t (n - i) { j = j + 1; class j = classBase + ; distancenew = distance - t; classnew = class j; create_class_with_β ( active_edge , distancenew, j, classnew ); } } αi αt{ } β{ }+ αt{ } αi
  • 144. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
  • 145. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
  • 146. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
  • 147. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata