1. An Improved Self-Reconfigurable Interconnection
Scheme for a Coarse Grain Reconfigurable
Architecture
Muhammad Ali Shami Ahmed Hemani
School of ICT School of ICT
Royal Institute of Technology, KTH Royal Institute of Technology, KTH
Stockholm, Sweden Stockholm, Sweden
Email: shami@kth.se Email: hemani@kth.se
Abstract—An improved Dynamic, Partial and self reconfig- compose bigger systems using these CGIs by connecting
urable interconnection network (Hybrid-2 Network) is presented them together. This is also a property of a computational
for Dynamically Reprogrammable Resource Array (DRRA), fabric.
which is a Coarse Grain Reconfiguration Architecture (CGRA).
To justify the design decision, Hybrid-2 network implementa- 4) Local Connectivity: To reduce delay and energy con-
tion is compared against the possible implementations using sumption, the interconnection network has local connec-
Multiplexer, NoC, Crossbar and already published Hybrid-1 tivity which is limited to 3-hops communication.
interconnection network. Results shows that newly presented 5) Non-blocking and Point to Point/Multi-Point: The
Hybrid-2 Interconnection network take (1.08x, 0.104x, 0.212x and DRRA interconnection network is a Non-blocking,
0.681x) the area, (1x, 0.037x, 0.026x and 0.107x) the configuration
bits of Multiplexer, NoC, Crossbar and Hybrid-1 Implementation Point-to-Point and Point-to-Multipoint network.
respectively. Hybrid-2 network is also 2.87x and 5.86x faster than 6) Sliding Window connectivity: The local connectivity in
Multiplexer and Hybrid-1 networks. non-overlapping segments restricts the interconnection
network to create a fix maximum size CGI. By having
I. I NTRODUCTION
connectivity in overlapping segments, a sliding window
Flexibility of a reconfigurable architecture comes from a) its style local connectivity is created which allows creation
ability to reconfigure computational logic and b) the ability to of arbitrary size CGIs.
reconfigure the interconnection network to connect the compu- 7) Dynamic Reconfiguration: Dynamic reconfiguration of
tational logic blocks with each other. Interconnection network, a network allows the system to reconfigure the network
in any Coarse Grain Reconfigurable Architecture (CGRA), is at run-time. For dynamic reconfiguration, the number
a key component which makes a reconfigurable architecture of configuration bits and the configuration time should
flexible. This paper presents an improved interconnection be low. The DRRA network is reconfigurable during
network for Dynamically Reprogrammable Resource Array runtime on cycle basic.
(DRRA) which is a CGRA fabric. The old interconnection 8) Partial Reconfiguration: The interconnection network
network, published in [10], will be referred as Hybrid-1. which allows configuration of only a segment of the net-
Moreover the new interconnection network, presented in this work is Partially reconfigurable interconnection network.
paper, will be referred as Hybrid-2 in rest of the paper. The Configuring only a segment of the network results in
DRRA fabric has the following properties; fewer bits generation, and allow configuration of a part
1) Creation of Coarse Grain Instruction (CGI): The in- of the network without disturbing the network connectiv-
terconnection network enables creation of coarse grain ity in the surrounding. In DRRA, even a single network
instructions by connecting two or more computational connection can be reconfigured without disturbing the
resources with each other. The maximum size of the other network connections.
CGI, which can be created, depends on the maximum 9) Self Reconfiguration: DRRA interconnection network
connectivity of the reconfigurable system. is self configurable which means that the CGIs, which
2) Arbitrary Parallelism: The interconnection network al- are created by the combination of the CGRA resources,
lows creation of many such CGIs and run them in can reconfigure the interconnection network. This allows
parallel. This is the property of the computational fabric the algorithms running on a CGI to reprogram the
like FPGA. A CGRA which has this property is called interconnection network and hence the CGIs according
a CGRA fabric. to their need. It also reduces the configuration time
3) Implementation of large sub-system: In addition to cre- since the main configuration manager doesn’t have to
ation of CGI, the interconnection network is also able to generate and send the configurations. This improvement
978-1-4244-8971-8/10$26.00 c 2010 IEEE
2. eliminates the need for a separate configuration network
for the interconnection network.
Properties 1,2,3,4,5,6 and 7 were implemented with Hybrid-
1 in DRRA fabric. The Hybrid-2 implements properties 8
and 9, in addition to properties 1-7, in DRRA fabric. The
property 7 has also been improved by reducing the dynamic
reconfiguration time of the interconnection fabric. This paper
has two main contribution;
• An improvement over existing Hybrid-1 interconnection
scheme of DRRA fabric. The improvement not only
includes new and improved functionality (property 7,
8 and 9) but also includes a redesigned switchbox to
reduce the number of configuration bits and configuration
memory size.
• A quantitative comparison to other Multiplexer, Cross-
Fig. 1. Dynamically Reprogrammable Resource Array(DRRA) Fabric
bar and NoC based interconnect schemes including the
Hybrid-1.
Section-2 discusses the related work. Section-3 contains a interconnection network. The first level offer nearest neighbor
brief introduction to DRRA. Section-4 presents the different connectivity, second and third level consists of local and global
implementations of DRRA interconnection network. Section-5 buses.
presents the results while Section-6 concludes the paper. Interconnect exploration for mapping of algorithms helps to
find the best routing and interconnection scheme. This paper
II. R ELATED W ORK is an effort in exploring the implementation style for DRRA
Two decade of research on CGRAs has produced a number Interconnection network discussed in introduction section to
of CGRA architectures with different interconnection prop- find the best implementation for area, configuration bits, and
erties and their implementation styles. These architectures power. The DRRA interconnection network is different from
have been reviewed in [3] and [1]. This section will discuss the above mentioned architectures because it is a computa-
the interconnection schemes in some of these architectures. tional fabric like an FPGA, and allows creation of a number
ADRS[7] is a CGRA with a multiplexer based mesh network arbitrary size partitions executing different algorithms.
with topologies like nearest neighboring connectivity, next
III. DRRA
hop connectivity, extra connection to central register file and
vertical busses etc. REMAC [8] also has a Multiplexer based Dynamically Reprogrammable Resources Array (DRRA)
nearest neighbor connectivity along with full row and column is a CGRA fabric, as shown in Figure 1, which consists
BUS connectivity. Multiplexer based networks are good to of pool of a)Arithmatic/Logic (mDPU)[9], b)Storage (RFile)
provide Point-to-Multipoint connectivity, but this comes at the and c)Control (Sequencers) Resources. These resources are
cost of long wires and high capacitance to drive. This has been seamlessly partitionable to compose Coarse Grain Instructions
recognized by ADRS and they have proposed a full custom (CGIs). The arithmetic resources are used to create the data-
transistor[7] to disconnect these segments of the wires which path for the CGI. Two or more mDPUs can be connected
are not used during a specific network configuration. Crossbar together to create a complex data-path which matches the
provides full connectivity but requires maximum number of granularity of the algorithm. The RFile not only provides the
configuration bits and is not scalable. Colt uses a crossbar to storage, but enough memory ports to feed this complex data-
communicate between data port and array of 4x4 elements path. The sequencers are used to control these resources by
which are connected in mesh network with nearest neighbor instantiating them in appropriate mode. The sequencers have
connectivity. VIRAM [6] processor also uses a crossbar for an instruction memory of 64 words only.
communication between DRAM banks and vector lanes. The In DRRA a CGI is composed by configuring the in-
crossbar is not scalable and has huge area and configuration terconnection network which connects these arithmetic and
overhead. Chameleon[4] and Imagine[5] use circuit switched storage resources with each other. Our goal is to design an
NoC for their interconnection network. Recently Multistage interconnection network which can create a CGI as complex
Interconnection Network (MIN)[2] has also been proposed for as Radix-4 FFT butterfly or bigger. To compose such big
CGRA. This network is created to provide arbitrary routing data-paths, we found that a sliding window communication
by connecting together different stages of the network. Since of 3-hops would be required. 3-hops communication window
creating a communication path in a NoC based network will means that every DRRA resource can communication with
require involvement of many geologically distributed switches, every other DRRA resource in either right or left direction up
creating a self reconfigurable network is not possible by using to 3-columns away as shown in Figure 1. The Sliding window
this approach. MorphoSys[11] has a three level of Hybrid means that these communication windows slides with respect
3. Fig. 3. Circuit Switched NoC Based DRRA Interconnection Network
configured in 6 cycles. Since all the sequencers can program
Fig. 2. Multiplexer Based DRRA Interconnection Network
their interconnects in parallel, it takes 6 cycles at most to
completely program this interconnection network in DRRA. A
to DRRA columns in a way that they are overlapping. The configuration memory for one DRRA column can be designed
Figure 1 shows a 2x8 fabric of DRRA which is created with which will be connected to both the sequencers. This will
these properties. It is important to mention that this fabric result in enabling the two sequencers in a DRRA column to
is a fragment and in 90nm technology, a 10x10mm chip can configure all the four switch-boxes by just configuring the
accommodate 324 DRRA Cells. memory. The memory will be organized in 12x8 (12 rows and
10 column). The first four column bits will decide the input
IV. I NTERCONNECTION I MPLEMENTATION E XPLORATION multiplexer which is to be configured while the rest of the
An interconnection network for an architecture is designed 6-bits will configure the 56x1 multiplexer.
with two main considerations; a)the functionality of the Multiplexer based network has two problems associated; a)
interconnection network and b)the physical overheads e.g. The large size Multiplexors cause routing congestion during
area, power, speed, and configuration bits. An interconnection floorplan, and b) A Point-to-Miltipoint connection results in
network with the functionality discussed in the introduction every output driving all the inputs (7x12) in the intercon-
section can be implemented using multiple implementation nection window as shown in Figure 2. This will not only
styles. Hence it becomes important to do an implementation increase the length of the interconnection wire, but also
exploration of all these implementation styles to find the increase the driving load of the output. This results in a slower
physical overheads. To do an implementation exploration of interconnection network which consumes much energy. We
this interconnection network, we have implemented it in Multi- can break the wire length by driving every output in either
plexer, Crossbar, NoC, Hybrid-1 and Hybrid-2 implementation right direction or in left direction. That would result in driving
styles. The implementation details and results are discussed in 42 inputs which is still huge.
the sections/subsections below.
B. Circuit Switch Network (NoC)
A. Multiplexer Based DRRA Network A circuit switch network can be created for this kind of
A DRRA interconnection network, as discussed in introduc- fabric as shown in Figure 3. A fully non-blocking, sliding
tion, can be implemented using Multiplexers. Every resource window interconnection network with 3-hops connectivity
input, in DRRA fabric, can receive data from resources up to 3- requires 48 rows. Every column has 12-inputs and 8-outputs.
columns away on both sides as shown in Figure 2. This creates These 20 input/outputs will be connected to these 48 rows.
an interconnection window of 7-columns. This window of This will result in 480 4-way switches. Every NoC switch
connectivity moves with the resources, and that is why called requires four configuration bits to configure resulting in 1920
sliding window. Each column has four resources with two out- bits of configuration memory in every column.
puts from every resource. This results in selecting one out of The problem with this network is that if a physical commu-
56(7x4x2) possible outputs for every single input and requires nication channel is to be established between two resources,
a Multiplexer of size 56x1. Since a column has 12-inputs, the geographically distributed switchboxes in the path between
twelve 56x1 multiplexers will be required for every column these two resources will have to be configured. This can
in multiplexer based DRRA interconnection network. A 56x1 be done only by an external configuration unit, since the
multiplexer requires 6-bits to configure, therefore a DRRA sequencers can only configure local switchboxes. So a self
column will require 72-bits to configure. This interconnection reconfiguration of this network is not possible. This kind of
scheme is partial, dynamic and self reconfigurable, and doesn’t NoC can also communicate beyond 3-hops. This communica-
require a dedicated interconnect reconfiguration network. A tion will be blocking and the synthesis tool will report a lower
sequencer can configure one input per cycle by providing clock frequency. To avoid this, the NoC switches will have to
6-bits. A complete DRRA column having 12 inputs can be be pipelined, which will increase their power consumptions
4. and area.
C. CrossBar based network
Fig. 4. Crossbar Based DRRA Interconnecction Network
Fig. 5. DRRA Hybrid-1 Network
A crossbar based sliding window network is possible to
create by using small crossbars cascadedly connected together
as shown in Figure 4. To provide connectivity to resources DRRA resource to drive all the inputs in the 7-column com-
on both sides up to 3-hops away, 48x56 crossbars will be munication window. However this interconnection network
required. This will result in configuration memory of size suffers from the delay of the crossbar based switchboxes.
2688-bits per column. These crossbars are used in sliding
window fashion i.e. every crossbar is connected to every E. Hybrid-2 Network with Tri-state Multiplexers and BUSes
other crossbar up to 3-hops away to create a 3-hops sliding
window network. Crossbar based network can be used for
communication beyond 3-hops, but that communication will
be blocking and will decrease the system clock because of the
longer network delay.
The problem with this implementation is its huge size,
configuration bits and large network delay. A crossbar has to
configure 2688 possible connections. If a self reconfiguration
requires one cycle to configure one connection, it will take
1344 cycles by the two sequencers to completely configure
the crossbar.
D. Hybrid-1 Network with Crossbars and BUSes
A single column of DRRA Hybrid-1 interconnection net-
work using Crossbars and Buses is shown in figure 5. This
interconnection network is organized in horizontal and vertical
BUSes with 14x12 Crossbars at the intersection called H2V
crossbars. The horizontal BUSes consist of the outputs of the
DRRA resources which are connected to the inputs of the H2V Fig. 6. DRRA Hybrid-2 Network
crossbars in sliding window fashion as discussed before. These
crossbars receives inputs from resources on both sides up to 3- Two problems are identified in Hybrid-1 type intercon-
hops (3-columns) away. Each column has four H2V crossbars. nection network; a)configuration bits are larger than the bits
One H2V crossbar requires 14x12=168 bits to configure. A present in Multiplexer based network, and b)The network
single DRRA column requires 4x168=672 bits to configure. delay of this network is also greater than the multiplexer based
This memory is configured by an external configuration unit network because of the use of crossbars based switchboxes.
through an interconnect configuration network, so a self re- Therefore the Hybrid-1 interconnection network is improved
configuration is not possible for this network. These horizontal by redesigning the switchboxes. Figure 6 shows the Hybrid-
inputs to the H2V crossbars are configured to connect to the 12 2 interconnection network with a newer switchbox design.
vertical BUSes which are then connected to the inputs of the This switchbox consists of twelve 14x1 multiplexers which
resources. This organization of interconnection network, with are connected to a tri-state buffer. These tri-state buffers are
H2V crossbar based switchboxes, prevents an output from a permanently connected to one of the twelve vertical buses.
5. Area Cfg.Bits Cfg.Cycles NetworkDelay
This design has three advantages over the previous design (Gates) (pS)
a) the configuration bits are reduced, b) the area of the MUX 8402 120 6 707
switchbox is reduced and c)delay of the switchbox is also NoC 87840 1920 Variable Variable
Crossbar 43008 2688 1344 Variable
reduced. The new switchbox requires 48 bits to configure in
Hybrid-1 13416 672 6*CND 1443
this interconnection network. Since all four switchboxes drives Hybrid-2 9147 120 6 246
the same vertical buses, their tri-state drivers are mutually
TABLE I
exclusive to each other. We can use this property to create C OMPARISON B ETWEEN D IFFERENT I MPLEMENTATIONS
a memory organized as 12x6 bits (12 rows and 10 columns).
Every row corresponds to the output connected to one of the
vertical BUSes. First two column bits select the switchbox
from one of the four switchboxes in one column, the next 4- interconnection network. To configure one complete DRRA
bits select the vertical BUS which is to be derived, and the column, twelve inputs are configured by the two sequencers.
last 4-bits select the horizontal BUS which will be driving the A sequencer takes single cycle to configure one input, hence it
selected vertical BUS. takes 6 cycles to completely configure a DRRA column. Since
all the DRRA columns are configured independently by their
own sequencers, a complete DRRA fabric, no matter how big,
can be configured in 6 cycles.
V. R ESULTS
The above mentioned interconnect implementations are
synthesized for DRRA using TSMC 90nm technology in
Cadence RTL Compiler. The Table I contains the data for
Area, Configuration Bits, Configuration Cycles and Network
delay of these implementations after the synthesis. This data
shows that;
1) Multiplexer based networks are the best in terms of area,
configuration bits and number of cycles to configure
the network. However Multiplexer based networks are
slow because of the long Point-to-Multipoint wires. This
problem has been realized by ADRS as well. To remove
this problem they have designed pass transistor based
full custom switches to break the wires [7].
2) Crossbar and NoC based solutions are very expensive
in terms of Area, configuration bits and configuration
cycles etc. In NoC based solutions, configuration of a
Fig. 7. Application Mapping Flow link depends on the number of switches in the path.
Partial and Dynamic reconfiguration can be supported
1) Self Reconfiguration: The new configuration memory in NoC and Crossbar based network using an external
has very few bits to configure and is designed as the two port configuration network. Self reconfiguration cannot be
memory to allow connectivity with the two sequencers present supported in NoC because the sequencers cannot recon-
in same DRRA column. This allows the sequencers to program figure the geographically distributed switches involved
the configuration memory hence creating a self reconfiguration in establishing a communication channel between two
system. Using sequencers, we can dynamically and partially resources. The configuration cycles in NoC and Cross-
reprogram the interconnection network without the need of bar based interconnection network also depends on the
the external configuration unit. So the external reconfiguration number of switches/crossbars involved and configuration
network for interconnects has been completely removed. The network delay (CND).
interconnect configurations are stored inside the sequencer dur- 3) The Hybrid-1 is better than NoC and Crossbar based
ing storage of the program/configware. The application map- networks. However it takes more area and configuration
ping flow is shown in figure 7. A DRRA program/configware bits, as compared to Multiplexer based network. It is
contains Memory, Data-path and Interconnect instructions. also slower than the Multiplexer based network. The
This program is loaded into the DRRA sequencer. When number of cycles to configure a DRRA Column depends
the sequencer starts, it executes the interconnect instructions on the Configuration Network Delay (CND) of Hybrid-1
to configure the interconnection network. Once the network network.
is configured, the data-path and memory instructions are 4) The Hybrid-2 network, as can be seen in Table I,
executed. During execution of the algorithm, the sequencer has almost same area and configuration bits as that
can issue new Interconnect instructions to re-configure the of a Multiplexer based network. Since the network is
6. self reconfigurable, the configuration network delay in ACKNOWLEDGMENT
this network is one. Hence it takes only 6 cycles to The Author is thankful to Swedish Research Council and
completely reconfigure a DRRA column. Furthermore Higher Education Commission of Pakistan for funding this
all DRRA columns can be reconfigured in parallel, research.
therefore it takes only 6 cycles to completely reconfigure
the whole DRRA fabric. Hybrid-2 network is also faster R EFERENCES
than the Multiplexer based network. Hybrid-2 network, [1] M. Baron. Trends in use of reconfigurable platforms. In 41st Pro-
in reality is a Multiplexer based network with tri-state ceedings of Design Automation Conference, pages 415–415. IEEE, July
2004.
buffers. The increase in size of the area is because of [2] R. Ferreira, M. Laure, A. C. Beck, T. Lo, M. Rutzig, and L. Carro. A
these tri-state buffers. Using this Hybrid-2 approach, we low cost and adaptable routing network for reconfigurable systems. In
have broken down the long Point-to-Multipoint wires Proc. IEEE Int. Symp. Parallel & Distributed Processing IPDPS 2009,
pages 1–8, 2009.
of Multiplexer based network into Point-to-Point wires [3] R. Hartenstein. A decade of reconfigurable computing: a visionary
using switchboxes. This doesn’t affect the Point-to- retrospective. In Design, Automation and Test in Europe, pages 642–649.
Multipoint capability of the network. This approach is IEEE, March 2001.
[4] P. M. Heysters. Coarse-Grained Reconfigurable Processors; Flexibility
better than [7] in which pass transistor based switch Meets Efficiency. PhD Thesis, ISBN:90-365-2076-2, Neitherlands, 2003.
was used to break the long wires of Multiplexer based [5] B. Khailany, W. J. Dally, U. J. Kapasi, P. Mattson, J. Namkoong, J. D.
network. Furthermore, the switchboxes in Hybrid-2 have Owens, B. Towles, A. Chang, and S. Rixner. Imagine: media processing
with streams. IEEE MICRO, 21(2):35–46, 2001.
been designed completely in standard cell technology [6] C. E. Kozyrakis, S. Perissakis, D. Patterson, T. Anderson, K. Asanovic,
which keeps the design flow simple and reduce the time N. Cardwell, R. Fromm, J. Golbus, B. Gribstad, K. Keeton, R. Thomas,
to market. N. Treuhaft, and K. Yelick. Scalable processors in the billion-transistor
era: Iram. Computer, 30(9):75–78, 1997.
5) DRRA with Hybrid-2 network is synthesized and floor- [7] Z. Kwok and S. J. E. Wilton. Register file architecture optimization in a
planned in 90nm using Cadence RTL compiler and SoC coarse-grained reconfigurable architecture. In Proc. 13th Annual IEEE
Encounter. Using this network, 2x8 fabric of DRRA Symp. Field-Programmable Custom Computing Machines FCCM 2005,
pages 35–44, 2005.
shown in Figure 1 runs at a frequency of 720MHz and [8] T. Miyamori and K. Olukotun. Remarc:reconfigurable multimedia
can support a peak local bandwidth of 138GB/s. array coprocessor. IEICE Transactions on Information and Systems,
82(5):389–397, November 1998.
[9] M. A. Shami and A. Hemani. Morphable dpu: Smart and efficient data
path for signal processing applications. In Proc. IEEE Workshop Signal
VI. C ONCLUSION AND F UTURE W ORK Processing Systems SiPS 2009, pages 167–172, 2009.
[10] M. A. Shami and A. Hemani. Partially reconfigurable interconnection
network for dynamically reprogrammable resource array. In IEEE 8th
International Conference on ASIC, pages 122–125. IEEE, Octoer 2009.
An improved implementation of Hybrid-2 interconnection [11] H. Singh, M.-H. Lee, G. Lu, F. J. Kurdahi, N. Bagherzadeh, and E. M.
network for Dynamically Reprogrammable Resource Array Chaves Filho. Morphosys: an integrated reconfigurable system for data-
has been presented. To justify the design decisions, an in- parallel and computation-intensive applications. IEEE Transactions on
Computers, 49(5):465–481, 2000.
terconnect exploration is done by implementing the same
network using Multiplexer, NoC and Crossbar based network.
Hybrid-2 network is then compared against Multiplexer, NoC,
Crossbar and previously published Hybrid-1 network. Results
show that newly presented network takes (1.08x, 0.104x,
0.212x and 0.681x) the area, (1x, 0.037x, 0.026x and 0.107x)
the configuration bits of Multiplexer, NoC, Crossbar and
Hybrid-1 Implementation. Hybrid-2 network is 2.87x and
5.86x better in terms of speed as compared to Multiplexer
and Hybrid-1 networks. Hybrid-2 network also takes minimum
number of cycles to configure/reconfigure the complete DRRA
column.
A future version of the interconnection network with ad-
justable sliding window has been planned. By lowering the
clock frequency, the width of the sliding window can be
increased to allow mapping of more complex data paths than
what is possible today. Similarly at higher clock frequencies
this width can be reduced. The future version of DRRA will
have voltage frequency scaling and power shut off method-
ology. This may result in some parts of DRRA working in
different voltage/frequency range or completely turned off.
The DRRA switchboxes will be improved to handle such
situations by having level shifters, or isolators.