1. Exploiting Instruction-Level Parallelism:
The Multithreaded Approach*
Philip Lenir, R. Govindarajan and S.S. Nemawarkar
Dept. of Electrical Engineering, McGill University, Montreal, H3A 2A7, CANADA
{lenir@ee470,govindr@pike,shashank@pike}.ee.mcgill.ca
Abstract
The main challenge in the field of Very Large
Instruction Word (VLIW) and superscalar architec-
tures is ezploiting as much instruction-level paral-
lelism as possible. In this paper an ezecution model
which uses multiple instruction sequences and eztracts
instruction-level parallelism at runtime from a set of
enabled threads has been presented. A new multi-ring
architecture has been proposed to support the ezecu-
lion model. The multithreaded architecture features
(i) large resident activations to improve program and
data locality, (ii) a novel high-speed buger organiza-
tion which ensures zero load/store stalls for the lo-
cal variables of an activation, and (iii) a dynamic in-
struction scheduler that groups operations from mul-
tiple threads for ezecution. Initial performance evalu-
ation studies predict that the proposed architecture is
capable of ezecuting 7 concurrent operations per cycle
with 8 ezecution pipes and 6 rings.
1 Introduction
Very Large Instruction Word (VLIW) architec-
tures [7,1,9]and superscalar architectures [ll,41 have
the potential to execute multiple operations in a sin-
gle machine cycle. The main challenge in these archi-
tectures is to detect and exploit as much parallelism
as is available in the application program. Surpris-
ingly, the reported values of instruction-level paral-
lelismexploited in many applications have been rather
low [11, 91. This is because the parallelism extracted
by VLIW compilers is limited by the basic block size,
while in superscalar architecture it is limited by the
size of the instruction window [4]. Techniques to in-
crease basic block size or instruction window size often
place higher demands on resources, such as processor
registers, and hence have limited effect.
'This work was supported by MICRONET -Network Cen-
tres of Excellence, Canada.
In this paper we propose a new approach based on
the multithreaded execution model to exploit higher
instruction-level parallelism. The multithreaded exe-
cution model [lo]has the potential to tolerate long and
unpredictable memory latencies and to address the is-
sues on synchronization costs in a satisfactory man-
ner. In the proposed model, instructions from mul-
tiple threads that are scheduled for execution based
on data availability are grouped together for execu-
tion. Such an approach has the advantage of exploit-
ing instruction-level parallelism in conjunction with
coarser-grained algorithmic parallelism. Also, the
instruction-level parallelism exploited in this model is
not limited by the size of basic blocks.
In the following section, we describe our execution
model. The details of the multi-ring multithreaded
architecture are presented in Section 3. The perfor-
mance of the architecture is reported in Section 4.
2 The Execution Model
2.1 Program Hierarchy
In our model, a program is represented by a col-
lection of code blocks each corresponding to a func-
tion/procedure body. A code block is divided into
several threads. At the lowest level of program hier-
archy is the instruction sequence in a thread.
2.2 Synchronization and Scheduling
Each invocation of a code block is called an BC-
tivation. Associated with each activation is a set of
memory locations, called the frame of the activation,
to store the local variables. Initially an activation is
dormant. It becomes enabled whenever at least one
thread in the activation is enabled. The frame corre-
sponding to an enabled activation is loaded in a high-
speed memory, if enough free space is available. Then
the activation is said to be resident. An activation re-
mains resident as long as there is at least one enabled
thread in that activation to be executed. An activa-
tion either terminates when all its threads have termi-
189
0-8186-3175-9/92 $3.00 0 1992 IEEE
2. nated, or suspends when there is at least one thread
in the activation which is not enabled.
A thread becomes enabled when it has received the
necessary operands on all its input arcs. An enabled
thread becomes ready when the corresponding activa
tion is made resident. A ready thread waits for pro-
cessor reuourceu (see Section 3.5). Once a thread has
acquired a free resource it is said to be in the active
state. A thread in the active state can run to com-
pletion without requiring further synchronization. In-
structions in a thread are executed sequentially. Ad-
ditional details on the execution model can be found
in [2].
MM. TSII. m u .
3 The Multi-Ring Architecture
1
In this section we describe our architecture, the
multi-ring Large Context Multithreaded (LCM) archi-
tecture. Our initial study on the performance of the
basic LCM architecture [6]clearly shows that the syn-
chronizing capability of the architecture is rather low
to fully utilize multiple execution pipes. As a conse-
quence only a low instruction-level parallelism could
be exploited in the LCM architecture. Hence, a multi-
ring structure with multiple synchronizing units (refer
to Fig. 1)has been proposed.
3.1 Memory Unit
The memory unit consists of a data memory and
multiple memory managers, one for each ring in the
multi-ring structure. The data memory is divided into
two segments, one for storing data structures and the
other for storing frames of various activations. Cur-
rently, only array data structure is supported in our
architecture. The memory manager is responsible for
allocating frames dynamically when new activations
are invoked and deallocating the frame when an acti-
vation terminates. Once a frame is allocated for a new
activation, tokens belonging to that frame are sent to
the thread synchronization unit.
3.2 Thread Synchronization Unit
As shown in Fig. 1, there are multiple thread syn-
chronization units to perform thread synchronization.
Each is similar to the explicit token store match-
ing unit of the Monsoon architecture [8]. When all
operands to a thread arrive in the thread synchro-
nization unit, the thread becomes enabled and is then
sent to the frame management unit in the same ring.
3.3 Frame Management Unit
The frame management unit performs the schedul-
ing of threads. It maintains a table of resident acti-
190
Figure 1: The Multi-Ring LCM Architecture
vations, and the base addresses of their frames. If the
incoming thread belongs to an already resident acti-
vation, the frame management unit extracts the base
address of the frame from the table. If the thread be-
longs to an activation which is not already resident,
the frame management unit checks whether the frame
for this activation can be loaded in a high-speed mem-
ory, called the high-speed buffer (refer to Section 3.4)
If so it instructs the high-speed buffer unit to load
the frame for this activation. When the high-speed
buffer has been successfully loaded, the base addresses
of the frame is sent to the frame management unit.
The frame management unit then passes the incoming
thread along with the base address to the instruction
fetch unit. Lastly, if the frame corresponding to the
incoming thread cannot be loaded, then the thread is
queued inside the frame management unit until a block
is freed in the high-speed buffer. The frame manage-
ment unit, however, proceeds to service other threads
in its input queue. The frame management unit is also
responsible for instructing the high-speed buffer either
to off-load the frame of a terminated activation or to
flush the frame of a suspended activation.
3.4 High-speed Buffer
A novel implementation idea used in our architec-
ture is to pre-load the necessary frame in the high-
speed buffer prior to scheduling the instructions. This,
3. together with dataflow synchronization, ensures that
all operands necessary for executing the instructions
in an enabled thread are available in the high-speed
buffer. Thus load/store stalls in accessing frame lo-
cations have been completely eliminated in our archi-
tecture. To reduce the load/store stalls in accessing
data structure elements, long-latency operations are
performed in a split-phase manner [8].
A single high-speed buffer loader services the re-
quests from all frame management units. This is be-
cause the number of requests received by the high-
speed buffer loader is fewer compared to that of the
thread synchronization units. Simulation experiments
confirm this fact.
3.5 Instruction Scheduler
Benchmark
Program
Matrix Mult.
The instruction scheduler consists of two units, the
instruction fetch unit and the scheduler unit. The
instruction fetch unit receives ready threads from var-
ious frame management units. A ready thread waits
for the availability of a free resource. A resource cor-
responds to a program counter, an intermediate in-
struction register, and an instruction register. A num-
ber of resources are available in the instruction fetch
unit. The instruction fetch unit also contains an in-
struction cache, similar to the conventional instruction
cache unit. The code-block corresponding to the ready
thread is loaded, if not already loaded, in the instruc-
tion cache. The instruction pointed by the program
counter is fetched and loaded in the intermediate in-
struction register. The set of intermediate registers
together form the instruction window for the multiple
execution pipes. At each execution cycle, the sched-
uler unit checks the intermediate instruction registers.
It selects upto n available instructions for execution
where n is the number of execution pipes. Pipeline
stalls due to data dependencies are avoided by using
a novel thread-interleaving scheme (refer to [2]).
3.6 Execution Pipe
The execution pipes used in our architecture are
generic in nature, each capable of performing any o p
eration. The processor architecture is load/store in
nature. The register file is logically divided into a
number of register banks, one corresponding to each
resident activation. In the execution model, the logi-
cal register name specified in an instruction is used as
an offset within the register bank.
Parallelism
8 pipes 8 pipes 6 pipes
8 rings 6 rings 4 rings
7.86 7.57 5.47
3.7 Router Unit
SAXPBY - unroll 4
SAXPBY - unroll 2
The router unit is a cross-bar network that connects
each execution pipe to every memory unit.
7.91 7.92 5.94
5.22 5.22 5.21
4 Simulation Results
Parameter
Parallelism
The performance of the multi-ring LCM architec-
ture is evaluated using a deterministic discrete-event
simulation. The benchmark programs considered in
the simulation are SAXPBY and matrix multiplica-
tion. In the simulation, each frame was assumed to be
a constant size (64 words). Likewise, the code block
size was assumed to be 128 words. The processing el-
ement can support up to 32 resident activations and
32 resources. The main output performance param-
eter is the average number of operations executed in
a single cycle, obtained by averaging the number of
instructions scheduled for execution in each cycle.
We considered three different configurations of the
architecture, namely 8 pipes & 8 rings, 8 pipes & 6
rings, and 6 pipes & 4 rings. In this work, we do not
investigate the feasibility of a VLSI implementation.
However, architectures with similar complexity have
been claimed to be feasible with the modern technol-
ogy [51.
Table 1. Performance of the Multi-ring
LCM Architecture
Number of Resident Activations
2 1 4 1 8 1 1 2 1 1 6 1 2 4
-
2.07 I 3.84 I 5.78 I 6.74 I 7.86 I 7.84
Table 2. Effect of Number of Resources
il Parameter II Number of Resources- ___ _ ~ ~ .~ ~~
2 1 4 1 8 ( 1 2 I l 6 I l 8
Parallelism 11 1.09 I 2.16 I 4.22 I 6.07 I 7.20 I 7.54
Table 1 summarizes the average number of oper-
ations executed in each cycle for the three different
configurations. We observe that the instruction-level
parallelism is quite high all three cases.
To determine the effect of the number of rings
and the number of pipes on the parallelism exploited,
we varied both parameters independently. The effect
of these parameters on the parallelism exploited for
the SAXPBY program is shown in Fig. 2. We note
that the exploited parallelism is strongly dependent
on both parameters.
191
4. NO. oi Exocution ~ l p o -f
Figure 2: Effect of Multiple Rings
The effect of the size of the instruction window (or
the number of resources) on the average operations
per cycle is shown in Table 2. Lastly, the effect of the
number of resident activations on the performance is
tabulated in Table 3. In these two experiments, the
number of the execution pipes and the number of rings
are fixed at 8 and 6 respectively.
5 Conclusions
In this paper, we have proposed a new architec-
ture for exploiting instruction-level parallelism, which
groups instructions from multiple active threads to
achieve high instruction-level parallelism. Initial per-
formance results based on simulation show that the
architecture can exploit a parallelism of at least 7 in-
structions with 8 execution pipes and 6 rings. This
improvement in instruction-level parallelism is mainly
due to our approach - that of extracting fine-grain
instruction-level parallelism from a set of threads. Our
approach is similar to the ones discussed in [5,31.
However, the multi-ring LCM architecture benefits
from (i) large resident activations, (ii) the layered ap-
proach to synchronization, and (iii) the high-speed
buffer organization.
Acknowledgements
The authors acknowledge the help of Vincent
Collini for his initial involvement in the work and the
anonymous reviewers for their useful comments.
References
tecture for a trace scheduling compiler.
Symposium on Computer Architecture, 1988.
IEEE
[2]R. Govindarajan and S.S. Nemwarkar. Small:
A scalable multithreaded architecture to exploit
large locality. In Proc. of the 4th IEEE Symp.
on Parallel and Distributed Processing, December
1992. to appear.
[3]H. Hirata et al. An efkmentary processor architec-
ture with simultaneous instruction issuing from
multiple threads. In Proceedings of the 19th Inter-
national Symposium on Computer Architecture,
pages 136-145. ACM and IEEE, 1992.
[4]Mike Johnson. Superscalar Microprocessor De-
sign. Prentice Hall, Englewood Cliffs, New Jersey
07632, 1991.
[5] S.W. Keckler and W.J. Dally. Processor coupling:
Integration compile time and runtime scheduling
for parallelism. In Proceedings of the 19th Inter-
national Symposium on Computer Architecture,
pages 202-213. ACM and IEEE, 1992.
[6]Philip Lenir and Vincent Collini. A large context
multithreaded architecture with mu1tiple pipelin-
ing. Report, McGill University, Dept. of Electri-
cal Engineering, April 1992.
[7]A.Nicolau and J. A. Fisher. Measuring the par-
allelism available for very long instruction word
architectures. IEEE %ansactiow on Computers,
1984.
[8]G. M. Papadopoulos and D. E. Culler. Monsoon:
An explicit token-store architecture. In Pro-
ceedings of the Seventeenth Annual International
Symposium of Computer Architecture, Seattle,
WA, pages 82-91, 1990.
[9]B. R. Rau, D.Yen, W.Yen, and R. A. Towle.
The Cydra 5 departmental supercomputer. IEEE
Computer, 22(1):12-35, January 1989.
[lo] Burton J. Smith. Architecture and applications
of the HEP multiprocessor computer system. In
SPIE Real-Time Signal Processing IV, volume
298,pages 241-248, 1981.
[ll] M.D.Smith, M. S. Lam, and M. A. Horowitz.
Boosting beyond static scheduling in a super-
scalar processor. Proceedings of the 17th Annual
Symposium on Computer Architecture, 1990.
[l] R. P. Colwell, R. P. Nix, J. J. O’Donnell, D. B.
Papworth, and P. K. Rodman. A VLIW archi-
192