SlideShare ist ein Scribd-Unternehmen logo
1 von 4
Downloaden Sie, um offline zu lesen
Exploiting Instruction-Level Parallelism:
The Multithreaded Approach*
Philip Lenir, R. Govindarajan and S.S. Nemawarkar
Dept. of Electrical Engineering, McGill University, Montreal, H3A 2A7, CANADA
{lenir@ee470,govindr@pike,shashank@pike}.ee.mcgill.ca
Abstract
The main challenge in the field of Very Large
Instruction Word (VLIW) and superscalar architec-
tures is ezploiting as much instruction-level paral-
lelism as possible. In this paper an ezecution model
which uses multiple instruction sequences and eztracts
instruction-level parallelism at runtime from a set of
enabled threads has been presented. A new multi-ring
architecture has been proposed to support the ezecu-
lion model. The multithreaded architecture features
(i) large resident activations to improve program and
data locality, (ii) a novel high-speed buger organiza-
tion which ensures zero load/store stalls for the lo-
cal variables of an activation, and (iii) a dynamic in-
struction scheduler that groups operations from mul-
tiple threads for ezecution. Initial performance evalu-
ation studies predict that the proposed architecture is
capable of ezecuting 7 concurrent operations per cycle
with 8 ezecution pipes and 6 rings.
1 Introduction
Very Large Instruction Word (VLIW) architec-
tures [7,1,9]and superscalar architectures [ll,41 have
the potential to execute multiple operations in a sin-
gle machine cycle. The main challenge in these archi-
tectures is to detect and exploit as much parallelism
as is available in the application program. Surpris-
ingly, the reported values of instruction-level paral-
lelismexploited in many applications have been rather
low [11, 91. This is because the parallelism extracted
by VLIW compilers is limited by the basic block size,
while in superscalar architecture it is limited by the
size of the instruction window [4]. Techniques to in-
crease basic block size or instruction window size often
place higher demands on resources, such as processor
registers, and hence have limited effect.
'This work was supported by MICRONET -Network Cen-
tres of Excellence, Canada.
In this paper we propose a new approach based on
the multithreaded execution model to exploit higher
instruction-level parallelism. The multithreaded exe-
cution model [lo]has the potential to tolerate long and
unpredictable memory latencies and to address the is-
sues on synchronization costs in a satisfactory man-
ner. In the proposed model, instructions from mul-
tiple threads that are scheduled for execution based
on data availability are grouped together for execu-
tion. Such an approach has the advantage of exploit-
ing instruction-level parallelism in conjunction with
coarser-grained algorithmic parallelism. Also, the
instruction-level parallelism exploited in this model is
not limited by the size of basic blocks.
In the following section, we describe our execution
model. The details of the multi-ring multithreaded
architecture are presented in Section 3. The perfor-
mance of the architecture is reported in Section 4.
2 The Execution Model
2.1 Program Hierarchy
In our model, a program is represented by a col-
lection of code blocks each corresponding to a func-
tion/procedure body. A code block is divided into
several threads. At the lowest level of program hier-
archy is the instruction sequence in a thread.
2.2 Synchronization and Scheduling
Each invocation of a code block is called an BC-
tivation. Associated with each activation is a set of
memory locations, called the frame of the activation,
to store the local variables. Initially an activation is
dormant. It becomes enabled whenever at least one
thread in the activation is enabled. The frame corre-
sponding to an enabled activation is loaded in a high-
speed memory, if enough free space is available. Then
the activation is said to be resident. An activation re-
mains resident as long as there is at least one enabled
thread in that activation to be executed. An activa-
tion either terminates when all its threads have termi-
189
0-8186-3175-9/92 $3.00 0 1992 IEEE
nated, or suspends when there is at least one thread
in the activation which is not enabled.
A thread becomes enabled when it has received the
necessary operands on all its input arcs. An enabled
thread becomes ready when the corresponding activa
tion is made resident. A ready thread waits for pro-
cessor reuourceu (see Section 3.5). Once a thread has
acquired a free resource it is said to be in the active
state. A thread in the active state can run to com-
pletion without requiring further synchronization. In-
structions in a thread are executed sequentially. Ad-
ditional details on the execution model can be found
in [2].
MM. TSII. m u .
3 The Multi-Ring Architecture
1
In this section we describe our architecture, the
multi-ring Large Context Multithreaded (LCM) archi-
tecture. Our initial study on the performance of the
basic LCM architecture [6]clearly shows that the syn-
chronizing capability of the architecture is rather low
to fully utilize multiple execution pipes. As a conse-
quence only a low instruction-level parallelism could
be exploited in the LCM architecture. Hence, a multi-
ring structure with multiple synchronizing units (refer
to Fig. 1)has been proposed.
3.1 Memory Unit
The memory unit consists of a data memory and
multiple memory managers, one for each ring in the
multi-ring structure. The data memory is divided into
two segments, one for storing data structures and the
other for storing frames of various activations. Cur-
rently, only array data structure is supported in our
architecture. The memory manager is responsible for
allocating frames dynamically when new activations
are invoked and deallocating the frame when an acti-
vation terminates. Once a frame is allocated for a new
activation, tokens belonging to that frame are sent to
the thread synchronization unit.
3.2 Thread Synchronization Unit
As shown in Fig. 1, there are multiple thread syn-
chronization units to perform thread synchronization.
Each is similar to the explicit token store match-
ing unit of the Monsoon architecture [8]. When all
operands to a thread arrive in the thread synchro-
nization unit, the thread becomes enabled and is then
sent to the frame management unit in the same ring.
3.3 Frame Management Unit
The frame management unit performs the schedul-
ing of threads. It maintains a table of resident acti-
190
Figure 1: The Multi-Ring LCM Architecture
vations, and the base addresses of their frames. If the
incoming thread belongs to an already resident acti-
vation, the frame management unit extracts the base
address of the frame from the table. If the thread be-
longs to an activation which is not already resident,
the frame management unit checks whether the frame
for this activation can be loaded in a high-speed mem-
ory, called the high-speed buffer (refer to Section 3.4)
If so it instructs the high-speed buffer unit to load
the frame for this activation. When the high-speed
buffer has been successfully loaded, the base addresses
of the frame is sent to the frame management unit.
The frame management unit then passes the incoming
thread along with the base address to the instruction
fetch unit. Lastly, if the frame corresponding to the
incoming thread cannot be loaded, then the thread is
queued inside the frame management unit until a block
is freed in the high-speed buffer. The frame manage-
ment unit, however, proceeds to service other threads
in its input queue. The frame management unit is also
responsible for instructing the high-speed buffer either
to off-load the frame of a terminated activation or to
flush the frame of a suspended activation.
3.4 High-speed Buffer
A novel implementation idea used in our architec-
ture is to pre-load the necessary frame in the high-
speed buffer prior to scheduling the instructions. This,
together with dataflow synchronization, ensures that
all operands necessary for executing the instructions
in an enabled thread are available in the high-speed
buffer. Thus load/store stalls in accessing frame lo-
cations have been completely eliminated in our archi-
tecture. To reduce the load/store stalls in accessing
data structure elements, long-latency operations are
performed in a split-phase manner [8].
A single high-speed buffer loader services the re-
quests from all frame management units. This is be-
cause the number of requests received by the high-
speed buffer loader is fewer compared to that of the
thread synchronization units. Simulation experiments
confirm this fact.
3.5 Instruction Scheduler
Benchmark
Program
Matrix Mult.
The instruction scheduler consists of two units, the
instruction fetch unit and the scheduler unit. The
instruction fetch unit receives ready threads from var-
ious frame management units. A ready thread waits
for the availability of a free resource. A resource cor-
responds to a program counter, an intermediate in-
struction register, and an instruction register. A num-
ber of resources are available in the instruction fetch
unit. The instruction fetch unit also contains an in-
struction cache, similar to the conventional instruction
cache unit. The code-block corresponding to the ready
thread is loaded, if not already loaded, in the instruc-
tion cache. The instruction pointed by the program
counter is fetched and loaded in the intermediate in-
struction register. The set of intermediate registers
together form the instruction window for the multiple
execution pipes. At each execution cycle, the sched-
uler unit checks the intermediate instruction registers.
It selects upto n available instructions for execution
where n is the number of execution pipes. Pipeline
stalls due to data dependencies are avoided by using
a novel thread-interleaving scheme (refer to [2]).
3.6 Execution Pipe
The execution pipes used in our architecture are
generic in nature, each capable of performing any o p
eration. The processor architecture is load/store in
nature. The register file is logically divided into a
number of register banks, one corresponding to each
resident activation. In the execution model, the logi-
cal register name specified in an instruction is used as
an offset within the register bank.
Parallelism
8 pipes 8 pipes 6 pipes
8 rings 6 rings 4 rings
7.86 7.57 5.47
3.7 Router Unit
SAXPBY - unroll 4
SAXPBY - unroll 2
The router unit is a cross-bar network that connects
each execution pipe to every memory unit.
7.91 7.92 5.94
5.22 5.22 5.21
4 Simulation Results
Parameter
Parallelism
The performance of the multi-ring LCM architec-
ture is evaluated using a deterministic discrete-event
simulation. The benchmark programs considered in
the simulation are SAXPBY and matrix multiplica-
tion. In the simulation, each frame was assumed to be
a constant size (64 words). Likewise, the code block
size was assumed to be 128 words. The processing el-
ement can support up to 32 resident activations and
32 resources. The main output performance param-
eter is the average number of operations executed in
a single cycle, obtained by averaging the number of
instructions scheduled for execution in each cycle.
We considered three different configurations of the
architecture, namely 8 pipes & 8 rings, 8 pipes & 6
rings, and 6 pipes & 4 rings. In this work, we do not
investigate the feasibility of a VLSI implementation.
However, architectures with similar complexity have
been claimed to be feasible with the modern technol-
ogy [51.
Table 1. Performance of the Multi-ring
LCM Architecture
Number of Resident Activations
2 1 4 1 8 1 1 2 1 1 6 1 2 4
-
2.07 I 3.84 I 5.78 I 6.74 I 7.86 I 7.84
Table 2. Effect of Number of Resources
il Parameter II Number of Resources- ___ _ ~ ~ .~ ~~
2 1 4 1 8 ( 1 2 I l 6 I l 8
Parallelism 11 1.09 I 2.16 I 4.22 I 6.07 I 7.20 I 7.54
Table 1 summarizes the average number of oper-
ations executed in each cycle for the three different
configurations. We observe that the instruction-level
parallelism is quite high all three cases.
To determine the effect of the number of rings
and the number of pipes on the parallelism exploited,
we varied both parameters independently. The effect
of these parameters on the parallelism exploited for
the SAXPBY program is shown in Fig. 2. We note
that the exploited parallelism is strongly dependent
on both parameters.
191
NO. oi Exocution ~ l p o -f
Figure 2: Effect of Multiple Rings
The effect of the size of the instruction window (or
the number of resources) on the average operations
per cycle is shown in Table 2. Lastly, the effect of the
number of resident activations on the performance is
tabulated in Table 3. In these two experiments, the
number of the execution pipes and the number of rings
are fixed at 8 and 6 respectively.
5 Conclusions
In this paper, we have proposed a new architec-
ture for exploiting instruction-level parallelism, which
groups instructions from multiple active threads to
achieve high instruction-level parallelism. Initial per-
formance results based on simulation show that the
architecture can exploit a parallelism of at least 7 in-
structions with 8 execution pipes and 6 rings. This
improvement in instruction-level parallelism is mainly
due to our approach - that of extracting fine-grain
instruction-level parallelism from a set of threads. Our
approach is similar to the ones discussed in [5,31.
However, the multi-ring LCM architecture benefits
from (i) large resident activations, (ii) the layered ap-
proach to synchronization, and (iii) the high-speed
buffer organization.
Acknowledgements
The authors acknowledge the help of Vincent
Collini for his initial involvement in the work and the
anonymous reviewers for their useful comments.
References
tecture for a trace scheduling compiler.
Symposium on Computer Architecture, 1988.
IEEE
[2]R. Govindarajan and S.S. Nemwarkar. Small:
A scalable multithreaded architecture to exploit
large locality. In Proc. of the 4th IEEE Symp.
on Parallel and Distributed Processing, December
1992. to appear.
[3]H. Hirata et al. An efkmentary processor architec-
ture with simultaneous instruction issuing from
multiple threads. In Proceedings of the 19th Inter-
national Symposium on Computer Architecture,
pages 136-145. ACM and IEEE, 1992.
[4]Mike Johnson. Superscalar Microprocessor De-
sign. Prentice Hall, Englewood Cliffs, New Jersey
07632, 1991.
[5] S.W. Keckler and W.J. Dally. Processor coupling:
Integration compile time and runtime scheduling
for parallelism. In Proceedings of the 19th Inter-
national Symposium on Computer Architecture,
pages 202-213. ACM and IEEE, 1992.
[6]Philip Lenir and Vincent Collini. A large context
multithreaded architecture with mu1tiple pipelin-
ing. Report, McGill University, Dept. of Electri-
cal Engineering, April 1992.
[7]A.Nicolau and J. A. Fisher. Measuring the par-
allelism available for very long instruction word
architectures. IEEE %ansactiow on Computers,
1984.
[8]G. M. Papadopoulos and D. E. Culler. Monsoon:
An explicit token-store architecture. In Pro-
ceedings of the Seventeenth Annual International
Symposium of Computer Architecture, Seattle,
WA, pages 82-91, 1990.
[9]B. R. Rau, D.Yen, W.Yen, and R. A. Towle.
The Cydra 5 departmental supercomputer. IEEE
Computer, 22(1):12-35, January 1989.
[lo] Burton J. Smith. Architecture and applications
of the HEP multiprocessor computer system. In
SPIE Real-Time Signal Processing IV, volume
298,pages 241-248, 1981.
[ll] M.D.Smith, M. S. Lam, and M. A. Horowitz.
Boosting beyond static scheduling in a super-
scalar processor. Proceedings of the 17th Annual
Symposium on Computer Architecture, 1990.
[l] R. P. Colwell, R. P. Nix, J. J. O’Donnell, D. B.
Papworth, and P. K. Rodman. A VLIW archi-
192

Weitere ähnliche Inhalte

Was ist angesagt?

Summary of Simultaneous Multithreading: Maximizing On-Chip Parallelism
Summary of Simultaneous Multithreading: Maximizing On-Chip ParallelismSummary of Simultaneous Multithreading: Maximizing On-Chip Parallelism
Summary of Simultaneous Multithreading: Maximizing On-Chip ParallelismFarwa Ansari
 
Implementation of Spanning Tree Protocol using ns-3
Implementation of Spanning Tree Protocol using ns-3Implementation of Spanning Tree Protocol using ns-3
Implementation of Spanning Tree Protocol using ns-3Naishil Shah
 
On the Tree Construction of Multi hop Wireless Mesh Networks with Evolutionar...
On the Tree Construction of Multi hop Wireless Mesh Networks with Evolutionar...On the Tree Construction of Multi hop Wireless Mesh Networks with Evolutionar...
On the Tree Construction of Multi hop Wireless Mesh Networks with Evolutionar...CSCJournals
 
Conference Paper: Universal Node: Towards a high-performance NFV environment
Conference Paper: Universal Node: Towards a high-performance NFV environmentConference Paper: Universal Node: Towards a high-performance NFV environment
Conference Paper: Universal Node: Towards a high-performance NFV environmentEricsson
 
PSO-based Training, Pruning, and Ensembling of Extreme Learning Machine RBF N...
PSO-based Training, Pruning, and Ensembling of Extreme Learning Machine RBF N...PSO-based Training, Pruning, and Ensembling of Extreme Learning Machine RBF N...
PSO-based Training, Pruning, and Ensembling of Extreme Learning Machine RBF N...ijceronline
 
A New Approach to Improve the Efficiency of Distributed Scheduling in IEEE 80...
A New Approach to Improve the Efficiency of Distributed Scheduling in IEEE 80...A New Approach to Improve the Efficiency of Distributed Scheduling in IEEE 80...
A New Approach to Improve the Efficiency of Distributed Scheduling in IEEE 80...IDES Editor
 
Conference Paper: Towards High Performance Packet Processing for 5G
Conference Paper: Towards High Performance Packet Processing for 5GConference Paper: Towards High Performance Packet Processing for 5G
Conference Paper: Towards High Performance Packet Processing for 5GEricsson
 
Pretzel: optimized Machine Learning framework for low-latency and high throu...
Pretzel: optimized Machine Learning framework for  low-latency and high throu...Pretzel: optimized Machine Learning framework for  low-latency and high throu...
Pretzel: optimized Machine Learning framework for low-latency and high throu...NECST Lab @ Politecnico di Milano
 
Optimization of Remote Core Locking Synchronization in Multithreaded Programs...
Optimization of Remote Core Locking Synchronization in Multithreaded Programs...Optimization of Remote Core Locking Synchronization in Multithreaded Programs...
Optimization of Remote Core Locking Synchronization in Multithreaded Programs...ITIIIndustries
 
A simulation model of ieee 802.15.4 in om ne t++
A simulation model of ieee 802.15.4 in om ne t++A simulation model of ieee 802.15.4 in om ne t++
A simulation model of ieee 802.15.4 in om ne t++wissem hammouda
 
Designing Run-Time Environments to have Predefined Global Dynamics
Designing  Run-Time  Environments to have Predefined Global DynamicsDesigning  Run-Time  Environments to have Predefined Global Dynamics
Designing Run-Time Environments to have Predefined Global DynamicsIJCNCJournal
 
The Challenges facing Libraries and Imperative Languages from Massively Paral...
The Challenges facing Libraries and Imperative Languages from Massively Paral...The Challenges facing Libraries and Imperative Languages from Massively Paral...
The Challenges facing Libraries and Imperative Languages from Massively Paral...Jason Hearne-McGuiness
 

Was ist angesagt? (19)

Summary of Simultaneous Multithreading: Maximizing On-Chip Parallelism
Summary of Simultaneous Multithreading: Maximizing On-Chip ParallelismSummary of Simultaneous Multithreading: Maximizing On-Chip Parallelism
Summary of Simultaneous Multithreading: Maximizing On-Chip Parallelism
 
Implementation of Spanning Tree Protocol using ns-3
Implementation of Spanning Tree Protocol using ns-3Implementation of Spanning Tree Protocol using ns-3
Implementation of Spanning Tree Protocol using ns-3
 
On the Tree Construction of Multi hop Wireless Mesh Networks with Evolutionar...
On the Tree Construction of Multi hop Wireless Mesh Networks with Evolutionar...On the Tree Construction of Multi hop Wireless Mesh Networks with Evolutionar...
On the Tree Construction of Multi hop Wireless Mesh Networks with Evolutionar...
 
Conference Paper: Universal Node: Towards a high-performance NFV environment
Conference Paper: Universal Node: Towards a high-performance NFV environmentConference Paper: Universal Node: Towards a high-performance NFV environment
Conference Paper: Universal Node: Towards a high-performance NFV environment
 
PSO-based Training, Pruning, and Ensembling of Extreme Learning Machine RBF N...
PSO-based Training, Pruning, and Ensembling of Extreme Learning Machine RBF N...PSO-based Training, Pruning, and Ensembling of Extreme Learning Machine RBF N...
PSO-based Training, Pruning, and Ensembling of Extreme Learning Machine RBF N...
 
A New Approach to Improve the Efficiency of Distributed Scheduling in IEEE 80...
A New Approach to Improve the Efficiency of Distributed Scheduling in IEEE 80...A New Approach to Improve the Efficiency of Distributed Scheduling in IEEE 80...
A New Approach to Improve the Efficiency of Distributed Scheduling in IEEE 80...
 
Conference Paper: Towards High Performance Packet Processing for 5G
Conference Paper: Towards High Performance Packet Processing for 5GConference Paper: Towards High Performance Packet Processing for 5G
Conference Paper: Towards High Performance Packet Processing for 5G
 
Pretzel: optimized Machine Learning framework for low-latency and high throu...
Pretzel: optimized Machine Learning framework for  low-latency and high throu...Pretzel: optimized Machine Learning framework for  low-latency and high throu...
Pretzel: optimized Machine Learning framework for low-latency and high throu...
 
Wiki 2
Wiki 2Wiki 2
Wiki 2
 
Lecture 5 inter process communication
Lecture 5 inter process communicationLecture 5 inter process communication
Lecture 5 inter process communication
 
Optimization of Remote Core Locking Synchronization in Multithreaded Programs...
Optimization of Remote Core Locking Synchronization in Multithreaded Programs...Optimization of Remote Core Locking Synchronization in Multithreaded Programs...
Optimization of Remote Core Locking Synchronization in Multithreaded Programs...
 
[IJET-V1I3P13] Authors :Aishwarya Manjunath, Shreenath K N, Dr. Srinivasa K G.
[IJET-V1I3P13] Authors :Aishwarya Manjunath, Shreenath K N, Dr. Srinivasa K G.[IJET-V1I3P13] Authors :Aishwarya Manjunath, Shreenath K N, Dr. Srinivasa K G.
[IJET-V1I3P13] Authors :Aishwarya Manjunath, Shreenath K N, Dr. Srinivasa K G.
 
A simulation model of ieee 802.15.4 in om ne t++
A simulation model of ieee 802.15.4 in om ne t++A simulation model of ieee 802.15.4 in om ne t++
A simulation model of ieee 802.15.4 in om ne t++
 
Designing Run-Time Environments to have Predefined Global Dynamics
Designing  Run-Time  Environments to have Predefined Global DynamicsDesigning  Run-Time  Environments to have Predefined Global Dynamics
Designing Run-Time Environments to have Predefined Global Dynamics
 
F0963440
F0963440F0963440
F0963440
 
IPC
IPCIPC
IPC
 
shieh06a
shieh06ashieh06a
shieh06a
 
The Challenges facing Libraries and Imperative Languages from Massively Paral...
The Challenges facing Libraries and Imperative Languages from Massively Paral...The Challenges facing Libraries and Imperative Languages from Massively Paral...
The Challenges facing Libraries and Imperative Languages from Massively Paral...
 
Multithreading
MultithreadingMultithreading
Multithreading
 

Andere mochten auch

SPARC T1 MMU Architecture
SPARC T1 MMU ArchitectureSPARC T1 MMU Architecture
SPARC T1 MMU ArchitectureKaushik Patra
 
Partnership revision questions ay 2014 2015
Partnership revision questions ay 2014 2015Partnership revision questions ay 2014 2015
Partnership revision questions ay 2014 2015JUMA BANANUKA
 
Accounting for Partnerships
Accounting for PartnershipsAccounting for Partnerships
Accounting for PartnershipsArthik Davianti
 
Parallelism
ParallelismParallelism
ParallelismMs. Ross
 
Accounting for Partnership by Guerrero et al
Accounting for Partnership by Guerrero et alAccounting for Partnership by Guerrero et al
Accounting for Partnership by Guerrero et alAdrian Chris Arevalo
 
Partnership accounting
Partnership accountingPartnership accounting
Partnership accountingKhuram Shahzad
 
Cpu and its functions
Cpu and its functionsCpu and its functions
Cpu and its functionsmyrajendra
 
Format of all accounts for O Levels
Format of all accounts for O LevelsFormat of all accounts for O Levels
Format of all accounts for O LevelsMuhammad Talha
 
08. Central Processing Unit (CPU)
08. Central Processing Unit (CPU)08. Central Processing Unit (CPU)
08. Central Processing Unit (CPU)Akhila Dakshina
 
Partnership accounts
Partnership accountsPartnership accounts
Partnership accountsSam Catlin
 

Andere mochten auch (15)

SPARC T1 MMU Architecture
SPARC T1 MMU ArchitectureSPARC T1 MMU Architecture
SPARC T1 MMU Architecture
 
1.prallelism
1.prallelism1.prallelism
1.prallelism
 
Partnership revision questions ay 2014 2015
Partnership revision questions ay 2014 2015Partnership revision questions ay 2014 2015
Partnership revision questions ay 2014 2015
 
Accounting for Partnerships
Accounting for PartnershipsAccounting for Partnerships
Accounting for Partnerships
 
Parallelism
ParallelismParallelism
Parallelism
 
Accounting for Partnership by Guerrero et al
Accounting for Partnership by Guerrero et alAccounting for Partnership by Guerrero et al
Accounting for Partnership by Guerrero et al
 
Partnership accounts
Partnership accountsPartnership accounts
Partnership accounts
 
Partnership accounting
Partnership accountingPartnership accounting
Partnership accounting
 
Types of parallelism
Types of parallelismTypes of parallelism
Types of parallelism
 
Cpu and its functions
Cpu and its functionsCpu and its functions
Cpu and its functions
 
Format of all accounts for O Levels
Format of all accounts for O LevelsFormat of all accounts for O Levels
Format of all accounts for O Levels
 
04 Cache Memory
04  Cache  Memory04  Cache  Memory
04 Cache Memory
 
Cpu
CpuCpu
Cpu
 
08. Central Processing Unit (CPU)
08. Central Processing Unit (CPU)08. Central Processing Unit (CPU)
08. Central Processing Unit (CPU)
 
Partnership accounts
Partnership accountsPartnership accounts
Partnership accounts
 

Ähnlich wie shashank_micro92_00697015

Cloud Module 3 .pptx
Cloud Module 3 .pptxCloud Module 3 .pptx
Cloud Module 3 .pptxssuser41d319
 
LATENCY-AWARE WRITE BUFFER RESOURCE CONTROL IN MULTITHREADED CORES
LATENCY-AWARE WRITE BUFFER RESOURCE  CONTROL IN MULTITHREADED CORESLATENCY-AWARE WRITE BUFFER RESOURCE  CONTROL IN MULTITHREADED CORES
LATENCY-AWARE WRITE BUFFER RESOURCE CONTROL IN MULTITHREADED CORESijmvsc
 
LATENCY-AWARE WRITE BUFFER RESOURCE CONTROL IN MULTITHREADED CORES
LATENCY-AWARE WRITE BUFFER RESOURCE CONTROL IN MULTITHREADED CORESLATENCY-AWARE WRITE BUFFER RESOURCE CONTROL IN MULTITHREADED CORES
LATENCY-AWARE WRITE BUFFER RESOURCE CONTROL IN MULTITHREADED CORESijdpsjournal
 
Computer Applications: An International Journal (CAIJ)
Computer Applications: An International Journal (CAIJ)Computer Applications: An International Journal (CAIJ)
Computer Applications: An International Journal (CAIJ)caijjournal
 
PERFORMANCE ENHANCEMENT WITH SPECULATIVE-TRACE CAPPING AT DIFFERENT PIPELINE ...
PERFORMANCE ENHANCEMENT WITH SPECULATIVE-TRACE CAPPING AT DIFFERENT PIPELINE ...PERFORMANCE ENHANCEMENT WITH SPECULATIVE-TRACE CAPPING AT DIFFERENT PIPELINE ...
PERFORMANCE ENHANCEMENT WITH SPECULATIVE-TRACE CAPPING AT DIFFERENT PIPELINE ...caijjournal
 
PERFORMANCE ENHANCEMENT WITH SPECULATIVE-TRACE CAPPING AT DIFFERENT PIPELINE ...
PERFORMANCE ENHANCEMENT WITH SPECULATIVE-TRACE CAPPING AT DIFFERENT PIPELINE ...PERFORMANCE ENHANCEMENT WITH SPECULATIVE-TRACE CAPPING AT DIFFERENT PIPELINE ...
PERFORMANCE ENHANCEMENT WITH SPECULATIVE-TRACE CAPPING AT DIFFERENT PIPELINE ...caijjournal
 
Integrating lock free and combining techniques for a practical and scalable f...
Integrating lock free and combining techniques for a practical and scalable f...Integrating lock free and combining techniques for a practical and scalable f...
Integrating lock free and combining techniques for a practical and scalable f...jpstudcorner
 
seminarembedded-150504150805-conversion-gate02.pdf
seminarembedded-150504150805-conversion-gate02.pdfseminarembedded-150504150805-conversion-gate02.pdf
seminarembedded-150504150805-conversion-gate02.pdfkarunyamittapally
 
DESIGNED DYNAMIC SEGMENTED LRU AND MODIFIED MOESI PROTOCOL FOR RING CONNECTED...
DESIGNED DYNAMIC SEGMENTED LRU AND MODIFIED MOESI PROTOCOL FOR RING CONNECTED...DESIGNED DYNAMIC SEGMENTED LRU AND MODIFIED MOESI PROTOCOL FOR RING CONNECTED...
DESIGNED DYNAMIC SEGMENTED LRU AND MODIFIED MOESI PROTOCOL FOR RING CONNECTED...Ilango Jeyasubramanian
 
Load Balancing in Parallel and Distributed Database
Load Balancing in Parallel and Distributed DatabaseLoad Balancing in Parallel and Distributed Database
Load Balancing in Parallel and Distributed DatabaseMd. Shamsur Rahim
 
Operating Systems R20 Unit 2.pptx
Operating Systems R20 Unit 2.pptxOperating Systems R20 Unit 2.pptx
Operating Systems R20 Unit 2.pptxPrudhvi668506
 
CS 301 Computer ArchitectureStudent # 1 EID 09Kingdom of .docx
CS 301 Computer ArchitectureStudent # 1 EID 09Kingdom of .docxCS 301 Computer ArchitectureStudent # 1 EID 09Kingdom of .docx
CS 301 Computer ArchitectureStudent # 1 EID 09Kingdom of .docxfaithxdunce63732
 
WRITE BUFFER PARTITIONING WITH RENAME REGISTER CAPPING IN MULTITHREADED CORES
WRITE BUFFER PARTITIONING WITH RENAME REGISTER CAPPING IN MULTITHREADED CORESWRITE BUFFER PARTITIONING WITH RENAME REGISTER CAPPING IN MULTITHREADED CORES
WRITE BUFFER PARTITIONING WITH RENAME REGISTER CAPPING IN MULTITHREADED CORESijdpsjournal
 
Enhanced transformer long short-term memory framework for datastream prediction
Enhanced transformer long short-term memory framework for datastream predictionEnhanced transformer long short-term memory framework for datastream prediction
Enhanced transformer long short-term memory framework for datastream predictionIJECEIAES
 

Ähnlich wie shashank_micro92_00697015 (20)

Cloud Module 3 .pptx
Cloud Module 3 .pptxCloud Module 3 .pptx
Cloud Module 3 .pptx
 
LATENCY-AWARE WRITE BUFFER RESOURCE CONTROL IN MULTITHREADED CORES
LATENCY-AWARE WRITE BUFFER RESOURCE  CONTROL IN MULTITHREADED CORESLATENCY-AWARE WRITE BUFFER RESOURCE  CONTROL IN MULTITHREADED CORES
LATENCY-AWARE WRITE BUFFER RESOURCE CONTROL IN MULTITHREADED CORES
 
LATENCY-AWARE WRITE BUFFER RESOURCE CONTROL IN MULTITHREADED CORES
LATENCY-AWARE WRITE BUFFER RESOURCE CONTROL IN MULTITHREADED CORESLATENCY-AWARE WRITE BUFFER RESOURCE CONTROL IN MULTITHREADED CORES
LATENCY-AWARE WRITE BUFFER RESOURCE CONTROL IN MULTITHREADED CORES
 
Computer Applications: An International Journal (CAIJ)
Computer Applications: An International Journal (CAIJ)Computer Applications: An International Journal (CAIJ)
Computer Applications: An International Journal (CAIJ)
 
PERFORMANCE ENHANCEMENT WITH SPECULATIVE-TRACE CAPPING AT DIFFERENT PIPELINE ...
PERFORMANCE ENHANCEMENT WITH SPECULATIVE-TRACE CAPPING AT DIFFERENT PIPELINE ...PERFORMANCE ENHANCEMENT WITH SPECULATIVE-TRACE CAPPING AT DIFFERENT PIPELINE ...
PERFORMANCE ENHANCEMENT WITH SPECULATIVE-TRACE CAPPING AT DIFFERENT PIPELINE ...
 
PERFORMANCE ENHANCEMENT WITH SPECULATIVE-TRACE CAPPING AT DIFFERENT PIPELINE ...
PERFORMANCE ENHANCEMENT WITH SPECULATIVE-TRACE CAPPING AT DIFFERENT PIPELINE ...PERFORMANCE ENHANCEMENT WITH SPECULATIVE-TRACE CAPPING AT DIFFERENT PIPELINE ...
PERFORMANCE ENHANCEMENT WITH SPECULATIVE-TRACE CAPPING AT DIFFERENT PIPELINE ...
 
Integrating lock free and combining techniques for a practical and scalable f...
Integrating lock free and combining techniques for a practical and scalable f...Integrating lock free and combining techniques for a practical and scalable f...
Integrating lock free and combining techniques for a practical and scalable f...
 
Real Time Operating Systems
Real Time Operating SystemsReal Time Operating Systems
Real Time Operating Systems
 
seminarembedded-150504150805-conversion-gate02.pdf
seminarembedded-150504150805-conversion-gate02.pdfseminarembedded-150504150805-conversion-gate02.pdf
seminarembedded-150504150805-conversion-gate02.pdf
 
Computer architecture
Computer architectureComputer architecture
Computer architecture
 
Oversimplified CA
Oversimplified CAOversimplified CA
Oversimplified CA
 
DESIGNED DYNAMIC SEGMENTED LRU AND MODIFIED MOESI PROTOCOL FOR RING CONNECTED...
DESIGNED DYNAMIC SEGMENTED LRU AND MODIFIED MOESI PROTOCOL FOR RING CONNECTED...DESIGNED DYNAMIC SEGMENTED LRU AND MODIFIED MOESI PROTOCOL FOR RING CONNECTED...
DESIGNED DYNAMIC SEGMENTED LRU AND MODIFIED MOESI PROTOCOL FOR RING CONNECTED...
 
Load Balancing in Parallel and Distributed Database
Load Balancing in Parallel and Distributed DatabaseLoad Balancing in Parallel and Distributed Database
Load Balancing in Parallel and Distributed Database
 
Operating Systems R20 Unit 2.pptx
Operating Systems R20 Unit 2.pptxOperating Systems R20 Unit 2.pptx
Operating Systems R20 Unit 2.pptx
 
CS 301 Computer ArchitectureStudent # 1 EID 09Kingdom of .docx
CS 301 Computer ArchitectureStudent # 1 EID 09Kingdom of .docxCS 301 Computer ArchitectureStudent # 1 EID 09Kingdom of .docx
CS 301 Computer ArchitectureStudent # 1 EID 09Kingdom of .docx
 
Chap2 slides
Chap2 slidesChap2 slides
Chap2 slides
 
WRITE BUFFER PARTITIONING WITH RENAME REGISTER CAPPING IN MULTITHREADED CORES
WRITE BUFFER PARTITIONING WITH RENAME REGISTER CAPPING IN MULTITHREADED CORESWRITE BUFFER PARTITIONING WITH RENAME REGISTER CAPPING IN MULTITHREADED CORES
WRITE BUFFER PARTITIONING WITH RENAME REGISTER CAPPING IN MULTITHREADED CORES
 
Threading.pptx
Threading.pptxThreading.pptx
Threading.pptx
 
Enhanced transformer long short-term memory framework for datastream prediction
Enhanced transformer long short-term memory framework for datastream predictionEnhanced transformer long short-term memory framework for datastream prediction
Enhanced transformer long short-term memory framework for datastream prediction
 
Os
OsOs
Os
 

shashank_micro92_00697015

  • 1. Exploiting Instruction-Level Parallelism: The Multithreaded Approach* Philip Lenir, R. Govindarajan and S.S. Nemawarkar Dept. of Electrical Engineering, McGill University, Montreal, H3A 2A7, CANADA {lenir@ee470,govindr@pike,shashank@pike}.ee.mcgill.ca Abstract The main challenge in the field of Very Large Instruction Word (VLIW) and superscalar architec- tures is ezploiting as much instruction-level paral- lelism as possible. In this paper an ezecution model which uses multiple instruction sequences and eztracts instruction-level parallelism at runtime from a set of enabled threads has been presented. A new multi-ring architecture has been proposed to support the ezecu- lion model. The multithreaded architecture features (i) large resident activations to improve program and data locality, (ii) a novel high-speed buger organiza- tion which ensures zero load/store stalls for the lo- cal variables of an activation, and (iii) a dynamic in- struction scheduler that groups operations from mul- tiple threads for ezecution. Initial performance evalu- ation studies predict that the proposed architecture is capable of ezecuting 7 concurrent operations per cycle with 8 ezecution pipes and 6 rings. 1 Introduction Very Large Instruction Word (VLIW) architec- tures [7,1,9]and superscalar architectures [ll,41 have the potential to execute multiple operations in a sin- gle machine cycle. The main challenge in these archi- tectures is to detect and exploit as much parallelism as is available in the application program. Surpris- ingly, the reported values of instruction-level paral- lelismexploited in many applications have been rather low [11, 91. This is because the parallelism extracted by VLIW compilers is limited by the basic block size, while in superscalar architecture it is limited by the size of the instruction window [4]. Techniques to in- crease basic block size or instruction window size often place higher demands on resources, such as processor registers, and hence have limited effect. 'This work was supported by MICRONET -Network Cen- tres of Excellence, Canada. In this paper we propose a new approach based on the multithreaded execution model to exploit higher instruction-level parallelism. The multithreaded exe- cution model [lo]has the potential to tolerate long and unpredictable memory latencies and to address the is- sues on synchronization costs in a satisfactory man- ner. In the proposed model, instructions from mul- tiple threads that are scheduled for execution based on data availability are grouped together for execu- tion. Such an approach has the advantage of exploit- ing instruction-level parallelism in conjunction with coarser-grained algorithmic parallelism. Also, the instruction-level parallelism exploited in this model is not limited by the size of basic blocks. In the following section, we describe our execution model. The details of the multi-ring multithreaded architecture are presented in Section 3. The perfor- mance of the architecture is reported in Section 4. 2 The Execution Model 2.1 Program Hierarchy In our model, a program is represented by a col- lection of code blocks each corresponding to a func- tion/procedure body. A code block is divided into several threads. At the lowest level of program hier- archy is the instruction sequence in a thread. 2.2 Synchronization and Scheduling Each invocation of a code block is called an BC- tivation. Associated with each activation is a set of memory locations, called the frame of the activation, to store the local variables. Initially an activation is dormant. It becomes enabled whenever at least one thread in the activation is enabled. The frame corre- sponding to an enabled activation is loaded in a high- speed memory, if enough free space is available. Then the activation is said to be resident. An activation re- mains resident as long as there is at least one enabled thread in that activation to be executed. An activa- tion either terminates when all its threads have termi- 189 0-8186-3175-9/92 $3.00 0 1992 IEEE
  • 2. nated, or suspends when there is at least one thread in the activation which is not enabled. A thread becomes enabled when it has received the necessary operands on all its input arcs. An enabled thread becomes ready when the corresponding activa tion is made resident. A ready thread waits for pro- cessor reuourceu (see Section 3.5). Once a thread has acquired a free resource it is said to be in the active state. A thread in the active state can run to com- pletion without requiring further synchronization. In- structions in a thread are executed sequentially. Ad- ditional details on the execution model can be found in [2]. MM. TSII. m u . 3 The Multi-Ring Architecture 1 In this section we describe our architecture, the multi-ring Large Context Multithreaded (LCM) archi- tecture. Our initial study on the performance of the basic LCM architecture [6]clearly shows that the syn- chronizing capability of the architecture is rather low to fully utilize multiple execution pipes. As a conse- quence only a low instruction-level parallelism could be exploited in the LCM architecture. Hence, a multi- ring structure with multiple synchronizing units (refer to Fig. 1)has been proposed. 3.1 Memory Unit The memory unit consists of a data memory and multiple memory managers, one for each ring in the multi-ring structure. The data memory is divided into two segments, one for storing data structures and the other for storing frames of various activations. Cur- rently, only array data structure is supported in our architecture. The memory manager is responsible for allocating frames dynamically when new activations are invoked and deallocating the frame when an acti- vation terminates. Once a frame is allocated for a new activation, tokens belonging to that frame are sent to the thread synchronization unit. 3.2 Thread Synchronization Unit As shown in Fig. 1, there are multiple thread syn- chronization units to perform thread synchronization. Each is similar to the explicit token store match- ing unit of the Monsoon architecture [8]. When all operands to a thread arrive in the thread synchro- nization unit, the thread becomes enabled and is then sent to the frame management unit in the same ring. 3.3 Frame Management Unit The frame management unit performs the schedul- ing of threads. It maintains a table of resident acti- 190 Figure 1: The Multi-Ring LCM Architecture vations, and the base addresses of their frames. If the incoming thread belongs to an already resident acti- vation, the frame management unit extracts the base address of the frame from the table. If the thread be- longs to an activation which is not already resident, the frame management unit checks whether the frame for this activation can be loaded in a high-speed mem- ory, called the high-speed buffer (refer to Section 3.4) If so it instructs the high-speed buffer unit to load the frame for this activation. When the high-speed buffer has been successfully loaded, the base addresses of the frame is sent to the frame management unit. The frame management unit then passes the incoming thread along with the base address to the instruction fetch unit. Lastly, if the frame corresponding to the incoming thread cannot be loaded, then the thread is queued inside the frame management unit until a block is freed in the high-speed buffer. The frame manage- ment unit, however, proceeds to service other threads in its input queue. The frame management unit is also responsible for instructing the high-speed buffer either to off-load the frame of a terminated activation or to flush the frame of a suspended activation. 3.4 High-speed Buffer A novel implementation idea used in our architec- ture is to pre-load the necessary frame in the high- speed buffer prior to scheduling the instructions. This,
  • 3. together with dataflow synchronization, ensures that all operands necessary for executing the instructions in an enabled thread are available in the high-speed buffer. Thus load/store stalls in accessing frame lo- cations have been completely eliminated in our archi- tecture. To reduce the load/store stalls in accessing data structure elements, long-latency operations are performed in a split-phase manner [8]. A single high-speed buffer loader services the re- quests from all frame management units. This is be- cause the number of requests received by the high- speed buffer loader is fewer compared to that of the thread synchronization units. Simulation experiments confirm this fact. 3.5 Instruction Scheduler Benchmark Program Matrix Mult. The instruction scheduler consists of two units, the instruction fetch unit and the scheduler unit. The instruction fetch unit receives ready threads from var- ious frame management units. A ready thread waits for the availability of a free resource. A resource cor- responds to a program counter, an intermediate in- struction register, and an instruction register. A num- ber of resources are available in the instruction fetch unit. The instruction fetch unit also contains an in- struction cache, similar to the conventional instruction cache unit. The code-block corresponding to the ready thread is loaded, if not already loaded, in the instruc- tion cache. The instruction pointed by the program counter is fetched and loaded in the intermediate in- struction register. The set of intermediate registers together form the instruction window for the multiple execution pipes. At each execution cycle, the sched- uler unit checks the intermediate instruction registers. It selects upto n available instructions for execution where n is the number of execution pipes. Pipeline stalls due to data dependencies are avoided by using a novel thread-interleaving scheme (refer to [2]). 3.6 Execution Pipe The execution pipes used in our architecture are generic in nature, each capable of performing any o p eration. The processor architecture is load/store in nature. The register file is logically divided into a number of register banks, one corresponding to each resident activation. In the execution model, the logi- cal register name specified in an instruction is used as an offset within the register bank. Parallelism 8 pipes 8 pipes 6 pipes 8 rings 6 rings 4 rings 7.86 7.57 5.47 3.7 Router Unit SAXPBY - unroll 4 SAXPBY - unroll 2 The router unit is a cross-bar network that connects each execution pipe to every memory unit. 7.91 7.92 5.94 5.22 5.22 5.21 4 Simulation Results Parameter Parallelism The performance of the multi-ring LCM architec- ture is evaluated using a deterministic discrete-event simulation. The benchmark programs considered in the simulation are SAXPBY and matrix multiplica- tion. In the simulation, each frame was assumed to be a constant size (64 words). Likewise, the code block size was assumed to be 128 words. The processing el- ement can support up to 32 resident activations and 32 resources. The main output performance param- eter is the average number of operations executed in a single cycle, obtained by averaging the number of instructions scheduled for execution in each cycle. We considered three different configurations of the architecture, namely 8 pipes & 8 rings, 8 pipes & 6 rings, and 6 pipes & 4 rings. In this work, we do not investigate the feasibility of a VLSI implementation. However, architectures with similar complexity have been claimed to be feasible with the modern technol- ogy [51. Table 1. Performance of the Multi-ring LCM Architecture Number of Resident Activations 2 1 4 1 8 1 1 2 1 1 6 1 2 4 - 2.07 I 3.84 I 5.78 I 6.74 I 7.86 I 7.84 Table 2. Effect of Number of Resources il Parameter II Number of Resources- ___ _ ~ ~ .~ ~~ 2 1 4 1 8 ( 1 2 I l 6 I l 8 Parallelism 11 1.09 I 2.16 I 4.22 I 6.07 I 7.20 I 7.54 Table 1 summarizes the average number of oper- ations executed in each cycle for the three different configurations. We observe that the instruction-level parallelism is quite high all three cases. To determine the effect of the number of rings and the number of pipes on the parallelism exploited, we varied both parameters independently. The effect of these parameters on the parallelism exploited for the SAXPBY program is shown in Fig. 2. We note that the exploited parallelism is strongly dependent on both parameters. 191
  • 4. NO. oi Exocution ~ l p o -f Figure 2: Effect of Multiple Rings The effect of the size of the instruction window (or the number of resources) on the average operations per cycle is shown in Table 2. Lastly, the effect of the number of resident activations on the performance is tabulated in Table 3. In these two experiments, the number of the execution pipes and the number of rings are fixed at 8 and 6 respectively. 5 Conclusions In this paper, we have proposed a new architec- ture for exploiting instruction-level parallelism, which groups instructions from multiple active threads to achieve high instruction-level parallelism. Initial per- formance results based on simulation show that the architecture can exploit a parallelism of at least 7 in- structions with 8 execution pipes and 6 rings. This improvement in instruction-level parallelism is mainly due to our approach - that of extracting fine-grain instruction-level parallelism from a set of threads. Our approach is similar to the ones discussed in [5,31. However, the multi-ring LCM architecture benefits from (i) large resident activations, (ii) the layered ap- proach to synchronization, and (iii) the high-speed buffer organization. Acknowledgements The authors acknowledge the help of Vincent Collini for his initial involvement in the work and the anonymous reviewers for their useful comments. References tecture for a trace scheduling compiler. Symposium on Computer Architecture, 1988. IEEE [2]R. Govindarajan and S.S. Nemwarkar. Small: A scalable multithreaded architecture to exploit large locality. In Proc. of the 4th IEEE Symp. on Parallel and Distributed Processing, December 1992. to appear. [3]H. Hirata et al. An efkmentary processor architec- ture with simultaneous instruction issuing from multiple threads. In Proceedings of the 19th Inter- national Symposium on Computer Architecture, pages 136-145. ACM and IEEE, 1992. [4]Mike Johnson. Superscalar Microprocessor De- sign. Prentice Hall, Englewood Cliffs, New Jersey 07632, 1991. [5] S.W. Keckler and W.J. Dally. Processor coupling: Integration compile time and runtime scheduling for parallelism. In Proceedings of the 19th Inter- national Symposium on Computer Architecture, pages 202-213. ACM and IEEE, 1992. [6]Philip Lenir and Vincent Collini. A large context multithreaded architecture with mu1tiple pipelin- ing. Report, McGill University, Dept. of Electri- cal Engineering, April 1992. [7]A.Nicolau and J. A. Fisher. Measuring the par- allelism available for very long instruction word architectures. IEEE %ansactiow on Computers, 1984. [8]G. M. Papadopoulos and D. E. Culler. Monsoon: An explicit token-store architecture. In Pro- ceedings of the Seventeenth Annual International Symposium of Computer Architecture, Seattle, WA, pages 82-91, 1990. [9]B. R. Rau, D.Yen, W.Yen, and R. A. Towle. The Cydra 5 departmental supercomputer. IEEE Computer, 22(1):12-35, January 1989. [lo] Burton J. Smith. Architecture and applications of the HEP multiprocessor computer system. In SPIE Real-Time Signal Processing IV, volume 298,pages 241-248, 1981. [ll] M.D.Smith, M. S. Lam, and M. A. Horowitz. Boosting beyond static scheduling in a super- scalar processor. Proceedings of the 17th Annual Symposium on Computer Architecture, 1990. [l] R. P. Colwell, R. P. Nix, J. J. O’Donnell, D. B. Papworth, and P. K. Rodman. A VLIW archi- 192