2. The semantic gap is the difference between the
operations provided in HLLs (High Level Languages) and
those provided in computer architecture.
Symptoms of this gap are alleged to include execution
inefficiency, excessive machine program size, and
compiler complexity. Designers responded with
architectures intended to close this gap. Key features
include large instruction sets, dozens of addressing
modes, and various HLL statements implemented in
hardware. An example of the latter is the CASE machine
instruction on the VAX. Such complex instruction sets
are intended to
• Ease the task of the compiler writer.
• Improve execution efficiency, because complex
sequences of operations can be implemented in
microcode.
• Provide support for even more complex and sophisticated
HLLs.
3. Reduced instruction set
computer (RISC) architecture
The RISC architecture is a dramatic departure from the
historical trend in processor architecture. An analysis of
the RISC architecture brings into focus many of the
important issues in computer organization and
architecture.
Although RISC systems have been defined and
designed in a variety of ways by different groups, the key
elements shared by most designs are these:
• A large number of general-purpose registers, and/or the
use of compiler technology to optimize register usage
• A limited and simple instruction set
• An emphasis on optimizing the instruction pipeline
4. Semantic gap
In order to improve the efficiency of
software development, new and powerful
programming languages have been
developed (Ada, C++, Java).
They provide: high level of abstraction,
conciseness, power.
• By this evolution the semantic gap grows
5. Problem: How should new HLL programs
be compiled and executed efficiently on a
processor architecture?
Two possible answers:
1. The CISC approach: design very complex
architectures including a large number of
instructions and addressing modes;
include also instructions close to those
present in HLL.
2. The RISC approach: simplify the
instruction set and adapt it to the real
requirements of user programs
6. Why need RISC
RISC architectures represent an important
innovation in the area of computer organization.
• The RISC architecture is an attempt to produce
more CPU power by simplifying the instruction
set of the CPU.
• The opposed trend to RISC is that of complex
instruction set computers (CISC).
Both RISC and CISC architectures have been
developed as an attempt to cover the
semantic gap.
7. INSTRUCTION EXECUTION
CHARACTERISTICS IN RISC
Operations performed: These determine the
functions to be performed by the processor and
its interaction with memory.
Operands used: The types of operands and the
frequency of their use determine the memory
organization for storing them and the addressing
modes for accessing them.
Execution sequencing: This determines the
control and pipeline organization.
8. Evaluation of Program execution
Several studies have been conducted to
determine the execution characteristics of
machine instruction sequences generated from
HLL programs.
• Aspects of interest:
1. The frequency of operations performed.
2. The types of operands and their frequency of
use.
3. Execution sequencing (frequency of jumps,
loops, subprogram calls).
9. Frequency of Instructions Executed
• Frequency distribution of executed machine
instructions:
moves: 33%
conditional branch: 20%
Arithmetic/logic: 16%
others: Between 0.1% and 10%
• Addressing modes: the overwhelming majority of
instructions uses simple addressing modes, in
which the address can be calculated in a single
cycle (register, register indirect, displacement);
complex addressing modes (memory indirect,
indexed+indirect, displacement+indexed, stack)
are used only by ~18% of the instructions.
10. Operand Types
• 74 to 80% of the operands are scalars (integers,
reals, characters, etc.) which can be hold in
registers;
• the rest (20-26%) are arrays/structures; 90% of
them are global variables;
• 80% of the scalars are local variables.
NB:The majority of operands are local variables of
scalar type, which can be stored in registers
11. Some statistics concerning procedure
calls:
• Only 1.25% of called procedures have
more than six parameters.
• Only 6.7% of called procedures have more
than six local variables.
• Chains of nested procedure calls are
usually short and only very seldom longer
than 6.
12. Conclusions from Evaluation of Program
Execution
• An overwhelming preponderance of simple (ALU and
move) operations over complex operations.
• Preponderance of simple addressing modes.
• Large frequency of operand accesses; on average each
instruction references 1.9 operands.
• Most of the referenced operands are scalars (so they can
be stored in a register) and are local variables or
parameters.
• Optimizing the procedure CALL/RETURN mechanism
promises large benefits in speed.
These conclusions have been at the starting point to the
Reduced Instruction Set Computer (RISC) approach.
13. Characteristics of Reduced
Instruction Set Architectures
Although a variety of different approaches
to reduced instruction set architecture
have been taken, certain characteristics
are common to all of them:
• One instruction per cycle
• Register-to-register operations
• Simple addressing modes
• Simple instruction formats
14. The first characteristic listed is that there is one
machine instruction per machine cycle. A
machine cycle is defined to be the time it takes
to fetch two operands from registers, perform an
ALU operation, and store the result in a register.
Thus, RISC machine instructions should be no
more complicated than, and execute about as
fast as, microinstructions on CISC machines.
With simple, one-cycle instructions, there is little
or no need for microcode; the machine
instructions can be hardwired. Such instructions
should execute faster than comparable machine
instructions on other machines, because it is not
necessary to access a microprogram control
store during instruction execution.
15. The goal is to create an instruction set containing
instructions that execute quickly; most of the RISC
instructions are executed in a single machine cycle (after
fetched and decoded).
- RISC instructions, being simple, are hard-wired, while
CISC architectures have to use microprogramming in
order to implement complex instructions.
- Having only simple instructions results in reduced
complexity of the control unit and the data path; as a
consequence, the processor can work at a high clock
frequency.
- The pipelines are used efficiently if instructions are simple
and of similar execution time.
- Complex operations on RISCs are executed as a
sequence of simple RISC instructions. In the case of
CISCs they are executed as one single or a few complex
instruction.
16. example
we have a program with 80% of executed
instructions being simple and 20% complex;
- on a CISC machine simple instructions take 4
cycles, complex instructions take 8 cycles; cycle
time is 100 ns (10-7 s);
- on a RISC machine simple instructions are
executed in one cycle; complex operations are
implemented as a sequence of instructions; we
consider on average 14 instructions (14 cycles)
for a complex operation; cycle time is 75 ns
(0.75 * 10-7 s).
17. How much time takes a program of 1 000 000
instructions?
CISC: (10^6*0.80*4 + 10^6*0.20*8)*10-7 = 0.48
s
RISC: (10^6*0.80*1 + 10^6*0.20*14)*0.75*10-7
= 0.27 s
• complex operations take more time on the RISC,
but their number is small;
• because of its simplicity, the RISC works at a
smaller cycle time; with the CISC, simple
instructions are slowed down because of the
increased data path length and the increased
control complexity.
18. A second characteristic is that most
operations should be register to register,
with only simple LOAD and STORE
operations accessing memory.
Only LOAD and STORE instructions
reference data in memory; all other
instructions operate only with registers
(are register-to-register instructions); thus,
only the few instructions accessing
memory need more than one cycle to
execute (after fetched and decoded).
19. Third Characteristic, Instructions use only few
addressing modes
- Addressing modes are usually register, direct,
register indirect, displacement.
Almost all RISC instructions use simple register
addressing.
Forth Characteristic Instructions are of fixed
length and uniform format
- This makes the loading and decoding of
instructions simple and fast; it is not needed to
wait until the length of an instruction is known in
order to start decoding the following one;
- Decoding is simplified because opcode and
address fields are located in the same position
forall instructions
20. Fifth Characteristic, A large number of
registers is available
- Variables and intermediate results can be
stored in registers and do not require
repeated loads and stores from/to
memory.
- All local variables of procedures and the
passed parameters can be stored in
registers.
21. What happens when a new
procedure is called?
- Normally the registers have to be saved in
memory (they contain values of variables and
parameters for the calling procedure); at return
to the calling procedure, the values have to be
again loaded from memory. This takes a lot of
time.
- If a large number of registers is available, a new
set of registers can be allocated to the called
procedure and the register set assigned to the
calling one remains untouched.
22. Is the strategy above realistic?
- The strategy is realistic, because the number of
local variables in procedures is not large. The
chains of nested procedure calls is only
exceptionally larger than 6.
- If the chain of nested procedure calls becomes
large, at a certain call there will be no registers
to be assigned to the called procedure; in this
case local variables and parameters have to be
stored in memory
23. Why is a large number of registers typical for RISC
architectures?
- Because of the reduced complexity of the
processor there is enough space on the
chip to be allocated to a large number of
registers. This, usually, is not the case
with CISCs.
24. The delayed load problem
• LOAD instructions (similar to the STORE)
require memory access and their execution
cannot be completed in a single clock cycle.
However, in the next cycle a new instruction is
started by the processor.
Two possible solutions:
1. The hardware should delay the execution of the
instruction following the LOAD, if this instruction
needs the loaded value
2. A more efficient, compiler based, solution, which
has similarities with the delayed branching, is
the delayed load:
25. With delayed load the processor always
executes the instruction following a LOAD,
without a stall; It is the programmers
(compilers) responsibility that this
instruction does not need the loaded
value.
26. CISC versus RISC Characteristics
After the initial enthusiasm for RISC machines,
there has been a growing realization that
(1) RISC designs may benefit from the inclusion of
some CISC features
(2) CISC designs may benefit from the inclusion of
some RISC features.
The result is that the more recent RISC designs,
notably the PowerPC, are no longer “pure” RISC
and the more recent CISC designs, notably the
Pentium II and later Pentium models, do
incorporate some RISC characteristics.
27. Typical RISC characteristics
1. A single instruction size.
2. That size is typically 4 bytes.
3. A small number of data addressing modes, typically less
than five. This parameter is difficult to pin down. In the
table, register and literal modes are not counted and
different formats with different offset sizes are counted
separately.
4. No indirect addressing that requires you to make one
memory access to get the address of another operand in
memory.
5. No operations that combine load/store with arithmetic
(e.g., add from memory, add to memory).
28. 6. No more than one memory-addressed operand
per instruction.
7. Does not support arbitrary alignment of data for
load/store operations.
8. Maximum number of uses of the memory
management unit (MMU) for a data address in
an instruction.
9. Number of bits for integer register specifier
equal to five or more. This means that at least
32 integer registers can be explicitly referenced
at a time.
10. Number of bits for floating-point register
specifier equal to four or more. This means that
at least 16 floating-point registers can be
explicitly referenced at a time.
29. RISC PIPELINING
Instruction pipelining is often used to enhance
performance.
Let us reconsider this in the context of a RISC
architecture. Most instructions are register to register,
and an instruction cycle has the following two stages:
• I: Instruction fetch.
• E: Execute. Performs an ALU operation with register input
and output.
For load and store operations, three stages are required:
• I: Instruction fetch.
• E: Execute. Calculates memory address
• D: Memory. Register-to-memory or memory-to-register
operation
30. The two stages of the pipeline are an instruction fetch
stage, and an execute/memory stage that executes the
instruction, including register-to-memory and memory to-
register operations. Thus we see that the instruction
fetch stage of the second instruction can e performed in
parallel with the first part of the execute/memory stage.
However, the execute/memory stage of the second
instruction must be delayed until the first instruction
clears the second stage of the pipeline. This scheme can
yield up to twice the execution rate of a serial scheme.
Two problems prevent the maximum speedup from
being achieved. First, we assume that a single-port
memory is used and that only one memory access is
possible per stage. This requires the insertion of a wait
state in some instructions. Second, a branch instruction
interrupts the sequential flow of execution.To
accommodate this with minimum circuitry, a NOOP
instruction can be inserted into the instruction stream by
the compiler or assembler
31. Pipelining can be improved further by
permitting two memory accesses per
stage. This yields the sequence. Now, up
to three instructions can be overlapped,
and the improvement is as much as a
factor of 3. Again, branch instructions
cause the speedup to fall short of the
maximum possible. Also, note that data
dependencies have an effect. If an
instruction needs an operand that is
altered by the preceding instruction, a
delay is required. Again, this can be
accomplished by a NOOP
32. .
The pipelining discussed so far works best if the three stages are of
approximately
equal duration. Because the E stage usually involves an ALU
operation, it
may be longer. In this case, we can divide into two substages:
• Register file read
• ALU operation and register write
Because of the simplicity and regularity of a RISC instruction set,
the design
of the phasing into three or four stages is easily accomplished.
Figure 13.6d shows
the result with a four-stage pipeline. Up to four instructions at a time
can be under
way, and the maximum potential speedup is a factor of 4. Note
again the use of
NOOPs to account for data and branch delays.
33. Optimization of Pipelining
Because of the simple and regular nature of RISC
instructions, pipelining schemes can be efficiently
employed.There are few variations in instruction execution
duration, and the pipeline can be tailored to reflect this.
However, we have seen that data and branch dependencies
reduce the overall execution rate.
DELAYED BRANCH To compensate for these
dependencies, code reorganization techniques have been
developed. First, let us consider branching instructions.
Delayed branch, a way of increasing the efficiency of the
pipeline, makes use of a branch that does not take effect
until after execution of the following instruction (hence the
term delayed).
34. LOOP UNROLLING Another compiler technique
to improve instruction parallelism is loop
unrolling [BACO94]. Unrolling replicates the
body of a loop some number of times called the
unrolling factor (u) and iterates by step u instead
of step 1.
Unrolling can improve the performance by
• reducing loop overhead
• increasing instruction parallelism by improving
pipeline performance
• improving register, data cache, or TLB locality
35. Instruction Set
Table below lists the basic instruction set for all MIPS R
series processors. All processor instructions are
encoded in a single 32-bit word format. All data
operations are register to register; the only memory
references are pure load/store operations.
The R4000 makes no use of condition codes. If an
instruction generates a condition, the corresponding
flags are stored in a general-purpose register. This
avoids the need for special logic to deal with condition
codes as they affect the pipelining mechanism and the
reordering of instructions by the compiler. Instead, the
mechanisms already implemented to deal with register-
value dependencies are employed. Further, conditions
mapped onto the register files are subject to the same
compile-time optimizations in allocation and reuse as
other values stored in registers.
36. As with most RISC-based machines, the MIPS
uses a single 32-bit instruction length. This
single instruction length simplifies instruction
fetch and decode, and it also simplifies the
interaction of instruction fetch with the virtual
memory management unit (i.e., instructions do
not cross word or page boundaries). The three
instruction formats share common formatting of
opcodes and register references, simplifying
instruction decode. The effect of more complex
instructions can be synthesized at compile time.
Only the simplest and most frequently used
memory-addressing mode is implemented in
hardware. All memory references consist of a
16-bit offset from a 32-bit register.
37. MIPS R-Series Instruction Set (OP
& Description)
Load/Store Instructions
LB Load Byte
LBU Load Byte Unsigned
LH Load Halfword
LHU Load Halfword Unsigned
LW Load Word
LWL Load Word Left
LWR Load Word Right
SB Store Byte
SH Store Halfword
SW Store Word
SWL Store Word Left
SWR Store Word Right
38. Arithmetic Instructions
(3-operand, R-type)
ADD Add
ADDU Add Unsigned
SUB Subtract
SUBU Subtract Unsigned
SLT Set on Less Than
SLTU Set on Less Than
Unsigned
AND AND
OR OR
XOR Exclusive-OR
NOR NOR
Arithmetic Instructions
(ALU Immediate)
ADDI Add Immediate
ADDIU Add Immediate
Unsigned
SLTI Set on Less Than
Immediate
SLTIU Set on Less Than
Immediate Unsigned
ANDI AND Immediate
ORI OR Immediate
XORI Exclusive-OR
Immediate
LUI Load Upper Immediate
39. Multiply/Divide Instructions
MULT Multiply
MULTU Multiply Unsigned
DIV Divide
DIVU Divide Unsigned
MFHI Move From HI
MTHI Move To HI
MFLO Move From LO
MTLO Move To LO
Shift Instructions
SLL Shift Left Logical
SRL Shift Right Logical
SRA Shift Right Arithmetic
SLLV Shift Left Logical
Variable
SRLV Shift Right Logical
Variable
SRAV Shift Right Arithmetic
Variable
40. Coprocessor Instructions
LWCz Load Word to Coprocessor
SWCz Store Word to Coprocessor
MTCz Move To Coprocessor
MFCz Move From Coprocessor
CTCz Move Control To Coprocessor
CFCz Move Control From
Coprocessor
COPz Coprocessor Operation
BCzT Branch on Coprocessor z True
BCzF Branch on Coprocessor z False
Special Instructions
SYSCALL System Call
BREAK Break
Jump and Branch Instructions
J Jump
JAL Jump and Link
JR Jump to Register
JALR Jump and Link Register
BEQ Branch on Equal
BNE Branch on Not Equal
BLEZ Branch on Less Than or
Equal to Zero
BGTZ Branch on Greater Than
Zero
BLTZ Branch on Less Than Zero
BGEZ Branch on Greater Than
or Equal to Zero
BLTZAL Branch on Less Than
Zero And Link
BGEZAL Branch on Greater
Than or Equal to Zero And Link
41. Instruction Pipeline
With its simplified instruction architecture, the MIPS can
achieve very efficient pipelining. It is instructive to look at
the evolution of the MIPS pipeline, as it illustrates the
evolution of RISC pipelining in general.
The initial experimental RISC systems and the first
generation of commercial RISC processors achieve
execution speeds that approach one instruction per
system clock cycle.To improve on this performance, two
classes of processors have evolved to offer execution of
multiple instructions per clock cycle: superscalar and
superpipelined architectures. In essence, a superscalar
architecture replicates each of the pipeline stages so that
two or more instructions at the same stage of the
pipeline can be processed simultaneously.
42. A superpipelined architecture is one that makes
use of more, and more fine-grained, pipeline
stages. With more stages, more instructions can
be in the pipeline at the same time, increasing
parallelism.
Both approaches have limitations.With
superscalar pipelining, dependencies between
instructions in different pipelines can slow down
the system. Also, overhead logic is required to
coordinate these dependencies.With
superpipelining, there is overhead associated with
transferring instructions from one stage to the
next.
43. RISC VERSUS CISC CONTROVERSY
The work that has been done on
assessing merits of the RISC approach
can be grouped into two categories:
• Quantitative: Attempts to compare
program size and execution speed of
programs on RISC and CISC machines
that use comparable technology
• Qualitative: Examins issues such as high-
level language support and optimum use
of VLSI real estate
44. Most of the work on quantitative assessment has been done by those
working on RISC systems [PATT82b, HEAT84, PATT84], and it has
been, by and large, favorable to the RISC approach. Others have
examined the issue and come away unconvinced [COLW85a,
FLYN87, DAVI87]. There are several problems with attempting such
comparisons [SERL86]:
• There is no pair of RISC and CISC machines that are comparable in
life-cycle cost, level of technology, gate complexity, sophistication of
compiler, operating system support, and so on.
• No definitive test set of programs exists. Performance varies with the
program.
• It is difficult to sort out hardware effects from effects due to skill in
compiler writing.
• Most of the comparative analysis on RISC has been done on “toy”
machines rather than commercial products. Furthermore, most
commercially available machines advertised as RISC possess a
mixture of RISC and CISC characteristics. Thus, a fair comparison
with a commercial, “pure-play” CISC machine (e.g.,VAX, Pentium) is
difficult.
The qualitative assessment is, almost by definition, subjective.
46. A superscalar processor is one in which
multiple independent instruction pipelines are
used. Each pipeline consists of multiple
stages, so that each pipeline can handle
multiple instructions at a time. Multiple
pipelines introduce a new level of
parallelism, enabling multiple streams of
instructions to be processed at a time. A
superscalar processor exploits what is
known as instruction-level parallelism,
which refers to the degree to which the
instructions of a program can be executed in
parallel
47. A superscalar processor typically fetches multiple
instructions at a time and then attempts to find nearby
instructions that are independent of one another and can
therefore be executed in parallel. If the input to one
instruction depends on the output of a preceding
instruction, then the latter instruction cannot complete
execution at the same time or before the former
instruction. Once such dependencies have been identified,
the processor may issue and complete instructions in an
order that differs from that of the original machine code.
The processor may eliminate some unnecessary
dependencies by the use of additional registers and the
renaming of register references in the original code.
Whereas pure RISC processors often employ delayed
branches to maximize the utilization of the instruction
pipeline, this method is less appropriate to a superscalar
machine. Instead, most superscalar machines use
traditional branch prediction methods to improve efficiency
48. A superscalar implementation of a processor
architecture is one in which common
instructions—integer and floating-point
arithmetic, loads, stores, and conditional
branches—can be initiated simultaneously
and executed independently. Such
implementations raise a number of complex
design issues related to the instruction
pipeline
49. The term superscalar, first coined in 1987
[AGER87], refers to a machine that is
designed to improve the performance of the
execution of scalar instructions. In most
applications, the bulk of the operations are
on scalar quantities. Accordingly, the
superscalar approach represents the next
step in the evolution of high-performance
general-purpose processors
The essence of the superscalar approach is
the ability to execute instructions
independently and concurrently in different
pipelines
50. Superscalar versus Superpipelined
An alternative approach to achieving
greater performance is referred to as
superpipelining. Superpipelining exploits
the fact that many pipeline stages perform
tasks that require less than half a clock
cycle.Thus, a doubled internal clock speed
allows the performance of two tasks in one
external clock cycle
51. The pipeline has four stages: instruction fetch,
operation decode, operation execution, and
result write back. The execution stage is
crosshatched for clarity. Note that although
several instructions are executing concurrently,
only one instruction is in its execution stage at
any one time.
Both the superpipeline and the superscalar
implementations have the same number of
instructions executing at the same time in the
steady state. The superpipelined processor falls
behind the superscalar processor at the start of
the program and at each branch target
52. Limitations
The superscalar approach depends on the ability to execute multiple
instructions in parallel.
The term instruction-level parallelism refers to the degree to
which, on average, the instructions of a program can be executed in
parallel.A combination of compiler-based optimization and hardware
techniques can be used to maximize instruction-level parallelism.
Before examining the design techniques used in superscalar
machines to increase instruction-level parallelism, we need to look
at the fundamental limitations to parallelism with which the system
must cope. [JOHN91] lists five limitations:
• True data dependency
• Procedural dependency
• Resource conflicts
• Output dependency
• Antidependency
53. A typical RISC processor takes two or more
cycles to perform a load from memory when the
load is a cache hit. It can take tens or even
hundreds of cycles for a cache miss on all cache
levels, because of the delay of an off-chip
memory access.
One way to compensate for this delay is for the
compiler to reorder instructions so that one or
more subsequent instructions that do not
depend on the memory load can begin flowing
through the pipeline.This scheme is less
effective in the case of a superscalar pipeline:
The independent instructions executed during
the load are likely to be executed on the first
cycle of the load, leaving the processor with
nothing to do until the load completes.
54. DESIGN ISSUES
Instruction-level parallelism exists when
instructions in a sequence are
independent and thus can be executed in
parallel by overlapping
As an example of the concept of
instruction-level parallelism, consider the
following two code fragments [JOUP89b]:
Load R1 ← R2 Add R3 ← R3, “1”
Add R3 ← R3, “1” Add R4 ← R3, R2
Add R4 ← R4, R2 Store [R4] ← R0
55. The three instructions on the left are independent, and in
theory all three could be executed in parallel. In contrast,
the three instructions on the right cannot be executed in
parallel because the second instruction uses the result of
the first, and the third instruction uses the result of the
second.
The degree of instruction-level parallelism is determined
by the frequency of true data dependencies and
procedural dependencies in the code. These factors, in
turn, are dependent on the instruction set architecture
and on the application.
Instruction-level parallelism is also determined by what
[JOUP89a] refers to as operation latency: the time until
the result of an instruction is available for use as an
operand in a subsequent instruction. The latency
determines how much of a delay a data or procedural
dependency will cause
56. Machine parallelism is a measure of the ability of the
processor to take advantage of instruction-level
parallelism. Machine parallelism is determined by the
number of instructions that can be fetched and executed
at the same time (the number of parallel pipelines) and
by the speed and sophistication of the mechanisms that
the processor uses to find independent instructions.
Both instruction-level and machine parallelism are
important factors in enhancing performance. A program
may not have enough instruction-level parallelism to take
full advantage of machine parallelism. The use of a
fixed-length instruction set architecture, as in a RISC,
enhances instruction-level parallelism. On the other
hand, limited machine parallelism will limit performance
no matter what the nature of the program
57. Instruction Issue Policy
The processor must also be able to identify instruction level
parallelism and orchestrate the fetching, decoding, and execution of
instructions in parallel
Instruction issue to refer to the process of initiating instruction
execution in the processor’s functional units and the term
instruction issue policy to refer to the protocol used to issue
instructions.
In general, we can say that instruction issue occurs when instruction
moves from the decode stage of the pipeline to the first execute stage
of the pipeline. In essence, the processor is trying to look ahead of the
current point of execution to locate instructions that can be brought
into the pipeline and executed. Three types of orderings are important
in this regard:
• The order in which instructions are fetched
• The order in which instructions are executed
• The order in which instructions update the contents of register and
memory locations
58. In general terms, we can group
superscalar instruction issue policies into
the following categories:
• In-order issue with in-order completion
• In-order issue with out-of-order completion
• Out-of-order issue with out-of-order
completion
59. IN-ORDER ISSUE WITH IN-ORDER
COMPLETION The simplest instruction issue
policy is to issue instructions in the exact order
that would be achieved by sequential execution
(in-order issue) and to write results in that same
order (in-order completion). Not even scalar
pipelines follow such a simple-minded policy.
However, it is useful to consider this policy as a
baseline for comparing more sophisticated
approaches.
60. IN-ORDER ISSUE WITH OUT-OF-
ORDER COMPLETION Out-of-order
completion is used in scalar RISC
processors to improve the performance of
instructions that require multiple cycles.
With out-of-order completion, any number
of instructions may be in the execution
stage at any one time, up to the maximum
degree of machine parallelism across all
functional units. Instruction issuing is
stalled by a resource conflict, a data
dependency, or a procedural dependency.
61. OUT-OF-ORDER ISSUE WITH OUT-OF-ORDER COMPLETION
With in-order issue, the processor will only decode instructions up to
the point of a dependency or conflict. No additional instructions are
decoded until the conflict is resolved. As a result, the processor
cannot look ahead of the point of conflict to subsequent instructions
that may be independent of those already in the pipeline and that
may be usefully introduced into the pipeline.
To allow out-of-order issue, it is necessary to decouple the decode
and execute stages of the pipeline. This is done with a buffer
referred to as an instruction window. With this organization, after
a processor has finished decoding an instruction, it is placed in the
instruction window.As long as this buffer is not full, the processor
can continue to fetch and decode new instructions.When a
functional unit becomes available in the execute stage, an
instruction from the instruction window may be issued to the execute
stage.Any instruction may be issued, provided that
(1) it needs the particular functional unit that is available, and
(2) no conflicts or dependencies block this instruction
62. The result of this organization is that the
processor has a lookahead capability,
allowing it to identify independent
instructions that can be brought into the
execute stage. Instructions are issued
from the instruction window with little
regard for their original program order. As
before, the only constraint is that the
program execution behaves correctly
63. One common technique that is used to support
out-of-order completion is the reorder buffer.The
reorder buffer is temporary storage for results
completed out of order that are then committed
to the register file in program order. A related
concept is Tomasulo’s algorithm.
The term antidependency is used because the
constraint is similar to that of a true data
dependency, but reversed: Instead of the first
instruction producing a value that the second
instruction uses, the second instruction destroys
a value that the first instruction uses.
64. Register Renaming
One method for coping with these types of
storage conflicts is based on a traditional
resource-conflict solution: duplication of
resources. In this context, the technique is
referred to as register renaming. In
essence, registers are allocated
dynamically by the processor hardware,
and they are associated with the values
needed by instructions at various points in
time.
65. When a new register value is created (i.e., when
an instruction executes that has a register as a
destination operand), a new register is allocated
for that value. Subsequent instructions that
access that value as a source operand in that
register must go through a renaming process:
the register references in those instructions must
be revised to refer to the register containing the
needed value. Thus, the same original register
reference in several different instructions may
refer to different actual registers, if different
values are intended
66. An alternative to register renaming is a
scoreboarding. In essence, scoreboarding
is a bookkeeping technique that allows
instructions to execute whenever they are
not dependent on previous instructions
and no structural hazards are present.
67. Branch Prediction
Any high-performance pipelined machine
must address the issue of dealing with
branches. For example, the Intel 80486
addressed the problem by fetching both
the next sequential instruction after a
branch and speculatively fetching the
branch target instruction. However,
because there are two pipeline stages
between prefetch and execution, this
strategy incurs a two-cycle delay when the
branch gets taken
68. With the advent of RISC machines, the delayed branch
strategy was explored. This allows the processor to
calculate the result of conditional branch instructions
before any unusable instructions have been prefetched
where the processor always executes the single
instruction that immediately follows the branch. This
keeps the pipeline full while the processor fetches a new
instruction stream.
With the development of superscalar machines, the
delayed branch strategy has less appeal. The reason is
that multiple instructions need to execute in the delay
slot, raising several problems relating to instruction
dependencies. Thus, superscalar machines have
returned to pre-RISC techniques of branch prediction.
Some, like the PowerPC 601, use a simple static branch
prediction technique. More sophisticated processors,
such as the PowerPC 620 and the Pentium 4, use
dynamic branch prediction based on branch history
analysis.
69. Superscalar Implementation
Based on our discussion so far, we can make some
general comments about the processor hardware
required for the superscalar approach. [SMIT95] lists the
following key elements:
• Instruction fetch strategies that simultaneously fetch
multiple instructions, often by predicting the outcomes of,
and fetching beyond, conditional branch instructions.
These functions require the use of multiple pipeline fetch
and decode stages, and branch prediction logic.
• Logic for determining true dependencies involving register
values, and mechanisms for communicating these
values to where they are needed during execution
70. Mechanisms for initiating, or issuing,
multiple instructions in parallel.
• Resources for parallel execution of multiple
instructions, including multiple pipelined
functional units and memory hierarchies
capable of simultaneously servicing
multiple memory references.
• Mechanisms for committing the process
state in correct order.
71. PENTIUM 4
Although the concept of superscalar design is generally associated with the
RISC architecture, the same superscalar principles can be applied to a
CISC machine. Perhaps the most notable example of this is the Pentium.
The evolution of superscalar concepts in the Intel line is interesting to note.
The 386 is a traditional CISC nonpipelined machine.
The 486 introduced the first pipelined x86 processor, reducing the average
latency of integer operations from between two and four cycles to one cycle,
but still limited to executing a single instruction each cycle, with no
superscalar elements. The original Pentium had a modest superscalar
component, consisting of the use of two separate integer execution units.
The Pentium Pro introduced a full-blown superscalar design. Subsequent
Pentium models have refined and enhanced the superscalar design.
A general block diagram of the Pentium 4 was shown in figure below depicts
the same structure in a way more suitable for the pipeline discussion in this
section. The operation of the Pentium 4 can be summarized as follows:
1. The processor fetches instructions from memory in the order of the static
program.
2. Each instruction is translated into one or more fixed-length RISC instructions,
known as micro-operations, or micro-ops.
72. 3. The processor executes the micro-ops on
a superscalar pipeline organization, so
that the micro-ops may be executed out of
order.
4. The processor commits the results of
each micro-op execution to the
processor’s register set in the order of the
original program flow
75. Front End
GENERATION OF MICRO-OPS The Pentium 4
organization includes an in-order front end that
can be considered outside the scope of the
pipeline depicted in figure above. This front end
feeds into an L1 instruction cache, called the
trace cache, which is where the pipeline proper
begins. Usually, the processor operates from the
trace cache; when a trace cache miss occurs,
the in-order front end feeds new instructions into
the trace cache.