SlideShare ist ein Scribd-Unternehmen logo
1 von 75
Reduced Instruction
Set Computers
Lesson 5 (RISC)
The semantic gap is the difference between the
operations provided in HLLs (High Level Languages) and
those provided in computer architecture.
Symptoms of this gap are alleged to include execution
inefficiency, excessive machine program size, and
compiler complexity. Designers responded with
architectures intended to close this gap. Key features
include large instruction sets, dozens of addressing
modes, and various HLL statements implemented in
hardware. An example of the latter is the CASE machine
instruction on the VAX. Such complex instruction sets
are intended to
• Ease the task of the compiler writer.
• Improve execution efficiency, because complex
sequences of operations can be implemented in
microcode.
• Provide support for even more complex and sophisticated
HLLs.
Reduced instruction set
computer (RISC) architecture
 The RISC architecture is a dramatic departure from the
historical trend in processor architecture. An analysis of
the RISC architecture brings into focus many of the
important issues in computer organization and
architecture.
 Although RISC systems have been defined and
designed in a variety of ways by different groups, the key
elements shared by most designs are these:
• A large number of general-purpose registers, and/or the
use of compiler technology to optimize register usage
• A limited and simple instruction set
• An emphasis on optimizing the instruction pipeline
Semantic gap
 In order to improve the efficiency of
software development, new and powerful
programming languages have been
developed (Ada, C++, Java).
 They provide: high level of abstraction,
conciseness, power.
• By this evolution the semantic gap grows
 Problem: How should new HLL programs
be compiled and executed efficiently on a
processor architecture?
Two possible answers:
1. The CISC approach: design very complex
architectures including a large number of
instructions and addressing modes;
include also instructions close to those
present in HLL.
2. The RISC approach: simplify the
instruction set and adapt it to the real
requirements of user programs
Why need RISC
 RISC architectures represent an important
innovation in the area of computer organization.
• The RISC architecture is an attempt to produce
more CPU power by simplifying the instruction
set of the CPU.
• The opposed trend to RISC is that of complex
instruction set computers (CISC).
 Both RISC and CISC architectures have been
developed as an attempt to cover the
semantic gap.
INSTRUCTION EXECUTION
CHARACTERISTICS IN RISC
 Operations performed: These determine the
functions to be performed by the processor and
its interaction with memory.
 Operands used: The types of operands and the
frequency of their use determine the memory
organization for storing them and the addressing
modes for accessing them.
 Execution sequencing: This determines the
control and pipeline organization.
Evaluation of Program execution
 Several studies have been conducted to
determine the execution characteristics of
machine instruction sequences generated from
HLL programs.
• Aspects of interest:
1. The frequency of operations performed.
2. The types of operands and their frequency of
use.
3. Execution sequencing (frequency of jumps,
loops, subprogram calls).
 Frequency of Instructions Executed
• Frequency distribution of executed machine
instructions:
 moves: 33%
 conditional branch: 20%
 Arithmetic/logic: 16%
 others: Between 0.1% and 10%
• Addressing modes: the overwhelming majority of
instructions uses simple addressing modes, in
which the address can be calculated in a single
cycle (register, register indirect, displacement);
complex addressing modes (memory indirect,
indexed+indirect, displacement+indexed, stack)
are used only by ~18% of the instructions.
 Operand Types
• 74 to 80% of the operands are scalars (integers,
reals, characters, etc.) which can be hold in
registers;
• the rest (20-26%) are arrays/structures; 90% of
them are global variables;
• 80% of the scalars are local variables.
NB:The majority of operands are local variables of
scalar type, which can be stored in registers
 Some statistics concerning procedure
calls:
• Only 1.25% of called procedures have
more than six parameters.
• Only 6.7% of called procedures have more
than six local variables.
• Chains of nested procedure calls are
usually short and only very seldom longer
than 6.
Conclusions from Evaluation of Program
Execution
• An overwhelming preponderance of simple (ALU and
move) operations over complex operations.
• Preponderance of simple addressing modes.
• Large frequency of operand accesses; on average each
instruction references 1.9 operands.
• Most of the referenced operands are scalars (so they can
be stored in a register) and are local variables or
parameters.
• Optimizing the procedure CALL/RETURN mechanism
promises large benefits in speed.
 These conclusions have been at the starting point to the
Reduced Instruction Set Computer (RISC) approach.
Characteristics of Reduced
Instruction Set Architectures
 Although a variety of different approaches
to reduced instruction set architecture
have been taken, certain characteristics
are common to all of them:
• One instruction per cycle
• Register-to-register operations
• Simple addressing modes
• Simple instruction formats
 The first characteristic listed is that there is one
machine instruction per machine cycle. A
machine cycle is defined to be the time it takes
to fetch two operands from registers, perform an
ALU operation, and store the result in a register.
Thus, RISC machine instructions should be no
more complicated than, and execute about as
fast as, microinstructions on CISC machines.
With simple, one-cycle instructions, there is little
or no need for microcode; the machine
instructions can be hardwired. Such instructions
should execute faster than comparable machine
instructions on other machines, because it is not
necessary to access a microprogram control
store during instruction execution.
 The goal is to create an instruction set containing
instructions that execute quickly; most of the RISC
instructions are executed in a single machine cycle (after
fetched and decoded).
- RISC instructions, being simple, are hard-wired, while
CISC architectures have to use microprogramming in
order to implement complex instructions.
- Having only simple instructions results in reduced
complexity of the control unit and the data path; as a
consequence, the processor can work at a high clock
frequency.
- The pipelines are used efficiently if instructions are simple
and of similar execution time.
- Complex operations on RISCs are executed as a
sequence of simple RISC instructions. In the case of
CISCs they are executed as one single or a few complex
instruction.
example
 we have a program with 80% of executed
instructions being simple and 20% complex;
- on a CISC machine simple instructions take 4
cycles, complex instructions take 8 cycles; cycle
time is 100 ns (10-7 s);
- on a RISC machine simple instructions are
executed in one cycle; complex operations are
implemented as a sequence of instructions; we
consider on average 14 instructions (14 cycles)
for a complex operation; cycle time is 75 ns
(0.75 * 10-7 s).
 How much time takes a program of 1 000 000
instructions?
 CISC: (10^6*0.80*4 + 10^6*0.20*8)*10-7 = 0.48
s
 RISC: (10^6*0.80*1 + 10^6*0.20*14)*0.75*10-7
= 0.27 s
• complex operations take more time on the RISC,
but their number is small;
• because of its simplicity, the RISC works at a
smaller cycle time; with the CISC, simple
instructions are slowed down because of the
increased data path length and the increased
control complexity.
 A second characteristic is that most
operations should be register to register,
with only simple LOAD and STORE
operations accessing memory.
Only LOAD and STORE instructions
reference data in memory; all other
instructions operate only with registers
(are register-to-register instructions); thus,
only the few instructions accessing
memory need more than one cycle to
execute (after fetched and decoded).
Third Characteristic, Instructions use only few
addressing modes
- Addressing modes are usually register, direct,
register indirect, displacement.
 Almost all RISC instructions use simple register
addressing.
Forth Characteristic Instructions are of fixed
length and uniform format
- This makes the loading and decoding of
instructions simple and fast; it is not needed to
wait until the length of an instruction is known in
order to start decoding the following one;
- Decoding is simplified because opcode and
address fields are located in the same position
forall instructions
 Fifth Characteristic, A large number of
registers is available
- Variables and intermediate results can be
stored in registers and do not require
repeated loads and stores from/to
memory.
- All local variables of procedures and the
passed parameters can be stored in
registers.
What happens when a new
procedure is called?
- Normally the registers have to be saved in
memory (they contain values of variables and
parameters for the calling procedure); at return
to the calling procedure, the values have to be
again loaded from memory. This takes a lot of
time.
- If a large number of registers is available, a new
set of registers can be allocated to the called
procedure and the register set assigned to the
calling one remains untouched.
Is the strategy above realistic?
- The strategy is realistic, because the number of
local variables in procedures is not large. The
chains of nested procedure calls is only
exceptionally larger than 6.
- If the chain of nested procedure calls becomes
large, at a certain call there will be no registers
to be assigned to the called procedure; in this
case local variables and parameters have to be
stored in memory
Why is a large number of registers typical for RISC
architectures?
- Because of the reduced complexity of the
processor there is enough space on the
chip to be allocated to a large number of
registers. This, usually, is not the case
with CISCs.
The delayed load problem
 • LOAD instructions (similar to the STORE)
require memory access and their execution
cannot be completed in a single clock cycle.
However, in the next cycle a new instruction is
started by the processor.
Two possible solutions:
1. The hardware should delay the execution of the
instruction following the LOAD, if this instruction
needs the loaded value
2. A more efficient, compiler based, solution, which
has similarities with the delayed branching, is
the delayed load:
 With delayed load the processor always
executes the instruction following a LOAD,
without a stall; It is the programmers
(compilers) responsibility that this
instruction does not need the loaded
value.
CISC versus RISC Characteristics
 After the initial enthusiasm for RISC machines,
there has been a growing realization that
(1) RISC designs may benefit from the inclusion of
some CISC features
(2) CISC designs may benefit from the inclusion of
some RISC features.
 The result is that the more recent RISC designs,
notably the PowerPC, are no longer “pure” RISC
and the more recent CISC designs, notably the
Pentium II and later Pentium models, do
incorporate some RISC characteristics.
Typical RISC characteristics
1. A single instruction size.
2. That size is typically 4 bytes.
3. A small number of data addressing modes, typically less
than five. This parameter is difficult to pin down. In the
table, register and literal modes are not counted and
different formats with different offset sizes are counted
separately.
4. No indirect addressing that requires you to make one
memory access to get the address of another operand in
memory.
 5. No operations that combine load/store with arithmetic
(e.g., add from memory, add to memory).
6. No more than one memory-addressed operand
per instruction.
7. Does not support arbitrary alignment of data for
load/store operations.
8. Maximum number of uses of the memory
management unit (MMU) for a data address in
an instruction.
9. Number of bits for integer register specifier
equal to five or more. This means that at least
32 integer registers can be explicitly referenced
at a time.
10. Number of bits for floating-point register
specifier equal to four or more. This means that
at least 16 floating-point registers can be
explicitly referenced at a time.
RISC PIPELINING
Instruction pipelining is often used to enhance
performance.
 Let us reconsider this in the context of a RISC
architecture. Most instructions are register to register,
and an instruction cycle has the following two stages:
• I: Instruction fetch.
• E: Execute. Performs an ALU operation with register input
and output.
For load and store operations, three stages are required:
• I: Instruction fetch.
• E: Execute. Calculates memory address
• D: Memory. Register-to-memory or memory-to-register
operation
 The two stages of the pipeline are an instruction fetch
stage, and an execute/memory stage that executes the
instruction, including register-to-memory and memory to-
register operations. Thus we see that the instruction
fetch stage of the second instruction can e performed in
parallel with the first part of the execute/memory stage.
However, the execute/memory stage of the second
instruction must be delayed until the first instruction
clears the second stage of the pipeline. This scheme can
yield up to twice the execution rate of a serial scheme.
Two problems prevent the maximum speedup from
being achieved. First, we assume that a single-port
memory is used and that only one memory access is
possible per stage. This requires the insertion of a wait
state in some instructions. Second, a branch instruction
interrupts the sequential flow of execution.To
accommodate this with minimum circuitry, a NOOP
instruction can be inserted into the instruction stream by
the compiler or assembler
 Pipelining can be improved further by
permitting two memory accesses per
stage. This yields the sequence. Now, up
to three instructions can be overlapped,
and the improvement is as much as a
factor of 3. Again, branch instructions
cause the speedup to fall short of the
maximum possible. Also, note that data
dependencies have an effect. If an
instruction needs an operand that is
altered by the preceding instruction, a
delay is required. Again, this can be
accomplished by a NOOP
 .
 The pipelining discussed so far works best if the three stages are of
approximately
 equal duration. Because the E stage usually involves an ALU
operation, it
 may be longer. In this case, we can divide into two substages:
 • Register file read
 • ALU operation and register write
 Because of the simplicity and regularity of a RISC instruction set,
the design
 of the phasing into three or four stages is easily accomplished.
Figure 13.6d shows
 the result with a four-stage pipeline. Up to four instructions at a time
can be under
 way, and the maximum potential speedup is a factor of 4. Note
again the use of
 NOOPs to account for data and branch delays.
Optimization of Pipelining
 Because of the simple and regular nature of RISC
instructions, pipelining schemes can be efficiently
employed.There are few variations in instruction execution
duration, and the pipeline can be tailored to reflect this.
However, we have seen that data and branch dependencies
reduce the overall execution rate.
 DELAYED BRANCH To compensate for these
dependencies, code reorganization techniques have been
developed. First, let us consider branching instructions.
Delayed branch, a way of increasing the efficiency of the
pipeline, makes use of a branch that does not take effect
until after execution of the following instruction (hence the
term delayed).
 LOOP UNROLLING Another compiler technique
to improve instruction parallelism is loop
unrolling [BACO94]. Unrolling replicates the
body of a loop some number of times called the
unrolling factor (u) and iterates by step u instead
of step 1.
 Unrolling can improve the performance by
• reducing loop overhead
• increasing instruction parallelism by improving
pipeline performance
• improving register, data cache, or TLB locality
Instruction Set
 Table below lists the basic instruction set for all MIPS R
series processors. All processor instructions are
encoded in a single 32-bit word format. All data
operations are register to register; the only memory
references are pure load/store operations.
 The R4000 makes no use of condition codes. If an
instruction generates a condition, the corresponding
flags are stored in a general-purpose register. This
avoids the need for special logic to deal with condition
codes as they affect the pipelining mechanism and the
reordering of instructions by the compiler. Instead, the
mechanisms already implemented to deal with register-
value dependencies are employed. Further, conditions
mapped onto the register files are subject to the same
compile-time optimizations in allocation and reuse as
other values stored in registers.
 As with most RISC-based machines, the MIPS
uses a single 32-bit instruction length. This
single instruction length simplifies instruction
fetch and decode, and it also simplifies the
interaction of instruction fetch with the virtual
memory management unit (i.e., instructions do
not cross word or page boundaries). The three
instruction formats share common formatting of
opcodes and register references, simplifying
instruction decode. The effect of more complex
instructions can be synthesized at compile time.
 Only the simplest and most frequently used
memory-addressing mode is implemented in
hardware. All memory references consist of a
16-bit offset from a 32-bit register.
MIPS R-Series Instruction Set (OP
& Description)
Load/Store Instructions
 LB Load Byte
 LBU Load Byte Unsigned
 LH Load Halfword
 LHU Load Halfword Unsigned
 LW Load Word
 LWL Load Word Left
 LWR Load Word Right
 SB Store Byte
 SH Store Halfword
 SW Store Word
 SWL Store Word Left
 SWR Store Word Right
Arithmetic Instructions
(3-operand, R-type)
ADD Add
ADDU Add Unsigned
SUB Subtract
SUBU Subtract Unsigned
SLT Set on Less Than
SLTU Set on Less Than
Unsigned
AND AND
OR OR
XOR Exclusive-OR
NOR NOR
 Arithmetic Instructions
(ALU Immediate)
 ADDI Add Immediate
 ADDIU Add Immediate
Unsigned
 SLTI Set on Less Than
Immediate
 SLTIU Set on Less Than
Immediate Unsigned
 ANDI AND Immediate
 ORI OR Immediate
 XORI Exclusive-OR
Immediate
 LUI Load Upper Immediate
Multiply/Divide Instructions
MULT Multiply
MULTU Multiply Unsigned
DIV Divide
DIVU Divide Unsigned
MFHI Move From HI
MTHI Move To HI
MFLO Move From LO
MTLO Move To LO
Shift Instructions
 SLL Shift Left Logical
 SRL Shift Right Logical
 SRA Shift Right Arithmetic
 SLLV Shift Left Logical
Variable
 SRLV Shift Right Logical
Variable
 SRAV Shift Right Arithmetic
Variable
Coprocessor Instructions
LWCz Load Word to Coprocessor
SWCz Store Word to Coprocessor
MTCz Move To Coprocessor
MFCz Move From Coprocessor
CTCz Move Control To Coprocessor
CFCz Move Control From
Coprocessor
COPz Coprocessor Operation
BCzT Branch on Coprocessor z True
BCzF Branch on Coprocessor z False
Special Instructions
SYSCALL System Call
BREAK Break
Jump and Branch Instructions
 J Jump
 JAL Jump and Link
 JR Jump to Register
 JALR Jump and Link Register
 BEQ Branch on Equal
 BNE Branch on Not Equal
 BLEZ Branch on Less Than or
Equal to Zero
 BGTZ Branch on Greater Than
Zero
 BLTZ Branch on Less Than Zero
 BGEZ Branch on Greater Than
or Equal to Zero
 BLTZAL Branch on Less Than
Zero And Link
 BGEZAL Branch on Greater
Than or Equal to Zero And Link
Instruction Pipeline
 With its simplified instruction architecture, the MIPS can
achieve very efficient pipelining. It is instructive to look at
the evolution of the MIPS pipeline, as it illustrates the
evolution of RISC pipelining in general.
 The initial experimental RISC systems and the first
generation of commercial RISC processors achieve
execution speeds that approach one instruction per
system clock cycle.To improve on this performance, two
classes of processors have evolved to offer execution of
multiple instructions per clock cycle: superscalar and
superpipelined architectures. In essence, a superscalar
architecture replicates each of the pipeline stages so that
two or more instructions at the same stage of the
pipeline can be processed simultaneously.
 A superpipelined architecture is one that makes
use of more, and more fine-grained, pipeline
stages. With more stages, more instructions can
be in the pipeline at the same time, increasing
parallelism.
 Both approaches have limitations.With
superscalar pipelining, dependencies between
instructions in different pipelines can slow down
the system. Also, overhead logic is required to
coordinate these dependencies.With
superpipelining, there is overhead associated with
transferring instructions from one stage to the
next.
RISC VERSUS CISC CONTROVERSY
 The work that has been done on
assessing merits of the RISC approach
can be grouped into two categories:
• Quantitative: Attempts to compare
program size and execution speed of
programs on RISC and CISC machines
that use comparable technology
• Qualitative: Examins issues such as high-
level language support and optimum use
of VLSI real estate
 Most of the work on quantitative assessment has been done by those
working on RISC systems [PATT82b, HEAT84, PATT84], and it has
been, by and large, favorable to the RISC approach. Others have
examined the issue and come away unconvinced [COLW85a,
FLYN87, DAVI87]. There are several problems with attempting such
comparisons [SERL86]:
• There is no pair of RISC and CISC machines that are comparable in
life-cycle cost, level of technology, gate complexity, sophistication of
compiler, operating system support, and so on.
• No definitive test set of programs exists. Performance varies with the
program.
• It is difficult to sort out hardware effects from effects due to skill in
compiler writing.
• Most of the comparative analysis on RISC has been done on “toy”
machines rather than commercial products. Furthermore, most
commercially available machines advertised as RISC possess a
mixture of RISC and CISC characteristics. Thus, a fair comparison
with a commercial, “pure-play” CISC machine (e.g.,VAX, Pentium) is
difficult.
The qualitative assessment is, almost by definition, subjective.
INSTRUCTION-LEVEL
PARALLELISM
AND SUPERSCALAR
PROCESSORS
Lesson 6
 A superscalar processor is one in which
multiple independent instruction pipelines are
used. Each pipeline consists of multiple
stages, so that each pipeline can handle
multiple instructions at a time. Multiple
pipelines introduce a new level of
parallelism, enabling multiple streams of
instructions to be processed at a time. A
superscalar processor exploits what is
known as instruction-level parallelism,
which refers to the degree to which the
instructions of a program can be executed in
parallel
 A superscalar processor typically fetches multiple
instructions at a time and then attempts to find nearby
instructions that are independent of one another and can
therefore be executed in parallel. If the input to one
instruction depends on the output of a preceding
instruction, then the latter instruction cannot complete
execution at the same time or before the former
instruction. Once such dependencies have been identified,
the processor may issue and complete instructions in an
order that differs from that of the original machine code.
 The processor may eliminate some unnecessary
dependencies by the use of additional registers and the
renaming of register references in the original code.
 Whereas pure RISC processors often employ delayed
branches to maximize the utilization of the instruction
pipeline, this method is less appropriate to a superscalar
machine. Instead, most superscalar machines use
traditional branch prediction methods to improve efficiency
 A superscalar implementation of a processor
architecture is one in which common
instructions—integer and floating-point
arithmetic, loads, stores, and conditional
branches—can be initiated simultaneously
and executed independently. Such
implementations raise a number of complex
design issues related to the instruction
pipeline
 The term superscalar, first coined in 1987
[AGER87], refers to a machine that is
designed to improve the performance of the
execution of scalar instructions. In most
applications, the bulk of the operations are
on scalar quantities. Accordingly, the
superscalar approach represents the next
step in the evolution of high-performance
general-purpose processors
 The essence of the superscalar approach is
the ability to execute instructions
independently and concurrently in different
pipelines
Superscalar versus Superpipelined
 An alternative approach to achieving
greater performance is referred to as
superpipelining. Superpipelining exploits
the fact that many pipeline stages perform
tasks that require less than half a clock
cycle.Thus, a doubled internal clock speed
allows the performance of two tasks in one
external clock cycle
 The pipeline has four stages: instruction fetch,
operation decode, operation execution, and
result write back. The execution stage is
crosshatched for clarity. Note that although
several instructions are executing concurrently,
only one instruction is in its execution stage at
any one time.
 Both the superpipeline and the superscalar
implementations have the same number of
instructions executing at the same time in the
steady state. The superpipelined processor falls
behind the superscalar processor at the start of
the program and at each branch target
Limitations
 The superscalar approach depends on the ability to execute multiple
instructions in parallel.
 The term instruction-level parallelism refers to the degree to
which, on average, the instructions of a program can be executed in
parallel.A combination of compiler-based optimization and hardware
techniques can be used to maximize instruction-level parallelism.
 Before examining the design techniques used in superscalar
machines to increase instruction-level parallelism, we need to look
at the fundamental limitations to parallelism with which the system
must cope. [JOHN91] lists five limitations:
• True data dependency
• Procedural dependency
• Resource conflicts
• Output dependency
• Antidependency
 A typical RISC processor takes two or more
cycles to perform a load from memory when the
load is a cache hit. It can take tens or even
hundreds of cycles for a cache miss on all cache
levels, because of the delay of an off-chip
memory access.
 One way to compensate for this delay is for the
compiler to reorder instructions so that one or
more subsequent instructions that do not
depend on the memory load can begin flowing
through the pipeline.This scheme is less
effective in the case of a superscalar pipeline:
The independent instructions executed during
the load are likely to be executed on the first
cycle of the load, leaving the processor with
nothing to do until the load completes.
DESIGN ISSUES
 Instruction-level parallelism exists when
instructions in a sequence are
independent and thus can be executed in
parallel by overlapping
 As an example of the concept of
instruction-level parallelism, consider the
following two code fragments [JOUP89b]:
 Load R1 ← R2 Add R3 ← R3, “1”
 Add R3 ← R3, “1” Add R4 ← R3, R2
 Add R4 ← R4, R2 Store [R4] ← R0
 The three instructions on the left are independent, and in
theory all three could be executed in parallel. In contrast,
the three instructions on the right cannot be executed in
parallel because the second instruction uses the result of
the first, and the third instruction uses the result of the
second.
 The degree of instruction-level parallelism is determined
by the frequency of true data dependencies and
procedural dependencies in the code. These factors, in
turn, are dependent on the instruction set architecture
and on the application.
 Instruction-level parallelism is also determined by what
[JOUP89a] refers to as operation latency: the time until
the result of an instruction is available for use as an
operand in a subsequent instruction. The latency
determines how much of a delay a data or procedural
dependency will cause
 Machine parallelism is a measure of the ability of the
processor to take advantage of instruction-level
parallelism. Machine parallelism is determined by the
number of instructions that can be fetched and executed
at the same time (the number of parallel pipelines) and
by the speed and sophistication of the mechanisms that
the processor uses to find independent instructions.
 Both instruction-level and machine parallelism are
important factors in enhancing performance. A program
may not have enough instruction-level parallelism to take
full advantage of machine parallelism. The use of a
fixed-length instruction set architecture, as in a RISC,
enhances instruction-level parallelism. On the other
hand, limited machine parallelism will limit performance
no matter what the nature of the program
Instruction Issue Policy
 The processor must also be able to identify instruction level
parallelism and orchestrate the fetching, decoding, and execution of
instructions in parallel
 Instruction issue to refer to the process of initiating instruction
execution in the processor’s functional units and the term
 instruction issue policy to refer to the protocol used to issue
instructions.
 In general, we can say that instruction issue occurs when instruction
moves from the decode stage of the pipeline to the first execute stage
of the pipeline. In essence, the processor is trying to look ahead of the
current point of execution to locate instructions that can be brought
into the pipeline and executed. Three types of orderings are important
in this regard:
• The order in which instructions are fetched
• The order in which instructions are executed
• The order in which instructions update the contents of register and
memory locations
 In general terms, we can group
superscalar instruction issue policies into
the following categories:
• In-order issue with in-order completion
• In-order issue with out-of-order completion
• Out-of-order issue with out-of-order
completion
 IN-ORDER ISSUE WITH IN-ORDER
COMPLETION The simplest instruction issue
policy is to issue instructions in the exact order
that would be achieved by sequential execution
(in-order issue) and to write results in that same
order (in-order completion). Not even scalar
pipelines follow such a simple-minded policy.
However, it is useful to consider this policy as a
baseline for comparing more sophisticated
approaches.
 IN-ORDER ISSUE WITH OUT-OF-
ORDER COMPLETION Out-of-order
completion is used in scalar RISC
processors to improve the performance of
instructions that require multiple cycles.
 With out-of-order completion, any number
of instructions may be in the execution
stage at any one time, up to the maximum
degree of machine parallelism across all
functional units. Instruction issuing is
stalled by a resource conflict, a data
dependency, or a procedural dependency.
 OUT-OF-ORDER ISSUE WITH OUT-OF-ORDER COMPLETION
With in-order issue, the processor will only decode instructions up to
the point of a dependency or conflict. No additional instructions are
decoded until the conflict is resolved. As a result, the processor
cannot look ahead of the point of conflict to subsequent instructions
that may be independent of those already in the pipeline and that
may be usefully introduced into the pipeline.
 To allow out-of-order issue, it is necessary to decouple the decode
and execute stages of the pipeline. This is done with a buffer
referred to as an instruction window. With this organization, after
a processor has finished decoding an instruction, it is placed in the
instruction window.As long as this buffer is not full, the processor
can continue to fetch and decode new instructions.When a
functional unit becomes available in the execute stage, an
instruction from the instruction window may be issued to the execute
stage.Any instruction may be issued, provided that
 (1) it needs the particular functional unit that is available, and
 (2) no conflicts or dependencies block this instruction
 The result of this organization is that the
processor has a lookahead capability,
allowing it to identify independent
instructions that can be brought into the
execute stage. Instructions are issued
from the instruction window with little
regard for their original program order. As
before, the only constraint is that the
program execution behaves correctly
 One common technique that is used to support
out-of-order completion is the reorder buffer.The
reorder buffer is temporary storage for results
completed out of order that are then committed
to the register file in program order. A related
concept is Tomasulo’s algorithm.
 The term antidependency is used because the
constraint is similar to that of a true data
dependency, but reversed: Instead of the first
instruction producing a value that the second
instruction uses, the second instruction destroys
a value that the first instruction uses.
Register Renaming
 One method for coping with these types of
storage conflicts is based on a traditional
resource-conflict solution: duplication of
resources. In this context, the technique is
referred to as register renaming. In
essence, registers are allocated
dynamically by the processor hardware,
and they are associated with the values
needed by instructions at various points in
time.
 When a new register value is created (i.e., when
an instruction executes that has a register as a
destination operand), a new register is allocated
for that value. Subsequent instructions that
access that value as a source operand in that
register must go through a renaming process:
the register references in those instructions must
be revised to refer to the register containing the
needed value. Thus, the same original register
reference in several different instructions may
refer to different actual registers, if different
values are intended
 An alternative to register renaming is a
scoreboarding. In essence, scoreboarding
is a bookkeeping technique that allows
instructions to execute whenever they are
not dependent on previous instructions
and no structural hazards are present.
Branch Prediction
 Any high-performance pipelined machine
must address the issue of dealing with
branches. For example, the Intel 80486
addressed the problem by fetching both
the next sequential instruction after a
branch and speculatively fetching the
branch target instruction. However,
because there are two pipeline stages
between prefetch and execution, this
strategy incurs a two-cycle delay when the
branch gets taken
 With the advent of RISC machines, the delayed branch
strategy was explored. This allows the processor to
calculate the result of conditional branch instructions
before any unusable instructions have been prefetched
where the processor always executes the single
instruction that immediately follows the branch. This
keeps the pipeline full while the processor fetches a new
instruction stream.
 With the development of superscalar machines, the
delayed branch strategy has less appeal. The reason is
that multiple instructions need to execute in the delay
slot, raising several problems relating to instruction
dependencies. Thus, superscalar machines have
returned to pre-RISC techniques of branch prediction.
Some, like the PowerPC 601, use a simple static branch
prediction technique. More sophisticated processors,
such as the PowerPC 620 and the Pentium 4, use
dynamic branch prediction based on branch history
analysis.
Superscalar Implementation
 Based on our discussion so far, we can make some
general comments about the processor hardware
required for the superscalar approach. [SMIT95] lists the
following key elements:
• Instruction fetch strategies that simultaneously fetch
multiple instructions, often by predicting the outcomes of,
and fetching beyond, conditional branch instructions.
These functions require the use of multiple pipeline fetch
and decode stages, and branch prediction logic.
• Logic for determining true dependencies involving register
values, and mechanisms for communicating these
values to where they are needed during execution
 Mechanisms for initiating, or issuing,
multiple instructions in parallel.
• Resources for parallel execution of multiple
instructions, including multiple pipelined
functional units and memory hierarchies
capable of simultaneously servicing
multiple memory references.
• Mechanisms for committing the process
state in correct order.
PENTIUM 4
 Although the concept of superscalar design is generally associated with the
RISC architecture, the same superscalar principles can be applied to a
CISC machine. Perhaps the most notable example of this is the Pentium.
The evolution of superscalar concepts in the Intel line is interesting to note.
The 386 is a traditional CISC nonpipelined machine.
 The 486 introduced the first pipelined x86 processor, reducing the average
latency of integer operations from between two and four cycles to one cycle,
but still limited to executing a single instruction each cycle, with no
superscalar elements. The original Pentium had a modest superscalar
component, consisting of the use of two separate integer execution units.
The Pentium Pro introduced a full-blown superscalar design. Subsequent
Pentium models have refined and enhanced the superscalar design.
 A general block diagram of the Pentium 4 was shown in figure below depicts
the same structure in a way more suitable for the pipeline discussion in this
section. The operation of the Pentium 4 can be summarized as follows:
1. The processor fetches instructions from memory in the order of the static
program.
2. Each instruction is translated into one or more fixed-length RISC instructions,
known as micro-operations, or micro-ops.
3. The processor executes the micro-ops on
a superscalar pipeline organization, so
that the micro-ops may be executed out of
order.
4. The processor commits the results of
each micro-op execution to the
processor’s register set in the order of the
original program flow
Pentium 4 Block Diagram
 Pg 538
Pentium 4 Pipeline
Front End
 GENERATION OF MICRO-OPS The Pentium 4
organization includes an in-order front end that
can be considered outside the scope of the
pipeline depicted in figure above. This front end
feeds into an L1 instruction cache, called the
trace cache, which is where the pipeline proper
begins. Usually, the processor operates from the
trace cache; when a trace cache miss occurs,
the in-order front end feeds new instructions into
the trace cache.

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Superscalar Processor
Superscalar ProcessorSuperscalar Processor
Superscalar Processor
 
Overview of UML Diagrams
Overview of UML DiagramsOverview of UML Diagrams
Overview of UML Diagrams
 
Computer organization
Computer organizationComputer organization
Computer organization
 
INSTRUCTION LEVEL PARALLALISM
INSTRUCTION LEVEL PARALLALISMINSTRUCTION LEVEL PARALLALISM
INSTRUCTION LEVEL PARALLALISM
 
VTU 5TH SEM CSE OPERATING SYSTEMS SOLVED PAPERS
VTU 5TH SEM CSE OPERATING SYSTEMS SOLVED PAPERSVTU 5TH SEM CSE OPERATING SYSTEMS SOLVED PAPERS
VTU 5TH SEM CSE OPERATING SYSTEMS SOLVED PAPERS
 
White box testing-200709
White box testing-200709White box testing-200709
White box testing-200709
 
RAID LEVELS
RAID LEVELSRAID LEVELS
RAID LEVELS
 
Os solved question paper
Os solved question paperOs solved question paper
Os solved question paper
 
Priority scheduling algorithms
Priority scheduling algorithmsPriority scheduling algorithms
Priority scheduling algorithms
 
Important 16 marks questions
Important 16 marks questionsImportant 16 marks questions
Important 16 marks questions
 
Hardware multithreading
Hardware multithreadingHardware multithreading
Hardware multithreading
 
Operating Systems: Device Management
Operating Systems: Device ManagementOperating Systems: Device Management
Operating Systems: Device Management
 
Parallel processing
Parallel processingParallel processing
Parallel processing
 
Risc & cisk
Risc & ciskRisc & cisk
Risc & cisk
 
Cache memory
Cache memoryCache memory
Cache memory
 
Dram and its types
Dram and its typesDram and its types
Dram and its types
 
Operating system 31 multiple processor scheduling
Operating system 31 multiple processor schedulingOperating system 31 multiple processor scheduling
Operating system 31 multiple processor scheduling
 
Uml in software engineering
Uml in software engineeringUml in software engineering
Uml in software engineering
 
Applications of Mealy & Moore Machine
Applications of  Mealy  & Moore Machine Applications of  Mealy  & Moore Machine
Applications of Mealy & Moore Machine
 
Deadlock ppt
Deadlock ppt Deadlock ppt
Deadlock ppt
 

Ähnlich wie Semantic gap between HLLs and computer architecture

Advanced processor principles
Advanced processor principlesAdvanced processor principles
Advanced processor principlesDhaval Bagal
 
Advanced Processor Power Point Presentation
Advanced Processor  Power Point  PresentationAdvanced Processor  Power Point  Presentation
Advanced Processor Power Point PresentationPrashantYadav931011
 
CS304PC:Computer Organization and Architecture UNIT V_merged_merged.pdf
CS304PC:Computer Organization and Architecture UNIT V_merged_merged.pdfCS304PC:Computer Organization and Architecture UNIT V_merged_merged.pdf
CS304PC:Computer Organization and Architecture UNIT V_merged_merged.pdfAsst.prof M.Gokilavani
 
Reduced instruction set computers
Reduced instruction set computersReduced instruction set computers
Reduced instruction set computersSyed Zaid Irshad
 
SOC System Design Approach
SOC System Design ApproachSOC System Design Approach
SOC System Design ApproachA B Shinde
 
Central processing unit
Central processing unitCentral processing unit
Central processing unitKamal Acharya
 
RISC Vs CISC Computer architecture and design
RISC Vs CISC Computer architecture and designRISC Vs CISC Computer architecture and design
RISC Vs CISC Computer architecture and designyousefzahdeh
 
Computer Organization.pptx
Computer Organization.pptxComputer Organization.pptx
Computer Organization.pptxsaimagul310
 
Risc processors all syllabus5
Risc processors all syllabus5Risc processors all syllabus5
Risc processors all syllabus5faiyaz_vt
 
Risc and cisc computers
Risc and cisc computersRisc and cisc computers
Risc and cisc computersankita mundhra
 

Ähnlich wie Semantic gap between HLLs and computer architecture (20)

Advanced processor principles
Advanced processor principlesAdvanced processor principles
Advanced processor principles
 
Difficulties in Pipelining
Difficulties in PipeliningDifficulties in Pipelining
Difficulties in Pipelining
 
Advanced Processor Power Point Presentation
Advanced Processor  Power Point  PresentationAdvanced Processor  Power Point  Presentation
Advanced Processor Power Point Presentation
 
CS304PC:Computer Organization and Architecture UNIT V_merged_merged.pdf
CS304PC:Computer Organization and Architecture UNIT V_merged_merged.pdfCS304PC:Computer Organization and Architecture UNIT V_merged_merged.pdf
CS304PC:Computer Organization and Architecture UNIT V_merged_merged.pdf
 
RISC.ppt
RISC.pptRISC.ppt
RISC.ppt
 
13 risc
13 risc13 risc
13 risc
 
Reduced instruction set computers
Reduced instruction set computersReduced instruction set computers
Reduced instruction set computers
 
W04505116121
W04505116121W04505116121
W04505116121
 
SOC System Design Approach
SOC System Design ApproachSOC System Design Approach
SOC System Design Approach
 
13 risc
13 risc13 risc
13 risc
 
Central processing unit
Central processing unitCentral processing unit
Central processing unit
 
RISC Vs CISC Computer architecture and design
RISC Vs CISC Computer architecture and designRISC Vs CISC Computer architecture and design
RISC Vs CISC Computer architecture and design
 
13 risc
13 risc13 risc
13 risc
 
Computer Organization.pptx
Computer Organization.pptxComputer Organization.pptx
Computer Organization.pptx
 
Lec1 final
Lec1 finalLec1 final
Lec1 final
 
Risc processors all syllabus5
Risc processors all syllabus5Risc processors all syllabus5
Risc processors all syllabus5
 
Risc revolution
Risc revolutionRisc revolution
Risc revolution
 
Processors selection
Processors selectionProcessors selection
Processors selection
 
Risc and cisc computers
Risc and cisc computersRisc and cisc computers
Risc and cisc computers
 
Lecture 28
Lecture 28Lecture 28
Lecture 28
 

Mehr von Ismail Mukiibi

Relational database (Unit 2)
Relational database (Unit 2)Relational database (Unit 2)
Relational database (Unit 2)Ismail Mukiibi
 
Relational data base management system (Unit 1)
Relational data base management system (Unit 1)Relational data base management system (Unit 1)
Relational data base management system (Unit 1)Ismail Mukiibi
 
IP/MAC Address Translation
IP/MAC Address TranslationIP/MAC Address Translation
IP/MAC Address TranslationIsmail Mukiibi
 
Traffic Characterization
Traffic CharacterizationTraffic Characterization
Traffic CharacterizationIsmail Mukiibi
 
Microprocessor application (Introduction)
Microprocessor application (Introduction)Microprocessor application (Introduction)
Microprocessor application (Introduction)Ismail Mukiibi
 
Advanced computer architecture lesson 1 and 2
Advanced computer architecture lesson 1 and 2Advanced computer architecture lesson 1 and 2
Advanced computer architecture lesson 1 and 2Ismail Mukiibi
 
Advanced computer architect lesson 3 and 4
Advanced computer architect lesson 3 and 4Advanced computer architect lesson 3 and 4
Advanced computer architect lesson 3 and 4Ismail Mukiibi
 
Pc hardware course work
Pc hardware course workPc hardware course work
Pc hardware course workIsmail Mukiibi
 
Mac addresses(media access control)
Mac addresses(media access control)Mac addresses(media access control)
Mac addresses(media access control)Ismail Mukiibi
 
Why building collapse in kampala
Why building collapse in kampalaWhy building collapse in kampala
Why building collapse in kampalaIsmail Mukiibi
 

Mehr von Ismail Mukiibi (15)

Relational database (Unit 2)
Relational database (Unit 2)Relational database (Unit 2)
Relational database (Unit 2)
 
Relational data base management system (Unit 1)
Relational data base management system (Unit 1)Relational data base management system (Unit 1)
Relational data base management system (Unit 1)
 
Quality of service
Quality of serviceQuality of service
Quality of service
 
IP/MAC Address Translation
IP/MAC Address TranslationIP/MAC Address Translation
IP/MAC Address Translation
 
Traffic Characterization
Traffic CharacterizationTraffic Characterization
Traffic Characterization
 
PHP POWERPOINT SLIDES
PHP POWERPOINT SLIDESPHP POWERPOINT SLIDES
PHP POWERPOINT SLIDES
 
Html notes
Html notesHtml notes
Html notes
 
Microprocessor application (Introduction)
Microprocessor application (Introduction)Microprocessor application (Introduction)
Microprocessor application (Introduction)
 
Advanced computer architecture lesson 1 and 2
Advanced computer architecture lesson 1 and 2Advanced computer architecture lesson 1 and 2
Advanced computer architecture lesson 1 and 2
 
Advanced computer architect lesson 3 and 4
Advanced computer architect lesson 3 and 4Advanced computer architect lesson 3 and 4
Advanced computer architect lesson 3 and 4
 
Pc hardware course work
Pc hardware course workPc hardware course work
Pc hardware course work
 
Mac addresses(media access control)
Mac addresses(media access control)Mac addresses(media access control)
Mac addresses(media access control)
 
Kinds of networks
Kinds of networksKinds of networks
Kinds of networks
 
Why building collapse in kampala
Why building collapse in kampalaWhy building collapse in kampala
Why building collapse in kampala
 
Compare peer
Compare peerCompare peer
Compare peer
 

Kürzlich hochgeladen

Concurrency Control in Database Management system
Concurrency Control in Database Management systemConcurrency Control in Database Management system
Concurrency Control in Database Management systemChristalin Nelson
 
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdfVirtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdfErwinPantujan2
 
What is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPWhat is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPCeline George
 
Global Lehigh Strategic Initiatives (without descriptions)
Global Lehigh Strategic Initiatives (without descriptions)Global Lehigh Strategic Initiatives (without descriptions)
Global Lehigh Strategic Initiatives (without descriptions)cama23
 
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...Nguyen Thanh Tu Collection
 
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfGrade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfJemuel Francisco
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for BeginnersSabitha Banu
 
Food processing presentation for bsc agriculture hons
Food processing presentation for bsc agriculture honsFood processing presentation for bsc agriculture hons
Food processing presentation for bsc agriculture honsManeerUddin
 
Active Learning Strategies (in short ALS).pdf
Active Learning Strategies (in short ALS).pdfActive Learning Strategies (in short ALS).pdf
Active Learning Strategies (in short ALS).pdfPatidar M
 
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptxQ4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptxlancelewisportillo
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxiammrhaywood
 
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSGRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSJoshuaGantuangco2
 
Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Celine George
 
Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4JOYLYNSAMANIEGO
 
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Celine George
 
4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptxmary850239
 
Choosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for ParentsChoosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for Parentsnavabharathschool99
 
ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4MiaBumagat1
 

Kürzlich hochgeladen (20)

Concurrency Control in Database Management system
Concurrency Control in Database Management systemConcurrency Control in Database Management system
Concurrency Control in Database Management system
 
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptxYOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
 
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdfVirtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
 
What is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPWhat is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERP
 
Global Lehigh Strategic Initiatives (without descriptions)
Global Lehigh Strategic Initiatives (without descriptions)Global Lehigh Strategic Initiatives (without descriptions)
Global Lehigh Strategic Initiatives (without descriptions)
 
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
 
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfGrade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for Beginners
 
Food processing presentation for bsc agriculture hons
Food processing presentation for bsc agriculture honsFood processing presentation for bsc agriculture hons
Food processing presentation for bsc agriculture hons
 
Active Learning Strategies (in short ALS).pdf
Active Learning Strategies (in short ALS).pdfActive Learning Strategies (in short ALS).pdf
Active Learning Strategies (in short ALS).pdf
 
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptxQ4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
 
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSGRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
 
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptxFINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
 
Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17
 
Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4
 
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
 
4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx
 
Choosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for ParentsChoosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for Parents
 
ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4
 

Semantic gap between HLLs and computer architecture

  • 2. The semantic gap is the difference between the operations provided in HLLs (High Level Languages) and those provided in computer architecture. Symptoms of this gap are alleged to include execution inefficiency, excessive machine program size, and compiler complexity. Designers responded with architectures intended to close this gap. Key features include large instruction sets, dozens of addressing modes, and various HLL statements implemented in hardware. An example of the latter is the CASE machine instruction on the VAX. Such complex instruction sets are intended to • Ease the task of the compiler writer. • Improve execution efficiency, because complex sequences of operations can be implemented in microcode. • Provide support for even more complex and sophisticated HLLs.
  • 3. Reduced instruction set computer (RISC) architecture  The RISC architecture is a dramatic departure from the historical trend in processor architecture. An analysis of the RISC architecture brings into focus many of the important issues in computer organization and architecture.  Although RISC systems have been defined and designed in a variety of ways by different groups, the key elements shared by most designs are these: • A large number of general-purpose registers, and/or the use of compiler technology to optimize register usage • A limited and simple instruction set • An emphasis on optimizing the instruction pipeline
  • 4. Semantic gap  In order to improve the efficiency of software development, new and powerful programming languages have been developed (Ada, C++, Java).  They provide: high level of abstraction, conciseness, power. • By this evolution the semantic gap grows
  • 5.  Problem: How should new HLL programs be compiled and executed efficiently on a processor architecture? Two possible answers: 1. The CISC approach: design very complex architectures including a large number of instructions and addressing modes; include also instructions close to those present in HLL. 2. The RISC approach: simplify the instruction set and adapt it to the real requirements of user programs
  • 6. Why need RISC  RISC architectures represent an important innovation in the area of computer organization. • The RISC architecture is an attempt to produce more CPU power by simplifying the instruction set of the CPU. • The opposed trend to RISC is that of complex instruction set computers (CISC).  Both RISC and CISC architectures have been developed as an attempt to cover the semantic gap.
  • 7. INSTRUCTION EXECUTION CHARACTERISTICS IN RISC  Operations performed: These determine the functions to be performed by the processor and its interaction with memory.  Operands used: The types of operands and the frequency of their use determine the memory organization for storing them and the addressing modes for accessing them.  Execution sequencing: This determines the control and pipeline organization.
  • 8. Evaluation of Program execution  Several studies have been conducted to determine the execution characteristics of machine instruction sequences generated from HLL programs. • Aspects of interest: 1. The frequency of operations performed. 2. The types of operands and their frequency of use. 3. Execution sequencing (frequency of jumps, loops, subprogram calls).
  • 9.  Frequency of Instructions Executed • Frequency distribution of executed machine instructions:  moves: 33%  conditional branch: 20%  Arithmetic/logic: 16%  others: Between 0.1% and 10% • Addressing modes: the overwhelming majority of instructions uses simple addressing modes, in which the address can be calculated in a single cycle (register, register indirect, displacement); complex addressing modes (memory indirect, indexed+indirect, displacement+indexed, stack) are used only by ~18% of the instructions.
  • 10.  Operand Types • 74 to 80% of the operands are scalars (integers, reals, characters, etc.) which can be hold in registers; • the rest (20-26%) are arrays/structures; 90% of them are global variables; • 80% of the scalars are local variables. NB:The majority of operands are local variables of scalar type, which can be stored in registers
  • 11.  Some statistics concerning procedure calls: • Only 1.25% of called procedures have more than six parameters. • Only 6.7% of called procedures have more than six local variables. • Chains of nested procedure calls are usually short and only very seldom longer than 6.
  • 12. Conclusions from Evaluation of Program Execution • An overwhelming preponderance of simple (ALU and move) operations over complex operations. • Preponderance of simple addressing modes. • Large frequency of operand accesses; on average each instruction references 1.9 operands. • Most of the referenced operands are scalars (so they can be stored in a register) and are local variables or parameters. • Optimizing the procedure CALL/RETURN mechanism promises large benefits in speed.  These conclusions have been at the starting point to the Reduced Instruction Set Computer (RISC) approach.
  • 13. Characteristics of Reduced Instruction Set Architectures  Although a variety of different approaches to reduced instruction set architecture have been taken, certain characteristics are common to all of them: • One instruction per cycle • Register-to-register operations • Simple addressing modes • Simple instruction formats
  • 14.  The first characteristic listed is that there is one machine instruction per machine cycle. A machine cycle is defined to be the time it takes to fetch two operands from registers, perform an ALU operation, and store the result in a register. Thus, RISC machine instructions should be no more complicated than, and execute about as fast as, microinstructions on CISC machines. With simple, one-cycle instructions, there is little or no need for microcode; the machine instructions can be hardwired. Such instructions should execute faster than comparable machine instructions on other machines, because it is not necessary to access a microprogram control store during instruction execution.
  • 15.  The goal is to create an instruction set containing instructions that execute quickly; most of the RISC instructions are executed in a single machine cycle (after fetched and decoded). - RISC instructions, being simple, are hard-wired, while CISC architectures have to use microprogramming in order to implement complex instructions. - Having only simple instructions results in reduced complexity of the control unit and the data path; as a consequence, the processor can work at a high clock frequency. - The pipelines are used efficiently if instructions are simple and of similar execution time. - Complex operations on RISCs are executed as a sequence of simple RISC instructions. In the case of CISCs they are executed as one single or a few complex instruction.
  • 16. example  we have a program with 80% of executed instructions being simple and 20% complex; - on a CISC machine simple instructions take 4 cycles, complex instructions take 8 cycles; cycle time is 100 ns (10-7 s); - on a RISC machine simple instructions are executed in one cycle; complex operations are implemented as a sequence of instructions; we consider on average 14 instructions (14 cycles) for a complex operation; cycle time is 75 ns (0.75 * 10-7 s).
  • 17.  How much time takes a program of 1 000 000 instructions?  CISC: (10^6*0.80*4 + 10^6*0.20*8)*10-7 = 0.48 s  RISC: (10^6*0.80*1 + 10^6*0.20*14)*0.75*10-7 = 0.27 s • complex operations take more time on the RISC, but their number is small; • because of its simplicity, the RISC works at a smaller cycle time; with the CISC, simple instructions are slowed down because of the increased data path length and the increased control complexity.
  • 18.  A second characteristic is that most operations should be register to register, with only simple LOAD and STORE operations accessing memory. Only LOAD and STORE instructions reference data in memory; all other instructions operate only with registers (are register-to-register instructions); thus, only the few instructions accessing memory need more than one cycle to execute (after fetched and decoded).
  • 19. Third Characteristic, Instructions use only few addressing modes - Addressing modes are usually register, direct, register indirect, displacement.  Almost all RISC instructions use simple register addressing. Forth Characteristic Instructions are of fixed length and uniform format - This makes the loading and decoding of instructions simple and fast; it is not needed to wait until the length of an instruction is known in order to start decoding the following one; - Decoding is simplified because opcode and address fields are located in the same position forall instructions
  • 20.  Fifth Characteristic, A large number of registers is available - Variables and intermediate results can be stored in registers and do not require repeated loads and stores from/to memory. - All local variables of procedures and the passed parameters can be stored in registers.
  • 21. What happens when a new procedure is called? - Normally the registers have to be saved in memory (they contain values of variables and parameters for the calling procedure); at return to the calling procedure, the values have to be again loaded from memory. This takes a lot of time. - If a large number of registers is available, a new set of registers can be allocated to the called procedure and the register set assigned to the calling one remains untouched.
  • 22. Is the strategy above realistic? - The strategy is realistic, because the number of local variables in procedures is not large. The chains of nested procedure calls is only exceptionally larger than 6. - If the chain of nested procedure calls becomes large, at a certain call there will be no registers to be assigned to the called procedure; in this case local variables and parameters have to be stored in memory
  • 23. Why is a large number of registers typical for RISC architectures? - Because of the reduced complexity of the processor there is enough space on the chip to be allocated to a large number of registers. This, usually, is not the case with CISCs.
  • 24. The delayed load problem  • LOAD instructions (similar to the STORE) require memory access and their execution cannot be completed in a single clock cycle. However, in the next cycle a new instruction is started by the processor. Two possible solutions: 1. The hardware should delay the execution of the instruction following the LOAD, if this instruction needs the loaded value 2. A more efficient, compiler based, solution, which has similarities with the delayed branching, is the delayed load:
  • 25.  With delayed load the processor always executes the instruction following a LOAD, without a stall; It is the programmers (compilers) responsibility that this instruction does not need the loaded value.
  • 26. CISC versus RISC Characteristics  After the initial enthusiasm for RISC machines, there has been a growing realization that (1) RISC designs may benefit from the inclusion of some CISC features (2) CISC designs may benefit from the inclusion of some RISC features.  The result is that the more recent RISC designs, notably the PowerPC, are no longer “pure” RISC and the more recent CISC designs, notably the Pentium II and later Pentium models, do incorporate some RISC characteristics.
  • 27. Typical RISC characteristics 1. A single instruction size. 2. That size is typically 4 bytes. 3. A small number of data addressing modes, typically less than five. This parameter is difficult to pin down. In the table, register and literal modes are not counted and different formats with different offset sizes are counted separately. 4. No indirect addressing that requires you to make one memory access to get the address of another operand in memory.  5. No operations that combine load/store with arithmetic (e.g., add from memory, add to memory).
  • 28. 6. No more than one memory-addressed operand per instruction. 7. Does not support arbitrary alignment of data for load/store operations. 8. Maximum number of uses of the memory management unit (MMU) for a data address in an instruction. 9. Number of bits for integer register specifier equal to five or more. This means that at least 32 integer registers can be explicitly referenced at a time. 10. Number of bits for floating-point register specifier equal to four or more. This means that at least 16 floating-point registers can be explicitly referenced at a time.
  • 29. RISC PIPELINING Instruction pipelining is often used to enhance performance.  Let us reconsider this in the context of a RISC architecture. Most instructions are register to register, and an instruction cycle has the following two stages: • I: Instruction fetch. • E: Execute. Performs an ALU operation with register input and output. For load and store operations, three stages are required: • I: Instruction fetch. • E: Execute. Calculates memory address • D: Memory. Register-to-memory or memory-to-register operation
  • 30.  The two stages of the pipeline are an instruction fetch stage, and an execute/memory stage that executes the instruction, including register-to-memory and memory to- register operations. Thus we see that the instruction fetch stage of the second instruction can e performed in parallel with the first part of the execute/memory stage. However, the execute/memory stage of the second instruction must be delayed until the first instruction clears the second stage of the pipeline. This scheme can yield up to twice the execution rate of a serial scheme. Two problems prevent the maximum speedup from being achieved. First, we assume that a single-port memory is used and that only one memory access is possible per stage. This requires the insertion of a wait state in some instructions. Second, a branch instruction interrupts the sequential flow of execution.To accommodate this with minimum circuitry, a NOOP instruction can be inserted into the instruction stream by the compiler or assembler
  • 31.  Pipelining can be improved further by permitting two memory accesses per stage. This yields the sequence. Now, up to three instructions can be overlapped, and the improvement is as much as a factor of 3. Again, branch instructions cause the speedup to fall short of the maximum possible. Also, note that data dependencies have an effect. If an instruction needs an operand that is altered by the preceding instruction, a delay is required. Again, this can be accomplished by a NOOP
  • 32.  .  The pipelining discussed so far works best if the three stages are of approximately  equal duration. Because the E stage usually involves an ALU operation, it  may be longer. In this case, we can divide into two substages:  • Register file read  • ALU operation and register write  Because of the simplicity and regularity of a RISC instruction set, the design  of the phasing into three or four stages is easily accomplished. Figure 13.6d shows  the result with a four-stage pipeline. Up to four instructions at a time can be under  way, and the maximum potential speedup is a factor of 4. Note again the use of  NOOPs to account for data and branch delays.
  • 33. Optimization of Pipelining  Because of the simple and regular nature of RISC instructions, pipelining schemes can be efficiently employed.There are few variations in instruction execution duration, and the pipeline can be tailored to reflect this. However, we have seen that data and branch dependencies reduce the overall execution rate.  DELAYED BRANCH To compensate for these dependencies, code reorganization techniques have been developed. First, let us consider branching instructions. Delayed branch, a way of increasing the efficiency of the pipeline, makes use of a branch that does not take effect until after execution of the following instruction (hence the term delayed).
  • 34.  LOOP UNROLLING Another compiler technique to improve instruction parallelism is loop unrolling [BACO94]. Unrolling replicates the body of a loop some number of times called the unrolling factor (u) and iterates by step u instead of step 1.  Unrolling can improve the performance by • reducing loop overhead • increasing instruction parallelism by improving pipeline performance • improving register, data cache, or TLB locality
  • 35. Instruction Set  Table below lists the basic instruction set for all MIPS R series processors. All processor instructions are encoded in a single 32-bit word format. All data operations are register to register; the only memory references are pure load/store operations.  The R4000 makes no use of condition codes. If an instruction generates a condition, the corresponding flags are stored in a general-purpose register. This avoids the need for special logic to deal with condition codes as they affect the pipelining mechanism and the reordering of instructions by the compiler. Instead, the mechanisms already implemented to deal with register- value dependencies are employed. Further, conditions mapped onto the register files are subject to the same compile-time optimizations in allocation and reuse as other values stored in registers.
  • 36.  As with most RISC-based machines, the MIPS uses a single 32-bit instruction length. This single instruction length simplifies instruction fetch and decode, and it also simplifies the interaction of instruction fetch with the virtual memory management unit (i.e., instructions do not cross word or page boundaries). The three instruction formats share common formatting of opcodes and register references, simplifying instruction decode. The effect of more complex instructions can be synthesized at compile time.  Only the simplest and most frequently used memory-addressing mode is implemented in hardware. All memory references consist of a 16-bit offset from a 32-bit register.
  • 37. MIPS R-Series Instruction Set (OP & Description) Load/Store Instructions  LB Load Byte  LBU Load Byte Unsigned  LH Load Halfword  LHU Load Halfword Unsigned  LW Load Word  LWL Load Word Left  LWR Load Word Right  SB Store Byte  SH Store Halfword  SW Store Word  SWL Store Word Left  SWR Store Word Right
  • 38. Arithmetic Instructions (3-operand, R-type) ADD Add ADDU Add Unsigned SUB Subtract SUBU Subtract Unsigned SLT Set on Less Than SLTU Set on Less Than Unsigned AND AND OR OR XOR Exclusive-OR NOR NOR  Arithmetic Instructions (ALU Immediate)  ADDI Add Immediate  ADDIU Add Immediate Unsigned  SLTI Set on Less Than Immediate  SLTIU Set on Less Than Immediate Unsigned  ANDI AND Immediate  ORI OR Immediate  XORI Exclusive-OR Immediate  LUI Load Upper Immediate
  • 39. Multiply/Divide Instructions MULT Multiply MULTU Multiply Unsigned DIV Divide DIVU Divide Unsigned MFHI Move From HI MTHI Move To HI MFLO Move From LO MTLO Move To LO Shift Instructions  SLL Shift Left Logical  SRL Shift Right Logical  SRA Shift Right Arithmetic  SLLV Shift Left Logical Variable  SRLV Shift Right Logical Variable  SRAV Shift Right Arithmetic Variable
  • 40. Coprocessor Instructions LWCz Load Word to Coprocessor SWCz Store Word to Coprocessor MTCz Move To Coprocessor MFCz Move From Coprocessor CTCz Move Control To Coprocessor CFCz Move Control From Coprocessor COPz Coprocessor Operation BCzT Branch on Coprocessor z True BCzF Branch on Coprocessor z False Special Instructions SYSCALL System Call BREAK Break Jump and Branch Instructions  J Jump  JAL Jump and Link  JR Jump to Register  JALR Jump and Link Register  BEQ Branch on Equal  BNE Branch on Not Equal  BLEZ Branch on Less Than or Equal to Zero  BGTZ Branch on Greater Than Zero  BLTZ Branch on Less Than Zero  BGEZ Branch on Greater Than or Equal to Zero  BLTZAL Branch on Less Than Zero And Link  BGEZAL Branch on Greater Than or Equal to Zero And Link
  • 41. Instruction Pipeline  With its simplified instruction architecture, the MIPS can achieve very efficient pipelining. It is instructive to look at the evolution of the MIPS pipeline, as it illustrates the evolution of RISC pipelining in general.  The initial experimental RISC systems and the first generation of commercial RISC processors achieve execution speeds that approach one instruction per system clock cycle.To improve on this performance, two classes of processors have evolved to offer execution of multiple instructions per clock cycle: superscalar and superpipelined architectures. In essence, a superscalar architecture replicates each of the pipeline stages so that two or more instructions at the same stage of the pipeline can be processed simultaneously.
  • 42.  A superpipelined architecture is one that makes use of more, and more fine-grained, pipeline stages. With more stages, more instructions can be in the pipeline at the same time, increasing parallelism.  Both approaches have limitations.With superscalar pipelining, dependencies between instructions in different pipelines can slow down the system. Also, overhead logic is required to coordinate these dependencies.With superpipelining, there is overhead associated with transferring instructions from one stage to the next.
  • 43. RISC VERSUS CISC CONTROVERSY  The work that has been done on assessing merits of the RISC approach can be grouped into two categories: • Quantitative: Attempts to compare program size and execution speed of programs on RISC and CISC machines that use comparable technology • Qualitative: Examins issues such as high- level language support and optimum use of VLSI real estate
  • 44.  Most of the work on quantitative assessment has been done by those working on RISC systems [PATT82b, HEAT84, PATT84], and it has been, by and large, favorable to the RISC approach. Others have examined the issue and come away unconvinced [COLW85a, FLYN87, DAVI87]. There are several problems with attempting such comparisons [SERL86]: • There is no pair of RISC and CISC machines that are comparable in life-cycle cost, level of technology, gate complexity, sophistication of compiler, operating system support, and so on. • No definitive test set of programs exists. Performance varies with the program. • It is difficult to sort out hardware effects from effects due to skill in compiler writing. • Most of the comparative analysis on RISC has been done on “toy” machines rather than commercial products. Furthermore, most commercially available machines advertised as RISC possess a mixture of RISC and CISC characteristics. Thus, a fair comparison with a commercial, “pure-play” CISC machine (e.g.,VAX, Pentium) is difficult. The qualitative assessment is, almost by definition, subjective.
  • 46.  A superscalar processor is one in which multiple independent instruction pipelines are used. Each pipeline consists of multiple stages, so that each pipeline can handle multiple instructions at a time. Multiple pipelines introduce a new level of parallelism, enabling multiple streams of instructions to be processed at a time. A superscalar processor exploits what is known as instruction-level parallelism, which refers to the degree to which the instructions of a program can be executed in parallel
  • 47.  A superscalar processor typically fetches multiple instructions at a time and then attempts to find nearby instructions that are independent of one another and can therefore be executed in parallel. If the input to one instruction depends on the output of a preceding instruction, then the latter instruction cannot complete execution at the same time or before the former instruction. Once such dependencies have been identified, the processor may issue and complete instructions in an order that differs from that of the original machine code.  The processor may eliminate some unnecessary dependencies by the use of additional registers and the renaming of register references in the original code.  Whereas pure RISC processors often employ delayed branches to maximize the utilization of the instruction pipeline, this method is less appropriate to a superscalar machine. Instead, most superscalar machines use traditional branch prediction methods to improve efficiency
  • 48.  A superscalar implementation of a processor architecture is one in which common instructions—integer and floating-point arithmetic, loads, stores, and conditional branches—can be initiated simultaneously and executed independently. Such implementations raise a number of complex design issues related to the instruction pipeline
  • 49.  The term superscalar, first coined in 1987 [AGER87], refers to a machine that is designed to improve the performance of the execution of scalar instructions. In most applications, the bulk of the operations are on scalar quantities. Accordingly, the superscalar approach represents the next step in the evolution of high-performance general-purpose processors  The essence of the superscalar approach is the ability to execute instructions independently and concurrently in different pipelines
  • 50. Superscalar versus Superpipelined  An alternative approach to achieving greater performance is referred to as superpipelining. Superpipelining exploits the fact that many pipeline stages perform tasks that require less than half a clock cycle.Thus, a doubled internal clock speed allows the performance of two tasks in one external clock cycle
  • 51.  The pipeline has four stages: instruction fetch, operation decode, operation execution, and result write back. The execution stage is crosshatched for clarity. Note that although several instructions are executing concurrently, only one instruction is in its execution stage at any one time.  Both the superpipeline and the superscalar implementations have the same number of instructions executing at the same time in the steady state. The superpipelined processor falls behind the superscalar processor at the start of the program and at each branch target
  • 52. Limitations  The superscalar approach depends on the ability to execute multiple instructions in parallel.  The term instruction-level parallelism refers to the degree to which, on average, the instructions of a program can be executed in parallel.A combination of compiler-based optimization and hardware techniques can be used to maximize instruction-level parallelism.  Before examining the design techniques used in superscalar machines to increase instruction-level parallelism, we need to look at the fundamental limitations to parallelism with which the system must cope. [JOHN91] lists five limitations: • True data dependency • Procedural dependency • Resource conflicts • Output dependency • Antidependency
  • 53.  A typical RISC processor takes two or more cycles to perform a load from memory when the load is a cache hit. It can take tens or even hundreds of cycles for a cache miss on all cache levels, because of the delay of an off-chip memory access.  One way to compensate for this delay is for the compiler to reorder instructions so that one or more subsequent instructions that do not depend on the memory load can begin flowing through the pipeline.This scheme is less effective in the case of a superscalar pipeline: The independent instructions executed during the load are likely to be executed on the first cycle of the load, leaving the processor with nothing to do until the load completes.
  • 54. DESIGN ISSUES  Instruction-level parallelism exists when instructions in a sequence are independent and thus can be executed in parallel by overlapping  As an example of the concept of instruction-level parallelism, consider the following two code fragments [JOUP89b]:  Load R1 ← R2 Add R3 ← R3, “1”  Add R3 ← R3, “1” Add R4 ← R3, R2  Add R4 ← R4, R2 Store [R4] ← R0
  • 55.  The three instructions on the left are independent, and in theory all three could be executed in parallel. In contrast, the three instructions on the right cannot be executed in parallel because the second instruction uses the result of the first, and the third instruction uses the result of the second.  The degree of instruction-level parallelism is determined by the frequency of true data dependencies and procedural dependencies in the code. These factors, in turn, are dependent on the instruction set architecture and on the application.  Instruction-level parallelism is also determined by what [JOUP89a] refers to as operation latency: the time until the result of an instruction is available for use as an operand in a subsequent instruction. The latency determines how much of a delay a data or procedural dependency will cause
  • 56.  Machine parallelism is a measure of the ability of the processor to take advantage of instruction-level parallelism. Machine parallelism is determined by the number of instructions that can be fetched and executed at the same time (the number of parallel pipelines) and by the speed and sophistication of the mechanisms that the processor uses to find independent instructions.  Both instruction-level and machine parallelism are important factors in enhancing performance. A program may not have enough instruction-level parallelism to take full advantage of machine parallelism. The use of a fixed-length instruction set architecture, as in a RISC, enhances instruction-level parallelism. On the other hand, limited machine parallelism will limit performance no matter what the nature of the program
  • 57. Instruction Issue Policy  The processor must also be able to identify instruction level parallelism and orchestrate the fetching, decoding, and execution of instructions in parallel  Instruction issue to refer to the process of initiating instruction execution in the processor’s functional units and the term  instruction issue policy to refer to the protocol used to issue instructions.  In general, we can say that instruction issue occurs when instruction moves from the decode stage of the pipeline to the first execute stage of the pipeline. In essence, the processor is trying to look ahead of the current point of execution to locate instructions that can be brought into the pipeline and executed. Three types of orderings are important in this regard: • The order in which instructions are fetched • The order in which instructions are executed • The order in which instructions update the contents of register and memory locations
  • 58.  In general terms, we can group superscalar instruction issue policies into the following categories: • In-order issue with in-order completion • In-order issue with out-of-order completion • Out-of-order issue with out-of-order completion
  • 59.  IN-ORDER ISSUE WITH IN-ORDER COMPLETION The simplest instruction issue policy is to issue instructions in the exact order that would be achieved by sequential execution (in-order issue) and to write results in that same order (in-order completion). Not even scalar pipelines follow such a simple-minded policy. However, it is useful to consider this policy as a baseline for comparing more sophisticated approaches.
  • 60.  IN-ORDER ISSUE WITH OUT-OF- ORDER COMPLETION Out-of-order completion is used in scalar RISC processors to improve the performance of instructions that require multiple cycles.  With out-of-order completion, any number of instructions may be in the execution stage at any one time, up to the maximum degree of machine parallelism across all functional units. Instruction issuing is stalled by a resource conflict, a data dependency, or a procedural dependency.
  • 61.  OUT-OF-ORDER ISSUE WITH OUT-OF-ORDER COMPLETION With in-order issue, the processor will only decode instructions up to the point of a dependency or conflict. No additional instructions are decoded until the conflict is resolved. As a result, the processor cannot look ahead of the point of conflict to subsequent instructions that may be independent of those already in the pipeline and that may be usefully introduced into the pipeline.  To allow out-of-order issue, it is necessary to decouple the decode and execute stages of the pipeline. This is done with a buffer referred to as an instruction window. With this organization, after a processor has finished decoding an instruction, it is placed in the instruction window.As long as this buffer is not full, the processor can continue to fetch and decode new instructions.When a functional unit becomes available in the execute stage, an instruction from the instruction window may be issued to the execute stage.Any instruction may be issued, provided that  (1) it needs the particular functional unit that is available, and  (2) no conflicts or dependencies block this instruction
  • 62.  The result of this organization is that the processor has a lookahead capability, allowing it to identify independent instructions that can be brought into the execute stage. Instructions are issued from the instruction window with little regard for their original program order. As before, the only constraint is that the program execution behaves correctly
  • 63.  One common technique that is used to support out-of-order completion is the reorder buffer.The reorder buffer is temporary storage for results completed out of order that are then committed to the register file in program order. A related concept is Tomasulo’s algorithm.  The term antidependency is used because the constraint is similar to that of a true data dependency, but reversed: Instead of the first instruction producing a value that the second instruction uses, the second instruction destroys a value that the first instruction uses.
  • 64. Register Renaming  One method for coping with these types of storage conflicts is based on a traditional resource-conflict solution: duplication of resources. In this context, the technique is referred to as register renaming. In essence, registers are allocated dynamically by the processor hardware, and they are associated with the values needed by instructions at various points in time.
  • 65.  When a new register value is created (i.e., when an instruction executes that has a register as a destination operand), a new register is allocated for that value. Subsequent instructions that access that value as a source operand in that register must go through a renaming process: the register references in those instructions must be revised to refer to the register containing the needed value. Thus, the same original register reference in several different instructions may refer to different actual registers, if different values are intended
  • 66.  An alternative to register renaming is a scoreboarding. In essence, scoreboarding is a bookkeeping technique that allows instructions to execute whenever they are not dependent on previous instructions and no structural hazards are present.
  • 67. Branch Prediction  Any high-performance pipelined machine must address the issue of dealing with branches. For example, the Intel 80486 addressed the problem by fetching both the next sequential instruction after a branch and speculatively fetching the branch target instruction. However, because there are two pipeline stages between prefetch and execution, this strategy incurs a two-cycle delay when the branch gets taken
  • 68.  With the advent of RISC machines, the delayed branch strategy was explored. This allows the processor to calculate the result of conditional branch instructions before any unusable instructions have been prefetched where the processor always executes the single instruction that immediately follows the branch. This keeps the pipeline full while the processor fetches a new instruction stream.  With the development of superscalar machines, the delayed branch strategy has less appeal. The reason is that multiple instructions need to execute in the delay slot, raising several problems relating to instruction dependencies. Thus, superscalar machines have returned to pre-RISC techniques of branch prediction. Some, like the PowerPC 601, use a simple static branch prediction technique. More sophisticated processors, such as the PowerPC 620 and the Pentium 4, use dynamic branch prediction based on branch history analysis.
  • 69. Superscalar Implementation  Based on our discussion so far, we can make some general comments about the processor hardware required for the superscalar approach. [SMIT95] lists the following key elements: • Instruction fetch strategies that simultaneously fetch multiple instructions, often by predicting the outcomes of, and fetching beyond, conditional branch instructions. These functions require the use of multiple pipeline fetch and decode stages, and branch prediction logic. • Logic for determining true dependencies involving register values, and mechanisms for communicating these values to where they are needed during execution
  • 70.  Mechanisms for initiating, or issuing, multiple instructions in parallel. • Resources for parallel execution of multiple instructions, including multiple pipelined functional units and memory hierarchies capable of simultaneously servicing multiple memory references. • Mechanisms for committing the process state in correct order.
  • 71. PENTIUM 4  Although the concept of superscalar design is generally associated with the RISC architecture, the same superscalar principles can be applied to a CISC machine. Perhaps the most notable example of this is the Pentium. The evolution of superscalar concepts in the Intel line is interesting to note. The 386 is a traditional CISC nonpipelined machine.  The 486 introduced the first pipelined x86 processor, reducing the average latency of integer operations from between two and four cycles to one cycle, but still limited to executing a single instruction each cycle, with no superscalar elements. The original Pentium had a modest superscalar component, consisting of the use of two separate integer execution units. The Pentium Pro introduced a full-blown superscalar design. Subsequent Pentium models have refined and enhanced the superscalar design.  A general block diagram of the Pentium 4 was shown in figure below depicts the same structure in a way more suitable for the pipeline discussion in this section. The operation of the Pentium 4 can be summarized as follows: 1. The processor fetches instructions from memory in the order of the static program. 2. Each instruction is translated into one or more fixed-length RISC instructions, known as micro-operations, or micro-ops.
  • 72. 3. The processor executes the micro-ops on a superscalar pipeline organization, so that the micro-ops may be executed out of order. 4. The processor commits the results of each micro-op execution to the processor’s register set in the order of the original program flow
  • 73. Pentium 4 Block Diagram  Pg 538
  • 75. Front End  GENERATION OF MICRO-OPS The Pentium 4 organization includes an in-order front end that can be considered outside the scope of the pipeline depicted in figure above. This front end feeds into an L1 instruction cache, called the trace cache, which is where the pipeline proper begins. Usually, the processor operates from the trace cache; when a trace cache miss occurs, the in-order front end feeds new instructions into the trace cache.