Faculty Development Programme on Performance Tuning

Faculty Development Programme
on
Performance Tuning

Dijesh P
27 July 2012

Quantitative Principles of Computer
Design
• The most important principle of computer
design is to make the common case fast.
• That is Favor the frequent case over the
infrequent case.
• Improving the frequent event over the
infrequent event will help improving
performance.
• Amdahl’s law can be used to quantify this
principle.

Amdahl’s Law
• States that “the performance improvement
to be gained from using some faster mode
of execution is limited by the time the
faster mode can be used”.
• Amdahl’s law defines “Speedup” that can
be gained by using a particular feature.
• We can make an enhancement to a
machine that will improve the performance
when it is used.

• Speedup =
– Performance for entire task using enhancement when
possible

– Performance for entire task without using enhancement

– Alternatively,
– Speedup =
• Execution time for entire task without using enhancement

• Execution time for entire task using enhancement when
possible
• Speedup tells us how much faster a task will run
using the m/c with the enhancement, as opposed
by the original m/c.

• Amdahl’s law gives a quick way to find the
speedup from some enhancement, which
depends on two factors.
• The fraction of computation time in the
original machine that can be converted to
take advantage of the enhancement.
(Fraction will always be less than or equal
to 1).
• The improvement gained by the enhanced
mode of execution; that is how much faster
the task would run if the enhanced mode
were used for the entire program.

• Execution timenew =Execution timeold X

• (1 – Fractionenhanced) + Fractionenhanced
•
• Speedupenhanced

• The overall speedup is the ratio of the execution
times.

• Speedupoverall = Execution timeold

Execution timenew

= 1

(1 - Fractionenhanced) + Fractionenhanced

Speedupenhanced

The CPU performance Equation
• All computers are constructed using a
clock running at a constant rate.
• These time events are called ticks, clock
ticks, clock periods, clocks, cycles, or
clock cycles.
• Designers refer to the time of a clock
period by its duration (Eg. 1 ns) or by its
rate (Eg. 1 GHz).

• CPU time for a program can be expressed
in two ways.
– CPU time = ( CPU clock cycles for a program )
X ( Clock cycle time ) .

– OR

– CPU time = CPU clock cycles for a program
Clock rate

• We can also count the number of
instruction executed – Instruction count
(IC)
• IF we know the number of clock cycles and
the instruction count, we can calculate the
average number of clock cycles per
instructions (CPI).
• CPI = CPU clock cycles for a program
Instruction Count

• CPU time = Instrn Count X Clock cycle
time X Cycles per instrn.
• CPU time is dependent on three
characteristics:
– Clock Cycles (Clock rate)  H/W technology
and organization
– Clock cycles per instrn Organization and
ISA.
– Instrn Count ISA and Compiler technology.

Principle of Locality
• Programs tend to reuse data and
instructions they have used recently.
– Temporal Locality Recently accessed items
are likely to be accessed in the near future.
– Spatial Locality Items whose addresses are
near one another tend to be referenced close
together in time.
• We can predict what instructions and data
a program will use in the near future
based on its accesses in the recent past.

Instruction Set Architecture
• The type of internal storage in a processor
is the basic difference.
• The major choices are: Stack, Accumulator
(AC) or a set of registers.
• The operands in a stack architecture are
implicitly on the top of the stack.
• In an AC architecture, one operand is
implicitly the AC.

• The General purpose register architecture
have only explicit operands – either
registers or memory locations.
• Consider the instruction C = A + B.
• Stack Accumulator Register Register
• (Reg – Mem) (Load-store)
• PUSH A LOAD A LOAD R1, A LOAD R1,A
• PUSH B ADD B ADD R3,R1,B LOAD R2,B
• Add STORE C STORE R3,C ADD R3,R1,R2
• POP C STORE R3,C

• There are two classes of register
computers :Register – Memory
architecture and Register- Register
architecture.
• Reg-mem architecture can access memory
as part of any instruction, and the other
can access memory only with load and
store instruction.
• A third class is also there, not used now a
days – Memory-Memory architecture.
(Keeps all the operands in memory)

• Some ISA have more registers than a
single accumulator, but places restriction
on uses of these special purpose registers.
• Such an architecture is called extended
accumulator or special – purpose register
computer.
• Almost all the computers now a days are
based on load-store register based. There
are two reasons for this. (Registers are
very fast and Compilers can efficiently use
registers).

• Registers can be used to hold variables.
• When variables are allocated to registers,
the memory traffic reduces, and the
program speeds up.
• Two major concern in the ISA are:
– Whether an ALU instruction has two or three
operands.
– How many of the operands may be memory
addressed in ALU instruction.
• This divides the GPR architecture into
different sub-categories.

No. of Mem Max no of Architecture Eg.
Addresses operands

0 3 Reg-Mem MIPS, ARM,
PowerPC
1 2 Reg-Mem IBM 360/370
Intel 80x86
2 2 Mem-Mem VAX

3 3 Mem-Mem VAX
VAX is a 32-bit computing architecture that supports
virtual addressing. It was developed in the mid-
1970s by Digital Equipment Corporation (DEC).
DEC was later purchased by Compaq, which in
turn was purchased by Hewlett-Packard.

Register- Register
• Advantages:
– Simple, fixed length instructions.
– Simple Code generation model
– Instructions take similar number of clock
cycles.
• Disadvantages:
– Higher instruction count than memory ref.
– More instructions leads to larger programs.

Register - Memory
• Advantages:
– Data can be accessed without a separate load
instruction first.
– Instruction format can be easily encoded.
• Disadvantages:
– Operands are not equal.
– Restriction on the number of registers. (Due to
encoding a register number and a memory address in
each instruction)
– Clocks per instruction vary.

Memory – Memory(2,2) or (3,3)
• Advantages:
– Most compact.
– Doesn’t waste registers for temporary data.
• Disadvantages:
– Large variation in instruction size (three
operand instruction)
– Large variation in work per instruction.
– Memory access creates memory bottleneck.

Memory Addressing
• An architecture should specify, how
memory addresses are interpreted,
irrespective of whether the architecture is
register-register.
• The measurement presented here are
largely computer independent.
• In some cases the measurements are
affected by the compiler technology.

Interpreting Memory Addresses
• All instruction sets are assumed to be byte
addressed and provide access for bytes (8
bits), half words (16 bits), words (32 bits),
and most computers provide access for
double words (64 bits).
• There are two conventions for ordering the
bytes within a large object: Little endian
and Big endian.

• Little endian byte order put the byte whose
address is “x….x000” at the least-significant
position in the double word.
• The bytes are numbered
7 6 5 4 3 2 1 0

• Big endian byte order puts the byte whose
address is “x….x000” at the most-significant
position in the double word.
• The bytes are numbered
0 1 2 3 4 5 6 7

Issues in memory interpreting
• Little endian ordering fails to match normal
ordering of words when strings are
compared.
• Strings appear “SDRAWKCAB” in the reg.
• Access to objects larger than a byte must
be aligned.
• An access to an object of size s bytes at
byte address A is aligned if A mod s = 0.

• Misalignment causes hardware
complications, since memory is usually
aligned on a multiple of word or double
word boundary.
• A misaligned memory access may take
multiple aligned memory references.
• In computers that allow misaligned access,
programs with aligned access run faster.

Addressing modes
• If an address is given, memory can be
accessed.
• Addressing modes specify constants and
registers in additions to locations in
memory.
• When a memory location is used, the
actual memory address specified by the
addressing mode is called effective
address.

Categories
• Addressing Mode Eg. Instrn Meaning
• Register Add R4, R3 Reg[R4]
Reg[R4]+Reg[R3]

• Immediate Add R4, #3 R4 R4+3

• Displacement Add R4, 100(R1) R4 R4+Mem
[100+Reg [R1]]

• Register Indirect Add R4, (R1) Reg [R4] Reg
[R4] +Mem [Reg [R1]]

• Indexed Add R3, (R1+R2) Reg [R3] Reg
[R3]+Mem [ Reg [R1] + Reg
[ R2]]

• Direct Add R1, (1001) Reg [ R1]
Reg [ R1] + Mem [1001]

• Mem Indirect Add R1, (R3) Reg [R1]
Reg [R1]+ Mem [ Mem [ Reg [ R3]]]
• Autoincrement Add R1, (R2)+ Reg[R1] Reg[R1]
+Mem[Reg[R2]]
Reg[R2] Reg[R2] + d

• Autodecrement Add R1,-(R2) Reg[R2]
Reg[R2] – d
Reg[R1] Reg[R1] +
Mem[Reg[R2]]
• Scaled Add R1,100(R2)[R3] Reg[R1]
Reg[R1]+Mem[100+Reg[R2]
+ Reg[R3] *d]

Usage of different addressing modes

• Register When a value is in register.
• Immediate For Constants
• Displacement Accessing Local
variables.
• Reg Indirect Accessing using a pointer
or an address.
• Indexed Array addressing. (R1= base of
array and R2=index amount)

• Direct Sometimes useful for accessing
static data.
• Mem Indirect If R3 is the address of a
pointer p, then mode yields *p;
• Autoincrement Stepping through arrays
within a loop. R2 points to start of an
array; each reference increment R2 by
size of an element d.

Operations in the Instruction Set
• Operator Type Example
• Arithmetic and logical add, subtract, and, or
• Data transfer Loads – stores
• Control Branch, jump, procedure call
• System OS call, VM mgt instructions
• Floating point add, multiply, divide, compare
• Decimal add, multiply, dec–char conversion
• String move, search, compare
• Graphics pixel and vertex operations,
compression, decompression

Instructions for control flow
• Four different types of control flow
changes:

– Conditional branches
– Jumps
– Procedure calls
– Procedure returns

Encoding an instruction set
• Different factors affect how the instructions are
encoded into a binary representation.
• The representation affects the size of the
compiled program and the implementation of the
processor (which decode the rep to quickly find
the operations and operands).
• Operation is specified in one field called opcode.
• Important is how to encode the addressing
modes with the operations.

• This depends on the range of addressing
modes.
• Some older computers have one to five
operands with 10 addressing modes for
each operand.
• For such large number of combinations, a
separate address specifier is needed for
each operand.
• When encoding an instruction, the no of
registers and no of addressing modes both
have an impact on the size of instruction.

Competing forces – instruction encoding

1. Desire to have as many registers and
addressing modes as possible.
2. Impact of size of the register and
addressing mode fields on the average
instruction size. (!!! Hence the average of
program size)
3. Desire to have instructions encoded into
lengths that will be easy to handle in a
pipelined implementation.

• Three choices for encoding an instruction
set are:
– Fixed Combines the operation and the
addressing mode into the opcode. Have only
a single size for all instructions.
– Variable Allows all addressing modes to be
with all operations. This style is best when
there are many addressing modes and
operations.
– Hybrid Has multiple formats.

Faculty Development Programme on Performance Tuning

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (7)

Similar to Faculty Development Programme on Performance Tuning

Similar to Faculty Development Programme on Performance Tuning (20)

More from PlusOrMinusZero

More from PlusOrMinusZero (19)

Recently uploaded

Recently uploaded (20)

Faculty Development Programme on Performance Tuning