ARM
• ARM: Advanced RISC Machines
• Most widely used 32- bit RISC instruction set
architecture
• The relative simplicity makes it suitable for low power
devices
• ARM7, ARM9, ARM11 and Cortex
• Approximately 90% of all embedded 32-bit RISC
processors
• Used extensively in consumer electronics,
including PDAs, mobile phones, digital media and music
players, hand-held game consoles, calculators and
computer peripherals such as hard drives and routers.
Product Code Description
• M: Multiplier
ARM processor have hardware multiplier unit doing
multiplication
• I: Embedded ICE Macrocel
Hardware circuit used to generate trace information. Used in
advance debugging.
• E: Enhanced Instruction Set
• J: Java Acceleration by Jazelle mode
Hardware circuit used for running JAVA byte code
• F: Vector Floating point
Hardware implementation of floating operations.
• S: Synthesizable Version
The ARM architecture can be modified as it comes in terms
of soft processor core.
Example
• ARM7TDMI
This is the ARM7 family processor which has T= Thumb
instruction set, D= Debug Unit, M= MMU(Memory
Management Unit), I= Embedded Trace core.
• ARM946E-S
1. ARM9xx core
2. Enhanced Instruction set
3. Synthesizable
ARM
• ARM has 3 instruction set states
1. 32-bit ARM instruction set
2. 16-bit Thumb instruction set
3. 8- bit Jazelle instruction set
• ARM – 32 bit Load/Store architecture with every instruction
being conditional.
• Thumb- 16 bit with only branch instructions being conditional
and only half of the registers used
• Jazelle- Allows Java byte code to be directly executed in ARM
architecture. Improves performance by 5x-10x
ARM- Processor Modes
• Seven basic operating modes exist:
1. User: Unprivileged mode under which most tasks run
2. FIQ: Entered when a high priority interrupt is raised
3. IRQ: Entered when a low priority interrupt is raised
4. Supervisor: Entered on reset and when a software
Interrupt instruction is executed
5. Abort: Used to handle memory access violations
6. Undef: Used to handle undefined instructions
7. System: Privileged mode using the same registers as user
mode.
Register Organization Summary
User FIQ IRQ SVC Undef Abort
r0
r1
User
r2 mode
r3 r0-r7,
r4 r15, User User User User Thumb state
and mode mode mode mode
r5
cpsr r0-r12, r0-r12, r0-r12, r0-r12,
Low registers
r6
r15, r15, r15, r15,
r7 and and and and
r8 r8 cpsr cpsr cpsr cpsr
r9 r9
r10 r10 Thumb state
r11 r11 High registers
r12 r12
r13 (sp) r13 (sp) r13 (sp) r13 (sp) r13 (sp) r13 (sp)
r14 (lr) r14 (lr) r14 (lr) r14 (lr) r14 (lr) r14 (lr)
r15 (pc)
cpsr
spsr spsr spsr spsr spsr
Note: System mode uses the User mode register set
ARM- The Registers
• ARM has 37 registers all of which are 32-bits long.
– 1 dedicated program counter
– 1 dedicated current program status register
– 5 dedicated saved program status registers
– 30 general purpose registers
• The current processor mode governs which of several banks is
accessible. Each mode can access
– a particular set of r0-r12 registers
– a particular r13 (the stack pointer, sp) and r14 (the link register, lr)
– the program counter, r15(pc)
– the current program status register, cpsr
Privileged modes (except System) can also access
– a particular spsr (saved program status register)
Program Status Registers
31 28 27 24 23 16 15 8 7 6 5 4 0
NZ C VQ J U n d e f i n e d I F T mode
f s x c
• Condition code flags • Interrupt Disable bits.
– N = Negative result from ALU – I = 1: Disables the IRQ.
– Z = Zero result from ALU – F = 1: Disables the FIQ.
– C = ALU operation Carried out
– V = ALU operation overflowed • T Bit
– Architecture xT only
• Sticky Overflow flag - Q flag – T = 0: Processor in ARM state
– Architecture 5TE/J only – T = 1: Processor in Thumb state
– Indicates if saturation has occurred
• Mode bits
• J bit – Specify the processor mode
– Architecture 5TEJ only
– J = 1: Processor in Jazelle state
Program Counter (r15)
• When the processor is executing in ARM state:
– All instructions are 32 bits wide
– All instructions must be word aligned
– Therefore the PC value is stored in bits [31:2] with bits [1:0] undefined (as
instruction cannot be halfword or byte aligned).
• When the processor is executing in Thumb state:
– All instructions are 16 bits wide
– All instructions must be halfword aligned
– Therefore the PC value is stored in bits [31:1] with bit [0] undefined (as
instruction cannot be byte aligned).
• When the processor is executing in Jazelle state:
– All instructions are 8 bits wide
– Processor performs a word access to read 4 instructions at once
Exception Handling
• When an exception occurs, the ARM:
– Copies CPSR into SPSR_<mode>
– Sets appropriate CPSR bits
• Change to ARM state
0x1C FIQ
• Change to exception mode 0x18 IRQ
• Disable interrupts (if appropriate) 0x14 (Reserved)
– Stores the return address in 0x10 Data Abort
LR_<mode> 0x0C Prefetch Abort
0x08 Software Interrupt
– Sets PC to vector address 0x04 Undefined Instruction
• To return, exception handler 0x00 Reset
needs to: Vector Table
Vector table can be at
– Restore CPSR from SPSR_<mode> 0xFFFF0000 on ARM720T
and on ARM9/10 family
– Restore PC from LR_<mode> devices
This can only be done in ARM state.
Development of the
ARM Architecture
Improved
Halfword ARM/Thumb 5TE Jazelle
4
and signed Interworking 5TEJ
1 Java bytecode
halfword / execution
CLZ
byte support
System SA-110 Saturated maths ARM9EJ-S ARM926EJ-S
2 mode
DSP multiply-
SA-1110 ARM7EJ-S ARM1026EJ-S
accumulate
instructions
3
ARM1020E SIMD Instructions
Thumb 4T 6
instruction Multi-processing
set XScale
Early ARM V6 Memory
architectures architecture (VMSA)
ARM7TDMI ARM9TDMI ARM9E-S
Unaligned data
ARM720T ARM940T ARM966E-S support ARM1136EJ-S
Main features of the
ARM Instruction Set
• All instructions are 32 bits long.
• Most instructions execute in a single cycle.
• Every instruction can be conditionally executed.
• A load/store architecture
– Data processing instructions act only on registers
• Three operand format
• Combined ALU and shifter for high speed bit manipulation
– Specific memory access instructions with powerful
auto-indexing addressing modes.
Conditional Execution
• Most instruction sets only allow branches to be executed
conditionally by postfixing them with the appropriate condition
code field..
• However by reusing the condition evaluation hardware, ARM
effectively increases number of instructions.
– All instructions contain a condition field which determines whether
the CPU will execute them.
– Non-executed instructions soak up 1 cycle.
• Still have to complete cycle so as to allow fetching and decoding of following
instructions.
• This removes the need for many branches, which stall the pipeline
(3 cycles to refill).
– Allows very dense in-line code, without branches.
– The Time penalty of not executing several conditional instructions is
frequently less than overhead of the branch
or subroutine call that would otherwise be needed.
The Condition Field
31 28 24 20 16 12 8 4 0
Cond
0000 = EQ - Z set (equal) 1001 = LS - C clear or Z (set unsigned
0001 = NE - Z clear (not equal) lower or same)
0010 = HS / CS - C set (unsigned 1010 = GE - N set and V set, or N clear
higher or same) and V clear (>or =)
0011 = LO / CC - C clear (unsigned 1011 = LT - N set and V clear, or N clear
lower) and V set (>)
0100 = MI -N set (negative) 1100 = GT - Z clear, and either N set and
0101 = PL - N clear (positive or zero) V set, or N clear and V set (>)
0110 = VS - V set (overflow) 1101 = LE - Z set, or N set and V clear,or
0111 = VC - V clear (no overflow) N clear and V set (<, or =)
1000 = HI - C set and Z clear 1110 = AL - always
(unsigned higher) 1111 = NV - reserved.
Using and updating the Condition Field
• To execute an instruction conditionally, simply postfix it with the
appropriate condition:
– For example an add instruction takes the form:
• ADD r0,r1,r2 ; r0 = r1 + r2 (ADDAL)
– To execute this only if the zero flag is set:
• ADDEQ r0,r1,r2 ; If zero flag set then…
; ... r0 = r1 + r2
• By default, data processing operations do not affect the condition
flags (apart from the comparisons where this is the only effect). To
cause the condition flags to be updated, the S bit of the instruction
needs to be set by postfixing the instruction (and any condition
code) with an “S”.
– For example to add two numbers and set the condition flags:
• ADDS r0,r1,r2 ; r0 = r1 + r2
; ... and set flags
Data processing Instructions
• Largest family of ARM instructions, all sharing the same
instruction format.
• Contains:
– Arithmetic operations
– Comparisons (no results - just set condition codes)
– Logical operations
– Data movement between registers
• Remember, this is a load / store architecture
– These instruction only work on registers, NOT memory.
• They each perform a specific operation on one or two
operands.
– First operand always a register - Rn
– Second operand sent to the ALU via barrel shifter.
Data Movement
• Operations are:
– MOV operand2
– MVN NOT operand2
Note that these make no use of operand1 i.e operand1
is ignored.
• Syntax:
– <Operation>{<cond>}{S} Rd, Operand2
• Examples:
– MOV r0, r1
– MOVS r2, #10
– MVNEQ r1,#0
Arithmetic Operations
• Operations are:
– ADD operand1 + operand2
– ADC operand1 + operand2 + carry
– SUB operand1 - operand2
– SBC operand1 - operand2 + carry -1
– RSB operand2 - operand1
– RSC operand2 - operand1 + carry - 1
• Syntax:
– <Operation>{<cond>}{S} Rd, Rn, Operand2
• Examples
– ADD r0, r1, r2
– SUBGT r3, r3, #1
– RSBLES r4, r5, #5
– SUB r4,r5,r7,LSR r2 ; Logical right shift R7 by the number in
; the bottom byte of R2, subtract result
; from R5, and put the answer into R4.
Logical Operations
• Operations are:
– AND operand1 AND operand2
– EOR operand1 EOR operand2
– ORR operand1 OR operand2
– BIC operand1 AND NOT operand2 [ie bit clear]
• Syntax:
– <Operation>{<cond>}{S} Rd, Rn, Operand2
• Examples:
– AND r0, r1, r2
– BICEQ r2, r3, #7
– EORS r1,r3,r0
Multiplication Instructions
• The Basic ARM provides two multiplication instructions.
• Multiply
– MUL{<cond>}{S} Rd, Rm, Rs ; Rd = Rm * Rs
• Multiply Accumulate - does addition for free
– MLA{<cond>}{S} Rd, Rm, Rs,Rn ; Rd = (Rm * Rs) + Rn
• Restrictions on use:
– Rd and Rm cannot be the same register
• Can be avoid by swapping Rm and Rs around. This works because
multiplication is commutative.
– Cannot use PC.
These will be picked up by the assembler if overlooked.
• Operands can be considered signed or unsigned
– Up to user to interpret correctly.
• The multiply form of the instruction gives Rd:=Rm*Rs. Rn is
ignored, and should be set to zero for compatibility with
possible future upgrades to the instruction set.
Multiplication Implementation
• The ARM makes use of Booth’s Algorithm to perform integer
multiplication.
• On non-M ARMs this operates on 2 bits of Rs at a time.
– For each pair of bits this takes 1 cycle (plus 1 cycle to start with).
– However when there are no more 1’s left in Rs, the multiplication will
early-terminate.
• Example: Multiply 18 and -1 : Rd = Rm * Rs
Rm 18 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 18 Rs
Rs -1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 -1 Rm
17 cycles 4 cycles
• Note: Compiler does not use early termination criteria to
decide on which order to place operands.
Extended Multiply Instructions
• M variants of ARM cores contain extended multiplication
hardware. This provides three enhancements:
– An 8 bit Booth’s Algorithm is used
• Multiplication is carried out faster (maximum for standard instructions
is now 5 cycles).
– Early termination method improved so that now completes
multiplication when all remaining bit sets contain
• all zeroes (as with non-M ARMs), or
• all ones.
Thus the previous example would early terminate in 2 cycles in
both cases.
– 64 bit results can now be produced from two 32bit operands
• Higher accuracy.
• Pair of registers used to store result.
Multiply-Long and
Multiply-Accumulate Long
• Instructions are
– MULL which gives RdHi,RdLo:=Rm*Rs
– MLAL which gives RdHi,RdLo:=(Rm*Rs)+RdHi,RdLo
• However the full 64 bit of the result now matter (lower precision
multiply instructions simply throws top 32bits away)
– Need to specify whether operands are signed or unsigned
• Therefore syntax of new instructions are:
– UMULL{<cond>}{S} RdLo,RdHi,Rm,Rs
– UMLAL{<cond>}{S} RdLo,RdHi,Rm,Rs
– SMULL{<cond>}{S} RdLo, RdHi, Rm, Rs
– SMLAL{<cond>}{S} RdLo, RdHi, Rm, Rs
• Not generated by the compiler.
Warning : Unpredictable on non-M ARMs.
Operand restrictions
• R15 must not be used as an operand or as a destination
register.
• RdHi, RdLo, and Rm must all specify different registers.
Data Transfer
• ARM is a load/store architecture
• Involves
-Load data from memory to register
-Store data from register into memory
• ARM has three types of load/store instructions
-LDR/STR
-LDM/STM
-SWP
Types of load/store instructions
Simple load/store has options like the following
• LDR/STR involved in storing/loading words(32 bits)
• LDRB/STRB involved with a byte transfer
• In ARM v4 we also have support for halfwords(16 bits)
LDRH/STRH without sign extension
LDRSB/STRSB with sign extension
• Condition codes can also be suffixed
LDREQB/STREQB
• General syntax looks somewhat like..
<LDR|STR>{<cond>}{<size>} Rd, <address>
Base Register
• STR r0,[r1] Stores content in address contained in r1 in r0
LDR r2,[r1] Loads content in address contained in r1 to r2
r0 Memory
Source
0x5
Register
for STR
r1 r2
Base Destination
0x200 0x200 0x5 0x5
Register Register
for LDR
Off set from the base register
• ARM also supports accessing locations pointed out as an
offset from the base register
• The offset can be
An unsigned 12 bit immediate value(0-4096)
A register with the option of shift
• Option exists for ‘+’ or ‘-’ from base register
• Offset can be applied
- before transfer is made
optionally auto incremnets base register by using ‘!’
-after transfer is made
base register auto incremented
Pre-Indexed Addressing
• Example :STR r0,[r1,#12]
r0 Source
Memory
0x5 Register
Offset for STR
12 0x20c 0x5
r1
Base 0x200 0x200
Register
•Offset value can as well be -12 (STR r0,[r1,#-12])
•To perform auto increment on base reg STR r0,[r1,#12]!
-updates base register to value 0x20C
•If r2 contains 3 then this will yield the same result
STR r0,[r1,r2,LSL#2]
•Useful if only a particular element is to be accessed
Post Indexed Addressing
• Example :STR r0,[r1],#12
Memory
Updated r1 Offset r0 Source
Base 0x20c 12 0x20c 0x5 Register
Register for STR
0x200 0x5
Original r1
Base 0x200
Register
•If r2 contains 3 then this will also yield the same result
STR r0,[r1],r2,LSL #2
•Useful if traversal is required through elements
For half words/signed byte access
• Instructions can be used in much the same way except
- the offset value is restricted to 8 bits(0-255)
- the registers cannot be shifted
LDM/STM (Block data transfer)
• Allow for transfer between 1-16 registers to or from memory
• The transferred registers can be:
- Any subset of the current bank of registers (default).
- Any subset of the user mode bank of registers when in a
privileged mode (postfix instruction with a ‘^’).
Block Data Transfer
• Base register determines where memory access can occur
• Base register can be updated after data transfer by suffixing a
‘!’
• These instructions are useful for
- Saving and restoring context
- moving large chunks of data to/from memory
Block Data Transfer
• One use of stacks is to temporary create register space for
subroutines
STMFD sp!,{r0-r12, lr} ; stack all registers
........ ; and the return address
........
LDMFD sp!,{r0-r12, pc} ; load all the registers
; and return automatically
• If the pop instruction also had the ‘S’ bit set (using ‘^’) then
the transfer of the PC when in a priviledged mode would also
cause the SPSR to be copied into the CPSR (see exception
handling module).
Direct functionality Of Block Data Transfer
• When not being used for a stack operation these instructions
can also be used in a generic way
• The LDM/STM support a further set of instructions
– STMIA / LDMIA : Increment After
– STMIB / LDMIB : Increment Before
– STMDA / LDMDA : Decrement After
– STMDB / LDMDB : Decrement Before
Swap Instruction
• The instruction is used to swap data between a register and a
memory
• This instruction is atomic (cannot be interrupted)
• The swap address is determined by the contents of the base
register (Rn).
• The processor first reads the contents of the swap address.
Then it writes the contents of the source register (Rm) to the
swap address, and stores the old memory contents in the
destination register (Rd).
• The same register may be specified as both the source and
destination
Branch and Branch with Link
• Branch instructions contain a signed 2’s complement 24 bit offset.
• This is shifted left two bits, sign extended to 32 bits, and added to
the PC.
• The instruction can therefore specify a branch of +/- 32Mbytes.
• The branch offset must take account of the prefetch operation,
which causes the PC to be 2 words (8 bytes) ahead of the current
instruction.
• Branches beyond +/- 32Mbytes must use an offset or absolute
destination which has been previously loaded into a register. In this
case the PC should be manually saved in R14 if a Branch with Link
type operation is required.
Link Bit
• Branch with Link (BL) writes the old PC into the link register
(R14) of the current bank.
• The PC value written into R14 is adjusted to allow for the
prefetch, and contains the address of the instruction following
the branch and link instruction.
• The CPSR is not saved with the PC
Barrel Shifter
• A barrel shifter is a digital circuit that can shift a data word by
a specified number of bits in one clock cycle.
• It can be implemented as a sequence of multiplexers (mux.),
and in such an implementation the output of one mux is
connected to the input of the next mux in a way that depends
on the shift distance.
• A barrel shifter is often implemented as a cascade of parallel
2×1 multiplexers.
Using the Barrel Shifter
•There are 2 options for shifting
- where shift amount is stored in a base register bottom byte
- shift amount as a % bit unsigned integer
Shift Operations
• Logical Shift Right
• Shifts right without preserving sign bit
...0 Destination CF
• Arithmetic Shift Right
• Preserves the sign bit
Destination CF
Sign bit shifted in
Rotate
• Rotate Right
Same as ASR but the bits wrap around as they rotate
The rotated bit also used as carry flag
Rotate Right
Destination CF
Comparison
• The only effect of the comparisons is to
– UPDATE THE CONDITION FLAGS. Thus no need to set S bit.
• Operations are:
– CMP operand1 - operand2, but result not written
– CMN operand1 + operand2, but result not written
– TST operand1 AND operand2, but result not written
– TEQ operand1 EOR operand2, but result not written
• Syntax:
– <Operation>{<cond>} Rn, Operand2
• Examples:
– CMP r0, r1
– TSTEQ r2, #5
• 3-stage pipeline organization
– Principal components
• The register bank
• The barrel shifter
– Can shift or rotate one operand by any number of bits
• The ALU
• The address register and incrementer
– Select and hold all memory addresses and generate
sequential addresses
• The data registers
• The instruction decoder and associated control logic
• Fetch - The instruction is
fetched from memory and
placed in the instruction
pipeline
• Decode - The instruction is
decoded and the datapath
control signals prepared for
the next cycle
• Execute - The register bank
is read, an operand shifted,
the ALU result generated
and written back into
destination register
• At any time slice, 3 different instructions may occupy
each of these stages, so the hardware in each stage has
to be capable of independent operations
• When the processor is executing data processing
instructions , the latency = 3 cycles and the throughput
= 1 instruction/cycle
• Drawback: Every data transfer instruction causes a
pipeline “stall”. (Single memory for data and
instruction- next instruction cannot be fetched while
data is being read)
5-stage Pipeline Organization
• Implemented in ARM9TDMI
• Tprog = Ninst * CPI / fclk
– Tprog: the time taken to execute a given program
– Ninst: the number of ARM instructions executed in
the program (compiler dependent)
– CPI: average number of clock cycles per
instructions => hazard causes pipeline stalls
– fclk: frequency
• Fetch
– The instruction is fetched from
memory and placed in the
instruction pipeline
• Decode
– The instruction is decoded and
register operands read from the
register files. There are 3
operand read ports in the
register file so most ARM
instructions can source all their
operands in one cycle
• Execute
– An operand is shifted and the
ALU result generated. If the
instruction is a load or store,
the memory address is
computed in the ALU
• Buffer/Data
– Data memory is accessed
if required. Otherwise the
ALU result is simply
buffered for one cycle.
• Write back
– The result generated by
the instruction are written
back to the register file,
including any data loaded
from memory.
5-stage pipeline organization
• Moved the register read step from the execute
stage to the decode stage
• Execute stage was split into 3 stages- ALU,
memory access, write back.
• Result: Better balanced pipeline with
minimized latencies between stages, which
can run at a faster clock speed.
Pipeline Hazards
• There are situations, called hazards, that prevent the
next instruction in the instruction stream from being
executed during its designated clock cycle. Hazards
reduce the performance from the ideal speedup
gained by pipelining.
• There are three classes of hazards:
– Structural Hazards
– Data Hazards
– Control Hazards
Structural Hazards
• When a machine is pipelined, the overlapped
execution of instructions requires pipelining of
functional units and duplication of resources
to allow all possible combinations of
instructions in the pipeline.
• If some combination of instructions cannot be
accommodated because of a resource conflict,
the machine is said to have a structural
hazard.
• Ex. A machine has shared a single-memory pipeline
for data and instructions. As a result, when an
instruction contains a data-memory reference (load),
it will conflict with the instruction reference for a
later instruction (instr 3):
Solution
• To resolve this, we stall the pipeline for one clock
cycle when a data-memory access occurs. The effect
of the stall is actually to occupy the resources for
that instruction slot. The following table shows how
the stall is actually implemented.
Solution
• Another solution is to use separate instruction
and data memories.
• ARM has moved from the von-Neumann
architecture to the Harvard architecture in
ARM9.
– Implemented a 5-stage pipeline and separate data
and instruction memory.
– Doesn’t suffer from this hazard.
Data Hazards
• They arise when an instruction depends on the result of a
previous instruction in a way that is exposed by the
overlapping of instructions in the pipeline.
• The problem with data hazards can be solved with a
hardware technique called data forwarding (by making
use of feedback paths).
• Without forwarding, the pipeline would have to be
stalled to get the results from the respective registers
• Example:
Data Hazards
• The first forwarding is for value of R1 from EXadd to EXsub.
• The second forwarding is also for value of R1 from MEMadd to EXand.
• This code now can be executed without stalls.
• Forwarding can be generalized to include passing the result directly
to the functional unit that requires it: a result is forwarded from the
output of one unit to the input of another, rather than just from the
result of a unit to the input of the same unit.
Control Hazards
• They arise from the pipelining of branches and other
instructions that change the PC.