1. Advanced Pipelining in ARM
Processors
Prof. J.K.Das
School of Electronics Engineering
KIIT Deemed to be University
2. Pipelining: Review of the Basics
• ARM processor is a RISC but differs in some features from pure RISC
(Variable execution cycle for special instructions, inline barrel shifter,
Thumb Inst. Set, Conditional exec., DSP instructions)
• Regular ARM Architecture:
Load/Store Architecture
Uniform Register Array
Fixed Length 32-bit instructions
3-address Instructions.
• System Speed- Latency and Throughput
Latency: Time required for a single instruction to pass through a system from start to
end.
Throughput: No. of instructions that can be executed in 1 machine cycle.
Speed α Latency, Throughput
3. Pipelining: Overview
• Mechanism to speed up the regular execution by fetching the next
instruction while the present instruction is being decoded and executed.
• Induces Parallelism- executing several instructions at a time.
• Pipelining (Temporal Parallelism) should ideally increase throughput
without any penalty in latency.
• Pipelining divides the instruction cycle into multiple stages:
Fetch → Decode → Execute → Memory operations
Disadvantage: If the level of pipelining is increased and the instruction
spends more time in the pipeline, data dependencies start to surface.
*(Dependency- The execution of the present instruction depends on intermediate results from some previous instruction)
4. Pipelining in ARM
• ARM implements different pipeline stages in its architectures.
ARM 7- 3 stage Pipeline
ARM 9- 5 stage Pipeline
ARM 10- 6 stage Pipeline
ARM 11- 7 stage Pipeline
5. PROBLEMS IN 5 STAGE PIPELINE
• Ideally IPC=1 when pipelining is
implemented
• Incase of complex branching instructions:
Control Hazards- PC value modified
resulting in Pipeline FLUSH
Data Hazards- NOPs and loading of
new branch instructions due to
FLUSH.
Interrupt execution leading to
modifying the inst. Already present in
the pipeline with those from IVT(Int.
Vector Table).
Solution:
1) Data Forwarding: Keeping the data to be required for the next instruction ready (use of multi-level Cache)
2) Branch Prediction: Predicting the result of Branching Instructions
6. 6- Stage Pipelining
• Usually present in ARM 10 architecture
• Additional ISSUE stage added – takes total 6 cycles to complete 1 inst.
• Issue stage checks if the inst. is ready to be decoded in the current stage or not.
• If the inst. Is not ready it allow out of order execution by allowing the next inst. In the pipeline to
start processing in the available time gap.
• Branch Prediction Mechanism has been introduced to improve throughput.
• Reduces processor stalls by resolving the Hazards.
• Throughput is almost double of ARM 7 but latency is compromised (Trade-off)
7. Details of 6-Stage Pipelining
Branch
Target Buffer
• Conditional branch inst. delay the operation as it takes time to
evaluating the condition and determining the branching address
• Sol: predict the branch statically.
• Prediction requires branch target calculation which might
induce delays
• BTB(Branch Target Buffers) are used to reduce delay in
pipeline and make it efficient.
• BTB is essentially a simple cache memory which should have
sufficient size to maintain throughput of pipelining.
8. 7-Stage Pipelining
• State-of-the-art pipelining mechanism used in ARM 11 and above.
• Implements data forwarding along with branch prediction.
• Stage 1: IT The Instruction Translate (IT) stage uses the TLB to translate the (virtual) PC address into a physical instruction
address. We can occasionally have a TLB miss, but you can safely ignore this possibility for this problem.
• Stage 2: IF The Instruction Fetch (IF) stage uses the physical instruction address computed in the IT stage to access the cache, and
fetches the 32-bit instruction stored at that address. You can safely ignore the possibility of a cache miss.
• Stage 3: ID The Instruction Decode (ID) stage first decodes the 32-bit instruction (e.g., identifying the opcode field, the rs, rt, and
rd fields, etc.). In the second half of the clock cycle, it reads the register file. It also computes the (virtual) target address, if the
instruction is a jump (j) or a branch (beq/bne).
• Stage 4: EX The Execute (EX) stage does the necessary ALU operations for the instruction. For branches, this includes resolving
the branch decision (taken or not-taken). For lw/sw instructions, the ALU computes the (virtual) data address from/to which data
is to be read/written.
• Stage 5: MT The Memory Translate (MT) stage translates the virtual data address into a physical data address using the TLB, if
the instruction is a lw or sw. As in the IT stage, you can safely ignore the possibility of a TLB miss.
• Stage 6: MM The Memory (MM) stage uses the physical data address computed in the MT stage to access the cache if the
instruction is a lw (data is read from the cache into the rt register) or a sw (data in the rt register is written to the cache). You can
safely ignore the possibility of a cache miss.
• Stage 7: WB The Write Back (WB) stage updates the register file (if necessary) in the first half of the clock cycle.