The document summarizes key aspects of the P6 microarchitecture used in processors like the Pentium Pro, Pentium II, and Pentium III. It describes the system architecture with separate front-side and back-side buses. It then details the instruction fetch, decode, register renaming, out-of-order execution, memory handling, and retirement stages of the processor pipeline. Diagrams illustrate the branch prediction, reservation stations, reorder buffer, and memory order buffer components that enable speculative and out-of-order execution in the P6.
LANDSLIDE MONITORING AND ALERT SYSTEM FINAL YEAR PROJECT BROCHURE
Lec12 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- P6, Netburst, Pentiium 4
1. ECE 4100/6100
Advanced Computer Architecture
Lecture 12 P6 and NetBurst Microarchitecture
Prof. Hsien-Hsin Sean Lee
School of Electrical and Computer Engineering
Georgia Institute of Technology
3. 3
Instruction Fetch UnitInstruction Fetch Unit
P6 Microarchitecture
BTB/BACBTB/BAC
Instruction Fetch UnitInstruction Fetch Unit
Bus interface unitBus interface unit
InstructionInstruction
DecoderDecoder
InstructionInstruction
DecoderDecoder
RegisterRegister
Alias TableAlias Table
AllocatorAllocatorMicrocodeMicrocode
SequencerSequencer
ReservationReservation
StationStation
ROB &ROB &
Retire RFRetire RF
AGUAGU
MMXMMX
IEU/JEUIEU/JEUIEU/JEUIEU/JEU
FEUFEU
MIUMIU
MemoryMemory
Order BufferOrder Buffer
Data CacheData Cache
Unit (L1)Unit (L1)
External busExternal bus
Chip boundaryChip boundary
ControlControl
FlowFlow
(Restricted)(Restricted)
DataData
FlowFlowInstruction Fetch Cluster
Issue Cluster
Out-of-order
Cluster
Memory
Cluster
Bus Cluster
4. 4
Pentium III Die Map
EBL/BBL – External/Backside Bus logic
MOB - Memory Order Buffer
Packed FPU - Floating Point Unit for SSE
IEU - Integer Execution Unit
FAU - Floating Point Arithmetic Unit
MIU - Memory Interface Unit
DCU - Data Cache Unit (L1)
PMH - Page Miss Handler
DTLB - Data TLB
BAC - Branch Address Calculator
RAT - Register Alias Table
SIMD - Packed Floating Point unit
RS - Reservation Station
BTB - Branch Target Buffer
TAP – Test Access Port
IFU - Instruction Fetch Unit and L1 I-Cache
ID - Instruction Decode
ROB - Reorder Buffer
MS - Micro-instruction Sequencer
5. 5
P6 Basics
• One implementation of IA32 architecture
• Deeply pipeline processor
• In-order front-end and back-end
• Dynamic execution engine (restricted dataflow)
• Speculative execution
• P6 microarchitecture family processors include
– Pentium Pro
– Pentium II (PPro + MMX + 2x caches)
– Pentium III (P-II + SSE + enhanced MMX, e.g. PSAD)
– Pentium 4 (Not P6, will be discussed separately)
– Pentium M (+SSE2, SSE3, µop fusion)
– Core (PM + SSSE3, SSE4, Intel 64 (EM64T), MacroOp
fusion, 4 µop retired rate vs. 3 of previous proliferation)
7. 7
Instruction Fetching Unit
• IFU1: Initiate fetch, requesting 16 bytes at a time
• IFU2: Instruction length decoder, mark instruction boundaries, BTB makes prediction (2 cycles)
• IFU3: Align instructions to 3 decoders in 4-1-1 format
Streaming
Buffer
Instruction
Cache
Victim
Cache
Instruction
TLB
data addr
P.Addr
Branch Target Buffer
Next PC
Mux
Other fetch
requests
LinearAddress
Select
mux
ILD
Length
marks
Instruction
rotator
Instruction
buffer
#bytes
consumed
by ID
Prediction
marks
9. 9
Dynamic Branch Prediction
• Similar to a 2-level PAs design
• Associated with each BTB entry
• W/ 16-entry Return Stack Buffer
• 4 branch predictions per cycle (due to
16-byte fetch per cycle)
• Speculative update (2 copies of BHR)
Static prediction provided by Branch Address
Calculator when BTB misses (see prior slide)
512-entry BTB512-entry BTB 1 1 0
Branch History RegisterBranch History Register
(BHR)(BHR)
0000
0001
0010
1111
1110
Pattern History TablesPattern History Tables
(PHT)(PHT)
Prediction
Rc: Branch ResultRc: Branch Result
2-bit sat. counter
11 00
1 1
0
Spec. updateSpec. update
New (spec) historyNew (spec) history
1101
W0W0 W1W1 W2W2 W3W3
10. 10
X86 Instruction Decode
• 4-1-1 decoder
• Decode rate depends on instruction alignment
• DEC1: translate x86 into micro-operation’s (µops)
• DEC2: move decoded µops to ID queue
• MS performs translations either
– Generate entire µop sequence from the “microcode ROM”
– Receive 4 µops from complex decoder, and the rest from microcode ROM
• Subsequent Instructions followed by the inst needing MS are flushed
complex
(1-4)
complex
(1-4)
simple
(1)
simple
(1)
simple
(1)
simple
(1)
(16 bytes)
Micro-
instruction
sequencer
(MS)
Instruction decoder queue
(6 µops)
Next 3 inst #Inst to dec
S,S,S 3
S,S,C First 2
S,C,S First 1
S,C,C First 1
C,S,S 3
C,S,C First 2
C,C,S First 1
C,C,C First 1
S: Simple
C: Complex
Instruction Buffer
To RAT/ALLOC
11. 11
Register Alias Table (RAT)
• Register renaming for 8 integer registers, 8 floating point (stack) registers and flags: 3 µop per
cycle
• 40 80-bit physicalphysical registers embedded in the ROB (thereby, 6 bit to specify PSrcPSrc)
• RAT looks up physical ROB locations for renamed sources based on RRF bit
• Override logic is for dependent µops decoded at the same cycle
• Misprediction will revert all pointers to point to Retirement Register File (RRF)
In-order
queue
FP
TOS
Adjus
t
FP
RAT
Array
Integer
RAT
Array
Logical Src
IntandFPOverrides
Array
Physical
Src (Psrc)
RAT
PSrc’s
Physical ROB Pointers
Allocator
25
2
ECX
15
EAX
EBX
ECX
EDX
Renaming Example
ROBRRF
RRF PSrc
0
0
0
1
12. 12
Partial Stalls due to RAT
• Partial register stalls: Occurs when writing a smaller (e.g. 8/16-bit) register
followed by a larger (e.g. 32-bit) read
– Because need to read different partial pieces from multiple physical registers !
• Partial flags stalls: Occurs when a subsequent instruction reads more flags than
a prior unretired instruction touches
EAXEAX
AXAX
writewritereadread
MOV AX, m8 ;MOV AX, m8 ;
ADD EAX, m32 ; stallADD EAX, m32 ; stall
Partial register stallsPartial register stalls
XOR EAX, EAXXOR EAX, EAX
MOV AL, m8 ;MOV AL, m8 ;
ADD EAX, m32 ; no stallADD EAX, m32 ; no stall
SUB EAX, EAXSUB EAX, EAX
MOV AL, m8 ;MOV AL, m8 ;
ADD EAX, m32 ; no stallADD EAX, m32 ; no stall
Idiom Fix (1)Idiom Fix (1)
Idiom Fix (2)Idiom Fix (2)
CMP EAX, EBXCMP EAX, EBX
INC ECXINC ECX
JBE XX ; stallJBE XX ; stall
Partial flag stalls (1)Partial flag stalls (1)
JBEJBE reads both ZFZF and CFCF while
INC affects (ZFZF,OF,SF,AF,PF)
i.e. only ZFZF
LAHFLAHF loads low byte of EFLAGSEFLAGS
while TESTTEST writes partial of them
TEST EBX, EBXTEST EBX, EBX
LAHF ; stallLAHF ; stall
Partial flag stalls (2)Partial flag stalls (2)
13. 13
Partial Register Width Renaming
• 32/16-bit accesses:
– Read from low banklow bank (AL/BL/CL/DL;AX/BX/CX/DX;EAX/EBX/ECX/EDX/EDI/ESI/EBP/ESP)
– Write to both banks (AH/BH/CH/DH)
• 8-bit RAT accesses: depending on which bank is being written and only update the
particular bank
In-orderqueue
FP
TOS
Adjust
FP
RAT
Array
Logical Src
IntandFPOverries
Array
Physical
Src
RAT
Physical Src
Physical ROB Pointers from Allocator
µop0: MOV AL = (a)
µop1: MOV AH = (b)
µop2: ADD AL = (c)
µop3: ADD AH = (d)
Integer
RAT
Array
INT Low Bank
(32b/16b/L): 8 entries
INT High Bank (H):
4 entries
Size(2) RRF(1) PSrc(6)
Allocator
14. 14
Allocator (ALLOC)
• The interface between in-order and out-of-order
pipelines
• Allocates into ROB, MOB and RS
– “3-or-none” µops per cycle into ROB and RS
• Must have 3 free ROB entries or no allocation
– “all-or-none” policy for MOB
• Stall allocation when not all the valid MOB µops can be allocated
• Generate physical destination token PdstPdst from the
ROB and pass it to the Register Alias Table (RAT)
and RS
• Stalls upon shortage of resources
15. 15
Reservation Stations (RS)
• Gateway to execution: binding max 5 µop to each port per cycle
• Port binding at dispatch time (certain µop can only be bound to one port)
• 20 µop entry buffer bridging the In-order and Out-of-order engine (32 entries in Core)
• RS fields include µop opcode, data valid bits, Pdst, Psrc, source data, BrPred, etc.
• Oldest first FIFO scheduling when multiple µops are ready at the same cycle
Port 0
Port 1
Port 2
Port 3
Port 4
IEU0 Fadd Fmul Imul Div
IEU1 JEU
AGU0
AGU1
MOBMOB DCU
ROB RRF
Pfadd
Pfmul
Pfshuf
WB bus 1
WB bus 0
Ld addr
St addr
LDA
STA
STD
St data
Loaded data
RS
Retired
data
16. 16
ReOrder Buffer (ROB)
• A 40-entry circular buffer (96-entry in Core)
– 157-bit wide
– Provide 40 alias physical registers
• Out-of-order completion
• Deposit exception in each entry
• Retirement (or de-allocation)
– After resolving prior speculation
– Handle exceptions thru MS
– Clear OOO state when a mis-predicted
branch or exception is detected
– 3 µop’s per cycle in program order
– For multi-µop x86 instructions: none or all
(atomic)
ALLOC
RATRAT
RS
RRFROB
...
MS
(exp) µcode assist
17. 17
Memory Execution Cluster
• Manage data memory accesses
• Address Translation
• Detect violation of access ordering
• Fill buffers (FB) in DCU, similar to MSHR for non-blocking cache support
RS / ROBRS / ROB
LDLD STASTA STDSTD
DTLBDTLBDTLBDTLB
LDLD STASTADCUDCUDCUDCU
Load BufferLoad Buffer
Store BufferStore BufferEBLEBL
Memory ClusterMemory Cluster
movl ecx, edi
addl ecx, 8
movl -4(edi), ebx
movl eax, 4(ecx)
RS cannot detect this and could
dispatch them at the same timeFBFB
18. 18
Memory Order Buffer (MOB)
• Allocated by ALLOC
• A second order RS for memory operations
• 1 µop for load; 2 µop’s for store: Store Address (STA) and Store Data (STD)
• MOB
16-entry load buffer (LB) (32-entry in Core, 64 in SandyBridge)
12-entry store address buffer (SAB) (20-entry in Core, 36 in SandyBridge)
SAB works in unison with
• Store data buffer (SDB) in MIU
• Physical Address Buffer (PAB) in DCU
Store Buffer (SB): SAB + SDB + PAB
• Senior Stores
Upon STD/STA retired from ROB
SB marks the store “seniorsenior”
Senior stores are committed back in program orderprogram order to memory when bus idle
or SB full
• Prefetch instructions in P-III
Senior loadSenior load behavior
Due to no explicit architectural destination
New Memory dependency predictor in Core to predict store-to-load
dependencies
19. 19
Store Coloring
• ALLOC assigns Store Buffer ID (SBID) in program order
• ALLOC tags loads with the most recent SBID
• Check loads against stores with equal or younger SBIDs for potential
address conflicts
• SDB forwards data if conflict detected
x86 Instructions µop’s store color
mov (0x1220), ebx std ebx 2
sta 0x1220 2
mov (0x1110), eax std eax 3
sta 0x1100 3
mov ecx, (0x1220) ld 0x1220 3
mov edx, (0x1280) ld 0x1280 3
mov (0x1400), edx std edx 4
sta 0x1400 4
mov edx, (0x1380) ld 0x1380 4
20. 20
Memory Type Range Registers (MTRR)
• Control registers written by the system (OS)
• Supporting Memory TypesMemory Types
– UnCacheable (UC)
– Uncacheable Speculative Write-combining (USWC or WC)
• Use a fill buffer entry as WC buffer
– WriteBack (WB)
– Write-Through (WT)
– Write-Protected (WP)
• E.g. Support copy-on-write in UNIX, save memory space by allowing
child processes to share with their parents. Only create new memory
pages when child processes attempt to write.
• Page Miss Handler (PMH)
– Look up MTRR while supplying physical addresses
– Return memory types and physical address to DTLB
21. 21
Intel NetBurst Microarchitecture
• Pentium 4’s microarchitecture
• Original target market: Graphics workstations, but …
• Design Goals:
– Performance, performance, performance, …
– Unprecedented multimedia/floating-point performance
• Streaming SIMD Extensions 2 (SSE2)
• SSE3 introduced in Prescott Pentium 4 (90nm)
– Reduced CPI
• Low latency instructions
• High bandwidth instruction fetching
• Rapid Execution of Arithmetic & Logic operations
– Reduced clock period
• New pipeline designed for scalability
23. 23
Pentium 4 Fact Sheet
• IA-32 fully backward compatible
• Available at speeds ranging from 1.3 to ~3.8 GHz
• Hyperpipelined (20+ stages)
• 125 million transistors in Prescott (1.328 billion in 16MB on-die L3 Tulsa, 65nm)
• 0.18 μ for 1.3 to 2GHz; 0.13μ for 1.8 to 3.4GHz; 90nm for 2.8GHz to 3.6GHz
• Die Size of 122mm2
(Prescott 90nm), 435mm2
(Tulsa 65nm),
• Consumes 115 watts of power at 3.6Ghz
• 1066MHz system bus
• Prescott L1 16KB, 8-way vs. previous P4’s 8KB 4-way
• 1MB, 512KB or 256KB 8-way full-speed on-die L2 (B/W example: 89.6 GB/s
@2.8GHz to L1)
• 2MB L3 cache (in P4 HT Extreme edition, 0.13μ only), 16MB in Tulsa
• 144 new 128 bit SIMD instructions (SSE2), and 16 SSSE instructions in Prescott
• HyperThreading Technology (Not in all versions)
24. 24
Building Blocks of Netburst
Bus Unit
Level 2 Cache
Memory subsystem
Fetch/
Dec
ETC
μROM
BTB / Br Pred.
System bus
L1 Data Cache
Execution Units
INT and FP Exec. Unit
OOO
logic
Retire
Branch history update
Front-end
Out-of-Order Engine
27. 27
Execution Trace Cache
• Primary first level I-cache to replace conventional L1
– Decoding several x86 instructions at high frequency is difficult, take
several pipeline stages
– Branch misprediction penalty is considerable
• Advantages
– Cache post-decode µops (think about fill unit)
– High bandwidth instruction fetching
– Eliminate x86 decoding overheads
– Reduce branch recovery time if TC hits
• Hold up to 12,000 µops
– 6 µops per trace line
– Many (?) trace lines in a single trace
28. 28
Execution Trace Cache
• Deliver 3 µop’s per cycle to OOO engine if br pred is good
• X86 instructions read from L2 when TC misses (7+ cycle latency)
• TC Hit rate ~ 8K to 16KB conventional I-cache
• Simplified x86 decoder
– Only one complex instruction per cycle
– Instruction > 4 µop will be executed by micro-code ROM (P6’s MS)
• Perform branch prediction in TC
– 512-entry BTB + 16-entry RAS
– With BP in x86 IFU, reduce 33% misprediction compared to P6
– Intel did not disclose the details of BP algorithms used in TC and x86 IFU
(Dynamic + Static)
29. 29
Out-Of-Order Engine
• Similar design philosophy with P6 uses
– Allocator
– Register Alias Table
– 128 physical registers
– 126-entry ReOrder Buffer
– 48-entry load buffer
– 24-entry store buffer
31. 31
Micro-op Scheduling
∀ µop FIFO queues
– Memory queue for loads and stores
– Non-memory queue
∀ µop schedulers
– Several schedulers fire instructions from 2 µop queues to execution (P6’s
RS)
– 4 distinct dispatch ports
– Maximum dispatch: 6 µops per cycle (2 fast ALU from Port 0,1 per cycle;
1 from ld/st ports)
Exec Port 0 Exec Port 1 Load Port Store Port
Fast ALU
(2x pumped)
Fast ALU
(2x pumped)
FP
Move
INT
Exec
FP
Exec
Memory
Load
Memory
Store
•Add/sub
•Logic
•Store Data
•Branches
•FP/SSE Move
•FP/SSE Store
•FXCH
•Add/sub •Shift
•Rotate
•FP/SSE Add
•FP/SSE Mul
•FP/SSE Div
•MMX
•Loads
•LEA
•Prefetch
•Stores
32. 32
Data Memory Accesses
• Prescott: 16KB 8-way L1 + 1MB 8-way L2 (with a HW prefetcher), 128B line
• Load-to-use speculation
– Dependent instruction dispatched before load finishes
• Due to the high frequency and deep pipeline depth
• From load scheduler to execution is longer than execution itself
– Scheduler assumes loads always hit L1
– If L1 miss, dependent instructions left the scheduler receive incorrect data
temporarily – mis-speculationmis-speculation
– Replay logicReplay logic
• Re-execute the load when mis-speculated
• Mis-speculated operations are placed into a replay queue for being re-
dispatched
– All trailing independent instructions are allowed to proceed
– Tornado breaker
• Up to 4 outstanding load misses (= 4 fill buffers in original P6)
• Store-to-load forwarding buffer
– 24 entries
– Have the same starting physical address
– Load data size <= store data size
33. 33
Fast Staggered ALU
• For frequent ALU instructions (No multiply, no shift, no rotate, no branch
processing)
• Double pumped clocks
• Each operation finishes in 3 fast cycles
– Lower-order 16-bit and bypass
– Higher-order 16-bit and bypass
– ALU flags generation
Bit[15:0]
Bit[31:16]
Flags
34. 34
Branch Predictor
• P4 uses the same hybrid predictor of Pentium M
Bimodal
Predictor
Local
Predictor
Global
Predictor
MUX
MUX
Pred_G
Pred_LPred_B
L_hit
G_hit
35. 35
• In Pentium M and Prescott Pentium 4
• Prediction based on global history
Indirect Branch Predictor
36. 36
New Instructions over Pentium
• CMOVcc / FCMOVcc r, r/m
– Conditional moves (predicated move) instructions
– Based on conditional code (cc)
• FCOMI/P : compare FP stack and set integer flags
• RDPMC/RDTSC instructions
– PMC: P6 has 2, Netburst (P4) has 18
• Uncacheable Speculative Write-Combining (USWC) —weakly
ordered memory type for graphics memory
37. 37
New Instructions
• SSE2 in Pentium 4 (not p6 microarchitecture)
– Double precision SIMD FP
• SSSE in Core 2
– Supplemental instructions for shuffle, align, add,
subtract.
• Intel 64 (EM64T)
– 64 bit support, new registers (8 more on top of 8)
– In Celeron D, Core 2 (and P4 Prescott, Pentium D)
– Almost compatible with AMD64
– AMD’s NX bit or Intel’s XD bit for preventing buffer overflow attacks
38. 38
Streaming SIMD Extension 2
• P-III SSE (Katmai New Instructions: KNI)
– Eight 128-bit wide xmmxmm registers (new architecture state)
– Single-precisionSingle-precision 128-bit SIMD FP
• Four 32-bit FP operations in one instruction
• Broken down into 2 µops for execution (only 80-bit data in ROB)
– 64-bit SIMD MMX (use 8 mmmm registers — map to FP stack)
– Prefetch (nta, t0, t1, t2) and sfence
• P4 SSE2 (Willamette New Instructions: WNI)
– Support Double-precisionDouble-precision 128-bit SIMD FP
• Two 64-bit FP operations in one instruction
• Throughput: 2 cycles for most of SSE2 operations (exceptional
examples: DIVPD and SQRTPD: 69 cycles, non-pipelined.)
– Enhanced 128-bit SIMD MMX using xmmxmm registers
41. 41
HyperThreading
• Intel Xeon Processor and Intel Xeon MP Processor
• Enable Simultaneous Multi-Threading (SMT)
– Exploit ILP through TLP (—Thread-Level Parallelism)
– Issuing and executing multiple threads at the same snapshot
• Single P4 w/ HT appears to be 2 logical processors2 logical processors
• Share the same execution resources
– dTLB shared with logical processor ID
– Some other shared resources are partitioned (next slide)
• Architectural states and some microarchitectural states are duplicated
– IPs, iTLB, streaming buffer
– Architectural register file
– Return stack buffer
– Branch history buffer
– Register Alias Table
43. 43
HyperThreading Resource Partitioning
• TC (or UROM) is alternatively accessed per cycle for
each logical processor unless one is stalled due to
TC miss
∀ µop queue (into ½) after fetched from TC
• ROB (126/2)
• LB (48/2)
• SB (24/2) (32/2 for Prescott)
• General µop queue and memory µop queue (1/2)
• TLB (½?) as there is no PID
• Retirement: alternating between 2 logical
processors