Lec12 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- P6, Netburst, Pentiium 4

ECE 4100/6100
Advanced Computer Architecture
Lecture 12 P6 and NetBurst Microarchitecture
Prof. Hsien-Hsin Sean Lee
School of Electrical and Computer Engineering
Georgia Institute of Technology

2
P6 System Architecture
SystemSystem
MemoryMemory
(DRAM)(DRAM)
MCHMCH
Front-SideFront-Side
BusBus
PCI USB I/O
GraphicsGraphics
ProcessorProcessor
Local
Frame
Buffer
PCIExpress
AGP
(SRAM)(SRAM)
L2 CacheL2 Cache
Back-SideBack-Side
BusBus
P6 CoreP6 Core
Host ProcessorHost Processor
L1L1
CacheCache
(SRAM)(SRAM)
GPUGPU
ICHICH
chipsetchipset

3
Instruction Fetch UnitInstruction Fetch Unit
P6 Microarchitecture
BTB/BACBTB/BAC
Instruction Fetch UnitInstruction Fetch Unit
Bus interface unitBus interface unit
InstructionInstruction
DecoderDecoder
InstructionInstruction
DecoderDecoder
RegisterRegister
Alias TableAlias Table
AllocatorAllocatorMicrocodeMicrocode
SequencerSequencer
ReservationReservation
StationStation
ROB &ROB &
Retire RFRetire RF
AGUAGU
MMXMMX
IEU/JEUIEU/JEUIEU/JEUIEU/JEU
FEUFEU
MIUMIU
MemoryMemory
Order BufferOrder Buffer
Data CacheData Cache
Unit (L1)Unit (L1)
External busExternal bus
Chip boundaryChip boundary
ControlControl
FlowFlow
(Restricted)(Restricted)
DataData
FlowFlowInstruction Fetch Cluster
Issue Cluster
Out-of-order
Cluster
Memory
Cluster
Bus Cluster

4
Pentium III Die Map
EBL/BBL – External/Backside Bus logic
MOB - Memory Order Buffer
Packed FPU - Floating Point Unit for SSE
IEU - Integer Execution Unit
FAU - Floating Point Arithmetic Unit
MIU - Memory Interface Unit
DCU - Data Cache Unit (L1)
PMH - Page Miss Handler
DTLB - Data TLB
BAC - Branch Address Calculator
RAT - Register Alias Table
SIMD - Packed Floating Point unit
RS - Reservation Station
BTB - Branch Target Buffer
TAP – Test Access Port
IFU - Instruction Fetch Unit and L1 I-Cache
ID - Instruction Decode
ROB - Reorder Buffer
MS - Micro-instruction Sequencer

5
P6 Basics
• One implementation of IA32 architecture
• Deeply pipeline processor
• In-order front-end and back-end
• Dynamic execution engine (restricted dataflow)
• Speculative execution
• P6 microarchitecture family processors include
– Pentium Pro
– Pentium II (PPro + MMX + 2x caches)
– Pentium III (P-II + SSE + enhanced MMX, e.g. PSAD)
– Pentium 4 (Not P6, will be discussed separately)
– Pentium M (+SSE2, SSE3, µop fusion)
– Core (PM + SSSE3, SSE4, Intel 64 (EM64T), MacroOp
fusion, 4 µop retired rate vs. 3 of previous proliferation)

6
P6 Pipelining
1111 1212 1313 1414 1515 1616 1717
2020 2121 2222
NextIPNextIP
I-CacheI-Cache
ILDILD
RotateRotate
Dec1Dec1
Dec2Dec2
BrDecBrDec
RSWriteRSWrite
RATRAT
IDQIDQ
In-order FEIn-order FE
3131 3232 3333
8181 8282
.... ....
8383
Exec2Exec2
ExecnExecn
Multi-cycleMulti-cycle
pipelinepipeline
3131 3232 3333
8181 8282
4242 4343
8383
AGUAGU
DCache1DCache1
DCache2DCache2
Non-blockingNon-blocking
memory pipelinememory pipeline
3131 3232 3333
8282 8383
RSschdRSschd
RSDispRSDisp
Exec/WBExec/WB
Single-cycleSingle-cycle
pipelinepipeline
83: Data WB83: Data WB
82: Int WB schedule82: Int WB schedule
81: Mem/FP WB81: Mem/FP WB
FEin-orderboundaryFEin-orderboundary
Retirementin-orderboundaryRetirementin-orderboundary
9191 9292 9393
RetptrwrRetptrwr
RetROBrdRetROBrd
RRFwrRRFwr
…
…
…
… ……..
RS SchedulingRS Scheduling
DelayDelay
ROB SchedulingROB Scheduling
DelayDelay
MOB SchedulingMOB Scheduling
DelayDelay
IFU1IFU1 IFU2IFU2 IFU3IFU3 DEC1DEC1 DEC2DEC2 RATRAT ROBROB DISDIS EXEX RET1RET1 RET2RET2
3131 3232 3333
8181 8282
4242 4343
8383
AGUAGU
MOBblkMOBblk
MOBwrMOBwr
4040 4141 4242 4343
MOBdispMOBdisp
DCache1DCache1
Dcache2Dcache2
MOBwakeupMOBwakeup
BlockingBlocking
memorymemory
pipelinepipeline

7
Instruction Fetching Unit
• IFU1: Initiate fetch, requesting 16 bytes at a time
• IFU2: Instruction length decoder, mark instruction boundaries, BTB makes prediction (2 cycles)
• IFU3: Align instructions to 3 decoders in 4-1-1 format
Streaming
Buffer
Instruction
Cache
Victim
Cache
Instruction
TLB
data addr
P.Addr
Branch Target Buffer
Next PC
Mux
Other fetch
requests
LinearAddress
Select
mux
ILD
Length
marks
Instruction
rotator
Instruction
buffer
#bytes
consumed
by ID
Prediction
marks

8
Static Branch Prediction (stage 17 Br. Dec of pg. 6)
BTB miss?BTB miss?
PC-relative?PC-relative?
Conditional?Conditional?
Backwards?Backwards?
Return?Return?
UnconditionalUnconditional
PC-relative?PC-relative?
NoNoNoNo
NoNo NoNo
NoNo
NoNo
YesYes
YesYes
YesYes
YesYes
YesYes
YesYes
BTB dynamicBTB dynamic
predictor’spredictor’s
decisiondecision
TakenTaken
TakenTaken
TakenTaken
TakenTaken
TakenTaken
IndirectIndirect
jumpjump
Not TakenNot Taken

9
Dynamic Branch Prediction
• Similar to a 2-level PAs design
• Associated with each BTB entry
• W/ 16-entry Return Stack Buffer
• 4 branch predictions per cycle (due to
16-byte fetch per cycle)
• Speculative update (2 copies of BHR)
 Static prediction provided by Branch Address
Calculator when BTB misses (see prior slide)
512-entry BTB512-entry BTB 1 1 0
Branch History RegisterBranch History Register
(BHR)(BHR)
0000
0001
0010
1111
1110
Pattern History TablesPattern History Tables
(PHT)(PHT)
Prediction
Rc: Branch ResultRc: Branch Result
2-bit sat. counter
11 00
1 1
0
Spec. updateSpec. update
New (spec) historyNew (spec) history
1101
W0W0 W1W1 W2W2 W3W3

10
X86 Instruction Decode
• 4-1-1 decoder
• Decode rate depends on instruction alignment
• DEC1: translate x86 into micro-operation’s (µops)
• DEC2: move decoded µops to ID queue
• MS performs translations either
– Generate entire µop sequence from the “microcode ROM”
– Receive 4 µops from complex decoder, and the rest from microcode ROM
• Subsequent Instructions followed by the inst needing MS are flushed
complex
(1-4)
complex
(1-4)
simple
(1)
simple
(1)
simple
(1)
simple
(1)
(16 bytes)
Micro-
instruction
sequencer
(MS)
Instruction decoder queue
(6 µops)
Next 3 inst #Inst to dec
S,S,S 3
S,S,C First 2
S,C,S First 1
S,C,C First 1
C,S,S 3
C,S,C First 2
C,C,S First 1
C,C,C First 1
S: Simple
C: Complex
Instruction Buffer
To RAT/ALLOC

11
Register Alias Table (RAT)
• Register renaming for 8 integer registers, 8 floating point (stack) registers and flags: 3 µop per
cycle
• 40 80-bit physicalphysical registers embedded in the ROB (thereby, 6 bit to specify PSrcPSrc)
• RAT looks up physical ROB locations for renamed sources based on RRF bit
• Override logic is for dependent µops decoded at the same cycle
• Misprediction will revert all pointers to point to Retirement Register File (RRF)
In-order
queue
FP
TOS
Adjus
t
FP
RAT
Array
Integer
RAT
Array
Logical Src
IntandFPOverrides
Array
Physical
Src (Psrc)
RAT
PSrc’s
Physical ROB Pointers
Allocator
25
2
ECX
15
EAX
EBX
ECX
EDX
Renaming Example
ROBRRF
RRF PSrc
0
0
0
1

12
Partial Stalls due to RAT
• Partial register stalls: Occurs when writing a smaller (e.g. 8/16-bit) register
followed by a larger (e.g. 32-bit) read
– Because need to read different partial pieces from multiple physical registers !
• Partial flags stalls: Occurs when a subsequent instruction reads more flags than
a prior unretired instruction touches
EAXEAX
AXAX
writewritereadread
MOV AX, m8 ;MOV AX, m8 ;
ADD EAX, m32 ; stallADD EAX, m32 ; stall
Partial register stallsPartial register stalls
XOR EAX, EAXXOR EAX, EAX
MOV AL, m8 ;MOV AL, m8 ;
ADD EAX, m32 ; no stallADD EAX, m32 ; no stall
SUB EAX, EAXSUB EAX, EAX
MOV AL, m8 ;MOV AL, m8 ;
ADD EAX, m32 ; no stallADD EAX, m32 ; no stall
Idiom Fix (1)Idiom Fix (1)
Idiom Fix (2)Idiom Fix (2)
CMP EAX, EBXCMP EAX, EBX
INC ECXINC ECX
JBE XX ; stallJBE XX ; stall
Partial flag stalls (1)Partial flag stalls (1)
JBEJBE reads both ZFZF and CFCF while
INC affects (ZFZF,OF,SF,AF,PF)
i.e. only ZFZF
LAHFLAHF loads low byte of EFLAGSEFLAGS
while TESTTEST writes partial of them
TEST EBX, EBXTEST EBX, EBX
LAHF ; stallLAHF ; stall
Partial flag stalls (2)Partial flag stalls (2)

13
Partial Register Width Renaming
• 32/16-bit accesses:
– Read from low banklow bank (AL/BL/CL/DL;AX/BX/CX/DX;EAX/EBX/ECX/EDX/EDI/ESI/EBP/ESP)
– Write to both banks (AH/BH/CH/DH)
• 8-bit RAT accesses: depending on which bank is being written and only update the
particular bank
In-orderqueue
FP
TOS
Adjust
FP
RAT
Array
Logical Src
IntandFPOverries
Array
Physical
Src
RAT
Physical Src
Physical ROB Pointers from Allocator
µop0: MOV AL = (a)
µop1: MOV AH = (b)
µop2: ADD AL = (c)
µop3: ADD AH = (d)
Integer
RAT
Array
INT Low Bank
(32b/16b/L): 8 entries
INT High Bank (H):
4 entries
Size(2) RRF(1) PSrc(6)
Allocator

14
Allocator (ALLOC)
• The interface between in-order and out-of-order
pipelines
• Allocates into ROB, MOB and RS
– “3-or-none” µops per cycle into ROB and RS
• Must have 3 free ROB entries or no allocation
– “all-or-none” policy for MOB
• Stall allocation when not all the valid MOB µops can be allocated
• Generate physical destination token PdstPdst from the
ROB and pass it to the Register Alias Table (RAT)
and RS
• Stalls upon shortage of resources

15
Reservation Stations (RS)
• Gateway to execution: binding max 5 µop to each port per cycle
• Port binding at dispatch time (certain µop can only be bound to one port)
• 20 µop entry buffer bridging the In-order and Out-of-order engine (32 entries in Core)
• RS fields include µop opcode, data valid bits, Pdst, Psrc, source data, BrPred, etc.
• Oldest first FIFO scheduling when multiple µops are ready at the same cycle
Port 0
Port 1
Port 2
Port 3
Port 4
IEU0 Fadd Fmul Imul Div
IEU1 JEU
AGU0
AGU1
MOBMOB DCU
ROB RRF
Pfadd
Pfmul
Pfshuf
WB bus 1
WB bus 0
Ld addr
St addr
LDA
STA
STD
St data
Loaded data
RS
Retired
data

16
ReOrder Buffer (ROB)
• A 40-entry circular buffer (96-entry in Core)
– 157-bit wide
– Provide 40 alias physical registers
• Out-of-order completion
• Deposit exception in each entry
• Retirement (or de-allocation)
– After resolving prior speculation
– Handle exceptions thru MS
– Clear OOO state when a mis-predicted
branch or exception is detected
– 3 µop’s per cycle in program order
– For multi-µop x86 instructions: none or all
(atomic)
ALLOC
RATRAT
RS
RRFROB
...
MS
(exp) µcode assist

17
Memory Execution Cluster
• Manage data memory accesses
• Address Translation
• Detect violation of access ordering
• Fill buffers (FB) in DCU, similar to MSHR for non-blocking cache support
RS / ROBRS / ROB
LDLD STASTA STDSTD
DTLBDTLBDTLBDTLB
LDLD STASTADCUDCUDCUDCU
Load BufferLoad Buffer
Store BufferStore BufferEBLEBL
Memory ClusterMemory Cluster
movl ecx, edi
addl ecx, 8
movl -4(edi), ebx
movl eax, 4(ecx)
RS cannot detect this and could
dispatch them at the same timeFBFB

18
Memory Order Buffer (MOB)
• Allocated by ALLOC
• A second order RS for memory operations
• 1 µop for load; 2 µop’s for store: Store Address (STA) and Store Data (STD)
• MOB
 16-entry load buffer (LB) (32-entry in Core, 64 in SandyBridge)
 12-entry store address buffer (SAB) (20-entry in Core, 36 in SandyBridge)
 SAB works in unison with
• Store data buffer (SDB) in MIU
• Physical Address Buffer (PAB) in DCU
 Store Buffer (SB): SAB + SDB + PAB
• Senior Stores
 Upon STD/STA retired from ROB
 SB marks the store “seniorsenior”
 Senior stores are committed back in program orderprogram order to memory when bus idle
or SB full
• Prefetch instructions in P-III
 Senior loadSenior load behavior
 Due to no explicit architectural destination
 New Memory dependency predictor in Core to predict store-to-load
dependencies

19
Store Coloring
• ALLOC assigns Store Buffer ID (SBID) in program order
• ALLOC tags loads with the most recent SBID
• Check loads against stores with equal or younger SBIDs for potential
address conflicts
• SDB forwards data if conflict detected
x86 Instructions µop’s store color
mov (0x1220), ebx std ebx 2
sta 0x1220 2
mov (0x1110), eax std eax 3
sta 0x1100 3
mov ecx, (0x1220) ld 0x1220 3
mov edx, (0x1280) ld 0x1280 3
mov (0x1400), edx std edx 4
sta 0x1400 4
mov edx, (0x1380) ld 0x1380 4

20
Memory Type Range Registers (MTRR)
• Control registers written by the system (OS)
• Supporting Memory TypesMemory Types
– UnCacheable (UC)
– Uncacheable Speculative Write-combining (USWC or WC)
• Use a fill buffer entry as WC buffer
– WriteBack (WB)
– Write-Through (WT)
– Write-Protected (WP)
• E.g. Support copy-on-write in UNIX, save memory space by allowing
child processes to share with their parents. Only create new memory
pages when child processes attempt to write.
• Page Miss Handler (PMH)
– Look up MTRR while supplying physical addresses
– Return memory types and physical address to DTLB

21
Intel NetBurst Microarchitecture
• Pentium 4’s microarchitecture
• Original target market: Graphics workstations, but …
• Design Goals:
– Performance, performance, performance, …
– Unprecedented multimedia/floating-point performance
• Streaming SIMD Extensions 2 (SSE2)
• SSE3 introduced in Prescott Pentium 4 (90nm)
– Reduced CPI
• Low latency instructions
• High bandwidth instruction fetching
• Rapid Execution of Arithmetic & Logic operations
– Reduced clock period
• New pipeline designed for scalability

22
Innovations Beyond P6
• Hyperpipelined technology
• Streaming SIMD Extension 2
• Hyper-threading Technology (HT)
• Execution trace cache
• Rapid execution engine
• Staggered adder unit
• Enhanced branch predictor
• Indirect branch predictor (also in Banias Pentium M)
• Load speculation and replay

23
Pentium 4 Fact Sheet
• IA-32 fully backward compatible
• Available at speeds ranging from 1.3 to ~3.8 GHz
• Hyperpipelined (20+ stages)
• 125 million transistors in Prescott (1.328 billion in 16MB on-die L3 Tulsa, 65nm)
• 0.18 μ for 1.3 to 2GHz; 0.13μ for 1.8 to 3.4GHz; 90nm for 2.8GHz to 3.6GHz
• Die Size of 122mm2
(Prescott 90nm), 435mm2
(Tulsa 65nm),
• Consumes 115 watts of power at 3.6Ghz
• 1066MHz system bus
• Prescott L1 16KB, 8-way vs. previous P4’s 8KB 4-way
• 1MB, 512KB or 256KB 8-way full-speed on-die L2 (B/W example: 89.6 GB/s
@2.8GHz to L1)
• 2MB L3 cache (in P4 HT Extreme edition, 0.13μ only), 16MB in Tulsa
• 144 new 128 bit SIMD instructions (SSE2), and 16 SSSE instructions in Prescott
• HyperThreading Technology (Not in all versions)

24
Building Blocks of Netburst
Bus Unit
Level 2 Cache
Memory subsystem
Fetch/
Dec
ETC
μROM
BTB / Br Pred.
System bus
L1 Data Cache
Execution Units
INT and FP Exec. Unit
OOO
logic
Retire
Branch history update
Front-end
Out-of-Order Engine

25
Pentium 4 Microarchitectue (Prescott)
BTB (4k entries)BTB (4k entries) I-TLB/PrefetcherI-TLB/Prefetcher
IA32 DecoderIA32 Decoder
Execution Trace CacheExecution Trace Cache
(12K(12K µµops)ops)
Trace Cache BTBTrace Cache BTB
(2k entries)(2k entries)
µµCode ROMCode ROM
µµop Queueop Queue
Allocator / Register RenamerAllocator / Register Renamer
INT / FPINT / FP µµop Queueop QueueMemoryMemory µµop Queueop Queue
Memory schedulerMemory scheduler
INT Register File / Bypass NetworkINT Register File / Bypass Network FP RF / Bypass NtwkFP RF / Bypass Ntwk
AGUAGU AGUAGU 2x ALU2x ALU 2x ALU2x ALU Slow ALUSlow ALU
Ld addrLd addr St addrSt addr
SimpleSimple
Inst.Inst.
SimpleSimple
Inst.Inst.
ComplexComplex
Inst.Inst.
FPFP
MMXMMX
SSE/2/3SSE/2/3
FPFP
MoveMove
L1 Data Cache (16KB 8-way, 64-byte line, WT, 1 rd + 1 wr port)L1 Data Cache (16KB 8-way, 64-byte line, WT, 1 rd + 1 wr port)
FastFast Slow/General FP schedulerSlow/General FP scheduler Simple FPSimple FP
QuadQuad
PumpedPumped
800MHz800MHz
6.4 GB/sec6.4 GB/sec
BIUBIU
U-L2 CacheU-L2 Cache
1MB 8-way1MB 8-way
128B line, WB128B line, WB
108 GB/s108 GB/s
256 bits256 bits
64 bits64 bits
64-bit64-bit
SystemSystem
BusBus

26
Pipeline Depth Evolution
PREF DEC DEC EXEC WB
P5 Microarchitecture
IFU1IFU1 IFU2IFU2 IFU3IFU3 DEC1DEC1 DEC2DEC2 RATRAT ROBROB DISDIS EXEX RET1RET1 RET2RET2
P6P6 Microarchitecture
TC NextIP TC Fetch Drive Alloc QueueRename Schedule Dispatch Reg File Exec Flags Br Ck Drive
NetBurst Microarchitecture (Willamette)
20 stages
NetBurst Microarchitecture (Prescott)
> 30 stages

27
Execution Trace Cache
• Primary first level I-cache to replace conventional L1
– Decoding several x86 instructions at high frequency is difficult, take
several pipeline stages
– Branch misprediction penalty is considerable
• Advantages
– Cache post-decode µops (think about fill unit)
– High bandwidth instruction fetching
– Eliminate x86 decoding overheads
– Reduce branch recovery time if TC hits
• Hold up to 12,000 µops
– 6 µops per trace line
– Many (?) trace lines in a single trace

28
Execution Trace Cache
• Deliver 3 µop’s per cycle to OOO engine if br pred is good
• X86 instructions read from L2 when TC misses (7+ cycle latency)
• TC Hit rate ~ 8K to 16KB conventional I-cache
• Simplified x86 decoder
– Only one complex instruction per cycle
– Instruction > 4 µop will be executed by micro-code ROM (P6’s MS)
• Perform branch prediction in TC
– 512-entry BTB + 16-entry RAS
– With BP in x86 IFU, reduce 33% misprediction compared to P6
– Intel did not disclose the details of BP algorithms used in TC and x86 IFU
(Dynamic + Static)

29
Out-Of-Order Engine
• Similar design philosophy with P6 uses
– Allocator
– Register Alias Table
– 128 physical registers
– 126-entry ReOrder Buffer
– 48-entry load buffer
– 24-entry store buffer

30
Register Renaming Schemes
ROB (40-entry)ROB (40-entry)
RRFRRF
DataData StatusStatus
EBXEBX
ECXECX
EDXEDX
ESIESI
EDIEDI
EAXEAX
ESPESP
EBPEBP
RATRAT
P6 Register RenamingP6 Register Renaming
AllocatedsequentiallyAllocatedsequentially
EBXEBX
ECXECX
EDXEDX
ESIESI
EDIEDI
EAXEAX
ESPESP
EBPEBP
Retirement RATRetirement RAT
NetBurst Register RenamingNetBurst Register Renaming
StatusStatus
AllocatedsequentiallyAllocatedsequentially
......
......
......
......
DataData
EBXEBX
ECXECX
EDXEDX
ESIESI
EDIEDI
EAXEAX
ESPESP
EBPEBP
Front-end RATFront-end RAT RF (128-entry)RF (128-entry) ROB (126)ROB (126)

31
Micro-op Scheduling
∀ µop FIFO queues
– Memory queue for loads and stores
– Non-memory queue
∀ µop schedulers
– Several schedulers fire instructions from 2 µop queues to execution (P6’s
RS)
– 4 distinct dispatch ports
– Maximum dispatch: 6 µops per cycle (2 fast ALU from Port 0,1 per cycle;
1 from ld/st ports)
Exec Port 0 Exec Port 1 Load Port Store Port
Fast ALU
(2x pumped)
Fast ALU
(2x pumped)
FP
Move
INT
Exec
FP
Exec
Memory
Load
Memory
Store
•Add/sub
•Logic
•Store Data
•Branches
•FP/SSE Move
•FP/SSE Store
•FXCH
•Add/sub •Shift
•Rotate
•FP/SSE Add
•FP/SSE Mul
•FP/SSE Div
•MMX
•Loads
•LEA
•Prefetch
•Stores

32
Data Memory Accesses
• Prescott: 16KB 8-way L1 + 1MB 8-way L2 (with a HW prefetcher), 128B line
• Load-to-use speculation
– Dependent instruction dispatched before load finishes
• Due to the high frequency and deep pipeline depth
• From load scheduler to execution is longer than execution itself
– Scheduler assumes loads always hit L1
– If L1 miss, dependent instructions left the scheduler receive incorrect data
temporarily – mis-speculationmis-speculation
– Replay logicReplay logic
• Re-execute the load when mis-speculated
• Mis-speculated operations are placed into a replay queue for being re-
dispatched
– All trailing independent instructions are allowed to proceed
– Tornado breaker
• Up to 4 outstanding load misses (= 4 fill buffers in original P6)
• Store-to-load forwarding buffer
– 24 entries
– Have the same starting physical address
– Load data size <= store data size

33
Fast Staggered ALU
• For frequent ALU instructions (No multiply, no shift, no rotate, no branch
processing)
• Double pumped clocks
• Each operation finishes in 3 fast cycles
– Lower-order 16-bit and bypass
– Higher-order 16-bit and bypass
– ALU flags generation
Bit[15:0]
Bit[31:16]
Flags

34
Branch Predictor
• P4 uses the same hybrid predictor of Pentium M
Bimodal
Predictor
Local
Predictor
Global
Predictor
MUX
MUX
Pred_G
Pred_LPred_B
L_hit
G_hit

35
• In Pentium M and Prescott Pentium 4
• Prediction based on global history
Indirect Branch Predictor

36
New Instructions over Pentium
• CMOVcc / FCMOVcc r, r/m
– Conditional moves (predicated move) instructions
– Based on conditional code (cc)
• FCOMI/P : compare FP stack and set integer flags
• RDPMC/RDTSC instructions
– PMC: P6 has 2, Netburst (P4) has 18
• Uncacheable Speculative Write-Combining (USWC) —weakly
ordered memory type for graphics memory

37
New Instructions
• SSE2 in Pentium 4 (not p6 microarchitecture)
– Double precision SIMD FP
• SSSE in Core 2
– Supplemental instructions for shuffle, align, add,
subtract.
• Intel 64 (EM64T)
– 64 bit support, new registers (8 more on top of 8)
– In Celeron D, Core 2 (and P4 Prescott, Pentium D)
– Almost compatible with AMD64
– AMD’s NX bit or Intel’s XD bit for preventing buffer overflow attacks

38
Streaming SIMD Extension 2
• P-III SSE (Katmai New Instructions: KNI)
– Eight 128-bit wide xmmxmm registers (new architecture state)
– Single-precisionSingle-precision 128-bit SIMD FP
• Four 32-bit FP operations in one instruction
• Broken down into 2 µops for execution (only 80-bit data in ROB)
– 64-bit SIMD MMX (use 8 mmmm registers — map to FP stack)
– Prefetch (nta, t0, t1, t2) and sfence
• P4 SSE2 (Willamette New Instructions: WNI)
– Support Double-precisionDouble-precision 128-bit SIMD FP
• Two 64-bit FP operations in one instruction
• Throughput: 2 cycles for most of SSE2 operations (exceptional
examples: DIVPD and SQRTPD: 69 cycles, non-pipelined.)
– Enhanced 128-bit SIMD MMX using xmmxmm registers

39
Examples of Using SSE
X3X3 X2X2 X1X1 X0X0
Y3Y3 Y2Y2 Y1Y1 Y0Y0
opop opop opop opop
X3 op Y3X3 op Y3 X2 op Y2X2 op Y2 X1 op Y1X1 op Y1X0 op Y0X0 op Y0
X3X3 X2X2 X1X1 X0X0
Y3Y3 Y2Y2 Y1Y1 Y0Y0
opop
X0 op Y0X0 op Y0X3X3 X2X2 X1X1
Packed SP FP operationPacked SP FP operation
(e.g.(e.g. ADDPS xmm1, xmm2ADDPS xmm1, xmm2))
Scalar SP FP operationScalar SP FP operation
(e.g.(e.g. ADDSS xmm1, xmm2ADDSS xmm1, xmm2))
xmm1xmm1
xmm2xmm2
xmm1xmm1
xmm1xmm1
xmm2xmm2
xmm1xmm1
Shuffle FP operation (8-bit imm)Shuffle FP operation (8-bit imm)
(e.g.(e.g. SHUFPS xmm1, xmm2,SHUFPS xmm1, xmm2, imm8imm8))
X3X3 X2X2 X1X1 X0X0
Y3Y3 Y2Y2 Y1Y1 Y0Y0
Y3 .. Y0Y3 .. Y0 Y3 .. Y0Y3 .. Y0 X3 .. X0X3 .. X0 X3 .. X0X3 .. X0
Shuffle FP operation (8-bit imm)Shuffle FP operation (8-bit imm)
(e.g.(e.g. SHUFPS xmm1, xmm2,SHUFPS xmm1, xmm2, 0xf10xf1))
X3X3 X2X2 X1X1 X0X0
Y3Y3 Y2Y2 Y1Y1 Y0Y0
xmm1xmm1
Y3Y3 X0X0 X1X1Y3Y3
xmm2xmm2
xmm1xmm1

40
Examples of Using SSE and SSE2
X3X3 X2X2 X1X1 X0X0
Y3Y3 Y2Y2 Y1Y1 Y0Y0
opop opop opop opop
X3 op Y3X3 op Y3 X2 op Y2X2 op Y2 X1 op Y1X1 op Y1X0 op Y0X0 op Y0
X3X3 X2X2 X1X1 X0X0
Y3Y3 Y2Y2 Y1Y1 Y0Y0
opop
X0 op Y0X0 op Y0X3X3 X2X2 X1X1
PackedPacked SPSP FP operationFP operation
(e.g.(e.g. ADDPS xmm1, xmm2ADDPS xmm1, xmm2))
ScalarScalar SPSP FP operationFP operation
(e.g.(e.g. ADDSS xmm1, xmm2ADDSS xmm1, xmm2))
xmm1xmm1
xmm2xmm2
xmm1xmm1
xmm1xmm1
xmm2xmm2
xmm1xmm1
Shuffle FP operationShuffle FP operation
(e.g.(e.g. SHUFPS xmm1, xmm2, imm8SHUFPS xmm1, xmm2, imm8))
X3X3 X2X2 X1X1 X0X0
Y3Y3 Y2Y2 Y1Y1 Y0Y0
Y3 .. Y0Y3 .. Y0 Y3 .. Y0Y3 .. Y0 X3 .. X0X3 .. X0 X3 .. X0X3 .. X0
ShuffleShuffle FPFP operation (8-bit imm)operation (8-bit imm)
(e.g.(e.g. SHUFPS xmm1, xmm2,SHUFPS xmm1, xmm2, 0xf10xf1))
X3X3 X2X2 X1X1 X0X0
Y3Y3 Y2Y2 Y1Y1 Y0Y0
xmm1xmm1
Y3Y3 X0X0 X1X1Y3Y3
xmm2xmm2
xmm1xmm1
X0X0
opop
PackedPacked DPDP FP operationFP operation
(e.g.(e.g. ADDPDADDPD xmm1, xmm2xmm1, xmm2))
ScalarScalar DPDP FP operationFP operation
(e.g.(e.g. ADDSDADDSD xmm1, xmm2xmm1, xmm2))
xmm1xmm1
xmm2xmm2
xmm1xmm1
xmm1xmm1
xmm2xmm2
xmm1xmm1
Shuffle FP operationShuffle FP operation
(e.g.(e.g. SHUFPS xmm1, xmm2, imm8SHUFPS xmm1, xmm2, imm8))
ShuffleShuffle DPDP operation (2-bit imm)operation (2-bit imm)
(e.g.(e.g. SHUFPD xmm1, xmm2,SHUFPD xmm1, xmm2, imm2imm2))
X1X1
Y0Y0Y1Y1
X0 op Y0X0 op Y0X1 op Y1X1 op Y1
opop
X0X0X1X1
Y0Y0Y1Y1
X0 op Y0X0 op Y0X1X1
opop
X0X0X1X1
Y0Y0Y1Y1
X1 or X0X1 or X0Y1 or Y0Y1 or Y0
SSESSE
SSE2SSE2

41
HyperThreading
• Intel Xeon Processor and Intel Xeon MP Processor
• Enable Simultaneous Multi-Threading (SMT)
– Exploit ILP through TLP (—Thread-Level Parallelism)
– Issuing and executing multiple threads at the same snapshot
• Single P4 w/ HT appears to be 2 logical processors2 logical processors
• Share the same execution resources
– dTLB shared with logical processor ID
– Some other shared resources are partitioned (next slide)
• Architectural states and some microarchitectural states are duplicated
– IPs, iTLB, streaming buffer
– Architectural register file
– Return stack buffer
– Branch history buffer
– Register Alias Table

42
Multithreading (MT) Paradigms
Thread 1
Unused
ExecutionTime
FU1 FU2 FU3 FU4
Conventional
Superscalar
Single
Threaded
Simultaneous
Multithreading
(or Intel’s HT)
Fine-grained
Multithreading
(cycle-by-cycle
Interleaving)
Thread 2
Thread 3
Thread 4
Thread 5
Coarse-grained
Multithreading
(Block Interleaving)
Chip
Multiprocessor
(CMP)
or called
Multi-Core Processors
today

43
HyperThreading Resource Partitioning
• TC (or UROM) is alternatively accessed per cycle for
each logical processor unless one is stalled due to
TC miss
∀ µop queue (into ½) after fetched from TC
• ROB (126/2)
• LB (48/2)
• SB (24/2) (32/2 for Prescott)
• General µop queue and memory µop queue (1/2)
• TLB (½?) as there is no PID
• Retirement: alternating between 2 logical
processors

Lec12 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- P6, Netburst, Pentiium 4

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (20)

Ähnlich wie Lec12 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- P6, Netburst, Pentiium 4

Ähnlich wie Lec12 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- P6, Netburst, Pentiium 4 (20)

Mehr von Hsien-Hsin Sean Lee, Ph.D.

Mehr von Hsien-Hsin Sean Lee, Ph.D. (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Lec12 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- P6, Netburst, Pentiium 4

Hinweis der Redaktion