SlideShare ist ein Scribd-Unternehmen logo
1 von 43
ECE 4100/6100
Advanced Computer Architecture
Lecture 12 P6 and NetBurst Microarchitecture
Prof. Hsien-Hsin Sean Lee
School of Electrical and Computer Engineering
Georgia Institute of Technology
2
P6 System Architecture
SystemSystem
MemoryMemory
(DRAM)(DRAM)
MCHMCH
Front-SideFront-Side
BusBus
PCI USB I/O
GraphicsGraphics
ProcessorProcessor
Local
Frame
Buffer
PCIExpress
AGP
(SRAM)(SRAM)
L2 CacheL2 Cache
Back-SideBack-Side
BusBus
P6 CoreP6 Core
Host ProcessorHost Processor
L1L1
CacheCache
(SRAM)(SRAM)
GPUGPU
ICHICH
chipsetchipset
3
Instruction Fetch UnitInstruction Fetch Unit
P6 Microarchitecture
BTB/BACBTB/BAC
Instruction Fetch UnitInstruction Fetch Unit
Bus interface unitBus interface unit
InstructionInstruction
DecoderDecoder
InstructionInstruction
DecoderDecoder
RegisterRegister
Alias TableAlias Table
AllocatorAllocatorMicrocodeMicrocode
SequencerSequencer
ReservationReservation
StationStation
ROB &ROB &
Retire RFRetire RF
AGUAGU
MMXMMX
IEU/JEUIEU/JEUIEU/JEUIEU/JEU
FEUFEU
MIUMIU
MemoryMemory
Order BufferOrder Buffer
Data CacheData Cache
Unit (L1)Unit (L1)
External busExternal bus
Chip boundaryChip boundary
ControlControl
FlowFlow
(Restricted)(Restricted)
DataData
FlowFlowInstruction Fetch Cluster
Issue Cluster
Out-of-order
Cluster
Memory
Cluster
Bus Cluster
4
Pentium III Die Map
EBL/BBL – External/Backside Bus logic
MOB - Memory Order Buffer
Packed FPU - Floating Point Unit for SSE
IEU - Integer Execution Unit
FAU - Floating Point Arithmetic Unit
MIU - Memory Interface Unit
DCU - Data Cache Unit (L1)
PMH - Page Miss Handler
DTLB - Data TLB
BAC - Branch Address Calculator
RAT - Register Alias Table
SIMD - Packed Floating Point unit
RS - Reservation Station
BTB - Branch Target Buffer
TAP – Test Access Port
IFU - Instruction Fetch Unit and L1 I-Cache
ID - Instruction Decode
ROB - Reorder Buffer
MS - Micro-instruction Sequencer
5
P6 Basics
• One implementation of IA32 architecture
• Deeply pipeline processor
• In-order front-end and back-end
• Dynamic execution engine (restricted dataflow)
• Speculative execution
• P6 microarchitecture family processors include
– Pentium Pro
– Pentium II (PPro + MMX + 2x caches)
– Pentium III (P-II + SSE + enhanced MMX, e.g. PSAD)
– Pentium 4 (Not P6, will be discussed separately)
– Pentium M (+SSE2, SSE3, µop fusion)
– Core (PM + SSSE3, SSE4, Intel 64 (EM64T), MacroOp
fusion, 4 µop retired rate vs. 3 of previous proliferation)
6
P6 Pipelining
1111 1212 1313 1414 1515 1616 1717
2020 2121 2222
NextIPNextIP
I-CacheI-Cache
ILDILD
RotateRotate
Dec1Dec1
Dec2Dec2
BrDecBrDec
RSWriteRSWrite
RATRAT
IDQIDQ
In-order FEIn-order FE
3131 3232 3333
8181 8282
.... ....
8383
Exec2Exec2
ExecnExecn
Multi-cycleMulti-cycle
pipelinepipeline
3131 3232 3333
8181 8282
4242 4343
8383
AGUAGU
DCache1DCache1
DCache2DCache2
Non-blockingNon-blocking
memory pipelinememory pipeline
3131 3232 3333
8282 8383
RSschdRSschd
RSDispRSDisp
Exec/WBExec/WB
Single-cycleSingle-cycle
pipelinepipeline
83: Data WB83: Data WB
82: Int WB schedule82: Int WB schedule
81: Mem/FP WB81: Mem/FP WB
FEin-orderboundaryFEin-orderboundary
Retirementin-orderboundaryRetirementin-orderboundary
9191 9292 9393
RetptrwrRetptrwr
RetROBrdRetROBrd
RRFwrRRFwr
…
…
…
… ……..
RS SchedulingRS Scheduling
DelayDelay
ROB SchedulingROB Scheduling
DelayDelay
MOB SchedulingMOB Scheduling
DelayDelay
IFU1IFU1 IFU2IFU2 IFU3IFU3 DEC1DEC1 DEC2DEC2 RATRAT ROBROB DISDIS EXEX RET1RET1 RET2RET2
3131 3232 3333
8181 8282
4242 4343
8383
AGUAGU
MOBblkMOBblk
MOBwrMOBwr
4040 4141 4242 4343
MOBdispMOBdisp
DCache1DCache1
Dcache2Dcache2
MOBwakeupMOBwakeup
BlockingBlocking
memorymemory
pipelinepipeline
7
Instruction Fetching Unit
• IFU1: Initiate fetch, requesting 16 bytes at a time
• IFU2: Instruction length decoder, mark instruction boundaries, BTB makes prediction (2 cycles)
• IFU3: Align instructions to 3 decoders in 4-1-1 format
Streaming
Buffer
Instruction
Cache
Victim
Cache
Instruction
TLB
data addr
P.Addr
Branch Target Buffer
Next PC
Mux
Other fetch
requests
LinearAddress
Select
mux
ILD
Length
marks
Instruction
rotator
Instruction
buffer
#bytes
consumed
by ID
Prediction
marks
8
Static Branch Prediction (stage 17 Br. Dec of pg. 6)
BTB miss?BTB miss?
PC-relative?PC-relative?
Conditional?Conditional?
Backwards?Backwards?
Return?Return?
UnconditionalUnconditional
PC-relative?PC-relative?
NoNoNoNo
NoNo NoNo
NoNo
NoNo
YesYes
YesYes
YesYes
YesYes
YesYes
YesYes
BTB dynamicBTB dynamic
predictor’spredictor’s
decisiondecision
TakenTaken
TakenTaken
TakenTaken
TakenTaken
TakenTaken
IndirectIndirect
jumpjump
Not TakenNot Taken
9
Dynamic Branch Prediction
• Similar to a 2-level PAs design
• Associated with each BTB entry
• W/ 16-entry Return Stack Buffer
• 4 branch predictions per cycle (due to
16-byte fetch per cycle)
• Speculative update (2 copies of BHR)
 Static prediction provided by Branch Address
Calculator when BTB misses (see prior slide)
512-entry BTB512-entry BTB 1 1 0
Branch History RegisterBranch History Register
(BHR)(BHR)
0000
0001
0010
1111
1110
Pattern History TablesPattern History Tables
(PHT)(PHT)
Prediction
Rc: Branch ResultRc: Branch Result
2-bit sat. counter
11 00
1 1
0
Spec. updateSpec. update
New (spec) historyNew (spec) history
1101
W0W0 W1W1 W2W2 W3W3
10
X86 Instruction Decode
• 4-1-1 decoder
• Decode rate depends on instruction alignment
• DEC1: translate x86 into micro-operation’s (µops)
• DEC2: move decoded µops to ID queue
• MS performs translations either
– Generate entire µop sequence from the “microcode ROM”
– Receive 4 µops from complex decoder, and the rest from microcode ROM
• Subsequent Instructions followed by the inst needing MS are flushed
complex
(1-4)
complex
(1-4)
simple
(1)
simple
(1)
simple
(1)
simple
(1)
(16 bytes)
Micro-
instruction
sequencer
(MS)
Instruction decoder queue
(6 µops)
Next 3 inst #Inst to dec
S,S,S 3
S,S,C First 2
S,C,S First 1
S,C,C First 1
C,S,S 3
C,S,C First 2
C,C,S First 1
C,C,C First 1
S: Simple
C: Complex
Instruction Buffer
To RAT/ALLOC
11
Register Alias Table (RAT)
• Register renaming for 8 integer registers, 8 floating point (stack) registers and flags: 3 µop per
cycle
• 40 80-bit physicalphysical registers embedded in the ROB (thereby, 6 bit to specify PSrcPSrc)
• RAT looks up physical ROB locations for renamed sources based on RRF bit
• Override logic is for dependent µops decoded at the same cycle
• Misprediction will revert all pointers to point to Retirement Register File (RRF)
In-order
queue
FP
TOS
Adjus
t
FP
RAT
Array
Integer
RAT
Array
Logical Src
IntandFPOverrides
Array
Physical
Src (Psrc)
RAT
PSrc’s
Physical ROB Pointers
Allocator
25
2
ECX
15
EAX
EBX
ECX
EDX
Renaming Example
ROBRRF
RRF PSrc
0
0
0
1
12
Partial Stalls due to RAT
• Partial register stalls: Occurs when writing a smaller (e.g. 8/16-bit) register
followed by a larger (e.g. 32-bit) read
– Because need to read different partial pieces from multiple physical registers !
• Partial flags stalls: Occurs when a subsequent instruction reads more flags than
a prior unretired instruction touches
EAXEAX
AXAX
writewritereadread
MOV AX, m8 ;MOV AX, m8 ;
ADD EAX, m32 ; stallADD EAX, m32 ; stall
Partial register stallsPartial register stalls
XOR EAX, EAXXOR EAX, EAX
MOV AL, m8 ;MOV AL, m8 ;
ADD EAX, m32 ; no stallADD EAX, m32 ; no stall
SUB EAX, EAXSUB EAX, EAX
MOV AL, m8 ;MOV AL, m8 ;
ADD EAX, m32 ; no stallADD EAX, m32 ; no stall
Idiom Fix (1)Idiom Fix (1)
Idiom Fix (2)Idiom Fix (2)
CMP EAX, EBXCMP EAX, EBX
INC ECXINC ECX
JBE XX ; stallJBE XX ; stall
Partial flag stalls (1)Partial flag stalls (1)
JBEJBE reads both ZFZF and CFCF while
INC affects (ZFZF,OF,SF,AF,PF)
i.e. only ZFZF
LAHFLAHF loads low byte of EFLAGSEFLAGS
while TESTTEST writes partial of them
TEST EBX, EBXTEST EBX, EBX
LAHF ; stallLAHF ; stall
Partial flag stalls (2)Partial flag stalls (2)
13
Partial Register Width Renaming
• 32/16-bit accesses:
– Read from low banklow bank (AL/BL/CL/DL;AX/BX/CX/DX;EAX/EBX/ECX/EDX/EDI/ESI/EBP/ESP)
– Write to both banks (AH/BH/CH/DH)
• 8-bit RAT accesses: depending on which bank is being written and only update the
particular bank
In-orderqueue
FP
TOS
Adjust
FP
RAT
Array
Logical Src
IntandFPOverries
Array
Physical
Src
RAT
Physical Src
Physical ROB Pointers from Allocator
µop0: MOV AL = (a)
µop1: MOV AH = (b)
µop2: ADD AL = (c)
µop3: ADD AH = (d)
Integer
RAT
Array
INT Low Bank
(32b/16b/L): 8 entries
INT High Bank (H):
4 entries
Size(2) RRF(1) PSrc(6)
Allocator
14
Allocator (ALLOC)
• The interface between in-order and out-of-order
pipelines
• Allocates into ROB, MOB and RS
– “3-or-none” µops per cycle into ROB and RS
• Must have 3 free ROB entries or no allocation
– “all-or-none” policy for MOB
• Stall allocation when not all the valid MOB µops can be allocated
• Generate physical destination token PdstPdst from the
ROB and pass it to the Register Alias Table (RAT)
and RS
• Stalls upon shortage of resources
15
Reservation Stations (RS)
• Gateway to execution: binding max 5 µop to each port per cycle
• Port binding at dispatch time (certain µop can only be bound to one port)
• 20 µop entry buffer bridging the In-order and Out-of-order engine (32 entries in Core)
• RS fields include µop opcode, data valid bits, Pdst, Psrc, source data, BrPred, etc.
• Oldest first FIFO scheduling when multiple µops are ready at the same cycle
Port 0
Port 1
Port 2
Port 3
Port 4
IEU0 Fadd Fmul Imul Div
IEU1 JEU
AGU0
AGU1
MOBMOB DCU
ROB RRF
Pfadd
Pfmul
Pfshuf
WB bus 1
WB bus 0
Ld addr
St addr
LDA
STA
STD
St data
Loaded data
RS
Retired
data
16
ReOrder Buffer (ROB)
• A 40-entry circular buffer (96-entry in Core)
– 157-bit wide
– Provide 40 alias physical registers
• Out-of-order completion
• Deposit exception in each entry
• Retirement (or de-allocation)
– After resolving prior speculation
– Handle exceptions thru MS
– Clear OOO state when a mis-predicted
branch or exception is detected
– 3 µop’s per cycle in program order
– For multi-µop x86 instructions: none or all
(atomic)
ALLOC
RATRAT
RS
RRFROB
...
MS
(exp) µcode assist
17
Memory Execution Cluster
• Manage data memory accesses
• Address Translation
• Detect violation of access ordering
• Fill buffers (FB) in DCU, similar to MSHR for non-blocking cache support
RS / ROBRS / ROB
LDLD STASTA STDSTD
DTLBDTLBDTLBDTLB
LDLD STASTADCUDCUDCUDCU
Load BufferLoad Buffer
Store BufferStore BufferEBLEBL
Memory ClusterMemory Cluster
movl ecx, edi
addl ecx, 8
movl -4(edi), ebx
movl eax, 4(ecx)
RS cannot detect this and could
dispatch them at the same timeFBFB
18
Memory Order Buffer (MOB)
• Allocated by ALLOC
• A second order RS for memory operations
• 1 µop for load; 2 µop’s for store: Store Address (STA) and Store Data (STD)
• MOB
 16-entry load buffer (LB) (32-entry in Core, 64 in SandyBridge)
 12-entry store address buffer (SAB) (20-entry in Core, 36 in SandyBridge)
 SAB works in unison with
• Store data buffer (SDB) in MIU
• Physical Address Buffer (PAB) in DCU
 Store Buffer (SB): SAB + SDB + PAB
• Senior Stores
 Upon STD/STA retired from ROB
 SB marks the store “seniorsenior”
 Senior stores are committed back in program orderprogram order to memory when bus idle
or SB full
• Prefetch instructions in P-III
 Senior loadSenior load behavior
 Due to no explicit architectural destination
 New Memory dependency predictor in Core to predict store-to-load
dependencies
19
Store Coloring
• ALLOC assigns Store Buffer ID (SBID) in program order
• ALLOC tags loads with the most recent SBID
• Check loads against stores with equal or younger SBIDs for potential
address conflicts
• SDB forwards data if conflict detected
x86 Instructions µop’s store color
mov (0x1220), ebx std ebx 2
sta 0x1220 2
mov (0x1110), eax std eax 3
sta 0x1100 3
mov ecx, (0x1220) ld 0x1220 3
mov edx, (0x1280) ld 0x1280 3
mov (0x1400), edx std edx 4
sta 0x1400 4
mov edx, (0x1380) ld 0x1380 4
20
Memory Type Range Registers (MTRR)
• Control registers written by the system (OS)
• Supporting Memory TypesMemory Types
– UnCacheable (UC)
– Uncacheable Speculative Write-combining (USWC or WC)
• Use a fill buffer entry as WC buffer
– WriteBack (WB)
– Write-Through (WT)
– Write-Protected (WP)
• E.g. Support copy-on-write in UNIX, save memory space by allowing
child processes to share with their parents. Only create new memory
pages when child processes attempt to write.
• Page Miss Handler (PMH)
– Look up MTRR while supplying physical addresses
– Return memory types and physical address to DTLB
21
Intel NetBurst Microarchitecture
• Pentium 4’s microarchitecture
• Original target market: Graphics workstations, but …
• Design Goals:
– Performance, performance, performance, …
– Unprecedented multimedia/floating-point performance
• Streaming SIMD Extensions 2 (SSE2)
• SSE3 introduced in Prescott Pentium 4 (90nm)
– Reduced CPI
• Low latency instructions
• High bandwidth instruction fetching
• Rapid Execution of Arithmetic & Logic operations
– Reduced clock period
• New pipeline designed for scalability
22
Innovations Beyond P6
• Hyperpipelined technology
• Streaming SIMD Extension 2
• Hyper-threading Technology (HT)
• Execution trace cache
• Rapid execution engine
• Staggered adder unit
• Enhanced branch predictor
• Indirect branch predictor (also in Banias Pentium M)
• Load speculation and replay
23
Pentium 4 Fact Sheet
• IA-32 fully backward compatible
• Available at speeds ranging from 1.3 to ~3.8 GHz
• Hyperpipelined (20+ stages)
• 125 million transistors in Prescott (1.328 billion in 16MB on-die L3 Tulsa, 65nm)
• 0.18 μ for 1.3 to 2GHz; 0.13μ for 1.8 to 3.4GHz; 90nm for 2.8GHz to 3.6GHz
• Die Size of 122mm2
(Prescott 90nm), 435mm2
(Tulsa 65nm),
• Consumes 115 watts of power at 3.6Ghz
• 1066MHz system bus
• Prescott L1 16KB, 8-way vs. previous P4’s 8KB 4-way
• 1MB, 512KB or 256KB 8-way full-speed on-die L2 (B/W example: 89.6 GB/s
@2.8GHz to L1)
• 2MB L3 cache (in P4 HT Extreme edition, 0.13μ only), 16MB in Tulsa
• 144 new 128 bit SIMD instructions (SSE2), and 16 SSSE instructions in Prescott
• HyperThreading Technology (Not in all versions)
24
Building Blocks of Netburst
Bus Unit
Level 2 Cache
Memory subsystem
Fetch/
Dec
ETC
μROM
BTB / Br Pred.
System bus
L1 Data Cache
Execution Units
INT and FP Exec. Unit
OOO
logic
Retire
Branch history update
Front-end
Out-of-Order Engine
25
Pentium 4 Microarchitectue (Prescott)
BTB (4k entries)BTB (4k entries) I-TLB/PrefetcherI-TLB/Prefetcher
IA32 DecoderIA32 Decoder
Execution Trace CacheExecution Trace Cache
(12K(12K µµops)ops)
Trace Cache BTBTrace Cache BTB
(2k entries)(2k entries)
µµCode ROMCode ROM
µµop Queueop Queue
Allocator / Register RenamerAllocator / Register Renamer
INT / FPINT / FP µµop Queueop QueueMemoryMemory µµop Queueop Queue
Memory schedulerMemory scheduler
INT Register File / Bypass NetworkINT Register File / Bypass Network FP RF / Bypass NtwkFP RF / Bypass Ntwk
AGUAGU AGUAGU 2x ALU2x ALU 2x ALU2x ALU Slow ALUSlow ALU
Ld addrLd addr St addrSt addr
SimpleSimple
Inst.Inst.
SimpleSimple
Inst.Inst.
ComplexComplex
Inst.Inst.
FPFP
MMXMMX
SSE/2/3SSE/2/3
FPFP
MoveMove
L1 Data Cache (16KB 8-way, 64-byte line, WT, 1 rd + 1 wr port)L1 Data Cache (16KB 8-way, 64-byte line, WT, 1 rd + 1 wr port)
FastFast Slow/General FP schedulerSlow/General FP scheduler Simple FPSimple FP
QuadQuad
PumpedPumped
800MHz800MHz
6.4 GB/sec6.4 GB/sec
BIUBIU
U-L2 CacheU-L2 Cache
1MB 8-way1MB 8-way
128B line, WB128B line, WB
108 GB/s108 GB/s
256 bits256 bits
64 bits64 bits
64-bit64-bit
SystemSystem
BusBus
26
Pipeline Depth Evolution
PREF DEC DEC EXEC WB
P5 Microarchitecture
IFU1IFU1 IFU2IFU2 IFU3IFU3 DEC1DEC1 DEC2DEC2 RATRAT ROBROB DISDIS EXEX RET1RET1 RET2RET2
P6P6 Microarchitecture
TC NextIP TC Fetch Drive Alloc QueueRename Schedule Dispatch Reg File Exec Flags Br Ck Drive
NetBurst Microarchitecture (Willamette)
20 stages
NetBurst Microarchitecture (Prescott)
> 30 stages
27
Execution Trace Cache
• Primary first level I-cache to replace conventional L1
– Decoding several x86 instructions at high frequency is difficult, take
several pipeline stages
– Branch misprediction penalty is considerable
• Advantages
– Cache post-decode µops (think about fill unit)
– High bandwidth instruction fetching
– Eliminate x86 decoding overheads
– Reduce branch recovery time if TC hits
• Hold up to 12,000 µops
– 6 µops per trace line
– Many (?) trace lines in a single trace
28
Execution Trace Cache
• Deliver 3 µop’s per cycle to OOO engine if br pred is good
• X86 instructions read from L2 when TC misses (7+ cycle latency)
• TC Hit rate ~ 8K to 16KB conventional I-cache
• Simplified x86 decoder
– Only one complex instruction per cycle
– Instruction > 4 µop will be executed by micro-code ROM (P6’s MS)
• Perform branch prediction in TC
– 512-entry BTB + 16-entry RAS
– With BP in x86 IFU, reduce 33% misprediction compared to P6
– Intel did not disclose the details of BP algorithms used in TC and x86 IFU
(Dynamic + Static)
29
Out-Of-Order Engine
• Similar design philosophy with P6 uses
– Allocator
– Register Alias Table
– 128 physical registers
– 126-entry ReOrder Buffer
– 48-entry load buffer
– 24-entry store buffer
30
Register Renaming Schemes
ROB (40-entry)ROB (40-entry)
RRFRRF
DataData StatusStatus
EBXEBX
ECXECX
EDXEDX
ESIESI
EDIEDI
EAXEAX
ESPESP
EBPEBP
RATRAT
P6 Register RenamingP6 Register Renaming
AllocatedsequentiallyAllocatedsequentially
EBXEBX
ECXECX
EDXEDX
ESIESI
EDIEDI
EAXEAX
ESPESP
EBPEBP
Retirement RATRetirement RAT
NetBurst Register RenamingNetBurst Register Renaming
StatusStatus
AllocatedsequentiallyAllocatedsequentially
......
......
......
......
DataData
EBXEBX
ECXECX
EDXEDX
ESIESI
EDIEDI
EAXEAX
ESPESP
EBPEBP
Front-end RATFront-end RAT RF (128-entry)RF (128-entry) ROB (126)ROB (126)
31
Micro-op Scheduling
∀ µop FIFO queues
– Memory queue for loads and stores
– Non-memory queue
∀ µop schedulers
– Several schedulers fire instructions from 2 µop queues to execution (P6’s
RS)
– 4 distinct dispatch ports
– Maximum dispatch: 6 µops per cycle (2 fast ALU from Port 0,1 per cycle;
1 from ld/st ports)
Exec Port 0 Exec Port 1 Load Port Store Port
Fast ALU
(2x pumped)
Fast ALU
(2x pumped)
FP
Move
INT
Exec
FP
Exec
Memory
Load
Memory
Store
•Add/sub
•Logic
•Store Data
•Branches
•FP/SSE Move
•FP/SSE Store
•FXCH
•Add/sub •Shift
•Rotate
•FP/SSE Add
•FP/SSE Mul
•FP/SSE Div
•MMX
•Loads
•LEA
•Prefetch
•Stores
32
Data Memory Accesses
• Prescott: 16KB 8-way L1 + 1MB 8-way L2 (with a HW prefetcher), 128B line
• Load-to-use speculation
– Dependent instruction dispatched before load finishes
• Due to the high frequency and deep pipeline depth
• From load scheduler to execution is longer than execution itself
– Scheduler assumes loads always hit L1
– If L1 miss, dependent instructions left the scheduler receive incorrect data
temporarily – mis-speculationmis-speculation
– Replay logicReplay logic
• Re-execute the load when mis-speculated
• Mis-speculated operations are placed into a replay queue for being re-
dispatched
– All trailing independent instructions are allowed to proceed
– Tornado breaker
• Up to 4 outstanding load misses (= 4 fill buffers in original P6)
• Store-to-load forwarding buffer
– 24 entries
– Have the same starting physical address
– Load data size <= store data size
33
Fast Staggered ALU
• For frequent ALU instructions (No multiply, no shift, no rotate, no branch
processing)
• Double pumped clocks
• Each operation finishes in 3 fast cycles
– Lower-order 16-bit and bypass
– Higher-order 16-bit and bypass
– ALU flags generation
Bit[15:0]
Bit[31:16]
Flags
34
Branch Predictor
• P4 uses the same hybrid predictor of Pentium M
Bimodal
Predictor
Local
Predictor
Global
Predictor
MUX
MUX
Pred_G
Pred_LPred_B
L_hit
G_hit
35
• In Pentium M and Prescott Pentium 4
• Prediction based on global history
Indirect Branch Predictor
36
New Instructions over Pentium
• CMOVcc / FCMOVcc r, r/m
– Conditional moves (predicated move) instructions
– Based on conditional code (cc)
• FCOMI/P : compare FP stack and set integer flags
• RDPMC/RDTSC instructions
– PMC: P6 has 2, Netburst (P4) has 18
• Uncacheable Speculative Write-Combining (USWC) —weakly
ordered memory type for graphics memory
37
New Instructions
• SSE2 in Pentium 4 (not p6 microarchitecture)
– Double precision SIMD FP
• SSSE in Core 2
– Supplemental instructions for shuffle, align, add,
subtract.
• Intel 64 (EM64T)
– 64 bit support, new registers (8 more on top of 8)
– In Celeron D, Core 2 (and P4 Prescott, Pentium D)
– Almost compatible with AMD64
– AMD’s NX bit or Intel’s XD bit for preventing buffer overflow attacks
38
Streaming SIMD Extension 2
• P-III SSE (Katmai New Instructions: KNI)
– Eight 128-bit wide xmmxmm registers (new architecture state)
– Single-precisionSingle-precision 128-bit SIMD FP
• Four 32-bit FP operations in one instruction
• Broken down into 2 µops for execution (only 80-bit data in ROB)
– 64-bit SIMD MMX (use 8 mmmm registers — map to FP stack)
– Prefetch (nta, t0, t1, t2) and sfence
• P4 SSE2 (Willamette New Instructions: WNI)
– Support Double-precisionDouble-precision 128-bit SIMD FP
• Two 64-bit FP operations in one instruction
• Throughput: 2 cycles for most of SSE2 operations (exceptional
examples: DIVPD and SQRTPD: 69 cycles, non-pipelined.)
– Enhanced 128-bit SIMD MMX using xmmxmm registers
39
Examples of Using SSE
X3X3 X2X2 X1X1 X0X0
Y3Y3 Y2Y2 Y1Y1 Y0Y0
opop opop opop opop
X3 op Y3X3 op Y3 X2 op Y2X2 op Y2 X1 op Y1X1 op Y1X0 op Y0X0 op Y0
X3X3 X2X2 X1X1 X0X0
Y3Y3 Y2Y2 Y1Y1 Y0Y0
opop
X0 op Y0X0 op Y0X3X3 X2X2 X1X1
Packed SP FP operationPacked SP FP operation
(e.g.(e.g. ADDPS xmm1, xmm2ADDPS xmm1, xmm2))
Scalar SP FP operationScalar SP FP operation
(e.g.(e.g. ADDSS xmm1, xmm2ADDSS xmm1, xmm2))
xmm1xmm1
xmm2xmm2
xmm1xmm1
xmm1xmm1
xmm2xmm2
xmm1xmm1
Shuffle FP operation (8-bit imm)Shuffle FP operation (8-bit imm)
(e.g.(e.g. SHUFPS xmm1, xmm2,SHUFPS xmm1, xmm2, imm8imm8))
X3X3 X2X2 X1X1 X0X0
Y3Y3 Y2Y2 Y1Y1 Y0Y0
Y3 .. Y0Y3 .. Y0 Y3 .. Y0Y3 .. Y0 X3 .. X0X3 .. X0 X3 .. X0X3 .. X0
Shuffle FP operation (8-bit imm)Shuffle FP operation (8-bit imm)
(e.g.(e.g. SHUFPS xmm1, xmm2,SHUFPS xmm1, xmm2, 0xf10xf1))
X3X3 X2X2 X1X1 X0X0
Y3Y3 Y2Y2 Y1Y1 Y0Y0
xmm1xmm1
Y3Y3 X0X0 X1X1Y3Y3
xmm2xmm2
xmm1xmm1
40
Examples of Using SSE and SSE2
X3X3 X2X2 X1X1 X0X0
Y3Y3 Y2Y2 Y1Y1 Y0Y0
opop opop opop opop
X3 op Y3X3 op Y3 X2 op Y2X2 op Y2 X1 op Y1X1 op Y1X0 op Y0X0 op Y0
X3X3 X2X2 X1X1 X0X0
Y3Y3 Y2Y2 Y1Y1 Y0Y0
opop
X0 op Y0X0 op Y0X3X3 X2X2 X1X1
PackedPacked SPSP FP operationFP operation
(e.g.(e.g. ADDPS xmm1, xmm2ADDPS xmm1, xmm2))
ScalarScalar SPSP FP operationFP operation
(e.g.(e.g. ADDSS xmm1, xmm2ADDSS xmm1, xmm2))
xmm1xmm1
xmm2xmm2
xmm1xmm1
xmm1xmm1
xmm2xmm2
xmm1xmm1
Shuffle FP operationShuffle FP operation
(e.g.(e.g. SHUFPS xmm1, xmm2, imm8SHUFPS xmm1, xmm2, imm8))
X3X3 X2X2 X1X1 X0X0
Y3Y3 Y2Y2 Y1Y1 Y0Y0
Y3 .. Y0Y3 .. Y0 Y3 .. Y0Y3 .. Y0 X3 .. X0X3 .. X0 X3 .. X0X3 .. X0
ShuffleShuffle FPFP operation (8-bit imm)operation (8-bit imm)
(e.g.(e.g. SHUFPS xmm1, xmm2,SHUFPS xmm1, xmm2, 0xf10xf1))
X3X3 X2X2 X1X1 X0X0
Y3Y3 Y2Y2 Y1Y1 Y0Y0
xmm1xmm1
Y3Y3 X0X0 X1X1Y3Y3
xmm2xmm2
xmm1xmm1
X0X0
opop
PackedPacked DPDP FP operationFP operation
(e.g.(e.g. ADDPDADDPD xmm1, xmm2xmm1, xmm2))
ScalarScalar DPDP FP operationFP operation
(e.g.(e.g. ADDSDADDSD xmm1, xmm2xmm1, xmm2))
xmm1xmm1
xmm2xmm2
xmm1xmm1
xmm1xmm1
xmm2xmm2
xmm1xmm1
Shuffle FP operationShuffle FP operation
(e.g.(e.g. SHUFPS xmm1, xmm2, imm8SHUFPS xmm1, xmm2, imm8))
ShuffleShuffle DPDP operation (2-bit imm)operation (2-bit imm)
(e.g.(e.g. SHUFPD xmm1, xmm2,SHUFPD xmm1, xmm2, imm2imm2))
X1X1
Y0Y0Y1Y1
X0 op Y0X0 op Y0X1 op Y1X1 op Y1
opop
X0X0X1X1
Y0Y0Y1Y1
X0 op Y0X0 op Y0X1X1
opop
X0X0X1X1
Y0Y0Y1Y1
X1 or X0X1 or X0Y1 or Y0Y1 or Y0
SSESSE
SSE2SSE2
41
HyperThreading
• Intel Xeon Processor and Intel Xeon MP Processor
• Enable Simultaneous Multi-Threading (SMT)
– Exploit ILP through TLP (—Thread-Level Parallelism)
– Issuing and executing multiple threads at the same snapshot
• Single P4 w/ HT appears to be 2 logical processors2 logical processors
• Share the same execution resources
– dTLB shared with logical processor ID
– Some other shared resources are partitioned (next slide)
• Architectural states and some microarchitectural states are duplicated
– IPs, iTLB, streaming buffer
– Architectural register file
– Return stack buffer
– Branch history buffer
– Register Alias Table
42
Multithreading (MT) Paradigms
Thread 1
Unused
ExecutionTime
FU1 FU2 FU3 FU4
Conventional
Superscalar
Single
Threaded
Simultaneous
Multithreading
(or Intel’s HT)
Fine-grained
Multithreading
(cycle-by-cycle
Interleaving)
Thread 2
Thread 3
Thread 4
Thread 5
Coarse-grained
Multithreading
(Block Interleaving)
Chip
Multiprocessor
(CMP)
or called
Multi-Core Processors
today
43
HyperThreading Resource Partitioning
• TC (or UROM) is alternatively accessed per cycle for
each logical processor unless one is stalled due to
TC miss
∀ µop queue (into ½) after fetched from TC
• ROB (126/2)
• LB (48/2)
• SB (24/2) (32/2 for Prescott)
• General µop queue and memory µop queue (1/2)
• TLB (½?) as there is no PID
• Retirement: alternating between 2 logical
processors

Weitere ähnliche Inhalte

Was ist angesagt?

Multivector and multiprocessor
Multivector and multiprocessorMultivector and multiprocessor
Multivector and multiprocessor
Kishan Panara
 
Arm cortex-m3 by-joe_bungo_arm
Arm cortex-m3 by-joe_bungo_armArm cortex-m3 by-joe_bungo_arm
Arm cortex-m3 by-joe_bungo_arm
Prashant Ahire
 

Was ist angesagt? (20)

Computer architecture the pentium architecture
Computer architecture the pentium architectureComputer architecture the pentium architecture
Computer architecture the pentium architecture
 
Full Custom IC Design Implementation of Priority Encoder
Full Custom IC Design Implementation of Priority EncoderFull Custom IC Design Implementation of Priority Encoder
Full Custom IC Design Implementation of Priority Encoder
 
Riscv 20160507-patterson
Riscv 20160507-pattersonRiscv 20160507-patterson
Riscv 20160507-patterson
 
Direct Memory Access
Direct Memory AccessDirect Memory Access
Direct Memory Access
 
486 or 80486 DX Architecture
486 or 80486 DX Architecture486 or 80486 DX Architecture
486 or 80486 DX Architecture
 
Soc architecture and design
Soc architecture and designSoc architecture and design
Soc architecture and design
 
DDR SDRAMs
DDR SDRAMsDDR SDRAMs
DDR SDRAMs
 
Library Characterization Flow
Library Characterization FlowLibrary Characterization Flow
Library Characterization Flow
 
Arm modes
Arm modesArm modes
Arm modes
 
Q4.11: ARM Architecture
Q4.11: ARM ArchitectureQ4.11: ARM Architecture
Q4.11: ARM Architecture
 
Multivector and multiprocessor
Multivector and multiprocessorMultivector and multiprocessor
Multivector and multiprocessor
 
AMBA 5 COHERENT HUB INTERFACE.pptx
AMBA 5 COHERENT HUB INTERFACE.pptxAMBA 5 COHERENT HUB INTERFACE.pptx
AMBA 5 COHERENT HUB INTERFACE.pptx
 
High Bandwidth Memory(HBM)
High Bandwidth Memory(HBM)High Bandwidth Memory(HBM)
High Bandwidth Memory(HBM)
 
80486 and pentium
80486 and pentium80486 and pentium
80486 and pentium
 
Arm cortex-m3 by-joe_bungo_arm
Arm cortex-m3 by-joe_bungo_armArm cortex-m3 by-joe_bungo_arm
Arm cortex-m3 by-joe_bungo_arm
 
ARM Architecture
ARM ArchitectureARM Architecture
ARM Architecture
 
DDR3
DDR3DDR3
DDR3
 
Direct Memory Access & Interrrupts
Direct Memory Access & InterrruptsDirect Memory Access & Interrrupts
Direct Memory Access & Interrrupts
 
Embedded C - Optimization techniques
Embedded C - Optimization techniquesEmbedded C - Optimization techniques
Embedded C - Optimization techniques
 
Arm instruction set
Arm instruction setArm instruction set
Arm instruction set
 

Andere mochten auch

2.3 sequantial logic circuit
2.3 sequantial logic circuit2.3 sequantial logic circuit
2.3 sequantial logic circuit
Wan Afirah
 
Lec2 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Num...
Lec2 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Num...Lec2 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Num...
Lec2 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Num...
Hsien-Hsin Sean Lee, Ph.D.
 

Andere mochten auch (20)

Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW
Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIWLec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW
Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW
 
Lec13 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- SMP
Lec13 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- SMPLec13 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- SMP
Lec13 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- SMP
 
Lec8 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Dynamic Sch...
Lec8 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Dynamic Sch...Lec8 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Dynamic Sch...
Lec8 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Dynamic Sch...
 
Lec14 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech --- Coherence
Lec14 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech --- CoherenceLec14 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech --- Coherence
Lec14 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech --- Coherence
 
Lec11 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part3
Lec11 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part3Lec11 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part3
Lec11 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part3
 
Lec13 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Multicore
Lec13 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- MulticoreLec13 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Multicore
Lec13 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Multicore
 
Lec9 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part 1
Lec9 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part 1Lec9 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part 1
Lec9 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part 1
 
Semiconductor
SemiconductorSemiconductor
Semiconductor
 
Shift Register
Shift RegisterShift Register
Shift Register
 
B sc cs i bo-de u-iii counters & registers
B sc cs i bo-de u-iii counters & registersB sc cs i bo-de u-iii counters & registers
B sc cs i bo-de u-iii counters & registers
 
Digital 9 16
Digital 9 16Digital 9 16
Digital 9 16
 
digital Counter
digital Counterdigital Counter
digital Counter
 
Counter And Sequencer Design- Student
Counter And Sequencer Design- StudentCounter And Sequencer Design- Student
Counter And Sequencer Design- Student
 
14827 shift registers
14827 shift registers14827 shift registers
14827 shift registers
 
2.3 sequantial logic circuit
2.3 sequantial logic circuit2.3 sequantial logic circuit
2.3 sequantial logic circuit
 
Overview of Shift register and applications
Overview of Shift register and applicationsOverview of Shift register and applications
Overview of Shift register and applications
 
Shift Registers
Shift RegistersShift Registers
Shift Registers
 
Shift registers
Shift registersShift registers
Shift registers
 
Lec6 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Can...
Lec6 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Can...Lec6 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Can...
Lec6 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Can...
 
Lec2 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Num...
Lec2 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Num...Lec2 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Num...
Lec2 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Num...
 

Ähnlich wie Lec12 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- P6, Netburst, Pentiium 4

selfridge_alec_chipspec
selfridge_alec_chipspecselfridge_alec_chipspec
selfridge_alec_chipspec
Alec Selfridge
 

Ähnlich wie Lec12 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- P6, Netburst, Pentiium 4 (20)

Pipelining1
Pipelining1Pipelining1
Pipelining1
 
Code GPU with CUDA - SIMT
Code GPU with CUDA - SIMTCode GPU with CUDA - SIMT
Code GPU with CUDA - SIMT
 
Unit II Arm 7 Introduction
Unit II Arm 7 IntroductionUnit II Arm 7 Introduction
Unit II Arm 7 Introduction
 
Arm architecture overview
Arm architecture overviewArm architecture overview
Arm architecture overview
 
ARM stacks, subroutines, Cortex M3, LPC 214X
ARM  stacks, subroutines, Cortex M3, LPC 214XARM  stacks, subroutines, Cortex M3, LPC 214X
ARM stacks, subroutines, Cortex M3, LPC 214X
 
Understanding DPDK
Understanding DPDKUnderstanding DPDK
Understanding DPDK
 
Arm architecture
Arm architectureArm architecture
Arm architecture
 
Microprocessor 8086 instructions
Microprocessor 8086 instructionsMicroprocessor 8086 instructions
Microprocessor 8086 instructions
 
Lec05
Lec05Lec05
Lec05
 
other-architectures.ppt
other-architectures.pptother-architectures.ppt
other-architectures.ppt
 
MICROCONTROLLER 8051- Architecture & Pin Configuration
MICROCONTROLLER 8051- Architecture & Pin Configuration MICROCONTROLLER 8051- Architecture & Pin Configuration
MICROCONTROLLER 8051- Architecture & Pin Configuration
 
Final
FinalFinal
Final
 
x86_1.ppt
x86_1.pptx86_1.ppt
x86_1.ppt
 
selfridge_alec_chipspec
selfridge_alec_chipspecselfridge_alec_chipspec
selfridge_alec_chipspec
 
MPC854XE: PowerQUICC III Processors
MPC854XE: PowerQUICC III ProcessorsMPC854XE: PowerQUICC III Processors
MPC854XE: PowerQUICC III Processors
 
Network Programming: Data Plane Development Kit (DPDK)
Network Programming: Data Plane Development Kit (DPDK)Network Programming: Data Plane Development Kit (DPDK)
Network Programming: Data Plane Development Kit (DPDK)
 
Cyclone II FPGA Overview
Cyclone II FPGA OverviewCyclone II FPGA Overview
Cyclone II FPGA Overview
 
Microcontroller 8051
Microcontroller 8051Microcontroller 8051
Microcontroller 8051
 
Dpdk applications
Dpdk applicationsDpdk applications
Dpdk applications
 
Project
ProjectProject
Project
 

Mehr von Hsien-Hsin Sean Lee, Ph.D.

Lec18 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- In...
Lec18 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- In...Lec18 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- In...
Lec18 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- In...
Hsien-Hsin Sean Lee, Ph.D.
 
Lec10 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Mu...
Lec10 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Mu...Lec10 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Mu...
Lec10 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Mu...
Hsien-Hsin Sean Lee, Ph.D.
 
Lec7 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Kar...
Lec7 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Kar...Lec7 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Kar...
Lec7 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Kar...
Hsien-Hsin Sean Lee, Ph.D.
 
Lec5 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Boo...
Lec5 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Boo...Lec5 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Boo...
Lec5 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Boo...
Hsien-Hsin Sean Lee, Ph.D.
 
Lec3 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- CMO...
Lec3 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- CMO...Lec3 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- CMO...
Lec3 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- CMO...
Hsien-Hsin Sean Lee, Ph.D.
 

Mehr von Hsien-Hsin Sean Lee, Ph.D. (20)

Lec20 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Da...
Lec20 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Da...Lec20 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Da...
Lec20 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Da...
 
Lec19 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Pr...
Lec19 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Pr...Lec19 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Pr...
Lec19 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Pr...
 
Lec18 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- In...
Lec18 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- In...Lec18 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- In...
Lec18 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- In...
 
Lec17 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Me...
Lec17 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Me...Lec17 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Me...
Lec17 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Me...
 
Lec16 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Fi...
Lec16 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Fi...Lec16 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Fi...
Lec16 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Fi...
 
Lec15 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Re...
Lec15 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Re...Lec15 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Re...
Lec15 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Re...
 
Lec14 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Se...
Lec14 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Se...Lec14 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Se...
Lec14 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Se...
 
Lec13 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Sh...
Lec13 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Sh...Lec13 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Sh...
Lec13 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Sh...
 
Lec12 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Ad...
Lec12 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Ad...Lec12 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Ad...
Lec12 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Ad...
 
Lec11 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- De...
Lec11 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- De...Lec11 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- De...
Lec11 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- De...
 
Lec10 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Mu...
Lec10 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Mu...Lec10 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Mu...
Lec10 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Mu...
 
Lec9 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Com...
Lec9 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Com...Lec9 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Com...
Lec9 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Com...
 
Lec8 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Qui...
Lec8 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Qui...Lec8 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Qui...
Lec8 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Qui...
 
Lec7 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Kar...
Lec7 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Kar...Lec7 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Kar...
Lec7 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Kar...
 
Lec5 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Boo...
Lec5 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Boo...Lec5 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Boo...
Lec5 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Boo...
 
Lec4 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- CMOS
Lec4 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- CMOSLec4 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- CMOS
Lec4 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- CMOS
 
Lec3 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- CMO...
Lec3 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- CMO...Lec3 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- CMO...
Lec3 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- CMO...
 
Lec1 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Intro
Lec1 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- IntroLec1 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Intro
Lec1 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Intro
 
Lec10 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part2
Lec10 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part2Lec10 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part2
Lec10 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part2
 
Lec7 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Dynamic Sch...
Lec7 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Dynamic Sch...Lec7 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Dynamic Sch...
Lec7 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Dynamic Sch...
 

Kürzlich hochgeladen

CRISIS COMMUNICATION presentation=-Rishabh(11195)-group ppt (4).pptx
CRISIS COMMUNICATION presentation=-Rishabh(11195)-group ppt (4).pptxCRISIS COMMUNICATION presentation=-Rishabh(11195)-group ppt (4).pptx
CRISIS COMMUNICATION presentation=-Rishabh(11195)-group ppt (4).pptx
Rishabh332761
 
一比一原版(Otago毕业证书)奥塔哥理工学院毕业证成绩单学位证靠谱定制
一比一原版(Otago毕业证书)奥塔哥理工学院毕业证成绩单学位证靠谱定制一比一原版(Otago毕业证书)奥塔哥理工学院毕业证成绩单学位证靠谱定制
一比一原版(Otago毕业证书)奥塔哥理工学院毕业证成绩单学位证靠谱定制
uodye
 
在线制作(ANU毕业证书)澳大利亚国立大学毕业证成绩单原版一比一
在线制作(ANU毕业证书)澳大利亚国立大学毕业证成绩单原版一比一在线制作(ANU毕业证书)澳大利亚国立大学毕业证成绩单原版一比一
在线制作(ANU毕业证书)澳大利亚国立大学毕业证成绩单原版一比一
ougvy
 
Mankhurd Call Girls, 09167354423 Mankhurd Escorts Services,Mankhurd Female Es...
Mankhurd Call Girls, 09167354423 Mankhurd Escorts Services,Mankhurd Female Es...Mankhurd Call Girls, 09167354423 Mankhurd Escorts Services,Mankhurd Female Es...
Mankhurd Call Girls, 09167354423 Mankhurd Escorts Services,Mankhurd Female Es...
Priya Reddy
 
一比一定(购)坎特伯雷大学毕业证(UC毕业证)成绩单学位证
一比一定(购)坎特伯雷大学毕业证(UC毕业证)成绩单学位证一比一定(购)坎特伯雷大学毕业证(UC毕业证)成绩单学位证
一比一定(购)坎特伯雷大学毕业证(UC毕业证)成绩单学位证
wpkuukw
 
Abortion pills in Jeddah |+966572737505 | Get Cytotec
Abortion pills in Jeddah |+966572737505 | Get CytotecAbortion pills in Jeddah |+966572737505 | Get Cytotec
Abortion pills in Jeddah |+966572737505 | Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
一比一定(购)国立南方理工学院毕业证(Southern毕业证)成绩单学位证
一比一定(购)国立南方理工学院毕业证(Southern毕业证)成绩单学位证一比一定(购)国立南方理工学院毕业证(Southern毕业证)成绩单学位证
一比一定(购)国立南方理工学院毕业证(Southern毕业证)成绩单学位证
wpkuukw
 
怎样办理圣芭芭拉分校毕业证(UCSB毕业证书)成绩单留信认证
怎样办理圣芭芭拉分校毕业证(UCSB毕业证书)成绩单留信认证怎样办理圣芭芭拉分校毕业证(UCSB毕业证书)成绩单留信认证
怎样办理圣芭芭拉分校毕业证(UCSB毕业证书)成绩单留信认证
ehyxf
 
一比一定(购)UNITEC理工学院毕业证(UNITEC毕业证)成绩单学位证
一比一定(购)UNITEC理工学院毕业证(UNITEC毕业证)成绩单学位证一比一定(购)UNITEC理工学院毕业证(UNITEC毕业证)成绩单学位证
一比一定(购)UNITEC理工学院毕业证(UNITEC毕业证)成绩单学位证
wpkuukw
 
一比一原版(USYD毕业证书)澳洲悉尼大学毕业证如何办理
一比一原版(USYD毕业证书)澳洲悉尼大学毕业证如何办理一比一原版(USYD毕业证书)澳洲悉尼大学毕业证如何办理
一比一原版(USYD毕业证书)澳洲悉尼大学毕业证如何办理
uodye
 
怎样办理昆士兰大学毕业证(UQ毕业证书)成绩单留信认证
怎样办理昆士兰大学毕业证(UQ毕业证书)成绩单留信认证怎样办理昆士兰大学毕业证(UQ毕业证书)成绩单留信认证
怎样办理昆士兰大学毕业证(UQ毕业证书)成绩单留信认证
ehyxf
 
Top profile Call Girls In Ratlam [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Ratlam [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Ratlam [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Ratlam [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
怎样办理斯威本科技大学毕业证(SUT毕业证书)成绩单留信认证
怎样办理斯威本科技大学毕业证(SUT毕业证书)成绩单留信认证怎样办理斯威本科技大学毕业证(SUT毕业证书)成绩单留信认证
怎样办理斯威本科技大学毕业证(SUT毕业证书)成绩单留信认证
tufbav
 
Abortion Pill for sale in Riyadh ((+918761049707) Get Cytotec in Dammam
Abortion Pill for sale in Riyadh ((+918761049707) Get Cytotec in DammamAbortion Pill for sale in Riyadh ((+918761049707) Get Cytotec in Dammam
Abortion Pill for sale in Riyadh ((+918761049707) Get Cytotec in Dammam
ahmedjiabur940
 
一比一维多利亚大学毕业证(victoria毕业证)成绩单学位证如何办理
一比一维多利亚大学毕业证(victoria毕业证)成绩单学位证如何办理一比一维多利亚大学毕业证(victoria毕业证)成绩单学位证如何办理
一比一维多利亚大学毕业证(victoria毕业证)成绩单学位证如何办理
uodye
 
Top profile Call Girls In Palghar [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In Palghar [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In Palghar [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In Palghar [ 7014168258 ] Call Me For Genuine Models W...
gajnagarg
 

Kürzlich hochgeladen (20)

CRISIS COMMUNICATION presentation=-Rishabh(11195)-group ppt (4).pptx
CRISIS COMMUNICATION presentation=-Rishabh(11195)-group ppt (4).pptxCRISIS COMMUNICATION presentation=-Rishabh(11195)-group ppt (4).pptx
CRISIS COMMUNICATION presentation=-Rishabh(11195)-group ppt (4).pptx
 
Point of Care Testing in clinical laboratory
Point of Care Testing in clinical laboratoryPoint of Care Testing in clinical laboratory
Point of Care Testing in clinical laboratory
 
Hilti's Latest Battery - Hire Depot.pptx
Hilti's Latest Battery - Hire Depot.pptxHilti's Latest Battery - Hire Depot.pptx
Hilti's Latest Battery - Hire Depot.pptx
 
一比一原版(Otago毕业证书)奥塔哥理工学院毕业证成绩单学位证靠谱定制
一比一原版(Otago毕业证书)奥塔哥理工学院毕业证成绩单学位证靠谱定制一比一原版(Otago毕业证书)奥塔哥理工学院毕业证成绩单学位证靠谱定制
一比一原版(Otago毕业证书)奥塔哥理工学院毕业证成绩单学位证靠谱定制
 
在线制作(ANU毕业证书)澳大利亚国立大学毕业证成绩单原版一比一
在线制作(ANU毕业证书)澳大利亚国立大学毕业证成绩单原版一比一在线制作(ANU毕业证书)澳大利亚国立大学毕业证成绩单原版一比一
在线制作(ANU毕业证书)澳大利亚国立大学毕业证成绩单原版一比一
 
Mankhurd Call Girls, 09167354423 Mankhurd Escorts Services,Mankhurd Female Es...
Mankhurd Call Girls, 09167354423 Mankhurd Escorts Services,Mankhurd Female Es...Mankhurd Call Girls, 09167354423 Mankhurd Escorts Services,Mankhurd Female Es...
Mankhurd Call Girls, 09167354423 Mankhurd Escorts Services,Mankhurd Female Es...
 
一比一定(购)坎特伯雷大学毕业证(UC毕业证)成绩单学位证
一比一定(购)坎特伯雷大学毕业证(UC毕业证)成绩单学位证一比一定(购)坎特伯雷大学毕业证(UC毕业证)成绩单学位证
一比一定(购)坎特伯雷大学毕业证(UC毕业证)成绩单学位证
 
Abortion pills in Jeddah |+966572737505 | Get Cytotec
Abortion pills in Jeddah |+966572737505 | Get CytotecAbortion pills in Jeddah |+966572737505 | Get Cytotec
Abortion pills in Jeddah |+966572737505 | Get Cytotec
 
一比一定(购)国立南方理工学院毕业证(Southern毕业证)成绩单学位证
一比一定(购)国立南方理工学院毕业证(Southern毕业证)成绩单学位证一比一定(购)国立南方理工学院毕业证(Southern毕业证)成绩单学位证
一比一定(购)国立南方理工学院毕业证(Southern毕业证)成绩单学位证
 
Abortion pills in Jeddah +966572737505 <> buy cytotec <> unwanted kit Saudi A...
Abortion pills in Jeddah +966572737505 <> buy cytotec <> unwanted kit Saudi A...Abortion pills in Jeddah +966572737505 <> buy cytotec <> unwanted kit Saudi A...
Abortion pills in Jeddah +966572737505 <> buy cytotec <> unwanted kit Saudi A...
 
怎样办理圣芭芭拉分校毕业证(UCSB毕业证书)成绩单留信认证
怎样办理圣芭芭拉分校毕业证(UCSB毕业证书)成绩单留信认证怎样办理圣芭芭拉分校毕业证(UCSB毕业证书)成绩单留信认证
怎样办理圣芭芭拉分校毕业证(UCSB毕业证书)成绩单留信认证
 
一比一定(购)UNITEC理工学院毕业证(UNITEC毕业证)成绩单学位证
一比一定(购)UNITEC理工学院毕业证(UNITEC毕业证)成绩单学位证一比一定(购)UNITEC理工学院毕业证(UNITEC毕业证)成绩单学位证
一比一定(购)UNITEC理工学院毕业证(UNITEC毕业证)成绩单学位证
 
一比一原版(USYD毕业证书)澳洲悉尼大学毕业证如何办理
一比一原版(USYD毕业证书)澳洲悉尼大学毕业证如何办理一比一原版(USYD毕业证书)澳洲悉尼大学毕业证如何办理
一比一原版(USYD毕业证书)澳洲悉尼大学毕业证如何办理
 
怎样办理昆士兰大学毕业证(UQ毕业证书)成绩单留信认证
怎样办理昆士兰大学毕业证(UQ毕业证书)成绩单留信认证怎样办理昆士兰大学毕业证(UQ毕业证书)成绩单留信认证
怎样办理昆士兰大学毕业证(UQ毕业证书)成绩单留信认证
 
Top profile Call Girls In Ratlam [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Ratlam [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Ratlam [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Ratlam [ 7014168258 ] Call Me For Genuine Models We...
 
怎样办理斯威本科技大学毕业证(SUT毕业证书)成绩单留信认证
怎样办理斯威本科技大学毕业证(SUT毕业证书)成绩单留信认证怎样办理斯威本科技大学毕业证(SUT毕业证书)成绩单留信认证
怎样办理斯威本科技大学毕业证(SUT毕业证书)成绩单留信认证
 
Abortion Pill for sale in Riyadh ((+918761049707) Get Cytotec in Dammam
Abortion Pill for sale in Riyadh ((+918761049707) Get Cytotec in DammamAbortion Pill for sale in Riyadh ((+918761049707) Get Cytotec in Dammam
Abortion Pill for sale in Riyadh ((+918761049707) Get Cytotec in Dammam
 
一比一维多利亚大学毕业证(victoria毕业证)成绩单学位证如何办理
一比一维多利亚大学毕业证(victoria毕业证)成绩单学位证如何办理一比一维多利亚大学毕业证(victoria毕业证)成绩单学位证如何办理
一比一维多利亚大学毕业证(victoria毕业证)成绩单学位证如何办理
 
Top profile Call Girls In Palghar [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In Palghar [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In Palghar [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In Palghar [ 7014168258 ] Call Me For Genuine Models W...
 
LANDSLIDE MONITORING AND ALERT SYSTEM FINAL YEAR PROJECT BROCHURE
LANDSLIDE MONITORING AND ALERT SYSTEM FINAL YEAR PROJECT BROCHURELANDSLIDE MONITORING AND ALERT SYSTEM FINAL YEAR PROJECT BROCHURE
LANDSLIDE MONITORING AND ALERT SYSTEM FINAL YEAR PROJECT BROCHURE
 

Lec12 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- P6, Netburst, Pentiium 4

  • 1. ECE 4100/6100 Advanced Computer Architecture Lecture 12 P6 and NetBurst Microarchitecture Prof. Hsien-Hsin Sean Lee School of Electrical and Computer Engineering Georgia Institute of Technology
  • 2. 2 P6 System Architecture SystemSystem MemoryMemory (DRAM)(DRAM) MCHMCH Front-SideFront-Side BusBus PCI USB I/O GraphicsGraphics ProcessorProcessor Local Frame Buffer PCIExpress AGP (SRAM)(SRAM) L2 CacheL2 Cache Back-SideBack-Side BusBus P6 CoreP6 Core Host ProcessorHost Processor L1L1 CacheCache (SRAM)(SRAM) GPUGPU ICHICH chipsetchipset
  • 3. 3 Instruction Fetch UnitInstruction Fetch Unit P6 Microarchitecture BTB/BACBTB/BAC Instruction Fetch UnitInstruction Fetch Unit Bus interface unitBus interface unit InstructionInstruction DecoderDecoder InstructionInstruction DecoderDecoder RegisterRegister Alias TableAlias Table AllocatorAllocatorMicrocodeMicrocode SequencerSequencer ReservationReservation StationStation ROB &ROB & Retire RFRetire RF AGUAGU MMXMMX IEU/JEUIEU/JEUIEU/JEUIEU/JEU FEUFEU MIUMIU MemoryMemory Order BufferOrder Buffer Data CacheData Cache Unit (L1)Unit (L1) External busExternal bus Chip boundaryChip boundary ControlControl FlowFlow (Restricted)(Restricted) DataData FlowFlowInstruction Fetch Cluster Issue Cluster Out-of-order Cluster Memory Cluster Bus Cluster
  • 4. 4 Pentium III Die Map EBL/BBL – External/Backside Bus logic MOB - Memory Order Buffer Packed FPU - Floating Point Unit for SSE IEU - Integer Execution Unit FAU - Floating Point Arithmetic Unit MIU - Memory Interface Unit DCU - Data Cache Unit (L1) PMH - Page Miss Handler DTLB - Data TLB BAC - Branch Address Calculator RAT - Register Alias Table SIMD - Packed Floating Point unit RS - Reservation Station BTB - Branch Target Buffer TAP – Test Access Port IFU - Instruction Fetch Unit and L1 I-Cache ID - Instruction Decode ROB - Reorder Buffer MS - Micro-instruction Sequencer
  • 5. 5 P6 Basics • One implementation of IA32 architecture • Deeply pipeline processor • In-order front-end and back-end • Dynamic execution engine (restricted dataflow) • Speculative execution • P6 microarchitecture family processors include – Pentium Pro – Pentium II (PPro + MMX + 2x caches) – Pentium III (P-II + SSE + enhanced MMX, e.g. PSAD) – Pentium 4 (Not P6, will be discussed separately) – Pentium M (+SSE2, SSE3, µop fusion) – Core (PM + SSSE3, SSE4, Intel 64 (EM64T), MacroOp fusion, 4 µop retired rate vs. 3 of previous proliferation)
  • 6. 6 P6 Pipelining 1111 1212 1313 1414 1515 1616 1717 2020 2121 2222 NextIPNextIP I-CacheI-Cache ILDILD RotateRotate Dec1Dec1 Dec2Dec2 BrDecBrDec RSWriteRSWrite RATRAT IDQIDQ In-order FEIn-order FE 3131 3232 3333 8181 8282 .... .... 8383 Exec2Exec2 ExecnExecn Multi-cycleMulti-cycle pipelinepipeline 3131 3232 3333 8181 8282 4242 4343 8383 AGUAGU DCache1DCache1 DCache2DCache2 Non-blockingNon-blocking memory pipelinememory pipeline 3131 3232 3333 8282 8383 RSschdRSschd RSDispRSDisp Exec/WBExec/WB Single-cycleSingle-cycle pipelinepipeline 83: Data WB83: Data WB 82: Int WB schedule82: Int WB schedule 81: Mem/FP WB81: Mem/FP WB FEin-orderboundaryFEin-orderboundary Retirementin-orderboundaryRetirementin-orderboundary 9191 9292 9393 RetptrwrRetptrwr RetROBrdRetROBrd RRFwrRRFwr … … … … …….. RS SchedulingRS Scheduling DelayDelay ROB SchedulingROB Scheduling DelayDelay MOB SchedulingMOB Scheduling DelayDelay IFU1IFU1 IFU2IFU2 IFU3IFU3 DEC1DEC1 DEC2DEC2 RATRAT ROBROB DISDIS EXEX RET1RET1 RET2RET2 3131 3232 3333 8181 8282 4242 4343 8383 AGUAGU MOBblkMOBblk MOBwrMOBwr 4040 4141 4242 4343 MOBdispMOBdisp DCache1DCache1 Dcache2Dcache2 MOBwakeupMOBwakeup BlockingBlocking memorymemory pipelinepipeline
  • 7. 7 Instruction Fetching Unit • IFU1: Initiate fetch, requesting 16 bytes at a time • IFU2: Instruction length decoder, mark instruction boundaries, BTB makes prediction (2 cycles) • IFU3: Align instructions to 3 decoders in 4-1-1 format Streaming Buffer Instruction Cache Victim Cache Instruction TLB data addr P.Addr Branch Target Buffer Next PC Mux Other fetch requests LinearAddress Select mux ILD Length marks Instruction rotator Instruction buffer #bytes consumed by ID Prediction marks
  • 8. 8 Static Branch Prediction (stage 17 Br. Dec of pg. 6) BTB miss?BTB miss? PC-relative?PC-relative? Conditional?Conditional? Backwards?Backwards? Return?Return? UnconditionalUnconditional PC-relative?PC-relative? NoNoNoNo NoNo NoNo NoNo NoNo YesYes YesYes YesYes YesYes YesYes YesYes BTB dynamicBTB dynamic predictor’spredictor’s decisiondecision TakenTaken TakenTaken TakenTaken TakenTaken TakenTaken IndirectIndirect jumpjump Not TakenNot Taken
  • 9. 9 Dynamic Branch Prediction • Similar to a 2-level PAs design • Associated with each BTB entry • W/ 16-entry Return Stack Buffer • 4 branch predictions per cycle (due to 16-byte fetch per cycle) • Speculative update (2 copies of BHR)  Static prediction provided by Branch Address Calculator when BTB misses (see prior slide) 512-entry BTB512-entry BTB 1 1 0 Branch History RegisterBranch History Register (BHR)(BHR) 0000 0001 0010 1111 1110 Pattern History TablesPattern History Tables (PHT)(PHT) Prediction Rc: Branch ResultRc: Branch Result 2-bit sat. counter 11 00 1 1 0 Spec. updateSpec. update New (spec) historyNew (spec) history 1101 W0W0 W1W1 W2W2 W3W3
  • 10. 10 X86 Instruction Decode • 4-1-1 decoder • Decode rate depends on instruction alignment • DEC1: translate x86 into micro-operation’s (µops) • DEC2: move decoded µops to ID queue • MS performs translations either – Generate entire µop sequence from the “microcode ROM” – Receive 4 µops from complex decoder, and the rest from microcode ROM • Subsequent Instructions followed by the inst needing MS are flushed complex (1-4) complex (1-4) simple (1) simple (1) simple (1) simple (1) (16 bytes) Micro- instruction sequencer (MS) Instruction decoder queue (6 µops) Next 3 inst #Inst to dec S,S,S 3 S,S,C First 2 S,C,S First 1 S,C,C First 1 C,S,S 3 C,S,C First 2 C,C,S First 1 C,C,C First 1 S: Simple C: Complex Instruction Buffer To RAT/ALLOC
  • 11. 11 Register Alias Table (RAT) • Register renaming for 8 integer registers, 8 floating point (stack) registers and flags: 3 µop per cycle • 40 80-bit physicalphysical registers embedded in the ROB (thereby, 6 bit to specify PSrcPSrc) • RAT looks up physical ROB locations for renamed sources based on RRF bit • Override logic is for dependent µops decoded at the same cycle • Misprediction will revert all pointers to point to Retirement Register File (RRF) In-order queue FP TOS Adjus t FP RAT Array Integer RAT Array Logical Src IntandFPOverrides Array Physical Src (Psrc) RAT PSrc’s Physical ROB Pointers Allocator 25 2 ECX 15 EAX EBX ECX EDX Renaming Example ROBRRF RRF PSrc 0 0 0 1
  • 12. 12 Partial Stalls due to RAT • Partial register stalls: Occurs when writing a smaller (e.g. 8/16-bit) register followed by a larger (e.g. 32-bit) read – Because need to read different partial pieces from multiple physical registers ! • Partial flags stalls: Occurs when a subsequent instruction reads more flags than a prior unretired instruction touches EAXEAX AXAX writewritereadread MOV AX, m8 ;MOV AX, m8 ; ADD EAX, m32 ; stallADD EAX, m32 ; stall Partial register stallsPartial register stalls XOR EAX, EAXXOR EAX, EAX MOV AL, m8 ;MOV AL, m8 ; ADD EAX, m32 ; no stallADD EAX, m32 ; no stall SUB EAX, EAXSUB EAX, EAX MOV AL, m8 ;MOV AL, m8 ; ADD EAX, m32 ; no stallADD EAX, m32 ; no stall Idiom Fix (1)Idiom Fix (1) Idiom Fix (2)Idiom Fix (2) CMP EAX, EBXCMP EAX, EBX INC ECXINC ECX JBE XX ; stallJBE XX ; stall Partial flag stalls (1)Partial flag stalls (1) JBEJBE reads both ZFZF and CFCF while INC affects (ZFZF,OF,SF,AF,PF) i.e. only ZFZF LAHFLAHF loads low byte of EFLAGSEFLAGS while TESTTEST writes partial of them TEST EBX, EBXTEST EBX, EBX LAHF ; stallLAHF ; stall Partial flag stalls (2)Partial flag stalls (2)
  • 13. 13 Partial Register Width Renaming • 32/16-bit accesses: – Read from low banklow bank (AL/BL/CL/DL;AX/BX/CX/DX;EAX/EBX/ECX/EDX/EDI/ESI/EBP/ESP) – Write to both banks (AH/BH/CH/DH) • 8-bit RAT accesses: depending on which bank is being written and only update the particular bank In-orderqueue FP TOS Adjust FP RAT Array Logical Src IntandFPOverries Array Physical Src RAT Physical Src Physical ROB Pointers from Allocator µop0: MOV AL = (a) µop1: MOV AH = (b) µop2: ADD AL = (c) µop3: ADD AH = (d) Integer RAT Array INT Low Bank (32b/16b/L): 8 entries INT High Bank (H): 4 entries Size(2) RRF(1) PSrc(6) Allocator
  • 14. 14 Allocator (ALLOC) • The interface between in-order and out-of-order pipelines • Allocates into ROB, MOB and RS – “3-or-none” µops per cycle into ROB and RS • Must have 3 free ROB entries or no allocation – “all-or-none” policy for MOB • Stall allocation when not all the valid MOB µops can be allocated • Generate physical destination token PdstPdst from the ROB and pass it to the Register Alias Table (RAT) and RS • Stalls upon shortage of resources
  • 15. 15 Reservation Stations (RS) • Gateway to execution: binding max 5 µop to each port per cycle • Port binding at dispatch time (certain µop can only be bound to one port) • 20 µop entry buffer bridging the In-order and Out-of-order engine (32 entries in Core) • RS fields include µop opcode, data valid bits, Pdst, Psrc, source data, BrPred, etc. • Oldest first FIFO scheduling when multiple µops are ready at the same cycle Port 0 Port 1 Port 2 Port 3 Port 4 IEU0 Fadd Fmul Imul Div IEU1 JEU AGU0 AGU1 MOBMOB DCU ROB RRF Pfadd Pfmul Pfshuf WB bus 1 WB bus 0 Ld addr St addr LDA STA STD St data Loaded data RS Retired data
  • 16. 16 ReOrder Buffer (ROB) • A 40-entry circular buffer (96-entry in Core) – 157-bit wide – Provide 40 alias physical registers • Out-of-order completion • Deposit exception in each entry • Retirement (or de-allocation) – After resolving prior speculation – Handle exceptions thru MS – Clear OOO state when a mis-predicted branch or exception is detected – 3 µop’s per cycle in program order – For multi-µop x86 instructions: none or all (atomic) ALLOC RATRAT RS RRFROB ... MS (exp) µcode assist
  • 17. 17 Memory Execution Cluster • Manage data memory accesses • Address Translation • Detect violation of access ordering • Fill buffers (FB) in DCU, similar to MSHR for non-blocking cache support RS / ROBRS / ROB LDLD STASTA STDSTD DTLBDTLBDTLBDTLB LDLD STASTADCUDCUDCUDCU Load BufferLoad Buffer Store BufferStore BufferEBLEBL Memory ClusterMemory Cluster movl ecx, edi addl ecx, 8 movl -4(edi), ebx movl eax, 4(ecx) RS cannot detect this and could dispatch them at the same timeFBFB
  • 18. 18 Memory Order Buffer (MOB) • Allocated by ALLOC • A second order RS for memory operations • 1 µop for load; 2 µop’s for store: Store Address (STA) and Store Data (STD) • MOB  16-entry load buffer (LB) (32-entry in Core, 64 in SandyBridge)  12-entry store address buffer (SAB) (20-entry in Core, 36 in SandyBridge)  SAB works in unison with • Store data buffer (SDB) in MIU • Physical Address Buffer (PAB) in DCU  Store Buffer (SB): SAB + SDB + PAB • Senior Stores  Upon STD/STA retired from ROB  SB marks the store “seniorsenior”  Senior stores are committed back in program orderprogram order to memory when bus idle or SB full • Prefetch instructions in P-III  Senior loadSenior load behavior  Due to no explicit architectural destination  New Memory dependency predictor in Core to predict store-to-load dependencies
  • 19. 19 Store Coloring • ALLOC assigns Store Buffer ID (SBID) in program order • ALLOC tags loads with the most recent SBID • Check loads against stores with equal or younger SBIDs for potential address conflicts • SDB forwards data if conflict detected x86 Instructions µop’s store color mov (0x1220), ebx std ebx 2 sta 0x1220 2 mov (0x1110), eax std eax 3 sta 0x1100 3 mov ecx, (0x1220) ld 0x1220 3 mov edx, (0x1280) ld 0x1280 3 mov (0x1400), edx std edx 4 sta 0x1400 4 mov edx, (0x1380) ld 0x1380 4
  • 20. 20 Memory Type Range Registers (MTRR) • Control registers written by the system (OS) • Supporting Memory TypesMemory Types – UnCacheable (UC) – Uncacheable Speculative Write-combining (USWC or WC) • Use a fill buffer entry as WC buffer – WriteBack (WB) – Write-Through (WT) – Write-Protected (WP) • E.g. Support copy-on-write in UNIX, save memory space by allowing child processes to share with their parents. Only create new memory pages when child processes attempt to write. • Page Miss Handler (PMH) – Look up MTRR while supplying physical addresses – Return memory types and physical address to DTLB
  • 21. 21 Intel NetBurst Microarchitecture • Pentium 4’s microarchitecture • Original target market: Graphics workstations, but … • Design Goals: – Performance, performance, performance, … – Unprecedented multimedia/floating-point performance • Streaming SIMD Extensions 2 (SSE2) • SSE3 introduced in Prescott Pentium 4 (90nm) – Reduced CPI • Low latency instructions • High bandwidth instruction fetching • Rapid Execution of Arithmetic & Logic operations – Reduced clock period • New pipeline designed for scalability
  • 22. 22 Innovations Beyond P6 • Hyperpipelined technology • Streaming SIMD Extension 2 • Hyper-threading Technology (HT) • Execution trace cache • Rapid execution engine • Staggered adder unit • Enhanced branch predictor • Indirect branch predictor (also in Banias Pentium M) • Load speculation and replay
  • 23. 23 Pentium 4 Fact Sheet • IA-32 fully backward compatible • Available at speeds ranging from 1.3 to ~3.8 GHz • Hyperpipelined (20+ stages) • 125 million transistors in Prescott (1.328 billion in 16MB on-die L3 Tulsa, 65nm) • 0.18 μ for 1.3 to 2GHz; 0.13μ for 1.8 to 3.4GHz; 90nm for 2.8GHz to 3.6GHz • Die Size of 122mm2 (Prescott 90nm), 435mm2 (Tulsa 65nm), • Consumes 115 watts of power at 3.6Ghz • 1066MHz system bus • Prescott L1 16KB, 8-way vs. previous P4’s 8KB 4-way • 1MB, 512KB or 256KB 8-way full-speed on-die L2 (B/W example: 89.6 GB/s @2.8GHz to L1) • 2MB L3 cache (in P4 HT Extreme edition, 0.13μ only), 16MB in Tulsa • 144 new 128 bit SIMD instructions (SSE2), and 16 SSSE instructions in Prescott • HyperThreading Technology (Not in all versions)
  • 24. 24 Building Blocks of Netburst Bus Unit Level 2 Cache Memory subsystem Fetch/ Dec ETC μROM BTB / Br Pred. System bus L1 Data Cache Execution Units INT and FP Exec. Unit OOO logic Retire Branch history update Front-end Out-of-Order Engine
  • 25. 25 Pentium 4 Microarchitectue (Prescott) BTB (4k entries)BTB (4k entries) I-TLB/PrefetcherI-TLB/Prefetcher IA32 DecoderIA32 Decoder Execution Trace CacheExecution Trace Cache (12K(12K µµops)ops) Trace Cache BTBTrace Cache BTB (2k entries)(2k entries) µµCode ROMCode ROM µµop Queueop Queue Allocator / Register RenamerAllocator / Register Renamer INT / FPINT / FP µµop Queueop QueueMemoryMemory µµop Queueop Queue Memory schedulerMemory scheduler INT Register File / Bypass NetworkINT Register File / Bypass Network FP RF / Bypass NtwkFP RF / Bypass Ntwk AGUAGU AGUAGU 2x ALU2x ALU 2x ALU2x ALU Slow ALUSlow ALU Ld addrLd addr St addrSt addr SimpleSimple Inst.Inst. SimpleSimple Inst.Inst. ComplexComplex Inst.Inst. FPFP MMXMMX SSE/2/3SSE/2/3 FPFP MoveMove L1 Data Cache (16KB 8-way, 64-byte line, WT, 1 rd + 1 wr port)L1 Data Cache (16KB 8-way, 64-byte line, WT, 1 rd + 1 wr port) FastFast Slow/General FP schedulerSlow/General FP scheduler Simple FPSimple FP QuadQuad PumpedPumped 800MHz800MHz 6.4 GB/sec6.4 GB/sec BIUBIU U-L2 CacheU-L2 Cache 1MB 8-way1MB 8-way 128B line, WB128B line, WB 108 GB/s108 GB/s 256 bits256 bits 64 bits64 bits 64-bit64-bit SystemSystem BusBus
  • 26. 26 Pipeline Depth Evolution PREF DEC DEC EXEC WB P5 Microarchitecture IFU1IFU1 IFU2IFU2 IFU3IFU3 DEC1DEC1 DEC2DEC2 RATRAT ROBROB DISDIS EXEX RET1RET1 RET2RET2 P6P6 Microarchitecture TC NextIP TC Fetch Drive Alloc QueueRename Schedule Dispatch Reg File Exec Flags Br Ck Drive NetBurst Microarchitecture (Willamette) 20 stages NetBurst Microarchitecture (Prescott) > 30 stages
  • 27. 27 Execution Trace Cache • Primary first level I-cache to replace conventional L1 – Decoding several x86 instructions at high frequency is difficult, take several pipeline stages – Branch misprediction penalty is considerable • Advantages – Cache post-decode µops (think about fill unit) – High bandwidth instruction fetching – Eliminate x86 decoding overheads – Reduce branch recovery time if TC hits • Hold up to 12,000 µops – 6 µops per trace line – Many (?) trace lines in a single trace
  • 28. 28 Execution Trace Cache • Deliver 3 µop’s per cycle to OOO engine if br pred is good • X86 instructions read from L2 when TC misses (7+ cycle latency) • TC Hit rate ~ 8K to 16KB conventional I-cache • Simplified x86 decoder – Only one complex instruction per cycle – Instruction > 4 µop will be executed by micro-code ROM (P6’s MS) • Perform branch prediction in TC – 512-entry BTB + 16-entry RAS – With BP in x86 IFU, reduce 33% misprediction compared to P6 – Intel did not disclose the details of BP algorithms used in TC and x86 IFU (Dynamic + Static)
  • 29. 29 Out-Of-Order Engine • Similar design philosophy with P6 uses – Allocator – Register Alias Table – 128 physical registers – 126-entry ReOrder Buffer – 48-entry load buffer – 24-entry store buffer
  • 30. 30 Register Renaming Schemes ROB (40-entry)ROB (40-entry) RRFRRF DataData StatusStatus EBXEBX ECXECX EDXEDX ESIESI EDIEDI EAXEAX ESPESP EBPEBP RATRAT P6 Register RenamingP6 Register Renaming AllocatedsequentiallyAllocatedsequentially EBXEBX ECXECX EDXEDX ESIESI EDIEDI EAXEAX ESPESP EBPEBP Retirement RATRetirement RAT NetBurst Register RenamingNetBurst Register Renaming StatusStatus AllocatedsequentiallyAllocatedsequentially ...... ...... ...... ...... DataData EBXEBX ECXECX EDXEDX ESIESI EDIEDI EAXEAX ESPESP EBPEBP Front-end RATFront-end RAT RF (128-entry)RF (128-entry) ROB (126)ROB (126)
  • 31. 31 Micro-op Scheduling ∀ µop FIFO queues – Memory queue for loads and stores – Non-memory queue ∀ µop schedulers – Several schedulers fire instructions from 2 µop queues to execution (P6’s RS) – 4 distinct dispatch ports – Maximum dispatch: 6 µops per cycle (2 fast ALU from Port 0,1 per cycle; 1 from ld/st ports) Exec Port 0 Exec Port 1 Load Port Store Port Fast ALU (2x pumped) Fast ALU (2x pumped) FP Move INT Exec FP Exec Memory Load Memory Store •Add/sub •Logic •Store Data •Branches •FP/SSE Move •FP/SSE Store •FXCH •Add/sub •Shift •Rotate •FP/SSE Add •FP/SSE Mul •FP/SSE Div •MMX •Loads •LEA •Prefetch •Stores
  • 32. 32 Data Memory Accesses • Prescott: 16KB 8-way L1 + 1MB 8-way L2 (with a HW prefetcher), 128B line • Load-to-use speculation – Dependent instruction dispatched before load finishes • Due to the high frequency and deep pipeline depth • From load scheduler to execution is longer than execution itself – Scheduler assumes loads always hit L1 – If L1 miss, dependent instructions left the scheduler receive incorrect data temporarily – mis-speculationmis-speculation – Replay logicReplay logic • Re-execute the load when mis-speculated • Mis-speculated operations are placed into a replay queue for being re- dispatched – All trailing independent instructions are allowed to proceed – Tornado breaker • Up to 4 outstanding load misses (= 4 fill buffers in original P6) • Store-to-load forwarding buffer – 24 entries – Have the same starting physical address – Load data size <= store data size
  • 33. 33 Fast Staggered ALU • For frequent ALU instructions (No multiply, no shift, no rotate, no branch processing) • Double pumped clocks • Each operation finishes in 3 fast cycles – Lower-order 16-bit and bypass – Higher-order 16-bit and bypass – ALU flags generation Bit[15:0] Bit[31:16] Flags
  • 34. 34 Branch Predictor • P4 uses the same hybrid predictor of Pentium M Bimodal Predictor Local Predictor Global Predictor MUX MUX Pred_G Pred_LPred_B L_hit G_hit
  • 35. 35 • In Pentium M and Prescott Pentium 4 • Prediction based on global history Indirect Branch Predictor
  • 36. 36 New Instructions over Pentium • CMOVcc / FCMOVcc r, r/m – Conditional moves (predicated move) instructions – Based on conditional code (cc) • FCOMI/P : compare FP stack and set integer flags • RDPMC/RDTSC instructions – PMC: P6 has 2, Netburst (P4) has 18 • Uncacheable Speculative Write-Combining (USWC) —weakly ordered memory type for graphics memory
  • 37. 37 New Instructions • SSE2 in Pentium 4 (not p6 microarchitecture) – Double precision SIMD FP • SSSE in Core 2 – Supplemental instructions for shuffle, align, add, subtract. • Intel 64 (EM64T) – 64 bit support, new registers (8 more on top of 8) – In Celeron D, Core 2 (and P4 Prescott, Pentium D) – Almost compatible with AMD64 – AMD’s NX bit or Intel’s XD bit for preventing buffer overflow attacks
  • 38. 38 Streaming SIMD Extension 2 • P-III SSE (Katmai New Instructions: KNI) – Eight 128-bit wide xmmxmm registers (new architecture state) – Single-precisionSingle-precision 128-bit SIMD FP • Four 32-bit FP operations in one instruction • Broken down into 2 µops for execution (only 80-bit data in ROB) – 64-bit SIMD MMX (use 8 mmmm registers — map to FP stack) – Prefetch (nta, t0, t1, t2) and sfence • P4 SSE2 (Willamette New Instructions: WNI) – Support Double-precisionDouble-precision 128-bit SIMD FP • Two 64-bit FP operations in one instruction • Throughput: 2 cycles for most of SSE2 operations (exceptional examples: DIVPD and SQRTPD: 69 cycles, non-pipelined.) – Enhanced 128-bit SIMD MMX using xmmxmm registers
  • 39. 39 Examples of Using SSE X3X3 X2X2 X1X1 X0X0 Y3Y3 Y2Y2 Y1Y1 Y0Y0 opop opop opop opop X3 op Y3X3 op Y3 X2 op Y2X2 op Y2 X1 op Y1X1 op Y1X0 op Y0X0 op Y0 X3X3 X2X2 X1X1 X0X0 Y3Y3 Y2Y2 Y1Y1 Y0Y0 opop X0 op Y0X0 op Y0X3X3 X2X2 X1X1 Packed SP FP operationPacked SP FP operation (e.g.(e.g. ADDPS xmm1, xmm2ADDPS xmm1, xmm2)) Scalar SP FP operationScalar SP FP operation (e.g.(e.g. ADDSS xmm1, xmm2ADDSS xmm1, xmm2)) xmm1xmm1 xmm2xmm2 xmm1xmm1 xmm1xmm1 xmm2xmm2 xmm1xmm1 Shuffle FP operation (8-bit imm)Shuffle FP operation (8-bit imm) (e.g.(e.g. SHUFPS xmm1, xmm2,SHUFPS xmm1, xmm2, imm8imm8)) X3X3 X2X2 X1X1 X0X0 Y3Y3 Y2Y2 Y1Y1 Y0Y0 Y3 .. Y0Y3 .. Y0 Y3 .. Y0Y3 .. Y0 X3 .. X0X3 .. X0 X3 .. X0X3 .. X0 Shuffle FP operation (8-bit imm)Shuffle FP operation (8-bit imm) (e.g.(e.g. SHUFPS xmm1, xmm2,SHUFPS xmm1, xmm2, 0xf10xf1)) X3X3 X2X2 X1X1 X0X0 Y3Y3 Y2Y2 Y1Y1 Y0Y0 xmm1xmm1 Y3Y3 X0X0 X1X1Y3Y3 xmm2xmm2 xmm1xmm1
  • 40. 40 Examples of Using SSE and SSE2 X3X3 X2X2 X1X1 X0X0 Y3Y3 Y2Y2 Y1Y1 Y0Y0 opop opop opop opop X3 op Y3X3 op Y3 X2 op Y2X2 op Y2 X1 op Y1X1 op Y1X0 op Y0X0 op Y0 X3X3 X2X2 X1X1 X0X0 Y3Y3 Y2Y2 Y1Y1 Y0Y0 opop X0 op Y0X0 op Y0X3X3 X2X2 X1X1 PackedPacked SPSP FP operationFP operation (e.g.(e.g. ADDPS xmm1, xmm2ADDPS xmm1, xmm2)) ScalarScalar SPSP FP operationFP operation (e.g.(e.g. ADDSS xmm1, xmm2ADDSS xmm1, xmm2)) xmm1xmm1 xmm2xmm2 xmm1xmm1 xmm1xmm1 xmm2xmm2 xmm1xmm1 Shuffle FP operationShuffle FP operation (e.g.(e.g. SHUFPS xmm1, xmm2, imm8SHUFPS xmm1, xmm2, imm8)) X3X3 X2X2 X1X1 X0X0 Y3Y3 Y2Y2 Y1Y1 Y0Y0 Y3 .. Y0Y3 .. Y0 Y3 .. Y0Y3 .. Y0 X3 .. X0X3 .. X0 X3 .. X0X3 .. X0 ShuffleShuffle FPFP operation (8-bit imm)operation (8-bit imm) (e.g.(e.g. SHUFPS xmm1, xmm2,SHUFPS xmm1, xmm2, 0xf10xf1)) X3X3 X2X2 X1X1 X0X0 Y3Y3 Y2Y2 Y1Y1 Y0Y0 xmm1xmm1 Y3Y3 X0X0 X1X1Y3Y3 xmm2xmm2 xmm1xmm1 X0X0 opop PackedPacked DPDP FP operationFP operation (e.g.(e.g. ADDPDADDPD xmm1, xmm2xmm1, xmm2)) ScalarScalar DPDP FP operationFP operation (e.g.(e.g. ADDSDADDSD xmm1, xmm2xmm1, xmm2)) xmm1xmm1 xmm2xmm2 xmm1xmm1 xmm1xmm1 xmm2xmm2 xmm1xmm1 Shuffle FP operationShuffle FP operation (e.g.(e.g. SHUFPS xmm1, xmm2, imm8SHUFPS xmm1, xmm2, imm8)) ShuffleShuffle DPDP operation (2-bit imm)operation (2-bit imm) (e.g.(e.g. SHUFPD xmm1, xmm2,SHUFPD xmm1, xmm2, imm2imm2)) X1X1 Y0Y0Y1Y1 X0 op Y0X0 op Y0X1 op Y1X1 op Y1 opop X0X0X1X1 Y0Y0Y1Y1 X0 op Y0X0 op Y0X1X1 opop X0X0X1X1 Y0Y0Y1Y1 X1 or X0X1 or X0Y1 or Y0Y1 or Y0 SSESSE SSE2SSE2
  • 41. 41 HyperThreading • Intel Xeon Processor and Intel Xeon MP Processor • Enable Simultaneous Multi-Threading (SMT) – Exploit ILP through TLP (—Thread-Level Parallelism) – Issuing and executing multiple threads at the same snapshot • Single P4 w/ HT appears to be 2 logical processors2 logical processors • Share the same execution resources – dTLB shared with logical processor ID – Some other shared resources are partitioned (next slide) • Architectural states and some microarchitectural states are duplicated – IPs, iTLB, streaming buffer – Architectural register file – Return stack buffer – Branch history buffer – Register Alias Table
  • 42. 42 Multithreading (MT) Paradigms Thread 1 Unused ExecutionTime FU1 FU2 FU3 FU4 Conventional Superscalar Single Threaded Simultaneous Multithreading (or Intel’s HT) Fine-grained Multithreading (cycle-by-cycle Interleaving) Thread 2 Thread 3 Thread 4 Thread 5 Coarse-grained Multithreading (Block Interleaving) Chip Multiprocessor (CMP) or called Multi-Core Processors today
  • 43. 43 HyperThreading Resource Partitioning • TC (or UROM) is alternatively accessed per cycle for each logical processor unless one is stalled due to TC miss ∀ µop queue (into ½) after fetched from TC • ROB (126/2) • LB (48/2) • SB (24/2) (32/2 for Prescott) • General µop queue and memory µop queue (1/2) • TLB (½?) as there is no PID • Retirement: alternating between 2 logical processors

Hinweis der Redaktion

  1. Latest DDR2-667 : 10.7GB/sec PCIExpress x16: 8GB/sec (AGP 8x 2.1GB/s) FSB 8.5GB (1067MHz * 8)