More Related Content Similar to SMT Verification of the POWER5 and POWER6 High-Performance Processors (20) SMT Verification of the POWER5 and POWER6 High-Performance Processors1. IBM Power Systems
© 2008 IBM Corporation
SMT Verification of the POWER5 and POWER6
High-Performance Processors
John Ludden
Senior Technical Staff Member
Hardware Verification
IBM Systems & Technology Group
2. IBM System p
2 © 2006 IBM Corporation IBM SystemsDRAFT: IBM Confidential © 2008 IBM CorporationIBM Systems & Technology
SMT Verification of the POWER5 and POWER6 High-Performance Processors
1. What is a multi-threaded processor?
• Essentially a processor core that executes multiple
instruction streams simultaneously
• Each thread appears to software as a “virtual” processor core
2. What are the advantages of SMT?
• More efficient utilization of silicon real estate and power: small
die size increase compared to adding another core
• Increased system throughput by utilizing processor resources
that would otherwise be idle
3. What are the disadvantages of SMT?
• Increased complexity -> Makes verification state space MUCH
larger
• SMT verification much harder than SMP
• Possibly degrades performance of some applications
Introduction to Simultaneous Multi-Threading
(SMT)
3. IBM System p
3 © 2006 IBM Corporation IBM SystemsDRAFT: IBM Confidential © 2008 IBM CorporationIBM Systems & Technology
SMT Verification of the POWER5 and POWER6 High-Performance Processors
1. Video Game Systems
• Sony Playstation 3: IBM CELL processor
• Xbox 360: IBM Xenon processor
2. Personal Computers:
• Intel Pentium 4 Hyper-Threading (HT) processors
3. Servers:
• SUN UltraSparc Systems: T1 (4 threads) and T2 (8 threads)
• HP Superdome Systems: Intel Itanium 2
• IBM Power Systems: POWER5 and POWER6 processors
Examples of SMT microprocessors
4. IBM System p
4 © 2006 IBM Corporation IBM SystemsDRAFT: IBM Confidential © 2008 IBM CorporationIBM Systems & Technology
SMT Verification of the POWER5 and POWER6 High-Performance Processors
1. Context : POWER5 vs. POWER6 Microarchitecture Comparison
2. Verification methodology: In the beginning…
3. The times they are a changing: SMT arrives in POWER5
4. POWER6: An in-order design should be simpler, but…
5. Future directions?
Overview
5. IBM System p
5 © 2006 IBM Corporation IBM SystemsDRAFT: IBM Confidential © 2008 IBM CorporationIBM Systems & Technology
SMT Verification of the POWER5 and POWER6 High-Performance Processors
Consistent predictable delivery
IBM POWER systems
POWER4+
POWER4
POWER5
POWER5+
POWER6
2001
2003
2004
2006
2007
6. IBM System p
6 © 2006 IBM Corporation IBM SystemsDRAFT: IBM Confidential © 2008 IBM CorporationIBM Systems & Technology
SMT Verification of the POWER5 and POWER6 High-Performance Processors
POWER5 Chip
High Freq
POWER5
SMT2 Core
~2 MB L2
36 MB L3
Controller
36 MB
L3
Chip
SMP Interconnect Fabric
Memory
Controller
Buffer
Chips
High Freq
POWER5
SMT2 Core
POWER6 Chip
Ultra Freq
POWER6
SMT2 Core
4 MB L2
32 MB L3
Controller
32 MB
L3
Chip(s)
SMP Interconnect Fabric
Ultra Freq
POWER6
SMT2 Core
4 MB L2
Memory
Controller
Memory
Controller
Buffer
Chips
Buffer
Chips
7. IBM System p
7 © 2006 IBM Corporation IBM SystemsDRAFT: IBM Confidential © 2008 IBM CorporationIBM Systems & Technology
SMT Verification of the POWER5 and POWER6 High-Performance Processors
POWER5 Pipeline
MP ISS RF EA DC WB Xfer
MP ISS RF EX WB Xfer
MP ISS RF EX WB Xfer
MP ISS RF F6
Xfer
F6F6F6F6F6
CP
BR
LD/ST
FX
FP
Group Formation and
Instruction Decode
Instruction Fetch
Branch Redirects
Interrupts & Flushes
Out-of-Order Processing
WB
Fmt
D1 D2 D3 Xfer GDD0D0
Shared by two threads Resource used by thread 1Resource used by thread 0
Shared Issue
Queues
CP
LSU0
FXU0
LSU1
FXU1
FPU0
FPU1
BXU
CRL
Shared
Execution
Units
Read Shared
Register Files
Dynamic
Instruction
Selection
Thread
Priority
Group Formation,
Instruction Decode,
Dispatch
Shared
Register
Mappers
Alternate
Target
Cache
Branch Prediction
Instruction
Translation
Instruction
Cache
Program
Counter
Branch
History
Tables
Return
Stack
Instruction
Buffer 1
Instruction
Buffer 0
Write Shared
Register Files
Group
Completion
Store
Queue
Data
Cache
Data
Translation
L2
Cache
IF BPICIF
8. IBM System p
8 © 2006 IBM Corporation IBM SystemsDRAFT: IBM Confidential © 2008 IBM CorporationIBM Systems & Technology
SMT Verification of the POWER5 and POWER6 High-Performance Processors
High-end server: New POWER6 microprocessor
Topology
– Two cores on chip, a 2-way SMP
– Core private L1s (64KB I, 64KB D)
– Superscalar, SMT cores
– Chip private 8 MB L2 cache
– L3 32 MB off chip
– Two-tier SMP fabric
Technology
– 65 nm SOI
– 341 mm2 die size
– 10 Layers of metal
– 790 million transistors on chip
– Frequency : 3.5, 4.2, 4.7, 5.0 GHz
Custom & semi-custom design style
– High frequency constraints
3.3 M Lines of VHDL
9. IBM System p
9 © 2006 IBM Corporation IBM SystemsDRAFT: IBM Confidential © 2008 IBM CorporationIBM Systems & Technology
SMT Verification of the POWER5 and POWER6 High-Performance Processors
POWER6 core pipeline
Instruction fetch pipelineInstruction fetch pipeline
BR/FX/Load pipelineBR/FX/Load pipeline
Floating Point PipelineFloating Point Pipeline Check Point Recovery PipelineCheck Point Recovery Pipeline
BR/CRBR/CR
FXFX
LOADLOAD
Legend :Legend : Pre-decode stage
Ifetch/Branch stage
Delayed/Transmit stage
Instruction Decode stage
Instruction Dispatch/Issue stage
Operand access/execution stage
Write back stage
Completion stage
Check Point stage
FX result bypass
Load result bypass
Float result bypass
Cache access stage
P1P1
P2P2
P3P3
P4P4 IC0IC0 ROTROTIC1IC1
EX1EX1
FMTFMTAGAGDISPDISPPDPDIB0IB0 IB1IB1
RFRF
RFRF
RFRF
RFRF DC0DC0 DC1DC1
EX2EX2 EX3EX3 EX4EX4 EX5EX5 EX6EX6 EX7EX7
EXEX
ISSISS ECCECC
ECCECC
BHTBHT
BHTBHT
IFARIFAR
Instruction dispatch pipelineInstruction dispatch pipeline
10. IBM System p
10 © 2006 IBM Corporation IBM SystemsDRAFT: IBM Confidential © 2008 IBM CorporationIBM Systems & Technology
SMT Verification of the POWER5 and POWER6 High-Performance Processors
POWER6 core
POWER6 processor is ~2X frequency of POWER5 (4 – 5 GHz)
POWER6 instruction pipeline depth equivalent to POWER5
– Minimize power
– Scale performance with frequency
Instruction Fetch Instruction Buffer/Decode Instruction Dispatch/Issue Data Fetch/Execute
FXU Dependent execution
Load Dependent execution
POWER6 extends functionality of POWER5 core
– 64K I cache, 64K D cache, 2 FXU, 2 Binary FPU, 1 branch execution unit
– Two way SMT with 7 instruction dispatch from 2 threads (maximum of 5 instructions per thread)
– Decimal Floating Point Unit
– VMX Unit (PowerPC’s SIMD ISA)
– Recovery Unit
~6ns/instr
~3ns/instr
11. IBM System p
11 © 2006 IBM Corporation IBM SystemsDRAFT: IBM Confidential © 2008 IBM CorporationIBM Systems & Technology
SMT Verification of the POWER5 and POWER6 High-Performance Processors
Bullet-proof computing
System reliability with recovery unit
– Every measure possible taken to preserve application execution
– Retry soft errors
– Change hardware for hard errors
Processor architected state check pointed
Every 1 cycle
ECC & Non-ECC protected circuitry checked
Every cycle
Processor restarts from last saved checkpoint
Processor workload moved to another CPU
No error found
No error found
Error found
Error found
Soft error case
Hard error case
12. IBM System p
12 © 2006 IBM Corporation IBM SystemsDRAFT: IBM Confidential © 2008 IBM CorporationIBM Systems & Technology
SMT Verification of the POWER5 and POWER6 High-Performance Processors
Overview
1. Context : POWER5 vs. POWER6 microarchitecture comparison
2. Verification methodology: In the beginning…
3. The times they are a changing: SMT arrives in POWER5
4. POWER6: An in-order design should be simpler, but…
5. Future directions?
13. IBM System p
13 © 2006 IBM Corporation IBM SystemsDRAFT: IBM Confidential © 2008 IBM CorporationIBM Systems & Technology
SMT Verification of the POWER5 and POWER6 High-Performance Processors
POWER4/5/6 RTL verification technology
RTL
(VHDL, Verilog)
Language Compile
Model Build
Physical VLSI
Design Tools /
Custom Design
Cycle-based
Model
Formal
Verification:
Boolean
Equivalence
Check
(Verity)
Software Simulator
(MESA)
Hardware
Accelerator
(Awan)
Driver/Checker
Assertions
Test Program
Generator
(GPRO, X-Gen)
C++
Testbench
Constraint
Random
Unit
Testbench
PSL et al.
(Semi) Formal
Verification
(SixthSense,
RuleBase)
14. IBM System p
14 © 2006 IBM Corporation IBM SystemsDRAFT: IBM Confidential © 2008 IBM CorporationIBM Systems & Technology
SMT Verification of the POWER5 and POWER6 High-Performance Processors
Single threaded uniprocessor verification for POWER4
Unit level: methodology inherited from POWER4
– Driven by a combination of instruction level test cases (AVPs) created by Genesys-
Pro (GPRO) pseudo-random test generator and random C++ driven irritation
– Instruction-By-Instruction (IBI) checking against AVP results
– Low level microarchitecture checkers written in C++
Processor core (aka “core”) level
– Mixture of GPRO pseudo-random and directed random instruction level test cases
– IBI checking against AVP results
– Low level microarchitecture checkers written in C++
- Irritation from random C++ drivers
- Highly deterministic and architected state easily verifiable against test
15. IBM System p
15 © 2006 IBM Corporation IBM SystemsDRAFT: IBM Confidential © 2008 IBM CorporationIBM Systems & Technology
SMT Verification of the POWER5 and POWER6 High-Performance Processors
Symmetric multi-processor (SMP) verification for POWER4
Chip (dual-core) level
– Test generation similar to uniprocessor via GPRO for false-sharing
or non-sharing tests
• IBI checking against AVP results for two-independent instruction streams
contained within single test
• Low level microarchitecture checkers written in C++
• L1/L2 interactions primary focus
– True-sharing scenarios, lock testing and storage access (“weak”)
ordering checked
• GPRO employed but….
– IBI checking of these accesses is limited or not possible:
› Non-unique or non-deterministic results
› CML (architecture level coherency monitor) employed to detect
the “right answer” as a post-simulation rule check
16. IBM System p
16 © 2006 IBM Corporation IBM SystemsDRAFT: IBM Confidential © 2008 IBM CorporationIBM Systems & Technology
SMT Verification of the POWER5 and POWER6 High-Performance Processors
Overview
1. Context : POWER5 vs. POWER6 microarchitecture comparison
2. Verification methodology: In the beginning…
3. The times they are a changing: SMT arrives in POWER5
4. POWER6: An in-order design should be simpler, but…
5. Future directions?
17. IBM System p
17 © 2006 IBM Corporation IBM SystemsDRAFT: IBM Confidential © 2008 IBM CorporationIBM Systems & Technology
SMT Verification of the POWER5 and POWER6 High-Performance Processors
POWER5 SMT verification methodology
Evolutionary based on single thread uniprocessor and SMP
approaches
– Traditional SMP scenarios now self-contained in a single core simulation model
• Downward migration of dual-core methodology to single core model
New SMT verification scenario categories
– Shared resource and priority conflicts:
• SMT resource types:
– Equally shared between threads: Queue full conditions easier to hit
– Dynamically shared / tagged: Either thread can consume most/all of the
resource
– Replicated: Not shared…same as single thread
– Dynamic thread mode switching: SMT->ST; ST->SMT
• Some applications attain better performance in ST mode
• Shared resources re-allocated on each mode switch
18. IBM System p
18 © 2006 IBM Corporation IBM SystemsDRAFT: IBM Confidential © 2008 IBM CorporationIBM Systems & Technology
SMT Verification of the POWER5 and POWER6 High-Performance Processors
Traditional SMP approach applied to SMT verification
SMT.tst
Random t0 Random t1
Core Level Registers common to both threads
t0 Registers
SMP.def
(test template)
Test
Generation
Real memory is common to both threads with test generator
managing some potential overlap
t1 Registers
Output test case
SMT.tst
Random t0 Random t1
Core Level Registers common to both threads
t0 Registers
SMP.def
(test template)
Test
Generation
Real memory is common to both threads with test generator
managing some potential overlap
t1 Registers
Output test case
19. IBM System p
19 © 2006 IBM Corporation IBM SystemsDRAFT: IBM Confidential © 2008 IBM CorporationIBM Systems & Technology
SMT Verification of the POWER5 and POWER6 High-Performance Processors
Shared resource and priority conflicts
Approach was similar to SMP verification
– Testing largely consisted of “symmetric” instruction streams
on each thread
• A particular resource targeted (e.g., GPR rename registers)
– 100 load instructions on each thread
– Coverage and lab feedback validated this approach
• Good enough: “Got the job done”
20. IBM System p
20 © 2006 IBM Corporation IBM SystemsDRAFT: IBM Confidential © 2008 IBM CorporationIBM Systems & Technology
SMT Verification of the POWER5 and POWER6 High-Performance Processors
POWER5 dynamic thread mode switching
All architected states initialized
Thread enabledInitial
State
Thread 0 terminates
itself
Shared resources
reallocated
Random instructions
Normal finish
Thread enabled
Run
State
Random instructions
Restart thread 0
Normal finish
Thread enabled
Final
State
All architected states initialized
Thread enabled
Save architected
state
Wake up thread
Partition resources
Restore architected
state
Thread kills
itself
Random instructions
Thread 0 Thread 1
Sim Driver
Other thread
Interrupt
21. IBM System p
21 © 2006 IBM Corporation IBM SystemsDRAFT: IBM Confidential © 2008 IBM CorporationIBM Systems & Technology
SMT Verification of the POWER5 and POWER6 High-Performance Processors
POWER5 shared resource re-allocation on mode switch
0
100
200
GPR FPR
Rename Registers per
thread
SMT Mode
Max
ST Mode 0
5
10
Split in half
Load Miss Queue entries
per thread
SMT Mode
ST Mode
0
10
20
Split in half
Branch Queue (BIQ)
entries per thread
SMT Mode
ST Mode
0
20
40
Dynamically
Shared
Max LRQ/SRQ entries per
thread
SMT mode
Max
ST mode
22. IBM System p
22 © 2006 IBM Corporation IBM SystemsDRAFT: IBM Confidential © 2008 IBM CorporationIBM Systems & Technology
SMT Verification of the POWER5 and POWER6 High-Performance Processors
Overview
1. Context : POWER5 vs. POWER6 microarchitecture comparison
2. Verification methodology: In the beginning…
3. The times they are a changing: SMT arrives in POWER5
4. POWER6: An in-order design should be simpler, but…
5. Future directions?
23. IBM System p
23 © 2006 IBM Corporation IBM SystemsDRAFT: IBM Confidential © 2008 IBM CorporationIBM Systems & Technology
SMT Verification of the POWER5 and POWER6 High-Performance Processors
POWER5: centralized complexity
POWER5
– Out-of-order design: Even in single thread mode,
complex events naturally occur simultaneously
– Started from POWER4+: Known working
design that was modified incrementally
– 23 FO4 design: Isolated complexity in
Instruction Sequencing Unit (ISU):
• Every unit communicated back to ISU
• ISU resolved all exceptions and
out-of-order conflicts
– ST and SMT modes both supported:
• Alternating dispatch cycles per thread
• Resources re-allocated on mode switch
FXU
FPU
LSU
IFU
ISU
24. IBM System p
24 © 2006 IBM Corporation IBM SystemsDRAFT: IBM Confidential © 2008 IBM CorporationIBM Systems & Technology
SMT Verification of the POWER5 and POWER6 High-Performance Processors
POWER6 distributed complexity
POWER6
– From-scratch mostly in-order design
• Normally, design is well behaved
• Cross-thread interaction necessary for “tough
bugs”
– 13 FO4 design: Distributed complexity needed to
achieve high performance goals
– Recovery unit (RU):
• Must resolve out-of-order FP with in-order
pipelines
• Checkpoints machine state
• Recovers processor from soft errors
– Design is inherently in SMT mode all the time
(almost)
• Dispatch to both threads in same cycle
• Most resources dynamically shared / tagged
• No resource reallocation on mode switch
IFU
IDU
FPU
LSU
RU
FXU
25. IBM System p
25 © 2006 IBM Corporation IBM SystemsDRAFT: IBM Confidential © 2008 IBM CorporationIBM Systems & Technology
SMT Verification of the POWER5 and POWER6 High-Performance Processors
The different verification engines have different strengths related to the
verification tasks
POWER6 verification process
Software simulation
– Slow, but low penalty for highly intrusive checking of model internals. Total model visibility.
– Hundreds of AIX workstations running 24x7x365
– New enhancements helped keep pace with design complexity
– 2x number of simulation cycles of POWER5 design
Hardware-accelerated simulation
– 10-1k x Faster than SW sim, but need less intrusive driving/checking to not slow down hardware box.
– New usage: Mainline function verification
– Yields additional 3x simulation cycle advantage over POWER5 (5x cycle advantage overall)
(Semi)-formal verification
– (High to) Exhaustive coverage, but higher skill needed to drive. Scaling problems w/ model size.
– Extensively used: Proved extremely valuable for complex SMT bugs
Hardware bring-up
– Ideal speed, very limited visibility/controllability
26. IBM System p
26 © 2006 IBM Corporation IBM SystemsDRAFT: IBM Confidential © 2008 IBM CorporationIBM Systems & Technology
SMT Verification of the POWER5 and POWER6 High-Performance Processors
Software simulation enhancements
Random command driven unit simulation for most core units
– Yielded >1 Million lines of C++ code
– More control over generation for low level events
– More efficient test generation
Irritator threads at “core model” level
– “Symmetric” instruction stream approach employed on POWER5 proved inadequate
“S” in SMT is for “Simultaneous”, not “Symmetric”
– Target cross-thread interactions at the microarchitecture level
– ~2x test generation efficiency
– Ensures both threads running the same length (self adjusting)
27. IBM System p
27 © 2006 IBM Corporation IBM SystemsDRAFT: IBM Confidential © 2008 IBM CorporationIBM Systems & Technology
SMT Verification of the POWER5 and POWER6 High-Performance Processors
Irritator thread example
SMT_Irritator.tst
Long
Random t0
Short
Irritator t1
Core Level Registers common to both threads
SMT_Irritator.def
(test template)
Test
Generation
Real memory with test generator managing some potential overlap
Irritator thread restrictions
• Cannot cause unexpected
exceptions
• Cannot modify memory read
by random thread
• Cannot modify registers
shared with other threads
• Architected results may be
undefined
t1 Registerst0 Registers
Output test case
28. IBM System p
28 © 2006 IBM Corporation IBM SystemsDRAFT: IBM Confidential © 2008 IBM CorporationIBM Systems & Technology
SMT Verification of the POWER5 and POWER6 High-Performance Processors
Irritator thread example
SEQUENCE
REPEAT 100
SELECT
Group_All
stw nop, A
SEQUENCE
LB0: fdiv
A: b to LB0
Long Random Thread Irritator Thread
Generated Instr: 101
Simulated Instr: 101
Generated Instr: 2
Simulated Instr: Infinite
Kill Irritator Thread
29. IBM System p
29 © 2006 IBM Corporation IBM SystemsDRAFT: IBM Confidential © 2008 IBM CorporationIBM Systems & Technology
SMT Verification of the POWER5 and POWER6 High-Performance Processors
Simulation acceleration usage on POWER6
Extensively used on POWER6
– Run lab exercisers prior to tape-out
•Found additional bugs missed by software simulation
•Debug new exerciser functionality prior to lab
•Error injection and recovery testing
•Reproducibility of lab bugs in “simulation-like” environment for
rapid debug of root cause
•Rapid testing of bug fixes and collateral damage testing
– Linux boot prior to tape-out
– Not employed on POWER5 for “mainline” functional
verification
30. IBM System p
30 © 2006 IBM Corporation IBM SystemsDRAFT: IBM Confidential © 2008 IBM CorporationIBM Systems & Technology
SMT Verification of the POWER5 and POWER6 High-Performance Processors
Formal methods are a vital complement to
simulation flow
– Lab bring-up bug re-creation
• Often faster reproduction than simulation based
approaches
• Aids in root cause analysis
• High-coverage / proof of side-effect-free fixes
(Semi) Formal methods
31. IBM System p
31 © 2006 IBM Corporation IBM SystemsDRAFT: IBM Confidential © 2008 IBM CorporationIBM Systems & Technology
SMT Verification of the POWER5 and POWER6 High-Performance Processors
Error detection and soft error recovery
Biggest challenge on POWER6
– Why so hard?
• Myriads of injection points coupled with large SMT state space
– Often needed multiple “rare” combinations of “asymmetric”
events on both threads while specific error was injected
• End-to-end recovery testing difficult at unit level
– Really a “core” effort
– Verification strategy:
– Error injection and recovery on hardware accelerated simulation
platform
– Dynamic on-the-fly error injection combined with “irritator threads”
needed to cover large SMT recovery state space
32. IBM System p
32 © 2006 IBM Corporation IBM SystemsDRAFT: IBM Confidential © 2008 IBM CorporationIBM Systems & Technology
SMT Verification of the POWER5 and POWER6 High-Performance Processors
Summary
1. SMT verification has four key pieces
– Traditional SMP-like effort
– Thread starvation and priority
– Starting and stopping threads
– Asymmetric “irritator thread” approach to verify often unforeseen cross-thread interactions at
the microarchitecture level
2. “From-scratch in-order” SMT design was more difficult to verify than the
“out-of-order retrofitted” SMT design
– Complex events only occurred due to cross thread interaction
– Even though team had experience
– Required more “weapons” in the arsenal
3. High frequency design drove distributed complexity
– Makes verification job harder
– Increased dependency on formal verification for difficult bugs
4. “Mainframe”-like RAS on POWER6 drove a huge amount of work that was
difficult to attack at the unit level
33. IBM System p
33 © 2006 IBM Corporation IBM SystemsDRAFT: IBM Confidential © 2008 IBM CorporationIBM Systems & Technology
SMT Verification of the POWER5 and POWER6 High-Performance Processors
Overview
1. Context : POWER5 vs. POWER6 microarchitecture comparison
2. Verification methodology: In the beginning…
3. The times they are a changing: SMT arrives in POWER5
4. POWER6: An in-order design should be simpler, but…
5. Future directions?
34. IBM System p
34 © 2006 IBM Corporation IBM SystemsDRAFT: IBM Confidential © 2008 IBM CorporationIBM Systems & Technology
SMT Verification of the POWER5 and POWER6 High-Performance Processors
Future directions
Predictions
– RAS features will be an increasingly important feature of server
systems
• POWER6 design has set the “bar” to a new high standard to which future
processors will have to measure up
- Power Systems Revenue up 29% in 2Q08 (from 2Q07)
• Verification methods employed on POWER6 to attack nearly infinite state
space created by the combination of SMT and processor recovery features will
become standard practice
– A migration of “pre-silicon” verification techniques into “post-silicon”
hardware lab verification effort
• Hardware is the fastest “simulator” available and the state space is getting
bigger with SMT