Weitere ähnliche Inhalte
Ähnlich wie Ludden q3 2008_boston
Ähnlich wie Ludden q3 2008_boston (20)
Mehr von Obsidian Software
Mehr von Obsidian Software (20)
Ludden q3 2008_boston
- 1. IBM Power Systems
SMT Verification of the POWER5 and POWER6
High-Performance Processors
John Ludden
Senior Technical Staff Member
Hardware Verification
IBM Systems & Technology Group
© 2008 IBM Corporation
- 2. IBM System p
SMT Verification of the POWER5 and POWER6 High-Performance Processors
Introduction to Simultaneous Multi-Threading
(SMT)
1. What is a multi-threaded processor?
• Essentially a processor core that executes multiple
instruction streams simultaneously
• Each thread appears to software as a “virtual” processor core
2. What are the advantages of SMT?
• More efficient utilization of silicon real estate and power: small
die size increase compared to adding another core
• Increased system throughput by utilizing processor resources
that would otherwise be idle
3. What are the disadvantages of SMT?
• Increased complexity -> Makes verification state space MUCH
larger
• SMT verification much harder than SMP
• Possibly degrades performance of some applications
2 IBM Systems
© 2006 IBM Corporation
IBM Systems & Technology DRAFT: IBM Confidential © 2008 IBM Corporation
- 3. IBM System p
SMT Verification of the POWER5 and POWER6 High-Performance Processors
Examples of SMT microprocessors
1. Video Game Systems
• Sony Playstation 3: IBM CELL processor
• Xbox 360: IBM Xenon processor
2. Personal Computers:
• Intel Pentium 4 Hyper-Threading (HT) processors
3. Servers:
• SUN UltraSparc Systems: T1 (4 threads) and T2 (8 threads)
• HP Superdome Systems: Intel Itanium 2
• IBM Power Systems: POWER5 and POWER6 processors
3 IBM Systems
© 2006 IBM Corporation
IBM Systems & Technology DRAFT: IBM Confidential © 2008 IBM Corporation
- 4. IBM System p
SMT Verification of the POWER5 and POWER6 High-Performance Processors
Overview
1. Context : POWER5 vs. POWER6 Microarchitecture Comparison
2. Verification methodology: In the beginning…
3. The times they are a changing: SMT arrives in POWER5
4. POWER6: An in-order design should be simpler, but…
5. Future directions?
4 IBM Systems
© 2006 IBM Corporation
IBM Systems & Technology DRAFT: IBM Confidential © 2008 IBM Corporation
- 5. IBM System p
SMT Verification of the POWER5 and POWER6 High-Performance Processors
IBM POWER systems
Consistent predictable delivery
2007
2006
POWER6
2004 POWER5+
2003 POWER5
2001 POWER4+
POWER4
5 IBM Systems
© 2006 IBM Corporation
IBM Systems & Technology DRAFT: IBM Confidential © 2008 IBM Corporation
- 6. IBM System p
SMT Verification of the POWER5 and POWER6 High-Performance Processors
POWER5 Chip POWER6 Chip
High Freq High Freq Ultra Freq Ultra Freq
POWER5 POWER5 POWER6 POWER6
SMT2 Core SMT2 Core SMT2 Core SMT2 Core
~2 MB L2 4 MB L2 4 MB L2
36 MB 32 MB
36 MB L3 L3 32 MB L3 L3
Controller Chip Controller Chip(s)
SMP Interconnect Fabric SMP Interconnect Fabric
Memory Memory Memory
Controller Controller Controller
Buffer Buffer Buffer
Chips Chips Chips
6 IBM Systems
© 2006 IBM Corporation
IBM Systems & Technology DRAFT: IBM Confidential © 2008 IBM Corporation
- 7. IBM System p
SMT Verification of the POWER5 and POWER6 High-Performance Processors
POWER5 Pipeline
Out-of-Order Processing
Branch Redirects
Instruction Fetch BR
MP ISS RF EX WB Xfer
IF IC LD/ST
IF BP CP
MP ISS RF EA DC Fmt WB Xfer CP
D0
D0 D1 D2 D3 Xfer GD MP ISS RF EX WB Xfer
FX
Group Formation and MP ISS RF F6
F6
F6 FP
Instruction Decode F6
F6
F6 WB Xfer
Interrupts & Flushes
Branch Prediction
Dynamic
Instruction Shared
Selection
Branch Return Target Execution
Program Shared Issue
History Stack Cache Units
Counter Queues
Tables LSU0
Alternate FXU0
Instruction LSU1
Buffer 0 Group Formation,
Instruction FXU1 Group Store
Instruction Decode,
Cache Completion Queue
Instruction Dispatch FPU0
Instruction Buffer 1 FPU1
Translation
BXU
Thread Data Data
Priority Shared CRL Translation Cache
Read Shared Write Shared
Register
Register Files Register Files
Mappers L2
Cache
Shared by two threads Resource used by thread 0 Resource used by thread 1
7 IBM Systems
© 2006 IBM Corporation
IBM Systems & Technology DRAFT: IBM Confidential © 2008 IBM Corporation
- 8. IBM System p
SMT Verification of the POWER5 and POWER6 High-Performance Processors
High-end server: New POWER6 microprocessor
Topology
– Two cores on chip, a 2-way SMP
– Core private L1s (64KB I, 64KB D)
– Superscalar, SMT cores
– Chip private 8 MB L2 cache
– L3 32 MB off chip
– Two-tier SMP fabric
Technology
– 65 nm SOI
– 341 mm2 die size
– 10 Layers of metal
– 790 million transistors on chip
– Frequency : 3.5, 4.2, 4.7, 5.0 GHz
Custom & semi-custom design style
– High frequency constraints
3.3 M Lines of VHDL
8 IBM Systems
© 2006 IBM Corporation
IBM Systems & Technology DRAFT: IBM Confidential © 2008 IBM Corporation
- 9. IBM System p
SMT Verification of the POWER5 and POWER6 High-Performance Processors
POWER6 core pipeline
P1
BR/CR
P2 RF
IFAR FX
P3 RF EX
P4 IC0 IC1 ROT IB0 IB1 PD DISP RF AG DC0 DC1 FMT LOAD
BHT Instruction dispatch pipeline BR/FX/Load pipeline
BHT RF
ISS ECC
EX1 EX2 EX3 EX4 EX5 EX6 EX7 ECC
Instruction fetch pipeline
Floating Point Pipeline Check Point Recovery Pipeline
Legend : Pre-decode stage Instruction Decode stage Write back stage Cache access stage FX result bypass
Ifetch/Branch stage Instruction Dispatch/Issue stage Completion stage Load result bypass
Delayed/Transmit stage Operand access/execution stage Check Point stage Float result bypass
9 IBM Systems
© 2006 IBM Corporation
IBM Systems & Technology DRAFT: IBM Confidential © 2008 IBM Corporation
- 10. IBM System p
SMT Verification of the POWER5 and POWER6 High-Performance Processors
POWER6 core
POWER6 processor is ~2X frequency of POWER5 (4 – 5 GHz)
POWER6 instruction pipeline depth equivalent to POWER5
– Minimize power
– Scale performance with frequency
Instruction Fetch Instruction Buffer/Decode Instruction Dispatch/Issue Data Fetch/Execute
~6ns/instr
~3ns/instr
FXU Dependent execution
Load Dependent execution
POWER6 extends functionality of POWER5 core
– 64K I cache, 64K D cache, 2 FXU, 2 Binary FPU, 1 branch execution unit
– Two way SMT with 7 instruction dispatch from 2 threads (maximum of 5 instructions per thread)
– Decimal Floating Point Unit
– VMX Unit (PowerPC’s SIMD ISA)
– Recovery Unit
10 IBM Systems
© 2006 IBM Corporation
IBM Systems & Technology DRAFT: IBM Confidential © 2008 IBM Corporation
- 11. IBM System p
SMT Verification of the POWER5 and POWER6 High-Performance Processors
Bullet-proof computing
System reliability with recovery unit
– Every measure possible taken to preserve application execution
– Retry soft errors
– Change hardware for hard errors
Processor architected state check pointed
Every 1 cycle
ECC & Non-ECC protected circuitry checked
Every cycle
No error found
Error found
Processor restarts from last saved checkpoint
No error found Soft error case
Error found
Processor workload moved to another CPU
Hard error case
11 IBM Systems
© 2006 IBM Corporation
IBM Systems & Technology DRAFT: IBM Confidential © 2008 IBM Corporation
- 12. IBM System p
SMT Verification of the POWER5 and POWER6 High-Performance Processors
Overview
1. Context : POWER5 vs. POWER6 microarchitecture comparison
2. Verification methodology: In the beginning…
3. The times they are a changing: SMT arrives in POWER5
4. POWER6: An in-order design should be simpler, but…
5. Future directions?
12 IBM Systems
© 2006 IBM Corporation
IBM Systems & Technology DRAFT: IBM Confidential © 2008 IBM Corporation
- 13. IBM System p
SMT Verification of the POWER5 and POWER6 High-Performance Processors
POWER4/5/6 RTL verification technology
RTL PSL et al.
Driver/Checker
(VHDL, Verilog) Assertions
Physical VLSI Language Compile
Design Tools /
Model Build
Custom Design
Test Program
Generator
(GPRO, X-Gen)
Cycle-based
Model
Constraint
C++ Random
Testbench Unit
Testbench
Formal
Software Simulator
Verification:
Boolean (Semi) Formal (MESA)
Equivalence Verification
Hardware
Check (SixthSense,
RuleBase) Accelerator
(Verity)
(Awan)
13 IBM Systems
© 2006 IBM Corporation
IBM Systems & Technology DRAFT: IBM Confidential © 2008 IBM Corporation
- 14. IBM System p
SMT Verification of the POWER5 and POWER6 High-Performance Processors
Single threaded uniprocessor verification for POWER4
Unit level: methodology inherited from POWER4
– Driven by a combination of instruction level test cases (AVPs) created by Genesys-
Pro (GPRO) pseudo-random test generator and random C++ driven irritation
– Instruction-By-Instruction (IBI) checking against AVP results
– Low level microarchitecture checkers written in C++
Processor core (aka “core”) level
– Mixture of GPRO pseudo-random and directed random instruction level test cases
– IBI checking against AVP results
– Low level microarchitecture checkers written in C++
- Irritation from random C++ drivers
- Highly deterministic and architected state easily verifiable against test
14 IBM Systems
© 2006 IBM Corporation
IBM Systems & Technology DRAFT: IBM Confidential © 2008 IBM Corporation
- 15. IBM System p
SMT Verification of the POWER5 and POWER6 High-Performance Processors
Symmetric multi-processor (SMP) verification for POWER4
Chip (dual-core) level
– Test generation similar to uniprocessor via GPRO for false-sharing
or non-sharing tests
• IBI checking against AVP results for two-independent instruction streams
contained within single test
• Low level microarchitecture checkers written in C++
• L1/L2 interactions primary focus
– True-sharing scenarios, lock testing and storage access (“weak”)
ordering checked
• GPRO employed but….
– IBI checking of these accesses is limited or not possible:
› Non-unique or non-deterministic results
› CML (architecture level coherency monitor) employed to detect
the “right answer” as a post-simulation rule check
15 IBM Systems
© 2006 IBM Corporation
IBM Systems & Technology DRAFT: IBM Confidential © 2008 IBM Corporation
- 16. IBM System p
SMT Verification of the POWER5 and POWER6 High-Performance Processors
Overview
1. Context : POWER5 vs. POWER6 microarchitecture comparison
2. Verification methodology: In the beginning…
3. The times they are a changing: SMT arrives in POWER5
4. POWER6: An in-order design should be simpler, but…
5. Future directions?
16 IBM Systems
© 2006 IBM Corporation
IBM Systems & Technology DRAFT: IBM Confidential © 2008 IBM Corporation
- 17. IBM System p
SMT Verification of the POWER5 and POWER6 High-Performance Processors
POWER5 SMT verification methodology
Evolutionary based on single thread uniprocessor and SMP
approaches
– Traditional SMP scenarios now self-contained in a single core simulation model
• Downward migration of dual-core methodology to single core model
New SMT verification scenario categories
– Shared resource and priority conflicts:
• SMT resource types:
– Equally shared between threads: Queue full conditions easier to hit
– Dynamically shared / tagged: Either thread can consume most/all of the
resource
– Replicated: Not shared…same as single thread
– Dynamic thread mode switching: SMT->ST; ST->SMT
• Some applications attain better performance in ST mode
• Shared resources re-allocated on each mode switch
17 IBM Systems
© 2006 IBM Corporation
IBM Systems & Technology DRAFT: IBM Confidential © 2008 IBM Corporation
- 18. IBM System p
SMT Verification of the POWER5 and POWER6 High-Performance Processors
Traditional SMP approach applied to SMT verification
SMP.def
SMP.def Test
Test
(test template)
(test template) Generation
Generation
Output test case
SMT.tst
Core Level Registers common to both threads
Core Level Registers common to both threads
t0 Registers t1 Registers
Random t0
Random t0 Random t1
Random t1
Real memory is common to both threads with test generator
managing some potential overlap
18 IBM Systems
© 2006 IBM Corporation
IBM Systems & Technology DRAFT: IBM Confidential © 2008 IBM Corporation
- 19. IBM System p
SMT Verification of the POWER5 and POWER6 High-Performance Processors
Shared resource and priority conflicts
Approach was similar to SMP verification
– Testing largely consisted of “symmetric” instruction streams
on each thread
• A particular resource targeted (e.g., GPR rename registers)
– 100 load instructions on each thread
– Coverage and lab feedback validated this approach
• Good enough: “Got the job done”
19 IBM Systems
© 2006 IBM Corporation
IBM Systems & Technology DRAFT: IBM Confidential © 2008 IBM Corporation
- 20. IBM System p
SMT Verification of the POWER5 and POWER6 High-Performance Processors
POWER5 dynamic thread mode switching
Thread 0 Thread 1
All architected states initialized All architected states initialized
Initial Thread enabled Thread enabled
State
Random instructions
Random instructions
Thread kills Save architected
itself state
Restart thread 0
Thread 0 terminates read
Oth er th
itself
Shared resources
reallocated
Random instructions
Wake up thread
Interrupt
Partition resources Sim Driver
Restore architected
Run state
State
Normal finish Normal finish
Final
Thread enabled Thread enabled
State
20 IBM Systems
© 2006 IBM Corporation
IBM Systems & Technology DRAFT: IBM Confidential © 2008 IBM Corporation
- 21. IBM System p
SMT Verification of the POWER5 and POWER6 High-Performance Processors
POWER5 shared resource re-allocation on mode switch
Rename Registers per Load Miss Queue entries
thread per thread
200 SMT Mode 10
100 Max 5 SMT Mode
0 ST Mode 0 ST Mode
GPR FPR Split in half
Branch Queue (BIQ) Max LRQ/SRQ entries per
entries per thread thread
20 40
SMT mode
20
10 SMT Mode 0 Max
0 ST Mode Dynamically ST mode
Split in half Shared
21 IBM Systems
© 2006 IBM Corporation
IBM Systems & Technology DRAFT: IBM Confidential © 2008 IBM Corporation
- 22. IBM System p
SMT Verification of the POWER5 and POWER6 High-Performance Processors
Overview
1. Context : POWER5 vs. POWER6 microarchitecture comparison
2. Verification methodology: In the beginning…
3. The times they are a changing: SMT arrives in POWER5
4. POWER6: An in-order design should be simpler, but…
5. Future directions?
22 IBM Systems
© 2006 IBM Corporation
IBM Systems & Technology DRAFT: IBM Confidential © 2008 IBM Corporation
- 23. IBM System p
SMT Verification of the POWER5 and POWER6 High-Performance Processors
POWER5: centralized complexity
POWER5
– Out-of-order design: Even in single thread mode, IFU
complex events naturally occur simultaneously
– Started from POWER4+: Known working
design that was modified incrementally
FXU ISU LSU
– 23 FO4 design: Isolated complexity in
Instruction Sequencing Unit (ISU):
• Every unit communicated back to ISU
• ISU resolved all exceptions and
out-of-order conflicts
FPU
– ST and SMT modes both supported:
• Alternating dispatch cycles per thread
• Resources re-allocated on mode switch
23 IBM Systems
© 2006 IBM Corporation
IBM Systems & Technology DRAFT: IBM Confidential © 2008 IBM Corporation
- 24. IBM System p
SMT Verification of the POWER5 and POWER6 High-Performance Processors
POWER6 distributed complexity
POWER6 IFU
– From-scratch mostly in-order design
• Normally, design is well behaved
FXU IDU
• Cross-thread interaction necessary for “tough
bugs”
– 13 FO4 design: Distributed complexity needed to
achieve high performance goals
– Recovery unit (RU):
• Must resolve out-of-order FP with in-order
pipelines
• Checkpoints machine state RU FPU
• Recovers processor from soft errors
– Design is inherently in SMT mode all the time
(almost) LSU
• Dispatch to both threads in same cycle
• Most resources dynamically shared / tagged
• No resource reallocation on mode switch
24 IBM Systems
© 2006 IBM Corporation
IBM Systems & Technology DRAFT: IBM Confidential © 2008 IBM Corporation
- 25. IBM System p
SMT Verification of the POWER5 and POWER6 High-Performance Processors
POWER6 verification process
The different verification engines have different strengths related to the
verification tasks
Software simulation
– Slow, but low penalty for highly intrusive checking of model internals. Total model visibility.
– Hundreds of AIX workstations running 24x7x365
– New enhancements helped keep pace with design complexity
– 2x number of simulation cycles of POWER5 design
Hardware-accelerated simulation
– 10-1k x Faster than SW sim, but need less intrusive driving/checking to not slow down hardware box.
– New usage: Mainline function verification
– Yields additional 3x simulation cycle advantage over POWER5 (5x cycle advantage overall)
(Semi)-formal verification
– (High to) Exhaustive coverage, but higher skill needed to drive. Scaling problems w/ model size.
– Extensively used: Proved extremely valuable for complex SMT bugs
Hardware bring-up
– Ideal speed, very limited visibility/controllability
25 IBM Systems
© 2006 IBM Corporation
IBM Systems & Technology DRAFT: IBM Confidential © 2008 IBM Corporation
- 26. IBM System p
SMT Verification of the POWER5 and POWER6 High-Performance Processors
Software simulation enhancements
Random command driven unit simulation for most core units
– Yielded >1 Million lines of C++ code
– More control over generation for low level events
– More efficient test generation
Irritator threads at “core model” level
– “Symmetric” instruction stream approach employed on POWER5 proved inadequate
“S” in SMT is for “Simultaneous”, not “Symmetric”
– Target cross-thread interactions at the microarchitecture level
– ~2x test generation efficiency
– Ensures both threads running the same length (self adjusting)
26 IBM Systems
© 2006 IBM Corporation
IBM Systems & Technology DRAFT: IBM Confidential © 2008 IBM Corporation
- 27. IBM System p
SMT Verification of the POWER5 and POWER6 High-Performance Processors
Irritator thread example
SMT_Irritator.def Test
(test template) Generation
Output test case
SMT_Irritator.tst
Core Level Registers common to both threads
t0 Registers t1 Registers Irritator thread restrictions
• Cannot cause unexpected
exceptions
• Cannot modify memory read
Long Short by random thread
Random t0 Irritator t1 • Cannot modify registers
shared with other threads
• Architected results may be
undefined
Real memory with test generator managing some potential overlap
27 IBM Systems
© 2006 IBM Corporation
IBM Systems & Technology DRAFT: IBM Confidential © 2008 IBM Corporation
- 28. IBM System p
SMT Verification of the POWER5 and POWER6 High-Performance Processors
Irritator thread example
Long Random Thread Irritator Thread
SEQUENCE
SEQUENCE
REPEAT 100
SELECT Kill Irritator Thre
ad LB0: fdiv
Group_All A: b to LB0
stw nop, A
Generated Instr: 101 Generated Instr: 2
Simulated Instr: 101 Simulated Instr: Infinite
28 IBM Systems
© 2006 IBM Corporation
IBM Systems & Technology DRAFT: IBM Confidential © 2008 IBM Corporation
- 29. IBM System p
SMT Verification of the POWER5 and POWER6 High-Performance Processors
Simulation acceleration usage on POWER6
Extensively used on POWER6
– Run lab exercisers prior to tape-out
• Found additional bugs missed by software simulation
• Debug new exerciser functionality prior to lab
• Error injection and recovery testing
• Reproducibility of lab bugs in “simulation-like” environment for
rapid debug of root cause
• Rapid testing of bug fixes and collateral damage testing
– Linux boot prior to tape-out
– Not employed on POWER5 for “mainline” functional
verification
29 IBM Systems
© 2006 IBM Corporation
IBM Systems & Technology DRAFT: IBM Confidential © 2008 IBM Corporation
- 30. IBM System p
SMT Verification of the POWER5 and POWER6 High-Performance Processors
(Semi) Formal methods
Formal methods are a vital complement to
simulation flow
– Lab bring-up bug re-creation
• Often faster reproduction than simulation based
approaches
• Aids in root cause analysis
• High-coverage / proof of side-effect-free fixes
30 IBM Systems
© 2006 IBM Corporation
IBM Systems & Technology DRAFT: IBM Confidential © 2008 IBM Corporation
- 31. IBM System p
SMT Verification of the POWER5 and POWER6 High-Performance Processors
Biggest challenge on POWER6
Error detection and soft error recovery
– Why so hard?
• Myriads of injection points coupled with large SMT state space
– Often needed multiple “rare” combinations of “asymmetric”
events on both threads while specific error was injected
• End-to-end recovery testing difficult at unit level
– Really a “core” effort
– Verification strategy:
– Error injection and recovery on hardware accelerated simulation
platform
– Dynamic on-the-fly error injection combined with “irritator threads”
needed to cover large SMT recovery state space
31 IBM Systems
© 2006 IBM Corporation
IBM Systems & Technology DRAFT: IBM Confidential © 2008 IBM Corporation
- 32. IBM System p
SMT Verification of the POWER5 and POWER6 High-Performance Processors
Summary
1. SMT verification has four key pieces
– Traditional SMP-like effort
– Thread starvation and priority
– Starting and stopping threads
– Asymmetric “irritator thread” approach to verify often unforeseen cross-thread interactions at
the microarchitecture level
2. “From-scratch in-order” SMT design was more difficult to verify than the
“out-of-order retrofitted” SMT design
– Complex events only occurred due to cross thread interaction
– Even though team had experience
– Required more “weapons” in the arsenal
3. High frequency design drove distributed complexity
– Makes verification job harder
– Increased dependency on formal verification for difficult bugs
4. “Mainframe”-like RAS on POWER6 drove a huge amount of work that was
difficult to attack at the unit level
32 IBM Systems
© 2006 IBM Corporation
IBM Systems & Technology DRAFT: IBM Confidential © 2008 IBM Corporation
- 33. IBM System p
SMT Verification of the POWER5 and POWER6 High-Performance Processors
Overview
1. Context : POWER5 vs. POWER6 microarchitecture comparison
2. Verification methodology: In the beginning…
3. The times they are a changing: SMT arrives in POWER5
4. POWER6: An in-order design should be simpler, but…
5. Future directions?
33 IBM Systems
© 2006 IBM Corporation
IBM Systems & Technology DRAFT: IBM Confidential © 2008 IBM Corporation
- 34. IBM System p
SMT Verification of the POWER5 and POWER6 High-Performance Processors
Future directions
Predictions
– RAS features will be an increasingly important feature of server
systems
• POWER6 design has set the “bar” to a new high standard to which future
processors will have to measure up
- Power Systems Revenue up 29% in 2Q08 (from 2Q07)
• Verification methods employed on POWER6 to attack nearly infinite state
space created by the combination of SMT and processor recovery features will
become standard practice
– A migration of “pre-silicon” verification techniques into “post-silicon”
hardware lab verification effort
• Hardware is the fastest “simulator” available and the state space is getting
bigger with SMT
34 IBM Systems
© 2006 IBM Corporation
IBM Systems & Technology DRAFT: IBM Confidential © 2008 IBM Corporation