Analysis of Multithreading Capabilities of Current High–Performance Processors 2005

Analysis of Multithreading Capabilities
of Current High-Performance
Processors
Kamil Kędzierski, Francisco J. Cazorla, Mateo Valero
kkedzier@ac.upc.edu, francisco.cazorla@bsc.es, mateo@ac.upc.edu

Outline
• Introduction
• Implementations
• Performance
• Conclusions
• References
Introduction
Implementation
Performance
Conclusions

Do we need multithreading?
• Problem:
resources not used efficiently
in a super-scalar processor
• Goal:
increase efficiency – do more
computation with similar
resource amount
• One possible solution:
SMT – Simultaneous
Multithreading
[ Francisco J. Cazorla, “QoS for SMT Processors”, PhD Thesis, DAC, UPC, 2005 ]
Introduction
Implementation
Performance
Conclusions
Motivation
History

How did we get to SMT?
• Multithreading first appeared in
the 1950s, and Simultaneous
Multithreading was investigated
by IBM in 1968
• Super-scalar architectures: only
one thread is running at a given
time
• Multiprocessor: each thread uses
a different set of resources
[ http://www.cs.clemson.edu/~mark/multithreading.html ]
Introduction
Implementation
Performance
Conclusions
Motivation
History

How did we get to SMT?
• Coarse-grain MT: thread switch on a long-latency operations
– a.k.a. "concurrent multithreading (CMT)“, "switch-on-event multithreading (SOEMT)“,
"dynamic blocked multithreading (BMT)"
• Fine-grain MT: thread switch
not only on a long-latency ops
– a.k.a. "vertical multithreading“,
"interleaved multithreading (IMT)“,
"time-sharing multithreading“,
"hardware multiprogramming" (1958)
• Simultaneous MT:
ops issue from different threads
at the same clock cycle
[ http://www.cs.clemson.edu/~mark/multithreading.html ]
Introduction
Implementation
Performance
Conclusions
Motivation
History

Outline
• Introduction
• Implementations
• Performance
• Conclusions
• References
Introduction
Implementations
Performance
Conclusions

Studied implementations
• IBM’s POWER5
• Intel’s Xeon (Pentium 4)
POWER5 die photo
[B. Sinharoy et al,“POWER5 system microarchitecture”, IBM Journal of Research and Development 2005 ]
Introduction
Implementations
Performance
Conclusions
IBM’s POWER5
Intel’s Xeon
Comparison

SMT implementation in POWER5
• SMT added to super-scalar architecture
– Second PC
– Return stack in branch prediction logic duplicated
– GPR/FPR rename mapper expanded for second set of registers
– Completion logic duplicated
– Thread bit added to most address/tag busses
[Ron Kalla et al,”Simultaneous Multi-threading Implementation in POWER5”, HotChips, 2003 ]
[B. Sinharoy et al “POWER5 system microarchitecture”, IBM Journal of Research and Development 2005 ]
Introduction
Implementations
Performance
Conclusions
IBM’s POWER5
Intel’s Xeon
Comparison

POWER5 architecture
Introduction
Implementations
Performance
Conclusions
IBM’s POWER5
Intel’s Xeon
Comparison

SMT implementation in Xeon
• SMT added to super-scalar architecture
– Second PC
– ITLB duplicated
– Microcode queue duplicated
– Return stack buffer duplicated
– Rename logic duplicated
– Thread bit added to most address/tag busses
[Deborah T. Marr et al “Hyper-Threading Technology Arch. and Microarchitecture”, Intel Technology Journal, Vol. 6, Issue 1, 2002 ]
Introduction
Implementations
Performance
Conclusions
IBM’s POWER5
Intel’s Xeon
Comparison

Xeon architecture
Additional stages in case of
cache miss – used to fill the
Trace Cache
[David Burns “Pre-Silicon Validation of Hyper-Threading Technology”, Intel Technology Journal, Volume 6, Issue 1, 2002 ]
Introduction
Implementations
Performance
Conclusions
IBM’s POWER5
Intel’s Xeon
Comparison

Simplified pipeline’s views
• IBM’s POWER5
• Intel’s Xeon (Pentium 4)
[ Kamil Kedzierski et al, “Analysis of Simultaneous Multithreading Implementations in Current High-Performance Processors”,
ACACES Second International Summer School on Advanced Computer Architecture and Compilation for Embedded Systems
2006 ]
Introduction
Implementations
Performance
Conclusions
IBM’s POWER5
Intel’s Xeon
Comparison

SMT implementations comparison
[ Kamil Kedzierski et al, “Analysis of Simultaneous Multithreading Implementations in Current High-Performance Processors”,
ACACES Second International Summer School on Advanced Computer Architecture and Compilation for Embedded Systems
2006 ]
Resource POWER5 Intel Xeon
PC duplicated duplicated
Instruction Cache shared shared
ITLB shared duplicated
DTLB shared shared
BHT shared shared
Return Stack Buffer duplicated duplicated
Decode shared shared
Instruction Buffer/
uCode Queue
duplicated duplicated
Group Formation shared -
Mapping/Rename shared duplicated
IQ shared partitioned
Scheduler - shared
Register Read shared shared
FUs shared shared
Store Queues duplicated partitioned
Register Write shared shared
GCT/Commit duplicated partitioned
Introduction
Implementations
Performance
Conclusions
IBM’s POWER5
Intel’s Xeon
Comparison

POWER5 performance
0,97 + 0,97 = 1,94
[ R. Kalla et al ”IBM POWER5 Chip: A Dual-Core Multithreaded Processor”, IEEE MICRO 2004 ]
Introduction
Implementations
Performance
Conclusions
IBM’s POWER5
Intel’s Xeon
Comparison

POWER5 performance
0,85 + 1 = 1,85
Introduction
Implementations
Performance
Conclusions
IBM’s POWER5
Intel’s Xeon
Comparison

POWER5 performance
0,85 + 1 = 1,85
Shows only similar
trends!
Pictures for two different
architectures
Introduction
Implementations
Performance
Conclusions
IBM’s POWER5
Intel’s Xeon
Comparison

Xeon performance
Varying processors in a system Varying server workloads
Introduction
Implementations
Performance
Conclusions
IBM’s POWER5
Intel’s Xeon
Comparison

Average performance comparison
• POWER5’s performance is reported to increase up to 41%
• Xeon’s performance is reported to increase up to 28%
• BUT POWER5 is dual-core architecture 
Introduction
Implementations
Performance
Conclusions
IBM’s POWER5
Intel’s Xeon
Comparison

Conclusions
• Two architectures have been studied (POWER5 and Xeon)
• Adding second thread usually increases performance
– Heavy depends on a given workload
• Most of the resources are shared
• Only the crucial ones are duplicated/partitioned:
– Return Stack Buffer, Instruction Buffer/uCode Queue, Store Queues
and GCT/Commit logic
• In addition Xeon more often duplicates/partitiones than shares:
– ITLB, Rename and IQ
• The thread level parallelism is a good answer on how to increase
performance without drastically increasing the resources
Introduction
Implementations
Performance
Conclusions
Conclusions
References

References
1. R. Kalla, B. Sinharoy, J. M. Tendler, ”IBM POWER5 Chip: A Dual-Core Multithreaded Processor”, IEEE MICRO
2004, pages 40-47
2. B. Sinharoy, R. N. Kalla, J. M. Tendler, R. J. Eickemeyer, and J. B. Joyner “POWER5 system
microarchitecture”, IBM Journal of Research and Development 2005, pages 505-521
3. H. M. Mathis, A. E. Mericas, J. D. McCalpin, R. J. Eickemeyer, and S. R. Kunkel “Characterization of
simultaneous multithreading (SMT) efficiency in POWER5”, IBM Journal of Research and Development 2005,
pages 555-564
4. Deborah T. Marr, Frank Binns, David L. Hill, Glenn Hinton, David A. Koufaty, J. Alan Miller, Michael Upton
“Hyper-Threading Technology Architecture and Microarchitecture”, Intel Technology Journal, Volume 6, Issue 1,
2002
5. David Burns “Pre-Silicon Validation of Hyper-Threading Technology”, Intel Technology Journal, Volume 6, Issue
1, 2002
6. Ron Kalla, BalaramSinharoy, Joel Tendler, ”Simultaneous Multi-threading Implementation in POWER5”, A
Symposium on High Performance Chips (HotChips), 19th August 2003,
7. G. Hinton, D. Sager, M. Upton, D. Boggs, D. Carmean, A. Kyker, and P. Roussel. “The microarchitecture of the
Pentium 4 processor”, Intel Technology Journal, Volume 5, Issue 1, 2001
8. Joel Emer, “EV8:The Post-Ultimate Alpha”, International Conference on Parallel Architectures and Compilation
Techniques, page 155, 2001
9. Shubu Mukherjee, “The Alpha 21364 and 21464 Microprocessors: Continuing the Performance Lead Beyond
Y2K”, 1999, http://www.cs.wisc.edu/~shubu/talks/alpha.ppt
10. Rick Merritt, “Designers cut fresh paths to parallelism”, EETimes
http://www.eetimes.com/story/OEG19991008S0014
11. Francisco J. Cazorla Almeida, “Quality of Service for Simultaneous Multithreading Processors (QoS for SMT
Processors)”, PhD Thesis, DAC, UPC, 2005
12. Kamil Kędzierski, Francisco J. Cazorla and Mateo Valero, “Analysis of Simultaneous Multithreading
Implementations in Current High-Performance Processors”, ACACES Second International Summer School on
Advanced Computer Architecture and Compilation for Embedded Systems, Poster Session, July 26th
, 2006
pages 113-116
13. http://www.cs.clemson.edu/~mark/multithreading.html
Introduction
Implementations
Performance
Conclusions
Conclusions
References

Analysis of Multithreading Capabilities of Current High–Performance Processors 2005

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Ähnlich wie Analysis of Multithreading Capabilities of Current High–Performance Processors 2005

Ähnlich wie Analysis of Multithreading Capabilities of Current High–Performance Processors 2005 (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Analysis of Multithreading Capabilities of Current High–Performance Processors 2005