SlideShare ist ein Scribd-Unternehmen logo
1 von 16
Downloaden Sie, um offline zu lesen
1
Autumn 2006 CSE P548 - Multitheading 1
Motivation for Multithreaded Architectures
Processors not executing code at their hardware potential
• late 70’s: performance lost to memory latency
• 90’s: performance not in line with the increasingly complex parallel
hardware as well
• increase in instruction issue bandwidth
• increase in number of functional units
• out-of-order execution
• techniques for decreasing/hiding branch & memory latencies
• Still, processor utilization was decreasing & instruction
throughput not increasing in proportion to the issue width
Autumn 2006 CSE P548 - Multitheading 2
Motivation for Multithreaded Architectures
2
Autumn 2006 CSE P548 - Multitheading 3
Motivation for Multithreaded Architectures
Major cause is the lack of instruction-level parallelism in a single executing
thread
Therefore the solution has to be more general than building a smarter
cache or a more accurate branch predictor
Autumn 2006 CSE P548 - Multitheading 4
Multithreaded Processors
Multithreaded processors can increase the pool of independent
instructions & consequently address multiple causes of processor
stalling
• holds processor state for more than one thread of execution
• registers
• PC
• each thread’s state is a hardware context
• execute the instruction stream from multiple threads without
software context switching
• utilize thread-level parallelism (TLP) to compensate for a lack in ILP
3
Autumn 2006 CSE P548 - Multitheading 5
Multithreading
Traditional multithreaded processors hardware switch to a different context
to avoid processor stalls
Two styles of traditional multithreading
1. coarse-grain multithreading
• switch on a long-latency operation (e.g., L2 cache miss)
• another thread executes while the miss is handled
• modest increase in instruction throughput
• doesn’t hide latency of short-latency operations
• no switch if no long-latency operations
• need to fill the pipeline on a switch
• potentially no slowdown to the thread with the miss
• if stall is long & switch back fairly promptly
• HEP, IBM RS64 III
Autumn 2006 CSE P548 - Multitheading 6
Traditional Multithreading
Two styles of traditional multithreading
2. fine-grain multithreading
• can switch to a different thread each cycle (usually round robin)
• hides latencies of all kinds
• larger increase in instruction throughput but slows down the
execution of each thread
• Cray (Tera) MTA
4
Autumn 2006 CSE P548 - Multitheading 7
Comparison of Issue Capabilities
Autumn 2006 CSE P548 - Multitheading 8
Simultaneous Multithreading (SMT)
Third style of multithreading, different concept
3. simultaneous multithreading (SMT)
• issues multiple instructions from multiple threads each cycle
• no hardware context switching
• same-cycle multithreading
• huge boost in instruction throughput with less degradation to
individual threads
5
Autumn 2006 CSE P548 - Multitheading 9
Comparison of Issue Capabilities
Autumn 2006 CSE P548 - Multitheading 10
Cray (Tera) MTA
Goals
• the appearance of uniform memory access
• lightweight synchronization
• heterogeneous parallelism
6
Autumn 2006 CSE P548 - Multitheading 11
Cray (Tera) MTA
Fine-grain multithreaded processor
• can switch to a different thread each cycle
• switches to ready threads only
• up to 128 hardware contexts
• lots of latency to hide, mostly from the multi-hop interconnection
network
• average instruction latency for computation: 22 cycles
(i.e., 22 instruction streams needed to keep functional units
busy)
• average instruction latency including memory: 120 to 200-
cycles
(i.e., 120 to 200 instruction streams needed to hide all latency,
on average)
• processor state for all 128 contexts
• GPRs (total of 4K registers!)
• status registers (includes the PC)
• branch target registers/stream
Autumn 2006 CSE P548 - Multitheading 12
Cray (Tera) MTA
Interesting features
• No processor-side data caches
• increases the latency for data accesses but reduces the
variation between ops
• to avoid having to keep caches coherent
• memory-side buffers instead
• L1 & L2 instruction caches
• instruction accesses are more predictable & have no coherency
problem
• prefetch fall-through & target code
7
Autumn 2006 CSE P548 - Multitheading 13
Cray (Tera) MTA
Interesting features
• Trade-off between avoiding memory bank conflicts &
exploiting spatial locality for data
• conflicts:
• memory distributed among hardware contexts
• memory addresses are randomized to avoid conflicts
• want to fully utilize all memory bandwidth
• locality:
• run-time system can confine consecutive virtual addresses to a
single (close-by) memory unit
• used mainly for the stack
Autumn 2006 CSE P548 - Multitheading 14
Cray (Tera) MTA
Interesting features
• tagged memory
• indirectly set full/empty bits to prevent data races
• prevents a consumer/producer from loading/overwriting a
value before a producer/consumer has written/read it
• example for the consumer:
• set to empty when producer instruction starts
executing
• consumer instructions block if try to read the producer
value
• set to full when producer writes value
• consumers can now read a valid value
• explicitly set full/empty bits for thread synchronization
• primarily used accessing shared data
• lock: read memory location & set to empty
• other readers are blocked
• unlock: write & set to full
8
Autumn 2006 CSE P548 - Multitheading 15
Cray (Tera) MTA
Interesting features
• no paging
• want pages pinned down in memory
• page size is 256MB
• forward bit
• memory contents interpreted as a pointer & dereferenced
• used for GC & null reference checking
• user-mode trap handlers
• lighter weight
• used for fatal exceptions, overflow, normalizing floating point
numbers
• not used for protection - user might override the RT
• designed for user-written trap handlers, but too complicated for
users
Autumn 2006 CSE P548 - Multitheading 16
Cray (Tera) MTA
Compiler support
• VLIW instructions
• memory/arithmetic/branch
• load/store architecture
• need a good code scheduler
• memory dependence look-ahead
• field in a memory instruction that specifies the number of
independent memory ops that follow
• guarantees nonstalling instruction choice
• improves memory parallelism
• handling branches
• special instruction to store a branch target in a register before
the branch is executed
• can start prefetching the target code
9
Autumn 2006 CSE P548 - Multitheading 17
Cray (Tera) MTA
Run-time support
• number of executing threads
• protection domain: group of threads executing in the same
virtual address space
• RT sets the maximum number of thread contexts (instruction
streams) a domain is allowed (compiler estimate)
• domain can create & kill threads within that limit, depending on
its need for them
Autumn 2006 CSE P548 - Multitheading 18
SMT: The Executive Summary
Simultaneous multithreaded (SMT) processors combine designs from:
• out-of-order superscalar processors
• traditional multithreaded processors
The combination enables a processor
• that issues & executes instructions from multiple threads
simultaneously
=> converting TLP to ILP
• in which threads share almost all hardware resources
10
Autumn 2006 CSE P548 - Multitheading 19
Performance Implications
Multiprogramming workload
• 2.5X on SPEC95, 4X on SPEC2000
Parallel programs
• ~1.7X on SPLASH2
Commercial databases
• 2-3X on TPC B; 1.5X on TPC D
Web servers & OS
• 4X on Apache and Digital Unix
Autumn 2006 CSE P548 - Multitheading 20
Does this Processor Sound Familiar?
Technology transfer =>
• 2-context Intel Hyperthreading
• 4-context IBM Power5
• 2-context Sun UltraSPARC on a 4-processor CMP
• 4-context Compaq 21464
• network processor & mobile device start-ups
• others in the wings
11
Autumn 2006 CSE P548 - Multitheading 21
An SMT Architecture
Three primary goals for this architecture:
1. Achieve significant throughput gains with multiple threads
2. Minimize the performance impact on a single thread executing
alone
3. Minimize the microarchitectural impact on a conventional out-of-
order superscalar design
Autumn 2006 CSE P548 - Multitheading 22
Implementing SMT
12
Autumn 2006 CSE P548 - Multitheading 23
Implementing SMT
No special hardware for scheduling instructions from multiple
threads
• use the out-of-order renaming & instruction scheduling mechanisms
• physical register pool model
• renaming hardware eliminates false dependences both within a
thread (just like a superscalar) & between threads
How it works:
• map thread-specific architectural registers onto a pool of thread-
independent physical registers
• operands are thereafter called by their physical names
• an instruction is issued when its operands become available & a
functional unit is free
• instruction scheduler not consider thread IDs when dispatching
instructions to functional units
(unless threads have different priorities)
Autumn 2006 CSE P548 - Multitheading 24
From Superscalar to SMT
Extra pipeline stages for accessing thread-shared register files
• 8 threads * 32 registers + renaming registers
SMT instruction fetcher (ICOUNT)
• fetch from 2 threads each cycle
• count the number of instructions for each thread in the pre-
execution stages
• pick the 2 threads with the lowest number
• in essence fetching from the two highest throughput threads
13
Autumn 2006 CSE P548 - Multitheading 25
From Superscalar to SMT
Per-thread hardware
• small stuff
• all part of current out-of-order processors
• none endangers the cycle time
• other per-thread processor state, e.g.,
• program counters
• return stacks
• thread identifiers, e.g., with BTB entries, TLB entries
• per-thread bookkeeping for, e.g.,
• instruction queue flush
• instruction retirement
• trapping
This is why there is only a 15% increase to Alpha 21464 chip area.
Autumn 2006 CSE P548 - Multitheading 26
Implementing SMT
Thread-shared hardware:
• fetch buffers
• branch prediction structures
• instruction queues
• functional units
• active list
• all caches & TLBs
• store buffers & MSHRs
This is why there is little single-thread performance degradation (~1.5%).
14
Autumn 2006 CSE P548 - Multitheading 27
Architecture Research
Concept & potential of Simultaneous Multithreading
Designing the microarchitecture
• straightforward extension of out-of-order superscalars
I-fetch thread chooser
• 40% faster than round-robin
The lockbox for cheap synchronization
• orders of magnitude faster
• can parallelize previously unparallelizable codes
Autumn 2006 CSE P548 - Multitheading 28
Architecture Research
Software-directed register deallocation
• large register-file performance w. small register file
Mini-threads
• large SMT performance w. small SMTs
SMT instruction speculation
• don’t execute as far down a wrong path
• speculative instructions don’t get as far down the pipeline
• speculation keeps a good thread mix in the IQ
• most important factor for performance
15
Autumn 2006 CSE P548 - Multitheading 29
Compiler Research
Tuning compiler optimizations for SMT
• data decomposition: use cyclic iteration scheduling
• tiling: use cyclic tiling; no tile size sweet spot
Communicate last-use information to HW for early register deallocation
• now need fewer renaming registers
Compiling for fewer registers/thread
• surprisingly little additional spill code (avg. 3%)
Autumn 2006 CSE P548 - Multitheading 30
OS Research
Analysis of OS behavior on SMT
• Kernel-kernel conflicts in I$ & D$ & branch mispredictions
ameliorated by SMT instruction issue + thread-sharing in HW
OS/runtime support for mini-threads
• dedicated server: recompile OS for fewer registers
• multiprogrammed environment: multiple versions
OS/runtime support for executing threaded programs
• page mapping , stack offsetting, dynamic memory allocation,
synchronization
16
Autumn 2006 CSE P548 - Multitheading 31
Others are Now Carrying the Ball
Fault detection & recovery
Thread-level speculation
Instruction & data prefetching
Instruction issue hardware design
Thread scheduling & thread priority
Single-thread execution
Profiling executing threads
SMT-CMP hybrids
Power considerations
Autumn 2006 CSE P548 - Multitheading 32
SMT Collaborators
UW
Hank Levy
Steve Gribble
Dean Tullsen (UC San Diego)
Jack Lo (VMWare)
Sujay Parekh (IBM Yorktown)
Brian Dewey (Microsoft)
Manu Thambi (Microsoft)
Josh Redstone (Google)
Mike Swift (Wisconsin)
Luke McDowell (Naval Academy)
Steve Swanson (UC San Diego)
Aaron Eakin (HP)
Dimitriy Portnov (Google)
DEC/Compaq
Joel Emer (now Intel)
Rebecca Stamm
Luiz Barroso (now Google)
Kourosh Gharachorloo (now Google)
For more info on SMT:
http://www.cs.washington.edu/research/smt

Weitere ähnliche Inhalte

Was ist angesagt?

Monitoring and Tuning Oracle FMW 11g
Monitoring and Tuning Oracle FMW 11gMonitoring and Tuning Oracle FMW 11g
Monitoring and Tuning Oracle FMW 11gMatthias Furrer
 
(ATS4-PLAT08) Server Pool Management
(ATS4-PLAT08) Server Pool Management(ATS4-PLAT08) Server Pool Management
(ATS4-PLAT08) Server Pool ManagementBIOVIA
 
weblogic perfomence tuning
weblogic perfomence tuningweblogic perfomence tuning
weblogic perfomence tuningprathap kumar
 
WEPA - Webdriver Enhanced Platform for Automation - WEPATest
WEPA - Webdriver Enhanced Platform for Automation - WEPATestWEPA - Webdriver Enhanced Platform for Automation - WEPATest
WEPA - Webdriver Enhanced Platform for Automation - WEPATestFreddy Vega
 
SC'16 PMIx BoF Presentation
SC'16 PMIx BoF PresentationSC'16 PMIx BoF Presentation
SC'16 PMIx BoF Presentationrcastain
 
The Meteor Framework
The Meteor FrameworkThe Meteor Framework
The Meteor FrameworkDamien Magoni
 
Learn Oracle WebLogic Server 12c Administration
Learn Oracle WebLogic Server 12c AdministrationLearn Oracle WebLogic Server 12c Administration
Learn Oracle WebLogic Server 12c AdministrationRevelation Technologies
 
Apache Kafka
Apache KafkaApache Kafka
Apache KafkaJoe Stein
 
Simon Jia - The Kohana Framework
Simon Jia - The Kohana FrameworkSimon Jia - The Kohana Framework
Simon Jia - The Kohana FrameworkCaroline_Rose
 
OSGifying the repository
OSGifying the repositoryOSGifying the repository
OSGifying the repositoryJukka Zitting
 
Collaborate 2011-tuning-ebusiness-416502
Collaborate 2011-tuning-ebusiness-416502Collaborate 2011-tuning-ebusiness-416502
Collaborate 2011-tuning-ebusiness-416502kaziul Islam Bulbul
 
OSI Model Layers and Internet Protocol Stack
OSI Model Layers and Internet Protocol StackOSI Model Layers and Internet Protocol Stack
OSI Model Layers and Internet Protocol StackShivam Mitra
 
Repository performance tuning
Repository performance tuningRepository performance tuning
Repository performance tuningJukka Zitting
 
Instruction Level Parallelism and Superscalar Processors
Instruction Level Parallelism and Superscalar ProcessorsInstruction Level Parallelism and Superscalar Processors
Instruction Level Parallelism and Superscalar ProcessorsSyed Zaid Irshad
 
End-to-end Troubleshooting Checklist for Microsoft SQL Server
End-to-end Troubleshooting Checklist for Microsoft SQL ServerEnd-to-end Troubleshooting Checklist for Microsoft SQL Server
End-to-end Troubleshooting Checklist for Microsoft SQL ServerKevin Kline
 
Reducing Downtime Using Incremental Backups X-Platform TTS
Reducing Downtime Using Incremental Backups X-Platform TTSReducing Downtime Using Incremental Backups X-Platform TTS
Reducing Downtime Using Incremental Backups X-Platform TTSEnkitec
 
Design and architecture of Jackrabbit
Design and architecture of JackrabbitDesign and architecture of Jackrabbit
Design and architecture of JackrabbitJukka Zitting
 
IBM InterConnect 2015 - IIB in the Cloud
IBM InterConnect 2015 - IIB in the CloudIBM InterConnect 2015 - IIB in the Cloud
IBM InterConnect 2015 - IIB in the CloudAndrew Coleman
 

Was ist angesagt? (20)

Monitoring and Tuning Oracle FMW 11g
Monitoring and Tuning Oracle FMW 11gMonitoring and Tuning Oracle FMW 11g
Monitoring and Tuning Oracle FMW 11g
 
(ATS4-PLAT08) Server Pool Management
(ATS4-PLAT08) Server Pool Management(ATS4-PLAT08) Server Pool Management
(ATS4-PLAT08) Server Pool Management
 
weblogic perfomence tuning
weblogic perfomence tuningweblogic perfomence tuning
weblogic perfomence tuning
 
WEPA - Webdriver Enhanced Platform for Automation - WEPATest
WEPA - Webdriver Enhanced Platform for Automation - WEPATestWEPA - Webdriver Enhanced Platform for Automation - WEPATest
WEPA - Webdriver Enhanced Platform for Automation - WEPATest
 
SC'16 PMIx BoF Presentation
SC'16 PMIx BoF PresentationSC'16 PMIx BoF Presentation
SC'16 PMIx BoF Presentation
 
The Meteor Framework
The Meteor FrameworkThe Meteor Framework
The Meteor Framework
 
Learn Oracle WebLogic Server 12c Administration
Learn Oracle WebLogic Server 12c AdministrationLearn Oracle WebLogic Server 12c Administration
Learn Oracle WebLogic Server 12c Administration
 
13 superscalar
13 superscalar13 superscalar
13 superscalar
 
Apache Kafka
Apache KafkaApache Kafka
Apache Kafka
 
Simon Jia - The Kohana Framework
Simon Jia - The Kohana FrameworkSimon Jia - The Kohana Framework
Simon Jia - The Kohana Framework
 
OSGifying the repository
OSGifying the repositoryOSGifying the repository
OSGifying the repository
 
Collaborate 2011-tuning-ebusiness-416502
Collaborate 2011-tuning-ebusiness-416502Collaborate 2011-tuning-ebusiness-416502
Collaborate 2011-tuning-ebusiness-416502
 
OSI Model Layers and Internet Protocol Stack
OSI Model Layers and Internet Protocol StackOSI Model Layers and Internet Protocol Stack
OSI Model Layers and Internet Protocol Stack
 
Repository performance tuning
Repository performance tuningRepository performance tuning
Repository performance tuning
 
Instruction Level Parallelism and Superscalar Processors
Instruction Level Parallelism and Superscalar ProcessorsInstruction Level Parallelism and Superscalar Processors
Instruction Level Parallelism and Superscalar Processors
 
End-to-end Troubleshooting Checklist for Microsoft SQL Server
End-to-end Troubleshooting Checklist for Microsoft SQL ServerEnd-to-end Troubleshooting Checklist for Microsoft SQL Server
End-to-end Troubleshooting Checklist for Microsoft SQL Server
 
Reducing Downtime Using Incremental Backups X-Platform TTS
Reducing Downtime Using Incremental Backups X-Platform TTSReducing Downtime Using Incremental Backups X-Platform TTS
Reducing Downtime Using Incremental Backups X-Platform TTS
 
Design and architecture of Jackrabbit
Design and architecture of JackrabbitDesign and architecture of Jackrabbit
Design and architecture of Jackrabbit
 
IBM InterConnect 2015 - IIB in the Cloud
IBM InterConnect 2015 - IIB in the CloudIBM InterConnect 2015 - IIB in the Cloud
IBM InterConnect 2015 - IIB in the Cloud
 
Weblogic
WeblogicWeblogic
Weblogic
 

Andere mochten auch

Andere mochten auch (20)

Text categorization as a graph
Text categorization as a graphText categorization as a graph
Text categorization as a graph
 
Data preparation
Data preparationData preparation
Data preparation
 
Java
JavaJava
Java
 
Czego pragna klienci
Czego pragna klienciCzego pragna klienci
Czego pragna klienci
 
Api crash
Api crashApi crash
Api crash
 
Abstract data types
Abstract data typesAbstract data types
Abstract data types
 
Test driven development
Test driven developmentTest driven development
Test driven development
 
Maven
MavenMaven
Maven
 
List in webpage
List in webpageList in webpage
List in webpage
 
List and iterator
List and iteratorList and iterator
List and iterator
 
Crypto theory practice
Crypto theory practiceCrypto theory practice
Crypto theory practice
 
Text classification
Text classificationText classification
Text classification
 
Hash mac algorithms
Hash mac algorithmsHash mac algorithms
Hash mac algorithms
 
Exception handling
Exception handlingException handling
Exception handling
 
Python language data types
Python language data typesPython language data types
Python language data types
 
Memory caching
Memory cachingMemory caching
Memory caching
 
Datamining with nb
Datamining with nbDatamining with nb
Datamining with nb
 
Hardware managed cache
Hardware managed cacheHardware managed cache
Hardware managed cache
 
Data mining and knowledge discovery
Data mining and knowledge discoveryData mining and knowledge discovery
Data mining and knowledge discovery
 
Reflection
ReflectionReflection
Reflection
 

Ähnlich wie Motivation for multithreaded architectures

Multithreaded processors ppt
Multithreaded processors pptMultithreaded processors ppt
Multithreaded processors pptSiddhartha Anand
 
Intel® hyper threading technology
Intel® hyper threading technologyIntel® hyper threading technology
Intel® hyper threading technologyAmirali Sharifian
 
Multithreading computer architecture
 Multithreading computer architecture  Multithreading computer architecture
Multithreading computer architecture Haris456
 
Reduced instruction set computers
Reduced instruction set computersReduced instruction set computers
Reduced instruction set computersSyed Zaid Irshad
 
Chip Multithreading Systems Need a New Operating System Scheduler
Chip Multithreading Systems Need a New Operating System Scheduler Chip Multithreading Systems Need a New Operating System Scheduler
Chip Multithreading Systems Need a New Operating System Scheduler Sarwan ali
 
IT209 Cpu Structure Report
IT209 Cpu Structure ReportIT209 Cpu Structure Report
IT209 Cpu Structure ReportBis Aquino
 
pipeline and pipeline hazards
pipeline and pipeline hazards pipeline and pipeline hazards
pipeline and pipeline hazards Bharti Khemani
 
Computer Organization: Introduction to Microprocessor and Microcontroller
Computer Organization: Introduction to Microprocessor and MicrocontrollerComputer Organization: Introduction to Microprocessor and Microcontroller
Computer Organization: Introduction to Microprocessor and MicrocontrollerAmrutaMehata
 
chap 18 multicore computers
chap 18 multicore computers chap 18 multicore computers
chap 18 multicore computers Sher Shah Merkhel
 
12 processor structure and function
12 processor structure and function12 processor structure and function
12 processor structure and functiondilip kumar
 
12 processor structure and function
12 processor structure and function12 processor structure and function
12 processor structure and functionSher Shah Merkhel
 

Ähnlich wie Motivation for multithreaded architectures (20)

Multithreaded processors ppt
Multithreaded processors pptMultithreaded processors ppt
Multithreaded processors ppt
 
Intel® hyper threading technology
Intel® hyper threading technologyIntel® hyper threading technology
Intel® hyper threading technology
 
Multithreading computer architecture
 Multithreading computer architecture  Multithreading computer architecture
Multithreading computer architecture
 
CPU Caches
CPU CachesCPU Caches
CPU Caches
 
Reduced instruction set computers
Reduced instruction set computersReduced instruction set computers
Reduced instruction set computers
 
Chip Multithreading Systems Need a New Operating System Scheduler
Chip Multithreading Systems Need a New Operating System Scheduler Chip Multithreading Systems Need a New Operating System Scheduler
Chip Multithreading Systems Need a New Operating System Scheduler
 
cs-procstruc.ppt
cs-procstruc.pptcs-procstruc.ppt
cs-procstruc.ppt
 
RISC.ppt
RISC.pptRISC.ppt
RISC.ppt
 
13 risc
13 risc13 risc
13 risc
 
IT209 Cpu Structure Report
IT209 Cpu Structure ReportIT209 Cpu Structure Report
IT209 Cpu Structure Report
 
13 risc
13 risc13 risc
13 risc
 
pipeline and pipeline hazards
pipeline and pipeline hazards pipeline and pipeline hazards
pipeline and pipeline hazards
 
Multicore computers
Multicore computersMulticore computers
Multicore computers
 
Computer Organization: Introduction to Microprocessor and Microcontroller
Computer Organization: Introduction to Microprocessor and MicrocontrollerComputer Organization: Introduction to Microprocessor and Microcontroller
Computer Organization: Introduction to Microprocessor and Microcontroller
 
chap 18 multicore computers
chap 18 multicore computers chap 18 multicore computers
chap 18 multicore computers
 
12 processor structure and function
12 processor structure and function12 processor structure and function
12 processor structure and function
 
lect13_programmable_dp.pptx
lect13_programmable_dp.pptxlect13_programmable_dp.pptx
lect13_programmable_dp.pptx
 
Parallel processing
Parallel processingParallel processing
Parallel processing
 
12 processor structure and function
12 processor structure and function12 processor structure and function
12 processor structure and function
 
OpenPOWER Webinar
OpenPOWER Webinar OpenPOWER Webinar
OpenPOWER Webinar
 

Mehr von Young Alista

Serialization/deserialization
Serialization/deserializationSerialization/deserialization
Serialization/deserializationYoung Alista
 
Big picture of data mining
Big picture of data miningBig picture of data mining
Big picture of data miningYoung Alista
 
Business analytics and data mining
Business analytics and data miningBusiness analytics and data mining
Business analytics and data miningYoung Alista
 
Directory based cache coherence
Directory based cache coherenceDirectory based cache coherence
Directory based cache coherenceYoung Alista
 
How analysis services caching works
How analysis services caching worksHow analysis services caching works
How analysis services caching worksYoung Alista
 
Optimizing shared caches in chip multiprocessors
Optimizing shared caches in chip multiprocessorsOptimizing shared caches in chip multiprocessors
Optimizing shared caches in chip multiprocessorsYoung Alista
 
Concurrency with java
Concurrency with javaConcurrency with java
Concurrency with javaYoung Alista
 
Data structures and algorithms
Data structures and algorithmsData structures and algorithms
Data structures and algorithmsYoung Alista
 
Cobol, lisp, and python
Cobol, lisp, and pythonCobol, lisp, and python
Cobol, lisp, and pythonYoung Alista
 
Object oriented analysis
Object oriented analysisObject oriented analysis
Object oriented analysisYoung Alista
 
Programming for engineers in python
Programming for engineers in pythonProgramming for engineers in python
Programming for engineers in pythonYoung Alista
 
Extending burp with python
Extending burp with pythonExtending burp with python
Extending burp with pythonYoung Alista
 
Python your new best friend
Python your new best friendPython your new best friend
Python your new best friendYoung Alista
 

Mehr von Young Alista (20)

Serialization/deserialization
Serialization/deserializationSerialization/deserialization
Serialization/deserialization
 
Big picture of data mining
Big picture of data miningBig picture of data mining
Big picture of data mining
 
Business analytics and data mining
Business analytics and data miningBusiness analytics and data mining
Business analytics and data mining
 
Directory based cache coherence
Directory based cache coherenceDirectory based cache coherence
Directory based cache coherence
 
Cache recap
Cache recapCache recap
Cache recap
 
How analysis services caching works
How analysis services caching worksHow analysis services caching works
How analysis services caching works
 
Object model
Object modelObject model
Object model
 
Optimizing shared caches in chip multiprocessors
Optimizing shared caches in chip multiprocessorsOptimizing shared caches in chip multiprocessors
Optimizing shared caches in chip multiprocessors
 
Abstraction file
Abstraction fileAbstraction file
Abstraction file
 
Concurrency with java
Concurrency with javaConcurrency with java
Concurrency with java
 
Data structures and algorithms
Data structures and algorithmsData structures and algorithms
Data structures and algorithms
 
Abstract class
Abstract classAbstract class
Abstract class
 
Inheritance
InheritanceInheritance
Inheritance
 
Cobol, lisp, and python
Cobol, lisp, and pythonCobol, lisp, and python
Cobol, lisp, and python
 
Object oriented analysis
Object oriented analysisObject oriented analysis
Object oriented analysis
 
Programming for engineers in python
Programming for engineers in pythonProgramming for engineers in python
Programming for engineers in python
 
Learning python
Learning pythonLearning python
Learning python
 
Python basics
Python basicsPython basics
Python basics
 
Extending burp with python
Extending burp with pythonExtending burp with python
Extending burp with python
 
Python your new best friend
Python your new best friendPython your new best friend
Python your new best friend
 

Kürzlich hochgeladen

Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????blackmambaettijean
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 

Kürzlich hochgeladen (20)

Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 

Motivation for multithreaded architectures

  • 1. 1 Autumn 2006 CSE P548 - Multitheading 1 Motivation for Multithreaded Architectures Processors not executing code at their hardware potential • late 70’s: performance lost to memory latency • 90’s: performance not in line with the increasingly complex parallel hardware as well • increase in instruction issue bandwidth • increase in number of functional units • out-of-order execution • techniques for decreasing/hiding branch & memory latencies • Still, processor utilization was decreasing & instruction throughput not increasing in proportion to the issue width Autumn 2006 CSE P548 - Multitheading 2 Motivation for Multithreaded Architectures
  • 2. 2 Autumn 2006 CSE P548 - Multitheading 3 Motivation for Multithreaded Architectures Major cause is the lack of instruction-level parallelism in a single executing thread Therefore the solution has to be more general than building a smarter cache or a more accurate branch predictor Autumn 2006 CSE P548 - Multitheading 4 Multithreaded Processors Multithreaded processors can increase the pool of independent instructions & consequently address multiple causes of processor stalling • holds processor state for more than one thread of execution • registers • PC • each thread’s state is a hardware context • execute the instruction stream from multiple threads without software context switching • utilize thread-level parallelism (TLP) to compensate for a lack in ILP
  • 3. 3 Autumn 2006 CSE P548 - Multitheading 5 Multithreading Traditional multithreaded processors hardware switch to a different context to avoid processor stalls Two styles of traditional multithreading 1. coarse-grain multithreading • switch on a long-latency operation (e.g., L2 cache miss) • another thread executes while the miss is handled • modest increase in instruction throughput • doesn’t hide latency of short-latency operations • no switch if no long-latency operations • need to fill the pipeline on a switch • potentially no slowdown to the thread with the miss • if stall is long & switch back fairly promptly • HEP, IBM RS64 III Autumn 2006 CSE P548 - Multitheading 6 Traditional Multithreading Two styles of traditional multithreading 2. fine-grain multithreading • can switch to a different thread each cycle (usually round robin) • hides latencies of all kinds • larger increase in instruction throughput but slows down the execution of each thread • Cray (Tera) MTA
  • 4. 4 Autumn 2006 CSE P548 - Multitheading 7 Comparison of Issue Capabilities Autumn 2006 CSE P548 - Multitheading 8 Simultaneous Multithreading (SMT) Third style of multithreading, different concept 3. simultaneous multithreading (SMT) • issues multiple instructions from multiple threads each cycle • no hardware context switching • same-cycle multithreading • huge boost in instruction throughput with less degradation to individual threads
  • 5. 5 Autumn 2006 CSE P548 - Multitheading 9 Comparison of Issue Capabilities Autumn 2006 CSE P548 - Multitheading 10 Cray (Tera) MTA Goals • the appearance of uniform memory access • lightweight synchronization • heterogeneous parallelism
  • 6. 6 Autumn 2006 CSE P548 - Multitheading 11 Cray (Tera) MTA Fine-grain multithreaded processor • can switch to a different thread each cycle • switches to ready threads only • up to 128 hardware contexts • lots of latency to hide, mostly from the multi-hop interconnection network • average instruction latency for computation: 22 cycles (i.e., 22 instruction streams needed to keep functional units busy) • average instruction latency including memory: 120 to 200- cycles (i.e., 120 to 200 instruction streams needed to hide all latency, on average) • processor state for all 128 contexts • GPRs (total of 4K registers!) • status registers (includes the PC) • branch target registers/stream Autumn 2006 CSE P548 - Multitheading 12 Cray (Tera) MTA Interesting features • No processor-side data caches • increases the latency for data accesses but reduces the variation between ops • to avoid having to keep caches coherent • memory-side buffers instead • L1 & L2 instruction caches • instruction accesses are more predictable & have no coherency problem • prefetch fall-through & target code
  • 7. 7 Autumn 2006 CSE P548 - Multitheading 13 Cray (Tera) MTA Interesting features • Trade-off between avoiding memory bank conflicts & exploiting spatial locality for data • conflicts: • memory distributed among hardware contexts • memory addresses are randomized to avoid conflicts • want to fully utilize all memory bandwidth • locality: • run-time system can confine consecutive virtual addresses to a single (close-by) memory unit • used mainly for the stack Autumn 2006 CSE P548 - Multitheading 14 Cray (Tera) MTA Interesting features • tagged memory • indirectly set full/empty bits to prevent data races • prevents a consumer/producer from loading/overwriting a value before a producer/consumer has written/read it • example for the consumer: • set to empty when producer instruction starts executing • consumer instructions block if try to read the producer value • set to full when producer writes value • consumers can now read a valid value • explicitly set full/empty bits for thread synchronization • primarily used accessing shared data • lock: read memory location & set to empty • other readers are blocked • unlock: write & set to full
  • 8. 8 Autumn 2006 CSE P548 - Multitheading 15 Cray (Tera) MTA Interesting features • no paging • want pages pinned down in memory • page size is 256MB • forward bit • memory contents interpreted as a pointer & dereferenced • used for GC & null reference checking • user-mode trap handlers • lighter weight • used for fatal exceptions, overflow, normalizing floating point numbers • not used for protection - user might override the RT • designed for user-written trap handlers, but too complicated for users Autumn 2006 CSE P548 - Multitheading 16 Cray (Tera) MTA Compiler support • VLIW instructions • memory/arithmetic/branch • load/store architecture • need a good code scheduler • memory dependence look-ahead • field in a memory instruction that specifies the number of independent memory ops that follow • guarantees nonstalling instruction choice • improves memory parallelism • handling branches • special instruction to store a branch target in a register before the branch is executed • can start prefetching the target code
  • 9. 9 Autumn 2006 CSE P548 - Multitheading 17 Cray (Tera) MTA Run-time support • number of executing threads • protection domain: group of threads executing in the same virtual address space • RT sets the maximum number of thread contexts (instruction streams) a domain is allowed (compiler estimate) • domain can create & kill threads within that limit, depending on its need for them Autumn 2006 CSE P548 - Multitheading 18 SMT: The Executive Summary Simultaneous multithreaded (SMT) processors combine designs from: • out-of-order superscalar processors • traditional multithreaded processors The combination enables a processor • that issues & executes instructions from multiple threads simultaneously => converting TLP to ILP • in which threads share almost all hardware resources
  • 10. 10 Autumn 2006 CSE P548 - Multitheading 19 Performance Implications Multiprogramming workload • 2.5X on SPEC95, 4X on SPEC2000 Parallel programs • ~1.7X on SPLASH2 Commercial databases • 2-3X on TPC B; 1.5X on TPC D Web servers & OS • 4X on Apache and Digital Unix Autumn 2006 CSE P548 - Multitheading 20 Does this Processor Sound Familiar? Technology transfer => • 2-context Intel Hyperthreading • 4-context IBM Power5 • 2-context Sun UltraSPARC on a 4-processor CMP • 4-context Compaq 21464 • network processor & mobile device start-ups • others in the wings
  • 11. 11 Autumn 2006 CSE P548 - Multitheading 21 An SMT Architecture Three primary goals for this architecture: 1. Achieve significant throughput gains with multiple threads 2. Minimize the performance impact on a single thread executing alone 3. Minimize the microarchitectural impact on a conventional out-of- order superscalar design Autumn 2006 CSE P548 - Multitheading 22 Implementing SMT
  • 12. 12 Autumn 2006 CSE P548 - Multitheading 23 Implementing SMT No special hardware for scheduling instructions from multiple threads • use the out-of-order renaming & instruction scheduling mechanisms • physical register pool model • renaming hardware eliminates false dependences both within a thread (just like a superscalar) & between threads How it works: • map thread-specific architectural registers onto a pool of thread- independent physical registers • operands are thereafter called by their physical names • an instruction is issued when its operands become available & a functional unit is free • instruction scheduler not consider thread IDs when dispatching instructions to functional units (unless threads have different priorities) Autumn 2006 CSE P548 - Multitheading 24 From Superscalar to SMT Extra pipeline stages for accessing thread-shared register files • 8 threads * 32 registers + renaming registers SMT instruction fetcher (ICOUNT) • fetch from 2 threads each cycle • count the number of instructions for each thread in the pre- execution stages • pick the 2 threads with the lowest number • in essence fetching from the two highest throughput threads
  • 13. 13 Autumn 2006 CSE P548 - Multitheading 25 From Superscalar to SMT Per-thread hardware • small stuff • all part of current out-of-order processors • none endangers the cycle time • other per-thread processor state, e.g., • program counters • return stacks • thread identifiers, e.g., with BTB entries, TLB entries • per-thread bookkeeping for, e.g., • instruction queue flush • instruction retirement • trapping This is why there is only a 15% increase to Alpha 21464 chip area. Autumn 2006 CSE P548 - Multitheading 26 Implementing SMT Thread-shared hardware: • fetch buffers • branch prediction structures • instruction queues • functional units • active list • all caches & TLBs • store buffers & MSHRs This is why there is little single-thread performance degradation (~1.5%).
  • 14. 14 Autumn 2006 CSE P548 - Multitheading 27 Architecture Research Concept & potential of Simultaneous Multithreading Designing the microarchitecture • straightforward extension of out-of-order superscalars I-fetch thread chooser • 40% faster than round-robin The lockbox for cheap synchronization • orders of magnitude faster • can parallelize previously unparallelizable codes Autumn 2006 CSE P548 - Multitheading 28 Architecture Research Software-directed register deallocation • large register-file performance w. small register file Mini-threads • large SMT performance w. small SMTs SMT instruction speculation • don’t execute as far down a wrong path • speculative instructions don’t get as far down the pipeline • speculation keeps a good thread mix in the IQ • most important factor for performance
  • 15. 15 Autumn 2006 CSE P548 - Multitheading 29 Compiler Research Tuning compiler optimizations for SMT • data decomposition: use cyclic iteration scheduling • tiling: use cyclic tiling; no tile size sweet spot Communicate last-use information to HW for early register deallocation • now need fewer renaming registers Compiling for fewer registers/thread • surprisingly little additional spill code (avg. 3%) Autumn 2006 CSE P548 - Multitheading 30 OS Research Analysis of OS behavior on SMT • Kernel-kernel conflicts in I$ & D$ & branch mispredictions ameliorated by SMT instruction issue + thread-sharing in HW OS/runtime support for mini-threads • dedicated server: recompile OS for fewer registers • multiprogrammed environment: multiple versions OS/runtime support for executing threaded programs • page mapping , stack offsetting, dynamic memory allocation, synchronization
  • 16. 16 Autumn 2006 CSE P548 - Multitheading 31 Others are Now Carrying the Ball Fault detection & recovery Thread-level speculation Instruction & data prefetching Instruction issue hardware design Thread scheduling & thread priority Single-thread execution Profiling executing threads SMT-CMP hybrids Power considerations Autumn 2006 CSE P548 - Multitheading 32 SMT Collaborators UW Hank Levy Steve Gribble Dean Tullsen (UC San Diego) Jack Lo (VMWare) Sujay Parekh (IBM Yorktown) Brian Dewey (Microsoft) Manu Thambi (Microsoft) Josh Redstone (Google) Mike Swift (Wisconsin) Luke McDowell (Naval Academy) Steve Swanson (UC San Diego) Aaron Eakin (HP) Dimitriy Portnov (Google) DEC/Compaq Joel Emer (now Intel) Rebecca Stamm Luiz Barroso (now Google) Kourosh Gharachorloo (now Google) For more info on SMT: http://www.cs.washington.edu/research/smt