SlideShare ist ein Scribd-Unternehmen logo
1 von 28
Taming Non-blocking Caches to Improve
Isolation in Multicore Real-Time Systems
Prathap Kumar Valsan, Heechul Yun, Farzad Farshchi
University of Kansas
1
High-Performance Multicores
for Real-Time Systems
• Why?
– Intelligence  more performance
– Space, weight, power (SWaP), cost
2
Time Predictability Challenge
3
Core1 Core2 Core3 Core4
Memory Controller (MC)
Shared Cache
• Shared hardware resource contention can cause
significant interference delays
• Shared cache is a major shared resource
DRAM
Task 1 Task 2 Task 3 Task 4
Cache Partitioning
• Can be done in either software or hardware
– Software: page coloring
– Hardware: way partitioning
• Eliminate unwanted cache-line evictions
• Common assumption
– cache partitioning  performance isolation
• If working-set fits in the cache partition
• Not necessarily true on out-of-order cores using
non-blocking caches
4
Outline
• Evaluate the isolation effect of cache partitioning
– On four real COTS multicore architectures
– Observed up to 21X slowdown of a task running on a
dedicated core accessing a dedicated cache partition
• Understand the source of contention
– Using a cycle-accurate full system simulator
• Propose an OS/architecture collaborative solution
– For better cache performance isolation
5
Memory-Level Parallelism (MLP)
• Broadly defined as the number of concurrent
memory requests that a given architecture
can handle at a time
6
Non-blocking Cache(*)
• Can serve cache hits under multiple cache misses
– Essential for an out-of-order core and any multicore
• Miss-Status-Holding Registers (MSHRs)
– On a miss, allocate a MSHR entry to track the req.
– On receiving the data, clear the MSHR entry
7
cpu cpu
miss hit miss
Miss penalty
Miss penalty
stall only when
result is needed
Multiple outstanding misses
(*) D. Kroft. “Lockup-free instruction fetch/prefetch cache organization,” ISCA’81
Non-blocking Cache
• #of cache MSHRs  memory-level parallelism
(MLP) of the cache
• What happens if all MSHRs are occupied?
– The cache is locked up
– Subsequent accesses---including cache hits---to
the cache stall
– We will see the impact of this in later experiments
8
COTS Multicore Platforms
Cortex-A7 Cortex-A9 Cortex-A15 Nehalem
Core 4core @
1.4GHz
In-order
4core @
1.7GHz
Out-of-order
4core @
2.0GHz
Out-of-order
4core @
2.8GHz
Out-of-order
LLC (shared) 512KB 1MB 2MB 8MB
9
• COTS multicore platforms
– Odroid-XU4: 4x Cortex-A7 and 4x Cortex-A15
– Odroid-U3: 4x Cortex-A9
– Dell desktop: Intel Xeon quad-core (Nehalem)
Identified MLP
• Local MLP
– MLP of a core-private cache
• Global MLP
– MLP of the shared cache (and DRAM)
10
(See paper for our experimental identification method)
Cortex-A7 Cortex-A9 Cortex-A15 Nehalem
Local MLP 1 4 6 10
Global MLP 4 4 11 16
Cache Interference Experiments
• Measure the performance of the ‘subject’
– (1) alone, (2) with co-runners
– LLC is partitioned (equal partition) using PALLOC (*)
• Q: Does cache partitioning provide isolation?
11
DRAM
LLC
Core1 Core2 Core3 Core4
subject co-runner(s)
(*) Heechul Yun, Renato Mancuso, Zheng-Pei Wu, Rodolfo Pellizzoni. “PALLOC: DRAM Bank-Aware Mem
ory Allocator for Performance Isolation on Multicore Platforms.” RTAS’14
IsolBench: Synthetic Workloads
• Latency
– A linked-list traversal, data dependency, one outstanding miss
• Bandwidth
– An array reads or writes, no data dependency, multiple misses
• Subject benchmarks: LLC partition fitting
12
Working-set size: (LLC) < ¼ LLC  cache-hits, (DRAM) > 2X LLC  cache misses
Latency(LLC) vs. BwRead(DRAM)
• No interference on Cortex-A7 and Nehalem
• On Cortex-A15, Latency(LLC) suffers 3.8X slowdown
– despite partitioned LLC
13
BwRead(LLC) vs. BwRead(DRAM)
• Up to 10.6X slowdown on Cortex-A15
• Cache partitioning != performance isolation
– On all tested out-of-order cores (A9, A15, Nehalem)
14
BwRead(LLC) vs. BwWrite(DRAM)
• Up to 21X slowdown on Cortex-A15
• Writes generally cause more slowdowns
– Due to write-backs
15
EEMBC and SD-VBS
• X-axis: EEMBC, SD-VBS (cache partition fitting)
– Co-runners: BwWrite(DRAM)
• Cache partitioning != performance isolation
16
Cortex-A7 (in-order) Cortex-A15 (out-of-order)
MSHR Contention
• Shortage of cache MSHRs  lock up the cache
• LLC MSHRs are shared resources
– 4cores x L1 cache MSHRs > LLC MSHRs
17
Cortex-A7 Cortex-A9 Cortex-A15 Nehalem
Local MLP 1 4 6 10
Global MLP 4 4 11 16
(See paper for our experimental identification method)
Outline
• Evaluate the isolation effect of cache
partitioning
• Understand the source of contention
– Effect of LLC MSHRs
– Effect of L1 MSHRs
• Propose an OS/architecture collaborative
solution
18
Simulation Setup
• Gem5 full-system simulator
– 4 out-of-order cores (modeling Cortex-A15)
• L1: 32K I&D, LLC (L2): 2048K
– Vary #of L1 and LLC MSHRs
• Linux 3.14
– Use PALLOC to partition LLC
• Workload
– IsolBench, EEMBC, SD-VBS, SPEC2006
19
DRAM
LLC
Core1 Core2 Core3 Core4
subject co-runner(s)
Effect of LLC (Shared) MSHRs
• Increasing LLC MSHRs eliminates the MSHR contention
• Issue: scalable?
– MSHRs are highly associative (must be accessed in parallel)
20
L1:6/LLC:12 MSHRs L1:6/LLC:24 MSHRs
Effect of L1 (Private) MSHRs
• Reducing L1 MSHRs can eliminate contention
• Issue: single thread performance. How much?
– It depends. L1 MSHRs are often underutilized
– EEMBC: low, SD-VBS: medium, SPEC2006: high
21
EEMBC, SD-VBS IsolBench,SPEC2006
Outline
• Evaluate the isolation effect of cache
partitioning
• Understand the source of contention
• Propose an OS/architecture collaborative
solution
22
OS Controlled MSHR Partitioning
23
• Add two registers in each core’s L1 cache
– TargetCount: max. MSHRs (set by OS)
– ValidCount: used MSHRs (set by HW)
• OS can control each core’s MLP
– By adjusting the core’s TargetCount register
Block Addr.Valid Issue Target Info.
Block Addr.Valid Issue Target Info.
Block Addr.Valid Issue Target Info.
MSHR 1
MSHR 2
MSHR n
TargetCount ValidCount
Last Level Cache (LLC)
Core
1
Core
2
Core
3
Core
4
MSHRs
MSHRs MSHRs MSHRs MSHRs
OS Scheduler Design
• Partition LLC MSHRs by enforcing
• When RT tasks are scheduled
– RT tasks: reserve MSHRs
– Non-RT tasks: share remaining MSHRs
• Implemented in Linux kernel
– prepare_task_switch() at kernel/sched/core.c
24
Case Study
• 4 RT tasks (EEMBC)
– One RT per core
– Reserve 2 MSHRs
– P: 20,30,40,60ms
– C: ~8ms
• 4 NRT tasks
– One NRT per core
– Run on slack
• Up to 20% WCET reduction
– Compared to cache partition only
25
Conclusion
• Evaluated the effect of cache partitioning on four real COTS
multicore architectures and a full system simulator
• Found cache partitioning does not ensure cache (hit)
performance isolation due to MSHR contention
• Proposed low-cost architecture extension and OS scheduler
support to eliminate MSHR contention
• Availability
– https://github.com/CSL-KU/IsolBench
– Synthetic benchmarks (IsolBench),test scripts, kernel patches
26
Exp3: BwRead(LLC) vs. BwRead(LLC)
• Cache bandwidth contention is not the main source
– Higher cache-bandwidth co-runners cause less slowdowns
27
EEMBC, SD-VBS Workload
• Subject
– Subset of EEMBC, SD-VBS
– High L2 hit, Low L2 miss
• Co-runners
– BwWrite(DRAM): High L2 miss, write-intensive
28

Weitere ähnliche Inhalte

Was ist angesagt?

Transformer Introduction (Seminar Material)
Transformer Introduction (Seminar Material)Transformer Introduction (Seminar Material)
Transformer Introduction (Seminar Material)Yuta Niki
 
Knowledge representation and reasoning
Knowledge representation and reasoningKnowledge representation and reasoning
Knowledge representation and reasoningMaryam Maleki
 
Applications of paralleL processing
Applications of paralleL processingApplications of paralleL processing
Applications of paralleL processingPage Maker
 
Deep learning for NLP and Transformer
 Deep learning for NLP  and Transformer Deep learning for NLP  and Transformer
Deep learning for NLP and TransformerArvind Devaraj
 
parallel computing.ppt
parallel computing.pptparallel computing.ppt
parallel computing.pptssuser413a98
 
Sequence to sequence (encoder-decoder) learning
Sequence to sequence (encoder-decoder) learningSequence to sequence (encoder-decoder) learning
Sequence to sequence (encoder-decoder) learningRoberto Pereira Silveira
 
Computer architecture multi processor
Computer architecture multi processorComputer architecture multi processor
Computer architecture multi processorMazin Alwaaly
 
Attention is all you need (UPC Reading Group 2018, by Santi Pascual)
Attention is all you need (UPC Reading Group 2018, by Santi Pascual)Attention is all you need (UPC Reading Group 2018, by Santi Pascual)
Attention is all you need (UPC Reading Group 2018, by Santi Pascual)Universitat Politècnica de Catalunya
 
Lec 3 knowledge acquisition representation and inference
Lec 3  knowledge acquisition representation and inferenceLec 3  knowledge acquisition representation and inference
Lec 3 knowledge acquisition representation and inferenceEyob Sisay
 
Computer architecture memory system
Computer architecture memory systemComputer architecture memory system
Computer architecture memory systemMazin Alwaaly
 
Sequence Modelling with Deep Learning
Sequence Modelling with Deep LearningSequence Modelling with Deep Learning
Sequence Modelling with Deep LearningNatasha Latysheva
 
Natural lanaguage processing
Natural lanaguage processingNatural lanaguage processing
Natural lanaguage processinggulshan kumar
 
Multiprocessor
MultiprocessorMultiprocessor
MultiprocessorA B Shinde
 
Hierarchical Reinforcement Learning
Hierarchical Reinforcement LearningHierarchical Reinforcement Learning
Hierarchical Reinforcement LearningDavid Jardim
 
Research Scope in Parallel Computing And Parallel Programming
Research Scope in Parallel Computing And Parallel ProgrammingResearch Scope in Parallel Computing And Parallel Programming
Research Scope in Parallel Computing And Parallel ProgrammingShitalkumar Sukhdeve
 
Transformer in Vision
Transformer in VisionTransformer in Vision
Transformer in VisionSangmin Woo
 
Building blocks of deep learning
Building blocks of deep learningBuilding blocks of deep learning
Building blocks of deep learningKeshan Sodimana
 
GraphSage vs Pinsage #InsideArangoDB
GraphSage vs Pinsage #InsideArangoDBGraphSage vs Pinsage #InsideArangoDB
GraphSage vs Pinsage #InsideArangoDBArangoDB Database
 
Introduction to Natural Language Processing (NLP)
Introduction to Natural Language Processing (NLP)Introduction to Natural Language Processing (NLP)
Introduction to Natural Language Processing (NLP)VenkateshMurugadas
 
Radial basis function network ppt bySheetal,Samreen and Dhanashri
Radial basis function network ppt bySheetal,Samreen and DhanashriRadial basis function network ppt bySheetal,Samreen and Dhanashri
Radial basis function network ppt bySheetal,Samreen and Dhanashrisheetal katkar
 

Was ist angesagt? (20)

Transformer Introduction (Seminar Material)
Transformer Introduction (Seminar Material)Transformer Introduction (Seminar Material)
Transformer Introduction (Seminar Material)
 
Knowledge representation and reasoning
Knowledge representation and reasoningKnowledge representation and reasoning
Knowledge representation and reasoning
 
Applications of paralleL processing
Applications of paralleL processingApplications of paralleL processing
Applications of paralleL processing
 
Deep learning for NLP and Transformer
 Deep learning for NLP  and Transformer Deep learning for NLP  and Transformer
Deep learning for NLP and Transformer
 
parallel computing.ppt
parallel computing.pptparallel computing.ppt
parallel computing.ppt
 
Sequence to sequence (encoder-decoder) learning
Sequence to sequence (encoder-decoder) learningSequence to sequence (encoder-decoder) learning
Sequence to sequence (encoder-decoder) learning
 
Computer architecture multi processor
Computer architecture multi processorComputer architecture multi processor
Computer architecture multi processor
 
Attention is all you need (UPC Reading Group 2018, by Santi Pascual)
Attention is all you need (UPC Reading Group 2018, by Santi Pascual)Attention is all you need (UPC Reading Group 2018, by Santi Pascual)
Attention is all you need (UPC Reading Group 2018, by Santi Pascual)
 
Lec 3 knowledge acquisition representation and inference
Lec 3  knowledge acquisition representation and inferenceLec 3  knowledge acquisition representation and inference
Lec 3 knowledge acquisition representation and inference
 
Computer architecture memory system
Computer architecture memory systemComputer architecture memory system
Computer architecture memory system
 
Sequence Modelling with Deep Learning
Sequence Modelling with Deep LearningSequence Modelling with Deep Learning
Sequence Modelling with Deep Learning
 
Natural lanaguage processing
Natural lanaguage processingNatural lanaguage processing
Natural lanaguage processing
 
Multiprocessor
MultiprocessorMultiprocessor
Multiprocessor
 
Hierarchical Reinforcement Learning
Hierarchical Reinforcement LearningHierarchical Reinforcement Learning
Hierarchical Reinforcement Learning
 
Research Scope in Parallel Computing And Parallel Programming
Research Scope in Parallel Computing And Parallel ProgrammingResearch Scope in Parallel Computing And Parallel Programming
Research Scope in Parallel Computing And Parallel Programming
 
Transformer in Vision
Transformer in VisionTransformer in Vision
Transformer in Vision
 
Building blocks of deep learning
Building blocks of deep learningBuilding blocks of deep learning
Building blocks of deep learning
 
GraphSage vs Pinsage #InsideArangoDB
GraphSage vs Pinsage #InsideArangoDBGraphSage vs Pinsage #InsideArangoDB
GraphSage vs Pinsage #InsideArangoDB
 
Introduction to Natural Language Processing (NLP)
Introduction to Natural Language Processing (NLP)Introduction to Natural Language Processing (NLP)
Introduction to Natural Language Processing (NLP)
 
Radial basis function network ppt bySheetal,Samreen and Dhanashri
Radial basis function network ppt bySheetal,Samreen and DhanashriRadial basis function network ppt bySheetal,Samreen and Dhanashri
Radial basis function network ppt bySheetal,Samreen and Dhanashri
 

Ähnlich wie Taming Non-blocking Caches to Improve Isolation in Multicore Real-Time Systems

Parallelism-Aware Memory Interference Delay Analysis for COTS Multicore Systems
Parallelism-Aware Memory Interference Delay Analysis for COTS Multicore SystemsParallelism-Aware Memory Interference Delay Analysis for COTS Multicore Systems
Parallelism-Aware Memory Interference Delay Analysis for COTS Multicore SystemsHeechul Yun
 
Deterministic Memory Abstraction and Supporting Multicore System Architecture
Deterministic Memory Abstraction and Supporting Multicore System ArchitectureDeterministic Memory Abstraction and Supporting Multicore System Architecture
Deterministic Memory Abstraction and Supporting Multicore System ArchitectureHeechul Yun
 
MemGuard: Memory Bandwidth Reservation System for Efficient Performance Isola...
MemGuard: Memory Bandwidth Reservation System for Efficient Performance Isola...MemGuard: Memory Bandwidth Reservation System for Efficient Performance Isola...
MemGuard: Memory Bandwidth Reservation System for Efficient Performance Isola...Heechul Yun
 
CPU Caches - Jamie Allen
CPU Caches - Jamie AllenCPU Caches - Jamie Allen
CPU Caches - Jamie Allenjaxconf
 
Ct213 memory subsystem
Ct213 memory subsystemCt213 memory subsystem
Ct213 memory subsystemSandeep Kamath
 
Memory access control in multiprocessor for real-time system with mixed criti...
Memory access control in multiprocessor for real-time system with mixed criti...Memory access control in multiprocessor for real-time system with mixed criti...
Memory access control in multiprocessor for real-time system with mixed criti...Heechul Yun
 
Memory Hierarchy PPT of Computer Organization
Memory Hierarchy PPT of Computer OrganizationMemory Hierarchy PPT of Computer Organization
Memory Hierarchy PPT of Computer Organization2022002857mbit
 
Erasing Belady's Limitations: In Search of Flash Cache Offline Optimality
Erasing Belady's Limitations: In Search of Flash Cache Offline OptimalityErasing Belady's Limitations: In Search of Flash Cache Offline Optimality
Erasing Belady's Limitations: In Search of Flash Cache Offline OptimalityYue Cheng
 
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...Databricks
 
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...Databricks
 
Lec11 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part3
Lec11 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part3Lec11 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part3
Lec11 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part3Hsien-Hsin Sean Lee, Ph.D.
 
Spectrum Scale Memory Usage
Spectrum Scale Memory UsageSpectrum Scale Memory Usage
Spectrum Scale Memory UsageTomer Perry
 
Chip Multithreading Systems Need a New Operating System Scheduler
Chip Multithreading Systems Need a New Operating System Scheduler Chip Multithreading Systems Need a New Operating System Scheduler
Chip Multithreading Systems Need a New Operating System Scheduler Sarwan ali
 
chap 18 multicore computers
chap 18 multicore computers chap 18 multicore computers
chap 18 multicore computers Sher Shah Merkhel
 

Ähnlich wie Taming Non-blocking Caches to Improve Isolation in Multicore Real-Time Systems (20)

Parallelism-Aware Memory Interference Delay Analysis for COTS Multicore Systems
Parallelism-Aware Memory Interference Delay Analysis for COTS Multicore SystemsParallelism-Aware Memory Interference Delay Analysis for COTS Multicore Systems
Parallelism-Aware Memory Interference Delay Analysis for COTS Multicore Systems
 
Deterministic Memory Abstraction and Supporting Multicore System Architecture
Deterministic Memory Abstraction and Supporting Multicore System ArchitectureDeterministic Memory Abstraction and Supporting Multicore System Architecture
Deterministic Memory Abstraction and Supporting Multicore System Architecture
 
MemGuard: Memory Bandwidth Reservation System for Efficient Performance Isola...
MemGuard: Memory Bandwidth Reservation System for Efficient Performance Isola...MemGuard: Memory Bandwidth Reservation System for Efficient Performance Isola...
MemGuard: Memory Bandwidth Reservation System for Efficient Performance Isola...
 
Memory (Computer Organization)
Memory (Computer Organization)Memory (Computer Organization)
Memory (Computer Organization)
 
CPU Caches - Jamie Allen
CPU Caches - Jamie AllenCPU Caches - Jamie Allen
CPU Caches - Jamie Allen
 
Cpu Caches
Cpu CachesCpu Caches
Cpu Caches
 
CPU Caches
CPU CachesCPU Caches
CPU Caches
 
Ct213 memory subsystem
Ct213 memory subsystemCt213 memory subsystem
Ct213 memory subsystem
 
Memory access control in multiprocessor for real-time system with mixed criti...
Memory access control in multiprocessor for real-time system with mixed criti...Memory access control in multiprocessor for real-time system with mixed criti...
Memory access control in multiprocessor for real-time system with mixed criti...
 
General Purpose GPU Computing
General Purpose GPU ComputingGeneral Purpose GPU Computing
General Purpose GPU Computing
 
Memory Hierarchy PPT of Computer Organization
Memory Hierarchy PPT of Computer OrganizationMemory Hierarchy PPT of Computer Organization
Memory Hierarchy PPT of Computer Organization
 
Erasing Belady's Limitations: In Search of Flash Cache Offline Optimality
Erasing Belady's Limitations: In Search of Flash Cache Offline OptimalityErasing Belady's Limitations: In Search of Flash Cache Offline Optimality
Erasing Belady's Limitations: In Search of Flash Cache Offline Optimality
 
Lect18
Lect18Lect18
Lect18
 
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
 
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
 
7_mem_cache.ppt
7_mem_cache.ppt7_mem_cache.ppt
7_mem_cache.ppt
 
Lec11 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part3
Lec11 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part3Lec11 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part3
Lec11 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part3
 
Spectrum Scale Memory Usage
Spectrum Scale Memory UsageSpectrum Scale Memory Usage
Spectrum Scale Memory Usage
 
Chip Multithreading Systems Need a New Operating System Scheduler
Chip Multithreading Systems Need a New Operating System Scheduler Chip Multithreading Systems Need a New Operating System Scheduler
Chip Multithreading Systems Need a New Operating System Scheduler
 
chap 18 multicore computers
chap 18 multicore computers chap 18 multicore computers
chap 18 multicore computers
 

Kürzlich hochgeladen

INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETEINFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETEroselinkalist12
 
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor CatchersTechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catcherssdickerson1
 
main PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidmain PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidNikhilNagaraju
 
An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...Chandu841456
 
8251 universal synchronous asynchronous receiver transmitter
8251 universal synchronous asynchronous receiver transmitter8251 universal synchronous asynchronous receiver transmitter
8251 universal synchronous asynchronous receiver transmitterShivangiSharma879191
 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxJoão Esperancinha
 
Artificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptxArtificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptxbritheesh05
 
Correctly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleCorrectly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleAlluxio, Inc.
 
Concrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxConcrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxKartikeyaDwivedi3
 
Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.eptoze12
 
complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...asadnawaz62
 
Biology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxBiology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxDeepakSakkari2
 
Heart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxHeart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxPoojaBan
 
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)Dr SOUNDIRARAJ N
 
Application of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptxApplication of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptx959SahilShah
 
Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...121011101441
 
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)dollysharma2066
 

Kürzlich hochgeladen (20)

🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
 
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETEINFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
 
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor CatchersTechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
 
main PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidmain PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfid
 
An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...
 
8251 universal synchronous asynchronous receiver transmitter
8251 universal synchronous asynchronous receiver transmitter8251 universal synchronous asynchronous receiver transmitter
8251 universal synchronous asynchronous receiver transmitter
 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
 
Artificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptxArtificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptx
 
Correctly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleCorrectly Loading Incremental Data at Scale
Correctly Loading Incremental Data at Scale
 
Concrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxConcrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptx
 
young call girls in Green Park🔝 9953056974 🔝 escort Service
young call girls in Green Park🔝 9953056974 🔝 escort Serviceyoung call girls in Green Park🔝 9953056974 🔝 escort Service
young call girls in Green Park🔝 9953056974 🔝 escort Service
 
Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.
 
complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...
 
Biology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxBiology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptx
 
Heart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxHeart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptx
 
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
 
Application of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptxApplication of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptx
 
Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...
 
POWER SYSTEMS-1 Complete notes examples
POWER SYSTEMS-1 Complete notes  examplesPOWER SYSTEMS-1 Complete notes  examples
POWER SYSTEMS-1 Complete notes examples
 
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
 

Taming Non-blocking Caches to Improve Isolation in Multicore Real-Time Systems

  • 1. Taming Non-blocking Caches to Improve Isolation in Multicore Real-Time Systems Prathap Kumar Valsan, Heechul Yun, Farzad Farshchi University of Kansas 1
  • 2. High-Performance Multicores for Real-Time Systems • Why? – Intelligence  more performance – Space, weight, power (SWaP), cost 2
  • 3. Time Predictability Challenge 3 Core1 Core2 Core3 Core4 Memory Controller (MC) Shared Cache • Shared hardware resource contention can cause significant interference delays • Shared cache is a major shared resource DRAM Task 1 Task 2 Task 3 Task 4
  • 4. Cache Partitioning • Can be done in either software or hardware – Software: page coloring – Hardware: way partitioning • Eliminate unwanted cache-line evictions • Common assumption – cache partitioning  performance isolation • If working-set fits in the cache partition • Not necessarily true on out-of-order cores using non-blocking caches 4
  • 5. Outline • Evaluate the isolation effect of cache partitioning – On four real COTS multicore architectures – Observed up to 21X slowdown of a task running on a dedicated core accessing a dedicated cache partition • Understand the source of contention – Using a cycle-accurate full system simulator • Propose an OS/architecture collaborative solution – For better cache performance isolation 5
  • 6. Memory-Level Parallelism (MLP) • Broadly defined as the number of concurrent memory requests that a given architecture can handle at a time 6
  • 7. Non-blocking Cache(*) • Can serve cache hits under multiple cache misses – Essential for an out-of-order core and any multicore • Miss-Status-Holding Registers (MSHRs) – On a miss, allocate a MSHR entry to track the req. – On receiving the data, clear the MSHR entry 7 cpu cpu miss hit miss Miss penalty Miss penalty stall only when result is needed Multiple outstanding misses (*) D. Kroft. “Lockup-free instruction fetch/prefetch cache organization,” ISCA’81
  • 8. Non-blocking Cache • #of cache MSHRs  memory-level parallelism (MLP) of the cache • What happens if all MSHRs are occupied? – The cache is locked up – Subsequent accesses---including cache hits---to the cache stall – We will see the impact of this in later experiments 8
  • 9. COTS Multicore Platforms Cortex-A7 Cortex-A9 Cortex-A15 Nehalem Core 4core @ 1.4GHz In-order 4core @ 1.7GHz Out-of-order 4core @ 2.0GHz Out-of-order 4core @ 2.8GHz Out-of-order LLC (shared) 512KB 1MB 2MB 8MB 9 • COTS multicore platforms – Odroid-XU4: 4x Cortex-A7 and 4x Cortex-A15 – Odroid-U3: 4x Cortex-A9 – Dell desktop: Intel Xeon quad-core (Nehalem)
  • 10. Identified MLP • Local MLP – MLP of a core-private cache • Global MLP – MLP of the shared cache (and DRAM) 10 (See paper for our experimental identification method) Cortex-A7 Cortex-A9 Cortex-A15 Nehalem Local MLP 1 4 6 10 Global MLP 4 4 11 16
  • 11. Cache Interference Experiments • Measure the performance of the ‘subject’ – (1) alone, (2) with co-runners – LLC is partitioned (equal partition) using PALLOC (*) • Q: Does cache partitioning provide isolation? 11 DRAM LLC Core1 Core2 Core3 Core4 subject co-runner(s) (*) Heechul Yun, Renato Mancuso, Zheng-Pei Wu, Rodolfo Pellizzoni. “PALLOC: DRAM Bank-Aware Mem ory Allocator for Performance Isolation on Multicore Platforms.” RTAS’14
  • 12. IsolBench: Synthetic Workloads • Latency – A linked-list traversal, data dependency, one outstanding miss • Bandwidth – An array reads or writes, no data dependency, multiple misses • Subject benchmarks: LLC partition fitting 12 Working-set size: (LLC) < ¼ LLC  cache-hits, (DRAM) > 2X LLC  cache misses
  • 13. Latency(LLC) vs. BwRead(DRAM) • No interference on Cortex-A7 and Nehalem • On Cortex-A15, Latency(LLC) suffers 3.8X slowdown – despite partitioned LLC 13
  • 14. BwRead(LLC) vs. BwRead(DRAM) • Up to 10.6X slowdown on Cortex-A15 • Cache partitioning != performance isolation – On all tested out-of-order cores (A9, A15, Nehalem) 14
  • 15. BwRead(LLC) vs. BwWrite(DRAM) • Up to 21X slowdown on Cortex-A15 • Writes generally cause more slowdowns – Due to write-backs 15
  • 16. EEMBC and SD-VBS • X-axis: EEMBC, SD-VBS (cache partition fitting) – Co-runners: BwWrite(DRAM) • Cache partitioning != performance isolation 16 Cortex-A7 (in-order) Cortex-A15 (out-of-order)
  • 17. MSHR Contention • Shortage of cache MSHRs  lock up the cache • LLC MSHRs are shared resources – 4cores x L1 cache MSHRs > LLC MSHRs 17 Cortex-A7 Cortex-A9 Cortex-A15 Nehalem Local MLP 1 4 6 10 Global MLP 4 4 11 16 (See paper for our experimental identification method)
  • 18. Outline • Evaluate the isolation effect of cache partitioning • Understand the source of contention – Effect of LLC MSHRs – Effect of L1 MSHRs • Propose an OS/architecture collaborative solution 18
  • 19. Simulation Setup • Gem5 full-system simulator – 4 out-of-order cores (modeling Cortex-A15) • L1: 32K I&D, LLC (L2): 2048K – Vary #of L1 and LLC MSHRs • Linux 3.14 – Use PALLOC to partition LLC • Workload – IsolBench, EEMBC, SD-VBS, SPEC2006 19 DRAM LLC Core1 Core2 Core3 Core4 subject co-runner(s)
  • 20. Effect of LLC (Shared) MSHRs • Increasing LLC MSHRs eliminates the MSHR contention • Issue: scalable? – MSHRs are highly associative (must be accessed in parallel) 20 L1:6/LLC:12 MSHRs L1:6/LLC:24 MSHRs
  • 21. Effect of L1 (Private) MSHRs • Reducing L1 MSHRs can eliminate contention • Issue: single thread performance. How much? – It depends. L1 MSHRs are often underutilized – EEMBC: low, SD-VBS: medium, SPEC2006: high 21 EEMBC, SD-VBS IsolBench,SPEC2006
  • 22. Outline • Evaluate the isolation effect of cache partitioning • Understand the source of contention • Propose an OS/architecture collaborative solution 22
  • 23. OS Controlled MSHR Partitioning 23 • Add two registers in each core’s L1 cache – TargetCount: max. MSHRs (set by OS) – ValidCount: used MSHRs (set by HW) • OS can control each core’s MLP – By adjusting the core’s TargetCount register Block Addr.Valid Issue Target Info. Block Addr.Valid Issue Target Info. Block Addr.Valid Issue Target Info. MSHR 1 MSHR 2 MSHR n TargetCount ValidCount Last Level Cache (LLC) Core 1 Core 2 Core 3 Core 4 MSHRs MSHRs MSHRs MSHRs MSHRs
  • 24. OS Scheduler Design • Partition LLC MSHRs by enforcing • When RT tasks are scheduled – RT tasks: reserve MSHRs – Non-RT tasks: share remaining MSHRs • Implemented in Linux kernel – prepare_task_switch() at kernel/sched/core.c 24
  • 25. Case Study • 4 RT tasks (EEMBC) – One RT per core – Reserve 2 MSHRs – P: 20,30,40,60ms – C: ~8ms • 4 NRT tasks – One NRT per core – Run on slack • Up to 20% WCET reduction – Compared to cache partition only 25
  • 26. Conclusion • Evaluated the effect of cache partitioning on four real COTS multicore architectures and a full system simulator • Found cache partitioning does not ensure cache (hit) performance isolation due to MSHR contention • Proposed low-cost architecture extension and OS scheduler support to eliminate MSHR contention • Availability – https://github.com/CSL-KU/IsolBench – Synthetic benchmarks (IsolBench),test scripts, kernel patches 26
  • 27. Exp3: BwRead(LLC) vs. BwRead(LLC) • Cache bandwidth contention is not the main source – Higher cache-bandwidth co-runners cause less slowdowns 27
  • 28. EEMBC, SD-VBS Workload • Subject – Subset of EEMBC, SD-VBS – High L2 hit, Low L2 miss • Co-runners – BwWrite(DRAM): High L2 miss, write-intensive 28

Hinweis der Redaktion

  1. This talk is about improving isolation in modern multicore real-time systems focusing on shared cache.
  2. High-performance multicore processors are increasingly demanded in many modern real-time systems to satisfy the ever increasing performance need while reducing size, weight, and power consumption.
  3. Unfortunately, it is well know that providing timing predictability is very difficult in multicore processors due to contention in shared hardware resources. In this work, we focus on shared cache.
  4. Cache partitioning is a well known technique to improve isolation for real-time systems. And it can be done either in software (by the OS) or in hardware. Either way, the goal of partitioning is to eliminate the unwanted cache-line evictions by providing isolated space per core/task. Most literature equate cache partitioning and cache performance isolation. However, as we will show in this work, partitioning cache doesn’t necessarily provide performance isolation in modern high-performance multicore architectures.
  5. This talk is composed of three parts. First, we evaluate the isolation effect cache partitioning on four real COTS multicore architectures. Surprisingly, we observe up to 21X slowdown of a task running on a dedicated core, accessing dedicated cache partition, and almost all memory accesses are cache hits. We then further investigate and better understand the source of the contention using a cycle accurate full system simulator. Finally, based on the understanding, we propose an OS and architecture collaborative solution to to provide true cache performance isolation.
  6. Before we begin, let me briefly introduce essential background on non-blocking caches. A non-blocking cache can continue to serve cache-hit requests under multiple cache misses. Supporting multiple outstanding misses is important to leverage memory-level parallelism. And it is especially important for out-of-order cores, but in-order multicore also can benefit from using non-blocking shared caches. In a non-blocking cache, there are special hardware registers known as miss-status-holing registers (MSHRs). When a cache miss occurs, the cache controller allocates a MSHR entry to track the outgoing cache-miss and it is cleared when the data is delivered from the lower-level memory hierarch,
  7. The number of MSHRs essentially determines the parallelism of the cache. Now, because MSHRs are limited and getting the data from DRAM could take 100s of cpu cycles, it is possible that MSRHs of a cache are fully occupied with outstanding misses. If that happens, the cache lock-up and all the subsequent accesses to the cache will be stalled.
  8. For experiments, we use four COTS multicore architectures. One in-order and three out-of-order architectuers.
  9. On the platforms, we perform a set of cache-interference experiments. The basic setup is as follows. We ran a subject benchmark on one core, and we increase the number of co-runners on the other cores. I would like to stress that the LLC is partitioned on a per-core basis. So, what we would like to see is whether cache portioning provide cache performance isolation.
  10. First, we use a benchmark suite called IsolBench that consists of two synthetic benchmarks latency and bandwidth. The latency benchmark is a simple linked-list traversal program. Due to data dependency, only one outstanding miss can be generated at a time. The bandwidth benchmark is a simple array access program. As there’s no data dependency, multiple concurrent accesses can be made in an out-of-order core. We created 6 workloads with varying their working-set sizes. Here, LLC denotes less than the given cache partiion while (DRAM) denotes twice size of the the LLC. In all workloads, the subject benchmarks always fit in their cache partitions. In other words, accesses are cache hits. So, ideally, cache partitioning should provide performance isolation regardless of the co-runners.
  11. We also used real benchmarks as subjects and get the similar results. That is isolation is provided in in-order cores but not so in out-of-order cores
  12. We attribute this to the MSHR contention problem. This table shows the experimentally obtained core local and global memory-level parallelism of the tested platforms. Detailed methods to obtain MLP is in paper. These numbers were also validated from the manuals of the processors. What this table suggests is that the MSHRs in the LLCs are smaller than the sum of L1 MSHRs on the tested out-of-order cores. As I mentioned before, then the shortage of the LLC MSHRs can lock-up the LLC, even if the cache space is partitioned.
  13. Next, we validate the MSHR contention problem using a cycle-accurate full system simulator. We also perform a set of sensitive analysis to better understand the effects of MSHRs in the local and shared caches.
  14. At first, we suspected that this may be because of shared bus that connects the LLC and the cores. But we found that it was not the case from next experiment. In this experiment, we chaged the corunners’ working-set sizes to fit in their cache partitions. Because now that co-runners’ memory accesses are also cache hits, the cache bandwidth demand, and therefore the shared bus contention would have been increased. Yet, the slowdown becomes much smaller than the previous experiment where cache bandwidth usages were much lower.