In this paper, we show that cache partitioning does
not necessarily ensure predictable cache performance in modern
COTS multicore platforms that use non-blocking caches to exploit
memory-level-parallelism (MLP).
Through carefully designed experiments using three real COTS
multicore platforms (four distinct CPU architectures) and a cycleaccurate
full system simulator, we show that special hardware
registers in non-blocking caches, known as Miss Status Holding
Registers (MSHRs), which track the status of outstanding cachemisses,
can be a significant source of contention; we observe up
to 21X WCET increase in a real COTS multicore platform due
to MSHR contention.
We propose a hardware and system software (OS) collaborative
approach to efficiently eliminate MSHR contention for
multicore real-time systems. Our approach includes a low-cost
hardware extension that enables dynamic control of per-core
MLP by the OS. Using the hardware extension, the OS scheduler
then globally controls each core’s MLP in such a way that
eliminates MSHR contention and maximizes overall throughput
of the system.
We implement the hardware extension in a cycle-accurate fullsystem
simulator and the scheduler modification in Linux 3.14
kernel. We evaluate the effectiveness of our approach using a set
of synthetic and macro benchmarks. In a case study, we achieve
up to 19% WCET reduction (average: 13%) for a set of EEMBC
benchmarks compared to a baseline cache partitioning setup.
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
Taming Non-blocking Caches to Improve Isolation in Multicore Real-Time Systems
1. Taming Non-blocking Caches to Improve
Isolation in Multicore Real-Time Systems
Prathap Kumar Valsan, Heechul Yun, Farzad Farshchi
University of Kansas
1
3. Time Predictability Challenge
3
Core1 Core2 Core3 Core4
Memory Controller (MC)
Shared Cache
• Shared hardware resource contention can cause
significant interference delays
• Shared cache is a major shared resource
DRAM
Task 1 Task 2 Task 3 Task 4
4. Cache Partitioning
• Can be done in either software or hardware
– Software: page coloring
– Hardware: way partitioning
• Eliminate unwanted cache-line evictions
• Common assumption
– cache partitioning performance isolation
• If working-set fits in the cache partition
• Not necessarily true on out-of-order cores using
non-blocking caches
4
5. Outline
• Evaluate the isolation effect of cache partitioning
– On four real COTS multicore architectures
– Observed up to 21X slowdown of a task running on a
dedicated core accessing a dedicated cache partition
• Understand the source of contention
– Using a cycle-accurate full system simulator
• Propose an OS/architecture collaborative solution
– For better cache performance isolation
5
6. Memory-Level Parallelism (MLP)
• Broadly defined as the number of concurrent
memory requests that a given architecture
can handle at a time
6
7. Non-blocking Cache(*)
• Can serve cache hits under multiple cache misses
– Essential for an out-of-order core and any multicore
• Miss-Status-Holding Registers (MSHRs)
– On a miss, allocate a MSHR entry to track the req.
– On receiving the data, clear the MSHR entry
7
cpu cpu
miss hit miss
Miss penalty
Miss penalty
stall only when
result is needed
Multiple outstanding misses
(*) D. Kroft. “Lockup-free instruction fetch/prefetch cache organization,” ISCA’81
8. Non-blocking Cache
• #of cache MSHRs memory-level parallelism
(MLP) of the cache
• What happens if all MSHRs are occupied?
– The cache is locked up
– Subsequent accesses---including cache hits---to
the cache stall
– We will see the impact of this in later experiments
8
10. Identified MLP
• Local MLP
– MLP of a core-private cache
• Global MLP
– MLP of the shared cache (and DRAM)
10
(See paper for our experimental identification method)
Cortex-A7 Cortex-A9 Cortex-A15 Nehalem
Local MLP 1 4 6 10
Global MLP 4 4 11 16
11. Cache Interference Experiments
• Measure the performance of the ‘subject’
– (1) alone, (2) with co-runners
– LLC is partitioned (equal partition) using PALLOC (*)
• Q: Does cache partitioning provide isolation?
11
DRAM
LLC
Core1 Core2 Core3 Core4
subject co-runner(s)
(*) Heechul Yun, Renato Mancuso, Zheng-Pei Wu, Rodolfo Pellizzoni. “PALLOC: DRAM Bank-Aware Mem
ory Allocator for Performance Isolation on Multicore Platforms.” RTAS’14
12. IsolBench: Synthetic Workloads
• Latency
– A linked-list traversal, data dependency, one outstanding miss
• Bandwidth
– An array reads or writes, no data dependency, multiple misses
• Subject benchmarks: LLC partition fitting
12
Working-set size: (LLC) < ¼ LLC cache-hits, (DRAM) > 2X LLC cache misses
13. Latency(LLC) vs. BwRead(DRAM)
• No interference on Cortex-A7 and Nehalem
• On Cortex-A15, Latency(LLC) suffers 3.8X slowdown
– despite partitioned LLC
13
14. BwRead(LLC) vs. BwRead(DRAM)
• Up to 10.6X slowdown on Cortex-A15
• Cache partitioning != performance isolation
– On all tested out-of-order cores (A9, A15, Nehalem)
14
15. BwRead(LLC) vs. BwWrite(DRAM)
• Up to 21X slowdown on Cortex-A15
• Writes generally cause more slowdowns
– Due to write-backs
15
17. MSHR Contention
• Shortage of cache MSHRs lock up the cache
• LLC MSHRs are shared resources
– 4cores x L1 cache MSHRs > LLC MSHRs
17
Cortex-A7 Cortex-A9 Cortex-A15 Nehalem
Local MLP 1 4 6 10
Global MLP 4 4 11 16
(See paper for our experimental identification method)
18. Outline
• Evaluate the isolation effect of cache
partitioning
• Understand the source of contention
– Effect of LLC MSHRs
– Effect of L1 MSHRs
• Propose an OS/architecture collaborative
solution
18
20. Effect of LLC (Shared) MSHRs
• Increasing LLC MSHRs eliminates the MSHR contention
• Issue: scalable?
– MSHRs are highly associative (must be accessed in parallel)
20
L1:6/LLC:12 MSHRs L1:6/LLC:24 MSHRs
21. Effect of L1 (Private) MSHRs
• Reducing L1 MSHRs can eliminate contention
• Issue: single thread performance. How much?
– It depends. L1 MSHRs are often underutilized
– EEMBC: low, SD-VBS: medium, SPEC2006: high
21
EEMBC, SD-VBS IsolBench,SPEC2006
22. Outline
• Evaluate the isolation effect of cache
partitioning
• Understand the source of contention
• Propose an OS/architecture collaborative
solution
22
23. OS Controlled MSHR Partitioning
23
• Add two registers in each core’s L1 cache
– TargetCount: max. MSHRs (set by OS)
– ValidCount: used MSHRs (set by HW)
• OS can control each core’s MLP
– By adjusting the core’s TargetCount register
Block Addr.Valid Issue Target Info.
Block Addr.Valid Issue Target Info.
Block Addr.Valid Issue Target Info.
MSHR 1
MSHR 2
MSHR n
TargetCount ValidCount
Last Level Cache (LLC)
Core
1
Core
2
Core
3
Core
4
MSHRs
MSHRs MSHRs MSHRs MSHRs
24. OS Scheduler Design
• Partition LLC MSHRs by enforcing
• When RT tasks are scheduled
– RT tasks: reserve MSHRs
– Non-RT tasks: share remaining MSHRs
• Implemented in Linux kernel
– prepare_task_switch() at kernel/sched/core.c
24
25. Case Study
• 4 RT tasks (EEMBC)
– One RT per core
– Reserve 2 MSHRs
– P: 20,30,40,60ms
– C: ~8ms
• 4 NRT tasks
– One NRT per core
– Run on slack
• Up to 20% WCET reduction
– Compared to cache partition only
25
26. Conclusion
• Evaluated the effect of cache partitioning on four real COTS
multicore architectures and a full system simulator
• Found cache partitioning does not ensure cache (hit)
performance isolation due to MSHR contention
• Proposed low-cost architecture extension and OS scheduler
support to eliminate MSHR contention
• Availability
– https://github.com/CSL-KU/IsolBench
– Synthetic benchmarks (IsolBench),test scripts, kernel patches
26
27. Exp3: BwRead(LLC) vs. BwRead(LLC)
• Cache bandwidth contention is not the main source
– Higher cache-bandwidth co-runners cause less slowdowns
27
28. EEMBC, SD-VBS Workload
• Subject
– Subset of EEMBC, SD-VBS
– High L2 hit, Low L2 miss
• Co-runners
– BwWrite(DRAM): High L2 miss, write-intensive
28
Hinweis der Redaktion
This talk is about improving isolation in modern multicore real-time systems focusing on shared cache.
High-performance multicore processors are increasingly demanded in many modern real-time systems to satisfy the ever increasing performance need while reducing size, weight, and power consumption.
Unfortunately, it is well know that providing timing predictability is very difficult in multicore processors due to contention in shared hardware resources. In this work, we focus on shared cache.
Cache partitioning is a well known technique to improve isolation for real-time systems. And it can be done either in software (by the OS) or in hardware. Either way, the goal of partitioning is to eliminate the unwanted cache-line evictions by providing isolated space per core/task.
Most literature equate cache partitioning and cache performance isolation. However, as we will show in this work, partitioning cache doesn’t necessarily provide performance isolation in modern high-performance multicore architectures.
This talk is composed of three parts. First, we evaluate the isolation effect cache partitioning on four real COTS multicore architectures. Surprisingly, we observe up to 21X slowdown of a task running on a dedicated core, accessing dedicated cache partition, and almost all memory accesses are cache hits.
We then further investigate and better understand the source of the contention using a cycle accurate full system simulator.
Finally, based on the understanding, we propose an OS and architecture collaborative solution to to provide true cache performance isolation.
Before we begin, let me briefly introduce essential background on non-blocking caches. A non-blocking cache can continue to serve cache-hit requests under multiple cache misses. Supporting multiple outstanding misses is important to leverage memory-level parallelism. And it is especially important for out-of-order cores, but in-order multicore also can benefit from using non-blocking shared caches.
In a non-blocking cache, there are special hardware registers known as miss-status-holing registers (MSHRs). When a cache miss occurs, the cache controller allocates a MSHR entry to track the outgoing cache-miss and it is cleared when the data is delivered from the lower-level memory hierarch,
The number of MSHRs essentially determines the parallelism of the cache. Now, because MSHRs are limited and getting the data from DRAM could take 100s of cpu cycles, it is possible that MSRHs of a cache are fully occupied with outstanding misses. If that happens, the cache lock-up and all the subsequent accesses to the cache will be stalled.
For experiments, we use four COTS multicore architectures. One in-order and three out-of-order architectuers.
On the platforms, we perform a set of cache-interference experiments. The basic setup is as follows. We ran a subject benchmark on one core, and we increase the number of co-runners on the other cores. I would like to stress that the LLC is partitioned on a per-core basis. So, what we would like to see is whether cache portioning provide cache performance isolation.
First, we use a benchmark suite called IsolBench that consists of two synthetic benchmarks latency and bandwidth. The latency benchmark is a simple linked-list traversal program. Due to data dependency, only one outstanding miss can be generated at a time. The bandwidth benchmark is a simple array access program. As there’s no data dependency, multiple concurrent accesses can be made in an out-of-order core. We created 6 workloads with varying their working-set sizes. Here, LLC denotes less than the given cache partiion while (DRAM) denotes twice size of the the LLC. In all workloads, the subject benchmarks always fit in their cache partitions. In other words, accesses are cache hits. So, ideally, cache partitioning should provide performance isolation regardless of the co-runners.
We also used real benchmarks as subjects and get the similar results. That is isolation is provided in in-order cores but not so in out-of-order cores
We attribute this to the MSHR contention problem. This table shows the experimentally obtained core local and global memory-level parallelism of the tested platforms. Detailed methods to obtain MLP is in paper. These numbers were also validated from the manuals of the processors. What this table suggests is that the MSHRs in the LLCs are smaller than the sum of L1 MSHRs on the tested out-of-order cores. As I mentioned before, then the shortage of the LLC MSHRs can lock-up the LLC, even if the cache space is partitioned.
Next, we validate the MSHR contention problem using a cycle-accurate full system simulator. We also perform a set of sensitive analysis to better understand the effects of MSHRs in the local and shared caches.
At first, we suspected that this may be because of shared bus that connects the LLC and the cores. But we found that it was not the case from next experiment. In this experiment, we chaged the corunners’ working-set sizes to fit in their cache partitions. Because now that co-runners’ memory accesses are also cache hits, the cache bandwidth demand, and therefore the shared bus contention would have been increased. Yet, the slowdown becomes much smaller than the previous experiment where cache bandwidth usages were much lower.