PALLOC is a kernel-level memory allocator that exploits page-based virtual-to-physical memory translation to selectively allocate memory pages of each application to the desired DRAM banks. The goal of PALLOC is to control applications' memory locations in a way to minimize memory performance unpredictability in multicore systems by eliminating bank sharing among applications executing in parallel. PALLOC is a software based solution, which is fully compatible with existing COTS hardware platforms and transparent to applications (i.e., no need to modify application code.)
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
PALLOC: DRAM Bank-Aware Memory Allocator for Performance Isolation on Multicore Platforms
1. PALLOC: DRAM Bank-Aware Memory
Allocator for Performance Isolation on
Multicore Platforms
Heechul Yun*, Renato Mancuso+, Zheng-Pei Wu#, Rodolfo Pellizzoni#
*University of Kansas, +University of Illinois , #University of Waterloo
3. Challenges: Shared Memory Hierarchy
3
CPU
Memory Hierarchy
Unicore
T1 T2
Core
1
Memory Hierarchy
Core
2
Core
3
Core
4
Multicore
T
1
T
2
T
3
T
4
T
5
T
6
T
7
T
8
Performance Impact
4. Impact of Memory Interference
• Setup: Core0: X-axis, Core1-3: 470.lbm x 3 (interference)
• Slowdown ratio = Solo IPC / Corun IPC
4
Slowdownratio
0.0
1.0
2.0
3.0
4.0
5.0
6.0
7.0
(*) Measured on Intel Xeon 3530 (4 cores), 8GB 1ch DDR3 DRAM
5. Memory Performance Isolation
• Q. How to reduce memory contention?
Part 1 Part 2 Part 3 Part 4
5
Core1 Core2 Core3 Core4
DRAM
Memory Controller
7. Background: DRAM Organization
L3
DRAM DIMM
Memory Controller (MC)
Bank
4
Bank
3
Bank
2
Bank
1
Core1 Core2 Core3 Core4
• Have multiple banks
• Different banks can be
accessed in parallel
11. Worst-case
• 1bank b/w
– Less than peak b/w
– How much?
Slow
L3
DRAM DIMM
Memory Controller (MC)
Bank
4
Bank
3
Bank
2
Bank
1
Core1 Core2 Core3 Core4
13. Real Worst-case
1 bank & always row misses ~1/10 peak b/w
L3
DRAM DIMM
Memory Controller (MC)
Bank
4
Bank
3
Bank
2
Bank
1
Core
1
Core
2
Core
3
Core
4
Row 1
Row 2
Row 3
Row 4
Row 1
Row 2
…
Request order
time
14. Problem
L3
DRAM DIMM
Memory Controller (MC)
Bank
4
Bank
3
Bank
2
Bank
1
Core1 Core2 Core3 Core4
• OS does NOT know
DRAM banks
• OS memory pages are
spread all over multiple
banks
????
Unpredictable
memory
performance
SMP OS
16. DRAM DIMM
PALLOC
CPC
Memory Controller (MC)
Bank
4
Bank
3
Bank
2
Bank
1
Core1 Core2 Core3 Core4
• OS is aware of
DRAM mapping
• Each page can be
allocated to a
desired DRAM
bank
Flexible
allocation
policy
SMP OS
17. PALLOC
L3
DRAM DIMM
Memory Controller (MC)
Bank
4
Bank
3
Bank
2
Bank
1
Core1 Core2 Core3 Core4 • Private banking
– Allocate pages
on certain
exclusively
assigned banks
Eliminate
Inter-core bank
conflicts
18. Identifying Memory Mapping
• Memory mappings are platform specific
• We developed a detection tool software
18
12141921
banks banks
cache-sets
31 06
1418
banks
31 067
channel
cache-sets
16
Intel Xeon 3530 + 4GiB DDR3 DIMM (16 banks)
Freescale P4080 + 2x2 GiB DDR3 DIMM (32 banks)
27. Performance Impact on Unicore
• #of bank vs. performance on a single core
• Finding: bank partitioning negatively affect performance due to reduced
MLP, but not significant for most benchmarks
27
NormalizedIPC
0.00
0.20
0.40
0.60
0.80
1.00
1.20
4banks 8banks 16banks Buddy
28. Performance Isolation on 4 Cores
• Setup: Core0: X-axis, Core1-3: 470.lbm x 3 (interference)
• PB: DRAM bank partitioning only;
• PB+PC: DRAM bank and Cache partitioning
• Finding: bank (and cache) partitioning improves isolation, but far from ideal
28
Slowdownratio
0.00
1.00
2.00
3.00
4.00
5.00
6.00
7.00
buddy PB PB+PC
30. Conclusion
• PALLOC
– DRAM bank aware kernel level memory allocator
– Can eliminate inter-core bank conflicts
• Findings
– Private banking improves performance isolation
– But, far from ideal isolation: memory bus bottleneck
• Future work
– Integration with memory bandwidth control (MemGuard
[RTAS’13])
– Multichannel, NUMA systems.
30
https://github.com/heechul/palloc
32. Data Intensive Applications
32
• Multimedia processing, object
tracking, game, big data(*), …
• More stress on the memory hierarchy
(*) Source: Intel, The Growing Importance of Big Data and Real-Time Analytics, 2012
34. Impact of Cache Partitioning
• PB: private DRAM banking
• PB+PC: private DRAM banking & private cache partitioning
34
Hinweis der Redaktion
Soon more rt/embedded systems will use multicore as well.
In the unicore systems, CPU time is the most important shared resource determining application’s performance. In the multicore systems, however, memory performance is also very important as multiple cores can concurrently access the memory and affect performance in significant ways.