Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
ARI. HiPEAK 2014
1. Viacheslav Fedorov, Sheng Qiu,
Narasimha Reddy, Paul Gratz
Texas A&M University
ARI:
Adaptive Replacement and Insertion
HiPEAC 2013, Vienna, Austria
2. Conventional Main Memory
● Usually we only care about
speeding up the cache miss path
Main Memory
Core 0
Core 1
Core 2
Core 3
L3$
L2$
L2$
3. Main Memory: Trends
● New Memories emerging
● DRAM not dense enough
● Replace or augment DRAM
DRAM
Core 0
Core 1
Core 2
Core 3
L3$
L2$
L2$
DRAM
PCM
DRAM
cache
4. PCM Technology
● Based on Chalcogenide glass
● Exploits two phases
● Amorphous
● Chrystalline
● Higher density than DRAM
● Non-volatile
Image: Stanford NanoHeat Lab
5. DRAM vs PCM
● DRAM is writeback-agnostic
● Write Buffers cushion the impact of writebacks
● State-of-the-art policies target cache misses
● PCM
● High write latency – Write Buffers insufficient
● High write energy – Mobile, embedded devices ?
●
Low cell endurance – Limited write cycles ?
Parameter DRAM PCM
Row Read 210 mW 78 mW
Row Write 195 mW 773 mW
Activate 75 mW 25 mW
Standby 90 mW 45 mW
Refresh 4 mW 0 mW
Initial Row Read 15 ns 28 ns
Row Write 22 ns 150 ns
Same Row R/W 15 ns 15 ns
0.3x
4x
0.3x
0.5x
7x
2x
0x
7. Motivation
● PCM is attractive as a Main Memory, but...
● PCM does not favor writes
● High energy
● High latency
● Low write cycle tolerance
● Solution: reduce writes into Main Memory
● Modify LLC policies to reduce Writebacks
● Mind the Miss rate!
8. Application behavior in
High-Associativity Caches
● Bi-Polar block distribution due to LRU policy
● 'Hot' blocks tend to group towards MRU side
● 'Cold' blocks towards LRU side in a set
● Hot blocks have higher Hit-ratio
● Cold blocks tend to have similar Hit-ratios
%hitrate
Position in LRU stackMRU LRU
'Hot' region 'Cold' region
Hit distribution in a high-associativity cache (16-way)
9. Static LLC policies
● Based on the observed hot-cold distribution
● 16-way cache: 16 static policies, xH16
● Replace any clean block in (16-x) Low-hit blocks
● Drawbacks:
● No single static policy good for all applications
● Less writebacks => more cache misses
– When replacing hot blocks
10. Enter ARI:
Adaptive Replacement and Insertion
●
Goal: Reduce LLC writebacks !
● Keep miss rate lower than conventional policies
● How?
● Do not replace dirty cache blocks (as long as possible)
● Place fresh incoming blocks into LLC smartly
● Dynamically choose the best policy
11. ARI: Operation
● Evict clean blocks from Low-Hit region
● Insert new blocks into top of Low-Hit region
%hitrate
Position in LRU stackMRU LRU
High-Hit region
Low-Hit region
12. ARI: Operation
● Application hit-distributions are not static
● Dynamic policy adaptation based on epochs
● Emulate various static thresholds in LLC tags
● Pick the best one for next epoch (25k LLC accesses)
● Misses + Writebacks metric used
%hitrate
MRU LRU
13. Core 0
Core 1
Core 2
Core 3
L3$
L2$
L2$
ARI: Implementation
● Emulate static thresholds in shadow tags
● Adapt to the hit-distribution
Tag Array Data ArrayShadow Tag Array
dynamically
4H16 10H16 14H16
15. Methodology
● gem5 + DRAMSim2 simulators
● nVidia Tegra -like out-of-order, dual-issue CPU
● SPEC2006 and PARSEC suites
● Compared against state-of-the-art policies
● ARI beats them in writeback reduction
● Nearly identical in total performance
System Single core Multicore
L1 cache 32KB I + 64KB D, 2-way, LRU, 64B block 32KB I + 64KB D, 2-way, LRU, 64B block
L2 cache 256KB, 8-way, LRU, 64B block 256KB, 8-way, LRU, 64B block (private)
L3 cache 2MB, 16-way, LRU, 64B block 16MB, 16-way, LRU, 64B block (shared)
Main memory 4GB, DDR3-1333 DRAM, 32-entry write buffer 4GB, DDR3-1333 DRAM, 32-entry write buffer
16. ARI: Writeback reduction
● ARI beats the competition: 33% WB reduction
Writeback improvement, normalized to LRU policy
DIP: M. Qureshi et al, ISCA '09
DBLK: S. Khan et al, MICRO '10
RRIP: A. Jaleel et al, ISCA '10
17. ARI: Miss reduction
● ARI achieves 4.7% Misses reduction
Miss rate improvement, normalized to LRU policy
DIP: M. Qureshi et al, ISCA '09
DBLK: S. Khan et al, MICRO '10
RRIP: A. Jaleel et al, ISCA '10
19. ARI: Dynamic behavior
● ARI adapts to program phases
● Achieves lower WBs than the best static policy
Soplex application, SPEC 2006mcf application, SPEC 2006
Writebacks
21. ARI: PCM lifetime improvement
● ARI facilitates the use of PCM as Main Memory
DIP DBLK RRIP ARI
0%
10%
20%
30%
40%
50%
60%
%PCMlifetimeimprovement
Decrease lifetime
for several apps
23. ARI: Hardware overhead
● 8 sets shadowed per LLC bank (x8)
● p*2 shadow tags (we use p=9)
● 14kB storage overhead in a 16MB LLC
● Epoch counter – 15 bits
● Performance counters, adders
● Not on critical path
● Can be designed for low power
25. ARI: Summary
● 33% writeback reduction
● 4.7% cache miss rate reduction
● 9% less Main Memory traffic
● System IPC boost of 5%
● Enabling PCM as Main Memory
● 50% lifetime improvement
Win – Win
26. Conclusion
● DRAM is hitting a scalability wall
● New memories/architectures proposed
● We target PCM as main memory
● Propose ARI: Adaptive Replacement and
Insertion
● Simple scheme
● Reduce writebacks to main memory
● Boost the PCM performance and lifetime
29. Related Work: PCM
G. Dhiman et al.
PDRAM: A hybrid PRAM and DRAM main memory system. DAC ’09
M. K. Qureshi et al.
Enhancing Lifetime and Security of PCM-based Main Memory with
Start-Gap Wear Leveling. MICRO ’09
B. C. Lee et al.
Architecting Phase Change Memory as a Scalable
DRAM Alternative. ISCA ’09
M. K. Qureshi et al.
Scalable high performance main memory system using
phase-change memory technology. ISCA ’09
A. P. Ferreira et al.
Increasing PCM main memory lifetime. DATE ’10
30. Related Work: PCM
N. H. Seong et al.
Security refresh: prevent malicious wear-out and increase durability
for phase-change memory with dynamically randomized address mapping.
ISCA ’10
H. Yoon et al.
Row buffer locality aware caching policies for hybrid memories. ICCD ’12
Stuecheli et al.
The Virtual Write Queue: Coordinating DRAM and
Last-Level Cache Policies. ISCA ’10
M. K. Qureshi & G. H. Loh
Fundamental latency trade-off in architecting dram caches:
Outperforming impractical SRAM-tags with a simple and practical design.
MICRO ’12