SlideShare ist ein Scribd-Unternehmen logo
1 von 39
Downloaden Sie, um offline zu lesen
GCMA: Guaranteed
Contiguous Memory Allocator
SeongJae Park <sj38.park@gmail.com>
These slides were presented during
The Kernel Summit 2018
(https://events.linuxfoundation.org/events/linux-kernel-summit-2018/)
This work by SeongJae Park is licensed under the
Creative Commons Attribution-ShareAlike 3.0 Unported
License. To view a copy of this license, visit
http://creativecommons.org/licenses/by-sa/3.0/.
I, SeongJae Park
● SeongJae Park <sj38.park@gmail.com>
● PhD candidate at Seoul National University
● Interested in memory management and parallel programming for OS kernels
Contiguous Memory
Requirements
Who Wants Physically Contiguous Memory?
● Virtual Memory avoids need of physically contiguous memory
○ CPU uses virtual address; virtual regions can be mapped to any physical regions
○ MMU maps the virtual address to physical address
● Particularly, Linux uses demand paging and reclaim
Physical
Memory
Application MMUCPU
Logical address Physical address
Devices using Direct Memory Access (DMA)
● Lots of devices need large internal buffers
(e.g., 12 Mega-Pixel Camera, 10 Gbps Network interface card, …)
● Because the MMU is behind the CPU, devices directly addressing the
memory cannot use the virtual address
Physical
Memory
Application MMUCPU
Logical address Physical address
Device
DMA
Physical address
Systems using Huge Pages
● A system can have multiple sizes of pages, usually 4 KiB (regular) and 2 MiB
● Use of huge pages improve system performance by reducing TLB misses
● Importance of huge pages is growing
○ Modern workloads including big data, cloud, and machine learning are memory intensive
○ Systems equipping tera-bytes are not rare now
○ It’s especially important on virtual machine based environments
● A 2 MiB huge page is just 512 physically contiguous regular pages
Existing Solutions
H/W Solutions: IOMMU, Scatter / Gather DMA
● Idea is simple; add MMU-like additional hardware for devices
● IOMMU and Scatter/Gather DMA gives the contiguous device memory illusion
● Additional H/W means increase of power consumption and price, which is
unaffordable for low-end devices
● Useless for many cases including huge pages
Physical
Memory
Application MMUCPU
Logical address Physicals address
Device
DMA
Device address
IOMMU
Physical address
Even H/W Solutions Impose Overhead
● Hardware based solutions are well known for low overhead
● Though it is low, the overhead is inevitable
Christoph Lameter et al., “User space contiguous memory allocation for DMA”,
Linux Plumbers Conference 2017
Reserved Area Technique
● Reserves sufficient amount of contiguous area at boot time and
let only contiguous memory allocation to use the area
● Simple and effective for contiguous memory allocation
● Memory space utilization could be low if the reserved area is not fully used
● Widely adopted despite of the utilization problem
System Memory
(Narrowed by Reservation)
Reserved
Area
Process
Normal Allocation Contiguous Allocation
CMA: Contiguous Memory Allocator
● Software-based solution in the Linux kernel
● Generously extended version of the Reserved area technique
○ Mainly focus on memory utilization
○ Let movable pages to use the reserved area
○ If contig mem alloc requires pages being used for movable pages,
move those pages out of the reserved area and use the vacant area for contig mem alloc
○ Solves the memory utilization problem well because general pages are movable
● In other words, CMA gives different priority to clients of the reserved area
○ Primary client: contiguous memory allocation
○ Secondary client: movable page allocation
CMA in Real-land
● Measured latency for taking a photo using Camera app on Raspberry Pi 2
under memory stress (Blogbench) for 30 times
● 1.6 Seconds worst-latency with Reserved area technique,
● 9.8 seconds worst-latency with CMA; Unacceptable
● Measured the latency of CMA under the situation
● It’s clear that latency of Camera is bounded by CMA
The Phantom Menace: Latency of CMA
CMA: Why So Slow?
● In short, secondary client of CMA is not so nice as expected
● Moving page out is an expensive task (copying, rmap control, LRU churn, …)
● If someone is holding the page, it could not be moved until the one releases
the page (e.g., get_user_page())
● Result: Unexpected long latency and even failure
● That’s why CMA is not adopted to many devices
● Raspberry Pi does not support CMA officially because of the problem
Buddy Allocator
● Adaptively split and merge adjacent contiguous pages
● Highly optimized and heavily used in the Linux kernel
● Supports only multi-order contiguous pages and up to (MAX_ORDER - 1)
contiguous pages
● Under fragmentation, it does time-consuming compaction and retry or just
fails
○ Not a good news for contiguous memory requesting tasks, either
● THP uses buddy allocator to allocates huge pages
○ Quickly falls back to regular pages if it fails to allocate contiguous pages
○ As a result, it cannot be used on highly fragmented memory
THP Performance under High Fragmentation
● Default: THP disabled, Thp: THP always
● .f suffix: On the fragmented memory
● Under fragmentation, THP benefits disappear because of the fragmentation
(Lowerisbetter)
Guaranteed Contiguous
Memory Allocator
GCMA: Guaranteed Contiguous Memory Allocator
● GCMA is a variant of CMA that guarantees
○ Fast latency and success of allocation
○ While keeping memory utilization as well
● The idea is simple
○ Follow primary / secondary client idea of CMA to keep memory utilization
○ Select secondary client as only nice one unlike CMA
■ For fast latency, should be able to vacate the reserved area as soon as required without
any task such as moving content
■ For success, should be out of kernel control
■ For memory utilization, most pages should be able to be secondary client
■ In short, frontswap and cleancache
Secondary Clients of GCMA
● Pages for a write-through mode Frontswap
○ Could be discarded immediately because the contents are written through to swap device
○ Kernel thinks it’s already swapped out
○ Most anonymous pages could be covered
○ We recommend users to use Zram as a swap device to minimize write-through overhead
● Pages for a Clean cache
○ Could be discarded because the contents in storage is up to date
○ Kernel thinks it’s already evicted out
○ Lots of file-backed pages could be the case
● Additional pros 1: Already expected to not be accessed again soon
○ Discarding the secondary client pages would not affect system performance much
● Additional pros 2: Occurs with only severe workloads
○ In peaceful case, no overhead at all
GCMA: Workflow
● Reserve memory area in boot time
● If a page is swapped out or evicted from the page cache, keep the content of
the page in the reserved area
○ If the system requires the content of a page again, give it back from the reserved area
○ If a contiguous memory allocation requires area being used by those pages, discard those
pages and use the area for contiguous memory allocation
System Memory GCMA Reserved
Area
Swap Device
File
Process
Normal Allocation Contiguous Allocation
Reclaim / Swap
Get back if hit
GCMA Architecture
● Reuse CMA interface
○ User can turn CMA to GCMA entire or use them selectively on single system
● DMEM: Discardable Memory
○ Abstraction for backend of frontswap and cleancache
○ Works as a last chance cache for secondary client pages using GCMA reserved area
○ Manages Index to pages using hash table of RB-tree based buckets
○ Evicts pages in LRU scheme
Reserved Area
Dmem Logic
GCMA Logic
Dmem Interface
CMA Interface
CMA Logic
Cleancache Frontswap
GCMA Implementation
● Implemented on Linux v3.18, v4.10 and v4.17
● About 1,500 lines of code
● Ported, evaluated on Raspberry Pi 2 and a high-end server
● Available under GPL v3 at: https://github.com/sjp38/linux.gcma
● Submitted to LKML (https://lkml.org/lkml/2015/2/23/480)
Evaluations of
GCMA for a Device
Experimental Setup
● Device setup
○ Raspberry Pi 2
○ ARM cortex-A7 900 MHz
○ 1 GiB LPDDR2 SDRAM
○ Class 10 SanDisk 16 GiB microSD card
● Configurations
○ Baseline: Linux rpi-v3.18.11 + 100 MiB swap +
256 MiB reserved area
○ CMA: Linux rpi-v3.18.11 + 100 MiB swap +
256 MiB CMA area
○ GCMA: Linux rpi-v3.18.11 + 100 MiB Zram swap +
256 MiB GCMA area
● Workloads
○ Contig Mem Alloc: Contiguous memory allocation using CMA or GCMA
○ Blogbench: Realistic file system benchmark
○ Camera shot: Repeatedly taking picture using Raspberry Pi 2 Camera module with 10 seconds
interval between each shot
Contig Mem Alloc without Background Workload
● Average of 30 allocations
● GCMA shows 14.89x to 129.41x faster latency compared to CMA
● CMA even failed once for 32,768 contiguous pages allocation even though
there is no background task
4MiB Contig Mem Alloc w/o Background Workload
● Latency of GCMA is more predictable compared to CMA
4MiB Contig Mem Allocation w/ Background Task
● Background workload: Blogbench
● CMA consumes 5 seconds (unacceptable!) for one allocation in bad case
Camera Latency w/ Background Task
● Camera is a realistic, important application of Raspberry Pi 2
● Background task: Blogbench
● GCMA keeps latency as fast as Reserved area technique configuration
Blogbench on CMA / GCMA
● Scores are normalized to scores of Baseline configuration
● ‘/ cam’ means camera workload as a background workload
● System using GCMA even slightly outperforms CMA version owing to its light
overhead
● Allocation using CMA even degrades system performance because it should
move out pages in reserved area
Evaluations of
GCMA for THP
Experimental Setup
● Device setup
○ Intel Xeon E7-8870 v3
○ L2 TLB with 1024 entries
○ 600 GiB DDR4 memory
○ 500 GiB Intel SSD
○ Linux v4.10 based kernels
● Configurations
○ Nothp: THP disabled
○ Thp.bd: Buddy allocator based THP enabled
○ Thp.cma: GCMA based THP enabled
○ Thp.bc: Buddy allocator based, GCMA fallback using THP enabled
○ .f suffix: On fragmented system
● Workloads
○ 429.mcf and 471.omnetpp from SPEC CPU2006
○ TPC-H Strong test
Performance of SPEC CPU 2006
● Memory intensive two workloads are chosen
● Fragmentation decreases performance of regular pages, too
● GCMA for THP shows 2.56x higher performance compared to the original
THP under fragmentation
● The impact is workload dependent, though
Performance of TPC-H
● Power test: Measure latency of each query for OLAP workloads
● Runtime is normalized by Thp.bc.f
● Queries 17 and 19 produces speedup more than 2x
● Four queries (3, 9, 14 and 20) show a speedup of 1.5x-2x
Plan for GCMA
Unifying Solutions Under CMA Interface
● There are many solutions and interfaces for contiguous memory allocations
● Too many interfaces can confuse new coming programmers; We don’t want
to increase the confusion with GCMA
● GCMA is developed to coexist with CMA, rather than substituting it; It is
already using CMA interface
● The interface could select the secondary clients for each CMA region
○ None (Reservation)
○ Migratables (CMA)
○ Discardables (GCMA)
● Currently, we are developing a patchset; It aims to be merged to the mainline
● The patchset will include updated evaluation result
Conclusion
● Contiguous memory allocation needs improvement
○ H/W solutions are expensive for low-end devices, imposes overhead, and cannot provide real
contig memory
○ CMA is slow in many cases
○ Buddy allocator is restricted
● GCMA guarantees fast latency, success of allocation and reasonable memory
utilization
○ It achieves the goals by utilizing nice clients, the pages for Frontswap and the Cleancache
○ Allocation latency of GCMA is as fast as Reserved area technique
○ GCMA can be used for THP allocation fallback
○ 130x and 10x shorter latency for contiguous memory allocation and taking a photo compared
to CMA based system, repectively
○ More than 50% and 100% performance improvement for 7/24, 3/24 realistic workloads
● Planning to release the official version and updated evaluation results in near
future
Thanks

Weitere ähnliche Inhalte

Was ist angesagt?

Taming Non-blocking Caches to Improve Isolation in Multicore Real-Time Systems
Taming Non-blocking Caches to Improve Isolation in Multicore Real-Time SystemsTaming Non-blocking Caches to Improve Isolation in Multicore Real-Time Systems
Taming Non-blocking Caches to Improve Isolation in Multicore Real-Time Systems
Heechul Yun
 

Was ist angesagt? (20)

Design pipeline architecture for various stage pipelines
Design pipeline architecture for various stage pipelinesDesign pipeline architecture for various stage pipelines
Design pipeline architecture for various stage pipelines
 
Process scheduling
Process schedulingProcess scheduling
Process scheduling
 
HKG15-305: Real Time processing comparing the RT patch vs Core isolation
HKG15-305: Real Time processing comparing the RT patch vs Core isolationHKG15-305: Real Time processing comparing the RT patch vs Core isolation
HKG15-305: Real Time processing comparing the RT patch vs Core isolation
 
Linux Locking Mechanisms
Linux Locking MechanismsLinux Locking Mechanisms
Linux Locking Mechanisms
 
Solving Real-Time Scheduling Problems With RT_PREEMPT and Deadline-Based Sche...
Solving Real-Time Scheduling Problems With RT_PREEMPT and Deadline-Based Sche...Solving Real-Time Scheduling Problems With RT_PREEMPT and Deadline-Based Sche...
Solving Real-Time Scheduling Problems With RT_PREEMPT and Deadline-Based Sche...
 
Linux Kernel Debugging
Linux Kernel DebuggingLinux Kernel Debugging
Linux Kernel Debugging
 
Using QEMU for cross development
Using QEMU for cross developmentUsing QEMU for cross development
Using QEMU for cross development
 
Linux Preempt-RT Internals
Linux Preempt-RT InternalsLinux Preempt-RT Internals
Linux Preempt-RT Internals
 
Linux kernel memory allocators
Linux kernel memory allocatorsLinux kernel memory allocators
Linux kernel memory allocators
 
Kernel Recipes 2016 - Speeding up development by setting up a kernel build farm
Kernel Recipes 2016 - Speeding up development by setting up a kernel build farmKernel Recipes 2016 - Speeding up development by setting up a kernel build farm
Kernel Recipes 2016 - Speeding up development by setting up a kernel build farm
 
Mastering Real-time Linux
Mastering Real-time LinuxMastering Real-time Linux
Mastering Real-time Linux
 
Linux kernel modules
Linux kernel modulesLinux kernel modules
Linux kernel modules
 
Kernel Recipes 2016 - entry_*.S: A carefree stroll through kernel entry code
Kernel Recipes 2016 - entry_*.S: A carefree stroll through kernel entry codeKernel Recipes 2016 - entry_*.S: A carefree stroll through kernel entry code
Kernel Recipes 2016 - entry_*.S: A carefree stroll through kernel entry code
 
Improving Real-Time Performance on Multicore Platforms using MemGuard
Improving Real-Time Performance on Multicore Platforms using MemGuardImproving Real-Time Performance on Multicore Platforms using MemGuard
Improving Real-Time Performance on Multicore Platforms using MemGuard
 
Kernel Recipes 2016 - Understanding a Real-Time System (more than just a kernel)
Kernel Recipes 2016 - Understanding a Real-Time System (more than just a kernel)Kernel Recipes 2016 - Understanding a Real-Time System (more than just a kernel)
Kernel Recipes 2016 - Understanding a Real-Time System (more than just a kernel)
 
Kernel Recipes 2017 - What's new in the world of storage for Linux - Jens Axboe
Kernel Recipes 2017 - What's new in the world of storage for Linux - Jens AxboeKernel Recipes 2017 - What's new in the world of storage for Linux - Jens Axboe
Kernel Recipes 2017 - What's new in the world of storage for Linux - Jens Axboe
 
Parallelism-Aware Memory Interference Delay Analysis for COTS Multicore Systems
Parallelism-Aware Memory Interference Delay Analysis for COTS Multicore SystemsParallelism-Aware Memory Interference Delay Analysis for COTS Multicore Systems
Parallelism-Aware Memory Interference Delay Analysis for COTS Multicore Systems
 
Taming Non-blocking Caches to Improve Isolation in Multicore Real-Time Systems
Taming Non-blocking Caches to Improve Isolation in Multicore Real-Time SystemsTaming Non-blocking Caches to Improve Isolation in Multicore Real-Time Systems
Taming Non-blocking Caches to Improve Isolation in Multicore Real-Time Systems
 
Ch6 cpu scheduling
Ch6   cpu schedulingCh6   cpu scheduling
Ch6 cpu scheduling
 
From printk to QEMU: Xen/Linux Kernel debugging
From printk to QEMU: Xen/Linux Kernel debuggingFrom printk to QEMU: Xen/Linux Kernel debugging
From printk to QEMU: Xen/Linux Kernel debugging
 

Ähnlich wie GCMA: Guaranteed Contiguous Memory Allocator

Literature survey presentation
Literature survey presentationLiterature survey presentation
Literature survey presentation
Karthik Iyr
 
A survey on exploring memory optimizations in smartphones
A survey on exploring memory optimizations in smartphonesA survey on exploring memory optimizations in smartphones
A survey on exploring memory optimizations in smartphones
Karthik Iyr
 

Ähnlich wie GCMA: Guaranteed Contiguous Memory Allocator (20)

Continguous Memory Allocator in the Linux Kernel
Continguous Memory Allocator in the Linux KernelContinguous Memory Allocator in the Linux Kernel
Continguous Memory Allocator in the Linux Kernel
 
Virtualization for Emerging Memory Devices
Virtualization for Emerging Memory DevicesVirtualization for Emerging Memory Devices
Virtualization for Emerging Memory Devices
 
Linux Memory Basics for SysAdmins - ChinaNetCloud Training
Linux Memory Basics for SysAdmins - ChinaNetCloud TrainingLinux Memory Basics for SysAdmins - ChinaNetCloud Training
Linux Memory Basics for SysAdmins - ChinaNetCloud Training
 
IBM MQ Disaster Recovery
IBM MQ Disaster RecoveryIBM MQ Disaster Recovery
IBM MQ Disaster Recovery
 
RAMinate Invited Talk at NII
RAMinate Invited Talk at NIIRAMinate Invited Talk at NII
RAMinate Invited Talk at NII
 
LCA13: Memory Hotplug on Android
LCA13: Memory Hotplug on AndroidLCA13: Memory Hotplug on Android
LCA13: Memory Hotplug on Android
 
Sc19 ibm hms final
Sc19 ibm hms finalSc19 ibm hms final
Sc19 ibm hms final
 
Software Design for Persistent Memory Systems
Software Design for Persistent Memory SystemsSoftware Design for Persistent Memory Systems
Software Design for Persistent Memory Systems
 
LCE13: Android Graphics Upstreaming
LCE13: Android Graphics UpstreamingLCE13: Android Graphics Upstreaming
LCE13: Android Graphics Upstreaming
 
Memory Management in Amoeba
Memory Management in AmoebaMemory Management in Amoeba
Memory Management in Amoeba
 
Recent advancements in cache technology
Recent advancements in cache technologyRecent advancements in cache technology
Recent advancements in cache technology
 
Kernel Recipes 2017 - 20 years of Linux Virtual Memory - Andrea Arcangeli
Kernel Recipes 2017 - 20 years of Linux Virtual Memory - Andrea ArcangeliKernel Recipes 2017 - 20 years of Linux Virtual Memory - Andrea Arcangeli
Kernel Recipes 2017 - 20 years of Linux Virtual Memory - Andrea Arcangeli
 
Enhanced Live Migration for Intensive Memory Loads
Enhanced Live Migration for Intensive Memory LoadsEnhanced Live Migration for Intensive Memory Loads
Enhanced Live Migration for Intensive Memory Loads
 
ch9_virMem.pdf
ch9_virMem.pdfch9_virMem.pdf
ch9_virMem.pdf
 
Drupal 7 performance and optimization
Drupal 7 performance and optimizationDrupal 7 performance and optimization
Drupal 7 performance and optimization
 
The Dark Side Of Go -- Go runtime related problems in TiDB in production
The Dark Side Of Go -- Go runtime related problems in TiDB  in productionThe Dark Side Of Go -- Go runtime related problems in TiDB  in production
The Dark Side Of Go -- Go runtime related problems in TiDB in production
 
Literature survey presentation
Literature survey presentationLiterature survey presentation
Literature survey presentation
 
Measuring a 25 and 40Gb/s Data Plane
Measuring a 25 and 40Gb/s Data PlaneMeasuring a 25 and 40Gb/s Data Plane
Measuring a 25 and 40Gb/s Data Plane
 
RAMinate ACM SoCC 2016 Talk
RAMinate ACM SoCC 2016 TalkRAMinate ACM SoCC 2016 Talk
RAMinate ACM SoCC 2016 Talk
 
A survey on exploring memory optimizations in smartphones
A survey on exploring memory optimizations in smartphonesA survey on exploring memory optimizations in smartphones
A survey on exploring memory optimizations in smartphones
 

Mehr von SeongJae Park

Deep dark side of git - prologue
Deep dark side of git - prologueDeep dark side of git - prologue
Deep dark side of git - prologue
SeongJae Park
 

Mehr von SeongJae Park (20)

Biscuit: an operating system written in go
Biscuit:  an operating system written in goBiscuit:  an operating system written in go
Biscuit: an operating system written in go
 
Design choices of golang for high scalability
Design choices of golang for high scalabilityDesign choices of golang for high scalability
Design choices of golang for high scalability
 
Brief introduction to kselftest
Brief introduction to kselftestBrief introduction to kselftest
Brief introduction to kselftest
 
Let the contribution begin (EST futures)
Let the contribution begin  (EST futures)Let the contribution begin  (EST futures)
Let the contribution begin (EST futures)
 
Porting golang development environment developed with golang
Porting golang development environment developed with golangPorting golang development environment developed with golang
Porting golang development environment developed with golang
 
An introduction to_golang.avi
An introduction to_golang.aviAn introduction to_golang.avi
An introduction to_golang.avi
 
Develop Android/iOS app using golang
Develop Android/iOS app using golangDevelop Android/iOS app using golang
Develop Android/iOS app using golang
 
Develop Android app using Golang
Develop Android app using GolangDevelop Android app using Golang
Develop Android app using Golang
 
Sw install with_without_docker
Sw install with_without_dockerSw install with_without_docker
Sw install with_without_docker
 
Git inter-snapshot public
Git  inter-snapshot publicGit  inter-snapshot public
Git inter-snapshot public
 
(Live) build and run golang web server on android.avi
(Live) build and run golang web server on android.avi(Live) build and run golang web server on android.avi
(Live) build and run golang web server on android.avi
 
Deep dark-side of git: How git works internally
Deep dark-side of git: How git works internallyDeep dark-side of git: How git works internally
Deep dark-side of git: How git works internally
 
Deep dark side of git - prologue
Deep dark side of git - prologueDeep dark side of git - prologue
Deep dark side of git - prologue
 
DO YOU WANT TO USE A VCS
DO YOU WANT TO USE A VCSDO YOU WANT TO USE A VCS
DO YOU WANT TO USE A VCS
 
Experimental android hacking using reflection
Experimental android hacking using reflectionExperimental android hacking using reflection
Experimental android hacking using reflection
 
ash
ashash
ash
 
Hacktime for adk
Hacktime for adkHacktime for adk
Hacktime for adk
 
Let the contribution begin
Let the contribution beginLet the contribution begin
Let the contribution begin
 
Touch Android Without Touching
Touch Android Without TouchingTouch Android Without Touching
Touch Android Without Touching
 
AOSP에 컨트리뷰션 하기 dev festx korea 2012 presentation
AOSP에 컨트리뷰션 하기   dev festx korea 2012 presentationAOSP에 컨트리뷰션 하기   dev festx korea 2012 presentation
AOSP에 컨트리뷰션 하기 dev festx korea 2012 presentation
 

Kürzlich hochgeladen

CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
VictorSzoltysek
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
shinachiaurasa2
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
masabamasaba
 

Kürzlich hochgeladen (20)

Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdfPayment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare
 
Define the academic and professional writing..pdf
Define the academic and professional writing..pdfDefine the academic and professional writing..pdf
Define the academic and professional writing..pdf
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
 
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
%+27788225528 love spells in Vancouver Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Vancouver Psychic Readings, Attraction spells,Br...%+27788225528 love spells in Vancouver Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Vancouver Psychic Readings, Attraction spells,Br...
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
10 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 202410 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 2024
 
Chinsurah Escorts ☎️8617697112 Starting From 5K to 15K High Profile Escorts ...
Chinsurah Escorts ☎️8617697112  Starting From 5K to 15K High Profile Escorts ...Chinsurah Escorts ☎️8617697112  Starting From 5K to 15K High Profile Escorts ...
Chinsurah Escorts ☎️8617697112 Starting From 5K to 15K High Profile Escorts ...
 
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) SolutionIntroducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
 
Generic or specific? Making sensible software design decisions
Generic or specific? Making sensible software design decisionsGeneric or specific? Making sensible software design decisions
Generic or specific? Making sensible software design decisions
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 

GCMA: Guaranteed Contiguous Memory Allocator

  • 1. GCMA: Guaranteed Contiguous Memory Allocator SeongJae Park <sj38.park@gmail.com>
  • 2. These slides were presented during The Kernel Summit 2018 (https://events.linuxfoundation.org/events/linux-kernel-summit-2018/)
  • 3. This work by SeongJae Park is licensed under the Creative Commons Attribution-ShareAlike 3.0 Unported License. To view a copy of this license, visit http://creativecommons.org/licenses/by-sa/3.0/.
  • 4. I, SeongJae Park ● SeongJae Park <sj38.park@gmail.com> ● PhD candidate at Seoul National University ● Interested in memory management and parallel programming for OS kernels
  • 6. Who Wants Physically Contiguous Memory? ● Virtual Memory avoids need of physically contiguous memory ○ CPU uses virtual address; virtual regions can be mapped to any physical regions ○ MMU maps the virtual address to physical address ● Particularly, Linux uses demand paging and reclaim Physical Memory Application MMUCPU Logical address Physical address
  • 7. Devices using Direct Memory Access (DMA) ● Lots of devices need large internal buffers (e.g., 12 Mega-Pixel Camera, 10 Gbps Network interface card, …) ● Because the MMU is behind the CPU, devices directly addressing the memory cannot use the virtual address Physical Memory Application MMUCPU Logical address Physical address Device DMA Physical address
  • 8. Systems using Huge Pages ● A system can have multiple sizes of pages, usually 4 KiB (regular) and 2 MiB ● Use of huge pages improve system performance by reducing TLB misses ● Importance of huge pages is growing ○ Modern workloads including big data, cloud, and machine learning are memory intensive ○ Systems equipping tera-bytes are not rare now ○ It’s especially important on virtual machine based environments ● A 2 MiB huge page is just 512 physically contiguous regular pages
  • 10. H/W Solutions: IOMMU, Scatter / Gather DMA ● Idea is simple; add MMU-like additional hardware for devices ● IOMMU and Scatter/Gather DMA gives the contiguous device memory illusion ● Additional H/W means increase of power consumption and price, which is unaffordable for low-end devices ● Useless for many cases including huge pages Physical Memory Application MMUCPU Logical address Physicals address Device DMA Device address IOMMU Physical address
  • 11. Even H/W Solutions Impose Overhead ● Hardware based solutions are well known for low overhead ● Though it is low, the overhead is inevitable Christoph Lameter et al., “User space contiguous memory allocation for DMA”, Linux Plumbers Conference 2017
  • 12. Reserved Area Technique ● Reserves sufficient amount of contiguous area at boot time and let only contiguous memory allocation to use the area ● Simple and effective for contiguous memory allocation ● Memory space utilization could be low if the reserved area is not fully used ● Widely adopted despite of the utilization problem System Memory (Narrowed by Reservation) Reserved Area Process Normal Allocation Contiguous Allocation
  • 13. CMA: Contiguous Memory Allocator ● Software-based solution in the Linux kernel ● Generously extended version of the Reserved area technique ○ Mainly focus on memory utilization ○ Let movable pages to use the reserved area ○ If contig mem alloc requires pages being used for movable pages, move those pages out of the reserved area and use the vacant area for contig mem alloc ○ Solves the memory utilization problem well because general pages are movable ● In other words, CMA gives different priority to clients of the reserved area ○ Primary client: contiguous memory allocation ○ Secondary client: movable page allocation
  • 14. CMA in Real-land ● Measured latency for taking a photo using Camera app on Raspberry Pi 2 under memory stress (Blogbench) for 30 times ● 1.6 Seconds worst-latency with Reserved area technique, ● 9.8 seconds worst-latency with CMA; Unacceptable
  • 15. ● Measured the latency of CMA under the situation ● It’s clear that latency of Camera is bounded by CMA The Phantom Menace: Latency of CMA
  • 16. CMA: Why So Slow? ● In short, secondary client of CMA is not so nice as expected ● Moving page out is an expensive task (copying, rmap control, LRU churn, …) ● If someone is holding the page, it could not be moved until the one releases the page (e.g., get_user_page()) ● Result: Unexpected long latency and even failure ● That’s why CMA is not adopted to many devices ● Raspberry Pi does not support CMA officially because of the problem
  • 17. Buddy Allocator ● Adaptively split and merge adjacent contiguous pages ● Highly optimized and heavily used in the Linux kernel ● Supports only multi-order contiguous pages and up to (MAX_ORDER - 1) contiguous pages ● Under fragmentation, it does time-consuming compaction and retry or just fails ○ Not a good news for contiguous memory requesting tasks, either ● THP uses buddy allocator to allocates huge pages ○ Quickly falls back to regular pages if it fails to allocate contiguous pages ○ As a result, it cannot be used on highly fragmented memory
  • 18. THP Performance under High Fragmentation ● Default: THP disabled, Thp: THP always ● .f suffix: On the fragmented memory ● Under fragmentation, THP benefits disappear because of the fragmentation (Lowerisbetter)
  • 20. GCMA: Guaranteed Contiguous Memory Allocator ● GCMA is a variant of CMA that guarantees ○ Fast latency and success of allocation ○ While keeping memory utilization as well ● The idea is simple ○ Follow primary / secondary client idea of CMA to keep memory utilization ○ Select secondary client as only nice one unlike CMA ■ For fast latency, should be able to vacate the reserved area as soon as required without any task such as moving content ■ For success, should be out of kernel control ■ For memory utilization, most pages should be able to be secondary client ■ In short, frontswap and cleancache
  • 21. Secondary Clients of GCMA ● Pages for a write-through mode Frontswap ○ Could be discarded immediately because the contents are written through to swap device ○ Kernel thinks it’s already swapped out ○ Most anonymous pages could be covered ○ We recommend users to use Zram as a swap device to minimize write-through overhead ● Pages for a Clean cache ○ Could be discarded because the contents in storage is up to date ○ Kernel thinks it’s already evicted out ○ Lots of file-backed pages could be the case ● Additional pros 1: Already expected to not be accessed again soon ○ Discarding the secondary client pages would not affect system performance much ● Additional pros 2: Occurs with only severe workloads ○ In peaceful case, no overhead at all
  • 22. GCMA: Workflow ● Reserve memory area in boot time ● If a page is swapped out or evicted from the page cache, keep the content of the page in the reserved area ○ If the system requires the content of a page again, give it back from the reserved area ○ If a contiguous memory allocation requires area being used by those pages, discard those pages and use the area for contiguous memory allocation System Memory GCMA Reserved Area Swap Device File Process Normal Allocation Contiguous Allocation Reclaim / Swap Get back if hit
  • 23. GCMA Architecture ● Reuse CMA interface ○ User can turn CMA to GCMA entire or use them selectively on single system ● DMEM: Discardable Memory ○ Abstraction for backend of frontswap and cleancache ○ Works as a last chance cache for secondary client pages using GCMA reserved area ○ Manages Index to pages using hash table of RB-tree based buckets ○ Evicts pages in LRU scheme Reserved Area Dmem Logic GCMA Logic Dmem Interface CMA Interface CMA Logic Cleancache Frontswap
  • 24. GCMA Implementation ● Implemented on Linux v3.18, v4.10 and v4.17 ● About 1,500 lines of code ● Ported, evaluated on Raspberry Pi 2 and a high-end server ● Available under GPL v3 at: https://github.com/sjp38/linux.gcma ● Submitted to LKML (https://lkml.org/lkml/2015/2/23/480)
  • 26. Experimental Setup ● Device setup ○ Raspberry Pi 2 ○ ARM cortex-A7 900 MHz ○ 1 GiB LPDDR2 SDRAM ○ Class 10 SanDisk 16 GiB microSD card ● Configurations ○ Baseline: Linux rpi-v3.18.11 + 100 MiB swap + 256 MiB reserved area ○ CMA: Linux rpi-v3.18.11 + 100 MiB swap + 256 MiB CMA area ○ GCMA: Linux rpi-v3.18.11 + 100 MiB Zram swap + 256 MiB GCMA area ● Workloads ○ Contig Mem Alloc: Contiguous memory allocation using CMA or GCMA ○ Blogbench: Realistic file system benchmark ○ Camera shot: Repeatedly taking picture using Raspberry Pi 2 Camera module with 10 seconds interval between each shot
  • 27. Contig Mem Alloc without Background Workload ● Average of 30 allocations ● GCMA shows 14.89x to 129.41x faster latency compared to CMA ● CMA even failed once for 32,768 contiguous pages allocation even though there is no background task
  • 28. 4MiB Contig Mem Alloc w/o Background Workload ● Latency of GCMA is more predictable compared to CMA
  • 29. 4MiB Contig Mem Allocation w/ Background Task ● Background workload: Blogbench ● CMA consumes 5 seconds (unacceptable!) for one allocation in bad case
  • 30. Camera Latency w/ Background Task ● Camera is a realistic, important application of Raspberry Pi 2 ● Background task: Blogbench ● GCMA keeps latency as fast as Reserved area technique configuration
  • 31. Blogbench on CMA / GCMA ● Scores are normalized to scores of Baseline configuration ● ‘/ cam’ means camera workload as a background workload ● System using GCMA even slightly outperforms CMA version owing to its light overhead ● Allocation using CMA even degrades system performance because it should move out pages in reserved area
  • 33. Experimental Setup ● Device setup ○ Intel Xeon E7-8870 v3 ○ L2 TLB with 1024 entries ○ 600 GiB DDR4 memory ○ 500 GiB Intel SSD ○ Linux v4.10 based kernels ● Configurations ○ Nothp: THP disabled ○ Thp.bd: Buddy allocator based THP enabled ○ Thp.cma: GCMA based THP enabled ○ Thp.bc: Buddy allocator based, GCMA fallback using THP enabled ○ .f suffix: On fragmented system ● Workloads ○ 429.mcf and 471.omnetpp from SPEC CPU2006 ○ TPC-H Strong test
  • 34. Performance of SPEC CPU 2006 ● Memory intensive two workloads are chosen ● Fragmentation decreases performance of regular pages, too ● GCMA for THP shows 2.56x higher performance compared to the original THP under fragmentation ● The impact is workload dependent, though
  • 35. Performance of TPC-H ● Power test: Measure latency of each query for OLAP workloads ● Runtime is normalized by Thp.bc.f ● Queries 17 and 19 produces speedup more than 2x ● Four queries (3, 9, 14 and 20) show a speedup of 1.5x-2x
  • 37. Unifying Solutions Under CMA Interface ● There are many solutions and interfaces for contiguous memory allocations ● Too many interfaces can confuse new coming programmers; We don’t want to increase the confusion with GCMA ● GCMA is developed to coexist with CMA, rather than substituting it; It is already using CMA interface ● The interface could select the secondary clients for each CMA region ○ None (Reservation) ○ Migratables (CMA) ○ Discardables (GCMA) ● Currently, we are developing a patchset; It aims to be merged to the mainline ● The patchset will include updated evaluation result
  • 38. Conclusion ● Contiguous memory allocation needs improvement ○ H/W solutions are expensive for low-end devices, imposes overhead, and cannot provide real contig memory ○ CMA is slow in many cases ○ Buddy allocator is restricted ● GCMA guarantees fast latency, success of allocation and reasonable memory utilization ○ It achieves the goals by utilizing nice clients, the pages for Frontswap and the Cleancache ○ Allocation latency of GCMA is as fast as Reserved area technique ○ GCMA can be used for THP allocation fallback ○ 130x and 10x shorter latency for contiguous memory allocation and taking a photo compared to CMA based system, repectively ○ More than 50% and 100% performance improvement for 7/24, 3/24 realistic workloads ● Planning to release the official version and updated evaluation results in near future