SlideShare ist ein Scribd-Unternehmen logo
1 von 72
Downloaden Sie, um offline zu lesen
Physical Memory (Physical Page Frame)
Management in Linux Kernel
Adrian Huang | July, 2022
* Based on kernel 5.11 (x86_64) – QEMU
* 1-socket CPUs (8 cores/socket)
* 16GB memory
* Kernel parameter: nokaslr norandmaps
* Userspace: ASLR is disabled
* Legacy BIOS
Agenda
• Physical memory (physical page frame) management overview
✓Data structures about node, zone, zonelist, migrate type and per-cpu page frame
cache (per_cpu_pageset struct: PCP)
✓Placement of physical page frames right after system initialization
• Physical memory defragmentation (anti-fragmentation): Approaches
• Physical memory defragmentation (anti-fragmentation): Example
✓How the migration works
✓How the buddy system works
• pageblock
• dmesg output: total pages
• Page frame allocator
✓Watermark
✓Call Path
Agenda
• Does not cover in this talk
✓Memory compaction, page reclaiming and OOM killer
▪ Will be discussion in the future
Physical memory management
pglist_data (pg_data_t)
node_zones
node_zonelists
nr_zones
node_start_pfn
CPU #0 CPU #1
Memory Node #0 Memory Node #1
node_present_pages
node_spanned_pages
node_id
kswapd
pglist_data (pg_data_t)
node_zones
node_zonelists
nr_zones
node_start_pfn
node_present_pages
node_spanned_pages
node_id
kswapd
ZONE_DMA
ZONE_DMA32
ZONE_NORMAL
ZONE_MOVABLE
ZONE_DEVICE
node_data
zone
lowmem_reserve[MAX_NR_ZONES]
node
zone_pgdat
zone_start_pfn
managed_pages
spanned_pages
present_pages
free_area[MAX_ORDER]
20
21
...
210
UNMOVABLE
MOVABLE
RECLAIMABLE
PCPTYPES
...
UNMOVABLE
MOVABLE
RECLAIMABLE
PCPTYPES
...
list_head
list_head
list_head
Legend
struct page
next prev
free_area[11]
free_list[MIGRATE_TYPES]
_watermark[NR_WMARK]
• spanned_pages = zone_end_pfn - zone_start_pfn
• present_pages = spanned_pages – page_holes
• managed_pages = present_pages - reserved_pages
• The head of struct page of pageblock is configured with
MOVABLE (MIGRATE_MOVABLE) – See
memmap_init_zone().
• MIGRATE_HIGHATOMIC = MIGRATE_PCPTYPES
kcompactd
kcompactd
Note
Zones Zone Allocator – x86
Buddy system
Per-CPU page
frame cache
Buddy system
Per-CPU page
frame cache
Buddy system
Per-CPU page
frame cache
ZONE_DMA
(Physical address: 0-16MB)
ZONE_NORMAL
(Physical address: 16-896MB)
ZONE_HIGHMEM
(Physical address > 896MB)
Zone Allocator – x86_64
Buddy system
Per-CPU page
frame cache
Buddy system
Per-CPU page
frame cache
Buddy system
Per-CPU page
frame cache
ZONE_DMA
(Physical address: 0-16MB)
ZONE_DMA32
(Physical address: 16MB-4GB)
ZONE_NORMAL
(Physical address > 4GB)
Buddy system
Per-CPU page
frame cache
Buddy system
Per-CPU page
frame cache
ZONE_MOVABLE
(for memory offline)
ZONE_DEVICE
(Intend for persistent memory)
• ZONE_MOVABLE is similar to ZONE_NORMAL, except that it contains movable pages with few exceptional cases: include/linux/mmzone.h
pglist_data (pg_data_t)
node_zones[MAX_NR_ZONES]
node_zonelists[MAX_ZONELISTS]
nr_zones
node_start_pfn
node_present_pages
node_spanned_pages
node_id
struct task_struct *kswapd
totalreserve_pages
__lruvec
atomic_long_t vm_stat[]
zone
_watermark[NR_WMARK]
watermark_boost
nr_reserved_highatomic
lowmem_reserve[MAX_NR_ZONES]
zone_pgdat
__percpu *pageset
zone_start_pfn
managed_pages
spanned_pages
present_pages
free_area[MAX_ORDER]
atomic_long_t vm_stat[]
per_cpu_pageset
pcp
expire
vm_numa_stat_diff[]
stat_threshold
vm_stat_diff[]
per_cpu_pages
count
high
batch
stat_threshold
lists[MIGRATE_PCPTYPES]
lists[UNMOVABLE]
lists[MOVABLE]
lists[RECLAIMABLE]
Order-0 page
free_area[0]
free_list[MIGRATE_TYPES]
nr_free
free_area[MAX_ORDER – 1]
free_list[MIGRATE_TYPES]
nr_free
.
. Order = (MAX_ORDER -1 ) pages
… …
Order-0 page
zonelist[ZONELIST_FALLBACK]
_zonerefs[MAX_ZONES_PER_ZONELIST + 1]
zoneref
struct zone *zone
zone_idx
zonelist[ZONELIST_NOFALLBACK]
_zonerefs[MAX_ZONES_PER_ZONELIST + 1]
Per-cpu Page Frame Cache: reduce lock contention between processors
Apply this zonelist if __GFP_THISNODE flag is set
per-node basis
zoneref
struct zone *zone
zone_idx
Physical memory management – Zone detail
How to know available memory pages in a system?
BIOS e820 memblock Zone Page Frame Allocator
e820__memblock_setup() __free_pages_core()
[Call Path] memblock frees available memory space to zone page frame allocator
1
2
1 Most physical page frames are stored in MIGRATE_MOVABLE
2 [Right after system boots] Most physical page frames are stored in MAX_ORDER (order=10)
Placement of physical page frames
right after system initialization
Physical memory defragmentation (anti-
fragmentation): Approaches
1. Buddy system
2. Memory migration (mobility)
3. Memory compaction
Physical memory defragmentation (Anti-fragmentation):
Approaches
Buddy System
Memory Migration
(Mobility)
Memory Compaction
Better
Defragmentation
Poor
Defragmentation
Page allocation
failure Failure
Physical memory defragmentation (Anti-fragmentation):
Buddy System
free_area[N]
free_area[1]
free_list[MIGRATE_UNMOVABLE]
free_list[MIGRATE_UNMOVABLE]
page order = 1
page order = N = MAX_ORDER - 1
free_area[0] free_list[MIGRATE_UNMOVABLE]
page order = 0
0 2
4 5 8 9
Buddy System: adjacent pages are merged to form larger continuous pages
… …
Physical memory defragmentation (Anti-fragmentation):
Memory Migration (Mobility)
7
0 15 23 31
[Buddy System] memory fragmentation: only order-2 pages can be allocated
0
23 31
15
Reclaimable pages
Unmovable pages
[Concept] Memory fragmentation is reduced by grouping pages based on mobility (migration)
Memory migration delays memory fragmentation. It does not solve the problem
Physical memory defragmentation (Anti-fragmentation):
Memory compaction (1/2)
free page allocated page
MIGRATE_MOVABLE
MIGRATE_MOVABLE
Memory compaction
Note
Physical memory defragmentation (Anti-fragmentation):
Memory compaction (2/2)
free page allocated page
MIGRATE_MOVABLE
Build a list of allocated pages
Build a list of free pages
MIGRATE_MOVABLE
Migration Scanner
Free Scanner
Memory compaction
Note
Physical memory defragmentation (anti-
fragmentation)
1. Buddy system
2. Memory migration (mobility)  detail discussion
3. Memory compaction
Anti-fragmentation: Memory migration (or Mobility)
• MIGRATE_UNMOVABLE: Allocations of the core kernel
• MIGRATE_MOVABLE (__GFP_MOVABLE): Pages that belongs to userspace applications
• MIGRATE_RECLAIMABLE (__GFP_RECLAIMABLE): File pages (Data mapped from files). Periodically reclaimed by the kswapd daemon
Slab/slub allocations that specify SLAB_RECLAIM_ACCOUNT and whose pages can be freed via shrinkers (kswapd): Check include/linux/gfp.h.
o SLAB_RECLAIM_ACCOUNT:
o mainly used by file system: data structure cache (fs/*).
o radix tree: cache for ‘struct radix_tree_node’
free_list[MIGRATE_UNMOVABLE]
free_list[MIGRATE_MOVABLE]
free_list[MIGRATE_RECLAIMABLE]
free_list[MIGRATE_RECLAIMABLE]
free_list[MIGRATE_RECLAIMABLE]
free_list[MIGRATE_UNMOVABLE]
free_list[MIGRATE_MOVABLE]
free_list[MIGRATE_UNMOVABLE]
free_list[MIGRATE_MOVABLE]
fallback list: check ‘fallbacks’ variable
Designated Migration Type
Steal fallback freelist if pages are
used up.
Steal fallback freelist if pages are
used up.
Page group concept: grouping pages with identical mobility (migration type)
Memory migration (or Mobility): Users
___GFP_MOVABLE
→ MIGRATE_MOVABLE
___GFP_RECLAIMABLE
→ MIGRATE_RECLAIMABLE
! ___GFP_MOVABLE && !___GFP_RECLAIMABLE
→ MIGRATE_UNMOVABLE
GFP_KERNEL
(gfp_flags & GFP_MOVABLE_MASK) >> GFP_MOVABLE_SHIFT
= (gfp_flags & 0x18) >> 3
GFP_HIGHUSER_MOVABLE
do_user_addr_fault()
wp_page_copy()
do_cow_fault()
do_swap_page()
slab/slub with
SLAB_RECLAIM_ACCOUNT flag
__GFP_RECLAIMABLE
Page table allocation…and so on
Memory Allocation
* If page mobility is disabled, all pages are kept in
MIGRATE_UNMOVABLE
Page frame allocator: Physical memory
defragmentation (anti-fragmentation)
Let’s see an example:
1. How the migration works
2. How the buddy system works
Example: Scan available page frames from pcp
zone
free_area[MAX_ORDER]
Note
struct page
free_area[0]
free_list[MIGRATE_TYPES]
nr_free = 1
free_area[1]
free_list[MIGRATE_TYPES]
nr_free = 1
UNMOVABLE
MOVABLE
RECLAIMABLE
PCPTYPES
list_head
list_head
list_head
list_head
UNMOVABLE
MOVABLE
RECLAIMABLE
PCPTYPES
list_head
list_head
list_head
list_head
__percpu *pageset
free_area[10]
free_list[MIGRATE_TYPES]
nr_free = 1244
UNMOVABLE
MOVABLE
RECLAIMABLE
PCPTYPES
list_head
list_head
list_head
list_head … … …
.
.
. .
.
per_cpu_pageset
pcp
per_cpu_pages
count = 0
high = 0
batch = 1
lists[MIGRATE_PCPTYPES]
Per-cpu Page Frame Cache UNMOVABLE
MOVABLE
RECLAIMABLE
list_head
list_head
list_head
pud_alloc(): Allocate a page frame
with GFP_KERNEL flags → order = 0,
UNMOVABLE
1
rmqueue_pcplist()
available?
2
* This example is illustrated during system init
zone
free_area[MAX_ORDER]
Note
struct page
free_area[0]
free_list[MIGRATE_TYPES]
nr_free = 1
free_area[1]
free_list[MIGRATE_TYPES]
nr_free = 1
UNMOVABLE
MOVABLE
RECLAIMABLE
PCPTYPES
list_head
list_head
list_head
list_head
UNMOVABLE
MOVABLE
RECLAIMABLE
PCPTYPES
list_head
list_head
list_head
list_head
__percpu *pageset
free_area[10]
free_list[MIGRATE_TYPES]
nr_free = 1244
UNMOVABLE
MOVABLE
RECLAIMABLE
PCPTYPES
list_head
list_head
list_head
list_head … … …
.
.
. .
.
per_cpu_pageset
pcp
per_cpu_pages
count = 0
high = 0
batch = 1
lists[MIGRATE_PCPTYPES]
Per-cpu Page Frame Cache UNMOVABLE
MOVABLE
RECLAIMABLE
list_head
list_head
list_head
pud_alloc(): Allocate a page frame
with GFP_KERNEL flags → order = 0,
UNMOVABLE
1
3
__rmqueue_smallest()
available?
available?
available?
rmqueue_pcplist()
Example: Scan available page frames from free_area
available?
2
* This example is illustrated during system init
lower order -> higher order
zone
free_area[MAX_ORDER]
Note
struct page
free_area[0]
free_list[MIGRATE_TYPES]
nr_free = 1
free_area[1]
free_list[MIGRATE_TYPES]
nr_free = 1
UNMOVABLE
MOVABLE
RECLAIMABLE
PCPTYPES
list_head
list_head
list_head
list_head
UNMOVABLE
MOVABLE
RECLAIMABLE
PCPTYPES
list_head
list_head
list_head
list_head
__percpu *pageset
free_area[10]
free_list[MIGRATE_TYPES]
nr_free = 1244
UNMOVABLE
MOVABLE
RECLAIMABLE
PCPTYPES
list_head
list_head
list_head
list_head … … …
.
.
. .
.
per_cpu_pageset
pcp
per_cpu_pages
count = 0
high = 0
batch = 1
lists[MIGRATE_PCPTYPES]
Per-cpu Page Frame Cache UNMOVABLE
MOVABLE
RECLAIMABLE
list_head
list_head
list_head
pud_alloc(): Allocate a page frame
with GFP_KERNEL flags → order = 0,
UNMOVABLE
1
3
__rmqueue_smallest()
available?
available?
available?
available?
rmqueue_pcplist()
[from MAX_ORDER to min_order]
__rmqueue_fallback() → steal_suitable_fallback()
4-1
4-2
Example: No available page frames in pcp and free_area for
specific migration type → steal from other migration type
2
* This example is illustrated during system init
[Steal]
MAX_ORDER -> min order
zone
free_area[MAX_ORDER]
Note
struct page
free_area[0]
free_list[MIGRATE_TYPES]
nr_free = 1
free_area[1]
free_list[MIGRATE_TYPES]
nr_free = 1
UNMOVABLE
MOVABLE
RECLAIMABLE
PCPTYPES
list_head
list_head
list_head
list_head
UNMOVABLE
MOVABLE
RECLAIMABLE
PCPTYPES
list_head
list_head
list_head
list_head
__percpu *pageset
free_area[10]
free_list[MIGRATE_TYPES]
nr_free = 1244
UNMOVABLE
MOVABLE
RECLAIMABLE
PCPTYPES
list_head
list_head
list_head
list_head … … …
.
.
. .
.
per_cpu_pageset
pcp
per_cpu_pages
count = 0
high = 0
batch = 1
lists[MIGRATE_PCPTYPES]
Per-cpu Page Frame Cache UNMOVABLE
MOVABLE
RECLAIMABLE
list_head
list_head
list_head
pud_alloc(): Allocate a page frame
with GFP_KERNEL flags → order = 0,
UNMOVABLE
1
3
__rmqueue_smallest()
available?
available?
available?
available?
rmqueue_pcplist()
[from MAX_ORDER to min_order]
__rmqueue_fallback() → steal_suitable_fallback()
4-1
4-2
Example: No available page frames in pcp and free_area for
specific migration type → steal from other migration type
2
* This example is illustrated during system init
4-1 4-2
zone
free_area[MAX_ORDER]
Note
struct page
free_area[0]
free_list[MIGRATE_TYPES]
nr_free = 1
free_area[1]
free_list[MIGRATE_TYPES]
nr_free = 1
UNMOVABLE
MOVABLE
RECLAIMABLE
PCPTYPES
list_head
list_head
list_head
list_head
UNMOVABLE
MOVABLE
RECLAIMABLE
PCPTYPES
list_head
list_head
list_head
list_head
__percpu *pageset
free_area[10]
free_list[MIGRATE_TYPES]
nr_free = 1244
UNMOVABLE
MOVABLE
RECLAIMABLE
PCPTYPES
list_head
list_head
list_head
list_head … … …
.
.
. .
.
per_cpu_pageset
pcp
per_cpu_pages
count = 0
high = 0
batch = 1
lists[MIGRATE_PCPTYPES]
Per-cpu Page Frame Cache UNMOVABLE
MOVABLE
RECLAIMABLE
list_head
list_head
list_head
pud_alloc(): Allocate a page frame
with GFP_KERNEL flags → order = 0,
UNMOVABLE
available?
rmqueue_pcplist()
5 steal_suitable_fallback() → move_to_free_list()
Example: No available page frames in pcp and free_area for
specific migration type → steal from other migration type
* This example is illustrated during system init
zone
free_area[MAX_ORDER]
Note
struct page
free_area[0]
free_list[MIGRATE_TYPES]
nr_free = 1
free_area[1]
free_list[MIGRATE_TYPES]
nr_free = 1
UNMOVABLE
MOVABLE
RECLAIMABLE
PCPTYPES
list_head
list_head
list_head
list_head
UNMOVABLE
MOVABLE
RECLAIMABLE
PCPTYPES
list_head
list_head
list_head
list_head
__percpu *pageset
free_area[10]
free_list[MIGRATE_TYPES]
nr_free = 1244
UNMOVABLE
MOVABLE
RECLAIMABLE
PCPTYPES
list_head
list_head
list_head
list_head … …
.
.
. .
.
per_cpu_pageset
pcp
per_cpu_pages
count = 0
high = 0
batch = 1
lists[MIGRATE_PCPTYPES]
Per-cpu Page Frame Cache UNMOVABLE
MOVABLE
RECLAIMABLE
list_head
list_head
list_head
pud_alloc(): Allocate a page frame
with GFP_KERNEL flags → order = 0,
UNMOVABLE
available?
rmqueue_pcplist()
…
6
available?
available?
available?
[try again]
__rmqueue_smallest()
Example: Re-scan available page frames from free_list
* This example is illustrated during system init
zone
free_area[MAX_ORDER]
Note
struct page
free_area[0]
free_list[MIGRATE_TYPES]
nr_free = 1
free_area[1]
free_list[MIGRATE_TYPES]
nr_free = 1
UNMOVABLE
MOVABLE
RECLAIMABLE
PCPTYPES
list_head
list_head
list_head
list_head
UNMOVABLE
MOVABLE
RECLAIMABLE
PCPTYPES
list_head
list_head
list_head
list_head
__percpu *pageset
free_area[10]
free_list[MIGRATE_TYPES]
nr_free = 1244
UNMOVABLE
MOVABLE
RECLAIMABLE
PCPTYPES
list_head
list_head
list_head
list_head … …
.
.
. .
.
per_cpu_pageset
pcp
per_cpu_pages
count = 0
high = 0
batch = 1
lists[MIGRATE_PCPTYPES]
Per-cpu Page Frame Cache UNMOVABLE
MOVABLE
RECLAIMABLE
list_head
list_head
list_head
pud_alloc(): Allocate a page frame
with GFP_KERNEL flags → order = 0,
UNMOVABLE
available?
rmqueue_pcplist()
…
6
available?
available?
available?
[try again]
__rmqueue_smallest()
7
1. del_page_from_free_list()
2. expand()
Example: Re-scan available page frames from free_list
* This example is illustrated during system init
zone
free_area[MAX_ORDER]
Note
struct page
free_area[0]
free_list[MIGRATE_TYPES]
nr_free = 2
free_area[1]
free_list[MIGRATE_TYPES]
nr_free = 2
UNMOVABLE
MOVABLE
RECLAIMABLE
PCPTYPES
list_head
list_head
list_head
list_head
UNMOVABLE
MOVABLE
RECLAIMABLE
PCPTYPES
list_head
list_head
list_head
list_head
__percpu *pageset
free_area[10]
free_list[MIGRATE_TYPES]
nr_free = 1243
UNMOVABLE
MOVABLE
RECLAIMABLE
PCPTYPES
list_head
list_head
list_head
list_head … …
.
.
. .
.
per_cpu_pageset
pcp
per_cpu_pages
count = 0
high = 0
batch = 1
lists[MIGRATE_PCPTYPES]
Per-cpu Page Frame Cache UNMOVABLE
MOVABLE
RECLAIMABLE
list_head
list_head
list_head
pud_alloc(): Allocate a page frame
with GFP_KERNEL flags → order = 0,
UNMOVABLE
rmqueue_pcplist()
[Buddy System] expand()
…
return this page descriptor
9
Example: Remove page frames from higher-order free_list
and expand them to lower-order free_list
* This example is illustrated during system init
8
Example: Call Trace (1/2)
breakpoints
Example: Call Trace (2/2)
pageblock
1. How is a pageblock organized?
2. Relationship between pageblock and free_list (migrate list)
3. How to add pages to free_list (migrate list)?
• pageblock, MAX_ORDER -1, or else?
zone
present_pages
Page
. . .
pageblock #0
Page
pageblock #1
Page
pageblock #N
CONFIG_HUGETLB_PAGE Number of Pages
Y 512 = Huge page size
N 1024 (MAX_ORDER - 1)
pageblock size
pageblock
zone
present_pages = 1311744
Page
. . .
pageblock #0
Page
pageblock #1
Page
pageblock #N
CONFIG_HUGETLB_PAGE Number of Pages
Y 512 = Huge page size
N 1024 (MAX_ORDER - 1)
pageblock size
N = round_up(present_pages / pageblock_size) - 1
Example
pageblocks = round_up(1311744 / 512) = 2562
pageblock
16 + 2544 + 2 = 2562
1
1
2
2
Move available page to zone->free_area[] (use pageblock if possible)
• call path: mm_init -> … -> __free_one_page -> add_to_free_list
• Two consecutive pageblocks can be emerged in order=10 free_list
✓ Assume “pageblock = 512 pages (order = 9)”
Relationship between pageblock and migrate list
2
1
Relationship between pageblock and migrate list
Memory migration (page mobility) is disabled if number of pages is too low
pageblock – migration type during OS initialization?
During OS boots, all pageblocks are marked
migration type ‘MIGRATE_MOVABLE’
1
2
3
How to add pages to free_list (migrate list)?
Which method?
• pageblock?
• MAX_ORDER -1?
• else?
How to add pages to free_list (migrate list)?
Which method?
• pageblock?
• MAX_ORDER -1?
• else?
Principle
• Zone->free_area[MAX_ORDER]: From
highest order to lowest order based on
start pfn (iteratively merge lower-order
pages if possible: see __free_one_page)
zone
free_area[MAX_ORDER]
Note
struct page
free_area[0]
free_list[MIGRATE_TYPES]
nr_free = 2
free_area[1]
free_list[MIGRATE_TYPES]
nr_free = 2
UNMOVABLE
MOVABLE
RECLAIMABLE
PCPTYPES
list_head
list_head
list_head
list_head
UNMOVABLE
MOVABLE
RECLAIMABLE
PCPTYPES
list_head
list_head
list_head
list_head
__percpu *pageset
free_area[10]
free_list[MIGRATE_TYPES]
nr_free = 1244
UNMOVABLE
MOVABLE
RECLAIMABLE
PCPTYPES
list_head
list_head
list_head
list_head … …
.
.
. .
.
__free_one_page(): continue to merge
* This example is illustrated during system init
free_area[9]
free_list[MIGRATE_TYPES]
nr_free = 1
UNMOVABLE
MOVABLE
RECLAIMABLE
PCPTYPES
list_head
list_head
list_head
list_head …
pageblock
Possible to merge?
Possible to merge?
Possible to merge?
Possible to merge?
dmesg output: total pages
dmesg output: total pages
* Based on kernel 5.11 (x86_64) – QEMU
* 2-socket CPUs (4 cores/socket)
* 16GB memory
* earlyprintk=serial,ttyS0 console=ttyS0
loglevel=8 nokaslr
2097022 + 2097152 = 4194174
4194174 != 4128619
Why?
dmesg output: total pages
2097022 + 2097152 = 4194174
4194174 != 4128619
Why? zone
zone_start_pfn
managed_pages
spanned_pages
present_pages
free_area[MAX_ORDER]
Two of them?
* Based on kernel 5.11 (x86_64) – QEMU
* 2-socket CPUs (4 cores/socket)
* 16GB memory
* earlyprintk=serial,ttyS0 console=ttyS0
loglevel=8 nokaslr
dmesg output: total pages
2097022 + 2097152 = 4194174
4194174 != 4128619
Why?
Total managed_pages
Total present_pages
* Based on kernel 5.11 (x86_64) – QEMU
* 2-socket CPUs (4 cores/socket)
* 16GB memory
* earlyprintk=serial,ttyS0 console=ttyS0
loglevel=8 nokaslr
Total present_pages & total managed_pages
Total managed_pages
Total present_pages
Total present_pages
Total managed_pages
calculate_node_totalpages
• Calculate zone.spanned_pages and zone.present_pages for each zone
• Sum each zone.present_pages to get total present_pages
• Print “On node 0 totalpages:….” message
free_area_init_core & zone_init_internals
• Calculate managed_pages and set zone.managed_pages
build_all_zonelists
• Sum all zone.managed_pages to get total managed_pages and print
“Built %u zonelists, …” message
* Based on kernel 5.11 (x86_64) – QEMU
* 2-socket CPUs (4 cores/socket)
* 16GB memory
* earlyprintk=serial,ttyS0 console=ttyS0
loglevel=8 nokaslr
Total present_pages: breakdown
Let’s focus on “node 0”
* Based on kernel 5.11 (x86_64) – QEMU
* 2-socket CPUs (4 cores/socket)
* 16GB memory
* earlyprintk=serial,ttyS0 console=ttyS0
loglevel=8 nokaslr
Total present_pages: breakdown
Total present_pages = 786302 + 1310720 = 2097022
* Based on kernel 5.11 (x86_64) – QEMU
* 2-socket CPUs (4 cores/socket)
* 16GB memory
* earlyprintk=serial,ttyS0 console=ttyS0
loglevel=8 nokaslr
Total present_pages: breakdown
Questions
1. What does it mean about “… %lu pages used for memmap”?
• Number of page structs to address a zone
2. What does it mean about “… %lu pages reserved”?
• Number pages reserved for DMA zone (check global variable ‘dma_reserve’)
3. Why aren’t the above-mentioned pages accumulated in present_pages?
• Pre-calculate page space requirement: Will consume total present_pages
* Based on kernel 5.11 (x86_64) – QEMU
* 2-socket CPUs (4 cores/socket)
* 16GB memory
* earlyprintk=serial,ttyS0 console=ttyS0
loglevel=8 nokaslr
Number of page structs to address a zone
• sizeof(struct page) = 64
• 12286 * 4096 / 64 = 786304 (page struct)
Total present_pages: breakdown * Based on kernel 5.11 (x86_64) – QEMU
* 2-socket CPUs (4 cores/socket)
* 16GB memory
* earlyprintk=serial,ttyS0 console=ttyS0
loglevel=8 nokaslr
Number of page structs to address a zone
• sizeof(struct page) = 64
• 12286 * 4096 / 64 = 786304 (page struct)
Total present_pages: breakdown
match: the difference is the page alignment
* Based on kernel 5.11 (x86_64) – QEMU
* 2-socket CPUs (4 cores/socket)
* 16GB memory
* earlyprintk=serial,ttyS0 console=ttyS0
loglevel=8 nokaslr
Total present_pages: breakdown
sparse case
Case 1: spanned_pages
Case 2: [sparse case] use present_pages
* Based on kernel 5.11 (x86_64) – QEMU
* 2-socket CPUs (4 cores/socket)
* 16GB memory
* earlyprintk=serial,ttyS0 console=ttyS0
loglevel=8 nokaslr
Total present_pages: breakdown
spanned_pages case
Case 1: spanned_pages
Case 2: [sparse case] use present_pages
* Based on kernel 5.11 (x86_64) – QEMU
* 2-socket CPUs (4 cores/socket)
* 16GB memory
* earlyprintk=serial,ttyS0 console=ttyS0
loglevel=8 nokaslr
pcp: per-cpu-pages pool
batch: pre-allocate ‘batch’ pages for per_cpu_pages if per_cpu_pages is empty
Total present_pages: breakdown * Based on kernel 5.11 (x86_64) – QEMU
* 2-socket CPUs (4 cores/socket)
* 16GB memory
* earlyprintk=serial,ttyS0 console=ttyS0
loglevel=8 nokaslr
Page frame allocator: watermark
1. _watermark[] configuration
2. lowmem_reserve[MAX_NR_ZONES]
3. __zone_watermark_ok()
• free_pages and _watermark[] – adjustment
_watermark[WMARK_LOW] = _watermark[WMARK_MIN] * (5/4) Note 1
Zone (!highmem_zone)
_watermark[WMARK_HIGH] =_watermark[WMARK_MIN] * (3/2) Note 1
page #n
_watermark[WMARK_MIN] = min_free_pages ∗
𝑍𝑜𝑛𝑒’𝑠 𝑚𝑎𝑛𝑎𝑔𝑒𝑑 𝑝𝑎𝑔𝑒𝑠
𝐴𝑙𝑙 𝑍𝑜𝑛𝑒𝑠’ 𝑚𝑎𝑛𝑎𝑔𝑒𝑑 𝑝𝑎𝑔𝑒𝑠
Zone - watermarks
Note 1: This is the old formula. The new formula includes kswapd watermarks distance according to the scale factor.
Background reclaim by
kswapd
Direct reclaim
Allocate page without
reclaiming pages
Zone – Update _watermark[]
Let’s check nr_free_buffer_pages()
Stage 1: update ‘min_free_kbytes’
and _watermark[]
Stage 2 (huge page): update
‘min_free_kbytes’ and _watermark[]
[Stage 1] Zone – watermark: min_free_kbytes:
• nr_free_buffer_pages() = (764421 – 0 ) + (3335722 – 0) = 4100143 pages: Get number of pages beyond
high watermark
• lowmem_kbytes = 4100143 * 4 = 16400572
• new_min_free_kbytes = min_free_kbytes = int_sqrt(16400572 * 16) = floor(sqrt(16400572 * 16)) = 16199
[min_free_kbytes]
• Force the page frame allocator to keep a minimum number of kilobytes free.
• The page frame allocator uses this number to compute a watermark[WMARK_MIN] value.
[Stage 1] Zone – watermark: min_free_kbytes:
nr_free_buffer_pages() = (764421 – 0 ) + (3335722 – 0) = 4100143
lowmem_kbytes = 4100143 * 4 = 16400572
new_min_free_kbytes = min_free_kbytes = int_sqrt(16400572 * 16) = floor(sqrt(16400572 * 16)) = 16199
_watermark[WMARK_LOW] = _watermark[WMARK_MIN] * (5/4) Note 1
Zone (!highmem_zone)
_watermark[WMARK_HIGH] =_watermark[WMARK_MIN] * (3/2) Note 1
page #n
_watermark[WMARK_MIN] = min_free_pages ∗
𝑍𝑜𝑛𝑒’𝑠 𝑚𝑎𝑛𝑎𝑔𝑒𝑑 𝑝𝑎𝑔𝑒𝑠
𝐴𝑙𝑙 𝑍𝑜𝑛𝑒𝑠’ 𝑚𝑎𝑛𝑎𝑔𝑒𝑑 𝑝𝑎𝑔𝑒𝑠
-- Normal Zone -- _watermark[WMARK_MIN]
u64
16199
4
∗
3335722
(764421 +3335722)
= 4049 ∗
3335722
4100143
= 3294
Note 1: This is the old formula. The new formula includes kswapd watermarks distance according to the scale factor.
Background reclaim by
kswapd
Direct reclaim
Allocate page without
reclaiming pages
[Stage 1] Zone – Update _watermark[]
[Stage 1] calculate_totalreserve_pages()
ZONE_NORMAL
ZONE_DMA32
WMARK_HIGH
WMARK_HIGH
z->_watermark[]
high = min * (3/2)
low = min * (5/4)
min
high = min * (3/2)
low = min * (5/4)
min
z->lowmem_reserve[]
lowmem_reserve[ZONE_MOVABLE]
lowmem_reserve[ZONE_NORMAL]
lowmem_reserve[ZONE_DMA32]
lowmem_reserve[ZONE_MOVABLE]
lowmem_reserve[ZONE_NORMAL]
lowmem_reserve[ZONE_DMA32]
ZONE_MOVABLE
WMARK_HIGH
high = min * (3/2)
low = min * (5/4)
min
lowmem_reserve[ZONE_MOVABLE]
lowmem_reserve[ZONE_NORMAL]
lowmem_reserve[ZONE_DMA32]
pglist_data->totalreserve_pages
high + max(z->lowmem_reserve[])
high + max(z->lowmem_reserve[])
high + max(z->lowmem_reserve[])
• High watermark is considered
the reserved pages.
• [Per-node basis]
pglist_data->totalreserve_pages
= summation(each zone’s
reserved pages)
_watermark[WMARK_MIN] = min_free_pages ∗
𝑍𝑜𝑛𝑒’𝑠 𝑚𝑎𝑛𝑎𝑔𝑒𝑑 𝑝𝑎𝑔𝑒𝑠
𝐴𝑙𝑙 𝑍𝑜𝑛𝑒𝑠’ 𝑚𝑎𝑛𝑎𝑔𝑒𝑑 𝑝𝑎𝑔𝑒𝑠
Update pglist_data->totalreserve_pages by iterating each zone
[Stage 1] calculate_totalreserve_pages()
z->lowmem_reserve[]
lowmem_reserve[ZONE_MOVABLE]
lowmem_reserve[ZONE_NORMAL]
lowmem_reserve[ZONE_DMA32]
lowmem_reserve[ZONE_MOVABLE]
lowmem_reserve[ZONE_NORMAL]
lowmem_reserve[ZONE_DMA32]
lowmem_reserve[ZONE_MOVABLE]
lowmem_reserve[ZONE_NORMAL]
lowmem_reserve[ZONE_DMA32]
pglist_data->totalreserve_pages
high + max(z->lowmem_reserve[])
high + max(z->lowmem_reserve[])
high + max(z->lowmem_reserve[])
• High watermark is considered
the reserved pages.
• [Per-node basis]
pglist_data->totalreserve_pages
= summation(each zone’s
reserved pages)
pglist_data->totalreserve_pages = (2282 + max(0, 0, 0)) + (9964 + max(0,0,0)) = 12246
[Stage 1] lowmem_reserve[MAX_NR_ZONES]
Issue Statement
• [Fallback Mechanism] Insufficient memory of higher zone → Allocate memory from lower
zone
o The memory of the lower zone might be exhausted because of the request of the higher zone.
o Example (gfp flag)
▪ GFP_DMA: ZONE_DMA
▪ GFP_DMA32: ZONE_DMA32 -> ZONE_DMA
▪ Otherwise: ZONE_NORMAL -> ZONE_DMA32 -> ZONE_DMA
o Scenario
▪ The memory allocation from higher zone could be the pinned pages via mlock().
Solution
• lowmem_reserve[MAX_NR_ZONES]
o Ensure a certain amount of free pages of the lower zone are reserved.
ZONE_MOVABLE
sysctl_lowmem_reserve_ratio = 0
ZONE_NORMAL
sysctl_lowmem_reserve_ratio = 32
ZONE_DMA32
sysctl_lowmem_reserve_ratio = 256
ZONE_DMA
sysctl_lowmem_reserve_ratio = 256
fallback
fallback
fallback
ZONE_MOVABLE
managed_pages = 0
sysctl_lowmem_reserve_ratio = 0
ZONE_NORMAL
managed_pages = 3335722
sysctl_lowmem_reserve_ratio = 32
ZONE_DMA32
managed_pages = 764421
sysctl_lowmem_reserve_ratio = 256
fallback
fallback
[Stage 1] lowmem_reserve[MAX_NR_ZONES]
lowmem_reserve[ZONE_DMA32] lowmem_reserve[ZONE_NORMAL] lowmem_reserve[ZONE_MOVABLE]
ZONE_DMA32
ZONE_NORMAL
ZONE_MOVABLE
0
6
= 0
6
= 0 0
0
6
= 0 0
0
= 0
0
= 0
0
= 0
0 0 0
pglist_data->totalreserve_pages = (2282 + max(0, 13030, 13030)) + (9964 + max(0,0,0)) = 25276
1
2
1 2
Zone – lowmem_reserve[MAX_NR_ZONES]: Usage
ZONE_MOVABLE
managed_pages = 0
sysctl_lowmem_reserve_ratio = 0
ZONE_NORMAL
managed_pages = 3335722
sysctl_lowmem_reserve_ratio = 32
ZONE_DMA32
managed_pages = 764421
sysctl_lowmem_reserve_ratio = 256
fallback
fallback
[Stage 2] lowmem_reserve[MAX_NR_ZONES]
(CONFIG_TRANSPARENT_HUGEPAGE=y)
nr_free_buffer_pages(): Get number of pages beyond high watermark
pglist_data->totalreserve_pages = (3628 + max(0, 13030, 13030)) + (15833 + max(0,0,0)) = 32491
1 2
1
2
[Stage 2] lowmem_reserve[MAX_NR_ZONES]
(CONFIG_TRANSPARENT_HUGEPAGE=y)
[gdb] tip: watchpoint
set_recommended_min_free_kbytes() is invoked during hugepage_init()
instead of init_per_zone_wmark_min()
lowmem_reserve[]
free_pages = zone_page_state(z, NR_FREE_PAGES)
free_pages -= (1 << request_order) - 1
free_pages
free_pages -= z->nr_reserved_highatomic
free_pages -= zone_page_state(z, NR_FREE_CMA_PAGES)
!ALLOC_CMA
min = mark = z->_watermark[mark_index]
min -= min / 2
min -= min / 2 or min -= min /4
ALLOC_HIGH
ALLOC_OOM or
ALLOC_HARDER
Zone: pages
!ALLOC_OOM or
!ALLOC_HARDER
__zone_watermark_ok
__compaction_suitable
__zone_watermark_ok
zone_watermark_ok
zone_watermark_fast
zone_watermark_ok_safe
should_reclaim_retry
free_pages > min + z->lowmem_reserve[highest_zoneidx] →
watermark ok. The allocation request can be met.
Page Frame Allocator
1. Page Allocation
2. fallback zonelist
3. Page De-allocation
Page Frame Allocator: Page Allocation
__alloc_pages_nodemask
prepare_alloc_pages
Init the members of alloc_context struct:
* highest_zoneidx, zonelist, migratetype , preferred_zoneref and so on.
get_page_from_freelist
node_reclaim
zone_watermark_fast?
rmqueue
__alloc_pages_slowpath
rmqueue_pcplist
__rmqueue_smallest
page order = 0
per-cpu page frame cache
Preparation
Slow path alloc_flags = gfp_to_alloc_flags(gfp_mask)
Y: free_pages > wmark_low
N
for_next_zone_zonelist_nodemask(): Iterate each zone from zonelist
Assign flag ‘ALLOC_WMARK_MIN’ to alloc_flags
Fast path
get_page_from_freelist
wake_all_kswapds
get_page_from_freelist
__alloc_pages_direct_reclaim
__alloc_pages_direct_compact
__alloc_pages_may_oom
Get pages if “wmark_min < free_pages < wmark_low”
Wake up kswapds if ALLOC_KSWAPD is set
Another try after adjusting zonelist, alloc_flags and
nodemask = NULL
wake_all_kswapds Wake up kswapds if ALLOC_KSWAPD is set
__rmqueue
else
alloc_flags = ALLOC_WMARK_LOW
reclaim page caches for a
node
free_list[MIGRATE_UNMOVABLE = 0]
free_list[MIGRATE_HIGHATOMIC =
MIGRATE_PCPTYPES = 3]
free_list[MIGRATE_MOVABLE = 1]
free_list[MIGRATE_RECLAIMABLE = 2]
free_list[MIGRATE_CMA = 4]
free_list[MIGRATE_ISOLATE = 5]
__GFP_DMA
__GFP_DMA32
__GFP_MOVABLE
Zone modifiers
__GFP_MOVABLE
__GFP_RECLAIMABLE
__GFP_THISNODE
Page mobility
__GFP_ATOMIC
__GFP_HIGH
__GFP_MEMALLOC
Watermark modifiers
__GFP_NOMEMALLOC
__GFP_DIRECT_RECLAIM
__GFP_KSWAPD_RECLAIM
Reclaim modifiers
ALLOC_WMARK_MIN
ALLOC_WMARK_LOW
ALLOC_WMARK_HIGH
Watermark
ALLOC_HARDER
ALLOC_HIGH
ALLOC_CPUSET
misc
ALLOC_CMA
equal
set ALLOC_HARDER in
__alloc_pages_slowpath()
ZONE_DMA
ZONE_DMA32
ZONE_NORMAL
ZONE_MOVABLE
ZONE_DEVICE
gfp_mask
alloc_flags
gfp_mask & alloc_flags
Page Frame Allocator: Page Allocation
__alloc_pages_nodemask
prepare_alloc_pages
Init the members of alloc_context struct:
* highest_zoneidx, zonelist, migratetype , preferred_zoneref and so on.
get_page_from_freelist
node_reclaim
zone_watermark_fast?
rmqueue
rmqueue_pcplist
__rmqueue_smallest
page order = 0
page order > 0 && alloc_flags & ALLOC_HARDER → Get page from
free_list[MIGRATE_HIGHATOMIC]
per-cpu page frame cache
Preparation
Y: free_pages > wmark_low
N
for_next_zone_zonelist_nodemask(): Iterate each zone from zonelist
Fast path
__rmqueue
else
alloc_flags = ALLOC_WMARK_LOW
reclaim page caches for a node
continue;
try next zone
zone_watermark
_ok?
Y
NODE_RECLAIM_NOSCAN or
NODE_RECLAIM_FULL
N
node_reclaim_mode (checked in node_reclaim function)
• Reclaim memory when a zone runs out of memory
• Representation
o 0 (default value): disabled
o 1: Reclaim write dirty pages
o 2: Reclaim page caches
Page Frame Allocator: Page Allocation - *rmqueue
Page Frame Allocator: fallback zone list
* Check macro “for_next_zone_zonelist_nodemask”
CPU #0
Memory Node #0
ZONE_NORMAL
ZONE_DMA32
pglist_data (pg_data_t)
node_zones
node_zonelists
CPU #1
Memory Node #1
ZONE_NORMAL
ZONE_DMA32 = empty
zoneref
struct zone *zone
zone_idx = 1
zoneref
struct zone *zone
zone_idx = 0
pglist_data (pg_data_t)
node_zones[MAX_NR_ZONES]
node_zonelists[MAX_ZONELISTS]
zoneref
struct zone *zone
zone_idx = 1
zoneref
struct zone *zone
zone_idx = 1
zoneref
struct zone *zone
zone_idx = 0
zonelist[ZONELIST_FALLBACK]
_zonerefs[MAX_ZONES_PER_ZONELIST + 1]
zonelist[ZONELIST_NOFALLBACK]
_zonerefs[MAX_ZONES_PER_ZONELIST + 1]
Apply NOFALLBACK zonelist if
__GFP_THISNODE flag is set
fallback list
nofallback list
fallback
Intra-node zone fallback & inter-node zone fallback
Page Frame Allocator: Page Deallocation
Reference
• https://wdv4758h.github.io/notes/blog/linux-kernel-boot.html
• https://www.cnblogs.com/LoyenWang/p/11626237.html
• https://pingcap.com/blog/linux-kernel-vs-memory-fragmentation-1
• https://www.programmersought.com/article/81176896338/

Weitere ähnliche Inhalte

Was ist angesagt?

Linux Memory Management
Linux Memory ManagementLinux Memory Management
Linux Memory Management
Ni Zo-Ma
 

Was ist angesagt? (20)

semaphore & mutex.pdf
semaphore & mutex.pdfsemaphore & mutex.pdf
semaphore & mutex.pdf
 
qemu + gdb: The efficient way to understand/debug Linux kernel code/data stru...
qemu + gdb: The efficient way to understand/debug Linux kernel code/data stru...qemu + gdb: The efficient way to understand/debug Linux kernel code/data stru...
qemu + gdb: The efficient way to understand/debug Linux kernel code/data stru...
 
Anatomy of the loadable kernel module (lkm)
Anatomy of the loadable kernel module (lkm)Anatomy of the loadable kernel module (lkm)
Anatomy of the loadable kernel module (lkm)
 
Reverse Mapping (rmap) in Linux Kernel
Reverse Mapping (rmap) in Linux KernelReverse Mapping (rmap) in Linux Kernel
Reverse Mapping (rmap) in Linux Kernel
 
spinlock.pdf
spinlock.pdfspinlock.pdf
spinlock.pdf
 
Memory Management with Page Folios
Memory Management with Page FoliosMemory Management with Page Folios
Memory Management with Page Folios
 
Linux Kernel - Virtual File System
Linux Kernel - Virtual File SystemLinux Kernel - Virtual File System
Linux Kernel - Virtual File System
 
Memory Compaction in Linux Kernel.pdf
Memory Compaction in Linux Kernel.pdfMemory Compaction in Linux Kernel.pdf
Memory Compaction in Linux Kernel.pdf
 
Vmlinux: anatomy of bzimage and how x86 64 processor is booted
Vmlinux: anatomy of bzimage and how x86 64 processor is bootedVmlinux: anatomy of bzimage and how x86 64 processor is booted
Vmlinux: anatomy of bzimage and how x86 64 processor is booted
 
Linux MMAP & Ioremap introduction
Linux MMAP & Ioremap introductionLinux MMAP & Ioremap introduction
Linux MMAP & Ioremap introduction
 
Linux Memory Management with CMA (Contiguous Memory Allocator)
Linux Memory Management with CMA (Contiguous Memory Allocator)Linux Memory Management with CMA (Contiguous Memory Allocator)
Linux Memory Management with CMA (Contiguous Memory Allocator)
 
Linux Kernel Module - For NLKB
Linux Kernel Module - For NLKBLinux Kernel Module - For NLKB
Linux Kernel Module - For NLKB
 
Memory management in Linux kernel
Memory management in Linux kernelMemory management in Linux kernel
Memory management in Linux kernel
 
Linux kernel debugging
Linux kernel debuggingLinux kernel debugging
Linux kernel debugging
 
Linux Kernel Booting Process (2) - For NLKB
Linux Kernel Booting Process (2) - For NLKBLinux Kernel Booting Process (2) - For NLKB
Linux Kernel Booting Process (2) - For NLKB
 
Linux kernel memory allocators
Linux kernel memory allocatorsLinux kernel memory allocators
Linux kernel memory allocators
 
New Ways to Find Latency in Linux Using Tracing
New Ways to Find Latency in Linux Using TracingNew Ways to Find Latency in Linux Using Tracing
New Ways to Find Latency in Linux Using Tracing
 
Linux Kernel Booting Process (1) - For NLKB
Linux Kernel Booting Process (1) - For NLKBLinux Kernel Booting Process (1) - For NLKB
Linux Kernel Booting Process (1) - For NLKB
 
Linux Memory Management
Linux Memory ManagementLinux Memory Management
Linux Memory Management
 
Kernel Recipes 2015: Linux Kernel IO subsystem - How it works and how can I s...
Kernel Recipes 2015: Linux Kernel IO subsystem - How it works and how can I s...Kernel Recipes 2015: Linux Kernel IO subsystem - How it works and how can I s...
Kernel Recipes 2015: Linux Kernel IO subsystem - How it works and how can I s...
 

Ähnlich wie Physical Memory Management.pdf

Windows memory manager internals
Windows memory manager internalsWindows memory manager internals
Windows memory manager internals
Sisimon Soman
 
Ch14 OS
Ch14 OSCh14 OS
Ch14 OS
C.U
 

Ähnlich wie Physical Memory Management.pdf (20)

memory.ppt
memory.pptmemory.ppt
memory.ppt
 
Updates
UpdatesUpdates
Updates
 
Updates
UpdatesUpdates
Updates
 
Linux memory
Linux memoryLinux memory
Linux memory
 
Exploiting Your File System to Build Robust & Efficient Workflows
Exploiting Your File System to Build Robust & Efficient WorkflowsExploiting Your File System to Build Robust & Efficient Workflows
Exploiting Your File System to Build Robust & Efficient Workflows
 
Memory
MemoryMemory
Memory
 
Linux Huge Pages
Linux Huge PagesLinux Huge Pages
Linux Huge Pages
 
I/O System and Case study
I/O System and Case studyI/O System and Case study
I/O System and Case study
 
Windows memory manager internals
Windows memory manager internalsWindows memory manager internals
Windows memory manager internals
 
How to configure the cluster based on Multi-site (WAN) configuration
How to configure the clusterbased on Multi-site (WAN) configurationHow to configure the clusterbased on Multi-site (WAN) configuration
How to configure the cluster based on Multi-site (WAN) configuration
 
Optimizing RocksDB for Open-Channel SSDs
Optimizing RocksDB for Open-Channel SSDsOptimizing RocksDB for Open-Channel SSDs
Optimizing RocksDB for Open-Channel SSDs
 
I/O System and Case Study
I/O System and Case StudyI/O System and Case Study
I/O System and Case Study
 
Disksim with SSD_extension
Disksim with SSD_extensionDisksim with SSD_extension
Disksim with SSD_extension
 
Ch14 OS
Ch14 OSCh14 OS
Ch14 OS
 
OSCh14
OSCh14OSCh14
OSCh14
 
OS_Ch14
OS_Ch14OS_Ch14
OS_Ch14
 
Logical volume manager xfs
Logical volume manager xfsLogical volume manager xfs
Logical volume manager xfs
 
Les 01 Arch
Les 01 ArchLes 01 Arch
Les 01 Arch
 
Operating Systems
Operating SystemsOperating Systems
Operating Systems
 
PV-Drivers for SeaBIOS using Upstream Qemu
PV-Drivers for SeaBIOS using Upstream QemuPV-Drivers for SeaBIOS using Upstream Qemu
PV-Drivers for SeaBIOS using Upstream Qemu
 

Kürzlich hochgeladen

%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
masabamasaba
 
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Medical / Health Care (+971588192166) Mifepristone and Misoprostol tablets 200mg
 
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
masabamasaba
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
masabamasaba
 
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
masabamasaba
 
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
chiefasafspells
 

Kürzlich hochgeladen (20)

VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
 
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
 
Architecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastArchitecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the past
 
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park %in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
 
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...
 
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open Source
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open SourceWSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open Source
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open Source
 
WSO2CON 2024 - How to Run a Security Program
WSO2CON 2024 - How to Run a Security ProgramWSO2CON 2024 - How to Run a Security Program
WSO2CON 2024 - How to Run a Security Program
 
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
 
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park %in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
 
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
 
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
 
Artyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptxArtyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptx
 
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
 
WSO2Con204 - Hard Rock Presentation - Keynote
WSO2Con204 - Hard Rock Presentation - KeynoteWSO2Con204 - Hard Rock Presentation - Keynote
WSO2Con204 - Hard Rock Presentation - Keynote
 
WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...
WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...
WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...
 
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation Template
 

Physical Memory Management.pdf

  • 1. Physical Memory (Physical Page Frame) Management in Linux Kernel Adrian Huang | July, 2022 * Based on kernel 5.11 (x86_64) – QEMU * 1-socket CPUs (8 cores/socket) * 16GB memory * Kernel parameter: nokaslr norandmaps * Userspace: ASLR is disabled * Legacy BIOS
  • 2. Agenda • Physical memory (physical page frame) management overview ✓Data structures about node, zone, zonelist, migrate type and per-cpu page frame cache (per_cpu_pageset struct: PCP) ✓Placement of physical page frames right after system initialization • Physical memory defragmentation (anti-fragmentation): Approaches • Physical memory defragmentation (anti-fragmentation): Example ✓How the migration works ✓How the buddy system works • pageblock • dmesg output: total pages • Page frame allocator ✓Watermark ✓Call Path
  • 3. Agenda • Does not cover in this talk ✓Memory compaction, page reclaiming and OOM killer ▪ Will be discussion in the future
  • 4. Physical memory management pglist_data (pg_data_t) node_zones node_zonelists nr_zones node_start_pfn CPU #0 CPU #1 Memory Node #0 Memory Node #1 node_present_pages node_spanned_pages node_id kswapd pglist_data (pg_data_t) node_zones node_zonelists nr_zones node_start_pfn node_present_pages node_spanned_pages node_id kswapd ZONE_DMA ZONE_DMA32 ZONE_NORMAL ZONE_MOVABLE ZONE_DEVICE node_data zone lowmem_reserve[MAX_NR_ZONES] node zone_pgdat zone_start_pfn managed_pages spanned_pages present_pages free_area[MAX_ORDER] 20 21 ... 210 UNMOVABLE MOVABLE RECLAIMABLE PCPTYPES ... UNMOVABLE MOVABLE RECLAIMABLE PCPTYPES ... list_head list_head list_head Legend struct page next prev free_area[11] free_list[MIGRATE_TYPES] _watermark[NR_WMARK] • spanned_pages = zone_end_pfn - zone_start_pfn • present_pages = spanned_pages – page_holes • managed_pages = present_pages - reserved_pages • The head of struct page of pageblock is configured with MOVABLE (MIGRATE_MOVABLE) – See memmap_init_zone(). • MIGRATE_HIGHATOMIC = MIGRATE_PCPTYPES kcompactd kcompactd Note
  • 5. Zones Zone Allocator – x86 Buddy system Per-CPU page frame cache Buddy system Per-CPU page frame cache Buddy system Per-CPU page frame cache ZONE_DMA (Physical address: 0-16MB) ZONE_NORMAL (Physical address: 16-896MB) ZONE_HIGHMEM (Physical address > 896MB) Zone Allocator – x86_64 Buddy system Per-CPU page frame cache Buddy system Per-CPU page frame cache Buddy system Per-CPU page frame cache ZONE_DMA (Physical address: 0-16MB) ZONE_DMA32 (Physical address: 16MB-4GB) ZONE_NORMAL (Physical address > 4GB) Buddy system Per-CPU page frame cache Buddy system Per-CPU page frame cache ZONE_MOVABLE (for memory offline) ZONE_DEVICE (Intend for persistent memory) • ZONE_MOVABLE is similar to ZONE_NORMAL, except that it contains movable pages with few exceptional cases: include/linux/mmzone.h
  • 6. pglist_data (pg_data_t) node_zones[MAX_NR_ZONES] node_zonelists[MAX_ZONELISTS] nr_zones node_start_pfn node_present_pages node_spanned_pages node_id struct task_struct *kswapd totalreserve_pages __lruvec atomic_long_t vm_stat[] zone _watermark[NR_WMARK] watermark_boost nr_reserved_highatomic lowmem_reserve[MAX_NR_ZONES] zone_pgdat __percpu *pageset zone_start_pfn managed_pages spanned_pages present_pages free_area[MAX_ORDER] atomic_long_t vm_stat[] per_cpu_pageset pcp expire vm_numa_stat_diff[] stat_threshold vm_stat_diff[] per_cpu_pages count high batch stat_threshold lists[MIGRATE_PCPTYPES] lists[UNMOVABLE] lists[MOVABLE] lists[RECLAIMABLE] Order-0 page free_area[0] free_list[MIGRATE_TYPES] nr_free free_area[MAX_ORDER – 1] free_list[MIGRATE_TYPES] nr_free . . Order = (MAX_ORDER -1 ) pages … … Order-0 page zonelist[ZONELIST_FALLBACK] _zonerefs[MAX_ZONES_PER_ZONELIST + 1] zoneref struct zone *zone zone_idx zonelist[ZONELIST_NOFALLBACK] _zonerefs[MAX_ZONES_PER_ZONELIST + 1] Per-cpu Page Frame Cache: reduce lock contention between processors Apply this zonelist if __GFP_THISNODE flag is set per-node basis zoneref struct zone *zone zone_idx Physical memory management – Zone detail
  • 7. How to know available memory pages in a system? BIOS e820 memblock Zone Page Frame Allocator e820__memblock_setup() __free_pages_core() [Call Path] memblock frees available memory space to zone page frame allocator
  • 8. 1 2 1 Most physical page frames are stored in MIGRATE_MOVABLE 2 [Right after system boots] Most physical page frames are stored in MAX_ORDER (order=10) Placement of physical page frames right after system initialization
  • 9. Physical memory defragmentation (anti- fragmentation): Approaches 1. Buddy system 2. Memory migration (mobility) 3. Memory compaction
  • 10. Physical memory defragmentation (Anti-fragmentation): Approaches Buddy System Memory Migration (Mobility) Memory Compaction Better Defragmentation Poor Defragmentation Page allocation failure Failure
  • 11. Physical memory defragmentation (Anti-fragmentation): Buddy System free_area[N] free_area[1] free_list[MIGRATE_UNMOVABLE] free_list[MIGRATE_UNMOVABLE] page order = 1 page order = N = MAX_ORDER - 1 free_area[0] free_list[MIGRATE_UNMOVABLE] page order = 0 0 2 4 5 8 9 Buddy System: adjacent pages are merged to form larger continuous pages … …
  • 12. Physical memory defragmentation (Anti-fragmentation): Memory Migration (Mobility) 7 0 15 23 31 [Buddy System] memory fragmentation: only order-2 pages can be allocated 0 23 31 15 Reclaimable pages Unmovable pages [Concept] Memory fragmentation is reduced by grouping pages based on mobility (migration) Memory migration delays memory fragmentation. It does not solve the problem
  • 13. Physical memory defragmentation (Anti-fragmentation): Memory compaction (1/2) free page allocated page MIGRATE_MOVABLE MIGRATE_MOVABLE Memory compaction Note
  • 14. Physical memory defragmentation (Anti-fragmentation): Memory compaction (2/2) free page allocated page MIGRATE_MOVABLE Build a list of allocated pages Build a list of free pages MIGRATE_MOVABLE Migration Scanner Free Scanner Memory compaction Note
  • 15. Physical memory defragmentation (anti- fragmentation) 1. Buddy system 2. Memory migration (mobility)  detail discussion 3. Memory compaction
  • 16. Anti-fragmentation: Memory migration (or Mobility) • MIGRATE_UNMOVABLE: Allocations of the core kernel • MIGRATE_MOVABLE (__GFP_MOVABLE): Pages that belongs to userspace applications • MIGRATE_RECLAIMABLE (__GFP_RECLAIMABLE): File pages (Data mapped from files). Periodically reclaimed by the kswapd daemon Slab/slub allocations that specify SLAB_RECLAIM_ACCOUNT and whose pages can be freed via shrinkers (kswapd): Check include/linux/gfp.h. o SLAB_RECLAIM_ACCOUNT: o mainly used by file system: data structure cache (fs/*). o radix tree: cache for ‘struct radix_tree_node’ free_list[MIGRATE_UNMOVABLE] free_list[MIGRATE_MOVABLE] free_list[MIGRATE_RECLAIMABLE] free_list[MIGRATE_RECLAIMABLE] free_list[MIGRATE_RECLAIMABLE] free_list[MIGRATE_UNMOVABLE] free_list[MIGRATE_MOVABLE] free_list[MIGRATE_UNMOVABLE] free_list[MIGRATE_MOVABLE] fallback list: check ‘fallbacks’ variable Designated Migration Type Steal fallback freelist if pages are used up. Steal fallback freelist if pages are used up. Page group concept: grouping pages with identical mobility (migration type)
  • 17. Memory migration (or Mobility): Users ___GFP_MOVABLE → MIGRATE_MOVABLE ___GFP_RECLAIMABLE → MIGRATE_RECLAIMABLE ! ___GFP_MOVABLE && !___GFP_RECLAIMABLE → MIGRATE_UNMOVABLE GFP_KERNEL (gfp_flags & GFP_MOVABLE_MASK) >> GFP_MOVABLE_SHIFT = (gfp_flags & 0x18) >> 3 GFP_HIGHUSER_MOVABLE do_user_addr_fault() wp_page_copy() do_cow_fault() do_swap_page() slab/slub with SLAB_RECLAIM_ACCOUNT flag __GFP_RECLAIMABLE Page table allocation…and so on Memory Allocation * If page mobility is disabled, all pages are kept in MIGRATE_UNMOVABLE
  • 18. Page frame allocator: Physical memory defragmentation (anti-fragmentation) Let’s see an example: 1. How the migration works 2. How the buddy system works
  • 19. Example: Scan available page frames from pcp zone free_area[MAX_ORDER] Note struct page free_area[0] free_list[MIGRATE_TYPES] nr_free = 1 free_area[1] free_list[MIGRATE_TYPES] nr_free = 1 UNMOVABLE MOVABLE RECLAIMABLE PCPTYPES list_head list_head list_head list_head UNMOVABLE MOVABLE RECLAIMABLE PCPTYPES list_head list_head list_head list_head __percpu *pageset free_area[10] free_list[MIGRATE_TYPES] nr_free = 1244 UNMOVABLE MOVABLE RECLAIMABLE PCPTYPES list_head list_head list_head list_head … … … . . . . . per_cpu_pageset pcp per_cpu_pages count = 0 high = 0 batch = 1 lists[MIGRATE_PCPTYPES] Per-cpu Page Frame Cache UNMOVABLE MOVABLE RECLAIMABLE list_head list_head list_head pud_alloc(): Allocate a page frame with GFP_KERNEL flags → order = 0, UNMOVABLE 1 rmqueue_pcplist() available? 2 * This example is illustrated during system init
  • 20. zone free_area[MAX_ORDER] Note struct page free_area[0] free_list[MIGRATE_TYPES] nr_free = 1 free_area[1] free_list[MIGRATE_TYPES] nr_free = 1 UNMOVABLE MOVABLE RECLAIMABLE PCPTYPES list_head list_head list_head list_head UNMOVABLE MOVABLE RECLAIMABLE PCPTYPES list_head list_head list_head list_head __percpu *pageset free_area[10] free_list[MIGRATE_TYPES] nr_free = 1244 UNMOVABLE MOVABLE RECLAIMABLE PCPTYPES list_head list_head list_head list_head … … … . . . . . per_cpu_pageset pcp per_cpu_pages count = 0 high = 0 batch = 1 lists[MIGRATE_PCPTYPES] Per-cpu Page Frame Cache UNMOVABLE MOVABLE RECLAIMABLE list_head list_head list_head pud_alloc(): Allocate a page frame with GFP_KERNEL flags → order = 0, UNMOVABLE 1 3 __rmqueue_smallest() available? available? available? rmqueue_pcplist() Example: Scan available page frames from free_area available? 2 * This example is illustrated during system init lower order -> higher order
  • 21. zone free_area[MAX_ORDER] Note struct page free_area[0] free_list[MIGRATE_TYPES] nr_free = 1 free_area[1] free_list[MIGRATE_TYPES] nr_free = 1 UNMOVABLE MOVABLE RECLAIMABLE PCPTYPES list_head list_head list_head list_head UNMOVABLE MOVABLE RECLAIMABLE PCPTYPES list_head list_head list_head list_head __percpu *pageset free_area[10] free_list[MIGRATE_TYPES] nr_free = 1244 UNMOVABLE MOVABLE RECLAIMABLE PCPTYPES list_head list_head list_head list_head … … … . . . . . per_cpu_pageset pcp per_cpu_pages count = 0 high = 0 batch = 1 lists[MIGRATE_PCPTYPES] Per-cpu Page Frame Cache UNMOVABLE MOVABLE RECLAIMABLE list_head list_head list_head pud_alloc(): Allocate a page frame with GFP_KERNEL flags → order = 0, UNMOVABLE 1 3 __rmqueue_smallest() available? available? available? available? rmqueue_pcplist() [from MAX_ORDER to min_order] __rmqueue_fallback() → steal_suitable_fallback() 4-1 4-2 Example: No available page frames in pcp and free_area for specific migration type → steal from other migration type 2 * This example is illustrated during system init [Steal] MAX_ORDER -> min order
  • 22. zone free_area[MAX_ORDER] Note struct page free_area[0] free_list[MIGRATE_TYPES] nr_free = 1 free_area[1] free_list[MIGRATE_TYPES] nr_free = 1 UNMOVABLE MOVABLE RECLAIMABLE PCPTYPES list_head list_head list_head list_head UNMOVABLE MOVABLE RECLAIMABLE PCPTYPES list_head list_head list_head list_head __percpu *pageset free_area[10] free_list[MIGRATE_TYPES] nr_free = 1244 UNMOVABLE MOVABLE RECLAIMABLE PCPTYPES list_head list_head list_head list_head … … … . . . . . per_cpu_pageset pcp per_cpu_pages count = 0 high = 0 batch = 1 lists[MIGRATE_PCPTYPES] Per-cpu Page Frame Cache UNMOVABLE MOVABLE RECLAIMABLE list_head list_head list_head pud_alloc(): Allocate a page frame with GFP_KERNEL flags → order = 0, UNMOVABLE 1 3 __rmqueue_smallest() available? available? available? available? rmqueue_pcplist() [from MAX_ORDER to min_order] __rmqueue_fallback() → steal_suitable_fallback() 4-1 4-2 Example: No available page frames in pcp and free_area for specific migration type → steal from other migration type 2 * This example is illustrated during system init 4-1 4-2
  • 23. zone free_area[MAX_ORDER] Note struct page free_area[0] free_list[MIGRATE_TYPES] nr_free = 1 free_area[1] free_list[MIGRATE_TYPES] nr_free = 1 UNMOVABLE MOVABLE RECLAIMABLE PCPTYPES list_head list_head list_head list_head UNMOVABLE MOVABLE RECLAIMABLE PCPTYPES list_head list_head list_head list_head __percpu *pageset free_area[10] free_list[MIGRATE_TYPES] nr_free = 1244 UNMOVABLE MOVABLE RECLAIMABLE PCPTYPES list_head list_head list_head list_head … … … . . . . . per_cpu_pageset pcp per_cpu_pages count = 0 high = 0 batch = 1 lists[MIGRATE_PCPTYPES] Per-cpu Page Frame Cache UNMOVABLE MOVABLE RECLAIMABLE list_head list_head list_head pud_alloc(): Allocate a page frame with GFP_KERNEL flags → order = 0, UNMOVABLE available? rmqueue_pcplist() 5 steal_suitable_fallback() → move_to_free_list() Example: No available page frames in pcp and free_area for specific migration type → steal from other migration type * This example is illustrated during system init
  • 24. zone free_area[MAX_ORDER] Note struct page free_area[0] free_list[MIGRATE_TYPES] nr_free = 1 free_area[1] free_list[MIGRATE_TYPES] nr_free = 1 UNMOVABLE MOVABLE RECLAIMABLE PCPTYPES list_head list_head list_head list_head UNMOVABLE MOVABLE RECLAIMABLE PCPTYPES list_head list_head list_head list_head __percpu *pageset free_area[10] free_list[MIGRATE_TYPES] nr_free = 1244 UNMOVABLE MOVABLE RECLAIMABLE PCPTYPES list_head list_head list_head list_head … … . . . . . per_cpu_pageset pcp per_cpu_pages count = 0 high = 0 batch = 1 lists[MIGRATE_PCPTYPES] Per-cpu Page Frame Cache UNMOVABLE MOVABLE RECLAIMABLE list_head list_head list_head pud_alloc(): Allocate a page frame with GFP_KERNEL flags → order = 0, UNMOVABLE available? rmqueue_pcplist() … 6 available? available? available? [try again] __rmqueue_smallest() Example: Re-scan available page frames from free_list * This example is illustrated during system init
  • 25. zone free_area[MAX_ORDER] Note struct page free_area[0] free_list[MIGRATE_TYPES] nr_free = 1 free_area[1] free_list[MIGRATE_TYPES] nr_free = 1 UNMOVABLE MOVABLE RECLAIMABLE PCPTYPES list_head list_head list_head list_head UNMOVABLE MOVABLE RECLAIMABLE PCPTYPES list_head list_head list_head list_head __percpu *pageset free_area[10] free_list[MIGRATE_TYPES] nr_free = 1244 UNMOVABLE MOVABLE RECLAIMABLE PCPTYPES list_head list_head list_head list_head … … . . . . . per_cpu_pageset pcp per_cpu_pages count = 0 high = 0 batch = 1 lists[MIGRATE_PCPTYPES] Per-cpu Page Frame Cache UNMOVABLE MOVABLE RECLAIMABLE list_head list_head list_head pud_alloc(): Allocate a page frame with GFP_KERNEL flags → order = 0, UNMOVABLE available? rmqueue_pcplist() … 6 available? available? available? [try again] __rmqueue_smallest() 7 1. del_page_from_free_list() 2. expand() Example: Re-scan available page frames from free_list * This example is illustrated during system init
  • 26. zone free_area[MAX_ORDER] Note struct page free_area[0] free_list[MIGRATE_TYPES] nr_free = 2 free_area[1] free_list[MIGRATE_TYPES] nr_free = 2 UNMOVABLE MOVABLE RECLAIMABLE PCPTYPES list_head list_head list_head list_head UNMOVABLE MOVABLE RECLAIMABLE PCPTYPES list_head list_head list_head list_head __percpu *pageset free_area[10] free_list[MIGRATE_TYPES] nr_free = 1243 UNMOVABLE MOVABLE RECLAIMABLE PCPTYPES list_head list_head list_head list_head … … . . . . . per_cpu_pageset pcp per_cpu_pages count = 0 high = 0 batch = 1 lists[MIGRATE_PCPTYPES] Per-cpu Page Frame Cache UNMOVABLE MOVABLE RECLAIMABLE list_head list_head list_head pud_alloc(): Allocate a page frame with GFP_KERNEL flags → order = 0, UNMOVABLE rmqueue_pcplist() [Buddy System] expand() … return this page descriptor 9 Example: Remove page frames from higher-order free_list and expand them to lower-order free_list * This example is illustrated during system init 8
  • 27. Example: Call Trace (1/2) breakpoints
  • 29. pageblock 1. How is a pageblock organized? 2. Relationship between pageblock and free_list (migrate list) 3. How to add pages to free_list (migrate list)? • pageblock, MAX_ORDER -1, or else?
  • 30. zone present_pages Page . . . pageblock #0 Page pageblock #1 Page pageblock #N CONFIG_HUGETLB_PAGE Number of Pages Y 512 = Huge page size N 1024 (MAX_ORDER - 1) pageblock size pageblock
  • 31. zone present_pages = 1311744 Page . . . pageblock #0 Page pageblock #1 Page pageblock #N CONFIG_HUGETLB_PAGE Number of Pages Y 512 = Huge page size N 1024 (MAX_ORDER - 1) pageblock size N = round_up(present_pages / pageblock_size) - 1 Example pageblocks = round_up(1311744 / 512) = 2562 pageblock 16 + 2544 + 2 = 2562 1 1 2 2
  • 32. Move available page to zone->free_area[] (use pageblock if possible) • call path: mm_init -> … -> __free_one_page -> add_to_free_list • Two consecutive pageblocks can be emerged in order=10 free_list ✓ Assume “pageblock = 512 pages (order = 9)” Relationship between pageblock and migrate list 2 1
  • 33. Relationship between pageblock and migrate list Memory migration (page mobility) is disabled if number of pages is too low
  • 34. pageblock – migration type during OS initialization? During OS boots, all pageblocks are marked migration type ‘MIGRATE_MOVABLE’ 1 2 3
  • 35. How to add pages to free_list (migrate list)? Which method? • pageblock? • MAX_ORDER -1? • else?
  • 36. How to add pages to free_list (migrate list)? Which method? • pageblock? • MAX_ORDER -1? • else? Principle • Zone->free_area[MAX_ORDER]: From highest order to lowest order based on start pfn (iteratively merge lower-order pages if possible: see __free_one_page)
  • 37. zone free_area[MAX_ORDER] Note struct page free_area[0] free_list[MIGRATE_TYPES] nr_free = 2 free_area[1] free_list[MIGRATE_TYPES] nr_free = 2 UNMOVABLE MOVABLE RECLAIMABLE PCPTYPES list_head list_head list_head list_head UNMOVABLE MOVABLE RECLAIMABLE PCPTYPES list_head list_head list_head list_head __percpu *pageset free_area[10] free_list[MIGRATE_TYPES] nr_free = 1244 UNMOVABLE MOVABLE RECLAIMABLE PCPTYPES list_head list_head list_head list_head … … . . . . . __free_one_page(): continue to merge * This example is illustrated during system init free_area[9] free_list[MIGRATE_TYPES] nr_free = 1 UNMOVABLE MOVABLE RECLAIMABLE PCPTYPES list_head list_head list_head list_head … pageblock Possible to merge? Possible to merge? Possible to merge? Possible to merge?
  • 39. dmesg output: total pages * Based on kernel 5.11 (x86_64) – QEMU * 2-socket CPUs (4 cores/socket) * 16GB memory * earlyprintk=serial,ttyS0 console=ttyS0 loglevel=8 nokaslr 2097022 + 2097152 = 4194174 4194174 != 4128619 Why?
  • 40. dmesg output: total pages 2097022 + 2097152 = 4194174 4194174 != 4128619 Why? zone zone_start_pfn managed_pages spanned_pages present_pages free_area[MAX_ORDER] Two of them? * Based on kernel 5.11 (x86_64) – QEMU * 2-socket CPUs (4 cores/socket) * 16GB memory * earlyprintk=serial,ttyS0 console=ttyS0 loglevel=8 nokaslr
  • 41. dmesg output: total pages 2097022 + 2097152 = 4194174 4194174 != 4128619 Why? Total managed_pages Total present_pages * Based on kernel 5.11 (x86_64) – QEMU * 2-socket CPUs (4 cores/socket) * 16GB memory * earlyprintk=serial,ttyS0 console=ttyS0 loglevel=8 nokaslr
  • 42. Total present_pages & total managed_pages Total managed_pages Total present_pages Total present_pages Total managed_pages calculate_node_totalpages • Calculate zone.spanned_pages and zone.present_pages for each zone • Sum each zone.present_pages to get total present_pages • Print “On node 0 totalpages:….” message free_area_init_core & zone_init_internals • Calculate managed_pages and set zone.managed_pages build_all_zonelists • Sum all zone.managed_pages to get total managed_pages and print “Built %u zonelists, …” message * Based on kernel 5.11 (x86_64) – QEMU * 2-socket CPUs (4 cores/socket) * 16GB memory * earlyprintk=serial,ttyS0 console=ttyS0 loglevel=8 nokaslr
  • 43. Total present_pages: breakdown Let’s focus on “node 0” * Based on kernel 5.11 (x86_64) – QEMU * 2-socket CPUs (4 cores/socket) * 16GB memory * earlyprintk=serial,ttyS0 console=ttyS0 loglevel=8 nokaslr
  • 44. Total present_pages: breakdown Total present_pages = 786302 + 1310720 = 2097022 * Based on kernel 5.11 (x86_64) – QEMU * 2-socket CPUs (4 cores/socket) * 16GB memory * earlyprintk=serial,ttyS0 console=ttyS0 loglevel=8 nokaslr
  • 45. Total present_pages: breakdown Questions 1. What does it mean about “… %lu pages used for memmap”? • Number of page structs to address a zone 2. What does it mean about “… %lu pages reserved”? • Number pages reserved for DMA zone (check global variable ‘dma_reserve’) 3. Why aren’t the above-mentioned pages accumulated in present_pages? • Pre-calculate page space requirement: Will consume total present_pages * Based on kernel 5.11 (x86_64) – QEMU * 2-socket CPUs (4 cores/socket) * 16GB memory * earlyprintk=serial,ttyS0 console=ttyS0 loglevel=8 nokaslr
  • 46. Number of page structs to address a zone • sizeof(struct page) = 64 • 12286 * 4096 / 64 = 786304 (page struct) Total present_pages: breakdown * Based on kernel 5.11 (x86_64) – QEMU * 2-socket CPUs (4 cores/socket) * 16GB memory * earlyprintk=serial,ttyS0 console=ttyS0 loglevel=8 nokaslr
  • 47. Number of page structs to address a zone • sizeof(struct page) = 64 • 12286 * 4096 / 64 = 786304 (page struct) Total present_pages: breakdown match: the difference is the page alignment * Based on kernel 5.11 (x86_64) – QEMU * 2-socket CPUs (4 cores/socket) * 16GB memory * earlyprintk=serial,ttyS0 console=ttyS0 loglevel=8 nokaslr
  • 48. Total present_pages: breakdown sparse case Case 1: spanned_pages Case 2: [sparse case] use present_pages * Based on kernel 5.11 (x86_64) – QEMU * 2-socket CPUs (4 cores/socket) * 16GB memory * earlyprintk=serial,ttyS0 console=ttyS0 loglevel=8 nokaslr
  • 49. Total present_pages: breakdown spanned_pages case Case 1: spanned_pages Case 2: [sparse case] use present_pages * Based on kernel 5.11 (x86_64) – QEMU * 2-socket CPUs (4 cores/socket) * 16GB memory * earlyprintk=serial,ttyS0 console=ttyS0 loglevel=8 nokaslr
  • 50. pcp: per-cpu-pages pool batch: pre-allocate ‘batch’ pages for per_cpu_pages if per_cpu_pages is empty Total present_pages: breakdown * Based on kernel 5.11 (x86_64) – QEMU * 2-socket CPUs (4 cores/socket) * 16GB memory * earlyprintk=serial,ttyS0 console=ttyS0 loglevel=8 nokaslr
  • 51. Page frame allocator: watermark 1. _watermark[] configuration 2. lowmem_reserve[MAX_NR_ZONES] 3. __zone_watermark_ok() • free_pages and _watermark[] – adjustment
  • 52. _watermark[WMARK_LOW] = _watermark[WMARK_MIN] * (5/4) Note 1 Zone (!highmem_zone) _watermark[WMARK_HIGH] =_watermark[WMARK_MIN] * (3/2) Note 1 page #n _watermark[WMARK_MIN] = min_free_pages ∗ 𝑍𝑜𝑛𝑒’𝑠 𝑚𝑎𝑛𝑎𝑔𝑒𝑑 𝑝𝑎𝑔𝑒𝑠 𝐴𝑙𝑙 𝑍𝑜𝑛𝑒𝑠’ 𝑚𝑎𝑛𝑎𝑔𝑒𝑑 𝑝𝑎𝑔𝑒𝑠 Zone - watermarks Note 1: This is the old formula. The new formula includes kswapd watermarks distance according to the scale factor. Background reclaim by kswapd Direct reclaim Allocate page without reclaiming pages
  • 53. Zone – Update _watermark[] Let’s check nr_free_buffer_pages() Stage 1: update ‘min_free_kbytes’ and _watermark[] Stage 2 (huge page): update ‘min_free_kbytes’ and _watermark[]
  • 54. [Stage 1] Zone – watermark: min_free_kbytes: • nr_free_buffer_pages() = (764421 – 0 ) + (3335722 – 0) = 4100143 pages: Get number of pages beyond high watermark • lowmem_kbytes = 4100143 * 4 = 16400572 • new_min_free_kbytes = min_free_kbytes = int_sqrt(16400572 * 16) = floor(sqrt(16400572 * 16)) = 16199 [min_free_kbytes] • Force the page frame allocator to keep a minimum number of kilobytes free. • The page frame allocator uses this number to compute a watermark[WMARK_MIN] value.
  • 55. [Stage 1] Zone – watermark: min_free_kbytes: nr_free_buffer_pages() = (764421 – 0 ) + (3335722 – 0) = 4100143 lowmem_kbytes = 4100143 * 4 = 16400572 new_min_free_kbytes = min_free_kbytes = int_sqrt(16400572 * 16) = floor(sqrt(16400572 * 16)) = 16199
  • 56. _watermark[WMARK_LOW] = _watermark[WMARK_MIN] * (5/4) Note 1 Zone (!highmem_zone) _watermark[WMARK_HIGH] =_watermark[WMARK_MIN] * (3/2) Note 1 page #n _watermark[WMARK_MIN] = min_free_pages ∗ 𝑍𝑜𝑛𝑒’𝑠 𝑚𝑎𝑛𝑎𝑔𝑒𝑑 𝑝𝑎𝑔𝑒𝑠 𝐴𝑙𝑙 𝑍𝑜𝑛𝑒𝑠’ 𝑚𝑎𝑛𝑎𝑔𝑒𝑑 𝑝𝑎𝑔𝑒𝑠 -- Normal Zone -- _watermark[WMARK_MIN] u64 16199 4 ∗ 3335722 (764421 +3335722) = 4049 ∗ 3335722 4100143 = 3294 Note 1: This is the old formula. The new formula includes kswapd watermarks distance according to the scale factor. Background reclaim by kswapd Direct reclaim Allocate page without reclaiming pages [Stage 1] Zone – Update _watermark[]
  • 57. [Stage 1] calculate_totalreserve_pages() ZONE_NORMAL ZONE_DMA32 WMARK_HIGH WMARK_HIGH z->_watermark[] high = min * (3/2) low = min * (5/4) min high = min * (3/2) low = min * (5/4) min z->lowmem_reserve[] lowmem_reserve[ZONE_MOVABLE] lowmem_reserve[ZONE_NORMAL] lowmem_reserve[ZONE_DMA32] lowmem_reserve[ZONE_MOVABLE] lowmem_reserve[ZONE_NORMAL] lowmem_reserve[ZONE_DMA32] ZONE_MOVABLE WMARK_HIGH high = min * (3/2) low = min * (5/4) min lowmem_reserve[ZONE_MOVABLE] lowmem_reserve[ZONE_NORMAL] lowmem_reserve[ZONE_DMA32] pglist_data->totalreserve_pages high + max(z->lowmem_reserve[]) high + max(z->lowmem_reserve[]) high + max(z->lowmem_reserve[]) • High watermark is considered the reserved pages. • [Per-node basis] pglist_data->totalreserve_pages = summation(each zone’s reserved pages) _watermark[WMARK_MIN] = min_free_pages ∗ 𝑍𝑜𝑛𝑒’𝑠 𝑚𝑎𝑛𝑎𝑔𝑒𝑑 𝑝𝑎𝑔𝑒𝑠 𝐴𝑙𝑙 𝑍𝑜𝑛𝑒𝑠’ 𝑚𝑎𝑛𝑎𝑔𝑒𝑑 𝑝𝑎𝑔𝑒𝑠 Update pglist_data->totalreserve_pages by iterating each zone
  • 58. [Stage 1] calculate_totalreserve_pages() z->lowmem_reserve[] lowmem_reserve[ZONE_MOVABLE] lowmem_reserve[ZONE_NORMAL] lowmem_reserve[ZONE_DMA32] lowmem_reserve[ZONE_MOVABLE] lowmem_reserve[ZONE_NORMAL] lowmem_reserve[ZONE_DMA32] lowmem_reserve[ZONE_MOVABLE] lowmem_reserve[ZONE_NORMAL] lowmem_reserve[ZONE_DMA32] pglist_data->totalreserve_pages high + max(z->lowmem_reserve[]) high + max(z->lowmem_reserve[]) high + max(z->lowmem_reserve[]) • High watermark is considered the reserved pages. • [Per-node basis] pglist_data->totalreserve_pages = summation(each zone’s reserved pages) pglist_data->totalreserve_pages = (2282 + max(0, 0, 0)) + (9964 + max(0,0,0)) = 12246
  • 59. [Stage 1] lowmem_reserve[MAX_NR_ZONES] Issue Statement • [Fallback Mechanism] Insufficient memory of higher zone → Allocate memory from lower zone o The memory of the lower zone might be exhausted because of the request of the higher zone. o Example (gfp flag) ▪ GFP_DMA: ZONE_DMA ▪ GFP_DMA32: ZONE_DMA32 -> ZONE_DMA ▪ Otherwise: ZONE_NORMAL -> ZONE_DMA32 -> ZONE_DMA o Scenario ▪ The memory allocation from higher zone could be the pinned pages via mlock(). Solution • lowmem_reserve[MAX_NR_ZONES] o Ensure a certain amount of free pages of the lower zone are reserved. ZONE_MOVABLE sysctl_lowmem_reserve_ratio = 0 ZONE_NORMAL sysctl_lowmem_reserve_ratio = 32 ZONE_DMA32 sysctl_lowmem_reserve_ratio = 256 ZONE_DMA sysctl_lowmem_reserve_ratio = 256 fallback fallback fallback
  • 60. ZONE_MOVABLE managed_pages = 0 sysctl_lowmem_reserve_ratio = 0 ZONE_NORMAL managed_pages = 3335722 sysctl_lowmem_reserve_ratio = 32 ZONE_DMA32 managed_pages = 764421 sysctl_lowmem_reserve_ratio = 256 fallback fallback [Stage 1] lowmem_reserve[MAX_NR_ZONES] lowmem_reserve[ZONE_DMA32] lowmem_reserve[ZONE_NORMAL] lowmem_reserve[ZONE_MOVABLE] ZONE_DMA32 ZONE_NORMAL ZONE_MOVABLE 0 6 = 0 6 = 0 0 0 6 = 0 0 0 = 0 0 = 0 0 = 0 0 0 0 pglist_data->totalreserve_pages = (2282 + max(0, 13030, 13030)) + (9964 + max(0,0,0)) = 25276 1 2 1 2
  • 61. Zone – lowmem_reserve[MAX_NR_ZONES]: Usage ZONE_MOVABLE managed_pages = 0 sysctl_lowmem_reserve_ratio = 0 ZONE_NORMAL managed_pages = 3335722 sysctl_lowmem_reserve_ratio = 32 ZONE_DMA32 managed_pages = 764421 sysctl_lowmem_reserve_ratio = 256 fallback fallback
  • 62. [Stage 2] lowmem_reserve[MAX_NR_ZONES] (CONFIG_TRANSPARENT_HUGEPAGE=y) nr_free_buffer_pages(): Get number of pages beyond high watermark pglist_data->totalreserve_pages = (3628 + max(0, 13030, 13030)) + (15833 + max(0,0,0)) = 32491 1 2 1 2
  • 63. [Stage 2] lowmem_reserve[MAX_NR_ZONES] (CONFIG_TRANSPARENT_HUGEPAGE=y) [gdb] tip: watchpoint set_recommended_min_free_kbytes() is invoked during hugepage_init() instead of init_per_zone_wmark_min()
  • 64. lowmem_reserve[] free_pages = zone_page_state(z, NR_FREE_PAGES) free_pages -= (1 << request_order) - 1 free_pages free_pages -= z->nr_reserved_highatomic free_pages -= zone_page_state(z, NR_FREE_CMA_PAGES) !ALLOC_CMA min = mark = z->_watermark[mark_index] min -= min / 2 min -= min / 2 or min -= min /4 ALLOC_HIGH ALLOC_OOM or ALLOC_HARDER Zone: pages !ALLOC_OOM or !ALLOC_HARDER __zone_watermark_ok __compaction_suitable __zone_watermark_ok zone_watermark_ok zone_watermark_fast zone_watermark_ok_safe should_reclaim_retry free_pages > min + z->lowmem_reserve[highest_zoneidx] → watermark ok. The allocation request can be met.
  • 65. Page Frame Allocator 1. Page Allocation 2. fallback zonelist 3. Page De-allocation
  • 66. Page Frame Allocator: Page Allocation __alloc_pages_nodemask prepare_alloc_pages Init the members of alloc_context struct: * highest_zoneidx, zonelist, migratetype , preferred_zoneref and so on. get_page_from_freelist node_reclaim zone_watermark_fast? rmqueue __alloc_pages_slowpath rmqueue_pcplist __rmqueue_smallest page order = 0 per-cpu page frame cache Preparation Slow path alloc_flags = gfp_to_alloc_flags(gfp_mask) Y: free_pages > wmark_low N for_next_zone_zonelist_nodemask(): Iterate each zone from zonelist Assign flag ‘ALLOC_WMARK_MIN’ to alloc_flags Fast path get_page_from_freelist wake_all_kswapds get_page_from_freelist __alloc_pages_direct_reclaim __alloc_pages_direct_compact __alloc_pages_may_oom Get pages if “wmark_min < free_pages < wmark_low” Wake up kswapds if ALLOC_KSWAPD is set Another try after adjusting zonelist, alloc_flags and nodemask = NULL wake_all_kswapds Wake up kswapds if ALLOC_KSWAPD is set __rmqueue else alloc_flags = ALLOC_WMARK_LOW reclaim page caches for a node
  • 67. free_list[MIGRATE_UNMOVABLE = 0] free_list[MIGRATE_HIGHATOMIC = MIGRATE_PCPTYPES = 3] free_list[MIGRATE_MOVABLE = 1] free_list[MIGRATE_RECLAIMABLE = 2] free_list[MIGRATE_CMA = 4] free_list[MIGRATE_ISOLATE = 5] __GFP_DMA __GFP_DMA32 __GFP_MOVABLE Zone modifiers __GFP_MOVABLE __GFP_RECLAIMABLE __GFP_THISNODE Page mobility __GFP_ATOMIC __GFP_HIGH __GFP_MEMALLOC Watermark modifiers __GFP_NOMEMALLOC __GFP_DIRECT_RECLAIM __GFP_KSWAPD_RECLAIM Reclaim modifiers ALLOC_WMARK_MIN ALLOC_WMARK_LOW ALLOC_WMARK_HIGH Watermark ALLOC_HARDER ALLOC_HIGH ALLOC_CPUSET misc ALLOC_CMA equal set ALLOC_HARDER in __alloc_pages_slowpath() ZONE_DMA ZONE_DMA32 ZONE_NORMAL ZONE_MOVABLE ZONE_DEVICE gfp_mask alloc_flags gfp_mask & alloc_flags
  • 68. Page Frame Allocator: Page Allocation __alloc_pages_nodemask prepare_alloc_pages Init the members of alloc_context struct: * highest_zoneidx, zonelist, migratetype , preferred_zoneref and so on. get_page_from_freelist node_reclaim zone_watermark_fast? rmqueue rmqueue_pcplist __rmqueue_smallest page order = 0 page order > 0 && alloc_flags & ALLOC_HARDER → Get page from free_list[MIGRATE_HIGHATOMIC] per-cpu page frame cache Preparation Y: free_pages > wmark_low N for_next_zone_zonelist_nodemask(): Iterate each zone from zonelist Fast path __rmqueue else alloc_flags = ALLOC_WMARK_LOW reclaim page caches for a node continue; try next zone zone_watermark _ok? Y NODE_RECLAIM_NOSCAN or NODE_RECLAIM_FULL N node_reclaim_mode (checked in node_reclaim function) • Reclaim memory when a zone runs out of memory • Representation o 0 (default value): disabled o 1: Reclaim write dirty pages o 2: Reclaim page caches
  • 69. Page Frame Allocator: Page Allocation - *rmqueue
  • 70. Page Frame Allocator: fallback zone list * Check macro “for_next_zone_zonelist_nodemask” CPU #0 Memory Node #0 ZONE_NORMAL ZONE_DMA32 pglist_data (pg_data_t) node_zones node_zonelists CPU #1 Memory Node #1 ZONE_NORMAL ZONE_DMA32 = empty zoneref struct zone *zone zone_idx = 1 zoneref struct zone *zone zone_idx = 0 pglist_data (pg_data_t) node_zones[MAX_NR_ZONES] node_zonelists[MAX_ZONELISTS] zoneref struct zone *zone zone_idx = 1 zoneref struct zone *zone zone_idx = 1 zoneref struct zone *zone zone_idx = 0 zonelist[ZONELIST_FALLBACK] _zonerefs[MAX_ZONES_PER_ZONELIST + 1] zonelist[ZONELIST_NOFALLBACK] _zonerefs[MAX_ZONES_PER_ZONELIST + 1] Apply NOFALLBACK zonelist if __GFP_THISNODE flag is set fallback list nofallback list fallback Intra-node zone fallback & inter-node zone fallback
  • 71. Page Frame Allocator: Page Deallocation
  • 72. Reference • https://wdv4758h.github.io/notes/blog/linux-kernel-boot.html • https://www.cnblogs.com/LoyenWang/p/11626237.html • https://pingcap.com/blog/linux-kernel-vs-memory-fragmentation-1 • https://www.programmersought.com/article/81176896338/