Note: When you view the the slide deck via web browser, the screenshots may be blurred. You can download and view them offline (Screenshots are clear).
1. Physical Memory (Physical Page Frame)
Management in Linux Kernel
Adrian Huang | July, 2022
* Based on kernel 5.11 (x86_64) – QEMU
* 1-socket CPUs (8 cores/socket)
* 16GB memory
* Kernel parameter: nokaslr norandmaps
* Userspace: ASLR is disabled
* Legacy BIOS
2. Agenda
• Physical memory (physical page frame) management overview
✓Data structures about node, zone, zonelist, migrate type and per-cpu page frame
cache (per_cpu_pageset struct: PCP)
✓Placement of physical page frames right after system initialization
• Physical memory defragmentation (anti-fragmentation): Approaches
• Physical memory defragmentation (anti-fragmentation): Example
✓How the migration works
✓How the buddy system works
• pageblock
• dmesg output: total pages
• Page frame allocator
✓Watermark
✓Call Path
3. Agenda
• Does not cover in this talk
✓Memory compaction, page reclaiming and OOM killer
▪ Will be discussion in the future
5. Zones Zone Allocator – x86
Buddy system
Per-CPU page
frame cache
Buddy system
Per-CPU page
frame cache
Buddy system
Per-CPU page
frame cache
ZONE_DMA
(Physical address: 0-16MB)
ZONE_NORMAL
(Physical address: 16-896MB)
ZONE_HIGHMEM
(Physical address > 896MB)
Zone Allocator – x86_64
Buddy system
Per-CPU page
frame cache
Buddy system
Per-CPU page
frame cache
Buddy system
Per-CPU page
frame cache
ZONE_DMA
(Physical address: 0-16MB)
ZONE_DMA32
(Physical address: 16MB-4GB)
ZONE_NORMAL
(Physical address > 4GB)
Buddy system
Per-CPU page
frame cache
Buddy system
Per-CPU page
frame cache
ZONE_MOVABLE
(for memory offline)
ZONE_DEVICE
(Intend for persistent memory)
• ZONE_MOVABLE is similar to ZONE_NORMAL, except that it contains movable pages with few exceptional cases: include/linux/mmzone.h
6. pglist_data (pg_data_t)
node_zones[MAX_NR_ZONES]
node_zonelists[MAX_ZONELISTS]
nr_zones
node_start_pfn
node_present_pages
node_spanned_pages
node_id
struct task_struct *kswapd
totalreserve_pages
__lruvec
atomic_long_t vm_stat[]
zone
_watermark[NR_WMARK]
watermark_boost
nr_reserved_highatomic
lowmem_reserve[MAX_NR_ZONES]
zone_pgdat
__percpu *pageset
zone_start_pfn
managed_pages
spanned_pages
present_pages
free_area[MAX_ORDER]
atomic_long_t vm_stat[]
per_cpu_pageset
pcp
expire
vm_numa_stat_diff[]
stat_threshold
vm_stat_diff[]
per_cpu_pages
count
high
batch
stat_threshold
lists[MIGRATE_PCPTYPES]
lists[UNMOVABLE]
lists[MOVABLE]
lists[RECLAIMABLE]
Order-0 page
free_area[0]
free_list[MIGRATE_TYPES]
nr_free
free_area[MAX_ORDER – 1]
free_list[MIGRATE_TYPES]
nr_free
.
. Order = (MAX_ORDER -1 ) pages
… …
Order-0 page
zonelist[ZONELIST_FALLBACK]
_zonerefs[MAX_ZONES_PER_ZONELIST + 1]
zoneref
struct zone *zone
zone_idx
zonelist[ZONELIST_NOFALLBACK]
_zonerefs[MAX_ZONES_PER_ZONELIST + 1]
Per-cpu Page Frame Cache: reduce lock contention between processors
Apply this zonelist if __GFP_THISNODE flag is set
per-node basis
zoneref
struct zone *zone
zone_idx
Physical memory management – Zone detail
7. How to know available memory pages in a system?
BIOS e820 memblock Zone Page Frame Allocator
e820__memblock_setup() __free_pages_core()
[Call Path] memblock frees available memory space to zone page frame allocator
8. 1
2
1 Most physical page frames are stored in MIGRATE_MOVABLE
2 [Right after system boots] Most physical page frames are stored in MAX_ORDER (order=10)
Placement of physical page frames
right after system initialization
11. Physical memory defragmentation (Anti-fragmentation):
Buddy System
free_area[N]
free_area[1]
free_list[MIGRATE_UNMOVABLE]
free_list[MIGRATE_UNMOVABLE]
page order = 1
page order = N = MAX_ORDER - 1
free_area[0] free_list[MIGRATE_UNMOVABLE]
page order = 0
0 2
4 5 8 9
Buddy System: adjacent pages are merged to form larger continuous pages
… …
12. Physical memory defragmentation (Anti-fragmentation):
Memory Migration (Mobility)
7
0 15 23 31
[Buddy System] memory fragmentation: only order-2 pages can be allocated
0
23 31
15
Reclaimable pages
Unmovable pages
[Concept] Memory fragmentation is reduced by grouping pages based on mobility (migration)
Memory migration delays memory fragmentation. It does not solve the problem
16. Anti-fragmentation: Memory migration (or Mobility)
• MIGRATE_UNMOVABLE: Allocations of the core kernel
• MIGRATE_MOVABLE (__GFP_MOVABLE): Pages that belongs to userspace applications
• MIGRATE_RECLAIMABLE (__GFP_RECLAIMABLE): File pages (Data mapped from files). Periodically reclaimed by the kswapd daemon
Slab/slub allocations that specify SLAB_RECLAIM_ACCOUNT and whose pages can be freed via shrinkers (kswapd): Check include/linux/gfp.h.
o SLAB_RECLAIM_ACCOUNT:
o mainly used by file system: data structure cache (fs/*).
o radix tree: cache for ‘struct radix_tree_node’
free_list[MIGRATE_UNMOVABLE]
free_list[MIGRATE_MOVABLE]
free_list[MIGRATE_RECLAIMABLE]
free_list[MIGRATE_RECLAIMABLE]
free_list[MIGRATE_RECLAIMABLE]
free_list[MIGRATE_UNMOVABLE]
free_list[MIGRATE_MOVABLE]
free_list[MIGRATE_UNMOVABLE]
free_list[MIGRATE_MOVABLE]
fallback list: check ‘fallbacks’ variable
Designated Migration Type
Steal fallback freelist if pages are
used up.
Steal fallback freelist if pages are
used up.
Page group concept: grouping pages with identical mobility (migration type)
17. Memory migration (or Mobility): Users
___GFP_MOVABLE
→ MIGRATE_MOVABLE
___GFP_RECLAIMABLE
→ MIGRATE_RECLAIMABLE
! ___GFP_MOVABLE && !___GFP_RECLAIMABLE
→ MIGRATE_UNMOVABLE
GFP_KERNEL
(gfp_flags & GFP_MOVABLE_MASK) >> GFP_MOVABLE_SHIFT
= (gfp_flags & 0x18) >> 3
GFP_HIGHUSER_MOVABLE
do_user_addr_fault()
wp_page_copy()
do_cow_fault()
do_swap_page()
slab/slub with
SLAB_RECLAIM_ACCOUNT flag
__GFP_RECLAIMABLE
Page table allocation…and so on
Memory Allocation
* If page mobility is disabled, all pages are kept in
MIGRATE_UNMOVABLE
18. Page frame allocator: Physical memory
defragmentation (anti-fragmentation)
Let’s see an example:
1. How the migration works
2. How the buddy system works
19. Example: Scan available page frames from pcp
zone
free_area[MAX_ORDER]
Note
struct page
free_area[0]
free_list[MIGRATE_TYPES]
nr_free = 1
free_area[1]
free_list[MIGRATE_TYPES]
nr_free = 1
UNMOVABLE
MOVABLE
RECLAIMABLE
PCPTYPES
list_head
list_head
list_head
list_head
UNMOVABLE
MOVABLE
RECLAIMABLE
PCPTYPES
list_head
list_head
list_head
list_head
__percpu *pageset
free_area[10]
free_list[MIGRATE_TYPES]
nr_free = 1244
UNMOVABLE
MOVABLE
RECLAIMABLE
PCPTYPES
list_head
list_head
list_head
list_head … … …
.
.
. .
.
per_cpu_pageset
pcp
per_cpu_pages
count = 0
high = 0
batch = 1
lists[MIGRATE_PCPTYPES]
Per-cpu Page Frame Cache UNMOVABLE
MOVABLE
RECLAIMABLE
list_head
list_head
list_head
pud_alloc(): Allocate a page frame
with GFP_KERNEL flags → order = 0,
UNMOVABLE
1
rmqueue_pcplist()
available?
2
* This example is illustrated during system init
20. zone
free_area[MAX_ORDER]
Note
struct page
free_area[0]
free_list[MIGRATE_TYPES]
nr_free = 1
free_area[1]
free_list[MIGRATE_TYPES]
nr_free = 1
UNMOVABLE
MOVABLE
RECLAIMABLE
PCPTYPES
list_head
list_head
list_head
list_head
UNMOVABLE
MOVABLE
RECLAIMABLE
PCPTYPES
list_head
list_head
list_head
list_head
__percpu *pageset
free_area[10]
free_list[MIGRATE_TYPES]
nr_free = 1244
UNMOVABLE
MOVABLE
RECLAIMABLE
PCPTYPES
list_head
list_head
list_head
list_head … … …
.
.
. .
.
per_cpu_pageset
pcp
per_cpu_pages
count = 0
high = 0
batch = 1
lists[MIGRATE_PCPTYPES]
Per-cpu Page Frame Cache UNMOVABLE
MOVABLE
RECLAIMABLE
list_head
list_head
list_head
pud_alloc(): Allocate a page frame
with GFP_KERNEL flags → order = 0,
UNMOVABLE
1
3
__rmqueue_smallest()
available?
available?
available?
rmqueue_pcplist()
Example: Scan available page frames from free_area
available?
2
* This example is illustrated during system init
lower order -> higher order
21. zone
free_area[MAX_ORDER]
Note
struct page
free_area[0]
free_list[MIGRATE_TYPES]
nr_free = 1
free_area[1]
free_list[MIGRATE_TYPES]
nr_free = 1
UNMOVABLE
MOVABLE
RECLAIMABLE
PCPTYPES
list_head
list_head
list_head
list_head
UNMOVABLE
MOVABLE
RECLAIMABLE
PCPTYPES
list_head
list_head
list_head
list_head
__percpu *pageset
free_area[10]
free_list[MIGRATE_TYPES]
nr_free = 1244
UNMOVABLE
MOVABLE
RECLAIMABLE
PCPTYPES
list_head
list_head
list_head
list_head … … …
.
.
. .
.
per_cpu_pageset
pcp
per_cpu_pages
count = 0
high = 0
batch = 1
lists[MIGRATE_PCPTYPES]
Per-cpu Page Frame Cache UNMOVABLE
MOVABLE
RECLAIMABLE
list_head
list_head
list_head
pud_alloc(): Allocate a page frame
with GFP_KERNEL flags → order = 0,
UNMOVABLE
1
3
__rmqueue_smallest()
available?
available?
available?
available?
rmqueue_pcplist()
[from MAX_ORDER to min_order]
__rmqueue_fallback() → steal_suitable_fallback()
4-1
4-2
Example: No available page frames in pcp and free_area for
specific migration type → steal from other migration type
2
* This example is illustrated during system init
[Steal]
MAX_ORDER -> min order
22. zone
free_area[MAX_ORDER]
Note
struct page
free_area[0]
free_list[MIGRATE_TYPES]
nr_free = 1
free_area[1]
free_list[MIGRATE_TYPES]
nr_free = 1
UNMOVABLE
MOVABLE
RECLAIMABLE
PCPTYPES
list_head
list_head
list_head
list_head
UNMOVABLE
MOVABLE
RECLAIMABLE
PCPTYPES
list_head
list_head
list_head
list_head
__percpu *pageset
free_area[10]
free_list[MIGRATE_TYPES]
nr_free = 1244
UNMOVABLE
MOVABLE
RECLAIMABLE
PCPTYPES
list_head
list_head
list_head
list_head … … …
.
.
. .
.
per_cpu_pageset
pcp
per_cpu_pages
count = 0
high = 0
batch = 1
lists[MIGRATE_PCPTYPES]
Per-cpu Page Frame Cache UNMOVABLE
MOVABLE
RECLAIMABLE
list_head
list_head
list_head
pud_alloc(): Allocate a page frame
with GFP_KERNEL flags → order = 0,
UNMOVABLE
1
3
__rmqueue_smallest()
available?
available?
available?
available?
rmqueue_pcplist()
[from MAX_ORDER to min_order]
__rmqueue_fallback() → steal_suitable_fallback()
4-1
4-2
Example: No available page frames in pcp and free_area for
specific migration type → steal from other migration type
2
* This example is illustrated during system init
4-1 4-2
23. zone
free_area[MAX_ORDER]
Note
struct page
free_area[0]
free_list[MIGRATE_TYPES]
nr_free = 1
free_area[1]
free_list[MIGRATE_TYPES]
nr_free = 1
UNMOVABLE
MOVABLE
RECLAIMABLE
PCPTYPES
list_head
list_head
list_head
list_head
UNMOVABLE
MOVABLE
RECLAIMABLE
PCPTYPES
list_head
list_head
list_head
list_head
__percpu *pageset
free_area[10]
free_list[MIGRATE_TYPES]
nr_free = 1244
UNMOVABLE
MOVABLE
RECLAIMABLE
PCPTYPES
list_head
list_head
list_head
list_head … … …
.
.
. .
.
per_cpu_pageset
pcp
per_cpu_pages
count = 0
high = 0
batch = 1
lists[MIGRATE_PCPTYPES]
Per-cpu Page Frame Cache UNMOVABLE
MOVABLE
RECLAIMABLE
list_head
list_head
list_head
pud_alloc(): Allocate a page frame
with GFP_KERNEL flags → order = 0,
UNMOVABLE
available?
rmqueue_pcplist()
5 steal_suitable_fallback() → move_to_free_list()
Example: No available page frames in pcp and free_area for
specific migration type → steal from other migration type
* This example is illustrated during system init
24. zone
free_area[MAX_ORDER]
Note
struct page
free_area[0]
free_list[MIGRATE_TYPES]
nr_free = 1
free_area[1]
free_list[MIGRATE_TYPES]
nr_free = 1
UNMOVABLE
MOVABLE
RECLAIMABLE
PCPTYPES
list_head
list_head
list_head
list_head
UNMOVABLE
MOVABLE
RECLAIMABLE
PCPTYPES
list_head
list_head
list_head
list_head
__percpu *pageset
free_area[10]
free_list[MIGRATE_TYPES]
nr_free = 1244
UNMOVABLE
MOVABLE
RECLAIMABLE
PCPTYPES
list_head
list_head
list_head
list_head … …
.
.
. .
.
per_cpu_pageset
pcp
per_cpu_pages
count = 0
high = 0
batch = 1
lists[MIGRATE_PCPTYPES]
Per-cpu Page Frame Cache UNMOVABLE
MOVABLE
RECLAIMABLE
list_head
list_head
list_head
pud_alloc(): Allocate a page frame
with GFP_KERNEL flags → order = 0,
UNMOVABLE
available?
rmqueue_pcplist()
…
6
available?
available?
available?
[try again]
__rmqueue_smallest()
Example: Re-scan available page frames from free_list
* This example is illustrated during system init
25. zone
free_area[MAX_ORDER]
Note
struct page
free_area[0]
free_list[MIGRATE_TYPES]
nr_free = 1
free_area[1]
free_list[MIGRATE_TYPES]
nr_free = 1
UNMOVABLE
MOVABLE
RECLAIMABLE
PCPTYPES
list_head
list_head
list_head
list_head
UNMOVABLE
MOVABLE
RECLAIMABLE
PCPTYPES
list_head
list_head
list_head
list_head
__percpu *pageset
free_area[10]
free_list[MIGRATE_TYPES]
nr_free = 1244
UNMOVABLE
MOVABLE
RECLAIMABLE
PCPTYPES
list_head
list_head
list_head
list_head … …
.
.
. .
.
per_cpu_pageset
pcp
per_cpu_pages
count = 0
high = 0
batch = 1
lists[MIGRATE_PCPTYPES]
Per-cpu Page Frame Cache UNMOVABLE
MOVABLE
RECLAIMABLE
list_head
list_head
list_head
pud_alloc(): Allocate a page frame
with GFP_KERNEL flags → order = 0,
UNMOVABLE
available?
rmqueue_pcplist()
…
6
available?
available?
available?
[try again]
__rmqueue_smallest()
7
1. del_page_from_free_list()
2. expand()
Example: Re-scan available page frames from free_list
* This example is illustrated during system init
26. zone
free_area[MAX_ORDER]
Note
struct page
free_area[0]
free_list[MIGRATE_TYPES]
nr_free = 2
free_area[1]
free_list[MIGRATE_TYPES]
nr_free = 2
UNMOVABLE
MOVABLE
RECLAIMABLE
PCPTYPES
list_head
list_head
list_head
list_head
UNMOVABLE
MOVABLE
RECLAIMABLE
PCPTYPES
list_head
list_head
list_head
list_head
__percpu *pageset
free_area[10]
free_list[MIGRATE_TYPES]
nr_free = 1243
UNMOVABLE
MOVABLE
RECLAIMABLE
PCPTYPES
list_head
list_head
list_head
list_head … …
.
.
. .
.
per_cpu_pageset
pcp
per_cpu_pages
count = 0
high = 0
batch = 1
lists[MIGRATE_PCPTYPES]
Per-cpu Page Frame Cache UNMOVABLE
MOVABLE
RECLAIMABLE
list_head
list_head
list_head
pud_alloc(): Allocate a page frame
with GFP_KERNEL flags → order = 0,
UNMOVABLE
rmqueue_pcplist()
[Buddy System] expand()
…
return this page descriptor
9
Example: Remove page frames from higher-order free_list
and expand them to lower-order free_list
* This example is illustrated during system init
8
29. pageblock
1. How is a pageblock organized?
2. Relationship between pageblock and free_list (migrate list)
3. How to add pages to free_list (migrate list)?
• pageblock, MAX_ORDER -1, or else?
30. zone
present_pages
Page
. . .
pageblock #0
Page
pageblock #1
Page
pageblock #N
CONFIG_HUGETLB_PAGE Number of Pages
Y 512 = Huge page size
N 1024 (MAX_ORDER - 1)
pageblock size
pageblock
31. zone
present_pages = 1311744
Page
. . .
pageblock #0
Page
pageblock #1
Page
pageblock #N
CONFIG_HUGETLB_PAGE Number of Pages
Y 512 = Huge page size
N 1024 (MAX_ORDER - 1)
pageblock size
N = round_up(present_pages / pageblock_size) - 1
Example
pageblocks = round_up(1311744 / 512) = 2562
pageblock
16 + 2544 + 2 = 2562
1
1
2
2
32. Move available page to zone->free_area[] (use pageblock if possible)
• call path: mm_init -> … -> __free_one_page -> add_to_free_list
• Two consecutive pageblocks can be emerged in order=10 free_list
✓ Assume “pageblock = 512 pages (order = 9)”
Relationship between pageblock and migrate list
2
1
33. Relationship between pageblock and migrate list
Memory migration (page mobility) is disabled if number of pages is too low
34. pageblock – migration type during OS initialization?
During OS boots, all pageblocks are marked
migration type ‘MIGRATE_MOVABLE’
1
2
3
35. How to add pages to free_list (migrate list)?
Which method?
• pageblock?
• MAX_ORDER -1?
• else?
36. How to add pages to free_list (migrate list)?
Which method?
• pageblock?
• MAX_ORDER -1?
• else?
Principle
• Zone->free_area[MAX_ORDER]: From
highest order to lowest order based on
start pfn (iteratively merge lower-order
pages if possible: see __free_one_page)
37. zone
free_area[MAX_ORDER]
Note
struct page
free_area[0]
free_list[MIGRATE_TYPES]
nr_free = 2
free_area[1]
free_list[MIGRATE_TYPES]
nr_free = 2
UNMOVABLE
MOVABLE
RECLAIMABLE
PCPTYPES
list_head
list_head
list_head
list_head
UNMOVABLE
MOVABLE
RECLAIMABLE
PCPTYPES
list_head
list_head
list_head
list_head
__percpu *pageset
free_area[10]
free_list[MIGRATE_TYPES]
nr_free = 1244
UNMOVABLE
MOVABLE
RECLAIMABLE
PCPTYPES
list_head
list_head
list_head
list_head … …
.
.
. .
.
__free_one_page(): continue to merge
* This example is illustrated during system init
free_area[9]
free_list[MIGRATE_TYPES]
nr_free = 1
UNMOVABLE
MOVABLE
RECLAIMABLE
PCPTYPES
list_head
list_head
list_head
list_head …
pageblock
Possible to merge?
Possible to merge?
Possible to merge?
Possible to merge?
40. dmesg output: total pages
2097022 + 2097152 = 4194174
4194174 != 4128619
Why? zone
zone_start_pfn
managed_pages
spanned_pages
present_pages
free_area[MAX_ORDER]
Two of them?
* Based on kernel 5.11 (x86_64) – QEMU
* 2-socket CPUs (4 cores/socket)
* 16GB memory
* earlyprintk=serial,ttyS0 console=ttyS0
loglevel=8 nokaslr
41. dmesg output: total pages
2097022 + 2097152 = 4194174
4194174 != 4128619
Why?
Total managed_pages
Total present_pages
* Based on kernel 5.11 (x86_64) – QEMU
* 2-socket CPUs (4 cores/socket)
* 16GB memory
* earlyprintk=serial,ttyS0 console=ttyS0
loglevel=8 nokaslr
42. Total present_pages & total managed_pages
Total managed_pages
Total present_pages
Total present_pages
Total managed_pages
calculate_node_totalpages
• Calculate zone.spanned_pages and zone.present_pages for each zone
• Sum each zone.present_pages to get total present_pages
• Print “On node 0 totalpages:….” message
free_area_init_core & zone_init_internals
• Calculate managed_pages and set zone.managed_pages
build_all_zonelists
• Sum all zone.managed_pages to get total managed_pages and print
“Built %u zonelists, …” message
* Based on kernel 5.11 (x86_64) – QEMU
* 2-socket CPUs (4 cores/socket)
* 16GB memory
* earlyprintk=serial,ttyS0 console=ttyS0
loglevel=8 nokaslr
43. Total present_pages: breakdown
Let’s focus on “node 0”
* Based on kernel 5.11 (x86_64) – QEMU
* 2-socket CPUs (4 cores/socket)
* 16GB memory
* earlyprintk=serial,ttyS0 console=ttyS0
loglevel=8 nokaslr
44. Total present_pages: breakdown
Total present_pages = 786302 + 1310720 = 2097022
* Based on kernel 5.11 (x86_64) – QEMU
* 2-socket CPUs (4 cores/socket)
* 16GB memory
* earlyprintk=serial,ttyS0 console=ttyS0
loglevel=8 nokaslr
45. Total present_pages: breakdown
Questions
1. What does it mean about “… %lu pages used for memmap”?
• Number of page structs to address a zone
2. What does it mean about “… %lu pages reserved”?
• Number pages reserved for DMA zone (check global variable ‘dma_reserve’)
3. Why aren’t the above-mentioned pages accumulated in present_pages?
• Pre-calculate page space requirement: Will consume total present_pages
* Based on kernel 5.11 (x86_64) – QEMU
* 2-socket CPUs (4 cores/socket)
* 16GB memory
* earlyprintk=serial,ttyS0 console=ttyS0
loglevel=8 nokaslr
46. Number of page structs to address a zone
• sizeof(struct page) = 64
• 12286 * 4096 / 64 = 786304 (page struct)
Total present_pages: breakdown * Based on kernel 5.11 (x86_64) – QEMU
* 2-socket CPUs (4 cores/socket)
* 16GB memory
* earlyprintk=serial,ttyS0 console=ttyS0
loglevel=8 nokaslr
47. Number of page structs to address a zone
• sizeof(struct page) = 64
• 12286 * 4096 / 64 = 786304 (page struct)
Total present_pages: breakdown
match: the difference is the page alignment
* Based on kernel 5.11 (x86_64) – QEMU
* 2-socket CPUs (4 cores/socket)
* 16GB memory
* earlyprintk=serial,ttyS0 console=ttyS0
loglevel=8 nokaslr
48. Total present_pages: breakdown
sparse case
Case 1: spanned_pages
Case 2: [sparse case] use present_pages
* Based on kernel 5.11 (x86_64) – QEMU
* 2-socket CPUs (4 cores/socket)
* 16GB memory
* earlyprintk=serial,ttyS0 console=ttyS0
loglevel=8 nokaslr
49. Total present_pages: breakdown
spanned_pages case
Case 1: spanned_pages
Case 2: [sparse case] use present_pages
* Based on kernel 5.11 (x86_64) – QEMU
* 2-socket CPUs (4 cores/socket)
* 16GB memory
* earlyprintk=serial,ttyS0 console=ttyS0
loglevel=8 nokaslr
50. pcp: per-cpu-pages pool
batch: pre-allocate ‘batch’ pages for per_cpu_pages if per_cpu_pages is empty
Total present_pages: breakdown * Based on kernel 5.11 (x86_64) – QEMU
* 2-socket CPUs (4 cores/socket)
* 16GB memory
* earlyprintk=serial,ttyS0 console=ttyS0
loglevel=8 nokaslr
52. _watermark[WMARK_LOW] = _watermark[WMARK_MIN] * (5/4) Note 1
Zone (!highmem_zone)
_watermark[WMARK_HIGH] =_watermark[WMARK_MIN] * (3/2) Note 1
page #n
_watermark[WMARK_MIN] = min_free_pages ∗
𝑍𝑜𝑛𝑒’𝑠 𝑚𝑎𝑛𝑎𝑔𝑒𝑑 𝑝𝑎𝑔𝑒𝑠
𝐴𝑙𝑙 𝑍𝑜𝑛𝑒𝑠’ 𝑚𝑎𝑛𝑎𝑔𝑒𝑑 𝑝𝑎𝑔𝑒𝑠
Zone - watermarks
Note 1: This is the old formula. The new formula includes kswapd watermarks distance according to the scale factor.
Background reclaim by
kswapd
Direct reclaim
Allocate page without
reclaiming pages
53. Zone – Update _watermark[]
Let’s check nr_free_buffer_pages()
Stage 1: update ‘min_free_kbytes’
and _watermark[]
Stage 2 (huge page): update
‘min_free_kbytes’ and _watermark[]
54. [Stage 1] Zone – watermark: min_free_kbytes:
• nr_free_buffer_pages() = (764421 – 0 ) + (3335722 – 0) = 4100143 pages: Get number of pages beyond
high watermark
• lowmem_kbytes = 4100143 * 4 = 16400572
• new_min_free_kbytes = min_free_kbytes = int_sqrt(16400572 * 16) = floor(sqrt(16400572 * 16)) = 16199
[min_free_kbytes]
• Force the page frame allocator to keep a minimum number of kilobytes free.
• The page frame allocator uses this number to compute a watermark[WMARK_MIN] value.
56. _watermark[WMARK_LOW] = _watermark[WMARK_MIN] * (5/4) Note 1
Zone (!highmem_zone)
_watermark[WMARK_HIGH] =_watermark[WMARK_MIN] * (3/2) Note 1
page #n
_watermark[WMARK_MIN] = min_free_pages ∗
𝑍𝑜𝑛𝑒’𝑠 𝑚𝑎𝑛𝑎𝑔𝑒𝑑 𝑝𝑎𝑔𝑒𝑠
𝐴𝑙𝑙 𝑍𝑜𝑛𝑒𝑠’ 𝑚𝑎𝑛𝑎𝑔𝑒𝑑 𝑝𝑎𝑔𝑒𝑠
-- Normal Zone -- _watermark[WMARK_MIN]
u64
16199
4
∗
3335722
(764421 +3335722)
= 4049 ∗
3335722
4100143
= 3294
Note 1: This is the old formula. The new formula includes kswapd watermarks distance according to the scale factor.
Background reclaim by
kswapd
Direct reclaim
Allocate page without
reclaiming pages
[Stage 1] Zone – Update _watermark[]
57. [Stage 1] calculate_totalreserve_pages()
ZONE_NORMAL
ZONE_DMA32
WMARK_HIGH
WMARK_HIGH
z->_watermark[]
high = min * (3/2)
low = min * (5/4)
min
high = min * (3/2)
low = min * (5/4)
min
z->lowmem_reserve[]
lowmem_reserve[ZONE_MOVABLE]
lowmem_reserve[ZONE_NORMAL]
lowmem_reserve[ZONE_DMA32]
lowmem_reserve[ZONE_MOVABLE]
lowmem_reserve[ZONE_NORMAL]
lowmem_reserve[ZONE_DMA32]
ZONE_MOVABLE
WMARK_HIGH
high = min * (3/2)
low = min * (5/4)
min
lowmem_reserve[ZONE_MOVABLE]
lowmem_reserve[ZONE_NORMAL]
lowmem_reserve[ZONE_DMA32]
pglist_data->totalreserve_pages
high + max(z->lowmem_reserve[])
high + max(z->lowmem_reserve[])
high + max(z->lowmem_reserve[])
• High watermark is considered
the reserved pages.
• [Per-node basis]
pglist_data->totalreserve_pages
= summation(each zone’s
reserved pages)
_watermark[WMARK_MIN] = min_free_pages ∗
𝑍𝑜𝑛𝑒’𝑠 𝑚𝑎𝑛𝑎𝑔𝑒𝑑 𝑝𝑎𝑔𝑒𝑠
𝐴𝑙𝑙 𝑍𝑜𝑛𝑒𝑠’ 𝑚𝑎𝑛𝑎𝑔𝑒𝑑 𝑝𝑎𝑔𝑒𝑠
Update pglist_data->totalreserve_pages by iterating each zone
59. [Stage 1] lowmem_reserve[MAX_NR_ZONES]
Issue Statement
• [Fallback Mechanism] Insufficient memory of higher zone → Allocate memory from lower
zone
o The memory of the lower zone might be exhausted because of the request of the higher zone.
o Example (gfp flag)
▪ GFP_DMA: ZONE_DMA
▪ GFP_DMA32: ZONE_DMA32 -> ZONE_DMA
▪ Otherwise: ZONE_NORMAL -> ZONE_DMA32 -> ZONE_DMA
o Scenario
▪ The memory allocation from higher zone could be the pinned pages via mlock().
Solution
• lowmem_reserve[MAX_NR_ZONES]
o Ensure a certain amount of free pages of the lower zone are reserved.
ZONE_MOVABLE
sysctl_lowmem_reserve_ratio = 0
ZONE_NORMAL
sysctl_lowmem_reserve_ratio = 32
ZONE_DMA32
sysctl_lowmem_reserve_ratio = 256
ZONE_DMA
sysctl_lowmem_reserve_ratio = 256
fallback
fallback
fallback
64. lowmem_reserve[]
free_pages = zone_page_state(z, NR_FREE_PAGES)
free_pages -= (1 << request_order) - 1
free_pages
free_pages -= z->nr_reserved_highatomic
free_pages -= zone_page_state(z, NR_FREE_CMA_PAGES)
!ALLOC_CMA
min = mark = z->_watermark[mark_index]
min -= min / 2
min -= min / 2 or min -= min /4
ALLOC_HIGH
ALLOC_OOM or
ALLOC_HARDER
Zone: pages
!ALLOC_OOM or
!ALLOC_HARDER
__zone_watermark_ok
__compaction_suitable
__zone_watermark_ok
zone_watermark_ok
zone_watermark_fast
zone_watermark_ok_safe
should_reclaim_retry
free_pages > min + z->lowmem_reserve[highest_zoneidx] →
watermark ok. The allocation request can be met.
66. Page Frame Allocator: Page Allocation
__alloc_pages_nodemask
prepare_alloc_pages
Init the members of alloc_context struct:
* highest_zoneidx, zonelist, migratetype , preferred_zoneref and so on.
get_page_from_freelist
node_reclaim
zone_watermark_fast?
rmqueue
__alloc_pages_slowpath
rmqueue_pcplist
__rmqueue_smallest
page order = 0
per-cpu page frame cache
Preparation
Slow path alloc_flags = gfp_to_alloc_flags(gfp_mask)
Y: free_pages > wmark_low
N
for_next_zone_zonelist_nodemask(): Iterate each zone from zonelist
Assign flag ‘ALLOC_WMARK_MIN’ to alloc_flags
Fast path
get_page_from_freelist
wake_all_kswapds
get_page_from_freelist
__alloc_pages_direct_reclaim
__alloc_pages_direct_compact
__alloc_pages_may_oom
Get pages if “wmark_min < free_pages < wmark_low”
Wake up kswapds if ALLOC_KSWAPD is set
Another try after adjusting zonelist, alloc_flags and
nodemask = NULL
wake_all_kswapds Wake up kswapds if ALLOC_KSWAPD is set
__rmqueue
else
alloc_flags = ALLOC_WMARK_LOW
reclaim page caches for a
node
68. Page Frame Allocator: Page Allocation
__alloc_pages_nodemask
prepare_alloc_pages
Init the members of alloc_context struct:
* highest_zoneidx, zonelist, migratetype , preferred_zoneref and so on.
get_page_from_freelist
node_reclaim
zone_watermark_fast?
rmqueue
rmqueue_pcplist
__rmqueue_smallest
page order = 0
page order > 0 && alloc_flags & ALLOC_HARDER → Get page from
free_list[MIGRATE_HIGHATOMIC]
per-cpu page frame cache
Preparation
Y: free_pages > wmark_low
N
for_next_zone_zonelist_nodemask(): Iterate each zone from zonelist
Fast path
__rmqueue
else
alloc_flags = ALLOC_WMARK_LOW
reclaim page caches for a node
continue;
try next zone
zone_watermark
_ok?
Y
NODE_RECLAIM_NOSCAN or
NODE_RECLAIM_FULL
N
node_reclaim_mode (checked in node_reclaim function)
• Reclaim memory when a zone runs out of memory
• Representation
o 0 (default value): disabled
o 1: Reclaim write dirty pages
o 2: Reclaim page caches