2. Agenda
• Initialization function list
• The list of the functions called from the kernel startup
function (start_kernel)
• The list of the functions called from some function called
from the start_kernel function
• setup_arch
• rest_init, and the following functions
• Initialization topics
• Multiprocessor (SMP) Initialization
2
4. Initialization Overview
4
Booting Code
(Preparing CPU states, Gathering HW information, Decompressing vmlinux etc.)
arch/*/boot/
arch/*/kernel/head*.S, head*.c
Low-level Initialization
(Switching to virtual memory world, Getting prepared for C programs)
init/main.c (startup_kernel)
Initialization
(Initializing all the kernel features including architecture-dependent parts)
init/main.c (rest_init)
Creating the “init” process, and letting it the rest
initialization
(Setting up multiprocessing, scheduling)
kernel/sched/idle.c (cpu_idle_loop)
“Swapper” (PID=0) now sleeps
init/main.c (kernel_init)
Performing final initialization
and
“Exec”ing the “init” user
“init” (PID=1)
arch/*/kernel, arch/*/mm, …Call
vmlinux
5. start_kernel (1)
5
# Function Category Description
1 lockdep_init Debug Lock validator
2 smp_setup_processor_id* SMP Initialize processor ID (some architecture)
3 debug_objects_early_init Debug Lifetime debugging facility for objects
4 boot_init_stack_canary* Debug Decide the canary value for the stack
protector
5 cgroup_init_early cgroup Early init for some cgroup subsystems
6 boot_cpu_init SMP Set the boot cpu for various cpumasks
7 page_address_init MM Initialize hash for kmap (highmem)
8 setup_arch*
9 mm_init_owner MM Set init_mm’s owner to init_task
10 mm_init_cpumask MM Set the cpu mask pointer to the mm’s cpumask
(only if CPUMASK_OFFSTACK)
11 setup_command_line Init Copy the command line parameter to newly
allocated buffer (allocated by memblock)
12 setup_nr_cpu_ids SMP Set “nr_cpu_ids” according to the last bit in
Functions with * : mostly
architecture dependent codes
6. start_kernel (2)
6
# Function Category Description
13 setup_per_cpu_areas* SMP Allocate and initialize percpu areas
14 smp_prepare_boot_cpu* SMP Prepare for SMP boot
15 build_all_zonelists MM Initializes “zonelist”
16 page_alloc_init MM Add a handler for CPU hotplug (to drain pages)
17 parse_early_param Init Parse “early” options
18 parse_args Init Parse the rest of options
19 jump_label_init Option Jump label (self-modification)
20 setup_log_buf Debug Allocate and initialize printk log buffer
21 pidhash_init Sched Initialize PID hash
22 vfs_caches_init FS Initialize various caches (kmem_cache) in VFS
(dcache, inode, mnt, files, …)
23 sort_main_extable MM Sort the exception table (used in page faults)
24 trap_init* CPU Initialize trap handlers
7. start_kernel (3)
7
# Function Category Description
25 mm_init MM Initialize MM
25A page_cgroup_init_flatmme MM Allocate pages for page_cgroup
25B mem_init* MM Free pages for buddy allocator
25C kmem_cache_init MM Initialize cache
25D percpu_init_late MM Replaces per-cpu chunks with those
allocated by slab
25E pgtable_init* MM Create cache for ptlock and pgtable (SH etc.)
25F vmalloc_init MM Initialize vmalloc
26 sched_init Sched Initialize scheduler
27 idr_cache_init Util Initialize IDR (ID to pointer translation)
28 rcu_init SMP Initialize RCU
29 tick_nohz_init Sched Initialize NOHZ (enable context tracking)
30 radix_tree_init Util Initialize radix tree (create cache, etc.)
31 early_irq_init* CPU Initialize irq_desc.
8. start_kernel (4)
8
# Function Category Description
32 init_IRQ * CPU Initialize various IRQs (in x86, set gates for
APIC interrupts, etc.)
33 tick_init Timer Tick broadcast (to emulate local timer)
34 init_timers Timer Timer stats, notifier, and timer softirq
35 hrtimers_init Timer hrtimer notifier, and hrtimer softirq
36 softirq_init Sched Tasklet lists, and tasklet softirqs
37 timekeeping_init Timer Clocksource
38 time_init * Timer (Platform-dependent) timer initialization
39 sched_clock_postinit Sched Start the hrtimer
40 perf_event_init Debug Perf events
41 profile_init Debug (Simple) profiler
42 call_function_init SMP Initialize csd (call single data) queue
local_irq_enable CPU At this point, interrupts are enabled
9. start_kernel (5)
9
# Function Categor
y
Description
43 kmem_cache_init_late MM Post-initialization of cache (slab)
44 console_init Console Call console initcalls
45 lockdep_info Debug Print lockdep information
46 locking_selftest Debug Test spinlocks, rwlocks, mutexes, and
rwsemaphores
47 page_cgroup_init cgroup Page cgroup
48 debug_objects_mem_init Debug Enable dynamic allocation for debugobjects
(#3), and replace static ones with newly
allocated one
49 kmemleak_init Debug kmemleak (Memory leak check facility)
50 setup_per_cpu_pageset MM Per-cpu pageset
51 numa_policy_init MM NUMA (VMA) policy
52 late_time_init* Timer Late initialization
(In x86, HPET and TSC are initialized)
10. start_kernel (6)
10
# Function Category Description
53 sched_clock_init Sched Set the time info for scheduler
54 calibrate_delay Timer Calibrate for the “delay” functions
55 pidmap_init Process Init PID map for initial PID namespace
56 anon_vma_init MM Create cache for “anon_vma”
57 acpi_early_init ACPI ACPI Subsystems, load DSDT
58 thread_info_cache_init Process Allocate cache for thread_info if its size is
less than PAGE_SIZE
59 cred_init Security Task credential
60 fork_init Process Allocate a cache for task_struct
61 proc_caches_init MM Allocate caches for mm_struct, etc.
62 buffer_init FS Allocate a cache for buffer_head
63 key_init Security Allocate a cache for key_jar
64 security_init Security Call security_initcall’s
65 dbg_late_init Debug Late init for kgdb
11. start_kernel (7)
11
# Function Category Description
66 vfs_caches_init FS Allocate SLAB caches and hashtables for
various VFS caches (dcache, inode_cache, …)
67 signals_init Sched Allocate a cache for sigqueue
68 page_writeback_init MM Initialize the ratio for the dirty pages
69 proc_root_init Procfs Create the root for procfs and some
directories
70 cgroup_init Cgroup Initialize the rest of cgroups
71 cpuset_init Sched The top-level cpuset
72 taskstats_init_early Sched Task statistics exposed to the user level
73 delayacct_init Sched Task delay accounting
74 check_bugs* CPU Fix up for some architecture-dependent bugs
(in x86_64, alternatives are initialized, and
divide the first 2MB page into 4K pages)
75 sfi_init_late SFI Map again the area by using ioremap
13. setup_arch (x86) (1)
13
# Function Category Description
1 memblock_reserve MM Reserve the text area
2 early_reserve_initrd MM Reserve the initrd area
3 clone_pgd_area, load_cr3 MM Switch to swapper_pg_dir (i386 only)
4 olpc_ofw_detect Platform OLPC OFW Stuff
5 early_trap_init CPU Init debug and int3 gate
6 early_cpu_init CPU Detect CPU’s vendor (registered in
cpu_dev_register: Intel, AMD, Cyrix…) and
calls early_init and bsp_init
7 early_ioremap_init MM Init early ioremap
8 setup_olpc_ofw_pgd Platform OLPC OFW Stuff
9 (Parsing boot parameters) Setup --
10 x86_init.oem.arch_setup Platform OEM-dependent setup (Intel MID etc.)
11 setup_memory_map MM Copy and print e820 information
12 parse_setup_data Setup Parse setup_data in boot_params
14. setup_arch (x86) (2)
14
# Function Category Description
13 copy_edd Setup Copy BIOS EDD information
14 (prepare init_mm) MM Set start_code, end_code, etc. for init_mm
15 (command line stuffs) Setup
16 x86_configure_nx MM Set ptemask according to whether NX is
supported by CPU
17 parse_early_param Setup (=#17 in start_kernel)
18 x86_report_nx MM Print NX information
19 memblock_x86_reserve_r
ange_setup_data
MM Reserve the setup_data area
20 acpi_mps_check SMP Check if ACPI is disabled and MPS code is not
built-in
21 early_pci_dump_devices Device Dump PCI info before PCI is initialized
22 e820_reserve_setup_data MM Reserve the setup_data area in e820
23 finish_e820_parsing Setup Sanitize e820 info and print e820 info.
15. setup_arch (x86) (3)
15
# Function Category Description
13 copy_edd Setup Copy BIOS EDD information
14 (prepare init_mm) MM Set start_code, end_code, etc. for init_mm
15 (command line stuffs) Setup
16 x86_configure_nx MM Set ptemask according to whether NX is
supported by CPU
17 parse_early_param Setup (=#17 in start_kernel)
18 x86_report_nx MM Print NX information
19 memblock_x86_reserve_r
ange_setup_data
MM Reserve the setup_data area
20 acpi_mps_check SMP Check if ACPI is disabled and MPS code is not
built-in
21 early_pci_dump_devices Device Dump PCI info before PCI is initialized
22 e820_reserve_setup_data MM Reserve the setup_data area in e820
23 finish_e820_parsing Setup Sanitize e820 info and print e820 info.
16. setup_arch (x86) (4)
16
# Function Cat. Description
24 dmi_scan_machine DMI Check if DMI (Desktop Management Interface)
is present or not
25 dmi_memdev_walk DMI Walk through the DMI table
26 dmi_set_dump_stack_arch_de
sc
DMI Set architecture description* for dump_stack
27 init_hypervisor_platform VM Get the hypervisor information and init
(e.g. Get Hz using special I/O port when
running on VMWare)
28 probe_roms MM Request resources for Video ROM, Extension
ROMs, etc.
29 insert_resource MM Insert resources for kernel’s code, data, BSS
30 e820_add_kernel_range MM Add kernel code, data areas to e820 if is not
marked as E820_RAM
31 trim_bios_range MM Reserve BIOS areas in e820
(*) Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 3
CPU: 3 PID: 2763 Comm: irqbalance Tainted: G W 3.14.13 #1
Hardware name: Supermicro X9SRH-7F/7TF/X9SRH-7F77TF, BIOS 3.00 07/05/2013
17. setup_arch (x86) (5)
17
# Function Category Description
32 early_gart_iommu_check Device Check GART (Graphics Address Remapping
Table)
33 (Substitute to max_pfn) MM Set max_pfn as the last page in e820
34 mtrr_bp_init CPU MTRRs (Memory Type Range Registers)
35 check_x2apic CPU Enable X2APIC if available
36 find_smp_config SMP Find the SMP config for Intel MP Spec.
37 reserve_ibft_region Device Reserve iSCSI Boot Format Table
38 early_alloc_pgt_buf MM Allocate page table buffer (to be used in the
early stage)
39 reserve_brk MM Reserve brk area
40 cleanup_highmap MM Unmap out-of-range areas in the kernel map
41 memblock_set_current_li
mit
MM Set the memblock’s allocation limit to
ISA_END_ADDRESS
42 memblock_x86_fill MM Fill the memblock info according to e820
18. setup_arch (x86) (6)
18
# Function Category Description
43 early_reserve_e820_mpc_
new
SMP Allocate for mptable
44 setup_bios_corruption_ch
eck
Setup Fill 64KB of low memory by some pattern to
detect if BIOS corrupts the area
45 reserve_real_mode CPU/SMP Reserve some low memory for trampoline
46 trim_platform_memory_r
anges
Setup Special tricks (reserve) for some platform
(Some Sandy Bridge)
47 trim_low_memory_range MM Reserve the first 4KB page in memblock
48 init_mem_mapping MM Reconstruct memory mapping
49 early_trap_pf_init CPU Set page fault handler
50 setup_real_mode CPU/SMP Setup the trampoline code
51 memblock_set_current_li
mit
MM Change the limit to the last page mapped
52 dma_contiguous_reserve MM Allocate contiguous area for DMA
19. setup_arch (x86) (7)
19
# Function Cat. Description
53 setup_log_buf Debug Setup printk log buffer
54 reserve_initrd MM Reserve the initrd
55 acpi_initrd_override ACPI Find the ACPI override info in initrd
56 vsmp_init Setup vSMP (ScaleMP Inc.)
57 io_delay_init Setup Check DMI override for I/O delay strategy
58 acpi_boot_table_init ACPI ACPI BOOT table parsing
59 early_acpi_boot_init ACPI Parse MADT in ACPI
60 initmem_init MM Setup node information based on ACPI (if
NUMA)
61 reserve_crashkernel Debug Reserve memory for crashkernel
62 memblock_find_dma_reserve MM Count the reserved pages in DMA zone
63 pagetable_init MM Initialize sparse mem, and zone sizes
64 tboot_init CPU Intel TXT (Trusted eXecution Technology)
support
20. setup_arch (x86) (8)
20
# Function Cat. Description
65 map_vsyscall CPU Map vsyscall
66 generic_apic_probe CPU Probe APIC driver
67 early_quirks PCI Apply some quirks for certain devices
68 acpi_boot_init ACPI Parse (again) BOOT, FADT, MADT, HPET etc.
69 sfi_init SFI SFI (Simple Firmware Interface)
70 x86_dtb_init Setup Device tree
71 get_smp_config SMP (If ACPI is not found) construct the table
72 prefill_possible_map SMP Set the possible CPU map
73 init_cpu_to_node NUMA Set up the cpu to node map
74 init_apic_mappings CPU Set the local APIC address
75 x86_io_apic_ops.init CPU I/O APIC
76 kvm_guest_init Virt. KVM Guest (paravirt ops, etc.)
77 e820_reserve_resources MM Reserve resources for e820 entries
21. setup_arch (x86) (9)
21
# Function Cat. Description
78 e820_mark_nosave_regions PM Add non-RAM area in e820 to nosave regions
79 x86_init.resources.reserve_re
sources
I/O Reserve standard I/O resources (Timer, KB,…)
80 e820_setup_gap MM Find the largest gap in e820, and pass PCI to
use the gap to allocate new MMIO areas
81 x86_init.oem.banner Debug “Booting paravirtualized kernel on %s”
82 x86_init.timers.wallclock_init Timer (NOP; defined in MID only)
83 mcheck_init CPU Machine check (temperature)
84 arch_init_ideal_nops CPU Set the NOP instructions ideal to the current
platform
85 register_refined_jiffies Timer Register “refined_jiffies” clocksource
22. setup_arch (ARM) (1)
22
# Function Category Description
1 setup_processor CPU Processor initialization
2 setup_machine_fdt Setup Parse the device tree
3 setup_machine_tags Setup If 2 is failed, parse the ATAGs
4 (prepare init_mm) MM Set start_code, end_code, etc. for init_mm
5 (command line stuffs) Setup (=#15 in x86)
6 parse_early_param Setup (=#17 in x86)
7 (sort meminfo) MM Sort the memory information
8 early_paging_init MM Recreate the page table prepared during boot
9 setup_dma_zone MM Setup the dma zone information
10 sanity_check_meminfo MM Sanitize the meminfo
11 arm_memblock_init MM Add free memory from meminfo, and reserve
various reserved areas.
12 paging_init MM Permanent kmap area
23. setup_arch (ARM) (2)
23
# Function Category Description
13 request_standard_resourc
es
MM Reserve resources for system memory, video
ram
14 unflatten_device_tree Setup Create a tree from FDT
15 arm_dt_init_cpu_maps CPU Create CPU logical map based on the device
tree
16 psci_init CPU Read the method to be used for CPU on, off,
etc.
17 smp_init_cpus SMP Initialize the CPU cores available
18 smp_build_mpidr_hash SMP Precompute shifts required to get index from
MPIDR (Mulitprocessor ID register) value
19 hyp_mode_check Virt. Check if the CPU is running in HYP mode
20 reserve_crashkernel Debug Reserve memory for crashkernel
21 mdesc->init_early (Platform-specific initialization)
24. The rest of initialization
• rest_init (init/main.c)
• Create two kernel threads
• “init” (PID = 1, gradually it becomes the init user process)
• “kthreadd” (PID = 2, to allow init to create another kernel threads)
24
static noinline void __init_refok rest_init(void)
{
rcu_scheduler_starting();
...
kernel_thread(kernel_init, NULL, CLONE_FS | CLONE_SIGHAND);
numa_default_policy();
pid = kernel_thread(kthreadd, NULL, CLONE_FS | CLONE_FILES);
rcu_read_lock();
kthreadd_task = find_task_by_pid_ns(pid, &init_pid_ns);
rcu_read_unlock();
complete(&kthreadd_done);
...
init_idle_bootup_task(current);
schedule_preempt_disabled();
...
cpu_startup_entry(CPUHP_ONLINE);
}
26. kernel_init
• Call the remaining init functions (kernel_init_freeable)
• Synchronize all the asynchronous operations
• Free the initmem (free_initmem)
• Mark RO Data to RO (and NX) (mark_rodata_ro)
• Set the system state to SYSTEM_RUNNING
• Set the current NUMA policy to default
(numa_default_policy)
• Try to execve(2) “init” process
• If rdinit parameter is set, exec the path
• If init parameter is set, exec the path
• Try to run “/sbin/init,” “/etc/init,” “/bin/init,” “/bin/sh”
• If nothing worked, panic with a familiar message:
26
"No working init found. Try passing init= option to kernel. See Linux
Documentation/init.txt for guidance."
27. kernel_init_freeable
• First, wait for the completion of kthreadd’s setup
• Set init’s allowed cpus/mems to all CPUs and nodes
• Set cad_pid to init’s
• Prepare to boot other CPUs (smp_prepare_cpus)
• Call early initcalls (do_pre_smp_initcalls)
• Initialize lockup_detector (lockup_detector_init)
• Initialize multiprocessor (smp_init)
• Boots up other cores/sockets
• Initialize the scheduler (sched_init_smp)
• Call the do_basic_setup function (-> Next slide)
• Open “/dev/console” and dup twice (fd : 0 to 2)
• Check if the ramdisk is available
• If not, try to mount root (prepare_namespace)
• Load the I/O scheduler (elevator) module
27
28. do_basic_setup
• Re-initialize cpuset to the active CPUs
(cpuset_init_smp)
• Initialize user-mode helper (khelper)
• Initialize tmpfs (shmem_init)
• Initialize drivers (driver_init)
• Create proc directories and files for IRQs (init_irq_proc)
• Call constructors (do_ctors) (CONFIG_CONSTRUCORS)
• Enable the user-mode helper workqueue
• Call all the initcalls (do_initcalls)
• Initialize random values (random_int_secret_init)
28
29. initcalls
• Facility to call initialization functions during the
initialization (in the kernel_init_freeable function)
• Example
29
static int cpu_pm_init(void)
{
register_syscore_ops(&cpu_pm_syscore_ops);
return 0;
}
core_initcall(cpu_pm_init);
(kernel/cpu_pm.c)
30. Level of initcalls
• Several levels (the order to call) are defined
30
Macro Lv. # Description
early_initcall early called before smp
pure_initcall 0 no dependency, variable initizalization
core_initcall{,_sync} 1, 1s
postcore_initcall{,_sync} 2, 2s
arch_initcall{,_sync} 3, 3s
subsys_initcall{,_sync} 4, 4s
fs_initcall{,_sync} 5, 5s
rootfs_initcall rootfs
device_initcall{,_sync} 6, 6s
late_initcall{,_sync} 7, 7s
31. Initcall definition
• Collect all the pointers for initcall functions at
certain sections
• Section name : “.initcall lv .init”
• E.g. for “core_initcall”, the section will be “.initcall1.init”
31
#define __define_initcall(fn, id)
static initcall_t __initcall_##fn##id __used
__attribute__((__section__(".initcall" #id ".init"))) = fn;
LTO_REFERENCE_INITCALL(__initcall_##fn##id)
(include/linux/init.h)
33. Special initcalls
• console_initcall
• Called from console_init (in kernel_start)
• security_initcall
• Called from security_init (in kernel_start)
• When used in loadable modules (not
recommended), it’s replaced by module_init
33
#else /* MODULE */
/* Don't use these in loadable modules, but some people do... */
#define early_initcall(fn) module_init(fn)
#define core_initcall(fn) module_init(fn)
...
(include/linux/init.h)
34. Initcall debug
• Kernel command-line option: “initcall_debug”
• Shows the debug message
• When it calls and is returned from each initcall function, it
prints a message with elapsed time
34
static int __init_or_module do_one_initcall_debug(initcall_t fn)
{
...
pr_debug("calling %pF @ %in", fn, task_pid_nr(current));
calltime = ktime_get();
ret = fn();
rettime = ktime_get();
...
pr_debug("initcall %pF returned %d after %lld usecsn",
fn, ret, duration);
...
}
(init/main.c)
36. How the multiple cores are started?
• Two types
36
HW Power On
Start Linux kernel
Initialize SMP
Core 0 Core 1 Core 2 …
Wake up
Wake up
Core 0 Core 1 Core 2
Wake up
Wake up
Stop &
Wait Stop &
Wait
37. How the multiple cores are started?
• The first type
• x86, ARM, etc.
• (x86) The first processor (core) is determined by HW,
and called “the bootstrap processor” (BSP). The
remaining processor(s) (cores) are called “application
processor(s)” (APs).
• The second type
• PowerPC (some models), etc.
37
38. MP Detection
• How to detect the number of cores available in the
hardware?
• Firmware Information
• ACPI MADT (Multiple APIC Description Table) (x86)
• SFI (Simple Firmware Interface) (Xeon Phi)
• MP Configuration Table (Very old x86)
• DeviceTree (ARM)
• Or hardcoded (ARM…)
• Kernel boot parameters
• nosmp
• maxcpus=<n>
• Kernel configuration
• CONFIG_NR_CPUS
38
39. MP Booting
• x86
• INIT IPI
• The sequence of INIT, INIT, STARTUP IPI.
• NMI (For CPU0)
• “This works to wake up soft offline CPU0 only”
• ARM
• “enable-method” node in the device tree
• Depends on the board (march)
• ARM64
• “enable-method” node in the device tree
• “spin-table”
• Cores spin at some memory area (outside the kernel). When a
value is written to the area, the core jumps to the written address.
• “psci” (Power State Coordination Interface)
39
40. AP Initialization
• After woken up, where will AP execute?
• X86
• First, “trampoline code”
• Switches from real-mode to the 32-bit or 64-bit mode
• Located in the very low memory since the new core start in the
real-mode
• Then, jump to the secondary entrypoint
• 32-bit : startup_32_smp (arch/x86/kernel/head_32.S)
• 64-bit : secondary_startup_64 (arch/x86/kernel/head_64.S)
• ARM64
• First, “secondary_holding_pen” (arch/arm64/kernel/head.S)
• After woken up, all the cores are held at this function
• Then, secondary_startup
40
41. AP Initialization (2)
• Initializes the CPU state for the new core in the
assembler level
• Paging on
• Some special registers…
• Then, goes to the C code
• start_secondary (in x86, arch/x86/kernel/smpboot.c)
• secondary_start_kernel (in ARM/ARM64,
arch/arm{,64}/kernel/smp.c)
• Finally, it goes to the idle loop as the boot task
• cpu_startup_entry
41
42. start_secondary (x86)
42
# Function Category Description
1 cpu_init CPU Various CPU states
2 x86_cpuinit.early_percpu_
clock_init
3 smp_callin SMP Notify the BSP of the AP’s boot-up
4 check_tsc_sync_target
5 set_cpu_online SMP Set the cpu_online_mask
6 x86_platform.nmi_init CPU
7 boot_init_stack_canary Debug
8 x86_cpuinit.setup_percpu
_clockev
9 cpu_startup_entry
43. secondary_start_kernel (ARM64)
43
# Function Category Description
1 (Set the current mm to
init_mm)
MM
2 set_my_cpu_offset SMP Set per-cpu offset
3 cpu_set_reserved_ttbr0 CPU Set TTBR0 to the zero page
4 cpu_ops[cpu]-
>cpu_postboot
CPU
5 notify_cpu_starting
6 smp_store_cpu_info
7 set_cpu_online
8 complete Notify the boot CPU of the core’s boot
9 cpu_startup_entry Go to the idle loop