More Related Content Similar to BitVisor Summit 11「2. BitVisor on Aarch64」 (20) BitVisor Summit 11「2. BitVisor on Aarch64」2. Agenda
◼ Current requirements
◼ How VMM works on Aarch64
◼ BitVisor Aarch64 initialization
◼ Interrupt handling
◼ MMIO handling
◼ Multiple core support
◼ Current limitation
◼ Ongoing tasks
◼ QEMU bugs we found
◼ Demo
1
Copyright© 2022 IGEL Co., Ltd. All Rights Reserved.
3. Current requirements
◼ Armv8.1 or later
– Need Virtualization Host Extension (VHE) for process
implementation
◼ Generic Interrupt Controller v3 (GICv3)
– Guest interrupt injection
◼ EL3 and Power State Coordination Interface (PSCI)
– Firmware running in EL3
– For secondary core start-up
◼ UEFI environment and ACPI
– BitVisor currently relies on them
2
Copyright© 2022 IGEL Co., Ltd. All Rights Reserved.
4. How VMM works on Aarch64
3
Copyright© 2022 IGEL Co., Ltd. All Rights Reserved.
Firmware
Hypervisor
OS0 OS1
P0 P1 P2 P3
EL0
EL1
EL2
EL3
SMC
HVC
SVC
5. How VMM works on Aarch64
4
Copyright© 2022 IGEL Co., Ltd. All Rights Reserved.
Hypervisor
OS0
P0 P1
Host OS/Hypervisor
OS0
P0 P1 P2
Standard With VHE
EL0
EL1
EL2
6. How VMM works on Aarch64
◼ Main system registers related to virtualization
– HCR_EL2
• Enable/Disable hypervisor
• Hypervisor behavior
• Register trapping
– VTTBR_EL2
• Stage-2 translation page table
– VTCR_EL2
• Stage-2 translation control
– VMPIDR_EL2
• Multiple Processor ID MPIDR_EL1 value read by the guest
– VPIDR_EL2
• Processor ID PIDR_EL1 value read by the guest
5
Copyright© 2022 IGEL Co., Ltd. All Rights Reserved.
7. How VMM works on Aarch64
◼ Page table on Aarch64 basic
– Typically, an OS sets up TTBR0_EL1 for a process’s page
table and TTBR1_EL1 for kernel page table
• Addresses with 0xF… prefix are mapped in TTBR1_EL1
– Normally, we can access only TTBR0_EL2 only on EL2
– With VHE feature, we can make EL2 behavior as same as
EL1
• Can access to TTBR1_EL2
• System registers related to translation change their structures
– Ex. TCR_EL2 bit definition becomes like TCR_EL1
6
Copyright© 2022 IGEL Co., Ltd. All Rights Reserved.
8. How VMM works on Aarch64
◼ Guest OS returns to EL2 from time to time through
exceptions
– Interrupt
• If the hypervisor chooses to route interrupts to EL2
– Trapping
• Register accesses
• Intermediate Physical Address translation fault
7
Copyright© 2022 IGEL Co., Ltd. All Rights Reserved.
9. How VMM works on Aarch64
◼ When an exception occurs
– The entry point is one of locations on the vector table
pointed by VBAR_EL2
• Depending on the current running EL/exception type/mode
– The first thing to do is saving the current context
• General registers x0-x30
• Floating registers if necessary
• Other system registers if necessary
– In BitVisor case (To switch between our processes and the guest)
» HCR_EL2
» ELR_EL2, SPSR_EL2, FAR_EL2, ESR_EL2
» SP_EL0, TPIDR_EL0
8
Copyright© 2022 IGEL Co., Ltd. All Rights Reserved.
10. How VMM works on Aarch64
◼ Handling an exception
– Interrupt (Asynchronous)
• Interrupt controller handler
– Scheduling
– Forwarding to the guest
– Hand over to the appropriate device driver
– Trapping (Synchronous)
• Read ESR_EL2 for exception syndrome
• Handle them accordingly
◼ After handling the exception, return to the guest
– Restore the entry context
– eret instruction to return to either EL0 or EL1 depending on
SPSR_EL2
9
Copyright© 2022 IGEL Co., Ltd. All Rights Reserved.
11. Early Aarch64 boot
◼ Relocation
– To be able to run code at any address, we need a table
structure that tell us where and what to adjust to get final
addresses
• Usually for global variables
– In the linker file, we have a special section for this table
named rela.dyn
10
Copyright© 2022 IGEL Co., Ltd. All Rights Reserved.
…
.rela.dyn : AT (phys + (_rela_start - head)) {
_rela_start = .;
*(.rela)
*(.rela.*)
_rela_end = .;
}
…
12. Early Aarch64 boot
◼ Relocation
– It contains an array of the following structure
– For BitVisor, we only deal with R_AARCH64_RELATIVE
operation currently
11
Copyright© 2022 IGEL Co., Ltd. All Rights Reserved.
struct rela_entry {
u64 r_offset; /* Location to apply relocation */
u64 r_info; /* Determine operation to perform */
u64 r_addend;
};
13. Early Aarch64 boot
◼ Relocation
– Resolving R_AARCH64_RELATIVE type with Delta(S) +
Addend operation according to Aa64elf document
• S is the static address of a symbol
• Delta(S) means find the difference between the static link
address of S and the execution address of S
– In other words
• If head_linktime_addr is 0, diff is head_runtime_addr
– BitVisor head_linktime_addr is currently 0
12
Copyright© 2022 IGEL Co., Ltd. All Rights Reserved.
diff = head_runtime_addr – head_linktime_addr;
*(u64 *)(diff + r_offset) = diff + r_addend;
14. Early Aarch64 boot
◼ Relocation
13
Copyright© 2022 IGEL Co., Ltd. All Rights Reserved.
int SECTION_ENTRY_TEXT
apply_reloc (phys_t base, struct rela_entry *start, struct rela_entry *end)
{
struct rela_entry *entries = start;
u64 *target;
unsigned int i, n_entries = end - start;
for (i = 0; i < n_entries; i++) {
switch (entries[i].r_info) {
case R_AARCH64_NONE:
break; /* Do nothing */
case R_AARCH64_RELATIVE:
/*
* Static head address is 0. That means Delta(S) is
* the runtime address.
*/
target = (u64 *)(base + entries[i].r_offset);
*target = base + entries[i].r_addend;
break;
default:
/* Current deal with only R_AARCH64_RELATIVE */
return -1;
}
}
return 0;
}
15. Early Aarch64 boot
◼ Cross-compiling UEFI loader
– Mingw currently has no toolchain for Aarch64
– Switch to clang for cross-compiling instead
– Most of code for UEFI loader remains the same
◼ UEFI loader and bitvisor.elf relation
– UEFI loader looks for bitvisor.elf
– It then loads the first 64KB portion bitvisor.elf for
bootstrapping
• Early initialization + loading the rest of BitVisor into a memory
• .entry section of BitVisor must be within the first 64KB
– Once bootstrapping is done, we can jump to the newly
allocated BitVisor, and start the remaining initialization
14
Copyright© 2022 IGEL Co., Ltd. All Rights Reserved.
16. Early Aarch64 boot
◼ Entering BitVisor code
– Firstly, save context at entry
16
Copyright© 2022 IGEL Co., Ltd. All Rights Reserved.
entry:
…
adrp x9, _uefi_entry_ctx
add x9, x9, :lo12:_uefi_entry_ctx
stp x19, x20, [x9], #16
…
stp x29, x30, [x9], #16
…
mov x10, sp
…
mrs x10, TTBR0_EL2
str x10, [x9], #8
mrs x10, VBAR_EL2
str x10, [x9], #8
…
17. Early Aarch64 boot
◼ Entering BitVisor code
– Apply relocation, need to correct addresses listed in rela.*
section
17
Copyright© 2022 IGEL Co., Ltd. All Rights Reserved.
entry:
…
adrp x0, head
add x0, x0, :lo12:head
adrp x1, _rela_start
add x1, x1, :lo12:_rela_start
adrp x2, _rela_end
add x2, x2, :lo12:_rela_end
bl apply_reloc64k
cmp x0, 0
bne .L1 /* Return if apply_reloc64() fails */
…
18. Early Aarch64 boot
◼ Entering BitVisor code
– Then, enter uefi_entry()
• Save some UEFI routine addresses
• Load entire BitVisor to a new allocated location
• Setup virtual address
– Enable HCR_E2H so that TTBR1_EL2 becomes effective
– Setup TTBR1_EL2 table for hypervisor memory mapping
– Setup MAIR_EL2, TCR_EL2, and SCTLR_EL2
– 0xFFFF000000000000 is our current virtual base address
• Return virtual address base to the assembly code so that we
can jump to the new location with virtual address
18
Copyright© 2022 IGEL Co., Ltd. All Rights Reserved.
19. Early Aarch64 boot
◼ Entering BitVisor code
– Jump to asm_bitvisor_entry()
– Apply relocation again with the new virtual address base +
Additional setup for C code entry
19
Copyright© 2022 IGEL Co., Ltd. All Rights Reserved.
/*
* x0 now contains new virtual memory base address.
* Next, calculate the position of asm_bitvisor_entry()
* relative x0.
*/
adrp x21, head /* Old head */
add x21, x21, :lo12:head
adrp x11, bitvisor_entry
add x11, x11, :lo12:asm_bitvisor_entry
sub x11, x11, x21
add x11, x11, x0
br x11 /* Jump to newly located asm_bitvisor_entry */
20. Early Aarch64 boot
◼ Before calling vmm_main()
– Initialize exception handling
20
Copyright© 2022 IGEL Co., Ltd. All Rights Reserved.
void
bitvisor_entry (void)
{
uefi_booted = true;
/* Save this for secondary core start */
mair_host = mrs (MAIR_EL2);
tcr_host = mrs (TCR_EL2);
sctlr_host = mrs (SCTLR_EL2);
serial_init ();
disable_interrupt ();
init_default_exception_handler ();
init_exception ();
vmm_main ();
}
21. BitVisor Aarch64 initialization
◼ The initialization flow is roughly as same as current
BitVisor
– Mainly done through call_initfunc()
– There are some Aarch64 specific initialization to take care
• MMU/memory mapping/MMIO handling, GIC initialization, etc
◼ Need some adjustment of the original code
– Separate platform specific code into separate files and
create interfaces for platform independent code to call them
• Ex. in the process implementation
– x86 assembly in call_msgfunc0() is replaced by
process_exec()
– The actual implementation of process_exec() is in either
x86/process.c or aarch64/process.c
21
Copyright© 2022 IGEL Co., Ltd. All Rights Reserved.
22. BitVisor Aarch64 initialization
◼ Entering guest
22
Copyright© 2022 IGEL Co., Ltd. All Rights Reserved.
void
vm_start (void)
{
u64 orig_tcr, val;
…
/* Setting up EL1 environment */
msr (SP_EL1, _uefi_entry_ctx.sp);
msr (ESR_EL12, _uefi_entry_ctx.esr_el2);
msr (FAR_EL12, _uefi_entry_ctx.far_el2);
msr (MAIR_EL12, _uefi_entry_ctx.mair_el2);
…
msr (TCR_EL12, val);
msr (TPIDR_EL1, _uefi_entry_ctx.tpidr_el2);
msr (TTBR0_EL12, _uefi_entry_ctx.ttbr0_el2);
msr (VBAR_EL12, _uefi_entry_ctx.vbar_el2);
msr (CPACR_EL12, CPACR_ZEN (3) | CPACR_FPEN (3) | CPACR_SMEN (3));
val = (_uefi_entry_ctx.spsr_el2 & ~0xF) | 0x5; /* E1h */
msr (SPSR_EL2, val);
msr (ELR_EL2, _uefi_entry_ctx.x30);
msr (CPTR_EL2, CPTR_FLAGS);
msr (HCR_EL2, HCR_FLAGS);
start_guest (&_uefi_entry_ctx);
}
23. BitVisor Aarch64 initialization
◼ Entering guest
23
Copyright© 2022 IGEL Co., Ltd. All Rights Reserved.
start_guest:
ldp x19, x20, [x0], #16
ldp x21, x22, [x0], #16
ldp x23, x24, [x0], #16
ldp x25, x26, [x0], #16
ldp x27, x28, [x0], #16
ldp x29, x30, [x0], #16
/* Clear all caller-saved register */
eor x15, x15, x15
eor x14, x14, x14
eor x13, x13, x13
…
mov x0, #1 /* Return 1 as success upon entry guest */
dsb ish
isb
eret
/* Prevent speculative execution */
dsb nsh
isb
24. Interrupt handling
◼ Physical interrupt and virtual interrupt
– The physical one is from an actual device
• Guest can receive physical interrupts if the hypervisor chooses
not to handle interrupts
– The virtual one is the one that the hypervisor injects to the
guest
• Cannot be trapped to EL2/3
– Interrupt type
• FIQ/vFIQ, high priority interrupt
• IRQ/vIRQ, low priority interrupt
• Serror/vSError, erroneous memory accesses (Ex. Bus error)
– No non-maskable interrupt until Armv8.8-A/Armv9.3-A
• QEMU still does not support this
• No need to worry about this for now
24
Copyright© 2022 IGEL Co., Ltd. All Rights Reserved.
25. Interrupt handling
◼ Injecting interrupts
– Via system registers
• We can write
– Set HCR_VF in HCR_EL2 to make vFIQ pending
– Set HCR_VI in HCR_EL2 to make vIRQ pending
– Set HCR_VSE in HCR_EL2 to make vSError pending
• Then, need to emulate an interrupt controller
– Via GIC (our focus)
25
Copyright© 2022 IGEL Co., Ltd. All Rights Reserved.
26. Interrupt handling
◼ Overview
26
Copyright© 2022 IGEL Co., Ltd. All Rights Reserved.
GIC
Hypervisor
- Save context
- Identify interrupt
- Forward interrupt
- Return to the guest
Guest
Virtual interrupt
Inject virtual interrupt IMO=1 FMO=1
27. Interrupt handling
◼ BitVisor GIC initialization
– Set HCR_FMO, HCR_IMO, and HCR_AMO in HCR_EL2
– Set ICH_HCR_EN in ICH_HCR_EL2
– Configure ICH_VMCR_EL2 to initialize vGIC states
– Need to change we acknowledge an interrupt
• Make writing EOI be only dropping priority
• The guest ends the interrupt on its interrupt handling
◼ Interrupt Handling
– Read ICC_IAR0/1_EL1 to get intid and acknowledge the
interrupt
– Scheduling and do tasks
– Write ICC_EOIR0/1_EL1 with intid to drop priority
– Inject the interrupt to the guest
27
Copyright© 2022 IGEL Co., Ltd. All Rights Reserved.
28. Interrupt handling
◼ Injecting interrupts with GIC
– Each core has a set of List Register (LR) for injecting virtual
interrupts
• ICH_LR0 – (max) ICH_LR15
– The max number is platform specific
– To inject a virtual interrupt, simply write to one of empty
ICH_LR register
– The virtual interrupt gets trapped by the guest once we
return to the guest
28
Copyright© 2022 IGEL Co., Ltd. All Rights Reserved.
29. Interrupt handling
◼ Injecting interrupts with GIC
29
Copyright© 2022 IGEL Co., Ltd. All Rights Reserved.
static void
try_inject_vint (u64 intid, u64 rpr, uint group)
{
…
/* Currently vintid = pintid */
g = !!group;
val = ICH_LR_VINTID (intid) | ICH_LR_PINTID (intid) |
ICH_LR_PRIORITY (rpr) | ICH_LR_GROUP (g) | ICH_LR_HW |
ICH_LR_STATE (LR_STATE_PENDING);
enqueue_lr (currentcpu, val);
elrsr = mrs (ICH_ELRSR_EL2);
for (i = 0; elrsr != 0 && i < currentcpu->max_int_slot; i++) {
empty = !!(elrsr & 0x1);
if (empty) {
if (dequeue_lr (currentcpu, &lr_val))
set_lr (i, lr_val);
else
break;
}
elrsr >>= 1;
}
}
30. Interrupt handling
◼ Injecting interrupts with GIC
30
Copyright© 2022 IGEL Co., Ltd. All Rights Reserved.
static void
set_lr (uint lr_idx, u64 val)
{
switch (lr_idx) {
case 0:
msr (ICH_LR0_EL2, val);
break;
case 1:
msr (ICH_LR1_EL2, val);
break;
case 2:
msr (ICH_LR2_EL2, val);
break;
case 3:
msr (ICH_LR3_EL2, val);
break;
…
default:
panic ("lr_idx out of bound");
break;
}
}
31. MMIO handling
◼ Stage-1 and Stage-2 memory translation
– Stage-1 is for translating a virtual address (VA) to a physical
address (PA) or an intermediate physical address (IPA)
• For EL1, IPA is PA if stage-2 translation is not enabled
– Stage-2 is for translating the IPA to an actual PA
• Need to set up
– VTTBR_EL2 for stage-2 page tables
– VTCR_EL2 for stage-2 translation control
• In our case, IPA and PA are identity mapped
◼ In general, MMIO handling is be done through stage-
2 translation fault
– Not limited to MMIO address but any PA
31
Copyright© 2022 IGEL Co., Ltd. All Rights Reserved.
32. MMIO handling
◼ Implementation concept
– During initialization, we create identity mapping for stage-2
address translation
• Does not need too many page tables as we can utilize 1GB
block mapping
– mmio_register() provides PA and size we want to monitor
• We unmap the address from stage-2 translation
• From MMU implementation point of view, we break down the
big mapping block into smaller blocks a hole of the address
– Exception handling is triggered once the guest accesses
monitored addresses
32
Copyright© 2022 IGEL Co., Ltd. All Rights Reserved.
33. MMIO handling
◼ Implementation concept
– We need to emulate those accesses
• Get instruction address from ELR_EL2 register
• Get fault address from FAR_EL2 register
• Decode the instruction to get source/destination registers
• Get all necessary info together and pass them to a handler
– Once we finish access handling
• Skip the instruction by adding 4 to ELR_EL2
– An instruction is 4 bytes
• Update guest registers in saved context if necessary
33
Copyright© 2022 IGEL Co., Ltd. All Rights Reserved.
34. Multiple core support
◼ On platform that support PSCI, multiple core support
is straightforward
◼ When the guest wants to start a secondary core
– It issues SMC instruction
– The call follows Secure Monitor Calling Convention (SMCC)
• smc #0
• x0: Function ID, x1~: Parameters
◼ BitVisor simply needs to intercept SMC instructions
– Set HCR_TSC bit in HCR_EL2 register
– Check for CPU_ON Function ID
– Save entry_address and context_id information
• entry_address is physical address
• context_id appears at x0 on secondary core entry
34
Copyright© 2022 IGEL Co., Ltd. All Rights Reserved.
35. Multiple core support
◼ BitVisor then issues SMC on behalf of the guest
– Copy guest’s CPU_ON command
– Replace entry_address and context_id with our values
◼ Secondary core entry
– Set up MMU and stack
– Jump to designated virtual address to continue per core
initialization
– Finally, we start the guest at its entry_address with its
context_id at x0
35
Copyright© 2022 IGEL Co., Ltd. All Rights Reserved.
36. Current limitation
◼ No Aarch32 for now
– For simplicity
◼ No Suspend/Resume for now
– Going to implement later
– PSCI SMC handling
◼ No EL0 debug shell through hypercall
– hvc instruction is not available at EL0
– Need to find an alternative
• Virtual serial?
36
Copyright© 2022 IGEL Co., Ltd. All Rights Reserved.
37. Current limitation
◼ No 52-bit address support for now
– Need either 64KB page size or need Armv8.7
– BitVisor itself does not need 52-bit address
– To allow guest OS to use this, we need either
• 64KB page size
– Quite a waste of memory for our use cases
• Armv8.7 FEAT_LPA2 for 4KB and 16KB page size
– 4KB page size needs 5-level page table
– See no real hardware that supports this yet
– Not the current priority
37
Copyright© 2022 IGEL Co., Ltd. All Rights Reserved.
38. Ongoing tasks
◼ Integrating Aarch64 implementation with the
mainstream
– Finalizing interfaces for platform specific implementation
– Cross-compiling implementation
38
Copyright© 2022 IGEL Co., Ltd. All Rights Reserved.
39. QEMU patches
◼ e1000e: Fix possible interrupt loss when using MSI
– There was a logic error resulting in delaying MSI indefinitely
◼ target/arm: honor HCR_E2H and HCR_TGE in
arm_excp_unmasked()
– Found this problem when trying to run a process in EL0 with
interrupt masked
• This is valid according to the architecture manual
• It was impossible before this patch
◼ target/arm: Honor HCR_E2H and HCR_TGE in
ats_write64()
– AT instruction implementation forgot to honor HCR_E2H and
HCR_TGE
– Found this because there was a weird memory error panic
39
Copyright© 2022 IGEL Co., Ltd. All Rights Reserved.
40. QEMU patches
◼ e1000e: Fix possible interrupt loss when using MSI
– https://github.com/qemu/qemu/commit/dd0ef128669c29734a
197ca9195e7ab64e20ba2c
◼ target/arm: honor HCR_E2H and HCR_TGE in
arm_excp_unmasked()
– https://github.com/qemu/qemu/commit/c939a7c7b93ee44a4
963fabe81454e1f956ecd4b
◼ target/arm: Honor HCR_E2H and HCR_TGE in
ats_write64()
– https://github.com/qemu/qemu/commit/638d5dbd78ea81c94
3959e2f2c65c109e5278a78
40
Copyright© 2022 IGEL Co., Ltd. All Rights Reserved.