Kexec (and kdump) are valuable multi-purpose mechanisms in Linux which are currently unavailable to Xen guests. What makes Xen guests so special, how is it different from other hypervisors? Can we solve these issues in a foreseeable future? This presentation will give an overview of several different approaches to making kexec possible for PVHVM Linux guests and describe difficulties we face while trying to implement these approaches. I will show which changes should be made to the hypervisor, toolstack, and Linux kernel in order to make things work. While the talk mainly focuses on PVHVM x86 guests I will also try to mention PVH and ARM implications.
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
Seattle2015 xen
1. PVHVM Linux guest
why doesn't kexec work?
Vitaly Kuznetsov
Red Hat
Xen Developer Summit, 2015
2. 2 PVHVM Linux guest: why doesn't kexec work?
Why?
● We support Red Hat Enterprise Linux.
● Bare hardware, virtualized and cloud environments, ...
● Kernel issues happen.
● Analyse stack traces.
● In complicated cases use kdump!
3. 3 PVHVM Linux guest: why doesn't kexec work?
Kexec/kdump
● “kexec … is a mechanism of the Linux kernel that
allows "live" booting of a new kernel "over" the
currently running kernel”
● Kdump uses kexec:
● Some memory is reserved at boot (crashkernel=)
● Crash kernel/initrd are loaded to the area.
● On crash we trigger crash kernel's boot.
● Crash initrd dumps all domain's memory and reboots.
● You have crash file to analyse! (profit!!!)
5. 5 PVHVM Linux guest: why doesn't kexec work?
Issues with Kexec on PVHVM
● Previously used structures cause problems, no good
way to transfer knowledge to kexec kernel.
● and we need these interfaces working!
● Xen/guest interfaces we need to re-establish:
● shared_info frame (XENMAPSPACE_shared_info)
● VCPU_info (VCPUOP_register_vcpu_info)
● Event channels (EVTCHNOP_bind_*, ABI)
● + Emuirq/pirq mappings (PHYSDEVOP_map_pirq)
● Granted pages
6. 6 PVHVM Linux guest: why doesn't kexec work?
shared_info page:
● 4k page, belongs to Xen hypervisor.
● Required for events, vcpu_info for first 32 VCPUs lives
here.
● Upon boot guest chooses one of its pages to sacrifice.
● XENMEM_add_to_physmap(XENMAPSPACE_shared_info)
frees guest's frame and mounts shared_info there.
● kexec kernel does the same for another frame → we
get a hole as shared_info is being unmapped from its
previous place.
7. 7 PVHVM Linux guest: why doesn't kexec work?
Event channels:
● Already bound event channels
● “(XEN) event_channel.c:370:d2v0 EVTCHNOP failure: error -17”
● 2 level → FIFO ABI switch at boot
● Mapped control block, event array pages.
● Some INTERDOMAIN channels are being set up by
the toolstack:
● Xenstore, xenconsole,..
● EVTCHNOP_reset resets everything, there is no
way back.
8. 8 PVHVM Linux guest: why doesn't kexec work?
Grant pages:
● Memory sharing mechanism in Xen.
● We can't do anything guest-side:
● Forcibly unmapping a page from backend domain
will crash it.
● Requesting new pages requires additional memory.
● Some grants are “persistent”.
● Maybe not-an-issue for kdump because its memory
region is separated but
● We still need functional backends for kexec kernel!
10. 10 PVHVM Linux guest: why doesn't kexec work?
“Obvious solution”
● Implement set of hypercalls to tear all interfaces down:
● reset_vcpu_info
● evtchn_switch_to_2l
● unmap_shared_info
● do_something_with_granted_pages
● …
● Good from “if there is a way to set something up there
should be one to tear it down” PoV.
● Good for hypervisor testing :-)
11. 11 PVHVM Linux guest: why doesn't kexec work?
“Obvious solution”
● Issues:
● Domain needs to follow a special protocol – what if
it doesn't?
● Granted pages story is complicated.
● Not all bits are being set up by the domain.
● Too many possible issues (including security).
12. 12 PVHVM Linux guest: why doesn't kexec work?
“New domain with the same memory”
● Destroy the original domain leaving its memory intact.
● Create new domain, reassign all memory pages, copy
vcpu contexts.
● Benefits:
● No cumbersome teardown required!
● Migration path is being reused!
● Supportability: new interfaces/objects should “just
work”.
13. 13 PVHVM Linux guest: why doesn't kexec work?
“New domain with the same memory”
● Issues:
● Memory reassignment appears to be
cumbersome :-(
● Superpages, PoD, mem_access issues.
● No m2p on ARM.
● Non-trivial toolstack part repeating migration code.
● Too complicated.
14. 14 PVHVM Linux guest: why doesn't kexec work?
“Reset everything”
● No cumbersome memory reassignment.
● Explicit list of interfaces to reset with one hypercall:
● shared_info, vcpu_info, event channels,
pirq_to_emuirq, ioreq servers.
● Toolstack involvement required:
● Restart device model.
● Reopen xenstore/xenconsole event channels.
● ..
● Hypervisor maintainers like it :-)
15. 15 PVHVM Linux guest: why doesn't kexec work?
“Reset everything”
● Granted pages - let's do (almost) nothing!
● Remove the domain from xenstore and add it back
– all backends are supposed to release all
mappings.
● Xenconsoled doesn't release its mapping (but that's
fine).
● Special debug print to find future issues.
● Hunt for misbehaving backends! (if there are such)
17. 17 PVHVM Linux guest: why doesn't kexec work?
Current status and future work
● [PATCH v10 00/11] “toolstack-assisted approach to
PVHVM guest kexec” is out waiting for reviewers!
● … and testers too!
● PVH (as "HVM without device model") should "just
work".
● Not tested, minor issues are possible.
● ARM-specific part is -ENOSYS stub for now.
● shared_info page needs handling (same as x86).
● Some GIC cleanup?