Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.

XPDSS19: Live-Updating Xen - Amit Shah & David Woodhouse, Amazon

156 Aufrufe

Veröffentlicht am

Xen currently has two major mechanisms to maintain security while hosting untrusted VMs without causing disruption to those guests: live patching, and live migration. We introduce a third method: live updating Xen. A live-update operation involves loading of the newly-staged hypervisor into RAM, the currently-running Xen serializing its state, and then transferring control to the newly-staged Xen, all without disrupting running instances, beyond a little downtime when neither hypervisor is running guest vCPUs.

We present a proposal on the design of such a feature, and invite comments and feedback.

Veröffentlicht in: Technologie
  • Loggen Sie sich ein, um Kommentare anzuzeigen.

  • Gehören Sie zu den Ersten, denen das gefällt!

XPDSS19: Live-Updating Xen - Amit Shah & David Woodhouse, Amazon

  1. 1. Live-Updating Xen Amit Shah <aams@amazon.com> David Woodhouse <dwmw@amazon.com> 10. Juli 2019 © 2019 Amazon Web Services, Inc. or its affiliates. All rights reserved.
  2. 2. Live Update • Update the running hypervisor with a new build • Gracefully transfer running guests to the new Xen • Guests may only notice a small pause 2
  3. 3. Why Do This • AWS operates a large fleet of hosts – Not much opportunity to reboot ● long-running guests – Operationally, we need to be ready to fix customer pain • Roll out fixes – Bug fixes – Security fixes • Bring new features • Maintenance – Reduce number of hypervisor versions needing support • Development – Reduce devel times by faster testing and prototyping 3
  4. 4. Existing Techniques • Live Patching – Works well, operationally proven – Requires backporting to multiple supported hypervisor versions – Effort required increases with patch complexity – Recurring work for each livepatch • Live Migration – Guest workload-dependent – Not applicable for all device models in use 4
  5. 5. Live Update • Currently restricting to minor version updates – e.g. 4.11.1 → 4.11.2 • Considering just hypervisor updates – Dom0, userspace, etc., out of scope for now • Most of this talk is request for comments – General idea and design is being presented here – Prototyping on these ideas started recently – Design is deliberately fluid to incorporate feedback and various usecases 5
  6. 6. Terminology • Running Xen – Current hypervisor on a host – The “source” in the live-update operation • Target Xen – New build of hypervisor – The “target” in the live-update operation 6
  7. 7. General Idea • Load Target Xen in memory, • Initiate Live-Update – Pause all domains, – Mask interrupts for domains, – Serialize domain states, – Serialize Running Xen state, – Jump to Target Xen, • Target Xen takes over – Deserialize state, – Unpause domains, – Unmask interrupts 7
  8. 8. General Idea • Load Target Xen in memory, • Initiate Live-Update – Pause all domains, – Mask interrupts for domains, – Serialize domain states, – Serialize Running Xen state, – Jump to Target Xen, • Target Xen takes over – Deserialize state, – Unpause domains, – Unmask interrupts 8
  9. 9. Load Target Xen in Memory • crashkernel area and kexec • Load new Xen binary in crashkernel region – kexec -l 9
  10. 10. Load Target Xen in Memory • crashkernel area and kexec • Load new Xen binary in crashkernel region – kexec -l • To load currently, we have to: – $ zcat /boot/xen-4.12.gz > xen-4.12 – $ echo -en x3 | dd of=xen-4.12 bs=1 seek=16 conv=notrunc – $ kexec -l xen-4.12 –append="..." --module "/boot/vmlinuz ..." --module /boot/initramfs -d –mem-min=0x2000000 10
  11. 11. Solutions to Challenges in Load • Patches merged for kexec-tools v2.0.20 – Adds multiboot2 support – Gets us relocation support – can now load in crashkernel area – Don’t use lowmem areas – Can directly use the ELF binary – From Varad Gautam ● “[PATCH 1/2] elf: Support ELF loading with relocation” ● “[PATCH 2/2] x86: Support multiboot2 images” 11
  12. 12. General Idea • Load Target Xen in memory, • Initiate Live-Update – Pause all domains, – Mask interrupts for domains, – Serialize domain states, – Serialize Running Xen state, – Jump to Target Xen, • Target Xen takes over – Deserialize state, – Unpause domains, – Unmask interrupts 12
  13. 13. Jump to Target Xen • Needs new hypercall, just `kexec -e` not sufficient – As this needs to be an atomic operation with pausing dom0 • Do not drop to Real Mode – Start in protected mode (or, later, even long mode) – Stop using real-mode low memory – Patches on the list from David ● “[RFC PATCH 0/7] Clean up x86_64 boot code” • Skip startup 13
  14. 14. Consume State in Target Xen • Two ways to transfer state – Pointer to memory region via kexec command line – Multiboot module with state • Deserialize state from Running Xen – Xen state; domain state 14
  15. 15. General Idea • Load Target Xen in memory, • Initiate Live-Update – Pause all domains, – Mask interrupts for domains, – Serialize domain states, – Serialize Running Xen state, – Jump to Target Xen, • Target Xen takes over – Deserialize state, – Unpause domains, – Unmask interrupts 15
  16. 16. Persisting Guest State • We have Live Migration – For minor version upgrades, state changes not expected – Just slightly different from LM: migration across time, not space • Persist memory • Persist domain structures • Collect state information – domheap, page tables, start_info, shared_info_frame 16
  17. 17. Persisting Host State • IOMMU state – Mask interrupts – DMA requests continue as normal • Memory regions – Xen memory, domain memory spread out – Have to ensure to not overwrite these areas ● And carefully relocate Target Xen later 17
  18. 18. Prototyping in Persisting Guest State • Ongoing work for a PV guest • Modified `xl save` workflow to start serialization – Skip memory scrubbing – Allow domain destruction – Store pointers in well-known location • Launch new domain – Re-use state information from previously-destroyed domain – See if guest continues running • Later – extend this to Dom0 – HVM domains – Across kexec 18
  19. 19. Things to be Aware of (1/2) • Pause time – Should not result in guest noticing much of this activity – A decent estimate could be “network connections don’t time out” ● 3 TCP RTT – About 1-2 seconds OK to begin with – Leaving memory pages in RAM, not initializing IOMMU, skipping startup – all help • Interrupts could get lost – May have to find a way to queue them and reinject • Domain states – Already-paused domains should remain paused • Ordering of pausing/masking activities during setup phase 19
  20. 20. Things to be Aware of (2/2) • Host Time: Target Xen re-initializes RTC – This can be off compared to Running Xen • Guest Timekeeping – pvclock sync • Internal state / struct changes – Handling major version updates – Can also sneak in for security fixes – Thoughts for the future ● Static annotation in source code / compile-time warnings • Controlling capabilities per domain – Currently, spread out: xen cmdline, global config, domain config, compile-time – Control feature advertisements at launch based on Running Xen capabilities 20
  21. 21. More Information • Discussions ongoing on IRC and devel list • Sending out RFC patches as we write them • Design session • Wiki page – https://wiki.xen.org/wiki/Live-Updating_Xen – Links to WIP trees – JIRA board – General status information 21

×