SlideShare ist ein Scribd-Unternehmen logo
1 von 34
Downloaden Sie, um offline zu lesen
Stateless Hypervisors

at Scale
Antony Messerli
amesserl@rackspace.com
@ntonym
• Almost 14 years with Rackspace
• Hardware Development for Rackspace
• Rackspace Cloud Servers
• Slicehost
• Openstack Public Cloud
• R&D and Prototyping
• Twitter: @ntonym

Github: antonym

IRC (freenode): antonym
ABOUT MYSELF
2
3
• Openstack Public Cloud in production

since August 2012
• Six Geographic regions around the globe
• 10’s of 1000’s of hypervisors (Over 340,000 Cores,
Just over 1.2 Petabytes of RAM)
• Over 10 different hardware platforms
• Primarily utilize the Citrix XenServer Hypervisor today
TRADITIONAL HYPERVISORS
4
5
Components of a Hypervisor

in OpenStack
• Bare metal
• Operating System
• Configuration Management
(Ansible, Chef, Puppet)
• Nova Compute
• Instance settings
• Instance virtual disks
6
Hypervisor’s Mission
Needs to be:
• Stable
• Secure
• Provision and run instances
reliably
• Consistent with other servers
Problems With Hypervisors At Scale
• Operating System
‣ Multiple versions of XenServer
‣ Each version has variation of patches, kernels, or xen hypervisor
‣ More variations = More work
• Server Hardware
‣ Incorrect BIOS settings, firmware, or modules can cause different behaviors
• Operational Issues
‣ Openstack or Hypervisor bugs can leave things in undesirable states.
7
8
How We Solved Some Of
Those Problems
• Factory style provisioning using
iPXE and Ansible
• Consolidated hypervisor versions to
reduce variations
• Attempt to correct inconsistencies
on the hypervisors automatically
But… we’re still running a
traditional operating system!
9
Our Goals
• Rapidly deploy hypervisors.
• Take advantage of server reboots!
• Reproducible build
• Consistency within hardware
platforms and operating systems
After all, these are Cattle, not
Pets!
THE CONCEPT:
LIVE BOOTED HYPERVISORS
10
What Is A Live OS?
• A bootable image that runs in a system’s memory
• Predictable and portable
• Typically used for installs or rescue, booted from CD or network
• Doesn’t make changes to existing configuration
11
What if we applied this same concept to run our hypervisor?
12
“We’ll Do It Live!”
• Network booted stateless LiveOS
• Built from scratch using Ansible
• Operating System is separated
from customer data
• Reboot for the latest build
But Where Does The Persistent Data Go?
• systemd unit file mounts disk early in the boot process
• Create the symlinks from LiveOS to persistent store
• For example:
/dev/sda2 is mounted to /data
/var/lib/nova -> /data/var/lib/nova
• Can create symlink for each directory you want to persist
13
How Is This Possible?
• We leverage the dracut project.
• Dracut runs in the initramfs during boot time
• Main goal is to transition to the real root filesystem
• Has lots of functionality for network boot
• Set options from kernel command line
• More information @ https://dracut.wiki.kernel.org
14
Dracut Config Example:
Why Use A LiveOS
• Everything boots from a single image.
• Can make changes without reboot, but should update image.
• Can update to a new release of the OS and roll back to the existing if needed.
• Portable and easy to test and develop on.
• Memory is cheap!
15
THE IMAGE BUILD PROCESS
16
Squashible
• Combination of SquashFS and Ansible.
• Ansible Playbooks automate the build
process of creating the images
• Supports multiple OS versions
• Configuration management done
during image build
• All changes to our build live within the
repo, fully tracked and easily
reproducible.
17
18
The Initial Bootstrap
• Ansible uses Docker to create a minimal chroot
• Installs:
‣ Package manager
‣ Init system
• Copies the chroot to Jenkins
• Ansible destroys the docker container minimal OS in chroot
dnf, apt, zypper
Filesystem on Jenkins server
Docker container or systemd-nspawn
Preparing The chroot
Live OS chroot
• Ansible uses its chroot module catch up the OS
• Version-tracking metadata is added to the image
• Package manager configurations are applied
• All packages are updated to the latest available
versions from the distribution's mirrors
yum/apt
configuration
versioning
metadata
19
Common Configuration
Live OS chroot
Ansible applies configuration to the live image that
should be included in all live images
• Authentication
• Auditing
• Common Packages
• Logging configuration
• Security configurations
• SSH configurations
• Enable/disable services on boot
security
configuration
(auth, sshd,
SELinux,
AppArmor,
auditd)
logging
configuration
(journald,
rsyslog)
service startup configuration
(via systemd)
20
21
Apply The Personality
• Ansible takes the common live OS
chroot and configures it based on the
desired “personality"
• Each role has the packages to install
along with any special configurations
required in order for the hypervisor to
function
Common Live OS chroot
Basic
server
KVM
hypervisor
Xen
hypervisor
LXC
hypervisor
XenServer
hypervisor
(via additional Ansible Roles)
22
Publishing The Build
• Kernel and initramfs are copied to
deployment server
• Root filesystem (entire chroot) is
tarballed and copied to the deployment
server
• mktorrent generates a torrent file for
rootfs
• rtorrent seeds the initial torrent of the
rootfs
Common Live OS chroot
vmlinuz
(kernel)
initrd
(ramdisk)
root filesystem
torrent file
(mktorrent)
root filesystem
tarball of chroot
rtorrent (seeds rootfs)
opentracker
Deployment Server (HTTP)
vmlinuz
initramfs
rootfs.img
rootfs.img.torrent
THE BOOT PROCESS
23
24
Ok, We Built An Image,

Now What? Boot It!
• Boot from network with iPXE
• Boot from local disk with Grub
• If network fails, can revert to localboot
• Lots of open source provisioning systems
available
Boot with iPXE
#!ipxe
:netboot
imgfree
set dracut_ip ip=${mgmt_ip_address}::${mgmt_gateway_ip}:${mgmt_netmask}:${hostname}:
${mgmt_device}:none nameserver=${dns}
kernel ${vmlinuz_url} || goto netboot
module ${initrd_url} || goto netboot
imgargs vmlinuz root=live:${torrent_url} ${dracut_ip} rd.writable.fsimg ${console}
boot || goto netboot
25
Boot via extlinux
LABEL latestbuild-$GIT_COMMIT
menu label latestbuild-$GIT_COMMIT
kernel $KERNEL root=live:/dev/sda1 rd.live.dir=/boot/builds/$GIT_COMMIT ${dracut_ip} rd.writable.fsimg booted_as=local
initrd $INITRD
26
• After booting from network, you can create a local disk cache of image.
• If network boot fails, you can still boot previously loaded image from disk.
• Could roll out images ahead of time and skip network boot.
Boot via kexec
kexec -l vmlinuz —initrd=initrd.img 
—command-line=“root=live:http://$deployment_server/images/fedora-23-kvm/rootfs.img 
ip=dhcp nameserver=8.8.8.8 rd.writable.fsimg rd.info rd.shell”
kexec -e
27
• Useful for testing it out from a running machine
• Also useful for reloading your OS to the latest build of the image
• Have to make sure your hardware drivers work well with kexec
28
Our Primary Boot Method,
Terraform
• Server makes DHCP request and retrieves iPXE
kernel
• Identifies itself using LLDP
• Gets all attributes and plugs that into an iPXE
template.
• Our Utility LiveOS:
‣Brings Firmware and BIOS settings to latest spec
‣Storage and OBM
‣Inventory
‣Kexec’s into Primary Image
29
Our Initial Scale Tests
(x86_64)
• Heavily tested on 200+ x86 hosts
running Fedora 23 based LiveOS
• Time to build and package live image
from git commit: ~10 minutes
• Time to boot a server once POST
completes: ~60 seconds
• Re-provision time for 200 servers from
reboot to provisioning instances: ~15
minutes
30
Openpower “Barreleye”

(ppc64le)
• Currently testing OpenStack KVM stack
with LiveOS builds using Fedora 23 on
OpenPower Barreleye
• More information about Barreleye @
http://blog.rackspace.com/openpower-
open-compute-barreleye/
31
Future Ideas
• Embedded configuration
management
‣ Image would run automation and
retrieve it’s own configuration on
boot
‣ Regenerates itself on every boot
• Stateless instances
‣ Boot from Config Drive
‣ Reset state or upgrade with
reboot
Give It A Try
Squashible - Cross-Platform Linux Live Image Builder
http://squashible.com
Sample iPXE Boot menus:
https://github.com/squashible/boot.squashible.com
32
Thank you!
Antony Messerli
amesserl@rackspace.com
@ntonym

Weitere ähnliche Inhalte

Was ist angesagt?

EuroBSDcon 2017 System Performance Analysis Methodologies
EuroBSDcon 2017 System Performance Analysis MethodologiesEuroBSDcon 2017 System Performance Analysis Methodologies
EuroBSDcon 2017 System Performance Analysis Methodologies
Brendan Gregg
 
Speeding up ps and top
Speeding up ps and topSpeeding up ps and top
Speeding up ps and top
Kirill Kolyshkin
 
ACM Applicative System Methodology 2016
ACM Applicative System Methodology 2016ACM Applicative System Methodology 2016
ACM Applicative System Methodology 2016
Brendan Gregg
 
Modern Linux Tracing Landscape
Modern Linux Tracing LandscapeModern Linux Tracing Landscape
Modern Linux Tracing Landscape
Kernel TLV
 

Was ist angesagt? (20)

JavaOne 2015 Java Mixed-Mode Flame Graphs
JavaOne 2015 Java Mixed-Mode Flame GraphsJavaOne 2015 Java Mixed-Mode Flame Graphs
JavaOne 2015 Java Mixed-Mode Flame Graphs
 
Linux Performance Profiling and Monitoring
Linux Performance Profiling and MonitoringLinux Performance Profiling and Monitoring
Linux Performance Profiling and Monitoring
 
EuroBSDcon 2017 System Performance Analysis Methodologies
EuroBSDcon 2017 System Performance Analysis MethodologiesEuroBSDcon 2017 System Performance Analysis Methodologies
EuroBSDcon 2017 System Performance Analysis Methodologies
 
Performance Analysis: The USE Method
Performance Analysis: The USE MethodPerformance Analysis: The USE Method
Performance Analysis: The USE Method
 
Block I/O Layer Tracing: blktrace
Block I/O Layer Tracing: blktraceBlock I/O Layer Tracing: blktrace
Block I/O Layer Tracing: blktrace
 
Speeding up ps and top
Speeding up ps and topSpeeding up ps and top
Speeding up ps and top
 
The New Systems Performance
The New Systems PerformanceThe New Systems Performance
The New Systems Performance
 
What Linux can learn from Solaris performance and vice-versa
What Linux can learn from Solaris performance and vice-versaWhat Linux can learn from Solaris performance and vice-versa
What Linux can learn from Solaris performance and vice-versa
 
Analyzing OS X Systems Performance with the USE Method
Analyzing OS X Systems Performance with the USE MethodAnalyzing OS X Systems Performance with the USE Method
Analyzing OS X Systems Performance with the USE Method
 
Kernel Recipes 2017: Using Linux perf at Netflix
Kernel Recipes 2017: Using Linux perf at NetflixKernel Recipes 2017: Using Linux perf at Netflix
Kernel Recipes 2017: Using Linux perf at Netflix
 
Linux Performance Tools
Linux Performance ToolsLinux Performance Tools
Linux Performance Tools
 
Introduction to Perf
Introduction to PerfIntroduction to Perf
Introduction to Perf
 
Java Performance Analysis on Linux with Flame Graphs
Java Performance Analysis on Linux with Flame GraphsJava Performance Analysis on Linux with Flame Graphs
Java Performance Analysis on Linux with Flame Graphs
 
From DTrace to Linux
From DTrace to LinuxFrom DTrace to Linux
From DTrace to Linux
 
Linux BPF Superpowers
Linux BPF SuperpowersLinux BPF Superpowers
Linux BPF Superpowers
 
OSSNA 2017 Performance Analysis Superpowers with Linux BPF
OSSNA 2017 Performance Analysis Superpowers with Linux BPFOSSNA 2017 Performance Analysis Superpowers with Linux BPF
OSSNA 2017 Performance Analysis Superpowers with Linux BPF
 
ACM Applicative System Methodology 2016
ACM Applicative System Methodology 2016ACM Applicative System Methodology 2016
ACM Applicative System Methodology 2016
 
Netflix: From Clouds to Roots
Netflix: From Clouds to RootsNetflix: From Clouds to Roots
Netflix: From Clouds to Roots
 
Modern Linux Tracing Landscape
Modern Linux Tracing LandscapeModern Linux Tracing Landscape
Modern Linux Tracing Landscape
 
USENIX ATC 2017: Visualizing Performance with Flame Graphs
USENIX ATC 2017: Visualizing Performance with Flame GraphsUSENIX ATC 2017: Visualizing Performance with Flame Graphs
USENIX ATC 2017: Visualizing Performance with Flame Graphs
 

Andere mochten auch

Introducing MCC Banquets & Events (Macedonian Cultural Center)
Introducing MCC Banquets & Events (Macedonian Cultural Center)Introducing MCC Banquets & Events (Macedonian Cultural Center)
Introducing MCC Banquets & Events (Macedonian Cultural Center)
Anna Wolski
 
The cognitive approach to abnormality (2)
The cognitive approach to abnormality (2)The cognitive approach to abnormality (2)
The cognitive approach to abnormality (2)
clivecaines
 
Low Latency “OLAP” with HBase - HBaseCon 2012
Low Latency “OLAP” with HBase - HBaseCon 2012Low Latency “OLAP” with HBase - HBaseCon 2012
Low Latency “OLAP” with HBase - HBaseCon 2012
Cosmin Lehene
 

Andere mochten auch (20)

Top10 Basic Business Principles
Top10 Basic Business PrinciplesTop10 Basic Business Principles
Top10 Basic Business Principles
 
Java Aktuell Bernd Zuther Canary Releases mit der Very Awesome Microservices ...
Java Aktuell Bernd Zuther Canary Releases mit der Very Awesome Microservices ...Java Aktuell Bernd Zuther Canary Releases mit der Very Awesome Microservices ...
Java Aktuell Bernd Zuther Canary Releases mit der Very Awesome Microservices ...
 
Bau dein eigenes extreme feedback device
Bau dein eigenes extreme feedback deviceBau dein eigenes extreme feedback device
Bau dein eigenes extreme feedback device
 
Gartner at HIMSS15 - Chicago
Gartner at HIMSS15 - ChicagoGartner at HIMSS15 - Chicago
Gartner at HIMSS15 - Chicago
 
Hypnotherapy Explanation
Hypnotherapy ExplanationHypnotherapy Explanation
Hypnotherapy Explanation
 
500’s Demo Day Batch 17 >> TraceAir
500’s Demo Day Batch 17 >> TraceAir500’s Demo Day Batch 17 >> TraceAir
500’s Demo Day Batch 17 >> TraceAir
 
ThinkGRC BCI World 2016 Presentation Benchmarking Organizational Resilience
ThinkGRC BCI World 2016 Presentation Benchmarking Organizational ResilienceThinkGRC BCI World 2016 Presentation Benchmarking Organizational Resilience
ThinkGRC BCI World 2016 Presentation Benchmarking Organizational Resilience
 
Ha nacido un concursante
Ha nacido un concursanteHa nacido un concursante
Ha nacido un concursante
 
Making Of Zoozoo (Part 1)
Making Of Zoozoo (Part 1)Making Of Zoozoo (Part 1)
Making Of Zoozoo (Part 1)
 
DÍAS DE RADIO
DÍAS DE RADIODÍAS DE RADIO
DÍAS DE RADIO
 
HISTORIA ACTIVA
HISTORIA ACTIVAHISTORIA ACTIVA
HISTORIA ACTIVA
 
Mismuseos.net: Art After Technology (putting cultural data to work)
Mismuseos.net: Art After Technology (putting cultural data to work)Mismuseos.net: Art After Technology (putting cultural data to work)
Mismuseos.net: Art After Technology (putting cultural data to work)
 
RHBC Announcements 3/19/17
RHBC Announcements 3/19/17RHBC Announcements 3/19/17
RHBC Announcements 3/19/17
 
Big Data Streaming processing using Apache Storm - FOSSCOMM 2016
Big Data Streaming processing using Apache Storm - FOSSCOMM 2016Big Data Streaming processing using Apache Storm - FOSSCOMM 2016
Big Data Streaming processing using Apache Storm - FOSSCOMM 2016
 
Introducing MCC Banquets & Events (Macedonian Cultural Center)
Introducing MCC Banquets & Events (Macedonian Cultural Center)Introducing MCC Banquets & Events (Macedonian Cultural Center)
Introducing MCC Banquets & Events (Macedonian Cultural Center)
 
Lessons From Copyright in Action for Copyright Reform
Lessons From Copyright in Action for Copyright ReformLessons From Copyright in Action for Copyright Reform
Lessons From Copyright in Action for Copyright Reform
 
The cognitive approach to abnormality (2)
The cognitive approach to abnormality (2)The cognitive approach to abnormality (2)
The cognitive approach to abnormality (2)
 
Low Latency “OLAP” with HBase - HBaseCon 2012
Low Latency “OLAP” with HBase - HBaseCon 2012Low Latency “OLAP” with HBase - HBaseCon 2012
Low Latency “OLAP” with HBase - HBaseCon 2012
 
Stream processing using Apache Storm - Big Data Meetup Athens 2016
Stream processing using Apache Storm - Big Data Meetup Athens 2016Stream processing using Apache Storm - Big Data Meetup Athens 2016
Stream processing using Apache Storm - Big Data Meetup Athens 2016
 
Migrando Aplicações para o SQL Azure Database
Migrando Aplicações para o SQL Azure DatabaseMigrando Aplicações para o SQL Azure Database
Migrando Aplicações para o SQL Azure Database
 

Ähnlich wie Stateless Hypervisors at Scale

Optimizing VM images for OpenStack with KVM/QEMU
Optimizing VM images for OpenStack with KVM/QEMUOptimizing VM images for OpenStack with KVM/QEMU
Optimizing VM images for OpenStack with KVM/QEMU
OpenStack Foundation
 
Rmll Virtualization As Is Tool 20090707 V1.0
Rmll Virtualization As Is Tool 20090707 V1.0Rmll Virtualization As Is Tool 20090707 V1.0
Rmll Virtualization As Is Tool 20090707 V1.0
guest72e8c1
 
Cloud Forensics
Cloud ForensicsCloud Forensics
Cloud Forensics
sdavis532
 

Ähnlich wie Stateless Hypervisors at Scale (20)

Ironic 140622212631-phpapp02
Ironic 140622212631-phpapp02Ironic 140622212631-phpapp02
Ironic 140622212631-phpapp02
 
Ironic
IronicIronic
Ironic
 
Ironic 140622212631-phpapp02
Ironic 140622212631-phpapp02Ironic 140622212631-phpapp02
Ironic 140622212631-phpapp02
 
Sanger, upcoming Openstack for Bio-informaticians
Sanger, upcoming Openstack for Bio-informaticiansSanger, upcoming Openstack for Bio-informaticians
Sanger, upcoming Openstack for Bio-informaticians
 
Flexible compute
Flexible computeFlexible compute
Flexible compute
 
Docking postgres
Docking postgresDocking postgres
Docking postgres
 
Leveraging Docker and CoreOS to provide always available Cassandra at Instacl...
Leveraging Docker and CoreOS to provide always available Cassandra at Instacl...Leveraging Docker and CoreOS to provide always available Cassandra at Instacl...
Leveraging Docker and CoreOS to provide always available Cassandra at Instacl...
 
Optimizing VM images for OpenStack with KVM/QEMU
Optimizing VM images for OpenStack with KVM/QEMUOptimizing VM images for OpenStack with KVM/QEMU
Optimizing VM images for OpenStack with KVM/QEMU
 
RMLL / LSM 2009
RMLL / LSM 2009RMLL / LSM 2009
RMLL / LSM 2009
 
Rmll Virtualization As Is Tool 20090707 V1.0
Rmll Virtualization As Is Tool 20090707 V1.0Rmll Virtualization As Is Tool 20090707 V1.0
Rmll Virtualization As Is Tool 20090707 V1.0
 
Juniper Network Automation for KrDAG
Juniper Network Automation for KrDAGJuniper Network Automation for KrDAG
Juniper Network Automation for KrDAG
 
Introduction to Stacki - World's fastest Linux server provisioning Tool
Introduction to Stacki - World's fastest Linux server provisioning ToolIntroduction to Stacki - World's fastest Linux server provisioning Tool
Introduction to Stacki - World's fastest Linux server provisioning Tool
 
AnsibleFest 2021 - DevSecOps with Ansible, OpenShift Virtualization, Packer a...
AnsibleFest 2021 - DevSecOps with Ansible, OpenShift Virtualization, Packer a...AnsibleFest 2021 - DevSecOps with Ansible, OpenShift Virtualization, Packer a...
AnsibleFest 2021 - DevSecOps with Ansible, OpenShift Virtualization, Packer a...
 
Cloud Forensics
Cloud ForensicsCloud Forensics
Cloud Forensics
 
Docker and kubernetes
Docker and kubernetesDocker and kubernetes
Docker and kubernetes
 
Ansible presentation
Ansible presentationAnsible presentation
Ansible presentation
 
To Build My Own Cloud with Blackjack…
To Build My Own Cloud with Blackjack…To Build My Own Cloud with Blackjack…
To Build My Own Cloud with Blackjack…
 
The State of Rootless Containers
The State of Rootless ContainersThe State of Rootless Containers
The State of Rootless Containers
 
Linux containers and docker
Linux containers and dockerLinux containers and docker
Linux containers and docker
 
Engage 2020 - Kubernetes for HCL Connections Component Pack - Build or Buy?
Engage 2020 - Kubernetes for HCL Connections Component Pack - Build or Buy?Engage 2020 - Kubernetes for HCL Connections Component Pack - Build or Buy?
Engage 2020 - Kubernetes for HCL Connections Component Pack - Build or Buy?
 

Kürzlich hochgeladen

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Kürzlich hochgeladen (20)

Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 

Stateless Hypervisors at Scale

  • 1. Stateless Hypervisors
 at Scale Antony Messerli amesserl@rackspace.com @ntonym
  • 2. • Almost 14 years with Rackspace • Hardware Development for Rackspace • Rackspace Cloud Servers • Slicehost • Openstack Public Cloud • R&D and Prototyping • Twitter: @ntonym
 Github: antonym
 IRC (freenode): antonym ABOUT MYSELF 2
  • 3. 3 • Openstack Public Cloud in production
 since August 2012 • Six Geographic regions around the globe • 10’s of 1000’s of hypervisors (Over 340,000 Cores, Just over 1.2 Petabytes of RAM) • Over 10 different hardware platforms • Primarily utilize the Citrix XenServer Hypervisor today
  • 5. 5 Components of a Hypervisor
 in OpenStack • Bare metal • Operating System • Configuration Management (Ansible, Chef, Puppet) • Nova Compute • Instance settings • Instance virtual disks
  • 6. 6 Hypervisor’s Mission Needs to be: • Stable • Secure • Provision and run instances reliably • Consistent with other servers
  • 7. Problems With Hypervisors At Scale • Operating System ‣ Multiple versions of XenServer ‣ Each version has variation of patches, kernels, or xen hypervisor ‣ More variations = More work • Server Hardware ‣ Incorrect BIOS settings, firmware, or modules can cause different behaviors • Operational Issues ‣ Openstack or Hypervisor bugs can leave things in undesirable states. 7
  • 8. 8 How We Solved Some Of Those Problems • Factory style provisioning using iPXE and Ansible • Consolidated hypervisor versions to reduce variations • Attempt to correct inconsistencies on the hypervisors automatically But… we’re still running a traditional operating system!
  • 9. 9 Our Goals • Rapidly deploy hypervisors. • Take advantage of server reboots! • Reproducible build • Consistency within hardware platforms and operating systems After all, these are Cattle, not Pets!
  • 10. THE CONCEPT: LIVE BOOTED HYPERVISORS 10
  • 11. What Is A Live OS? • A bootable image that runs in a system’s memory • Predictable and portable • Typically used for installs or rescue, booted from CD or network • Doesn’t make changes to existing configuration 11 What if we applied this same concept to run our hypervisor?
  • 12. 12 “We’ll Do It Live!” • Network booted stateless LiveOS • Built from scratch using Ansible • Operating System is separated from customer data • Reboot for the latest build
  • 13. But Where Does The Persistent Data Go? • systemd unit file mounts disk early in the boot process • Create the symlinks from LiveOS to persistent store • For example: /dev/sda2 is mounted to /data /var/lib/nova -> /data/var/lib/nova • Can create symlink for each directory you want to persist 13
  • 14. How Is This Possible? • We leverage the dracut project. • Dracut runs in the initramfs during boot time • Main goal is to transition to the real root filesystem • Has lots of functionality for network boot • Set options from kernel command line • More information @ https://dracut.wiki.kernel.org 14 Dracut Config Example:
  • 15. Why Use A LiveOS • Everything boots from a single image. • Can make changes without reboot, but should update image. • Can update to a new release of the OS and roll back to the existing if needed. • Portable and easy to test and develop on. • Memory is cheap! 15
  • 16. THE IMAGE BUILD PROCESS 16
  • 17. Squashible • Combination of SquashFS and Ansible. • Ansible Playbooks automate the build process of creating the images • Supports multiple OS versions • Configuration management done during image build • All changes to our build live within the repo, fully tracked and easily reproducible. 17
  • 18. 18 The Initial Bootstrap • Ansible uses Docker to create a minimal chroot • Installs: ‣ Package manager ‣ Init system • Copies the chroot to Jenkins • Ansible destroys the docker container minimal OS in chroot dnf, apt, zypper Filesystem on Jenkins server Docker container or systemd-nspawn
  • 19. Preparing The chroot Live OS chroot • Ansible uses its chroot module catch up the OS • Version-tracking metadata is added to the image • Package manager configurations are applied • All packages are updated to the latest available versions from the distribution's mirrors yum/apt configuration versioning metadata 19
  • 20. Common Configuration Live OS chroot Ansible applies configuration to the live image that should be included in all live images • Authentication • Auditing • Common Packages • Logging configuration • Security configurations • SSH configurations • Enable/disable services on boot security configuration (auth, sshd, SELinux, AppArmor, auditd) logging configuration (journald, rsyslog) service startup configuration (via systemd) 20
  • 21. 21 Apply The Personality • Ansible takes the common live OS chroot and configures it based on the desired “personality" • Each role has the packages to install along with any special configurations required in order for the hypervisor to function Common Live OS chroot Basic server KVM hypervisor Xen hypervisor LXC hypervisor XenServer hypervisor (via additional Ansible Roles)
  • 22. 22 Publishing The Build • Kernel and initramfs are copied to deployment server • Root filesystem (entire chroot) is tarballed and copied to the deployment server • mktorrent generates a torrent file for rootfs • rtorrent seeds the initial torrent of the rootfs Common Live OS chroot vmlinuz (kernel) initrd (ramdisk) root filesystem torrent file (mktorrent) root filesystem tarball of chroot rtorrent (seeds rootfs) opentracker Deployment Server (HTTP) vmlinuz initramfs rootfs.img rootfs.img.torrent
  • 24. 24 Ok, We Built An Image,
 Now What? Boot It! • Boot from network with iPXE • Boot from local disk with Grub • If network fails, can revert to localboot • Lots of open source provisioning systems available
  • 25. Boot with iPXE #!ipxe :netboot imgfree set dracut_ip ip=${mgmt_ip_address}::${mgmt_gateway_ip}:${mgmt_netmask}:${hostname}: ${mgmt_device}:none nameserver=${dns} kernel ${vmlinuz_url} || goto netboot module ${initrd_url} || goto netboot imgargs vmlinuz root=live:${torrent_url} ${dracut_ip} rd.writable.fsimg ${console} boot || goto netboot 25
  • 26. Boot via extlinux LABEL latestbuild-$GIT_COMMIT menu label latestbuild-$GIT_COMMIT kernel $KERNEL root=live:/dev/sda1 rd.live.dir=/boot/builds/$GIT_COMMIT ${dracut_ip} rd.writable.fsimg booted_as=local initrd $INITRD 26 • After booting from network, you can create a local disk cache of image. • If network boot fails, you can still boot previously loaded image from disk. • Could roll out images ahead of time and skip network boot.
  • 27. Boot via kexec kexec -l vmlinuz —initrd=initrd.img —command-line=“root=live:http://$deployment_server/images/fedora-23-kvm/rootfs.img ip=dhcp nameserver=8.8.8.8 rd.writable.fsimg rd.info rd.shell” kexec -e 27 • Useful for testing it out from a running machine • Also useful for reloading your OS to the latest build of the image • Have to make sure your hardware drivers work well with kexec
  • 28. 28 Our Primary Boot Method, Terraform • Server makes DHCP request and retrieves iPXE kernel • Identifies itself using LLDP • Gets all attributes and plugs that into an iPXE template. • Our Utility LiveOS: ‣Brings Firmware and BIOS settings to latest spec ‣Storage and OBM ‣Inventory ‣Kexec’s into Primary Image
  • 29. 29 Our Initial Scale Tests (x86_64) • Heavily tested on 200+ x86 hosts running Fedora 23 based LiveOS • Time to build and package live image from git commit: ~10 minutes • Time to boot a server once POST completes: ~60 seconds • Re-provision time for 200 servers from reboot to provisioning instances: ~15 minutes
  • 30. 30 Openpower “Barreleye”
 (ppc64le) • Currently testing OpenStack KVM stack with LiveOS builds using Fedora 23 on OpenPower Barreleye • More information about Barreleye @ http://blog.rackspace.com/openpower- open-compute-barreleye/
  • 31. 31 Future Ideas • Embedded configuration management ‣ Image would run automation and retrieve it’s own configuration on boot ‣ Regenerates itself on every boot • Stateless instances ‣ Boot from Config Drive ‣ Reset state or upgrade with reboot
  • 32. Give It A Try Squashible - Cross-Platform Linux Live Image Builder http://squashible.com Sample iPXE Boot menus: https://github.com/squashible/boot.squashible.com 32
  • 33.