Having High Availability enabled for KVM Hosts can improve greatly the QoS by handling (fence/recover) a problematic Host as well as re-starting its stopped VMs on healthy hosts. However, there is a limitation on CloudStack HA for KVM; it relies mainly on NFS heartbeat script checks. This Talk illustrates how CloudStack HA works for KVM hosts and it presents a way of improving its implementation in a way that KVM HA works with any storage system pluggable on KVM, not just NFS.
About Gabriel Brasher - https://blogs.apache.org/cloudstack/
------------------------------------------
CloudStack European User Group Virtual happened on May 27th. The first CSEUG Virtual proved to be a huge success. It collected people from 23 countries – Germany, the United Kingdom, Switzerland, India, Bulgaria, Greece, Poland, Serbia, Brazil, Chile, Russia, USA, Canada, Japan, France, Uruguay, Korea …
We also had a record number of registrations and attendees for a CloudStack User Group Event. The physical distance was not a stopper for our speakers, who joined the event from 6 different countries.
------------------------------------------
About CloudStack: https://cloudstack.apache.org/
Similar to KVM High Availability Regardless of Storage - Gabriel Brascher, VP of Apache CloudStack - CloudStack European User Group Virtual, May 2021 (20)
2. Who am I?
gabriel@apache.org
• Gabriel Beims Bräscher, Brazilian
• Software Developer at PCextreme B.V.
○ Dutch hosting company founded in 2004
• 2013: First time using CloudStack (CloudStack 4.1.0)
• 2017: Apache CloudStack Committer
• 2019: CloudStack Project Management Committee (PMC)
• 2021: Appointed by the ASF as PMC Chair (VP) of CloudStack
CloudStack™ European User Group Virtual - May 27th 2021
3. • CloudStack KVM HA
• Health Check with NFS
• Can we have KVM HA without NFS?
• KVM HA regardless of storage
• Take away: future
Summary
What this presentation brings?
CloudStack™ European User Group Virtual - May 27th 2021
4. CloudStack KVM HA
Why configure HA for Hosts?
Why?
• Improve QoS
○ VMs should run as much as possible
○ Hosts should not stay “Down”
CloudStack™ European User Group Virtual - May 27th 2021
5. CloudStack KVM HA
Why configure HA for Hosts?
How it works?
Why?
• Improve QoS
○ VMs should run as much as possible
○ Hosts should not stay “Down”
How?
• Detect problematic Host
• Re-start its stopped VMs
CloudStack™ European User Group Virtual - May 27th 2021
6. Why?
• Improve QoS
○ VMs should run as much as possible
○ Hosts should not stay “Down”
How?
• Detect problematic Host
• Recover or Fence it
• Re-start its stopped VMs
We don’t want 2 VMs mapped to same storage path
• CloudStack cannot reach a Host
• VMs are still running and writing/reading on storage
CloudStack KVM HA
Why configure HA for Hosts?
How it works?
CloudStack™ European User Group Virtual - May 27th 2021
7. CloudStack KVM HA
Why configure HA for Hosts?
How it works?
HA States
CloudStack™ European User Group Virtual - May 27th 2021
Link: https://github.com/apache/cloudstack/blob/master/api/src/main/java/org/apache/cloudstack/ha/HAConfig.java
Host HA States
• Disabled: HA Operations disabled
• Available: The resource is healthy
• Ineligible: The current state does not support HA/recovery
• Suspect: Most recent health check failed
• Degraded: The resource cannot be managed, but services end user
requests
• Checking: The activity checks are currently being performed
• Recovering: The resource is undergoing recovery operation
• Recovered: The resource is recovered
• Fencing: The resource is undergoing fence operation
• Fenced: The resource is fenced
8. CloudStack KVM HA
Why configure HA for Hosts?
How it works?
HA States
CloudStack™ European User Group Virtual - May 27th 2021
Link: https://cwiki.apache.org/confluence/display/CLOUDSTACK/Host+HA
9. Out-of-band management
• IPMI
• Redfish (CloudStack +4.15.0)
Enable HA
• VMs Service offerings enabled for HA
• Hosts enabled for HA
Use NFS as shared primary storage pool
CloudStack KVM HA
Why configure HA for Hosts?
How it works?
HA States
Requirements
CloudStack™ European User Group Virtual - May 27th 2021
10. Why NFS?
• Hosts in the same cluster can check the same storage
• Check the storage activity
How it works?
• HeartBeat script running on KVM nodes checks if can write/read on the
mounted NFS partition
Health Check with NFS
Why use NFS?
CloudStack™ European User Group Virtual - May 27th 2021
11. Health Check with NFS
Today, with NFS
CloudStack™ European User Group Virtual - May 27th 2021
12. Currently KVM HA works by monitoring an NFS based heartbeat file and it can often
fail whenever this network share becomes slower, causing the hypervisors to reboot.
This can be particularly annoying when you have different kinds of primary storages in
place which are working fine (people running CEPH etc).
...
This is embarrassing. How can we fix it? Ideas, suggestions? How are other hypervisors
doing it?
– Nux 09, October, 2015
JIRA Issue: CLOUDSTACK-8943
Health Check with NFS
Why use NFS?
CloudStack™ European User Group Virtual - May 27th 2021
Link: https://issues.apache.org/jira/browse/CLOUDSTACK-8943
13. Possible validations
• Request to the CloudStack Agent (JVM) -- Java can crash
• Check storage activity -- cost to implement & maintain (for each
storage)
• Check via Libvirt
• Ping host -- Ping is limited and often firewalls can block
Can we have KVM HA without NFS?
What are the possible validations?
CloudStack™ European User Group Virtual - May 27th 2021
14. KVM HA regardless of storage
CloudStack + KVM + HA - NFS
CloudStack™ European User Group Virtual - May 27th 2021
Possible validations
• Request to the CloudStack Agent (JVM) -- Java can crash
• Check storage activity -- cost to implement & maintain (for each
storage)
• Check via Libvirt
• Ping host -- Ping is limited and often firewalls can block
15. KVM HA regardless of storage
Today, with NFS
CloudStack™ European User Group Virtual - May 27th 2021
16. KVM HA regardless of storage
Proposal with KVM HA Agent Helper web-service
CloudStack™ European User Group Virtual - May 27th 2021
17. KVM HA regardless of storage
HTTP Request for checking neighbour hosts
CloudStack™ European User Group Virtual - May 27th 2021
18. KVM HA regardless of storage
What if NFS check fails?
CloudStack™ European User Group Virtual - May 27th 2021
19. KVM HA regardless of storage
What if NFS check fails?
What if KVM HA Helper Fails?
CloudStack™ European User Group Virtual - May 27th 2021
20. KVM HA regardless of storage
What if NFS check fails?
What if KVM HA Helper Fails?
What if both fails?
CloudStack™ European User Group Virtual - May 27th 2021
21. KVM HA regardless of storage
In a nutshell
CloudStack™ European User Group Virtual - May 27th 2021
HTTP Rest API that checks Libvirt - KVM HA Agent
• The web-service runs Libvirt commands to list VMs ( ~$ virsh list )
• Checks neighbour hosts via the same agent
• One can enable or disable the KVM HA Agent checks
• If NFS is used on the cluster, it is also taken into account
• If no NFS is used, Heart Beat checks are skipped
Example:
• HTTP GET -> http://host.name:8080/
○ response: {"count": 3, "virtualmachines": ["r-123-VM", "v-134-VM", "s-111-VM"]}
• HTTP GET -> http://host.name:8080/check-neighbour/neighbour.name:8080
○ response: {"status": "Up"} OR {"status": "Down"}
22. KVM HA regardless of storage
Possible outcomes
All Good
• HTTP Request gets a response listing VMs that matches DB
Warning
• HTTP Request gets a response but listed VMs does not match DB
Recover/Fence
• HTTP Request gets a response listing Zero VMs but according to the DB
there are VMs running
• HTTP Request gets an error code (e.g. 404), Service is not reachable
CloudStack™ European User Group Virtual - May 27th 2021
23. • HA systems are critical and will always need attention
• HA can be done regardless of storage
• However, combining multiple checks can lead to robust
systems
• Code is already available at PR #4978
• Running on a test environment
• Aim implementation for 4.16.0.0 or next LTS
Take away
Future
CloudStack™ European User Group Virtual - May 27th 2021
Link for PR: https://github.com/apache/cloudstack/pull/4978