This document discusses achieving 99.999% availability for OpenStack cloud services running on Ceph storage. It describes Deutsche Telekom's motivation to build highly available NFV clouds across multiple data centers. Various failure scenarios are considered, such as power, network, hardware failures, and disasters. Setting up OpenStack and Ceph for high availability requires redundant components and careful planning. Ensuring quorum across Ceph monitors and OpenStack services is critical. Achieving five nines availability requires distributing applications across multiple regions to tolerate data center or regional failures.
4. NFV Cloud @ Deutsche Telekom
● Datacenter design
○ Backend DCs
■ Few but classic DCs
■ High SLAs for infrastructure and services
■ For private/customer data and services
○ Frontend DCs
■ Small but many
■ Near to the customer
■ Lower SLAs, can fail at any time
■ NFVs:
● Spread over many FDCs
● Failures are handled by services and not the infrastructure
● Run telco core services @OpenStack/KVM/Ceph
4
6. Availability
● Measured relative to “100 % operational”
6
availability downtime classification
99.9% 8.76 hours/year high availability
99.99% 52.6 minutes/year very high availability
99.999% 5.26 minutes/year highest availability
99.9999% 0.526 minutes/year disaster tolerant
7. High Availability
● Continuous system availability in case of component
failures
● Which availability?
○ Server
○ Network
○ Datacenter
○ Cloud
○ Application/Service
● End-to-End availability most interesting
7
8. High Availability
● Calculation
○ Each component contributes to the service availability
■ Infrastructure
■ Hardware
■ Software
■ Processes
○ Likelihood of disaster and failure scenarios
○ Model can get very complex
● SLA’s
○ ITIL (IT Infrastructure Library)
○ Planned maintenance depending on SLA may be excluded
8
10. Failure scenarios
● Power outage
○ External
○ Internal
○ Backup UPS/Generator
● Network outage
○ External connectivity
○ Internal
■ Cables
■ Switches, router
● Failure of a server or a component
● Failure of a software service
10
11. Failure scenarios
● Human error still often leading
cause of outage
○ Misconfiguration
○ Accidents
○ Emergency power-off
● Disaster
○ Fire
○ Flood
○ Earthquake
○ Plane crash
○ Nuclear accident
11
13. Mitigation
● Identify potential SPoF
● Use redundant components
● Careful planning
○ Network design (external/internal)
○ Power management (external/internal)
○ Fire suppression
○ Disaster management
○ Monitoring
● 5-nines on DC/HW level hard to achieve
○ Tier IV usually too expensive (compared with Tier III or III+)
○ Requires HA concept on cloud and application level
13
17. Architecture: Ceph Components
● OSDs
○ 10s - 1000s per cluster
○ One per device (HDD/SDD/RAID Group, SAN …)
○ Store objects
○ Handle replication and recovery
● MONs:
○ Maintain cluster membership and states
○ Use PAXOS protocol to establish quorum
consensus
○ Small, lightweight
○ Odd number17
19. HA - Critical Components
Which services need to be HA?
● Control plane
○ Provisioning, management
○ API endpoints and services
○ Admin nodes
○ Control nodes
● Data plane
○ Steady states
○ Storage
○ Network
19
20. HA Setup
● Stateless services
○ No dependency between requests
○ After reply no further attention required
○ API endpoints (e.g. nova-api, glance-api,...) or nova-scheduler
● Stateful service
○ Action typically comprises out of multiple requests
○ Subsequent requests depend on the results of a former request
○ Databases, RabbitMQ
20
21. HA Setup
21
active/passive active/active
stateless ● load balance
redundant services
● load balance redundant
services
stateful ● bring replacement
resource online
● redundant services, all
with the same state
● state changes are
passed to all instances
23. Quorum?
● Required to decide which cluster partition/member is
primary to prevent data/service corruption
● Examples:
○ Databases
■ MariaDB / Galera, MongoDB, CassandraDB
○ Pacemaker/corosync
○ Ceph Monitors
■ Paxos
■ Odd number of MONs required
■ At least 3 MONs for HA, simple majority (2:3, 3:5, 4:7, …)
■ Without quorum:
● no changes of cluster membership (e.g. add new MONs/ODSs)
● Clients can’t connect to cluster
23
25. SPoF
● OpenStack HA
○ No SPoF assumed
● Ceph
○ No SPoF assumed
○ Availability of RBDs is critical to VMs
○ Availability of RadosGW can be easily managed via HAProxy
● What in case of failures on higher level?
○ Data center cores or fire compartments
○ Network
■ Physical
■ Misconfiguration
○ Power
25
29. Failure scenarios - Split brain
29
● Ceph
● Quorum selects B
● Storage in A stops
● OpenStack HA:
● Selects B
● VMs in B still running
● Best-case scenario
30. Failure scenarios - Split brain
30
● Ceph
● Quorum selects B
● Storage in A stops
● OpenStack HA:
● Selects A
● VMs in A and B stop
working
● Worst-case scenario
31. Other issues
● Replica distribution
○ Two room setup:
■ 2 or 3 replica contain risk of having only one replica left
■ Would require 4 replica (2:2)
● Reduced performance
● Increased traffic and costs
○ Alternative: erasure coding
■ Reduced performance, less space required
● Spare capacity
○ Remaining room requires spare capacity to restore
○ Depends on
■ Failure/restore scenario
■ Replication vs erasure code
○ Costs
31
32. Mitigation - Three FCs
32
● Third FC/failure
zone hosting all
services
● Usually higher
costs
● More resistant
against failures
● Better replica
distribution
● More east/west
traffic
33. Mitigation - Quorum Room
33
● Most DCs have
backup rooms
● Only a few servers
to host quorum
related services
● Less cost
intensive
● Can mitigate split
brain between FCs
(depending on
network layout)
34. Mitigation - Pets vs Cattle
34
● NO pets allowed !!!
● Only cloud-ready applications
35. Mitigation - Failure tolerant applications
35
● Tier level is not the most relevant layer
● Application must build their own cluster
mechanisms on top of the DC
→ increases the availability significantly
● Data replication must be done across
multi-region
● In case of a disaster route traffic to
different DC
● Many VNF (virtual network functions)
already support such setups
36. Mitigation - Federated Object Stores
36
● Best way to synchronize and replicate
data across multiple DC is usage of
object storage
● Sync is done asynchronously
Open issues:
● Doesn’t solve replication of databases
● Many applications don’t support object
storage and need to be adapted
● Applications also need to support
regions/zones
37. Mitigation - Outlook
● “OpenStack follows Storage”
○ Use RBDs as fencing devices
○ Extend Ceph MONs
■ Include information about physical placement similar to CRUSH map
■ Enable HA setup to query quorum decisions and map quorum to physical layout
● Passive standby Ceph MONs to ease deployment of
MONs if quorum fails
○ http://tracker.ceph.com/projects/ceph/wiki/Passive_monitors
● Generic quorum service/library ?
37
39. Conclusions
● OpenStack and Ceph provide HA if carefully planed
○ Be aware of potential failure scenarios!
○ All Quorums need must be synced
○ Third room must be used
○ Replica distribution and spare capacity must be considered
○ Ceph need more extended quorum information
● Target for five 9’s is E2E
○ Five 9’s on data center level very expensive
○ No pets !!!
○ Distribute applications or services over multiple DCs
39