2. What is a disaster for the cloud
• Disaster for the Cloud is hardware/software
failure,network/power outage, physical damage to
the data center (DC)
• Disaster can cause partial or entire DC failure
• As a result, VMs become unresponsive and needs
to be restored in another DataCenter
• DR products’ goal is to prepare VM’s for failover
and recover them in a short time frame
3. Existing DR solutions in CS
• Recurring snapshots feature
!
No out-of-box cross zones recovery solution
4. What new DR service does
• Lets admin to configure recovery service w/o putting
extra scripts and config files
• Prepares for disaster and restores VM and all its
metadata - Networks/Networking rules
• Recovers VM cross zones
• Real time updates for the recovery VMs' metadata -
helps to keep MTTR (Mean Time to Repair) low
• Provides tiered DR service - most important apps/
accounts can be recovered first
5. Things DR service doesn’t cover
• No Storage replication is done by DR service, only
metadata replication
Storage replication is covered by the admin outside
of CS (NetApp’s Snapmirror)
6. Which version of Cloudstack
is supported by DR?
DR works with:
• Cloudstack 4.5 version
• Next Citrix CloudPlatform release based on ASF 4.4
7. Design principles followed while writing
the DR
• Develop as a CS plugin in V1 with ability to run as a separate
service in the future versions
• No changes to core/server CS code that are specific just to DR
• No direct access to CS DB. All data manipulation through CS
APIs only
• DR service doesn’t have its own DB in Version 1. All DR data is
stored in CS DB in form of resources’ metadata
• Rely on MTBF (Mean Time Between Failures) to be high. Never
fail VM in original zone if its preparation fails, let admin fix things
and retry
8. DR Service deployment
DR UI
plugin
DR API
plugin
DR
Events
listener
DR
Server
CS
Orchestration
engine
CS
API
DR service CloudStack
CS
UI
Event
message bus
CS
Services
/Plugins
DR UI
plugin
DR API
plugin
DR
Events
listener
DR
Service
9. DR process
• Configuration - configuring the DR service
• Preparation - preparing VM for failover
• Failover - failing over the vm to the Recovery zone
• Failback - failing back the vm to its Original zone
10. Configuration DR
• Setup Active zone with the Recovery zone
• Configure DR offerings (SLAs)
• Tag storages for the DR VMs’ volumes placement
11. Preparing VM for failover
• DR service listens to events from CS, and deploys/
updates a recovery VM metadata in the Recovery
zone
• Recovery Vm doesn’t occupy physical resources
on the CS side
• Recovery VM is invisible to an end user
12. Preparing VM for failover
Nic1
Nic 2
UserVm
Nic1
Nic 2
UserVm
Active zone Recovery zone
DR Service
13. Failover process
Process of restoring failed vm in the recovery zone
• DR doesn’t do automatic indication that the
Disaster happens
• DR admin triggers failover for the VM by calling the
DR API
• DR service performs the failover process
14. Failover process
UserVm
Active zone Recovery zone
CS storage1
Volume1
Volume2
UserVm
Volume1
Volume2
CS storage2
Physical storage1
DR Service
Volume1
Volume2
Volume1
Volume2
Physical storage2NetApp
SnapMirror
UUID1 UUID1
15. Failback process
Process of moving VM back to its original zone
• Vm metadata is preserved in the original zone and re-used
when vm is recovered
• Recovery VM’s volumes get re-introduced to the original
zone, and attached to the original vm
• VM in the recovery zone gets disabled
• VM in the original zone gets enabled
• UUID swap happens
16. DR metadata in CS DB
user_vm
CS DB
id name zone_id
1 VM-user1 1
2 VM-user1 2
user_vm_details
vm_id detail_name detail_value
1 DR_RECOVERY_ID 2
1 DR_STATE
FAILED_TO_PREPARE_FOR_
DR
1 DR_ALERT
Failed to attach Nic to the
Recovery vm
17. Who controls the DR
process
• Admin controls recovery process on behalf of users’ VMs
• End user can monitor:
- DR state of his VMs - “Ready to Failover”/“FailedOver”
- Recovery zone info - to which zone the VM recovers in case
of failure
- Recovery public ip address(es) info - to reconfigure his
public DNS
18. CS API enhancements
• Added some missing data to CS API responses
• Added missing “resource_details” tables for some CS
resources
• Put in the support for CS services to publish Alerts via
CS APIs
• Introduced External UUID management
• Implemented resource creation with delayed start for
some objects (VPC)
19. Things yet to fix on CS
• Single sign on is missing
• Resource creation in the DB and actual
implementation are not granular enough
20. Summary
If you are an API developer for open source IaaS
product:
• Always think from an end user/customer use case
perspective while adding/modifying end user APIs
• Look out what plugins/services/bug fixes people
write for your software. Helps to define missing
pieces/common problems in your software