My experience writing DR service for CloudStack

My experience writing a
DR service for
CloudStack
Alena Prokharchyk
Citrix
@Lemonjet

What is a disaster for the cloud
• Disaster for the Cloud is hardware/software
failure,network/power outage, physical damage to
the data center (DC)
• Disaster can cause partial or entire DC failure
• As a result, VMs become unresponsive and needs
to be restored in another DataCenter
• DR products’ goal is to prepare VM’s for failover
and recover them in a short time frame

Existing DR solutions in CS
• Recurring snapshots feature
!
No out-of-box cross zones recovery solution

What new DR service does
• Lets admin to configure recovery service w/o putting
extra scripts and config files
• Prepares for disaster and restores VM and all its
metadata - Networks/Networking rules
• Recovers VM cross zones
• Real time updates for the recovery VMs' metadata -
helps to keep MTTR (Mean Time to Repair) low
• Provides tiered DR service - most important apps/
accounts can be recovered first

Things DR service doesn’t cover
• No Storage replication is done by DR service, only
metadata replication
Storage replication is covered by the admin outside
of CS (NetApp’s Snapmirror)

Which version of Cloudstack
is supported by DR?
DR works with:
• Cloudstack 4.5 version
• Next Citrix CloudPlatform release based on ASF 4.4

Design principles followed while writing
the DR
• Develop as a CS plugin in V1 with ability to run as a separate
service in the future versions
• No changes to core/server CS code that are speciﬁc just to DR
• No direct access to CS DB. All data manipulation through CS
APIs only
• DR service doesn’t have its own DB in Version 1. All DR data is
stored in CS DB in form of resources’ metadata
• Rely on MTBF (Mean Time Between Failures) to be high. Never
fail VM in original zone if its preparation fails, let admin ﬁx things
and retry

DR Service deployment
DR UI
plugin
DR API
plugin
DR
Events
listener
DR
Server
CS
Orchestration
engine
CS
API
DR service CloudStack
CS
UI
Event
message bus
CS
Services
/Plugins
DR UI
plugin
DR API
plugin
DR
Events
listener
DR
Service

DR process
• Conﬁguration - conﬁguring the DR service
• Preparation - preparing VM for failover
• Failover - failing over the vm to the Recovery zone
• Failback - failing back the vm to its Original zone

Conﬁguration DR
• Setup Active zone with the Recovery zone
• Conﬁgure DR offerings (SLAs)
• Tag storages for the DR VMs’ volumes placement

Preparing VM for failover
• DR service listens to events from CS, and deploys/
updates a recovery VM metadata in the Recovery
zone
• Recovery Vm doesn’t occupy physical resources
on the CS side
• Recovery VM is invisible to an end user

Preparing VM for failover
Nic1
Nic 2
UserVm
Nic1
Nic 2
UserVm
Active zone Recovery zone
DR Service

Failover process
Process of restoring failed vm in the recovery zone
• DR doesn’t do automatic indication that the
Disaster happens
• DR admin triggers failover for the VM by calling the
DR API
• DR service performs the failover process

Failover process
UserVm
Active zone Recovery zone
CS storage1
Volume1
Volume2
UserVm
Volume1
Volume2
CS storage2
Physical storage1
DR Service
Volume1
Volume2
Volume1
Volume2
Physical storage2NetApp
SnapMirror
UUID1 UUID1

Failback process
Process of moving VM back to its original zone
• Vm metadata is preserved in the original zone and re-used
when vm is recovered
• Recovery VM’s volumes get re-introduced to the original
zone, and attached to the original vm
• VM in the recovery zone gets disabled
• VM in the original zone gets enabled
• UUID swap happens

DR metadata in CS DB
user_vm
CS DB
id name zone_id
1 VM-user1 1
2 VM-user1 2
user_vm_details
vm_id detail_name detail_value
1 DR_RECOVERY_ID 2
1 DR_STATE
FAILED_TO_PREPARE_FOR_
DR
1 DR_ALERT
Failed to attach Nic to the
Recovery vm

Who controls the DR
process
• Admin controls recovery process on behalf of users’ VMs
• End user can monitor:
- DR state of his VMs - “Ready to Failover”/“FailedOver”
- Recovery zone info - to which zone the VM recovers in case
of failure
- Recovery public ip address(es) info - to reconﬁgure his
public DNS

CS API enhancements
• Added some missing data to CS API responses
• Added missing “resource_details” tables for some CS
resources
• Put in the support for CS services to publish Alerts via
CS APIs
• Introduced External UUID management
• Implemented resource creation with delayed start for
some objects (VPC)

Things yet to ﬁx on CS
• Single sign on is missing
• Resource creation in the DB and actual
implementation are not granular enough

Summary
If you are an API developer for open source IaaS
product:
• Always think from an end user/customer use case
perspective while adding/modifying end user APIs
• Look out what plugins/services/bug ﬁxes people
write for your software. Helps to deﬁne missing
pieces/common problems in your software

My experience writing DR service for CloudStack

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie My experience writing DR service for CloudStack

Ähnlich wie My experience writing DR service for CloudStack (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

My experience writing DR service for CloudStack