1. Leveraging the Public Cloud
for Disaster Recovery
Lahav Savir, Architect & CEO
Emind systems Ltd.
lahavs@emind.co
2. About
Lahav Savir
• 15+ years’ experience in on-line industry
• Architect and CEO @ Emind Systems
Emind Systems (est. 2006)
• Boutique system integrator
• ~100 AWS customers
• AWS solution provider
4. Disaster Recovery in a Nutshell
• Business continuity
• Minimize downtime and data loss
• Recovery Time Objective (RPO)
• Recovery Point Objective (RTO)
• Price
7. Why Amazon ?
Flexible, Global Infrastructure
• N. Virginia
• Oregon
• N. California
• Ireland
• Singapore
• Tokyo
• Sydney
• São Paulo
• GovCloud
8. Secure
• VPC - Virtual Private
Cloud on AWS's
infrastructure
• Specify private IP address
range
• Bridge your onsite IT
infrastructure and the
VPC with a VPN
connection or Direct
Connect
• Extending your existing
security and management
policies to the cloud
9. A different cost model
Ability to scale –
Cost savings no arbitrary time
limit to failback
w/ AWS
Infrastructure Cost
2nd Site
Cost
AWS Cost
Demand
Time
Test Test Failover Failback
11. Disaster Recovery Terms
• RTO: Recovery Time Objective
– Acceptable time period within which normal
operation (or degraded operation) needs to be
restored after event
• RPO: Recovery Point Objective
– Acceptable data loss measured in time
12. Backup and Restore
Amazon Route 53
Data copied
to S3
Traditional server S3 Bucket
with Objects
AWS
On-premises Infrastructure
Import/Export
13. Backup and Restore
Amazon EC2 Data copied from
Instance objects in S3
Data
Volume
Instance Quickly
Amazon
provisioned from
S3 Bucket
AMI
Pre-bundled with
OS and
applications
AMI
Availability Zone
AWS Region
14. Backup and Restore
• Advantages
– Simple to get started
– Extremely cost effective (mostly backup storage)
• Preparation Phase
– Take backups of current systems
– Store backups in S3
– Describe procedure to restore from backup on AWS
• Know which AMI to use, build your own as needed
• Know how to restore system from backups
• Know how to switch to new system
15. Backup and Restore
• In Case of Disaster
– Retrieve backups from S3
– Bring up required infrastructure
• EC2 instances with prepared AMIs, Load Balancing, etc.
– Restore system from backup
– Switch over to the new system
• Adjust DNS records to point to AWS
• Objectives
– RTO: as long as it takes to bring up infrastructure and
restore system from backups
– RPO: time since last backup
16. Pilot Light
User or system
Web Web
Server Server
Amazon Route 53
Not Running
Application Application
Server Server
Database Database
Data Mirroring/ Server Smaller Instance
Server
Replication
Data Data
Volume Volume
17. Pilot Light
User or system
Web
Web Web
Server
Server Server
Amazon Route 53
Not Running
Application Application
Server Server
Database Database
Database
Server Data Mirroring/ Server Smaller Instance
Server Replication
Data Data
Volume Volume
18. Pilot Light
User or system
Web
Web Web
Server
Server Server
Amazon Route 53
Start in minutes
Application
Application
Server
Server
Database Database
Database
Server Data Mirroring/ Server Resize as desired
Server Replication
Data Data
Volume Volume
19. Pilot Light
• Advantages
– Very cost effective (fewer 24/7 resources)
• Preparation Phase
– Enable replication of all critical data to AWS
– Prepare all required resources for automatic start
• AMIs, Network Settings, Load Balancing, etc.
20. Pilot Light
• In Case of Disaster
– Automatically bring up resources around the replicated core data set
– Scale the system as needed to handle current production traffic
– Switch over to the new system
• Adjust DNS records to point to AWS
• Objectives
– RTO: as long as it takes to detect need for DR and automatically scale
up replacement system
– RPO: depends on replication type
21. Fully-Working Low Capacity Standby
User or system
Web
Web Server
Server
Amazon Route 53
Low Capacity
App
Application Server
Server
Database DB
Server Data Mirroring/ Server
Replication
Data Data
Volume Volume
22. Fully-Working Low Capacity Standby
User or system
Web
Web Server
Server
Amazon Route 53
Low Capacity
App
Application Server
Server
Database DB
Server Data Mirroring/ Server
Replication
Data Data
Volume Volume
23. Fully-Working Low Capacity Standby
User or system
Web Web Web
Server
Server Server
Amazon Route 53
Grow Capacity
Application Application
App
Server Server
Server
Database Database
DB
Server Data Mirroring/ Server
Server
Replication
Data Data
Volume Volume
24. Fully-Working Low-Capacity Standby
User or system
Web Web Web
Server
Server Server
Amazon Route 53
Grow Capacity
Application Application
App
Server Server
Server
Database Database
DB
Server Data Mirroring/ Server
Server
Replication
Data Data
Volume Volume
25. Fully-Working Low-Capacity Standby
• Advantages
– Can take some production traffic at any time
– Cost savings (IT footprint smaller than full DR)
• Preparation
– Similar to Pilot Light
– All necessary components running 24/7, but not scaled for production
traffic
– Best practice – continuous testing
• “Trickle” a statistical subset of production traffic to DR site
26. Fully-Working Low-Capacity Standby
• In Case of Disaster
– Immediately fail over most critical production load
• Adjust DNS records to point to AWS
– (Auto) Scale the system further to handle all production load
• Objectives
– RTO: for critical load: as long as it takes to fail over; for all other
load, as long as it takes to scale further
– RPO: depends on replication type
27. Multi-Site Hot Standby
User or system
Web Web Web
Server
Server Server
Amazon Route 53
Full Capacity
Application
Application Application
App
Server
Server Server
Server
Database
Database Database
DB
Server
Server Data Mirroring/ Server
Server
Replication
Data Data
Volume Volume
28. Multi-Site Hot Standby
• Advantages
– At any moment can take all production load
• Preparation
– Similar to Low-Capacity Standby
– Fully scaling in/out with production load
• In Case of Disaster
– Immediately fail over all production load
• Adjust DNS records to point to AWS
• Objectives
– RTO: as long as it takes fail over
– RPO: depends on replication type
29. Summary
• Plan
– Analyze your existing applications and services
– Find the right approach per case
• Adapt
– Match your plan to RTO, RPO and Budget
• POC
– Validate your plan
• Test
– Periodic testing
• Monitor
– Ensure continues operation of all
30. • goCloud – Emind’s optimal road to the cloud
– Secure cloud architecture
– Scalable & high-availability design
– Customized system deployment
– Orchestrating cloud and software
– Cloud operation team
– Monitoring and alerting
– 24x7 SLA