OSDC 2019 | RTO & RPO – Best Practices in Hybrid Architectures by Fernando Honig

RTO & RPO
Best Practices
in Hybrid Architectures
OSDC - May 2019
Fernando Hönig
fernando@nubego.io
fernandohonig

RTO vs RPO
What is this?
2© 2019, nubeGO or its Affiliates. All rights reserved.

RTO vs RPO
Apples vs Oranges
It calculates how quickly you need to recover.
It is the target time you set for the recovery.
3
It is focused on data and your company’s loss
tolerance in relation to your data.
It is determined by looking at the time between
data backups and the amount of data that could
be lost in between backups.
© 2019, nubeGO or its Affiliates. All rights reserved.
RTO
RPO

RPO and RTO
4
The business can recover from
losing (at most) the last 12 hours
of data.
The application can be
unavailable for a maximum of
1 hour.

AVAILABILITY CONCEPTS
5
HIGH Availability
Backup
Disaster Recovery
Minimizing downtime for your application
Making your data safe
Getting your applications and data back
after a major disaster

What could go wrong?
6
HOW DO WE FIX IT? QUICKLY?
Small events
Large Scale events
Colossal events
Instance restart failure
Application deployment failure
Availability Zones down
Unavailable services
Unavailable region
Infrastructure destruction by error

Latest Events
7
Small events
Large Scale events
Colossal events
Instance restart failure
Application deployment failure
GitHub S3 AZ Unavailable
UK’s Petition System Unavailable
Data Unavailable - Failed Backups
GitLab Database Destruction

DISASTER PLANNING
8
RECOVERY OPTIONS

DISASTER PLANNING
9

Operating System
10
Machine Images
Snapshot to other regions
Share it across your accounts/projects
UserData
Create scripts to execute during start up
Patch / Update your OS and stay up to date

Storage
11
Object storage
Replicate to other regions
Enable versioning
Block storage
Create point-in-time Snapshots
Copy snapshots across regions and accounts

Machine Images and Snapshots
12
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/images/ami_lifecycle.png

Networking
13
DNS
Enable health checks
Enable Latency records
Load
balancing
Failover options
Health Checks with HTTP Code
VPC Extend your network to the Cloud
Direct
Connect
Enable fast and consistent replication/backup
options from on-premise environments to the cloud

Databases
14
Snapshot data and save it in a separate region
Combine Read Replicas with Multi-AZ to
build a resilient disaster recovery strategy
RDS

Infrastructure
15
Use templates to quickly deploy collections of
resources as needed
Treat it as code, test it and deploy new
changes with your application releases
IAC

BACKUP AND RESTORE
16

Backup Phase
17
● Take backups of current systems.
● Store backups in Object Storage Services.
● Describe procedure to restore from backup on Cloud.
● Know which machine template to use; build your own as needed.
● Know how to restore system from backups.
● Know how to switch to new system.
● Know how to conﬁgure the deployment.

Backup Options
18
FILES
NFS
SMB
VOLUMES iSCSI
TAPES ISCSI Virtual Tape Library

Hybrid Backup
19
https://d1.awsstatic.com/product-marketing/AWS%20Backup/product-page-diagram_aws_backup_hybrid.e5132f9c5fd6cd0299187d8d41147a3f7964d09a.png

Restore Phase
20
● Retrieve backups from Object Storage.
● Bring up required infrastructure.
● Cloud instances with prepared machine images, Load Balancers, etc.
● Use infrastructure as code to automate deployment of core networking.
● Restore system from backup.
● Switch over to the new system.
● Adjust DNS records to point to the cloud systems.
In case of disaster…

RECOVERY STRATEGIES
21

Pilot Light
22
Web
Server
App
Server
Database
Server
DB
Web
Server
App
Server
Database
Server
Data mirroring/replication
Not running
User or system
Amazon Route 53
hosted zone
DB
secondary

Pilot Light
Web
Server
App
Server
Web
Server
App
Server
Starts in minutes
User or system
Amazon Route 53
hosted zone
DB DB
secondary

Pilot Light
24
Very cost-effective (uses fewer 24/7 resources)Advantage
Preparation
Phase
Set up instances to replicate or mirror data.
Ensure that you have all supporting custom software
packages available in the cloud.
Create and maintain Machine Images of key servers where
fast recovery is required.
Regularly run these servers, test them, and apply any
software updates and conﬁguration changes.
Consider automating the provisioning of cloud resources.

Pilot Light
25
Automatically bring up resources around the replicated core data set.
Scale the system as needed to handle current production trafﬁc.
Switch over to the new system.
● Adjust DNS records to point to the cloud
In case of
disaster…
Objectives
RTO: As long as it takes to detect need for DR and
automatically scale up replacement system.
RPO: Depends on replication type.

Fully Working Low-Capacity Standby
Web
server
App
server
Database
Server
Web
Server
App
Server
Low
capacity
User or system
Amazon Route 53
hosted zone
Web
server
App
server
Auto Scaling
Auto Scaling
Database
Server
Database
Server
DB DB
secondary

27
Web
server
App
Server
Web
server
App
server
Low
capacity
User or system
Amazon Route 53
hosted zone
Web
server
App
Server
Web
server
App
server
Database
Server
Database
Server
DB DB
secondary

28
Advantages
Can take some production traffic at any time.
Cost savings (IT footprint smaller than full DR)
Preparation
Similar to Pilot Light
All necessary components running 24/7,
but not scaled for production traffic
Best practice: continuous testing
● “Tickle” a statistical subset of production traffic to DR site.

29
Immediately fail over most critical production load.
Adjust DNS records to point to the cloud.
(Auto) Scale the system further to handle all production load.
Objectives
RTO: For critical load: as long as it takes to fail over; for all other load,
as long as it takes to scale further.
In case of
disaster...

Web
server
App
server
Web
server
App
server
Full
capacity
User or system
Amazon Route 53
hosted zone
Web
server
App
server
Web
server
App
server
Database
Server
Database
Server
Database
Server
DB DB
secondary
Multi-Site Active-Active

Multi-Site Active-Active
31
Preparation
Advantages
Objectives
In case of
disaster…
At any moment, can take all production load.
Similar to low-capacity standby.
Fully scaling in/out with production load.
Immediately fail over all production load.
RTO: As long as it takes to fail over.

▪ Lower priority use cases
▪ Solutions: Object Storage,
Archive Storage
▪ Meeting lower RTO and
RPO requirements
▪ Core services
▪ Scale cloud resources in
response to a DR event
▪ Solutions that require
RTO and RPO in minutes
▪ Business-critical services
▪ Auto-failover of
your
environment in
the cloud to a
running
duplicate
Cost: $ Cost: $$ Cost: $$$ Cost: $$$$
Recovery Strategies

SCENARIO TIME!
33

CASE SCENARIO #1
34
Bob is in charge of deﬁning the best DR strategy for a hybrid architecture and he did the setup based on the
following requirements:
We need to have a
RTO of 60 minutes
Our backups are
stored in the cloud
and are taken daily
The RPO has to be
less than 8 hours, and
we need to be able to
build a new
environment quick
Our Application runs
in the Cloud but our
database still in our
local datacenter

RTO = 1h
RPO = 8hs
35
CAN BE ACHIEVED?
CASE SCENARIO

CASE SCENARIO
36
DATABASE
RTO/RPO
CODE
ON PREM
There is no certainty they can achieve 1h RTO and
8hs RPO
Backups run daily. So RPO can’t be 8hs.
How much time would take to build a new DB and
import the data?
How much time it would take you to copy from the
cloud to your on-prem DB?
APP: Is your app code full of variables to cope with a
change of endpoints?.
INFRA: Is your infrastructure treated as code? Can
you deploy a new environment within tens of
minutes?

TIPS TIME!
37

MTTR: How to reduce it?
38
START SIMPLE CHECK FOR SOFTWARE
LICENSING ISSUES
PRACTICE
“GAME DAY” EXERCISES

Practice Failure Through Chaos Engineering
39
Chaos engineering can answer critical questions...
Did a system fail
in the way
you expected?
Were you able
to ﬁx it promptly?
What did
the monitoring
data look like?
How long did it take
for the service to be
available again?

Train the entire team on different roles and functions
40
Intensive cross-training across
your engineering team
reducing MTTR
Avoid burning out
tech specialists by fostering
a general understanding
of how to resolve issues
when an incident arises!

Follow up on incidents to uncover root causes
41
What happened? How did it happen? Root causes?
How can we
prevent it?
Reducing
MTTR

Calibrate your alerting tools
42
Programmatic allerting will help you
sort through large amounts of information about your systems
and develop clear plans for how to use the data
Mean time to detection
(MTTD)
How long it takes you to detect the occurrence
of a customer-impacting issue in your system.
The earlier you catch the problem, the sooner you can reduce your MTTR!

Create runbooks
43
Incident response
procedures
Monitoring and
alerting practices
Creating runbooks

Focus on the correct ﬁx—not the fastest one
44
When trying to reduce MTTR...
urge to take
shortcuts
focusing on the
correct ﬁx

Get up to 10% of your AWS bill on
AWS credits
to spend on your infrastructure!
nubego.io/aws-credits

Q/A
Wrap Up!
fernando@nubego.io
fernandohonig

47
We’re Hiring!
https://nubego.io
info@nubego.io
careers@nubego.io
+44 (0) 20 8123 5282

OSDC 2019 | RTO & RPO – Best Practices in Hybrid Architectures by Fernando Honig

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie OSDC 2019 | RTO & RPO – Best Practices in Hybrid Architectures by Fernando Honig

Ähnlich wie OSDC 2019 | RTO & RPO – Best Practices in Hybrid Architectures by Fernando Honig (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

OSDC 2019 | RTO & RPO – Best Practices in Hybrid Architectures by Fernando Honig