Weitere ähnliche Inhalte Ähnlich wie OSDC 2019 | RTO & RPO – Best Practices in Hybrid Architectures by Fernando Honig (20) Kürzlich hochgeladen (20) OSDC 2019 | RTO & RPO – Best Practices in Hybrid Architectures by Fernando Honig1. RTO & RPO
Best Practices
in Hybrid Architectures
OSDC - May 2019
Fernando Hönig
fernando@nubego.io
fernandohonig
2. RTO vs RPO
What is this?
2© 2019, nubeGO or its Affiliates. All rights reserved.
3. RTO vs RPO
Apples vs Oranges
It calculates how quickly you need to recover.
It is the target time you set for the recovery.
3
It is focused on data and your company’s loss
tolerance in relation to your data.
It is determined by looking at the time between
data backups and the amount of data that could
be lost in between backups.
© 2019, nubeGO or its Affiliates. All rights reserved.
RTO
RPO
4. RPO and RTO
4
© 2019, nubeGO or its Affiliates. All rights reserved.
The business can recover from
losing (at most) the last 12 hours
of data.
The application can be
unavailable for a maximum of
1 hour.
5. AVAILABILITY CONCEPTS
5
© 2019, nubeGO or its Affiliates. All rights reserved.
HIGH Availability
Backup
Disaster Recovery
Minimizing downtime for your application
Making your data safe
Getting your applications and data back
after a major disaster
6. What could go wrong?
6
© 2019, nubeGO or its Affiliates. All rights reserved.
HOW DO WE FIX IT? QUICKLY?
Small events
Large Scale events
Colossal events
Instance restart failure
Application deployment failure
Availability Zones down
Unavailable services
Unavailable region
Infrastructure destruction by error
7. Latest Events
7
© 2019, nubeGO or its Affiliates. All rights reserved.
Small events
Large Scale events
Colossal events
Instance restart failure
Application deployment failure
GitHub S3 AZ Unavailable
UK’s Petition System Unavailable
Data Unavailable - Failed Backups
GitLab Database Destruction
10. Operating System
10
© 2019, nubeGO or its Affiliates. All rights reserved.
Machine Images
Snapshot to other regions
Share it across your accounts/projects
UserData
Create scripts to execute during start up
Patch / Update your OS and stay up to date
11. Storage
11
© 2019, nubeGO or its Affiliates. All rights reserved.
Object storage
Replicate to other regions
Enable versioning
Block storage
Create point-in-time Snapshots
Copy snapshots across regions and accounts
12. Machine Images and Snapshots
12
© 2019, nubeGO or its Affiliates. All rights reserved.
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/images/ami_lifecycle.png
13. Networking
13
© 2019, nubeGO or its Affiliates. All rights reserved.
DNS
Enable health checks
Enable Latency records
Load
balancing
Failover options
Health Checks with HTTP Code
VPC Extend your network to the Cloud
Direct
Connect
Enable fast and consistent replication/backup
options from on-premise environments to the cloud
14. Databases
14
© 2019, nubeGO or its Affiliates. All rights reserved.
Snapshot data and save it in a separate region
Combine Read Replicas with Multi-AZ to
build a resilient disaster recovery strategy
RDS
15. Infrastructure
15
Use templates to quickly deploy collections of
resources as needed
Treat it as code, test it and deploy new
changes with your application releases
IAC
© 2019, nubeGO or its Affiliates. All rights reserved.
17. Backup Phase
17
© 2019, nubeGO or its Affiliates. All rights reserved.
● Take backups of current systems.
● Store backups in Object Storage Services.
● Describe procedure to restore from backup on Cloud.
● Know which machine template to use; build your own as needed.
● Know how to restore system from backups.
● Know how to switch to new system.
● Know how to configure the deployment.
18. Backup Options
18
© 2019, nubeGO or its Affiliates. All rights reserved.
FILES
NFS
SMB
VOLUMES iSCSI
TAPES ISCSI Virtual Tape Library
19. Hybrid Backup
19
© 2019, nubeGO or its Affiliates. All rights reserved.
https://d1.awsstatic.com/product-marketing/AWS%20Backup/product-page-diagram_aws_backup_hybrid.e5132f9c5fd6cd0299187d8d41147a3f7964d09a.png
20. Restore Phase
20
● Retrieve backups from Object Storage.
● Bring up required infrastructure.
● Cloud instances with prepared machine images, Load Balancers, etc.
● Use infrastructure as code to automate deployment of core networking.
● Restore system from backup.
● Switch over to the new system.
● Adjust DNS records to point to the cloud systems.
© 2019, nubeGO or its Affiliates. All rights reserved.
In case of disaster…
24. Pilot Light
24
© 2019, nubeGO or its Affiliates. All rights reserved.
Very cost-effective (uses fewer 24/7 resources)Advantage
Preparation
Phase
Set up instances to replicate or mirror data.
Ensure that you have all supporting custom software
packages available in the cloud.
Create and maintain Machine Images of key servers where
fast recovery is required.
Regularly run these servers, test them, and apply any
software updates and configuration changes.
Consider automating the provisioning of cloud resources.
25. Pilot Light
25
Automatically bring up resources around the replicated core data set.
© 2019, nubeGO or its Affiliates. All rights reserved.
Scale the system as needed to handle current production traffic.
Switch over to the new system.
● Adjust DNS records to point to the cloud
In case of
disaster…
Objectives
RTO: As long as it takes to detect need for DR and
automatically scale up replacement system.
RPO: Depends on replication type.
26. Fully Working Low-Capacity Standby
© 2019, nubeGO or its Affiliates. All rights reserved.
Web
server
App
server
Database
Server
Web
Server
App
Server
Low
capacity
User or system
Amazon Route 53
hosted zone
Web
server
App
server
Auto Scaling
Auto Scaling
Database
Server
Database
Server
Data mirroring/replication
DB DB
secondary
27. Fully Working Low-Capacity Standby
27
© 2019, nubeGO or its Affiliates. All rights reserved.
Web
server
App
Server
Web
server
App
server
Low
capacity
User or system
Amazon Route 53
hosted zone
Web
server
App
Server
Web
server
App
server
Database
Server
Database
Server
Data mirroring/replication
DB DB
secondary
28. Fully Working Low-Capacity Standby
28
© 2019, nubeGO or its Affiliates. All rights reserved.
Advantages
Can take some production traffic at any time.
Cost savings (IT footprint smaller than full DR)
Preparation
Similar to Pilot Light
All necessary components running 24/7,
but not scaled for production traffic
Best practice: continuous testing
● “Tickle” a statistical subset of production traffic to DR site.
29. Fully Working Low-Capacity Standby
29
© 2019, nubeGO or its Affiliates. All rights reserved.
Immediately fail over most critical production load.
Adjust DNS records to point to the cloud.
(Auto) Scale the system further to handle all production load.
Objectives
RTO: For critical load: as long as it takes to fail over; for all other load,
as long as it takes to scale further.
RPO: Depends on replication type.
In case of
disaster...
31. Multi-Site Active-Active
31
© 2019, nubeGO or its Affiliates. All rights reserved.
Preparation
Advantages
Objectives
In case of
disaster…
At any moment, can take all production load.
Similar to low-capacity standby.
Fully scaling in/out with production load.
Immediately fail over all production load.
RTO: As long as it takes to fail over.
RPO: Depends on replication type.
32. ▪ Lower priority use cases
▪ Solutions: Object Storage,
Archive Storage
▪ Meeting lower RTO and
RPO requirements
▪ Core services
▪ Scale cloud resources in
response to a DR event
▪ Solutions that require
RTO and RPO in minutes
▪ Business-critical services
▪ Auto-failover of
your
environment in
the cloud to a
running
duplicate
Cost: $ Cost: $$ Cost: $$$ Cost: $$$$
© 2019, nubeGO or its Affiliates. All rights reserved.
Recovery Strategies
34. CASE SCENARIO #1
34
© 2019, nubeGO or its Affiliates. All rights reserved.
Bob is in charge of defining the best DR strategy for a hybrid architecture and he did the setup based on the
following requirements:
We need to have a
RTO of 60 minutes
Our backups are
stored in the cloud
and are taken daily
The RPO has to be
less than 8 hours, and
we need to be able to
build a new
environment quick
Our Application runs
in the Cloud but our
database still in our
local datacenter
35. RTO = 1h
RPO = 8hs
35
© 2019, nubeGO or its Affiliates. All rights reserved.
CAN BE ACHIEVED?
CASE SCENARIO
36. CASE SCENARIO
36
© 2019, nubeGO or its Affiliates. All rights reserved.
DATABASE
RTO/RPO
CODE
ON PREM
There is no certainty they can achieve 1h RTO and
8hs RPO
Backups run daily. So RPO can’t be 8hs.
How much time would take to build a new DB and
import the data?
How much time it would take you to copy from the
cloud to your on-prem DB?
APP: Is your app code full of variables to cope with a
change of endpoints?.
INFRA: Is your infrastructure treated as code? Can
you deploy a new environment within tens of
minutes?
38. MTTR: How to reduce it?
38
© 2019, nubeGO or its Affiliates. All rights reserved.
START SIMPLE CHECK FOR SOFTWARE
LICENSING ISSUES
PRACTICE
“GAME DAY” EXERCISES
39. Practice Failure Through Chaos Engineering
39
© 2019, nubeGO or its Affiliates. All rights reserved.
Chaos engineering can answer critical questions...
Did a system fail
in the way
you expected?
Were you able
to fix it promptly?
What did
the monitoring
data look like?
How long did it take
for the service to be
available again?
40. Train the entire team on different roles and functions
40
© 2019, nubeGO or its Affiliates. All rights reserved.
Intensive cross-training across
your engineering team
reducing MTTR
Avoid burning out
tech specialists by fostering
a general understanding
of how to resolve issues
when an incident arises!
41. Follow up on incidents to uncover root causes
41
© 2019, nubeGO or its Affiliates. All rights reserved.
What happened? How did it happen? Root causes?
How can we
prevent it?
Reducing
MTTR
42. Calibrate your alerting tools
42
© 2019, nubeGO or its Affiliates. All rights reserved.
Programmatic allerting will help you
sort through large amounts of information about your systems
and develop clear plans for how to use the data
Mean time to detection
(MTTD)
How long it takes you to detect the occurrence
of a customer-impacting issue in your system.
The earlier you catch the problem, the sooner you can reduce your MTTR!
43. Create runbooks
43
© 2019, nubeGO or its Affiliates. All rights reserved.
Incident response
procedures
Monitoring and
alerting practices
Creating runbooks
44. Focus on the correct fix—not the fastest one
44
© 2019, nubeGO or its Affiliates. All rights reserved.
When trying to reduce MTTR...
urge to take
shortcuts
focusing on the
correct fix
45. 45© 2019, nubeGO or its Affiliates. All rights reserved.
Get up to 10% of your AWS bill on
AWS credits
to spend on your infrastructure!
nubego.io/aws-credits
47. 47
We’re Hiring!
© 2019, nubeGO or its Affiliates. All rights reserved.
https://nubego.io
info@nubego.io
careers@nubego.io
+44 (0) 20 8123 5282