SlideShare ist ein Scribd-Unternehmen logo
1 von 47
Downloaden Sie, um offline zu lesen
RTO & RPO
Best Practices
in Hybrid Architectures
OSDC - May 2019
Fernando Hönig
fernando@nubego.io
fernandohonig
RTO vs RPO
What is this?
2© 2019, nubeGO or its Affiliates. All rights reserved.
RTO vs RPO
Apples vs Oranges
It calculates how quickly you need to recover.
It is the target time you set for the recovery.
3
It is focused on data and your company’s loss
tolerance in relation to your data.
It is determined by looking at the time between
data backups and the amount of data that could
be lost in between backups.
© 2019, nubeGO or its Affiliates. All rights reserved.
RTO
RPO
RPO and RTO
4
© 2019, nubeGO or its Affiliates. All rights reserved.
The business can recover from
losing (at most) the last 12 hours
of data.
The application can be
unavailable for a maximum of
1 hour.
AVAILABILITY CONCEPTS
5
© 2019, nubeGO or its Affiliates. All rights reserved.
HIGH Availability
Backup
Disaster Recovery
Minimizing downtime for your application
Making your data safe
Getting your applications and data back
after a major disaster
What could go wrong?
6
© 2019, nubeGO or its Affiliates. All rights reserved.
HOW DO WE FIX IT? QUICKLY?
Small events
Large Scale events
Colossal events
Instance restart failure
Application deployment failure
Availability Zones down
Unavailable services
Unavailable region
Infrastructure destruction by error
Latest Events
7
© 2019, nubeGO or its Affiliates. All rights reserved.
Small events
Large Scale events
Colossal events
Instance restart failure
Application deployment failure
GitHub S3 AZ Unavailable
UK’s Petition System Unavailable
Data Unavailable - Failed Backups
GitLab Database Destruction
DISASTER PLANNING
8
© 2019, nubeGO or its Affiliates. All rights reserved.
RECOVERY OPTIONS
DISASTER PLANNING
9
© 2019, nubeGO or its Affiliates. All rights reserved.
Operating System
10
© 2019, nubeGO or its Affiliates. All rights reserved.
Machine Images
Snapshot to other regions
Share it across your accounts/projects
UserData
Create scripts to execute during start up
Patch / Update your OS and stay up to date
Storage
11
© 2019, nubeGO or its Affiliates. All rights reserved.
Object storage
Replicate to other regions
Enable versioning
Block storage
Create point-in-time Snapshots
Copy snapshots across regions and accounts
Machine Images and Snapshots
12
© 2019, nubeGO or its Affiliates. All rights reserved.
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/images/ami_lifecycle.png
Networking
13
© 2019, nubeGO or its Affiliates. All rights reserved.
DNS
Enable health checks
Enable Latency records
Load
balancing
Failover options
Health Checks with HTTP Code
VPC Extend your network to the Cloud
Direct
Connect
Enable fast and consistent replication/backup
options from on-premise environments to the cloud
Databases
14
© 2019, nubeGO or its Affiliates. All rights reserved.
Snapshot data and save it in a separate region
Combine Read Replicas with Multi-AZ to
build a resilient disaster recovery strategy
RDS
Infrastructure
15
Use templates to quickly deploy collections of
resources as needed
Treat it as code, test it and deploy new
changes with your application releases
IAC
© 2019, nubeGO or its Affiliates. All rights reserved.
BACKUP AND RESTORE
16
© 2019, nubeGO or its Affiliates. All rights reserved.
Backup Phase
17
© 2019, nubeGO or its Affiliates. All rights reserved.
● Take backups of current systems.
● Store backups in Object Storage Services.
● Describe procedure to restore from backup on Cloud.
● Know which machine template to use; build your own as needed.
● Know how to restore system from backups.
● Know how to switch to new system.
● Know how to configure the deployment.
Backup Options
18
© 2019, nubeGO or its Affiliates. All rights reserved.
FILES
NFS
SMB
VOLUMES iSCSI
TAPES ISCSI Virtual Tape Library
Hybrid Backup
19
© 2019, nubeGO or its Affiliates. All rights reserved.
https://d1.awsstatic.com/product-marketing/AWS%20Backup/product-page-diagram_aws_backup_hybrid.e5132f9c5fd6cd0299187d8d41147a3f7964d09a.png
Restore Phase
20
● Retrieve backups from Object Storage.
● Bring up required infrastructure.
● Cloud instances with prepared machine images, Load Balancers, etc.
● Use infrastructure as code to automate deployment of core networking.
● Restore system from backup.
● Switch over to the new system.
● Adjust DNS records to point to the cloud systems.
© 2019, nubeGO or its Affiliates. All rights reserved.
In case of disaster…
RECOVERY STRATEGIES
21
© 2019, nubeGO or its Affiliates. All rights reserved.
Pilot Light
22
Web
Server
App
Server
Database
Server
DB
Web
Server
App
Server
Database
Server
Data mirroring/replication
Not running
User or system
Amazon Route 53
hosted zone
DB
secondary
© 2019, nubeGO or its Affiliates. All rights reserved.
Pilot Light
Web
Server
App
Server
Web
Server
App
Server
Data mirroring/replication
Starts in minutes
User or system
Amazon Route 53
hosted zone
DB DB
secondary
© 2019, nubeGO or its Affiliates. All rights reserved.
Pilot Light
24
© 2019, nubeGO or its Affiliates. All rights reserved.
Very cost-effective (uses fewer 24/7 resources)Advantage
Preparation
Phase
Set up instances to replicate or mirror data.
Ensure that you have all supporting custom software
packages available in the cloud.
Create and maintain Machine Images of key servers where
fast recovery is required.
Regularly run these servers, test them, and apply any
software updates and configuration changes.
Consider automating the provisioning of cloud resources.
Pilot Light
25
Automatically bring up resources around the replicated core data set.
© 2019, nubeGO or its Affiliates. All rights reserved.
Scale the system as needed to handle current production traffic.
Switch over to the new system.
● Adjust DNS records to point to the cloud
In case of
disaster…
Objectives
RTO: As long as it takes to detect need for DR and
automatically scale up replacement system.
RPO: Depends on replication type.
Fully Working Low-Capacity Standby
© 2019, nubeGO or its Affiliates. All rights reserved.
Web
server
App
server
Database
Server
Web
Server
App
Server
Low
capacity
User or system
Amazon Route 53
hosted zone
Web
server
App
server
Auto Scaling
Auto Scaling
Database
Server
Database
Server
Data mirroring/replication
DB DB
secondary
Fully Working Low-Capacity Standby
27
© 2019, nubeGO or its Affiliates. All rights reserved.
Web
server
App
Server
Web
server
App
server
Low
capacity
User or system
Amazon Route 53
hosted zone
Web
server
App
Server
Web
server
App
server
Database
Server
Database
Server
Data mirroring/replication
DB DB
secondary
Fully Working Low-Capacity Standby
28
© 2019, nubeGO or its Affiliates. All rights reserved.
Advantages
Can take some production traffic at any time.
Cost savings (IT footprint smaller than full DR)
Preparation
Similar to Pilot Light
All necessary components running 24/7,
but not scaled for production traffic
Best practice: continuous testing
● “Tickle” a statistical subset of production traffic to DR site.
Fully Working Low-Capacity Standby
29
© 2019, nubeGO or its Affiliates. All rights reserved.
Immediately fail over most critical production load.
Adjust DNS records to point to the cloud.
(Auto) Scale the system further to handle all production load.
Objectives
RTO: For critical load: as long as it takes to fail over; for all other load,
as long as it takes to scale further.
RPO: Depends on replication type.
In case of
disaster...
Web
server
App
server
Web
server
App
server
Full
capacity
User or system
Amazon Route 53
hosted zone
Web
server
App
server
Web
server
App
server
Database
Server
Database
Server
Database
Server
Data mirroring/replication
DB DB
secondary
Multi-Site Active-Active
© 2019, nubeGO or its Affiliates. All rights reserved.
Multi-Site Active-Active
31
© 2019, nubeGO or its Affiliates. All rights reserved.
Preparation
Advantages
Objectives
In case of
disaster…
At any moment, can take all production load.
Similar to low-capacity standby.
Fully scaling in/out with production load.
Immediately fail over all production load.
RTO: As long as it takes to fail over.
RPO: Depends on replication type.
▪ Lower priority use cases
▪ Solutions: Object Storage,
Archive Storage
▪ Meeting lower RTO and
RPO requirements
▪ Core services
▪ Scale cloud resources in
response to a DR event
▪ Solutions that require
RTO and RPO in minutes
▪ Business-critical services
▪ Auto-failover of
your
environment in
the cloud to a
running
duplicate
Cost: $ Cost: $$ Cost: $$$ Cost: $$$$
© 2019, nubeGO or its Affiliates. All rights reserved.
Recovery Strategies
SCENARIO TIME!
33
© 2019, nubeGO or its Affiliates. All rights reserved.
CASE SCENARIO #1
34
© 2019, nubeGO or its Affiliates. All rights reserved.
Bob is in charge of defining the best DR strategy for a hybrid architecture and he did the setup based on the
following requirements:
We need to have a
RTO of 60 minutes
Our backups are
stored in the cloud
and are taken daily
The RPO has to be
less than 8 hours, and
we need to be able to
build a new
environment quick
Our Application runs
in the Cloud but our
database still in our
local datacenter
RTO = 1h
RPO = 8hs
35
© 2019, nubeGO or its Affiliates. All rights reserved.
CAN BE ACHIEVED?
CASE SCENARIO
CASE SCENARIO
36
© 2019, nubeGO or its Affiliates. All rights reserved.
DATABASE
RTO/RPO
CODE
ON PREM
There is no certainty they can achieve 1h RTO and
8hs RPO
Backups run daily. So RPO can’t be 8hs.
How much time would take to build a new DB and
import the data?
How much time it would take you to copy from the
cloud to your on-prem DB?
APP: Is your app code full of variables to cope with a
change of endpoints?.
INFRA: Is your infrastructure treated as code? Can
you deploy a new environment within tens of
minutes?
TIPS TIME!
37
© 2019, nubeGO or its Affiliates. All rights reserved.
MTTR: How to reduce it?
38
© 2019, nubeGO or its Affiliates. All rights reserved.
START SIMPLE CHECK FOR SOFTWARE
LICENSING ISSUES
PRACTICE
“GAME DAY” EXERCISES
Practice Failure Through Chaos Engineering
39
© 2019, nubeGO or its Affiliates. All rights reserved.
Chaos engineering can answer critical questions...
Did a system fail
in the way
you expected?
Were you able
to fix it promptly?
What did
the monitoring
data look like?
How long did it take
for the service to be
available again?
Train the entire team on different roles and functions
40
© 2019, nubeGO or its Affiliates. All rights reserved.
Intensive cross-training across
your engineering team
reducing MTTR
Avoid burning out
tech specialists by fostering
a general understanding
of how to resolve issues
when an incident arises!
Follow up on incidents to uncover root causes
41
© 2019, nubeGO or its Affiliates. All rights reserved.
What happened? How did it happen? Root causes?
How can we
prevent it?
Reducing
MTTR
Calibrate your alerting tools
42
© 2019, nubeGO or its Affiliates. All rights reserved.
Programmatic allerting will help you
sort through large amounts of information about your systems
and develop clear plans for how to use the data
Mean time to detection
(MTTD)
How long it takes you to detect the occurrence
of a customer-impacting issue in your system.
The earlier you catch the problem, the sooner you can reduce your MTTR!
Create runbooks
43
© 2019, nubeGO or its Affiliates. All rights reserved.
Incident response
procedures
Monitoring and
alerting practices
Creating runbooks
Focus on the correct fix—not the fastest one
44
© 2019, nubeGO or its Affiliates. All rights reserved.
When trying to reduce MTTR...
urge to take
shortcuts
focusing on the
correct fix
45© 2019, nubeGO or its Affiliates. All rights reserved.
Get up to 10% of your AWS bill on
AWS credits
to spend on your infrastructure!
nubego.io/aws-credits
Q/A
Wrap Up!
46© 2019, nubeGO or its Affiliates. All rights reserved.
fernando@nubego.io
fernandohonig
47
We’re Hiring!
© 2019, nubeGO or its Affiliates. All rights reserved.
https://nubego.io
info@nubego.io
careers@nubego.io
+44 (0) 20 8123 5282

Weitere ähnliche Inhalte

Was ist angesagt?

AWS Webcast - Business Continuity in the AWS Cloud
AWS Webcast - Business Continuity in the AWS CloudAWS Webcast - Business Continuity in the AWS Cloud
AWS Webcast - Business Continuity in the AWS CloudAmazon Web Services
 
AWS Summit Barcelona - Backup & Disaster Recovery
AWS Summit Barcelona - Backup & Disaster RecoveryAWS Summit Barcelona - Backup & Disaster Recovery
AWS Summit Barcelona - Backup & Disaster RecoveryAmazon Web Services
 
Data Orchestration Platform for the Cloud
Data Orchestration Platform for the CloudData Orchestration Platform for the Cloud
Data Orchestration Platform for the CloudAlluxio, Inc.
 
Arcserve Portfolio Technical Overview
Arcserve Portfolio Technical OverviewArcserve Portfolio Technical Overview
Arcserve Portfolio Technical OverviewGina Tragos
 
NICTA, Disaster Recovery Using OpenStack
NICTA, Disaster Recovery Using OpenStackNICTA, Disaster Recovery Using OpenStack
NICTA, Disaster Recovery Using OpenStacklaurabeckcahoon
 
Business Track Session 1: The Power of udp
Business Track Session 1: The Power of udpBusiness Track Session 1: The Power of udp
Business Track Session 1: The Power of udparcserve data protection
 
The TCO Calculator - Estimate the True Cost of Hadoop
The TCO Calculator - Estimate the True Cost of Hadoop The TCO Calculator - Estimate the True Cost of Hadoop
The TCO Calculator - Estimate the True Cost of Hadoop MapR Technologies
 
Building an Apache Hadoop data application
Building an Apache Hadoop data applicationBuilding an Apache Hadoop data application
Building an Apache Hadoop data applicationtomwhite
 
Commercial track 2_UDP Solution Selling Made Simple
Commercial track 2_UDP Solution Selling Made SimpleCommercial track 2_UDP Solution Selling Made Simple
Commercial track 2_UDP Solution Selling Made Simplearcserve data protection
 
Disaster Recovery using Amazon Web Services - Webinar
Disaster Recovery using Amazon Web Services - WebinarDisaster Recovery using Amazon Web Services - Webinar
Disaster Recovery using Amazon Web Services - WebinarAmazon Web Services
 
New Integration Options with Postgres Enterprise Manager 8.0
New Integration Options with Postgres Enterprise Manager 8.0New Integration Options with Postgres Enterprise Manager 8.0
New Integration Options with Postgres Enterprise Manager 8.0EDB
 
Next Generation Data Protection Architecture
Next Generation Data Protection Architecture Next Generation Data Protection Architecture
Next Generation Data Protection Architecture Gina Tragos
 
The Edge to AI Deep Dive Barcelona Meetup March 2019
The Edge to AI Deep Dive Barcelona Meetup March 2019The Edge to AI Deep Dive Barcelona Meetup March 2019
The Edge to AI Deep Dive Barcelona Meetup March 2019Timothy Spann
 
Hadoop World 2011: Hadoop as a Service in Cloud
Hadoop World 2011: Hadoop as a Service in CloudHadoop World 2011: Hadoop as a Service in Cloud
Hadoop World 2011: Hadoop as a Service in CloudCloudera, Inc.
 
Journey Through the Cloud: Disaster Recovery
Journey Through the Cloud: Disaster RecoveryJourney Through the Cloud: Disaster Recovery
Journey Through the Cloud: Disaster RecoveryAmazon Web Services
 
S104872 spectrum nas-one-day-jburg-v1809e
S104872 spectrum nas-one-day-jburg-v1809eS104872 spectrum nas-one-day-jburg-v1809e
S104872 spectrum nas-one-day-jburg-v1809eTony Pearson
 
S100294 bcdr-seven-tiers-orlando-v1804a
S100294 bcdr-seven-tiers-orlando-v1804aS100294 bcdr-seven-tiers-orlando-v1804a
S100294 bcdr-seven-tiers-orlando-v1804aTony Pearson
 
Disaster Recovery with AWS - Simone Brunozzi - AWS Summit 2012 Australia -
Disaster Recovery with AWS - Simone Brunozzi - AWS Summit 2012 Australia - Disaster Recovery with AWS - Simone Brunozzi - AWS Summit 2012 Australia -
Disaster Recovery with AWS - Simone Brunozzi - AWS Summit 2012 Australia - Amazon Web Services
 
S100297 ilm-archive-orlando-v1804c
S100297 ilm-archive-orlando-v1804cS100297 ilm-archive-orlando-v1804c
S100297 ilm-archive-orlando-v1804cTony Pearson
 
CA ARCserve Solution Overview
CA ARCserve Solution OverviewCA ARCserve Solution Overview
CA ARCserve Solution OverviewMotty Ben Atia
 

Was ist angesagt? (20)

AWS Webcast - Business Continuity in the AWS Cloud
AWS Webcast - Business Continuity in the AWS CloudAWS Webcast - Business Continuity in the AWS Cloud
AWS Webcast - Business Continuity in the AWS Cloud
 
AWS Summit Barcelona - Backup & Disaster Recovery
AWS Summit Barcelona - Backup & Disaster RecoveryAWS Summit Barcelona - Backup & Disaster Recovery
AWS Summit Barcelona - Backup & Disaster Recovery
 
Data Orchestration Platform for the Cloud
Data Orchestration Platform for the CloudData Orchestration Platform for the Cloud
Data Orchestration Platform for the Cloud
 
Arcserve Portfolio Technical Overview
Arcserve Portfolio Technical OverviewArcserve Portfolio Technical Overview
Arcserve Portfolio Technical Overview
 
NICTA, Disaster Recovery Using OpenStack
NICTA, Disaster Recovery Using OpenStackNICTA, Disaster Recovery Using OpenStack
NICTA, Disaster Recovery Using OpenStack
 
Business Track Session 1: The Power of udp
Business Track Session 1: The Power of udpBusiness Track Session 1: The Power of udp
Business Track Session 1: The Power of udp
 
The TCO Calculator - Estimate the True Cost of Hadoop
The TCO Calculator - Estimate the True Cost of Hadoop The TCO Calculator - Estimate the True Cost of Hadoop
The TCO Calculator - Estimate the True Cost of Hadoop
 
Building an Apache Hadoop data application
Building an Apache Hadoop data applicationBuilding an Apache Hadoop data application
Building an Apache Hadoop data application
 
Commercial track 2_UDP Solution Selling Made Simple
Commercial track 2_UDP Solution Selling Made SimpleCommercial track 2_UDP Solution Selling Made Simple
Commercial track 2_UDP Solution Selling Made Simple
 
Disaster Recovery using Amazon Web Services - Webinar
Disaster Recovery using Amazon Web Services - WebinarDisaster Recovery using Amazon Web Services - Webinar
Disaster Recovery using Amazon Web Services - Webinar
 
New Integration Options with Postgres Enterprise Manager 8.0
New Integration Options with Postgres Enterprise Manager 8.0New Integration Options with Postgres Enterprise Manager 8.0
New Integration Options with Postgres Enterprise Manager 8.0
 
Next Generation Data Protection Architecture
Next Generation Data Protection Architecture Next Generation Data Protection Architecture
Next Generation Data Protection Architecture
 
The Edge to AI Deep Dive Barcelona Meetup March 2019
The Edge to AI Deep Dive Barcelona Meetup March 2019The Edge to AI Deep Dive Barcelona Meetup March 2019
The Edge to AI Deep Dive Barcelona Meetup March 2019
 
Hadoop World 2011: Hadoop as a Service in Cloud
Hadoop World 2011: Hadoop as a Service in CloudHadoop World 2011: Hadoop as a Service in Cloud
Hadoop World 2011: Hadoop as a Service in Cloud
 
Journey Through the Cloud: Disaster Recovery
Journey Through the Cloud: Disaster RecoveryJourney Through the Cloud: Disaster Recovery
Journey Through the Cloud: Disaster Recovery
 
S104872 spectrum nas-one-day-jburg-v1809e
S104872 spectrum nas-one-day-jburg-v1809eS104872 spectrum nas-one-day-jburg-v1809e
S104872 spectrum nas-one-day-jburg-v1809e
 
S100294 bcdr-seven-tiers-orlando-v1804a
S100294 bcdr-seven-tiers-orlando-v1804aS100294 bcdr-seven-tiers-orlando-v1804a
S100294 bcdr-seven-tiers-orlando-v1804a
 
Disaster Recovery with AWS - Simone Brunozzi - AWS Summit 2012 Australia -
Disaster Recovery with AWS - Simone Brunozzi - AWS Summit 2012 Australia - Disaster Recovery with AWS - Simone Brunozzi - AWS Summit 2012 Australia -
Disaster Recovery with AWS - Simone Brunozzi - AWS Summit 2012 Australia -
 
S100297 ilm-archive-orlando-v1804c
S100297 ilm-archive-orlando-v1804cS100297 ilm-archive-orlando-v1804c
S100297 ilm-archive-orlando-v1804c
 
CA ARCserve Solution Overview
CA ARCserve Solution OverviewCA ARCserve Solution Overview
CA ARCserve Solution Overview
 

Ähnlich wie OSDC 2019 | RTO & RPO – Best Practices in Hybrid Architectures by Fernando Honig

Breaking the Monolith road to containers.pdf
Breaking the Monolith road to containers.pdfBreaking the Monolith road to containers.pdf
Breaking the Monolith road to containers.pdfAmazon Web Services
 
Breaking the Monolith road to containers.pdf
Breaking the Monolith road to containers.pdfBreaking the Monolith road to containers.pdf
Breaking the Monolith road to containers.pdfAmazon Web Services
 
Implement a Modern Flash-to-Flash-to-Cloud Backup Environment (DEV205-S) - AW...
Implement a Modern Flash-to-Flash-to-Cloud Backup Environment (DEV205-S) - AW...Implement a Modern Flash-to-Flash-to-Cloud Backup Environment (DEV205-S) - AW...
Implement a Modern Flash-to-Flash-to-Cloud Backup Environment (DEV205-S) - AW...Amazon Web Services
 
AWS Summit Singapore 2019 | Build a Unified Cloud
AWS Summit Singapore 2019 | Build a Unified CloudAWS Summit Singapore 2019 | Build a Unified Cloud
AWS Summit Singapore 2019 | Build a Unified CloudAWS Summits
 
Oracle MAA Best Practices - Applications Considerations
Oracle MAA Best Practices - Applications ConsiderationsOracle MAA Best Practices - Applications Considerations
Oracle MAA Best Practices - Applications ConsiderationsMarkus Michalewicz
 
Catching the Software Defined Storage Wave
Catching the Software Defined Storage WaveCatching the Software Defined Storage Wave
Catching the Software Defined Storage WaveDataCore Software
 
Accelerate Design and Development of Data Projects Using AWS
Accelerate Design and Development of Data Projects Using AWSAccelerate Design and Development of Data Projects Using AWS
Accelerate Design and Development of Data Projects Using AWSDelphix
 
2. migration, disaster recovery and business continuity in the cloud
2. migration, disaster recovery and business continuity in the cloud2. migration, disaster recovery and business continuity in the cloud
2. migration, disaster recovery and business continuity in the cloudReham Maher El-Safarini
 
Splunk und Multi-Cloud
Splunk und Multi-CloudSplunk und Multi-Cloud
Splunk und Multi-CloudSplunk
 
Splunk and Multicloud
Splunk and Multicloud Splunk and Multicloud
Splunk and Multicloud Splunk
 
Splunk and Multicloud
Splunk and MulticloudSplunk and Multicloud
Splunk and MulticloudSplunk
 
ADV Slides: Strategies for Transitioning to a Cloud-First Enterprise
ADV Slides: Strategies for Transitioning to a Cloud-First EnterpriseADV Slides: Strategies for Transitioning to a Cloud-First Enterprise
ADV Slides: Strategies for Transitioning to a Cloud-First EnterpriseDATAVERSITY
 
Migrate and Modernize Your Database
Migrate and Modernize Your DatabaseMigrate and Modernize Your Database
Migrate and Modernize Your DatabaseAmazon Web Services
 
Hashicorp Corporate Pitch Deck Stenio_v2
Hashicorp Corporate Pitch Deck Stenio_v2 Hashicorp Corporate Pitch Deck Stenio_v2
Hashicorp Corporate Pitch Deck Stenio_v2 Stenio Ferreira
 
Make Your Disaster Recovery Plan Resilient & Cost-Effective (ENT213-S) - AWS ...
Make Your Disaster Recovery Plan Resilient & Cost-Effective (ENT213-S) - AWS ...Make Your Disaster Recovery Plan Resilient & Cost-Effective (ENT213-S) - AWS ...
Make Your Disaster Recovery Plan Resilient & Cost-Effective (ENT213-S) - AWS ...Amazon Web Services
 
ProfitBricks-white-paper-Disaster-Recovery-US
ProfitBricks-white-paper-Disaster-Recovery-USProfitBricks-white-paper-Disaster-Recovery-US
ProfitBricks-white-paper-Disaster-Recovery-USMudia Akpobome
 
How Dow Jones Uses AWS to Enable Innovation and New Engineering Work (CTD316)...
How Dow Jones Uses AWS to Enable Innovation and New Engineering Work (CTD316)...How Dow Jones Uses AWS to Enable Innovation and New Engineering Work (CTD316)...
How Dow Jones Uses AWS to Enable Innovation and New Engineering Work (CTD316)...Amazon Web Services
 
Cloud Composer workshop at Airflow Summit 2023.pdf
Cloud Composer workshop at Airflow Summit 2023.pdfCloud Composer workshop at Airflow Summit 2023.pdf
Cloud Composer workshop at Airflow Summit 2023.pdfLeah Cole
 

Ähnlich wie OSDC 2019 | RTO & RPO – Best Practices in Hybrid Architectures by Fernando Honig (20)

Breaking the Monolith road to containers.pdf
Breaking the Monolith road to containers.pdfBreaking the Monolith road to containers.pdf
Breaking the Monolith road to containers.pdf
 
Breaking the Monolith road to containers.pdf
Breaking the Monolith road to containers.pdfBreaking the Monolith road to containers.pdf
Breaking the Monolith road to containers.pdf
 
Implement a Modern Flash-to-Flash-to-Cloud Backup Environment (DEV205-S) - AW...
Implement a Modern Flash-to-Flash-to-Cloud Backup Environment (DEV205-S) - AW...Implement a Modern Flash-to-Flash-to-Cloud Backup Environment (DEV205-S) - AW...
Implement a Modern Flash-to-Flash-to-Cloud Backup Environment (DEV205-S) - AW...
 
Build_a_Unified_Cloud
Build_a_Unified_CloudBuild_a_Unified_Cloud
Build_a_Unified_Cloud
 
AWS Summit Singapore 2019 | Build a Unified Cloud
AWS Summit Singapore 2019 | Build a Unified CloudAWS Summit Singapore 2019 | Build a Unified Cloud
AWS Summit Singapore 2019 | Build a Unified Cloud
 
Oracle MAA Best Practices - Applications Considerations
Oracle MAA Best Practices - Applications ConsiderationsOracle MAA Best Practices - Applications Considerations
Oracle MAA Best Practices - Applications Considerations
 
Catching the Software Defined Storage Wave
Catching the Software Defined Storage WaveCatching the Software Defined Storage Wave
Catching the Software Defined Storage Wave
 
Accelerate Design and Development of Data Projects Using AWS
Accelerate Design and Development of Data Projects Using AWSAccelerate Design and Development of Data Projects Using AWS
Accelerate Design and Development of Data Projects Using AWS
 
2. migration, disaster recovery and business continuity in the cloud
2. migration, disaster recovery and business continuity in the cloud2. migration, disaster recovery and business continuity in the cloud
2. migration, disaster recovery and business continuity in the cloud
 
Splunk und Multi-Cloud
Splunk und Multi-CloudSplunk und Multi-Cloud
Splunk und Multi-Cloud
 
Splunk and Multicloud
Splunk and Multicloud Splunk and Multicloud
Splunk and Multicloud
 
Splunk and Multicloud
Splunk and MulticloudSplunk and Multicloud
Splunk and Multicloud
 
ADV Slides: Strategies for Transitioning to a Cloud-First Enterprise
ADV Slides: Strategies for Transitioning to a Cloud-First EnterpriseADV Slides: Strategies for Transitioning to a Cloud-First Enterprise
ADV Slides: Strategies for Transitioning to a Cloud-First Enterprise
 
Migrate and Modernize Your Database
Migrate and Modernize Your DatabaseMigrate and Modernize Your Database
Migrate and Modernize Your Database
 
Hashicorp Corporate Pitch Deck Stenio_v2
Hashicorp Corporate Pitch Deck Stenio_v2 Hashicorp Corporate Pitch Deck Stenio_v2
Hashicorp Corporate Pitch Deck Stenio_v2
 
Make Your Disaster Recovery Plan Resilient & Cost-Effective (ENT213-S) - AWS ...
Make Your Disaster Recovery Plan Resilient & Cost-Effective (ENT213-S) - AWS ...Make Your Disaster Recovery Plan Resilient & Cost-Effective (ENT213-S) - AWS ...
Make Your Disaster Recovery Plan Resilient & Cost-Effective (ENT213-S) - AWS ...
 
Build-a-Unified-Cloud
Build-a-Unified-CloudBuild-a-Unified-Cloud
Build-a-Unified-Cloud
 
ProfitBricks-white-paper-Disaster-Recovery-US
ProfitBricks-white-paper-Disaster-Recovery-USProfitBricks-white-paper-Disaster-Recovery-US
ProfitBricks-white-paper-Disaster-Recovery-US
 
How Dow Jones Uses AWS to Enable Innovation and New Engineering Work (CTD316)...
How Dow Jones Uses AWS to Enable Innovation and New Engineering Work (CTD316)...How Dow Jones Uses AWS to Enable Innovation and New Engineering Work (CTD316)...
How Dow Jones Uses AWS to Enable Innovation and New Engineering Work (CTD316)...
 
Cloud Composer workshop at Airflow Summit 2023.pdf
Cloud Composer workshop at Airflow Summit 2023.pdfCloud Composer workshop at Airflow Summit 2023.pdf
Cloud Composer workshop at Airflow Summit 2023.pdf
 

Kürzlich hochgeladen

Strategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero resultsStrategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero resultsJean Silva
 
SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?Alexandre Beguel
 
2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shards2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shardsChristopher Curtin
 
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...OnePlan Solutions
 
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...OnePlan Solutions
 
eSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration toolseSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration toolsosttopstonverter
 
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4j
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4jGraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4j
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4jNeo4j
 
VictoriaMetrics Q1 Meet Up '24 - Community & News Update
VictoriaMetrics Q1 Meet Up '24 - Community & News UpdateVictoriaMetrics Q1 Meet Up '24 - Community & News Update
VictoriaMetrics Q1 Meet Up '24 - Community & News UpdateVictoriaMetrics
 
Zer0con 2024 final share short version.pdf
Zer0con 2024 final share short version.pdfZer0con 2024 final share short version.pdf
Zer0con 2024 final share short version.pdfmaor17
 
Keeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository worldKeeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository worldRoberto Pérez Alcolea
 
Osi security architecture in network.pptx
Osi security architecture in network.pptxOsi security architecture in network.pptx
Osi security architecture in network.pptxVinzoCenzo
 
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdfAndrey Devyatkin
 
Advantages of Cargo Cloud Solutions.pptx
Advantages of Cargo Cloud Solutions.pptxAdvantages of Cargo Cloud Solutions.pptx
Advantages of Cargo Cloud Solutions.pptxRTS corp
 
Key Steps in Agile Software Delivery Roadmap
Key Steps in Agile Software Delivery RoadmapKey Steps in Agile Software Delivery Roadmap
Key Steps in Agile Software Delivery RoadmapIshara Amarasekera
 
Ronisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited CatalogueRonisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited Catalogueitservices996
 
Effectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryErrorEffectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryErrorTier1 app
 
Effort Estimation Techniques used in Software Projects
Effort Estimation Techniques used in Software ProjectsEffort Estimation Techniques used in Software Projects
Effort Estimation Techniques used in Software ProjectsDEEPRAJ PATHAK
 
What’s New in VictoriaMetrics: Q1 2024 Updates
What’s New in VictoriaMetrics: Q1 2024 UpdatesWhat’s New in VictoriaMetrics: Q1 2024 Updates
What’s New in VictoriaMetrics: Q1 2024 UpdatesVictoriaMetrics
 
[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf
[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf
[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdfSteve Caron
 
The Ultimate Guide to Performance Testing in Low-Code, No-Code Environments (...
The Ultimate Guide to Performance Testing in Low-Code, No-Code Environments (...The Ultimate Guide to Performance Testing in Low-Code, No-Code Environments (...
The Ultimate Guide to Performance Testing in Low-Code, No-Code Environments (...kalichargn70th171
 

Kürzlich hochgeladen (20)

Strategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero resultsStrategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero results
 
SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?
 
2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shards2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shards
 
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
 
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
 
eSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration toolseSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration tools
 
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4j
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4jGraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4j
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4j
 
VictoriaMetrics Q1 Meet Up '24 - Community & News Update
VictoriaMetrics Q1 Meet Up '24 - Community & News UpdateVictoriaMetrics Q1 Meet Up '24 - Community & News Update
VictoriaMetrics Q1 Meet Up '24 - Community & News Update
 
Zer0con 2024 final share short version.pdf
Zer0con 2024 final share short version.pdfZer0con 2024 final share short version.pdf
Zer0con 2024 final share short version.pdf
 
Keeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository worldKeeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository world
 
Osi security architecture in network.pptx
Osi security architecture in network.pptxOsi security architecture in network.pptx
Osi security architecture in network.pptx
 
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf
 
Advantages of Cargo Cloud Solutions.pptx
Advantages of Cargo Cloud Solutions.pptxAdvantages of Cargo Cloud Solutions.pptx
Advantages of Cargo Cloud Solutions.pptx
 
Key Steps in Agile Software Delivery Roadmap
Key Steps in Agile Software Delivery RoadmapKey Steps in Agile Software Delivery Roadmap
Key Steps in Agile Software Delivery Roadmap
 
Ronisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited CatalogueRonisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited Catalogue
 
Effectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryErrorEffectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryError
 
Effort Estimation Techniques used in Software Projects
Effort Estimation Techniques used in Software ProjectsEffort Estimation Techniques used in Software Projects
Effort Estimation Techniques used in Software Projects
 
What’s New in VictoriaMetrics: Q1 2024 Updates
What’s New in VictoriaMetrics: Q1 2024 UpdatesWhat’s New in VictoriaMetrics: Q1 2024 Updates
What’s New in VictoriaMetrics: Q1 2024 Updates
 
[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf
[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf
[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf
 
The Ultimate Guide to Performance Testing in Low-Code, No-Code Environments (...
The Ultimate Guide to Performance Testing in Low-Code, No-Code Environments (...The Ultimate Guide to Performance Testing in Low-Code, No-Code Environments (...
The Ultimate Guide to Performance Testing in Low-Code, No-Code Environments (...
 

OSDC 2019 | RTO & RPO – Best Practices in Hybrid Architectures by Fernando Honig

  • 1. RTO & RPO Best Practices in Hybrid Architectures OSDC - May 2019 Fernando Hönig fernando@nubego.io fernandohonig
  • 2. RTO vs RPO What is this? 2© 2019, nubeGO or its Affiliates. All rights reserved.
  • 3. RTO vs RPO Apples vs Oranges It calculates how quickly you need to recover. It is the target time you set for the recovery. 3 It is focused on data and your company’s loss tolerance in relation to your data. It is determined by looking at the time between data backups and the amount of data that could be lost in between backups. © 2019, nubeGO or its Affiliates. All rights reserved. RTO RPO
  • 4. RPO and RTO 4 © 2019, nubeGO or its Affiliates. All rights reserved. The business can recover from losing (at most) the last 12 hours of data. The application can be unavailable for a maximum of 1 hour.
  • 5. AVAILABILITY CONCEPTS 5 © 2019, nubeGO or its Affiliates. All rights reserved. HIGH Availability Backup Disaster Recovery Minimizing downtime for your application Making your data safe Getting your applications and data back after a major disaster
  • 6. What could go wrong? 6 © 2019, nubeGO or its Affiliates. All rights reserved. HOW DO WE FIX IT? QUICKLY? Small events Large Scale events Colossal events Instance restart failure Application deployment failure Availability Zones down Unavailable services Unavailable region Infrastructure destruction by error
  • 7. Latest Events 7 © 2019, nubeGO or its Affiliates. All rights reserved. Small events Large Scale events Colossal events Instance restart failure Application deployment failure GitHub S3 AZ Unavailable UK’s Petition System Unavailable Data Unavailable - Failed Backups GitLab Database Destruction
  • 8. DISASTER PLANNING 8 © 2019, nubeGO or its Affiliates. All rights reserved. RECOVERY OPTIONS
  • 9. DISASTER PLANNING 9 © 2019, nubeGO or its Affiliates. All rights reserved.
  • 10. Operating System 10 © 2019, nubeGO or its Affiliates. All rights reserved. Machine Images Snapshot to other regions Share it across your accounts/projects UserData Create scripts to execute during start up Patch / Update your OS and stay up to date
  • 11. Storage 11 © 2019, nubeGO or its Affiliates. All rights reserved. Object storage Replicate to other regions Enable versioning Block storage Create point-in-time Snapshots Copy snapshots across regions and accounts
  • 12. Machine Images and Snapshots 12 © 2019, nubeGO or its Affiliates. All rights reserved. https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/images/ami_lifecycle.png
  • 13. Networking 13 © 2019, nubeGO or its Affiliates. All rights reserved. DNS Enable health checks Enable Latency records Load balancing Failover options Health Checks with HTTP Code VPC Extend your network to the Cloud Direct Connect Enable fast and consistent replication/backup options from on-premise environments to the cloud
  • 14. Databases 14 © 2019, nubeGO or its Affiliates. All rights reserved. Snapshot data and save it in a separate region Combine Read Replicas with Multi-AZ to build a resilient disaster recovery strategy RDS
  • 15. Infrastructure 15 Use templates to quickly deploy collections of resources as needed Treat it as code, test it and deploy new changes with your application releases IAC © 2019, nubeGO or its Affiliates. All rights reserved.
  • 16. BACKUP AND RESTORE 16 © 2019, nubeGO or its Affiliates. All rights reserved.
  • 17. Backup Phase 17 © 2019, nubeGO or its Affiliates. All rights reserved. ● Take backups of current systems. ● Store backups in Object Storage Services. ● Describe procedure to restore from backup on Cloud. ● Know which machine template to use; build your own as needed. ● Know how to restore system from backups. ● Know how to switch to new system. ● Know how to configure the deployment.
  • 18. Backup Options 18 © 2019, nubeGO or its Affiliates. All rights reserved. FILES NFS SMB VOLUMES iSCSI TAPES ISCSI Virtual Tape Library
  • 19. Hybrid Backup 19 © 2019, nubeGO or its Affiliates. All rights reserved. https://d1.awsstatic.com/product-marketing/AWS%20Backup/product-page-diagram_aws_backup_hybrid.e5132f9c5fd6cd0299187d8d41147a3f7964d09a.png
  • 20. Restore Phase 20 ● Retrieve backups from Object Storage. ● Bring up required infrastructure. ● Cloud instances with prepared machine images, Load Balancers, etc. ● Use infrastructure as code to automate deployment of core networking. ● Restore system from backup. ● Switch over to the new system. ● Adjust DNS records to point to the cloud systems. © 2019, nubeGO or its Affiliates. All rights reserved. In case of disaster…
  • 21. RECOVERY STRATEGIES 21 © 2019, nubeGO or its Affiliates. All rights reserved.
  • 22. Pilot Light 22 Web Server App Server Database Server DB Web Server App Server Database Server Data mirroring/replication Not running User or system Amazon Route 53 hosted zone DB secondary © 2019, nubeGO or its Affiliates. All rights reserved.
  • 23. Pilot Light Web Server App Server Web Server App Server Data mirroring/replication Starts in minutes User or system Amazon Route 53 hosted zone DB DB secondary © 2019, nubeGO or its Affiliates. All rights reserved.
  • 24. Pilot Light 24 © 2019, nubeGO or its Affiliates. All rights reserved. Very cost-effective (uses fewer 24/7 resources)Advantage Preparation Phase Set up instances to replicate or mirror data. Ensure that you have all supporting custom software packages available in the cloud. Create and maintain Machine Images of key servers where fast recovery is required. Regularly run these servers, test them, and apply any software updates and configuration changes. Consider automating the provisioning of cloud resources.
  • 25. Pilot Light 25 Automatically bring up resources around the replicated core data set. © 2019, nubeGO or its Affiliates. All rights reserved. Scale the system as needed to handle current production traffic. Switch over to the new system. ● Adjust DNS records to point to the cloud In case of disaster… Objectives RTO: As long as it takes to detect need for DR and automatically scale up replacement system. RPO: Depends on replication type.
  • 26. Fully Working Low-Capacity Standby © 2019, nubeGO or its Affiliates. All rights reserved. Web server App server Database Server Web Server App Server Low capacity User or system Amazon Route 53 hosted zone Web server App server Auto Scaling Auto Scaling Database Server Database Server Data mirroring/replication DB DB secondary
  • 27. Fully Working Low-Capacity Standby 27 © 2019, nubeGO or its Affiliates. All rights reserved. Web server App Server Web server App server Low capacity User or system Amazon Route 53 hosted zone Web server App Server Web server App server Database Server Database Server Data mirroring/replication DB DB secondary
  • 28. Fully Working Low-Capacity Standby 28 © 2019, nubeGO or its Affiliates. All rights reserved. Advantages Can take some production traffic at any time. Cost savings (IT footprint smaller than full DR) Preparation Similar to Pilot Light All necessary components running 24/7, but not scaled for production traffic Best practice: continuous testing ● “Tickle” a statistical subset of production traffic to DR site.
  • 29. Fully Working Low-Capacity Standby 29 © 2019, nubeGO or its Affiliates. All rights reserved. Immediately fail over most critical production load. Adjust DNS records to point to the cloud. (Auto) Scale the system further to handle all production load. Objectives RTO: For critical load: as long as it takes to fail over; for all other load, as long as it takes to scale further. RPO: Depends on replication type. In case of disaster...
  • 30. Web server App server Web server App server Full capacity User or system Amazon Route 53 hosted zone Web server App server Web server App server Database Server Database Server Database Server Data mirroring/replication DB DB secondary Multi-Site Active-Active © 2019, nubeGO or its Affiliates. All rights reserved.
  • 31. Multi-Site Active-Active 31 © 2019, nubeGO or its Affiliates. All rights reserved. Preparation Advantages Objectives In case of disaster… At any moment, can take all production load. Similar to low-capacity standby. Fully scaling in/out with production load. Immediately fail over all production load. RTO: As long as it takes to fail over. RPO: Depends on replication type.
  • 32. ▪ Lower priority use cases ▪ Solutions: Object Storage, Archive Storage ▪ Meeting lower RTO and RPO requirements ▪ Core services ▪ Scale cloud resources in response to a DR event ▪ Solutions that require RTO and RPO in minutes ▪ Business-critical services ▪ Auto-failover of your environment in the cloud to a running duplicate Cost: $ Cost: $$ Cost: $$$ Cost: $$$$ © 2019, nubeGO or its Affiliates. All rights reserved. Recovery Strategies
  • 33. SCENARIO TIME! 33 © 2019, nubeGO or its Affiliates. All rights reserved.
  • 34. CASE SCENARIO #1 34 © 2019, nubeGO or its Affiliates. All rights reserved. Bob is in charge of defining the best DR strategy for a hybrid architecture and he did the setup based on the following requirements: We need to have a RTO of 60 minutes Our backups are stored in the cloud and are taken daily The RPO has to be less than 8 hours, and we need to be able to build a new environment quick Our Application runs in the Cloud but our database still in our local datacenter
  • 35. RTO = 1h RPO = 8hs 35 © 2019, nubeGO or its Affiliates. All rights reserved. CAN BE ACHIEVED? CASE SCENARIO
  • 36. CASE SCENARIO 36 © 2019, nubeGO or its Affiliates. All rights reserved. DATABASE RTO/RPO CODE ON PREM There is no certainty they can achieve 1h RTO and 8hs RPO Backups run daily. So RPO can’t be 8hs. How much time would take to build a new DB and import the data? How much time it would take you to copy from the cloud to your on-prem DB? APP: Is your app code full of variables to cope with a change of endpoints?. INFRA: Is your infrastructure treated as code? Can you deploy a new environment within tens of minutes?
  • 37. TIPS TIME! 37 © 2019, nubeGO or its Affiliates. All rights reserved.
  • 38. MTTR: How to reduce it? 38 © 2019, nubeGO or its Affiliates. All rights reserved. START SIMPLE CHECK FOR SOFTWARE LICENSING ISSUES PRACTICE “GAME DAY” EXERCISES
  • 39. Practice Failure Through Chaos Engineering 39 © 2019, nubeGO or its Affiliates. All rights reserved. Chaos engineering can answer critical questions... Did a system fail in the way you expected? Were you able to fix it promptly? What did the monitoring data look like? How long did it take for the service to be available again?
  • 40. Train the entire team on different roles and functions 40 © 2019, nubeGO or its Affiliates. All rights reserved. Intensive cross-training across your engineering team reducing MTTR Avoid burning out tech specialists by fostering a general understanding of how to resolve issues when an incident arises!
  • 41. Follow up on incidents to uncover root causes 41 © 2019, nubeGO or its Affiliates. All rights reserved. What happened? How did it happen? Root causes? How can we prevent it? Reducing MTTR
  • 42. Calibrate your alerting tools 42 © 2019, nubeGO or its Affiliates. All rights reserved. Programmatic allerting will help you sort through large amounts of information about your systems and develop clear plans for how to use the data Mean time to detection (MTTD) How long it takes you to detect the occurrence of a customer-impacting issue in your system. The earlier you catch the problem, the sooner you can reduce your MTTR!
  • 43. Create runbooks 43 © 2019, nubeGO or its Affiliates. All rights reserved. Incident response procedures Monitoring and alerting practices Creating runbooks
  • 44. Focus on the correct fix—not the fastest one 44 © 2019, nubeGO or its Affiliates. All rights reserved. When trying to reduce MTTR... urge to take shortcuts focusing on the correct fix
  • 45. 45© 2019, nubeGO or its Affiliates. All rights reserved. Get up to 10% of your AWS bill on AWS credits to spend on your infrastructure! nubego.io/aws-credits
  • 46. Q/A Wrap Up! 46© 2019, nubeGO or its Affiliates. All rights reserved. fernando@nubego.io fernandohonig
  • 47. 47 We’re Hiring! © 2019, nubeGO or its Affiliates. All rights reserved. https://nubego.io info@nubego.io careers@nubego.io +44 (0) 20 8123 5282