Diese Präsentation wurde erfolgreich gemeldet.
Die SlideShare-Präsentation wird heruntergeladen. ×

OSDC 2019 | RTO & RPO – Best Practices in Hybrid Architectures by Fernando Honig

Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Wird geladen in …3
×

Hier ansehen

1 von 47 Anzeige

OSDC 2019 | RTO & RPO – Best Practices in Hybrid Architectures by Fernando Honig

Herunterladen, um offline zu lesen

In this presentation we will explore best practices and how to plan a hybrid strategy. Disaster recovery according to the size of the event. We will discuss the different concepts: Availability, Backup, Recovery, RTO and RPO. How to replicate your storage, networking, databases and compute instances in the cloud, and the different type of recoveries available in a hybrid architecture. We will go from a Pilot light model to a Multi-Site Active-Active and explain cost and time to recover using these options.

In this presentation we will explore best practices and how to plan a hybrid strategy. Disaster recovery according to the size of the event. We will discuss the different concepts: Availability, Backup, Recovery, RTO and RPO. How to replicate your storage, networking, databases and compute instances in the cloud, and the different type of recoveries available in a hybrid architecture. We will go from a Pilot light model to a Multi-Site Active-Active and explain cost and time to recover using these options.

Anzeige
Anzeige

Weitere Verwandte Inhalte

Diashows für Sie (20)

Ähnlich wie OSDC 2019 | RTO & RPO – Best Practices in Hybrid Architectures by Fernando Honig (20)

Anzeige

Aktuellste (20)

OSDC 2019 | RTO & RPO – Best Practices in Hybrid Architectures by Fernando Honig

  1. 1. RTO & RPO Best Practices in Hybrid Architectures OSDC - May 2019 Fernando Hönig fernando@nubego.io fernandohonig
  2. 2. RTO vs RPO What is this? 2© 2019, nubeGO or its Affiliates. All rights reserved.
  3. 3. RTO vs RPO Apples vs Oranges It calculates how quickly you need to recover. It is the target time you set for the recovery. 3 It is focused on data and your company’s loss tolerance in relation to your data. It is determined by looking at the time between data backups and the amount of data that could be lost in between backups. © 2019, nubeGO or its Affiliates. All rights reserved. RTO RPO
  4. 4. RPO and RTO 4 © 2019, nubeGO or its Affiliates. All rights reserved. The business can recover from losing (at most) the last 12 hours of data. The application can be unavailable for a maximum of 1 hour.
  5. 5. AVAILABILITY CONCEPTS 5 © 2019, nubeGO or its Affiliates. All rights reserved. HIGH Availability Backup Disaster Recovery Minimizing downtime for your application Making your data safe Getting your applications and data back after a major disaster
  6. 6. What could go wrong? 6 © 2019, nubeGO or its Affiliates. All rights reserved. HOW DO WE FIX IT? QUICKLY? Small events Large Scale events Colossal events Instance restart failure Application deployment failure Availability Zones down Unavailable services Unavailable region Infrastructure destruction by error
  7. 7. Latest Events 7 © 2019, nubeGO or its Affiliates. All rights reserved. Small events Large Scale events Colossal events Instance restart failure Application deployment failure GitHub S3 AZ Unavailable UK’s Petition System Unavailable Data Unavailable - Failed Backups GitLab Database Destruction
  8. 8. DISASTER PLANNING 8 © 2019, nubeGO or its Affiliates. All rights reserved. RECOVERY OPTIONS
  9. 9. DISASTER PLANNING 9 © 2019, nubeGO or its Affiliates. All rights reserved.
  10. 10. Operating System 10 © 2019, nubeGO or its Affiliates. All rights reserved. Machine Images Snapshot to other regions Share it across your accounts/projects UserData Create scripts to execute during start up Patch / Update your OS and stay up to date
  11. 11. Storage 11 © 2019, nubeGO or its Affiliates. All rights reserved. Object storage Replicate to other regions Enable versioning Block storage Create point-in-time Snapshots Copy snapshots across regions and accounts
  12. 12. Machine Images and Snapshots 12 © 2019, nubeGO or its Affiliates. All rights reserved. https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/images/ami_lifecycle.png
  13. 13. Networking 13 © 2019, nubeGO or its Affiliates. All rights reserved. DNS Enable health checks Enable Latency records Load balancing Failover options Health Checks with HTTP Code VPC Extend your network to the Cloud Direct Connect Enable fast and consistent replication/backup options from on-premise environments to the cloud
  14. 14. Databases 14 © 2019, nubeGO or its Affiliates. All rights reserved. Snapshot data and save it in a separate region Combine Read Replicas with Multi-AZ to build a resilient disaster recovery strategy RDS
  15. 15. Infrastructure 15 Use templates to quickly deploy collections of resources as needed Treat it as code, test it and deploy new changes with your application releases IAC © 2019, nubeGO or its Affiliates. All rights reserved.
  16. 16. BACKUP AND RESTORE 16 © 2019, nubeGO or its Affiliates. All rights reserved.
  17. 17. Backup Phase 17 © 2019, nubeGO or its Affiliates. All rights reserved. ● Take backups of current systems. ● Store backups in Object Storage Services. ● Describe procedure to restore from backup on Cloud. ● Know which machine template to use; build your own as needed. ● Know how to restore system from backups. ● Know how to switch to new system. ● Know how to configure the deployment.
  18. 18. Backup Options 18 © 2019, nubeGO or its Affiliates. All rights reserved. FILES NFS SMB VOLUMES iSCSI TAPES ISCSI Virtual Tape Library
  19. 19. Hybrid Backup 19 © 2019, nubeGO or its Affiliates. All rights reserved. https://d1.awsstatic.com/product-marketing/AWS%20Backup/product-page-diagram_aws_backup_hybrid.e5132f9c5fd6cd0299187d8d41147a3f7964d09a.png
  20. 20. Restore Phase 20 ● Retrieve backups from Object Storage. ● Bring up required infrastructure. ● Cloud instances with prepared machine images, Load Balancers, etc. ● Use infrastructure as code to automate deployment of core networking. ● Restore system from backup. ● Switch over to the new system. ● Adjust DNS records to point to the cloud systems. © 2019, nubeGO or its Affiliates. All rights reserved. In case of disaster…
  21. 21. RECOVERY STRATEGIES 21 © 2019, nubeGO or its Affiliates. All rights reserved.
  22. 22. Pilot Light 22 Web Server App Server Database Server DB Web Server App Server Database Server Data mirroring/replication Not running User or system Amazon Route 53 hosted zone DB secondary © 2019, nubeGO or its Affiliates. All rights reserved.
  23. 23. Pilot Light Web Server App Server Web Server App Server Data mirroring/replication Starts in minutes User or system Amazon Route 53 hosted zone DB DB secondary © 2019, nubeGO or its Affiliates. All rights reserved.
  24. 24. Pilot Light 24 © 2019, nubeGO or its Affiliates. All rights reserved. Very cost-effective (uses fewer 24/7 resources)Advantage Preparation Phase Set up instances to replicate or mirror data. Ensure that you have all supporting custom software packages available in the cloud. Create and maintain Machine Images of key servers where fast recovery is required. Regularly run these servers, test them, and apply any software updates and configuration changes. Consider automating the provisioning of cloud resources.
  25. 25. Pilot Light 25 Automatically bring up resources around the replicated core data set. © 2019, nubeGO or its Affiliates. All rights reserved. Scale the system as needed to handle current production traffic. Switch over to the new system. ● Adjust DNS records to point to the cloud In case of disaster… Objectives RTO: As long as it takes to detect need for DR and automatically scale up replacement system. RPO: Depends on replication type.
  26. 26. Fully Working Low-Capacity Standby © 2019, nubeGO or its Affiliates. All rights reserved. Web server App server Database Server Web Server App Server Low capacity User or system Amazon Route 53 hosted zone Web server App server Auto Scaling Auto Scaling Database Server Database Server Data mirroring/replication DB DB secondary
  27. 27. Fully Working Low-Capacity Standby 27 © 2019, nubeGO or its Affiliates. All rights reserved. Web server App Server Web server App server Low capacity User or system Amazon Route 53 hosted zone Web server App Server Web server App server Database Server Database Server Data mirroring/replication DB DB secondary
  28. 28. Fully Working Low-Capacity Standby 28 © 2019, nubeGO or its Affiliates. All rights reserved. Advantages Can take some production traffic at any time. Cost savings (IT footprint smaller than full DR) Preparation Similar to Pilot Light All necessary components running 24/7, but not scaled for production traffic Best practice: continuous testing ● “Tickle” a statistical subset of production traffic to DR site.
  29. 29. Fully Working Low-Capacity Standby 29 © 2019, nubeGO or its Affiliates. All rights reserved. Immediately fail over most critical production load. Adjust DNS records to point to the cloud. (Auto) Scale the system further to handle all production load. Objectives RTO: For critical load: as long as it takes to fail over; for all other load, as long as it takes to scale further. RPO: Depends on replication type. In case of disaster...
  30. 30. Web server App server Web server App server Full capacity User or system Amazon Route 53 hosted zone Web server App server Web server App server Database Server Database Server Database Server Data mirroring/replication DB DB secondary Multi-Site Active-Active © 2019, nubeGO or its Affiliates. All rights reserved.
  31. 31. Multi-Site Active-Active 31 © 2019, nubeGO or its Affiliates. All rights reserved. Preparation Advantages Objectives In case of disaster… At any moment, can take all production load. Similar to low-capacity standby. Fully scaling in/out with production load. Immediately fail over all production load. RTO: As long as it takes to fail over. RPO: Depends on replication type.
  32. 32. ▪ Lower priority use cases ▪ Solutions: Object Storage, Archive Storage ▪ Meeting lower RTO and RPO requirements ▪ Core services ▪ Scale cloud resources in response to a DR event ▪ Solutions that require RTO and RPO in minutes ▪ Business-critical services ▪ Auto-failover of your environment in the cloud to a running duplicate Cost: $ Cost: $$ Cost: $$$ Cost: $$$$ © 2019, nubeGO or its Affiliates. All rights reserved. Recovery Strategies
  33. 33. SCENARIO TIME! 33 © 2019, nubeGO or its Affiliates. All rights reserved.
  34. 34. CASE SCENARIO #1 34 © 2019, nubeGO or its Affiliates. All rights reserved. Bob is in charge of defining the best DR strategy for a hybrid architecture and he did the setup based on the following requirements: We need to have a RTO of 60 minutes Our backups are stored in the cloud and are taken daily The RPO has to be less than 8 hours, and we need to be able to build a new environment quick Our Application runs in the Cloud but our database still in our local datacenter
  35. 35. RTO = 1h RPO = 8hs 35 © 2019, nubeGO or its Affiliates. All rights reserved. CAN BE ACHIEVED? CASE SCENARIO
  36. 36. CASE SCENARIO 36 © 2019, nubeGO or its Affiliates. All rights reserved. DATABASE RTO/RPO CODE ON PREM There is no certainty they can achieve 1h RTO and 8hs RPO Backups run daily. So RPO can’t be 8hs. How much time would take to build a new DB and import the data? How much time it would take you to copy from the cloud to your on-prem DB? APP: Is your app code full of variables to cope with a change of endpoints?. INFRA: Is your infrastructure treated as code? Can you deploy a new environment within tens of minutes?
  37. 37. TIPS TIME! 37 © 2019, nubeGO or its Affiliates. All rights reserved.
  38. 38. MTTR: How to reduce it? 38 © 2019, nubeGO or its Affiliates. All rights reserved. START SIMPLE CHECK FOR SOFTWARE LICENSING ISSUES PRACTICE “GAME DAY” EXERCISES
  39. 39. Practice Failure Through Chaos Engineering 39 © 2019, nubeGO or its Affiliates. All rights reserved. Chaos engineering can answer critical questions... Did a system fail in the way you expected? Were you able to fix it promptly? What did the monitoring data look like? How long did it take for the service to be available again?
  40. 40. Train the entire team on different roles and functions 40 © 2019, nubeGO or its Affiliates. All rights reserved. Intensive cross-training across your engineering team reducing MTTR Avoid burning out tech specialists by fostering a general understanding of how to resolve issues when an incident arises!
  41. 41. Follow up on incidents to uncover root causes 41 © 2019, nubeGO or its Affiliates. All rights reserved. What happened? How did it happen? Root causes? How can we prevent it? Reducing MTTR
  42. 42. Calibrate your alerting tools 42 © 2019, nubeGO or its Affiliates. All rights reserved. Programmatic allerting will help you sort through large amounts of information about your systems and develop clear plans for how to use the data Mean time to detection (MTTD) How long it takes you to detect the occurrence of a customer-impacting issue in your system. The earlier you catch the problem, the sooner you can reduce your MTTR!
  43. 43. Create runbooks 43 © 2019, nubeGO or its Affiliates. All rights reserved. Incident response procedures Monitoring and alerting practices Creating runbooks
  44. 44. Focus on the correct fix—not the fastest one 44 © 2019, nubeGO or its Affiliates. All rights reserved. When trying to reduce MTTR... urge to take shortcuts focusing on the correct fix
  45. 45. 45© 2019, nubeGO or its Affiliates. All rights reserved. Get up to 10% of your AWS bill on AWS credits to spend on your infrastructure! nubego.io/aws-credits
  46. 46. Q/A Wrap Up! 46© 2019, nubeGO or its Affiliates. All rights reserved. fernando@nubego.io fernandohonig
  47. 47. 47 We’re Hiring! © 2019, nubeGO or its Affiliates. All rights reserved. https://nubego.io info@nubego.io careers@nubego.io +44 (0) 20 8123 5282

×