Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.
MTBF / MTTR                        Availability or recoverability?                        Presented by                    ...
Michael Richardson                                                Twitter: @mr_spb                                        ...
So what is high availability?•      Five nines?•      No single point of failures?•      Multiple data centres?•      Faul...
Ninesof availability                                                    9       9                                         ...
Ninesof availability                   Availability                Downtime per Year                   One nine (90%)     ...
Problem withthe nines•  What do they mean?•  Guaranteed or just an SLA?•  Multiplicity (99.9% * 99.9% * 99.9% = 99.7%)© 20...
SLA availability numbersjust aim to provide a level ofconfidence in a website’s service© 2012 Energized Work - www.energize...
No single point of failure(SPOF)© 2012 Energized Work - www.energizedwork.com   8
Two of everything?© 2012 Energized Work - www.energizedwork.com   9
Start with this                                                 Users                                                Index...
End with this                                                     Users                                       Firewall 1  ...
Problems witheliminating SPOF•      It’s expensive•      Where do you draw the line?•      Are failures independent?•     ...
Problem:Data centres fail© 2012 Energized Work - www.energizedwork.com   13
Solution:Get a second data centre© 2012 Energized Work - www.energizedwork.com   14
Hot – Hotmultisite•      Full range of services available in multiple locations•      Easy to automate failover of sites• ...
Hot – Warmmultisite•  Simpler than hot – hot•  Read / Write ratio dependent•  Synchronously or asynchronously replicate da...
Hot – Coldmultisite•      Easy to setup•      Will it work?•      Can it be trusted?•      Cold site rapidly becomes stale...
DR multisite•  Fingers crossed you never need it•  How can / should you test it?•  Cloud?                                 ...
Problemswith multiple sites•      It’s expensive•      Managing more systems•      Managing data consistency•      Managin...
We now havea complex system© 2012 Energized Work - www.energizedwork.com   20
Complex systems•  More redundancy and automation leads to more complexity•  More complexity often adds more points of fail...
How complex systems fail - Dr. Richard Cook•  Catastrophe is always just around the corner•  Human operators have dual rol...
Failure and recovery© 2012 Energized Work - www.energizedwork.com   23
Questionsfor the business•  What is the cost of downtime?•  What are the Recovery Time Objectives (RTO)•  What are the Rec...
Aggressive RTO and RPOare expensive and have aperformance impact© 2012 Energized Work - www.energizedwork.com   25
RTO / RPOexampleProblem:•  Simple DB•  Business can tolerate up to 15 minutes downtime•  10-minute window of data loss© 20...
RTO / RPOexamplePossible solution:•  Continuously replicate data to second host•  Continue with nightly backups and also c...
So what is more important –increasing availabilityor reducing recovery time?© 2012 Energized Work - www.energizedwork.com ...
MTBF or MTTR?What about MTTD?© 2012 Energized Work - www.energizedwork.com   29
The answer is:It depends© 2012 Energized Work - www.energizedwork.com   30
Failureis inevitable© 2012 Energized Work - www.energizedwork.com   31
Ask anyone© 2012 Energized Work - www.energizedwork.com   32
LicenseThis presentation is provided under the Creative Commons Attribution Share Alike 3.0 Unported License.             ...
Nächste SlideShare
Wird geladen in …5
×

MTBF / MTTR - Energized Work TekTalk, Mar 2012

4.268 Aufrufe

Veröffentlicht am

Veröffentlicht in: Technologie, Business
  • Als Erste(r) kommentieren

MTBF / MTTR - Energized Work TekTalk, Mar 2012

  1. 1. MTBF / MTTR Availability or recoverability? Presented by Michael Richardson, Energized Work 21 March 2012ENERGIZED WORK25 MACKLIN STREETLONDON WC2B 5NN+44 (0)20 7691 8933WWW.ENERGIZEDWORK.COM
  2. 2. Michael Richardson Twitter: @mr_spb Email: michael@energizedwork.com #ewtektalk © 2012 Energized Work - www.energizedwork.com 2
  3. 3. So what is high availability?•  Five nines?•  No single point of failures?•  Multiple data centres?•  Fault tolerance?•  Load balancing?•  Uptime?© 2012 Energized Work - www.energizedwork.com 3
  4. 4. Ninesof availability 9 9 9 9 99 9 9© 2012 Energized Work - www.energizedwork.com 4
  5. 5. Ninesof availability Availability Downtime per Year One nine (90%) 36.5 days Two nines (99%) 3.65 days Three nines (99.9%) 8.76 hours Four nines (99.99%) 52.56 minutes Five nines (99.999%) 5.26 minutes© 2012 Energized Work - www.energizedwork.com 5
  6. 6. Problem withthe nines•  What do they mean?•  Guaranteed or just an SLA?•  Multiplicity (99.9% * 99.9% * 99.9% = 99.7%)© 2012 Energized Work - www.energizedwork.com 6
  7. 7. SLA availability numbersjust aim to provide a level ofconfidence in a website’s service© 2012 Energized Work - www.energizedwork.com 7
  8. 8. No single point of failure(SPOF)© 2012 Energized Work - www.energizedwork.com 8
  9. 9. Two of everything?© 2012 Energized Work - www.energizedwork.com 9
  10. 10. Start with this Users Index.html© 2012 Energized Work - www.energizedwork.com 10
  11. 11. End with this Users Firewall 1 Firewall 2 Switch 1 Switch 2 WEB1 WEB2 APP1 APP2 DB1 DB2© 2012 Energized Work - www.energizedwork.com 11
  12. 12. Problems witheliminating SPOF•  It’s expensive•  Where do you draw the line?•  Are failures independent?•  Can you guarantee no SPOF?•  Increased complexity© 2012 Energized Work - www.energizedwork.com 12
  13. 13. Problem:Data centres fail© 2012 Energized Work - www.energizedwork.com 13
  14. 14. Solution:Get a second data centre© 2012 Energized Work - www.energizedwork.com 14
  15. 15. Hot – Hotmultisite•  Full range of services available in multiple locations•  Easy to automate failover of sites•  Data consistency is hard•  Capacity planning concerns +© 2012 Energized Work - www.energizedwork.com 15
  16. 16. Hot – Warmmultisite•  Simpler than hot – hot•  Read / Write ratio dependent•  Synchronously or asynchronously replicate data? +© 2012 Energized Work - www.energizedwork.com 16
  17. 17. Hot – Coldmultisite•  Easy to setup•  Will it work?•  Can it be trusted?•  Cold site rapidly becomes stale•  Is it actually valuable? +© 2012 Energized Work - www.energizedwork.com 17
  18. 18. DR multisite•  Fingers crossed you never need it•  How can / should you test it?•  Cloud? +© 2012 Energized Work - www.energizedwork.com 18
  19. 19. Problemswith multiple sites•  It’s expensive•  Managing more systems•  Managing data consistency•  Managing capacity•  Is it still fail proof?•  Unless you test it, it’s just a plan© 2012 Energized Work - www.energizedwork.com 19
  20. 20. We now havea complex system© 2012 Energized Work - www.energizedwork.com 20
  21. 21. Complex systems•  More redundancy and automation leads to more complexity•  More complexity often adds more points of failure© 2012 Energized Work - www.energizedwork.com 21
  22. 22. How complex systems fail - Dr. Richard Cook•  Catastrophe is always just around the corner•  Human operators have dual roles•  Change introduces new forms of failure© 2012 Energized Work - www.energizedwork.com 22
  23. 23. Failure and recovery© 2012 Energized Work - www.energizedwork.com 23
  24. 24. Questionsfor the business•  What is the cost of downtime?•  What are the Recovery Time Objectives (RTO)•  What are the Recovery Point Objectives (RPO)?© 2012 Energized Work - www.energizedwork.com 24
  25. 25. Aggressive RTO and RPOare expensive and have aperformance impact© 2012 Energized Work - www.energizedwork.com 25
  26. 26. RTO / RPOexampleProblem:•  Simple DB•  Business can tolerate up to 15 minutes downtime•  10-minute window of data loss© 2012 Energized Work - www.energizedwork.com 26
  27. 27. RTO / RPOexamplePossible solution:•  Continuously replicate data to second host•  Continue with nightly backups and also copy DB transaction logs from the primary host to another system© 2012 Energized Work - www.energizedwork.com 27
  28. 28. So what is more important –increasing availabilityor reducing recovery time?© 2012 Energized Work - www.energizedwork.com 28
  29. 29. MTBF or MTTR?What about MTTD?© 2012 Energized Work - www.energizedwork.com 29
  30. 30. The answer is:It depends© 2012 Energized Work - www.energizedwork.com 30
  31. 31. Failureis inevitable© 2012 Energized Work - www.energizedwork.com 31
  32. 32. Ask anyone© 2012 Energized Work - www.energizedwork.com 32
  33. 33. LicenseThis presentation is provided under the Creative Commons Attribution Share Alike 3.0 Unported License. You are free: To share – to copy, distribute and transmit the work To remix – to adapt the work Under the following conditions: Attribution – You must attribute the work in the manner specified by Energized Work (but not in any way that suggests that Energized Work endorse you or your use of the work). Share Alike – If you alter, transform, or build upon this work, you may distribute the resulting work only under the same or similar license to this one. ENERGIZED WORK 25 MACKLIN STREET LONDON WC2B 5NN +44 (0)20 7691 8933© 2012 Energized Work - www.energizedwork.com WWW.ENERGIZEDWORK.COM 33

×