SlideShare ist ein Scribd-Unternehmen logo
1 von 34
Retrospective from a startup built in the cloud :  top 3 big lessons from the AWS outage on 04.21.2011 plus 4,369 other smaller ones 6/22/2011 1
What a country : entrepreneurial resiliency 6/22/2011 2
(true story) “robust systems:highly fault-tolerant, on or off grid. eg: our culture wrt entrepreneurs, AWS, the BD API” 6/22/2011 3
Boom 6/22/2011 4
me: previous startupteams in 3 countrieshighly transactional systemMS tech : IIS/MS SQL Serverco-located, leased/owned hardware0% in cloud$75M/yearly rev  6/22/2011 5
me : current startupsystems 100% on AWS99% free/open-source software 6/22/2011 6 standing on the shoulders of giants
6/22/2011 7 What Happened US-WEST Regions and Zones US-EAST
6/22/2011 8 What Happened in us-east It’s all about the EBS (Elastic Block Store) – apologies for the artistic license, AWS EBS Cluster Region US-EAST Control plane services Zones
6/22/2011 9 What Happened in us-east It’s all about the EBS (Elastic Block Store) – apologies for the artistic license, AWS EBS Cluster ? ‘re-mirroring storm’ Control plane services Thread-starved Regional API brown-out Region/Zones
fault tolerance: 3 to 47 important failearnings and 4,369 less important ones 6/22/2011 10
in the context of our startup, of course YMMV depending on velocity 6/22/2011 11
Ruger 6/22/2011 12
The Ruger Fault Equivalencytime = money fault tolerance = time²  - risk tolerance  Also known as:  'Fast, good and cheap : pick two‘ 6/22/2011 13
system design philosophy: 6/22/2011 14 leverage proven, open-source tech in the cloud to build a scaleable reliable secure operational foundation quickly
So how do you achievethe right level of fault tolerance in the cloud? 3 tenets 6/22/2011 15
Tenet #1 6/22/2011 16 Scripted Repeatability  Tenet #2 SPOF Elimination Tenet #3 Clear-Cut Communication
Tenet #1prepare a fault-tolerant foundation with scripted repeatability aka automation 6/22/2011 17
Tenet #1 : scripted repeatabilityfrom the start :script the non-interactive install of your toolsand OScustom  AMIDebian : great package managementbased on Eric Hammond’s workhttp://alestic.com/ 6/22/2011 18
Tenet #1 : scripted repeatability which will allow you toscript the setup/tear-down of your stack 6/22/2011 19
Tenet #1 : scripted repeatability which will allow you toscript system testsintegrity (3-4K tests)performance (30-40K tests)load, capacity (2-4M requests) 6/22/2011 20
6/22/2011 21 Tenet #1 : scripted repeatability A/B system test results : MySQL Percona Upgrade
That’s how1 person set up andmanaged a networkcomprised of 90+/- server instancesfor 1.5 yearswhile serving various other roleswithout having to leave their chair 6/22/2011 22 try that with real hardware
Tenet #2SPOF Elimination We don’t need no stinkin single points of failure.   6/22/2011 23
Tenet #2 : SPOF EliminationSPOF Examples:Cloud ProviderRegionZoneLoad BalancerApp Server DatabaseFred 6/22/2011 24
Tenet #2 : SPOF Elimination Cloud Provider fail-over? e.g. AWS –> Rackspace 6/22/2011 25
Tenet #2 : SPOF Elimination Region fail-over? e.g. useast->uswest within AWS Nah. 6/22/2011 26
Tenet #2 : SPOF Elimination Zone fail-over? Yes. 6/22/2011 27 US-WEST US-EAST
Tenet #2 : SPOF Elimination Zone fail-over best practices:are you using auto-scaling?no : distribute server instances evenly between 2 or more zonesyes : trigger scaling on network I/O or custom metrics 6/22/2011 28
Tenet #2 : SPOF EliminationLoad-balancer (ELB), app server, database fail-over? Yes. 6/22/2011 29
Tenet #2 : SPOF Elimination So it’s actually all about reduction of the right SPOFs for your business context Just adding the ability to fail-over and have backups within a region is huge! Probably enough for most. What about Fred? 6/22/2011 30
Tenet #3Clear-Cut Communication 6/22/2011 31
Tenet #3 : Clear-cut CommunicationDuring an outage, communicating the right things at the right time:hard. But not that hard. 6/22/2011 32
Tenet #1 6/22/2011 33 Three Tenets Revisited Scripted Repeatability  Tenet #2 SPOF Elimination Tenet #3 Clear-Cut Communication
Thank YouOur AWS account rep :"Dylan Peterson" <dylanpet@amazon.com>(notes attached to this slide) 6/22/2011 34

Weitere ähnliche Inhalte

Ähnlich wie Retrospective from a startup built in the cloud: top three big lessons learned from the AWS outage on 4.21.11

The AMIS Report from Oracle Open World and JavaOne 2011 - Part One
The AMIS Report from Oracle Open World and JavaOne 2011 - Part OneThe AMIS Report from Oracle Open World and JavaOne 2011 - Part One
The AMIS Report from Oracle Open World and JavaOne 2011 - Part OneLucas Jellema
 
MySQL Replication Performance in the Cloud
MySQL Replication Performance in the CloudMySQL Replication Performance in the Cloud
MySQL Replication Performance in the CloudVitor Oliveira
 
Muves3 Elastic Grid Java One2009 Final
Muves3 Elastic Grid Java One2009 FinalMuves3 Elastic Grid Java One2009 Final
Muves3 Elastic Grid Java One2009 FinalElastic Grid, LLC.
 
Powering the Cloud with Oracle WebLogic
Powering the Cloud with Oracle WebLogicPowering the Cloud with Oracle WebLogic
Powering the Cloud with Oracle WebLogicLucas Jellema
 
Matt Wright - The Application Grid
Matt Wright - The Application GridMatt Wright - The Application Grid
Matt Wright - The Application GridSaul Cunningham
 
Oracle SOA Suite in use – a practical experience report
Oracle SOA Suite in use – a practical experience reportOracle SOA Suite in use – a practical experience report
Oracle SOA Suite in use – a practical experience reportGuido Schmutz
 
Patterns & Practices of Microservices
Patterns & Practices of MicroservicesPatterns & Practices of Microservices
Patterns & Practices of MicroservicesWesley Reisz
 
Scaling Databricks to Run Data and ML Workloads on Millions of VMs
Scaling Databricks to Run Data and ML Workloads on Millions of VMsScaling Databricks to Run Data and ML Workloads on Millions of VMs
Scaling Databricks to Run Data and ML Workloads on Millions of VMsMatei Zaharia
 
Calton pu experimental methods on performance in cloud and accuracy in big da...
Calton pu experimental methods on performance in cloud and accuracy in big da...Calton pu experimental methods on performance in cloud and accuracy in big da...
Calton pu experimental methods on performance in cloud and accuracy in big da...jins0618
 
Advanced equal logic customer presentation
Advanced equal logic customer presentationAdvanced equal logic customer presentation
Advanced equal logic customer presentationallardb
 
Bandwidth: Use Cases for Elastic Cloud on Kubernetes
Bandwidth: Use Cases for Elastic Cloud on Kubernetes Bandwidth: Use Cases for Elastic Cloud on Kubernetes
Bandwidth: Use Cases for Elastic Cloud on Kubernetes Elasticsearch
 
Top 20 FAQs on the Autonomous Database
Top 20 FAQs on the Autonomous DatabaseTop 20 FAQs on the Autonomous Database
Top 20 FAQs on the Autonomous DatabaseSandesh Rao
 
Scaling and High Performance Storage System: LeoFS
Scaling and High Performance Storage System: LeoFSScaling and High Performance Storage System: LeoFS
Scaling and High Performance Storage System: LeoFSRakuten Group, Inc.
 
[Rakuten TechConf2014] [C-5] Ichiba Architecture on ExaLogic
[Rakuten TechConf2014] [C-5] Ichiba Architecture on ExaLogic[Rakuten TechConf2014] [C-5] Ichiba Architecture on ExaLogic
[Rakuten TechConf2014] [C-5] Ichiba Architecture on ExaLogicRakuten Group, Inc.
 
Openstack Summit Tokyo 2015 - Building a private cloud to efficiently handle ...
Openstack Summit Tokyo 2015 - Building a private cloud to efficiently handle ...Openstack Summit Tokyo 2015 - Building a private cloud to efficiently handle ...
Openstack Summit Tokyo 2015 - Building a private cloud to efficiently handle ...Pierre GRANDIN
 
Naveen nimmu sdn future of networking
Naveen nimmu sdn   future of networkingNaveen nimmu sdn   future of networking
Naveen nimmu sdn future of networkingOpenSourceIndia
 
Naveen nimmu sdn future of networking
Naveen nimmu sdn   future of networkingNaveen nimmu sdn   future of networking
Naveen nimmu sdn future of networkingsuniltomar04
 
Drizzle Keynote at the MySQL User's Conference
Drizzle Keynote at the MySQL User's ConferenceDrizzle Keynote at the MySQL User's Conference
Drizzle Keynote at the MySQL User's ConferenceBrian Aker
 
Become a Performance Diagnostics Hero
Become a Performance Diagnostics HeroBecome a Performance Diagnostics Hero
Become a Performance Diagnostics HeroTechWell
 

Ähnlich wie Retrospective from a startup built in the cloud: top three big lessons learned from the AWS outage on 4.21.11 (20)

The AMIS Report from Oracle Open World and JavaOne 2011 - Part One
The AMIS Report from Oracle Open World and JavaOne 2011 - Part OneThe AMIS Report from Oracle Open World and JavaOne 2011 - Part One
The AMIS Report from Oracle Open World and JavaOne 2011 - Part One
 
MySQL Replication Performance in the Cloud
MySQL Replication Performance in the CloudMySQL Replication Performance in the Cloud
MySQL Replication Performance in the Cloud
 
Muves3 Elastic Grid Java One2009 Final
Muves3 Elastic Grid Java One2009 FinalMuves3 Elastic Grid Java One2009 Final
Muves3 Elastic Grid Java One2009 Final
 
Powering the Cloud with Oracle WebLogic
Powering the Cloud with Oracle WebLogicPowering the Cloud with Oracle WebLogic
Powering the Cloud with Oracle WebLogic
 
Matt Wright - The Application Grid
Matt Wright - The Application GridMatt Wright - The Application Grid
Matt Wright - The Application Grid
 
Oracle SOA Suite in use – a practical experience report
Oracle SOA Suite in use – a practical experience reportOracle SOA Suite in use – a practical experience report
Oracle SOA Suite in use – a practical experience report
 
Introduction To Cloud Computing
Introduction To Cloud ComputingIntroduction To Cloud Computing
Introduction To Cloud Computing
 
Patterns & Practices of Microservices
Patterns & Practices of MicroservicesPatterns & Practices of Microservices
Patterns & Practices of Microservices
 
Scaling Databricks to Run Data and ML Workloads on Millions of VMs
Scaling Databricks to Run Data and ML Workloads on Millions of VMsScaling Databricks to Run Data and ML Workloads on Millions of VMs
Scaling Databricks to Run Data and ML Workloads on Millions of VMs
 
Calton pu experimental methods on performance in cloud and accuracy in big da...
Calton pu experimental methods on performance in cloud and accuracy in big da...Calton pu experimental methods on performance in cloud and accuracy in big da...
Calton pu experimental methods on performance in cloud and accuracy in big da...
 
Advanced equal logic customer presentation
Advanced equal logic customer presentationAdvanced equal logic customer presentation
Advanced equal logic customer presentation
 
Bandwidth: Use Cases for Elastic Cloud on Kubernetes
Bandwidth: Use Cases for Elastic Cloud on Kubernetes Bandwidth: Use Cases for Elastic Cloud on Kubernetes
Bandwidth: Use Cases for Elastic Cloud on Kubernetes
 
Top 20 FAQs on the Autonomous Database
Top 20 FAQs on the Autonomous DatabaseTop 20 FAQs on the Autonomous Database
Top 20 FAQs on the Autonomous Database
 
Scaling and High Performance Storage System: LeoFS
Scaling and High Performance Storage System: LeoFSScaling and High Performance Storage System: LeoFS
Scaling and High Performance Storage System: LeoFS
 
[Rakuten TechConf2014] [C-5] Ichiba Architecture on ExaLogic
[Rakuten TechConf2014] [C-5] Ichiba Architecture on ExaLogic[Rakuten TechConf2014] [C-5] Ichiba Architecture on ExaLogic
[Rakuten TechConf2014] [C-5] Ichiba Architecture on ExaLogic
 
Openstack Summit Tokyo 2015 - Building a private cloud to efficiently handle ...
Openstack Summit Tokyo 2015 - Building a private cloud to efficiently handle ...Openstack Summit Tokyo 2015 - Building a private cloud to efficiently handle ...
Openstack Summit Tokyo 2015 - Building a private cloud to efficiently handle ...
 
Naveen nimmu sdn future of networking
Naveen nimmu sdn   future of networkingNaveen nimmu sdn   future of networking
Naveen nimmu sdn future of networking
 
Naveen nimmu sdn future of networking
Naveen nimmu sdn   future of networkingNaveen nimmu sdn   future of networking
Naveen nimmu sdn future of networking
 
Drizzle Keynote at the MySQL User's Conference
Drizzle Keynote at the MySQL User's ConferenceDrizzle Keynote at the MySQL User's Conference
Drizzle Keynote at the MySQL User's Conference
 
Become a Performance Diagnostics Hero
Become a Performance Diagnostics HeroBecome a Performance Diagnostics Hero
Become a Performance Diagnostics Hero
 

Kürzlich hochgeladen

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 

Kürzlich hochgeladen (20)

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 

Retrospective from a startup built in the cloud: top three big lessons learned from the AWS outage on 4.21.11

  • 1. Retrospective from a startup built in the cloud : top 3 big lessons from the AWS outage on 04.21.2011 plus 4,369 other smaller ones 6/22/2011 1
  • 2. What a country : entrepreneurial resiliency 6/22/2011 2
  • 3. (true story) “robust systems:highly fault-tolerant, on or off grid. eg: our culture wrt entrepreneurs, AWS, the BD API” 6/22/2011 3
  • 5. me: previous startupteams in 3 countrieshighly transactional systemMS tech : IIS/MS SQL Serverco-located, leased/owned hardware0% in cloud$75M/yearly rev 6/22/2011 5
  • 6. me : current startupsystems 100% on AWS99% free/open-source software 6/22/2011 6 standing on the shoulders of giants
  • 7. 6/22/2011 7 What Happened US-WEST Regions and Zones US-EAST
  • 8. 6/22/2011 8 What Happened in us-east It’s all about the EBS (Elastic Block Store) – apologies for the artistic license, AWS EBS Cluster Region US-EAST Control plane services Zones
  • 9. 6/22/2011 9 What Happened in us-east It’s all about the EBS (Elastic Block Store) – apologies for the artistic license, AWS EBS Cluster ? ‘re-mirroring storm’ Control plane services Thread-starved Regional API brown-out Region/Zones
  • 10. fault tolerance: 3 to 47 important failearnings and 4,369 less important ones 6/22/2011 10
  • 11. in the context of our startup, of course YMMV depending on velocity 6/22/2011 11
  • 13. The Ruger Fault Equivalencytime = money fault tolerance = time²  - risk tolerance Also known as: 'Fast, good and cheap : pick two‘ 6/22/2011 13
  • 14. system design philosophy: 6/22/2011 14 leverage proven, open-source tech in the cloud to build a scaleable reliable secure operational foundation quickly
  • 15. So how do you achievethe right level of fault tolerance in the cloud? 3 tenets 6/22/2011 15
  • 16. Tenet #1 6/22/2011 16 Scripted Repeatability Tenet #2 SPOF Elimination Tenet #3 Clear-Cut Communication
  • 17. Tenet #1prepare a fault-tolerant foundation with scripted repeatability aka automation 6/22/2011 17
  • 18. Tenet #1 : scripted repeatabilityfrom the start :script the non-interactive install of your toolsand OScustom AMIDebian : great package managementbased on Eric Hammond’s workhttp://alestic.com/ 6/22/2011 18
  • 19. Tenet #1 : scripted repeatability which will allow you toscript the setup/tear-down of your stack 6/22/2011 19
  • 20. Tenet #1 : scripted repeatability which will allow you toscript system testsintegrity (3-4K tests)performance (30-40K tests)load, capacity (2-4M requests) 6/22/2011 20
  • 21. 6/22/2011 21 Tenet #1 : scripted repeatability A/B system test results : MySQL Percona Upgrade
  • 22. That’s how1 person set up andmanaged a networkcomprised of 90+/- server instancesfor 1.5 yearswhile serving various other roleswithout having to leave their chair 6/22/2011 22 try that with real hardware
  • 23. Tenet #2SPOF Elimination We don’t need no stinkin single points of failure. 6/22/2011 23
  • 24. Tenet #2 : SPOF EliminationSPOF Examples:Cloud ProviderRegionZoneLoad BalancerApp Server DatabaseFred 6/22/2011 24
  • 25. Tenet #2 : SPOF Elimination Cloud Provider fail-over? e.g. AWS –> Rackspace 6/22/2011 25
  • 26. Tenet #2 : SPOF Elimination Region fail-over? e.g. useast->uswest within AWS Nah. 6/22/2011 26
  • 27. Tenet #2 : SPOF Elimination Zone fail-over? Yes. 6/22/2011 27 US-WEST US-EAST
  • 28. Tenet #2 : SPOF Elimination Zone fail-over best practices:are you using auto-scaling?no : distribute server instances evenly between 2 or more zonesyes : trigger scaling on network I/O or custom metrics 6/22/2011 28
  • 29. Tenet #2 : SPOF EliminationLoad-balancer (ELB), app server, database fail-over? Yes. 6/22/2011 29
  • 30. Tenet #2 : SPOF Elimination So it’s actually all about reduction of the right SPOFs for your business context Just adding the ability to fail-over and have backups within a region is huge! Probably enough for most. What about Fred? 6/22/2011 30
  • 32. Tenet #3 : Clear-cut CommunicationDuring an outage, communicating the right things at the right time:hard. But not that hard. 6/22/2011 32
  • 33. Tenet #1 6/22/2011 33 Three Tenets Revisited Scripted Repeatability Tenet #2 SPOF Elimination Tenet #3 Clear-Cut Communication
  • 34. Thank YouOur AWS account rep :"Dylan Peterson" <dylanpet@amazon.com>(notes attached to this slide) 6/22/2011 34

Hinweis der Redaktion

  1. Who here either works for or has used AWS?RightScale?Who has read and understood the full post mortem for the April outage?post slides to somewhere, make available and note in preso
  2. ‘what a country’ : my dad always says this, I like itso, one of our principle investors, BradFeld was in our offices recently, and was asking how AWS was working out for us i&apos;d replied very much in the positive, with a few exceptions regarding their support services.  that night at dinner brad was talking about how resilient our culture is for entrepreneurs; how we can fail and retry here in the united statesdoing things that folks might get strung up for, in other countries.the following night, I found myself exploring analogies between that idea and computing systems, and wound up pulling out my phone and started typing up a twitter post
  3. It went something like this.this was going to be the brilliant culmination of my twitter career, to date. I was almost ready to hit the send button, when I started getting alerts from our systemsThe alerts were appearing literally right above what I had written : ‘system DOWN’.  Oh, the irony. wish i had a screenshot from my phone
  4. that was the evening of 4/20, morning of 4/21 - the AWS outageAs you can see it made the NYT. Lasted for a number of days; our API was intermittently affected for about 12 hours; that could have been mitigatedThat outage totally sucked for many reasons I’m hoping that by sharing some of my experience with AWS , you’ll gain some insights that may help you prepare adequatelyAlso hoping that this can turn into a conversation toward the end, so you can share your experiences as well.
  5. So before I go on, a bit about me : my name is Jeff MalekGrew up in Colorado, graduated in 93 from CU Boulder after 6 long years and a suspensionduring which time I hitch-hiked around the country, winding up in hawaiigraduated, moved around, met some great friends, helped to start up a companywas at Zango for 10 years,responsible for engineering, QA and product development teams distributed across three countries50+ people who built and maintained the high-transaction system that resulted in $75M yearly revenue at its peakleveragedthe client side software I wrote in the C programming language which talked to backend systems built on Windows technology (.NET, IIS, MSSQL Server, etc) which was sitting on co-located , purchased hardware
  6. BigDoor: over 2 years oldFunded by Foundry and Brad Feld in 2010. If you’re familiar with airline point systems, you’re familiar with loyalty programs.BigDoor provides a platform that powers social loyalty programs and game mechanics for digital communities.Think of it in terms of sharing your points with your friends and leveling-up in the processfreeRESTfulAPI that you can brand any way you wantI did a tech stack pivot ; built in the cloud on AWS using 99.99% free/open-source software – our backend systems are primarily Django-Python.Even after the outage, still a huge fan of AWS, generally very impressed with what they’ve built and their speed of innovationWhen was the last time you got a newsletter letting you know that a vendor’s pricing was going down
  7. So what happened? Here’s some quick background.AWS Regions are areas geographically separated by large distances, and contain Zones. In the US there are two Regions, us-east and us-west.Zones are a euphemism for ‘data center’, each Region contains four data centers, in separate buildings.
  8. Here’s the region with four zones again. Within a zone, you can allocate block-level replicated storage that’s optimized for consistency and low latency read/write access to/from EC2 instances – otherwise known as EBS (Elastic Block Storage). These EBS volumes are stored and replicated between nodes within a cluster, multiple times for durability and availability. If one replica becomes unavailable our out of sync, a new replica is provisioned automatically. This is called re-mirroring, and while it’s happening, access to that data is blocked for consistency. Old replicated blocks aren’t released until the new replica is confirmed. Within a cluster, nodes are connected to each other via two networks; one high-bandwidth backplane, and a lower bandwidth overflow capacity networkThe four zones, or data centers, are connected via control plane services that coordinate user requests for EBS resources
  9. During scaling maintenance to upgrade primary network capacity, it’s standard practice to shift traffic away from the primary to another router, but someone routed traffic to the lower capacity network, essentially flooding it. Many nodes got disconnected from other nodes in the cluster, couldn’t connect to their replicas. While the network was down, EBS API requests were queuing up, exacerbated by the fact that you can set a ‘wait-timeout’ on API requests.Then the primary network was restored. Affected nodes began trying to create replicas; start of the ‘re-mirroring storm’. There was a bug that caused nodes to crash when closing large volumes of requests, resulting in more needing to re-mirror; on top of that, nothing metering back these requests as they were failing repeatedly, no exponential back-off. Exhausted the capacity of the cluster, putting about 13% of all volumes in ‘stuck’ state. When it came back up, the regional control plane services were overloaded and this is what made EBS services unavailable regionally.
  10. So that’s what I’m here to talk about: fault tolerance in the cloud
  11. I want to talk about all of this in the context of our startup, of courseUltimately the AWS outage didn’t result in any major changes to the way we do thingsWhile there were a few smaller things that we bumped up the priority chain, there’s a certain level of risk that a start up is willing to live with
  12. My girlfriend Jenny and I got Ruger as a puppy, right when BigDoor startedRaised him from a puppy while building out our operational infrastructure, working out of our houseHe’s a great dog, love him to death So he’s kind of our mascot, and to help put things in context, I came up with a formula : The Ruger Fault Equivalency.
  13. IOW given a low tolerance for risk, you can create a highly-fault tolerant system if you have a lot of time and/or money. that’s not BigDoor. Conversely, executing with a higher tolerance for risk gets you to market faster with less money, but with lower fault tolerance.For us, scalability is more important than extremely high fault tolerancestartup = time^2 is low (little time and money)So, fun and interesting, but what does it mean in the context of BigDoor system design?TODO : add another pic, movie of play dead?
  14. I designed the BigDoor systems at a high level with this philosophy in mind. A bit more regarding our context : Django/PythonWeek long sprints that end in production code release 260G+ and growing transactional database, so still not that bigPeak so far: 18MM API requests/day, so still a ways to go Response times need to be faster than 500ms
  15. OK, given that context – how do you achieve the right amount of fault tolerance in the cloud?Three basic tenets, and in the context of the AWS outage:the first sets a foundation for fault tolerance the second leverages the first to improve fault toleranceand the third will help keep your customers around when you are in crisis mode, ultimately also improving fault tolerance
  16. Scripted repeatabilitySPOF eliminationClear-Cut Communication(repeat)
  17. Nothing to see here, move along
  18. AMIs (amazon machine image, install images; OS blueprints), these are used to start new server instancesLeverage pre-built AMIsDebian has great package managementpackages are verified, tested before making it into the main line - less to think aboutThank you Eric HammondA good best practice : use a single master AMI re-buildregularly via automation with new softwarenew package patches (apt)your application code we thentag per environment (test, staging, production) switch services (Apache, MySQL) on and off during boot via init scriptsAnother good practice :All app code and software config is checked out via SVN and baked into the AMIsvn up during boot via init scriptsenables fast initialization during auto-scaling activities
  19. AWS has cloud formationThey came out with that a few days after I’d finished pretty much doing the sameI wrapped the AWS command line tools in shell scriptsSince we’re a Python shop, we’re likely going to be using boto (which has matured quite a bit in the last two years) and fabric
  20. Nothing to see here, move along
  21. Nothing to see here, move along
  22. Who knows what this is a picture of?That’s a picture of the IBM RAMAC, built in 1956, which had 5M of storage and weighed a ton. We’ve come a long way, baby!
  23. For anyone unfamiliar: if a system stops working when a part of it fails, that part is a single point of failure. So in every system there’s potential for many single points of failure, proportional to system complexityBecause of the Ruger Fault Equivalency, the idea is to pick the right SPsOF and eliminate (or at least mitigate) themI used the word ‘elimination’ here, hoping that it would make some folks chuckle; it’s really not possible to eliminate all SPOFs. You can mitigate them, though. So here are some examples, and I’ll drill into which ones are critical in our context.
  24. If your cloud provider goes out of business, you’re hosed. SPOF.In AWS, a region is…etc. If a region disappears, you’re hosed. SPOF.Within regions, are zones. If an entire zone fails, you’re hosed. SPOF.Same with load balancers, application servers, databasesAnd even Fred. If Fred is the only guy who knows your operational systems, and he trips over the extension cord, knocking himself out in the process – you’re hosed. SPOF. The critical ones in our context and likely in many others : Zones and everything below.
  25. Should you attempt to achieve high fault-tolerance through cloud-cloud failover?Ruger Fault Equivalency says : Cost prohibitive (times squared)RightScale , who provides a very cool cloud management system, apparently has some of this functionality, and will likely be the place to go for cloud-cloud fail-over in the future.
  26. Ruger Fault Equivalency says :Ditto – cost prohibitiveIf you try to migrate an ELB-balanced tech stack from one region to the next, you’ll learn:You ELB won’t be able to route traffic between regionsEIPs can’t be pointed from an instance in one region to anotherYour custom useast (for example) AMI can’t be used in the new region Your useast Security groups can’t be used in the new regionYour snapshots can’t be used to create new volumes, in the new regionDo set up a DB replicant in another region, if possible.
  27. Ruger says : yes, even in light of the recent outage, that affected the entire useast region. It’s not cost-prohibitive, and you get data-center fail-over.What about the recent AWS outage? A human error caused a major problem in one zone that had a ripple effect into the other zones. But ultimately, downtime suffered was in proportion to how well you were already leveraging other zones, and how dependent you were on EBS volumes. If all of your eggs were in the wrong zone, or didn’t have the right backup strategy in place – totally screwed. Otherwise – not so bad!
  28. Our zone scenario and why were were down intermittently for 12 hours during the AWS outagebefore the outage we had auto-scaling groups in two zones within a single regionat some point I brought everything into a single zone, while debugging odd performance between the twoconscientiously de-prioritized revisiting that, in light of other priorities, figuring the single-zone group would at least scale with trafficbut I’d configured the groups with a trigger to auto-scale when CPU spikedover time our application grew more resource efficient, which meant CPU wasn’t spiking, which meant we weren’t scaling with trafficled to the learning that it’s better to scale on network IO, or now that AWS supports them, custom scaling triggerswe’re in multiple zones again now; recently saw the effects of an entire zone’s application server group go dark
  29. Ruger says : don’t even think about not doing it.What’s generally worked for us:ELBs for same-region traffic distribution auto scaling groups to allow application server fail-over, within a zone and across themreplication to put secondary fail-over database servers in other zones within a region.
  30. What about Fred? Cut Fred some slack for tripping over the extension cord, we all make mistakes. You need Fred. That is, assuming he communicates what happened widely. If he doesn’t, he’s going to suffer the wrath of his internal and external customers.
  31. Customers don’t need a ton of detail; they need status updates and anything actionable. Does open communication increase fault tolerance? I’d argue yes. Your customers will be more tolerant of your faults if you’re open and clear about them
  32. At BigDoor, if there’s a crisis, our standard operating procedure identifies a single person responsible for stopping the team on an hourly basis to get status and determine what should be communicated externally, if anything. As much as we love him, we don’t involve our lawyer in that conversation, by the way.
  33. In summary, these are the three tenets that I’m hoping will help you achieve the right amount of fault tolerance in the cloud:Scripted repeatabilitySPOF eliminationClear-Cut CommunicationAll three of these things are mentioned by AWS in one way or another in their post-mortems as things they planned on doing to mitigate this for themselves going forward, by the way – including the better communication. Thanks again WTIA, I’ll be around if anyone wants to talk more about this stuff later. I also have some notes that describe the good and bad about AWS, available online here : TODO
  34. AWS outage root cause analysis : http://aws.amazon.com/message/65648/Net Effects :hours of high EBS API error and latency rates : 11 days before affected data made available again in affected zone : peak ‘stuck’ volumes in other zones : .07% Ultimately .07% of volumes couldn’t be restored due to hardware failures45% of RDS single-zone instances affected at peak, .04% unrecoverable2.5% of multi-zone RDS didn’t fail over due to another bugTools : the good and bad ELBsGood : quick to configure, auto-scaling load-balancerscan be used for fail-over within a regionBad :  no loggingreturn 503s on error - you won&apos;t know unless you can monitor every request end to ende.g. if there aren&apos;t instances that can service requestsname servers disregarding ttls + auto-scaling = traffic routing issuesbest practice : return custom HTTP headers in your response so that you can distinguish calls during support incidentscan&apos;t be used for failover between AWS regions; need separate DNS solution for funneling trafficAMIs (amazon machine image, install images; OS blueprints)Good : Leveraged pre-built Debian AMIDebian has great package management, which can be scripted.packages are verified, tested before making it into the main line - less to think aboutThank you Eric Hammondhttp://alestic.com/scripted repeatability : script the non-interactive install of your toolscan be used to stand-up instances within a regionbest practice : single master AMI built on top of pre-existing, re-built regularly with new software, app code and patches, via automation.  Tagged. best practice : put app code, package configuration into SVN and include in your AMI, svn-up regularly or during instance start-upfaster for things like auto-scalingBad : Can&apos;t copy/port AMIs from region to region easilyNot having the entire process scripted from kernel means loss of flexibility (regional AMIs) and securitypitfall : easy to get off track.  Didn&apos;t start out with a single script that installs everything or stay diligent about including everything?  Have fun re-doing all that!Security toolsGreat article : http://trust.cased.de/AMIDAMID script : http://code.google.com/p/amid/downloads/detail?name=AMID.py&amp;can=2&amp;q=EC2 instancesGood :Leverages AMIsObviously, script-able automated instance creationEIPs allow for easy, dependable service re-routing from one instance to anotherSecurity groups are an easy way to firewall (and tag, before they came out with those)Zones allow easy fail-over within a geographic region (most of the time)Regions provide the promise of fail-over between data centers more geographically separated (virginiavscalifornia)Init scripts allow you to create/update on a per-instance basisBad:Security groups can&apos;t be added to or removed from an instance once it&apos;s runningbest practice: use a different group for each narrower categorye.g. instead of &apos;database group&apos;, create groups for &apos;primary transactional db server in production&apos;, &apos;replicant...&apos; etc best practice : use a group that whitelists trusted IPs to give access to otherwise un-needed ports and servicesRegions don&apos;t allow easy failover; EIPs can&apos;t be mapped between them (at least not programmatically)Can&apos;t port AMIs from region to region easily, so setup to fail region-region is difficult.EBSGood:provides redundant storage for instances that can be snapshot-ed for easy backup and volume duplication within a regionBad:volumes from snapshots can&apos;t be done between regions data loss: it happened (not to us, fortunately) so be prepared and apply the amount of resources your risk tolerance allowspoor I/O in general, specifically writes, typically only has been an issue for us on our primary tx DB serversbest pracitice : RAID 0 array for MySQL data directory, but make sure it&apos;s replicated and backed upAuto-scalingGood:n scaling groups in 1-4 zones behind an ELB; provides same-region fail-overn# of instances in a scaling groupcloud watch monitors provide great statspreviously, limited scaling triggers were provided, latest integrate CloudWatch much better including custom metrics you defineBad:learning : we had no baselines for when to scale on anything other than CPU utilization, which at the time was easy to differentiate; we spikedapplication improvements fixed the spikes, which in return stopped auto scaling triggers need monitoring/alerting via nagios/other tool?  figure out how to (de-)register new instances during scaling activitiesthis is changing - cloud watch is getting better.  do you trust amazon&apos;s monitoring/alerting on amazon&apos;s monitoring/alerting?EMRGood :Great for async log analysiswhat&apos;s worked for us : centralized log hostsapache logs rotated via logrotate and rsync&apos;d via cron, pre-processed, sync&apos;d to S3 and drawn into EMR/Hive cluster for aggregations and reporting Hive/HQL very similar to SQLBad :asynchronous, takes a fair amount of time to munge data S3Good:Available from anywhere, any regionS3cmd is a great tool , for the most partBad:no full support for standard paths and directories…TBDCloudWatch Good :can monitor various services and trigger/alert when thresholds are crossed (e.g. ELB network in)new : auto-scaling can leverage triggers more broadly, custom metrics (new)Bad :no built-in ability to trigger/alert based on % change from previous measurementsconsole reports/graphs need decoder tool and most recently, appear buggy.  but they&apos;ve made big steps forward.AWS APIsGood :API wrappers provided; allow for cmd-line scriptingDRY : Can (and should) script most things that repeat, repeatableAll done via scripts :a bit about our process and how the cloud fits well1 week sprints - lockdown tuesdays, test overnight (uTEST), release wedtest first methodologysystem tests for backend, other big changes, our API changesTested a new ver of MySQL (Percona, recommended)http://screencast.com/t/yVf5RnaUN9http://screencast.com/t/WJaL2qiSRperformance, integrity, load, capacitythese require full-stack stand-up/tear-down , including a 230G+ db backendBad :Keep your eye out for library updates (why not open-source these things? Verify they’re not already…)Scripts, wrappers trail AWS innovation, which is fast.  BASH isn&apos;t as well-known or readable as Python, for example - maintainabilityscripted stuff bakes you in a bit, no way around this w/out baking yourself into RightScale or some other solution anyway thoughAPI key management : not straight-forwardAPI keys aren&apos;t portable between regions; region-region fail-over not as easy as it sounds.  not rocket science, either.Bake region 1’s keys into region 2’s new AMIAPI&apos;s - GeneralBuild things test first, run integrity tests before pushing out changes to your APIDon&apos;t version;  make it backwards-compatibleWe try to keep away from anything that’s going to lock us in too muchWe continue to shy away from SQS (simple queuing service), RDS (relational database service), SimpleDB (non-relational datastore)SQS, SimpleDB proprietary, would prefer to avoid lock-in for these things and their need hasn&apos;t been high enough for us yetRDS : doesn&apos;t provide enough flexibility for us.  would love to use it as a replicant pool for reads/reporting though. can&apos;t.multi-zone RDS suffered one of the biggest hits during recent AWS outageWhat we&apos;re looking forward to leveragingNew CW status, PUTs, scaling triggers from them