Retrospective from a startup built in the cloud: top three big lessons learned from the AWS outage on 4.21.11

•Als PPTX, PDF herunterladen•

1 gefällt mir•353 views

Jeff Malek

All about the April 2011 AWS outage, its causes, effects and ways to mitigate the same sorts of issues in the future.

Technologie Business

Retrospective from a startup built in the cloud : top 3 big lessons from the AWS outage on 04.21.2011 plus 4,369 other smaller ones 6/22/2011 1

What a country : entrepreneurial resiliency 6/22/2011 2

(true story) “robust systems:highly fault-tolerant, on or off grid. eg: our culture wrt entrepreneurs, AWS, the BD API” 6/22/2011 3

me: previous startupteams in 3 countrieshighly transactional systemMS tech : IIS/MS SQL Serverco-located, leased/owned hardware0% in cloud$75M/yearly rev 6/22/2011 5

me : current startupsystems 100% on AWS99% free/open-source software 6/22/2011 6 standing on the shoulders of giants

6/22/2011 7 What Happened US-WEST Regions and Zones US-EAST

6/22/2011 8 What Happened in us-east It’s all about the EBS (Elastic Block Store) – apologies for the artistic license, AWS EBS Cluster Region US-EAST Control plane services Zones

6/22/2011 9 What Happened in us-east It’s all about the EBS (Elastic Block Store) – apologies for the artistic license, AWS EBS Cluster ? ‘re-mirroring storm’ Control plane services Thread-starved Regional API brown-out Region/Zones

fault tolerance: 3 to 47 important failearnings and 4,369 less important ones 6/22/2011 10

in the context of our startup, of course YMMV depending on velocity 6/22/2011 11

The Ruger Fault Equivalencytime = money fault tolerance = time² - risk tolerance Also known as: 'Fast, good and cheap : pick two‘ 6/22/2011 13

system design philosophy: 6/22/2011 14 leverage proven, open-source tech in the cloud to build a scaleable reliable secure operational foundation quickly

So how do you achievethe right level of fault tolerance in the cloud? 3 tenets 6/22/2011 15

Tenet #1 6/22/2011 16 Scripted Repeatability Tenet #2 SPOF Elimination Tenet #3 Clear-Cut Communication

Tenet #1prepare a fault-tolerant foundation with scripted repeatability aka automation 6/22/2011 17

Tenet #1 : scripted repeatabilityfrom the start :script the non-interactive install of your toolsand OScustom AMIDebian : great package managementbased on Eric Hammond’s workhttp://alestic.com/ 6/22/2011 18

Tenet #1 : scripted repeatability which will allow you toscript the setup/tear-down of your stack 6/22/2011 19

Tenet #1 : scripted repeatability which will allow you toscript system testsintegrity (3-4K tests)performance (30-40K tests)load, capacity (2-4M requests) 6/22/2011 20

6/22/2011 21 Tenet #1 : scripted repeatability A/B system test results : MySQL Percona Upgrade

That’s how1 person set up andmanaged a networkcomprised of 90+/- server instancesfor 1.5 yearswhile serving various other roleswithout having to leave their chair 6/22/2011 22 try that with real hardware

Tenet #2SPOF Elimination We don’t need no stinkin single points of failure. 6/22/2011 23

Tenet #2 : SPOF EliminationSPOF Examples:Cloud ProviderRegionZoneLoad BalancerApp Server DatabaseFred 6/22/2011 24

Tenet #2 : SPOF Elimination Cloud Provider fail-over? e.g. AWS –> Rackspace 6/22/2011 25

Tenet #2 : SPOF Elimination Region fail-over? e.g. useast->uswest within AWS Nah. 6/22/2011 26

Tenet #2 : SPOF Elimination Zone fail-over? Yes. 6/22/2011 27 US-WEST US-EAST

Tenet #2 : SPOF Elimination Zone fail-over best practices:are you using auto-scaling?no : distribute server instances evenly between 2 or more zonesyes : trigger scaling on network I/O or custom metrics 6/22/2011 28

Tenet #2 : SPOF EliminationLoad-balancer (ELB), app server, database fail-over? Yes. 6/22/2011 29

Tenet #2 : SPOF Elimination So it’s actually all about reduction of the right SPOFs for your business context Just adding the ability to fail-over and have backups within a region is huge! Probably enough for most. What about Fred? 6/22/2011 30

Tenet #3Clear-Cut Communication 6/22/2011 31

Tenet #3 : Clear-cut CommunicationDuring an outage, communicating the right things at the right time:hard. But not that hard. 6/22/2011 32

Tenet #1 6/22/2011 33 Three Tenets Revisited Scripted Repeatability Tenet #2 SPOF Elimination Tenet #3 Clear-Cut Communication

Thank YouOur AWS account rep :"Dylan Peterson" <dylanpet@amazon.com>(notes attached to this slide) 6/22/2011 34

Empfohlen

BigDoor's Jeff Malek Gluecon PresentationCarrie Peters

Glue con2011 Jeff Malek from BigDoorCarrie Peters

Addressing data plane performance measurement on OpenStack clouds using VMTPSuhail Syed

SplunkLive! Washington DC May 2013 - Splunk App for VMwareSplunk

Discoverer Migration PlanSohail Nawaz

Fifth draftanugrah nayar

整合Cloud Foundry 和 Kubernetes 技術打造企業級雲應用平台解決方案inwin stack

Bin repacking scheduling in virtualized datacentersFabien Hermenier

Empfohlen

BigDoor's Jeff Malek Gluecon PresentationCarrie Peters

Glue con2011 Jeff Malek from BigDoorCarrie Peters

Addressing data plane performance measurement on OpenStack clouds using VMTPSuhail Syed

SplunkLive! Washington DC May 2013 - Splunk App for VMwareSplunk

Discoverer Migration PlanSohail Nawaz

Fifth draftanugrah nayar

整合Cloud Foundry 和 Kubernetes 技術打造企業級雲應用平台解決方案inwin stack

Bin repacking scheduling in virtualized datacentersFabien Hermenier

The AMIS Report from Oracle Open World and JavaOne 2011 - Part OneLucas Jellema

MySQL Replication Performance in the CloudVitor Oliveira

Muves3 Elastic Grid Java One2009 FinalElastic Grid, LLC.

Powering the Cloud with Oracle WebLogicLucas Jellema

Matt Wright - The Application GridSaul Cunningham

Oracle SOA Suite in use – a practical experience reportGuido Schmutz

Introduction To Cloud ComputingRinat Shagisultanov

Patterns & Practices of MicroservicesWesley Reisz

Scaling Databricks to Run Data and ML Workloads on Millions of VMsMatei Zaharia

Calton pu experimental methods on performance in cloud and accuracy in big da...jins0618

Advanced equal logic customer presentationallardb

Bandwidth: Use Cases for Elastic Cloud on Kubernetes Elasticsearch

Top 20 FAQs on the Autonomous DatabaseSandesh Rao

Scaling and High Performance Storage System: LeoFSRakuten Group, Inc.

[Rakuten TechConf2014] [C-5] Ichiba Architecture on ExaLogicRakuten Group, Inc.

Openstack Summit Tokyo 2015 - Building a private cloud to efficiently handle ...Pierre GRANDIN

Naveen nimmu sdn future of networkingOpenSourceIndia

Naveen nimmu sdn future of networkingsuniltomar04

Drizzle Keynote at the MySQL User's ConferenceBrian Aker

Become a Performance Diagnostics HeroTechWell

04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG

2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong

Weitere ähnliche Inhalte

Ähnlich wie Retrospective from a startup built in the cloud: top three big lessons learned from the AWS outage on 4.21.11

The AMIS Report from Oracle Open World and JavaOne 2011 - Part OneLucas Jellema

MySQL Replication Performance in the CloudVitor Oliveira

Muves3 Elastic Grid Java One2009 FinalElastic Grid, LLC.

Powering the Cloud with Oracle WebLogicLucas Jellema

Matt Wright - The Application GridSaul Cunningham

Oracle SOA Suite in use – a practical experience reportGuido Schmutz

Introduction To Cloud ComputingRinat Shagisultanov

Patterns & Practices of MicroservicesWesley Reisz

Scaling Databricks to Run Data and ML Workloads on Millions of VMsMatei Zaharia

Calton pu experimental methods on performance in cloud and accuracy in big da...jins0618

Advanced equal logic customer presentationallardb

Bandwidth: Use Cases for Elastic Cloud on Kubernetes Elasticsearch

Top 20 FAQs on the Autonomous DatabaseSandesh Rao

Scaling and High Performance Storage System: LeoFSRakuten Group, Inc.

[Rakuten TechConf2014] [C-5] Ichiba Architecture on ExaLogicRakuten Group, Inc.

Openstack Summit Tokyo 2015 - Building a private cloud to efficiently handle ...Pierre GRANDIN

Naveen nimmu sdn future of networkingOpenSourceIndia

Naveen nimmu sdn future of networkingsuniltomar04

Drizzle Keynote at the MySQL User's ConferenceBrian Aker

Become a Performance Diagnostics HeroTechWell

Ähnlich wie Retrospective from a startup built in the cloud: top three big lessons learned from the AWS outage on 4.21.11 (20)

The AMIS Report from Oracle Open World and JavaOne 2011 - Part One

MySQL Replication Performance in the Cloud

Muves3 Elastic Grid Java One2009 Final

Powering the Cloud with Oracle WebLogic

Matt Wright - The Application Grid

Oracle SOA Suite in use – a practical experience report

Introduction To Cloud Computing

Patterns & Practices of Microservices

Scaling Databricks to Run Data and ML Workloads on Millions of VMs

Calton pu experimental methods on performance in cloud and accuracy in big da...

Advanced equal logic customer presentation

Bandwidth: Use Cases for Elastic Cloud on Kubernetes

Top 20 FAQs on the Autonomous Database

Scaling and High Performance Storage System: LeoFS

[Rakuten TechConf2014] [C-5] Ichiba Architecture on ExaLogic

Openstack Summit Tokyo 2015 - Building a private cloud to efficiently handle ...

Naveen nimmu sdn future of networking

Drizzle Keynote at the MySQL User's Conference

Become a Performance Diagnostics Hero

Kürzlich hochgeladen

04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG

2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong

How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes

How to convert PDF to text with Nanonetsnaman860154

Salesforce Community Group Quito, Salesforce 101Paola De la Torre

Histor y of HAM Radio presentation slidevu2urc

Boost PC performance: How more available memory can improve productivityPrincipled Technologies

TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc

A Call to Action for Generative AI in 2024Results

Partners Life - Insurer Innovation Award 2024The Digital Insurer

GenCyber Cyber Security Day PresentationMichael W. Hawkins

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia

Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j

Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge

Developing An App To Navigate The Roads of BrazilV3cube

08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls

Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies

🐬 The future of MySQL is Postgres 🐘RTylerCroy

Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC

Kürzlich hochgeladen (20)

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx

2024: Domino Containers - The Next Step. News from the Domino Container commu...

How to Troubleshoot Apps for the Modern Connected Worker

How to convert PDF to text with Nanonets

Salesforce Community Group Quito, Salesforce 101

Histor y of HAM Radio presentation slide

Boost PC performance: How more available memory can improve productivity

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments

A Call to Action for Generative AI in 2024

Partners Life - Insurer Innovation Award 2024

GenCyber Cyber Security Day Presentation

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...

Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...

Driving Behavioral Change for Information Management through Data-Driven Gree...

Developing An App To Navigate The Roads of Brazil

08448380779 Call Girls In Civil Lines Women Seeking Men

Factors to Consider When Choosing Accounts Payable Services Providers.pptx

🐬 The future of MySQL is Postgres 🐘

Breaking the Kubernetes Kill Chain: Host Path Mount

Retrospective from a startup built in the cloud: top three big lessons learned from the AWS outage on 4.21.11

1. Retrospective from a startup built in the cloud : top 3 big lessons from the AWS outage on 04.21.2011 plus 4,369 other smaller ones 6/22/2011 1

2. What a country : entrepreneurial resiliency 6/22/2011 2

3. (true story) “robust systems:highly fault-tolerant, on or off grid. eg: our culture wrt entrepreneurs, AWS, the BD API” 6/22/2011 3

4. Boom 6/22/2011 4

5. me: previous startupteams in 3 countrieshighly transactional systemMS tech : IIS/MS SQL Serverco-located, leased/owned hardware0% in cloud$75M/yearly rev 6/22/2011 5

6. me : current startupsystems 100% on AWS99% free/open-source software 6/22/2011 6 standing on the shoulders of giants

7. 6/22/2011 7 What Happened US-WEST Regions and Zones US-EAST

8. 6/22/2011 8 What Happened in us-east It’s all about the EBS (Elastic Block Store) – apologies for the artistic license, AWS EBS Cluster Region US-EAST Control plane services Zones

9. 6/22/2011 9 What Happened in us-east It’s all about the EBS (Elastic Block Store) – apologies for the artistic license, AWS EBS Cluster ? ‘re-mirroring storm’ Control plane services Thread-starved Regional API brown-out Region/Zones

10. fault tolerance: 3 to 47 important failearnings and 4,369 less important ones 6/22/2011 10

11. in the context of our startup, of course YMMV depending on velocity 6/22/2011 11

12. Ruger 6/22/2011 12

13. The Ruger Fault Equivalencytime = money fault tolerance = time² - risk tolerance Also known as: 'Fast, good and cheap : pick two‘ 6/22/2011 13

14. system design philosophy: 6/22/2011 14 leverage proven, open-source tech in the cloud to build a scaleable reliable secure operational foundation quickly

15. So how do you achievethe right level of fault tolerance in the cloud? 3 tenets 6/22/2011 15

16. Tenet #1 6/22/2011 16 Scripted Repeatability Tenet #2 SPOF Elimination Tenet #3 Clear-Cut Communication

17. Tenet #1prepare a fault-tolerant foundation with scripted repeatability aka automation 6/22/2011 17

18. Tenet #1 : scripted repeatabilityfrom the start :script the non-interactive install of your toolsand OScustom AMIDebian : great package managementbased on Eric Hammond’s workhttp://alestic.com/ 6/22/2011 18

19. Tenet #1 : scripted repeatability which will allow you toscript the setup/tear-down of your stack 6/22/2011 19

20. Tenet #1 : scripted repeatability which will allow you toscript system testsintegrity (3-4K tests)performance (30-40K tests)load, capacity (2-4M requests) 6/22/2011 20

21. 6/22/2011 21 Tenet #1 : scripted repeatability A/B system test results : MySQL Percona Upgrade

22. That’s how1 person set up andmanaged a networkcomprised of 90+/- server instancesfor 1.5 yearswhile serving various other roleswithout having to leave their chair 6/22/2011 22 try that with real hardware

23. Tenet #2SPOF Elimination We don’t need no stinkin single points of failure. 6/22/2011 23

24. Tenet #2 : SPOF EliminationSPOF Examples:Cloud ProviderRegionZoneLoad BalancerApp Server DatabaseFred 6/22/2011 24

25. Tenet #2 : SPOF Elimination Cloud Provider fail-over? e.g. AWS –> Rackspace 6/22/2011 25

26. Tenet #2 : SPOF Elimination Region fail-over? e.g. useast->uswest within AWS Nah. 6/22/2011 26

27. Tenet #2 : SPOF Elimination Zone fail-over? Yes. 6/22/2011 27 US-WEST US-EAST

28. Tenet #2 : SPOF Elimination Zone fail-over best practices:are you using auto-scaling?no : distribute server instances evenly between 2 or more zonesyes : trigger scaling on network I/O or custom metrics 6/22/2011 28

29. Tenet #2 : SPOF EliminationLoad-balancer (ELB), app server, database fail-over? Yes. 6/22/2011 29

30. Tenet #2 : SPOF Elimination So it’s actually all about reduction of the right SPOFs for your business context Just adding the ability to fail-over and have backups within a region is huge! Probably enough for most. What about Fred? 6/22/2011 30

31. Tenet #3Clear-Cut Communication 6/22/2011 31

32. Tenet #3 : Clear-cut CommunicationDuring an outage, communicating the right things at the right time:hard. But not that hard. 6/22/2011 32

33. Tenet #1 6/22/2011 33 Three Tenets Revisited Scripted Repeatability Tenet #2 SPOF Elimination Tenet #3 Clear-Cut Communication

34. Thank YouOur AWS account rep :"Dylan Peterson" <dylanpet@amazon.com>(notes attached to this slide) 6/22/2011 34

Hinweis der Redaktion

Who here either works for or has used AWS?RightScale?Who has read and understood the full post mortem for the April outage?post slides to somewhere, make available and note in preso
‘what a country’ : my dad always says this, I like itso, one of our principle investors, BradFeld was in our offices recently, and was asking how AWS was working out for us i'd replied very much in the positive, with a few exceptions regarding their support services. that night at dinner brad was talking about how resilient our culture is for entrepreneurs; how we can fail and retry here in the united statesdoing things that folks might get strung up for, in other countries.the following night, I found myself exploring analogies between that idea and computing systems, and wound up pulling out my phone and started typing up a twitter post
It went something like this.this was going to be the brilliant culmination of my twitter career, to date. I was almost ready to hit the send button, when I started getting alerts from our systemsThe alerts were appearing literally right above what I had written : ‘system DOWN’. Oh, the irony. wish i had a screenshot from my phone
that was the evening of 4/20, morning of 4/21 - the AWS outageAs you can see it made the NYT. Lasted for a number of days; our API was intermittently affected for about 12 hours; that could have been mitigatedThat outage totally sucked for many reasons I’m hoping that by sharing some of my experience with AWS , you’ll gain some insights that may help you prepare adequatelyAlso hoping that this can turn into a conversation toward the end, so you can share your experiences as well.
So before I go on, a bit about me : my name is Jeff MalekGrew up in Colorado, graduated in 93 from CU Boulder after 6 long years and a suspensionduring which time I hitch-hiked around the country, winding up in hawaiigraduated, moved around, met some great friends, helped to start up a companywas at Zango for 10 years,responsible for engineering, QA and product development teams distributed across three countries50+ people who built and maintained the high-transaction system that resulted in $75M yearly revenue at its peakleveragedthe client side software I wrote in the C programming language which talked to backend systems built on Windows technology (.NET, IIS, MSSQL Server, etc) which was sitting on co-located , purchased hardware
BigDoor: over 2 years oldFunded by Foundry and Brad Feld in 2010. If you’re familiar with airline point systems, you’re familiar with loyalty programs.BigDoor provides a platform that powers social loyalty programs and game mechanics for digital communities.Think of it in terms of sharing your points with your friends and leveling-up in the processfreeRESTfulAPI that you can brand any way you wantI did a tech stack pivot ; built in the cloud on AWS using 99.99% free/open-source software – our backend systems are primarily Django-Python.Even after the outage, still a huge fan of AWS, generally very impressed with what they’ve built and their speed of innovationWhen was the last time you got a newsletter letting you know that a vendor’s pricing was going down
So what happened? Here’s some quick background.AWS Regions are areas geographically separated by large distances, and contain Zones. In the US there are two Regions, us-east and us-west.Zones are a euphemism for ‘data center’, each Region contains four data centers, in separate buildings.
Here’s the region with four zones again. Within a zone, you can allocate block-level replicated storage that’s optimized for consistency and low latency read/write access to/from EC2 instances – otherwise known as EBS (Elastic Block Storage). These EBS volumes are stored and replicated between nodes within a cluster, multiple times for durability and availability. If one replica becomes unavailable our out of sync, a new replica is provisioned automatically. This is called re-mirroring, and while it’s happening, access to that data is blocked for consistency. Old replicated blocks aren’t released until the new replica is confirmed. Within a cluster, nodes are connected to each other via two networks; one high-bandwidth backplane, and a lower bandwidth overflow capacity networkThe four zones, or data centers, are connected via control plane services that coordinate user requests for EBS resources
During scaling maintenance to upgrade primary network capacity, it’s standard practice to shift traffic away from the primary to another router, but someone routed traffic to the lower capacity network, essentially flooding it. Many nodes got disconnected from other nodes in the cluster, couldn’t connect to their replicas. While the network was down, EBS API requests were queuing up, exacerbated by the fact that you can set a ‘wait-timeout’ on API requests.Then the primary network was restored. Affected nodes began trying to create replicas; start of the ‘re-mirroring storm’. There was a bug that caused nodes to crash when closing large volumes of requests, resulting in more needing to re-mirror; on top of that, nothing metering back these requests as they were failing repeatedly, no exponential back-off. Exhausted the capacity of the cluster, putting about 13% of all volumes in ‘stuck’ state. When it came back up, the regional control plane services were overloaded and this is what made EBS services unavailable regionally.
So that’s what I’m here to talk about: fault tolerance in the cloud
I want to talk about all of this in the context of our startup, of courseUltimately the AWS outage didn’t result in any major changes to the way we do thingsWhile there were a few smaller things that we bumped up the priority chain, there’s a certain level of risk that a start up is willing to live with
My girlfriend Jenny and I got Ruger as a puppy, right when BigDoor startedRaised him from a puppy while building out our operational infrastructure, working out of our houseHe’s a great dog, love him to death So he’s kind of our mascot, and to help put things in context, I came up with a formula : The Ruger Fault Equivalency.
IOW given a low tolerance for risk, you can create a highly-fault tolerant system if you have a lot of time and/or money. that’s not BigDoor. Conversely, executing with a higher tolerance for risk gets you to market faster with less money, but with lower fault tolerance.For us, scalability is more important than extremely high fault tolerancestartup = time^2 is low (little time and money)So, fun and interesting, but what does it mean in the context of BigDoor system design?TODO : add another pic, movie of play dead?
I designed the BigDoor systems at a high level with this philosophy in mind. A bit more regarding our context : Django/PythonWeek long sprints that end in production code release 260G+ and growing transactional database, so still not that bigPeak so far: 18MM API requests/day, so still a ways to go Response times need to be faster than 500ms
OK, given that context – how do you achieve the right amount of fault tolerance in the cloud?Three basic tenets, and in the context of the AWS outage:the first sets a foundation for fault tolerance the second leverages the first to improve fault toleranceand the third will help keep your customers around when you are in crisis mode, ultimately also improving fault tolerance
Scripted repeatabilitySPOF eliminationClear-Cut Communication(repeat)
Nothing to see here, move along
AMIs (amazon machine image, install images; OS blueprints), these are used to start new server instancesLeverage pre-built AMIsDebian has great package managementpackages are verified, tested before making it into the main line - less to think aboutThank you Eric HammondA good best practice : use a single master AMI re-buildregularly via automation with new softwarenew package patches (apt)your application code we thentag per environment (test, staging, production) switch services (Apache, MySQL) on and off during boot via init scriptsAnother good practice :All app code and software config is checked out via SVN and baked into the AMIsvn up during boot via init scriptsenables fast initialization during auto-scaling activities
AWS has cloud formationThey came out with that a few days after I’d finished pretty much doing the sameI wrapped the AWS command line tools in shell scriptsSince we’re a Python shop, we’re likely going to be using boto (which has matured quite a bit in the last two years) and fabric
Nothing to see here, move along
Nothing to see here, move along
Who knows what this is a picture of?That’s a picture of the IBM RAMAC, built in 1956, which had 5M of storage and weighed a ton. We’ve come a long way, baby!
For anyone unfamiliar: if a system stops working when a part of it fails, that part is a single point of failure. So in every system there’s potential for many single points of failure, proportional to system complexityBecause of the Ruger Fault Equivalency, the idea is to pick the right SPsOF and eliminate (or at least mitigate) themI used the word ‘elimination’ here, hoping that it would make some folks chuckle; it’s really not possible to eliminate all SPOFs. You can mitigate them, though. So here are some examples, and I’ll drill into which ones are critical in our context.
If your cloud provider goes out of business, you’re hosed. SPOF.In AWS, a region is…etc. If a region disappears, you’re hosed. SPOF.Within regions, are zones. If an entire zone fails, you’re hosed. SPOF.Same with load balancers, application servers, databasesAnd even Fred. If Fred is the only guy who knows your operational systems, and he trips over the extension cord, knocking himself out in the process – you’re hosed. SPOF. The critical ones in our context and likely in many others : Zones and everything below.
Should you attempt to achieve high fault-tolerance through cloud-cloud failover?Ruger Fault Equivalency says : Cost prohibitive (times squared)RightScale , who provides a very cool cloud management system, apparently has some of this functionality, and will likely be the place to go for cloud-cloud fail-over in the future.
Ruger Fault Equivalency says :Ditto – cost prohibitiveIf you try to migrate an ELB-balanced tech stack from one region to the next, you’ll learn:You ELB won’t be able to route traffic between regionsEIPs can’t be pointed from an instance in one region to anotherYour custom useast (for example) AMI can’t be used in the new region Your useast Security groups can’t be used in the new regionYour snapshots can’t be used to create new volumes, in the new regionDo set up a DB replicant in another region, if possible.
Ruger says : yes, even in light of the recent outage, that affected the entire useast region. It’s not cost-prohibitive, and you get data-center fail-over.What about the recent AWS outage? A human error caused a major problem in one zone that had a ripple effect into the other zones. But ultimately, downtime suffered was in proportion to how well you were already leveraging other zones, and how dependent you were on EBS volumes. If all of your eggs were in the wrong zone, or didn’t have the right backup strategy in place – totally screwed. Otherwise – not so bad!
Our zone scenario and why were were down intermittently for 12 hours during the AWS outagebefore the outage we had auto-scaling groups in two zones within a single regionat some point I brought everything into a single zone, while debugging odd performance between the twoconscientiously de-prioritized revisiting that, in light of other priorities, figuring the single-zone group would at least scale with trafficbut I’d configured the groups with a trigger to auto-scale when CPU spikedover time our application grew more resource efficient, which meant CPU wasn’t spiking, which meant we weren’t scaling with trafficled to the learning that it’s better to scale on network IO, or now that AWS supports them, custom scaling triggerswe’re in multiple zones again now; recently saw the effects of an entire zone’s application server group go dark
Ruger says : don’t even think about not doing it.What’s generally worked for us:ELBs for same-region traffic distribution auto scaling groups to allow application server fail-over, within a zone and across themreplication to put secondary fail-over database servers in other zones within a region.
What about Fred? Cut Fred some slack for tripping over the extension cord, we all make mistakes. You need Fred. That is, assuming he communicates what happened widely. If he doesn’t, he’s going to suffer the wrath of his internal and external customers.
Customers don’t need a ton of detail; they need status updates and anything actionable. Does open communication increase fault tolerance? I’d argue yes. Your customers will be more tolerant of your faults if you’re open and clear about them
At BigDoor, if there’s a crisis, our standard operating procedure identifies a single person responsible for stopping the team on an hourly basis to get status and determine what should be communicated externally, if anything. As much as we love him, we don’t involve our lawyer in that conversation, by the way.
In summary, these are the three tenets that I’m hoping will help you achieve the right amount of fault tolerance in the cloud:Scripted repeatabilitySPOF eliminationClear-Cut CommunicationAll three of these things are mentioned by AWS in one way or another in their post-mortems as things they planned on doing to mitigate this for themselves going forward, by the way – including the better communication. Thanks again WTIA, I’ll be around if anyone wants to talk more about this stuff later. I also have some notes that describe the good and bad about AWS, available online here : TODO
AWS outage root cause analysis : http://aws.amazon.com/message/65648/Net Effects :hours of high EBS API error and latency rates : 11 days before affected data made available again in affected zone : peak ‘stuck’ volumes in other zones : .07% Ultimately .07% of volumes couldn’t be restored due to hardware failures45% of RDS single-zone instances affected at peak, .04% unrecoverable2.5% of multi-zone RDS didn’t fail over due to another bugTools : the good and bad ELBsGood : quick to configure, auto-scaling load-balancerscan be used for fail-over within a regionBad : no loggingreturn 503s on error - you won't know unless you can monitor every request end to ende.g. if there aren't instances that can service requestsname servers disregarding ttls + auto-scaling = traffic routing issuesbest practice : return custom HTTP headers in your response so that you can distinguish calls during support incidentscan't be used for failover between AWS regions; need separate DNS solution for funneling trafficAMIs (amazon machine image, install images; OS blueprints)Good : Leveraged pre-built Debian AMIDebian has great package management, which can be scripted.packages are verified, tested before making it into the main line - less to think aboutThank you Eric Hammondhttp://alestic.com/scripted repeatability : script the non-interactive install of your toolscan be used to stand-up instances within a regionbest practice : single master AMI built on top of pre-existing, re-built regularly with new software, app code and patches, via automation. Tagged. best practice : put app code, package configuration into SVN and include in your AMI, svn-up regularly or during instance start-upfaster for things like auto-scalingBad : Can't copy/port AMIs from region to region easilyNot having the entire process scripted from kernel means loss of flexibility (regional AMIs) and securitypitfall : easy to get off track. Didn't start out with a single script that installs everything or stay diligent about including everything? Have fun re-doing all that!Security toolsGreat article : http://trust.cased.de/AMIDAMID script : http://code.google.com/p/amid/downloads/detail?name=AMID.py&can=2&q=EC2 instancesGood :Leverages AMIsObviously, script-able automated instance creationEIPs allow for easy, dependable service re-routing from one instance to anotherSecurity groups are an easy way to firewall (and tag, before they came out with those)Zones allow easy fail-over within a geographic region (most of the time)Regions provide the promise of fail-over between data centers more geographically separated (virginiavscalifornia)Init scripts allow you to create/update on a per-instance basisBad:Security groups can't be added to or removed from an instance once it's runningbest practice: use a different group for each narrower categorye.g. instead of 'database group', create groups for 'primary transactional db server in production', 'replicant...' etc best practice : use a group that whitelists trusted IPs to give access to otherwise un-needed ports and servicesRegions don't allow easy failover; EIPs can't be mapped between them (at least not programmatically)Can't port AMIs from region to region easily, so setup to fail region-region is difficult.EBSGood:provides redundant storage for instances that can be snapshot-ed for easy backup and volume duplication within a regionBad:volumes from snapshots can't be done between regions data loss: it happened (not to us, fortunately) so be prepared and apply the amount of resources your risk tolerance allowspoor I/O in general, specifically writes, typically only has been an issue for us on our primary tx DB serversbest pracitice : RAID 0 array for MySQL data directory, but make sure it's replicated and backed upAuto-scalingGood:n scaling groups in 1-4 zones behind an ELB; provides same-region fail-overn# of instances in a scaling groupcloud watch monitors provide great statspreviously, limited scaling triggers were provided, latest integrate CloudWatch much better including custom metrics you defineBad:learning : we had no baselines for when to scale on anything other than CPU utilization, which at the time was easy to differentiate; we spikedapplication improvements fixed the spikes, which in return stopped auto scaling triggers need monitoring/alerting via nagios/other tool? figure out how to (de-)register new instances during scaling activitiesthis is changing - cloud watch is getting better. do you trust amazon's monitoring/alerting on amazon's monitoring/alerting?EMRGood :Great for async log analysiswhat's worked for us : centralized log hostsapache logs rotated via logrotate and rsync'd via cron, pre-processed, sync'd to S3 and drawn into EMR/Hive cluster for aggregations and reporting Hive/HQL very similar to SQLBad :asynchronous, takes a fair amount of time to munge data S3Good:Available from anywhere, any regionS3cmd is a great tool , for the most partBad:no full support for standard paths and directories…TBDCloudWatch Good :can monitor various services and trigger/alert when thresholds are crossed (e.g. ELB network in)new : auto-scaling can leverage triggers more broadly, custom metrics (new)Bad :no built-in ability to trigger/alert based on % change from previous measurementsconsole reports/graphs need decoder tool and most recently, appear buggy. but they've made big steps forward.AWS APIsGood :API wrappers provided; allow for cmd-line scriptingDRY : Can (and should) script most things that repeat, repeatableAll done via scripts :a bit about our process and how the cloud fits well1 week sprints - lockdown tuesdays, test overnight (uTEST), release wedtest first methodologysystem tests for backend, other big changes, our API changesTested a new ver of MySQL (Percona, recommended)http://screencast.com/t/yVf5RnaUN9http://screencast.com/t/WJaL2qiSRperformance, integrity, load, capacitythese require full-stack stand-up/tear-down , including a 230G+ db backendBad :Keep your eye out for library updates (why not open-source these things? Verify they’re not already…)Scripts, wrappers trail AWS innovation, which is fast. BASH isn't as well-known or readable as Python, for example - maintainabilityscripted stuff bakes you in a bit, no way around this w/out baking yourself into RightScale or some other solution anyway thoughAPI key management : not straight-forwardAPI keys aren't portable between regions; region-region fail-over not as easy as it sounds. not rocket science, either.Bake region 1’s keys into region 2’s new AMIAPI's - GeneralBuild things test first, run integrity tests before pushing out changes to your APIDon't version; make it backwards-compatibleWe try to keep away from anything that’s going to lock us in too muchWe continue to shy away from SQS (simple queuing service), RDS (relational database service), SimpleDB (non-relational datastore)SQS, SimpleDB proprietary, would prefer to avoid lock-in for these things and their need hasn't been high enough for us yetRDS : doesn't provide enough flexibility for us. would love to use it as a replicant pool for reads/reporting though. can't.multi-zone RDS suffered one of the biggest hits during recent AWS outageWhat we're looking forward to leveragingNew CW status, PUTs, scaling triggers from them