SlideShare a Scribd company logo
1 of 15
Amazon Major Cloud Outage Analysis
Author: Rahul Tyagi
2
The Agenda
• The Issue
• The Goals
• Analysis Methodology
• The Analysis
3
The Issue
• Due to deep proliferation of Amazon cloud into
enterprises, The major Amazon cloud outages
causes wide spread impact…
• The organizations like Netflix, Dropbox, AirBnB
and Pinterest had impact due to Amazon cloud
outages
4
The Issue
• Major cloud outages are pretty regular events in
recent past, some of the major outages
• Dec/24/2012
• Oct/22/2012
• Jun/29/2012
• Apr/21/2011
5
The Goals
• We want to analyze chain of events causing major
Amazon cloud outages (from official Amazon
statements)…
• We analyzed major outages in past 2 years…
• The goal is to identify probable root causes and
areas that have opportunity to improve…
6
Analysis Methodology
We would leverage “Analytical Hierarchy Process”
for identifying root causes…
7
Analysis Methodology
Analyze
Amazon’s
Statements
about Outage
Identify “Chain
of Events”
causing outage
Categorize
“Chain of
Events”
Analysis and
Conclusion
8
The Analysis > Analyze Amazon’s Statements about Outages
Outage Date Amazon’s Statement
Dec/24/2012 http://aws.amazon.com/message/680587/
Oct/22/2012 http://aws.amazon.com/message/680342/
Jun/29/2012 http://aws.amazon.com/message/67457/
Apr/21/2011 http://aws.amazon.com/message/65648/
We analyzed following Amazon’s official
statements…
9
The Analysis > Identify “Chain of Events” causing outages
Outage Core Issue
Dec-12
“The *ELB State+ data was deleted by a maintenance process that
was inadvertently run against the production ELB state data”
Oct-12
“The root cause of the problem was a latent bug in an operational
data collection agent that runs on the EBS storage servers”
Jun-12
“In the single datacenter that did not successfully transfer to the
generator backup, all servers continued to operate normally on
Uninterruptable Power Supply (“UPS”) power. As onsite personnel
worked to stabilize the primary and backup power generators, the
UPS systems were depleting and servers began losing power at
8:04pm PDT”
Apr-11
“The traffic shift was executed incorrectly and rather than routing
the traffic to the other router on the primary network, the traffic
was routed onto the lower capacity redundant EBS network.”
The statements in double quotes are from
Amazon’s press releases…
10
The Analysis > Identify “Chain of Events” causing outages
Outage Chain of Events
Dec-12"Maintenance process inadvertently run against production ELB state data"
Process for incident approval had loose ends
Validation for maintenance process's (which ran inadvertently) output was missing
"load balancers that were modified were improperly configured by the control plane"
Oct-12"latent bug in an (EBS) operational data collection agent"
"latent memory leak bug in the reporting agent" The monitoring process of memory leak was non existent.
"the DNS update did not successfully propagate to all of the internal DNS servers"
"the (aggressive) throttling policy that was put in place was too aggressive"
Jun-12"datacenter that did not successfully transfer to the generator backup"
"As onsite personnel worked to stabilize the primary and backup power generators, the UPS systems were depleting
and servers began losing power at 8:04pm PDT"
"a small number of Multi-AZ RDS instances did not complete failover, due to a software bug"
"As the power and systems returned, a large number of ELBs came up in a state which triggered a bug we hadn’t seen
before"
Apr-11
“The traffic shift was executed incorrectly and rather than routing the traffic to the other router on the primary
network, the traffic was routed onto the lower capacity redundant EBS network.”
"We now understand the amount of capacity needed for large recovery events and will be modifying our capacity
planning and alarming so that we carry the additional safety capacity that is needed for large scale failures"
"We will audit our change process and increase the automation to prevent this mistake from happening in the future"
"We will also invest in increasing our visibility, control, and automation to recover volumes in an EBS cluster"
11
The Analysis > Categorize “Chain of Events”
Outage Chain of Events Hardware Software Automation Process
Dec-12"Maintenance process inadvertently run against production ELB state data" X X
Process for incident approval had loose ends X
Validation for maintenance process's (which ran inadvertently) output was
missing X X X
"load balancers that were modified were improperly configured by the control
plane" X
Oct-12"latent bug in an (EBS) operational data collection agent" X X
"latent memory leak bug in the reporting agent" The monitoring process of
memory leak was non existent. X X
"the DNS update did not successfully propagate to all of the internal DNS servers" X X
"the (aggressive) throttling policy that was put in place was too aggressive" X X
Jun-12"datacenter that did not successfully transfer to the generator backup" X
"As onsite personnel worked to stabilize the primary and backup power
generators, the UPS systems were depleting and servers began losing power at
8:04pm PDT" X
"a small number of Multi-AZ RDS instances did not complete failover, due to a
software bug" X X
"As the power and systems returned, a large number of ELBs came up in a state
which triggered a bug we hadn’t seen before" X X
Apr-11
“The traffic shift was executed incorrectly and rather than routing the traffic to
the other router on the primary network, the traffic was routed onto the lower
capacity redundant EBS network.” X
"We now understand the amount of capacity needed for large recovery events
and will be modifying our capacity planning and alarming so that we carry the
additional safety capacity that is needed for large scale failures" X
"We will audit our change process and increase the automation to prevent this
mistake from happening in the future" X
"We will also invest in increasing our visibility, control, and automation to recover
volumes in an EBS cluster" X X
12
The Analysis > Analysis and Conclusions
Process issues are common theme in major
outages at Amazon cloud…
13
The Analysis > Analysis and Conclusions
Software, 8
Automation, 4
Process, 14
#ofIssues
Amazon Cloud Major Outage - Issues Categories
Process and Software are leading contributing
factors to major outages at Amazon…
14
The Analysis > Analysis and Conclusions
• The majority of issues contributing to outages are
related to process or software
• It seems “Process” rigor in cloud operations and
SDLC at Amazon has opportunity to improve
• Culture? We heard, Amazon has Just-Do-It
culture, The process rigor may require more than
just “just-do-it”
15
Thank You! You are Awesome! You deserve applause!!

More Related Content

Viewers also liked

External analysis Nokia, Amazon
External analysis Nokia, AmazonExternal analysis Nokia, Amazon
External analysis Nokia, Amazon
Dan Saguy
 
Creating SaaS Startups that Rock: Scaling to Millions of Users
Creating SaaS Startups that Rock: Scaling to Millions of UsersCreating SaaS Startups that Rock: Scaling to Millions of Users
Creating SaaS Startups that Rock: Scaling to Millions of Users
Hasan Basri AKIRMAK, MSc,ExecMBA
 
Amazon Investor's Analysis
Amazon Investor's AnalysisAmazon Investor's Analysis
Amazon Investor's Analysis
Thomas Pollard
 

Viewers also liked (19)

5 Worst Case Scenarios Your Hosted VoIP Provider Should Be Ready For-LONG VER...
5 Worst Case Scenarios Your Hosted VoIP Provider Should Be Ready For-LONG VER...5 Worst Case Scenarios Your Hosted VoIP Provider Should Be Ready For-LONG VER...
5 Worst Case Scenarios Your Hosted VoIP Provider Should Be Ready For-LONG VER...
 
Production Monitoring Platform
Production Monitoring PlatformProduction Monitoring Platform
Production Monitoring Platform
 
Aws presentation
Aws presentationAws presentation
Aws presentation
 
External analysis Nokia, Amazon
External analysis Nokia, AmazonExternal analysis Nokia, Amazon
External analysis Nokia, Amazon
 
AWS Summit Kuala Lumpur Keynote with Stephen Orban - Head of Enterprise Strategy
AWS Summit Kuala Lumpur Keynote with Stephen Orban - Head of Enterprise StrategyAWS Summit Kuala Lumpur Keynote with Stephen Orban - Head of Enterprise Strategy
AWS Summit Kuala Lumpur Keynote with Stephen Orban - Head of Enterprise Strategy
 
Henry
HenryHenry
Henry
 
Analyzing and Surveying Trust In Cloud Computing Environment
Analyzing and Surveying Trust In Cloud Computing EnvironmentAnalyzing and Surveying Trust In Cloud Computing Environment
Analyzing and Surveying Trust In Cloud Computing Environment
 
Dcpl cloud computing amazon fail
Dcpl cloud computing amazon failDcpl cloud computing amazon fail
Dcpl cloud computing amazon fail
 
Cloud Computing Outages - Analysis of Key Outages 2009 - 2012
Cloud Computing Outages - Analysis of Key Outages 2009 - 2012 Cloud Computing Outages - Analysis of Key Outages 2009 - 2012
Cloud Computing Outages - Analysis of Key Outages 2009 - 2012
 
Cloud malfunction up11
Cloud malfunction up11Cloud malfunction up11
Cloud malfunction up11
 
Cloud Computing & ITSM - For Better of for Worse?
Cloud Computing & ITSM - For Better of for Worse?Cloud Computing & ITSM - For Better of for Worse?
Cloud Computing & ITSM - For Better of for Worse?
 
Creating SaaS Startups that Rock: Scaling to Millions of Users
Creating SaaS Startups that Rock: Scaling to Millions of UsersCreating SaaS Startups that Rock: Scaling to Millions of Users
Creating SaaS Startups that Rock: Scaling to Millions of Users
 
European Utility Week 2015: Next Generation Outage Management
European Utility Week 2015: Next Generation Outage ManagementEuropean Utility Week 2015: Next Generation Outage Management
European Utility Week 2015: Next Generation Outage Management
 
Amazon Investor's Analysis
Amazon Investor's AnalysisAmazon Investor's Analysis
Amazon Investor's Analysis
 
Amazon Partnership Model
Amazon Partnership Model Amazon Partnership Model
Amazon Partnership Model
 
APN Overview and Best Practices for Partnering with AWS
APN Overview and Best Practices for Partnering with AWSAPN Overview and Best Practices for Partnering with AWS
APN Overview and Best Practices for Partnering with AWS
 
Amazon Web Services SWOT
Amazon Web Services SWOTAmazon Web Services SWOT
Amazon Web Services SWOT
 
Amazon Brand Analysis
Amazon Brand AnalysisAmazon Brand Analysis
Amazon Brand Analysis
 
DC architectures future proof
DC architectures future proofDC architectures future proof
DC architectures future proof
 

Recently uploaded

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 

Recently uploaded (20)

MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 

Amazon Cloud Major Outages Analysis

  • 1. Amazon Major Cloud Outage Analysis Author: Rahul Tyagi
  • 2. 2 The Agenda • The Issue • The Goals • Analysis Methodology • The Analysis
  • 3. 3 The Issue • Due to deep proliferation of Amazon cloud into enterprises, The major Amazon cloud outages causes wide spread impact… • The organizations like Netflix, Dropbox, AirBnB and Pinterest had impact due to Amazon cloud outages
  • 4. 4 The Issue • Major cloud outages are pretty regular events in recent past, some of the major outages • Dec/24/2012 • Oct/22/2012 • Jun/29/2012 • Apr/21/2011
  • 5. 5 The Goals • We want to analyze chain of events causing major Amazon cloud outages (from official Amazon statements)… • We analyzed major outages in past 2 years… • The goal is to identify probable root causes and areas that have opportunity to improve…
  • 6. 6 Analysis Methodology We would leverage “Analytical Hierarchy Process” for identifying root causes…
  • 7. 7 Analysis Methodology Analyze Amazon’s Statements about Outage Identify “Chain of Events” causing outage Categorize “Chain of Events” Analysis and Conclusion
  • 8. 8 The Analysis > Analyze Amazon’s Statements about Outages Outage Date Amazon’s Statement Dec/24/2012 http://aws.amazon.com/message/680587/ Oct/22/2012 http://aws.amazon.com/message/680342/ Jun/29/2012 http://aws.amazon.com/message/67457/ Apr/21/2011 http://aws.amazon.com/message/65648/ We analyzed following Amazon’s official statements…
  • 9. 9 The Analysis > Identify “Chain of Events” causing outages Outage Core Issue Dec-12 “The *ELB State+ data was deleted by a maintenance process that was inadvertently run against the production ELB state data” Oct-12 “The root cause of the problem was a latent bug in an operational data collection agent that runs on the EBS storage servers” Jun-12 “In the single datacenter that did not successfully transfer to the generator backup, all servers continued to operate normally on Uninterruptable Power Supply (“UPS”) power. As onsite personnel worked to stabilize the primary and backup power generators, the UPS systems were depleting and servers began losing power at 8:04pm PDT” Apr-11 “The traffic shift was executed incorrectly and rather than routing the traffic to the other router on the primary network, the traffic was routed onto the lower capacity redundant EBS network.” The statements in double quotes are from Amazon’s press releases…
  • 10. 10 The Analysis > Identify “Chain of Events” causing outages Outage Chain of Events Dec-12"Maintenance process inadvertently run against production ELB state data" Process for incident approval had loose ends Validation for maintenance process's (which ran inadvertently) output was missing "load balancers that were modified were improperly configured by the control plane" Oct-12"latent bug in an (EBS) operational data collection agent" "latent memory leak bug in the reporting agent" The monitoring process of memory leak was non existent. "the DNS update did not successfully propagate to all of the internal DNS servers" "the (aggressive) throttling policy that was put in place was too aggressive" Jun-12"datacenter that did not successfully transfer to the generator backup" "As onsite personnel worked to stabilize the primary and backup power generators, the UPS systems were depleting and servers began losing power at 8:04pm PDT" "a small number of Multi-AZ RDS instances did not complete failover, due to a software bug" "As the power and systems returned, a large number of ELBs came up in a state which triggered a bug we hadn’t seen before" Apr-11 “The traffic shift was executed incorrectly and rather than routing the traffic to the other router on the primary network, the traffic was routed onto the lower capacity redundant EBS network.” "We now understand the amount of capacity needed for large recovery events and will be modifying our capacity planning and alarming so that we carry the additional safety capacity that is needed for large scale failures" "We will audit our change process and increase the automation to prevent this mistake from happening in the future" "We will also invest in increasing our visibility, control, and automation to recover volumes in an EBS cluster"
  • 11. 11 The Analysis > Categorize “Chain of Events” Outage Chain of Events Hardware Software Automation Process Dec-12"Maintenance process inadvertently run against production ELB state data" X X Process for incident approval had loose ends X Validation for maintenance process's (which ran inadvertently) output was missing X X X "load balancers that were modified were improperly configured by the control plane" X Oct-12"latent bug in an (EBS) operational data collection agent" X X "latent memory leak bug in the reporting agent" The monitoring process of memory leak was non existent. X X "the DNS update did not successfully propagate to all of the internal DNS servers" X X "the (aggressive) throttling policy that was put in place was too aggressive" X X Jun-12"datacenter that did not successfully transfer to the generator backup" X "As onsite personnel worked to stabilize the primary and backup power generators, the UPS systems were depleting and servers began losing power at 8:04pm PDT" X "a small number of Multi-AZ RDS instances did not complete failover, due to a software bug" X X "As the power and systems returned, a large number of ELBs came up in a state which triggered a bug we hadn’t seen before" X X Apr-11 “The traffic shift was executed incorrectly and rather than routing the traffic to the other router on the primary network, the traffic was routed onto the lower capacity redundant EBS network.” X "We now understand the amount of capacity needed for large recovery events and will be modifying our capacity planning and alarming so that we carry the additional safety capacity that is needed for large scale failures" X "We will audit our change process and increase the automation to prevent this mistake from happening in the future" X "We will also invest in increasing our visibility, control, and automation to recover volumes in an EBS cluster" X X
  • 12. 12 The Analysis > Analysis and Conclusions Process issues are common theme in major outages at Amazon cloud…
  • 13. 13 The Analysis > Analysis and Conclusions Software, 8 Automation, 4 Process, 14 #ofIssues Amazon Cloud Major Outage - Issues Categories Process and Software are leading contributing factors to major outages at Amazon…
  • 14. 14 The Analysis > Analysis and Conclusions • The majority of issues contributing to outages are related to process or software • It seems “Process” rigor in cloud operations and SDLC at Amazon has opportunity to improve • Culture? We heard, Amazon has Just-Do-It culture, The process rigor may require more than just “just-do-it”
  • 15. 15 Thank You! You are Awesome! You deserve applause!!