SlideShare ist ein Scribd-Unternehmen logo
1 von 70
S U M M I T
SYDNEY
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
The art of successful failure
Becky Weiss
Senior Principal Engineer
Amazon Web Services
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
This is a talk about failing… successfully
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Agenda
• Never waste a failure: The AWS approach to post-mortems
• Seeing your failures before your customers do
• How AWS can help you fail successfully
S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Availability
100
90
80
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
COE: Correction of Error
• Structured analysis of
customer-impacting events
• Reflection of Amazon’s
peculiar culture
• Goes well beyond “How do we
prevent this from happening
again”
COE
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
We take these very seriously
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
COEs start with the customer and work backwards
• Summary
• Narrative description of what happened
• Metrics and graphs
• Primary impact and supporting graphs
• If they don’t exist, that’s something to fix
• Customer impact
• How many customers affected
• What was the impaired experience
AvailabilityLatencyof
dependency
p99
p50
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Areas of focus
• Root cause: Why? (x 5)
• Blast radius: How widespread
was the impact?
• Duration: For how long?
• What can others learn?
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Toyota’s Five-Whys approach to root cause
The vehicle will not start. (the problem)
Why? - The battery is dead. (First why)
Why? - The alternator is not functioning. (Second why)
Why? - The alternator belt has broken. (Third why)
Why? - The alternator belt was well beyond its useful service life and not
replaced. (Fourth why)
Why? - The vehicle was not maintained according to the recommended
service schedule. (Fifth why, a root cause)
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Blast radius
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Blast radius
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Blast radius containment as a core design tenet
AWS Cloud
Region Region Region
…
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Blast radius containment as a core design tenet
AWS Cloud
Region
Availability Zone Availability Zone Availability Zone
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Blast radius containment as a core design tenet
AWS Cloud
Region
Cell Cell Cell
AWS Cloud
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Controlling event duration
Availability
100
90
80
Impact period
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Controlling duration: Improving incident response
• “How was the event
detected?”
• “How could time to detection
be improved? As a thought
experiment, how would you
have cut the time in half?”
Availability
100
90
80
Time to detection
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Controlling duration: Improving incident response
Good
Amazon CloudWatch
Alarm
!
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Controlling duration: Improving incident response
Good Bad
???
Amazon CloudWatch
Alarm
!
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Controlling duration: Improving time to mitigation
• “How did you reach the point
where you knew how to
mitigate the impact?”
• “How could time to
mitigation be improved? As
a thought experiment, how
would you have cut the time
in half?”
Availability
100
90
80
Determining how to mitigate
Mitigation activity
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Controlling duration: Improving time to mitigation
Example: Alarm-based automatic rollback
AWS CodeDeploy
Alarm
!
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Controlling duration: Improving time to mitigation
Example: Alarm-based automatic rollback
AWS CodeDeploy
Alarm
rollback
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Controlling event duration
Availability
100
90
80
Impact period
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Controlling event duration
Availability
100
90
80
Impact period
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Wins are just as important as failures
Latency
deployment
p99
S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Metrics are very interesting, and that can be a problem
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Health metrics and diagnostic metrics
Health metrics
• Answers the question: Am I
failing?
• Does not answer the
question: Why am I failing?
• Always set alarms on these
• Be conservative in defining
Diagnostic metrics
• Answers the question: What is
the value of this thing I
measured?
• Might answer the question:
Why isn’t my system working?
• Sometimes set alarms on these
• Be liberal in defining
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Health or Diagnostic?
Time
Responsecount
5XX
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Health or Diagnostic?
Time
Responsecount
4XX
5XX
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Health or Diagnostic?
Time
Databasetransactionrollbacks
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Health or Diagnostic?
Time
Latency(msec)
avg
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Health or Diagnostic?
Time
Latency(msec)
avg
50th
percentile
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Health or Diagnostic?
Time
Latency(msec)
avg
99th
percentile
50th
percentile
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Percentiles >> Avg
Time
Latency(msec)
99th
percentile
50th
percentile
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Layout of a Great Dashboard
Health Metrics at the top
Latency percentiles Faults Volume
Key diagnostic metrics below the fold
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
The AWS Ops Wheel
S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Volume
Time scale: ~one week
volume
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Time scale: ~one week
volume
p99.9 latency
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Volume
Time scale: Weeks/months
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Higherisworse
Time scale: Weeks/months
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Latency
p99
Alarm threshold
Failures
Failures
S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Key Service: Amazon CloudWatch
Amazon CloudWatch:
Metrics, Logs and Alarms
AWS Lambda
Amazon API Gateway
Amazon DynamoDB
Billing
Amazon Simple
Notification Service
Amazon Simple Storage
Service (S3)
Amazon EC2
Automatically-published metrics
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Key Service: Amazon CloudWatch
Amazon CloudWatch:
Metrics, Logs and Alarms
Application-specific logging and metrics
Amazon EC2
instances with
CloudWatch agent
AWS Lambda
functions
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Key Service: Amazon CloudWatch
Amazon CloudWatch:
Metrics, Logs and Alarms
Application-specific logging and metrics
Amazon EC2
instances with
CloudWatch agent
AWS Lambda
functions
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Key Service: Amazon CloudWatch
Amazon CloudWatch:
Metrics, Logs and Alarms
Application-specific logging and metrics
Amazon EC2
instances with
CloudWatch agent
AWS Lambda
functions
VPC
VPC Endpoint
S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
A simple serverless API
Amazon API Gateway:
“Colors”
Lambda function:
“GetColor”
Proxy
integration
GET /blue
GET /red
Amazon CloudWatch:
Dashboards, Logs, and
Alarms
AWS CloudFormation Stack
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Amazon CloudWatch default view
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Custom dashboard
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Custom dashboard: Metric math
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Provisioning dashboards with Cloud Formation
AWS CloudFormation Stack
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Custom dashboard: Breakdown by method
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Custom dashboard: Diagnosing customer issues
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Fast diagnostics with Amazon CloudWatch Logs
Insights
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Fast diagnostics with Amazon CloudWatch Logs
Insights
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Fast diagnostics with Amazon CloudWatch Logs
Insights
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Fast diagnostics with Amazon CloudWatch Logs
Insights
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Fast diagnostics with Amazon CloudWatch Logs
Insights
S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Takeaways
• Never waste a failure:
Effective post-mortems
• Catch failures before your
customers do:
Effective dashboarding and
metrics-reading
• Use AWS tools to gain
visibility and insight into
your application
Thank you!
S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Becky Weiss
becky@amazon.com

Weitere ähnliche Inhalte

Was ist angesagt?

Machine Learning & Amazon SageMaker
Machine Learning & Amazon SageMakerMachine Learning & Amazon SageMaker
Machine Learning & Amazon SageMaker
Amazon Web Services
 

Was ist angesagt? (20)

Innovate - Making Friends with Finance: How to Manage Cost Efficiency and Bud...
Innovate - Making Friends with Finance: How to Manage Cost Efficiency and Bud...Innovate - Making Friends with Finance: How to Manage Cost Efficiency and Bud...
Innovate - Making Friends with Finance: How to Manage Cost Efficiency and Bud...
 
From Strategy to Reality: Better Decisions With Data
From Strategy to Reality: Better Decisions With DataFrom Strategy to Reality: Better Decisions With Data
From Strategy to Reality: Better Decisions With Data
 
AI & Machine Learning at AWS - An Introduction
AI & Machine Learning at AWS - An IntroductionAI & Machine Learning at AWS - An Introduction
AI & Machine Learning at AWS - An Introduction
 
Build-Train-Deploy-Machine-Learning-Models-at-Any-Scale
Build-Train-Deploy-Machine-Learning-Models-at-Any-ScaleBuild-Train-Deploy-Machine-Learning-Models-at-Any-Scale
Build-Train-Deploy-Machine-Learning-Models-at-Any-Scale
 
Scale - Cloud Data Management with Veeam and AWS
Scale - Cloud Data Management with Veeam and AWSScale - Cloud Data Management with Veeam and AWS
Scale - Cloud Data Management with Veeam and AWS
 
Smart-Energy-Connect-Accelerating-Innovation-at-CLP
Smart-Energy-Connect-Accelerating-Innovation-at-CLPSmart-Energy-Connect-Accelerating-Innovation-at-CLP
Smart-Energy-Connect-Accelerating-Innovation-at-CLP
 
Serverless applications with AWS
Serverless applications with AWSServerless applications with AWS
Serverless applications with AWS
 
Building-Event-Driven-Serverless-Apps-with-AWS-Event-Forkines
Building-Event-Driven-Serverless-Apps-with-AWS-Event-ForkinesBuilding-Event-Driven-Serverless-Apps-with-AWS-Event-Forkines
Building-Event-Driven-Serverless-Apps-with-AWS-Event-Forkines
 
Best practices for running Windows workloads on AWS
Best practices for running Windows workloads on AWSBest practices for running Windows workloads on AWS
Best practices for running Windows workloads on AWS
 
Amazon Redshift tips and tricks - Scaling storage and compute - ADB301 - Sant...
Amazon Redshift tips and tricks - Scaling storage and compute - ADB301 - Sant...Amazon Redshift tips and tricks - Scaling storage and compute - ADB301 - Sant...
Amazon Redshift tips and tricks - Scaling storage and compute - ADB301 - Sant...
 
Building Deep Learning Applications with TensorFlow and SageMaker on AWS - Te...
Building Deep Learning Applications with TensorFlow and SageMaker on AWS - Te...Building Deep Learning Applications with TensorFlow and SageMaker on AWS - Te...
Building Deep Learning Applications with TensorFlow and SageMaker on AWS - Te...
 
AWS Data-Driven Insights Learning Series ANZ Sep 2019 Part 1
AWS Data-Driven Insights Learning Series ANZ Sep 2019 Part 1AWS Data-Driven Insights Learning Series ANZ Sep 2019 Part 1
AWS Data-Driven Insights Learning Series ANZ Sep 2019 Part 1
 
Work with Machine Learning in Amazon SageMaker - BDA203 - Toronto AWS Summit
Work with Machine Learning in Amazon SageMaker - BDA203 - Toronto AWS SummitWork with Machine Learning in Amazon SageMaker - BDA203 - Toronto AWS Summit
Work with Machine Learning in Amazon SageMaker - BDA203 - Toronto AWS Summit
 
Ask me anything about building data lakes on AWS - ADB209 - New York AWS Summit
Ask me anything about building data lakes on AWS - ADB209 - New York AWS SummitAsk me anything about building data lakes on AWS - ADB209 - New York AWS Summit
Ask me anything about building data lakes on AWS - ADB209 - New York AWS Summit
 
Machine Learning & Amazon SageMaker
Machine Learning & Amazon SageMakerMachine Learning & Amazon SageMaker
Machine Learning & Amazon SageMaker
 
Analysing streaming data in real time (AWS)
Analysing streaming data in real time (AWS)Analysing streaming data in real time (AWS)
Analysing streaming data in real time (AWS)
 
Building home security solutions at scale, ft. Comcast - SVC206 - New York AW...
Building home security solutions at scale, ft. Comcast - SVC206 - New York AW...Building home security solutions at scale, ft. Comcast - SVC206 - New York AW...
Building home security solutions at scale, ft. Comcast - SVC206 - New York AW...
 
Optimizing Cost and Capacity for Compute - CMP302 - Santa Clara AWS Summit
Optimizing Cost and Capacity for Compute - CMP302 - Santa Clara AWS SummitOptimizing Cost and Capacity for Compute - CMP302 - Santa Clara AWS Summit
Optimizing Cost and Capacity for Compute - CMP302 - Santa Clara AWS Summit
 
Developing your Cloud Center of Excellence using CloudHealth - DEM04-S - New ...
Developing your Cloud Center of Excellence using CloudHealth - DEM04-S - New ...Developing your Cloud Center of Excellence using CloudHealth - DEM04-S - New ...
Developing your Cloud Center of Excellence using CloudHealth - DEM04-S - New ...
 
Reinventing SAP on AWS: Scale & Simplify SAP Operations on AWS
Reinventing SAP on AWS: Scale & Simplify SAP Operations on AWSReinventing SAP on AWS: Scale & Simplify SAP Operations on AWS
Reinventing SAP on AWS: Scale & Simplify SAP Operations on AWS
 

Ähnlich wie The Art of Successful Failure - AWS Summit Sydney 2019

Introduction to the Well-Architected Framework and Tool - SVC212 - Chicago AW...
Introduction to the Well-Architected Framework and Tool - SVC212 - Chicago AW...Introduction to the Well-Architected Framework and Tool - SVC212 - Chicago AW...
Introduction to the Well-Architected Framework and Tool - SVC212 - Chicago AW...
Amazon Web Services
 

Ähnlich wie The Art of Successful Failure - AWS Summit Sydney 2019 (20)

Threat detection and mitigation at AWS - SEC201 - Atlanta AWS Summit
Threat detection and mitigation at AWS - SEC201 - Atlanta AWS SummitThreat detection and mitigation at AWS - SEC201 - Atlanta AWS Summit
Threat detection and mitigation at AWS - SEC201 - Atlanta AWS Summit
 
The Zen of governance - Establish guardrails and empower builders - SVC201 - ...
The Zen of governance - Establish guardrails and empower builders - SVC201 - ...The Zen of governance - Establish guardrails and empower builders - SVC201 - ...
The Zen of governance - Establish guardrails and empower builders - SVC201 - ...
 
Introduction to the Well-Architected Framework and Tool - SVC208 - Anaheim AW...
Introduction to the Well-Architected Framework and Tool - SVC208 - Anaheim AW...Introduction to the Well-Architected Framework and Tool - SVC208 - Anaheim AW...
Introduction to the Well-Architected Framework and Tool - SVC208 - Anaheim AW...
 
Introduction to the Well-Architected Framework and Tool - SVC212 - Chicago AW...
Introduction to the Well-Architected Framework and Tool - SVC212 - Chicago AW...Introduction to the Well-Architected Framework and Tool - SVC212 - Chicago AW...
Introduction to the Well-Architected Framework and Tool - SVC212 - Chicago AW...
 
Introduction to AWS Global Accelerator - SVC212 - New York AWS Summit
Introduction to AWS Global Accelerator - SVC212 - New York AWS SummitIntroduction to AWS Global Accelerator - SVC212 - New York AWS Summit
Introduction to AWS Global Accelerator - SVC212 - New York AWS Summit
 
How Nubank is building a customer-obsessed bank - FSV201 - New York AWS Summit
How Nubank is building a customer-obsessed bank - FSV201 - New York AWS SummitHow Nubank is building a customer-obsessed bank - FSV201 - New York AWS Summit
How Nubank is building a customer-obsessed bank - FSV201 - New York AWS Summit
 
The Theory and Practice, Practice, Practice of AWS Operations - AWS Summit Sy...
The Theory and Practice, Practice, Practice of AWS Operations - AWS Summit Sy...The Theory and Practice, Practice, Practice of AWS Operations - AWS Summit Sy...
The Theory and Practice, Practice, Practice of AWS Operations - AWS Summit Sy...
 
Automated Security Remediation
Automated Security RemediationAutomated Security Remediation
Automated Security Remediation
 
Are you Well Architected?
Are you Well Architected?Are you Well Architected?
Are you Well Architected?
 
Predicting Demand In A Diverse Retail Environment - AWS Summit Sydney
Predicting Demand In A Diverse Retail Environment - AWS Summit SydneyPredicting Demand In A Diverse Retail Environment - AWS Summit Sydney
Predicting Demand In A Diverse Retail Environment - AWS Summit Sydney
 
Castles in Castles - Secure Operational Scale - AWS Summit Sydney
Castles in Castles - Secure Operational Scale - AWS Summit SydneyCastles in Castles - Secure Operational Scale - AWS Summit Sydney
Castles in Castles - Secure Operational Scale - AWS Summit Sydney
 
Creating resiliency through destruction
Creating resiliency through destructionCreating resiliency through destruction
Creating resiliency through destruction
 
Simplify Compliance Through Automation
Simplify Compliance Through AutomationSimplify Compliance Through Automation
Simplify Compliance Through Automation
 
Automated Security Remediation - AWS Summit Sydney
Automated Security Remediation - AWS Summit SydneyAutomated Security Remediation - AWS Summit Sydney
Automated Security Remediation - AWS Summit Sydney
 
AWS及客戶在AI/ML的數位運行過程中得到的重要經驗與學習
AWS及客戶在AI/ML的數位運行過程中得到的重要經驗與學習AWS及客戶在AI/ML的數位運行過程中得到的重要經驗與學習
AWS及客戶在AI/ML的數位運行過程中得到的重要經驗與學習
 
Building the Business Case for Migrating to AWS
Building the Business Case for Migrating to AWSBuilding the Business Case for Migrating to AWS
Building the Business Case for Migrating to AWS
 
Rapid Prototyping with AWS - AWS Summit Sydney
Rapid Prototyping with AWS - AWS Summit SydneyRapid Prototyping with AWS - AWS Summit Sydney
Rapid Prototyping with AWS - AWS Summit Sydney
 
Student Track - AWS Summit 2019 - Introduzione
Student Track - AWS Summit 2019 - IntroduzioneStudent Track - AWS Summit 2019 - Introduzione
Student Track - AWS Summit 2019 - Introduzione
 
Innovating at Scale – Lessons Learned Growing Alexa - AWS Summit Sydney
Innovating at Scale – Lessons Learned Growing Alexa - AWS Summit SydneyInnovating at Scale – Lessons Learned Growing Alexa - AWS Summit Sydney
Innovating at Scale – Lessons Learned Growing Alexa - AWS Summit Sydney
 
AWS WAF.pptx
AWS WAF.pptxAWS WAF.pptx
AWS WAF.pptx
 

Mehr von Amazon Web Services

Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
Amazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
Amazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
Amazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
Amazon Web Services
 

Mehr von Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

The Art of Successful Failure - AWS Summit Sydney 2019

  • 1. S U M M I T SYDNEY
  • 2. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T The art of successful failure Becky Weiss Senior Principal Engineer Amazon Web Services
  • 3. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T This is a talk about failing… successfully
  • 4. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Agenda • Never waste a failure: The AWS approach to post-mortems • Seeing your failures before your customers do • How AWS can help you fail successfully
  • 5. S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 6. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
  • 7. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Availability 100 90 80
  • 8. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T COE: Correction of Error • Structured analysis of customer-impacting events • Reflection of Amazon’s peculiar culture • Goes well beyond “How do we prevent this from happening again” COE
  • 9. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T We take these very seriously
  • 10. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T COEs start with the customer and work backwards • Summary • Narrative description of what happened • Metrics and graphs • Primary impact and supporting graphs • If they don’t exist, that’s something to fix • Customer impact • How many customers affected • What was the impaired experience AvailabilityLatencyof dependency p99 p50
  • 11. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Areas of focus • Root cause: Why? (x 5) • Blast radius: How widespread was the impact? • Duration: For how long? • What can others learn?
  • 12. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Toyota’s Five-Whys approach to root cause The vehicle will not start. (the problem) Why? - The battery is dead. (First why) Why? - The alternator is not functioning. (Second why) Why? - The alternator belt has broken. (Third why) Why? - The alternator belt was well beyond its useful service life and not replaced. (Fourth why) Why? - The vehicle was not maintained according to the recommended service schedule. (Fifth why, a root cause)
  • 13. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Blast radius
  • 14. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Blast radius
  • 15. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Blast radius containment as a core design tenet AWS Cloud Region Region Region …
  • 16. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Blast radius containment as a core design tenet AWS Cloud Region Availability Zone Availability Zone Availability Zone
  • 17. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Blast radius containment as a core design tenet AWS Cloud Region Cell Cell Cell AWS Cloud
  • 18. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Controlling event duration Availability 100 90 80 Impact period
  • 19. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Controlling duration: Improving incident response • “How was the event detected?” • “How could time to detection be improved? As a thought experiment, how would you have cut the time in half?” Availability 100 90 80 Time to detection
  • 20. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Controlling duration: Improving incident response Good Amazon CloudWatch Alarm !
  • 21. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Controlling duration: Improving incident response Good Bad ??? Amazon CloudWatch Alarm !
  • 22. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Controlling duration: Improving time to mitigation • “How did you reach the point where you knew how to mitigate the impact?” • “How could time to mitigation be improved? As a thought experiment, how would you have cut the time in half?” Availability 100 90 80 Determining how to mitigate Mitigation activity
  • 23. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Controlling duration: Improving time to mitigation Example: Alarm-based automatic rollback AWS CodeDeploy Alarm !
  • 24. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Controlling duration: Improving time to mitigation Example: Alarm-based automatic rollback AWS CodeDeploy Alarm rollback
  • 25. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Controlling event duration Availability 100 90 80 Impact period
  • 26. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Controlling event duration Availability 100 90 80 Impact period
  • 27. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Wins are just as important as failures Latency deployment p99
  • 28. S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 29.
  • 30. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Metrics are very interesting, and that can be a problem
  • 31. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Health metrics and diagnostic metrics Health metrics • Answers the question: Am I failing? • Does not answer the question: Why am I failing? • Always set alarms on these • Be conservative in defining Diagnostic metrics • Answers the question: What is the value of this thing I measured? • Might answer the question: Why isn’t my system working? • Sometimes set alarms on these • Be liberal in defining
  • 32. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Health or Diagnostic? Time Responsecount 5XX
  • 33. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Health or Diagnostic? Time Responsecount 4XX 5XX
  • 34. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Health or Diagnostic? Time Databasetransactionrollbacks
  • 35. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Health or Diagnostic? Time Latency(msec) avg
  • 36. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Health or Diagnostic? Time Latency(msec) avg 50th percentile
  • 37. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Health or Diagnostic? Time Latency(msec) avg 99th percentile 50th percentile
  • 38. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Percentiles >> Avg Time Latency(msec) 99th percentile 50th percentile
  • 39. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Layout of a Great Dashboard Health Metrics at the top Latency percentiles Faults Volume Key diagnostic metrics below the fold
  • 40. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T The AWS Ops Wheel
  • 41. S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 42. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
  • 43. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Volume Time scale: ~one week volume
  • 44. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Time scale: ~one week volume p99.9 latency
  • 45. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Volume Time scale: Weeks/months
  • 46. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Higherisworse Time scale: Weeks/months
  • 47. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Latency p99 Alarm threshold
  • 50. S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 51. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Key Service: Amazon CloudWatch Amazon CloudWatch: Metrics, Logs and Alarms AWS Lambda Amazon API Gateway Amazon DynamoDB Billing Amazon Simple Notification Service Amazon Simple Storage Service (S3) Amazon EC2 Automatically-published metrics
  • 52. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Key Service: Amazon CloudWatch Amazon CloudWatch: Metrics, Logs and Alarms Application-specific logging and metrics Amazon EC2 instances with CloudWatch agent AWS Lambda functions
  • 53. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Key Service: Amazon CloudWatch Amazon CloudWatch: Metrics, Logs and Alarms Application-specific logging and metrics Amazon EC2 instances with CloudWatch agent AWS Lambda functions
  • 54. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Key Service: Amazon CloudWatch Amazon CloudWatch: Metrics, Logs and Alarms Application-specific logging and metrics Amazon EC2 instances with CloudWatch agent AWS Lambda functions VPC VPC Endpoint
  • 55. S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 56. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T A simple serverless API Amazon API Gateway: “Colors” Lambda function: “GetColor” Proxy integration GET /blue GET /red Amazon CloudWatch: Dashboards, Logs, and Alarms AWS CloudFormation Stack
  • 57. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Amazon CloudWatch default view
  • 58. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Custom dashboard
  • 59. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Custom dashboard: Metric math
  • 60. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Provisioning dashboards with Cloud Formation AWS CloudFormation Stack
  • 61. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Custom dashboard: Breakdown by method
  • 62. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Custom dashboard: Diagnosing customer issues
  • 63. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Fast diagnostics with Amazon CloudWatch Logs Insights
  • 64. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Fast diagnostics with Amazon CloudWatch Logs Insights
  • 65. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Fast diagnostics with Amazon CloudWatch Logs Insights
  • 66. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Fast diagnostics with Amazon CloudWatch Logs Insights
  • 67. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Fast diagnostics with Amazon CloudWatch Logs Insights
  • 68. S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 69. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Takeaways • Never waste a failure: Effective post-mortems • Catch failures before your customers do: Effective dashboarding and metrics-reading • Use AWS tools to gain visibility and insight into your application
  • 70. Thank you! S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Becky Weiss becky@amazon.com