Weitere ähnliche Inhalte Ähnlich wie The Art of Successful Failure - AWS Summit Sydney 2019 (20) Mehr von Amazon Web Services (20) The Art of Successful Failure - AWS Summit Sydney 20192. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
The art of successful failure
Becky Weiss
Senior Principal Engineer
Amazon Web Services
3. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
This is a talk about failing… successfully
4. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Agenda
• Never waste a failure: The AWS approach to post-mortems
• Seeing your failures before your customers do
• How AWS can help you fail successfully
5. S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
6. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
7. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Availability
100
90
80
8. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
COE: Correction of Error
• Structured analysis of
customer-impacting events
• Reflection of Amazon’s
peculiar culture
• Goes well beyond “How do we
prevent this from happening
again”
COE
9. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
We take these very seriously
10. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
COEs start with the customer and work backwards
• Summary
• Narrative description of what happened
• Metrics and graphs
• Primary impact and supporting graphs
• If they don’t exist, that’s something to fix
• Customer impact
• How many customers affected
• What was the impaired experience
AvailabilityLatencyof
dependency
p99
p50
11. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Areas of focus
• Root cause: Why? (x 5)
• Blast radius: How widespread
was the impact?
• Duration: For how long?
• What can others learn?
12. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Toyota’s Five-Whys approach to root cause
The vehicle will not start. (the problem)
Why? - The battery is dead. (First why)
Why? - The alternator is not functioning. (Second why)
Why? - The alternator belt has broken. (Third why)
Why? - The alternator belt was well beyond its useful service life and not
replaced. (Fourth why)
Why? - The vehicle was not maintained according to the recommended
service schedule. (Fifth why, a root cause)
13. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Blast radius
14. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Blast radius
15. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Blast radius containment as a core design tenet
AWS Cloud
Region Region Region
…
16. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Blast radius containment as a core design tenet
AWS Cloud
Region
Availability Zone Availability Zone Availability Zone
17. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Blast radius containment as a core design tenet
AWS Cloud
Region
Cell Cell Cell
AWS Cloud
18. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Controlling event duration
Availability
100
90
80
Impact period
19. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Controlling duration: Improving incident response
• “How was the event
detected?”
• “How could time to detection
be improved? As a thought
experiment, how would you
have cut the time in half?”
Availability
100
90
80
Time to detection
20. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Controlling duration: Improving incident response
Good
Amazon CloudWatch
Alarm
!
21. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Controlling duration: Improving incident response
Good Bad
???
Amazon CloudWatch
Alarm
!
22. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Controlling duration: Improving time to mitigation
• “How did you reach the point
where you knew how to
mitigate the impact?”
• “How could time to
mitigation be improved? As
a thought experiment, how
would you have cut the time
in half?”
Availability
100
90
80
Determining how to mitigate
Mitigation activity
23. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Controlling duration: Improving time to mitigation
Example: Alarm-based automatic rollback
AWS CodeDeploy
Alarm
!
24. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Controlling duration: Improving time to mitigation
Example: Alarm-based automatic rollback
AWS CodeDeploy
Alarm
rollback
25. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Controlling event duration
Availability
100
90
80
Impact period
26. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Controlling event duration
Availability
100
90
80
Impact period
27. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Wins are just as important as failures
Latency
deployment
p99
28. S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
30. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Metrics are very interesting, and that can be a problem
31. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Health metrics and diagnostic metrics
Health metrics
• Answers the question: Am I
failing?
• Does not answer the
question: Why am I failing?
• Always set alarms on these
• Be conservative in defining
Diagnostic metrics
• Answers the question: What is
the value of this thing I
measured?
• Might answer the question:
Why isn’t my system working?
• Sometimes set alarms on these
• Be liberal in defining
32. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Health or Diagnostic?
Time
Responsecount
5XX
33. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Health or Diagnostic?
Time
Responsecount
4XX
5XX
34. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Health or Diagnostic?
Time
Databasetransactionrollbacks
35. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Health or Diagnostic?
Time
Latency(msec)
avg
36. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Health or Diagnostic?
Time
Latency(msec)
avg
50th
percentile
37. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Health or Diagnostic?
Time
Latency(msec)
avg
99th
percentile
50th
percentile
38. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Percentiles >> Avg
Time
Latency(msec)
99th
percentile
50th
percentile
39. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Layout of a Great Dashboard
Health Metrics at the top
Latency percentiles Faults Volume
Key diagnostic metrics below the fold
40. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
The AWS Ops Wheel
41. S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
42. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
43. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Volume
Time scale: ~one week
volume
44. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Time scale: ~one week
volume
p99.9 latency
45. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Volume
Time scale: Weeks/months
46. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Higherisworse
Time scale: Weeks/months
47. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Latency
p99
Alarm threshold
50. S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
51. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Key Service: Amazon CloudWatch
Amazon CloudWatch:
Metrics, Logs and Alarms
AWS Lambda
Amazon API Gateway
Amazon DynamoDB
Billing
Amazon Simple
Notification Service
Amazon Simple Storage
Service (S3)
Amazon EC2
Automatically-published metrics
52. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Key Service: Amazon CloudWatch
Amazon CloudWatch:
Metrics, Logs and Alarms
Application-specific logging and metrics
Amazon EC2
instances with
CloudWatch agent
AWS Lambda
functions
53. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Key Service: Amazon CloudWatch
Amazon CloudWatch:
Metrics, Logs and Alarms
Application-specific logging and metrics
Amazon EC2
instances with
CloudWatch agent
AWS Lambda
functions
54. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Key Service: Amazon CloudWatch
Amazon CloudWatch:
Metrics, Logs and Alarms
Application-specific logging and metrics
Amazon EC2
instances with
CloudWatch agent
AWS Lambda
functions
VPC
VPC Endpoint
55. S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
56. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
A simple serverless API
Amazon API Gateway:
“Colors”
Lambda function:
“GetColor”
Proxy
integration
GET /blue
GET /red
Amazon CloudWatch:
Dashboards, Logs, and
Alarms
AWS CloudFormation Stack
57. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Amazon CloudWatch default view
58. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Custom dashboard
59. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Custom dashboard: Metric math
60. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Provisioning dashboards with Cloud Formation
AWS CloudFormation Stack
61. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Custom dashboard: Breakdown by method
62. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Custom dashboard: Diagnosing customer issues
63. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Fast diagnostics with Amazon CloudWatch Logs
Insights
64. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Fast diagnostics with Amazon CloudWatch Logs
Insights
65. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Fast diagnostics with Amazon CloudWatch Logs
Insights
66. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Fast diagnostics with Amazon CloudWatch Logs
Insights
67. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Fast diagnostics with Amazon CloudWatch Logs
Insights
68. S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
69. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Takeaways
• Never waste a failure:
Effective post-mortems
• Catch failures before your
customers do:
Effective dashboarding and
metrics-reading
• Use AWS tools to gain
visibility and insight into
your application
70. Thank you!
S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Becky Weiss
becky@amazon.com