SlideShare a Scribd company logo
1 of 27
Download to read offline
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Reliability of the Cloud:
How AWS Achieves High Availability
Rodney Lester
Reliability Lead
AWS Well Architected
A R C 3 1 7
Shaun Ray
Manager
AWS Evangelism
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Agenda
Well-Architected Reliability
Pillar
Once upon a time … (stories)
Availability design goals
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Breakout repeats
Tuesday, November 27
ARC317-R [REPEAT] Reliability of the Cloud: How AWS
Achieves High Availability
3:15 p.m. – 4:15 p.m. | Aria East, Level 1, Joshua 4
Thursday, November 29
ARC317-R [REPEAT 1] Reliability of the Cloud: How AWS
Achieves High Availability
11:30 a.m. – 12:30 p.m. | Mirage, Antigua A
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Related breakouts
Wednesday, November 28
ARC335-R1 Failing Successfully in the Cloud: AWS Approach to
Resilient Design
12:15 p.m. – 1:15 p.m. | Aria East, Level 2, Mariposa 8
Thursday, November 29
ARC335-R2 Failing Successfully in the Cloud: AWS Approach to
Resilient Design
4:00 p.m. – 5:00 p.m. | MGM, Level 3, South Concourse 302
Wednesday, November 28
ARC408 Under the Hood of Route 53
11:30 a.m. – 12:30 p.m. | Venetian, Level 4, Lando 4305
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
AWS Well-Architected Reliability Pillar
• Completely refreshed December 2017
• Additional changes approximately every three months
• Plan is to have it more dynamic in the future, but a new version will be released soon
• Significant changes
• Calculating availability
• Application design primer
• Examples, at different design goals
• Appendix contains design goals of 37 AWS services
• More added in each revision and will continue
• These concepts are used to develop services
https://aws.amazon.com/well-architected/
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
AWS uses the information in this white paper
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
How does this relate to how AWS builds services?
• This document was written in consultation with AWS principal
engineers
• The techniques described are quite proven
• All of the techniques described have articles or books written about
them
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Ops meetings
• David Lubell and Kevin Miller conducted a chalk talk in 2017 on how
we run our ops meeting
• Review critical services every week in a two hour meeting
• Charlie Bell (SVP, AWS Operations) leads the meeting
• Senior leaders of the services
• Representation from every AWS service
• Service metrics reviews
• 130+ services * 10 min/service = 22-hr meeting?
• How do we ensure all services are ready every week?
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Service review
• Now open source
• http://bit.ly/aws-wheel
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
The things that happen once in million happen all the
time in AWS
• Some commonly observed problems:
• Our back end service was having no problems, now it’s overloaded
• An occasional huge spike in traffic that quickly disappears causes problems
• Average response time to requests is slowly creeping up, but the p99 is exponential
• Observe a rise in failed requests “The service/region is failing”
• Experienced a failure, on recovery, we’re receiving duplicate requests that are all errors
• Cannot adapt fast enough to the huge changes in demand up or down
• Dependency on a less reliable system
• No problems until a system that was dependent on us went down, then we went down
• Couldn’t get capacity quick enough when a location went down
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Common causes of such problems (cont.)
• Our back end service was having no problems, now it’s overloaded
• Someone deployed a service that uses our service and the requests are much more than
planned/expected
• Someone in marketing is running a campaign and didn’t tell us; our service is not alone
• A bug exists that causes repeated requests to our service, either a new deployment, or a
latent bug
• We see an occasional huge spike in traffic that quickly disappears
• Some kind of edge case exists where things go normally, then under a condition, some kind
of rebuilding of a data model happens
• Someone in marketing is running a campaign and didn’t tell us; our service is not alone
• A bug exists that causes repeated requests to our service, either a new deployment, or a
latent bug
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Common causes of such problems (cont.)
• Average response time to requests is slowly creeping up, but the p99 is
exponential
• This can be an indicator of impending problems
• There is a use case that executes a different path, either on your service, or a dependency
• Observe a rise in failed requests “The service/region is failing”
• There may be an event (known internally as a Large Scale Event) occurring
• Maybe a transient problem
• Can often be better to wait it out rather than fail over
• Experienced a failure, on recovery, we’re receiving duplicate requests
that are all errors
• Even if you are not distributed, it is possible that the invoking service has no idea you were
successful in processing some requests
• Idempotency tokens can be used
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Common causes of such problems (cont.)
• Cannot adapt fast enough to the huge changes in demand up or down
• Need good communication paths with business drivers of traffic
• You can have the system constantly performing tasks that are replaced by requests from
consumers of your service
• Dependency on a less reliable system
• Can turn this into a soft dependency if you can find an acceptable replacement state
• This usually needs to be negotiated with the product owners
• No problems until a system that was dependent on us went down, then
we went down
• Commonly known as a cascading failure
• Not always a failure (see previous examples of spiky traffic)
• Example of “bi-modal behavior”
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Common causes of such problems (cont.)
• Couldn’t get capacity quick enough when a location went down
• Pilot light or running at high utilization can cause a brown out when failure occurs
• Need to be able to take a loss of a location and service the traffic immediately
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Service Design Goals
• Not SLAs
• Managed to in the weekly ops meeting
• Currently document 37 services
• Adding more as I work with services to establish them
• Control Plane versus Data Plane
• Control plane mutates resources (bi-modal!) and data plane is the “day job”
• Control plane is often more “dangerous“ and therefore less available (not always!)
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Thank you!
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Rodney Lester
rodneyle@amazon.com
Shaun Ray
shaunray@amazon.com
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Software/implementation has an impact on
availability
• Throttling
• Protect your service by refusing requests when out of capacity
• Exponential back off for retries
• This is an art and a science; built into the AWS SDKs
• Fail fast
• Users will retry on failure, so this can allow your system to recover faster
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
More advanced implementation patterns
• Idempotency
• You have a choice: “at most once” semantics, or “at least once.” Choose the latter.
• Constant work
• If you have a system that is always performing work, and you replace that work with user
requests, you have a system that is much more predictable
• Colm MacCarthaigh has a tweet thread on this:
https://twitter.com/colmmacc/status/1039228121327648768
• Circuit breaker
• Can be used to remove hard dependencies in your availability calculation
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Bi-model behavior and static stability
• Cascading failures are often from “bi-modal” behavior
• I’ve seen this often—anomaly causes huge change in system
• Static stability
• On loss of capacity, you want to be able to handle your current load with no need to acquire
resources
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
It’s a danger to stay on old versions of operating
systems, frameworks, or third-party software
• More than just operating systems
• Operating systems
• Frameworks like Spring, Angular, and more
• Other third-party software like libraries
• Ensure you keep up to date
• Can be more than availability concern—Equifax had a old version of Struts that exposed their
customer data
• This is part of the corporate wide topics communicated in the Ops
meetings

More Related Content

What's hot

What's hot (20)

Chaos Engineering: Why the World Needs More Resilient Systems
Chaos Engineering: Why the World Needs More Resilient SystemsChaos Engineering: Why the World Needs More Resilient Systems
Chaos Engineering: Why the World Needs More Resilient Systems
 
Deep Dive - CI/CD on AWS
Deep Dive - CI/CD on AWSDeep Dive - CI/CD on AWS
Deep Dive - CI/CD on AWS
 
Microservices Architectures on Amazon Web Services
Microservices Architectures on Amazon Web ServicesMicroservices Architectures on Amazon Web Services
Microservices Architectures on Amazon Web Services
 
Executing a Large-Scale Migration to AWS
Executing a Large-Scale Migration to AWSExecuting a Large-Scale Migration to AWS
Executing a Large-Scale Migration to AWS
 
Application Modernization using the Strangler Pattern
Application Modernization using the Strangler PatternApplication Modernization using the Strangler Pattern
Application Modernization using the Strangler Pattern
 
Chaos Engineering: Why Breaking Things Should Be Practiced - AWS Developer Wo...
Chaos Engineering: Why Breaking Things Should Be Practiced - AWS Developer Wo...Chaos Engineering: Why Breaking Things Should Be Practiced - AWS Developer Wo...
Chaos Engineering: Why Breaking Things Should Be Practiced - AWS Developer Wo...
 
How AWS Minimizes the Blast Radius of Failures (ARC338) - AWS re:Invent 2018
How AWS Minimizes the Blast Radius of Failures (ARC338) - AWS re:Invent 2018How AWS Minimizes the Blast Radius of Failures (ARC338) - AWS re:Invent 2018
How AWS Minimizes the Blast Radius of Failures (ARC338) - AWS re:Invent 2018
 
Deep Dive on Amazon Elastic Container Service (ECS) and Fargate
Deep Dive on Amazon Elastic Container Service (ECS) and FargateDeep Dive on Amazon Elastic Container Service (ECS) and Fargate
Deep Dive on Amazon Elastic Container Service (ECS) and Fargate
 
Failure is not an Option - Designing Highly Resilient AWS Systems
Failure is not an Option - Designing Highly Resilient AWS SystemsFailure is not an Option - Designing Highly Resilient AWS Systems
Failure is not an Option - Designing Highly Resilient AWS Systems
 
Re-Host or Re-Architect: Understanding the Why and How of Very Different Path...
Re-Host or Re-Architect: Understanding the Why and How of Very Different Path...Re-Host or Re-Architect: Understanding the Why and How of Very Different Path...
Re-Host or Re-Architect: Understanding the Why and How of Very Different Path...
 
SRE-iously! Defining the Principles, Habits, and Practices of Site Reliabilit...
SRE-iously! Defining the Principles, Habits, and Practices of Site Reliabilit...SRE-iously! Defining the Principles, Habits, and Practices of Site Reliabilit...
SRE-iously! Defining the Principles, Habits, and Practices of Site Reliabilit...
 
Cost Optimization on AWS
Cost Optimization on AWSCost Optimization on AWS
Cost Optimization on AWS
 
Chaos engineering
Chaos engineering Chaos engineering
Chaos engineering
 
Well-Architected Bootcamp
Well-Architected BootcampWell-Architected Bootcamp
Well-Architected Bootcamp
 
Site Reliability Engineering (SRE) - Tech Talk by Keet Sugathadasa
Site Reliability Engineering (SRE) - Tech Talk by Keet SugathadasaSite Reliability Engineering (SRE) - Tech Talk by Keet Sugathadasa
Site Reliability Engineering (SRE) - Tech Talk by Keet Sugathadasa
 
CI/CD Best Practices for Building Modern Applications - MAD302 - Anaheim AWS ...
CI/CD Best Practices for Building Modern Applications - MAD302 - Anaheim AWS ...CI/CD Best Practices for Building Modern Applications - MAD302 - Anaheim AWS ...
CI/CD Best Practices for Building Modern Applications - MAD302 - Anaheim AWS ...
 
AWS Well-Architected Framework: Operational Excellence Pillar
AWS Well-Architected Framework: Operational Excellence PillarAWS Well-Architected Framework: Operational Excellence Pillar
AWS Well-Architected Framework: Operational Excellence Pillar
 
CI/CD on AWS
CI/CD on AWSCI/CD on AWS
CI/CD on AWS
 
Vault Open Source vs Enterprise v2
Vault Open Source vs Enterprise v2Vault Open Source vs Enterprise v2
Vault Open Source vs Enterprise v2
 
Chaos Engineering with Gremlin Platform
Chaos Engineering with Gremlin PlatformChaos Engineering with Gremlin Platform
Chaos Engineering with Gremlin Platform
 

Similar to Reliability of the Cloud: How AWS Achieves High Availability (ARC317-R1) - AWS re:Invent 2018

Getting Started with Serverless Architectures with Microservices_AWSPSSummit_...
Getting Started with Serverless Architectures with Microservices_AWSPSSummit_...Getting Started with Serverless Architectures with Microservices_AWSPSSummit_...
Getting Started with Serverless Architectures with Microservices_AWSPSSummit_...
Amazon Web Services
 

Similar to Reliability of the Cloud: How AWS Achieves High Availability (ARC317-R1) - AWS re:Invent 2018 (20)

Rodney Lester: Well-Architected - Reliability Instructor Led Lab.pdf
Rodney Lester: Well-Architected - Reliability Instructor Led Lab.pdfRodney Lester: Well-Architected - Reliability Instructor Led Lab.pdf
Rodney Lester: Well-Architected - Reliability Instructor Led Lab.pdf
 
2019 03-13-implementing microservices by ddd
2019 03-13-implementing microservices by ddd2019 03-13-implementing microservices by ddd
2019 03-13-implementing microservices by ddd
 
Implementing Microservices by DDD
Implementing Microservices by DDDImplementing Microservices by DDD
Implementing Microservices by DDD
 
Introduction to Serverless on AWS - Builders Day Jerusalem
Introduction to Serverless on AWS - Builders Day JerusalemIntroduction to Serverless on AWS - Builders Day Jerusalem
Introduction to Serverless on AWS - Builders Day Jerusalem
 
Data Design and Modeling for Microservices I AWS Dev Day 2018
Data Design and Modeling for Microservices I AWS Dev Day 2018Data Design and Modeling for Microservices I AWS Dev Day 2018
Data Design and Modeling for Microservices I AWS Dev Day 2018
 
From Monolith to Modern Apps: Best Practices (SRV322-R2) - AWS re:Invent 2018
From Monolith to Modern Apps: Best Practices (SRV322-R2) - AWS re:Invent 2018From Monolith to Modern Apps: Best Practices (SRV322-R2) - AWS re:Invent 2018
From Monolith to Modern Apps: Best Practices (SRV322-R2) - AWS re:Invent 2018
 
Getting Started with Serverless Architectures with Microservices_AWSPSSummit_...
Getting Started with Serverless Architectures with Microservices_AWSPSSummit_...Getting Started with Serverless Architectures with Microservices_AWSPSSummit_...
Getting Started with Serverless Architectures with Microservices_AWSPSSummit_...
 
Advanced Deployment Best Practices with AWS CodeDeploy (DEV404-R2) - AWS re:I...
Advanced Deployment Best Practices with AWS CodeDeploy (DEV404-R2) - AWS re:I...Advanced Deployment Best Practices with AWS CodeDeploy (DEV404-R2) - AWS re:I...
Advanced Deployment Best Practices with AWS CodeDeploy (DEV404-R2) - AWS re:I...
 
Microservices & Data Design
Microservices & Data DesignMicroservices & Data Design
Microservices & Data Design
 
From Monolithic to Modern Apps: Best Practices
From Monolithic to Modern Apps: Best PracticesFrom Monolithic to Modern Apps: Best Practices
From Monolithic to Modern Apps: Best Practices
 
Microservices: Data & Design - Miguel Cervantes
Microservices: Data & Design - Miguel CervantesMicroservices: Data & Design - Miguel Cervantes
Microservices: Data & Design - Miguel Cervantes
 
Remove Undifferentiated Heavy Lifting from CI/CD Toolsets with Corteva Agrisc...
Remove Undifferentiated Heavy Lifting from CI/CD Toolsets with Corteva Agrisc...Remove Undifferentiated Heavy Lifting from CI/CD Toolsets with Corteva Agrisc...
Remove Undifferentiated Heavy Lifting from CI/CD Toolsets with Corteva Agrisc...
 
Microservices & Data Design: Database Week SF
Microservices & Data Design: Database Week SFMicroservices & Data Design: Database Week SF
Microservices & Data Design: Database Week SF
 
Microservices and Data Design
Microservices and Data DesignMicroservices and Data Design
Microservices and Data Design
 
Microservices & Data Design: Database Week San Francisco
Microservices & Data Design: Database Week San FranciscoMicroservices & Data Design: Database Week San Francisco
Microservices & Data Design: Database Week San Francisco
 
Coordinating Microservices with AWS Step Functions.pdf
Coordinating Microservices with AWS Step Functions.pdfCoordinating Microservices with AWS Step Functions.pdf
Coordinating Microservices with AWS Step Functions.pdf
 
Breaking Down the 'Monowhat'
Breaking Down the 'Monowhat'Breaking Down the 'Monowhat'
Breaking Down the 'Monowhat'
 
Hybrid Cloud Customer Use Cases on AWS
Hybrid Cloud Customer Use Cases on AWSHybrid Cloud Customer Use Cases on AWS
Hybrid Cloud Customer Use Cases on AWS
 
Resiliency Testing: Verify That Your System Is as Reliable as You Think (ARC4...
Resiliency Testing: Verify That Your System Is as Reliable as You Think (ARC4...Resiliency Testing: Verify That Your System Is as Reliable as You Think (ARC4...
Resiliency Testing: Verify That Your System Is as Reliable as You Think (ARC4...
 
Enterprise DevOps: Begin with Production-Ready Migration (ENT217-R1) - AWS re...
Enterprise DevOps: Begin with Production-Ready Migration (ENT217-R1) - AWS re...Enterprise DevOps: Begin with Production-Ready Migration (ENT217-R1) - AWS re...
Enterprise DevOps: Begin with Production-Ready Migration (ENT217-R1) - AWS re...
 

More from Amazon Web Services

Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
Amazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
Amazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
Amazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
Amazon Web Services
 

More from Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Reliability of the Cloud: How AWS Achieves High Availability (ARC317-R1) - AWS re:Invent 2018

  • 1.
  • 2. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Reliability of the Cloud: How AWS Achieves High Availability Rodney Lester Reliability Lead AWS Well Architected A R C 3 1 7 Shaun Ray Manager AWS Evangelism
  • 3. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Agenda Well-Architected Reliability Pillar Once upon a time … (stories) Availability design goals
  • 4. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Breakout repeats Tuesday, November 27 ARC317-R [REPEAT] Reliability of the Cloud: How AWS Achieves High Availability 3:15 p.m. – 4:15 p.m. | Aria East, Level 1, Joshua 4 Thursday, November 29 ARC317-R [REPEAT 1] Reliability of the Cloud: How AWS Achieves High Availability 11:30 a.m. – 12:30 p.m. | Mirage, Antigua A
  • 5. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Related breakouts Wednesday, November 28 ARC335-R1 Failing Successfully in the Cloud: AWS Approach to Resilient Design 12:15 p.m. – 1:15 p.m. | Aria East, Level 2, Mariposa 8 Thursday, November 29 ARC335-R2 Failing Successfully in the Cloud: AWS Approach to Resilient Design 4:00 p.m. – 5:00 p.m. | MGM, Level 3, South Concourse 302 Wednesday, November 28 ARC408 Under the Hood of Route 53 11:30 a.m. – 12:30 p.m. | Venetian, Level 4, Lando 4305
  • 6. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 7. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. AWS Well-Architected Reliability Pillar • Completely refreshed December 2017 • Additional changes approximately every three months • Plan is to have it more dynamic in the future, but a new version will be released soon • Significant changes • Calculating availability • Application design primer • Examples, at different design goals • Appendix contains design goals of 37 AWS services • More added in each revision and will continue • These concepts are used to develop services https://aws.amazon.com/well-architected/
  • 8. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. AWS uses the information in this white paper
  • 9. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. How does this relate to how AWS builds services? • This document was written in consultation with AWS principal engineers • The techniques described are quite proven • All of the techniques described have articles or books written about them
  • 10. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Ops meetings • David Lubell and Kevin Miller conducted a chalk talk in 2017 on how we run our ops meeting • Review critical services every week in a two hour meeting • Charlie Bell (SVP, AWS Operations) leads the meeting • Senior leaders of the services • Representation from every AWS service • Service metrics reviews • 130+ services * 10 min/service = 22-hr meeting? • How do we ensure all services are ready every week?
  • 11. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Service review • Now open source • http://bit.ly/aws-wheel
  • 12. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 13. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. The things that happen once in million happen all the time in AWS • Some commonly observed problems: • Our back end service was having no problems, now it’s overloaded • An occasional huge spike in traffic that quickly disappears causes problems • Average response time to requests is slowly creeping up, but the p99 is exponential • Observe a rise in failed requests “The service/region is failing” • Experienced a failure, on recovery, we’re receiving duplicate requests that are all errors • Cannot adapt fast enough to the huge changes in demand up or down • Dependency on a less reliable system • No problems until a system that was dependent on us went down, then we went down • Couldn’t get capacity quick enough when a location went down
  • 14. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Common causes of such problems (cont.) • Our back end service was having no problems, now it’s overloaded • Someone deployed a service that uses our service and the requests are much more than planned/expected • Someone in marketing is running a campaign and didn’t tell us; our service is not alone • A bug exists that causes repeated requests to our service, either a new deployment, or a latent bug • We see an occasional huge spike in traffic that quickly disappears • Some kind of edge case exists where things go normally, then under a condition, some kind of rebuilding of a data model happens • Someone in marketing is running a campaign and didn’t tell us; our service is not alone • A bug exists that causes repeated requests to our service, either a new deployment, or a latent bug
  • 15. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Common causes of such problems (cont.) • Average response time to requests is slowly creeping up, but the p99 is exponential • This can be an indicator of impending problems • There is a use case that executes a different path, either on your service, or a dependency • Observe a rise in failed requests “The service/region is failing” • There may be an event (known internally as a Large Scale Event) occurring • Maybe a transient problem • Can often be better to wait it out rather than fail over • Experienced a failure, on recovery, we’re receiving duplicate requests that are all errors • Even if you are not distributed, it is possible that the invoking service has no idea you were successful in processing some requests • Idempotency tokens can be used
  • 16. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Common causes of such problems (cont.) • Cannot adapt fast enough to the huge changes in demand up or down • Need good communication paths with business drivers of traffic • You can have the system constantly performing tasks that are replaced by requests from consumers of your service • Dependency on a less reliable system • Can turn this into a soft dependency if you can find an acceptable replacement state • This usually needs to be negotiated with the product owners • No problems until a system that was dependent on us went down, then we went down • Commonly known as a cascading failure • Not always a failure (see previous examples of spiky traffic) • Example of “bi-modal behavior”
  • 17. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Common causes of such problems (cont.) • Couldn’t get capacity quick enough when a location went down • Pilot light or running at high utilization can cause a brown out when failure occurs • Need to be able to take a loss of a location and service the traffic immediately
  • 18. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 19. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Service Design Goals • Not SLAs • Managed to in the weekly ops meeting • Currently document 37 services • Adding more as I work with services to establish them • Control Plane versus Data Plane • Control plane mutates resources (bi-modal!) and data plane is the “day job” • Control plane is often more “dangerous“ and therefore less available (not always!)
  • 20. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 21. Thank you! © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Rodney Lester rodneyle@amazon.com Shaun Ray shaunray@amazon.com
  • 22. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 23. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 24. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Software/implementation has an impact on availability • Throttling • Protect your service by refusing requests when out of capacity • Exponential back off for retries • This is an art and a science; built into the AWS SDKs • Fail fast • Users will retry on failure, so this can allow your system to recover faster
  • 25. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. More advanced implementation patterns • Idempotency • You have a choice: “at most once” semantics, or “at least once.” Choose the latter. • Constant work • If you have a system that is always performing work, and you replace that work with user requests, you have a system that is much more predictable • Colm MacCarthaigh has a tweet thread on this: https://twitter.com/colmmacc/status/1039228121327648768 • Circuit breaker • Can be used to remove hard dependencies in your availability calculation
  • 26. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Bi-model behavior and static stability • Cascading failures are often from “bi-modal” behavior • I’ve seen this often—anomaly causes huge change in system • Static stability • On loss of capacity, you want to be able to handle your current load with no need to acquire resources
  • 27. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. It’s a danger to stay on old versions of operating systems, frameworks, or third-party software • More than just operating systems • Operating systems • Frameworks like Spring, Angular, and more • Other third-party software like libraries • Ensure you keep up to date • Can be more than availability concern—Equifax had a old version of Struts that exposed their customer data • This is part of the corporate wide topics communicated in the Ops meetings