Reliability of the Cloud: How AWS Achieves High Availability (ARC317-R1) - AWS re:Invent 2018

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Reliability of the Cloud:
How AWS Achieves High Availability
Rodney Lester
Reliability Lead
AWS Well Architected
A R C 3 1 7
Shaun Ray
Manager
AWS Evangelism

Agenda
Well-Architected Reliability
Pillar
Once upon a time … (stories)
Availability design goals

Breakout repeats
Tuesday, November 27
ARC317-R [REPEAT] Reliability of the Cloud: How AWS
Achieves High Availability
3:15 p.m. – 4:15 p.m. | Aria East, Level 1, Joshua 4
Thursday, November 29
ARC317-R [REPEAT 1] Reliability of the Cloud: How AWS
Achieves High Availability
11:30 a.m. – 12:30 p.m. | Mirage, Antigua A

Related breakouts
Wednesday, November 28
ARC335-R1 Failing Successfully in the Cloud: AWS Approach to
Resilient Design
12:15 p.m. – 1:15 p.m. | Aria East, Level 2, Mariposa 8
Thursday, November 29
ARC335-R2 Failing Successfully in the Cloud: AWS Approach to
Resilient Design
4:00 p.m. – 5:00 p.m. | MGM, Level 3, South Concourse 302
Wednesday, November 28
ARC408 Under the Hood of Route 53
11:30 a.m. – 12:30 p.m. | Venetian, Level 4, Lando 4305

AWS Well-Architected Reliability Pillar
• Completely refreshed December 2017
• Additional changes approximately every three months
• Plan is to have it more dynamic in the future, but a new version will be released soon
• Significant changes
• Calculating availability
• Application design primer
• Examples, at different design goals
• Appendix contains design goals of 37 AWS services
• More added in each revision and will continue
• These concepts are used to develop services
https://aws.amazon.com/well-architected/

AWS uses the information in this white paper

How does this relate to how AWS builds services?
• This document was written in consultation with AWS principal
engineers
• The techniques described are quite proven
• All of the techniques described have articles or books written about
them

Ops meetings
• David Lubell and Kevin Miller conducted a chalk talk in 2017 on how
we run our ops meeting
• Review critical services every week in a two hour meeting
• Charlie Bell (SVP, AWS Operations) leads the meeting
• Senior leaders of the services
• Representation from every AWS service
• Service metrics reviews
• 130+ services * 10 min/service = 22-hr meeting?
• How do we ensure all services are ready every week?

Service review
• Now open source
• http://bit.ly/aws-wheel

The things that happen once in million happen all the
time in AWS
• Some commonly observed problems:
• Our back end service was having no problems, now it’s overloaded
• An occasional huge spike in traffic that quickly disappears causes problems
• Average response time to requests is slowly creeping up, but the p99 is exponential
• Observe a rise in failed requests “The service/region is failing”
• Experienced a failure, on recovery, we’re receiving duplicate requests that are all errors
• Cannot adapt fast enough to the huge changes in demand up or down
• Dependency on a less reliable system
• No problems until a system that was dependent on us went down, then we went down
• Couldn’t get capacity quick enough when a location went down

Common causes of such problems (cont.)
• Our back end service was having no problems, now it’s overloaded
• Someone deployed a service that uses our service and the requests are much more than
planned/expected
• Someone in marketing is running a campaign and didn’t tell us; our service is not alone
• A bug exists that causes repeated requests to our service, either a new deployment, or a
latent bug
• We see an occasional huge spike in traffic that quickly disappears
• Some kind of edge case exists where things go normally, then under a condition, some kind
of rebuilding of a data model happens
• Someone in marketing is running a campaign and didn’t tell us; our service is not alone
• A bug exists that causes repeated requests to our service, either a new deployment, or a
latent bug

• Average response time to requests is slowly creeping up, but the p99 is
exponential
• This can be an indicator of impending problems
• There is a use case that executes a different path, either on your service, or a dependency
• Observe a rise in failed requests “The service/region is failing”
• There may be an event (known internally as a Large Scale Event) occurring
• Maybe a transient problem
• Can often be better to wait it out rather than fail over
• Experienced a failure, on recovery, we’re receiving duplicate requests
that are all errors
• Even if you are not distributed, it is possible that the invoking service has no idea you were
successful in processing some requests
• Idempotency tokens can be used

• Cannot adapt fast enough to the huge changes in demand up or down
• Need good communication paths with business drivers of traffic
• You can have the system constantly performing tasks that are replaced by requests from
consumers of your service
• Dependency on a less reliable system
• Can turn this into a soft dependency if you can find an acceptable replacement state
• This usually needs to be negotiated with the product owners
• No problems until a system that was dependent on us went down, then
we went down
• Commonly known as a cascading failure
• Not always a failure (see previous examples of spiky traffic)
• Example of “bi-modal behavior”

• Couldn’t get capacity quick enough when a location went down
• Pilot light or running at high utilization can cause a brown out when failure occurs
• Need to be able to take a loss of a location and service the traffic immediately

Service Design Goals
• Not SLAs
• Managed to in the weekly ops meeting
• Currently document 37 services
• Adding more as I work with services to establish them
• Control Plane versus Data Plane
• Control plane mutates resources (bi-modal!) and data plane is the “day job”
• Control plane is often more “dangerous“ and therefore less available (not always!)

Thank you!
Rodney Lester
rodneyle@amazon.com
Shaun Ray
shaunray@amazon.com

Software/implementation has an impact on
availability
• Throttling
• Protect your service by refusing requests when out of capacity
• Exponential back off for retries
• This is an art and a science; built into the AWS SDKs
• Fail fast
• Users will retry on failure, so this can allow your system to recover faster

More advanced implementation patterns
• Idempotency
• You have a choice: “at most once” semantics, or “at least once.” Choose the latter.
• Constant work
• If you have a system that is always performing work, and you replace that work with user
requests, you have a system that is much more predictable
• Colm MacCarthaigh has a tweet thread on this:
https://twitter.com/colmmacc/status/1039228121327648768
• Circuit breaker
• Can be used to remove hard dependencies in your availability calculation

Bi-model behavior and static stability
• Cascading failures are often from “bi-modal” behavior
• I’ve seen this often—anomaly causes huge change in system
• Static stability
• On loss of capacity, you want to be able to handle your current load with no need to acquire
resources

It’s a danger to stay on old versions of operating
systems, frameworks, or third-party software
• More than just operating systems
• Operating systems
• Frameworks like Spring, Angular, and more
• Other third-party software like libraries
• Ensure you keep up to date
• Can be more than availability concern—Equifax had a old version of Struts that exposed their
customer data
• This is part of the corporate wide topics communicated in the Ops
meetings

Reliability of the Cloud: How AWS Achieves High Availability (ARC317-R1) - AWS re:Invent 2018

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Reliability of the Cloud: How AWS Achieves High Availability (ARC317-R1) - AWS re:Invent 2018

Similar to Reliability of the Cloud: How AWS Achieves High Availability (ARC317-R1) - AWS re:Invent 2018 (20)

More from Amazon Web Services

More from Amazon Web Services (20)

Reliability of the Cloud: How AWS Achieves High Availability (ARC317-R1) - AWS re:Invent 2018