Lambda gives you a lot of scalability and multi-AZ out-of-the-box, but still, things can go wrong in production.
There are region-wide outages, and performance degradation in services your function depends on can cause it to time out or error. And what if you're dealing with downstream systems that just aren't as scalable and can't handle the load you put on them?
The bottomline is many things can go wrong and they often do at the worst times. The goal of building resilient systems is not to prevent failures, but to build systems that can withstand these failures. In this talk, we will look at a number of practices and architectural patterns that can help you build more resilient serverless applications. Such as multi-region, active-active, employing DLQs and surge queues and using chaos experiments to identify failure modes before they manifest in production.
The recording is available here: https://www.youtube.com/watch?v=elVeOYYtLM0
6. You Shall Not Fail!
in the face of turbulent conditions
TM
what is
RESILIENCE
chaos
ENGINEERING
multi-region
STRATEGIES
retries &
TIMEOUTS
lambda
SCALING
decoupled
INVOCATION
7. PRODUCERS
Yan Cui, @theburningmonk
Sara Gerion, @sarutule
SPEAKERS
Yan Cui, @theburningmonk
Sara Gerion, @sarutule
SPEAKING AT
AWS Community Summit Online
SPECIAL THANKS
Phil Horn
Joe Park
12. SARA GERION
Italian living in Amsterdam, The Netherlands
Passionate about cloud, scalability, resilience
Twitter: @Sarutule
Backend engineer at DAZN
@dazneng
Director of Tech at SheSharp
@SheSharpNL
17. @theburningmonk
@sarutule
REST API - Lambda autoscaling
17
Concurrency limits:
3000 – US West (Oregon), US East (N.
Virginia), Europe (Ireland), 1000 – Asia Pacific
(Tokyo), Europe (Frankfurt), 500 – Other
Regions
Later bursts: 500 new containers / each minute
18. @theburningmonk
@sarutule
REST API - Lambda autoscaling
18
X number of execution environments
pre-initialized (ready to respond to invocations)
Note: standard burst concurrency limits when
over the provisioned capacity
Concurrency limits:
3000 – US West (Oregon), US East (N.
Virginia), Europe (Ireland), 1000 – Asia Pacific
(Tokyo), Europe (Frankfurt), 500 – Other
Regions
Later bursts: 500 new containers / each minute
19. @theburningmonk
@sarutule
REST API - Lambda autoscaling
19
Adjustable provisioned capacity based on
CloudWatch metrics
X number of execution environments
pre-initialized (ready to respond to invocations)
Note: standard burst concurrency limits when
over the provisioned capacity
Concurrency limits:
3000 – US West (Oregon), US East (N.
Virginia), Europe (Ireland), 1000 – Asia Pacific
(Tokyo), Europe (Frankfurt), 500 – Other
Regions
Later bursts: 500 new containers / each minute
33. @theburningmonk
@sarutule
Possible mitigations for REST API’s
33
Use 1 Lambda
for each
endpoint
Optimise
performance
Offload computing
operations to an
async flow (SQS, SNS, …)
Raise limits with
an AWS support ticket
33
36. @theburningmonk
@sarutule
Possible mitigations for REST API’s
36
Use 1 Lambda
for each
endpoint
Optimise
performance
Offload computing
operations to an
async flow (SQS, SNS, …)
Use provisioned capacity
(plus autoscaling)
Raise limits with
an AWS support ticket
36
37. @theburningmonk
@sarutule
Reminder: beware of long timeouts
37
API Gateway
Integration timeout
Default: 29s
Lambda
Timeout
Max: 15 minutes
SQS
Visibility timeout
Default: 30s
Min: 0s
Max: 12 hours
47. @theburningmonk
@sarutule
47
“the discipline of experimenting on a system in order to build confidence in the
system’s capability to withstand turbulent conditions in production”
principlesofchaos.org
70. @theburningmonk
@sarutule
70
chaos monkey kills an
EC2 instance
latency monkey induces
artificial delay in APIs
chaos gorilla kills an AWS
Availability Zone
chaos kong kills an entire
AWS region
83. @theburningmonk
@sarutule
83
TIL: the js DynamoDB client defaults to 10 retries
with base delay of 50ms
delay = Math.random() * (Math.pow(2, retryCount) * base)
this is Marc Brooker’s
fav formula!
92. @theburningmonk
@sarutule
92
TIL: most HTTP client libraries have default timeout of 60s.
API Gateway has an integration timeout of 29s.
Most Lambda functions default to timeout of 3-6s.