Applying principles of chaos engineering to Serverless (CodeMotion Berlin)

Applying principles of chaos
engineering to Serverless
Yan Cui @theburningmonk
Berlin | November 20 - 21, 2018

Agenda
What is chaos engineering?
New challenges with serverless
Applying latency injection to serverless
Applying error injection to serverless

Chaos Engineering is the discipline of experimenting on a distributed system 
in order to build conﬁdence in the system’s capability 
to withstand turbulent conditions in production.
- principlesofchaos.org

Smallpox
Earliest evidence of disease in 3rd Century BC Egyptian Mummy.
est. 400K deaths per year in 18th Century Europe.

History of vaccination
First vaccine was developed in
1798 by Edward Jenner.
https://en.wikipedia.org/wiki/Edward_Jenner

First vaccine was developed in
1798 by Edward Jenner.
WHO certified global eradication
in 1980.
https://en.wikipedia.org/wiki/Edward_Jenner

https://en.wikipedia.org/wiki/Vaccine

Vaccination is the most effective method to prevent infectious diseases.

Vaccination is the most effective method to prevent infectious diseases.
Vaccines stimulate the immune system to recognise and destroy the
disease before contracting it for real.

Chaos engineering
Use controlled experiments to inject failures into our system.

Chaos engineering
Help us learn about our system’s behaviour and uncover unknown failure
modes, before they manifest like wildfire in production.

Chaos engineering
Help us learn about our system’s behaviour and uncover unknown failure
modes, before they manifest like wildfire in production.
Lets us build confidence in its ability to withstand turbulent conditions.

Chaos Engineering is the vaccine to frailties in modern software.

Who am I?
Principal Engineer at DAZN.
AWS Serverless Hero.
Author of Production-Ready Serverless* course by Manning.
Blogger**, speaker.
* https://bit.ly/production-ready-serverless
** https://theburningmonk.com

https://www.ft.com/content/07d375ee-6ee5-11e8-92d3-6c13e5c92914

https://www.theguardian.com/media/2018/may/14/streaming-service-dazn-netﬂix-sport-us-boxing-eddie-hearn

About DAZN
Available in 7 countries - Austria, Switzerland, Germany,
Japan, Canada, Italy and USA.
Available on 30+ platforms.

About DAZN
Available in 7 countries - Austria, Switzerland, Germany,
Japan, Canada, Italy and USA.
Available on 30+ platforms.
Around 1,000,000 concurrent viewers at peak.

follow @dazneng for
updates about the
engineering team
We’re hiring! Visit
engineering.dazn.com
to learn more.
WE’RE HIRING!

Chaos engineering has an image problem

Too much emphasis is on breaking things.

Too much emphasis is on breaking things.
Easy to conflate the action of injecting failures with the payback.

The goal is to learn about the system and build confidence.

The goal is to learn about the system and build confidence.
The goal is not to break things.

Chaos in practice
4 steps to start running chaos
experiments yourself.

STEP 1. Deﬁne “steady state”
aka. what does normal, working
condition looks like?

Hypothesise steady state will
continue in both control group
& the experiment group
ie. you should have a reasonable degree of conﬁdence the
system would handle the failure before you proceed with
the experiment
STEP 2.

Chaos in practice
Explore unknown unknowns away from production.

Chaos in practice
Explore unknown unknowns away from production.
Experiments that graduate to production should be carefully
considered and planned.
You should have reasonable confidence in the system before
running experiments in production.

Chaos in practice
Treat production with the care it deserves.

Chaos in practice
If you know the system would break and you did it anyway,
then it’s not a chaos experiment!
It’s called being irresponsible.

STEP 3. Inject realistic failures
e.g. server crash, network error, HD
malfunction, etc.

https://github.com/Netﬂix/SimianArmy http://oreil.ly/2tZU1Sn

STEP 4. Disprove hypothesis
i.e. look for diﬀerence in steady state

Chaos in practice
Look for evidence that steady state was impacted by the
injected failure.

Chaos in practice
Look for evidence that steady state was impacted by the
injected failure.
Address weaknesses before failures happen for real.

Containment
Experiments needs to be controlled.

Containment
Ensure everyone knows what you are doing.
Don’t surprise your teammates.

Containment
Run experiments during office hours.

Containment
Run experiments during office hours.
Avoid important dates.

Containment
Make the smallest change necessary to prove or disprove hypothesis.

Containment
Make the smallest change necessary to prove or disprove hypothesis.
Have a rollback plan.
Stop the experiment right away if things start to go wrong.

Containment
Don’t start in production.
Can learn a lot by running experiments in staging.

by Russ Miles @russmiles
source https://medium.com/russmiles/chaos-engineering-for-the-business-17b723f26361

New challenges with serverless

chaos monkey kills an EC2
instance
latency monkey induces
artiﬁcial delay in APIs
chaos gorilla kills an AWS
Availability Zone
chaos kong kills an entire
AWS region

Serverless challenges
There are no servers that you can access and kill.

There are more inherent chaos and complexity in
a serverless architecture.

Smaller unit of deployment, but a lot more of them.

serverful
serverlessServerless challenges

Every function needs to be correctly configured and secured.

Kinesis
?
SNS
CloudWatch
Events
CloudWatch
LogsIoT
Core
DynamoDB
S3 SES

A lot of managed, intermediate services.
Each with its own set of failure modes.

Unknown failure modes in the infrastructure we don’t control.

Unknown failure modes in the infrastructure we don’t control.
Often there’s little we can do when an outage occurs in the platform.

Common weaknesses
Improperly tuned timeouts.

Common weaknesses
Missing error handling.

Common weaknesses
Missing fallback.

Common weaknesses
Missing regional failover.

Latency injection with serverless

Defining steady state
What metrics do you use?

Defining steady state
What metrics do you use?
p95/p99 latencies, error count, backlog size, yield*, harvest**
* percentage of requests completed
** completeness of the returned response

API Gateway
Serverless considerations

Serverless considerations
Consider the effect of cold starts.
How does it affect your strategy
for handling slow responses.

Request timeouts
Strategy should:

Request timeouts
Strategy should:
1. Give requests the best chance to succeed

Request timeouts
Strategy should:
1. Give requests the best chance to succeed
2. Do not allow slow response to timeout the caller function

Request timeouts
Finding the right timeout value is tricky.

Request timeouts
Too short : requests not given the best chance to succeed.

Request timeouts
Too long : risk timing out the calling function.

Request timeouts
Too long : risk timing out the calling function.
Even more complicated when you have multiple integration points.

Approach 1: split invocation time equally
(e.g. 3 requests, 6s function timeout = 2s timeout per request)

Approach 2: every request is given nearly all the invocation time
(e.g. 3 requests, 6s function timeout = 5s timeout per request)

Request timeouts
Proposal: set request timeouts dynamically based on
invocation time left

Set timeout based on remaining invocation time

Recovery steps
Log the timeout with as much context as possible.
The API, timeout value, correlation IDs, request object, etc.

Recovery steps
Record custom metrics.

Recovery steps
Record custom metrics.
Use fallbacks.

Recovery steps
Be mindful when you sacrifice precision for availability.
User experience is the king.

hypothesis:
Function has appropriate timeout on its HTTP communications
and can degrade gracefully when these requests time out

Where to inject latency?
Should be applied to 3rd party services too.
DynamoDB, Twillio, Auth0, …

Be mindful of the blast radius of the experiment.

http client
public-api-a
http client
public-api-b
internal-api

hypothesis:
All functions have appropriate timeout on their HTTP
communications to this internal API, and can degrade
gracefully when requests are timed out

Large blast radius, can cause cascade failures unintentionally

Priming (psychology):
Priming is a technique whereby exposure to one stimulus
inﬂuences a response to a subsequent stimulus, without
conscious guidance or intention.
It is a technique in psychology used to train a person's
memory both in positive and negative ways.

Use failure injection to programme your colleagues into
thinking about failure modes early.

Make X% of all requests slow
in the dev environment.

hypothesis:
The client app has appropriate timeout on their HTTP
communication with the server, and can degrade gracefully
when requests are timed out

How to inject latency?
Static weavers (e.g. PostSharp, AspectJ).
Dynamic proxies.

https://theburningmonk.com/2015/04/design-for-latency-issues/

How to inject latency?
Manually crafted wrapper libraries.

conﬁgured in SSM Parameter Store

factory wrapper function
(think bluebird’s promisifyAll function)

Error injection with serverless

Common errors
HTTP 5XX
DynamoDB provisioned throughput exceeded
Throttled Lambda invocations

hypothesis:
Function has appropriate error handling on its HTTP communications
and can degrade gracefully when downstream dependencies fail

hypothesis:
Function has appropriate error handling on DynamoDB operations and
can degrade gracefully when DynamoDB throughputs are exceeded

Where to inject errors?
Induce Lambda throttling by temporarily setting reserved concurrency.

the only way to truly know your system’s
resilience against failures is to test it
through CONTROLLED experiments

the goal of chaos engineering is NOT to
actually break production

CONTAINMENT should be front and
centre of your thinking

there are more inherent chaos and
complexity in a serverless application

even without servers, you can still inject
CONTROLLED failures at the application level

API Gateway and Kinesis
Authentication & authorisation (IAM, Cognito)
Testing
Running & Debugging functions locally
Log aggregation
Monitoring & Alerting
X-Ray
Correlation IDs
CI/CD
Performance and Cost optimisation
Error Handling
Configuration management
VPC
Security
Leading practices (API Gateway, Kinesis, Lambda)
Canary deployments
http://bit.ly/prod-ready-serverless
get 40% off with:
ytcui

@theburningmonk
theburningmonk.com
github.com/theburningmonk
API Gateway and Kinesis
Authentication & authorisation (IAM, Cognito)
Testing
Running & Debugging functions locally
Log aggregation
Monitoring & Alerting
X-Ray
Correlation IDs
CI/CD
Performance and Cost optimisation
Error Handling
Configuration management
VPC
Security
Leading practices (API Gateway, Kinesis, Lambda)
Canary deployments
http://bit.ly/prod-ready-serverless
get 40% off with:
ytcui

Applying principles of chaos engineering to Serverless (CodeMotion Berlin)

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (19)

Ähnlich wie Applying principles of chaos engineering to Serverless (CodeMotion Berlin)

Ähnlich wie Applying principles of chaos engineering to Serverless (CodeMotion Berlin) (20)

Mehr von Yan Cui

Mehr von Yan Cui (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Applying principles of chaos engineering to Serverless (CodeMotion Berlin)