SlideShare ist ein Scribd-Unternehmen logo
1 von 150
Downloaden Sie, um offline zu lesen
APPLYING PRINCIPLES
to SERVERLESSt
a
b
chaos engineering
of
A
E
S
of
Yan Cui
http://theburningmonk.com
@theburningmonk
Principal Engineer @
“Netflix for sports”
offices in London, Leeds, Katowice and Tokyo
We’re hiring ;-)
http://engineering.dazn.com
history of Smallpox
est. 400K deaths per year in 18th Century Europe.
earliest evidence of disease in 3rd Century BC Egyptian Mummy
history of Smallpox
est. 400K deaths per year in 18th Century Europe.
earliest evidence of disease in 3rd Century BC Egyptian Mummy
1798
first vaccine developed
Edward Jenner
1798
first vaccine developed
1980
history of Smallpox
Edward Jenner
WHO certified
global eradication
est. 400K deaths per year in 18th Century Europe.
earliest evidence of disease in 3rd Century BC Egyptian Mummy
Vaccination is the most effective method of
preventing infectious diseases
stimulates the immune system to recognize
and destroy the disease before contracting
the disease for real
Chaos Engineering
controlled experiments to help us learn about
our system’s behaviour and build confidence
in its ability to withstand turbulent conditions
it’s about building confidence,
NOT breaking things
I’m gonna inject
you with a deadly
disease now
http://principlesofchaos.org
STEP 1. define “Steady State”
aka. what does normal, working
condition looks like?
this is not a
steady state
STEP 2.
hypothesize steady state will
continue in both control group
& the experiment group
ie. you should have a reasonable degree of
confidence the system would handle the failure
before you proceed with the experiment
explore unknown unknowns
away from production
treat production with the
care it deserves
the goal is NOT,
to actually hurt production
If you know the system would break,
and you did it anyway…
then it’s NOT a chaos experiment.
It’s called being IRRESPONSIBLE.
STEP 3.
inject realistic failures
e.g. server crash, network error,
HD malfunction, etc.
https://github.com/Netflix/SimianArmy
https://github.com/Netflix/SimianArmy http://oreil.ly/2tZU1Sn
STEP 4.
disprove hypothesis
i.e. look for difference with steady state
if a WEAkNESS is uncovered,
IMPROVE it before the behaviour
manifests in the system at large
Chaos Engineering
controlled experiments to help us learn about
our system’s behaviour and build confidence
in its ability to withstand turbulent conditions
Chaos Engineering
controlled experiments to help us learn about
our system’s behaviour and build confidence
in its ability to withstand turbulent conditions
communication
ensure everyone knows what you’re doing
ensure everyone knows what you’re doing
NO surprises!
communication
Timing
run experiments during office hours
AVOID important dates
communication
Timing
contain Blast radius
smallest change that allows
you to detect a signal that
steady state is disrupted
rollback at the first sign of
TROUBLE!
communication
Timing
contain Blast radius
don’t try to run before you
know how to walk.
by Russ Miles @russmiles
source https://medium.com/russmiles/chaos-engineering-for-the-business-17b723f26361
chaos monkey kills an
EC2 instance
latency monkey induces
artificial delay in APIs
chaos gorilla kills an
AWS Availability Zone
chaos kong kills an
entire AWS region
there is no server…
there is no server…
that you can kill
there are more inherent chaos and
complexity in a Serverless architecture
smaller units of deployment
but A LOT more of them!
more difficult to harden
around boundaries
serverful
serverless
?
SNS
Kinesis
CloudWatch
Events
CloudWatch
LogsIoT
DynamoDB
S3 SES
?
SNS
Kinesis
CloudWatch
Events
CloudWatch
LogsIoT
DynamoDB
S3 SES
more intermediary services,
and greater variety too
?
SNS
Kinesis
CloudWatch
Events
CloudWatch
LogsIoT
DynamoDB
S3 SES
more intermediary services,
and greater variety too
each with its own set of
failure modes
serverful
serverless
more configurations,
more opportunities for misconfiguration
more unknown failure modes in
infrastructure that we don’t control
often there’s little we can do when an
outage occurs in the platform
improperly tuned timeouts
missing error handling
missing fallback when downstream is unavailable
LATENCY INJECTION
STEP 1. define “Steady State”
aka. what does normal, working
condition looks like?
what metrics do you monitor?
9X-percentile latency
error count
yield (% of requests completed)
harvest (completeness of results)
STEP 2.
hypothesize steady state will
continue in both control group
& the experiment group
ie. you should have a reasonable degree of
confidence the system would handle the failure
before you proceed with the experiment
API Gateway
consider the effect of cold-starts
& API Gateway overhead
use short timeout for API calls
the goal of a timeout strategy is to give HTTP
requests the best chance to succeed,
provided that doing so does not cause the
calling function itself to err
fixed timeout are tricky to get right…
fixed timeout are tricky to get right…
too short and you don’t
give requests the best
chance to succeed
fixed timeout are tricky to get right…
too long and you run the
risk of letting the request
timeout the calling function
and it gets worse when you make multiple
API calls in one function…
set the request timeout based on the
amount of invocation time left
log the timeout incident with
as much context as possible
e.g. timeout value, correlation IDs,
request object, …
report custom metrics
be mindful when you sacrifice precision for
availability, user experience is the king
STEP 3.
inject realistic failures
e.g. server crash, network error,
HD malfunction, etc.
where to inject latency?
hypothesis:
function has appropriate timeout on its HTTP
communications and can degrade gracefully
when these requests time out
should also be applied to 3rd parties
services we depend on, e.g. DynamoDB
what’s the blast radius?
http client
public-api-a
http client
public-api-b
internal-api
hypothesis:
all functions have appropriate timeout on
their HTTP communications to this internal
API, and can degrade gracefully when
requests are timed out
large blast radius, risky..
could be effective when used away from
production environment, to weed out
weaknesses quickly
not priming developers to
build more resilient systems
development
development
production
Priming (psychology):
Priming is a technique whereby exposure to one
stimulus influences a response to a subsequent
stimulus, without conscious guidance or intention.
It is a technique in psychology used to train a
person's memory both in positive and negative ways.
make dev environments better resemble the
turbulent conditions you should realistically
expect your system to survive in production
hypothesis:
the client app has appropriate timeout on
their HTTP communication with the server,
and can degrade gracefully when requests
are timed out
STEP 4.
disprove hypothesis
i.e. look for difference with steady state
how to inject latency?
static weaver (e.g. AspectJ, PostSharp),
or dynamic proxies
https://theburningmonk.com/2015/04/design-for-latency-issues/
manually crafted wrapper library
configured in SSM Parameter Store
no injected latency
with injected latency
factory wrapper function
(think bluebird’s promisifyAll function)
ERROR INJECTION
failures are INEVITABLE
the only way to truly know your system’s
resilience against failures is to test it
through controlled experiments
vaccinate your serverless
architecture against failures
Yan Cui
http://theburningmonk.com
@theburningmonk
@theburningmonk
theburningmonk.com
github.com/theburningmonk
API Gateway and Kinesis
Authentication & authorisation (IAM, Cognito)
Testing
Running & Debugging functions locally
Log aggregation
Monitoring & Alerting
X-Ray
Correlation IDs
CI/CD
Performance and Cost optimisation
Error Handling
Configuration management
VPC
Security
Leading practices (API Gateway, Kinesis, Lambda)
Canary deployments
http://bit.ly/production-ready-serverless
API Gateway and Kinesis
Authentication & authorisation (IAM, Cognito)
Testing
Running & Debugging functions locally
Log aggregation
Monitoring & Alerting
X-Ray
Correlation IDs
CI/CD
Performance and Cost optimisation
Error Handling
Configuration management
VPC
Security
Leading practices (API Gateway, Kinesis, Lambda)
Canary deployments
http://bit.ly/production-ready-serverless
get 40% off with code:
ytcui

Weitere ähnliche Inhalte

Ähnlich wie Applying principles of chaos engineering to serverless (ServerlessCPH)

Chapter 10. ScenariosI have always been a big fan of learnin
Chapter 10. ScenariosI have always been a big fan of learninChapter 10. ScenariosI have always been a big fan of learnin
Chapter 10. ScenariosI have always been a big fan of learnin
EstelaJeffery653
 
1.8 Exercises1. Distinguish between vulnerability, threat, and con.docx
1.8 Exercises1. Distinguish between vulnerability, threat, and con.docx1.8 Exercises1. Distinguish between vulnerability, threat, and con.docx
1.8 Exercises1. Distinguish between vulnerability, threat, and con.docx
hacksoni
 

Ähnlich wie Applying principles of chaos engineering to serverless (ServerlessCPH) (20)

Applying principles of chaos engineering to Serverless
Applying principles of chaos engineering to ServerlessApplying principles of chaos engineering to Serverless
Applying principles of chaos engineering to Serverless
 
Yan Cui - Applying principles of chaos engineering to Serverless - Codemotion...
Yan Cui - Applying principles of chaos engineering to Serverless - Codemotion...Yan Cui - Applying principles of chaos engineering to Serverless - Codemotion...
Yan Cui - Applying principles of chaos engineering to Serverless - Codemotion...
 
Yan Cui - Applying principles of chaos engineering to Serverless - Codemotion...
Yan Cui - Applying principles of chaos engineering to Serverless - Codemotion...Yan Cui - Applying principles of chaos engineering to Serverless - Codemotion...
Yan Cui - Applying principles of chaos engineering to Serverless - Codemotion...
 
The Case for Chaos Testing
The Case for Chaos TestingThe Case for Chaos Testing
The Case for Chaos Testing
 
Chaos Engineering: Why the World Needs More Resilient Systems
Chaos Engineering: Why the World Needs More Resilient SystemsChaos Engineering: Why the World Needs More Resilient Systems
Chaos Engineering: Why the World Needs More Resilient Systems
 
The case for chaos testing
The case for chaos testingThe case for chaos testing
The case for chaos testing
 
From Duke of DevOps to Queen of Chaos - Api days 2018
From Duke of DevOps to Queen of Chaos - Api days 2018From Duke of DevOps to Queen of Chaos - Api days 2018
From Duke of DevOps to Queen of Chaos - Api days 2018
 
Muwanika rogers (software testing) muni university
Muwanika rogers (software testing) muni universityMuwanika rogers (software testing) muni university
Muwanika rogers (software testing) muni university
 
Chapter 10. ScenariosI have always been a big fan of learnin
Chapter 10. ScenariosI have always been a big fan of learninChapter 10. ScenariosI have always been a big fan of learnin
Chapter 10. ScenariosI have always been a big fan of learnin
 
1.8 Exercises1. Distinguish between vulnerability, threat, and con.docx
1.8 Exercises1. Distinguish between vulnerability, threat, and con.docx1.8 Exercises1. Distinguish between vulnerability, threat, and con.docx
1.8 Exercises1. Distinguish between vulnerability, threat, and con.docx
 
What does "monitoring" mean? (FOSDEM 2017)
What does "monitoring" mean? (FOSDEM 2017)What does "monitoring" mean? (FOSDEM 2017)
What does "monitoring" mean? (FOSDEM 2017)
 
Chaos Engineering - The Art of Breaking Things in Production
Chaos Engineering - The Art of Breaking Things in ProductionChaos Engineering - The Art of Breaking Things in Production
Chaos Engineering - The Art of Breaking Things in Production
 
Prometheus - Open Source Forum Japan
Prometheus  - Open Source Forum JapanPrometheus  - Open Source Forum Japan
Prometheus - Open Source Forum Japan
 
Using security to drive chaos engineering - April 2018
Using security to drive chaos engineering - April 2018Using security to drive chaos engineering - April 2018
Using security to drive chaos engineering - April 2018
 
Prometheus (Prometheus London, 2016)
Prometheus (Prometheus London, 2016)Prometheus (Prometheus London, 2016)
Prometheus (Prometheus London, 2016)
 
Monitoring As Code: How to Integrate App Monitoring Into Your Developer Cycle
Monitoring As Code: How to Integrate App Monitoring Into Your Developer CycleMonitoring As Code: How to Integrate App Monitoring Into Your Developer Cycle
Monitoring As Code: How to Integrate App Monitoring Into Your Developer Cycle
 
An Introduction to Prometheus (GrafanaCon 2016)
An Introduction to Prometheus (GrafanaCon 2016)An Introduction to Prometheus (GrafanaCon 2016)
An Introduction to Prometheus (GrafanaCon 2016)
 
Chaos Engineering and Systems Reliability
Chaos Engineering and Systems ReliabilityChaos Engineering and Systems Reliability
Chaos Engineering and Systems Reliability
 
INTERNSHIPREVIEW-ISHAQ (1) [Recovered].pptx
INTERNSHIPREVIEW-ISHAQ (1) [Recovered].pptxINTERNSHIPREVIEW-ISHAQ (1) [Recovered].pptx
INTERNSHIPREVIEW-ISHAQ (1) [Recovered].pptx
 
What activates a bug? A refinement of the Laprie terminology model.
What activates a bug? A refinement of the Laprie terminology model.What activates a bug? A refinement of the Laprie terminology model.
What activates a bug? A refinement of the Laprie terminology model.
 

Mehr von Yan Cui

How serverless changes the cost paradigm
How serverless changes the cost paradigmHow serverless changes the cost paradigm
How serverless changes the cost paradigm
Yan Cui
 

Mehr von Yan Cui (20)

How to win the game of trade-offs
How to win the game of trade-offsHow to win the game of trade-offs
How to win the game of trade-offs
 
How to choose the right messaging service
How to choose the right messaging serviceHow to choose the right messaging service
How to choose the right messaging service
 
How to choose the right messaging service for your workload
How to choose the right messaging service for your workloadHow to choose the right messaging service for your workload
How to choose the right messaging service for your workload
 
Patterns and practices for building resilient serverless applications.pdf
Patterns and practices for building resilient serverless applications.pdfPatterns and practices for building resilient serverless applications.pdf
Patterns and practices for building resilient serverless applications.pdf
 
Lambda and DynamoDB best practices
Lambda and DynamoDB best practicesLambda and DynamoDB best practices
Lambda and DynamoDB best practices
 
Lessons from running AppSync in prod
Lessons from running AppSync in prodLessons from running AppSync in prod
Lessons from running AppSync in prod
 
Serverless observability - a hero's perspective
Serverless observability - a hero's perspectiveServerless observability - a hero's perspective
Serverless observability - a hero's perspective
 
How to ship customer value faster with step functions
How to ship customer value faster with step functionsHow to ship customer value faster with step functions
How to ship customer value faster with step functions
 
How serverless changes the cost paradigm
How serverless changes the cost paradigmHow serverless changes the cost paradigm
How serverless changes the cost paradigm
 
Why your next serverless project should use AWS AppSync
Why your next serverless project should use AWS AppSyncWhy your next serverless project should use AWS AppSync
Why your next serverless project should use AWS AppSync
 
Build social network in 4 weeks
Build social network in 4 weeksBuild social network in 4 weeks
Build social network in 4 weeks
 
Patterns and practices for building resilient serverless applications
Patterns and practices for building resilient serverless applicationsPatterns and practices for building resilient serverless applications
Patterns and practices for building resilient serverless applications
 
How to bring chaos engineering to serverless
How to bring chaos engineering to serverlessHow to bring chaos engineering to serverless
How to bring chaos engineering to serverless
 
Migrating existing monolith to serverless in 8 steps
Migrating existing monolith to serverless in 8 stepsMigrating existing monolith to serverless in 8 steps
Migrating existing monolith to serverless in 8 steps
 
Building a social network in under 4 weeks with Serverless and GraphQL
Building a social network in under 4 weeks with Serverless and GraphQLBuilding a social network in under 4 weeks with Serverless and GraphQL
Building a social network in under 4 weeks with Serverless and GraphQL
 
FinDev as a business advantage in the post covid19 economy
FinDev as a business advantage in the post covid19 economyFinDev as a business advantage in the post covid19 economy
FinDev as a business advantage in the post covid19 economy
 
How to improve lambda cold starts
How to improve lambda cold startsHow to improve lambda cold starts
How to improve lambda cold starts
 
What can you do with lambda in 2020
What can you do with lambda in 2020What can you do with lambda in 2020
What can you do with lambda in 2020
 
A chaos experiment a day, keeping the outage away
A chaos experiment a day, keeping the outage awayA chaos experiment a day, keeping the outage away
A chaos experiment a day, keeping the outage away
 
How to debug slow lambda response times
How to debug slow lambda response timesHow to debug slow lambda response times
How to debug slow lambda response times
 

Kürzlich hochgeladen

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Kürzlich hochgeladen (20)

Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 

Applying principles of chaos engineering to serverless (ServerlessCPH)