Chaos engineering is a discipline that focuses on improving system resilience through experiments that expose the inherent chaos and failure modes in our system, in a controlled fashion, before these failure modes manifest themselves like a wildfire in production and impact our users.
Netflix is undoubtedly the leader in this field, but much of the publicised tools and articles focus on killing EC2 instances, and the efforts in the serverless community has been largely limited to moving those tools into AWS Lambda functions.
But how can we apply the same principles of chaos to a serverless architecture built around AWS Lambda functions?
These serverless architectures have more inherent chaos and complexity than their serverful counterparts, and, we have less control over their runtime behaviour. In short, there are far more unknown unknowns with these systems.
Can we adapt existing practices to expose the inherent chaos in these systems? What are the limitations and new challenges that we need to consider?
Join us in this talk as Yan Cui shares his thought experiments, and actual experiments, in his pursuit to understand how we can apply the principles of chaos to a serverless architecture.
5. history of Smallpox
est. 400K deaths per year in 18th Century Europe.
earliest evidence of disease in 3rd Century BC Egyptian Mummy
6. history of Smallpox
est. 400K deaths per year in 18th Century Europe.
earliest evidence of disease in 3rd Century BC Egyptian Mummy
1798
first vaccine developed
Edward Jenner
7. 1798
first vaccine developed
1980
history of Smallpox
Edward Jenner
WHO certified
global eradication
est. 400K deaths per year in 18th Century Europe.
earliest evidence of disease in 3rd Century BC Egyptian Mummy
8.
9. Vaccination is the most effective method of
preventing infectious diseases
10. stimulates the immune system to recognize
and destroy the disease before contracting
the disease for real
20. STEP 2.
hypothesize steady state will
continue in both control group
& the experiment group
ie. you should have a reasonable degree of
confidence the system would handle the failure
before you proceed with the experiment
46. by Russ Miles @russmiles
source https://medium.com/russmiles/chaos-engineering-for-the-business-17b723f26361
47. chaos monkey kills an
EC2 instance
latency monkey induces
artificial delay in APIs
chaos gorilla kills an
AWS Availability Zone
chaos kong kills an
entire AWS region
70. STEP 2.
hypothesize steady state will
continue in both control group
& the experiment group
ie. you should have a reasonable degree of
confidence the system would handle the failure
before you proceed with the experiment
74. the goal of a timeout strategy is to give HTTP
requests the best chance to succeed,
provided that doing so does not cause the
calling function itself to err
100. hypothesis:
all functions have appropriate timeout on
their HTTP communications to this internal
API, and can degrade gracefully when
requests are timed out
109. Priming (psychology):
Priming is a technique whereby exposure to one
stimulus influences a response to a subsequent
stimulus, without conscious guidance or intention.
It is a technique in psychology used to train a
person's memory both in positive and negative ways.
110.
111.
112. make dev environments better resemble the
turbulent conditions you should realistically
expect your system to survive in production
113. hypothesis:
the client app has appropriate timeout on
their HTTP communication with the server,
and can degrade gracefully when requests
are timed out