Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.
Applying principles of chaos
engineering to Serverless
Yan Cui @theburningmonk
Berlin | November 20 - 21, 2018
Agenda
What is chaos engineering?
New challenges with serverless
Applying latency injection to serverless
Applying error i...
What is chaos engineering?
Chaos Engineering is the discipline of experimenting on a distributed system

in order to build confidence in the system’s ...
Smallpox
Earliest evidence of disease in 3rd Century BC Egyptian Mummy.
est. 400K deaths per year in 18th Century Europe.
History of vaccination
First vaccine was developed in
1798 by Edward Jenner.
https://en.wikipedia.org/wiki/Edward_Jenner
History of vaccination
First vaccine was developed in
1798 by Edward Jenner.
WHO certified global eradication
in 1980.
htt...
https://en.wikipedia.org/wiki/Vaccine
History of vaccination
History of vaccination
Vaccination is the most effective method to prevent infectious diseases.
History of vaccination
Vaccination is the most effective method to prevent infectious diseases.
Vaccines stimulate the imm...
Chaos engineering
Use controlled experiments to inject failures into our system.
Chaos engineering
Use controlled experiments to inject failures into our system.
Help us learn about our system’s behaviou...
Chaos engineering
Use controlled experiments to inject failures into our system.
Help us learn about our system’s behaviou...
Chaos Engineering is the vaccine to frailties in modern software.
Who am I?
Principal Engineer at DAZN.
AWS Serverless Hero.
Author of Production-Ready Serverless* course by Manning.
Blogg...
https://www.ft.com/content/07d375ee-6ee5-11e8-92d3-6c13e5c92914
https://www.theguardian.com/media/2018/may/14/streaming-service-dazn-netflix-sport-us-boxing-eddie-hearn
About DAZN
Available in 7 countries - Austria, Switzerland, Germany,
Japan, Canada, Italy and USA.
Available on 30+ platfo...
About DAZN
Available in 7 countries - Austria, Switzerland, Germany,
Japan, Canada, Italy and USA.
Available on 30+ platfo...
follow @dazneng for
updates about the
engineering team
We’re hiring! Visit
engineering.dazn.com
to learn more.
WE’RE HIRIN...
Chaos engineering has an image problem
Chaos engineering has an image problem
Chaos engineering has an image problem
Too much emphasis is on breaking things.
Chaos engineering has an image problem
Too much emphasis is on breaking things.
Easy to conflate the action of injecting f...
Why did you break
production?
Because I can!
Chaos engineering has an image problem
The goal is to learn about the system and build confidence.
Chaos engineering has an image problem
The goal is to learn about the system and build confidence.
The goal is not to brea...
Chaos in practice
4 steps to start running chaos
experiments yourself.
STEP 1. Define “steady state”
aka. what does normal, working
condition looks like?
this is not a
steady state
Hypothesise steady state will
continue in both control group
& the experiment group
ie. you should have a reasonable degre...
Chaos in practice
Explore unknown unknowns away from production.
Chaos in practice
Explore unknown unknowns away from production.
Experiments that graduate to production should be careful...
Chaos in practice
Treat production with the care it deserves.
The goal is not to break things.
Chaos in practice
If you know the system would break and you did it anyway,
then it’s not a chaos experiment!
It’s called ...
STEP 3. Inject realistic failures
e.g. server crash, network error, HD
malfunction, etc.
https://github.com/Netflix/SimianArmy http://oreil.ly/2tZU1Sn
STEP 4. Disprove hypothesis
i.e. look for difference in steady state
Chaos in practice
Look for evidence that steady state was impacted by the
injected failure.
Chaos in practice
Look for evidence that steady state was impacted by the
injected failure.
Address weaknesses before fail...
Containment
Experiments needs to be controlled.
The goal is not to break things.
Containment
Ensure everyone knows what you are doing.
Don’t surprise your teammates.
Containment
Run experiments during office hours.
Containment
Run experiments during office hours.
Avoid important dates.
Containment
Make the smallest change necessary to prove or disprove hypothesis.
Containment
Make the smallest change necessary to prove or disprove hypothesis.
Have a rollback plan.
Stop the experiment ...
Containment
Don’t start in production.
Can learn a lot by running experiments in staging.
by Russ Miles @russmiles
source https://medium.com/russmiles/chaos-engineering-for-the-business-17b723f26361
New challenges with serverless
chaos monkey kills an EC2
instance
latency monkey induces
artificial delay in APIs
chaos gorilla kills an AWS
Availability ...
Serverless challenges
There are no servers that you can access and kill.
There are more inherent chaos and complexity in
a serverless architecture.
Serverless challenges
Smaller unit of deployment, but a lot more of them.
serverful
serverlessServerless challenges
Serverless challenges
Every function needs to be correctly configured and secured.
Kinesis
?
SNS
CloudWatch
Events
CloudWatch
LogsIoT
Core
DynamoDB
S3 SES
Serverless challenges
Serverless challenges
A lot of managed, intermediate services.
Each with its own set of failure modes.
Serverless challenges
Unknown failure modes in the infrastructure we don’t control.
Serverless challenges
Unknown failure modes in the infrastructure we don’t control.
Often there’s little we can do when an...
Common weaknesses
Common weaknesses
Improperly tuned timeouts.
Common weaknesses
Missing error handling.
Common weaknesses
Missing fallback.
Common weaknesses
Missing regional failover.
Latency injection with serverless
STEP 1. Define “steady state”
aka. what does normal, working
condition looks like?
Defining steady state
What metrics do you use?
Defining steady state
What metrics do you use?
p95/p99 latencies, error count, backlog size, yield*, harvest**
* percentag...
Hypothesise steady state will
continue in both control group
& the experiment group
ie. you should have a reasonable degre...
API Gateway
Serverless considerations
Serverless considerations
Consider the effect of cold starts.
How does it affect your strategy
for handling slow responses.
Request timeouts
Strategy should:
Request timeouts
Strategy should:
1. Give requests the best chance to succeed
Request timeouts
Strategy should:
1. Give requests the best chance to succeed
2. Do not allow slow response to timeout the...
Request timeouts
Finding the right timeout value is tricky.
Request timeouts
Finding the right timeout value is tricky.
Too short : requests not given the best chance to succeed.
Request timeouts
Finding the right timeout value is tricky.
Too short : requests not given the best chance to succeed.
Too...
Request timeouts
Finding the right timeout value is tricky.
Too short : requests not given the best chance to succeed.
Too...
Approach 1: split invocation time equally
(e.g. 3 requests, 6s function timeout = 2s timeout per request)
Approach 2: every request is given nearly all the invocation time
(e.g. 3 requests, 6s function timeout = 5s timeout per r...
Request timeouts
Proposal: set request timeouts dynamically based on
invocation time left
Request timeouts
Set timeout based on remaining invocation time
Set timeout based on remaining invocation time
Recovery steps
Log the timeout with as much context as possible.
The API, timeout value, correlation IDs, request object, ...
Recovery steps
Log the timeout with as much context as possible.
The API, timeout value, correlation IDs, request object, ...
Recovery steps
Log the timeout with as much context as possible.
The API, timeout value, correlation IDs, request object, ...
Recovery steps
Be mindful when you sacrifice precision for availability.
User experience is the king.
STEP 3. Inject realistic failures
e.g. server crash, network error, HD
malfunction, etc.
Where to inject latency?
hypothesis:
Function has appropriate timeout on its HTTP communications
and can degrade gracefully when these requests tim...
Where to inject latency?
Where to inject latency?
Should be applied to 3rd party services too.
DynamoDB, Twillio, Auth0, …
Where to inject latency?
Where to inject latency?
Be mindful of the blast radius of the experiment.
The goal is not to break things.
http client
public-api-a
http client
public-api-b
internal-api
Where to inject latency?
hypothesis:
All functions have appropriate timeout on their HTTP
communications to this internal API, and can degrade
grac...
Where to inject latency?
Where to inject latency?
Where to inject latency?
Large blast radius, can cause cascade failures unintentionally
development
development production
Priming (psychology):
Priming is a technique whereby exposure to one stimulus
influences a response to a subsequent stimulu...
Use failure injection to programme your colleagues into
thinking about failure modes early.
Where to inject latency?
Make X% of all requests slow
in the dev environment.
hypothesis:
The client app has appropriate timeout on their HTTP
communication with the server, and can degrade gracefully...
Where to inject latency?
Where to inject latency?
Where to inject latency?
STEP 4. Disprove hypothesis
i.e. look for difference in steady state
How to inject latency?
How to inject latency?
Static weavers (e.g. PostSharp, AspectJ).
Dynamic proxies.
https://theburningmonk.com/2015/04/design-for-latency-issues/
How to inject latency?
Manually crafted wrapper libraries.
configured in SSM Parameter Store
no injected latency
with injected latency
factory wrapper function
(think bluebird’s promisifyAll function)
Error injection with serverless
Common errors
HTTP 5XX
DynamoDB provisioned throughput exceeded
Throttled Lambda invocations
hypothesis:
Function has appropriate error handling on its HTTP communications
and can degrade gracefully when downstream ...
Where to inject errors?
hypothesis:
Function has appropriate error handling on DynamoDB operations and
can degrade gracefully when DynamoDB throug...
Where to inject errors?
Where to inject errors?
Induce Lambda throttling by temporarily setting reserved concurrency.
Recap
failures are INEVITABLE
the only way to truly know your system’s
resilience against failures is to test it
through CONTROLLED experiments
the goal of chaos engineering is NOT to
actually break production
CONTAINMENT should be front and
centre of your thinking
STEP 1. Define “steady state”
aka. what does normal, working
condition looks like?
Hypothesise steady state will
continue in both control group
& the experiment group
ie. you should have a reasonable degre...
STEP 3. Inject realistic failures
e.g. server crash, network error, HD
malfunction, etc.
STEP 4. Disprove hypothesis
i.e. look for difference in steady state
there are more inherent chaos and
complexity in a serverless application
even without servers, you can still inject
CONTROLLED failures at the application level
API Gateway and Kinesis
Authentication & authorisation (IAM, Cognito)
Testing
Running & Debugging functions locally
Log ag...
@theburningmonk
theburningmonk.com
github.com/theburningmonk
API Gateway and Kinesis
Authentication & authorisation (IAM, ...
Applying principles of chaos engineering to Serverless (CodeMotion Berlin)
Applying principles of chaos engineering to Serverless (CodeMotion Berlin)
Applying principles of chaos engineering to Serverless (CodeMotion Berlin)
Applying principles of chaos engineering to Serverless (CodeMotion Berlin)
Applying principles of chaos engineering to Serverless (CodeMotion Berlin)
Applying principles of chaos engineering to Serverless (CodeMotion Berlin)
Applying principles of chaos engineering to Serverless (CodeMotion Berlin)
Applying principles of chaos engineering to Serverless (CodeMotion Berlin)
Applying principles of chaos engineering to Serverless (CodeMotion Berlin)
Applying principles of chaos engineering to Serverless (CodeMotion Berlin)
Applying principles of chaos engineering to Serverless (CodeMotion Berlin)
Applying principles of chaos engineering to Serverless (CodeMotion Berlin)
Applying principles of chaos engineering to Serverless (CodeMotion Berlin)
Applying principles of chaos engineering to Serverless (CodeMotion Berlin)
Applying principles of chaos engineering to Serverless (CodeMotion Berlin)
Applying principles of chaos engineering to Serverless (CodeMotion Berlin)
Applying principles of chaos engineering to Serverless (CodeMotion Berlin)
Applying principles of chaos engineering to Serverless (CodeMotion Berlin)
Applying principles of chaos engineering to Serverless (CodeMotion Berlin)
Applying principles of chaos engineering to Serverless (CodeMotion Berlin)
Applying principles of chaos engineering to Serverless (CodeMotion Berlin)
Applying principles of chaos engineering to Serverless (CodeMotion Berlin)
Applying principles of chaos engineering to Serverless (CodeMotion Berlin)
Applying principles of chaos engineering to Serverless (CodeMotion Berlin)
Applying principles of chaos engineering to Serverless (CodeMotion Berlin)
Applying principles of chaos engineering to Serverless (CodeMotion Berlin)
Applying principles of chaos engineering to Serverless (CodeMotion Berlin)
Applying principles of chaos engineering to Serverless (CodeMotion Berlin)
Applying principles of chaos engineering to Serverless (CodeMotion Berlin)
Applying principles of chaos engineering to Serverless (CodeMotion Berlin)
Applying principles of chaos engineering to Serverless (CodeMotion Berlin)
Nächste SlideShare
Wird geladen in …5
×

Applying principles of chaos engineering to Serverless (CodeMotion Berlin)

270 Aufrufe

Veröffentlicht am

Chaos engineering is a discipline that focuses on improving system resilience through experiments that expose the inherent chaos and failure modes in our system, in a controlled fashion, before these failure modes manifest themselves like a wildfire in production and impact our users.

Netflix is undoubtedly the leader in this field, but much of the publicised tools and articles focus on killing EC2 instances, and the efforts in the serverless community has been largely limited to moving those tools into AWS Lambda functions.

But how can we apply the same principles of chaos to a serverless architecture built around AWS Lambda functions?

These serverless architectures have more inherent chaos and complexity than their serverful counterparts, and, we have less control over their runtime behaviour. In short, there are far more unknown unknowns with these systems.

Can we adapt existing practices to expose the inherent chaos in these systems? What are the limitations and new challenges that we need to consider?

Veröffentlicht in: Technologie
  • .DOWNLOAD THIS BOOKS INTO AVAILABLE FORMAT ......................................................................................................................... ......................................................................................................................... .DOWNLOAD PDF EBOOK here { https://tinyurl.com/y8nn3gmc } ......................................................................................................................... .DOWNLOAD EPUB Ebook here { https://tinyurl.com/y8nn3gmc } ......................................................................................................................... .DOWNLOAD doc Ebook here { https://tinyurl.com/y8nn3gmc } ......................................................................................................................... .DOWNLOAD PDF EBOOK here { https://tinyurl.com/y8nn3gmc } ......................................................................................................................... .DOWNLOAD EPUB Ebook here { https://tinyurl.com/y8nn3gmc } ......................................................................................................................... .DOWNLOAD doc Ebook here { https://tinyurl.com/y8nn3gmc } ......................................................................................................................... ......................................................................................................................... ......................................................................................................................... .............. Browse by Genre Available eBooks ......................................................................................................................... Art, Biography, Business, Chick Lit, Children's, Christian, Classics, Comics, Contemporary, Cookbooks, Crime, Ebooks, Fantasy, Fiction, Graphic Novels, Historical Fiction, History, Horror, Humor And Comedy, Manga, Memoir, Music, Mystery, Non Fiction, Paranormal, Philosophy, Poetry, Psychology, Religion, Romance, Science, Science Fiction, Self Help, Suspense, Spirituality, Sports, Thriller, Travel, Young Adult,
       Antworten 
    Sind Sie sicher, dass Sie …  Ja  Nein
    Ihre Nachricht erscheint hier

Applying principles of chaos engineering to Serverless (CodeMotion Berlin)

  1. 1. Applying principles of chaos engineering to Serverless Yan Cui @theburningmonk Berlin | November 20 - 21, 2018
  2. 2. Agenda What is chaos engineering? New challenges with serverless Applying latency injection to serverless Applying error injection to serverless
  3. 3. What is chaos engineering?
  4. 4. Chaos Engineering is the discipline of experimenting on a distributed system
 in order to build confidence in the system’s capability
 to withstand turbulent conditions in production. - principlesofchaos.org
  5. 5. Smallpox Earliest evidence of disease in 3rd Century BC Egyptian Mummy. est. 400K deaths per year in 18th Century Europe.
  6. 6. History of vaccination First vaccine was developed in 1798 by Edward Jenner. https://en.wikipedia.org/wiki/Edward_Jenner
  7. 7. History of vaccination First vaccine was developed in 1798 by Edward Jenner. WHO certified global eradication in 1980. https://en.wikipedia.org/wiki/Edward_Jenner
  8. 8. https://en.wikipedia.org/wiki/Vaccine History of vaccination
  9. 9. History of vaccination Vaccination is the most effective method to prevent infectious diseases.
  10. 10. History of vaccination Vaccination is the most effective method to prevent infectious diseases. Vaccines stimulate the immune system to recognise and destroy the disease before contracting it for real.
  11. 11. Chaos engineering Use controlled experiments to inject failures into our system.
  12. 12. Chaos engineering Use controlled experiments to inject failures into our system. Help us learn about our system’s behaviour and uncover unknown failure modes, before they manifest like wildfire in production.
  13. 13. Chaos engineering Use controlled experiments to inject failures into our system. Help us learn about our system’s behaviour and uncover unknown failure modes, before they manifest like wildfire in production. Lets us build confidence in its ability to withstand turbulent conditions.
  14. 14. Chaos Engineering is the vaccine to frailties in modern software.
  15. 15. Who am I? Principal Engineer at DAZN. AWS Serverless Hero. Author of Production-Ready Serverless* course by Manning. Blogger**, speaker. * https://bit.ly/production-ready-serverless ** https://theburningmonk.com
  16. 16. https://www.ft.com/content/07d375ee-6ee5-11e8-92d3-6c13e5c92914
  17. 17. https://www.theguardian.com/media/2018/may/14/streaming-service-dazn-netflix-sport-us-boxing-eddie-hearn
  18. 18. About DAZN Available in 7 countries - Austria, Switzerland, Germany, Japan, Canada, Italy and USA. Available on 30+ platforms.
  19. 19. About DAZN Available in 7 countries - Austria, Switzerland, Germany, Japan, Canada, Italy and USA. Available on 30+ platforms. Around 1,000,000 concurrent viewers at peak.
  20. 20. follow @dazneng for updates about the engineering team We’re hiring! Visit engineering.dazn.com to learn more. WE’RE HIRING!
  21. 21. Chaos engineering has an image problem
  22. 22. Chaos engineering has an image problem
  23. 23. Chaos engineering has an image problem Too much emphasis is on breaking things.
  24. 24. Chaos engineering has an image problem Too much emphasis is on breaking things. Easy to conflate the action of injecting failures with the payback.
  25. 25. Why did you break production?
  26. 26. Because I can!
  27. 27. Chaos engineering has an image problem The goal is to learn about the system and build confidence.
  28. 28. Chaos engineering has an image problem The goal is to learn about the system and build confidence. The goal is not to break things.
  29. 29. Chaos in practice 4 steps to start running chaos experiments yourself.
  30. 30. STEP 1. Define “steady state” aka. what does normal, working condition looks like?
  31. 31. this is not a steady state
  32. 32. Hypothesise steady state will continue in both control group & the experiment group ie. you should have a reasonable degree of confidence the system would handle the failure before you proceed with the experiment STEP 2.
  33. 33. Chaos in practice Explore unknown unknowns away from production.
  34. 34. Chaos in practice Explore unknown unknowns away from production. Experiments that graduate to production should be carefully considered and planned. You should have reasonable confidence in the system before running experiments in production.
  35. 35. Chaos in practice Treat production with the care it deserves. The goal is not to break things.
  36. 36. Chaos in practice If you know the system would break and you did it anyway, then it’s not a chaos experiment! It’s called being irresponsible.
  37. 37. STEP 3. Inject realistic failures e.g. server crash, network error, HD malfunction, etc.
  38. 38. https://github.com/Netflix/SimianArmy http://oreil.ly/2tZU1Sn
  39. 39. STEP 4. Disprove hypothesis i.e. look for difference in steady state
  40. 40. Chaos in practice Look for evidence that steady state was impacted by the injected failure.
  41. 41. Chaos in practice Look for evidence that steady state was impacted by the injected failure. Address weaknesses before failures happen for real.
  42. 42. Containment Experiments needs to be controlled. The goal is not to break things.
  43. 43. Containment Ensure everyone knows what you are doing. Don’t surprise your teammates.
  44. 44. Containment Run experiments during office hours.
  45. 45. Containment Run experiments during office hours. Avoid important dates.
  46. 46. Containment Make the smallest change necessary to prove or disprove hypothesis.
  47. 47. Containment Make the smallest change necessary to prove or disprove hypothesis. Have a rollback plan. Stop the experiment right away if things start to go wrong.
  48. 48. Containment Don’t start in production. Can learn a lot by running experiments in staging.
  49. 49. by Russ Miles @russmiles source https://medium.com/russmiles/chaos-engineering-for-the-business-17b723f26361
  50. 50. New challenges with serverless
  51. 51. chaos monkey kills an EC2 instance latency monkey induces artificial delay in APIs chaos gorilla kills an AWS Availability Zone chaos kong kills an entire AWS region
  52. 52. Serverless challenges There are no servers that you can access and kill.
  53. 53. There are more inherent chaos and complexity in a serverless architecture.
  54. 54. Serverless challenges Smaller unit of deployment, but a lot more of them.
  55. 55. serverful serverlessServerless challenges
  56. 56. Serverless challenges Every function needs to be correctly configured and secured.
  57. 57. Kinesis ? SNS CloudWatch Events CloudWatch LogsIoT Core DynamoDB S3 SES Serverless challenges
  58. 58. Serverless challenges A lot of managed, intermediate services. Each with its own set of failure modes.
  59. 59. Serverless challenges Unknown failure modes in the infrastructure we don’t control.
  60. 60. Serverless challenges Unknown failure modes in the infrastructure we don’t control. Often there’s little we can do when an outage occurs in the platform.
  61. 61. Common weaknesses
  62. 62. Common weaknesses Improperly tuned timeouts.
  63. 63. Common weaknesses Missing error handling.
  64. 64. Common weaknesses Missing fallback.
  65. 65. Common weaknesses Missing regional failover.
  66. 66. Latency injection with serverless
  67. 67. STEP 1. Define “steady state” aka. what does normal, working condition looks like?
  68. 68. Defining steady state What metrics do you use?
  69. 69. Defining steady state What metrics do you use? p95/p99 latencies, error count, backlog size, yield*, harvest** * percentage of requests completed ** completeness of the returned response
  70. 70. Hypothesise steady state will continue in both control group & the experiment group ie. you should have a reasonable degree of confidence the system would handle the failure before you proceed with the experiment STEP 2.
  71. 71. API Gateway Serverless considerations
  72. 72. Serverless considerations Consider the effect of cold starts. How does it affect your strategy for handling slow responses.
  73. 73. Request timeouts Strategy should:
  74. 74. Request timeouts Strategy should: 1. Give requests the best chance to succeed
  75. 75. Request timeouts Strategy should: 1. Give requests the best chance to succeed 2. Do not allow slow response to timeout the caller function
  76. 76. Request timeouts Finding the right timeout value is tricky.
  77. 77. Request timeouts Finding the right timeout value is tricky. Too short : requests not given the best chance to succeed.
  78. 78. Request timeouts Finding the right timeout value is tricky. Too short : requests not given the best chance to succeed. Too long : risk timing out the calling function.
  79. 79. Request timeouts Finding the right timeout value is tricky. Too short : requests not given the best chance to succeed. Too long : risk timing out the calling function. Even more complicated when you have multiple integration points.
  80. 80. Approach 1: split invocation time equally (e.g. 3 requests, 6s function timeout = 2s timeout per request)
  81. 81. Approach 2: every request is given nearly all the invocation time (e.g. 3 requests, 6s function timeout = 5s timeout per request)
  82. 82. Request timeouts Proposal: set request timeouts dynamically based on invocation time left
  83. 83. Request timeouts
  84. 84. Set timeout based on remaining invocation time
  85. 85. Set timeout based on remaining invocation time
  86. 86. Recovery steps Log the timeout with as much context as possible. The API, timeout value, correlation IDs, request object, etc.
  87. 87. Recovery steps Log the timeout with as much context as possible. The API, timeout value, correlation IDs, request object, etc. Record custom metrics.
  88. 88. Recovery steps Log the timeout with as much context as possible. The API, timeout value, correlation IDs, request object, etc. Record custom metrics. Use fallbacks.
  89. 89. Recovery steps Be mindful when you sacrifice precision for availability. User experience is the king.
  90. 90. STEP 3. Inject realistic failures e.g. server crash, network error, HD malfunction, etc.
  91. 91. Where to inject latency?
  92. 92. hypothesis: Function has appropriate timeout on its HTTP communications and can degrade gracefully when these requests time out
  93. 93. Where to inject latency?
  94. 94. Where to inject latency? Should be applied to 3rd party services too. DynamoDB, Twillio, Auth0, …
  95. 95. Where to inject latency?
  96. 96. Where to inject latency? Be mindful of the blast radius of the experiment. The goal is not to break things.
  97. 97. http client public-api-a http client public-api-b internal-api Where to inject latency?
  98. 98. hypothesis: All functions have appropriate timeout on their HTTP communications to this internal API, and can degrade gracefully when requests are timed out
  99. 99. Where to inject latency?
  100. 100. Where to inject latency?
  101. 101. Where to inject latency? Large blast radius, can cause cascade failures unintentionally
  102. 102. development
  103. 103. development production
  104. 104. Priming (psychology): Priming is a technique whereby exposure to one stimulus influences a response to a subsequent stimulus, without conscious guidance or intention. It is a technique in psychology used to train a person's memory both in positive and negative ways.
  105. 105. Use failure injection to programme your colleagues into thinking about failure modes early.
  106. 106. Where to inject latency? Make X% of all requests slow in the dev environment.
  107. 107. hypothesis: The client app has appropriate timeout on their HTTP communication with the server, and can degrade gracefully when requests are timed out
  108. 108. Where to inject latency?
  109. 109. Where to inject latency?
  110. 110. Where to inject latency?
  111. 111. STEP 4. Disprove hypothesis i.e. look for difference in steady state
  112. 112. How to inject latency?
  113. 113. How to inject latency? Static weavers (e.g. PostSharp, AspectJ). Dynamic proxies.
  114. 114. https://theburningmonk.com/2015/04/design-for-latency-issues/
  115. 115. How to inject latency? Manually crafted wrapper libraries.
  116. 116. configured in SSM Parameter Store
  117. 117. no injected latency
  118. 118. with injected latency
  119. 119. factory wrapper function (think bluebird’s promisifyAll function)
  120. 120. Error injection with serverless
  121. 121. Common errors HTTP 5XX DynamoDB provisioned throughput exceeded Throttled Lambda invocations
  122. 122. hypothesis: Function has appropriate error handling on its HTTP communications and can degrade gracefully when downstream dependencies fail
  123. 123. Where to inject errors?
  124. 124. hypothesis: Function has appropriate error handling on DynamoDB operations and can degrade gracefully when DynamoDB throughputs are exceeded
  125. 125. Where to inject errors?
  126. 126. Where to inject errors? Induce Lambda throttling by temporarily setting reserved concurrency.
  127. 127. Recap
  128. 128. failures are INEVITABLE
  129. 129. the only way to truly know your system’s resilience against failures is to test it through CONTROLLED experiments
  130. 130. the goal of chaos engineering is NOT to actually break production
  131. 131. CONTAINMENT should be front and centre of your thinking
  132. 132. STEP 1. Define “steady state” aka. what does normal, working condition looks like?
  133. 133. Hypothesise steady state will continue in both control group & the experiment group ie. you should have a reasonable degree of confidence the system would handle the failure before you proceed with the experiment STEP 2.
  134. 134. STEP 3. Inject realistic failures e.g. server crash, network error, HD malfunction, etc.
  135. 135. STEP 4. Disprove hypothesis i.e. look for difference in steady state
  136. 136. there are more inherent chaos and complexity in a serverless application
  137. 137. even without servers, you can still inject CONTROLLED failures at the application level
  138. 138. API Gateway and Kinesis Authentication & authorisation (IAM, Cognito) Testing Running & Debugging functions locally Log aggregation Monitoring & Alerting X-Ray Correlation IDs CI/CD Performance and Cost optimisation Error Handling Configuration management VPC Security Leading practices (API Gateway, Kinesis, Lambda) Canary deployments http://bit.ly/prod-ready-serverless get 40% off with: ytcui
  139. 139. @theburningmonk theburningmonk.com github.com/theburningmonk API Gateway and Kinesis Authentication & authorisation (IAM, Cognito) Testing Running & Debugging functions locally Log aggregation Monitoring & Alerting X-Ray Correlation IDs CI/CD Performance and Cost optimisation Error Handling Configuration management VPC Security Leading practices (API Gateway, Kinesis, Lambda) Canary deployments http://bit.ly/prod-ready-serverless get 40% off with: ytcui

×