Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.

Performing Chaos at Netflix Scale - DEV334 - re:Invent 2017

622 Aufrufe

Veröffentlicht am

Chaos Engineering is described as “the discipline of experimenting on a distributed system in order to build confidence in the system’s capability to withstand turbulent conditions in production.” Going beyond Chaos Monkey, this session covers the specifics of designing a Chaos Engineering solution, how to increment your solution technically and culturally, the socialization and evangelism pieces that tend to get overlooked in the process, and how to get developers excited about purposefully injected failure. This session provides examples of getting started with Chaos Engineering at startups, performing chaos at Netflix scale, integrating your tools with AWS, and the road to cultural acceptance within your company. There are several different “levels” of chaos you can introduce before unleashing a full-blown chaos solution. We provide a focus on each of these levels, so you can leave this session with a game plan you can culturally and technically introduce.

  • Loggen Sie sich ein, um Kommentare anzuzeigen.

Performing Chaos at Netflix Scale - DEV334 - re:Invent 2017

  1. 1. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. AWS re:INVENT Chaos Engineering at Netflix Scale Nora Jones, Senior Chaos Engineer @nora_js DEV334 November 29, 2017
  2. 2. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. KNOWN WAYS TO INCREASE CONFIDENCE
  3. 3. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. UNIT TESTS Input Output Component A
  4. 4. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. INTEGRATION TESTS Input Output COMPONENT A COMPONENT B Service C Service D
  5. 5. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. CHAOS ENGINEERING
  6. 6. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. NEW WAY TO INCREASE CONFIDENCE
  7. 7. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. CHAOS EXPERIMENTS Service C Service D
  8. 8. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. WHY IS THERE A FEAR OF CHAOS WHEN IT’S INEVITABLE?
  9. 9. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. THE “IT DOESN’T APPLY TO ME” FALLACY
  10. 10. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. IT APPLIES TO YOU MORE
  11. 11. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. FORCES OF CHAOS
  12. 12. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. FORCE 0: SOCIALIZATION & MONITORING
  13. 13. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. FORCE 0: SOCIALIZATION Acknowledge complexity Define the steady state
  14. 14. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. “I WORK AT A STARTUP, THERE IS NO STEADY STATE”
  15. 15. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. TIPS FOR DEFINING STEADY STATE • Start with non-critical services • Start in a staging environment, if possible • Only include services that want to be Chaos’ed
  16. 16. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
  17. 17. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. “THE ARMIES OF CHAOS ARE COMING!”
  18. 18. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. FORCE 0: SOCIALIZATION Part of your job as a Chaos Engineer is to understand the customer and their needs
  19. 19. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. FORCE 0: MONITORING WHAT ARE YOUR KEY BUSINESS METRICS?
  20. 20. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. FORCE 0: MONITORING
  21. 21. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. SPS: NETFLIX’S KEY BUSINESS METRIC
  22. 22. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
  23. 23. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
  24. 24. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. FORCE 0: MONITORING DON’T LOSE SIGHT OF YOUR COMPANY’S CUSTOMERS
  25. 25. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. FORCE 1: GRACEFUL RESTARTS AND DEGRADATION
  26. 26. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
  27. 27. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
  28. 28. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. FORCE 2: TARGETED CHAOS
  29. 29. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. FORCE 3: CAN WE CAUSE A CASCADING FAILURE?
  30. 30. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. NOT IF THIS FAILS, BUT WHEN IT FAILS
  31. 31. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
  32. 32. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. LATENCY MONKEY: “A LEARNING OPPORTUNITY”
  33. 33. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. FORCE 4: FAILURE INJECTION
  34. 34. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. SAMPLE FAILURE INJECTION LIBRARY
  35. 35. TYPES OF CHAOS FAILURES
  36. 36. TYPES OF CHAOS FAILURES
  37. 37. Criteria&API
  38. 38. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. NETFLIX FAILURE INJECTION POINTS HYSTRIX
  39. 39. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Service A Service B Routing Failure injection Service Injection Points NETFLIX FAILURE INJECTION
  40. 40. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. FORCE 5: CHAOS AUTOMATION PLATFORM
  41. 41. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Service A Service BRouting 100%
  42. 42. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Service A Control Service A Service BRouting 98% 1%
  43. 43. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Service A Control Service A Experiment Service A Service BRouting 98% 1% 1%
  44. 44. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. FORCE 0: MONITORING
  45. 45. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. SPS: NETFLIX’S KEY BUSINESS METRIC
  46. 46. ChAP MONITORING 10:27 10:30 10:33 10:36 10:39 10:42 10:45 10:48
  47. 47. ChAP MONITORING 10:27 10:30 10:33 10:36 10:39 10:42 10:45 10:48 SHORTED
  48. 48. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. ChAP GOAL: CHAOS ALL THE THINGS AND RUN ALL THE TIME
  49. 49. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. FORCE 6: WHAT’S NEXT?
  50. 50. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. NETFLIX FAILURE INJECTION POINTS HYSTRIX
  51. 51. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. AUTOMATE EXPERIMENT CREATION AND PRIORITIZATION
  52. 52. ChAP’S MONOCLE
  53. 53. ChAP’S MONOCLE
  54. 54. ChAP’S MONOCLE
  55. 55. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
  56. 56. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. CRITICIALITY SCORE
  57. 57. RPS Stats Range bucket * number of retries * number of Hystrix Commands = CRITICALITY SCORE
  58. 58. RPS Stats Range bucket * number of retries * number of Hystrix Commands = Criticality Score CRITICALITY SCORE
  59. 59. RPS Stats Range bucket * number of retries * number of Hystrix Commands = Criticality Score CRITICALITY SCORE
  60. 60. RPS Stats Range bucket * number of retries * number of Hystrix Commands = Criticality Score CRITICALITY SCORE
  61. 61. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. FORCES OF CHAOS 0. Socialization and Monitoring 1. Graceful Restarts and Degradation 2. Targeted Chaos 3. Causing a Cascading Failure 4. Failure Injection 5. Chaos Automation Platform
  62. 62. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. RECORD CHAOS SUCCESS STORIES
  63. 63. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. “We ran a Chaos Experiment that verifies that our fallback path works and it successfully caught an issue in the fallback path before it resulted in an availability incident”
  64. 64. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. “While failing calls, we discovered an increase in requests for the experiment cluster (even though fallbacks were successful)…”
  65. 65. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
  66. 66. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. “…this likely means whoever was consuming the fallback was retrying the call, causing an increase in requests.”
  67. 67. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. TAKEAWAYS • Everyone can and should be doing Chaos Engineering • The road to chaos is a learning opportunity • Safety is critical, involve your business
  68. 68. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. CHAOS DOESN’T CAUSE PROBLEMS. IT REVEALS THEM.
  69. 69. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. THANK YOU! N o r a J o n e s @ n o r a _ j s

×