Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.

Chaos Engineering with Kubernetes

Chaos Engineering with Kubernetes

Ähnliche Bücher

Kostenlos mit einer 30-tägigen Testversion von Scribd

Alle anzeigen

Ähnliche Hörbücher

Kostenlos mit einer 30-tägigen Testversion von Scribd

Alle anzeigen

Chaos Engineering with Kubernetes

  1. 1. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Arun Gupta, @arungupta Principal Open Source Technologist, Amazon Web Services Using Chaos to Bring Resiliency to Your Applications in Kubernetes
  2. 2. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Failures are a given and everything will eventually fail over time. https://www.allthingsdistributed.com/2016/03/10-lessons-from-10-years-of-aws.html
  3. 3. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark https://www.youtube.com/watch?v=zoz0ZjfrQ9s Amazon 2006 GameDay: Creating Resiliency Through Destruction Jesse Robbins
  4. 4. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Chaos Monkeys https://github.com/Netflix/SimianArmy
  5. 5. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Chaos Engineering
  6. 6. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Resilience Ability of a system to adapt to changes, failures, and disturbances
  7. 7. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Chaos Engineering is the discipline of experimenting on a distributed system in order to build confidence in the system’s capability to withstand turbulent conditions in production Credit: https://www.flickr.com/photos/loseryouthcrew/8775130600/ https://principlesofchaos.org/
  8. 8. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Bad things will happen to your system, no matter how well designed it is You cannot become ignorant to it
  9. 9. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Break your systems on purpose Find out their weaknesses and fix them before they break when least expected
  10. 10. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
  11. 11. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Chaos doesn’t cause problems. It reveals them.
  12. 12. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark • Application level • Host failure • Resource attacks (CPU, memory, …) • Network attacks (dependencies, latency, …) • Region attacks!
  13. 13. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Where do you inject Chaos?
  14. 14. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Phases of chaos engineering
  15. 15. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Phases of chaos engineering
  16. 16. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark https://www.elastic.co/blog/timelion-tutorial-from-zero-to-hero ”Normal” behavior of your system
  17. 17. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Business metric https://medium.com/netflix- techblog/sps-the-pulse-of- netflix-streaming- ae4db0e05f8a
  18. 18. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Phases of chaos engineering
  19. 19. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Phases of chaos engineering
  20. 20. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark • a service gives 404 or 503? • latency increases by 300ms? • the port is not accessible? • security group rules changed? • the database stops? • excessive number of requests come? • iptables are wiped out?
  21. 21. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Phases of chaos engineering
  22. 22. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Phases of chaos engineering
  23. 23. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Pick hypothesis Scope the experiment Identify metrics Notify the organization
  24. 24. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Start with very small As close as possible to production Minimize the blast radius. Have an emergency STOP!
  25. 25. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Users Canary deployment 99% users 1% users Start with...
  26. 26. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Phases of chaos engineering
  27. 27. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Phases of chaos engineering
  28. 28. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Time to detect? Time for notification? And escalation? Time to public notification? Time for graceful degradation to kick-in? Time for self healing to happen? Time to recovery—partial and full? Time to all-clear and stable?
  29. 29. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark DON’T blame that one person…
  30. 30. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark PostMortems—COE (Correction of Errors) The 5 WHYs
  31. 31. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Phases of chaos engineering
  32. 32. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Phases of chaos engineering
  33. 33. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Fix
  34. 34. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Failure free operations require experience with failure. http://web.mit.edu/2.75/resources/random/How%20Complex%20Systems%20Fail.pdf
  35. 35. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Kubernetes cluster
  36. 36. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Reconciles desired and actual state for pods Distributes pods across AZs Automatic health-check based restarts Rolling deployment of a service
  37. 37. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Kubernetes cluster with Amazon EKS AWS managed Customer account
  38. 38. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Kubernetes cluster with Amazon EKS mycluster.eks.amazonaws.com Availability Zone 1 Availability Zone 2 Availability Zone 3 Kubectl
  39. 39. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Region and Availability Zones Control Plane is highly available Master and Workers are configured in ASG Master instance type auto-scaling Etcd is HA and backed up every hour
  40. 40. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Chaos in a Kubernetes cluster mycluster.eks.amazonaws.com Availability Zone 1 Availability Zone 2 Availability Zone 3 Kubectl x x Health check? Dead node? x
  41. 41. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Istio Chaos Toolkit Kube Monkey PowerfulSeal Gremlin Simian Army
  42. 42. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Istio Intelligent routing and load balancing Resilience across languages and platforms Fleet-wide policy enforcement In-depth telemetry
  43. 43. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Timeouts Bounded retries with timeout budget Concurrent connections limit and request load Active health checks (periodic) Passive health checks (circuit breakers) AZ-aware load balancing with automatic failover
  44. 44. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark • Timing failures • Increased network latency • Overloaded upstream service • Crashes • HTTP error codes • TCP connection failures
  45. 45. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Fault injection using Istio—timeout apiVersion: networking.istio.io/v1alpha3 kind: VirtualService metadata: name: greeting spec: hosts: - greeting http: - fault: delay: fixedDelay: 10s percent: 100 route: - destination: host: greeting subset: greeting-hello --- apiVersion: networking.istio.io/v1alpha3 kind: DestinationRule metadata: name: greeting-destination-rule spec: host: greeting subsets: - name: greeting-hello labels: greeting: hello
  46. 46. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Fault injection using Istio—HTTP abort apiVersion: networking.istio.io/v1alpha3 kind: VirtualService metadata: name: greeting spec: hosts: - greeting http: - fault: abort: httpStatus: 500 percent: 100 route: - destination: host: greeting subset: greeting-hello
  47. 47. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Istio traffic management apiVersion: networking.istio.io/v1alpha3 kind: VirtualService metadata: name: greeting-virtual-service spec: hosts: - greeting http: - route: - destination: host: greeting subset: greeting-hello weight: 75 - destination: host: greeting subset: greeting-howdy weight: 25 --- apiVersion: networking.istio.io/v1alpha3 kind: DestinationRule metadata: name: greeting-destination-rule spec: host: greeting subsets: - name: greeting-hello labels: greeting: hello - name: greeting-howdy labels: greeting: howdy
  48. 48. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Istio circuit breaker apiVersion: networking.istio.io/v1alpha3 kind: DestinationRule metadata: name: greeting-destination-rule spec: host: greeting subsets: - name: greeting-hello labels: greeting: hello trafficPolicy: connectionPool: tcp: maxConnections: 100
  49. 49. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark https://istio.io/docs/
  50. 50. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Chaos Toolkit Open API for Chaos Engineering
  51. 51. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark CLI-driven Experiments declared in JSON/YAML files Open specification Extensible: Kubernetes, AWS, Spring, others
  52. 52. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Chaos Toolkit follows the principles of chaos
  53. 53. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark query a system to observe a behavior • Check state of a pod with a specific label • Multiple probes to define steady state real-world events • Terminate a deployment • Multiple actions simulate events Types of probe and method • Process: Run a binary • HTTP: Invoke a HTTP endpoint • Python: Call a Python function to perform richer operations
  54. 54. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Chaos Toolkit metadata { "version": "1.0.0", "title": "Terminating the greeting service should not impact users", "description": "How does the greeting service unavailbility impacts our users? Do they see an error or does the webapp gets slower?", "tags": [ "kubernetes", "aws" ], "configuration": { "web_app_url": { "type": "env", "key": "WEBAPP_URL" } },
  55. 55. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Chaos Toolkit steady state & hypothesis "steady-state-hypothesis": { "title": "Services are all available and healthy", "probes": [ { "type": "probe", "name": "alive-and-healthy", "tolerance": true, "provider": { "type": "python", "module": "chaosk8s.pod.probes", "func": "pods_in_phase", "arguments": { "label_selector": "app=webapp-pod", "phase": "Running", "ns": "default" } } }, { "type": "probe", "name": "application-must-respond-normally", "tolerance": 200, "provider": { "type": "http", "url": "${web_app_url}", "timeout": 3 } } ] },
  56. 56. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Chaos Toolkit experiment & verify "method": [ { "type": "action", "name": "terminate-greeting-service", "provider": { "type": "python", "module": "chaosk8s.pod.actions", "func": "terminate_pods", "arguments": { "label_selector": "app=greeter-pod", "ns": "default" } } }, { "type": "probe", "name": "fetch-application-logs", "provider": { "type": "python", "module": "chaosk8s.pod.probes", "func": "read_pod_logs", "arguments": { "label_selector": "app=webapp-pod", "last": "20s", "ns": "default" } } } ],
  57. 57. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Chaos Toolkit run $ chaos run experiments/experiment.json [2018-03-10 14:42:38 INFO] Validating the experiment's syntax [2018-03-10 14:42:38 INFO] Experiment looks valid [2018-03-10 14:42:38 INFO] Running experiment: Terminate the greeting service should not impact users [2018-03-10 14:42:38 INFO] Steady state hypothesis: Services are all available and healthy [2018-03-10 14:42:38 INFO] Probe: application-should-be-alive-and-healthy [2018-03-10 14:42:38 INFO] Probe: application-must-respond-normally [2018-03-10 14:42:39 INFO] Steady state hypothesis is met! [2018-03-10 14:42:39 INFO] Action: terminate-greeting-service [2018-03-10 14:42:40 INFO] Probe: fetch-application-logs [2018-03-10 14:42:41 INFO] Steady state hypothesis: Services are all available and healthy [2018-03-10 14:42:41 INFO] Probe: application-should-be-alive-and-healthy [2018-03-10 14:42:42 INFO] Probe: application-must-respond-normally [2018-03-10 14:42:45 ERROR] => failed: activity took too long to complete [2018-03-10 14:42:45 CRITICAL] Steady state probe 'application-must-respond-normally' is not in the given tolerance so failing this experiment [2018-03-10 14:42:45 INFO] Let's rollback... [2018-03-10 14:42:45 INFO] No declared rollbacks, let's move on. [2018-03-10 14:42:45 INFO] Experiment ended with status: failed
  58. 58. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark https://github.com/chaostoolkit/chaostoolkit/
  59. 59. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Implementation of Netflix’s Chaos Monkey for Kubernetes Randomly deletes pods in the cluster Applications opt-in using annotations
  60. 60. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Run Kube-Monkey—create configuration apiVersion: v1 kind: ConfigMap metadata: name: kube-monkey-config-map namespace: kube-system data: config.toml: | [kubemonkey] run_hour = 8 start_hour = 10 end_hour = 16 blacklisted_namespaces = ["kube-system"] whitelisted_namespaces = [""]
  61. 61. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Kube-Monkey application opt-in apiVersion: apps/v1 kind: Deployment . . . template: metadata: labels: app: greeting kube-monkey/enabled: enabled kube-monkey/identifier: monkey-victim-pods kube-monkey/mtbf: 2 kube-monkey/kill-mode: random-max-percent kube-monkey/kill-value: 40 spec: containers: - name: greeting
  62. 62. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark https://github.com/asobti/kube-monkey
  63. 63. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Chaos Engineering working group @ CNCF https://github.com/chaoseng/wg-chaoseng
  64. 64. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Chaos Engineering mind map https://bit.ly/2uKOJMQ
  65. 65. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark You don’t chose the moment, the moment chooses you. You only choose how prepared you are, when it does.
  66. 66. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Thank you!

    Als Erste(r) kommentieren

    Loggen Sie sich ein, um Kommentare anzuzeigen.

  • shailendersingh4

    Apr. 17, 2019
  • AnupamaJanakiram1

    Sep. 20, 2019
  • RahulGuhaCSMPMP

    Oct. 6, 2019
  • slars2k

    Apr. 1, 2020
  • nuaays

    May. 13, 2020

Chaos Engineering with Kubernetes

Aufrufe

Aufrufe insgesamt

1.526

Auf Slideshare

0

Aus Einbettungen

0

Anzahl der Einbettungen

78

Befehle

Downloads

70

Geteilt

0

Kommentare

0

Likes

5

×