Chaos Engineering with Containers

Video and slides synchronized, mp3 and slide download available at URL https://bit.ly/2USKOWb.

Ana Medina discusses the benefits of using Chaos Engineering to inject failures in order to make our container infrastructure more reliable. She also shares how to improve container monitoring and observability and lessons learned from running Chaos Engineering GameDays with Gremlin customers. Filmed at qconsf.com.

Ana Medina is currently working as a Chaos Engineer at Gremlin, helping companies avoid outages by running proactive chaos engineering experiments. She last worked at Uber where she was an engineer on the SRE and Infrastructure teams specifically focusing on chaos engineering and cloud computing.

Chaos Engineering with Containers

  1. 1. #QConSF @ana_m_medina Chaos EngineeringChaos Engineering with Containers 1 Ana Medina
 Chaos Engineer at
  4. 4. #QConSF @ana_m_medina 2 Ana Medina @ana_m_medina 
 Chaos Engineer @ Gremlin Previously Software Engineer / SRE @ Uber, Also worked/ interned @ SFEFCU, Google, Quicken Loans, Stanford University and Miami Dade College. College dropout. Self taught engineer.
  5. 5. #QConSF @ana_m_medina 3 How many of you have heard of Chaos Engineering?
  6. 6. #QConSF @ana_m_medina 4 How many of have run a Chaos Engineering experiment?
  7. 7. #QConSF @ana_m_medina 5 Thoughtful, planned experiments designed to reveal the weakness in our systems. 
 Chaos Engineering
  8. 8. #QConSF @ana_m_medina 6 Inject something harmful to build an immunity. -@KoltonAndrus
 Gremlin Founder and CEO Chaos Engineering
  9. 9. #QConSF @ana_m_medina 7 Why? ● Microservices ● Systems are scaling fast ● Downtime is really expensive ● Our dependencies will fail ● Pager fatigue and burnout really hurts
  10. 10. #QConSF @ana_m_medina 8 “Chaos Engineering Without Observability ... Is Just Chaos”
 -@mipsytipsy Charity Majors CEO of honeycomb

  11. 11. #QConSF @ana_m_medina 9 Prerequisite of Chaos Engineering ● Monitoring/Observability ● On-Call and Incident Management ● Cost of Downtime Per Hour
  12. 12. #QConSF @ana_m_medina 10 Use Cases for Chaos Engineering ● Outage reproduction ● On-call training ● Strengthen new products ● Battle test new infrastructure and services
  13. 13. #QConSF @ana_m_medina 11 Use Cases for Chaos Engineering - Containers ● Testing Provider Specific Reliability (eg: EKS vs AKS vs GKE) ● Auto Scaling ● Logs, Disk failure
  14. 14. #QConSF @ana_m_medina Minimize the Blast radius 12
  15. 15. #QConSF @ana_m_medina Monitoring / Observability 13
  16. 16. #QConSF @ana_m_medina 14 What to measure and monitor? ! System Metrics: CPU, Disk, I/O ! Availability ! Service specific KPIs ! Customer complaints
  17. 17. #QConSF @ana_m_medina 15 Demo
  18. 18. #QConSF @ana_m_medina 16 #1 - Battle Test Cloud infrastructure Real World Scenario: company / user is evaluating cloud provider managed kubernetes. which one is more reliable? The Hypothesis: shutting down a container (1/1) should only give a small delay before app is reachable again The Experiment: shut down kubernetes dashboard container Abort Conditions: app is unreachable after 60 seconds
  23. 23. #QConSF @ana_m_medina 21 #2 - Shutdown of a Container Real World Scenario: company / user is evaluating containers. Are they as reliable as promised? The Hypothesis: yes, they will come back up The Experiment: shutdown container and wait a few seconds and check if it’s up Abort Conditions: app is unreachable after 60 seconds
  25. 25. #QConSF @ana_m_medina 23 #3 - Blackholing traffic to Catalog Real World Scenario: company / user is working with their UI team to provide a good user experience when there API/DB issues The Hypothesis: images will not load, but product listing will The Experiment: blackhole all traffic from the front end to REST API and DB ports Abort Conditions: app is unreachable after 60 seconds
  27. 27. #QConSF @ana_m_medina Case Study 25
  28. 28. #QConSF @ana_m_medina 26 Companies doing Chaos Engineering
  29. 29. #QConSF @ana_m_medina 27 Tools you Can Use Gremlin
 Chaos Toolkit
  30. 30. #QConSF @ana_m_medina 28 Break Things Together bit.ly/chaos-eng-slack
 2,000+ members across the world
  31. 31. #QConSF @ana_m_medina THANKS! @ana_m_medina ana@gremlin.com
