The document discusses chaos engineering and running chaos experiments with containers. It provides an overview of chaos engineering and why it is important, especially for microservices and systems that are scaling rapidly. The presentation then demonstrates three sample chaos experiments: (1) shutting down containers to test cloud provider reliability, (2) shutting down a container to test container reliability, and (3) blackholing traffic to a catalog to test user experience with API/database issues. Contact information is provided for tools that can be used to conduct chaos experiments and a slack community for sharing chaos engineering experiences.
2. InfoQ.com: News & Community Site
• Over 1,000,000 software developers, architects and CTOs read the site world-
wide every month
• 250,000 senior developers subscribe to our weekly newsletter
• Published in 4 languages (English, Chinese, Japanese and Brazilian
Portuguese)
• Post content from our QCon conferences
• 2 dedicated podcast channels: The InfoQ Podcast, with a focus on
Architecture and The Engineering Culture Podcast, with a focus on building
• 96 deep dives on innovative topics packed as downloadable emags and
minibooks
• Over 40 new content items per week
Watch the video with slide
synchronization on InfoQ.com!
https://www.infoq.com/presentations/
chaos-engineering-gamedays
3. Purpose of QCon
- to empower software development by facilitating the spread of
knowledge and innovation
Strategy
- practitioner-driven conference designed for YOU: influencers of
change and innovation in your teams
- speakers and topics driving the evolution and innovation
- connecting and catalyzing the influencers and innovators
Highlights
- attended by more than 12,000 delegates since 2007
- held in 9 cities worldwide
Presented at QCon San Francisco
www.qconsf.com
4. #QConSF @ana_m_medina
2
Ana Medina
@ana_m_medina
Chaos Engineer @ Gremlin
Previously Software Engineer /
SRE @ Uber, Also worked/
interned @ SFEFCU, Google,
Quicken Loans, Stanford
University and Miami Dade
College.
College dropout.
Self taught engineer.
12. #QConSF @ana_m_medina
10
Use Cases for Chaos Engineering
● Outage reproduction
● On-call training
● Strengthen new products
● Battle test new infrastructure and
services
13. #QConSF @ana_m_medina
11
Use Cases for Chaos Engineering - Containers
● Testing Provider Specific Reliability
(eg: EKS vs AKS vs GKE)
● Auto Scaling
● Logs, Disk failure
16. #QConSF @ana_m_medina
14
What to measure and monitor?
! System Metrics: CPU, Disk, I/O
! Availability
! Service specific KPIs
! Customer complaints
18. #QConSF @ana_m_medina
16
#1 - Battle Test Cloud infrastructure
Real World Scenario: company / user is evaluating cloud
provider managed kubernetes. which one is more reliable?
The Hypothesis: shutting down a container (1/1) should only
give a small delay before app is reachable again
The Experiment: shut down kubernetes dashboard
container
Abort Conditions: app is unreachable after 60 seconds
23. #QConSF @ana_m_medina
21
#2 - Shutdown of a Container
Real World Scenario: company / user is evaluating
containers. Are they as reliable as promised?
The Hypothesis: yes, they will come back up
The Experiment: shutdown container and wait a few
seconds and check if it’s up
Abort Conditions: app is unreachable after 60 seconds
25. #QConSF @ana_m_medina
23
#3 - Blackholing traffic to Catalog
Real World Scenario: company / user is working with their UI
team to provide a good user experience when there API/DB
issues
The Hypothesis: images will not load, but product listing will
The Experiment: blackhole all traffic from the front end to
REST API and DB ports
Abort Conditions: app is unreachable after 60 seconds