Talk given by Lyndsay Prewer Technical Delivery Manager at Equal Experts at ExpertTalks Leeds on June 11 2019.
Embracing Collaborative Chaos
Today’s systems are inherently complex, with some component parts often operating in or close to suboptimal or failure modes. Left unchecked, as complexity increases, the compounding of failure modes will inevitably lead to catastrophic system failure. Chaos Days help us address this risk by spending time deliberately inducing failures, then analysing the response.
This session summarises our experience of running Chaos Days on a large scale platform. We’ll explore the what, why, how and when of running a Chaos Day.
1. Making Software. Better.
Simple solutions to big business problems.
Equal Experts is a network of talented, experienced, software
consultants, specialising in agile delivery.
3. Photo by Darius Bashar on Unsplash
What is chaos engineering
and why should we care?
4. Look at what I built today!
Google Cloud Dataflow In the Smart Home Data Pipeline
5. Operating on the edge of chaos
http://bit.ly/2ZavoyP
http://bit.ly/2QVeWzA
“Two
normally-benign
misconfigurations,
and a specific
software bug,
combined to initiate
the outage”
6. Predicting failure
Google Cloud Dataflow In the Smart Home Data Pipeline
● How many component parts does
your system have?
● How are they connected?
● How reliable is each part?
● How reliable are the connections?
● What happens when X fails?
7. Addressing the risk of unexpected failure
A
B
A
B D
C
Z
E
G H
F
I
● Address risk by deliberate
inducing failure
● Observe, reflect and improve
● Build resilience in (like quality)
● Think about production (and
failure) all the time
Simples Hard
11. In process chaos
● Part of normal engineering process
● Focus for all roles in the team
● Production thinking / building resilience in
Product
Owner
Dev QA Dev Ops
Focus on: Quality AND Production AND Resilience
Define Build Explore Deploy
12. Unplanned chaos
● Every day is a school day
● Handle incidents well
● Learn from incidents - post incident
reviews
● AWS podcast: http://bit.ly/31oQfAf
A
B D
C
Z
E
G H
F
I
13. How does it help?
People
ProcessProduct
Knowledge
Behaviour
Expertise
Managing incidents
Learning from incidents
Engineering approach
Observability
Simplification
Alerting
Runbooks
Resilience
14. Photo by Darius Bashar on Unsplash
Running a Chaos Day
- when and how?
15. Our context
Legacy systems
X00 million
internal
requests
(busiest day)
X00 million
log messages
(busiest day)
x850
microservices
XXm Customers
60 Delivery teams
~1000 Microservices
Loren ipsum caveat empor
Loren ipsum caveat empor. Loren ipsum
caveat empor. Loren ipsum caveat empor
Loren ipsum caveat empor.
Loren ipsum caveat empor
Loren ipsum caveat empor. Loren ipsum
caveat empor. Loren ipsum caveat empor
Loren ipsum caveat empor.
Loren ipsum caveat empor
Loren ipsum caveat empor. Loren ipsum
caveat empor. Loren ipsum caveat empor
Loren ipsum caveat empor.
6 Platform teams
(AWS PaaS)
16. When were we ready for chaos?
2013 2014
Cloud
Docker
Scala
Mongo
ELK
Fast
growth
(teams,
services,
traffic)
17. When were we ready for chaos?
2013 2014 2015 2016
Cloud
Docker
Scala
Mongo
ELK
Fast
growth
(teams,
services,
traffic)
Multi
active WIP
Multi
active
18. When were we ready for chaos?
2013 2014 2015 2016 2017 2018
Cloud
Docker
Scala
Mongo
ELK
Fast
growth
(teams,
services,
traffic)
Multi
active WIP
Multi
active
More multi
active
(to AWS)
Self serve
deploys
AWS
Ready
for
Chaos
19. When are you ready for chaos?
Manual
In process
Automated
Unplanned
20. Photo by Darius Bashar on Unsplash
Who, where and exactly how?
21. Agents of chaos
● Virtual, closed team
● Draw from component
teams
● Experts / veterans
● Highest bus factor
22. Chaos scope - know thyself
● Know your architecture
● Know your steady state
● Know your constraints
○ What’s in your control?
○ What’s not?
○ What needs protecting?
Loren ipsum caveat empor
Loren ipsum caveat empor. Loren ipsum
caveat empor. Loren ipsum caveat empor
Loren ipsum caveat empor.
X00 million
internal
requests
(busiest day)
X00 million
log messages
(busiest day)
23. Chaos scope - trust the brains-storm
http://bit.ly/2XzR7Q9
24. Chaos scope - brainstorm, then plan the
detail
Team X Team Y Team Z
25. Chaos scope - hack in amongst the chaos
Team X Team Y Team Z
26. Deciding where
● Production or closest to it
● Production (like) load
● Production (like) telemetry
● Decide the blast radius
● Decide comm’s channel(s)
Production
Staging
QA
Development
32. Divide and conquer, then regroup
● Major on engineering
improvements (people, process,
product)
● Minor on chaos day improvements
● Component teams retro’s /
incident reviews first
● Then team-of-teams retro
People
ProcessProduct
Team X
Team Y
Team Z
Team of
teams
33. What did we learn?
● Manage/limit the pain
● Start small
● Production is a tough step
● Production-like is also hard!
● Have fun!
35. What’s your next chaos step?
Manual
In process
Automated
Unplanned
● Where are you at in the journey?
● What’s the next (baby) step?
● Need any help?
36. Thank You
United Kingdom
+44 203 603 7830
helloUK@equalexperts.com
Equal Experts UK Ltd
30 Brock Street
London NW1 3FG
India
+91 20 6607 7763
helloIndia@equalexperts.com
Equal Experts India Private Ltd
Office No. 4-C
Cerebrum IT Park No. B3
Kumar City, Kalyani Nagar
Pune, 411006
Canada
+1 403 775 4861
helloCanada@equalexperts.com
Equal Experts Devices Inc
205 - 279 Midpark way S.E.
T2X 1M2
Calgary, Alberta
Portugal
+351 211 378 414
helloPortugal@equalexperts.com
Equal Experts Portugal
Avenida Dom João II, Nº35
Edificio Infante 11ºA
1990-083 Parque das Nações
Lisboa – Portugal
Thank You
USA
+1 866-943-9737
helloUSA@equalexperts.com
Equal Experts Inc
1460 Broadway
New York
NY 10036
LinkedIn
linkedin.com/company/equal-experts
Twitter
@EqualExperts
Web
www.equalexperts.com