2. «Failures are given, and
everything will eventually fail
over time»
(Werner Vogels – CTO Amazon)
3. Why?
1. Growth of microservices and distributed cloud
architectures
2. The web has grown increasingly complex
3. We all depend on these system more than ever
4. Failures have become much harder to predict
5. These failures cause strongly outages for companies
4. From On-Premises ...
1. Before the Cloud, users were connected to our application through the Company’s local network
2. A server’s downtime was planned and involved stopping production
3. Monolithic
5. ... To Cloud
1. Now our users are connected through the Internet
2. The workload to which our services are subjected will increase significantly,
thanks to the greater spread of the applications themselves
3. Many Microservices replace one Monolithic
6. Microservices: is it really a matter of sizes?
Common Characteristics
Componentisation via services
Organised around business capabilities
Decentralised data management
Products not projects
Decentralised governance
Smart endpoints and dumb pipes
Evolutionary design
Infrastructure automation
Designed for failure
We cannot say there is a formal definition of the microservices
architectural style, but we can attempt to describe what we see
as common characteristics for architectures that fit the label.
(Martin Fowler, James Lewis)
10. Change Mindset
Building a reliable application in the cloud is different
than building a reliable application in an enterprise setting
11. Reactive Manifesto
1. Jones Boner, Dave Farley, Roland Kuhn, Martin Thompson – 16.01.2014
2. The absolute, most important thing is it to be responsive.
This means that a reactive system needs to remain responsive even when a failure occurs.
• https://www.reactivemanifesto.org/it
12. Resilient
System
• Networks
• Servers
• Applications
• Processes
• People
Resilience is the ability of a system to adapt to changes, failures & disturbances
Resilience is a function of People & Culture
13. Failures are given
Availability Downtime per year
95% (1-nine) 18 days 6 hours
99% (2-nines) 3 days 15 hours
99.9% (3-nines) 8 hours 45 minutes
99.99% (4-nines) 52 minutes
99.999% (5-nines) 5 minutes
99.9999% (6-nines) 31 seconds
14. The beauty of Math at work
Component Availability Downtime
X 99% (2-nines) 3 days 15 hours
Y 99.99% (4-nines) 52 minutes
X and Y Combined 98.99% 3 days 16 hours 33 minutes
Component Availability Downtime
X 99% (2-nines) 3 days 15 hours
Two X in parallel 99.99% (4-nines) 52 minutes
Three X in parallel 99.9999% (6-nines) 31 seconds
15. Chaos Engineering
Chaos Engineering is the discipline of experimenting on a system in
order to build confidence in the system’s capability to whitstand turbolent
conditions in production.
https://principlesofchaos.org
• Instead of trying to avoid failure, chaos engineering
embraces it
• Provide evidence of system weaknesses through scientific
chaos engineering experiments
• Which kind of weaknesses? Dark Debt
16.
17. History
1. 1564-1642: Galileo Galilei introduces the experimental scientific method
2. 1879-1955: No amount of experiments will prove me right; a single experiment will prove me wrong (A. Einstein)
3. 2000: Game Day by Jesse Robbins, the Master of Disaster
4. 2010: Chaos Monkey by Netflix. Why? To support move from physical infrastructure to cloud infrastructure
5. 2011: Simian Army. We have to design a cloud architecture where individual components can fail without affecting the
availability of the entire system
6. 2012: Neftlix shared Chaos Monkey on Github
7. 2014: A new role. Chaos Engineer
18. Once upon a time in Seattle
«You don’t choose the moment, the moment chooses you»
«You only choose how prepared you are when it does»
Jesse Robbins, the Master of Disaster at Amazon
19. Chaos Experiment vs Testing
Testing
• Several set of inputs and predicted outputs
• Limited scopes
• Is a programming practice that instructs developers
• Testing, strictly speaking, does not create new
knowledge
Chaos Experiment
• Discover weakness through experiments
• Limited scopes
• Experimentation creates new knowledge
20. Game Day
• An exercise designed to increase Resilience through large-scale fault
injection across critical systems.
• The goal of a Game Day is to practice how you, your team, and your
supporting system deal with real-world turbolent conditions.
• Creating Resiliency through destruction
21. Sociotechnical System
Before starting your journey into chaos engineering, make sure you’ve done your homework and have built resiliency into every
level of your organization. Building resilient systems isn’t all about software. It starts at the infrastructure layer, progress to
the network and data, influences application design and extends to people and culture.
Adrian Hornsby
22. Notifications and Approvals
Name Role Approved?
Bob Jennifer Owner (CEO) Yes
• Remember the Conway’s Law
Table of notifications and approvals
23.
24. Dark Debt
• Dark Debt is not recognizable at the time of creation.
• Dark Debt arises from the unforeseen interactions of hardware or software
with other parts of the framework.
• Dark Debt is invisible until an anomaly reveals its presence.
• Platform
• Applications
• People, practices, and processes
25. The Phases of Chaos Engineering
Chaos engineering is NOT about letting monkeys loose or allowing them to break things randomly without a purpose.
Chaos engineering is about breaking things in a controlled environment.
26. Start with Experiments
• Get your team together and come up with a picture of your system (including people, practices, processes)
• Make the right questions:
Where would it be most valuable to create an experiment that helps us build trust and confidence in our system
under turbolent conditions?
What could possibly go wrong?
• Chaos Engineering doesn’t guarantee you have the perfect system
• Chaos Engineering never ends
• Likelihood and Impact
27. Checkmate in three moves
Preparation
• Identification and mitigation of risks and impact from failure
• Reduces frequency of failures (MTBF)
• Reduces duration of recovery (MTTR)
Participation
• Builds confidence & competence responding to failure and under stress
• Strengthens individual and cultural ability to anticipate, mitigate, respond to, and recovery from
failures of all types
Exercises
• Trigger and expose «latent defects»
• Choose discover them, instead of letting that be determined by the next real disaster.
28. Likelihood-Impact Map
• The likelihood that a failure may occur
• The potential impact your system will
experience if it does
API products becomes
unavailable
Contribution
Availability
29. Describe Your Experiment
• A steady-state hypothesis: A set of measurements that indicates that the system is working in an expected way
from a business perspective, and within a given set of tolerances
• A method: The set of activities you’re going to use to inject the turbolent conditions into the target system
• Rollbacks: A set of remediating actions through which you will attempt to repair what you have done
knowingly in your experiment’s method
Explore Discover Analyze
Validate Improve
30.
31. Demo
Explore Discover Analyze
Validate Improve
1. Using a chaos experiment to explore and discover weaknesses in the target system
2. Using a chaos experiment to discover and begin to analyze any weaknesses surfaced in the system
3. One the challenge of analysis is done, it’s time to apply an improvement to the system (if needed)
4. Your chaos experiment becomes a chaos test to detect whether the weakness has indeed been overcome.
32. Demo
Explore Discover Analyze
Validate Improve
1. Using a chaos experiment to explore and discover weaknesses in the target system
2. Using a chaos experiment to discover and begin to analyze any weaknesses surfaced in the system
3. One the challenge of analysis is done, it’s time to apply an improvement to the system (if needed)
4. Your chaos experiment becomes a chaos test to detect whether the weakness has indeed been overcome.
33. Demo
Explore Discover Analyze
Validate Improve
1. Using a chaos experiment to explore and discover weaknesses in the target system
2. Using a chaos experiment to discover and begin to analyze any weaknesses surfaced in the system
3. One the challenge of analysis is done, it’s time to apply an improvement to the system (if needed)
4. Your chaos experiment becomes a chaos test to detect whether the weakness has indeed been overcome.
34. Demo
Explore Discover Analyze
Validate Improve
1. Using a chaos experiment to explore and discover weaknesses in the target system
2. Using a chaos experiment to discover and begin to analyze any weaknesses surfaced in the system
3. One the challenge of analysis is done, it’s time to apply an improvement to the system (if needed)
4. Your chaos experiment becomes a chaos test to detect whether the weakness has indeed been
overcome.
35. Under the skin of chaos run
Start
Experiment valid?
Steady-state hypothesis
Execute method
Steady-state hypothesis
No deviations found Deviations found Experiment aborted
No
Not within tolerances
Not within tolerances
Within tolerances
Within tolerances
Yes
36. Steady-state hypothesis
Model that characterizes the steady-state of the system based on expected values of
the business metrics.
Chaos Engineering
37. Canary Deployment
Start small and slowly build confidence within your team and your organization.
- How many customers are
affected?
- What functionality is
impaired?
- Which locations are
imapcted?
38. Benefits of Chaos Engineering
- First, chaos engineering help you uncover the unknowns
in your system and fix them before they happen in
production at 3am during the weekend — so, first,
improved resiliency and sleep.
- Second, a successful chaos engineering practice always
generates a lot more changes than anticipated, and these
are mostly cultural. Probably the most important of these
changes is a natural evolution towards a “non-blaming”
culture: the “Why did you do that?” turns into a “How
can we avoid doing that in the future?” — resulting in
happier and more efficient, empowered, engaged and
successful teams. And that’s gold!