This introductory slidedeck talks about the challenge of modern production systems under the pressure of increased feature velocity and change, and at the same time needing to be more business critical and reliable than ever.
13. 2010
The Netflix Eng Tools team created Chaos Monkey. Chaos
Monkey was created in response to Netflix’s move from
physical infrastructure to cloud infrastructure provided by
Amazon Web Services, and the need to be sure that a loss
of an Amazon instance wouldn’t affect the Netflix
streaming experience.
https://www.gremlin.com/community/tutorials/chaos-engineering-the-history-principles-and-practice/
17. “she caused a “mission” to crash by selecting
the DSKY keys in an unexpected way, alerting
the team as to what would happen if the
prelaunch program, P01, were inadvertently
selected by a real astronaut during a real
mission, during real midcourse.”
Murphy, Niall Richard; Beyer, Betsy; Jones,
Chris; Petoff, Jennifer. Site Reliability
Engineering: How Google Runs Production
Systems . O'Reilly Media. Kindle Edition.
38. TBD
Wouldn’t it be
great if there was a
proactive practice for
exploring and diminishing
system weaknesses
before they affected
users?
Probably a pipe
dream…
45. 1. Form a hypothesis.
2. Communicate to your team.
3. Run experiments.
4. Analyze the results.
5. Increase the scope.
6. Automate experiments.
https://blog.codeship.com/embracing-the-chaos-of-chaos-engineering/