5. CHAOS ENGINEERING
“[T]he discipline of experimenting on a
distributed system in order to build confidence
in the system’s capability to withstand
turbulent conditions in production.”
Principles of Chaos Engineering
http://principlesofchaos.org
#OSCON
8. PagerDuty Simian Human Army
FAILURE FRIDAY
Time-boxed recurring meeting
Pre-announced agenda
Break things
Sign-off from service owners
Attendance
GROUND RULES
Keep monitoring & alerting
Abort if needed
#OSCON
13. 2 Years Later
ACCOMPLISHMENTS
Whole DC outages
Target multiple services at once
Distribute failure testing to teams
Automation (in progress)
#OSCON
14. Automation: Rationale
#OSCON
“MANY” HOSTS
- Distribute tasks to multiple people and keep executing manually.
- Watch Operations team with envy while they use chef and knife.
- Start automating.
20. PagerDuty/smoothie
UNICORN SUSPEND & RESUME RECIPES
#OSCON
def recipe__unicorn_suspend_master(hosts)
ssh_task 'suspend unicorn[master] immediately' do
members hosts
execute 'sudo kill -s STOP `cat /u/.../pids/unicorn.pid`'
end
end
def recipe__unicorn_resume_master(hosts)
ssh_task 'resume unicorn[master] immediately' do
members hosts
execute 'sudo kill -s CONT `cat /u/.../pids/unicorn.pid`'
end
end
21. PagerDuty/smoothie
LATENCY RECIPE
#OSCON
def recipe__tc_add_latency(hosts)
ssh_task 'add network latency using tc' do
members hosts
execute 'sudo tc qdisc add dev eth0 root netem delay 500ms 100ms loss 20%'
end
end
def recipe__tc_remove_latency(hosts)
ssh_task 'remove network latency using tc' do
members hosts
execute 'sudo tc qdisc del dev eth0 root netem'
end
end
26. Future
CHATOPS
Inject failures by invoking chat commands.
Share metrics and graphs to help people follow along.
Collect TODOs during Failure Fridays and generate a report.
#OSCON
27. Future
NEW TYPES OF FAILURES
Distributed Denial of Service (DDoS) attacks for services.
Impediments that come up during Incident Response.
#OSCON
29. #OSCON
PROPOSED EDIT
“Experiments that aren’t introducing new
insights should be automated and used to
monitor ongoing health of the system. New
experiments should be devised to continue to
push the bounds of the system.”
Culture From Chaos by @dougbarth
https://speakerdeck.com/dougbarth/culture-from-chaos