Failure is inevitable. In our modern world filled with continuously delivered and increasingly complex distributed architectures (looking at you micro-services), it is important to be able to test and improve our systems under a range of failure conditions.
In this talk, Matt discusses these complexities and the forces they exert on development teams, presenting some simple strategies and practical advice to deal with them.
7. Problems with traditional approaches
■ Integration test hell
■ Need to get by without E2E environments
■ Learnings are non-representative anyway
■ Slower
■ Costly (effort + $$)
8. Alternative?
Create an isolated, simulated environment
■ Run locally or on a CI environment
■ Fast - no need to setup complex test data, scenarios etc.
■ Enables single-variable hypothesis testing
■ Automatable
9. Lab Testing w Docker Compose
Hypothesis testing simulated environments
10. Docker Compose
■ Docker container orchestration tool
■ Run locally or remotely
■ Works across platforms (Windows, Mac, *nix)
■ Easy to use
11.
12. Nginx
Let’s take a practical, real-world example: Nginx as an API Proxy.
16. Nginx Testing
H0 = Introducing network latency does not cause errors
Test setup:
● Nginx running locally, with Production configuration
● DNSMasq used to resolve production urls to other Docker
containers
● Muxy container setup, proxying the API
● A test harness to hit the API via Nginx n times, expecting 0
failures
19. Knobs and Levers
We can now have a number of levers to pull. What if we...
● Want to improve on our SLA?
● Want to see how it performs if the API is hard down?
● ...
25. Chaos Engineering
● We expect our teams to build resilient applications
○ Fault tolerance across and within service boundaries
● We expect servers and dependent services to fail
● Let’s make that normal
● Production is a playground
● Levelling up
26. Chaos Engineering - Principles
1. Build a hypothesis around Steady State Behavior
2. Vary real-world events
3. Run experiments in production
4. Automate experiments to run continuously
Requires the ability to measure - you need metrics!!
http://www.principlesofchaos.org/
27. Production Hypothesis Testing
H0 = Loss of an AWS region does not result in errors
Test setup:
● Multi-region application setup for the video playing API
● Apply Chaos Kong to us-west-2
● Measure aggregate production traffic for ‘normal’ levels
28. Kill an AWS region
http://techblog.netflix.com/2015/09/chaos-engineering-upgraded.html
29. Go/Hystrix API Demo
H0 = Introducing network latency does not cause API errors
Test setup:
● API1 running with Hystrix circuit breaker enabled if API2 does
not respond within SLAs
● Muxy container setup, proxying upstream API2
● A test harness to hit API1 n times, expecting 0 failures
32. Chernobyl
● Worst nuclear disaster of all time (1986)
● Public information sketchy
● Estimated > 3M Ukrainians affected
● Radioactive clouds sent over Europe
● Combination of system + human errors
● Series of seemingly logical steps -> catastrophe
33. What we know about human factors
● Accidents happen
● 1am - 8am = higher incidence of human errors
● Humans will ignore directions
○ They sometimes need to (e.g. override)
○ Other times they think they need to (mistake)
● Computers are better at following processes
34. Let’s use a Production deployment as a key example:
● CI -> CD pipeline used to deploy
● Production incident occurs 6 hours later (2am)
● ...what do we do?
● We trust the build pipeline, avoid non-standard
actions
These events help us understand and improve our
systems
Translation
35. “ A game day exercise is where we intentionally try to
break our system, with the goal of being able to
understand it better and learn from it ”
Game Day Exercises
36. Prerequisites:
● A game plan
● All team members and affected staff aware of it
● Close collaboration between Dev, Ops, Test,
Product people etc.
● An open mind
● Hypotheses
● Metrics
● Bravery
Game Day Exercises
37. ● Get entire team together
● Make a simple diagram of system on a whiteboard
● Come up with ~5 failure scenarios
● Write down hypotheses for each scenario
● Backup any data you can’t lose
● Induce each failure and observe the results
Game Day Exercises
https://stripe.com/blog/game-day-exercises-at-stripe
38. Examples of things that fail:
● Application dies
● Hard disk fail
● Machine dies < AZ < Region…
● Github/Source control goes down
● Build server dies
● Loss of degraded network connectivity
● Loss of dependent API
● ...
Game Day Exercises
40. ■ Apply the scientific method
■ Use metrics to make learn and make decisions
■ Docker-compose + Muxy to automate failure
■ Build resilience into software & architecture
■ Regularly Production resilience until it’s normal
■ Production outages are opportunities to learn
■ Start small!
Wrapping up
DiUS - who we are
100 or so developers, testers, UXers, BAs IMs etc. in Melbourne and Sydney
We help businesses get their ideas to market - from software, to hardware and everything in between.
We’re a lot more than that, so if you’re interested in hearing more about us and what we do come and chat after.
We are always on the lookout for talent
Yow? Hands?
OK, we have HEAPS to cover.
Inevitably with these things I get too excited and could go on for hours, it’s a really interesting topic, and one that could be the subject of 100 of these sessions but we only have 20 or so minutes.
It’s also a Depressing talk at least to begin with. We’re going to talk about failures and catastrophes... A lot.
I’m also going to tempt the Demo Gods which seems rather ironic, so if you could all please pray to your respective Gods that would be great.
My hope is to pique your interests in a few areas and provide you with some materials to go further
If I do my job well, we’ll have some tools/practices at the end that you can take back to your teams and talk about
WHY
Why this talk?
Approaches too labour intensive
No simple way to excercise failure in a lab environment - needs to be repeatable, automatable and so on
Put your hands up if you’ve never been involved with any sort of failure?
Context: User service, well defined boundaries
1 external collaborator, 0 dependencies
Few places where things can go wrong
Well defined practices to test / remediate
Chaos
Initial starting conditions dramatically change outcomes
Testing
Integration test hell
Need to be able to get by without E2E environments
It’s not Prod anyway, learnings will be non-representative
Genuine example of a Netflix architecture (mapped with Spigo)
Chaos
Initial starting conditions dramatically change outcomes
Testing
Integration test hell
Need to be able to get by without E2E environments
It’s not Prod anyway, learnings will be non-representative
Lot’s of places where failure can occur
Chaos
Initial starting conditions dramatically change outcomes
Testing
Integration test hell
Need to be able to get by without E2E environments
It’s not Prod anyway, learnings will be non-representative
If you’re using AWS/Cloud for this, you will be paying for all of the services you provision for this E2E test
Not to mention management of them (tools, people, process etc.)
One alternative...
Still non-representative, but cheaper, faster etc.
And one tool we have in the kit is Docker Compose
Hands up - Docker?
Docker Compose?
We want to be able to test this failover scenario
Now, it’s a terrible name. (Mux router, muliplexing and so on).
Unlike all of my other GH projects this one somehow got popular and it was too late to change!
It lets me:
Act as a proxy between 2 endpoints and intercept requests
Alter the network behaviour on a machine at Layer 4 - configuration for the network devices.
Alter the http request/response cycle at Layer 7
Another really nice tool is Toxy. Lot’s of bells/whistles, however it can’t screw with the actual network and requires Node.
Null hypothesis: “The null hypothesis, denoted by H0, is usually the hypothesis that sample observations result purely from chance. Alternative hypothesis. The alternative hypothesis, denoted by H1 or Ha, is the hypothesis that sample observations are influenced by some non-random cause.”
In English, the thing/variable we’re changing isn’t the cause of any change in behaviour we observe.
Out of the lab, and into real life
The Titanic was over engineered for its day, costing about $7.5M dollars which is between $200-400M today.
But it failed and was a terrible catastrophe, mostly because we can now no longer get onto a boat without somebody impersonating Leonardo Di Caprio
I struggled to find a really good ‘opposite’, and almost talked about the ancient Roman state enemy Mithradates.
But the more I thought on it, the more I felt that we humans are a great example of the opposite.
Whether you think we’ve been designed by a God, or crafted serendipitously by evolution, you can’t argue that once we are thrown into the real world - we get better.
Hormesis (Mithradasis)
Emerging field in Computing, initially from the world of Economics, where asymmetric payoffs + increased uncertainty = greater results.
When applied to Engineering, the take home point is that we need to be subjecting our systems to failure more often, and in increasingly more brutal ways, to make our systems better
Fragile < Resilient < Antifragile
Netflix are the pioneers in this space, in fact they have a dedicated Chaos Team
As per our lab experiments, we now take the same principles, but apply them to PRODUCTION
Drop this in the likelihood we’ll be running low on time
Here is Netflix, testing out their Chaos Kong (a bigger version of their Chaos Monkey) which takes out an entire AWS region.
Drop this in the likelihood we’ll be running low on time
Abnormal power surge in Reactor 4
Emergency shutdown (bad action)
Huge power surge, resulting in steam explosions exposing the graphite core which then ignited
Mistakes - Anyone at the Uber talk at Yow will note that there biggest incident happened between 12-1am (from memory)
We jump into AWS console, see if we can SSH onto a box, holy crap the boxes are dead. Let’s check DNS settings, see if we can point it at the fail over inactive environment….
Incorrect.
We redeploy from CI.
At 3am in the morning, you run a serious risk of delegating your top level domain to your personal blog.
Manual actions = catastrophe
Why wait until 3am? Let’s break our system more often!
Everything is up for grabs:
Process
Technology
People*
This is where we all get to learn, not just the ones deploying on managing the infra.