Practical Chaos Engineering will show how to start running chaos experiments in your infrastructure and will try to guide your through the principles of chaos.
5. Motivations
Modern architectures and businesses demand:
- Performance
- Availability
- Fault tolerance
- Velocity of features release
Balance of these four aspects should equally influence technological and architectural choices.
This has been the engine that eventually led to the rise of Microservices oriented architectures: each team
should be able to work and ship independently.
8. Core principle: Experiment
1. Identify the Steady State
2. Real World Events
3. Run Experiments in Production
4. Automate your Experiments
5. Identify and Minimize Blast Radius
9. 1. Identify the Steady State
2. Real World Events
3. Run Experiments in Production
4. Automate your Experiments
5. Identify and Minimize Blast Radius
Core principle: Experiment
10. 1. Identify the Steady State
2. Real World Events
3. Run Experiments in Production
4. Automate your Experiments
5. Identify and Minimize Blast Radius
Trigger actual real world events and measure how
your system reacts.
There’s also the human factor to consider. Test and
measure the response from your team.
Core principle: Experiment
11. 1. Identify the Steady State
2. Real World Events
3. Run Experiments in Production
4. Automate your Experiments
5. Identify and Minimize Blast Radius
While with testing you want to catch bugs as far as possible
from production, with Chaos Engineering it’s the opposite.
You want to run your experiments as close as possible to
prod. This is because we are working within the unknown and
evaluating the system as a whole.
Environment and interactions in prod are unique and almost
impossible to replicate reliably.
Core principle: Experiment
12. 1. Identify the Steady State
2. Real World Events
3. Run Experiments in Production
4. Automate your Experiments
5. Identify and Minimize Blast Radius
Once you successfully ran an experiment, automate it. Make it
run as a routine to build up confidence over time.
Core principle: Experiment
13. 1. Identify the Steady State
2. Real World Events
3. Run Experiments in Production
4. Automate your Experiments
5. Identify and Minimize Blast Radius
Chaos Engineering is not about breaking production.
Chaos experiments should take careful, measured risks that
build upon each others to increase confidence.
Start small.
Concentrated
experiments.
Automated
tests
Small scale Large scale
Core principle: Experiment
15. A practical Approach - Start Small!
Introduce Chaos in your infrastructure
- Network latency / Packet loss - Network Emulation
- IO latency / Kernel Failure Injection - SystemTap
Reverse Engineering
- What’s the app doing - tracing
- What’s the bottleneck - profiling
Focus on a single host / application using standard (Linux) tools
16. 1. Defines a clear Open API to write and run your chaos engineering experiments
2. Integrates natively with cloud and cloud-native infrastructures
3. Ships with a number of predefined testing scenarios
4. Provides a simple CLI interface
5. It’s easy to automate
Scaling our Chaos Experiments: The Chaos Toolkit
17. $ chaos discover && chaos init
{
"version": "0.1.0",
"title": "Moving a file from under our feet is forgivable",
"description": "Our application should re-create a file that was removed",
"steady-state-hypothesis": {
"title": "The file must be around first",
"probes": [
{
"type": "python",
"name": "file-must-exist",
"tolerance": true,
"provider": {
"module": "os.path",
"func": "exists",
"arguments": {
"path": "some/file"
}
}
}
]
},
"method": [
{
"type": "action",
"name": "file-be-gone",
"provider": {
"module": "os.path",
"func": "remove",
"arguments": {
"path": "some/file"
}
},
"pauses": {
"after": 5
}
},
{
"ref": "file-must-exist"
}
]
}
{
"version": "0.1.0",
"title": "Moving a file from under our feet is forgivable",
"description": "Our application should re-create a file that was
removed",
18. {
"version": "0.1.0",
"title": "Moving a file from under our feet is forgivable",
"description": "Our application should re-create a file that was removed",
"steady-state-hypothesis": {
"title": "The file must be around first",
"probes": [
{
"type": "python",
"name": "file-must-exist",
"tolerance": true,
"provider": {
"module": "os.path",
"func": "exists",
"arguments": {
"path": "some/file"
}
}
}
]
},
"method": [
{
"type": "action",
"name": "file-be-gone",
"provider": {
"module": "os.path",
"func": "remove",
"arguments": {
"path": "some/file"
}
},
"pauses": {
"after": 5
}
},
{
"ref": "file-must-exist"
}
]
}
"steady-state-hypothesis": {
"title": "The file must be around first",
"probes": [
{
"type": "python",
"name": "file-must-exist",
"tolerance": true,
"provider": {
"module": "os.path",
"func": "exists",
"arguments": {
"path": "some/file"
}
}
}
]
},
$ chaos discover && chaos init
19. {
"version": "0.1.0",
"title": "Moving a file from under our feet is forgivable",
"description": "Our application should re-create a file that was removed",
"steady-state-hypothesis": {
"title": "The file must be around first",
"probes": [
{
"type": "python",
"name": "file-must-exist",
"tolerance": true,
"provider": {
"module": "os.path",
"func": "exists",
"arguments": {
"path": "some/file"
}
}
}
]
},
"method": [
{
"type": "action",
"name": "file-be-gone",
"provider": {
"module": "os.path",
"func": "remove",
"arguments": {
"path": "some/file"
}
},
"pauses": {
"after": 5
}
},
{
"ref": "file-must-exist"
}
]
}
"method": [
{
"type": "action",
"name": "file-be-gone",
"provider": {
"module": "os.path",
"func": "remove",
"arguments": {
"path": "some/file"
}
},
$ chaos discover && chaos init
21. The Cloud Native Approach
“Cloud Native is structuring teams, culture and technology to utilize
automation and architectures to manage complexity and
unlock velocity” - Joe Beda
22. 1. Application probing
2. Rolling updates (done right)
3. Workload scheduling and anti-affinity rules
4. Ingresses and LoadBalancers
5. Write monitoring exporters for your KPIs
Learning from Mistakes the Cloud-Native way
24. 1. Use multiple replicas when possible (stating the obvious…)
2. Leverage labels and annotations to test canary deployments
3. Set your rollout strategy with maxUnavailable and maxSurge
4. Leverage history and plan your emergency rollbacks
k8s rolling updates
25. k8s scheduling
Node selectors, pod (containers) affinity and anti-affinity rules let you schedule your
workloads
across your cluster to leverage multi-AZ architectures and take control on where workloads
are executed.
Node 1 Node 2 Node 3
apiVersion: v1
kind: Pod
metadata:
name: with-node-affinity
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringEx
ecution:
nodeSelectorTerms:
- matchExpressions:
- key:
kubernetes.io/e2e-az-name
operator: In
values:
- e2e-az1
- e2e-az2
26. k8s ingresses
Pod Pod Pod
Ingress Ingress
Leverage Ingresses and LoadBalancers to implement logical separation between
service/applications and infrastructure edges (access layer) - stating the obvious
Users
27. k8s monitoring
1. Instrument the application and the chaos probes to have a clear view on your
business metrics (metrics should represent your business KPI and users behaviour,
not just ops availability).
2. When working with distributed architectures, use tracing to correlate actions and
events!
29. Wrap Up
Chaos Engineering is not a new technology / methodology
Chaos Engineering is powerful tool to
- improve the confidence in managing complex distributed infrastructure
- meet business requirements on scalability and agility
Cloud Native Infrastructures are a natural fit for Chaos Engineering
- they natively provide APIs to implement fault tolerant and elastic computing
- reduce the blast radius of our Chaos Experiments minimizing the risks
Experiment - Learn - Improve - Automate