Practical Chaos Engineering

Practical
Chaos Engineering
Andrea Tosatto Jacopo Nardiello
IDI 2018

Andrea Tosatto Jacopo Nardiello
$ who
@_hilbert_ @jnardiello
Founder & DevOps Engineer
SIGHUP - sighup.io
(PowerDNS) Solution Engineer
Open-Xchange - open-xchange.com

Agenda
- Chaos Engineering, what the heck!?
- A Practical Approach
- Starting with Simple Tools
- Reverse Engineering
- Predictive Failures
- The Cloud Native Approach
- Reproducibility, Build/Ship/Run
- Kubernetes
- The Chaos Engineer Toolkit
- Wrap Up

Motivations
Modern architectures and businesses demand:
- Performance
- Availability
- Fault tolerance
- Velocity of features release
Balance of these four aspects should equally influence technological and architectural choices.
This has been the engine that eventually led to the rise of Microservices oriented architectures: each team
should be able to work and ship independently.

Motivations
Drawback of distributed architectures is complexity

Building Confidence
despite the Unknown

Core principle: Experiment
1. Identify the Steady State
2. Real World Events
3. Run Experiments in Production
4. Automate your Experiments
5. Identify and Minimize Blast Radius

Trigger actual real world events and measure how
your system reacts.
There’s also the human factor to consider. Test and
measure the response from your team.

While with testing you want to catch bugs as far as possible
from production, with Chaos Engineering it’s the opposite.
You want to run your experiments as close as possible to
prod. This is because we are working within the unknown and
evaluating the system as a whole.
Environment and interactions in prod are unique and almost
impossible to replicate reliably.

Once you successfully ran an experiment, automate it. Make it
run as a routine to build up confidence over time.

Chaos Engineering is not about breaking production.
Chaos experiments should take careful, measured risks that
build upon each others to increase confidence.
Start small.
Concentrated
experiments.
Automated
tests
Small scale Large scale

Chaos Engineering
is not about new tools

A practical Approach - Start Small!
Introduce Chaos in your infrastructure
- Network latency / Packet loss - Network Emulation
- IO latency / Kernel Failure Injection - SystemTap
Reverse Engineering
- What’s the app doing - tracing
- What’s the bottleneck - profiling
Focus on a single host / application using standard (Linux) tools

1. Defines a clear Open API to write and run your chaos engineering experiments
2. Integrates natively with cloud and cloud-native infrastructures
3. Ships with a number of predefined testing scenarios
4. Provides a simple CLI interface
5. It’s easy to automate
Scaling our Chaos Experiments: The Chaos Toolkit

$ chaos discover && chaos init
{
"version": "0.1.0",
"title": "Moving a file from under our feet is forgivable",
"description": "Our application should re-create a file that was removed",
"steady-state-hypothesis": {
"title": "The file must be around first",
"probes": [
{
"type": "python",
"name": "file-must-exist",
"tolerance": true,
"provider": {
"module": "os.path",
"func": "exists",
"arguments": {
"path": "some/file"
}
}
}
]
},
"method": [
{
"type": "action",
"name": "file-be-gone",
"provider": {
"func": "remove",
"arguments": {
"path": "some/file"
}
},
"pauses": {
"after": 5
}
},
{
"ref": "file-must-exist"
}
]
}
{
"version": "0.1.0",
"description": "Our application should re-create a file that was
removed",

{
"version": "0.1.0",
"probes": [
{
"type": "python",
"tolerance": true,
"provider": {
"func": "exists",
"arguments": {
"path": "some/file"
}
}
}
]
},
"method": [
{
"type": "action",
"provider": {
"func": "remove",
"arguments": {
"path": "some/file"
}
},
"pauses": {
"after": 5
}
},
{
}
]
}
"probes": [
{
"type": "python",
"tolerance": true,
"provider": {
"func": "exists",
"arguments": {
"path": "some/file"
}
}
}
]
},

{
"version": "0.1.0",
"probes": [
{
"type": "python",
"tolerance": true,
"provider": {
"func": "exists",
"arguments": {
"path": "some/file"
}
}
}
]
},
"method": [
{
"type": "action",
"provider": {
"func": "remove",
"arguments": {
"path": "some/file"
}
},
"pauses": {
"after": 5
}
},
{
}
]
}
"method": [
{
"type": "action",
"provider": {
"func": "remove",
"arguments": {
"path": "some/file"
}
},

The Cloud Native Approach
“Cloud Native is structuring teams, culture and technology to utilize
automation and architectures to manage complexity and
unlock velocity” - Joe Beda

1. Application probing
2. Rolling updates (done right)
3. Workload scheduling and anti-affinity rules
4. Ingresses and LoadBalancers
5. Write monitoring exporters for your KPIs
Learning from Mistakes the Cloud-Native way

k8s probing
Liveness probe
Readiness probe
exec
http
tcp
µservice
livenessProbe:
httpGet:
path: /healthz
port: 8080
httpHeaders:
- name: X-Custom-Header
value: Awesome
initialDelaySeconds: 3
periodSeconds: 3

1. Use multiple replicas when possible (stating the obvious…)
2. Leverage labels and annotations to test canary deployments
3. Set your rollout strategy with maxUnavailable and maxSurge
4. Leverage history and plan your emergency rollbacks
k8s rolling updates

k8s scheduling
Node selectors, pod (containers) affinity and anti-affinity rules let you schedule your
workloads
across your cluster to leverage multi-AZ architectures and take control on where workloads
are executed.
Node 1 Node 2 Node 3
apiVersion: v1
kind: Pod
metadata:
name: with-node-affinity
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringEx
ecution:
nodeSelectorTerms:
- matchExpressions:
- key:
kubernetes.io/e2e-az-name
operator: In
values:
- e2e-az1
- e2e-az2

k8s ingresses
Pod Pod Pod
Ingress Ingress
Leverage Ingresses and LoadBalancers to implement logical separation between
service/applications and infrastructure edges (access layer) - stating the obvious
Users

k8s monitoring
1. Instrument the application and the chaos probes to have a clear view on your
business metrics (metrics should represent your business KPI and users behaviour,
not just ops availability).
2. When working with distributed architectures, use tracing to correlate actions and
events!

Chaos Engineering naturally fits
Cloud Native infrastructures

Wrap Up
Chaos Engineering is not a new technology / methodology
Chaos Engineering is powerful tool to
- improve the confidence in managing complex distributed infrastructure
- meet business requirements on scalability and agility
Cloud Native Infrastructures are a natural fit for Chaos Engineering
- they natively provide APIs to implement fault tolerant and elastic computing
- reduce the blast radius of our Chaos Experiments minimizing the risks
Experiment - Learn - Improve - Automate

Embrace the Chaos
principlesofchaos.org

Useful links
- The Chaos Toolkit, http://chaostoolkit.org/
- Bloomberg/powerfulseal, https://github.com/bloomberg/powerfulseal
- Kube-monkey, https://github.com/asobti/kube-monkey

Thanks Andrea Tosatto Jacopo Nardiello
@_hilbert_ @jnardiello

Practical Chaos Engineering

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Practical Chaos Engineering

Ähnlich wie Practical Chaos Engineering (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Practical Chaos Engineering