18. CHAOS ENGINEERING
“Chaos Engineering is the discipline of
experimenting on a distributed system
in order to build confidence in the
system’s capability to withstand
turbulent conditions in production.”
https//principlesofchaos.org
19. CHAOS ENGINEERING
“Chaos Engineering is the discipline of
experimenting on a distributed system
in order to build confidence in the
system’s capability to withstand
turbulent conditions in production.”
https//principlesofchaos.org
35. DEFINE HYPOTHESIS
BRAINSTORM WHAT CAN GO WRONG
BRING EVERYONE
DEVELOPERS
SRE /OPERATIONS
NETWORKS
BUSINESS
INFRASTRUCTURE
TESTERS
WHAT CAN GO WRONG?
WHAT IFDATABASE IS DOWN?
WHAT IFSERVICE RESPONDS SLOWER?
WHAT IFMY CACHE RESPONDS SLOW?
WHAT IFA POD DIES?
WHAT IF LOADBALANCER STOPS?
WHAT IF….?
46. MULTI PARALELLISM
PARALLELISM AVAILABILITY DOWNTIME PER YEAR
1 99% 3 DAYS 16 HOURS
2 99,99% 53 MINUTES
3 99,9999% 32 SECONDS
HOW PARALEL IS YOUR CLOUD COMPONENT ?
REGIONSAVAILABILITY ZONES
56. WRAP UP
BIG CULTURE CHANGE
FULL CYCLE DEVELOPERSPRODUCTION ACCESS
START EXPERIMENTING
START SMALL CHECK OUT TOOLSOBSERVABILITY
57. “CHAOS ENGINEERING DOESN’T CAUSE
PROBLEMS, IT JUST REVEALS THEM”
NORA JONES – CHAOS ENGINEERING LEAD SLACK
58. GEERT VAN DER CRUIJSEN
@GEERTVDC
THANK YOU!ALL PICTURES USED ARE FROM UNSPLASHED.COM
RESOURCES
BOOKS:
Chaosengineering-O’Reilly
Chaosengineeringobservability -O’Reilly
TOOLS:
chaostoolkit.org
gremlin.com
github.com/netflix/simianarmy
github.com/asobti/kube-monkey
RESOURCES:
principlesofchaos.org
github.com/dastergon/awesome-chaos-engineering
docs.microsoft.com/en-us/azure/architecture/patterns/category/resiliency
Editor's Notes
Why? Easy you can put “chaos engineer” as function title on your resume
Who did fail over test to other data center?
We’ve started using the cloud and building distributed applications
Deathstar of amazon
Who did fail over test to other data center?
How do we monitor this kind of stuff?
Netflix SPS
Netflix SPS
It has been reported that every 100ms of latency costs Amazon 1% of profit
INFRA: cloud is providing this for us right?
NETWORK: The network is always reliable? We know it is not. How about switching over? How do we test that? It’s often one of the easiest
APPLICATION: how do applications hold up when errors occur? What if the database is not accesible?
PEOPLE: People intervention. Is that making it wose? Fire drills. Do we fire drill for IT?
Partial failure mode
You have to think TOGETHER with business of the impact of failure.
Fail open example
Partial failure mode
Partial failure mode
Partial failure mode
Chaos engineering is like vaccination.
We add small amounts of harm to make the full system more immune to the effects
Why? Easy you can put “chaos engineer” as function title on your resume
Why? Easy you can put “chaos engineer” as function title on your resume
Partial failure mode
incident-response learnin
incident-response learning
MTTR
incident-response learnin
MULTI REGION, MULTI AVAILABILITY ZONE
Idempotency: can i do the same thing with same effect?
Safety: does it change the end state?
Example in kubernetes fixed CPU / memory reservations
Not 1 application can kill others by using max CPU/MEMORY
Fusebox
When things go wrong stop retrying
Polly
Command and Query Responsibility Segregation
Chaos engineering is like vaccination.
We add small amounts of harm to make the full system more immune to the effects
Why? Easy you can put “chaos engineer” as function title on your resume