Submit Search
Upload
Chaos Engineering
•
16 likes
•
5,145 views
Amazon Web Services
Follow
Chaos Engineering
Read less
Read more
Slideshow view
Report
Share
Slideshow view
Report
Share
1 of 72
Download now
Download to read offline
Recommended
Introduction to Chaos Engineering
Introduction to Chaos Engineering
Raymond Adrian (Rad) Butalid
An Introduction to Chaos Engineering
An Introduction to Chaos Engineering
Gremlin
Chaos engineering
Chaos engineering
Alberto Acerbis
Chaos Engineering: Why Breaking Things Should Be Practiced - AWS Developer Wo...
Chaos Engineering: Why Breaking Things Should Be Practiced - AWS Developer Wo...
Amazon Web Services
Chaos Engineering - The Art of Breaking Things in Production
Chaos Engineering - The Art of Breaking Things in Production
Keet Sugathadasa
Chaos Engineering
Chaos Engineering
Yury Roa
Chaos Engineering with Gremlin Platform
Chaos Engineering with Gremlin Platform
Anshul Patel
Chaos Engineering: Why the World Needs More Resilient Systems
Chaos Engineering: Why the World Needs More Resilient Systems
C4Media
Recommended
Introduction to Chaos Engineering
Introduction to Chaos Engineering
Raymond Adrian (Rad) Butalid
An Introduction to Chaos Engineering
An Introduction to Chaos Engineering
Gremlin
Chaos engineering
Chaos engineering
Alberto Acerbis
Chaos Engineering: Why Breaking Things Should Be Practiced - AWS Developer Wo...
Chaos Engineering: Why Breaking Things Should Be Practiced - AWS Developer Wo...
Amazon Web Services
Chaos Engineering - The Art of Breaking Things in Production
Chaos Engineering - The Art of Breaking Things in Production
Keet Sugathadasa
Chaos Engineering
Chaos Engineering
Yury Roa
Chaos Engineering with Gremlin Platform
Chaos Engineering with Gremlin Platform
Anshul Patel
Chaos Engineering: Why the World Needs More Resilient Systems
Chaos Engineering: Why the World Needs More Resilient Systems
C4Media
Chaos engineering and chaos testing
Chaos engineering and chaos testing
jeetendra mandal
Chaos Engineering, When should you release the monkeys?
Chaos Engineering, When should you release the monkeys?
Thoughtworks
Patterns of resilience
Patterns of resilience
Uwe Friedrichsen
chaos-engineering-Knolx
chaos-engineering-Knolx
Knoldus Inc.
Chaos Engineering with Kubernetes - Berlin / Hamburg Chaos Engineering Meetup...
Chaos Engineering with Kubernetes - Berlin / Hamburg Chaos Engineering Meetup...
Ana Medina
Chaos Engineering 101: A Field Guide
Chaos Engineering 101: A Field Guide
matthewbrahms
Introduction to Chaos Engineering with Microsoft Azure
Introduction to Chaos Engineering with Microsoft Azure
Ana Medina
Chaos engineering & Gameday on AWS
Chaos engineering & Gameday on AWS
Bilal Aybar
From Monolithic to Microservices
From Monolithic to Microservices
Amazon Web Services
Introduction to Resilience4j
Introduction to Resilience4j
Knoldus Inc.
DevOps on AWS
DevOps on AWS
Amazon Web Services
Observability For Modern Applications
Observability For Modern Applications
Amazon Web Services
DevOps & SRE at Google Scale
DevOps & SRE at Google Scale
Kaushik Bhattacharya
Microservices Part 3 Service Mesh and Kafka
Microservices Part 3 Service Mesh and Kafka
Araf Karsh Hamid
Practical Chaos Engineering
Practical Chaos Engineering
SIGHUP
Reactive Microservices with Quarkus
Reactive Microservices with Quarkus
Niklas Heidloff
Microservices Architectures: Become a Unicorn like Netflix, Twitter and Hailo
Microservices Architectures: Become a Unicorn like Netflix, Twitter and Hailo
gjuljo
Event Sourcing & CQRS, Kafka, Rabbit MQ
Event Sourcing & CQRS, Kafka, Rabbit MQ
Araf Karsh Hamid
The Paved Road at Netflix
The Paved Road at Netflix
Dianne Marsh
Overview of Site Reliability Engineering (SRE) & best practices
Overview of Site Reliability Engineering (SRE) & best practices
Ashutosh Agarwal
Chaos Engineering: Why Breaking Things Should Be Practised.
Chaos Engineering: Why Breaking Things Should Be Practised.
Adrian Hornsby
Keynote - Chaos Engineering: Why breaking things should be practiced
Keynote - Chaos Engineering: Why breaking things should be practiced
AWS User Group Bengaluru
More Related Content
What's hot
Chaos engineering and chaos testing
Chaos engineering and chaos testing
jeetendra mandal
Chaos Engineering, When should you release the monkeys?
Chaos Engineering, When should you release the monkeys?
Thoughtworks
Patterns of resilience
Patterns of resilience
Uwe Friedrichsen
chaos-engineering-Knolx
chaos-engineering-Knolx
Knoldus Inc.
Chaos Engineering with Kubernetes - Berlin / Hamburg Chaos Engineering Meetup...
Chaos Engineering with Kubernetes - Berlin / Hamburg Chaos Engineering Meetup...
Ana Medina
Chaos Engineering 101: A Field Guide
Chaos Engineering 101: A Field Guide
matthewbrahms
Introduction to Chaos Engineering with Microsoft Azure
Introduction to Chaos Engineering with Microsoft Azure
Ana Medina
Chaos engineering & Gameday on AWS
Chaos engineering & Gameday on AWS
Bilal Aybar
From Monolithic to Microservices
From Monolithic to Microservices
Amazon Web Services
Introduction to Resilience4j
Introduction to Resilience4j
Knoldus Inc.
DevOps on AWS
DevOps on AWS
Amazon Web Services
Observability For Modern Applications
Observability For Modern Applications
Amazon Web Services
DevOps & SRE at Google Scale
DevOps & SRE at Google Scale
Kaushik Bhattacharya
Microservices Part 3 Service Mesh and Kafka
Microservices Part 3 Service Mesh and Kafka
Araf Karsh Hamid
Practical Chaos Engineering
Practical Chaos Engineering
SIGHUP
Reactive Microservices with Quarkus
Reactive Microservices with Quarkus
Niklas Heidloff
Microservices Architectures: Become a Unicorn like Netflix, Twitter and Hailo
Microservices Architectures: Become a Unicorn like Netflix, Twitter and Hailo
gjuljo
Event Sourcing & CQRS, Kafka, Rabbit MQ
Event Sourcing & CQRS, Kafka, Rabbit MQ
Araf Karsh Hamid
The Paved Road at Netflix
The Paved Road at Netflix
Dianne Marsh
Overview of Site Reliability Engineering (SRE) & best practices
Overview of Site Reliability Engineering (SRE) & best practices
Ashutosh Agarwal
What's hot
(20)
Chaos engineering and chaos testing
Chaos engineering and chaos testing
Chaos Engineering, When should you release the monkeys?
Chaos Engineering, When should you release the monkeys?
Patterns of resilience
Patterns of resilience
chaos-engineering-Knolx
chaos-engineering-Knolx
Chaos Engineering with Kubernetes - Berlin / Hamburg Chaos Engineering Meetup...
Chaos Engineering with Kubernetes - Berlin / Hamburg Chaos Engineering Meetup...
Chaos Engineering 101: A Field Guide
Chaos Engineering 101: A Field Guide
Introduction to Chaos Engineering with Microsoft Azure
Introduction to Chaos Engineering with Microsoft Azure
Chaos engineering & Gameday on AWS
Chaos engineering & Gameday on AWS
From Monolithic to Microservices
From Monolithic to Microservices
Introduction to Resilience4j
Introduction to Resilience4j
DevOps on AWS
DevOps on AWS
Observability For Modern Applications
Observability For Modern Applications
DevOps & SRE at Google Scale
DevOps & SRE at Google Scale
Microservices Part 3 Service Mesh and Kafka
Microservices Part 3 Service Mesh and Kafka
Practical Chaos Engineering
Practical Chaos Engineering
Reactive Microservices with Quarkus
Reactive Microservices with Quarkus
Microservices Architectures: Become a Unicorn like Netflix, Twitter and Hailo
Microservices Architectures: Become a Unicorn like Netflix, Twitter and Hailo
Event Sourcing & CQRS, Kafka, Rabbit MQ
Event Sourcing & CQRS, Kafka, Rabbit MQ
The Paved Road at Netflix
The Paved Road at Netflix
Overview of Site Reliability Engineering (SRE) & best practices
Overview of Site Reliability Engineering (SRE) & best practices
Similar to Chaos Engineering
Chaos Engineering: Why Breaking Things Should Be Practised.
Chaos Engineering: Why Breaking Things Should Be Practised.
Adrian Hornsby
Keynote - Chaos Engineering: Why breaking things should be practiced
Keynote - Chaos Engineering: Why breaking things should be practiced
AWS User Group Bengaluru
Keynote - Adrian Hornsby on Chaos Engineering
Keynote - Adrian Hornsby on Chaos Engineering
Amazon Web Services
Fully Realizing the Microservices Vision with Service Mesh (DEV312-S) - AWS r...
Fully Realizing the Microservices Vision with Service Mesh (DEV312-S) - AWS r...
Amazon Web Services
Releasing Mission-Critical Software at Amazon (DEV209-R1) - AWS re:Invent 2018
Releasing Mission-Critical Software at Amazon (DEV209-R1) - AWS re:Invent 2018
Amazon Web Services
Chaos Engineering with Kubernetes
Chaos Engineering with Kubernetes
Arun Gupta
[REPEAT 1] Safeguard the Integrity of Your Code for Fast and Secure Deploymen...
[REPEAT 1] Safeguard the Integrity of Your Code for Fast and Secure Deploymen...
Amazon Web Services
Safeguard the Integrity of Your Code for Fast and Secure Deployments (DEV349-...
Safeguard the Integrity of Your Code for Fast and Secure Deployments (DEV349-...
Amazon Web Services
Chaos Engineering and Scalability at Audible.com (ARC308) - AWS re:Invent 2018
Chaos Engineering and Scalability at Audible.com (ARC308) - AWS re:Invent 2018
Amazon Web Services
Building Global Multi-Region, Active-Active Serverless Backends
Building Global Multi-Region, Active-Active Serverless Backends
Amazon Web Services
Moving 400 Engineers to AWS: Our Journey to Secure Adoption (SEC306-S) - AWS ...
Moving 400 Engineers to AWS: Our Journey to Secure Adoption (SEC306-S) - AWS ...
Amazon Web Services
Amazon CI/CD Practices for Software Development Teams - SRV320 - Chicago AWS ...
Amazon CI/CD Practices for Software Development Teams - SRV320 - Chicago AWS ...
Amazon Web Services
Amazon CI/CD Practices for Software Development Teams - SRV320 - Anaheim AWS ...
Amazon CI/CD Practices for Software Development Teams - SRV320 - Anaheim AWS ...
Amazon Web Services
Modernizing Media Supply Chains with AWS Serverless (API301) - AWS re:Invent ...
Modernizing Media Supply Chains with AWS Serverless (API301) - AWS re:Invent ...
Amazon Web Services
Resiliency and Availability Design Patterns for the Cloud
Resiliency and Availability Design Patterns for the Cloud
Amazon Web Services
Build Your Own Log Analytics Solutions on AWS (ANT323-R) - AWS re:Invent 2018
Build Your Own Log Analytics Solutions on AWS (ANT323-R) - AWS re:Invent 2018
Amazon Web Services
Breaking Containers: Chaos Engineering for Modern Applications on AWS (CON310...
Breaking Containers: Chaos Engineering for Modern Applications on AWS (CON310...
Amazon Web Services
Applying the Twelve-Factor App Methodology to Serverless Applications (SRV218...
Applying the Twelve-Factor App Methodology to Serverless Applications (SRV218...
Amazon Web Services
The Quest for Continuous ATO: A Case Study Featuring the US Intelligence Comm...
The Quest for Continuous ATO: A Case Study Featuring the US Intelligence Comm...
Amazon Web Services
From Monolith to Modern Apps: Best Practices (SRV322-R2) - AWS re:Invent 2018
From Monolith to Modern Apps: Best Practices (SRV322-R2) - AWS re:Invent 2018
Amazon Web Services
Similar to Chaos Engineering
(20)
Chaos Engineering: Why Breaking Things Should Be Practised.
Chaos Engineering: Why Breaking Things Should Be Practised.
Keynote - Chaos Engineering: Why breaking things should be practiced
Keynote - Chaos Engineering: Why breaking things should be practiced
Keynote - Adrian Hornsby on Chaos Engineering
Keynote - Adrian Hornsby on Chaos Engineering
Fully Realizing the Microservices Vision with Service Mesh (DEV312-S) - AWS r...
Fully Realizing the Microservices Vision with Service Mesh (DEV312-S) - AWS r...
Releasing Mission-Critical Software at Amazon (DEV209-R1) - AWS re:Invent 2018
Releasing Mission-Critical Software at Amazon (DEV209-R1) - AWS re:Invent 2018
Chaos Engineering with Kubernetes
Chaos Engineering with Kubernetes
[REPEAT 1] Safeguard the Integrity of Your Code for Fast and Secure Deploymen...
[REPEAT 1] Safeguard the Integrity of Your Code for Fast and Secure Deploymen...
Safeguard the Integrity of Your Code for Fast and Secure Deployments (DEV349-...
Safeguard the Integrity of Your Code for Fast and Secure Deployments (DEV349-...
Chaos Engineering and Scalability at Audible.com (ARC308) - AWS re:Invent 2018
Chaos Engineering and Scalability at Audible.com (ARC308) - AWS re:Invent 2018
Building Global Multi-Region, Active-Active Serverless Backends
Building Global Multi-Region, Active-Active Serverless Backends
Moving 400 Engineers to AWS: Our Journey to Secure Adoption (SEC306-S) - AWS ...
Moving 400 Engineers to AWS: Our Journey to Secure Adoption (SEC306-S) - AWS ...
Amazon CI/CD Practices for Software Development Teams - SRV320 - Chicago AWS ...
Amazon CI/CD Practices for Software Development Teams - SRV320 - Chicago AWS ...
Amazon CI/CD Practices for Software Development Teams - SRV320 - Anaheim AWS ...
Amazon CI/CD Practices for Software Development Teams - SRV320 - Anaheim AWS ...
Modernizing Media Supply Chains with AWS Serverless (API301) - AWS re:Invent ...
Modernizing Media Supply Chains with AWS Serverless (API301) - AWS re:Invent ...
Resiliency and Availability Design Patterns for the Cloud
Resiliency and Availability Design Patterns for the Cloud
Build Your Own Log Analytics Solutions on AWS (ANT323-R) - AWS re:Invent 2018
Build Your Own Log Analytics Solutions on AWS (ANT323-R) - AWS re:Invent 2018
Breaking Containers: Chaos Engineering for Modern Applications on AWS (CON310...
Breaking Containers: Chaos Engineering for Modern Applications on AWS (CON310...
Applying the Twelve-Factor App Methodology to Serverless Applications (SRV218...
Applying the Twelve-Factor App Methodology to Serverless Applications (SRV218...
The Quest for Continuous ATO: A Case Study Featuring the US Intelligence Comm...
The Quest for Continuous ATO: A Case Study Featuring the US Intelligence Comm...
From Monolith to Modern Apps: Best Practices (SRV322-R2) - AWS re:Invent 2018
From Monolith to Modern Apps: Best Practices (SRV322-R2) - AWS re:Invent 2018
More from Amazon Web Services
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Amazon Web Services
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Amazon Web Services
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
Amazon Web Services
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
Amazon Web Services
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
Amazon Web Services
Open banking as a service
Open banking as a service
Amazon Web Services
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Amazon Web Services
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
Amazon Web Services
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Amazon Web Services
Computer Vision con AWS
Computer Vision con AWS
Amazon Web Services
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
Amazon Web Services
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Amazon Web Services
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
Amazon Web Services
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Amazon Web Services
Tools for building your MVP on AWS
Tools for building your MVP on AWS
Amazon Web Services
How to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
Amazon Web Services
Building a web application without servers
Building a web application without servers
Amazon Web Services
Fundraising Essentials
Fundraising Essentials
Amazon Web Services
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
Amazon Web Services
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
Amazon Web Services
More from Amazon Web Services
(20)
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
Open banking as a service
Open banking as a service
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Computer Vision con AWS
Computer Vision con AWS
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Tools for building your MVP on AWS
Tools for building your MVP on AWS
How to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
Building a web application without servers
Building a web application without servers
Fundraising Essentials
Fundraising Essentials
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
Chaos Engineering
1.
© 2018, Amazon
Web Services, Inc. or its affiliates. All rights reserved. Adrian Hornsby Cloud Infrastructure – Amazon Web Services Chaos Engineering: Why Breaking Things Should Be Practiced. @adhorn
2.
© 2018, Amazon
Web Services, Inc. or its affiliates. All rights reserved. Been there?
3.
© 2018, Amazon
Web Services, Inc. or its affiliates. All rights reserved. Failures are a given and everything will eventually fail over time. Werner Vogels CTO – Amazon.com “ “
4.
© 2018, Amazon
Web Services, Inc. or its affiliates. All rights reserved. … at the Edge
5.
© 2018, Amazon
Web Services, Inc. or its affiliates. All rights reserved. Building Confidence Through Testing Unit testing of components: • Tested in isolation to ensure function meets expectations. Functional testing of integrations: • Each execution path tested to assure expected results. Is it enough???
6.
© 2018, Amazon
Web Services, Inc. or its affiliates. All rights reserved. Jesse Robbins GameDay: Creating Resiliency Through Destruction https://www.youtube.com/watch?v=zoz0ZjfrQ9s
7.
Netflix 2013 https://medium.com/netflix-techblog
8.
Chaos Monkeys https://github.com/Netflix/SimianArmy
9.
10.
© 2017, Amazon
Web Services, Inc. or its Affiliates. All rights reserved. https://bit.ly/2uKOJMQ
11.
© 2018, Amazon
Web Services, Inc. or its affiliates. All rights reserved. Twilio Use-Case Discovering Issues with HTTP/2 via Chaos Testing https://www.twilio.com/blog/2017/10/http2-issues.html ”While HTTP/2 provides for a number of improvements over HTTP/1.x, via Chaos Testing we discovered that there are situations where HTTP/2 will perform worse than HTTP/1.”
12.
© 2018, Amazon
Web Services, Inc. or its affiliates. All rights reserved. What “really” is Chaos Engineering?
13.
© 2018, Amazon
Web Services, Inc. or its affiliates. All rights reserved. “Chaos Engineering is the discipline of experimenting on a distributed system in order to build confidence in the system’s capability to withstand turbulent conditions in production.” http://principlesofchaos.org
14.
Break your systems
on purpose. Find out their weaknesses and fix them before they break when least expected.
15.
© 2018, Amazon
Web Services, Inc. or its affiliates. All rights reserved. Failure Injection • Start small & build confidence • Application level • Host failure • Resource attacks (CPU, memory, …) • Network attacks (dependencies, latency, …) • Region attacks!
16.
17.
© 2018, Amazon
Web Services, Inc. or its affiliates. All rights reserved. “CHAOS DOESN’T CAUSE PROBLEMS. IT REVEALS THEM.” Nora Jones Senior Chaos Engineer, Netflix
18.
© 2018, Amazon
Web Services, Inc. or its affiliates. All rights reserved. Before breaking things …
19.
© 2018, Amazon
Web Services, Inc. or its affiliates. All rights reserved. People Application Network & Data Infrastructure
20.
© 2018, Amazon
Web Services, Inc. or its affiliates. All rights reserved. Infrastructure
21.
© 2018, Amazon
Web Services, Inc. or its affiliates. All rights reserved. Availability Availability Downtime per year 99% (2-nines) 3 days 15 hours 99.99% (4-nines) 52 minutes 99.999% (5-nines) 5 minutes 99.9999% (6-nines) 31 seconds
22.
© 2018, Amazon
Web Services, Inc. or its affiliates. All rights reserved. System Availability Availability = Normal Operation Time Total Time MTBF** MTBF** + MTTR* = * Mean Time To Repair (MTTR) **Mean Time Between Failure (MTBF)
23.
© 2018, Amazon
Web Services, Inc. or its affiliates. All rights reserved. Availability in Parallel Component Availability Downtime X 99% (2-nines) 3 days 15 hours Two X in parallel 99.99% (4-nines) 52 minutes Three X in parallel 99.9999% (6-nines) 31 seconds
24.
© 2018, Amazon
Web Services, Inc. or its affiliates. All rights reserved. Availability Zone 1 Availability Zone 2 Availability Zone n Multi-AZ Support Instance Failure Application
25.
© 2018, Amazon
Web Services, Inc. or its affiliates. All rights reserved. Auto-Scaling • Compute efficiency • Node failure • Traffic spikes • Performance bugs
26.
© 2018, Amazon
Web Services, Inc. or its affiliates. All rights reserved. Immutable Infrastructure • No updates on live systems • Always start from a new resource being provisioned • Deploy the new software • Test in different environments (dev, staging) • Deploy to prod (inactive) • Change references (DNS or Load Balancer) • Keep old version around (inactive) • Fast rollback if things go wrong
27.
© 2018, Amazon
Web Services, Inc. or its affiliates. All rights reserved. Infrastructure as Code • Template of the infrastructure in code. • Version controlled infrastructure. • Repeatable template. • Testable infrastructure. • Automate it!
28.
© 2018, Amazon
Web Services, Inc. or its affiliates. All rights reserved. Network & Data
29.
© 2018, Amazon
Web Services, Inc. or its affiliates. All rights reserved. Read / Write Sharding RDS DB Instance Read Replica App Instance App Instance App Instance RDS DB Instance Master (Multi-AZ) RDS DB Instance Read Replica RDS DB Instance Read Replica
30.
© 2018, Amazon
Web Services, Inc. or its affiliates. All rights reserved. Database Federation Users DB Products DB App Instance App Instance App Instance
31.
© 2018, Amazon
Web Services, Inc. or its affiliates. All rights reserved. Database Sharding User ShardID 002345 A 002346 B 002347 C 002348 B 002349 A CBA App Instance App Instance App Instance
32.
© 2018, Amazon
Web Services, Inc. or its affiliates. All rights reserved. Message passing for async. patterns A Queue B A Queue BListener Pub-Sub
33.
© 2018, Amazon
Web Services, Inc. or its affiliates. All rights reserved. Web Instances Worker Instance Worker Instance Queue API Instance API Instance API Instance API: {DO foo} PUT JOB: {JobID: 0001, Task: DO foo} API: {JobID: 0001} GET JOB: {JobID: 0001, Task: DO foo} Cache Result: { JobID: 0001, Result: bar }
34.
© 2018, Amazon
Web Services, Inc. or its affiliates. All rights reserved. Worker Instance Worker Instance Queue API Instance API Instance API Instance Cache Amazon SNS Push Notification User
35.
© 2018, Amazon
Web Services, Inc. or its affiliates. All rights reserved. Exponential Backoff
36.
© 2018, Amazon
Web Services, Inc. or its affiliates. All rights reserved. Circuit Breaker • Wrap a protected function call in a circuit breaker object, which monitors for failures. • If failures reach a certain threshold, the circuit breaker trips. https://martinfowler.com/bliki/CircuitBreaker.html
37.
© 2018, Amazon
Web Services, Inc. or its affiliates. All rights reserved. Dynamic Routing with Route53 1. Latency Based Routing 2. Geo DNS 3. Weighted Round Robin 4. DNS Failover Amazon Route53 Resource A In US Resource B in EU User in US
38.
© 2018, Amazon
Web Services, Inc. or its affiliates. All rights reserved. Dynamic Routing 1. Improve Latency for end-users 2. Disaster Recovery Applications in US West Applications in US East Users from San Francisco Users from New York Service 1 Service 2 Service 3 Service 4 Service 1 Service 2 Service 3 Service 4
39.
© 2018, Amazon
Web Services, Inc. or its affiliates. All rights reserved. Application
40.
© 2018, Amazon
Web Services, Inc. or its affiliates. All rights reserved. Stateless Services AZ1 AZ2 AWS Region Data Store Cache Auto-ScalingGroup User
41.
© 2018, Amazon
Web Services, Inc. or its affiliates. All rights reserved. Transient state does not belong in the database.
42.
© 2018, Amazon
Web Services, Inc. or its affiliates. All rights reserved. CAP Theorem Consistency Availability Partition Tolerance Data is consistent. All nodes see the same state. Every request is non-failing. Service still responds as expected if some nodes crash. Distributed System In the presence of a network partition, you must choose between consistency and availability!
43.
© 2018, Amazon
Web Services, Inc. or its affiliates. All rights reserved. Eventual Consistency … if no new updates are made to a given data item, eventually all accesses to that item will return the last updated value. Availability An eventually consistent system can return any value before it converges!! https://en.wikipedia.org/wiki/Eventual_consistency Distributed System Every request is non-failing.
44.
© 2018, Amazon
Web Services, Inc. or its affiliates. All rights reserved. Process A Process B Process A Process B Synchronous Asynchronous Waiting Working Continues get or fetch resultGet result
45.
© 2018, Amazon
Web Services, Inc. or its affiliates. All rights reserved. Non-blocking UI https://medium.com/@sophie_paxtonUX/stop-getting-in-my-way-non-blocking-ux-5cbbfe0f0158
46.
© 2018, Amazon
Web Services, Inc. or its affiliates. All rights reserved. Exception Handling
47.
© 2018, Amazon
Web Services, Inc. or its affiliates. All rights reserved. Service Degradation & Fallbacks
48.
© 2018, Amazon
Web Services, Inc. or its affiliates. All rights reserved. People
49.
© 2018, Amazon
Web Services, Inc. or its affiliates. All rights reserved. “It is not failure itself that holds you back; it is the fear of failure that paralyses you.” Brian Tracy
50.
© 2018, Amazon
Web Services, Inc. or its affiliates. All rights reserved. Conway’s Law User UI Team Application Team DBA Team ”Any organization that designs a system (defined broadly) will produce a design whose structure is a copy of the organization's communication structure.” http://www.melconway.com/Home/Conways_Law.html Siloed Teams Siloed Applications
51.
© 2018, Amazon
Web Services, Inc. or its affiliates. All rights reserved. Conway’s Law ”Any organization that designs a system (defined broadly) will produce a design whose structure is a copy of the organization's communication structure.” http://www.melconway.com/Home/Conways_Law.html Services Cross-Functional Teams
52.
© 2018, Amazon
Web Services, Inc. or its affiliates. All rights reserved. Fire Drills
53.
© 2018, Amazon
Web Services, Inc. or its affiliates. All rights reserved. Phases of Chaos Engineering
54.
© 2018, Amazon
Web Services, Inc. or its affiliates. All rights reserved. Steady State Hypothesis Design Experiment Verify & Learn Fix
55.
© 2018, Amazon
Web Services, Inc. or its affiliates. All rights reserved. What is Steady State? • ”normal” behavior of your system https://www.elastic.co/blog/timelion-tutorial-from-zero-to-hero
56.
What is Steady
State? • ”normal” behavior of your system • Business Metric https://medium.com/netflix-techblog/sps-the-pulse-of-netflix-streaming-ae4db0e05f8a
57.
© 2018, Amazon
Web Services, Inc. or its affiliates. All rights reserved. Business Metrics at work Amazon: 100 ms of extra load time caused a 1% drop in sales (Greg Linden). Google: 500 ms of extra load time caused 20% fewer searches (Marissa Mayer). Yahoo!: 400 ms of extra load time caused a 5–9% increase in the number of people who clicked “back” before the page even loaded (Nicole Sullivan).
58.
© 2018, Amazon
Web Services, Inc. or its affiliates. All rights reserved. Steady State Important: • Know the value range of Healthy State!
59.
© 2018, Amazon
Web Services, Inc. or its affiliates. All rights reserved. Hypothesis: What if…? “What if this load balancer breaks?” “What if Redis becomes slow?” “What if a host on Cassandra goes away?” ”What if latency increases by 300ms?” ”What if the database stops?” Make it everyone’s problem!
60.
© 2018, Amazon
Web Services, Inc. or its affiliates. All rights reserved. Disclaimer! Don’t make an hypothesis that you know will break you!
61.
© 2018, Amazon
Web Services, Inc. or its affiliates. All rights reserved. Designing Experiment • Pick hypothesis • Scope the experiment • Identify metrics • Notify the organization
62.
© 2018, Amazon
Web Services, Inc. or its affiliates. All rights reserved. Rules of thumbs • Start with very small • As close as possible to production • Minimize the blast radius. • Have an emergency STOP!
63.
© 2018, Amazon
Web Services, Inc. or its affiliates. All rights reserved. New Version Users Run the Experiment: Canary deployment Old Version 99% Users 1% Users Start with .. Dynamic Routing (Route53)
64.
© 2018, Amazon
Web Services, Inc. or its affiliates. All rights reserved. Verify & Learn: Quantifying the result of the experiment • Time to detect? • Time for notification? And escalation? • Time to public notification? • Time for graceful degradation to kick-in? • Time for self healing to happen? • Time to recovery – partial and full? • Time to all-clear and stable?
65.
© 2018, Amazon
Web Services, Inc. or its affiliates. All rights reserved. DON’T blame that one person …
66.
© 2018, Amazon
Web Services, Inc. or its affiliates. All rights reserved. PostMortems The 5 WHYs Outage Because of … Because of … Because of … Because of …
67.
© 2018, Amazon
Web Services, Inc. or its affiliates. All rights reserved. More questions to ask. • Can you clarify if there were any preceding events? • Why would they believe acting in this way was the best course of action to deliver the desired outcome? • Is there another failure mode that could present here? • What decisions or events prior to this made this work before? • Why stop there – are there places to dig deeper that could shine a light more on this? • Did others step in to help, to advise, or to intercede? http://willgallego.com/2018/04/02/no-seriously-root-cause-is-a-fallacy/
68.
© 2018, Amazon
Web Services, Inc. or its affiliates. All rights reserved. Fixing the issues!
69.
© 2018, Amazon
Web Services, Inc. or its affiliates. All rights reserved. Big Challenges to Chaos Engineering Mostly Cultural • no time or flexibility to simulate disasters. • teams already spending all of its time fixing things. • can be very political. • might force deep conversations. • deeply invested in a specific technical roadmap (micro-services) that chaos engineering tests show is not as resilient to failures as originally predicted.
70.
© 2018, Amazon
Web Services, Inc. or its affiliates. All rights reserved. Changing Culture takes time! Be patient…
71.
© 2018, Amazon
Web Services, Inc. or its affiliates. All rights reserved. More Resources • https://mvdirona.com/jrh/talksAndPapers/JamesRH_Lisa.pdf • https://www.gremlin.com • https://queue.acm.org/detail.cfm?id=2353017 • https://softwareengineeringdaily.com/ • https://github.com/dastergon/awesome-sre • https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-yuan.pdf • https://medium.com/@NetflixTechBlog • http://principlesofchaos.org • https://speakerdeck.com/tammybutow/chaos-engineering-bootcamp • https://github.com/adhorn/awesome-chaos-engineering • https://www.infoq.com/presentations/netflix-chaos-microservices • http://royal.pingdom.com/wp- content/uploads/2015/04/pingdom_uptime_cheat_sheet.pdf • http://willgallego.com/2018/04/02/no-seriously-root-cause-is-a-fallacy
72.
© 2018, Amazon
Web Services, Inc. or its affiliates. All rights reserved. Thank you! @adhorn https://medium.com/@adhorn
Download now