SlideShare a Scribd company logo
1 of 41
Embracing Failure 
Fault Injection and Service Resilience at Netflix 
Josh Evans – Director of Operations Engineering 
Naresh Gopalani – Software Engineer and Architect 
© 2014 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.
Netflix Ecosystem 
• ~50 million members, ~50 countries 
• > 1 billion hours per month 
• > 1000 device types 
• 3 AWS Regions, hundreds of services 
• Hundreds of thousands of requests/second 
• CDN serves petabytes of data at terabits/second 
Static 
Content 
Akamai 
Netflix CDN 
Service 
Partners 
AWS/Netfli 
x 
Control 
Plane 
Internet
Availability means that members can 
● sign up 
● activate a device 
● browse 
● watch
What keeps us up at night
Failures can happen any time 
• Disks fail 
• Power outages 
• Natural disasters 
• Software bugs 
• Human error
We design for failure 
• Exception handling 
• Fault tolerance and isolation 
• Fall-backs and degraded experiences 
• Auto-scaling clusters 
• Redundancy
Testing for failure is hard 
• Web-scale traffic 
• Massive, changing data sets 
• Complex interactions and request patterns 
• Asynchronous, concurrent requests 
• Complete and partial failure modes 
Constant innovation and change
What if we regularly inject failures 
into our systems under controlled 
circumstances?
Blast Radius 
• Unit of isolation 
• Scope of an outage 
• Scope a chaos exercise 
Instance 
Zone 
Region 
Global
An Instance Fails 
Edge Cluster 
Cluster A 
Cluster B 
Cluster C 
Cluster D
Chaos Monkey 
• Monkey loose in your DC 
• Run during business hours 
• What we learned 
– Auto-replacement works 
– State is problematic
A State of Xen - Chaos Monkey & Cassandra 
Out of our 2700+ Cassandra nodes 
• 218 rebooted 
• 22 did not reboot successfully 
• Automation replaced failed nodes 
• 0 downtime due to reboot
An Availability Zone Fails 
EU-West 
US-West US-East 
AZ1 
AZ2
Chaos Gorilla 
Simulate an availability zone 
outage 
• 3-zone configuration 
• Eliminate one zone 
• Ensure that others can 
handle the load and 
nothing breaks 
Chaos Gorilla
Challenges 
• Rapidly shifting traffic 
– LBs must expire connections quickly 
– Lingering connections to caches must be addressed 
• Service configuration 
– Not all clusters auto-scaled or pinned 
– Services not configured for cross-zone calls 
– Mismatched timeouts – fallbacks prevented fail-over
A Region Fails 
US-West US-East EU-West
Regional Load Balancers 
Zuul – Traffic Shaping/Routing 
AZ1 AZ2 AZ3 
Data Data Data 
Geo-located 
Chaos Kong 
Chaos Kong 
Regional Load Balancers 
Zuul – Traffic Shaping/Routing 
AZ1 AZ2 AZ3 
Data Data Data 
Customer 
Device
Challenges 
● Rapidly shifting traffic 
○ Auto-scaling configuration 
○ Static configuration/pinning 
○ Instance start time 
○ Cache fill time
Challenges 
● Service Configuration 
○ Timeout configurations 
○ Fallbacks fail or don’t provide the 
desired experience 
● No minimal (critical) stack 
○ Any service may be critical!
A Service Fails 
Service 
Zone 
Region 
Global
Services Slow Down and Fail 
Simulate latent/failed service 
calls 
• Inject arbitrary latency and errors at 
the service level 
• Observe for effects 
Latency Monkey
Latency Monkey 
Device ELB Zuul Edge Service B 
Service C 
Internet 
Service A
Challenges 
• Startup resiliency is an issue 
• Services owners don’t know all dependencies 
• Fallbacks can fail too 
• Second order effects not easily tested 
• Dependencies are in constant flux 
• Latency Monkey tests function and scale 
– Not a staged approach 
– Lots of opt-outs
More Precise and Continuous
Service Failure Testing:FIT
Distributed Systems Fail 
● Complex interactions at scale 
● Variability across services 
● Byzantine failures 
● Combinatorial complexity
Any service can cause cascading failures 
ELB
Fault Injection Testing (FIT) 
Device Service B 
Service C 
Internet Zuul 
Edge 
Device or Account Override 
Service A 
Request-level simulations 
ELB
Failure Injection Points 
IPC Cassandra Client Memcached Client Service Container Fault Tolerance
FIT Details 
● Common Simulation Syntax 
● Single Simulation Interface 
● Transported via Http Request header
Integrating Failure 
request 
[sendRequestHeader] >>fit.failure: 1|fit.Serializer| 
2|[[{"name”:”failSocial, 
Filter 
Service 
Ribbon 
Filter 
Service 
Ribbon 
ServerRcv 
ClientSend 
ServerRcv 
Service A 
response 
Service B 
”whitelist":false, 
"injectionPoints”: 
[“SocialService”]},{} 
]], 
{"Id": 
"252c403b-7e34-4c0b-a28a-3606fcc38768"}]]
Failure Scenarios 
● Set of injection points to fail 
● Defined based on 
○ Past outages 
○ Specific dependency interactions 
○ Whitelist of a set of critical services 
○ Dynamic tracing of dependencies
FIT Insights : Salp 
● Distributed tracing inspired by Dapper paper 
● Provides insight into dependencies 
● Helps define & visualize scenarios
Dialing Up Failure 
Functional Validation 
● Isolated synthetic transactions 
○ Set of devices 
Validation at Scale 
● Dial up customer traffic - % based 
● Simulation of full service failure 
Chaos!
Continuous Validation 
Critical 
Services 
Non-critical 
Services 
Synthetic 
Transactions
Don’t Fear The Monkeys
Take-aways 
• Don’t wait for random failures 
– Cause failure to validate resiliency 
– Remove uncertainty by forcing failures regularly 
– Better to fail at 2pm than 2am 
• Test design assumptions by stressing them 
Embrace Failure
The Simian Army is part of 
the Netflix open source 
cloud platform 
http://netflix.github.com
Netflix talks at re:Invent 
Talk Time Title 
BDT-403 Wednesday, 2:15pm Next Generation Big Data Platform at Netflix 
PFC-306 Wednesday, 3:30pm Performance Tuning EC2 
DEV-309 Wednesday, 3:30pm From Asgard to Zuul, How Netflix’s proven Open Source 
Tools can accelerate and scale your services 
ARC-317 Wednesday, 4:30pm Maintaining a Resilient Front-Door at Massive Scale 
PFC-304 Wednesday, 4:30pm Effective Inter-process Communications in the Cloud: The 
Pros and Cons of Micro Services Architectures 
ENT-209 Wednesday, 4:30pm Cloud Migration, Dev-Ops and Distributed Systems 
APP-310 Friday 9:00am Scheduling using Apache Mesos in the Cloud
Please give us your feedback on this 
presentation 
Josh Evans 
jevans@netflix.com 
@josh_evans_nflx 
Naresh Gopalani 
ngopalani@netflix.com 
Join the conversation on Twitter with #reinvent 
© 2014 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.

More Related Content

What's hot

Chaos Engineering - The Art of Breaking Things in Production
Chaos Engineering - The Art of Breaking Things in ProductionChaos Engineering - The Art of Breaking Things in Production
Chaos Engineering - The Art of Breaking Things in ProductionKeet Sugathadasa
 
Velero search & practice 20210609
Velero search & practice 20210609Velero search & practice 20210609
Velero search & practice 20210609KAI CHU CHUNG
 
Immersion Day - Well Architected Workshop - June 2019
Immersion Day - Well Architected Workshop - June 2019Immersion Day - Well Architected Workshop - June 2019
Immersion Day - Well Architected Workshop - June 2019Amazon Web Services
 
Chaos Engineering: Why Breaking Things Should Be Practiced - AWS Developer Wo...
Chaos Engineering: Why Breaking Things Should Be Practiced - AWS Developer Wo...Chaos Engineering: Why Breaking Things Should Be Practiced - AWS Developer Wo...
Chaos Engineering: Why Breaking Things Should Be Practiced - AWS Developer Wo...Amazon Web Services
 
Chaos Engineering, When should you release the monkeys?
Chaos Engineering, When should you release the monkeys?Chaos Engineering, When should you release the monkeys?
Chaos Engineering, When should you release the monkeys?Thoughtworks
 
Choose your own adventure Chaos Engineering - QCon NYC 2017
Choose your own adventure Chaos Engineering - QCon NYC 2017 Choose your own adventure Chaos Engineering - QCon NYC 2017
Choose your own adventure Chaos Engineering - QCon NYC 2017 Nora Jones
 
The AWS Shared Security Responsibility Model in Practice
The AWS Shared Security Responsibility Model in PracticeThe AWS Shared Security Responsibility Model in Practice
The AWS Shared Security Responsibility Model in PracticeAmazon Web Services
 
AWSome Day Online Conference 2019 - Module 5 AWS Pricing and Support.pdf
AWSome Day Online Conference 2019 - Module 5 AWS Pricing and Support.pdfAWSome Day Online Conference 2019 - Module 5 AWS Pricing and Support.pdf
AWSome Day Online Conference 2019 - Module 5 AWS Pricing and Support.pdfAmazon Web Services
 
Using new sentinel features in terraform cloud
Using new sentinel features in terraform cloudUsing new sentinel features in terraform cloud
Using new sentinel features in terraform cloudMitchell Pronschinske
 
Kubernetes/ EKS - 김광영 (AWS 솔루션즈 아키텍트)
Kubernetes/ EKS - 김광영 (AWS 솔루션즈 아키텍트)Kubernetes/ EKS - 김광영 (AWS 솔루션즈 아키텍트)
Kubernetes/ EKS - 김광영 (AWS 솔루션즈 아키텍트)Amazon Web Services Korea
 
Which Application Modernization Pattern Is Right For You?
Which Application Modernization Pattern Is Right For You?Which Application Modernization Pattern Is Right For You?
Which Application Modernization Pattern Is Right For You?Apigee | Google Cloud
 
Chaos engineering and chaos testing
Chaos engineering and chaos testingChaos engineering and chaos testing
Chaos engineering and chaos testingjeetendra mandal
 
클라우드 환경에서 비즈니스 애플리케이션의 성능 통합 모니터링 방안::류길현::AWS Summit Seoul 2018
클라우드 환경에서 비즈니스 애플리케이션의 성능 통합 모니터링 방안::류길현::AWS Summit Seoul 2018 클라우드 환경에서 비즈니스 애플리케이션의 성능 통합 모니터링 방안::류길현::AWS Summit Seoul 2018
클라우드 환경에서 비즈니스 애플리케이션의 성능 통합 모니터링 방안::류길현::AWS Summit Seoul 2018 Amazon Web Services Korea
 
AWS Media and Entertainment - Broadcast and OTT Workloads - Toronto
AWS Media and Entertainment - Broadcast and OTT Workloads - TorontoAWS Media and Entertainment - Broadcast and OTT Workloads - Toronto
AWS Media and Entertainment - Broadcast and OTT Workloads - TorontoAmazon Web Services
 
Monitoring at the Speed of DevOps
Monitoring at the Speed of DevOpsMonitoring at the Speed of DevOps
Monitoring at the Speed of DevOpsDevOps.com
 
Cloud Migration Checklist | Microsoft Azure Migration
Cloud Migration Checklist | Microsoft Azure MigrationCloud Migration Checklist | Microsoft Azure Migration
Cloud Migration Checklist | Microsoft Azure MigrationIntellika
 

What's hot (20)

Chaos engineering
Chaos engineering Chaos engineering
Chaos engineering
 
Chaos Engineering - The Art of Breaking Things in Production
Chaos Engineering - The Art of Breaking Things in ProductionChaos Engineering - The Art of Breaking Things in Production
Chaos Engineering - The Art of Breaking Things in Production
 
Velero search & practice 20210609
Velero search & practice 20210609Velero search & practice 20210609
Velero search & practice 20210609
 
Immersion Day - Well Architected Workshop - June 2019
Immersion Day - Well Architected Workshop - June 2019Immersion Day - Well Architected Workshop - June 2019
Immersion Day - Well Architected Workshop - June 2019
 
AWS Security and SecOps
AWS Security and SecOpsAWS Security and SecOps
AWS Security and SecOps
 
Chaos Engineering: Why Breaking Things Should Be Practiced - AWS Developer Wo...
Chaos Engineering: Why Breaking Things Should Be Practiced - AWS Developer Wo...Chaos Engineering: Why Breaking Things Should Be Practiced - AWS Developer Wo...
Chaos Engineering: Why Breaking Things Should Be Practiced - AWS Developer Wo...
 
Chaos Engineering, When should you release the monkeys?
Chaos Engineering, When should you release the monkeys?Chaos Engineering, When should you release the monkeys?
Chaos Engineering, When should you release the monkeys?
 
Choose your own adventure Chaos Engineering - QCon NYC 2017
Choose your own adventure Chaos Engineering - QCon NYC 2017 Choose your own adventure Chaos Engineering - QCon NYC 2017
Choose your own adventure Chaos Engineering - QCon NYC 2017
 
The AWS Shared Security Responsibility Model in Practice
The AWS Shared Security Responsibility Model in PracticeThe AWS Shared Security Responsibility Model in Practice
The AWS Shared Security Responsibility Model in Practice
 
AWSome Day Online Conference 2019 - Module 5 AWS Pricing and Support.pdf
AWSome Day Online Conference 2019 - Module 5 AWS Pricing and Support.pdfAWSome Day Online Conference 2019 - Module 5 AWS Pricing and Support.pdf
AWSome Day Online Conference 2019 - Module 5 AWS Pricing and Support.pdf
 
Using new sentinel features in terraform cloud
Using new sentinel features in terraform cloudUsing new sentinel features in terraform cloud
Using new sentinel features in terraform cloud
 
Kubernetes/ EKS - 김광영 (AWS 솔루션즈 아키텍트)
Kubernetes/ EKS - 김광영 (AWS 솔루션즈 아키텍트)Kubernetes/ EKS - 김광영 (AWS 솔루션즈 아키텍트)
Kubernetes/ EKS - 김광영 (AWS 솔루션즈 아키텍트)
 
Which Application Modernization Pattern Is Right For You?
Which Application Modernization Pattern Is Right For You?Which Application Modernization Pattern Is Right For You?
Which Application Modernization Pattern Is Right For You?
 
Chaos engineering and chaos testing
Chaos engineering and chaos testingChaos engineering and chaos testing
Chaos engineering and chaos testing
 
클라우드 환경에서 비즈니스 애플리케이션의 성능 통합 모니터링 방안::류길현::AWS Summit Seoul 2018
클라우드 환경에서 비즈니스 애플리케이션의 성능 통합 모니터링 방안::류길현::AWS Summit Seoul 2018 클라우드 환경에서 비즈니스 애플리케이션의 성능 통합 모니터링 방안::류길현::AWS Summit Seoul 2018
클라우드 환경에서 비즈니스 애플리케이션의 성능 통합 모니터링 방안::류길현::AWS Summit Seoul 2018
 
Intro to Amazon ECS
Intro to Amazon ECSIntro to Amazon ECS
Intro to Amazon ECS
 
AWS Media and Entertainment - Broadcast and OTT Workloads - Toronto
AWS Media and Entertainment - Broadcast and OTT Workloads - TorontoAWS Media and Entertainment - Broadcast and OTT Workloads - Toronto
AWS Media and Entertainment - Broadcast and OTT Workloads - Toronto
 
Monitoring at the Speed of DevOps
Monitoring at the Speed of DevOpsMonitoring at the Speed of DevOps
Monitoring at the Speed of DevOps
 
Cloud Migration Workshop
Cloud Migration WorkshopCloud Migration Workshop
Cloud Migration Workshop
 
Cloud Migration Checklist | Microsoft Azure Migration
Cloud Migration Checklist | Microsoft Azure MigrationCloud Migration Checklist | Microsoft Azure Migration
Cloud Migration Checklist | Microsoft Azure Migration
 

Viewers also liked

Application Networks: Microservices and APIs at Netflix
Application Networks: Microservices and APIs at NetflixApplication Networks: Microservices and APIs at Netflix
Application Networks: Microservices and APIs at NetflixMuleSoft
 
Mastering Chaos - A Netflix Guide to Microservices
Mastering Chaos - A Netflix Guide to MicroservicesMastering Chaos - A Netflix Guide to Microservices
Mastering Chaos - A Netflix Guide to MicroservicesJosh Evans
 
MicroService Architecture
MicroService ArchitectureMicroService Architecture
MicroService ArchitectureFred George
 
Microservices Workshop All Topics Deck 2016
Microservices Workshop All Topics Deck 2016Microservices Workshop All Topics Deck 2016
Microservices Workshop All Topics Deck 2016Adrian Cockcroft
 
AWS Re:Invent 2012 - Chaos Monkey & The Netflix Simian Army
AWS Re:Invent 2012 - Chaos Monkey & The Netflix Simian ArmyAWS Re:Invent 2012 - Chaos Monkey & The Netflix Simian Army
AWS Re:Invent 2012 - Chaos Monkey & The Netflix Simian ArmyAriel Tseitlin
 
MicroServices at Netflix - challenges of scale
MicroServices at Netflix - challenges of scaleMicroServices at Netflix - challenges of scale
MicroServices at Netflix - challenges of scaleSudhir Tonse
 
Beyond DevOps - How Netflix Bridges the Gap
Beyond DevOps - How Netflix Bridges the GapBeyond DevOps - How Netflix Bridges the Gap
Beyond DevOps - How Netflix Bridges the GapJosh Evans
 
Scaling unstable systems velocity 2015
Scaling unstable systems   velocity 2015Scaling unstable systems   velocity 2015
Scaling unstable systems velocity 2015Siddharth Ram
 
Principles of Chaos Engineering
Principles of Chaos EngineeringPrinciples of Chaos Engineering
Principles of Chaos Engineeringh_marvin
 
fault injection in operating systems
fault injection in operating systemsfault injection in operating systems
fault injection in operating systemsLukas Pirl
 
Web Scale Applications using NeflixOSS Cloud Platform
Web Scale Applications using NeflixOSS Cloud PlatformWeb Scale Applications using NeflixOSS Cloud Platform
Web Scale Applications using NeflixOSS Cloud PlatformSudhir Tonse
 
Open Business Conference: Continuous Delivery At Netflix -- Powered by Open S...
Open Business Conference: Continuous Delivery At Netflix -- Powered by Open S...Open Business Conference: Continuous Delivery At Netflix -- Powered by Open S...
Open Business Conference: Continuous Delivery At Netflix -- Powered by Open S...Dianne Marsh
 
#NetflixEverywhere Global Architecture
#NetflixEverywhere Global Architecture#NetflixEverywhere Global Architecture
#NetflixEverywhere Global ArchitectureJosh Evans
 
I Don't Test Often ...
I Don't Test Often ...I Don't Test Often ...
I Don't Test Often ...Gareth Bowles
 
Engineering Netflix Global Operations in the Cloud
Engineering Netflix Global Operations in the CloudEngineering Netflix Global Operations in the Cloud
Engineering Netflix Global Operations in the CloudJosh Evans
 
The Journey of Chaos Engineering Begins with a Single Step
The Journey of Chaos Engineering Begins with a Single StepThe Journey of Chaos Engineering Begins with a Single Step
The Journey of Chaos Engineering Begins with a Single StepBruce Wong
 
Release the Monkeys ! Testing in the Wild at Netflix
Release the Monkeys !  Testing in the Wild at NetflixRelease the Monkeys !  Testing in the Wild at Netflix
Release the Monkeys ! Testing in the Wild at NetflixGareth Bowles
 
Automated Fault Tolerance Testing
Automated Fault Tolerance TestingAutomated Fault Tolerance Testing
Automated Fault Tolerance TestingAjay Kumar Vaddadi
 
DevOps for the Enterprise: Automated Testing and Monitoring
DevOps for the Enterprise: Automated Testing and Monitoring DevOps for the Enterprise: Automated Testing and Monitoring
DevOps for the Enterprise: Automated Testing and Monitoring Amazon Web Services
 
Isep m2 m - iot - course 1 - update 2013 - 09122013 - part 3 - v(0.7)
Isep   m2 m - iot - course 1 - update 2013 - 09122013 - part 3 - v(0.7)Isep   m2 m - iot - course 1 - update 2013 - 09122013 - part 3 - v(0.7)
Isep m2 m - iot - course 1 - update 2013 - 09122013 - part 3 - v(0.7)Thierry Lestable
 

Viewers also liked (20)

Application Networks: Microservices and APIs at Netflix
Application Networks: Microservices and APIs at NetflixApplication Networks: Microservices and APIs at Netflix
Application Networks: Microservices and APIs at Netflix
 
Mastering Chaos - A Netflix Guide to Microservices
Mastering Chaos - A Netflix Guide to MicroservicesMastering Chaos - A Netflix Guide to Microservices
Mastering Chaos - A Netflix Guide to Microservices
 
MicroService Architecture
MicroService ArchitectureMicroService Architecture
MicroService Architecture
 
Microservices Workshop All Topics Deck 2016
Microservices Workshop All Topics Deck 2016Microservices Workshop All Topics Deck 2016
Microservices Workshop All Topics Deck 2016
 
AWS Re:Invent 2012 - Chaos Monkey & The Netflix Simian Army
AWS Re:Invent 2012 - Chaos Monkey & The Netflix Simian ArmyAWS Re:Invent 2012 - Chaos Monkey & The Netflix Simian Army
AWS Re:Invent 2012 - Chaos Monkey & The Netflix Simian Army
 
MicroServices at Netflix - challenges of scale
MicroServices at Netflix - challenges of scaleMicroServices at Netflix - challenges of scale
MicroServices at Netflix - challenges of scale
 
Beyond DevOps - How Netflix Bridges the Gap
Beyond DevOps - How Netflix Bridges the GapBeyond DevOps - How Netflix Bridges the Gap
Beyond DevOps - How Netflix Bridges the Gap
 
Scaling unstable systems velocity 2015
Scaling unstable systems   velocity 2015Scaling unstable systems   velocity 2015
Scaling unstable systems velocity 2015
 
Principles of Chaos Engineering
Principles of Chaos EngineeringPrinciples of Chaos Engineering
Principles of Chaos Engineering
 
fault injection in operating systems
fault injection in operating systemsfault injection in operating systems
fault injection in operating systems
 
Web Scale Applications using NeflixOSS Cloud Platform
Web Scale Applications using NeflixOSS Cloud PlatformWeb Scale Applications using NeflixOSS Cloud Platform
Web Scale Applications using NeflixOSS Cloud Platform
 
Open Business Conference: Continuous Delivery At Netflix -- Powered by Open S...
Open Business Conference: Continuous Delivery At Netflix -- Powered by Open S...Open Business Conference: Continuous Delivery At Netflix -- Powered by Open S...
Open Business Conference: Continuous Delivery At Netflix -- Powered by Open S...
 
#NetflixEverywhere Global Architecture
#NetflixEverywhere Global Architecture#NetflixEverywhere Global Architecture
#NetflixEverywhere Global Architecture
 
I Don't Test Often ...
I Don't Test Often ...I Don't Test Often ...
I Don't Test Often ...
 
Engineering Netflix Global Operations in the Cloud
Engineering Netflix Global Operations in the CloudEngineering Netflix Global Operations in the Cloud
Engineering Netflix Global Operations in the Cloud
 
The Journey of Chaos Engineering Begins with a Single Step
The Journey of Chaos Engineering Begins with a Single StepThe Journey of Chaos Engineering Begins with a Single Step
The Journey of Chaos Engineering Begins with a Single Step
 
Release the Monkeys ! Testing in the Wild at Netflix
Release the Monkeys !  Testing in the Wild at NetflixRelease the Monkeys !  Testing in the Wild at Netflix
Release the Monkeys ! Testing in the Wild at Netflix
 
Automated Fault Tolerance Testing
Automated Fault Tolerance TestingAutomated Fault Tolerance Testing
Automated Fault Tolerance Testing
 
DevOps for the Enterprise: Automated Testing and Monitoring
DevOps for the Enterprise: Automated Testing and Monitoring DevOps for the Enterprise: Automated Testing and Monitoring
DevOps for the Enterprise: Automated Testing and Monitoring
 
Isep m2 m - iot - course 1 - update 2013 - 09122013 - part 3 - v(0.7)
Isep   m2 m - iot - course 1 - update 2013 - 09122013 - part 3 - v(0.7)Isep   m2 m - iot - course 1 - update 2013 - 09122013 - part 3 - v(0.7)
Isep m2 m - iot - course 1 - update 2013 - 09122013 - part 3 - v(0.7)
 

Similar to Embracing Failure - Fault Injection and Service Resilience at Netflix

(PFC305) Embracing Failure: Fault-Injection and Service Reliability | AWS re:...
(PFC305) Embracing Failure: Fault-Injection and Service Reliability | AWS re:...(PFC305) Embracing Failure: Fault-Injection and Service Reliability | AWS re:...
(PFC305) Embracing Failure: Fault-Injection and Service Reliability | AWS re:...Amazon Web Services
 
Net flix embracingfailure re-invent2014-141113085858-conversion-gate02
Net flix embracingfailure re-invent2014-141113085858-conversion-gate02Net flix embracingfailure re-invent2014-141113085858-conversion-gate02
Net flix embracingfailure re-invent2014-141113085858-conversion-gate02~Eric Principe
 
(ISM301) Engineering Netflix Global Operations In The Cloud
(ISM301) Engineering Netflix Global Operations In The Cloud(ISM301) Engineering Netflix Global Operations In The Cloud
(ISM301) Engineering Netflix Global Operations In The CloudAmazon Web Services
 
Tokyo azure meetup #12 service fabric internals
Tokyo azure meetup #12   service fabric internalsTokyo azure meetup #12   service fabric internals
Tokyo azure meetup #12 service fabric internalsTokyo Azure Meetup
 
The impact of cloud NSBCon NY by Yves Goeleven
The impact of cloud NSBCon NY by Yves GoelevenThe impact of cloud NSBCon NY by Yves Goeleven
The impact of cloud NSBCon NY by Yves GoelevenParticular Software
 
Arc305 how netflix leverages multiple regions to increase availability an i...
Arc305 how netflix leverages multiple regions to increase availability   an i...Arc305 how netflix leverages multiple regions to increase availability   an i...
Arc305 how netflix leverages multiple regions to increase availability an i...Ruslan Meshenberg
 
Netflix Development Patterns for Scale, Performance & Availability (DMG206) |...
Netflix Development Patterns for Scale, Performance & Availability (DMG206) |...Netflix Development Patterns for Scale, Performance & Availability (DMG206) |...
Netflix Development Patterns for Scale, Performance & Availability (DMG206) |...Amazon Web Services
 
Migrating IBM i Systems to the Cloud: Exploring the Pros and Cons
Migrating IBM i Systems to the Cloud: Exploring the Pros and ConsMigrating IBM i Systems to the Cloud: Exploring the Pros and Cons
Migrating IBM i Systems to the Cloud: Exploring the Pros and ConsPrecisely
 
Expect the unexpected: Prepare for failures in microservices
Expect the unexpected: Prepare for failures in microservicesExpect the unexpected: Prepare for failures in microservices
Expect the unexpected: Prepare for failures in microservicesBhakti Mehta
 
.NET microservices with Azure Service Fabric
.NET microservices with Azure Service Fabric.NET microservices with Azure Service Fabric
.NET microservices with Azure Service FabricDavide Benvegnù
 
More Nines for Your Dimes: Improving Availability and Lowering Costs using Au...
More Nines for Your Dimes: Improving Availability and Lowering Costs using Au...More Nines for Your Dimes: Improving Availability and Lowering Costs using Au...
More Nines for Your Dimes: Improving Availability and Lowering Costs using Au...Amazon Web Services
 
Microservices Architecture
Microservices ArchitectureMicroservices Architecture
Microservices ArchitectureLucian Neghina
 
Service Stampede: Surviving a Thousand Services
Service Stampede: Surviving a Thousand ServicesService Stampede: Surviving a Thousand Services
Service Stampede: Surviving a Thousand ServicesAnil Gursel
 
iMobileMagic Teck Talk Scale Up
iMobileMagic Teck Talk Scale UpiMobileMagic Teck Talk Scale Up
iMobileMagic Teck Talk Scale UpPedro Machado
 
More Nines for Your Dimes: Improving Availability and Lowering Costs using Au...
More Nines for Your Dimes: Improving Availability and Lowering Costs using Au...More Nines for Your Dimes: Improving Availability and Lowering Costs using Au...
More Nines for Your Dimes: Improving Availability and Lowering Costs using Au...Amazon Web Services
 
Netflix Massively Scalable, Highly Available, Immutable Infrastructure
Netflix Massively Scalable, Highly Available, Immutable InfrastructureNetflix Massively Scalable, Highly Available, Immutable Infrastructure
Netflix Massively Scalable, Highly Available, Immutable InfrastructureAmer Ather
 
AAI-2013 Preparing to Fail: Practical WebSphere Application Server High Avail...
AAI-2013 Preparing to Fail: Practical WebSphere Application Server High Avail...AAI-2013 Preparing to Fail: Practical WebSphere Application Server High Avail...
AAI-2013 Preparing to Fail: Practical WebSphere Application Server High Avail...WASdev Community
 
High Availability in the Cloud - Architectural Best Practices
High Availability in the Cloud - Architectural Best PracticesHigh Availability in the Cloud - Architectural Best Practices
High Availability in the Cloud - Architectural Best PracticesRightScale
 

Similar to Embracing Failure - Fault Injection and Service Resilience at Netflix (20)

(PFC305) Embracing Failure: Fault-Injection and Service Reliability | AWS re:...
(PFC305) Embracing Failure: Fault-Injection and Service Reliability | AWS re:...(PFC305) Embracing Failure: Fault-Injection and Service Reliability | AWS re:...
(PFC305) Embracing Failure: Fault-Injection and Service Reliability | AWS re:...
 
Net flix embracingfailure re-invent2014-141113085858-conversion-gate02
Net flix embracingfailure re-invent2014-141113085858-conversion-gate02Net flix embracingfailure re-invent2014-141113085858-conversion-gate02
Net flix embracingfailure re-invent2014-141113085858-conversion-gate02
 
(ISM301) Engineering Netflix Global Operations In The Cloud
(ISM301) Engineering Netflix Global Operations In The Cloud(ISM301) Engineering Netflix Global Operations In The Cloud
(ISM301) Engineering Netflix Global Operations In The Cloud
 
Tokyo azure meetup #12 service fabric internals
Tokyo azure meetup #12   service fabric internalsTokyo azure meetup #12   service fabric internals
Tokyo azure meetup #12 service fabric internals
 
The impact of cloud NSBCon NY by Yves Goeleven
The impact of cloud NSBCon NY by Yves GoelevenThe impact of cloud NSBCon NY by Yves Goeleven
The impact of cloud NSBCon NY by Yves Goeleven
 
Arc305 how netflix leverages multiple regions to increase availability an i...
Arc305 how netflix leverages multiple regions to increase availability   an i...Arc305 how netflix leverages multiple regions to increase availability   an i...
Arc305 how netflix leverages multiple regions to increase availability an i...
 
Netflix Development Patterns for Scale, Performance & Availability (DMG206) |...
Netflix Development Patterns for Scale, Performance & Availability (DMG206) |...Netflix Development Patterns for Scale, Performance & Availability (DMG206) |...
Netflix Development Patterns for Scale, Performance & Availability (DMG206) |...
 
Migrating IBM i Systems to the Cloud: Exploring the Pros and Cons
Migrating IBM i Systems to the Cloud: Exploring the Pros and ConsMigrating IBM i Systems to the Cloud: Exploring the Pros and Cons
Migrating IBM i Systems to the Cloud: Exploring the Pros and Cons
 
Expect the unexpected: Prepare for failures in microservices
Expect the unexpected: Prepare for failures in microservicesExpect the unexpected: Prepare for failures in microservices
Expect the unexpected: Prepare for failures in microservices
 
.NET microservices with Azure Service Fabric
.NET microservices with Azure Service Fabric.NET microservices with Azure Service Fabric
.NET microservices with Azure Service Fabric
 
More Nines for Your Dimes: Improving Availability and Lowering Costs using Au...
More Nines for Your Dimes: Improving Availability and Lowering Costs using Au...More Nines for Your Dimes: Improving Availability and Lowering Costs using Au...
More Nines for Your Dimes: Improving Availability and Lowering Costs using Au...
 
Microservices Architecture
Microservices ArchitectureMicroservices Architecture
Microservices Architecture
 
Service Stampede: Surviving a Thousand Services
Service Stampede: Surviving a Thousand ServicesService Stampede: Surviving a Thousand Services
Service Stampede: Surviving a Thousand Services
 
iMobileMagic Teck Talk Scale Up
iMobileMagic Teck Talk Scale UpiMobileMagic Teck Talk Scale Up
iMobileMagic Teck Talk Scale Up
 
More Nines for Your Dimes: Improving Availability and Lowering Costs using Au...
More Nines for Your Dimes: Improving Availability and Lowering Costs using Au...More Nines for Your Dimes: Improving Availability and Lowering Costs using Au...
More Nines for Your Dimes: Improving Availability and Lowering Costs using Au...
 
Play With Streams
Play With StreamsPlay With Streams
Play With Streams
 
Mini-Track: Lessons from Public Cloud
Mini-Track: Lessons from Public CloudMini-Track: Lessons from Public Cloud
Mini-Track: Lessons from Public Cloud
 
Netflix Massively Scalable, Highly Available, Immutable Infrastructure
Netflix Massively Scalable, Highly Available, Immutable InfrastructureNetflix Massively Scalable, Highly Available, Immutable Infrastructure
Netflix Massively Scalable, Highly Available, Immutable Infrastructure
 
AAI-2013 Preparing to Fail: Practical WebSphere Application Server High Avail...
AAI-2013 Preparing to Fail: Practical WebSphere Application Server High Avail...AAI-2013 Preparing to Fail: Practical WebSphere Application Server High Avail...
AAI-2013 Preparing to Fail: Practical WebSphere Application Server High Avail...
 
High Availability in the Cloud - Architectural Best Practices
High Availability in the Cloud - Architectural Best PracticesHigh Availability in the Cloud - Architectural Best Practices
High Availability in the Cloud - Architectural Best Practices
 

Recently uploaded

AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024The Digital Insurer
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKJago de Vreede
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024The Digital Insurer
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 

Recently uploaded (20)

AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 

Embracing Failure - Fault Injection and Service Resilience at Netflix

  • 1. Embracing Failure Fault Injection and Service Resilience at Netflix Josh Evans – Director of Operations Engineering Naresh Gopalani – Software Engineer and Architect © 2014 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • 2. Netflix Ecosystem • ~50 million members, ~50 countries • > 1 billion hours per month • > 1000 device types • 3 AWS Regions, hundreds of services • Hundreds of thousands of requests/second • CDN serves petabytes of data at terabits/second Static Content Akamai Netflix CDN Service Partners AWS/Netfli x Control Plane Internet
  • 3. Availability means that members can ● sign up ● activate a device ● browse ● watch
  • 4. What keeps us up at night
  • 5. Failures can happen any time • Disks fail • Power outages • Natural disasters • Software bugs • Human error
  • 6. We design for failure • Exception handling • Fault tolerance and isolation • Fall-backs and degraded experiences • Auto-scaling clusters • Redundancy
  • 7. Testing for failure is hard • Web-scale traffic • Massive, changing data sets • Complex interactions and request patterns • Asynchronous, concurrent requests • Complete and partial failure modes Constant innovation and change
  • 8. What if we regularly inject failures into our systems under controlled circumstances?
  • 9.
  • 10. Blast Radius • Unit of isolation • Scope of an outage • Scope a chaos exercise Instance Zone Region Global
  • 11. An Instance Fails Edge Cluster Cluster A Cluster B Cluster C Cluster D
  • 12. Chaos Monkey • Monkey loose in your DC • Run during business hours • What we learned – Auto-replacement works – State is problematic
  • 13. A State of Xen - Chaos Monkey & Cassandra Out of our 2700+ Cassandra nodes • 218 rebooted • 22 did not reboot successfully • Automation replaced failed nodes • 0 downtime due to reboot
  • 14. An Availability Zone Fails EU-West US-West US-East AZ1 AZ2
  • 15. Chaos Gorilla Simulate an availability zone outage • 3-zone configuration • Eliminate one zone • Ensure that others can handle the load and nothing breaks Chaos Gorilla
  • 16. Challenges • Rapidly shifting traffic – LBs must expire connections quickly – Lingering connections to caches must be addressed • Service configuration – Not all clusters auto-scaled or pinned – Services not configured for cross-zone calls – Mismatched timeouts – fallbacks prevented fail-over
  • 17. A Region Fails US-West US-East EU-West
  • 18. Regional Load Balancers Zuul – Traffic Shaping/Routing AZ1 AZ2 AZ3 Data Data Data Geo-located Chaos Kong Chaos Kong Regional Load Balancers Zuul – Traffic Shaping/Routing AZ1 AZ2 AZ3 Data Data Data Customer Device
  • 19. Challenges ● Rapidly shifting traffic ○ Auto-scaling configuration ○ Static configuration/pinning ○ Instance start time ○ Cache fill time
  • 20. Challenges ● Service Configuration ○ Timeout configurations ○ Fallbacks fail or don’t provide the desired experience ● No minimal (critical) stack ○ Any service may be critical!
  • 21. A Service Fails Service Zone Region Global
  • 22. Services Slow Down and Fail Simulate latent/failed service calls • Inject arbitrary latency and errors at the service level • Observe for effects Latency Monkey
  • 23. Latency Monkey Device ELB Zuul Edge Service B Service C Internet Service A
  • 24. Challenges • Startup resiliency is an issue • Services owners don’t know all dependencies • Fallbacks can fail too • Second order effects not easily tested • Dependencies are in constant flux • Latency Monkey tests function and scale – Not a staged approach – Lots of opt-outs
  • 25. More Precise and Continuous
  • 27. Distributed Systems Fail ● Complex interactions at scale ● Variability across services ● Byzantine failures ● Combinatorial complexity
  • 28. Any service can cause cascading failures ELB
  • 29. Fault Injection Testing (FIT) Device Service B Service C Internet Zuul Edge Device or Account Override Service A Request-level simulations ELB
  • 30. Failure Injection Points IPC Cassandra Client Memcached Client Service Container Fault Tolerance
  • 31. FIT Details ● Common Simulation Syntax ● Single Simulation Interface ● Transported via Http Request header
  • 32. Integrating Failure request [sendRequestHeader] >>fit.failure: 1|fit.Serializer| 2|[[{"name”:”failSocial, Filter Service Ribbon Filter Service Ribbon ServerRcv ClientSend ServerRcv Service A response Service B ”whitelist":false, "injectionPoints”: [“SocialService”]},{} ]], {"Id": "252c403b-7e34-4c0b-a28a-3606fcc38768"}]]
  • 33. Failure Scenarios ● Set of injection points to fail ● Defined based on ○ Past outages ○ Specific dependency interactions ○ Whitelist of a set of critical services ○ Dynamic tracing of dependencies
  • 34. FIT Insights : Salp ● Distributed tracing inspired by Dapper paper ● Provides insight into dependencies ● Helps define & visualize scenarios
  • 35. Dialing Up Failure Functional Validation ● Isolated synthetic transactions ○ Set of devices Validation at Scale ● Dial up customer traffic - % based ● Simulation of full service failure Chaos!
  • 36. Continuous Validation Critical Services Non-critical Services Synthetic Transactions
  • 37. Don’t Fear The Monkeys
  • 38. Take-aways • Don’t wait for random failures – Cause failure to validate resiliency – Remove uncertainty by forcing failures regularly – Better to fail at 2pm than 2am • Test design assumptions by stressing them Embrace Failure
  • 39. The Simian Army is part of the Netflix open source cloud platform http://netflix.github.com
  • 40. Netflix talks at re:Invent Talk Time Title BDT-403 Wednesday, 2:15pm Next Generation Big Data Platform at Netflix PFC-306 Wednesday, 3:30pm Performance Tuning EC2 DEV-309 Wednesday, 3:30pm From Asgard to Zuul, How Netflix’s proven Open Source Tools can accelerate and scale your services ARC-317 Wednesday, 4:30pm Maintaining a Resilient Front-Door at Massive Scale PFC-304 Wednesday, 4:30pm Effective Inter-process Communications in the Cloud: The Pros and Cons of Micro Services Architectures ENT-209 Wednesday, 4:30pm Cloud Migration, Dev-Ops and Distributed Systems APP-310 Friday 9:00am Scheduling using Apache Mesos in the Cloud
  • 41. Please give us your feedback on this presentation Josh Evans jevans@netflix.com @josh_evans_nflx Naresh Gopalani ngopalani@netflix.com Join the conversation on Twitter with #reinvent © 2014 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.