SlideShare a Scribd company logo
1 of 58
@atseitlin
Resiliency through failure
Netflix's Approach to Extreme Availability in the Cloud
Ariel Tseitlin
http://www.linkedin.com/in/atseitlin
@atseitlin
@atseitlin
About Netflix
Netflix is the world’s
leading Internet
television network with
more than 38 million
members in 40
countries enjoying more
than one billion hours
of TV shows and movies
per month, including
original series[1]
[1] http://ir.netflix.com/
@atseitlin
A complex distributed system
@atseitlin
How Netflix Streaming Works
Customer Device
(PC, PS3, TV…)
Web Site or
Discovery API
User Data
Personalization
Streaming API
DRM
QoS Logging
OpenConnect
CDN Boxes
CDN
Management and
Steering
Content Encoding
Consumer
Electronics
AWS Cloud
Services
CDN Edge
Locations
Browse
Play
Watch
@atseitlin
Highly Available Architecture
Micro-services, redundancy,
resiliency
@atseitlin
Web Server Dependencies Flow
(Home page business transaction as seen by AppDynamics)
Start Here
memcached
Cassandra
Web service
S3 bucket
Personalization movie
group chooser
Each icon is
three to a few
hundred
instances
across three
AWS zones
@atseitlin
Component Micro-Services
Test With Chaos Monkey, Latency Monkey
@atseitlin
Three Balanced Availability Zones
Test with Chaos Gorilla
Cassandra and Evcache
Replicas
Zone A
Cassandra and Evcache
Replicas
Zone B
Cassandra and Evcache
Replicas
Zone C
Load Balancers
@atseitlin
Triple Replicated Persistence
Cassandra maintenance affects individual replicas
Cassandra and Evcache
Replicas
Zone A
Cassandra and Evcache
Replicas
Zone B
Cassandra and Evcache
Replicas
Zone C
Load Balancers
@atseitlin
Isolated Regions
Will someday test with Chaos Kong
Cassandra Replicas
Zone A
Cassandra Replicas
Zone B
Cassandra Replicas
Zone C
US-East Load Balancers
Cassandra Replicas
Zone A
Cassandra Replicas
Zone B
Cassandra Replicas
Zone C
EU-West Load Balancers
@atseitlin
Failure Modes and Effects
Failure Mode Probability Current Mitigation Plan
Application Failure High Automatic degraded response
AWS Region Failure Low Wait for region to recover
AWS Zone Failure Medium Continue to run on 2 out of 3 zones
Datacenter Failure Medium Migrate more functions to cloud
Data store failure Low Restore from S3 backups
S3 failure Low Restore from remote archive
Until we got really good at mitigating high and medium
probability failures, the ROI for mitigating regional
failures didn’t make sense. Getting there…
@atseitlin
Application Resilience
Run what you wrote
Rapid detection
Rapid Response
Fail often
@atseitlin
Run What You Wrote
• Make developers responsible for failures
– Then they learn and write code that doesn’t fail
• Use Incident Reviews to find gaps to fix
– Make sure its not about finding “who to blame”
• Keep timeouts short, fail fast
– Don’t let cascading timeouts stack up
@atseitlin
Rapid Detection
• If your pilot had no instument panel, would
you ever board fly on a plane?
– Never run your service blind
• Monitor services, not instances
– Make instance failure a non-event
• Don’t pay people to watch screens
– Instead pay them to build alerting
@atseitlin
Edda
AWS
Instances, ASGs, et
c.
Eureka Services
metadata
AppDynamics
Request flow
Edda – Configuration History
http://techblog.netflix.com/2012/11/edda-learn-stories-of-your-cloud.html
@atseitlin
Edda Query Examples
Find any instances that have ever had a specific public IP address
$ curl "http://edda/api/v2/view/instances;publicIpAddress=1.2.3.4;_since=0"
["i-0123456789","i-012345678a","i-012345678b”]
Show the most recent change to a security group
$ curl "http://edda/api/v2/aws/securityGroups/sg-0123456789;_diff;_all;_limit=2"
--- /api/v2/aws.securityGroups/sg-0123456789;_pp;_at=1351040779810
+++ /api/v2/aws.securityGroups/sg-0123456789;_pp;_at=1351044093504
@@ -1,33 +1,33 @@
{
…
"ipRanges" : [
"10.10.1.1/32",
"10.10.1.2/32",
+ "10.10.1.3/32",
- "10.10.1.4/32"
…
}
@atseitlin
Rapid Rollback
• Use a new Autoscale Group to push code
• Leave existing ASG in place, switch traffic
• If OK, auto-delete old ASG a few hours later
• If “whoops”, switch traffic back in seconds
@atseitlin
Asgard
http://techblog.netflix.com/2012/06/asgard-web-based-cloud-management-and.html
@atseitlin
@atseitlin
@atseitlin
Our goal is availability
• Members can stream Netflix whenever they
want
• New users can explore and sign up for the
service
• New members can activate their service and
add new devices
@atseitlin
Failure is all around us
• Disks fail
• Power goes out. And your generator fails.
• Software bugs introduced
• People make mistakes
Failure is unavoidable
@atseitlin
We design around failure
• Exception handling
• Clusters
• Redundancy
• Fault tolerance
• Fall-back or degraded experience (Hystrix)
• All to insulate our users from failure
Is that enough?
@atseitlin
It’s not enough
• How do we know if we’ve succeeded?
• Does the system work as designed?
• Is it as resilient as we believe?
• How do we prevent drifting into failure?
The typical answer is…
@atseitlin
More testing!
• Unit testing
• Integration testing
• Stress testing
• Exhaustive test suites to simulate and test all
failure mode
Can we effectively simulate a large-
scale distributed system?
@atseitlin
Building distributed systems is hard
Testing them exhaustively is even harder
• Massive data sets and changing shape
• Internet-scale traffic
• Complex interaction and information flow
• Asynchronous nature
• 3rd party services
• All while innovating and building features
Prohibitively expensive, if not impossible,
for most large-scale systems
@atseitlin
What if we could reduce variability of failures?
@atseitlin
There is another way
• Cause failure to validate resiliency
• Test design assumption by stressing them
• Don’t wait for random failure. Remove its
uncertainty by forcing it periodically
@atseitlin
And that’s exactly what we did
@atseitlin
Instances fail
@atseitlin
@atseitlin
Chaos Monkey taught us…
• State is bad
• Clusters are good
• Surviving single instance failure is not enough
@atseitlin
Lots of instances fail
@atseitlin
Chaos Gorilla
@atseitlin
Chaos Gorilla taught us…
• Hidden assumptions on deployment topology
• Infrastructure control plane can be a
bottleneck
• Large scale events are hard to simulate
• Rapidly shifting traffic is error prone
• Smooth recovery is a challenge
• Cassandra works as expected
@atseitlin
What about larger catastrophes?
Anyone remember Sandy?
@atseitlin
Chaos Kong (*some day soon*)
@atseitlin
The Sick and Wounded
@atseitlin
Latency Monkey
@atseitlin
@atseitlin
Resilient Design – Hystrix, RxJava
http://techblog.netflix.com/2012/02/fault-tolerance-in-high-volume.html
@atseitlin
Latency Monkey taught us
• Startup resiliency is often missed
• An ongoing unified approach to runtime
dependency management is important (visibility &
transparency gets missed otherwise)
• Know thy neighbor (unknown dependencies)
• Fall backs can fail too
@atseitlin
Entropy
@atseitlin
Clutter accumulates
• Complexity
• Cruft
• Vulnerabilities
• Cost
@atseitlin
Janitor Monkey
@atseitlin
Janitor Monkey taught us…
• Label everything
• Clutter builds up
@atseitlin
Ranks of the Simian Army
• Chaos Monkey
• Chaos Gorilla
• Latency Monkey
• Janitor Monkey
• Conformity
Monkey
• Circus Monkey
• Doctor Monkey
• Howler Monkey
• Security Monkey
• Chaos Kong
• Efficiency Monkey
@atseitlin
Observability is key
• Don’t exacerbate real customer issues with
failure exercises
• Deep system visibility is key to root-cause
failures and understand the system
@atseitlin
Organizational elements
• Every engineer is an operator of the service
• Each failure is an opportunity to learn
• Blameless culture
Goal is to create a learning organization
@atseitlin
Assembling the Puzzle
@atseitlin
Netflix Highly Available Platform
now open
@NetflixOSS
@atseitlin
Open Source Projects
Github / Techblog
Apache Contributions
Techblog Post
Coming Soon
Priam
Cassandra as a Service
Astyanax
Cassandra client for Java
CassJMeter
Cassandra test suite
Cassandra
Multi-region EC2 datastore
support
Aegisthus
Hadoop ETL for Cassandra
Ice
Spend analytics
Governator
Library lifecycle and dependency
injection
Odin
Cloud orchestration
Blitz4j Async logging
Exhibitor
Zookeeper as a Service
Curator
Zookeeper Patterns
EVCache
Memcached as a Service
Eureka / Discovery
Service Directory
Archaius
Dynamics Properties Service
Edda
Config state with history
Denominator
Ribbon
REST Client + mid-tier LB
Karyon
Instrumented REST Base Serve
Servo and Autoscaling Scripts
Genie
Hadoop PaaS
Hystrix
Robust service pattern
RxJava Reactive Patterns
Asgard
AutoScaleGroup based AWS
console
Chaos Monkey
Robustness verification
Latency Monkey
Janitor Monkey
Bakeries / Aminotor
Legend
@atseitlin
How does it all fit together?
@atseitlin
@atseitlin
Our Current Catalog of Releases
Free code available at http://netflix.github.com
@atseitlin
We’re hiring!
• Simian Army
• Cloud Tools
• NetflixOSS
• Cloud Operations
• Reliability Engineering
• Edge Services
• Many, many more
jobs.netflix.com
@atseitlin
Takeaways
Create fine-grained micro-services. Don’t trust your dependencies.
Regularly inducing failure in your production environment validates resiliency
and increases availability
Netflix has built and deployed a scalable global and highly available Platform
as a Service and opened sourced it (NetflixOSS)
http://netflix.github.com
http://techblog.netflix.com
http://slideshare.net/Netflix
http://www.linkedin.com/in/atseitlin
@atseitlin @NetflixOSS
@atseitlin
Thank you!
Any questions?
Ariel Tseitlin
http://www.linkedin.com/in/atseitlin
@atseitlin

More Related Content

What's hot

Reactive programming and Hystrix fault tolerance by Max Myslyvtsev
Reactive programming and Hystrix fault tolerance by Max MyslyvtsevReactive programming and Hystrix fault tolerance by Max Myslyvtsev
Reactive programming and Hystrix fault tolerance by Max MyslyvtsevJavaDayUA
 
Security as Code
Security as CodeSecurity as Code
Security as CodeEd Bellis
 
Chaos Engineering - Limiting Damage During Chaos Experiments
Chaos Engineering - Limiting Damage During Chaos ExperimentsChaos Engineering - Limiting Damage During Chaos Experiments
Chaos Engineering - Limiting Damage During Chaos ExperimentsNils Meder
 
I Don't Test Often ...
I Don't Test Often ...I Don't Test Often ...
I Don't Test Often ...Gareth Bowles
 
OpenStack in the Enterprise - NJ VMUG June 9, 2015 - Melissa Palmer
OpenStack in the Enterprise - NJ VMUG June 9, 2015 - Melissa PalmerOpenStack in the Enterprise - NJ VMUG June 9, 2015 - Melissa Palmer
OpenStack in the Enterprise - NJ VMUG June 9, 2015 - Melissa Palmervmiss33
 
HealthConDX Virtual Summit 2021 - How Security Chaos Engineering is Changing ...
HealthConDX Virtual Summit 2021 - How Security Chaos Engineering is Changing ...HealthConDX Virtual Summit 2021 - How Security Chaos Engineering is Changing ...
HealthConDX Virtual Summit 2021 - How Security Chaos Engineering is Changing ...Aaron Rinehart
 
Slam Dunk with Splunk and Stash Data Center
Slam Dunk with Splunk and Stash Data CenterSlam Dunk with Splunk and Stash Data Center
Slam Dunk with Splunk and Stash Data CenterAtlassian
 
Dockercon USA 2016 - Immutable Awesomeness
Dockercon USA 2016 - Immutable Awesomeness Dockercon USA 2016 - Immutable Awesomeness
Dockercon USA 2016 - Immutable Awesomeness John Willis
 
Go Reactive: Event-Driven, Scalable, Resilient & Responsive Systems (Soft-Sha...
Go Reactive: Event-Driven, Scalable, Resilient & Responsive Systems (Soft-Sha...Go Reactive: Event-Driven, Scalable, Resilient & Responsive Systems (Soft-Sha...
Go Reactive: Event-Driven, Scalable, Resilient & Responsive Systems (Soft-Sha...mircodotta
 
Immutable Service Delivery Shenzhen 2016
Immutable Service Delivery   Shenzhen 2016Immutable Service Delivery   Shenzhen 2016
Immutable Service Delivery Shenzhen 2016John Willis
 
Vertafore: Database Evaluation - Selecting Apache Cassandra
Vertafore: Database Evaluation - Selecting Apache CassandraVertafore: Database Evaluation - Selecting Apache Cassandra
Vertafore: Database Evaluation - Selecting Apache CassandraDataStax Academy
 
DevOps: Cultural and Tooling Tips Around the World
DevOps: Cultural and Tooling Tips Around the WorldDevOps: Cultural and Tooling Tips Around the World
DevOps: Cultural and Tooling Tips Around the WorldDynatrace
 
DOES16 London - Better Faster Cheaper .. How?
DOES16 London - Better Faster Cheaper .. How? DOES16 London - Better Faster Cheaper .. How?
DOES16 London - Better Faster Cheaper .. How? John Willis
 
Mobile User Experience: Auto Drive through Performance Metrics
Mobile User Experience:Auto Drive through Performance MetricsMobile User Experience:Auto Drive through Performance Metrics
Mobile User Experience: Auto Drive through Performance MetricsAndreas Grabner
 
BTD2015 - Your Place In DevTOps is Finding Solutions - Not Just Bugs!
BTD2015 - Your Place In DevTOps is Finding Solutions - Not Just Bugs!BTD2015 - Your Place In DevTOps is Finding Solutions - Not Just Bugs!
BTD2015 - Your Place In DevTOps is Finding Solutions - Not Just Bugs!Andreas Grabner
 
Get Loose! Microservices and Loosely Coupled Architectures
Get Loose! Microservices and Loosely Coupled ArchitecturesGet Loose! Microservices and Loosely Coupled Architectures
Get Loose! Microservices and Loosely Coupled ArchitecturesDeborah Schalm
 
Pets versus Cattle: servers evolved
Pets versus Cattle: servers evolvedPets versus Cattle: servers evolved
Pets versus Cattle: servers evolvedPhil Cryer
 
cdSummit Austin - Orchestrating the continuous delivery process - Andy Pemberton
cdSummit Austin - Orchestrating the continuous delivery process - Andy PembertoncdSummit Austin - Orchestrating the continuous delivery process - Andy Pemberton
cdSummit Austin - Orchestrating the continuous delivery process - Andy PembertonMiles Blatstein
 
Monktoberfest Fast Delivery
Monktoberfest Fast DeliveryMonktoberfest Fast Delivery
Monktoberfest Fast DeliveryAdrian Cockcroft
 
All Change how the economics of Cloud will make you think differently about Java
All Change how the economics of Cloud will make you think differently about JavaAll Change how the economics of Cloud will make you think differently about Java
All Change how the economics of Cloud will make you think differently about JavaSteve Poole
 

What's hot (20)

Reactive programming and Hystrix fault tolerance by Max Myslyvtsev
Reactive programming and Hystrix fault tolerance by Max MyslyvtsevReactive programming and Hystrix fault tolerance by Max Myslyvtsev
Reactive programming and Hystrix fault tolerance by Max Myslyvtsev
 
Security as Code
Security as CodeSecurity as Code
Security as Code
 
Chaos Engineering - Limiting Damage During Chaos Experiments
Chaos Engineering - Limiting Damage During Chaos ExperimentsChaos Engineering - Limiting Damage During Chaos Experiments
Chaos Engineering - Limiting Damage During Chaos Experiments
 
I Don't Test Often ...
I Don't Test Often ...I Don't Test Often ...
I Don't Test Often ...
 
OpenStack in the Enterprise - NJ VMUG June 9, 2015 - Melissa Palmer
OpenStack in the Enterprise - NJ VMUG June 9, 2015 - Melissa PalmerOpenStack in the Enterprise - NJ VMUG June 9, 2015 - Melissa Palmer
OpenStack in the Enterprise - NJ VMUG June 9, 2015 - Melissa Palmer
 
HealthConDX Virtual Summit 2021 - How Security Chaos Engineering is Changing ...
HealthConDX Virtual Summit 2021 - How Security Chaos Engineering is Changing ...HealthConDX Virtual Summit 2021 - How Security Chaos Engineering is Changing ...
HealthConDX Virtual Summit 2021 - How Security Chaos Engineering is Changing ...
 
Slam Dunk with Splunk and Stash Data Center
Slam Dunk with Splunk and Stash Data CenterSlam Dunk with Splunk and Stash Data Center
Slam Dunk with Splunk and Stash Data Center
 
Dockercon USA 2016 - Immutable Awesomeness
Dockercon USA 2016 - Immutable Awesomeness Dockercon USA 2016 - Immutable Awesomeness
Dockercon USA 2016 - Immutable Awesomeness
 
Go Reactive: Event-Driven, Scalable, Resilient & Responsive Systems (Soft-Sha...
Go Reactive: Event-Driven, Scalable, Resilient & Responsive Systems (Soft-Sha...Go Reactive: Event-Driven, Scalable, Resilient & Responsive Systems (Soft-Sha...
Go Reactive: Event-Driven, Scalable, Resilient & Responsive Systems (Soft-Sha...
 
Immutable Service Delivery Shenzhen 2016
Immutable Service Delivery   Shenzhen 2016Immutable Service Delivery   Shenzhen 2016
Immutable Service Delivery Shenzhen 2016
 
Vertafore: Database Evaluation - Selecting Apache Cassandra
Vertafore: Database Evaluation - Selecting Apache CassandraVertafore: Database Evaluation - Selecting Apache Cassandra
Vertafore: Database Evaluation - Selecting Apache Cassandra
 
DevOps: Cultural and Tooling Tips Around the World
DevOps: Cultural and Tooling Tips Around the WorldDevOps: Cultural and Tooling Tips Around the World
DevOps: Cultural and Tooling Tips Around the World
 
DOES16 London - Better Faster Cheaper .. How?
DOES16 London - Better Faster Cheaper .. How? DOES16 London - Better Faster Cheaper .. How?
DOES16 London - Better Faster Cheaper .. How?
 
Mobile User Experience: Auto Drive through Performance Metrics
Mobile User Experience:Auto Drive through Performance MetricsMobile User Experience:Auto Drive through Performance Metrics
Mobile User Experience: Auto Drive through Performance Metrics
 
BTD2015 - Your Place In DevTOps is Finding Solutions - Not Just Bugs!
BTD2015 - Your Place In DevTOps is Finding Solutions - Not Just Bugs!BTD2015 - Your Place In DevTOps is Finding Solutions - Not Just Bugs!
BTD2015 - Your Place In DevTOps is Finding Solutions - Not Just Bugs!
 
Get Loose! Microservices and Loosely Coupled Architectures
Get Loose! Microservices and Loosely Coupled ArchitecturesGet Loose! Microservices and Loosely Coupled Architectures
Get Loose! Microservices and Loosely Coupled Architectures
 
Pets versus Cattle: servers evolved
Pets versus Cattle: servers evolvedPets versus Cattle: servers evolved
Pets versus Cattle: servers evolved
 
cdSummit Austin - Orchestrating the continuous delivery process - Andy Pemberton
cdSummit Austin - Orchestrating the continuous delivery process - Andy PembertoncdSummit Austin - Orchestrating the continuous delivery process - Andy Pemberton
cdSummit Austin - Orchestrating the continuous delivery process - Andy Pemberton
 
Monktoberfest Fast Delivery
Monktoberfest Fast DeliveryMonktoberfest Fast Delivery
Monktoberfest Fast Delivery
 
All Change how the economics of Cloud will make you think differently about Java
All Change how the economics of Cloud will make you think differently about JavaAll Change how the economics of Cloud will make you think differently about Java
All Change how the economics of Cloud will make you think differently about Java
 

Viewers also liked

AWS August Webinar Series - DDoS Resiliency
AWS August Webinar Series - DDoS ResiliencyAWS August Webinar Series - DDoS Resiliency
AWS August Webinar Series - DDoS ResiliencyAmazon Web Services
 
Crowdfunding strategie sessie IVN
Crowdfunding strategie sessie IVNCrowdfunding strategie sessie IVN
Crowdfunding strategie sessie IVNRonald Kleverlaan
 
Fast, reliable, secure @ Velocity 2015
Fast, reliable, secure @  Velocity 2015Fast, reliable, secure @  Velocity 2015
Fast, reliable, secure @ Velocity 2015Ariel Tseitlin
 
Innovation, Service, and Shared References
Innovation, Service, and Shared ReferencesInnovation, Service, and Shared References
Innovation, Service, and Shared ReferencesEric Reiss
 
Quiz Topics For 1st Quiz 8th
Quiz Topics For 1st Quiz 8thQuiz Topics For 1st Quiz 8th
Quiz Topics For 1st Quiz 8thawltech
 
Mohamed Sayed C.V.
Mohamed Sayed C.V.Mohamed Sayed C.V.
Mohamed Sayed C.V.darsh0225
 
Crowdfunding in de Zorg - Verschillende vormen en tips en trucs
Crowdfunding in de Zorg - Verschillende vormen en tips en trucsCrowdfunding in de Zorg - Verschillende vormen en tips en trucs
Crowdfunding in de Zorg - Verschillende vormen en tips en trucsRonald Kleverlaan
 
Nationaal Monumentencongres - Crowdfunding
Nationaal Monumentencongres - CrowdfundingNationaal Monumentencongres - Crowdfunding
Nationaal Monumentencongres - CrowdfundingRonald Kleverlaan
 
Crowdfunding in de Zorg - Novartis Patient Academy
Crowdfunding in de Zorg - Novartis Patient AcademyCrowdfunding in de Zorg - Novartis Patient Academy
Crowdfunding in de Zorg - Novartis Patient AcademyRonald Kleverlaan
 
091710 NTNUMUN說明會
091710 NTNUMUN說明會091710 NTNUMUN說明會
091710 NTNUMUN說明會Peitung Wang
 
The role of the information architect
The role of the information architectThe role of the information architect
The role of the information architectEric Reiss
 
Innovation at Israel Mobile Monetization Summit
Innovation at Israel Mobile Monetization SummitInnovation at Israel Mobile Monetization Summit
Innovation at Israel Mobile Monetization SummitEric Reiss
 
Keynote at UX Sofia 2013
Keynote at UX Sofia 2013Keynote at UX Sofia 2013
Keynote at UX Sofia 2013Eric Reiss
 
Happy New Year 2009 how will you Celebrate
Happy New Year 2009 how will you CelebrateHappy New Year 2009 how will you Celebrate
Happy New Year 2009 how will you CelebrateBillen
 
Using New media to create a band of youth social journalists
Using New media to create a band of youth social journalistsUsing New media to create a band of youth social journalists
Using New media to create a band of youth social journalistsKeerthi Kiran K
 

Viewers also liked (20)

AWS August Webinar Series - DDoS Resiliency
AWS August Webinar Series - DDoS ResiliencyAWS August Webinar Series - DDoS Resiliency
AWS August Webinar Series - DDoS Resiliency
 
Crowdfunding strategie sessie IVN
Crowdfunding strategie sessie IVNCrowdfunding strategie sessie IVN
Crowdfunding strategie sessie IVN
 
Create a Loyal Following of Customers
Create a Loyal Following of CustomersCreate a Loyal Following of Customers
Create a Loyal Following of Customers
 
Fast, reliable, secure @ Velocity 2015
Fast, reliable, secure @  Velocity 2015Fast, reliable, secure @  Velocity 2015
Fast, reliable, secure @ Velocity 2015
 
Innovation, Service, and Shared References
Innovation, Service, and Shared ReferencesInnovation, Service, and Shared References
Innovation, Service, and Shared References
 
Quiz Topics For 1st Quiz 8th
Quiz Topics For 1st Quiz 8thQuiz Topics For 1st Quiz 8th
Quiz Topics For 1st Quiz 8th
 
Mohamed Sayed C.V.
Mohamed Sayed C.V.Mohamed Sayed C.V.
Mohamed Sayed C.V.
 
Crowdfunding in de Zorg - Verschillende vormen en tips en trucs
Crowdfunding in de Zorg - Verschillende vormen en tips en trucsCrowdfunding in de Zorg - Verschillende vormen en tips en trucs
Crowdfunding in de Zorg - Verschillende vormen en tips en trucs
 
Nationaal Monumentencongres - Crowdfunding
Nationaal Monumentencongres - CrowdfundingNationaal Monumentencongres - Crowdfunding
Nationaal Monumentencongres - Crowdfunding
 
Community Mill: Data, Media & Communities
Community Mill: Data, Media & CommunitiesCommunity Mill: Data, Media & Communities
Community Mill: Data, Media & Communities
 
Crowdfunding in de Zorg - Novartis Patient Academy
Crowdfunding in de Zorg - Novartis Patient AcademyCrowdfunding in de Zorg - Novartis Patient Academy
Crowdfunding in de Zorg - Novartis Patient Academy
 
How to access to capitol
How to access to capitolHow to access to capitol
How to access to capitol
 
091710 NTNUMUN說明會
091710 NTNUMUN說明會091710 NTNUMUN說明會
091710 NTNUMUN說明會
 
The role of the information architect
The role of the information architectThe role of the information architect
The role of the information architect
 
Dengue
DengueDengue
Dengue
 
Crowdfunding algeracorridor
Crowdfunding algeracorridorCrowdfunding algeracorridor
Crowdfunding algeracorridor
 
Innovation at Israel Mobile Monetization Summit
Innovation at Israel Mobile Monetization SummitInnovation at Israel Mobile Monetization Summit
Innovation at Israel Mobile Monetization Summit
 
Keynote at UX Sofia 2013
Keynote at UX Sofia 2013Keynote at UX Sofia 2013
Keynote at UX Sofia 2013
 
Happy New Year 2009 how will you Celebrate
Happy New Year 2009 how will you CelebrateHappy New Year 2009 how will you Celebrate
Happy New Year 2009 how will you Celebrate
 
Using New media to create a band of youth social journalists
Using New media to create a band of youth social journalistsUsing New media to create a band of youth social journalists
Using New media to create a band of youth social journalists
 

Similar to Resiliency through Failure @ OSCON 2013

Resiliency through failure @ QConNY 2013
Resiliency through failure @ QConNY 2013Resiliency through failure @ QConNY 2013
Resiliency through failure @ QConNY 2013Ariel Tseitlin
 
LF_APIStrat17_Don't Build a Death Star
LF_APIStrat17_Don't Build a Death StarLF_APIStrat17_Don't Build a Death Star
LF_APIStrat17_Don't Build a Death StarLF_APIStrat
 
I don't always test...but when I do I test in production - Gareth Bowles
I don't always test...but when I do I test in production - Gareth BowlesI don't always test...but when I do I test in production - Gareth Bowles
I don't always test...but when I do I test in production - Gareth BowlesQA or the Highway
 
Netflix presents at MassTLC Cloud Summit 2013
Netflix presents at MassTLC Cloud Summit 2013Netflix presents at MassTLC Cloud Summit 2013
Netflix presents at MassTLC Cloud Summit 2013MassTLC
 
Cloud Native Future
Cloud Native FutureCloud Native Future
Cloud Native FutureJulie Coonce
 
(SPOT302) Availability: The New Kind of Innovator’s Dilemma
(SPOT302) Availability: The New Kind of Innovator’s Dilemma(SPOT302) Availability: The New Kind of Innovator’s Dilemma
(SPOT302) Availability: The New Kind of Innovator’s DilemmaAmazon Web Services
 
Antifragile, Microservices and DevOps - A Study
Antifragile, Microservices and DevOps - A StudyAntifragile, Microservices and DevOps - A Study
Antifragile, Microservices and DevOps - A StudyWilliam Yang
 
Release the Monkeys ! Testing in the Wild at Netflix
Release the Monkeys !  Testing in the Wild at NetflixRelease the Monkeys !  Testing in the Wild at Netflix
Release the Monkeys ! Testing in the Wild at NetflixGareth Bowles
 
Create an architecture for web test automation
Create an architecture for web test automationCreate an architecture for web test automation
Create an architecture for web test automationElias Nogueira
 
Practical Cloud & Workflow Orchestration
Practical Cloud & Workflow OrchestrationPractical Cloud & Workflow Orchestration
Practical Cloud & Workflow OrchestrationChris Dagdigian
 
GDG Cloud Southlake 29 Jimmy Mesta OWASP Top 10 for Kubernetes
GDG Cloud Southlake 29 Jimmy Mesta OWASP Top 10 for KubernetesGDG Cloud Southlake 29 Jimmy Mesta OWASP Top 10 for Kubernetes
GDG Cloud Southlake 29 Jimmy Mesta OWASP Top 10 for KubernetesJames Anderson
 
Green Custard Friday Talk 19: Chaos Engineering
Green Custard Friday Talk 19: Chaos EngineeringGreen Custard Friday Talk 19: Chaos Engineering
Green Custard Friday Talk 19: Chaos EngineeringGreen Custard
 
Architecting for failure - Why are distributed systems hard?
Architecting for failure - Why are distributed systems hard?Architecting for failure - Why are distributed systems hard?
Architecting for failure - Why are distributed systems hard?Markus Eisele
 
Planning to Fail #phpuk13
Planning to Fail #phpuk13Planning to Fail #phpuk13
Planning to Fail #phpuk13Dave Gardner
 
Site reliability in the Serverless age - Serverless Boston 2019
Site reliability in the Serverless age  - Serverless Boston 2019Site reliability in the Serverless age  - Serverless Boston 2019
Site reliability in the Serverless age - Serverless Boston 2019Erik Peterson
 

Similar to Resiliency through Failure @ OSCON 2013 (20)

Resiliency through failure @ QConNY 2013
Resiliency through failure @ QConNY 2013Resiliency through failure @ QConNY 2013
Resiliency through failure @ QConNY 2013
 
Mini-Training: Netflix Simian Army
Mini-Training: Netflix Simian ArmyMini-Training: Netflix Simian Army
Mini-Training: Netflix Simian Army
 
LF_APIStrat17_Don't Build a Death Star
LF_APIStrat17_Don't Build a Death StarLF_APIStrat17_Don't Build a Death Star
LF_APIStrat17_Don't Build a Death Star
 
I don't always test...but when I do I test in production - Gareth Bowles
I don't always test...but when I do I test in production - Gareth BowlesI don't always test...but when I do I test in production - Gareth Bowles
I don't always test...but when I do I test in production - Gareth Bowles
 
Netflix presents at MassTLC Cloud Summit 2013
Netflix presents at MassTLC Cloud Summit 2013Netflix presents at MassTLC Cloud Summit 2013
Netflix presents at MassTLC Cloud Summit 2013
 
Running a Lean Startup with AWS
Running a Lean Startup with AWSRunning a Lean Startup with AWS
Running a Lean Startup with AWS
 
Fantastic Elastic
Fantastic ElasticFantastic Elastic
Fantastic Elastic
 
Cloud Native Future
Cloud Native FutureCloud Native Future
Cloud Native Future
 
(SPOT302) Availability: The New Kind of Innovator’s Dilemma
(SPOT302) Availability: The New Kind of Innovator’s Dilemma(SPOT302) Availability: The New Kind of Innovator’s Dilemma
(SPOT302) Availability: The New Kind of Innovator’s Dilemma
 
Antifragile, Microservices and DevOps - A Study
Antifragile, Microservices and DevOps - A StudyAntifragile, Microservices and DevOps - A Study
Antifragile, Microservices and DevOps - A Study
 
Release the Monkeys ! Testing in the Wild at Netflix
Release the Monkeys !  Testing in the Wild at NetflixRelease the Monkeys !  Testing in the Wild at Netflix
Release the Monkeys ! Testing in the Wild at Netflix
 
ChaosEngineeringITEA.pptx
ChaosEngineeringITEA.pptxChaosEngineeringITEA.pptx
ChaosEngineeringITEA.pptx
 
Create an architecture for web test automation
Create an architecture for web test automationCreate an architecture for web test automation
Create an architecture for web test automation
 
Practical Cloud & Workflow Orchestration
Practical Cloud & Workflow OrchestrationPractical Cloud & Workflow Orchestration
Practical Cloud & Workflow Orchestration
 
GDG Cloud Southlake 29 Jimmy Mesta OWASP Top 10 for Kubernetes
GDG Cloud Southlake 29 Jimmy Mesta OWASP Top 10 for KubernetesGDG Cloud Southlake 29 Jimmy Mesta OWASP Top 10 for Kubernetes
GDG Cloud Southlake 29 Jimmy Mesta OWASP Top 10 for Kubernetes
 
Green Custard Friday Talk 19: Chaos Engineering
Green Custard Friday Talk 19: Chaos EngineeringGreen Custard Friday Talk 19: Chaos Engineering
Green Custard Friday Talk 19: Chaos Engineering
 
Architecting for failure - Why are distributed systems hard?
Architecting for failure - Why are distributed systems hard?Architecting for failure - Why are distributed systems hard?
Architecting for failure - Why are distributed systems hard?
 
Planning to Fail #phpuk13
Planning to Fail #phpuk13Planning to Fail #phpuk13
Planning to Fail #phpuk13
 
Chaos engineering
Chaos engineering Chaos engineering
Chaos engineering
 
Site reliability in the Serverless age - Serverless Boston 2019
Site reliability in the Serverless age  - Serverless Boston 2019Site reliability in the Serverless age  - Serverless Boston 2019
Site reliability in the Serverless age - Serverless Boston 2019
 

Recently uploaded

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 

Recently uploaded (20)

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 

Resiliency through Failure @ OSCON 2013

  • 1. @atseitlin Resiliency through failure Netflix's Approach to Extreme Availability in the Cloud Ariel Tseitlin http://www.linkedin.com/in/atseitlin @atseitlin
  • 2. @atseitlin About Netflix Netflix is the world’s leading Internet television network with more than 38 million members in 40 countries enjoying more than one billion hours of TV shows and movies per month, including original series[1] [1] http://ir.netflix.com/
  • 4. @atseitlin How Netflix Streaming Works Customer Device (PC, PS3, TV…) Web Site or Discovery API User Data Personalization Streaming API DRM QoS Logging OpenConnect CDN Boxes CDN Management and Steering Content Encoding Consumer Electronics AWS Cloud Services CDN Edge Locations Browse Play Watch
  • 6. @atseitlin Web Server Dependencies Flow (Home page business transaction as seen by AppDynamics) Start Here memcached Cassandra Web service S3 bucket Personalization movie group chooser Each icon is three to a few hundred instances across three AWS zones
  • 7. @atseitlin Component Micro-Services Test With Chaos Monkey, Latency Monkey
  • 8. @atseitlin Three Balanced Availability Zones Test with Chaos Gorilla Cassandra and Evcache Replicas Zone A Cassandra and Evcache Replicas Zone B Cassandra and Evcache Replicas Zone C Load Balancers
  • 9. @atseitlin Triple Replicated Persistence Cassandra maintenance affects individual replicas Cassandra and Evcache Replicas Zone A Cassandra and Evcache Replicas Zone B Cassandra and Evcache Replicas Zone C Load Balancers
  • 10. @atseitlin Isolated Regions Will someday test with Chaos Kong Cassandra Replicas Zone A Cassandra Replicas Zone B Cassandra Replicas Zone C US-East Load Balancers Cassandra Replicas Zone A Cassandra Replicas Zone B Cassandra Replicas Zone C EU-West Load Balancers
  • 11. @atseitlin Failure Modes and Effects Failure Mode Probability Current Mitigation Plan Application Failure High Automatic degraded response AWS Region Failure Low Wait for region to recover AWS Zone Failure Medium Continue to run on 2 out of 3 zones Datacenter Failure Medium Migrate more functions to cloud Data store failure Low Restore from S3 backups S3 failure Low Restore from remote archive Until we got really good at mitigating high and medium probability failures, the ROI for mitigating regional failures didn’t make sense. Getting there…
  • 12. @atseitlin Application Resilience Run what you wrote Rapid detection Rapid Response Fail often
  • 13. @atseitlin Run What You Wrote • Make developers responsible for failures – Then they learn and write code that doesn’t fail • Use Incident Reviews to find gaps to fix – Make sure its not about finding “who to blame” • Keep timeouts short, fail fast – Don’t let cascading timeouts stack up
  • 14. @atseitlin Rapid Detection • If your pilot had no instument panel, would you ever board fly on a plane? – Never run your service blind • Monitor services, not instances – Make instance failure a non-event • Don’t pay people to watch screens – Instead pay them to build alerting
  • 15. @atseitlin Edda AWS Instances, ASGs, et c. Eureka Services metadata AppDynamics Request flow Edda – Configuration History http://techblog.netflix.com/2012/11/edda-learn-stories-of-your-cloud.html
  • 16. @atseitlin Edda Query Examples Find any instances that have ever had a specific public IP address $ curl "http://edda/api/v2/view/instances;publicIpAddress=1.2.3.4;_since=0" ["i-0123456789","i-012345678a","i-012345678b”] Show the most recent change to a security group $ curl "http://edda/api/v2/aws/securityGroups/sg-0123456789;_diff;_all;_limit=2" --- /api/v2/aws.securityGroups/sg-0123456789;_pp;_at=1351040779810 +++ /api/v2/aws.securityGroups/sg-0123456789;_pp;_at=1351044093504 @@ -1,33 +1,33 @@ { … "ipRanges" : [ "10.10.1.1/32", "10.10.1.2/32", + "10.10.1.3/32", - "10.10.1.4/32" … }
  • 17. @atseitlin Rapid Rollback • Use a new Autoscale Group to push code • Leave existing ASG in place, switch traffic • If OK, auto-delete old ASG a few hours later • If “whoops”, switch traffic back in seconds
  • 21. @atseitlin Our goal is availability • Members can stream Netflix whenever they want • New users can explore and sign up for the service • New members can activate their service and add new devices
  • 22. @atseitlin Failure is all around us • Disks fail • Power goes out. And your generator fails. • Software bugs introduced • People make mistakes Failure is unavoidable
  • 23. @atseitlin We design around failure • Exception handling • Clusters • Redundancy • Fault tolerance • Fall-back or degraded experience (Hystrix) • All to insulate our users from failure Is that enough?
  • 24. @atseitlin It’s not enough • How do we know if we’ve succeeded? • Does the system work as designed? • Is it as resilient as we believe? • How do we prevent drifting into failure? The typical answer is…
  • 25. @atseitlin More testing! • Unit testing • Integration testing • Stress testing • Exhaustive test suites to simulate and test all failure mode Can we effectively simulate a large- scale distributed system?
  • 26. @atseitlin Building distributed systems is hard Testing them exhaustively is even harder • Massive data sets and changing shape • Internet-scale traffic • Complex interaction and information flow • Asynchronous nature • 3rd party services • All while innovating and building features Prohibitively expensive, if not impossible, for most large-scale systems
  • 27. @atseitlin What if we could reduce variability of failures?
  • 28. @atseitlin There is another way • Cause failure to validate resiliency • Test design assumption by stressing them • Don’t wait for random failure. Remove its uncertainty by forcing it periodically
  • 32. @atseitlin Chaos Monkey taught us… • State is bad • Clusters are good • Surviving single instance failure is not enough
  • 35. @atseitlin Chaos Gorilla taught us… • Hidden assumptions on deployment topology • Infrastructure control plane can be a bottleneck • Large scale events are hard to simulate • Rapidly shifting traffic is error prone • Smooth recovery is a challenge • Cassandra works as expected
  • 36. @atseitlin What about larger catastrophes? Anyone remember Sandy?
  • 41. @atseitlin Resilient Design – Hystrix, RxJava http://techblog.netflix.com/2012/02/fault-tolerance-in-high-volume.html
  • 42. @atseitlin Latency Monkey taught us • Startup resiliency is often missed • An ongoing unified approach to runtime dependency management is important (visibility & transparency gets missed otherwise) • Know thy neighbor (unknown dependencies) • Fall backs can fail too
  • 44. @atseitlin Clutter accumulates • Complexity • Cruft • Vulnerabilities • Cost
  • 46. @atseitlin Janitor Monkey taught us… • Label everything • Clutter builds up
  • 47. @atseitlin Ranks of the Simian Army • Chaos Monkey • Chaos Gorilla • Latency Monkey • Janitor Monkey • Conformity Monkey • Circus Monkey • Doctor Monkey • Howler Monkey • Security Monkey • Chaos Kong • Efficiency Monkey
  • 48. @atseitlin Observability is key • Don’t exacerbate real customer issues with failure exercises • Deep system visibility is key to root-cause failures and understand the system
  • 49. @atseitlin Organizational elements • Every engineer is an operator of the service • Each failure is an opportunity to learn • Blameless culture Goal is to create a learning organization
  • 51. @atseitlin Netflix Highly Available Platform now open @NetflixOSS
  • 52. @atseitlin Open Source Projects Github / Techblog Apache Contributions Techblog Post Coming Soon Priam Cassandra as a Service Astyanax Cassandra client for Java CassJMeter Cassandra test suite Cassandra Multi-region EC2 datastore support Aegisthus Hadoop ETL for Cassandra Ice Spend analytics Governator Library lifecycle and dependency injection Odin Cloud orchestration Blitz4j Async logging Exhibitor Zookeeper as a Service Curator Zookeeper Patterns EVCache Memcached as a Service Eureka / Discovery Service Directory Archaius Dynamics Properties Service Edda Config state with history Denominator Ribbon REST Client + mid-tier LB Karyon Instrumented REST Base Serve Servo and Autoscaling Scripts Genie Hadoop PaaS Hystrix Robust service pattern RxJava Reactive Patterns Asgard AutoScaleGroup based AWS console Chaos Monkey Robustness verification Latency Monkey Janitor Monkey Bakeries / Aminotor Legend
  • 53. @atseitlin How does it all fit together?
  • 55. @atseitlin Our Current Catalog of Releases Free code available at http://netflix.github.com
  • 56. @atseitlin We’re hiring! • Simian Army • Cloud Tools • NetflixOSS • Cloud Operations • Reliability Engineering • Edge Services • Many, many more jobs.netflix.com
  • 57. @atseitlin Takeaways Create fine-grained micro-services. Don’t trust your dependencies. Regularly inducing failure in your production environment validates resiliency and increases availability Netflix has built and deployed a scalable global and highly available Platform as a Service and opened sourced it (NetflixOSS) http://netflix.github.com http://techblog.netflix.com http://slideshare.net/Netflix http://www.linkedin.com/in/atseitlin @atseitlin @NetflixOSS
  • 58. @atseitlin Thank you! Any questions? Ariel Tseitlin http://www.linkedin.com/in/atseitlin @atseitlin

Editor's Notes

  1. The genre box shots were chosen because we have rights to use them, we are starting to make specific logos for each project going forward.