SlideShare a Scribd company logo
1 of 74
How 
Ne'lix 
Leverages 
Mul3ple 
Regions 
to 
Increase 
Availability 
Ruslan 
Meshenberg 
10 
März 
2014
Watch the video with slide 
synchronization on InfoQ.com! 
http://www.infoq.com/presentations 
/netflix-availability-regions 
InfoQ.com: News & Community Site 
• 750,000 unique visitors/month 
• Published in 4 languages (English, Chinese, Japanese and Brazilian 
Portuguese) 
• Post content from our QCon conferences 
• News 15-20 / week 
• Articles 3-4 / week 
• Presentations (videos) 12-15 / week 
• Interviews 2-3 / week 
• Books 1 / month
Presented at QCon London 
www.qconlondon.com 
Purpose of QCon 
- to empower software development by facilitating the spread of 
knowledge and innovation 
Strategy 
- practitioner-driven conference designed for YOU: influencers of 
change and innovation in your teams 
- speakers and topics driving the evolution and innovation 
- connecting and catalyzing the influencers and innovators 
Highlights 
- attended by more than 12,000 delegates since 2007 
- held in 9 cities worldwide
Ne'lix
Who 
am 
I? 
• Director, 
Pla'orm 
Engineering 
• Led 
Mul3-­‐Regional 
Resiliency 
• Leading 
Ne'lixOSS 
Program 
• @rusmeshenberg
Failure
Assump3ons 
Hardware 
Will 
Fail 
• Slowly 
Changing 
• Large 
Scale 
• Rapid 
Change 
• Large 
Scale 
• Slowly 
Changing 
• Small 
Scale 
• Rapid 
Change 
• Small 
Scale 
Telcos 
Web 
Scale 
Speed 
Scale 
Everything 
Works 
Everything 
Is 
Broken 
SoTware 
Will 
Fail 
Enterprise 
IT 
Startups
Incidents 
– 
Impact 
and 
Mi3ga3on 
PR 
X 
Incidents 
CS 
XX 
Incidents 
Metrics 
Impact 
– 
Feature 
Disable 
XXX 
Incidents 
No 
Impact 
– 
Fast 
Retry 
or 
Automated 
Failover 
XXXX 
Incidents 
Public 
rela3ons 
media 
impact 
High 
customer 
service 
calls 
Affects 
AB 
test 
results 
Y 
incidents 
mi3gated 
by 
Ac3ve 
Ac3ve, 
game 
day 
prac3cing 
YY 
incidents 
mi3gated 
by 
beZer 
tools 
and 
prac3ces 
YYY 
incidents 
mi3gated 
by 
beZer 
data 
tagging
Does 
an 
Instance 
Fail? 
• It 
can, 
plan 
for 
it 
• Bad 
code 
/ 
configura3on 
pushes 
• Latent 
issues 
• Hardware 
failure 
• Test 
with 
Chaos 
Monkey
Does 
a 
Zone 
Fail? 
• Rarely, 
but 
happened 
before 
• Rou3ng 
issues 
• DC-­‐specific 
issues 
• App-­‐specific 
issues 
within 
a 
zone 
• Test 
with 
Chaos 
Gorilla
Does 
a 
Region 
Fail? 
• Full 
region 
– 
unlikely, 
very 
rare 
• Individual 
Services 
can 
fail 
region-­‐wide 
• Most 
likely, 
a 
region-­‐wide 
configura3on 
issue 
• Test 
with 
Chaos 
Kong
Everything 
Fails… 
Eventually 
• Keep 
your 
services 
running 
by 
embracing 
isola3on 
and 
redundancy 
• Construct 
a 
highly 
agile 
and 
highly 
available 
service 
from 
ephemeral 
and 
assumed 
broken 
components
Isola3on 
• Changes 
in 
one 
region 
should 
not 
affect 
others 
• Regional 
outage 
should 
not 
affect 
others 
• Network 
par33oning 
between 
regions 
should 
not 
affect 
func3onality 
/ 
opera3ons
Redundancy 
• Make 
more 
than 
one 
(of 
preZy 
much 
everything) 
• Specifically, 
distribute 
services 
across 
Availability 
Zones 
and 
regions
History: 
X-­‐mas 
Eve 
2012 
• Ne'lix 
mul3-­‐hour 
outage 
• US-­‐East1 
regional 
Elas3c 
Load 
Balancing 
issue 
• “...data 
was 
deleted 
by 
a 
maintenance 
process 
that 
was 
inadvertently 
run 
against 
the 
produc8on 
ELB 
state 
data”
Isthmus 
– 
Normal 
Opera3on 
Zone 
A 
Cassandra 
Replicas 
Zone 
B 
Cassandra 
Replicas 
Zone 
C 
Cassandra 
Replicas 
US-­‐West-­‐2 
ELB 
US-­‐East 
ELB 
Tunnel 
ELB 
Zuul 
Infrastructure*
Isthmus 
– 
Failover 
Zone 
A 
Cassandra 
Replicas 
Zone 
B 
Cassandra 
Replicas 
Zone 
C 
Cassandra 
Replicas 
US-­‐West-­‐2 
ELB 
US-­‐East 
ELB 
Tunnel 
ELB 
Zuul 
Infrastructure*
Zuul 
Overview 
Elas%c 
Load 
Balancing 
Elas%c 
Load 
Balancing 
Elas%c 
Load 
Balancing
Zuul
Denominator 
– 
Abstrac3ng 
the 
DNS 
Layer 
Zone 
A 
Cassandra 
Replicas 
Zone 
B 
Cassandra 
Replicas 
Zone 
C 
Cassandra 
Replicas 
ELBs 
Zone 
A 
Cassandra 
Replicas 
Regional 
Load 
Balancers 
Zone 
B 
Cassandra 
Replicas 
Zone 
C 
Cassandra 
Replicas 
UltraDNS 
DynECT 
DNS 
Amazon 
Route 
53 
Denominator
Isthmus 
– 
Only 
for 
ELB 
Failures 
• Other 
services 
may 
fail 
region-­‐wide 
• Not 
worthwhile 
to 
develop 
one-­‐offs 
for 
each 
one
Ac3ve-­‐Ac3ve 
– 
Full 
Regional 
Resiliency 
Zone 
A 
Cassandra 
Replicas 
Zone 
B 
Cassandra 
Replicas 
Zone 
C 
Cassandra 
Replicas 
Regional 
Load 
Balancers 
Zone 
A 
Cassandra 
Replicas 
Regional 
Load 
Balancers 
Zone 
B 
Cassandra 
Replicas 
Zone 
C 
Cassandra 
Replicas
Ac3ve-­‐Ac3ve 
– 
Failover 
Zone 
A 
Cassandra 
Replicas 
Zone 
B 
Cassandra 
Replicas 
Zone 
C 
Cassandra 
Replicas 
Regional 
Load 
Balancers 
Zone 
A 
Cassandra 
Replicas 
Regional 
Load 
Balancers 
Zone 
B 
Cassandra 
Replicas 
Zone 
C 
Cassandra 
Replicas
Can’t 
we 
just 
deploy 
in 
2 
regions?
Ac3ve-­‐Ac3ve 
Architecture
2 
main 
challenges 
• Rou3ng 
the 
users 
to 
the 
services 
• Replica3ng 
the 
data
Separa3ng 
the 
Data 
– 
Eventual 
Consistency 
• 2–4 
region 
Cassandra 
clusters 
• Eventual 
consistency 
!= 
hopeful 
consistency
Benchmarking 
Global 
Cassandra 
Write 
intensive 
test 
of 
cross-­‐region 
replica3on 
capacity 
16 
x 
hi1.4xlarge 
SSD 
nodes 
per 
zone 
= 
96 
total 
192 
TB 
of 
SSD 
in 
six 
loca3ons 
up 
and 
running 
Cassandra 
in 
20 
minutes 
Zone 
A 
US-­‐West-­‐2 
Region 
-­‐ 
Oregon 
Cassandra 
Replicas 
Zone 
B 
Cassandra 
Replicas 
Zone 
C 
Cassandra 
Replicas 
Zone 
A 
US-­‐East-­‐1 
Region 
-­‐ 
Virginia 
Cassandra 
Replicas 
Zone 
B 
Cassandra 
Replicas 
Zone 
C 
Cassandra 
Replicas 
Test 
Load 
Test 
Load 
Valida3on 
Load 
Interzone 
Traffic 
1 
Million 
Writes 
CL.ONE 
(Wait 
for 
One 
Replica 
to 
ack) 
1 
Million 
Reads 
aTer 
500 
ms 
CL.ONE 
with 
No 
Data 
Loss 
Interregion 
Traffic 
Up 
to 
9Gbits/s, 
83ms 
18 
TB 
backups 
from 
S3
Propaga3ng 
EVCache 
Invalida3ons 
4. 
Calls 
Writer 
with 
Key, 
Write 
Time, 
TTL 
& 
Value 
aTer 
checking 
if 
this 
is 
the 
latest 
event 
for 
the 
key 
in 
the 
current 
batch. 
Goes 
cross-­‐region 
through 
ELB 
over 
HTTPS 
6. 
Deletes 
the 
value 
for 
the 
key 
US-­‐WEST-­‐2 
3. 
Read 
from 
SQS 
in 
batches 
US-­‐EAST-­‐1 
Drainer 
Writer 
EVCache 
Replica%on 
Metadata 
5. 
Checks 
write 
3me 
to 
ensure 
this 
is 
the 
latest 
opera3on 
for 
the 
key 
EVECVACCAHCHE 
E 
EVCACHE 
EVECVACCAHCHE 
E 
EVCACHE 
EVCache 
Replica%on 
SQS 
Metadata 
EVCache 
Client 
App 
Server 
Drainer 
Writer 
1. 
Set 
data 
in 
EVCACHE 
2. 
Write 
events 
to 
SQS 
and 
EVCACHE_REGION_REPLICA-­‐TION 
8. 
Delete 
keys 
from 
SQS 
that 
were 
successful 
7. 
Return 
keys 
that 
were 
successful 
EVCache 
Replica3on 
Service 
EVCache 
Replica3on 
Service
Announcing 
Dinomyte 
• Cross 
AZ 
& 
Region 
replica3on 
to 
exis3ng 
Key 
Value 
stores 
– Memcached 
– Redis 
• Thin 
Dynamo 
implementa3on 
for 
replica3on 
• Keep 
na3ve 
protocol 
– No 
code 
refactoring 
• OSS 
soon!
Archaius 
– 
Region-­‐isolated 
Configura3on
Running 
Isthmus 
and 
Ac%ve-­‐Ac%ve
Mul3regional 
Monkeys 
• Detect 
failure 
to 
deploy 
• Differences 
in 
configura3on 
• Resource 
differences
Opera3ng 
in 
N 
Regions 
• Best 
prac3ces: 
avoiding 
peak 
3mes 
for 
deployment 
• Early 
problem 
detec3on 
/ 
rollbacks 
• Automated 
canaries 
/ 
con3nuous 
delivery
Applica3on 
deployment
6 
deployments
Automa3ng 
deployment
Monitoring 
and 
Aler3ng 
• Per 
region 
metrics 
• Global 
aggrega3on 
• Anomaly 
detec3on
Failover 
and 
Fallback 
• Direc3onal 
DNS 
changes 
(Or 
Route53) 
• For 
fallback, 
ensure 
data 
consistency 
• Some 
challenges 
– Cold 
cache 
– Autoscaling 
• (Iterate, 
Automate)+
Valida3ng 
the 
Whole 
Thing 
Works
Does 
Isola3on 
Work?
Does 
Failover 
work?
Steady 
State
Wii.FM?
We’re 
here 
to 
help 
you 
get 
to 
global 
scale… 
Apache 
Licensed 
Cloud 
Na3ve 
OSS 
Pla'orm 
hZp://ne'lix.github.com
Technical 
Indiges3on 
– 
what 
do 
all 
these 
do?
Asgard 
-­‐ 
Manage 
Red/Black 
Deployments
Ice 
– 
Slice 
and 
dice 
detailed 
costs 
and 
usage
Automa3cally 
Baking 
AMIs 
with 
Aminator 
• AutoScaleGroup 
instances 
should 
be 
iden3cal 
• Base 
plus 
code/config 
• Immutable 
instances 
• Works 
for 
1 
or 
1000… 
• Aminator 
Launch 
– Use 
Asgard 
to 
start 
AMI 
or 
– CloudForma3on 
Recipe
Discovering 
your 
Services 
-­‐ 
Eureka 
• Map 
applica3ons 
by 
name 
to 
– AMI, 
instances, 
Zones 
– IP 
addresses, 
URLs, 
ports 
– Keep 
track 
of 
healthy, 
unhealthy 
and 
ini3alizing 
instances 
• Eureka 
Launch 
– Use 
Asgard 
to 
launch 
AMI 
or 
use 
CloudForma3on 
Template
Deploying 
Eureka 
Service 
– 
1 
per 
Zone
Searchable 
state 
history 
for 
a 
Region 
/ 
Account 
Edda 
AWS 
Instances, 
ASGs, 
etc. 
Eureka 
Services 
metadata 
Your 
Own 
Custom 
State 
Monkeys 
Timestamped 
delta 
cache 
of 
JSON 
describe 
call 
results 
for 
anything 
of 
interest… 
Edda 
Launch 
Use 
Asgard 
to 
launch 
AMI 
or 
use 
CloudForma3on 
Template
Edda 
Query 
Examples 
Find 
any 
instances 
that 
have 
ever 
had 
a 
specific 
public 
IP 
address! 
$ curl "http://edda/api/v2/view/instances;publicIpAddress=1.2.3.4;_since=0"! 
["i-0123456789","i-012345678a","i-012345678b”]! 
!S 
how 
the 
most 
recent 
change 
to 
a 
security 
group! 
$ curl "http://edda/api/v2/aws/securityGroups/sg-0123456789;_diff;_all;_limit=2"! 
--- /api/v2/aws.securityGroups/sg-0123456789;_pp;_at=1351040779810! 
+++ /api/v2/aws.securityGroups/sg-0123456789;_pp;_at=1351044093504! 
@@ -1,33 +1,33 @@! 
{! 
…! 
"ipRanges" : [! 
"10.10.1.1/32",! 
"10.10.1.2/32",! 
+ "10.10.1.3/32",! 
- "10.10.1.4/32"! 
…! 
}!
Archaius 
– 
Property 
Console
Archaius 
library 
– 
configura3on 
management 
Based 
on 
Pytheas. 
Not 
open 
sourced 
yet 
SimpleDB 
or 
DynamoDB 
for 
Ne'lixOSS. 
Ne'lix 
uses 
Cassandra 
for 
mul3-­‐region…
Denominator: 
DNS 
for 
Mul3-­‐Region 
Availability 
Zone 
A 
Cassandra 
Replicas 
Zone 
B 
Cassandra 
Replicas 
Zone 
C 
Cassandra 
Replicas 
Regional 
Load 
Balancers 
Zone 
A 
Cassandra 
Replicas 
Denominator 
Zone 
B 
Cassandra 
Replicas 
Zone 
C 
Cassandra 
Replicas 
Regional 
Load 
Balancers 
UltraDNS 
DynECT 
DNS 
AWS 
Route53 
Zuul 
API 
Router 
Denominator 
– 
manage 
traffic 
via 
mul3ple 
DNS 
providers 
with 
Java 
code
Zuul 
– 
Smart 
and 
Scalable 
Rou3ng 
Layer
Ribbon 
library 
for 
internal 
request 
rou3ng
Ribbon 
– 
Zone 
Aware 
LB
Karyon 
-­‐ 
Common 
server 
container 
• Bootstrapping 
o Dependency 
& 
Lifecycle 
management 
via 
Governator. 
o Service 
registry 
via 
Eureka. 
o Property 
management 
via 
Archaius 
o Hooks 
for 
Latency 
Monkey 
tes3ng 
o Preconfigured 
status 
page 
and 
heathcheck 
servlets
Karyon 
• Embedded 
Status 
Page 
Console 
o Environment 
o Eureka 
o JMX
Karyon 
• Sample 
Service 
using 
Karyon 
available 
as 
"Hello-­‐ne'lix-­‐oss" 
on 
github
Hystrix 
Circuit 
Breaker: 
Fail 
Fast 
-­‐> 
recover 
fast
Hystrix 
Circuit 
Breaker 
State 
Flow
Turbine 
Dashboard 
Per 
Second 
Update 
Circuit 
Breakers 
in 
a 
Web 
Browser
Either 
you 
break 
it, 
or 
users 
will
Add 
some 
Chaos 
to 
your 
system
Clean 
up 
your 
room! 
– 
Janitor 
Monkey 
Works 
with 
Edda 
history 
to 
clean 
up 
aTer 
Asgard
Conformity 
Monkey 
Track 
and 
alert 
for 
old 
code 
versions 
and 
known 
issues 
Walks 
Karyon 
status 
pages 
found 
via 
Edda
In 
use 
by 
many 
companies 
ne=lixoss@ne=lix.com
hZp://www.meetup.com/Ne'lix-­‐Open-­‐ 
Source-­‐Pla'orm/
Takeaways 
By 
embracing 
isola8on 
and 
redundancy 
for 
availability 
we 
transformed 
our 
Cloud 
Infrastructure 
to 
be 
resilient 
against 
Region-­‐wide 
outages 
Use 
Ne=lixOSS 
to 
scale 
your 
Enterprise 
Contribute 
to 
exis8ng 
github 
projects 
and 
add 
your 
own 
hJp://ne=lix.github.com 
@Ne=lixOSS 
@rusmeshenberg
Watch the video with slide synchronization on 
InfoQ.com! 
http://www.infoq.com/presentations/netflix-availability- 
regions

More Related Content

More from C4Media

More from C4Media (20)

Does Java Need Inline Types? What Project Valhalla Can Bring to Java
Does Java Need Inline Types? What Project Valhalla Can Bring to JavaDoes Java Need Inline Types? What Project Valhalla Can Bring to Java
Does Java Need Inline Types? What Project Valhalla Can Bring to Java
 
Service Meshes- The Ultimate Guide
Service Meshes- The Ultimate GuideService Meshes- The Ultimate Guide
Service Meshes- The Ultimate Guide
 
Shifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDShifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CD
 
CI/CD for Machine Learning
CI/CD for Machine LearningCI/CD for Machine Learning
CI/CD for Machine Learning
 
Fault Tolerance at Speed
Fault Tolerance at SpeedFault Tolerance at Speed
Fault Tolerance at Speed
 
Architectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsArchitectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep Systems
 
ML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.js
 
Build Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerBuild Your Own WebAssembly Compiler
Build Your Own WebAssembly Compiler
 
User & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleUser & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix Scale
 
Scaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeScaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's Edge
 
Make Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereMake Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home Everywhere
 
The Talk You've Been Await-ing For
The Talk You've Been Await-ing ForThe Talk You've Been Await-ing For
The Talk You've Been Await-ing For
 
Future of Data Engineering
Future of Data EngineeringFuture of Data Engineering
Future of Data Engineering
 
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreAutomated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
 
Navigating Complexity: High-performance Delivery and Discovery Teams
Navigating Complexity: High-performance Delivery and Discovery TeamsNavigating Complexity: High-performance Delivery and Discovery Teams
Navigating Complexity: High-performance Delivery and Discovery Teams
 
High Performance Cooperative Distributed Systems in Adtech
High Performance Cooperative Distributed Systems in AdtechHigh Performance Cooperative Distributed Systems in Adtech
High Performance Cooperative Distributed Systems in Adtech
 
Rust's Journey to Async/await
Rust's Journey to Async/awaitRust's Journey to Async/await
Rust's Journey to Async/await
 
Opportunities and Pitfalls of Event-Driven Utopia
Opportunities and Pitfalls of Event-Driven UtopiaOpportunities and Pitfalls of Event-Driven Utopia
Opportunities and Pitfalls of Event-Driven Utopia
 
Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/DayDatadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day
 
Are We Really Cloud-Native?
Are We Really Cloud-Native?Are We Really Cloud-Native?
Are We Really Cloud-Native?
 

Recently uploaded

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Recently uploaded (20)

2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 

How Netflix Leverages Multiple Regions to Increase Availability: An Active-Active Case Study

  • 1. How Ne'lix Leverages Mul3ple Regions to Increase Availability Ruslan Meshenberg 10 März 2014
  • 2. Watch the video with slide synchronization on InfoQ.com! http://www.infoq.com/presentations /netflix-availability-regions InfoQ.com: News & Community Site • 750,000 unique visitors/month • Published in 4 languages (English, Chinese, Japanese and Brazilian Portuguese) • Post content from our QCon conferences • News 15-20 / week • Articles 3-4 / week • Presentations (videos) 12-15 / week • Interviews 2-3 / week • Books 1 / month
  • 3. Presented at QCon London www.qconlondon.com Purpose of QCon - to empower software development by facilitating the spread of knowledge and innovation Strategy - practitioner-driven conference designed for YOU: influencers of change and innovation in your teams - speakers and topics driving the evolution and innovation - connecting and catalyzing the influencers and innovators Highlights - attended by more than 12,000 delegates since 2007 - held in 9 cities worldwide
  • 4.
  • 6. Who am I? • Director, Pla'orm Engineering • Led Mul3-­‐Regional Resiliency • Leading Ne'lixOSS Program • @rusmeshenberg
  • 8. Assump3ons Hardware Will Fail • Slowly Changing • Large Scale • Rapid Change • Large Scale • Slowly Changing • Small Scale • Rapid Change • Small Scale Telcos Web Scale Speed Scale Everything Works Everything Is Broken SoTware Will Fail Enterprise IT Startups
  • 9. Incidents – Impact and Mi3ga3on PR X Incidents CS XX Incidents Metrics Impact – Feature Disable XXX Incidents No Impact – Fast Retry or Automated Failover XXXX Incidents Public rela3ons media impact High customer service calls Affects AB test results Y incidents mi3gated by Ac3ve Ac3ve, game day prac3cing YY incidents mi3gated by beZer tools and prac3ces YYY incidents mi3gated by beZer data tagging
  • 10. Does an Instance Fail? • It can, plan for it • Bad code / configura3on pushes • Latent issues • Hardware failure • Test with Chaos Monkey
  • 11. Does a Zone Fail? • Rarely, but happened before • Rou3ng issues • DC-­‐specific issues • App-­‐specific issues within a zone • Test with Chaos Gorilla
  • 12. Does a Region Fail? • Full region – unlikely, very rare • Individual Services can fail region-­‐wide • Most likely, a region-­‐wide configura3on issue • Test with Chaos Kong
  • 13. Everything Fails… Eventually • Keep your services running by embracing isola3on and redundancy • Construct a highly agile and highly available service from ephemeral and assumed broken components
  • 14. Isola3on • Changes in one region should not affect others • Regional outage should not affect others • Network par33oning between regions should not affect func3onality / opera3ons
  • 15. Redundancy • Make more than one (of preZy much everything) • Specifically, distribute services across Availability Zones and regions
  • 16. History: X-­‐mas Eve 2012 • Ne'lix mul3-­‐hour outage • US-­‐East1 regional Elas3c Load Balancing issue • “...data was deleted by a maintenance process that was inadvertently run against the produc8on ELB state data”
  • 17. Isthmus – Normal Opera3on Zone A Cassandra Replicas Zone B Cassandra Replicas Zone C Cassandra Replicas US-­‐West-­‐2 ELB US-­‐East ELB Tunnel ELB Zuul Infrastructure*
  • 18. Isthmus – Failover Zone A Cassandra Replicas Zone B Cassandra Replicas Zone C Cassandra Replicas US-­‐West-­‐2 ELB US-­‐East ELB Tunnel ELB Zuul Infrastructure*
  • 19. Zuul Overview Elas%c Load Balancing Elas%c Load Balancing Elas%c Load Balancing
  • 20. Zuul
  • 21. Denominator – Abstrac3ng the DNS Layer Zone A Cassandra Replicas Zone B Cassandra Replicas Zone C Cassandra Replicas ELBs Zone A Cassandra Replicas Regional Load Balancers Zone B Cassandra Replicas Zone C Cassandra Replicas UltraDNS DynECT DNS Amazon Route 53 Denominator
  • 22. Isthmus – Only for ELB Failures • Other services may fail region-­‐wide • Not worthwhile to develop one-­‐offs for each one
  • 23. Ac3ve-­‐Ac3ve – Full Regional Resiliency Zone A Cassandra Replicas Zone B Cassandra Replicas Zone C Cassandra Replicas Regional Load Balancers Zone A Cassandra Replicas Regional Load Balancers Zone B Cassandra Replicas Zone C Cassandra Replicas
  • 24. Ac3ve-­‐Ac3ve – Failover Zone A Cassandra Replicas Zone B Cassandra Replicas Zone C Cassandra Replicas Regional Load Balancers Zone A Cassandra Replicas Regional Load Balancers Zone B Cassandra Replicas Zone C Cassandra Replicas
  • 25. Can’t we just deploy in 2 regions?
  • 27. 2 main challenges • Rou3ng the users to the services • Replica3ng the data
  • 28. Separa3ng the Data – Eventual Consistency • 2–4 region Cassandra clusters • Eventual consistency != hopeful consistency
  • 29. Benchmarking Global Cassandra Write intensive test of cross-­‐region replica3on capacity 16 x hi1.4xlarge SSD nodes per zone = 96 total 192 TB of SSD in six loca3ons up and running Cassandra in 20 minutes Zone A US-­‐West-­‐2 Region -­‐ Oregon Cassandra Replicas Zone B Cassandra Replicas Zone C Cassandra Replicas Zone A US-­‐East-­‐1 Region -­‐ Virginia Cassandra Replicas Zone B Cassandra Replicas Zone C Cassandra Replicas Test Load Test Load Valida3on Load Interzone Traffic 1 Million Writes CL.ONE (Wait for One Replica to ack) 1 Million Reads aTer 500 ms CL.ONE with No Data Loss Interregion Traffic Up to 9Gbits/s, 83ms 18 TB backups from S3
  • 30. Propaga3ng EVCache Invalida3ons 4. Calls Writer with Key, Write Time, TTL & Value aTer checking if this is the latest event for the key in the current batch. Goes cross-­‐region through ELB over HTTPS 6. Deletes the value for the key US-­‐WEST-­‐2 3. Read from SQS in batches US-­‐EAST-­‐1 Drainer Writer EVCache Replica%on Metadata 5. Checks write 3me to ensure this is the latest opera3on for the key EVECVACCAHCHE E EVCACHE EVECVACCAHCHE E EVCACHE EVCache Replica%on SQS Metadata EVCache Client App Server Drainer Writer 1. Set data in EVCACHE 2. Write events to SQS and EVCACHE_REGION_REPLICA-­‐TION 8. Delete keys from SQS that were successful 7. Return keys that were successful EVCache Replica3on Service EVCache Replica3on Service
  • 31. Announcing Dinomyte • Cross AZ & Region replica3on to exis3ng Key Value stores – Memcached – Redis • Thin Dynamo implementa3on for replica3on • Keep na3ve protocol – No code refactoring • OSS soon!
  • 33. Running Isthmus and Ac%ve-­‐Ac%ve
  • 34. Mul3regional Monkeys • Detect failure to deploy • Differences in configura3on • Resource differences
  • 35. Opera3ng in N Regions • Best prac3ces: avoiding peak 3mes for deployment • Early problem detec3on / rollbacks • Automated canaries / con3nuous delivery
  • 39. Monitoring and Aler3ng • Per region metrics • Global aggrega3on • Anomaly detec3on
  • 40. Failover and Fallback • Direc3onal DNS changes (Or Route53) • For fallback, ensure data consistency • Some challenges – Cold cache – Autoscaling • (Iterate, Automate)+
  • 41. Valida3ng the Whole Thing Works
  • 46. We’re here to help you get to global scale… Apache Licensed Cloud Na3ve OSS Pla'orm hZp://ne'lix.github.com
  • 47. Technical Indiges3on – what do all these do?
  • 48. Asgard -­‐ Manage Red/Black Deployments
  • 49. Ice – Slice and dice detailed costs and usage
  • 50. Automa3cally Baking AMIs with Aminator • AutoScaleGroup instances should be iden3cal • Base plus code/config • Immutable instances • Works for 1 or 1000… • Aminator Launch – Use Asgard to start AMI or – CloudForma3on Recipe
  • 51. Discovering your Services -­‐ Eureka • Map applica3ons by name to – AMI, instances, Zones – IP addresses, URLs, ports – Keep track of healthy, unhealthy and ini3alizing instances • Eureka Launch – Use Asgard to launch AMI or use CloudForma3on Template
  • 52. Deploying Eureka Service – 1 per Zone
  • 53. Searchable state history for a Region / Account Edda AWS Instances, ASGs, etc. Eureka Services metadata Your Own Custom State Monkeys Timestamped delta cache of JSON describe call results for anything of interest… Edda Launch Use Asgard to launch AMI or use CloudForma3on Template
  • 54. Edda Query Examples Find any instances that have ever had a specific public IP address! $ curl "http://edda/api/v2/view/instances;publicIpAddress=1.2.3.4;_since=0"! ["i-0123456789","i-012345678a","i-012345678b”]! !S how the most recent change to a security group! $ curl "http://edda/api/v2/aws/securityGroups/sg-0123456789;_diff;_all;_limit=2"! --- /api/v2/aws.securityGroups/sg-0123456789;_pp;_at=1351040779810! +++ /api/v2/aws.securityGroups/sg-0123456789;_pp;_at=1351044093504! @@ -1,33 +1,33 @@! {! …! "ipRanges" : [! "10.10.1.1/32",! "10.10.1.2/32",! + "10.10.1.3/32",! - "10.10.1.4/32"! …! }!
  • 56. Archaius library – configura3on management Based on Pytheas. Not open sourced yet SimpleDB or DynamoDB for Ne'lixOSS. Ne'lix uses Cassandra for mul3-­‐region…
  • 57. Denominator: DNS for Mul3-­‐Region Availability Zone A Cassandra Replicas Zone B Cassandra Replicas Zone C Cassandra Replicas Regional Load Balancers Zone A Cassandra Replicas Denominator Zone B Cassandra Replicas Zone C Cassandra Replicas Regional Load Balancers UltraDNS DynECT DNS AWS Route53 Zuul API Router Denominator – manage traffic via mul3ple DNS providers with Java code
  • 58. Zuul – Smart and Scalable Rou3ng Layer
  • 59. Ribbon library for internal request rou3ng
  • 60. Ribbon – Zone Aware LB
  • 61. Karyon -­‐ Common server container • Bootstrapping o Dependency & Lifecycle management via Governator. o Service registry via Eureka. o Property management via Archaius o Hooks for Latency Monkey tes3ng o Preconfigured status page and heathcheck servlets
  • 62. Karyon • Embedded Status Page Console o Environment o Eureka o JMX
  • 63. Karyon • Sample Service using Karyon available as "Hello-­‐ne'lix-­‐oss" on github
  • 64. Hystrix Circuit Breaker: Fail Fast -­‐> recover fast
  • 66. Turbine Dashboard Per Second Update Circuit Breakers in a Web Browser
  • 67. Either you break it, or users will
  • 68. Add some Chaos to your system
  • 69. Clean up your room! – Janitor Monkey Works with Edda history to clean up aTer Asgard
  • 70. Conformity Monkey Track and alert for old code versions and known issues Walks Karyon status pages found via Edda
  • 71. In use by many companies ne=lixoss@ne=lix.com
  • 73. Takeaways By embracing isola8on and redundancy for availability we transformed our Cloud Infrastructure to be resilient against Region-­‐wide outages Use Ne=lixOSS to scale your Enterprise Contribute to exis8ng github projects and add your own hJp://ne=lix.github.com @Ne=lixOSS @rusmeshenberg
  • 74. Watch the video with slide synchronization on InfoQ.com! http://www.infoq.com/presentations/netflix-availability- regions