5. With more than 30 million streaming
members in the United States, Canada,
Latin America, the United Kingdom, Ireland
and the Nordics, Netflix is the world's
leading internet subscription service for
enjoying movies and TV programs
streamed over the internet to PCs, Macs
and TV.
Source: http://ir.netflix.com
Tweet @jedberg with feedback!
6. The Netflix Way
• Everything is “built for three”
• Fully automated build tools to test
and make packages
• Fully automated machine image
bakery
• Fully automated image deployment
• Independent teams responsible for
both Dev and Ops
Tweet @jedberg with feedback!
9. Automate all the things!
• Application startup
• Configuration
• Code deployment
• System deployment
Tweet @jedberg with feedback!
10. Automation
• Standard base image
• Tools to manage all the
systems
• Automated code deployment
Tweet @jedberg with feedback!
11. Shared state should be
stored in a shared
service
Data on an instance
should be replicated to
other instances
Tweet @jedberg with feedback!
12. “Build for Three”
We hold a boot camp for new engineers to teach them
how to build for a highly distributed environment.
Tweet @jedberg with feedback!
14. Netflix on AWS
2012 2012 2012
IPv6 IPv6 IPv6
Open Connect
Tweet @jedberg with feedback!
15. Highly aligned, loosely
coupled
• Services are built by different
teams who work together to figure
out what each service will provide.
• The service owner publishes an
API that anyone can use.
Tweet @jedberg with feedback!
16. Advantages to a Service
Oriented Architecture
• Easier auto-scaling
• Easier capacity planning
• Identify problematic code-paths more
easily
• Narrow in the effects of a change
• More efficient local caching
Tweet @jedberg with feedback!
17. Freedom and Responsibility
• Developers deploy when they
want
• They also manage their own
capacity and autoscaling
• And fix anything that breaks at
4am!
Tweet @jedberg with feedback!
18. All systems choices assume
some part will fail at some
point.
Tweet @jedberg with feedback!
19. The Monkey Theory
• Simulate
things that go
wrong
• Find things
that are
different
Tweet @jedberg with feedback!
20. Execution
Photo from I, Robot, copyright 20th Century Fox
Tweet @jedberg with feedback!
21. Netflix built a global PaaS
• Service Oriented
Architecture
• HTTP/Rest interfaces
between services
Tweet @jedberg with feedback!
22. Netflix PaaS features
• Supports all regions and zones
• Multiple accounts
• Cross region/account replication
• Internationalized, localized and GeoIP
routed
• Advanced key management
• Autoscaling with 1000s of instances
• Monitoring and alerting on millions of
metrics
Tweet @jedberg with feedback!
23. What AWS Provides
• Instances
• Machine Images
• Elastic IPs
• Load Balancers
• Security groups / Autoscaling groups
• Availability zones and regions
Tweet @jedberg with feedback!
24. Linux Base AMI (CentOS or
Ubuntu)
Optional
Java (JDK 6 or 7)
Apache
Appdynamics
App Agent
Monitoring
monitoring
Tomcat
Log Rotation
to S3 Application war file, base
Healthcheck, status
GC and servlet, platform,
servelets, JMX interface,
Appdynamics thread interface jars for
Servo autoscale
Machine dump dependent services
Agent logging
Tweet @jedberg with feedback!
25. The Netflix Platform
Discovery
Circut Breakers
(Eureka)Entrypoints
(Hystrix)
(Edda)Configuration
Cassandra (Priam &
(Archaius)
Astyanax &
Zookeeper (Exhibitor)
CassJMeter) Cryptex
logging (Blitz4j & Honu)
AKMSEvCache
NIWS
Proxiesi18n
Geo
L10n
Base Open Source
Tweet @jedberg with feedback!
27. Open Source at Netflix
Governator
Blitz4j
Edda
Tweet @jedberg with feedback! Hystrix
28. Finding things
• Discovery (Eureka)
• Application to instance mapping
• Heartbeat to keep track of health
• Entrypoints (Edda)
• Local database of AWS resources
• NIWS (Netflix Internal Web Service)
• On instance software load balancer
• Handles retry logic
• Geo (Geolocation library)
• Provides IP to Lat/Lon mapping for any service that
needs it.
Tweet @jedberg with feedback!
29. Entrypoints (Edda)
• REST API
• GET /REST/v2/instance/$id
• Keeps track of all resources
• Autoscaling groups, EIPs,
Instances, Applications, Clusters,
History
Tweet @jedberg with feedback!
30. Entrypoints Exploration
Find all active GET /REST/v2/view/instances
instances
Find all instances in a GET /REST/v2/group/clusters
cluster
/v2/aws/autoScalingGroups/edda-
Show only ASG name, v123;_pp:(autoScalingGroupName,instances:(
instance ID and health instanceId,lifecycleState))
Which ASG contains a /v2/aws/autoScalingGroups;instances.instanceId=i
-96f3ca3a
particular instance?
Tweet @jedberg with feedback!
31. Keeping it all Straight
• Configuration (Archaius)
• Global variables (Fast properties)
• Base
• Base system. Prod vs. Test, etc
• Zookeeper (Curator)
• Locks, other similar coordination
• Logging (Blitz4j and Honu)
• Keep track of what happened and store it
for post analysis.
Tweet @jedberg with feedback!
32. Keeping it Secure
• Cryptex
• Service for key management
• High, medium and low value keys
• AKMS (Amazon Key Management System)
• Hands out keys to instances (and dev boxes)
so they don’t have to store the key on the
instance
Tweet @jedberg with feedback! For more info, see SEC201: Security Panel
33. Storing it
• Cassandra (Priam, astyanax)
• Configure and access Cassandra
• Provide OO abstractions handle
connection pooling, discovery of
hosts
• EVCache (Eccentric Volatile Cache)
• Wrapper for memcached to handle
zone awareness and replication
• Proxies
• Get data out of the datacenter and
into the cloud.
Tweet @jedberg with feedback!
34. Data
What do we do with it all?
Tweet @jedberg with feedback!
35. We store it!
• Cache
(memcached)
• Cassandra
• RDS (MySql)
Tweet @jedberg with feedback!
51. Netflix has moved the
granularity from the
instance to the cluster
Tweet @jedberg with feedback!
52. Why Bake?
Traditional:
•launch OS
•install Generic AMI
Instance
packages
•install app
Netflix:
•launch OS+app
App AMI Instance
Tweet @jedberg with feedback!
53. Getting Baked
Artifactory
app bundles
Ivy
snapshot / release
libraries
libraries / apps
Jenkins
resolve test publish
sync compile build report
source
Perforce / Git Ant targets Groovy all over
Tweet @jedberg with feedback!
54. Base
Image S3 / EBS
Baking foundation
AMI
Linux: CentOS, Fedora, Ubuntu
base
AMI
mount snapshot
Ready
for
Yum / Apt app
install Bakery bake
AWS
RPMs: Apache, Java...
ec2 slave instances
Tweet @jedberg with feedback!
55. App Image
Baking S3 / EBS
base AMI
Linux, Apache, Java, Tomcat
app
AMI
mount snapshot
Jenkins / Yum / Ready
Artifactory to launch!
install Bakery
AWS
app bundle
ec2 slave instances
Tweet @jedberg with feedback!
56. Linux Base AMI (CentOS or
Ubuntu)
Optional
Java (JDK 6 or 7)
Apache
Appdynamics
App Agent
Monitoring
monitoring
Tomcat
Log Rotation
to S3 Application war file, base
Healthcheck, status
GC and servlet, platform,
servelets, JMX interface,
Appdynamics thread interface jars for
Servo autoscale
Machine dump dependent services
Agent logging
Tweet @jedberg with feedback!
57. Linux Base AMI (CentOS or
Ubuntu)
Optional
Java (JDK 6 or 7)
Apache
Appdynamics
App Agent
Monitoring
monitoring
JBoss
Log Rotation
to S3 Application war file, base
Healthcheck, status
GC and servlet, platform,
servelets, JMX interface,
Appdynamics thread interface jars for
Servo autoscale
Machine dump dependent services
Agent logging
Tweet @jedberg with feedback!
58. Linux Base AMI (CentOS or
Ubuntu)
Optional
Python
Apache
monitoring
Monitoring Django
Log Rotation
to S3 Application file, base
server, platform,
Appdynamics interface libs for
logging
Machine dependent services
Agent
Tweet @jedberg with feedback!
59. The Monkey Theory
• Simulate
things that go
wrong
• Find things
that are
different
Tweet @jedberg with feedback!
60. •
The simian army
Chaos -- Kills random instances
• Chaos Gorilla -- Kills zones
• Chaos Kong -- Kills regions
• Latency -- Degrades network and injects faults
• Conformity -- Looks for outliers
• Circus -- Kills and launches instances to maintain zone
balance
• Doctor -- Fixes unhealthy resources
• Janitor -- Cleans up unused resources
• Howler -- Yells about bad things like Amazon limit
violations
• Security -- Finds security issues and expiring certificates
Tweet @jedberg with feedback! For more info, see ARC301: Intro to Chaos Monkey & the Simian Army
65. Alert Systems
CORE
Event Paging
Atlas Gatewa Service
alerting y
CORE
Appdynamics Agent Amazon
SES
api
CORE
Agent
api
Other
Team’s
Agent
Tweet @jedberg with feedback!
68. Data Collection Pipeline
Data Processing Pipeline
Text
Tweet @jedberg with feedback! For more info, see BDT303: Data Science with Elastic MapReduce
71. Incident Reviews
Ask the key questions:
• What went wrong?
• How could we have detected it
sooner?
• How could we have prevented it?
• How can we prevent this class of
problem in the future?
• How can we improve our behavior
for next time?
Tweet @jedberg with feedback!
72. Best Practices for Data
• Have multiple copies of all data
• Keep those copies in multiple AZs
• Avoid keeping state on a single instance
• Take frequent snapshots of EBS disks
• No secret keys on the instance
Tweet @jedberg with feedback!
73. Netflix autoscaling
2
Deployment
Text
1
Traffic Peak
Tweet @jedberg with feedback!
74. AWS Usage
Dollar amounts have been carefully removed
Tweet @jedberg with feedback!
78. Leveraging Multi-region
• 100% uptime is theoretically possible.
• You have to replicate your data
• This will cost money
Tweet @jedberg with feedback!
79. Circuit Breakers (Hystrix)
Be liberal in what you accept, strict in what you send
Tweet @jedberg with feedback!
80. Just a quick reminder...
• (Some of) Netflix is open source:
• https://github.com/netflix
Tweet @jedberg with feedback!
81. We are sincerely eager to
hear your feedback on this
presentation and on re:Invent.
Please fill out an evaluation
form when you have a
chance.