My @TriangleDevops talk from 2013-10-17. I covered the work that led us to @NetflixOSS (Acme Air), the work we did on the cloud prize (NetflixOSS on IBM SoftLayer/RightScale) and the @NetflixOSS platform (Karyon, Archaius, Eureka, Ribbon, Asgard, Hystrix, Turbine, Zuul, Servo, Edda, Ice, Denominator, Aminator, Janitor/Conformity/Chaos Monkeys of the Simian Army).
2. Agenda
• How did I get here?
•
•
•
•
•
Netflix and Netflix OSS platform overview
Runtime components
Management components
Build components
Automated test and cleanliness components
2
3. About me …
• IBM STSM of Performance Architect and Strategy
• Eleven years in performance in WebSphere
–
–
–
–
Led the App Server Performance team for years
Small sabbatical focused on IBM XML technology
Work in Emerging Technology Institute and CTO Office
Starting to look at cloud service operations
• Email: aspyker@us.ibm.com
–
–
–
–
Blog: http://ispyker.blogspot.com/
Linkedin: http://www.linkedin.com/in/aspyker
Twitter: http://twitter.com/aspyker
Github: http://www.github.com/aspyker
• Triangle dad that enjoys technology as well as running, wine and poker
3
4. Develop or maintain a service today?
• Develop – starting
• Maintain – starting
• More on this later ….
http://www.flickr.com/photos/stevendepolo/
4
5. What qualifies me to talk?
• My shirt?
• Of cloud prize ~ 25 nominees
– Personally
• Best example mash-up sample
– My IBM team
• Best portability enhancement
– More on this coming …
•
http://techblog.netflix.com/2013/09/netflixoss-meetup-s1e4-cloud-prize.html
5
6. Seriously, how did I get here?
• Plenty of experience with performance and scale on
standardized benchmarks (SPEC/TPC)
– Non representative of how to (web) scale
• Pinning, biggest monolithic DB “wins”, hand tuned for fixed size
– Out of date on modern architecture for mobile/cloud
• Created Acme Air
– http://bit.ly/acmeairblog
• Demonstrated that we could achieve (web) scale runs
– 4B+ Mobile/Browser request/day
– With modern mobile and cloud best practices
6
8. What was shown?
• Peak performance and scale – You betcha!
• Operational visibility – Only during the run via
nmon collection and post-run visualization
•
•
•
•
True operational visibility - nope
Devops – nope
HA and DR – nope
Manual and automatic elastic scaling - nope
8
9. What next?
• Went looking for what best industry practices around
devops and high availability at web scale existed
– Many have documented via research papers and on
highscalability.com – Google, Twitter, Facebook, Linkedin,
etc.
• Why Netflix?
– Documented not only on their tech blog, but also have
released working OSS on github
– Also, given dependence on Amazon, they are a clear
bellwether of web scale public cloud availability
9
10. Steps to NetflixOSS understanding
• Recoded Acme Air application to make use of NetflixOSS
runtime components
• Worked to implement a NetflixOSS devops and high
availability setup around Acme Air (on EC2) run at previous
levels of scale and performance
• Worked to port NetflixOSS runtime and devops/high
availability servers to IBM Cloud (SoftLayer) and RightScale
• Through public collaboration with Netflix technical team
– Google groups, github and meetups
10
11. Why?
• To prove that advanced cloud high availability
and devops platform wasn’t “tied” to Amazon
• To understand how we can advance IBM cloud
platforms for our customers
• To understand how we can host our IBM
public cloud services better
11
12. Agenda
• How did I get here?
• Netflix and Netflix OSS platform overview
•
•
•
•
Runtime components
Management components
Build components
Automated test and cleanliness components
12
13. My view of Netflix goals
• As a business
– Be the best streaming media provider in the world
– Make best content deals based on real data/analysis
• Technology wise
– Have the most availability possible
– Measure all things by “stream starts per unit of time”
• Any dip in that relates back to the business
– Do this at web scale
13
14. Standing on the shoulder of a giants
• Public Cloud (Amazon)
– When adding streaming, Netflix decided they
• Shouldn’t invest in building data centers worldwide
• Had to plan for the streaming business to be very big
– Embraced cloud architecture paying only for what they need
• Open Source
– Many parts of runtime depend on open source
• Linux, Apache Tomcat, Apache Cassandra, etc.
– Realized that Amazon wasn’t enough
• Started a cloud platform on top that would
eventually be open sourced - NetflixOSS
http://en.wikipedia.org/wiki/
File:Andre_in_the_late_%2780s.jpg
14
15. Faleure
• What is failing?
– Underlying IaaS problems
• Instances, racks, availability zones, regions
– Software issues
• Operating system, servers, application code
Inspiration
– Surrounding services
• Other application services, DNS, user registries, etc.
• How is a component failing?
–
–
–
–
Fails and disappears altogether
Intermittently fails
Works, but is responding slowly
Works, but is causing users a poor experience
15
16. Overview of Amazon EC2
•
Amazon launches instances into availability zones
– Instances of various sizes (compute, storage, etc.)
•
Regions independent of each other
Regions only connected over the Internet
Regions contain availability zones
Availability zones are isolated from each over
Availability zones are connected /w low-latency links
Availability
Zone
Availability
Zone
Internet
This gives a high level of resilience to outages
– Unlikely to affect multiple availability zones or regions
•
Availability
Zone
Organized into regions and availability zones
–
–
–
–
–
•
EC2 Region
(US East)
Amazon requires customer be aware of this
topology to take advantage of its benefits within
their application
EC2 Region
(US West)
Availability
Zone
Availability
Zone
Availability
Zone
16
18. NetflixOSS – for today
• For today
– Focus on mid tier web
app and micro service
servers
– Devops servers and tools
– Skipping some just for
simplicity
• For another time
– Big data
– Data tier
– Caching
18
19. Agenda
• How did I get here?
• Netflix and Netflix OSS platform overview
• Runtime components
• Management components
• Build components
• Automated test and cleanliness components
19
20. Acme Air As A Sample
ELB
Web App
Front End
(REST services)
App Service
(Authentication)
Data Tier
Greatly simplified …
20
21. Micro-services architecture
• Decompose system into isolated services that can be developed
separately
• Why?
– They can fail independently vs. fail together monolythically
– They can be developed and released with difference velocities by
different teams
• To show this we created separate “auth service” for Acme Air
• In a typical customer facing application any single front end
invocation could spawn 20-30 calls to services and data sources
21
22. How do services advertise themselves?
• Upon web app startup, Karyon server is started
– Karyon will configure (via Archaius) the application
– Karyon will register the location of the instance with Eureka
• Others can know of the existence of the service
• Lease based so instances continue to check in updating list of available instances
– Karyon will also expose a JMX console, healthcheck URL
• Devops can change things about the service via JMX
• The system can monitor the health of the instance
App Service
(Authentication)
Name, Port
IP address,
Healthcheck url
Karyon
Tomcat
Eureka
Eureka
Server(s)
Eureka
Server(s)
Eureka
Server(s)
Server(s)
config.properties, auth-service.properties
Or remote Archaius stores
22
23. How do consumers find services?
• Service consumers query eureka at startup and
periodically to determine location of dependencies
– Can query based on availability zone and cross
availability zone
Web App
Front End
(REST services)
Eureka client
Tomcat
What “auth-service”
instances exist?
Eureka
Eureka
Server(s)
Eureka
Server(s)
Eureka
Server(s)
Server(s)
23
25. How does the consumer call the service?
• Protocols impls have eureka aware load balancing support build in
– In client load balancing -- does not require separate LB tier
• Ribbon – REST client
– Pluggable load balancing scheme
– Built in failure recovery support (retry next server, mark instance as failing, etc.)
• Other eureka enabled clients – memcached (EVCache), asystanax coming
(Priam and Cassandra)
Web App
Front End
(REST services)
Call
“auth-service”
Ribbon
REST
client
Eureka
client
App Service
App Service
(Authentication)
App Service
(Authentication)
App Service
(Authentication)
(Authentication)
25
26. How to deploy this with HA?
Instances?
• Deploy across AZs
• Using AutoScalingGroups in
EC2 managed by Asgard
Eureka?
•
•
DNS and Elastic IP trickery
Deployed across AZs
•
For clients to find eureka servers
–
– ASG manages recovery
–
•
For new eureka servers
–
–
–
•
DNS TXT record for domain lists AZ TXT
records
AZ TXT records have list of Eureka servers
Look for list of eureka servers IP’s for the AZ
it’s coming up in
Look for unassigned elastic IP’s, grab one and
assign it to itself
Sync with other already assigned IP’s that
likely are hosting Eureka server instances
Simpler configurations with less HA are
available
26
27. Protect yourself from unhealthy services
• Wrap all calls to services with Hystrix command pattern
– Hystrix implements circuit breaker pattern
– Executes command using semaphore or separate thread
pool to guarantee return within finite time to caller
– If a unhealthy service is detected, start to call fallback
implementation (broken circuit) and periodically check if
main implementation works (reset circuit)
Execute
auth-service
call
Call
“auth-service”
Hystrix
Web App
Front End
(REST
services)
Ribbon REST
client
App Service
App Service
(Authentication)
App Service
(Authentication)
App Service
(Authentication)
(Authentication)
Fallback implementation
27
28. Does Hystrix do more?
• Main reason for Hystrix is
protect yourself from
dependencies, but …
• Once you have a layer of
indirection take advantage of it,
Hystrix can provide
– Caching
– Visualization
• Aggregated via Turbine
– Request collapsing
• Programming models
– Sync, Async, Reactive (RxJava)
28
29. Agenda
• How did I get here?
• Netflix and Netflix OSS platform overview
• Runtime components
• Management components
• Build components
• Automated test and cleanliness components
29
30. Ability to reconfigure - Archaius
• Using dynamic properties, can
easily change properties across
cluster of applications, either
Application
– NetflixOSS named props
• Hystrix timeouts for example
Runtime
– Custom dynamic props
Hierarchy
• High throughput achieved by
polling approach
• HA of configuration source
dependent on what source you
use
URL
JMX
Karyon
Console
Persisted DB
Application Props
Libraries
Container
– HTTP server, database, etc.
DynamicIntProperty prop =
DynamicPropertyFactory.getInstance().getIntProperty("myProperty", DEFAULT_VALUE);
int value = prop.get(); // value will change over time based on configuration
30
31. ASGard
EC2 Region
(US East)
Availability
Zone
Tell EC2 to start
these instances and
Keep this many
Instances running
Availability
Zone
Web App
App Service
(REST App Service
Services)
(Authentication)
App Service
(Authentication)
(Authentication)
App Service
App Service
App Service
(Authentication)
(Authentication)
App Service
(Authentication)
(Authentication)
Availability
Zone
Web App
App Service
(REST App Service
Services)
(Authentication)
App Service
(Authentication)
(Authentication)
App Service
App Service
App Service
(Authentication)
(Authentication)
App Service
(Authentication)
(Authentication)
Web App
App Service
(REST App Service
Services)
(Authentication)
App Service
(Authentication)
(Authentication)
App Service
App Service
App Service
(Authentication)
(Authentication)
App Service
(Authentication)
(Authentication)
• Asgard is the missing EC2 console for AutoScalingGroup mgmt.
31
– EC2 only has CLI for ASG management
32. Asgard creates an “application”
• Enforces common practices for deploying code
– Common approach to linking auto scaling groups to launch configs,
ELB’s, security groups, scaling policies and AMIs
• Adds missing concept to the EC2 domain model – “application”
– Extends clustering to applications vs. AMI’s
• Example
–
–
–
–
Application – app1
Cluster – app1-env
Autoscaling group version n – app1-env-v009
Autoscaling group version n+1 – app1-env-v010
32
33. Asgard devops procedures
•
•
•
•
Fast rollback
Canary testing
Red/Black pushes
More through REST interfaces
– Adhoc processes but enforced through Asgard model
• More coming using Glisten and Amazon SWF
33
35. Augmenting the ELB tier - Zuul
• Zuul adds devops support in the front tier routing
–
–
–
–
–
Stress testing (squeeze testing)
Canary testing
Dynamic routing
Load Shedding
Debugging
• And some common function
–
–
–
–
–
Authentication
Security
Static response handling
Multi-region resiliency (DR for ELB tier)
Insight
Amazon
ELB
Filter
Filter
Filter
Filters
Zuul
Zuul
Zuul
Edge
Service
Edge
Service
• Through dynamically deployable filters (written in Groovy)
• Eureka aware using ribbon, and archaius like shown in runtime section
35
36. Monitoring - Servo
• Annotation based publishing through JMX of
application metrics
• Filters, Observers, and Pollers to publish metrics
– Can export metrics to CloudWatch and other monitors
• The entire Netflix monitoring infrastructure
hasn’t been open sourced due to complexity and
priority
36
37. A note on the next three projects
• I haven’t personally worked with the projects
• Given the audience, I included as I believe
they will be of interest
37
38. Edda
• Polls Amazon config and stores the data in a
queriable database
• Provides a searchable view of Amazon
deployments
– Searchable in ways not possible from Amazon API’s
• Provides a historical view
– For correlation of problems to changes
– Likely less of an issue in clouds that expose all changes
38
39. Ice
• Cloud spend and usage analytics
• Communicates with billing API to give
birds eye view of cloud spend with drill
down to region, availability zone, and
service team through application groups
• Watches on-demand, used and unused
reserved instances and instance sizes to
help optimize
• Not point in time
– Shows trends to help predict future
optimizations
39
40. Denominator
• Java Library and CLI for cross DNS configuration
• Allows for common, quicker (than using various
DNS provider UI) and automated DNS updates
• Plugins have been developed by various DNS
providers
40
41. Agenda
•
•
•
•
How did I get here?
Netflix and Netflix OSS platform overview
Runtime components
Management components
• Build components
• Automated test and cleanliness components
41
42. Get baked!
• Caution: Flame/troll bait ahead!!
• Netflix takes the approach of baking images as part of build such that
– Instance boot-up doesn’t depend on outside servers
– Instance boot-up only starts servers already set to run
– New code = new instances (never update instances in place)
• Why?
– Critical when launching hundreds of servers at a time
– Goal to reduce the failure points in places where dynamic system
configuration doesn’t provide value
– Speed of elastic scaling, boot and go
– Discourages ad hoc changes to server instances
• Criticism – “Netflix is ruining the cloud”
– Overhead of AMI’s for every code version
– Ties to Amazon AMI’s (would this work for containers – I think yes)
42
43. AMInator
• Starting image/volume
– Foundational image created (maybe via loopback),
base AMI with common software created/tested
independently
• Aminator running – Bakery
– Bakery obtains a known EBS volume of the base
image from a pool
– Bakery mounts volume and provisions the
application (apt/deb or yum/rpm)
– Bakery snapshots and registers snapshot
• Recent work to add other provisioning such as chef
as plugins
• I have used hand built AMI’s thus far, but blog
states developers can go through CI builds and
have running test instances within 15 minutes of
code being checked in
43
44. Agenda
•
•
•
•
•
How did I get here?
Netflix and Netflix OSS platform overview
Runtime components
Management components
Build components
• Automated test and cleanliness components
44
45. The Simian Army
• A bunch of automated “monkeys” that
perform automated system administration
tasks
• Anything that is done by a human more than
once can and should be automated
• Absolutely necessary at web scale
45
46. Good Monkeys
• Janitor Monkey
– Somewhat a mitigation for baking approach
– Will mark and sweep unused resources
(instances, volumes, snapshots, ASG’s,
launch configs, images, etc.)
– Owners notified, then removed
• Conformity Monkey
http://www.flickr.com/photos/sonofgroucho/5852049290
– Check instances are conforming to rules
around security, ASG/ELB, age, status/health
check, etc.
46
47. Back to high availability
• Failure is inevitable. Don’t try to avoid it!
• How do you know if your backup is good?
– Try to restore from your backup every so often
– Better to ensure backup works before you have a crashed
system and find out your backup is broken
• How do you know if your system is HA?
– Try to force failures every so often
– Better to force those failures during office hours
– Better to ensure HA before you have a down system and
angry users
– Best to learn from failures and add automated tests
47
48. Bad Monkeys
• Open Sourced – Chaos Monkey
– Used to randomly terminate instances
– Now block network, burn cpu, kill
processes, fail amazon api, fail dns, fail
dynamo, fail s3, introduce network
errors/latency, detach volumes, fill disk,
burn I/O
http://www.flickr.com/photos/27261720@N00/132750805
• Not yet open sourced
– Chaos Gorilla
• Kill all instances in an availability zone
– Chaos Kong
• Kill all instances in an entire region
– Latency Monkey
• Introduce latency into service calls directly
(ribbon server side)
48
49. Agenda
• Blah, blah, blah
• How can I learn more?
• How do I play with this?
• Let’s write some code!
49
50. Want to play?
• NetflixOSS blog and github
– http://techblog.netflix.com
– http://github.com/Netflix
• Acme Air, NetflixOSS AMI’s
– Try Asgard/Eureka with a real application
– http://bit.ly/aa-AMIs
• See what we ported to IBM Cloud (video)
– http://bit.ly/noss-sl-blog
• Fork and submit pull requests to Acme Air
– http://github.com/aspyker/acmeair-netflix
50