Slides from the 2016/01/28 Advanced Amazon Web Services (AWS) Meetup. Netflix overviewed the usage of containers at Netflix. We covered technologies we are working on in the runtime (Titus) and developer experience (Newt). We talked about how the Titus container management system was different from others and our journey with Docker, Mesos, Netflix Fenzo and eventually Amazon Elastic Container Service (ECS).
2. About Netflix
● 75M+ members
● #NetflixEverywhere (Worldwide)
● 42.5B hours watched 2015
● > ⅓ NA internet download traffic
● 1000’s Microservices
● Many 10’s of thousands VM’s
● 3 regions across the world
● 2000+ employees
2
3. About me
● Cloud platform technologies
○ Distributed configuration, service discovery, RPC, application
frameworks, non-Java sidecar
● Container cloud
○ Resource management and scheduling, making Docker containers
operational in Amazon EC2/ECS
● Open Source
○ Organize @NetflixOSS meetups & internal group
● Performance
○ Assist across Netflix, but focused mainly on cloud platform perf
With Netflix for ~ 1 year. Previously at IBM.
@aspyker
ispyker.
blogspot.
com
3
5. Agenda
● Why Containers for Netflix?
● Container runtime platform
● Container development experience
5
6. Why containers operationally?
Case 1:
I have a job I want run reliably and efficiently, but I don’t
want to manage clusters myself
Case 2:
I have lots of services and I want to reduce the number
of the VM’s I need to manage with isolation between
process instances
7. History - Project Titan
● Container management system
○ Predominantly batch processing system
● Higher level frameworks drive tasks
○ General workflow engine
○ DAG base data processing
○ Misc reports, big data processing stages, interactive notebooks
● Tech
○ Rudimentary scheduling with Dynamo storage
○ Proven Docker execution environment
○ Using Mesos and Fenzo
7
8. History - Project Mantis
● Real time operational intelligence for
streaming experience
○ Ad hoc and perpetual stream processing
● Tech
○ Proven scheduling with C* storage
○ Mantis fatjars deployed in cgroups
○ Using Mesos and Fenzo
8
9. Fenzo overview
● A generic, plug-ins based scheduling library
for Apache Mesos frameworks
● Features
○ Heterogenous resources match with varied tasks
○ Autoscaling of underlying cluster
○ Plugins for constraints and fitness
○ Support for fast (ms) scheduling rate
○ Visibility of scheduling actions
github.com/Netflix/Fenzo 9
10. Fenzo: fitness, constraints plugins
Fitness value (0.0 - 1.0)
● Degree of fitness - first fit, best fit, worst fit
○ Real world tradeoff between perfection and speed
● Composable evaluators
● e.g., bin packing
Constraints
● Hard constraints filter appropriate resources
● Soft constraints specify preferences
● e.g., zone balancing, instance type preferences
10
11. Project Titus
● Mantis (Scheduling, Job Mgmt)
+ Titan (Docker execution)
------------------------------------------
Titus (Andromedon)
● Titan API -> Mantis job mgmt/scheduler -> Titan executor
● Rolled out Q4 2015, took over all jobs in Jan 2016
11
12. Why Titus?
● Many other container management &
scheduling systems, why build another?
● Key unique values
○ Deeply support Amazon (not trying to abstract IaaS)
○ Narrow focus (just container management)
○ Deep integration with existing Netflix systems
○ Complex job scheduling reqs and scale/reliability
12
13. Current Titus Numbers
● Autoscaling 100’s of r3.8xl’s
(32 vCPU, 244G)
● Peak
○ thousands of cores, tens
of TB’s memory
● thousands containers/day
● < 100 different images
13
14. Also in containers
● Already
○ Long running data pipeline service style routing tier
■ 850 c3.4xl instances with ~10K long running containers
○ Mantis cgroups
■ 1000’s cores running varied stream processing jobs
● Soon
○ Media encoding (10 of thousands of cores)
○ Service style (potentially VERY large)
14
15. Titus UITitus UI
Docker
Registry
Docker
Registry
Titus high level architecture
Rhea
container
container
container
docker
Mesos Agent
metrics agent
container
container
container
docker executor
logging agent
zfsmesos agent
docker
RheaTitus API
Cassandra
Titus Master
Job Management
& Scheduler
S3
Zookeeper
Docker
Registry
15
EC2 Autocaling
API
Mesos Master
Titus UI
(CI/CD)
Fenzo
19. ● Disparate use cases in a single API
○ Going beyond batch to service, stream and cron
● SLA based on job attributes
○ For batch, completion time
○ For service, user focused SLA (autoscaling, etc.)
● Ownership and cost accounting/metering
○ Group costs to owner and teams
● Aligned with existing continuous deployment system
○ Apps, clusters, asgs in Spinnaker
Titus API (coming)
19
20. Titus Operational Views
Also API’s for
● cluster state
● cluster rolling updates
● leadership
● Titus app managed
through Spinnaker
20
21. Dependency Versions (as of 1/16)
Docker
● Registry - 2.0.1
● Engine - 1.9.1
○ Plus Netflix logging driver
Mesos
● 0.24.1
Using Netflix C*, Zookeeper shared services
21
22. Container Agent Features (existing)
● Volumes with quota
○ Using ZFS with snapshots and S3 archival
● Logging
○ Streaming live stdout/err logs
○ Rotation & shipping stdout/err & app logs to S3
● Networking
○ IP per container integration with VPC
● Metrics
○ cgroup metrics tagged by job/task id and image
22
23. Container Agent Features (planned)
● Networking/Security
○ Extend driver to support security groups & IAM Roles
● Volume Drivers
○ Persistent volumes as required by EBS/EFS
● Isolation
○ Beyond CPU, Memory, Disk - Networking I/O Bandwidth
● Security
○ Host and container security hardening (AppArmor/SELinux)
● Insight
○ Performance (Vector) and adhoc debugging (ssh)
23
24. Unique Titus Scheduler Technology
● Job managers are separate from resource allocation
○ Less monolithic, more extensible
● Fenzo benefits
○ Bin packing, autoscaling, fitness/constraint configurability
○ Visibility into current state of the cluster
● Mesos reconciliation and task heartbeats
● Rate limiting of failing jobs and agents
● Thresholds and alerts for key aspects
○ Queue depth, idle hosts, etc
24
25. Integration with Netflix Infrastructure
● Goal: Make containers work with existing cloud
systems (designed for virtual machines) vs. replace
● Areas
○ Service registration and discovery (Eureka)
○ IPC (Ribbon)
○ Continuous Delivery (Spinnaker)
○ Telemetry (Atlas)
○ Reliability (Chaos, Performance Insight)
25
26. Path to ECS
● Why we are considering ECS
○ Resource/cluster mgmt undifferentiated heavy lifting
○ Expect ECS to have strong integration /w EC2/AWS
● Have prototyped a Titus/Fenzo ECS port
○ Using our job mgmt/scheduling on top of ECS
● Working with the ECS team to add in
○ Simpler start task API (w/o define task first)
○ Event stream to power real time scheduling info
○ Extensibility in ECS events, resource types
26
27. Why containers for developers?
Case 1:
I want a consistent local development and cloud
deployment experience (in both directions)
Case 2:
I want to specify what it means to run my process, not
integrate into a one size fits most VM image
27
29. Developer experience
NEWT
● One stop shop for creation, development, deployment of containers
Netflix Docker base layers
● Already integrated with runtime expectations
● Continuously rebuilt with small and controlled common support
Netflix Docker build tools
● Extend our bakery to produce Docker images and run locally
● More advanced image creation tools
○ Multi-inheritance, guaranteed metadata, metrics
29
30. We’re hiring
Come advance containers at Netflix!
Senior Software Engineer Container Platform -
https://jobs.netflix.com/jobs/860487
30