3. Motivating Factors For Containers
● From Late 2015 Technical Strategy ...
● Simpler management of compute resources
● Simpler deployment packaging artifacts for compute jobs
● Need for a consistent local developer environment
3
4. Provided Innovation Velocity
Media Encoding - encoding research development time
● Using custom VM’s - 1 month
● Using customizable containers - 1 week
Niagara
● Build all Netflix codebases in hours
● Saves development 100’s of hours of debugging
NodeQuark
● Focus returns to app development
● Newt & Titus simplifies, speeds test and deployment
4
5. Consistent Developer Experience
● NeWT - Common local developer experience including
support for container development
○ Container image used for local laptop development
○ Same container image re-used when deployed
● Has benefits in both directions
○ Cloud like local development environment
○ Easier operational debugging of cloud workloads
5
6. What is Titus?
● Cloud runtime platform for container based jobs
● Scheduling
○ Service & batch job management
○ Advanced resource management
across elastic shared resource pool
● Container Execution
○ Advanced Isolation
○ Docker and AWS Integration
○ Containers integration with Netflix
infrastructure
6
Service
Job Management
Resource Management & Optimization
Container Execution
Integration
Batch
7. Titus Evolution Timeframe
7
Titus Created
Batch GA
4Q 2015
Service Support
Added
1Q 2016
Netflix Infra &
AWS Integration
2Q 2016
First Scale
Production Service
4Q 2016
First User Path
Service
2Q 2017
8. Containers Scale Over Time
8
● From thousand daily
● To 100K daily
● Spike to 450K
days
Containers
Launched
9. 9
Titus Current Scale
● Deployed across multiple AWS accounts & three regions
● Over 5,000 instances (Mostly M4.4xls & R4.8xls)
● Over a week period launched over 1,000,000 containers
● Around 10,000 long running containers 9
10. Current Titus Users (Sampling)
● Service
○ Stream Processing (Flink)
○ UI Services (NodeJS single core)
○ Internal dashboards
● Batch
○ Algorithm model training, personalization &
recommendations (with GPU’s)
○ Content value analysis
○ Digital watermarking
○ Adhoc reporting (ex. Open Connect CDN
analysis and planning)
○ Continuous integration builds
● Queued worker model
○ Media encoding experimentation
10
Archer
11. 11
Titus Overview
Titus UITitus UI
RheaRheaTitus API
Titus UI
Cassandra
Titus Master
Job Management &
Scheduler
Zookeeper
EC2
Auto-scaling API
Mesos Master
Fenzo
1111
Docker
Registry
Docker
Registry
container
container
container
docker
Titus Agent
metrics agents
Titus executor
logging agent
btrfs
Mesos agent
Docker
S3
Docker
Registry
container
Pod & VPC network
drivers
containercontainer
AWS
metadata proxy
Integration
AWS VM’s
12. 12
AWS Integration
● Making Docker integrate with AWS like VM’s
● Titus adds
○ VPC Connectivity (IP per container)
○ Security Groups
○ EC2 Metadata service
○ IAM Roles
○ Multi-tenant isolation (cpu, memory, disk quota, network)
○ Live and S3 persisted logs rotation & mgmt
○ Remote storage (EFS)
○ Autoscaling service jobs
○ GPU Support
○ Environmental context to similar to user data 12
13. Multi-tenant networking is hard
● Decided early on we wanted full IP stacks per container
● But what about?
○ Security group support
○ IAM role support
○ Network bandwidth isolation
○ Integration with VPC
13
20. Isolation
● CPU
○ Fixed shares today (pinning coming)
● Memory
○ Including page cache
● Disk
○ Quotas
● Networking
○ Bandwidth, ENI’s and IP’s
● Security
○ User namespaces, hosts locked down, secret mgmt
20
21. 21
Netflix Infrastructure Integration
● Provide single cloud platform (VM’s and containers same)
● Titus adds integration with
○ Spinnaker CI/CD and canaries
○ Atlas telemetry and outlier detection
○ Discovery/IPC
○ Edda (and dependent systems)
○ Instance pollers (healthcheck, system metrics)
○ Chaos monkey
○ Traffic control & Kong
○ Netflix secure secret management
○ Interactive access (ala ssh)
● Supports both reserved critical and elastically scaled flex workloads
● Manages containers under both service and batch systems 21
29. Container Level Introspection
29
● Interactive “ssh” and files “scp” managed by Titus hosts
● Locked down as hosts are secure and only accessible by Titus operators
31. Fenzo - The heart of Titus scheduling
● Extensible Library for Scheduling Frameworks
● Plugins based scheduling objectives
○ Bin packing, etc.
● Heterogeneous resources & tasks
● Cluster autoscaling
○ Multiple instance types
● Plugins based constraints evaluator
○ Resource affinity, task locality, etc.
● Single offer mode added in support of ECS
31
33. Scheduling - Bin Packing, Elastic Scaling
User adds work tasks
● Titus does bin packing
to ensure that we can
downscale entire hosts
efficiently
33
34. Scheduling - Constraints including AZ Balancing
User specifies constraints
● AZ Balancing
● Resource and Task
affinity
● Hard and soft
34
35. Scheduling - Agent upgrades
Operator updates Titus agent
codebase
● New scheduling on new cluster
● Batch jobs drain
● Service tasks are migrated via
Spinnaker pipelines
● Old cluster autoscales down
35
37. ● Perf/Scalability, Ops Enablement, Reliability
○ Better resiliency driven by directed chaos testing
○ More scale (2 orders of magnitude by 2019)
○ Hands off canaried automation of all operational tasks
● Scheduling
○ Advanced job and AWS rate limiting
○ Easier and more scalable fleet management
○ “Trough” management and improved batch SLA
Some Titus Futures
37
38. ● Container Execution
○ Improved isolation
○ Deeper and automated layers of security
○ Pods (system services, then application sidecars)
● Netflix Infrastructure and AWS Integration
○ Chargeback visibility and automated improvements
○ ALB support
Some Titus Futures
38