SlideShare a Scribd company logo
1 of 39
Download to read offline
Containers at Netflix
WASP 10/19/17
Andrew Leung
The Whole Titus Team
2
Motivating Factors For Containers
● From Late 2015 Technical Strategy ...
● Simpler management of compute resources
● Simpler deployment packaging artifacts for compute jobs
● Need for a consistent local developer environment
3
Provided Innovation Velocity
Media Encoding - encoding research development time
● Using custom VM’s - 1 month
● Using customizable containers - 1 week
Niagara
● Build all Netflix codebases in hours
● Saves development 100’s of hours of debugging
NodeQuark
● Focus returns to app development
● Newt & Titus simplifies, speeds test and deployment
4
Consistent Developer Experience
● NeWT - Common local developer experience including
support for container development
○ Container image used for local laptop development
○ Same container image re-used when deployed
● Has benefits in both directions
○ Cloud like local development environment
○ Easier operational debugging of cloud workloads
5
What is Titus?
● Cloud runtime platform for container based jobs
● Scheduling
○ Service & batch job management
○ Advanced resource management
across elastic shared resource pool
● Container Execution
○ Advanced Isolation
○ Docker and AWS Integration
○ Containers integration with Netflix
infrastructure
6
Service
Job Management
Resource Management & Optimization
Container Execution
Integration
Batch
Titus Evolution Timeframe
7
Titus Created
Batch GA
4Q 2015
Service Support
Added
1Q 2016
Netflix Infra &
AWS Integration
2Q 2016
First Scale
Production Service
4Q 2016
First User Path
Service
2Q 2017
Containers Scale Over Time
8
● From thousand daily
● To 100K daily
● Spike to 450K
days
Containers
Launched
9
Titus Current Scale
● Deployed across multiple AWS accounts & three regions
● Over 5,000 instances (Mostly M4.4xls & R4.8xls)
● Over a week period launched over 1,000,000 containers
● Around 10,000 long running containers 9
Current Titus Users (Sampling)
● Service
○ Stream Processing (Flink)
○ UI Services (NodeJS single core)
○ Internal dashboards
● Batch
○ Algorithm model training, personalization &
recommendations (with GPU’s)
○ Content value analysis
○ Digital watermarking
○ Adhoc reporting (ex. Open Connect CDN
analysis and planning)
○ Continuous integration builds
● Queued worker model
○ Media encoding experimentation
10
Archer
11
Titus Overview
Titus UITitus UI
RheaRheaTitus API
Titus UI
Cassandra
Titus Master
Job Management &
Scheduler
Zookeeper
EC2
Auto-scaling API
Mesos Master
Fenzo
1111
Docker
Registry
Docker
Registry
container
container
container
docker
Titus Agent
metrics agents
Titus executor
logging agent
btrfs
Mesos agent
Docker
S3
Docker
Registry
container
Pod & VPC network
drivers
containercontainer
AWS
metadata proxy
Integration
AWS VM’s
12
AWS Integration
● Making Docker integrate with AWS like VM’s
● Titus adds
○ VPC Connectivity (IP per container)
○ Security Groups
○ EC2 Metadata service
○ IAM Roles
○ Multi-tenant isolation (cpu, memory, disk quota, network)
○ Live and S3 persisted logs rotation & mgmt
○ Remote storage (EFS)
○ Autoscaling service jobs
○ GPU Support
○ Environmental context to similar to user data 12
Multi-tenant networking is hard
● Decided early on we wanted full IP stacks per container
● But what about?
○ Security group support
○ IAM role support
○ Network bandwidth isolation
○ Integration with VPC
13
Networking - VPC Driver
14
Networking - VPC Driver
15
Networking - VPC Driver
16
Networking - VPC Driver
17
Networking - Metadata Proxy
18
Networking - Putting it all together
19
Isolation
● CPU
○ Fixed shares today (pinning coming)
● Memory
○ Including page cache
● Disk
○ Quotas
● Networking
○ Bandwidth, ENI’s and IP’s
● Security
○ User namespaces, hosts locked down, secret mgmt
20
21
Netflix Infrastructure Integration
● Provide single cloud platform (VM’s and containers same)
● Titus adds integration with
○ Spinnaker CI/CD and canaries
○ Atlas telemetry and outlier detection
○ Discovery/IPC
○ Edda (and dependent systems)
○ Instance pollers (healthcheck, system metrics)
○ Chaos monkey
○ Traffic control & Kong
○ Netflix secure secret management
○ Interactive access (ala ssh)
● Supports both reserved critical and elastically scaled flex workloads
● Manages containers under both service and batch systems 21
22
Netflix Cloud Infrastructure (VM’s + Containers)
Why? Single Consistent Cloud Platform
Spinnaker Setup
23
24
Deploy based
on new images
tags
24
25
Basic resource
requirements
IAM Roles & Sec
Groups per
container
Deploy
Strategies
Same as VM’s
25
26
Easily see
health &
discovery
26
2727
2828
Container Level Introspection
29
● Interactive “ssh” and files “scp” managed by Titus hosts
● Locked down as hosts are secure and only accessible by Titus operators
Scheduling
30
Fenzo - The heart of Titus scheduling
● Extensible Library for Scheduling Frameworks
● Plugins based scheduling objectives
○ Bin packing, etc.
● Heterogeneous resources & tasks
● Cluster autoscaling
○ Multiple instance types
● Plugins based constraints evaluator
○ Resource affinity, task locality, etc.
● Single offer mode added in support of ECS
31
Scheduling - Capacity Guarantees
● Titus maintains …
● Critical tier
○ guaranteed capacity &
start latencies
● Flex tier
○ more dynamic capacity &
variable start latency
32
Scheduling - Bin Packing, Elastic Scaling
User adds work tasks
● Titus does bin packing
to ensure that we can
downscale entire hosts
efficiently
33
Scheduling - Constraints including AZ Balancing
User specifies constraints
● AZ Balancing
● Resource and Task
affinity
● Hard and soft
34
Scheduling - Agent upgrades
Operator updates Titus agent
codebase
● New scheduling on new cluster
● Batch jobs drain
● Service tasks are migrated via
Spinnaker pipelines
● Old cluster autoscales down
35
Future
36
● Perf/Scalability, Ops Enablement, Reliability
○ Better resiliency driven by directed chaos testing
○ More scale (2 orders of magnitude by 2019)
○ Hands off canaried automation of all operational tasks
● Scheduling
○ Advanced job and AWS rate limiting
○ Easier and more scalable fleet management
○ “Trough” management and improved batch SLA
Some Titus Futures
37
● Container Execution
○ Improved isolation
○ Deeper and automated layers of security
○ Pods (system services, then application sidecars)
● Netflix Infrastructure and AWS Integration
○ Chargeback visibility and automated improvements
○ ALB support
Some Titus Futures
38
Questions
? 39

More Related Content

What's hot

Aptira presents OpenStack Load Balancing as a Service at Banglore India OSUG ...
Aptira presents OpenStack Load Balancing as a Service at Banglore India OSUG ...Aptira presents OpenStack Load Balancing as a Service at Banglore India OSUG ...
Aptira presents OpenStack Load Balancing as a Service at Banglore India OSUG ...
OpenStack
 
Netflix Data Benchmark @ HPTS 2017
Netflix Data Benchmark @ HPTS 2017Netflix Data Benchmark @ HPTS 2017
Netflix Data Benchmark @ HPTS 2017
Ioannis Papapanagiotou
 
Initial presentation of swift (for montreal user group)
Initial presentation of swift (for montreal user group)Initial presentation of swift (for montreal user group)
Initial presentation of swift (for montreal user group)
Marcos García
 

What's hot (20)

Webinar: Achieving Economies of Web Scale in Your Enterprise with Containeriz...
Webinar: Achieving Economies of Web Scale in Your Enterprise with Containeriz...Webinar: Achieving Economies of Web Scale in Your Enterprise with Containeriz...
Webinar: Achieving Economies of Web Scale in Your Enterprise with Containeriz...
 
Aptira presents OpenStack Load Balancing as a Service at Banglore India OSUG ...
Aptira presents OpenStack Load Balancing as a Service at Banglore India OSUG ...Aptira presents OpenStack Load Balancing as a Service at Banglore India OSUG ...
Aptira presents OpenStack Load Balancing as a Service at Banglore India OSUG ...
 
Netflix Data Benchmark @ HPTS 2017
Netflix Data Benchmark @ HPTS 2017Netflix Data Benchmark @ HPTS 2017
Netflix Data Benchmark @ HPTS 2017
 
Using OpenStack Swift for Extreme Data Durability
 Using OpenStack Swift for Extreme Data Durability Using OpenStack Swift for Extreme Data Durability
Using OpenStack Swift for Extreme Data Durability
 
WSO2 Microservices Framework for Java - Product Overview
WSO2 Microservices Framework for Java - Product OverviewWSO2 Microservices Framework for Java - Product Overview
WSO2 Microservices Framework for Java - Product Overview
 
Cncf storage-final-filip
Cncf storage-final-filipCncf storage-final-filip
Cncf storage-final-filip
 
NATS vs HTTP
NATS vs HTTPNATS vs HTTP
NATS vs HTTP
 
Kubecon 2019_eu-k8s-secrets-csi
Kubecon 2019_eu-k8s-secrets-csiKubecon 2019_eu-k8s-secrets-csi
Kubecon 2019_eu-k8s-secrets-csi
 
The evolving container landscape
The evolving container landscapeThe evolving container landscape
The evolving container landscape
 
Open stack wtf_(1)
Open stack  wtf_(1)Open stack  wtf_(1)
Open stack wtf_(1)
 
Kubernetes 1.12 Update and Container Security with Liz Rice
Kubernetes 1.12 Update and Container Security with Liz RiceKubernetes 1.12 Update and Container Security with Liz Rice
Kubernetes 1.12 Update and Container Security with Liz Rice
 
Neutron Updates - Liberty Edition
Neutron Updates - Liberty Edition Neutron Updates - Liberty Edition
Neutron Updates - Liberty Edition
 
Initial presentation of swift (for montreal user group)
Initial presentation of swift (for montreal user group)Initial presentation of swift (for montreal user group)
Initial presentation of swift (for montreal user group)
 
Samuel Bercovici - lbaaS for Havana
Samuel Bercovici - lbaaS for HavanaSamuel Bercovici - lbaaS for Havana
Samuel Bercovici - lbaaS for Havana
 
19. Cloud Native Computing - Kubernetes - Bratislava - Databases in K8s world
19. Cloud Native Computing - Kubernetes - Bratislava - Databases in K8s world19. Cloud Native Computing - Kubernetes - Bratislava - Databases in K8s world
19. Cloud Native Computing - Kubernetes - Bratislava - Databases in K8s world
 
Implementing Microservices with NATS
Implementing Microservices with NATSImplementing Microservices with NATS
Implementing Microservices with NATS
 
Running Netflix OSS on Docker with Nirmata
Running Netflix OSS on Docker with NirmataRunning Netflix OSS on Docker with Nirmata
Running Netflix OSS on Docker with Nirmata
 
A New Way of Thinking | NATS 2.0 & Connectivity
A New Way of Thinking | NATS 2.0 & ConnectivityA New Way of Thinking | NATS 2.0 & Connectivity
A New Way of Thinking | NATS 2.0 & Connectivity
 
Glance Updates - Liberty Edition
Glance Updates - Liberty EditionGlance Updates - Liberty Edition
Glance Updates - Liberty Edition
 
Kubera Launch Webinar: Kubernetes native management of Kubernetes native data
Kubera Launch Webinar: Kubernetes native management of Kubernetes native dataKubera Launch Webinar: Kubernetes native management of Kubernetes native data
Kubera Launch Webinar: Kubernetes native management of Kubernetes native data
 

Similar to Netflix Titus WASP October 2017

Scaling Open edX with Kubernetes
Scaling Open edX with KubernetesScaling Open edX with Kubernetes
Scaling Open edX with Kubernetes
Appsembler
 

Similar to Netflix Titus WASP October 2017 (20)

Netflix Container Scheduling and Execution - QCon New York 2016
Netflix Container Scheduling and Execution - QCon New York 2016Netflix Container Scheduling and Execution - QCon New York 2016
Netflix Container Scheduling and Execution - QCon New York 2016
 
Scheduling a fuller house - Talk at QCon NY 2016
Scheduling a fuller house - Talk at QCon NY 2016Scheduling a fuller house - Talk at QCon NY 2016
Scheduling a fuller house - Talk at QCon NY 2016
 
Netflix and Containers: Not A Stranger Thing
Netflix and Containers:  Not A Stranger ThingNetflix and Containers:  Not A Stranger Thing
Netflix and Containers: Not A Stranger Thing
 
Netflix and Containers: Not Stranger Things
Netflix and Containers: Not Stranger ThingsNetflix and Containers: Not Stranger Things
Netflix and Containers: Not Stranger Things
 
NetflixOSS Meetup S6E1 - Titus & Containers
NetflixOSS Meetup S6E1 - Titus & ContainersNetflixOSS Meetup S6E1 - Titus & Containers
NetflixOSS Meetup S6E1 - Titus & Containers
 
Craig Box (Google) - The road to Kubernetes 1.0
Craig Box (Google) - The road to Kubernetes 1.0Craig Box (Google) - The road to Kubernetes 1.0
Craig Box (Google) - The road to Kubernetes 1.0
 
QConSF18 - Disenchantment: Netflix Titus, its Feisty Team, and Daemons
QConSF18 - Disenchantment: Netflix Titus, its Feisty Team, and DaemonsQConSF18 - Disenchantment: Netflix Titus, its Feisty Team, and Daemons
QConSF18 - Disenchantment: Netflix Titus, its Feisty Team, and Daemons
 
Monitoring hybrid container environments
Monitoring hybrid container environments Monitoring hybrid container environments
Monitoring hybrid container environments
 
Unleashing k8 s to reduce complexities of an entire middleware platform
Unleashing k8 s to reduce complexities of an entire middleware platformUnleashing k8 s to reduce complexities of an entire middleware platform
Unleashing k8 s to reduce complexities of an entire middleware platform
 
Container World 2018
Container World 2018Container World 2018
Container World 2018
 
WSO2 Kubernetes Reference Architecture - Nov 2017
WSO2 Kubernetes Reference Architecture - Nov 2017WSO2 Kubernetes Reference Architecture - Nov 2017
WSO2 Kubernetes Reference Architecture - Nov 2017
 
AWS re:Invent 2016: Netflix: Container Scheduling, Execution, and Integration...
AWS re:Invent 2016: Netflix: Container Scheduling, Execution, and Integration...AWS re:Invent 2016: Netflix: Container Scheduling, Execution, and Integration...
AWS re:Invent 2016: Netflix: Container Scheduling, Execution, and Integration...
 
Re:invent 2016 Container Scheduling, Execution and AWS Integration
Re:invent 2016 Container Scheduling, Execution and AWS IntegrationRe:invent 2016 Container Scheduling, Execution and AWS Integration
Re:invent 2016 Container Scheduling, Execution and AWS Integration
 
How Kubernetes helps Devops
How Kubernetes helps DevopsHow Kubernetes helps Devops
How Kubernetes helps Devops
 
Scaling Open edX with Kubernetes
Scaling Open edX with KubernetesScaling Open edX with Kubernetes
Scaling Open edX with Kubernetes
 
Future of Microservices - Jakub Hadvig
Future of Microservices - Jakub HadvigFuture of Microservices - Jakub Hadvig
Future of Microservices - Jakub Hadvig
 
DCSF19 How Docker Simplifies Kubernetes for the Masses
DCSF19 How Docker Simplifies Kubernetes for the Masses  DCSF19 How Docker Simplifies Kubernetes for the Masses
DCSF19 How Docker Simplifies Kubernetes for the Masses
 
Automating using Ansible
Automating using AnsibleAutomating using Ansible
Automating using Ansible
 
Kubernetes 101
Kubernetes 101Kubernetes 101
Kubernetes 101
 
NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1
 

Recently uploaded

Recently uploaded (20)

DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 

Netflix Titus WASP October 2017

  • 1. Containers at Netflix WASP 10/19/17 Andrew Leung
  • 3. Motivating Factors For Containers ● From Late 2015 Technical Strategy ... ● Simpler management of compute resources ● Simpler deployment packaging artifacts for compute jobs ● Need for a consistent local developer environment 3
  • 4. Provided Innovation Velocity Media Encoding - encoding research development time ● Using custom VM’s - 1 month ● Using customizable containers - 1 week Niagara ● Build all Netflix codebases in hours ● Saves development 100’s of hours of debugging NodeQuark ● Focus returns to app development ● Newt & Titus simplifies, speeds test and deployment 4
  • 5. Consistent Developer Experience ● NeWT - Common local developer experience including support for container development ○ Container image used for local laptop development ○ Same container image re-used when deployed ● Has benefits in both directions ○ Cloud like local development environment ○ Easier operational debugging of cloud workloads 5
  • 6. What is Titus? ● Cloud runtime platform for container based jobs ● Scheduling ○ Service & batch job management ○ Advanced resource management across elastic shared resource pool ● Container Execution ○ Advanced Isolation ○ Docker and AWS Integration ○ Containers integration with Netflix infrastructure 6 Service Job Management Resource Management & Optimization Container Execution Integration Batch
  • 7. Titus Evolution Timeframe 7 Titus Created Batch GA 4Q 2015 Service Support Added 1Q 2016 Netflix Infra & AWS Integration 2Q 2016 First Scale Production Service 4Q 2016 First User Path Service 2Q 2017
  • 8. Containers Scale Over Time 8 ● From thousand daily ● To 100K daily ● Spike to 450K days Containers Launched
  • 9. 9 Titus Current Scale ● Deployed across multiple AWS accounts & three regions ● Over 5,000 instances (Mostly M4.4xls & R4.8xls) ● Over a week period launched over 1,000,000 containers ● Around 10,000 long running containers 9
  • 10. Current Titus Users (Sampling) ● Service ○ Stream Processing (Flink) ○ UI Services (NodeJS single core) ○ Internal dashboards ● Batch ○ Algorithm model training, personalization & recommendations (with GPU’s) ○ Content value analysis ○ Digital watermarking ○ Adhoc reporting (ex. Open Connect CDN analysis and planning) ○ Continuous integration builds ● Queued worker model ○ Media encoding experimentation 10 Archer
  • 11. 11 Titus Overview Titus UITitus UI RheaRheaTitus API Titus UI Cassandra Titus Master Job Management & Scheduler Zookeeper EC2 Auto-scaling API Mesos Master Fenzo 1111 Docker Registry Docker Registry container container container docker Titus Agent metrics agents Titus executor logging agent btrfs Mesos agent Docker S3 Docker Registry container Pod & VPC network drivers containercontainer AWS metadata proxy Integration AWS VM’s
  • 12. 12 AWS Integration ● Making Docker integrate with AWS like VM’s ● Titus adds ○ VPC Connectivity (IP per container) ○ Security Groups ○ EC2 Metadata service ○ IAM Roles ○ Multi-tenant isolation (cpu, memory, disk quota, network) ○ Live and S3 persisted logs rotation & mgmt ○ Remote storage (EFS) ○ Autoscaling service jobs ○ GPU Support ○ Environmental context to similar to user data 12
  • 13. Multi-tenant networking is hard ● Decided early on we wanted full IP stacks per container ● But what about? ○ Security group support ○ IAM role support ○ Network bandwidth isolation ○ Integration with VPC 13
  • 14. Networking - VPC Driver 14
  • 15. Networking - VPC Driver 15
  • 16. Networking - VPC Driver 16
  • 17. Networking - VPC Driver 17
  • 19. Networking - Putting it all together 19
  • 20. Isolation ● CPU ○ Fixed shares today (pinning coming) ● Memory ○ Including page cache ● Disk ○ Quotas ● Networking ○ Bandwidth, ENI’s and IP’s ● Security ○ User namespaces, hosts locked down, secret mgmt 20
  • 21. 21 Netflix Infrastructure Integration ● Provide single cloud platform (VM’s and containers same) ● Titus adds integration with ○ Spinnaker CI/CD and canaries ○ Atlas telemetry and outlier detection ○ Discovery/IPC ○ Edda (and dependent systems) ○ Instance pollers (healthcheck, system metrics) ○ Chaos monkey ○ Traffic control & Kong ○ Netflix secure secret management ○ Interactive access (ala ssh) ● Supports both reserved critical and elastically scaled flex workloads ● Manages containers under both service and batch systems 21
  • 22. 22 Netflix Cloud Infrastructure (VM’s + Containers) Why? Single Consistent Cloud Platform
  • 24. 24 Deploy based on new images tags 24
  • 25. 25 Basic resource requirements IAM Roles & Sec Groups per container Deploy Strategies Same as VM’s 25
  • 27. 2727
  • 28. 2828
  • 29. Container Level Introspection 29 ● Interactive “ssh” and files “scp” managed by Titus hosts ● Locked down as hosts are secure and only accessible by Titus operators
  • 31. Fenzo - The heart of Titus scheduling ● Extensible Library for Scheduling Frameworks ● Plugins based scheduling objectives ○ Bin packing, etc. ● Heterogeneous resources & tasks ● Cluster autoscaling ○ Multiple instance types ● Plugins based constraints evaluator ○ Resource affinity, task locality, etc. ● Single offer mode added in support of ECS 31
  • 32. Scheduling - Capacity Guarantees ● Titus maintains … ● Critical tier ○ guaranteed capacity & start latencies ● Flex tier ○ more dynamic capacity & variable start latency 32
  • 33. Scheduling - Bin Packing, Elastic Scaling User adds work tasks ● Titus does bin packing to ensure that we can downscale entire hosts efficiently 33
  • 34. Scheduling - Constraints including AZ Balancing User specifies constraints ● AZ Balancing ● Resource and Task affinity ● Hard and soft 34
  • 35. Scheduling - Agent upgrades Operator updates Titus agent codebase ● New scheduling on new cluster ● Batch jobs drain ● Service tasks are migrated via Spinnaker pipelines ● Old cluster autoscales down 35
  • 37. ● Perf/Scalability, Ops Enablement, Reliability ○ Better resiliency driven by directed chaos testing ○ More scale (2 orders of magnitude by 2019) ○ Hands off canaried automation of all operational tasks ● Scheduling ○ Advanced job and AWS rate limiting ○ Easier and more scalable fleet management ○ “Trough” management and improved batch SLA Some Titus Futures 37
  • 38. ● Container Execution ○ Improved isolation ○ Deeper and automated layers of security ○ Pods (system services, then application sidecars) ● Netflix Infrastructure and AWS Integration ○ Chargeback visibility and automated improvements ○ ALB support Some Titus Futures 38