SlideShare a Scribd company logo
1 of 31
Download to read offline
Netflix and Containers
Titus Overview, January 2016
Andrew Spyker
Cloud Platform Engineer
About Netflix
● 75M+ members
● #NetflixEverywhere (Worldwide)
● 42.5B hours watched 2015
● > ⅓ NA internet download traffic
● 1000’s Microservices
● Many 10’s of thousands VM’s
● 3 regions across the world
● 2000+ employees
2
About me
● Cloud platform technologies
○ Distributed configuration, service discovery, RPC, application
frameworks, non-Java sidecar
● Container cloud
○ Resource management and scheduling, making Docker containers
operational in Amazon EC2/ECS
● Open Source
○ Organize @NetflixOSS meetups & internal group
● Performance
○ Assist across Netflix, but focused mainly on cloud platform perf
With Netflix for ~ 1 year. Previously at IBM.
@aspyker
ispyker.
blogspot.
com
3
Team members
@aspyker @amit_joshee Andrew
Leung
@podila Andrei
Ushakov
@william
thurston
@timbozarth @dzapata
4
Agenda
● Why Containers for Netflix?
● Container runtime platform
● Container development experience
5
Why containers operationally?
Case 1:
I have a job I want run reliably and efficiently, but I don’t
want to manage clusters myself
Case 2:
I have lots of services and I want to reduce the number
of the VM’s I need to manage with isolation between
process instances
History - Project Titan
● Container management system
○ Predominantly batch processing system
● Higher level frameworks drive tasks
○ General workflow engine
○ DAG base data processing
○ Misc reports, big data processing stages, interactive notebooks
● Tech
○ Rudimentary scheduling with Dynamo storage
○ Proven Docker execution environment
○ Using Mesos and Fenzo
7
History - Project Mantis
● Real time operational intelligence for
streaming experience
○ Ad hoc and perpetual stream processing
● Tech
○ Proven scheduling with C* storage
○ Mantis fatjars deployed in cgroups
○ Using Mesos and Fenzo
8
Fenzo overview
● A generic, plug-ins based scheduling library
for Apache Mesos frameworks
● Features
○ Heterogenous resources match with varied tasks
○ Autoscaling of underlying cluster
○ Plugins for constraints and fitness
○ Support for fast (ms) scheduling rate
○ Visibility of scheduling actions
github.com/Netflix/Fenzo 9
Fenzo: fitness, constraints plugins
Fitness value (0.0 - 1.0)
● Degree of fitness - first fit, best fit, worst fit
○ Real world tradeoff between perfection and speed
● Composable evaluators
● e.g., bin packing
Constraints
● Hard constraints filter appropriate resources
● Soft constraints specify preferences
● e.g., zone balancing, instance type preferences
10
Project Titus
● Mantis (Scheduling, Job Mgmt)
+ Titan (Docker execution)
------------------------------------------
Titus (Andromedon)
● Titan API -> Mantis job mgmt/scheduler -> Titan executor
● Rolled out Q4 2015, took over all jobs in Jan 2016
11
Why Titus?
● Many other container management &
scheduling systems, why build another?
● Key unique values
○ Deeply support Amazon (not trying to abstract IaaS)
○ Narrow focus (just container management)
○ Deep integration with existing Netflix systems
○ Complex job scheduling reqs and scale/reliability
12
Current Titus Numbers
● Autoscaling 100’s of r3.8xl’s
(32 vCPU, 244G)
● Peak
○ thousands of cores, tens
of TB’s memory
● thousands containers/day
● < 100 different images
13
Also in containers
● Already
○ Long running data pipeline service style routing tier
■ 850 c3.4xl instances with ~10K long running containers
○ Mantis cgroups
■ 1000’s cores running varied stream processing jobs
● Soon
○ Media encoding (10 of thousands of cores)
○ Service style (potentially VERY large)
14
Titus UITitus UI
Docker
Registry
Docker
Registry
Titus high level architecture
Rhea
container
container
container
docker
Mesos Agent
metrics agent
container
container
container
docker executor
logging agent
zfsmesos agent
docker
RheaTitus API
Cassandra
Titus Master
Job Management
& Scheduler
S3
Zookeeper
Docker
Registry
15
EC2 Autocaling
API
Mesos Master
Titus UI
(CI/CD)
Fenzo
Titus User Console
16
Titus Spinnaker Integration
● Spinnaker is
our CI/CD
system
● Titus
integration
coming soon
17
POST http://titusapi/v2/jobs
GET http://titusapi/v2/jobs/JOBID
GET http://titusapi/v2/tasks/TASKID
Titus API (today)
JOB Titus-12345
Task
Index = 0
Num = 2
Task
Index = 1
Num = 3
Task
Index = 2
Num = 4
Task
Index = 1
Num = 5
titus-12345-worker-1-5
18
● Disparate use cases in a single API
○ Going beyond batch to service, stream and cron
● SLA based on job attributes
○ For batch, completion time
○ For service, user focused SLA (autoscaling, etc.)
● Ownership and cost accounting/metering
○ Group costs to owner and teams
● Aligned with existing continuous deployment system
○ Apps, clusters, asgs in Spinnaker
Titus API (coming)
19
Titus Operational Views
Also API’s for
● cluster state
● cluster rolling updates
● leadership
● Titus app managed
through Spinnaker
20
Dependency Versions (as of 1/16)
Docker
● Registry - 2.0.1
● Engine - 1.9.1
○ Plus Netflix logging driver
Mesos
● 0.24.1
Using Netflix C*, Zookeeper shared services
21
Container Agent Features (existing)
● Volumes with quota
○ Using ZFS with snapshots and S3 archival
● Logging
○ Streaming live stdout/err logs
○ Rotation & shipping stdout/err & app logs to S3
● Networking
○ IP per container integration with VPC
● Metrics
○ cgroup metrics tagged by job/task id and image
22
Container Agent Features (planned)
● Networking/Security
○ Extend driver to support security groups & IAM Roles
● Volume Drivers
○ Persistent volumes as required by EBS/EFS
● Isolation
○ Beyond CPU, Memory, Disk - Networking I/O Bandwidth
● Security
○ Host and container security hardening (AppArmor/SELinux)
● Insight
○ Performance (Vector) and adhoc debugging (ssh)
23
Unique Titus Scheduler Technology
● Job managers are separate from resource allocation
○ Less monolithic, more extensible
● Fenzo benefits
○ Bin packing, autoscaling, fitness/constraint configurability
○ Visibility into current state of the cluster
● Mesos reconciliation and task heartbeats
● Rate limiting of failing jobs and agents
● Thresholds and alerts for key aspects
○ Queue depth, idle hosts, etc
24
Integration with Netflix Infrastructure
● Goal: Make containers work with existing cloud
systems (designed for virtual machines) vs. replace
● Areas
○ Service registration and discovery (Eureka)
○ IPC (Ribbon)
○ Continuous Delivery (Spinnaker)
○ Telemetry (Atlas)
○ Reliability (Chaos, Performance Insight)
25
Path to ECS
● Why we are considering ECS
○ Resource/cluster mgmt undifferentiated heavy lifting
○ Expect ECS to have strong integration /w EC2/AWS
● Have prototyped a Titus/Fenzo ECS port
○ Using our job mgmt/scheduling on top of ECS
● Working with the ECS team to add in
○ Simpler start task API (w/o define task first)
○ Event stream to power real time scheduling info
○ Extensibility in ECS events, resource types
26
Why containers for developers?
Case 1:
I want a consistent local development and cloud
deployment experience (in both directions)
Case 2:
I want to specify what it means to run my process, not
integrate into a one size fits most VM image
27
Developer Experience (coming)
Titus
28
Developer experience
NEWT
● One stop shop for creation, development, deployment of containers
Netflix Docker base layers
● Already integrated with runtime expectations
● Continuously rebuilt with small and controlled common support
Netflix Docker build tools
● Extend our bakery to produce Docker images and run locally
● More advanced image creation tools
○ Multi-inheritance, guaranteed metadata, metrics
29
We’re hiring
Come advance containers at Netflix!
Senior Software Engineer Container Platform -
https://jobs.netflix.com/jobs/860487
30
Questions?
31

More Related Content

More from aspyker

Herding Kats - Netflix’s Journey to Kubernetes Public
Herding Kats - Netflix’s Journey to Kubernetes PublicHerding Kats - Netflix’s Journey to Kubernetes Public
Herding Kats - Netflix’s Journey to Kubernetes Publicaspyker
 
Season 7 Episode 1 - Tools for Data Scientists
Season 7 Episode 1 - Tools for Data ScientistsSeason 7 Episode 1 - Tools for Data Scientists
Season 7 Episode 1 - Tools for Data Scientistsaspyker
 
CMP376 - Another Week, Another Million Containers on Amazon EC2
CMP376 - Another Week, Another Million Containers on Amazon EC2CMP376 - Another Week, Another Million Containers on Amazon EC2
CMP376 - Another Week, Another Million Containers on Amazon EC2aspyker
 
QConSF18 - Disenchantment: Netflix Titus, its Feisty Team, and Daemons
QConSF18 - Disenchantment: Netflix Titus, its Feisty Team, and DaemonsQConSF18 - Disenchantment: Netflix Titus, its Feisty Team, and Daemons
QConSF18 - Disenchantment: Netflix Titus, its Feisty Team, and Daemonsaspyker
 
NetflixOSS Meetup S6E2 - Spinnaker, Kayenta
NetflixOSS Meetup S6E2 - Spinnaker, KayentaNetflixOSS Meetup S6E2 - Spinnaker, Kayenta
NetflixOSS Meetup S6E2 - Spinnaker, Kayentaaspyker
 
NetflixOSS Meetup S6E1 - Titus & Containers
NetflixOSS Meetup S6E1 - Titus & ContainersNetflixOSS Meetup S6E1 - Titus & Containers
NetflixOSS Meetup S6E1 - Titus & Containersaspyker
 
SRECon Lightning Talk
SRECon Lightning TalkSRECon Lightning Talk
SRECon Lightning Talkaspyker
 
Container World 2018
Container World 2018Container World 2018
Container World 2018aspyker
 
Netflix Cloud Architecture and Open Source
Netflix Cloud Architecture and Open SourceNetflix Cloud Architecture and Open Source
Netflix Cloud Architecture and Open Sourceaspyker
 
Netflix OSS Meetup Season 5 Episode 1
Netflix OSS Meetup Season 5 Episode 1Netflix OSS Meetup Season 5 Episode 1
Netflix OSS Meetup Season 5 Episode 1aspyker
 
Series of Unfortunate Netflix Container Events - QConNYC17
Series of Unfortunate Netflix Container Events - QConNYC17Series of Unfortunate Netflix Container Events - QConNYC17
Series of Unfortunate Netflix Container Events - QConNYC17aspyker
 
Netflix OSS Meetup Season 4 Episode 4
Netflix OSS Meetup Season 4 Episode 4Netflix OSS Meetup Season 4 Episode 4
Netflix OSS Meetup Season 4 Episode 4aspyker
 
Re:invent 2016 Container Scheduling, Execution and AWS Integration
Re:invent 2016 Container Scheduling, Execution and AWS IntegrationRe:invent 2016 Container Scheduling, Execution and AWS Integration
Re:invent 2016 Container Scheduling, Execution and AWS Integrationaspyker
 
Netflix and Containers: Not A Stranger Thing
Netflix and Containers:  Not A Stranger ThingNetflix and Containers:  Not A Stranger Thing
Netflix and Containers: Not A Stranger Thingaspyker
 
Netflix Open Source: Building a Distributed and Automated Open Source Program
Netflix Open Source:  Building a Distributed and Automated Open Source ProgramNetflix Open Source:  Building a Distributed and Automated Open Source Program
Netflix Open Source: Building a Distributed and Automated Open Source Programaspyker
 
Velocity NYC 2016 - Containers @ Netflix
Velocity NYC 2016 - Containers @ NetflixVelocity NYC 2016 - Containers @ Netflix
Velocity NYC 2016 - Containers @ Netflixaspyker
 
Netflix Open Source Meetup Season 4 Episode 3
Netflix Open Source Meetup Season 4 Episode 3Netflix Open Source Meetup Season 4 Episode 3
Netflix Open Source Meetup Season 4 Episode 3aspyker
 
Netflix Container Scheduling and Execution - QCon New York 2016
Netflix Container Scheduling and Execution - QCon New York 2016Netflix Container Scheduling and Execution - QCon New York 2016
Netflix Container Scheduling and Execution - QCon New York 2016aspyker
 
Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2aspyker
 
Netflix Open Source Meetup Season 4 Episode 1
Netflix Open Source Meetup Season 4 Episode 1Netflix Open Source Meetup Season 4 Episode 1
Netflix Open Source Meetup Season 4 Episode 1aspyker
 

More from aspyker (20)

Herding Kats - Netflix’s Journey to Kubernetes Public
Herding Kats - Netflix’s Journey to Kubernetes PublicHerding Kats - Netflix’s Journey to Kubernetes Public
Herding Kats - Netflix’s Journey to Kubernetes Public
 
Season 7 Episode 1 - Tools for Data Scientists
Season 7 Episode 1 - Tools for Data ScientistsSeason 7 Episode 1 - Tools for Data Scientists
Season 7 Episode 1 - Tools for Data Scientists
 
CMP376 - Another Week, Another Million Containers on Amazon EC2
CMP376 - Another Week, Another Million Containers on Amazon EC2CMP376 - Another Week, Another Million Containers on Amazon EC2
CMP376 - Another Week, Another Million Containers on Amazon EC2
 
QConSF18 - Disenchantment: Netflix Titus, its Feisty Team, and Daemons
QConSF18 - Disenchantment: Netflix Titus, its Feisty Team, and DaemonsQConSF18 - Disenchantment: Netflix Titus, its Feisty Team, and Daemons
QConSF18 - Disenchantment: Netflix Titus, its Feisty Team, and Daemons
 
NetflixOSS Meetup S6E2 - Spinnaker, Kayenta
NetflixOSS Meetup S6E2 - Spinnaker, KayentaNetflixOSS Meetup S6E2 - Spinnaker, Kayenta
NetflixOSS Meetup S6E2 - Spinnaker, Kayenta
 
NetflixOSS Meetup S6E1 - Titus & Containers
NetflixOSS Meetup S6E1 - Titus & ContainersNetflixOSS Meetup S6E1 - Titus & Containers
NetflixOSS Meetup S6E1 - Titus & Containers
 
SRECon Lightning Talk
SRECon Lightning TalkSRECon Lightning Talk
SRECon Lightning Talk
 
Container World 2018
Container World 2018Container World 2018
Container World 2018
 
Netflix Cloud Architecture and Open Source
Netflix Cloud Architecture and Open SourceNetflix Cloud Architecture and Open Source
Netflix Cloud Architecture and Open Source
 
Netflix OSS Meetup Season 5 Episode 1
Netflix OSS Meetup Season 5 Episode 1Netflix OSS Meetup Season 5 Episode 1
Netflix OSS Meetup Season 5 Episode 1
 
Series of Unfortunate Netflix Container Events - QConNYC17
Series of Unfortunate Netflix Container Events - QConNYC17Series of Unfortunate Netflix Container Events - QConNYC17
Series of Unfortunate Netflix Container Events - QConNYC17
 
Netflix OSS Meetup Season 4 Episode 4
Netflix OSS Meetup Season 4 Episode 4Netflix OSS Meetup Season 4 Episode 4
Netflix OSS Meetup Season 4 Episode 4
 
Re:invent 2016 Container Scheduling, Execution and AWS Integration
Re:invent 2016 Container Scheduling, Execution and AWS IntegrationRe:invent 2016 Container Scheduling, Execution and AWS Integration
Re:invent 2016 Container Scheduling, Execution and AWS Integration
 
Netflix and Containers: Not A Stranger Thing
Netflix and Containers:  Not A Stranger ThingNetflix and Containers:  Not A Stranger Thing
Netflix and Containers: Not A Stranger Thing
 
Netflix Open Source: Building a Distributed and Automated Open Source Program
Netflix Open Source:  Building a Distributed and Automated Open Source ProgramNetflix Open Source:  Building a Distributed and Automated Open Source Program
Netflix Open Source: Building a Distributed and Automated Open Source Program
 
Velocity NYC 2016 - Containers @ Netflix
Velocity NYC 2016 - Containers @ NetflixVelocity NYC 2016 - Containers @ Netflix
Velocity NYC 2016 - Containers @ Netflix
 
Netflix Open Source Meetup Season 4 Episode 3
Netflix Open Source Meetup Season 4 Episode 3Netflix Open Source Meetup Season 4 Episode 3
Netflix Open Source Meetup Season 4 Episode 3
 
Netflix Container Scheduling and Execution - QCon New York 2016
Netflix Container Scheduling and Execution - QCon New York 2016Netflix Container Scheduling and Execution - QCon New York 2016
Netflix Container Scheduling and Execution - QCon New York 2016
 
Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2
 
Netflix Open Source Meetup Season 4 Episode 1
Netflix Open Source Meetup Season 4 Episode 1Netflix Open Source Meetup Season 4 Episode 1
Netflix Open Source Meetup Season 4 Episode 1
 

Recently uploaded

Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 

Recently uploaded (20)

Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 

Netflix and Containers - Titus

  • 1. Netflix and Containers Titus Overview, January 2016 Andrew Spyker Cloud Platform Engineer
  • 2. About Netflix ● 75M+ members ● #NetflixEverywhere (Worldwide) ● 42.5B hours watched 2015 ● > ⅓ NA internet download traffic ● 1000’s Microservices ● Many 10’s of thousands VM’s ● 3 regions across the world ● 2000+ employees 2
  • 3. About me ● Cloud platform technologies ○ Distributed configuration, service discovery, RPC, application frameworks, non-Java sidecar ● Container cloud ○ Resource management and scheduling, making Docker containers operational in Amazon EC2/ECS ● Open Source ○ Organize @NetflixOSS meetups & internal group ● Performance ○ Assist across Netflix, but focused mainly on cloud platform perf With Netflix for ~ 1 year. Previously at IBM. @aspyker ispyker. blogspot. com 3
  • 4. Team members @aspyker @amit_joshee Andrew Leung @podila Andrei Ushakov @william thurston @timbozarth @dzapata 4
  • 5. Agenda ● Why Containers for Netflix? ● Container runtime platform ● Container development experience 5
  • 6. Why containers operationally? Case 1: I have a job I want run reliably and efficiently, but I don’t want to manage clusters myself Case 2: I have lots of services and I want to reduce the number of the VM’s I need to manage with isolation between process instances
  • 7. History - Project Titan ● Container management system ○ Predominantly batch processing system ● Higher level frameworks drive tasks ○ General workflow engine ○ DAG base data processing ○ Misc reports, big data processing stages, interactive notebooks ● Tech ○ Rudimentary scheduling with Dynamo storage ○ Proven Docker execution environment ○ Using Mesos and Fenzo 7
  • 8. History - Project Mantis ● Real time operational intelligence for streaming experience ○ Ad hoc and perpetual stream processing ● Tech ○ Proven scheduling with C* storage ○ Mantis fatjars deployed in cgroups ○ Using Mesos and Fenzo 8
  • 9. Fenzo overview ● A generic, plug-ins based scheduling library for Apache Mesos frameworks ● Features ○ Heterogenous resources match with varied tasks ○ Autoscaling of underlying cluster ○ Plugins for constraints and fitness ○ Support for fast (ms) scheduling rate ○ Visibility of scheduling actions github.com/Netflix/Fenzo 9
  • 10. Fenzo: fitness, constraints plugins Fitness value (0.0 - 1.0) ● Degree of fitness - first fit, best fit, worst fit ○ Real world tradeoff between perfection and speed ● Composable evaluators ● e.g., bin packing Constraints ● Hard constraints filter appropriate resources ● Soft constraints specify preferences ● e.g., zone balancing, instance type preferences 10
  • 11. Project Titus ● Mantis (Scheduling, Job Mgmt) + Titan (Docker execution) ------------------------------------------ Titus (Andromedon) ● Titan API -> Mantis job mgmt/scheduler -> Titan executor ● Rolled out Q4 2015, took over all jobs in Jan 2016 11
  • 12. Why Titus? ● Many other container management & scheduling systems, why build another? ● Key unique values ○ Deeply support Amazon (not trying to abstract IaaS) ○ Narrow focus (just container management) ○ Deep integration with existing Netflix systems ○ Complex job scheduling reqs and scale/reliability 12
  • 13. Current Titus Numbers ● Autoscaling 100’s of r3.8xl’s (32 vCPU, 244G) ● Peak ○ thousands of cores, tens of TB’s memory ● thousands containers/day ● < 100 different images 13
  • 14. Also in containers ● Already ○ Long running data pipeline service style routing tier ■ 850 c3.4xl instances with ~10K long running containers ○ Mantis cgroups ■ 1000’s cores running varied stream processing jobs ● Soon ○ Media encoding (10 of thousands of cores) ○ Service style (potentially VERY large) 14
  • 15. Titus UITitus UI Docker Registry Docker Registry Titus high level architecture Rhea container container container docker Mesos Agent metrics agent container container container docker executor logging agent zfsmesos agent docker RheaTitus API Cassandra Titus Master Job Management & Scheduler S3 Zookeeper Docker Registry 15 EC2 Autocaling API Mesos Master Titus UI (CI/CD) Fenzo
  • 17. Titus Spinnaker Integration ● Spinnaker is our CI/CD system ● Titus integration coming soon 17
  • 18. POST http://titusapi/v2/jobs GET http://titusapi/v2/jobs/JOBID GET http://titusapi/v2/tasks/TASKID Titus API (today) JOB Titus-12345 Task Index = 0 Num = 2 Task Index = 1 Num = 3 Task Index = 2 Num = 4 Task Index = 1 Num = 5 titus-12345-worker-1-5 18
  • 19. ● Disparate use cases in a single API ○ Going beyond batch to service, stream and cron ● SLA based on job attributes ○ For batch, completion time ○ For service, user focused SLA (autoscaling, etc.) ● Ownership and cost accounting/metering ○ Group costs to owner and teams ● Aligned with existing continuous deployment system ○ Apps, clusters, asgs in Spinnaker Titus API (coming) 19
  • 20. Titus Operational Views Also API’s for ● cluster state ● cluster rolling updates ● leadership ● Titus app managed through Spinnaker 20
  • 21. Dependency Versions (as of 1/16) Docker ● Registry - 2.0.1 ● Engine - 1.9.1 ○ Plus Netflix logging driver Mesos ● 0.24.1 Using Netflix C*, Zookeeper shared services 21
  • 22. Container Agent Features (existing) ● Volumes with quota ○ Using ZFS with snapshots and S3 archival ● Logging ○ Streaming live stdout/err logs ○ Rotation & shipping stdout/err & app logs to S3 ● Networking ○ IP per container integration with VPC ● Metrics ○ cgroup metrics tagged by job/task id and image 22
  • 23. Container Agent Features (planned) ● Networking/Security ○ Extend driver to support security groups & IAM Roles ● Volume Drivers ○ Persistent volumes as required by EBS/EFS ● Isolation ○ Beyond CPU, Memory, Disk - Networking I/O Bandwidth ● Security ○ Host and container security hardening (AppArmor/SELinux) ● Insight ○ Performance (Vector) and adhoc debugging (ssh) 23
  • 24. Unique Titus Scheduler Technology ● Job managers are separate from resource allocation ○ Less monolithic, more extensible ● Fenzo benefits ○ Bin packing, autoscaling, fitness/constraint configurability ○ Visibility into current state of the cluster ● Mesos reconciliation and task heartbeats ● Rate limiting of failing jobs and agents ● Thresholds and alerts for key aspects ○ Queue depth, idle hosts, etc 24
  • 25. Integration with Netflix Infrastructure ● Goal: Make containers work with existing cloud systems (designed for virtual machines) vs. replace ● Areas ○ Service registration and discovery (Eureka) ○ IPC (Ribbon) ○ Continuous Delivery (Spinnaker) ○ Telemetry (Atlas) ○ Reliability (Chaos, Performance Insight) 25
  • 26. Path to ECS ● Why we are considering ECS ○ Resource/cluster mgmt undifferentiated heavy lifting ○ Expect ECS to have strong integration /w EC2/AWS ● Have prototyped a Titus/Fenzo ECS port ○ Using our job mgmt/scheduling on top of ECS ● Working with the ECS team to add in ○ Simpler start task API (w/o define task first) ○ Event stream to power real time scheduling info ○ Extensibility in ECS events, resource types 26
  • 27. Why containers for developers? Case 1: I want a consistent local development and cloud deployment experience (in both directions) Case 2: I want to specify what it means to run my process, not integrate into a one size fits most VM image 27
  • 29. Developer experience NEWT ● One stop shop for creation, development, deployment of containers Netflix Docker base layers ● Already integrated with runtime expectations ● Continuously rebuilt with small and controlled common support Netflix Docker build tools ● Extend our bakery to produce Docker images and run locally ● More advanced image creation tools ○ Multi-inheritance, guaranteed metadata, metrics 29
  • 30. We’re hiring Come advance containers at Netflix! Senior Software Engineer Container Platform - https://jobs.netflix.com/jobs/860487 30