SlideShare ist ein Scribd-Unternehmen logo
1 von 39
Downloaden Sie, um offline zu lesen
Scheduling a Fuller House:
Container Management
Sharma Podila, Andrew Spyker - Senior Software Engineers
About Netflix
● 81.5M members
● 2000+ employees (1400 tech)
● 190+ countries
● > 100M hours watch per day
● > ⅓ NA internet download traffic
● 500+ Microservices
● Many 10’s of thousands VM’s
● 3 regions across the world
2
Agenda
● Why containers at Netflix?
● What did we build and what did we learn?
● What are our current and future workloads?
3
⇨
Why a 2nd edition of virtualization?
● Given our resilient cloud native, CI/CD devops enabled,
elastically scalable virtual machine based architecture,
did we really need containers? 4
Motivating factors for containers
● Simpler management of compute resources
● Simpler deployment packaging artifacts for compute jobs
● Need for a consistent local developer environment
5
Simpler compute, Management & Packaging
Batch/stream processing jobs
● Here are the files to run my process
● I need m cores, n disk, and o memory
● Please just run it for me!
6
Service style jobs (VM’s)
● Use tested/secure base AMI
● Bake an AMI
● Define launch config
● Choose t-shirt sized instance
● Canary & red/black ASG’s
Consistent developer experience
● Many years focused on
○ Build, bake / cloud deploy / operational experience
○ Not as much time focused on developer experience
● New Netflix local developer experience based on Docker
● Has had a benefit in both directions
○ Cloud like local development environment
○ Easier operational debugging of cloud workloads
7
What about resource optimization?
● Not absolutely required and easier to get wins at larger
scale across larger virtual machine fleet
● However, potential benefits to
○ Elastic resource pool for scaling batch & adhoc jobs
○ Reliable smaller instance sizes for NodeJS
○ Cross Netflix resource optimizations
■ Trough usage, instance type migration
8
Agenda
● Why containers at Netflix?
● What did we build and what did we learn?
● What are our current and future workloads?
9
⇨
VMVM
Lesson: Support containers by leveraging
existing Netflix IaaS focused cloud platform
10
Atlas
EC2
AWSAutoScaler
VMs
App
Cloud Platform
(metrics, IPC, health)
Eureka
VPC
Edda
Existing - VM’s
VMVM
Atlas
EC2
TitusJobControl
Containers
App
Cloud Platform
(metrics, IPC, health)
Eureka
VPC
Edda
Titus - Containers
VMVM
Batch
Containers
VMVM
11
EC2
AWSAutoScaler
VMs
App
Cloud Platform
(metrics, IPC, health)
VPC
Netflix Cloud Infrastructure (VM’s + Containers)
VMVM
Atlas
TitusJobControl
Containers
App
Cloud Platform
(metrics, IPC, health)
Eureka Edda
VMVM
Batch
Containers
Why - Single consistent cloud platform
Lesson: Buy vs. Build, Why build our own?
● Looking across other container management solutions
○ Mesos, Kubernetes, and Swarm
● Proven solutions are focused on the datacenter
● Newer solutions are
○ Working to abstract datacenter and cloud
○ Delivering more than cluster manager
■ PaaS, Service discovery, IPC
■ Continuous deployment
■ Metrics
○ Not yet at our level of scale
● Not appropriate for Netflix 12
“Project Titus” (Firehose peek)
13
Titus UITitus UI
Docker
Registry
Docker
Registry
Rhea
container
container
container
docker
Titus Agent
metrics agent
Titus executor
logging agent
zfs
mesos agent
docker
RheaTitus API
Cassandra
Titus Master
Job Management &
Scheduler
S3
Zookeeper
Docker
Registry
EC2 Autocaling
API
Mesos Master
Titus UI
Fenzo
container
Pod & VPC net
drivers
container
container
AWS container
metadata proxy
Integration
CI/CD Amazon VM’s
Is that all?
14
Container Execution
15
Titus UITitus UI
Docker
Registry
Docker
Registry
Rhea
container
container
container
docker
Titus Agent
metrics agent
Titus executor
logging agent
zfs
mesos agent
docker
RheaTitus API
Cassandra
Titus Master
Job Management &
Scheduler
S3
Zookeeper
Docker
Registry
EC2 Autocaling
API
Mesos Master
Titus UI
Fenzo
container
Pod & VPC net
drivers
container
container
AWS container
metadata proxy
CI/CD Amazon VM’s
Lesson: What you lose with Docker on EC2
16
+ <
● Networking: VPC
● Security: Security Groups, IAM Roles
● Context: Instance Metadata, User Data / Env Context
● Operational Visibility: Metrics, Health checking
● Resource Isolation: Networking, Local Storage
MULTI-TENANT
Lesson: Making Containers Act Like VM’s
17
● Built: EC2 Metadata Proxy
○ Provide overridden scheduled IAM role, instance id
○ Proxy other values
● Provided: Provide Environmental Context
○ Titus specific job and task info
○ ASG app, stack, sequence, other EC2 standard
● Why? Now:
○ Service discovery registration works
○ Amazon service SDK based applications work
Lesson: Networking will continue to evolve
18
● Started with batch
○ Started with “bridge” with port mapping
○ Added “host” with port resource mapping (for performance?)
○ Continue to use “bridge” without port mapping
● Service style apps added
○ Added “nfvpc” VPC IP/container with libnetwork plugin
○ Removed Host (no value over VPC IP/container)
○ Changed “nfvpc” VPC IP/container
■ Pod based with customer executor (no plugin)
○ Added security groups to “nfvpc”
Plumbing VPC Networking into Docker
19
No IP Needed
Task 0
SecGrp Y
Task 1 Task 2 Task 3
docker0 (*)
EC2 VMeth0
eni0
SG=Titus Agent
eth1
eni1
SecGrp=X
eth2
eni2
SG=Y
IP 1
IP 2
IP 3
pod root
veth<id>
app
SecGrp X
pod root
veth<id>
app
SecGrp X
pod root
veth<id>
appapp
veth<id>
Linux Policy
Based Routing
EC2
Metadata
Proxy
169.254.169.254
IPTables NAT (*)
* **
169.254.169.254
Lesson: Secure Multi-tenancy is Hard
20
Common to VM’s and tiered security needed
● Protect the reduced host IAM role, Allow containers to have specific IAM roles
● Needed to support same security groups in container networking as VM’s
User namespacing
● Docker 1.10 - Introduced User Namespaces
● Didn’t work /w shared networking NS
● Docker 1.11 - Fixed shared networking NS’s
● But, namespacing is per daemon
● Not per container, as hoped
● Waiting on Linux
● Considering mass chmod / ZFS clones
Operational Visibility Evolution
21
● What is “node” - containers on VM’s
● Soft limits / bursting a good thing?
○ Until percent util and outliers are considered
● System level metrics
○ Currently - hand coded cgroup scraping
○ Considering Intel Snap replacement
● Pollers - Metrics, Health, Discovery
○ Created Edda common “server group” view
Future Execution Focus
22
● Better Isolation (agents, networking, block I/O, etc.)
● Exposing our implementation of “Pod”’s to users
● Better resiliency (DNS dependencies reduced)
Job Management and Resource Scheduling
23
Titus UITitus UI
Docker
Registry
Docker
Registry
Rhea
container
container
container
docker
Titus Agent
metrics agent
Titus executor
logging agent
zfs
mesos agent
docker
RheaTitus API
Cassandra
Titus Master
Job Management &
Scheduler
S3
Zookeeper
Docker
Registry
EC2 Autocaling
API
Mesos Master
Titus UI
Fenzo
container
Pod & VPC net
drivers
container
container
AWS container
metadata proxy
CI/CD Amazon VM’s
Lesson: Complexity in scheduling
24
● Resilience
○ Balance instances across EC2 zones,
instances within a zone
● Security
○ Two level resource for ENIs
● Placement optimization
○ Resource affinity
○ Task locality
○ Bin packing (Auto Scaling)
Lesson: Keep resource scheduling extensible
25
Fenzo - Extensible Scheduling Library
Features:
● Heterogeneous resources & tasks
● Autoscaling of mesos cluster
○ Multiple instance types
● Plugins based scheduling objectives
○ Bin packing, etc.
● Plugins based constraints evaluator
○ Resource affinity, task locality, etc.
● Scheduling actions visibility
https://github.com/Netflix/Fenzo
Cluster Autoscaling Challenge
26
Host 4Host 3Host 1
vs.
For long running stateful services
Host 1 Host 2
Host 2
Host 3 Host 4
Resources assigned in Titus
27
● CPU, memory, disk capacity
● Per container AWS EC2 Security groups, IP, and
network bandwidth via custom driver
● Abstracting out EC2 instance types
Security groups and their resources
28
A two level resource per EC2 Instance: N ENIs, each with M IPs
ENI 0
Assigned Security Group: SG1 Used IPs Count: 2 of 7
ENI 1
Assigned Security Group: SG1,SG2 Used IPs Count: 1 of 7
ENI 2
Assigned Security Group: SG3 Used IPs Count: 7 of 7
Lesson: Scheduling Vs. Job Management
29
Scheduling resources to tasks is common.
Lifecycle management is not.
Lesson: Scheduling Vs. Job Management
30
Task scheduling concerns
● Assign resources to tasks
● Cluster wide optimizations
○ Bin packing
○ Global constraints, like SLAs
● Task preferences and constraints
○ Locality with other tasks
○ Resource affinity
Job manager concerns
● Managing task/instance counts
● Creating metadata, defining constraints
● Lifecycle management
○ Replace failed task executions
● Handle failures
○ Rate limit requeuing & relaunching
○ Time out tasks in transitionary states
Future Job Management & Scheduling Focus
31
● More resources to track: GPUs
● Automatic resource affinity with heterogenous instances
● SLAs
○ Latencies for services
○ Throughput for batch
○ Task preemptions
Things we didn’t cover in this talk
● Overall integration
○ Chaos, continuous delivery, performance insight
● Container Execution
○ Logging (live log access & S3 log rotation)
○ Liveness and health checking
○ Isolation (disk usage, networking, block I/O)
○ Image registry (metrics, security scanning)
● Scheduling
○ Autoscaling heterogeneous pools
○ Host-task fitness criteria
● API
○ Extensibility, polymorphic, SLA and job/container ownership 32
Agenda
● Why containers at Netflix?
● What did we build and what did we learn?
● What are our current and future workloads?
33
⇨
Current Titus Production Usage
34
● Autoscaling
○ 100’s of r3.8xl’s
○ Each 32 vCPU, 244G
● Peak
○ Thousands of cores
○ Tens of TB’s memory
● Thousands containers/day
○ ~ 100 different images
Workloads, Past
● Most current usage is batch
○ Algorithm training, adhoc reporting jobs
● Sampling:
○ Training of “sims” and A/B test models
○ Open Connect Device/IX reporting
○ Web security scanning and analysis
○ Social media analytics updates
35
Workloads, Now
● Spent last five months adding service style support
● First line of fire customer requests already received
● Larger scale shadow and trickle traffic throughout 2Q
● First service style apps
○ Finer grained instances - NodeJS
○ Docker provided local developer experience
36
Workloads, Coming
● Media Encoding
○ Thousands of VM’s
○ VM based resource scheduling
○ Considering containers to have faster start-up
○ Internal spot-market - trough borrowing
● SPaaS
○ 10’s of thousands of containers
○ Stream Processing as a Service
○ Convert scheduling systems to Titus
37
Questions?
38
Other Netflix QCon Talks
39
Title Time Speaker(s)
The Netflix API Platform for
Server-Side Scripting
Monday 10:35 Katharina Probst
Scheduling A Fuller House:
Container Mgmt @ Netflix
Tuesday 10:35 Andrew Spyker &
Sharma Podila
Chaos Kong - Endowing
Netflix with Antifragility
Tuesday 11:50 Luke Kosewski
The Evolution of the
JavaScript
Wednesday 4:10 Jafar Husain
Async Programming in JS:
The End of the Loop
Friday 9:00 Jafar Husain

Weitere ähnliche Inhalte

Was ist angesagt?

Moving from CellsV1 to CellsV2 at CERN
Moving from CellsV1 to CellsV2 at CERNMoving from CellsV1 to CellsV2 at CERN
Moving from CellsV1 to CellsV2 at CERN
Belmiro Moreira
 

Was ist angesagt? (20)

3 - Delen Private Bank: FOSS adventures in a Cloud Native world
3 - Delen Private Bank: FOSS adventures in a Cloud Native world3 - Delen Private Bank: FOSS adventures in a Cloud Native world
3 - Delen Private Bank: FOSS adventures in a Cloud Native world
 
Container orchestration: the cold war - Giulio De Donato - Codemotion Rome 2017
Container orchestration: the cold war - Giulio De Donato - Codemotion Rome 2017Container orchestration: the cold war - Giulio De Donato - Codemotion Rome 2017
Container orchestration: the cold war - Giulio De Donato - Codemotion Rome 2017
 
CNCF Projects Overview
CNCF Projects OverviewCNCF Projects Overview
CNCF Projects Overview
 
Moving from CellsV1 to CellsV2 at CERN
Moving from CellsV1 to CellsV2 at CERNMoving from CellsV1 to CellsV2 at CERN
Moving from CellsV1 to CellsV2 at CERN
 
Docker+java
Docker+javaDocker+java
Docker+java
 
Kubernetes and OpenStack at Scale
Kubernetes and OpenStack at ScaleKubernetes and OpenStack at Scale
Kubernetes and OpenStack at Scale
 
Getting started with kubernetes
Getting started with kubernetesGetting started with kubernetes
Getting started with kubernetes
 
Kubernetes - how to orchestrate containers
Kubernetes - how to orchestrate containersKubernetes - how to orchestrate containers
Kubernetes - how to orchestrate containers
 
Neutron high availability open stack architecture openstack israel event 2015
Neutron high availability  open stack architecture   openstack israel event 2015Neutron high availability  open stack architecture   openstack israel event 2015
Neutron high availability open stack architecture openstack israel event 2015
 
Container World 2017!
Container World 2017!Container World 2017!
Container World 2017!
 
Nugwc k8s session-16-march-2021
Nugwc k8s session-16-march-2021Nugwc k8s session-16-march-2021
Nugwc k8s session-16-march-2021
 
Cloud spanner architecture and use cases
Cloud spanner architecture and use casesCloud spanner architecture and use cases
Cloud spanner architecture and use cases
 
1 - Welcome OPEN19 & Partners line-up
1 - Welcome OPEN19 & Partners line-up1 - Welcome OPEN19 & Partners line-up
1 - Welcome OPEN19 & Partners line-up
 
Introduction of Kubernetes - Trang Nguyen
Introduction of Kubernetes - Trang NguyenIntroduction of Kubernetes - Trang Nguyen
Introduction of Kubernetes - Trang Nguyen
 
GCP - Continuous Integration and Delivery into Kubernetes with GitHub, Travis...
GCP - Continuous Integration and Delivery into Kubernetes with GitHub, Travis...GCP - Continuous Integration and Delivery into Kubernetes with GitHub, Travis...
GCP - Continuous Integration and Delivery into Kubernetes with GitHub, Travis...
 
Docker for HPC in a Nutshell
Docker for HPC in a NutshellDocker for HPC in a Nutshell
Docker for HPC in a Nutshell
 
Promise of DevOps
Promise of DevOpsPromise of DevOps
Promise of DevOps
 
Raspberry pi x kubernetes x tensorflow
Raspberry pi x kubernetes x tensorflowRaspberry pi x kubernetes x tensorflow
Raspberry pi x kubernetes x tensorflow
 
Kubernetes a comprehensive overview
Kubernetes   a comprehensive overviewKubernetes   a comprehensive overview
Kubernetes a comprehensive overview
 
SFScon16 - Michele Baldessari: "OpenStack – An introduction"
SFScon16 - Michele Baldessari: "OpenStack – An introduction"SFScon16 - Michele Baldessari: "OpenStack – An introduction"
SFScon16 - Michele Baldessari: "OpenStack – An introduction"
 

Andere mochten auch

Caitlin Jones Houser
Caitlin Jones HouserCaitlin Jones Houser
Caitlin Jones Houser
Caitlin Jones
 
Generations Workplace Characteristics.pdf
Generations Workplace Characteristics.pdfGenerations Workplace Characteristics.pdf
Generations Workplace Characteristics.pdf
Johnny Schaefer
 

Andere mochten auch (13)

CVTenglish
CVTenglishCVTenglish
CVTenglish
 
S4 tarea4 ropem
S4 tarea4 ropemS4 tarea4 ropem
S4 tarea4 ropem
 
Pat Pezzi Resume
Pat Pezzi ResumePat Pezzi Resume
Pat Pezzi Resume
 
Caitlin Jones Houser
Caitlin Jones HouserCaitlin Jones Houser
Caitlin Jones Houser
 
Momin Resume 1
Momin Resume 1Momin Resume 1
Momin Resume 1
 
Merixstudio: about us
Merixstudio: about usMerixstudio: about us
Merixstudio: about us
 
It's not you, it's us: Winning over people for yourself and the team
It's not you, it's us: Winning over people for yourself and the teamIt's not you, it's us: Winning over people for yourself and the team
It's not you, it's us: Winning over people for yourself and the team
 
Mesos sys adminday
Mesos sys admindayMesos sys adminday
Mesos sys adminday
 
тест
тесттест
тест
 
Maths ppt
Maths pptMaths ppt
Maths ppt
 
Generations Workplace Characteristics.pdf
Generations Workplace Characteristics.pdfGenerations Workplace Characteristics.pdf
Generations Workplace Characteristics.pdf
 
This Makes Me Feel :) : Capturing Emotions in UX
This Makes Me Feel :) : Capturing Emotions in UXThis Makes Me Feel :) : Capturing Emotions in UX
This Makes Me Feel :) : Capturing Emotions in UX
 
A Survey on Software Release Planning Models - Slides for the Presentation @ ...
A Survey on Software Release Planning Models - Slides for the Presentation @ ...A Survey on Software Release Planning Models - Slides for the Presentation @ ...
A Survey on Software Release Planning Models - Slides for the Presentation @ ...
 

Ähnlich wie Scheduling a fuller house - Talk at QCon NY 2016

Kubernetes for Beginners
Kubernetes for BeginnersKubernetes for Beginners
Kubernetes for Beginners
DigitalOcean
 
Scaling Open edX with Kubernetes
Scaling Open edX with KubernetesScaling Open edX with Kubernetes
Scaling Open edX with Kubernetes
Appsembler
 

Ähnlich wie Scheduling a fuller house - Talk at QCon NY 2016 (20)

Netflix Titus WASP October 2017
Netflix Titus WASP October 2017Netflix Titus WASP October 2017
Netflix Titus WASP October 2017
 
Netflix and Containers: Not A Stranger Thing
Netflix and Containers:  Not A Stranger ThingNetflix and Containers:  Not A Stranger Thing
Netflix and Containers: Not A Stranger Thing
 
Netflix and Containers: Not Stranger Things
Netflix and Containers: Not Stranger ThingsNetflix and Containers: Not Stranger Things
Netflix and Containers: Not Stranger Things
 
QConSF18 - Disenchantment: Netflix Titus, its Feisty Team, and Daemons
QConSF18 - Disenchantment: Netflix Titus, its Feisty Team, and DaemonsQConSF18 - Disenchantment: Netflix Titus, its Feisty Team, and Daemons
QConSF18 - Disenchantment: Netflix Titus, its Feisty Team, and Daemons
 
NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1
 
NetflixOSS Meetup S6E1 - Titus & Containers
NetflixOSS Meetup S6E1 - Titus & ContainersNetflixOSS Meetup S6E1 - Titus & Containers
NetflixOSS Meetup S6E1 - Titus & Containers
 
Kubernetes @ Squarespace: Kubernetes in the Datacenter
Kubernetes @ Squarespace: Kubernetes in the DatacenterKubernetes @ Squarespace: Kubernetes in the Datacenter
Kubernetes @ Squarespace: Kubernetes in the Datacenter
 
Automating using Ansible
Automating using AnsibleAutomating using Ansible
Automating using Ansible
 
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)Kubernetes @ Squarespace (SRE Portland Meetup October 2017)
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)
 
Kubernetes for Beginners
Kubernetes for BeginnersKubernetes for Beginners
Kubernetes for Beginners
 
Scaling Open edX with Kubernetes
Scaling Open edX with KubernetesScaling Open edX with Kubernetes
Scaling Open edX with Kubernetes
 
USENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a Month
USENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a MonthUSENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a Month
USENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a Month
 
DevOps Days Boston 2017: Real-world Kubernetes for DevOps
DevOps Days Boston 2017: Real-world Kubernetes for DevOpsDevOps Days Boston 2017: Real-world Kubernetes for DevOps
DevOps Days Boston 2017: Real-world Kubernetes for DevOps
 
Truemotion Adventures in Containerization
Truemotion Adventures in ContainerizationTruemotion Adventures in Containerization
Truemotion Adventures in Containerization
 
Disenchantment: Netflix Titus, Its Feisty Team, and Daemons
Disenchantment: Netflix Titus, Its Feisty Team, and DaemonsDisenchantment: Netflix Titus, Its Feisty Team, and Daemons
Disenchantment: Netflix Titus, Its Feisty Team, and Daemons
 
Introduction to rook
Introduction to rookIntroduction to rook
Introduction to rook
 
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with Kubernetes
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with KubernetesKubernetes Forum Seoul 2019: Re-architecting Data Platform with Kubernetes
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with Kubernetes
 
Container World 2018
Container World 2018Container World 2018
Container World 2018
 
Unleashing k8 s to reduce complexities of an entire middleware platform
Unleashing k8 s to reduce complexities of an entire middleware platformUnleashing k8 s to reduce complexities of an entire middleware platform
Unleashing k8 s to reduce complexities of an entire middleware platform
 
GCCP JSCOE Session 2
GCCP JSCOE Session 2GCCP JSCOE Session 2
GCCP JSCOE Session 2
 

Mehr von Sharma Podila

Prezo tooracleteam (2)
Prezo tooracleteam (2)Prezo tooracleteam (2)
Prezo tooracleteam (2)
Sharma Podila
 

Mehr von Sharma Podila (9)

Podila mesos con-northamerica_sep2017
Podila mesos con-northamerica_sep2017Podila mesos con-northamerica_sep2017
Podila mesos con-northamerica_sep2017
 
Netflix container scheduling talk at stanford final
Netflix container scheduling talk at stanford   finalNetflix container scheduling talk at stanford   final
Netflix container scheduling talk at stanford final
 
Podila QCon SF 2016
Podila QCon SF 2016Podila QCon SF 2016
Podila QCon SF 2016
 
Podila mesos con europe keynote aug sep 2016
Podila mesos con europe keynote aug sep 2016Podila mesos con europe keynote aug sep 2016
Podila mesos con europe keynote aug sep 2016
 
Lessons from Netflix Mesos Clusters
Lessons from Netflix Mesos ClustersLessons from Netflix Mesos Clusters
Lessons from Netflix Mesos Clusters
 
Prezo at-mesos con2015-final
Prezo at-mesos con2015-finalPrezo at-mesos con2015-final
Prezo at-mesos con2015-final
 
Resource Scheduling using Apache Mesos in Cloud Native Environments
Resource Scheduling using Apache Mesos in Cloud Native EnvironmentsResource Scheduling using Apache Mesos in Cloud Native Environments
Resource Scheduling using Apache Mesos in Cloud Native Environments
 
AWS re:Invent 2014 talk: Scheduling using Apache Mesos in the Cloud
AWS re:Invent 2014 talk: Scheduling using Apache Mesos in the CloudAWS re:Invent 2014 talk: Scheduling using Apache Mesos in the Cloud
AWS re:Invent 2014 talk: Scheduling using Apache Mesos in the Cloud
 
Prezo tooracleteam (2)
Prezo tooracleteam (2)Prezo tooracleteam (2)
Prezo tooracleteam (2)
 

Kürzlich hochgeladen

TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
mohitmore19
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
VictorSzoltysek
 

Kürzlich hochgeladen (20)

VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with Precision
 
10 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 202410 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 2024
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.js
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
 
How to Choose the Right Laravel Development Partner in New York City_compress...
How to Choose the Right Laravel Development Partner in New York City_compress...How to Choose the Right Laravel Development Partner in New York City_compress...
How to Choose the Right Laravel Development Partner in New York City_compress...
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
Exploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdfExploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdf
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) SolutionIntroducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
 
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdfAzure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
 
Define the academic and professional writing..pdf
Define the academic and professional writing..pdfDefine the academic and professional writing..pdf
Define the academic and professional writing..pdf
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 

Scheduling a fuller house - Talk at QCon NY 2016

  • 1. Scheduling a Fuller House: Container Management Sharma Podila, Andrew Spyker - Senior Software Engineers
  • 2. About Netflix ● 81.5M members ● 2000+ employees (1400 tech) ● 190+ countries ● > 100M hours watch per day ● > ⅓ NA internet download traffic ● 500+ Microservices ● Many 10’s of thousands VM’s ● 3 regions across the world 2
  • 3. Agenda ● Why containers at Netflix? ● What did we build and what did we learn? ● What are our current and future workloads? 3 ⇨
  • 4. Why a 2nd edition of virtualization? ● Given our resilient cloud native, CI/CD devops enabled, elastically scalable virtual machine based architecture, did we really need containers? 4
  • 5. Motivating factors for containers ● Simpler management of compute resources ● Simpler deployment packaging artifacts for compute jobs ● Need for a consistent local developer environment 5
  • 6. Simpler compute, Management & Packaging Batch/stream processing jobs ● Here are the files to run my process ● I need m cores, n disk, and o memory ● Please just run it for me! 6 Service style jobs (VM’s) ● Use tested/secure base AMI ● Bake an AMI ● Define launch config ● Choose t-shirt sized instance ● Canary & red/black ASG’s
  • 7. Consistent developer experience ● Many years focused on ○ Build, bake / cloud deploy / operational experience ○ Not as much time focused on developer experience ● New Netflix local developer experience based on Docker ● Has had a benefit in both directions ○ Cloud like local development environment ○ Easier operational debugging of cloud workloads 7
  • 8. What about resource optimization? ● Not absolutely required and easier to get wins at larger scale across larger virtual machine fleet ● However, potential benefits to ○ Elastic resource pool for scaling batch & adhoc jobs ○ Reliable smaller instance sizes for NodeJS ○ Cross Netflix resource optimizations ■ Trough usage, instance type migration 8
  • 9. Agenda ● Why containers at Netflix? ● What did we build and what did we learn? ● What are our current and future workloads? 9 ⇨
  • 10. VMVM Lesson: Support containers by leveraging existing Netflix IaaS focused cloud platform 10 Atlas EC2 AWSAutoScaler VMs App Cloud Platform (metrics, IPC, health) Eureka VPC Edda Existing - VM’s VMVM Atlas EC2 TitusJobControl Containers App Cloud Platform (metrics, IPC, health) Eureka VPC Edda Titus - Containers VMVM Batch Containers
  • 11. VMVM 11 EC2 AWSAutoScaler VMs App Cloud Platform (metrics, IPC, health) VPC Netflix Cloud Infrastructure (VM’s + Containers) VMVM Atlas TitusJobControl Containers App Cloud Platform (metrics, IPC, health) Eureka Edda VMVM Batch Containers Why - Single consistent cloud platform
  • 12. Lesson: Buy vs. Build, Why build our own? ● Looking across other container management solutions ○ Mesos, Kubernetes, and Swarm ● Proven solutions are focused on the datacenter ● Newer solutions are ○ Working to abstract datacenter and cloud ○ Delivering more than cluster manager ■ PaaS, Service discovery, IPC ■ Continuous deployment ■ Metrics ○ Not yet at our level of scale ● Not appropriate for Netflix 12
  • 13. “Project Titus” (Firehose peek) 13 Titus UITitus UI Docker Registry Docker Registry Rhea container container container docker Titus Agent metrics agent Titus executor logging agent zfs mesos agent docker RheaTitus API Cassandra Titus Master Job Management & Scheduler S3 Zookeeper Docker Registry EC2 Autocaling API Mesos Master Titus UI Fenzo container Pod & VPC net drivers container container AWS container metadata proxy Integration CI/CD Amazon VM’s
  • 15. Container Execution 15 Titus UITitus UI Docker Registry Docker Registry Rhea container container container docker Titus Agent metrics agent Titus executor logging agent zfs mesos agent docker RheaTitus API Cassandra Titus Master Job Management & Scheduler S3 Zookeeper Docker Registry EC2 Autocaling API Mesos Master Titus UI Fenzo container Pod & VPC net drivers container container AWS container metadata proxy CI/CD Amazon VM’s
  • 16. Lesson: What you lose with Docker on EC2 16 + < ● Networking: VPC ● Security: Security Groups, IAM Roles ● Context: Instance Metadata, User Data / Env Context ● Operational Visibility: Metrics, Health checking ● Resource Isolation: Networking, Local Storage MULTI-TENANT
  • 17. Lesson: Making Containers Act Like VM’s 17 ● Built: EC2 Metadata Proxy ○ Provide overridden scheduled IAM role, instance id ○ Proxy other values ● Provided: Provide Environmental Context ○ Titus specific job and task info ○ ASG app, stack, sequence, other EC2 standard ● Why? Now: ○ Service discovery registration works ○ Amazon service SDK based applications work
  • 18. Lesson: Networking will continue to evolve 18 ● Started with batch ○ Started with “bridge” with port mapping ○ Added “host” with port resource mapping (for performance?) ○ Continue to use “bridge” without port mapping ● Service style apps added ○ Added “nfvpc” VPC IP/container with libnetwork plugin ○ Removed Host (no value over VPC IP/container) ○ Changed “nfvpc” VPC IP/container ■ Pod based with customer executor (no plugin) ○ Added security groups to “nfvpc”
  • 19. Plumbing VPC Networking into Docker 19 No IP Needed Task 0 SecGrp Y Task 1 Task 2 Task 3 docker0 (*) EC2 VMeth0 eni0 SG=Titus Agent eth1 eni1 SecGrp=X eth2 eni2 SG=Y IP 1 IP 2 IP 3 pod root veth<id> app SecGrp X pod root veth<id> app SecGrp X pod root veth<id> appapp veth<id> Linux Policy Based Routing EC2 Metadata Proxy 169.254.169.254 IPTables NAT (*) * ** 169.254.169.254
  • 20. Lesson: Secure Multi-tenancy is Hard 20 Common to VM’s and tiered security needed ● Protect the reduced host IAM role, Allow containers to have specific IAM roles ● Needed to support same security groups in container networking as VM’s User namespacing ● Docker 1.10 - Introduced User Namespaces ● Didn’t work /w shared networking NS ● Docker 1.11 - Fixed shared networking NS’s ● But, namespacing is per daemon ● Not per container, as hoped ● Waiting on Linux ● Considering mass chmod / ZFS clones
  • 21. Operational Visibility Evolution 21 ● What is “node” - containers on VM’s ● Soft limits / bursting a good thing? ○ Until percent util and outliers are considered ● System level metrics ○ Currently - hand coded cgroup scraping ○ Considering Intel Snap replacement ● Pollers - Metrics, Health, Discovery ○ Created Edda common “server group” view
  • 22. Future Execution Focus 22 ● Better Isolation (agents, networking, block I/O, etc.) ● Exposing our implementation of “Pod”’s to users ● Better resiliency (DNS dependencies reduced)
  • 23. Job Management and Resource Scheduling 23 Titus UITitus UI Docker Registry Docker Registry Rhea container container container docker Titus Agent metrics agent Titus executor logging agent zfs mesos agent docker RheaTitus API Cassandra Titus Master Job Management & Scheduler S3 Zookeeper Docker Registry EC2 Autocaling API Mesos Master Titus UI Fenzo container Pod & VPC net drivers container container AWS container metadata proxy CI/CD Amazon VM’s
  • 24. Lesson: Complexity in scheduling 24 ● Resilience ○ Balance instances across EC2 zones, instances within a zone ● Security ○ Two level resource for ENIs ● Placement optimization ○ Resource affinity ○ Task locality ○ Bin packing (Auto Scaling)
  • 25. Lesson: Keep resource scheduling extensible 25 Fenzo - Extensible Scheduling Library Features: ● Heterogeneous resources & tasks ● Autoscaling of mesos cluster ○ Multiple instance types ● Plugins based scheduling objectives ○ Bin packing, etc. ● Plugins based constraints evaluator ○ Resource affinity, task locality, etc. ● Scheduling actions visibility https://github.com/Netflix/Fenzo
  • 26. Cluster Autoscaling Challenge 26 Host 4Host 3Host 1 vs. For long running stateful services Host 1 Host 2 Host 2 Host 3 Host 4
  • 27. Resources assigned in Titus 27 ● CPU, memory, disk capacity ● Per container AWS EC2 Security groups, IP, and network bandwidth via custom driver ● Abstracting out EC2 instance types
  • 28. Security groups and their resources 28 A two level resource per EC2 Instance: N ENIs, each with M IPs ENI 0 Assigned Security Group: SG1 Used IPs Count: 2 of 7 ENI 1 Assigned Security Group: SG1,SG2 Used IPs Count: 1 of 7 ENI 2 Assigned Security Group: SG3 Used IPs Count: 7 of 7
  • 29. Lesson: Scheduling Vs. Job Management 29 Scheduling resources to tasks is common. Lifecycle management is not.
  • 30. Lesson: Scheduling Vs. Job Management 30 Task scheduling concerns ● Assign resources to tasks ● Cluster wide optimizations ○ Bin packing ○ Global constraints, like SLAs ● Task preferences and constraints ○ Locality with other tasks ○ Resource affinity Job manager concerns ● Managing task/instance counts ● Creating metadata, defining constraints ● Lifecycle management ○ Replace failed task executions ● Handle failures ○ Rate limit requeuing & relaunching ○ Time out tasks in transitionary states
  • 31. Future Job Management & Scheduling Focus 31 ● More resources to track: GPUs ● Automatic resource affinity with heterogenous instances ● SLAs ○ Latencies for services ○ Throughput for batch ○ Task preemptions
  • 32. Things we didn’t cover in this talk ● Overall integration ○ Chaos, continuous delivery, performance insight ● Container Execution ○ Logging (live log access & S3 log rotation) ○ Liveness and health checking ○ Isolation (disk usage, networking, block I/O) ○ Image registry (metrics, security scanning) ● Scheduling ○ Autoscaling heterogeneous pools ○ Host-task fitness criteria ● API ○ Extensibility, polymorphic, SLA and job/container ownership 32
  • 33. Agenda ● Why containers at Netflix? ● What did we build and what did we learn? ● What are our current and future workloads? 33 ⇨
  • 34. Current Titus Production Usage 34 ● Autoscaling ○ 100’s of r3.8xl’s ○ Each 32 vCPU, 244G ● Peak ○ Thousands of cores ○ Tens of TB’s memory ● Thousands containers/day ○ ~ 100 different images
  • 35. Workloads, Past ● Most current usage is batch ○ Algorithm training, adhoc reporting jobs ● Sampling: ○ Training of “sims” and A/B test models ○ Open Connect Device/IX reporting ○ Web security scanning and analysis ○ Social media analytics updates 35
  • 36. Workloads, Now ● Spent last five months adding service style support ● First line of fire customer requests already received ● Larger scale shadow and trickle traffic throughout 2Q ● First service style apps ○ Finer grained instances - NodeJS ○ Docker provided local developer experience 36
  • 37. Workloads, Coming ● Media Encoding ○ Thousands of VM’s ○ VM based resource scheduling ○ Considering containers to have faster start-up ○ Internal spot-market - trough borrowing ● SPaaS ○ 10’s of thousands of containers ○ Stream Processing as a Service ○ Convert scheduling systems to Titus 37
  • 39. Other Netflix QCon Talks 39 Title Time Speaker(s) The Netflix API Platform for Server-Side Scripting Monday 10:35 Katharina Probst Scheduling A Fuller House: Container Mgmt @ Netflix Tuesday 10:35 Andrew Spyker & Sharma Podila Chaos Kong - Endowing Netflix with Antifragility Tuesday 11:50 Luke Kosewski The Evolution of the JavaScript Wednesday 4:10 Jafar Husain Async Programming in JS: The End of the Loop Friday 9:00 Jafar Husain