Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.

Implement Advanced Scheduling Techniques in Kubernetes

4.408 Aufrufe

Veröffentlicht am


Is advanced scheduling in Kubernetes achievable? Yes, however, how do you properly accommodate every real-life scenario that a Kubernetes user might encounter? How do you leverage advanced scheduling techniques to shape and describe each scenario in easy-to-use rules and configurations?

Oleg Chunikhin addressed those questions and demonstrated techniques for implementing advanced scheduling. For example, using spot instances and cost-effective resources on AWS, coupled with the ability to deliver a minimum set of functionalities that cover the majority of needs – without configuration complexity. You’ll get a run-down of the pitfalls and things to keep in mind for this route.

Veröffentlicht in: Technologie
  • Als Erste(r) kommentieren

Implement Advanced Scheduling Techniques in Kubernetes

  1. 1. Implement Advanced Scheduling Techniques in Kubernetes Oleg Chunikhin | CTO, Kublr | February 2018
  2. 2. Introduction • Oleg Chunikhin • CTO @ Kublr • Chief Software Architect @ EastBanc Technologies • Kublr • Enterprise Kubernetes cluster manager • Application delivery platform
  3. 3. What to Look For • Kubernetes overview • Scheduling algorithm • Scheduling controls • Advanced scheduling techniques • Examples, use cases, and recommendations
  4. 4. Kubernetes | Technology Stack Kubernetes • Orchestration • Network • Configuration • Service discovery • Ingress • Persistence • … Docker • Distribution • Configuration • Isolation
  5. 5. Docker | Architecture Docker image repository Instance Images App data Docker CLI Overlay network Docker daemon Application containersApplication containers
  6. 6. Kubernetes | Architecture Master Node K8s master components: etcd, scheduler, api, controller K8s metadata Docker kubelet App data K8s node components: overlay network, discovery, connectivity Infrastructure and application containers Infrastructure and application containers Overlay network
  7. 7. Kubernetes | Nodes and Pods Node2 Pod A-2 10.0.1.5 Cnt1 Cnt2 Node 1 Pod A-1 10.0.0.3 Cnt1 Cnt2 Pod B-1 10.0.0.8 Cnt3
  8. 8. Node 1 Kubernetes | Container Orchestration Docker Kubelet K8S Master API K8S Scheduler(s) Pod A Pod B K8S Controller(s) User Node 1 Pod A Pod B Node 2 Pod C
  9. 9. Node 1 Kubernetes | Container Orchestration Docker Kubelet K8S Master API K8S Scheduler(s) K8S Controller(s) User It all starts empty
  10. 10. Node 1 Kubernetes | Container Orchestration Docker Kubelet K8S Master API K8S Scheduler(s) K8S Controller(s) User Kubelet registers node object in master
  11. 11. Node 1 Kubernetes | Container Orchestration Docker Kubelet K8S Master API K8S Scheduler(s) K8S Controller(s) User Node 1 Node 2
  12. 12. Node 1 Kubernetes | Container Orchestration Docker Kubelet K8S Master API K8S Scheduler(s) K8S Controller(s) User Node 1 Node 2 User creates (unscheduled) Pod object(s) in Master
  13. 13. Node 1 Kubernetes | Container Orchestration Docker Kubelet K8S Master API K8S Scheduler(s) K8S Controller(s) User Node 1 Node 2 Scheduler notices unscheduled Pods ...
  14. 14. Node 1 Kubernetes | Container Orchestration Docker Kubelet K8S Master API K8S Scheduler(s) K8S Controller(s) User Node 1 Node 2 …identifies the best node to run them on… Pod A Pod B Pod C
  15. 15. Node 1 Kubernetes | Container Orchestration Docker Kubelet K8S Master API K8S Scheduler(s) K8S Controller(s) User Node 1 Node 2 …and marks the pods as scheduled on corresponding nodes. Pod A Pod B Pod C
  16. 16. Node 1 Kubernetes | Container Orchestration Docker Kubelet K8S Master API K8S Scheduler(s) K8S Controller(s) User Node 1 Node 2 Kubelet notices pods scheduled to its nodes… Pod A Pod B Pod C
  17. 17. Node 1 Kubernetes | Container Orchestration Docker Kubelet K8S Master API K8S Scheduler(s) K8S Controller(s) User Node 1 Node 2 …and starts pods’ containers. Pod A Pod B Pod C Pod A Pod B
  18. 18. Node 1 Kubernetes | Container Orchestration Docker Kubelet K8S Master API K8S Scheduler(s) K8S Controller(s) User Node 1 Node 2 Scheduler finds the best node to run pods. HOW? Pod A Pod B Pod C Pod A Pod B
  19. 19. Kubernetes | Scheduling Algorithm For each pod that needs scheduling: 1. Filter nodes 2. Calculate nodes priorities 3. Schedule pod if possible
  20. 20. Kubernetes | Scheduling Algorithm Volume filters • Do pod requested volumes’ zones fit the node’s zone? • Can the node attach to the volumes? • Are there mounted volumes conflicts? • Are there additional volume topology constraints? Volume filters Resource filters Topology filters Prioritization
  21. 21. Kubernetes | Scheduling Algorithm Resource filters • Does pod requested resources (CPU, RAM GPU, etc) fit the node’s available resources? • Can pod requested ports be opened on the node? • Is there no memory or disk pressure on the node? Volume filters Resource filters Topology filters Prioritization
  22. 22. Kubernetes | Scheduling Algorithm Topology filters • Is the pod requested to run on this node? • Are there inter-pod affinity constraints? • Does the node match the pod’s node selector? • Can the pod tolerate the node’s taints? Volume filters Resource filters Topology filters Prioritization
  23. 23. Kubernetes | Scheduling Algorithm Prioritize with weights for • Pod replicas distribution • Least (or most) node utilization • Balanced resource usage • Inter-pod affinity priority • Node affinity priority • Taint toleration priority Volume filters Resource filters Topology filters Prioritization
  24. 24. Scheduling Controlling Pods Destination • Specify resource requirements • Be aware of volumes • Use node constraints • Use affinity and anti-affinity • Scheduler configuration • Custom / multiple schedulers
  25. 25. Scheduling Controlled | Resources • CPU, RAM, other (GPU) • Requests and limits • Reserved resources kind: Node status: allocatable: cpu: "4" memory: 8070796Ki pods: "110" capacity: cpu: "4" memory: 8Gi pods: "110" kind: Pod spec: containers: - name: main resources: requests: cpu: 100m memory: 1Gi
  26. 26. Scheduling Controlled | Volumes • Request volumes in the right zones • Make sure the node can attach enough volumes • Avoid volume location conflicts • Use volume topology constraints (alpha in 1.7) Node 1 Pod A Node 2 Volume 2 Pod B Unschedulable Zone A Pod C Requested Volume Zone B
  27. 27. Scheduling Controlled | Volumes • Request volumes in the right zones • Make sure the node can attach enough volumes • Avoid volume location conflicts • Use volume topology constraints (alpha in 1.7) Node 1 Pod A Volume 2Pod B Pod C Requested Volume Volume 1
  28. 28. Scheduling Controlled | Volumes • Request volumes in the right zones • Make sure node can attach enough volumes • Avoid volume location conflicts • Use volume topology constraints (alpha in 1.7) Node 1 Volume 1Pod A Node 2 Volume 2Pod B Pod C
  29. 29. Scheduling Controlled | Volumes • Request volumes in the right zones • Make sure node can attach enough volumes • Avoid volume location conflicts • Use volume topology constraints (alpha in 1.7) annotations: "volume.alpha.kubernetes.io/node-affinity": '{ "requiredDuringSchedulingIgnoredDuringExecution": { "nodeSelectorTerms": [{ "matchExpressions": [{ "key": "kubernetes.io/hostname", "operator": "In", "values": ["docker03"] }] }] }}'
  30. 30. Scheduling Controlled | Constraints • Host constraints • Labels and node selectors • Taints and tolerations Node 1Pod A kind: Pod spec: nodeName: node1 kind: Node metadata: name: node1
  31. 31. Scheduling Controlled | Node Constraints • Host constraints • Labels and node selectors • Taints and tolerations Node 1 Pod A Node 2 Node 3 label: tier: backend kind: Node metadata: labels: tier: backend kind: Pod spec: nodeSelector: tier: backend
  32. 32. Scheduling Controlled | Node Constraints • Host constraints • Labels and node selectors • Taints and tolerations kind: Pod spec: tolerations: - key: error value: disk operator: Equal effect: NoExecute tolerationSeconds: 60 kind: Node spec: taints: - effect: NoSchedule key: error value: disk timeAdded: null Pod B Node 1 tainted Pod A tolerate
  33. 33. Scheduling Controlled | Taints Taints communicate node conditions • Key – condition category • Value – specific condition • Operator – value wildcard • Equal • Exists • Effect • NoSchedule – filter at scheduling time • PreferNoSchedule – prioritize at scheduling time • NoExecute – filter at scheduling time, evict if executing • TolerationSeconds – time to tolerate “NoExecute” taint kind: Pod spec: tolerations: - key: <taint key> value: <taint value> operator: <match operator> effect: <taint effect> tolerationSeconds: 60
  34. 34. Scheduling Controlled | Affinity • Node affinity • Inter-pod affinity • Inter-pod anti-affinity kind: Pod spec: affinity: nodeAffinity: { ... } podAffinity: { ... } podAntiAffinity: { ... }
  35. 35. Scheduling Controlled | Node Affinity Scope • Preferred during scheduling, ignored during execution • Required during scheduling, ignored during execution kind: Pod spec: affinity: nodeAffinity: preferredDuringSchedulingIgnoredDuringExecution: - weight: 10 preference: { <node selector term> } - ... requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - { <node selector term> } - ... v
  36. 36. Interlude | Node Selector vs Node Selector Term ... nodeSelector: <label 1 key>: <label 1 value> ... ... <node selector term>: matchExpressions: - key: <label key> operator: In | NotIn | Exists | DoesNotExist | Gt | Lt values: - <label value 1> ... ...
  37. 37. Scheduling Controlled | Inter-pod Affinity Scope • Preferred during scheduling, ignored during execution • Required during scheduling, ignored during execution kind: Pod spec: affinity: podAffinity: preferredDuringSchedulingIgnoredDuringExecution: - weight: 10 podAffinityTerm: { <pod affinity term> } - ... requiredDuringSchedulingIgnoredDuringExecution: - { <pod affinity term> } - ...
  38. 38. Scheduling Controlled | Inter-pod Anti-affinity Scope • Preferred during scheduling, ignored during execution • Required during scheduling, ignored during execution kind: Pod spec: affinity: podAntiAffinity: preferredDuringSchedulingIgnoredDuringExecution: - weight: 10 podAffinityTerm: { <pod affinity term> } - ... requiredDuringSchedulingIgnoredDuringExecution: - { <pod affinity term> } - ...
  39. 39. Scheduling Controlled | Pod Affinity Terms • topologyKey – nodes’ label key defining co-location • labelSelector and namespaces – select group of pods <pod affinity term>: topologyKey: <topology label key> namespaces: [ <namespace>, ... ] labelSelector: matchLabels: <label key>: <label value> ... matchExpressions: - key: <label key> operator: In | NotIn | Exists | DoesNotExist values: [ <value 1>, ... ] ...
  40. 40. Scheduling Controlled | Affinity Example affinity: topologyKey: tier labelSelector: matchLabels: group: a Node 1 tier: a Pod B group: a Node 3 tier: b tier: a Node 4 tier: b tier: b Pod B group: a Node 1 tier: a
  41. 41. Scheduling Controlled | Scheduler Configuration • Algorithm provider • Policy configuration file / ConfigMap • Extender
  42. 42. Default Scheduler | Algorithm Provider kube-scheduler --scheduler-name=default-scheduler --algorithm-provider=DefaultProvider --algorithm-provider=ClusterAutoscalerProvider
  43. 43. Default Scheduler | Custom Policy Config kube-scheduler --config=<file> --policy-config-file=<file> --use-legacy-policy-config=<true|false> --policy-configmap=<config map name> --policy-configmap-namespace=<config map ns>
  44. 44. Default Scheduler | Custom Policy Config { "kind" : "Policy", "apiVersion" : "v1", "predicates" : [ {"name" : "PodFitsHostPorts"}, ... {"name" : "HostName"} ], "priorities" : [ {"name" : "LeastRequestedPriority", "weight" : 1}, ... {"name" : "EqualPriority", "weight" : 1} ], "hardPodAffinitySymmetricWeight" : 10, "alwaysCheckAllPredicates" : false }
  45. 45. Default Scheduler | Scheduler Extender { "kind" : "Policy", "apiVersion" : "v1", "predicates" : [...], "priorities" : [...], "extenders" : [{ "urlPrefix": "http://127.0.0.1:12346/scheduler", "filterVerb": "filter", "bindVerb": "bind", "prioritizeVerb": "prioritize", "weight": 5, "enableHttps": false, "nodeCacheCapable": false }], "hardPodAffinitySymmetricWeight" : 10, "alwaysCheckAllPredicates" : false }
  46. 46. Default Scheduler | Scheduler Extender func fiter(pod, nodes) api.NodeList func prioritize(pod, nodes) HostPriorityList func bind(pod, node)
  47. 47. Scheduling Controlled | Multiple Schedulers kind: Pod Metadata: name: pod2 spec: schedulerName: my-scheduler kind: Pod Metadata: name: pod1 spec: ...
  48. 48. Scheduling Controlled | Custom Scheduler Naive implementation • In an infinite loop: • Get list of Nodes: /api/v1/nodes • Get list of Pods: /api/v1/pods • Select Pods with status.phase == Pending and spec.schedulerName == our-name • For each pod: • Calculate target Node • Create a new Binding object: POST /api/v1/bindings apiVersion: v1 kind: Binding Metadata: namespace: default name: pod1 target: apiVersion: v1 kind: Node name: node1
  49. 49. Scheduling Controlled | Custom Scheduler Better implementation • Watch Pods: /api/v1/pods • On each Pod event: • Process if the Pod with status.phase == Pending and spec.schedulerName == our-name • Get list of Nodes: /api/v1/nodes • Calculate target Node • Create a new Binding object: POST /api/v1/bindings apiVersion: v1 kind: Binding Metadata: namespace: default name: pod1 target: apiVersion: v1 kind: Node name: node1
  50. 50. Scheduling Controlled | Custom Scheduler Even better implementation • Watch Nodes: /api/v1/nodes • On each Node event: • Update Node cache • Watch Pods: /api/v1/pods • On each Pod event: • Process if the Pod with status.phase == Pending and spec.schedulerName == our-name • Calculate target Node • Create a new Binding object: POST /api/v1/bindings apiVersion: v1 kind: Binding Metadata: namespace: default name: pod1 target: apiVersion: v1 kind: Node name: node1
  51. 51. Custom Scheduler | Standard Filters • Minimal set of filters • kube-scheduler • Extend • Re-implement GitHub kubernetes/kubernetes plugin/pkg/scheduler/scheduler.go plugin/pkg/scheduler/algorithm/predicates/predicates.go
  52. 52. Use Case | Distributed Pods apiVersion: v1 kind: Pod metadata: name: db-replica-3 labels: component: db spec: affinity: podAntiAffinity: requiredDuringSchedulingIgnoredDuringExecution: - topologyKey: kubernetes.io/hostname labelSelector: matchExpressions: - key: component operator: In values: [ "db" ] Node 2 db-replica-2 Node 1 Node 3 db-replica-1 db-replica-3
  53. 53. Use Case | Co-located Pods apiVersion: v1 kind: Pod metadata: name: app-replica-1 labels: component: web spec: affinity: podAffinity: requiredDuringSchedulingIgnoredDuringExecution: - topologyKey: kubernetes.io/hostname labelSelector: matchExpressions: - key: component operator: In values: [ "db" ] Node 2 db-replica-2 Node 1 Node 3 db-replica-1 app-replica-1
  54. 54. Use Case | Reliable Service on Spot Nodes • “fixed” node group Expensive, more reliable, fixed number Tagged with label nodeGroup: fixed • “spot” node group Inexpensive, unreliable, auto-scaled Tagged with label nodeGroup: spot • Scheduling rules: • At least two pods on “fixed” nodes • All other pods favor “spot” nodes • Custom scheduler
  55. 55. Scheduling | Dos and Don’ts DO • Use resource-based scheduling instead of node-based • Specify resource requests • Keep requests == limits • Especially for non-elastic resources • Memory is non-elastic! • Safeguard against missing resource specs • Namespace default limits • Admission controllers • Plan architecture of localized volumes (EBS, local) • Use inter-pod affinity/anti-affinity if possible DON’T • ... assign pod to nodes directly • ... use pods with no resource requests • ... use resource requests rather node • ... use node-affinity or node assignment if possible
  56. 56. Scheduling | Key Takeaways • Scheduling filters and priorities • Resource requests and availability • Inter-pod affinity/anti-affinity • Volumes localization (AZ) • Node labels and selectors • Node affinity/anti-affinity • Node taints and tolerations • Scheduler(s) tweaking and customization
  57. 57. Oleg Chunikhin Chief Technology Officer oleg@kublr.com kublr.com Thank you!

×