The rise in popularity of machine learning, streaming, and latency-sensitive online applications in shared production clusters has raised new challenges for cluster schedulers. To optimize their performance and resilience, these applications require precise control of their placements by means of complex constraints. Examples of such scenarios are the following:
• Deep learning applications need to run on GPU machines with specific GPU models and driver/kernel versions.
• Hive or Spark applications benefit from being collocated on the same rack to reduce network cost and thus speed up their execution. At the same time, it is desirable to limit the number of allocations per machine to minimize resource interference.
• Low-latency services such as HBase need to be allocated across failure domains to improve their availability.
• A DNS service might need to run on machines with public IP address.
In this talk we present the brand new addition of expressive placement constraints in YARN. We show how applications can leverage such constraints to achieve complex placements, such as collocating their allocations on the same node/rack (affinity), spreading their allocations across nodes/racks (anti-affinity), or allowing up to a specific number of allocations per node group (cardinality) to strike a balance between the two. We describe real use cases from production clusters and show the benefits of placement constraints on large clusters using popular applications in both on-prem and cloud settings.
Speakers
Konstantinos Karanasos, Senior Scientists, Microsoft
Wangda Tan, Staff Software Engineer, Hortonworks
12. > Performance:
Storm HBase
> Resilience:
HBase
> Cluster objectives:
Rack r1 Rack r2
Upgrade
Domain A
Upgrade
Domain B
n1 n2
n3 n4
n5
n7 n8
n6
MR
MR
HBase
Storm Storm
Storm
MR HBase
Storm Storm
MR
MR
MR
Node Groups
It is all about placement with constraints!
13. 0
2
4
6
8
10
0 1 2 3 4
Unavailablemachines(%)
Days
1
10
100
Unavailablemachines(%)
total
total
> Less than 10% of cluster
nodes unavailable
> Cluster is organized in
node groups
14. 0
2
4
6
8
10
0 1 2 3 4
Unavailablemachines(%)
Days
1
10
100
Unavailablemachines(%)
total node group A
total
upgrade domain A
> Less than 10% of cluster
nodes unavailable
> Cluster is organized in
node groups
> Machines become
unavailable in groups
> With random placement,
an LRA might lose all its
containers at once
16. > How to refer to container groups and node groups?
> Container tags and node groups
> How to express constraints related to LRA containers?
> Expressive constraints within and across LRAs
> How to achieve high quality placement without affecting task-based jobs?
> Placement constraint processor
19. > Affinity “Place 3 Storm containers in the same rack as an HBase container”
storm=3, IN, RACK, hbase
> Cardinality “Place 7 Storm containers with no more than 5 containers per
node”
storm=7, CARDINALITY, NODE, storm, 0, 5
> Anti-affinity “Place 5 Storm containers in different nodes than Spark”
storm=5, NOTIN, NODE, spark
> Placement Constraints API
Static methods for LRAs to specify constraints
20. >
zk=3 NOTIN NODE not-self/zk
hbase=5 IN RACK all/zk
>
zk=5 IN RACK hbase NOTIN NODE zk
>
22. Capacity/Fair
Scheduler
LRA Interface
Placement Constraint
Processor
Tasks LRAs
Allocation
Cluster State
> LRA scheduling algorithm in the Placement Constraints Processor
Invoked when an LRA is submitted, considers multiple containers
Constraint Manager
Constraints, Container Tags,
Node Groups
High quality placementFast task-based allocations
23. Capacity/Fair
Scheduler
LRA Interface
Placement Constraint
Processor
Allocation
Cluster State
Constraint Manager
Constraints, Container Tags,
Node Groups
Tasks LRAs
High quality placementFast task-based allocations
Satisfy LRA constraints without affecting
task-based jobs
> LRA scheduling algorithm in the Placement Constraint Processor
Invoked when an LRA is submitted, considers multiple containers
26. TensorFlow ML workflow with 1M iterations using 32 workers
with varying workers per node
0
50
100
150
200
250
300
1 2 4 8 16 32
Runtime(min)
Max cardinality per node
Low utilized cluster High utilized cluster
Anti-affinity Affinity
27. TensorFlow ML workflow with 1M iterations using 32 workers
with varying workers per node
0
50
100
150
200
250
300
1 2 4 8 16 32
Runtime(min)
Max cardinality per node
5% utilized cluster 70% utilized cluster
Anti-affinity Affinity
Cardinality constraints are important
Affinity and anti-affinity are not enough
5% utilized cluster
28. 0
50
100
150
200
250
300
1 2 4 8 16 32
Runtime(min)
Max cardinality per node
5% utilized cluster 70% utilized cluster
34%
42%
Anti-affinity Affinity
5% utilized cluster
70% utilized cluster
Over-utilizationNetwork overhead
Cardinality constraints are important
Affinity and anti-affinity are not enough
32. > Important additions for long-running applications/services
in Hadoop 3.1
> Deployment, packaging, upgrade, discovery of LRAs
> Scheduling of LRAs
Expressive constraints (affinity, anti-affinity, cardinality)
High quality placement via constraint processor
> Many more things to be done: come help us!
> Demo time!
34. > Latency for placing all containers of an LRA
Scheduling scalability
35. 0
10
20
30
40
50
60
YARN KUBE KUBE++ MEDEA
Runtime(sec)
Impact of MEDEA in Task performance
Short-tasks
Task-based job runtimes are not affected
no-constraints
YARN
affinity
KUBE
+ cardinality
KUBE++
+ multiple
containers
+
36. 0.00
0.20
0.40
0.60
0.80
1.00
0 200 400 600 800
CDFofrequestlatency
Request latency (ms)
No Constraints Intra only Intra-Inter
better
4.6x
> Memcached lookups mean latency:
No constraints: 372ms
Intra-app: 361.7ms
Intra-inter: 78ms
> Total mean latency:
7.6x better over default
5x over intra-only
LRA performance: inter-app
Both intra- and inter-application constraints
are crucial to application performance
Hinweis der Redaktion
Fix – add HWX logo
Wangda to add slides for deployment, migration, upgrade, service specification, etc.
task-based job runtimes are similar across all schedulers