1. The document discusses multi-resource packing of tasks with dependencies to improve cluster scheduler performance.
2. It proposes packing tasks along multiple resources like CPU, memory, disk, and network to reduce fragmentation and increase utilization. This can improve makespan and average job completion time by up to 50%.
3. It also suggests prioritizing jobs with less "remaining work" using a shortest remaining processing time heuristic to reduce average job completion times. Additionally, incorporating a fairness knob that trades off a small amount of fairness for large gains in performance.
2. Cluster Scheduling for Jobs
Jobs
Machines, file-system, network
Cluster Scheduler
matches tasks to resources
Goals
• High cluster utilization
• Fast job completion time
• Predictable perf./ fairness
E.g., BigData (Hive, SCOPE, Spark)
E.g., CloudBuild
Tasks
Dependencies
• Need not keep resource “buffers”
• More dynamic than VM placement (tasks last seconds)
• Aggregate properties are important (eg, all tasks in a job should finish)
3. Need careful multi-resource planning
Problem
Fragmentation
Current Schedulers Packer Scheduler
Over-allocation of net/disk
Current Schedulers Packer Scheduler
2 tasks/T 3 tasks/T (+50%) 2 tasks/ 2T 2 tasks/T (+100%)
4. … worse with dependencies
Problem 2
Tt,
𝟏
𝒏
r t, 1- r
t, r
t, 1- r t, 1- r
(T- 2)t,
𝟏
𝒏
r (T- 4)t,
𝟏
𝒏
r ~Tt,
𝟏
𝒏
r
…
…
DAG label= {duration, resource demand}
resource
time
~nT t
…
resource
time
~T t
…
…
Crit. Path Best
Critical path scheduling is n times off since it ignores resource demands
Packers can be d times off since they ignore future work [d resources]
5. Typical job scheduler infrastructure
+ packing
+ bounded unfairness
+ merge schedules
+ overbook
DAG
AM
DAG
AM
… Node
heartbeat
Task
assignment
Schedule
Constructor
Schedule
Constructor
RM
NM
NM
NM
NM
6. Main ideas in multi-resource packing
Task packing ~ Multi-dimensional bin packing, but
* Very hard problem (“APX-hard”)
* Available heuristics do not directly apply [task demands change with placement]
Alignment score (A) = D R
A packing heuristic
Task’s resources demand vector: D Machine resource vector: R<
Fit
A job completion time heuristic shortest remaining work, P tasks avg. duration
tasks avg. resource demand
*
*
=
remaining # tasks
Packing
Efficiency
?
delays job completion
loses packing efficiencyJob Completion
Time
Fairness
Trade-offs:
We show that:
{best “perf” |bounded unfairness} ~ best “perf”
loses both
7. Main ideas in packing dependent tasks
1. Identify troublesome tasks (meat) and place
them first
2. Systematically place other tasks without
deadlocks
3. At runtime, use a precedence order from the
computed schedule + heuristics to (a)
overbook, (b) previous slide.
4. Better lower bounds for DAG completion time
M
P
C
O
time
resource
meat
begin
meat
end
parents
meat
children
12. Map
(disk)
Reduce
(netw.)
Fair share among two identical jobs
50%
50%
50%
50%
2T 4T
Instantaneous fairness
100
%
100
%
100
%
100
%
2T 3TT
1) Temporal relaxation of fairness
a job will finish within (1 + 𝑓)x the time it takes given strict share
2) Optimal trade-off with performance
(1 + 𝑓)x fairness costs (2 + 2𝑓 − 2 𝑓 + 𝑓2)x on make-span
3) A simple (offline) algorithm that achieves the above trade-off
Problem:
Instantaneous fairness can be up to dx worse on makespan (d resources)
Best
Fairness slack 𝒇 Perf loss
0 (perfectly fair) 2x
1 (<2x longer) 1.1x
2 (<3x longer) 1.07x
13. Bare metal
VM Allocation
Data-parallel Jobs
Job: Tasks
Dependencies
E.g., HDInsight, AzureBatch
E.g., BigData (Yarn, Cosmos, Spark)
E.g., CloudBuild
3500 servers
3500 users
>20M targets/day
~100K servers (40K at Yahoo)
>50K servers
>2EB stored
>6K devs
14. • Tasks are short-lived (10s of seconds)
• Have peculiar shaped demands
• Composites are important (job needs all tasks to finish)
• OK to kill and restart tasks
• Locality
1) Job scheduling has specific aspects
2) will speed-up the average job (and reduce resource cost)
3) research + practice
16. Cluster Scheduling for Jobs
Jobs
Machines, file-system, network
Cluster Scheduler
matches tasks to resources
Goals
• High cluster utilization
• Fast job completion time
• Predictable perf./ fairness
• Efficient (milliseconds…)
E.g., HDInsight, AzureBatch
E.g., BigData (Hive, SCOPE, Spark)
E.g., CloudBuild
Tasks
Dependencies
17. Need careful multi-resource planning
Problem
Fragmentation
Current Schedulers Packer Scheduler
Over-allocation of net/disk
Current Schedulers Packer Scheduler
2 tasks/T 3 tasks/T (+50%) 2 tasks/ 2T 2 tasks/T (+100%)
18. … worse with dependencies
Problem 2
Tt,
𝟏
𝒏
r t, 1- r
t, r
t, 1- r t, 1- r
(T- 2)t,
𝟏
𝒏
r (T- 4)t,
𝟏
𝒏
r ~Tt,
𝟏
𝒏
r
…
…
DAG label= {duration, resource demand}
resource
time
~nT t
…
resource
time
~T t
…
…
Crit. Path Best
Critical path scheduling is n times off since it ignores resource demands
Packers can be d times off since they ignore future work [d resources]
19. Typical job scheduler infrastructure
+ packing
+ bounded unfairness
+ merge schedules
+ overbook
DAG
AM
DAG
AM
… Node
heartbeat
Task
assignment
Schedule
Constructor
Schedule
Constructor
RM
NM
NM
NM
NM
20. Main ideas in packing dependent tasks
1. Identify troublesome tasks (T) and place them
first
2. Systematically place other tasks without dead-
ends
3. At runtime, enforce computed schedule +
heuristics to (a) overbook, (b) previous slide.
4. Better lower bounds for DAG completion time
T
P
C
O
time
resource
Trouble
begin
Trouble
end
parents
trouble
children
24. Performance of cluster schedulers
We observe that:
1Time to finish a set of jobs
Resources are fragmented i.e. machines are running below capacity
Even at 100% usage, goodput is much smaller due to over-allocation
Even pareto-efficient multi-resource fair schemes result in much lower performance
Tetris
up to 40% improvement in makespan1 and job
completion time with near-perfect fairness
25. Findings from Bing and Facebook traces analysis
Tasks need varying amounts of each resource
Demands for resources are weakly correlated
Diversity in multi-resource requirements:
Multiple resources become tight
This matters because no single bottleneck resource:
Enough cross-rack network bandwidth to use all CPU cores
25
Upper bounding potential gains
reduce makespan1 by up to 49%
reduce avg. job compl. time by up to 46%
26. 26
Why so bad #1
Production schedulers neither pack
tasks nor consider all their relevant
resource demands
#1 Resource Fragmentation
#2 Over-allocation
27. Current Schedulers “Packer” Scheduler
Machine A
4 GB Memory
Machine B
4 GB Memory
T1: 2 GB
T3: 4 GB
T2: 2 GB
Time
Resource Fragmentation (RF)
STOP
Machine A
4 GB Memory
Machine B
4 GB Memory
T1: 2 GB
T3: 4 GB
T2: 2 GB
Time
Avg. task compl. time = 1 t
27
Current Schedulers
RF increase with the
number of resources
being allocated !
Avg. task compl.time = 1.33 t
Resources allocated
in terms of Slots
Free resources unable
to be assigned to tasks
28. Current Schedulers “Packer” Scheduler
Machine A
4 GB Memory; 20 MB/s Nw.
Time
T1: 2 GB
Memory
20 MB/s
Nw.
T2: 2 GB
Memory
20 MB/s
Nw.
T3: 2 GB
Memory
Machine A
4 GB Memory; 20 MB/s Nw.Time
T1: 2 GB
Memory
20 MB/s
Nw.
T2: 2 GB
Memory
20 MB/s
Nw.
T3: 2 GB
Memory
STOP
20 MB/s
Nw.
20 MB/s
Nw.
28
Over-Allocation
Not all tasks resource
demands are
explicitly allocated
Disk and network
are over-allocated
Avg. task compl.time= 2.33 t Avg. task compl. time = 1.33 t
Current Schedulers
29. Work Conserving != no fragmentation, over-allocation
Treat cluster as a big bag of resources
Hides the impact of resource fragmentation
Assume job has a fixed resource profile
Different tasks in the same job have different demands
Multi-resource Fairness Schemes do not help either
Why so bad #2
The schedule impacts job’s current resource profiles
Can schedule to create complementarity profiles
Packer Scheduler vs. DRF
Avg. Job Compl.Time: 50%
Makespan: 33%
Pareto1 efficient != performant
1no job can increase share without decreasing the share of another
29
30. Competing objectives
Job completion time
Fairness
vs.
Cluster efficiency
vs.
Current Schedulers
1. Resource Fragmentation
3. Fair allocations sacrifice performance
2. Over-Allocation
30
31. # 1
Pack tasks along multiple resources to improve
cluster efficiency and reduce makespan
31
32. Theory Practice
Multi-Resource Packing of Tasks
similar to
Multi-Dimensional Bin Packing
Balls could be tasks
Bin could be machine, time
1APX-Hard is a strict subset of NP-hard
APX-Hard1
Existing heuristics do not directly apply here:
Assume balls of a fixed size
Assume balls are known apriori
32
vary with time / machine placed
elastic
cope with online arrival of jobs,
dependencies, cluster activity
Avoiding fragmentation looks like:
Tight bin packing
Reduces # of bins used -> reduce makespan
33. # 1
Packing heuristic
1. Check for fit ensure no over-allocation Over-Allocation
Alignment score (A)
33
A packing heuristic
Tasks resources demand vector Machine resource vector<
Fit
“A” works because:
2. Bigger balls get bigger scores
3. Abundant resources used first
Resource Fragmentation
4. Can spread load across machines
35. 35
CHALLENGE
# 2
Shortest Remaining Time First1 (SRTF)
1SRTF – M. Harchol-Balter et al. Connection Scheduling in Web Servers [USITS’99]
schedules jobs in ascending order of their remaining time
Job Completion
Time Heuristic
Q: What is the shortest “remaining time” ?
“remaining work”
remaining # tasks
tasks durations
tasks resource demands
&
&
=
A job completion time heuristic
Gives a score P to every job
Extended SRTF to incorporate multiple resources
36. 36
CHALLENGE
# 2
Job Completion
Time Heuristic
Combine A and P scores !
Packing
Efficiency
Completion
Time
?
1: among J runnable jobs
2: score (j) = A(t, R)+ P(j)
3: max task t in j, demand(t) ≤ R (resources free)
4: pick j*, t* = argmax score(j)
A: delays job completion time
P: loss in packing efficiency
38. # 3
38
A says: “task i should go here to improve packing efficiency”
Feasible solution which typically can satisfy all of them
P says: “schedule job j next to improve job completion time”
Fairness says: “this set of jobs should be scheduled next”
Fairness
Heuristic
Performance and fairness do not mix well in general
But ….
We can get “perfect fairness” and much better performance
39. # 3
39
Fairness Knob, F [0, 1)
F = 0 most efficient scheduling
F → 1 close to perfect fairness
Pick the best-for-perf. task from among
1-F fraction of jobs furthest from fair share
Fairness
Heuristic
Fairness is not a tight constraint
Long term fairness not short term fairness
Lose a bit of fairness for a lot of gains in performance
Heuristic
40. 40
Putting it all together
We saw:
Other things in the paper:
Packing efficiency
Prefer small remaining work
Fairness knob
Estimate task demands
Deal with inaccuracies, barriers
Ingestion / evacuation
Job Manager1
Node Manager1
Cluster-wide Resource Manager
Multi-resource asks;
barrier hint
Track resource usage;
enforce allocations
New logic to match tasks to machines
(+packing, +SRTF, +fairness)
Allocations
Asks
Offers
Resource
availability reports
Yarn architecture
Changes to add Tetris(shown in orange)
42. 42
Efficiency
Makespan
DRF 28 %
Avg. Job Compl. Time
35%
0
50
100
150
200
0 5000 10000 15000
Utilization(%)
Time (s)
CPU Mem In St
Tetris
Gains from
avoiding fragmentation
avoid over-allocation
0
50
100
150
200
0 4500 9000 13500 18000 22500
Utilization(%)
Time (s)
CPU Mem In St
Tetris vs.
Capacity
Scheduler
29 % 30 %
Over-allocation
Lower value => higher resource fragmentation
Utilization(%)
200
150
100
50
0
0 5000 10000 15000
Time (s)
Over-allocation
Lower value => higher resource fragmentation
Capacity Scheduler
43. 43
Fairness
Fairness Knob
quantifies the extent to which Tetris adheres to fair allocation
No Fairness
F = 0
Makespan
50 %
10 %
25 %
Job Compl.
Time
40 %
23 %
35 %
Avg. Slowdown
[over impacted jobs]
25 %
2 %
5 %
Full Fairness
F → 1
F = 0.25
44. Pack efficiently
along multiple
resources
Prefer jobs
with less
“remaining
work”
Incorporate
Fairness
combine heuristics that improve packing efficiency with those that
lower average job completion time
achieving desired amounts of fairness can coexist with improving
cluster performance
implemented inside YARN; trace-driven simulations and deployment
show encouraging initial results
We are working towards a Yarn check-in
http://research.microsoft.com/en-us/UM/redmond/projects/tetris/
44
46. Estimating resource requirements
Estimating Resource Demands
Under-utilization
from:
o finished tasks in the same phase
peak usage demands estimates
Machine1 - In Network
850
1024
0
512
MBytes/sec
Time (sec)
In Network Used
In Network Free
Resource Tracker
o report unused resources
o aware of other cluster activities: ingestion and evacuation
Resource Tracker
o collecting statistics from recurring jobs
Peak Demand
o inputs size/location of tasks
46
Placement
Impacts network/disk requirements
47. Packer Scheduler vs. DRF
DRF Scheduler Packer Schedulers
2 tasks
Job Schedule
Resources used
2 tasks 2 tasks
2 tasks 2 tasks 2 tasks
6 tasks 6 tasks 6 tasksA
B
C
18 cores
16 GB
18 cores
16 GB
18 cores
16 GB
t 2t 3t
0 tasks
Job Schedule
Resources used
0 tasks 6 tasks
0 tasks 6 tasks
18 tasksA
B
C
18 cores 18 cores
6 GB
18 cores
6 GB
t 2t 3t
36 GB
Durations:
A: 3t
B: 3t
C: 3t
Durations:
A: t
B: 2t
C: 3t
33%
improvement
Dominant Resource Fairness (DRF)
computes the dominant share (DS) of every user and
seeks to maximize the minimum DS across all users
Cluster [18 Cores, 36 GB Memory]
Job: [Task Prof.], # tasks
A [1 Core, 2 GB], 18
B [3 Cores, 1 GB], 6
C [3 Cores, 1 GB], 6
DS =
𝟏
𝟑
max (qA, qB, qC) (Maximize allocations)
qA + 3qB + 3qC ≤ 18 (CPU constraint)
2qA + 1qB + 1qC ≤ 36 (Memory constraint)
qA
18
=
qB
6
=
qC
6
(Equalize DS) 47
48. 1Time to finish a set of jobs
Machine 1,2: [2 Cores, 4 GB]
Job: [Task Prof.], # tasks
A [2 Cores, 3 GB], 6
B [1 Core, 2 GB], 2
Resources used
4
cores
6 GB
2
tasks
2
tasks
2
tasks
2
tasks
t 2t 3t 4t
Job Schedule
4
cores
6 GB
4
cores
6 GB
2
cores
4 GB
Resources used
2
cores
4 GB
2
tasks
2
tasks
2
tasks
2
tasks
t 2t 3t 4t
Job Schedule
4
cores
6 GB
4
cores
6 GB
4
cores
6 GB
Pack No Pack
Durations:
A: 3t
B: 4t
Durations:
A: 4t
B: t
29% improvement
48
Packing efficiency does not achieve everything
Achieving packing efficiency does not
necessarily improve job completion time
49. 49
Ingestion / evacuation
ingestion = storing incoming data for later analytics
evacuation = data evacuated and re-replicated before
maintenance operations
e.g. some clusters reports volumes of up to 10 TB per hour
Other cluster activities which produce background traffic
e.g. rack decommission for machines re-imaging
Resource Tracker reports, used by Tetris to avoid
contention between its tasks and these activities
54. 54
Virtual Machine Packing != Tetris
Virtual Machine Packing
But focus on different challenges and not task packing:
balance load across servers
ensure VM availability inspite of failures
allow for quick software and hardware updates
NO corresponding entity to a job and hence job completion time is inexpressible
Explicit resource requirements (e.g. small VM) makes VM packing simpler
Consolidating VMs, with multi-dimensional resource
requirements, on to the fewest number of servers
55. 55
Barrier knob, b [0, 1)
Tetris gives preference for last tasks in a stage
Offer resources to tasks in a stage preceding a
barrier, where b fraction of tasks have finished
b = 1 no tasks preferentially treated
56. 56
Starvation Prevention
It could take a long time to accommodate large tasks ?
But …
1. most tasks have demands within one order of magnitude of one another
2. machines report resource availability to the scheduler periodically
scheduler learn about all the resources freed up by tasks that finish in the
preceding period together => can to reservation for large tasks
59. Performance of cluster schedulers
We observe that:
1Time to finish a set of jobs
Typically cluster schedulers do dependency-aware scheduling
OR multi-resource packing
None of the existing solutions are close to optimal for more than 50% of the
production jobs
Graphene
> 30% improvements in makespan1 and job
completion time for more than 50% of the jobs
2
60. Findings from Bing traces analysis
Jobs structure have evolved into complex DAGs of tasks
depth 7
103 tasks
Median job DAG’s has:
A good cluster scheduler should be
aware of dependencies
1Time to finish a set of jobs
3
61. Findings from Bing traces analysis
High coefficient of variation (~1) for many resources
Demands for resources are weakly correlated
Applications have (very) diverse resource needs:
Multiple resources become tight
This matters because no single bottleneck resource:
Enough cross-rack network bandwidth to use all CPU cores
61
CPU, Memory, Network and Disk
A good cluster scheduler should
pack resources
63. Dependency-aware Packing
Breadth First Search (BFS)
63
Do not account for tasks resource demands
If so, they assume tasks have homogeneous
demands
OR Consider the DAG structure during the
schedule
Tetris
Ignore dependencies
Takes local greedy choices
Handle tasks with multiple resource
requirements
Any scheduler that is not packing,
is up to n x OPTIMAL (n – number tasks)
Any scheduler that ignores dependencies is
d x OPTIMAL (d – number resource dimensions)
Critical Path Scheduling
(CPSched)
64. Where does the “work” lie in a DAG?
“Work” – stages in a DAG where most amount of resources X time is spent
Large DAGs that are neither a bunch of unrelated stages
nor a chain of stages
> 40% of the DAGs have most of the “work” on the Critical Path CPSched performs well
> 30% of the DAGs have most of the “work” such that Packers performs well
For ~50% of the DAGs neither
packers nor critically-based
schedulers may perform well 7
65. Pack tasks along multiple resources
while consider tasks dependencies
65
State-of-the art techniques are suboptimal
Key ideas in Graphene
Conclusion
66. State-of-the art scheduling techniques are suboptimal
CPSched / Tetris
3 X Optimal
66
t0: t1:
t2:
t3:
1
{.7, .31}
.01
{.95, .01}
.01
{.1, .7}
.96
{. 2, .68}
.98
{. 1, .01}
.01
{. 01, .01}
t4:
t5:
duration
{rsrc.1, rsrc.2}
task:
CPSched t0 t4 t5
t
t1 t3t2
2t 3t
Time: ~3T
Tetris t0 t1 t2
t
t4 t3t5
2t 3t
Time: ~3T
Optimal t1 t0
t
t4 t3
t2
3t
Time: ~T
t5
Key insights:
t0, t2, t5 are troublesome tasks
schedule them as soon as possible
Total capacity in any dimens. = 1
68. T
P
C
O
…
time
resources
T
…
time
resources
P
O
C
T
Schedule Construction
Identify tasks that can lead to a poor schedule (troublesome tasks) - T
more likely to be on the critical path
more difficult to pack
Break the others tasks into P, C, O sets based on their relationship with tasks from T
Place tasks in T on a virtual time space; overlay the others to fill any resultant holes in
this space
Nearly optimal for over three quarters of our
analyzed production DAGs
11
70. DAG
Schedule Construction
Schedule Construction
Preference order
Preference order
- merging schedulesDAG
Runtime component
Node
heartbeat
Task
assignment
Resource Manager
Prefer jobs with less
remaining work
Enforces priority ordering
Local placement
Multi-resource packing
Judicious overbooking of
malleable resources
Deficit counters to bound
unfairness
Enables implementation
of different fairness
schemes
Job completion time
Online Scheduling
Makespan Being Fair
- bound unfairness
- packing + overbooking
13
71. Evaluation
Implemented in Yarn and Tez
250 machine cluster deployment
Replay Bing traces and
TPC-DS / TPC-H workloads
71
72. Makespan
Tetris
29 %
Avg. Job Compl. Time
27%
Graphene vs.
Critical Path
31 % 33 %BFS
23 % 24%
Gains from
view of the entire DAG
place the troublesome
tasks first
Efficiency
more compact schedule
better packing
overbooking
15
73. combine various mechanisms to improve packing efficiency and
consider tasks dependencies
constructs a good schedule by placing tasks on a virtual resource time space
implemented inside YARN and Tez; trace-driven simulations and
deployment show encouraging initial results
73
online heuristics that softly enforces the desired schedules
74. Makespan
Tetris
29 %
Avg. Job Compl. Time
27%
Graphene vs.
Critical Path
31 % 33 %BFS
23 % 24%
Gains from
view of the entire DAG
place the troublesome
tasks first
Graphene BFSRunning tasks
Efficiency
more compact schedule
better packing
overbooking
15