5. Static partitioning
Web cluster、DB cluster、Hadoop ClusterなどのClusterは独
自のサーバー群を持っていてsharingしない
● hard to utilize machines
● hard to scale elastically
● hard to deal with failures
絵でわかる(p30~p40):
https://speakerdeck.com/benh/apache-mesos-nyc-meetup
8. Dynamic sharing
Running multiple frameworks in a single cluster can
● maximize utilization
● sharing data between frameworks
● simplify the infrastructure
9. Dynamic sharingの課題
Dynamic sharingのメリットは大きい一方で、Cluster
schedulingは複雑化になります:
● a wide range of requirements and policies have to be
taken into account
● clusters and their workloads keep growing and since the
scheduler's workload is roughly proportional to the
cluster size, the scheduler is at risk of becoming a
scalability bottleneck.
13. Monolithic scheduler
use a single, centralized scheduling algorithm for all jobs.
Google's current(2013) cluster scheduler is effectively
monolithic, acquired many optimizations over the years:
provide internal parallelism and multi-threading to address
head-of-line blocking and scalability.
17. Two-level scheduler(Mesos)
An obvious fix to the issues of static partition is to adjust the
allocation of resource to each scheduler dynamically, using
a central coordinator to decide how many resources each
sub-cluster can have.
Mesos works best when
1) tasks are short-lived
2) relinquish resources frequently
3) job sizes are small compared to the size of the cluster
19. Clusterのworkloads
simple two-way split:
● batch jobs: perform a computation and then finish. For
simplicity we put all low priority jobs and those marked
as "best effort" or "batch" into the batch category
● service jobs: long-running service jobs that provide end
user operations(e.g., web services) and internal
infrastructure services(e.g. storage service, naming
service, locking service)
20. Cluster traces from Google
● most(>80%) jobs are batch jobs
● the majority of resources (55-80%) are
allocated to service jobs
● service jobs typically run for much longer(20-
40% of them run for over a month) and have
fewer tasks than batch jobs
※ YahooとFacebookのworkloadsも似ている
21. Googleのニーズ
● Many batch jobs are short, and fast turnaround is important, so a lightweight, low-quality
approach to placement works just fine.
● Long-running, high-priority service jobs must meet stringent availability and performance targets,
so careful placement of their tasks is needed to maximize resistance to failures and provide
good performance.
● "head of line blocking" problem: while it is very reasonable to spend a few seconds making a
decision whose effects last for several weeks, it can be problematic if an interactive batch job
has to wait for such a calculation. This problem can be avoided by introducing parallelism.
つまりGoogleのニーズ:require a scheduler architecture that
● can accommodate both types of jobs
● flexibly support job-specific policies
● and also scale to an ever-growing amount of scheduling work.
22. なぜgoogleは不採用?
Monolithic schedulerとtwo-level schedulerはgoogleのニーズに満たせない:
1) Monolithic scheduler:
● It complicates an already difficult job: the scheduler has to minimize the
time a job spends waiting before it starts running.
● It is surprisingly difficult to support a wide range of policies in a sustainable
manner using a single-algorithm implementation.
This kind of software engineering consideration, rather than performance
scalability implementation, was our primary motivation to move to an
architecture that supported concurrent, independent scheduling components.
performance scalabilityよりsoftware engineeringの考えですね!
23. なぜgoogleは不採用?
Monolithic schedulerとtwo-level schedulerはgoogleのニーズに満たせない:
2) Two-level scheduler:
● No global view of the overall cluster state
● Lock issue: pessimistic concurrency control
● Assumptions that resource become available frequently and scheduler
decisions are quick, so works best when short tasks/relinquish resource
frequently/small job size compared to the size of the cluster: but google's
cluster workloads do not have these properties, especially in the case of
service jobs
24. Share-state scheduler(Omega)
● each scheduler can full access to the entire cluster
● use optimistic concurrency control
This immediately eliminate two of the issues of the two-
level scheduler approach:
➔ limited parallelism due to pessimistic concurrency
control
➔ restricted visibility of resources in a scheduler
framework
25. Share-state scheduler(Omega)
● No central resource allocator in Omega(be simplified to a persistent data store)
● All of the resource-allocation take place in the schedulers.
● "cell state": a resilient master copy of the resource allocation maintained in the cluster. Each
scheduler is given a private, local, frequently-updated copy of cell state for making scheduling
decisions. The scheduler can see the entire state of the cell.
● Omega schedulers operate completely in parallel and do not have to wait for jobs in other
schedulers and there is no inter-scheduler head of line blocking.
The performance viability of the share-state approach is ultimately determined
by the frequency at which transactions fail and the costs of such failures.
The batch scheduler is the main scalability bottleneck, the Omega model can
scale to a high workload while still providing good behavior for service jobs.
26. cluster schedulersの比較
Approach Resource
Choice
Interference Alloc.
granularity
Cluster-wide
policies
Monolithic all available none(serialized) global policy strict priority
(preemption)
Statically partitioned fixed subnet none
(partitioned)
per-partition
policy
scheduler-
dependent
Two-level(Mesos) dynamic subnet pessimistic hoarding strict fairness
Shared-state(Omega) all available optimistic per-scheduler
policy
free-for-all,
priority
preemption
27. MesosとPaaSの話
PaaS検証の背景(p3):multiple workloads, multiple tenantsのPaaS上マルチClustersのresource sharing
問題
(Dynamic sharingのcluster scheduler)
PaaS上のworkloads:long running processes/one-off tasks/scheduled jobs
service jobsの割合はより高く、service jobsのschedulingはもっと重要
Mesos frameworks for Long running services:
Aurora/Marathon/SingularityなどありますがOmegaのpaper(2013)が指摘したMesosの問題(特にService
jobsの問題)
Mesosの最新状況や各frameworksの対応はどうになっているか
28. MesosとPaaSの話
Kubernetesについて
Run Kubernetes on Mesos:
https://github.com/mesosphere/kubernetes-mesos
Run Kubernetes on Hadoop YARN:
http://hortonworks.com/blog/docker-kubernetes-apache-hadoop-yarn/