From Dive Bars to World Tour: A Data Scientist's Perspective

From the dive bars of silicon valley to the World Tour
Carlo Curino

PhD + PostDoc in databases
“we did all of this 30 years ago!”
Yahoo! Research
“webscale, webscale, webscale!”
Microsoft – CISL
“enterprise + cloud + search engine + big-data”
My perspective

Cluster as an Embedded systems (map-reduce)
single-purpose clusters
General purpose cluster OS (YARN, Mesos, Omega, Corona)
standardizing access to computational resources
Real-time OS for the cluster !? (Rayon)
predictable resource allocation
Cluster stdlib (REEF)
factoring out common functionalities
Agenda

Cluster as an Embedded Systems
“the era of map-reduce only clusters”

Purpose-built technology
Within large web companies
Well targeted mission (process webcrawl)
à scale and fault tolerance
The origin
Google leading the pack
Google File System + MapReduce (2003/2004)
Open-source and parallel efforts
Yahoo! Hadoop ecosystem HDFS + MR (2006/2007)
Microsoft Scope/Cosmos (2008) (more than MR)

In-house growth
What was the key to success for Hadoop?

In-house growth (following Hadoop story)
Access, access, access…
All the data sit in the DFS
Trivial to use massive compute power
à lots of new applications
But… everything has to be MR
Cast any computation as map-only job
MPI, graph processing, streaming, launching web-servers!?!

Popularization
Everybody wants Big Data
Insight from raw data is cool
Outside MS and Google, Big-Data == Hadoop
Hadoop as catch-all big-data solution (and cluster manager)

Not just massive in-house clusters
New challenges?
New deployment environments
Small clusters (10s of machines)
Public Cloud

New deployment challenges
Small clusters
Efficiency matter more than scalability
Admin/tuning done by mere mortals
Cloud
Untrusted users (security)
Users are paying (availability, predictability)
Users are unrelated to each other (performance isolation)

Classic Hadoop (1.0) Architecture
Client
Job1
JobTracker
Scheduler
TaskTracker TaskTracker TaskTracker
Map Reduce Map Reduce
Map
Map
Map
Map
Map
Map
Map
Red.
Red.
Red.
Red.
Red.
Red.
Reduce
Handles resource management
Global invariants
fairness/capacity
Determines
who runs / resources / where
Manages MapReduce application flow
maps before reducers, re-run upon failure, etc..

What are the key shortcomings
of (old) Hadoop?
Hadoop 1.0 Shortcomings

Programming model rigidity
JobTracker manages resources
JobTracker manages application workflow (data dependencies)
Performance and Availability
Map vs Reduce slots lead to low cluster utilization (~70%)
JobTracker had too much to do: scalability concern
JobTracker is a single point of failure
Hadoop 1.0 Shortcomings (similar to original MR)

General purpose cluster OS
“Cluster OS (YARN)”

YARN (2008-2013, Hadoop 2.x, production at Yahoo!, GA)*
Request-based central scheduler
Mesos (2011, UCB, open-sourced, tested at Twitter)*
Offer-based two level scheduler
Omega (2013, Google, simulation?)*
Shared-state-based scheduling
Corona (2013, Facebook, production)
YARN-like but offered-based
Four proposals
* all three were best-papers or best-student-paper

YARN
Ad-hoc
app
Ad-hoc
app
Ad-hoc
app
Ad-hoc
Apps
YARN
MR
v2
Tez Giraph Storm Dryad
REEF
...
Hive / Pig
Hadoop 1.x
(MapReduce)
MR v1
Hive / Pig
Users
Application
Frameworks
Programming
Model(s)
Cluster OS
(Resource
Management)
Hadoop 1 World Hadoop 2 World
File System
HDFS 1 HDFS 2
Hardware
Ad-hoc
app
Ad-hoc
app

A new architecture for Hadoop
Decouples resource management from programming model
(MapReduce is an “application” running on YARN)
YARN (or Hadoop 2.x)

YARN (Hadoop 2) Architecture
Client
Job1
Resource
Manager
Scheduler
NodeManager NodeManager NodeManager
Task
Task
Task Task
TaskApp
Master
Task
Task
Negotiate access to
more resources
(ResourceRequest)

Flexibility, Performance and Availability
Multiple Programming Models
Central components do less à scale better
Easier High-Availability (e.g., RM vs AM)
Why does this matter?
System Jobs/Day Tasks/Day Cores pegged
Hadoop 1.0 77k 4M 3.2
YARN 125k (150k) 12M (15M) 6 (10)

Maintenance, Upgrade, and Experimentation
Run with multiple framework versions (at one time)
Trying out a new ideas is as is as launching a job
Anything else you can think?

Real-time OS for the cluster (?)
“predictable resource allocation”

YARN (Cosmos, Mesos, and Corona)
support instantaneous scheduling invariants (fairness/capacity)
maximize cluster throughput (eye to locality)
Current trends
New applications (require “gang” and dependencies)
Consolidation of production/test clusters + Cloud
(SLA jobs mixed with best-effort jobs)
Motivation

Job/Pipeline with SLAs: 200 CPU hours by 6am (e.g., Oozie)
Service: daily ebb/flows, reserve capacity accordingly (e.g., Samza)
Gang: I need 50 concurrent containers for 3 hours (e.g., Giraph)
Example Use Cases

In a consolidated cluster:
Time-based SLAs for production jobs (completion deadline)
Good latency for best-effort jobs
High cluster utilization/throughput
(Support rich applications: gang and skylines)
High-Level Goals

Decompose time-based SLAs in
resource definition: via RDL
predictable resource allocation: planning + scheduling
Divide and Conquer time-based SLAs

Expose to planner application needs
time: start (s), finish (f)
resources: capacity (w), total parallelism (h),
minimum parallelism (l), min lease duration (t)
Resource Definition Language (RDL) 1/2

Skylines / pipelines:
dependencies: among atomic allocations
(ALL, ANY, ORDER)
Resource Definition Language (RDL) 2/2

Important classes
Framework semantics: Perforator modeling of Scope/Hive
Machine Learning: gang + bounded iterations (PREDict)
Periodic jobs: history-based resource definition
Coming up with RDL specs
prediction

Root
100%
Staging
15%
Production
60%
J1
10%
J2
40%
J3
10%
Post
5%
Best Effort
20%
Planning vs Scheduling
Plan
Follower
J3
J1
J3
J1
J3
J1
J3
J1 J3
??
Scheduling
(fine-grained but
time-oblivious)
Resource
Definition
Planning
(coarse but
time-aware)
Preemption
Plan
Sharing
Policy
ReservationService
Resource
Manager
Sys Model /
Feedback

Some example run:
lots of queues for gridmix
Microsoft pipelines
Dynamic queues

Improves
production job SLAs
best-effort jobs latency
cluster utilization and throughput
Comparing against Hadoop CapacityScheduler

Under promise, over deliver
Plan for late execution, and run as early as you can
Greedy Agent
GB

Coping with imperfections (system)
compensate RDL based on black-box models of overheads
Coping with Failures (system)
re-plan (move/kill allocations) in response of system-
observable resource issues
Coping with Failures/Misprediction (user)
continue in best-effort mode when reservation expires
re-negotiate existing reservations
Dealing with “Reality”

Sharing Policy: CapacityOverTimePolicy
constrains: instantaneous max, and running avg
e.g., no user can exceed an instantaneous 30% allocation, and
an average of 10% in any 24h period of time
single partial scan of plan: O(|alloc| + |window|)
User Quotas (trade-off flexibility to fairness)

Introduce Admission Control and Time-based SLAs (YARN-1051)
New ReservationService API (to reserve resources)
Agents + Plan + SharingPolicy to organize future allocations
Leverage underlying scheduler
Future Directions
Work with MSR-India on RDL estimates for Hive and MR
Advanced agents for placement ($$-based and optimal algos)
Enforcing decisions (Linux Containers, Drawbridge, Pacer)
Conclusion

Cluster stdlib: REEF
“factoring out recurring components”

Dryad DAG computations
Tez DAG computations (focus on interactive and Hive support)
Storm stream processing
Spark interactive / in-memory / iterative
Giraph graph-processing Bulk Synchronous Parallel (a la’ Pregel)
Impala scalable, interactive, SQL-like query
HoYA Hbase on yarn
Stratoshpere parallel iterative computations
REEF, Weave, Spring-Hadoop meta-frameworks to help build apps
Focusing on YARN: many applications

Lots of repeated work
Communication
Configuration
Data and Control Flow
Error handling / fault-tolerance
Common “better than hadoop” tricks:
Avoid Scheduling overheads
Control Excessive disk IO
Are YARN/Mesos/Omega enough?

The Challenge
YARN / HDFS
SQL / Hive … …
Machine
Learning
u  Fault Tolerance
u  Row/Column Storage
u  High Bandwidth Networking

The Challenge
YARN / HDFS
SQL / Hive … …
Machine
Learning
u  Fault Awareness
u  Local data caching
u  Low Latency Networking

SQL / Hive
The Challenge
YARN / HDFS
… …
Machine
Learning

SQL / Hive
REEF in the Stack
YARN / HDFS
… …
Machine
Learning
REEF

REEF in the Stack (Future)
YARN / HDFS
SQL / Hive … …
Machine
Learning
REEF
Operator API and Library
Logical Abstraction

REEF
Client
Job1
Resource
Manager
Scheduler
Task
Task
Task Task
TaskApp
Master
Task
Task
Negotiate access to
more resources
(ResourceRequest)

REEF
Client
Job1
Resource
Manager
Scheduler
Evaluator
Task
services
Evaluator
Task
services
REEF RT
Driver
Name-
based
User control
flow logic
Retains
State!
User data
crunching
logic
Fault-
detection
Injection-based
checkable
configuration
Event-based
Control flow

REEF: Computation and Data Management
Extensible Control Flow Data Management Services
Storage
Network
State Management
Job Driver
Control plane implementation. User code
executed on YARN’s Application Master
Activity User code executed within an Evaluator.
Evaluator
Execution Environment for Activities. One
Evaluator is bound to one YARN Container.

Control Flow is centralized in the Driver
Evaluator, Tasks configuration and launch
Error Handling is centralized in the Driver
All exceptions are forwarded to the Driver
All APIs are asynchronous
Support for:
Caching / Checkpointing / Group communication
Example Apps Running on REEF
MR, Asynch Page Rank, ML regressions, PCA, distributed shell,….
REEF Summary
(Open-sourced with Apache License)

Big-Data Systems
Ongoing focus
Future work
Leverage high-level app semantics
Coordinate tiered-storage and scheduling
Conclusions

Adding Preemption to YARN,
and open-sourcing it toApache

Limited mechanisms to “revise current schedule”
Patience
Container killing
To enforce global properties
Leave resources fallow (e.g., CapacityScheduler) à low utilization
Kill containers (e.g., FairScheduler) à wasted work
(Old) new trick
Support work-preserving preemption
(via) checkpointing à more than preemption
State of the Art

Changes throughout YARN
Client
Job1
RM
Scheduler
App
Master Task
Task
Task
Task
Task
Task
Task
PreemptionMessage {
Strict { Set<ContainerID> }
Flexible { Set<ResourceRequest>,
Set<ContainerID> }
}
Collaborative application
Policy-based binding for Flexible
preemption requests
Use of Preemption
Context:
Outdated information
Delayed effects of actions
Multi-actor orchestration
Interesting type of preemption:
RM declarative request
AM bounds it to containers

Changes throughout YARN
Client
Job1
RM
Scheduler
MR
AM Task
Task
Task
Task
Task
Task
Task
When can I preempt?
tag safe UDFs or
user-saved state
@Preemptable
public class MyReducer{
…
}
Common Checkpoint Service
WriteChannel cwc = cs.create();
cwc.write(…state…);
CheckpointID cid = cs.commit(cwc);
ReadChannel crc = cs.open(cid);

57
CapacityScheduler + Unreservation + Preemption: memory utilization
CapacityScheduler (allow overcapacity)
CapacityScheduler (no overcapacity)

Client
Job1
RM
Scheduler
App
Master Task
Task
Task
Task
Task
Task
Task
MR-5192
MR-5194
MR-5197
MR-5189
MR-5189
MR-5176
YARN-569
MR-5196
(Metapoint) Experience contributing to Apache
Engaging with OSS
talk with active developers
show early/partial work
small patches
ok to leave things unfinished

With @Preemptable
tag imperative code with semantic property
Generalize this trick
expose semantic properties to platform (@PreserveSortOrder)
allow platforms to optimize execution (map-reduce pipelining)
REEF seems the logical place where to do this.
Tagging UDFs

(Basic) Building block for:
Enables efficient preemption
Dynamic Optimizations (task splitting, efficiency improvements)
Fault Tolerance
Other uses for Checkpointing

From Dive Bars to World Tour: A Data Scientist's Perspective

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (13)

Similar to From Dive Bars to World Tour: A Data Scientist's Perspective

Similar to From Dive Bars to World Tour: A Data Scientist's Perspective (20)

More from eXascale Infolab

More from eXascale Infolab (20)

Recently uploaded

Recently uploaded (20)

From Dive Bars to World Tour: A Data Scientist's Perspective