Drizzle—Low Latency Execution for Apache Spark: Spark Summit East talk by Shivaram Venkataraman

DRIZZLE: Low latency execution for apache spark
Shivaram Venkataraman, Aurojit Panda, Kay Ousterhout

Who am I ?
PhD candidate, AMPLab UC Berkeley

Dissertation: System design for large scale machine learning

Apache Spark PMC Member. Contributions to Spark core, MLlib, SparkR

Low latency: SPARK STREAMING
“Delivering low latency, high throughput, and stability
simultaneously:* Right now, our own tests indicate you
can get at most two of these characteristics out of Spark
Streaming at the same time.”
From https://goo.gl/wGCrtE
“How to choose right DStream batch interval”
From https://goo.gl/6UX0FW
“Getting the best performance out of a Spark Streaming application on
a cluster requires a bit of tuning…Reducing the processing time of each
batch of data by efficiently using cluster resources. Setting the right batch
size such that the batches of data can be processed as fast as they are
received….” From spark.apache.org/docs/latest/streaming-programming-guide

Large Scale Stream Processing Goals

State
Low Latency
High Throughput
LARGE SCALE STREAM PROCESSING: PERFORMANCE

LARGE SCALE STREAM PROCESSING: ADAPTABLE

Straggler Mitigation

Fault Tolerance

Elasticity

Query Optimization

Computation models: RECORD-AT-A-TIME
Long-lived operators
Distributed Checkpoints
(Chandy-Lamport)
Naiad
Task
Control Message
Driver
Network Transfer
Streaming DBs:
Borealis, Flux etc
Mutable State
Google
MillWheel

Centralized task
scheduling
Lineage, Parallel
Recovery
Microsoft Dryad
Computation models: batch processing
Task
Control Message
Driver
S
H
U
F
F
L
E
Network Transfer
Micro-Batch
Adaptable: Elasticity,
Google FlumeJava

Sync checkpoints,
Lineage for partial results

Fault tolerance

Elasticity
RECORD-AT-A-TIMEBatch processing
Micro-batch
boundaries
~1 seconds
Checkpoint, restart
(stateful operators)
Chandy-Lamport checkpoints,
Process pairs
~10 millisecondsLatency
Query Optimization

Can we achieve low latency with Apache Spark ?

DESIGN INSIGHT
Fine-grained execution

with

Coarse-grained scheduling
Data Processing
Coordination

DRIZZLE
S
H
U
F
F
L
E
Micro-Batch
Group Scheduling
Pre-Scheduling Shufﬂes

BACKGROUND: STREAMING On SPARK
Scheduler

DAG scheduling
Assign tasks to hosts using
(a) locality preferences
(b) straggler mitigation
(c) fair sharing etc.
Tasks
Host1
Host2
Driver
Host1
Host2
Serialize &
Launch
Host
Metadata
Scheduler

SCALING BATCH COMPUTATION
Cluster: 4 core, r3.xlarge machines Workload: Sum of 10k numbers per-core
Median-task time breakdown
0
50
100
150
200
250
4
8
16
32
64
128
Time(ms)
Machines
Compute + Data Transfer
Task Fetch
Scheduler Delay

DAG scheduling
Assign tasks to hosts using
(a) locality preferences
(b) straggler mitigation
(c) fair sharing etc.
Tasks
Host1
Host2
Driver
Host1
Host2
Serialize &
Launch
Host
Metadata
Scheduler
Same DAG structure
for many iterations
Can reuse scheduling decisions

GROUP scheduling
Schedule a group
of iterations at once
Fault tolerance, scheduling
at group boundaries
1 stage in each iteration
group = 2

How much does this help ?
1
10
100
1000
4
8
16
32
64
128
Time/Iter(ms)
Machines
Apache Spark
Drizzle-10
Drizzle-50
Drizzle-100
Workload: Sum of 10k numbers per-core
Single Stage Job, 100 iterations – Varying Drizzle group size

coordinating shuffles: Existing systems
Task
Control Message
Data Message
Driver
Intermediate Data
Driver sends metadata
Tasks pull data

coordinating shuffles: PRE-SCHEDULING
Pre-schedule down-stream
tasks on executors
Trigger tasks once
dependencies are met
Task
Control Message
Data Message
Driver
Intermediate Data
Pre-scheduled task

0
50
100
150
200
250
300
4
8
16
32
64
128
Time/Iter(ms)
Machines
Baseline
Only Pre-Scheduling
Drizzle-10
Drizzle-100
Micro-benchmark: 2-stages
100 iterations – Breakdown of pre-scheduling, group-scheduling

EXTENSIONS
Group size auto tuning

Query optimization

Iterative ML algorithms

Fault tolerance

group=1 à Batch processing
GROUP scheduling trade-offs
Higher overhead
Smaller window for fault tolerance
group=N à Parallel operators
Lower overhead
Larger window for fault tolerance

GROUP scheduling – AUTO TUNING
Goal : Smallest group such that overhead is between ﬁxed threshold
Tuning algorithm

- Measure scheduler delay, execution time per group

- If overhead > threshold, multiplicatively increase group size

- If overhead < threshold, additively decrease group size
Similar to AIMD schemes used in TCP congestion control

QUERY OPTIMIZATION
Intra-Batch Inter-Batch
Predicate Push Down
Vectorization
...
Operator Selection
Data Layout
...
…

MLLIB ALGORITHMS
Iterative patterns à

Gradient Descent

PCA

…

Similar structure to streaming !

Model stored, updated as shared state
Parameter server integration
State

EVALUATION
Yahoo! Streaming Benchmark

Experiments

- Latency

- Throughput

- Fault tolerance

Comparing Spark 2.0, Flink 1.1.1, Drizzle
Amazon EC2 r3.xlarge instances

0
0.2
0.4
0.6
0.8
1
0
500
1000
1500
2000
2500
3000
Event Latency (ms)
Spark
Drizzle
Flink
Streaming BENCHMARK - performance
Yahoo Streaming Benchmark: 20M JSON Ad-events / second, 128 machines
Event Latency: Difference between window end, processing end

Optimize execution of each micro-batch by pushing down aggregation
INTRA-BATCH QUERY optimization
Yahoo Streaming Benchmark: 20M JSON Ad-events / second, 128 machines
0
0.2
0.4
0.6
0.8
1
0
500
1000
1500
2000
2500
3000
Event Latency (ms)
Spark
Drizzle
Flink
Drizzle-Optimized

Weak-scaling throughput
Yahoo Streaming Benchmark: 150,000 events/sec per machine
Weak scaling from 4 to 128 machines (600k to 19.2M events/s)
0
400
800
1200
1600
4
8
16
32
48
64
96
128
MedianEventLatency(ms)
Machines
Spark
Flink
Drizzle
Drizzle-Optimized

FAULT TOLERANCE
0
5000
10000
15000
20000
150
200
250
300
350
Latency(ms)
Time (seconds)
Drizzle
Spark
Flink
Inject machine failure at 240 seconds

OPEN SOURCE UPDATE
Spark Scheduler Improvements

- SPARK-18890, SPARK-18836, SPARK-19485

- Addresses serialization, RPC bottlenecks etc.

Design discussion to integrate Drizzle: SPARK-19487

Open source code at: https://github.com/amplab/drizzle-spark

conclusion
Low latency during execution and while adapting

Drizzle: Decouple execution from centralized scheduling

Amortize overheads using group scheduling, pre-scheduling
Shivaram Venkataraman
shivaram@cs.berkeley.edu
Source Code: https://github.com/amplab/drizzle-spark

Drizzle—Low Latency Execution for Apache Spark: Spark Summit East talk by Shivaram Venkataraman

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Drizzle—Low Latency Execution for Apache Spark: Spark Summit East talk by Shivaram Venkataraman

Similar to Drizzle—Low Latency Execution for Apache Spark: Spark Summit East talk by Shivaram Venkataraman (20)

More from Spark Summit

More from Spark Summit (20)

Recently uploaded

Recently uploaded (20)

Drizzle—Low Latency Execution for Apache Spark: Spark Summit East talk by Shivaram Venkataraman