3. What is Flink
3
Gelly
Table
ML
SAMOA
DataSet (Java/Scala) DataStream
HadoopM/R
Local Remote YARN Tez Embedded
Dataflow
Dataflow(WiP)
MRQL
Table
Cascading(WiP)
Streaming dataflow runtime
Zeppelin
4. Program compilation
4
case class Path (from: Long, to:
Long)
val tc = edges.iterate(10) {
paths: DataSet[Path] =>
val next = paths
.join(edges)
.where("to")
.equalTo("from") {
(path, edge) =>
Path(path.from, edge.to)
}
.union(paths)
.distinct()
next
}
Optimizer
Type extraction
stack
Task
scheduling
Dataflow
metadata
Pre-flight (Client)
Master
Workers
Data Source
orders.tbl
Filter
Map DataSource
lineitem.tbl
Join
Hybrid Hash
buildHT probe
hash-part [0] hash-part [0]
GroupRed
sort
forward
Program
Dataflow Graph
Independent of
batch or
streaming job
deploy
operators
track
intermediate
results
Layered Architecture
allows plugging of
components
5. Native workload support
5
Flink
Streaming
topologies
Long batch
pipelines
Machine Learning at scale
How can an engine natively support all these workloads?
And what does "native" mean?
Graph Analysis
Low latency
resource utilization iterative algorithms
Mutable state
6. E.g.: Non-native iterations
6
Step Step Step Step Step
Client
for (int i = 0; i < maxIterations; i++) {
// Execute MapReduce job
}
Teaching an old elephant new tricks
Treat system as a black box
8. Native workload support
8
Flink
Streaming
topologies
Long batch
pipelines
Machine Learning at scale
How can an engine natively support all these workloads?
And what does "native" mean?
Graph Analysis
Low latency
resource utilization iterative algorithms
Mutable state
9. Ingredients for “native” support
1. Execute everything as streams
Pipelined execution, push model
2. Special code paths for batch
Automatic job optimization, fault tolerance
3. Allow some iterative (cyclic) dataflows
4. Allow some mutable state
5. Operate on managed memory
Make data processing on the JVM robust
9
13. Expressive APIs
13
case class Word (word: String, frequency: Int)
val lines: DataStream[String] = env.fromSocketStream(...)
lines.flatMap {line => line.split(" ").map(word => Word(word,1))}
.window(Time.of(5,SECONDS)).every(Time.of(1,SECONDS))
.groupBy("word").sum("frequency")
.print()
val lines: DataSet[String] = env.readTextFile(...)
lines.flatMap {line => line.split(" ").map(word => Word(word,1))}
.groupBy("word").sum("frequency")
.print()
DataSet API (batch):
DataStream API (streaming):
14. Checkpointing / Recovery
14
Chandy-Lamport Algorithm for consistent asynchronous distributed snapshots
Pushes checkpoint barriers
through the data flow
Data Stream
barrier
Before barrier =
part of the snapshot
After barrier =
Not in snapshot
(backup till next snapshot)
Guarantees exactly-once
processing
16. Batch on an streaming engine
16
File in HDFS
Filter Map Result 1
Map Result 2
Batch program, completely pipelined
Data is never materialized anywhere (in this example)
17. Batch on an streaming engine
Map
Operator
Map
Operator
Map
Operator
17
Data
Source
(small)
Stream
Data
Sink
Data
Sink
Data
Sink
Join
Operator
in parallel
Data
Source
(large)
Data
Sink
in parallel (once build side finished)
Map
18. Batch processing requirements
Get the data processed as fast as possible
• Automatic job optimizer
• Efficient memory management
Robust processing
• provide fault-tolerance
• again, memory management
18
19. Optimizer
Cost-based optimizer
Select data shipping strategy (forward, partition, broadcast)
Local execution (sort merge join/hash join)
Caching of loop invariant data (iterations)
19
case class Path (from: Long, to:
Long)
val tc = edges.iterate(10) {
paths: DataSet[Path] =>
val next = paths
.join(edges)
.where("to")
.equalTo("from") {
(path, edge) =>
Path(path.from, edge.to)
}
.union(paths)
.distinct()
next
}
Optimizer
Type extraction
stack
Pre-flight (Client)
Data
Source
orders.tbl
Filter
Map
DataSourc
e
lineitem.tbl
Join
Hybrid Hash
build
HT
probe
hash-part [0] hash-part [0]
GroupRed
sort
forward
Program
Dataflow
Graph
20. Two execution plans
20
DataSource
orders.tbl
Filter
Map DataSource
lineitem.tbl
Join
Hybrid Hash
buildHT probe
broadcast forward
Combine
GroupRed
sort
DataSource
orders.tbl
Filter
Map DataSource
lineitem.tbl
Join
Hybrid Hash
buildHT probe
hash-part [0] hash-part [0]
hash-part [0,1]
GroupRed
sort
forward
Best plan
depends on
relative sizes
of input files
25. Iterate in the Dataflow
26
API and runtime support
Automatic caching of loop invariant
data
IterationState state =
getInitialState();
while (!terminationCriterion()) {
state = step(state);
}
setFinalState(state);
26. Example: Matrix Factorization
27
Factorizing a matrix with
28 billion ratings for
recommendations
More at: http://data-artisans.com/computing-recommendations-with-flink.html
Setups:
• 40 medium instances ("n1-highmem-8" - 8
cores, 52 GB)
• 40 large instances ("n1-highmem-16" - 16
cores, 104 GB)
27. Flink ML – Machine Learning
Provide a complete toolchain
• scikit-learn style pipelining
• Data pre-processing
various algorithms
• Recommendations: ALS
• Supervised learning: Support Vector Machines
• …
ML on streams: SAMOA. We are planning to add support for
streaming into ML
28
30. Iterate natively with state/deltas
31
Keep state in an controlled way by having a partitioned hash-
map
Relax immutability assumption of batch processing
31. … fast graph analysis
32More at: http://data-artisans.com/data-analysis-with-flink.html
32. Gelly – Graph Processing API
33
Transformations: map, filter, subgraph, union, reverse,
undirected
Mutations: add vertex/edge, remove …
Pregel style vertex centric iterations
Library of algorithms
Utilities: Special data types, loading, graph properties
33. Gelly and Flink ML:
Available in Flink 0.9 (so far only beta release)
Still under heavy development
Seamlessly integrate with DataSet abstraction
Preprocess data as needed
Use results as needed
Easy entry point for new contributors
34
35. Flink Meetup Groups
SF Spark and Friends
• June 16, San Francisco
Bay Area Flink Meetup
• June 17, Redwood City
Chicago Flink Meetup
• June 30
Stockholm, Sweden
Berlin, Germany
36
36. Flink Forward registration & call for
abstracts is open now
flink.apache.org 37
• 12/13 October 2015
• Meet developers and users of Flink!
• With Flink Workshops / Trainings!
40. Flink
Historic data
Kafka, RabbitMQ, ...
HDFS, JDBC, ...
ETL, Graphs,
Machine Learning
Relational, …
Low latency,
windowing,
aggregations, ...
Event logs
Real-time data
streams
What is Apache Flink?
(master)
41. Cornerpoints of Flink Design
42
Robust Algorithms on
Managed Memory
Pipelined Execution
of Batch Programs
Better shuffle performance
No OutOfMemory Errors
Scales to very large JVMs
Efficient an robust processing
Flexible Data
Streaming Engine
Low Latency Steam Proc.
Highly flexible windows
Native Iterations
Very fast Graph Processing
Stateful Iterations for ML
High-level APIs,
beyond key/value pairs
Java/Scala/Python (upcoming)
Relational-style optimizer
Graphs / Machine Learning
Streaming ML (coming)
Scales to very large groups
Active Library Development
42. Defining windows in Flink
Trigger policy
• When to trigger the computation on current window
Eviction policy
• When data points should leave the window
• Defines window width/size
E.g., count-based policy
• evict when #elements > n
• start a new window every n-th element
Built-in: Count, Time, Delta policies
43
Flink is an entire software stack
the heart: streaming dataflow engine:
think of programs as operators and data flows
Kappa architecture: run batch programs on a streaming system
Table API: logical representation, sql-style
Samoa “on-line learners”
toy program: native transitive closure
type extraction: types that go in and out of each operator
Flink is an analytical system
streaming topology: real-time; low latency
“native”: build-in support in the system, no working around, no black-box
next slide: define native by some “non-native” examples
Used for Machine Learning
run the same job over the data multiple times to come up with parameters for a ml model
this is how you do it when treating the engine as a black box
If you only have a batch processor:
do a lot of small batch jobs
LIMITATION: state across the small jobs (batches)
Flink is an analytical system
streaming topology: real-time; low latency
“native”: build-in support in the system, no working around, no black-box
next slide: define native by some “non-native” examples
Corner points / requirements for flink
keep data in motion, avoid materialization
even though it’s a streaming runtime, have special paths for batch: OPTIMIZER, CHECKPOINTING
make the system aware of cyclic data flows, in a controlled way
allow operators to have some state, in a controlled way (DELTA-ITERATIONS). relax “traditional” batch assumption
flink runs in the jvm, but we want control over memory, not rely on GC
Explain flink by use case
pipelined execution: logical way you would go for low latency.
no synchronization barriers, records keep flowing
streaming shuffle (for example w/ hash code)
push model
maintain state inside long lived operators (!= mini batch)
nice, fluent APIs known from the batch world
window definitions
Low overhead snapshots using “batched” snapshots
Exactly once processing guarantees (without doing mini batches)
How does it work:
periodically push barriers through the streams.
when a barrier reaches an operator, snapshot state
when barrier reaches sinks, a checkpoint is completed (secured)
multiple parallel checkpoints
chop stream into generations (pre checkpointed, post checkpointed)
structure,
different title
Measure effect of pipeline parallelism
blocking happens in operators (join build side)
in flink operators are running at the same time need to control memory
example: hash join
robust in memory
graceful behavior
needed for machine learning
function that encapsulate the transformation “Hadoop job”
deploy this once, keep running across iterations (also you can keep state)
allow to feed back data to the beginning
1 Tb of input data
many terabytes of intermediate data
40 machines cluster @ google compute
SVM = supervised learning
ALS = Recommendation
keep mutable state in a controlled way
by having a hash map locally on each machine
great documentation on flink website
dev list: 300-400 messages/month. record 1000 messages on
structure,
different title
visualization of dataflow graph in flink
Explain: sources, maps, binary operators (join, …)
This is an actual example of a Flink job we’ve seen from an industry user
candidate flights out of all available flights
you need windows in stream programs: grouping for ex on infinite stream is impossible
pipelined data ex: needed for low latency
in batch: pipelined or blocking, if we are optimizing (for ex we don’t want all operators online at the same time)
overlapping operators
operators start as soon as data is available
join and co-group start early, (co group starts sorting incoming data)
in flink operators are running at the same time need to control memory
example: hash join
robust in memory
graceful behavior