4. Why Stream Processing?
• Most problems have streaming nature
• Stream processing gives lower latency
• Data volumes more easily tamed
4
Event stream
5. Batch and Streaming
Pipelined and
blocking operators Streaming Dataflow Runtime
Batch Parameters
DataSet DataStream
Relational
Optimizer
Window
Optimization
Pipelined and
windowed operators
Schedule lazily
Schedule eagerly
Recompute whole
operators Periodic checkpoints
Streaming data movement
Stateful operations
DAG recovery
Fully buffered streams DAG resource management
Streaming Parameters
6. Flink APIs
6
case class Word (word: String, frequency: Int)
val lines: DataStream[String] = env.readFromKafka(...)
lines.flatMap {line => line.split(" ").map(word => Word(word,1))}
.keyBy("word”).timeWindow(Time.of(5,SECONDS)).sum("frequency")
.print()
val lines: DataSet[String] = env.readTextFile(...)
lines.flatMap {line => line.split(" ").map(word => Word(word,1))}
.groupBy("word").sum("frequency”)
.print()
DataSet API (batch):
DataStream API (streaming):
7. Working with Windows
7
Why windows?
We are often interested in fresh data!15 38 65 88 110 120
#sec
40 80
SUM #2
0
SUM #1
20 60 100 120
15 38 65 88
1) Tumbling windows
myKeyStream.timeWindow(
Time.of(60, TimeUnit.SECONDS));
#sec
40 80
SUM #3
SUM #2
0
SUM #1
20 60 100
15 38
38 65
65 88
myKeyStream.timeWindow(
Time.of(60, TimeUnit.SECONDS),
Time.of(20, TimeUnit.SECONDS));
2) Sliding windows
window buckets/panes
8. Working with Windows
7
Why windows?
We are often interested in fresh data!
Highlight: Flink can form and trigger windows consistently
under different notions of time and deal with late events!
15 38 65 88 110 120
#sec
40 80
SUM #2
0
SUM #1
20 60 100 120
15 38 65 88
1) Tumbling windows
myKeyStream.timeWindow(
Time.of(60, TimeUnit.SECONDS));
#sec
40 80
SUM #3
SUM #2
0
SUM #1
20 60 100
15 38
38 65
65 88
myKeyStream.timeWindow(
Time.of(60, TimeUnit.SECONDS),
Time.of(20, TimeUnit.SECONDS));
2) Sliding windows
window buckets/panes
11. Meet Gelly
• Java & Scala Graph APIs on top of Flink
• graph transformations and utilities
• iterative graph processing
• library of graph algorithms
• Can be seamlessly mixed with the DataSet Flink
API to easily implement applications that use
both record-based and graph-based analysis
10
12. Hello, Gelly!
ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
DataSet<Edge<Long, NullValue>> edges = getEdgesDataSet(env);
Graph<Long, Long, NullValue> graph = Graph.fromDataSet(edges, env);
DataSet<Vertex<Long, Long>> verticesWithMinIds = graph.run(
new ConnectedComponents(maxIterations));
val env = ExecutionEnvironment.getExecutionEnvironment
val edges: DataSet[Edge[Long, NullValue]] = getEdgesDataSet(env)
val graph = Graph.fromDataSet(edges, env)
val components = graph.run(new ConnectedComponents(maxIterations))
Java
Scala
11
15. Iterative Graph Processing
• Gelly offers iterative graph processing abstractions
on top of Flink’s Delta iterations
• Based on the BSP, vertex-centric model
• scatter-gather
• gather-sum-apply
• vertex-centric (pregel)*
• partition-centric*
14
17. Gather-Sum-Apply Iterations
• Gather: compute one
value per edge
• Sum: combine the
partial values of Gather
to a single value
• Apply: update the vertex
value, based on the Sum
and the current value
16
Gather ApplySum
18. Library of Algorithms
• PageRank*
• Single Source Shortest Paths*
• Label Propagation
• Weakly Connected Components*
• Community Detection
• Triangle Count & Enumeration
• Graph Summarization
• val ranks = inputGraph.run(new PageRank(0.85, 20))
• *: both scatter-gather and GSA implementations
17
34. Batch vs. Stream Graph
Processing
22
Batch Stream
Input Graph static dynamic
Analysis on a snapshot continuous
Response
after job
completion
immediately
35. Graph Streaming Challenges
• Maintain the graph structure
• How to apply state updates efficiently?
• Result updates
• Re-run the analysis for each event?
• Design an incremental algorithm?
• Run separate instances on multiple snapshots?
• Computation on most recent events only
23
36. Single-Pass Graph Streaming
• Each event is an edge addition
• Maintain only a graph summary
• Recent events are grouped in graph
windows
24
37. Graph Summaries
• spanners for distance estimation
• sparsifiers for cut estimation
• sketches for homomorphic properties
graph summary
algorithm algorithm~R1 R2
25
39. Batch Connected Components
• State: the graph and a component ID per vertex
(initially equal to vertex ID)
• Iterative Computation: For each vertex:
• choose the min of neighbors’ component IDs and own
component ID as new ID
• if component ID changed since last iteration, notify neighbors
27
44. Stream Connected Components
• State: a disjoint set data structure for the
components
• Computation: For each edge
• if seen for the 1st time, create a component with ID the min of
the vertex IDs
• if in different components, merge them and update the
component ID to the min of the component IDs
• if only one of the endpoints belongs to a component, add the
other one to the same component
32
58. Introducing Gelly-Stream
46
Gelly-Stream enriches the DataStream API with two new additional ADTs:
• GraphStream:
• A representation of a data stream of edges.
• Edges can have state (e.g. weights).
• Supports property streams, transformations and aggregations.
• GraphWindow:
• A “time-slice” of a graph stream.
• It enables neighborhood aggregations
60. Graph Stream Aggregations
48
result
aggregate
property streamgraph
stream
(window) fold
combine
fold
reduce
local
summaries
global
summary
edges
agg
global aggregates
can be persistent or transient
graphStream.aggregate(
new MyGraphAggregation(window, fold, combine, transform))
61. Graph Stream Aggregations
49
result
aggregate
property stream
graph
stream
(window) fold
combine transform
fold
reduce map
local
summaries
global
summary
edges
agg
graphStream.aggregate(
new MyGraphAggregation(window, fold, combine, transform))