Batch and Stream Graph Processing with Apache Flink

Batch & Stream
Graph Processing
with Apache Flink
Vasia Kalavri
vasia@apache.org
@vkalavri

Apache Flink
• An open-source, distributed data analysis framework
• True streaming at its core
• Streaming & Batch API
2
Historic data
Kafka, RabbitMQ, ...
HDFS, JDBC, ...
Event logs
ETL, Graphs, 
Machine Learning 
Relational, …
Low latency, 
windowing,
aggregations, ...

Integration (picture not complete)
POSIX Java/Scala 
Collections
POSIX

Why Stream Processing?
• Most problems have streaming nature
• Stream processing gives lower latency
• Data volumes more easily tamed
4
Event stream

Batch and Streaming
Pipelined and 
blocking operators Streaming Dataflow Runtime
Batch Parameters
DataSet DataStream
Relational 
Optimizer
Window 
Optimization
Pipelined and 
windowed operators
Schedule lazily
Schedule eagerly
Recompute whole 
operators Periodic checkpoints
Streaming data movement
Stateful operations
DAG recovery
Fully buffered streams DAG resource management
Streaming Parameters

Flink APIs
6
case class Word (word: String, frequency: Int)
val lines: DataStream[String] = env.readFromKafka(...)
lines.flatMap {line => line.split(" ").map(word => Word(word,1))}
.keyBy("word”).timeWindow(Time.of(5,SECONDS)).sum("frequency")
.print()
val lines: DataSet[String] = env.readTextFile(...)
lines.flatMap {line => line.split(" ").map(word => Word(word,1))}
.groupBy("word").sum("frequency”)
.print()
DataSet API (batch):
DataStream API (streaming):

Working with Windows
7
Why windows?
We are often interested in fresh data!15 38 65 88 110 120
#sec
40 80
SUM #2
0
SUM #1
20 60 100 120
15 38 65 88
1) Tumbling windows
myKeyStream.timeWindow(
Time.of(60, TimeUnit.SECONDS));
#sec
40 80
SUM #3
SUM #2
0
SUM #1
20 60 100
15 38
38 65
65 88
Time.of(60, TimeUnit.SECONDS),
2) Sliding windows
window buckets/panes

Working with Windows
7
Why windows?
We are often interested in fresh data!
Highlight: Flink can form and trigger windows consistently
under different notions of time and deal with late events!
15 38 65 88 110 120
#sec
40 80
SUM #2
0
SUM #1
20 60 100 120
15 38 65 88
1) Tumbling windows
#sec
40 80
SUM #3
SUM #2
0
SUM #1
20 60 100
15 38
38 65
65 88
Time.of(60, TimeUnit.SECONDS),
2) Sliding windows
window buckets/panes

Flink Stack
Gelly
Table
ML
SAMOA
DataSet (Java/Scala) DataStream (Java/Scala)
HadoopM/R
Local Remote Yarn Embedded
Dataflow
Dataflow(WiP)
Table
Cascading
Streaming dataflow runtime
CEP
8

Meet Gelly
• Java & Scala Graph APIs on top of Flink
• graph transformations and utilities
• iterative graph processing
• library of graph algorithms
• Can be seamlessly mixed with the DataSet Flink
API to easily implement applications that use
both record-based and graph-based analysis
10

Hello, Gelly!
ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
DataSet<Edge<Long, NullValue>> edges = getEdgesDataSet(env);
Graph<Long, Long, NullValue> graph = Graph.fromDataSet(edges, env);
DataSet<Vertex<Long, Long>> verticesWithMinIds = graph.run(
new ConnectedComponents(maxIterations));
val env = ExecutionEnvironment.getExecutionEnvironment
val edges: DataSet[Edge[Long, NullValue]] = getEdgesDataSet(env)
val graph = Graph.fromDataSet(edges, env)
val components = graph.run(new ConnectedComponents(maxIterations))
Java
Scala
11

Graph Methods
Graph Properties
getVertexIds
getEdgeIds
numberOfVertices
numberOfEdges
getDegrees
...
12
Transformations
map, filter, join
subgraph, union,
difference
reverse, undirected
getTriplets
Mutations
add vertex/edge
remove vertex/edge

Neighborhood Methods
graph.reduceOnNeighbors(new MinValue, EdgeDirection.OUT)
13

Iterative Graph Processing
• Gelly offers iterative graph processing abstractions
on top of Flink’s Delta iterations
• Based on the BSP, vertex-centric model
• scatter-gather
• gather-sum-apply
• vertex-centric (pregel)*
• partition-centric*
14

Scatter-Gather Iterations
• MessagingFunction:
generate message for
other vertices
• VertexUpdateFunction:
update vertex value based
on received messages
15
Scatter Gather

Gather-Sum-Apply Iterations
• Gather: compute one
value per edge
• Sum: combine the
partial values of Gather
to a single value
• Apply: update the vertex
value, based on the Sum
and the current value
16
Gather ApplySum

Library of Algorithms
• PageRank*
• Single Source Shortest Paths*
• Label Propagation
• Weakly Connected Components*
• Community Detection
• Triangle Count & Enumeration
• Graph Summarization
• val ranks = inputGraph.run(new PageRank(0.85, 20))
• *: both scatter-gather and GSA implementations
17

Gelly-Stream
single-pass stream graph
processing with Flink

Real Graphs are dynamic
Graphs are created from events happening in real-time
19

Gelly on Streams
21
DataStreamDataSet
Distributed Dataﬂow
Deployment
DataStream

Gelly on Streams
21
DataStreamDataSet
Deployment
Gelly
• Static Graphs
• Multi-Pass Algorithms
• Full Computations
DataStream

Gelly on Streams
21
DataStreamDataSet
Deployment
Gelly Gelly-Stream
• Static Graphs
DataStream

Gelly on Streams
21
DataStreamDataSet
Deployment
Gelly Gelly-Stream
• Static Graphs
• Dynamic Graphs
• Single-Pass Algorithms
• Approximate Computations
DataStream

Batch vs. Stream Graph
Processing
22
Batch Stream
Input Graph static dynamic
Analysis on a snapshot continuous
Response
after job
completion
immediately

Graph Streaming Challenges
• Maintain the graph structure
• How to apply state updates efﬁciently?
• Result updates
• Re-run the analysis for each event?
• Design an incremental algorithm?
• Run separate instances on multiple snapshots?
• Computation on most recent events only
23

Single-Pass Graph Streaming
• Each event is an edge addition
• Maintain only a graph summary
• Recent events are grouped in graph
windows
24

Graph Summaries
• spanners for distance estimation
• sparsiﬁers for cut estimation
• sketches for homomorphic properties
graph summary
algorithm algorithm~R1 R2
25

Batch Connected Components
• State: the graph and a component ID per vertex
(initially equal to vertex ID)
• Iterative Computation: For each vertex:
• choose the min of neighbors’ component IDs and own
component ID as new ID
• if component ID changed since last iteration, notify neighbors
27

1
43
2
5
6
7
8
i=0
28

1
11
2
2
6
6
6
i=1
29

1
11
1
5
6
6
6
i=2
30
1

1
11
1
1
6
6
6
i=3
31

Stream Connected Components
• State: a disjoint set data structure for the
components
• Computation: For each edge
• if seen for the 1st time, create a component with ID the min of
the vertex IDs
• if in different components, merge them and update the
component ID to the min of the component IDs
• if only one of the endpoints belongs to a component, add the
other one to the same component
32

31
52
54
76
86
ComponentID Vertices
1
43
2
5
6
7
8
33

31
52
54
76
86
42
1 1, 3
1
43
2
5
6
7
8
34

31
52
54
76
86
42
43
2 2, 5
1 1, 3
1
43
2
5
6
7
8
35

31
52
54
76
86
42
43
87
2 2, 4, 5
1 1, 3
1
43
2
5
6
7
8
36

31
52
54
76
86
42
43
87
41
2 2, 4, 5
1 1, 3
6 6, 7
1
43
2
5
6
7
8
37

52
54
76
86
42
43
87
41
2 2, 4, 5
1 1, 3
6 6, 7, 8
1
43
2
5
6
7
8
38

54
76
86
42
43
87
41 ComponentID Vertices
2 2, 4, 5
1 1, 3
6 6, 7, 8
1
43
2
5
6
7
8
39

76
86
42
43
87
41
2 2, 4, 5
1 1, 3
6 6, 7, 8
1
43
2
5
6
7
8
40

76
86
42
43
87
41
6 6, 7, 8
1 1, 2, 3, 4, 5
1
43
2
5
6
7
8
41

86
42
43
87
41
6 6, 7, 8
1 1, 2, 3, 4, 5
1
43
2
5
6
7
8
42

42
43
87
41
6 6, 7, 8
1 1, 2, 3, 4, 5
1
43
2
5
6
7
8
43

Distributed Stream Connected
Components
44

API Requirements
• Continuous aggregations on edge
streams
• Global graph aggregations
• Support for windowing
45

Introducing Gelly-Stream
46
Gelly-Stream enriches the DataStream API with two new additional ADTs:
• GraphStream:
• A representation of a data stream of edges.
• Edges can have state (e.g. weights).
• Supports property streams, transformations and aggregations.
• GraphWindow:
• A “time-slice” of a graph stream.
• It enables neighborhood aggregations

GraphStream Operations
47
.getEdges()
.getVertices()
.numberOfVertices()
.numberOfEdges()
.getDegrees()
.inDegrees()
.outDegrees()
GraphStream -> DataStream
.mapEdges();
.distinct();
.filterVertices();
.filterEdges();
.reverse();
.undirected();
.union();
GraphStream -> GraphStream
Property Streams Transformations

Graph Stream Aggregations
48
result
aggregate
property streamgraph
stream
(window) fold
combine
fold
reduce
local
summaries
global
summary
edges
agg
global aggregates
can be persistent or transient
graphStream.aggregate(
new MyGraphAggregation(window, fold, combine, transform))

Graph Stream Aggregations
49
result
aggregate
property stream
graph
stream
(window) fold
combine transform
fold
reduce map
local
summaries
global
summary
edges
agg
graphStream.aggregate(
new MyGraphAggregation(window, fold, combine, transform))

Connected Components
50
graph
stream #components

50
graph
stream
1
43
2
5
6
7
8
#components

50
graph
stream
31
52
1
43
2
5
6
7
8
#components

51
graph
stream
{1,3}
{2,5}
1
43
2
5
6
7
8
#components

52
graph
stream
{1,3}
{2,5}
54
1
43
2
5
6
7
8
#components

53
graph
stream
{1,3}
{2,5}
{4,5}
76
86
1
43
2
5
6
7
8
#components

54
graph
stream
{1,3}
{2,5}
{4,5}
{6,7}
{6,8}
1
43
2
5
6
7
8
#components

54
graph
stream
{1,3}
{2,5}
{4,5}
{6,7}
{6,8}
1
43
2
5
6
7
8
#components
window
triggers

55
graph
stream
{2,5}
{6,8}
{1,3}
{4,5}
{6,7}
1
43
2
5
6
7
8
#components

55
graph
stream
{2,5}
{6,8}
{1,3}
{4,5}
{6,7}
3
1
43
2
5
6
7
8
#components

56
graph
stream
{1,3}
{2,4,5}
{6,7,8}
1
43
2
5
6
7
8
#components

56
graph
stream
{1,3}
{2,4,5}
{6,7,8}
3
1
43
2
5
6
7
8
#components

57
graph
stream
{1,3}
{2,4,5}
{6,7,8}
42
43
1
43
2
5
6
7
8
#components

58
graph
stream
{1,3}
{2,4,5}
{6,7,8}{2,4}
{3,4}
41
87
1
43
2
5
6
7
8
#components

59
graph
stream
{1,3}
{2,4,5}
{6,7,8}{1,2,4}
{3,4}
{7,8}
1
43
2
5
6
7
8
#components

59
graph
stream
{1,3}
{2,4,5}
{6,7,8}{1,2,4}
{3,4}
{7,8}
1
43
2
5
6
7
8
#components
window
triggers

60
graph
stream
{1,2,4,5}
{6,7,8}
{3,4}
{7,8}
1
43
2
5
6
7
8
#components

60
graph
stream
{1,2,4,5}
{6,7,8}
2
{3,4}
{7,8}
1
43
2
5
6
7
8
#components

61
graph
stream
{1,2,3,4,5}
{6,7,8}
1
43
2
5
6
7
8
#components

61
graph
stream
{1,2,3,4,5}
{6,7,8}
2
1
43
2
5
6
7
8
#components

Slicing Graph Streams
62
graphStream.slice(Time.of(1, MINUTE));
11:40 11:41 11:42 11:43

Aggregating Slices
63
graphStream.slice(Time.of(1, MINUTE), direction)
• Slicing collocates edges by vertex
information
• Neighborhood aggregations on sliced
graphs
source
target

Aggregating Slices
63
graphStream.slice(Time.of(1, MINUTE), direction)
.reduceOnEdges();
.foldNeighbors();
.applyOnNeighbors();
• Slicing collocates edges by vertex
information
• Neighborhood aggregations on sliced
graphs
source
target
Aggregations

Finding Matches Nearby
64
graphStream.filterVertices(GraphGeeks())
.slice(Time.of(15, MINUTE), EdgeDirection.IN)
.applyOnNeighbors(FindPairs())
GraphStream :: graph geek check-ins
wendy checked_in soap_bar
steve checked_in soap_bar
tom checked_in joe’s_grill
sandra checked_in soap_bar
rafa checked_in joe’s_grill

64
slice
wendy
steve
sandra
soap
bar
tom
rafa
joe’s
grill
GraphWindow :: user-place

64
slice
wendy
steve
sandra
soap
bar
tom
rafa
joe’s
grill
FindPairs
{wendy, steve}
{steve, sandra}
{wendy, sandra}
{tom, rafa}
GraphWindow :: user-place

What’s next?
• Integration with Neo4j (Input / Output)
• OpenCypher on Flink/Gelly
• Pregel and Partition-Centric Iterations
• Integration with Graphalytics

Feeling Gelly?
• Gelly Guide
https://ci.apache.org/projects/flink/flink-docs-master/libs/gelly_guide.html
• Gelly-Stream Repository
https://github.com/vasia/gelly-streaming
• Gelly-Stream talk @FOSDEM16
https://fosdem.org/2016/schedule/event/graph_processing_apache_flink/
• An interesting read
http://people.cs.umass.edu/~mcgregor/papers/13-graphsurvey.pdf
• A cool thesis
http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-170425

Batch and Stream Graph Processing with Apache Flink

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (20)

Ähnlich wie Batch and Stream Graph Processing with Apache Flink

Ähnlich wie Batch and Stream Graph Processing with Apache Flink (20)

Mehr von Vasia Kalavri

Mehr von Vasia Kalavri (9)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Batch and Stream Graph Processing with Apache Flink