Demystifying Distributed Graph Processing

DEMYSTIFYING
DISTRIBUTED
GRAPH PROCESSING
Vasia Kalavri
vasia@apache.org
@vkalavri

WHY DISTRIBUTED
GRAPH PROCESSING?

MY GRAPH IS SO BIG, IT
DOESN’T FIT IN A SINGLE
MACHINE
Big Data Ninja
MISCONCEPTION #1

YOUR INPUT DATASET SIZE
IS _OFTEN_ IRRELEVANT

INTERMEDIATE DATA: THE OFTEN DISREGARDED EVIL
▸ Naive Who(m) to Follow:
▸ compute a friends-of-friends list
per user
▸ exclude existing friends
▸ rank by common connections

DISTRIBUTED PROCESSING IS
ALWAYS FASTER THAN
SINGLE-NODE
Data Science Rockstar
MISCONCEPTION #2

GRAPHS DON’T APPEAR OUT OF THIN AIR
Expectation…

GRAPHS DON’T APPEAR OUT OF THIN AIR
Reality!

HOW DO WE EXPRESS A
DISTRIBUTED GRAPH
ANALYSIS TASK?

GRAPH APPLICATIONS ARE DIVERSE
▸ Iterative value propagation
▸ PageRank, Connected Components, Label Propagation
▸ Traversals and path exploration
▸ Shortest paths, centrality measures
▸ Ego-network analysis
▸ Personalized recommendations
▸ Pattern mining
▸ Finding frequent subgraphs

RECENT DISTRIBUTED GRAPH PROCESSING HISTORY
2004
MapReduce
Pegasus
2009
Pregel
2010
Signal-Collect
PowerGraph
2012

2004
MapReduce
Pegasus
2009
Pregel
2010
Signal-Collect
PowerGraph
2012
Iterative value propagation

PREGEL: THINK LIKE A VERTEX
1
5
4
3
2 1 3, 4
2 1, 4
5 3
...

PREGEL: SUPERSTEPS
(Vi+1, outbox) <— compute(Vi, inbox)
1 3, 4
2 1, 4
5 3
...
1 3, 4
2 1, 4
5 3
...
Superstep i Superstep i+1

PREGEL EXAMPLE: PAGERANK
void compute(messages):
sum = 0.0
for (m <- messages) do
sum = sum + m
end for
setValue(0.15/numVertices() + 0.85*sum)
for (edge <- getOutEdges()) do
sendMessageTo(
edge.target(), getValue()/numEdges)
end for
sum up received
messages
update vertex rank
distribute rank to
neighbors

SIGNAL-COLLECT
outbox <— signal(Vi)
1 3, 4
2 1, 4
5 3
...
1 3, 4
2 1, 4
5 3
...
Superstep i
Vi+1 <— collect(inbox)
1 3, 4
2 1, 4
5 3
...
Signal Collect
Superstep i+1

SIGNAL-COLLECT EXAMPLE: PAGERANK
void signal():
for (edge <- getOutEdges()) do
sendMessageTo(
edge.target(), getValue()/numEdges)
end for
void collect(messages):
sum = 0.0
for (m <- messages) do
sum = sum + m
end for
setValue(0.15/numVertices() + 0.85*sum)
distribute rank to
neighbors
sum up received
messages
update vertex rank

GATHER-SUM-APPLY (POWERGRAPH)
1
...
...
Gather Sum
1
2
5
...
Apply
3
1 5
5 3
1
...
Gather
3
1 5
5 3
Superstep i Superstep i+1

GSA EXAMPLE: PAGERANK
double gather(source, edge, target):
return target.value() / target.numEdges()
double sum(rank1, rank2):
return rank1 + rank2
double apply(sum, currentRank):
return 0.15 + 0.85*sum
compute partial
rank
combine partial
ranks
update rank

2004
MapReduce
Pegasus
2009
Pregel
2010
Signal-Collect
PowerGraph
2012
Giraph++
2013

2004
MapReduce
Pegasus
2009
Pregel
2010
Signal-Collect
PowerGraph
2012
Giraph++
2013
Graph Traversals

THINK LIKE A (SUB)GRAPH
1
5
4
3
2
1
5
4
3
2
- compute() on the entire partition
- Information ﬂows freely inside each partition
- Network communication between partitions,
not vertices

2004
MapReduce
Pegasus
2009
Pregel
2010
Signal-Collect
PowerGraph
2012
Giraph++
2013
Graph Traversals
NScale
2014

2004
MapReduce
Pegasus
2009
Pregel
2010
Signal-Collect
PowerGraph
2012
Giraph++
2013
Graph Traversals
NScale
2014
Ego-network analysis
Arabesque
2015
Tinkerpop

2004
MapReduce
Pegasus
2009
Pregel
2010
Signal-Collect
PowerGraph
2012
Giraph++
2013
Graph Traversals
NScale
2014
Ego-network analysis
Arabesque
2015
Pattern Matching
Tinkerpop

CAN WE HAVE IT ALL?
▸ Data pipeline integration: built on top of an efﬁcient
distributed processing engine
▸ Graph ETL: high-level API with abstractions and methods to
transform graphs
▸ Familiar programming model: support popular programming
abstractions

HELLO, GELLY! THE APACHE FLINK GRAPH API
▸ Java and Scala APIs: seamlessly integrate with Flink’s DataSet API
▸ Transformations, library of common algorithms
val graph = Graph.fromDataSet(edges, env)
val ranks = graph.run(new PageRank(0.85, 20))
▸ Iteration abstractions
Pregel
Signal-Collect
Gather-Sum-Apply
Partition-Centric*

POSIX Java/Scala 
Collections
POSIX
‣efﬁcient streaming runtime
‣native iteration operators
‣well-integrated
WHY FLINK?

FEELING GELLY?
▸ Paper References
http://www.citeulike.org/user/vasiakalavri/tag/dotscale
▸ Apache Flink:
http://flink.apache.org/
▸ Gelly documentation:
http://ci.apache.org/projects/flink/flink-docs-master/apis/batch/libs/gelly.html
▸ Gelly-Stream:
https://github.com/vasia/gelly-streaming

Demystifying Distributed Graph Processing

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Demystifying Distributed Graph Processing

Ähnlich wie Demystifying Distributed Graph Processing (20)

Mehr von Vasia Kalavri

Mehr von Vasia Kalavri (12)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Demystifying Distributed Graph Processing