5. Why?
Lot of machine learning algorithms
require graph computations and in the
real world the input for these are
huge, which cannot fit in one machine.
6. Real World?
Big Graphs:
● Social Networks
● Biological Networks
● Mobile Call Networks
● Citation Networks
● World Wide Web
● Geographic Pathways
● Customer merchant graphs(Amazon,
Ebay)
11. Why not MapReduce?
● Represent graphs as adjacency lists
● Perform local computations in mapper
● Pass along partial results via
outlinks, keyed by destination node
● Perform aggregation in reducer on
inlinks to a node
● Iterate until convergence: controlled
by external “driver”
● Don’t forget to pass the graph
structure between iterations
12. Why not Spark?
● Spark provides GraphX library for
graph & machine learning algorithms.
● But still it is not designed
specifically for graph algorithms.
● So, no optimization will be available
which are applicable for graphs only.
13. PREGEL, Google, 2010
● Basic idea: “think like a vertex”
● Based on Bulk Synchronous
Parallel(BSP) Model
● Provides scalability
● Provides fault tolerance
● Provides flexibility to express
arbitrary graph algorithms
14. How does it work?
● Master/Worker architecture
● Each worker is assigned a subset of
a directed graph’s vertices
● Vertex-centric model. Each vertex
has:
● An arbitrary “value” that can be
get/set.
● List of messages sent to it
● List of outgoing edges (edges have
a value too)
● A binary state (active/inactive)
16. Pregel execution model
Master initiates synchronous iterations (called a
“superstep”), where at every superstep:
● Workers asynchronously execute a user function on all
of its vertices
● Vertices can receive messages sent to it in the last
superstep
● Vertices can modify their value, modify values of
edges, change the topology of the graph (add/remove
vertices or edges)
● Vertices can send messages to other vertices to be
received in the next superstep
● Vertices can “vote to halt”
● Execution stops when all vertices have voted to halt
and no vertices have messages.
● Vote to halt trumped by non-empty message queue
18. Page Rank
PageRank is a link analysis
algorithm that is used to determine
the importance of a documentbased on
the number of references to it and
the importance of the source
documents themselves.
19. Page Rank
A = A given page
T1 .... Tn = Pages that point to page
A (citations)
d = Damping factor between 0 and 1
(usually kept as
0.85)
C(T) = number of links going out of T
PR(A) = the PageRank of page A
20. Page Rank
Class PageRankVertex
: public Vertex<double, void, double> {
public:
virtual void Compute(MessageIterator* msgs) {
if (superstep() >= 1) {
double sum = 0;
for (; !msgs->done(); msgs->Next())
sum += msgs->Value();
*MutableValue() = 0.15 + 0.85 * sum;
}
if (supersteps() < 30) {
const int64 n = GetOutEdgeIterator().size();
SendMessageToAllNeighbors(GetValue() / n);
}
else {
VoteToHalt();
}}};
21. Open Source
PREGEL was a research paper, Google didn't
expose any open source implementation.
As a result lots of open source
implementations came up and they keep on
improving the basic Pregel model. Most
notable two are:
a) Apache Giraph, started, maintained and
used mainly by facebook
b) CMU's GraphLab(now it is a company by
itself)
22. One Example: GraphLab
● GraphLab is currently is the best one
● GraphLab modified the partitioning
strategy to reduce network overhead
message transfer among workers
● GraphLab has a rich library of
machine learning algorithms and its
growing
23. Reference
● Pregel: A System for Large-Scale Graph
Processing
● PowerGraph: Distributed Graph-Parallel
Computation on Natural Graphs
● GraphX: A Resilient Distributed Graph
System on Spark
● giraph.apache.org
● graphlab.org