1. Mining Quasi-Bicliques with Giraph
2013.07.02
Hsiao-Fei Liu
Sr. Engineer, CoreTech, Trend Micro Inc.
Chung-Tsai Su
Sr. Manager, CoreTech, Trend Micro Inc.
An-Chiang Chu
PostDoc, CSIE, National Taiwan University
2. Outline
• Preliminary
• Introduction to Giraph
• Giraph vs. chained MapReduce
• Problem
• Algorithm
• MapReduce Algorithm
• Giraph Algorithm
• Experiment results
• Conclusions
3. Outline
• Preliminary
• Introduction to Giraph
• Giraph vs. chained MapReduce
• Problem
• Algorithm
• MapReduce Algorithm
• Giraph Algorithm
• Experiment results
• Conclusions
4. • Distributed graph-processing system
• Originating from Google’s paper “Pregel” in 2010
• Efficient iterative processing of large sparse graphs
• A variation of Bulk Synchronous Parallel (BSP) Model
• Prominent user
• Facebook
• Who are contributing Giraph?
• Facebook, Yahoo!, Twitter, Linkedin and TrendMicro
What’s Apach Giraph
5. BSP VARIANT (1/3)
• Input: a directed graph where each vertex contains
1. state (active or inactive)
2. a value of a user-defined type
3. a set of out-edges, and each out-edge can also have an
associated value
4. a user program compute(.), which is the same for all vertices
and is allowed to execute the following tasks
• read/write its own vertex and edge values
• do local computation
• send messages
• mutate topology
• vote to halt, i.e., change the state to inactive
6. BSP VARIANT (2/3)
• Execution: a sequence of supersteps.
• In each superstep, each vertex runs its compute(.) function in
parallel with received messages as input
• messages sent to a vertex are to be processed in the next
superstep
• topology mutations will become effective in the next superstep with
conflicts resolution rules are as following:
1. remove > add
2. apply user-defined handler for multiple requests to add the same
vertex/edge with different initial values
• Barrier synchronization
• A mechanism for ensuring all computations are done and all
messages are delivered before starting the next superstep
7. BSP VARIANT (3/3)
• Termination criteria: all vertices become inactive and
no messages are en route.
Active Inactive
messages received
vote to halt
9. Copyright 2009 - Trend Micro Inc.
How Giraph Works -- Initialization
1. User decides the partition rule for vertices
• default partition rule part_i = { v | hash(v) mod N = i}, where N is the
number of partitions
2. Master computes partition-to-worker assignment and sends it to all
workers
3. Master instructs each worker to load a split of the graph data from
HDFS
• if a worker happens to load the data of a vertex belonging to itself, then
keep it
• else, send messages to the owner of the vertex to create the vertex in
the beginning of the first superstep
4. After loading the split and delivering the messages, a worker
responds to the master.
5. Master starts the first superstep after all workers respond
10. Copyright 2009 - Trend Micro Inc.
How Giraph Works -- Superstep
1. Master instruct workers to start superstep
2. Each worker executes compute(.) for all of its vertices
• One thread per partition
3. Each worker responds to master after done with all of
computations and message deliveries with
• Number of active vertices under it and
• Number of message deliveries
11. Copyright 2009 - Trend Micro Inc.
How Giraph Works -- Synchronization
1. Master waits until all workers respond
2. If all vertices become inactive and no message is delivered then
stop
3. Else, start the next superstep.
12. Outline
• Preliminary
• Introduction to Giraph
• Giraph vs. chained MapReduce
• Problem
• Algorithm
• MapReduce Algorithm
• Giraph Algorithm
• Experiment results
• Conclusions
13. Giraph vs chained MapReduce
• Pros
• No need to load/shuffle/store the entire graph in each iteration
• Vertex-centric programming model is an more intuitive way to
think of graphs
• Cons
• Requires the whole input graph to be loaded into memory
• Memory has to be larger than the input
• Messages are stored in memory
• Control communication costs to avoid out-of-memory errors.
14. Outline
• Preliminary
• Introduction to Giraph
• Giraph vs. chained MapReduce
• Problem
• Algorithm
• MapReduce Algorithm
• Giraph Algorithm
• Experiment results
• Conclusions
15. • A bilclique in a bipartite graph is a set of nodes sharing
the same neighbors
• Informally, a quasi-biclique in a bipartite graph
is a set of nodes sharing similar neighbors
• E.g.
Quasi-Biclique Mining (1/4)
66.135.205.141
66.135.213.211
66.135.213.215
66.211.160.11
66.135.202.89
66.211.180.27
shop.ebay.de
video.ebay.au
fahrzeugteile.ebay.ca
domain IP
16. • E.g. C&C detection
• Given is a bipartite website-client graph and a website which is
reported to be a command and control (C&C) server.
• Hackers used to setup multiple C&C servers for high availability
and these C&C servers usually share the same bots.
• Thus finding websites sharing similar clients with the reported
C&C server can help to identify remaining C&C servers.
Quasi-Biclique Mining (2/4)
17. Quasi-Biclique Mining (3/4)
• Given a bipartite graph and a threshold µ, the quasi-biclique for a
node v is the set of nodes connecting to at least µ of v’s neighbors
• E.g. Let µ = 2/3. quasi-biclique(D2) = {D1, D2}.
D1
D2
D4
IP1
IP2
IP4
IP5
IP3
D3
G
18. Quasi-Biclique Mining (4/4)
• Given is a bipartite graph G=(X, Y, E) and a threshold 0<µ ≤ 1.
• Suppose that the objects we are interested in are represented by
nodes in X, and their associated features are represented by nodes
in Y.
• The problem is to find quasi-bicliques for all vertices in X.
19. Outline
• Preliminary
• Introduction to Giraph
• Giraph vs. chained MapReduce
• Problem
• Algorithm
• MapReduce Algorithm
• Giraph Algorithm
• Experiment results
• Conclusions
20. MapReduce Algorithm: Mapper
Map(keyIn, valueIn):
/***
keyIn: a vertex y on the right side
valueIn: neighbors of y
***/
for x in valueIn:
keyOut := x
valueOut := valueIn
output (keyOut, valueOut)
x1
y1x2
x3
Map(y1, [x1, x2, x3])
(x1, [x1, x2, x3])
(x2, [x1, x2, x3])
(x3, [x1, x2, x3])
E.g.
y2
x4
Map(y2, [x3, x4])
(x3, [x3, x4]))
(x4, [x3, x4]))
21. MapReduce Algorithm: Reducer
Reduce(keyIn, valueIn):
/***
keyIn: a vertex x on the left side
valueIn:
[ neighbors(y)) | y∈neighbors(keyIn)]
***/
for neighbors(y) in valueIn:
for x’ in neighbors(y):
COUNTER[x’] += 1
if COUNTER[x’] >= µ*|valueIn|:
add x’ to Q_BICLIQUE[keyIn]
x1 y1
x2
x3
Map(y1, [x1, x2])
(x1, [x1, x2])
(x2, [x1, x2])
E.g. µ = 2/3
y2
Map(y2, [x1, x2])
(x1, [x1, x2])
(x2, [x1, x2])
y3
Map(y3, [x1, x3])
(x1, [x1, x3])
(x3, [x1, x3])
Reduce(x1, [[x1, x2], [x1, x2], [x1, x3]])
Reduce(x2, [[x1, x2], [x1,x2]])
Reduce(x3, [[x1, x3]])
Q_BICLIQUE[x1] = {x1, x2}
Q_BICLIQUE[x2] = {x1, x2}
Q_BICLIQUE[x3] = {x1, x3}
22. MapReduce Algorithm: Bottleneck
• Experimental results on one hour of web browsing logs
• Input graph size = 180 MB
• Map outputs are too large to shuffle efficiently!!
• Same information is copied and sent multiple times
Map output bytes 36 GB
23. Outline
• Preliminary
• Introduction to Giraph
• Giraph vs. chained MapReduce
• Problem
• Algorithm
• MapReduce Algorithm
• Giraph Algorithm
• Experiment results
• Conclusions
24. 1. Partition the graph into small groups composed of highly
correlated nodes in advance
• Improve data locality
• Reduce unnecessary communication cost and disk I/O
2. Utilize Giraph for efficient graph partitioning
Idea for improvement
25. Giraph Algorithm Overview
• Three phases:
1. Partitioning (Giraph):
• An iterative algorithm dividing the graph into smaller partitions
• The partitioning algorithm is designed to produce good enough partitions
without incurring too many communication efforts
2. Augmenting (MapReduce):
• Extend each partition with its adjacent inter-partition edges
3. Computing (MapReduce):
• Compute quasi-bicliques of augmented partitions in parallel
26. • Iteration1: each vertex on the left sets its group ID to hash of
its neighbors
Partitioning
value=H1=hash(IP1, IP3, IP4)
value=H2=hash(IP1, IP2,
IP4)
value=H3=hash(IP2, IP3, IP4, IP7)
value=H4=hash(IP5, IP6, IP7)
value=H5=hash(IP6, IP7)
D1
D2
D4
IP1
IP2
IP3
IP5
D5
IP6
IP4
IP7
D3
value=NULL
value=NULL
value=NULL
value=NULL
value=NULL
value=NULL
value=NULL
27. • Iteration2: Each vertex on the right side sets its group ID to
the group ID of its highest-degree neighbor.
Partitioning
value=H1
value=H2
value=H3
value=H4
value=H5
D1
D2
D4
IP1
IP2
IP3
IP5
D5
IP6
IP4
IP7
D3
value=H1
value=H3
value=H3
value=H3
value=H3
value=H4
value=H4
28. • Iteration3: Each vertex on the left side changes its group ID to
the majority group ID among its neighbors, if any.
Partitioning
value=H3 (changed)
value=H3 (changed)
value=H3
value=H4
value=H4 (changed)
D1
D2
D4
IP1
IP2
IP3
IP5
D5
IP6
IP4
IP7
D3
value=H1
value=H3
value=H3
value=H3
value=H3
value=H4
value=H4
29. • Iteration4: Each vertex on the right side changes its group ID
to the majority group ID among its neighbors, if any.
Partitioning
value=H3
value=H3
value=H3
value=H4
value=H4
D1
D2
D4
IP1
IP2
IP3
IP5
D5
IP6
IP4
IP7
D3
value=H3
(changed)
value=H3
value=H3
value=H3
value=H3
value=H4
value=H4
30. • Iteration5: All vertices stop to change group IDs so the
convergence criteria are met.
value=H4
D1
D2
D3
value=H4
value=H3
Partitioning
D4
IP1
IP2
IP3
IP5
D5
IP6
IP4
IP7
value=H3
value=H3
value=H3
value=H3
value=H3
value=H3
value=H3
value=H4
value=H4
Partition1
Partition2
33. Outline
• Preliminary
• Introduction to Giraph
• Giraph vs. chained MapReduce
• Problem
• Algorithm
• MapReduce Algorithm
• Giraph Algorithm
• Experiment results
• Conclusions
34. Performance testing
• Setting
• Input graph is constructed by parsing one-hour of web browsing logs
• 3.4 M vertices (1.3 M domains + 2.1 M IPs) and 2.5 M edges
• 60 MB in size
• Giraph: 15 workers (or mappers)
• MapReduce: 15 mappers and 15 reducers
• Result
• Our approach is able to reduce CPU time by 80% and I/O load by 95%,
compared with the MapReduce algorithm
• The communication cost incurred by graph partitioning is only 720MB,
and it takes only 1 minute to finish the partitioning
35. Outline
• Preliminary
• Introduction to Giraph
• Giraph vs. chained MapReduce
• Problem
• Algorithm
• MapReduce Algorithm
• Giraph Algorithm
• Experiment results
• Conclusions
36. Lessons learned
• Giraph is great for implementing iterative algorithms for it will not
bring unnecessary I/O between iterations
• Usecases: Belief propagation, Page Ranking, Random Walk,
Connected Componets, Shortest Paths, etc.
• Giraph requires the whole input graph to be loaded into
memory
• Proper graph partitioning in advance can significantly improve the
performance of following graph mining tasks
• A general graph partitioning algorithm is hard to design for we
usually don’t know which nodes should belong to the same group
37. Future work
• Incremental Graph Mining
• Observe the communication patterns during past incremental mining
tasks
• Partition the graph such that nodes which communicate often with
each other are in the same group
• Following incremental mining tasks will have lower communication
costs
• Consider the situation where the incremental algorithms are hard to design
so the easiest way is to periodically re-compute the result from scratch.
38. • Pregel: A System for Large-Scale Graph Processing
http://kowshik.github.com/JPregel/pregel_paper.pdf
• Apache Giraph
http://giraph.apache.org/
• GraphLab: A Distributed Framework for Machine Learning in the Cloud
http://vldb.org/pvldb/vol5/p716_yuchenglow_vldb2012.pdf
• Kineograph: Taking the Pulse of a Fast-Changing and Connected World
http://research.microsoft.com/apps/pubs/default.aspx?id=163832
References
Good morning ladies and gentlemen.
Thank you for attending my presentation.
My chineses name is 劉效飛 or you can call me Ken. Today I will share with you my first research project at Trend Micro.
The topic is mining quasi-bicliques with Giraph.
This is joint work with my mentor Dr. Chung-Tsai Su and my friend Dr. An-Chiang Chu.
The result is included in the industry track of IEEE BigData Congress 2013
I know My English speaking is not very fluent,
so please interrupt me if you don’t understand what I am saying.
Let us get back to the main point.My presentation is in 5 parts.
First I’ll introduce you the programming model of Giraph and discuss the pros and cons between Giraph and chained MapReduce.
Next, I’ll give you a formal problem definition and algorithms for mining quasi-bicliques.
We shall study the bottleneck of the existing mapreduce algorithm
and propose a new algorithm based on Giraph.The performance improvement is proved through experiments on real data.
Finally, I will conclude with a summary and future work.
Let’s begin with Giraph.
Apache Giraph is a distributed graph processing system following the desing of Google’s paper pregel.
It’s designed to enable efficient iterative processing of large sparse graphs, and it adopts a variation of the BSP model. The most prominent user is Facebook, which recently announced it has used Giraph in some of its production applications. In addition to Facebook, Yahoo!, Twitter, Linkedin and most importanly and vsome of my colleagues in US are important contributors to Giraph.
That’s why I have to use Giraph, not other frameworks like GraphLab.
Let’s move on to the programming model. The input to Giraph is a directed graph where each vertex contains four members.
The first is the state of the vertex, either active or inactive.
The second is a modifiable value of a user-defined type
The third is a set of out-edges. Each out-edge can also have an associated value.
The fourth is a user-defined program, which is allowed to do local computation, read/write its own values, send messages, mutate topology and vote to halt.
What it cannot do is to modify the values of oher vertices and their edges.
The execution is composed of a sequence of supersteps.
In each superstep each active vertex will run the user program in parallel with received messages as input.
A barrier synchronization mechanism is implemented to make sure all computations and message deliveries are done before starting the next superstep.Note that the messages sent to a vertex in the present superstep are to be received in the next superstep.
Similarly, the topology mutations are not effective until the beginning of the next superstep.
So when will a Giraph program stop execution?
Let’s first look at the state digram of a single vertex. A vertex becomes inactive after voting to halt and is re-activated after receiving new messages.
The Giraph program terminates when all vertices become inactive and no messages is on route.
This diagram shows the basic architecture of Giraph. The input graph is partitioned by a user defined partition rule, default is according to the hash values of vertices.A distinguished master node computes the partition assignment and copy the partition rule and the partition assignment to all workers.Each worker then load a split of the input file from HDFS. If a vertex specified in the split does not belong to the worker, the worker will send messages to the owner of the vertex to, tell the owner to create the vertex in the beginning of the first superstep.
Here is a more detailed description of the initialization step.
After loading the split and delivering the messages, each worker must respond to the master.
The master starts the first superstep after receiving responses from all workers.
In the execution of a superstep, each worker will create a thread per partition to run the user program for its vertices.
After finishing all computations and message deliveries, each worker has to respond to the master with the number of remaining active vertices under it
and the number of message deliveries.
Barrier synchronization is achieved by letting the master to wait until all workers respond.
At the end of each supestep, by the responded information,
the master can check if any vertex is still active and if any messages is en route to determine to stop execution or not.
We have finished the introduction to Giraph. Now we are ready to compare Giraph with chained MapReduce.
The main benefit of Giraph is to avoid unnecessary disk I/O and network traffic incurred byloading, shuffling and storing the entire graph in each iteration.
Also, the vertex-centric programming model is more intuitive to think of graphs.
However, Giraph requires very much memory.
In MapReduce, each map or reduce task can be executed independently, but Giraph requires the whole input graph to be loaded into memory to start execution.
So the memory requirement is very high. Especially when you have multiple users wanting to run Giraph
applications at the same time.
Even worse, the current implementation of Giraph stores all messages in memory. So the algorithm has to carefully control the communication costs to reduce memory consumption.
Otherwise your job will fail due to out of memory errors. But it’s mainly an implementation issue,
not an architecture flaw.
Now lets introduce the problem.
Givent a bipartie graph, a bilclique is a set of nodes with the same neighbors.
However, real data are usually full of missing edges or non-exising edges caused by errors.
So to make the definition more suitable to real data, we have to relax the requirement to define quasi-bicliques.
Informally, a quasi-bliclique in a bipartite graph is a set of vertices sharing similar neighbors.In the example, we have the three domains resolve to a set of similar IPs and form a quasi-biclique.
In practice, quasi-bicliques are useful to help identify nodes serving the similar purpose.
One application of quais-biclique mining is to identify C&C servers. Consider a bipartite graph consisting of websites clients, where one of the websites is known to be a C&C server.
Hackers used to setup multiple C&C servers for HA purpose and these C&C servers will share the similar bots.So it’s possible for us to identify remaining C&C servers by finding out websites which share similar clients with the known C&C server.
It’s one of our target application at TrendMicro.
Ok. Let’s move on to the formal definition. Given a bipartite graph and a threshold mu,
the qusi-biclique for a node v is the set of nodes connecting to at least mu of v’s neighbors.
For example, let mu equal to 2 third. In the graph, the quasi-biclique for D2 consists of D1 and D2.D1 is included because it connects to 2 third of D2 ‘s neighbors. D3 is not included because it only connects to one third of D2’s neighbors.
Suppose that the objects we are interested in finding quasi-bicliques are represented by nodes on the left side,
and their associated features are represented by nodes on the right side. The problem is to find the quasi-bicliques for all vertices on the left side.
Next I’d like to show you the mapreduce algorithm for solving this problem.
First look at the Map function. The input to the map function is a key-value pair.
The input key is a vertex y on the right side and the value is its neighbors on the left side
Each map output is also a key-value pair where the key is a neighbor of y and the value is is y’s adjacency list
After aggregating the map output of the same key, The input to the reduce function will be a key-value pair, where the key is a vertex x on the left side,
And the value is the adjacency lists of its neighbors.We then compute quasi-biclique for x by finding the nodes which appear in enough number of adjacency lists of neighbors of x.
Before Giraph is indroduced toTrendMicro, we used to apply the mapresuce algorithm to mining quasi-bicliques.However, the performance is not very satisfying. So we conduct some experiments to find the root cause.
The input graph is a domain-IP graph constructed by using customers’ web browsing logs.
The logs contain the domains and corresponding IPs accessed by our customers. The input graph is only of size 180 Mega bytes, but the map outputs expand to 36 GB. 200 times larger than the input.
To shuffle the map outputs become the performance bottleneck.The huge amount of map outputs was due to that the mapreduce algorithm does not utilize the locality of the graph.
As we can see in the above example. Each vertex has to copy and send its adjacency list multiple times to all of its neighbors. It makes map outputs explode.
The observation gives us an idea to improve the typical mapreduce algorithm.
Our basic idea is to partition the graph into smaller groups in advance to improve data locality. Each group is composed of highly-correlated nodes and can be processed by a single reducer.
In this way the nodes in the same group can share one copy of the same information to reduce unnecessary communication and disk I/O.
However, Graph partitioning usually require multiple iterations of processing. Chained MapReduce is not very efficient for this kind of tasks, so we have utilized Giraph for graph partitioning.
Here is an overview of the algorithm.First, we use an iterative Giraph algorithm to divide the graph into smaller partitions.
The algorithm is designed to be very lightweight to not become another bottleneck, but still produce good enough partitions.
Second, we augment each partition with its adjacent inter-partition edges, so each partition is self-contained for mining quasi-bicliques.
It ensures the solution produced by our improved algorithm is exactly the same as the one produced by the typical mapreduce algorithm.
We do not sacrifice solution quality for performance.
Finally, we compute quasi-bicliques for all augmented partitions in parallel without dependency.
I’d rather not go into the details but show you a running example to got some feelings for the partioning algorithm.In the first iteration of the partitioning algorithm, each vertex on the leftside would set its goup ID to the hash value of its adjacency list.
Because we want the vertices with the same neighbors are in the same group.
In the second iteration, each vertex on the right side would set its group ID to the group ID of its highest-degree neighbor.
The intention for this step is to increase the probability that correlated vertices on the right side would have the same group ID.
It is based on the assumption that the graph is composed of groups with structures similar to bicliques
so that the highest-degree vertex in a group would have connections to most of its group peers on the other side.
If this assumption does not hold, our partitioning algorithm may not produce good result and take a long time to converge.
So I have to emphasize that this partitioning algorithm is not designed for general bipartite graphs.
It works only for graphs composed of loosely coupled groups and each group has structure similar to a biclique.
From the third iteration, we apply the majority rule to adjust the group IDs.The majority rule says that a vertex change its group ID if only if more than half of its neighbors have a common group ID different than its current group ID.
The process continues until all vertices stop to change group IDs.
We also proved convergence of our partitioning algorithm.
After the algorithm converges, vertices on the left with the same group ID and their neighbors together define a partition.
Some partitons may have common right-hand side vertices
After partitioning, there would be some missing information due to inter-partition edges.
The goal of augmentation is to extend each partition with the information of its adjacent inter-partition edges.
So in the example the edges D1 IP2 and D5 IP4 are added to the partition to form an augmented partition.
Finally we run a mapreduce job to assign each augmented partition to a reducer.
And then each reducer runs a sequential algorithm to compute quasi-bicliques for augmented partitions assigned to it.
Ok next I’ll show you the performance testing result.
The testing data is a domain-IP graph constructed by parsing our customers’ web browsing logs, which consists of 3.4 M vertices and 2.5 M edges.The result shows that our proposed algorithm can reduce CPU time by 80% and I/O load by 95%., compared to the mapreduce algorithm.
And the communication cost introduced by the graph partitioning algorithm is quite few, It takes only 1 minute to finish the graph partitioning.Actually we believe the communication cost can be significant reduced by some simple tricks,
but for now it is quick enough so we didn’t do any further optimization.
Let’s summarize the lesson we learned in this project. First, Giraph is great for implementing iterative algorithms for it will not bring unnecessary I/O between iterations.
However, Giraph has very high memory requirement for it has to load the whole input graph into memory. We also learned taht proper graph partitioning in advance may significantly improve the performance of following graph mining tasks.
However, a general graph partitioning algorithm is hard to design for we usuall don’t know which nodes should belong to the same group.
It’s usually data and application dependent.
The potential future work includes incremental Graph Mining.
In incremental mining,
we can first observe he communication pattern of the graph mining algorithm during the past,
and partition the graph so that nodes communicating often are in the same group.
So the following incremental mining tasks will have lower communication costs.
Particularly when the incremental algorithms are hard to design so we have to periodically re-compute the result from scratch.