Large scale graphs containing O(billion) of vertices
are becoming increasingly common in various applica-
tions. With graphs of such proportion, efficient query-
ing infrastructure becomes crucial. In this paper, we
propose WOOster a hosted querying infrastructure de-
signed specifically for the large graphs. We make two
key contributions: a) Design of the WOOster frame-
work. b)Scalable map-reduce algorithms for two pop-
ular graph queries: sub-graph match and reachability.
Our experiments show that the proposed map-reduce
algorithms scale well with large synthetic datasets.
[2024]Digital Global Overview Report 2024 Meltwater.pdf
Â
WOOster: A Map-Reduce based Platform for Graph Mining
1. WOOster: A Map-Reduce based
Platform for Graph Mining
Aravindan Raghuveer
Yahoo! Inc, Bangalore.
2. Introduction
âIf you squint the right way, graphs
are everywhereâ [1]
@ Yahoo! :
⢠The WOO Graph: All knowledge
assimilated from the web.
- http://iswc2011.semanticweb.org/fileadmin/iswc/Pa
pers/Industry/WOO_ISWC.pptx
[1] http://googleresearch.blogspot.com/2009/06/large-scale-graph-computing-at-google.html 2
Yahoo! Confidential
3. The What and Why?
What? Family of Graph Query Algorithms.
⢠Framework:
⢠For graph storage and invoking the query algorithms
⢠Hosted Solution on Hadoop
Why?
⢠Family of Graph Query Algorithms: Present day
algorithms do not scale to billion edge, vertex graphs.
⢠Framework:
â˘Optimizes storage layout to suit graph query
algorithms
â˘Improves throughput of the queries.
3
Yahoo! Confidential
4. Outline of the talk
⢠MapReduce 101
⢠Graph Mining Approaches
⢠Brief overview of WOOster architecture
⢠Graph query algorithms in WOOster:
⢠Sub Graph Matching
⢠Reachability Query
⢠Experiments
⢠Conclusion
Yahoo! Confidential
5. Map Reduce 101
ď§ Switch to slides from Cloud Computing
with MapReduce and Hadoop
ď§ www.cs.berkeley.edu/~matei/talks/2009/parlab_bo
otcamp_clouds.ppt
5
Yahoo! Confidential
6. MapReduce Programming Model
⢠Data type: key-value records
⢠Map function:
(Kin, Vin) ď¨ list(Kinter, Vinter)
⢠Reduce function:
(Kinter, list(Vinter)) ď¨ list(Kout, Vout)
7. Example: Word Count
def mapper(line):
foreach word in line.split():
output(word, 1)
def reducer(key, values):
output(key, sum(values))
8. Word Count Execution
Input Map Shuffle & Sort Reduce Output
the, 1
brown, 1
the quick fox, 1 brown, 2
Map
brown fox fox, 2
Reduce
how, 1
the, 1
fox, 1
now, 1
the, 1 the, 3
the fox ate
Map
the mouse quick, 1
how, 1
ate, 1 ate, 1
now, 1
mouse, 1
brown, 1 Reduce cow, 1
how now mouse, 1
Map cow, 1
brown cow quick, 1
9. Graph Mining Approaches : Two Schools
ď§ School-1: Invent a new platform:
- Map-reduce is not best suited for graph mining:
- BSP, PRAM models : circa 1980s
- Pregel, Haloop from Google [1]
ď§ School-2: Ride on Map-Reduce
- MR has wide adoption, open source tools, industry support.
- Invest on one more computing infrastructure
- Apache Giraph: http://incubator.apache.org/giraph/ (BSP on Hadoop)
- Efforts in open source / academia on the same lines:
⢠Pegasus CMU [2]
⢠Graph Mining in Apache Mahout[3]
⢠Rayethonâs Graph Mining [4]
[1] SIGMOD 2010, http://dl.acm.org/citation.cfm?id=1807184
[2] http://www.cs.cmu.edu/~pegasus/
[3] http://www.robust-project.eu/news/robust-project-pushes-large-scale-graph-mining-with-hadoop-apache 9
[4] http://www.cloudera.com/blog/2010/03/how-raytheon-researchers-are-using-hadoop-to-build-a-scalable-distributed-triple-store/
Yahoo! Confidential
10. WOOster Architecture
⢠User submits a query
WOOster Web UI & WebService APIs ⢠Planner periodically scans for
newly arrived queries.
⢠Planner creates a M-R plan that
Graph
Planner re-uses computation, / IO
Indices Jobs
D/B across queries. (Batching)
Executor ⢠Executor executes the M-R
plan.
⢠Result notified to the user
WOO Graph
(Hosted Solution)
Grid
Yahoo! Confidential
11. The Sub-Graph Match Query
Find all
instances in graph G
of query Q
Vertices have
attributes (ex age:31)
Vertices and edges have
constraints (ex: age<40) Edges have relationship
labels.
Notation Query Vertex Graph Vertex A matched graph vertex
Why Sub-Graph Match (Exact Graph Isomorphism)?:
A popular and expressive graph query useful to mine patterns.
To our knowledge, a large scale algorithm to operate on a billion vertex graph is
not present.
Yahoo! Confidential
12. Overview of the Solution
Step-0. Data Layout on HDFS
Step-1. Query Graph Partitioning
Step-2. Edge Selection
Step-3. Query Partition Matching
Step-4. Query Partition Merging
Yahoo! Confidential
13. Data Layout on HDFS
⢠How to store a large scale graph?
⢠Adjacency List like solution:
⢠Each row/line has information about a vertex:
⢠Vertex attributes
⢠Vertex neighbors and the labels associated with each edge.
Implications:
â˘Enables early pruning of non-matching edges and vertices.
â˘Each vertex has information about itself and its immediate
neighbors only.
Yahoo! Confidential
14. Step-1: Query Graph Partitioning
Why?: Parallelized solving of independent sub-
problems
How?
Find minimum number of partitions such that
diameter of partition = 2.
Pivot Vertices
Intuition:
â˘In a spanning tree of diameter 2, there is one vertex that is
connected to all other vertices ď pivot vertex
â˘Will use this property in steps 2, 3.
Yahoo! Confidential
15. Step-2: Edge Selection
⢠What: Select a subset of edges from G that match atleast one
edge in Q.
⢠How: 3.
g1-g2 emitted:
g1 mapped to a
query vertex.
g2
Map g1 g2 Reduce
g3 g1
g1 Logic Logic
g1 g2
g4
1. g1:Current 2b.
2a. 4.
vertex in For every emits allof
Mapper neigbor g1-g2 emited Reducer emits 5.
mapper. edges if vertex and
q1, there exists a from g2âs an edge if a pair
edge constraints are
corresponding mapper is found
neighbor for g1
met
Yahoo! Confidential
16. Step-3: Query Partition Matching
Edge Selection:
⢠Associates a graph vertex to the possible query vertices it could map to
⢠Associates the graph vertex to its âpivotâ graph vertex. g1 g2
⢠Pivot graph vertex is a graph vertex which is mapped to a pivot query vertex: g1 in this example
Reducer forms
the partition
g1 g2 3.
Edge
Selection Map Reduce g2
g1 g3
output Logic Logic
g1 g3
g1 g4
g4
Mapper emits pivot graph
vertex as key and edge as 2. Reducer receives all
value edges with the same
1.
pivot graph vertex
Yahoo! Confidential
17. Step-4: Query Partition Merging
⢠Merges partitions one after another to form the a query match
⢠More details in paper.
Take-away from Steps1-4: (also for any scalable Map-Reduce
program)
ďś The mapper/reducer keys are chosen such that:
ďś # keys is proportional to the number of matches of query Q
in the graph.
ďś Hence the algorithm scales well for large graphs and complex
queries.
Yahoo! Confidential
18. Results 160
140
120
Time (sec)
100
80
60
40
20
0
100 150 200 250
Number of Reducers
Edge Selection Query Partition Matching Query Partition Merging
ď§ Graph of 10 million vertices and 50 million edges
ď§ Complex Query of 24 vertices
ď§ Note that the edge selection time reduces with
increasing number of reducers.
Yahoo! Confidential
19. In the paperâŚ
ď§ Detailed map-reduce algorithms for sub-graph match and
reachability
ď§ Theoretical analysis for scalability
ď§ Construction of the synthetic dataset
ď§ Methodology and more experiments.
ď§ Reachability query: examples, map-reduce algorithm
ď§ Related work
Yahoo! Confidential
20. Future Work
⢠Indexing structure for graphs suited for M-R jobs
⢠Compare with giraph based approach.
⢠Better batching strategies.
⢠Right interface for custom graph algorithms to be
plugged in while WOOster providing automatic
batching.
⢠More graph mining algorithms implemented
Yahoo! Confidential