Slides presenting our work on COSI at the ASONAM conference 2010
Note: The images used in this presentation are copyright by the respective owners as indicated with the picture. Pictures used are either CC or fair use. Please notify the author if you feel that your images are unfairly used in this presentation.
7. collaborate
USA Prof Prof
COSI
dean author
member Italy
in
Jones Paper Baneri
âABCâ comment UC
UMD author
CS CS
in
faculty Prof
friends
Calero faculty
department in
member
faculty Prof presented
Dooley attended Social
University
MD Science department Universita
department in ASONAM Calabria
10 dean
attended Prof
faculty UMD
author
submitted Roma
member
Physics author
organized visited
Prof author accepted KPLLC Paper friends
09 Paper âUVWâ
Smith Paper âHIJâ
submitted
âXYZâ
comment
comment attended
student of author Prof
Prof
collaborates Olsen student of
Prof Lund member
dean
Jamie Larsen
faculty Karl
Lock member
Social Oede
visited Science
Odense SDU
John
colleagues Doe Physics department
Odense Denmark
8. COSI
Example Query
?p
author comment
?v1 ?v3
faculty friends
faculty
University in
MD Italy ?v2
Simple query, yet already
difficult to answer by hand
8
13. COSI
COSI Graph Partitioning
ï§âŻHow should we partition the graph?
ï§âŻGOAL: Find a way to partition the
graph DB into âblocksâ across the k
storage nodes so that expected
time to answer queries is small.
13
14. COSI
Example Query & Naive Approach
Jones
Dooley ?p
author
Smith comment
?v1 ?v3
faculty friends
faculty
University in
MD Italy ?v2
14
15. COSI
Co-Retrieval
Paper âABCâ
?p
author comment
Jones ?v3
faculty friends
faculty
University in
MD Italy ?v2
Co-retrieval:
Jones â Paper âABCâ
15
16. COSI
Cost Model
ï§âŻ Query trace: A query trace w.r.t. a query plan x
for query Q consists of
-⯠All vertices in the DB whose neighborhood is
retrieved during execution of x
-⯠All pairs (u,v) of vertices where x retrieves
vâs nbhd immediately after retrieving uâs
nbhd.
âąâŻ Intuition: Try to put u,v on same storage node.
âąâŻ Assumption: Retrieved nbhds are cached in
memory.
16
17. COSI
Cost Model (continued)
ï§âŻAssume fixed but arbitrary distribution
over the set of all queries.
ï§âŻThis induces a pdf over the set of all
feasible query plans qp(Q) for query Q.
-⯠(x)= ï Q Ć , qp(Q)=x (Q).
-⯠Prob of query plan âxâ is the sum of the probs of
queries requiring query plan x.
ï§âŻLet E(v) be the event that v is retrieved by
a query trace of a random query plan for
Q.
17
18. COSI
Cost Model (continued)
ï§âŻ Prob that vertex v occurs in the trace of a
randomly chosen query plan is
(E(v)) = ï x Ć qp(Q) â v Ć qt(x,DB) (x).
ï§âŻ Prob that (u,v) occurs in the trace of a randomly
chosen query plan is
(E(u,v)) = ïx Ć qp(Q) â (u,v ) Ć qt(x,DB) (x).
18
19. COSI
Cost Model (continued)
Key Theorem
Suppose vertex retrieval and inter-node comms
are uniform across storage nodes. The partition
of the DB graph that minimizes query exec time
coincides with the partition that minimizes edge
cut cost in the graph (V,VïV) with weight
function w(u,v)= (E(u,v))+ (E(v,u)).
ï§âŻ SO MIN EDGE-CUTS IN COMPLETE GRAPHS IS
CLOSELY RELATED TO MINIMIZING QUERY
EXECUTION TIME.
19
20. COSI
Partitioning Algorithm
ï§âŻ Challenges
-⯠Finding MIN EDGE-CUT is NP-complete.
-⯠We want to process graphs containing 100s of
millions of edges.
ï§âŻ So we want an algorithm that is
-⯠Very fast
-⯠Produces good edge cuts
âąâŻ but maybe not optimal
ï§âŻ To achieve speed, we focus on partition strategies that
permanently assign vertices to blocks.
20
21. COSI
Individual edge insertion
ï§âŻSuppose we have a partition P={P1,..,Pk}.
ï§âŻWe are inserting the edge (v,p,o).
ï§âŻVertex force vectors: Measures how strongly
each Pi âpullsâ a vertex.
-⯠|v|[i] = fP(ï y Ć (nbhd(v) ⊠Pi) w(v,y))
-⯠fP maps positive reals to reals and is an âaffinityâ
measure.
-⯠|v|[i] sums up the weights of edges from v to each
neighbor in Pi. Insert v into block Pi with highest |v
|[i].
21
22. COSI
Affinity Measures
ï§âŻMust satisfy 3 properties
-⯠Connectedness of a vertex to a partition
block. This helps minimize edge cut.
-⯠Imbalance of block sizes.
âąâŻ E.g. standard deviation of block sizes,
normalized by expected DB size.
-⯠Excessive size should be punished.
22
23. COSI
Batch insertion
ï§âŻAdding a set of edges at once.
ï§âŻIdea: Find strongly connected
components using modularity
maximization and assign those to the
partition block with highest affinity.
23
25. COSI
Graph modularity
ï§âŻMod(P) = ïPi Ć P(W(Pi,Pi)/2|E| -
degW(Pi) 2/(2|E|)2)
ï§âŻWhere
-⯠W(X,Y) is the sum of the weights of
edges (x,y) with x in X, y in Y.
-⯠degW(v) is the sum of the weights of
edges (v,-) and
-⯠degW(Pi) is the sum of the degW(v)âs for
v in Pi.
25
27. COSI
Query Answering
Graph Data Client B ?X
ï
?Z C
ï
A ?Y
load Receive query -
Return results
Dispatch query
Query answer
ï ï ï
ï Forward (partially
ï
Answered) query
28. COSI
Example Query
?p
author comment
?v1 ?v3
faculty friends
faculty
University in
MD Italy ?v2
P1
28
29. COSI
Example Query
Jones : P2
Dooley : P2 ?p
author
Smith : P3 comment
?v1 ?v3
faculty friends
faculty
University in
MD Italy ?v2
29
30. COSI
Example Query
Paper âABCâ : P2
Paper âHIJâ : P3
?p
author comment
P2 Calero : P2
Dooley ?v3
faculty friends
faculty
University in
MD Italy ?v2
Where to send query next?
30
31. COSI
Query answering
ï§âŻBasic: Next substitution arbitrary
ï§âŻCOSI_Heur is a heuristic version that makes
intelligent choices about the next variable
to be substituted.
-⯠Branching Factor ï # possible substitutions
-⯠Communication cost ï # messages to be sent
-⯠Workload distribution ï partitions hosting
vertices
31
33. COSI
COSI implementation
ï§âŻImplementation is in Java (approx
10,000 loc)
ï§âŻ778M edges social network DB
-⯠Flickr, Orkut, Livejournal, Youtube
-⯠[Mislove â07]
ï§âŻ16-node compute cluster
-⯠8 GB of RAM
-⯠30 GB HDs
-⯠8 core Intel CPU
33
34. COSI
Partitioning quality
Comparison of Partitioning Methods
40.0%
35.0%
30.0%
25.0% Edge Cut
20.0%
Improvement
15.0%
Imbalance
10.0%
5.0%
0.0%
Single Greedy Batch Greedy Batch Partition
COSI_Partition achieves a 36% improvement in
edge-cut with only slightly higher imbalance.
Took 7.5 h to load with individual triple insertion, 10.5 h with batch.
34
35. COSI
Logarithmic
Query answering time scale
10000000
Query Times by Cost Model (in ms)
1000000
100000
ms
10000
1000
100
6 Edges / 7 Edges / 8 Edges / 9 Edges / 10 Edges / 11 Edges / 11 Edges / 14 Edges / 16 Edges / 17 Edges / 23 Edges /
3 Vars 4 Vars 3 Vars 3 Vars 3 Vars 4 Vars 5 Vars 5 Vars 7 Vars 5 Vars 6 Vars
Cost Model A
Cost Model 2.0/0.5 Cost Model B
Cost Model 1.2/0.1 Cost Model C
Cost Model 8.0/5.0 No Cost Model
No Cost Model
COSI_heur does very well, answering
pretty complex queries in under a second.
X-axis shows number of edges and variable vertices.
35
36. COSI
Logarithmic
Partitioning Effect scale
100000
10000
Time (ms)
1000
100
6E/3V 7E/4V 8E/3V 9E/3V 10E/3V 11E/4V 11E/5V 14E/5V 16E/7V 17E/5V 23E/6V
Size of the query (# edges / # vertices)
COSI Batch Partition Individual Edge Insertion
COSI_heur does very well, answering
pretty complex queries in under a second.
36
38. COSI
Related Work
Systems Pros Cons
Single Neo4j, DEO, Latency, Speed Limited size
Machine Hypergraph, Limited Throughput
RDF-3X, OWLIM,
AllegroGraph, etc
Orchestrated YARS 2, system Size Scalability Latency
Distribution extensions Limited Throughput
Asynchronous COSI Size Scalability Latency
Cloud Throughput
oriented Scalability
Resource Elasticity
38
39. COSI
Conclusion
ï§âŻCOSI is a general, scalable and fast
graph database framework for social
network analysis
ï§âŻDemonstrated scalability and speed on
the problem of subgraph identification
39