Complex networks, such as biological, social, and communication
networks, often entail uncertainty, and thus, can be
modeled as probabilistic graphs. Similar to the problem of
similarity search in standard graphs, a fundamental problem
for probabilistic graphs is to efficiently answer k-nearest
neighbor queries (k-NN), which is the problem of computing
the k closest nodes to some specific node.
In this paper we introduce a framework for processing
k-NN queries in probabilistic graphs. We propose novel distance
functions that extend well-known graph concepts, such
as shortest paths. In order to compute them in probabilistic
graphs, we design algorithms based on sampling. During
k-NN query processing we efficiently prune the search space
using novel techniques.
Our experiments indicate that our distance functions outperform
previously used alternatives in identifying true neighbors
in real-world biological data. We also demonstrate
that our algorithms scale for graphs with tens of millions
of edges.
2. Thesis
• Many complex networks are modeled as
probabilistic (i.e., uncertain) graphs.
• The probabilistic treatment of such graphs leads
to better understanding of real data.
Nearest Neighbors in Uncertain Graphs @ VLDB 2010 2
3. Probabilistic Protein-Protein
Interaction Networks
Possible interactions between
proteins are established
through biological experiments
that entail uncertainty.
The edge probability
represents that uncertainty.
A
0.2 0.6
0.4
B C
0.3 0.7
D
Source: Asthana et al., Genome Research 2004
Nearest Neighbors in Uncertain Graphs @ VLDB 2010 3
4. Probabilistic Protein-Protein
Interaction Networks
• Neighbors of a given node in a standard graph?
– Nodes close in terms of shortest path distance!
A
• How do we define neighbors
0.2 0.6
in probabilistic graphs?
0.4
B C
• How do we define the distance?
0.3 0.7
D
– Treat them as weighted graphs (N06)
– Nodes with high reliability(GR04)
– Most probable path (BI03)
– …shortest paths? (VLDB10)
Nearest Neighbors in Uncertain Graphs @ VLDB 2010 4
5. Probabilistic Protein-Protein
Interaction Networks
• Why is it important to find good neighbors of
proteins in PPI networks?
– Detection of candidate co-complex relationships.
– Actual co-complex relationships can be
established through experiments in the lab.
Nearest Neighbors in Uncertain Graphs @ VLDB 2010 5
8. A
0.6
B
0.2
0.4
C
Distance Definition
0.3 0.7
D
A A A A A A A A
B C B C B C B C B C B C B C B C
D D D D D D D D
A A A A A A A A
B C B C B C B C B C B C B C B C
D D D D D D D D
A A A A A A A A
B C B C B C B C B C B C B C B C
D D D D D D D D
A A A A A A A A
B C B C B C B C B C B C B C B C
D D D D D D D D
Nearest Neighbors in Uncertain Graphs @ VLDB 2010 8
10. Distance Definition
the graph a world
A A
0.2 0.6 Pr(world ) p( A, B) p( B, D)
0.4 (1 p( B, C )) (1 p(C , D)) (1 p( A, D))
B C B C
0.3 0.7
D D
Nearest Neighbors in Uncertain Graphs @ VLDB 2010 10
11. Distance Definition
the graph a world
A A
0.2 0.6 Pr(world ) p( A, B) p( B, D)
0.4 (1 p( B, C )) (1 p(C , D)) (1 p( A, D))
B C B C
0.3 0.7
D D
PDF .44
.3
.26
1 2 inf
shortest path length d(B,D)
Nearest Neighbors in Uncertain Graphs @ VLDB 2010 12
12. Distance Definition
• Use well known statistics of the Shortest Path
PDF:
– Median
– Majority (mode)
– ExpectedReliable
• infinity problem
PDF
• Hard! they require .44
d med 2
.3
explicit enumeration .26
d maj inf
of possible worlds:
d exp 1.46
resort to sampling! 1 2 inf
shortest path length d(B,D)
Nearest Neighbors in Uncertain Graphs @ VLDB 2010 14
14. Sampling Algorithms
1. sample (a small number of) worlds
2. compute sample median (approximation)
3. output result
– Median (Chernoff bound)
– ExpectedReliable (Hoeffding inequality)
– Majority (No bound)
Nearest Neighbors in Uncertain Graphs @ VLDB 2010 16
15. Sampling Algorithms
BIOMINE FLICKR
database of biological entities users from flickr.com. edges have
and uncertain interactions from been created assuming homophily
UHelsinki based on jaccard of flickr groups
1M nodes, 10M edges 77K nodes, 20M edges
Nearest Neighbors in Uncertain Graphs @ VLDB 2010 17
17. kNN Pruning
• Query: Given a probabilistic graph, and a
source node find the set of k nodes closest to
the source.
• Naïve algorithm:
1. sample worlds
2. run dijkstra traversals and compute a pdf of the sp
distance per node
3. calculate the median distance to all nodes using the
pdf’s
4. compute k-nn
Nearest Neighbors in Uncertain Graphs @ VLDB 2010 19
18. kNN Pruning naive
1nn - median
node: A
sample: 5 worlds
E 0.5
D
0.6
0.8
B 0.3
0.9
A G
0.3
0.7
C
0.4
F
Nearest Neighbors in Uncertain Graphs @ VLDB 2010 20
19. kNN Pruning naive
E
D
1nn - median
B
node: A
A G
sample: 5 worlds
C
F
E 0.5
D
0.6
0.8
B 0.3
0.9
A G
0.3
1 2 3
0.7
C B C D E F G
0.4
F
Nearest Neighbors in Uncertain Graphs @ VLDB 2010 21
20. kNN Pruning naive
E E
D D
1nn - median
B B
node: A
A G A G
sample: 5 worlds
C C
F F
E 0.5
D
0.6
0.8
B 0.3
0.9
A G
0.3
1 2 3
0.7
C B C D E F G
0.4
F
Nearest Neighbors in Uncertain Graphs @ VLDB 2010 22
21. kNN Pruning naive
E E E
D D D
1nn - median
B B B
node: A
A G A G A G
sample: 5 worlds
C C C
F F F
E 0.5
D
0.6
0.8
B 0.3
0.9
A G
0.3
1 1 2 2 3 2 2
0.7
C B C D E F G
0.4
F
Nearest Neighbors in Uncertain Graphs @ VLDB 2010 23
22. kNN Pruning naive
E E E E
D D D D
1nn - median
B B B B
node: A
A G A G A G A G
sample: 5 worlds
C C C C
F F F F
E 0.5
D
0.6
0.8
B 0.3
0.9
A G
0.3
1 1 2 2 3 2 2
0.7
C B C D E F G
0.4
F
Nearest Neighbors in Uncertain Graphs @ VLDB 2010 24
23. kNN Pruning naive
E E E E E
D D D D D
1nn - median
B B B B B
node: A
A G A G A G A G A G
sample: 5 worlds
C C C C C
F F F F F
E 0.5
D
0.6
0.8
B 0.3
0.9
A G
0.3
1 1 2 2 3 2 2
0.7
C B C D E F G
0.4
F
Nearest Neighbors in Uncertain Graphs @ VLDB 2010 25
24. kNN Pruning naive
E E E E E
D D D D D
1nn - median
B B B B B
node: A
A G A G A G A G A G
sample: 5 worlds
C C C C C
3 F F F F F
E 0.5
0.6
D 2
0.8
1 B 0.3
0.9
A G
0.3
1 1 2 2 3 2 2
0.7
C B C D E F G
0.4
F
Nearest Neighbors in Uncertain Graphs @ VLDB 2010 26
25. kNN Pruning naive
E E E E E
D D D D D
1nn - median
B B B B B
node: A
A G A G A G A G A G
sample: 5 worlds
C C C C C
3 F F F F F
E 0.5
0.6
D 2
0.8
1 B 0.3
0.9
A G
0.3
1 1 2 2 3 2 2
0.7
C B C D E F G
0.4
F
Nearest Neighbors in Uncertain Graphs @ VLDB 2010 27
26. kNN Pruning
1nn - median
node: A
sample: 5 worlds
E 0.5
D
0.6
0.8
• algorithm
B 0.3
0.9
– sample worlds on the fly
– increase the horizon of each dijkstra one hop at a
A G
time
0.3
0.7 – maintain truncated pdf histograms
C
0.4
F
Nearest Neighbors in Uncertain Graphs @ VLDB 2010 28
27. kNN Pruning
1nn - median
node: A
sample: 5 worlds
E 0.5
D
0.6
0.8
B 0.3
0.9
A G
0.3
0.7
C
0.4
F
Nearest Neighbors in Uncertain Graphs @ VLDB 2010 29
28. kNN Pruning
1nn - median
B
node: A
A
sample: 5 worlds
E 0.5
D
0.6
0.8
B 0.3
0.9
A G
0.3
1
0.7
C B
0.4
F
Nearest Neighbors in Uncertain Graphs @ VLDB 2010 30
29. kNN Pruning
1nn - median
B B
node: A
A
sample: 5 worlds A
E 0.5
D
0.6
0.8
B 0.3
0.9
A G
0.3
1
0.7
C B
0.4
F
Nearest Neighbors in Uncertain Graphs @ VLDB 2010 31
30. kNN Pruning
1nn - median
B B B
node: A
A
sample: 5 worlds A A
C
E 0.5
D
0.6
0.8
B 0.3
0.9
A G
0.3
1 1
0.7
C B C
0.4
F
Nearest Neighbors in Uncertain Graphs @ VLDB 2010 32
31. kNN Pruning
1nn - median
B B B B
node: A
A
sample: 5 worlds A A A
C C
E 0.5
D
0.6
0.8
B 0.3
0.9
A G
0.3
1 1
0.7
C B C
0.4
F
Nearest Neighbors in Uncertain Graphs @ VLDB 2010 33
32. kNN Pruning
1nn - median
B B B B
node: A
A A
sample: 5 worlds A A A
C C
E 0.5
D
0.6
0.8
B 0.3
0.9
A G
0.3
1 1
0.7
C B C
0.4
F
Nearest Neighbors in Uncertain Graphs @ VLDB 2010 34
33. kNN Pruning
1nn - median
B B B B
node: A
A A
sample: 5 worlds A A A
C C
E 0.5
D
0.6
0.8
1 B 0.3
0.9
A G
0.3
1 1
0.7
>1 C B C
0.4
F
Nearest Neighbors in Uncertain Graphs @ VLDB 2010 35
34. kNN Pruning
1nn - median
B B B B
node: A
A A
sample: 5 worlds A A A
C C
E 0.5
0.6
D •B has distance 1
0.8 •C has distance greater than 1
1 B 0.3
•D, E, F, G, … were not discovered (d>1)
0.9
•1NN set is complete with B – no need to cont
A G •just 2 nodes visited (and 2 histograms
0.3
1 1 maintained)
0.7
•worlds were only partially instantiated
>1 C B C •same answer as the naive
0.4
F •with a small cost: dijkstra state needs to be
maintained in memory for all worlds
Nearest Neighbors in Uncertain Graphs @ VLDB 2010 36
35. kNN Pruning
for 200 worlds and 5NN the speedups were:
247x (BIOMINE), 111x (FLICKR), 269x (DBLP)
BIOMINE FLICKR DBLP
database of biological entities users from flickr.com. edges have authors from dblp. probabilities
and uncertain interactions from been created assuming homophily have been assigned based on
UHelsinki based on jaccard of flickr groups number of coauthored papers
1M nodes, 10M edges 77K nodes, 20M edges 226K nodes, 1.4M edges
Nearest Neighbors in Uncertain Graphs @ VLDB 2010 37
36. kNN Pruning
for 200 worlds and 5NN the speedups were:
247x (BIOMINE), 111x (FLICKR), 269x (DBLP)
BIOMINE FLICKR DBLP
database of biological entities users from flickr.com. edges have authors from dblp. probabilities
and uncertain interactions from been created assuming homophily have been assigned based on
UHelsinki based on jaccard of flickr groups number of coauthored papers
1M nodes, 10M edges 77K nodes, 20M edges 226K nodes, 1.4M edges
Nearest Neighbors in Uncertain Graphs @ VLDB 2010 38
38. Less uncertainty, more pruning
A A
•boost probabilities of edges by d d
0.2 0.6 1-0.8 1-0.4
giving each edge d chances 0.4
d
1-0.6
B C B C
•d=1: original graph
0.3 0.7 d
•increasing d, p goes to 1 1-0.7 d
1-0.3
D D
Nearest Neighbors in Uncertain Graphs @ VLDB 2010 40
39. Less uncertainty, more pruning
A A
•boost probabilities of edges by d d
0.2 0.6 1-0.8 1-0.4
giving each edge d chances 0.4
d
1-0.6
B C B C
•d=1: original graph
0.3 0.7 d
•increasing d, p goes to 1 1-0.7 d
1-0.3
D D
Nearest Neighbors in Uncertain Graphs @ VLDB 2010 41
40. Less uncertainty, more pruning
A A
•boost probabilities of edges by d d
0.2 0.6 1-0.8 1-0.4
giving each edge d chances 0.4
d
1-0.6
B C B C
•d=1: original graph
0.3 0.7 d
•increasing d, p goes to 1 1-0.7 d
1-0.3
D D
Nearest Neighbors in Uncertain Graphs @ VLDB 2010 42
42. Experiments
• Dataset
– Probabilistic PPI network
[Krogan et al, Nature 06]
– Protein co-complex
relationships (ground truth)
[Mewes et al, Nuc Acids Res 04]
• Experiment
– Choose a ground truth edge
(A,B)
– Choose a node C s.t. there is
no ground truth edge (A,C)
– Classification task: Distinguish
between the two types of
edges: (A,B) and (A,C)
Nearest Neighbors in Uncertain Graphs @ VLDB 2010 44
43. Experiments
• Dataset
– Probabilistic PPI network
[Krogan et al, Nature 06]
– Protein co-complex
relationships (ground truth)
[Mewes et al, Nuc Acids Res 04]
• Experiment
– Choose a ground truth edge
(A,B)
– Choose a node C s.t. there is
no ground truth edge (A,C)
– Classification task: Distinguish
between the two types of
edges: (A,B) and (A,C)
Nearest Neighbors in Uncertain Graphs @ VLDB 2010 45
44. Conclusion
• Probabilistic graph analysis benefits from
possible-world semantics.
– Extended standard graph concepts to
probabilistic graphs and designed
approximation algorithms to compute them
– Introduced novel pruning algorithms for kNN
in probabilistic graphs
– Confirmed the efficacy of our framework on
real data.
Nearest Neighbors in Uncertain Graphs @ VLDB 2010 46
45. Future Work
• Enrich model
– Node probabilities
– Arbitrary PDFs
• Explore random walks further
Nearest Neighbors in Uncertain Graphs @ VLDB 2010 47