Information networks are a popular way to represent information, especially in domains where the emphasis lies on the structural relationships between the entities rather than their features. Notable examples are online social networks and road networks. This special focus on network topology has led to the development of specialized graph databases. However, few of these databases offer a high-level declarative interface suited for analyzing information networks.
In this talk I present our work on developing a query language for analyzing networks. I will focus on the general principles we followed in the design of this language, and the main challenges related to developing it into a scalable tool for network analysis.
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
A query language for analyzing networks
1. A query language for
analyzing networks
Anton Dries
(based on joint work with Siegfried Nijssen)
2. Idea
Declarative language for manipulating and
analyzing information networks
“Query language” – cf. SQL
with special focus on querying connections
simplicity / expressivity / flexibility
6. Process
Common tasks
TOP DOWN APPROACH
Query language
Operational model (algebra)
Implementation & Optimization
Data management & storage
7. Process
Common tasks
TOP DOWN APPROACH
Query language [CIKM 2009]
Operational model (algebra) [MLG 2010]
Implementation & Optimization ?
Data management & storage
Graph databases (DEX, Neo, ...)
10. keyword
graphs
keyword has
data mining keyw
ord
keyw
has
ord
author author of
publication
has keyword
rof
author
a u tho
author of
author
au
tho
of
ro
r
tho
f
au
author of author
publication
of
o rd th or
author
ey
w au f
s k author o
ha ord
yw
has ke publication has keyw
ord
keyword
keyword probabilities
machine learning
11. keyword
graphs
keyword has
data mining keyw
ord
keyw
has
ord
author author of
publication
co-au
has keyword
ro f
author
u tho
thor
a
co-a
co-au
utho
thor
author of
author
au
co
co-
tho
r
aut
-au
of hor
ro
r
tho -author
tho
co
f
au
r
author of author
publication
of
o rd th or co-author
author
yw au
s ke aut hor of
ha ord
yw
has ke publication has keyw
ord
keyword
keyword
machine learning co-authorship probabilities
12. Manipulation
“query language”
SQL-style: loosely based on SQL syntax
One type of query: create set of (new) objects
CREATE/UPDATE Domain<Vars> { Properties }
FROM Path Expression
WHERE Constraints
13. Example
keyword
graphs
keyword has
data mining keyw
ord
keyw
has
author
ord
co-au
author author of
publication
author
thor
has keyword
co-a
author hor of
aut co-au
utho
thor author
co
co-
r
aut
author of
-au
author
au
hor
tho
tho
ro
f co-author
ro
tho
r
f
au author
author of author
publication
of co-author
ord thor author
author
yw au f
s ke author
o
ha
yword
has ke publication has keyw
ord
keyword
keyword probabilities
machine learning
CREATE CoAuthor<A,B> { A <−>, B <−> }
FROM Author A −> AuthorOf −> Publication P
<− AuthorOf <− Author B
14. keyword
graphs
keyword has
data mining keyw
ord
keyw
has
author
Example
ord
co-au
author author of
publication
author
thor
has keyword
co-a
f
author or o
auth co-aut
utho
hor author
co
co-a
r
author of
-au
author uth
au
or
tho
tho
f co-author
ro
ro
r
tho
f
au author
author of author
publication
f co-author
rd o ro author
y wo a uth author
e of
sk author
ha
yword
has ke publication has keywo
rd
keyword
keyword probabilities
machine learning
“object creation” – output specification
CREATE CoAuthor<A,B> { A <−>, B <−> }
FROM Author A −> AuthorOf −> Publication P
<− AuthorOf <− Author B
“path expression” – structural selection
(+ other operations)
15. Structural selection
Author A −> AuthorOf −> Publication P <− AuthorOf <− Author B,
Publicati
on P −>
HasKeyw
ord −> K
eyword K
Author Author
AuthorOf Publication P AuthorOf
A B
HasKeyword
Keyword
K
Author Author
CoAuthor
A B
Author A −> CoAuthor −> Author B −> CoAuthor CoAuthor CoAuthor
−> Author C −> CoAuthor −> Author A Author
C
16. Structural selection
regular expressions
Node A −> Edge [E] −> (Node −> Edge [E] −>)* Node B
list variables
each expansion of regular expression should
lead to a valid (simple) path expression defining
the same variables
18. Output specification
CREATE CoAuthor<A,B> { A <−>, B <−> }
FROM Author A −> AuthorOf −> Publication P <− AuthorOf <− Author B
UPDATE
CREATE CoAuthor<A,B> { A <−>, B <−> }
update/ put them for each with these
create in this combination properties
objects domain of values
27. Co-authorship
adding a new relationship
A B
CoAuthor
strength: 3
start: 2008
end: 2010
CREATE CoAuthor<A,B>
{ A −>, B −>, <− P, P1 P2 P3
start: min<P>(P.year),
year: 2008 year: 2008 year: 2010
end: max<P>(P.year),
strength: count<P> }
FROM Author A −> AuthorOf −> Publication P <− AuthorOf <− Author B
28. Size of neighborhood
transitive closure
UPDATE <A> { netsize: count<B> }
FROM Author A −> (CoAuthor [co] <− Author −>)*
CoAuthor [co] <− Author B
WHERE length(co) < 4
29. Distance
based on shortest path
CREATE Connection<A,B>
{ A −>, −> B, distance: min<E>(length(E)) }
FROM Node A −> Edge [E] (−> Node −> Edge [E])* −> Node B
distance: min<E>(length(E))
distance: min<E>(sum(E.weight))
distance: max<E>(product(E.probability))
30. Centrality measures
degree centrality
UPDATE <A> { Cdegree: count<B>/(count<N>-1) }
FROM Node A −− Edge -- Node B, Node N
deg(v)
CD (v) =
n 1
closeness centrality
UPDATE <A> { closeness: 1/sum<B>(min<AB>(AB.distance))}
FROM Node A −> Connection AB −> Node B
1
CC (v) = P
t2V dist(v, t)
32. Operational model
Query algebra operators:
Evaluate path expression (graph –> tuple)
Relational algebra (tuple –> tuple)
Construction operator (tuple –> graph)
Used by prototype implementation
33. Operational model
Node A −> Edge [E] (−> Node −> Edge [E])* −> Node B
“Pattern match” operator is too broad
Enumerates all paths
exponential
e.g. even when only shortest path is requested
Need for atomic graph operations (open question)
34. Pattern matching
Node A −> Edge [E] (−> Node −> Edge [E])* −> Node B
Homomorphism matching (no cycle check)
more efficient than isomorphism
cycles could lead to unbounded solutions
Use constraints and algebraic solutions to avoid
infinite processing
operator interaction – “pattern match” operator not
atomic enough
35. Avoiding unbounded
solutions
CREATE Distance<A,B>
{ A −>, −> B, distance: min<E>(sum(E.weight)) }
FROM Node A −> Edge [E] (−> Node −> Edge [E])* −> Node B
CREATE ConnectionWeight<A,B>
{ A −>, −> B, distance: sum<E>(product(E.weight)) }
FROM Node A −> Edge [E] (−> Node −> Edge [E])* −> Node B
CREATE PathCount<A,B>
{ A −>, −> B, numP: count<E> }
FROM Node A −> Edge [E] (−> Node −> Edge [E])* −> Node B
36. Fletcher’s algorithm
[FLETCHER, 1980]
[BATAGELJ, 1994]
FOR k = 1..n
FOR i = 1..n
FOR j = 1..n
Ck,i,j = Ck-1,i,j ⊕ (Ck-1,i,k ⊙ Ck-1,k,k* ⊙ Ck-1,k,j)
Ck,k,k = e⊙ ⊕ Ck,k,k
where C0,i,j weighted adjacency matrix
(S, ⊕, ⊙, e⊕, e⊙) an algebraic semiring
a* = e⊙ ⊕ a ⊕ a⊙a ⊕ a⊙a⊙a ⊕ ... closure operator
n number of nodes in the graph
39. Fletcher’s algorithm
(S, ⊕, ⊙, e⊕, e⊙) = ([0,1], +, ·, 0, 1)
FOR k = 1..n
FOR i = 1..n
FOR j = 1..n
Ck,i,j = Ck-1,i,j + Ck-1,i,k · Ck-1,k,k* · Ck-1,k,j
Ck,k,k = 1 + Ck,k,k
a* = e⊙ ⊕ a ⊕ a⊙a ⊕ a⊙a⊙a + ...
Ck,k* = 1 + Ck,k + Ck,k2 + Ck,k3 + ... = 1 / (1-Ck,k) (|Ck,k | < 1)
sum of all path weights
40. Fletcher’s algorithm
(S, ⊕, ⊙, e⊕, e⊙) = (N, +, ·, 0, 1)
FOR k = 1..n
FOR i = 1..n
FOR j = 1..n
Ck,i,j = Ck-1,i,j + Ck-1,i,k · Ck-1,k,k* · Ck-1,k,j
Ck,k,k = 1 + Ck,k,k
a* = 1 + a + a2 + a3 + ...
Ck,k* = 1 (Ck,k = 0) no cycle k–>k
Ck,k* = ∞ (Ck,k > 0) cycle k–>k
number of paths
41. Fletcher’s algorithm
Generalized algorithm for several connectivity
problems
O(n3) time complexity, O(n3) or O(n2) space complexity
for many problems: best known time complexity
(exact, for arbitrary graphs)
also in the presence of cycles (thanks to (Ck,k,k*) term)
Applicability depends on constraints on path
42. Fletcher’s algorithm
(S, ⊕, ⊙, e⊕, e⊙) = (ℝ, min, +, ∞, 0)
CREATE Connection<A,B>
{ A −>, −> B, distance: min<E>(sum(E.weight)) }
FROM Node A −> Edge [E] (−> Node −> Edge [E])* −> Node B
WHERE A.color = ‘blue’
if e1e2 matches path expression then
e1 and e2 must match path expression
= +
=> has to compute all pair shortest paths
43. Conclusion
A query language for analyzing networks
Focussed to path based analysis
Raises interesting questions
Some ideas on implementation and optimization
44. Future work
Need for atomic graph operations
Fletcher’s algorithm:
interaction with constraints
complex path expressions (not just Node-Edge-Node)
Approximate answers – O(n3) is very bad
Other metrics: flow-based, pagerank, ... mining