A query language for analyzing networks

A query language for
analyzing networks
Anton Dries
(based on joint work with Siegfried Nijssen)

Idea

Declarative language for manipulating and
analyzing information networks
“Query language” – cf. SQL

with special focus on querying connections

simplicity / expressivity / ﬂexibility

Information networks

Objects (“nodes”)

Connections between objects (“edges”)

Focus on structure (“topology”)

a.k.a. “large single graph”


HTTP://SPIKEDMATH.COM/382.HTML

Examples:

World Wide Web

Social networks

Bibliographical

Transportation

Biological

Process
Common tasks
TOP DOWN APPROACH

Query language

Operational model (algebra)

Implementation & Optimization

Data management & storage

Process
Common tasks
TOP DOWN APPROACH

Query language [CIKM 2009]

Operational model (algebra) [MLG 2010]

Implementation & Optimization ?

Data management & storage
Graph databases (DEX, Neo, ...)

Common tasks
Feature-based queries

Structure-based queries

Aggregation

Basic graph problems e.g. degree, shortest path

Network analysis (e.g. centrality measures)

...
Mainly path-based queries

BiQL
“The BISON Query Language”

keyword
graphs
keyword has
data mining keyw
ord

keyw
has
ord
author author of
publication

has keyword
rof
author
a u tho

author of
author
au
tho

of
ro

r
tho
f

au
author of author
publication
of
o rd th or
author
ey
w au f
s k author o
ha ord
yw
has ke publication has keyw
ord
keyword
keyword probabilities
machine learning

keyword
graphs
keyword has
data mining keyw
ord

keyw
has
ord
author author of
publication

co-au
has keyword
ro f
author
u tho

thor
a

co-a
co-au

utho
thor

author of
author
au

co
co-
tho

r
aut

-au
of hor
ro

r
tho -author

tho
co
f

au

r
author of author
publication
of
o rd th or co-author
author
yw au
s ke aut hor of
ha ord
yw
ord
keyword
keyword
machine learning co-authorship probabilities

Manipulation
“query language”
SQL-style: loosely based on SQL syntax

One type of query: create set of (new) objects

CREATE/UPDATE Domain<Vars> { Properties }
FROM Path Expression
WHERE Constraints

Example
keyword
graphs
keyword has
data mining keyw
ord

keyw
has
author

ord

co-au
author author of
publication
author

thor
has keyword

co-a
author hor of
aut co-au

utho
thor author

co
co-

r
aut

author of

-au
author
au

hor
tho

tho
ro
f co-author
ro

tho

r
f

au author
author of author
publication
of co-author
ord thor author
author
yw au f
s ke author
o
ha
yword
ord
keyword
machine learning

CREATE CoAuthor<A,B> { A <−>, B <−> }
FROM Author A −> AuthorOf −> Publication P
<− AuthorOf <− Author B

keyword
graphs
keyword has
data mining keyw
ord

keyw
has
author

Example
ord

co-au
author author of
publication
author

thor
has keyword

co-a
f
author or o
auth co-aut

utho
hor author

co
co-a

r
author of

-au
author uth

au
or

tho

tho
f co-author
ro
ro

r
tho
f
au author
author of author
publication
f co-author
rd o ro author
y wo a uth author
e of
sk author
ha
yword
has ke publication has keywo
rd
keyword
machine learning

“object creation” – output speciﬁcation


FROM Author A −> AuthorOf −> Publication P
<− AuthorOf <− Author B

“path expression” – structural selection

(+ other operations)

Structural selection
Author A −> AuthorOf −> Publication P <− AuthorOf <− Author B,
Publicati
on P −>
HasKeyw
ord −> K
eyword K
Author Author
AuthorOf Publication P AuthorOf
A B

HasKeyword

Keyword
K

Author Author
CoAuthor
A B

Author A −> CoAuthor −> Author B −> CoAuthor CoAuthor CoAuthor

−> Author C −> CoAuthor −> Author A Author
C

regular expressions
Node A −> Edge [E] −> (Node −> Edge [E] −>)* Node B

list variables

each expansion of regular expression should
lead to a valid (simple) path expression deﬁning
the same variables

Node A −> Edge [E] −> (Node −> Edge [E] −>)* Node B

Node A −> Edge [E] −> Node B
(n1, [e1], n2)
e1
n2
e4
(n1, [e3], n3)
n1 e2
(A,E,B) = (n2, [e2], n3)
e3 e5
n4
(n2, [e4], n4)
n3 (n3, [e5], n4)
Node A −> Edge [E] −> Node −> Edge [E] −> Node B
(n1, [e1,e2], n3)
(A,E,B) = (n1, [e1,e4], n4)
(n1, [e3,e5], n4)

Output speciﬁcation
FROM Author A −> AuthorOf −> Publication P <− AuthorOf <− Author B

UPDATE

update/ put them for each with these
create in this combination properties
objects domain of values

n1
e1

e3
n2

e2
e4

e5
n4
n3

UPDATE <A> { nr_reach: count }
FROM Node A −> Edge [E] −> (Node −> Edge [E] −>)* Node B

(n1, [e1], n2) ([e1], n2)
(n1, [e3], n3) ([e3], n3)
(n2, [e2], n3) <A> n1 ([e1,e2], n3)
(n2, [e4], n4) ([e1,e4], n4)
(n3, [e5], n4) ([e3,e5], n4)
(n1, [e1,e2], n3) ([e2], n3)
n2
(n1, [e1,e4], n4) ([e4], n4)
(n1, [e3,e5], n4) n3 ([e5], n4)

n1
e1

e3
n2

e2
e4

e5
n4
n3


([e1], n2)
([e3], n3)
<A> n1 ([e1,e2], n3)
([e1,e4], n4)
([e3,e5], n4)
([e2], n3)
n2
([e4], n4)
n3 ([e5], n4)

n1
e1

e3
n2

e2
e4

e5
n4
n3


([e1], n2) ([e1]) n2
([e3], n3) ([e3])
n3
<A> n1 ([e1,e2], n3) n1 ([e1,e2])
([e1,e4], n4) ([e1,e4])
n4
([e3,e5], n4) ([e3,e5])
([e2], n3) ([e2]) n3
n2 n2
([e4], n4) ([e4]) n4
n3 ([e5], n4) n3 ([e5]) n4

n1
e1

e3
n2

e2
e4

e5
n4
n3


([e1]) n2
([e3])
n3
 n1 ([e1,e2])
([e1,e4])
n4
([e3,e5])
([e2]) n3
n2
([e4]) n4
n3 ([e5]) n4

n1
e1

e3
n2

e2
e4

e5
n4
n3


([e1]) n2
([e3])
n3
 n1 ([e1,e2]) count 3
([e1,e4])
n4
([e3,e5])
([e2]) n3
n2 2
([e4]) n4
n3 ([e5]) n4 1

n1
e1

e3
n2

e2
e4

e5
n4
n3


([e1]) n2
([e3])
n1

n3 nr_reach: 3

 n1 ([e1,e2]) count 3 UPDATE
([e1,e4])
n4
([e3,e5]) n2
nr_reach: 2

([e2]) n3
n2 2
([e4]) n4
n3 ([e5]) n4
n3

1 nr_reach: 1

Object properties

Attribute deﬁnition

strength: count start: min(P.year)

Link deﬁnition
A −>, B −> P <−

Co-authorship
adding a new relationship

A B

CoAuthor
strength: 3
start: 2008
end: 2010

CREATE CoAuthor<A,B>
{ A −>, B −>, <− P, P1 P2 P3

start: min(P.year),
year: 2008 year: 2008 year: 2010

end: max(P.year),
strength: count }
FROM Author A −> AuthorOf −> Publication P <− AuthorOf <− Author B

Size of neighborhood
transitive closure

UPDATE <A> { netsize: count }
FROM Author A −> (CoAuthor [co] <− Author −>)*
CoAuthor [co] <− Author B
WHERE length(co) < 4

Distance
based on shortest path

CREATE Connection<A,B>
{ A −>, −> B, distance: min<E>(length(E)) }
FROM Node A −> Edge [E] (−> Node −> Edge [E])* −> Node B

distance: min<E>(length(E))
distance: min<E>(sum(E.weight))
distance: max<E>(product(E.probability))

Centrality measures
degree centrality
UPDATE <A> { Cdegree: count/(count<N>-1) }
FROM Node A −− Edge -- Node B, Node N
deg(v)
CD (v) =
n 1
closeness centrality
UPDATE <A> { closeness: 1/sum(min<AB>(AB.distance))}
FROM Node A −> Connection AB −> Node B
1
CC (v) = P
t2V dist(v, t)

Operational model

Query algebra operators:
Evaluate path expression (graph –> tuple)

Relational algebra (tuple –> tuple)

Construction operator (tuple –> graph)

Used by prototype implementation

Operational model
Node A −> Edge [E] (−> Node −> Edge [E])* −> Node B

“Pattern match” operator is too broad

Enumerates all paths
exponential

e.g. even when only shortest path is requested

Need for atomic graph operations (open question)

Pattern matching
Node A −> Edge [E] (−> Node −> Edge [E])* −> Node B

Homomorphism matching (no cycle check)
more efﬁcient than isomorphism

cycles could lead to unbounded solutions

Use constraints and algebraic solutions to avoid
inﬁnite processing
operator interaction – “pattern match” operator not
atomic enough

Avoiding unbounded
solutions
CREATE Distance<A,B>
{ A −>, −> B, distance: min<E>(sum(E.weight)) }

CREATE ConnectionWeight<A,B>
{ A −>, −> B, distance: sum<E>(product(E.weight)) }

CREATE PathCount<A,B>
{ A −>, −> B, numP: count<E> }

Fletcher’s algorithm
[FLETCHER, 1980]
[BATAGELJ, 1994]
FOR k = 1..n
FOR i = 1..n
FOR j = 1..n
Ck,i,j = Ck-1,i,j ⊕ (Ck-1,i,k ⊙ Ck-1,k,k* ⊙ Ck-1,k,j)
Ck,k,k = e⊙ ⊕ Ck,k,k
where C0,i,j weighted adjacency matrix
(S, ⊕, ⊙, e⊕, e⊙) an algebraic semiring
a* = e⊙ ⊕ a ⊕ a⊙a ⊕ a⊙a⊙a ⊕ ... closure operator
n number of nodes in the graph


Dynamic programming approach

At step k: Ck,i,j contains solution using paths
containing only nodes 1...k

Some examples ...

(S, ⊕, ⊙, e⊕, e⊙) = (ℝ+, min, +, ∞, 0)

FOR k = 1..n
FOR i = 1..n
FOR j = 1..n
Ck,i,j = min(Ck-1,i,j,Ck-1,i,k + Ck-1,k,j)
Ck,k,k = 0

a* = e⊙ ⊕ a ⊕ a⊙a ⊕ a⊙a⊙a + ...
Ck,k* = min(0, Ck,k, 2Ck,k, 3Ck,k, ...) = 0 (Ck,k >= 0)

Floyd-Warshall shortest path algorithm

(S, ⊕, ⊙, e⊕, e⊙) = ([0,1], +, ·, 0, 1)

FOR k = 1..n
FOR i = 1..n
FOR j = 1..n
Ck,i,j = Ck-1,i,j + Ck-1,i,k · Ck-1,k,k* · Ck-1,k,j
Ck,k,k = 1 + Ck,k,k

a* = e⊙ ⊕ a ⊕ a⊙a ⊕ a⊙a⊙a + ...
Ck,k* = 1 + Ck,k + Ck,k2 + Ck,k3 + ... = 1 / (1-Ck,k) (|Ck,k | < 1)

sum of all path weights

(S, ⊕, ⊙, e⊕, e⊙) = (N, +, ·, 0, 1)

FOR k = 1..n
FOR i = 1..n
FOR j = 1..n
Ck,i,j = Ck-1,i,j + Ck-1,i,k · Ck-1,k,k* · Ck-1,k,j
Ck,k,k = 1 + Ck,k,k

a* = 1 + a + a2 + a3 + ...
Ck,k* = 1 (Ck,k = 0) no cycle k–>k
Ck,k* = ∞ (Ck,k > 0) cycle k–>k
number of paths

Generalized algorithm for several connectivity
problems
O(n3) time complexity, O(n3) or O(n2) space complexity

for many problems: best known time complexity
(exact, for arbitrary graphs)

also in the presence of cycles (thanks to (Ck,k,k*) term)

Applicability depends on constraints on path

(S, ⊕, ⊙, e⊕, e⊙) = (ℝ, min, +, ∞, 0)

CREATE Connection<A,B>
{ A −>, −> B, distance: min<E>(sum(E.weight)) }
WHERE A.color = ‘blue’

if e1e2 matches path expression then
e1 and e2 must match path expression

= +

=> has to compute all pair shortest paths

Conclusion

A query language for analyzing networks

Focussed to path based analysis

Raises interesting questions

Some ideas on implementation and optimization

Future work
Need for atomic graph operations

Fletcher’s algorithm:
interaction with constraints

complex path expressions (not just Node-Edge-Node)

Approximate answers – O(n3) is very bad

Other metrics: ﬂow-based, pagerank, ... mining

A query language for analyzing networks

Recommended

Recommended

More Related Content

Recently uploaded

Recently uploaded (20)

Featured

Featured (20)

A query language for analyzing networks