Graph Summarization with Quality Guarantees

Graph Summarization with Quality Guarantees
Matteo Riondato – Labs, Two Sigma Investments LP
DEI – UNIPD – September 26, 2016
1 / 32

Who am I?
Laurea in Ing. Informazione and Laurea Spec. in Ing. Informatica @ DEI-UNIPD;
Ph.D. in Computer Science at Brown U. with Eli Upfal;
Working at
Labs, Two Sigma Investments LP (Research Scientist);
CS Dept., Brown U. (Visiting Asst. Prof.);
Research in algorithmic data science: algorithms plus data, algorithms for data
(used to be data mining, but then data miners forgot about algorithms...);
Tweeting @teorionda;
“Living” at http://matteo.rionda.to.
2 / 32

What is my approach to research?
Not This:
Theory → Practice
3 / 32

Not This:
Theory ← Practice
3 / 32

Not This:
Theory ↔ Practice
3 / 32

Not This:
Theory + Practice
3 / 32

This:
(Theory × Practice)Theory × Practice
Marry Theory and Practice: make them one, leverage individual/combined strengths!
We can’t look at them separately to solve larger-than-anything data problems.
3 / 32

What is summarization?
Motivation: Speeding up the answering of queries on a large graph G = ([n], E).
Examples of query:
“What’s the degree of v ∈ [n]?”, “How many triangles are there in G?”
Summarization: creation and use of a summary data structure to answer the queries;
Summary: a lossy representation; of G that ﬁts in main memory (i.e., sketch);
We also need algorithms using the summary to approximately answer the queries;
We want a general-purpose summary, not specialized for only one query.
Summarization is not:
•graph compression: store a lossless compressed version of the graph on disk;
•graph partitioning: partition the nodes into k classes, while minimizing the number
of edges from one class to the other.
4 / 32

What am I going to talk about?
We present the first constant-factor
approximation algorithm for building
graph summaries according to natural
error measures;
Among the contributions: a practical use
of extreme graph theory, with the cut
norm and the algorithmic version of
Szemerédi’s Regularity Lemma.
Joint work with
Francesco Bonchi (ISI / Eurecat) and
David García-Soriano (Eurecat).
IEEE ICDM’14;
Matteo Riondato
Stanford University
rionda@cs.stanford.edu
David Garc´ıa-Soriano
Yahoo Labs, Barcelona, Spain
davidgs@yahoo-inc.com
Francesco Bonchi
Yahoo Labs, Barcelona, Spain
bonchi@yahoo-inc.com
Abstract—We study the problem of graph summarization.
Given a large graph we aim at producing a concise lossy
representation that can be stored in main memory and used to
approximately answer queries about the original graph much
faster than by using the exact representation. In this paper
we study a very natural type of summary: the original set
of vertices is partitioned into a small number of supernodes
connected by superedges to form a complete weighted graph.
The superedge weights are the edge densities between vertices in
the corresponding supernodes. The goal is to produce a summary
that minimizes the reconstruction error w.r.t. the original graph.
By exposing a connection between graph summarization and
geometric clustering problems (i.e., k-means and k-median),
we develop the first polynomial-time approximation algorithm to
compute the best possible summary of a given size.
I. INTRODUCTION
Data analysts in several application domains (e.g., social
networks, molecular biology, communication networks, and
many others) routinely face graphs with millions of vertices
and billions of edges. In principle, this abundance of data
should allow for a more accurate analysis of the phenomena
under study. However, as the graphs under analysis grow, min-
ing and visualizing them become computationally challenging
tasks. In fact, the running time of most graph algorithms
depends on the size of the input: executing them on huge
graphs might be impractical, especially when the input is too
large to fit in main memory.
Graph summarization speeds up the analysis by creating
a lossy concise representation of the graph that fits into
main memory. Answers to otherwise expensive queries can
then be computed using the summary without accessing the
exact representation on disk. Query answers computed on the
summary incur in a minimal loss of accuracy. Summaries
can also be used for privacy purposes [1], to create easily
interpretable visualizations of the graph [2], and to store a
compressed representation of the graph.
LeFevre and Terzi [1] propose an enriched “supergraph” as a
summary, associating an integer to each supernode and to each
The GraSS algorithm presented in [1] follows a greedy heuris-
tic resembling an agglomerative hierarchical clustering using
Ward’s method [3] and as such can not give any guarantee on
the quality of the summary. In this paper instead, we propose
efficient algorithms to compute summaries of guaranteed
quality (a constant factor from the optimal). This theoretical
property is also verified empirically: our algorithms build more
representative summaries and are much more efficient and
scalable than GraSS in building those summaries.
II. PROBLEM DEFINITION
We consider an undirected graph G = (V, E) with |V | = n.
In the rest of the paper, the key concepts are defined from the
standpoint of the symmetric adjacency matrix AG of G. We
allow the edges to be weighted (so the adjacency matrix is not
necessarily binary) and we allow self-loops (so the diagonal
of the adjacency matrix is not necessarily all-zero).
Given a graph G = (V, E) and k 2 N, a k-summary S of
G is a complete undirected weighted graph S = (V 0
, V 0
⇥V 0
)
that is uniquely identified by a k-partition V 0
of V (i.e., V 0
=
{V1, . . . , Vk}, s.t. [i2[1,k]Vi = V and 8i, j 2 [1, k], i 6= j, it
holds Vi Vj = ;). The vertices of S are called supernodes.
There is a superedge eij for each unordered pair of supernodes
(Vi, Vj), including (Vi, Vi) (i.e., each supernode has a self-loop
eii). Each superedge eij is weighted by the density of edges
between Vi and Vj:
dG(i, j) = dG(Vi, Vj) = eG(Vi, Vj)/(|Vi||Vj|),
where for any two sets of vertices S, T ✓ V , we denote
eG(S, T) =
X
i2S,j2T
AG(i, j).
We define the density matrix of S as the k⇥k matrix AS with
entries AS(i, j) = dG(i, j), 1  i, j  k. For each v 2 V , we
also denote by s(v) the unique element of S (a supernode)
such that v 2 s(v). The density matrix AS 2 Rk⇥k
can be
lifted1
to the matrix A"
S 2 Rn⇥n
defined by
A"
S(v, w) = AS(s(v), s(w)).
Data Mining and Knowledge Discovery,
in press;
Data Mining and Knowledge Discovery manuscript No.
(will be inserted by the editor)
Matteo Riondato · David Garc´ıa-Soriano ·
Francesco Bonchi
5 / 32

What does a summary look like?
G = ([n], E): an undirected weighted graph with adjacency matrix AG;
Summary S of G: a complete weighted graph with self-loops: S = (V, V × V);
The supernodes Vi , i = 1, . . . , k, in V form a partitioning of [n];
The superedges weights are the edge densities between the corresponding sets of nodes:
dG(i, j) = dG(Vi , Vj) =
eG(Vi , Vj)
|Vi ||Vj|
,
where
eG(S, T) =
i∈S,j∈T
AG(i, j) for all S, T ⊆ V .
Technical point: if S and T overlap, each non-loop edge with both endpoints in
S ∩ T is counted twice for eG(S, T).
6 / 32

Shall we look at an example?
A graph with 7 nodes, and a summary of it with 3 supernodes:
2 1
3
4 75
0
7/9
3/4
1/4
1/3
1/3
{3,4,5}
{6,7}
{1,2}
6
7 / 32

How can we answer queries using a summary?
Preliminary question:
What is the desired semantic of an answer computed on the summary?
A summary S is a lossy representation of G, compatible with many other graphs;
We adopt an expected-value semantic for query answers:
Let π be a distribution over the set of graphs compatible with S;
Given q and S, we want
answer(q, S) = Eπ[answer(q, compatible G )] .
We most often use the uniform distribution as π.
8 / 32

What is the lifted adjacency matrix?
Key object to compute query answers and evaluate the quality of the summary;
Preliminaries:
Given a summary S, let AS be the k × k matrix such that AS(i, j) = dG(i, j);
Given a node v ∈ [n], let s(v) be the unique supernode of S containing v;
Lifted adjacency matrix A↑
S ∈ Rn×n of S:
A↑
S(u, w) = AS(s(u), s(w)) .
A↑
S has a nice regular structure (example coming up);
We call matrices like A↑
S “P-constant”, where P is the partitioning of [n] used in S.
9 / 32

What does a lifted adjacency matrix look like?
2 1
3
4 75
0
7/9
3/4
1/4
1/3
1/3
{3,4,5}
{6,7}
{1,2}
6
1 2 3 4 5 6 7
1 0 0 1/3 1/3 1/3 1/4 1/4
2 0 0 1/3 1/3 1/3 1/4 1/4
3 1/3 1/3 7/9 7/9 7/9 1/3 1/3
4 1/3 1/3 7/9 7/9 7/9 1/3 1/3
5 1/3 1/3 7/9 7/9 7/9 1/3 1/3
6 1/4 1/4 1/3 1/3 1/3 3/4 3/4
7 1/4 1/4 1/3 1/3 1/3 3/4 3/4
10 / 32

How do we use the lifted adjacency matrix to answer queries?
Query Answer
Adjacency of u and w A↑
S(u, w)
Degree of u n
i=1 A↑
S(u, i)
Eigenvector centrality of u n
i=1 A↑
S(u, i)/2|E|
No. of triangles in G
k
i=1
|Vi |
3 A↑
S(i, i)3+
k
j=i+1 A↑
S(i, j)2 |Vi |
2 |Vj|A↑
S(i, i) + |Vj |
2 |Vi |A↑
S(j, j) +
+ k
w=j+1 |Vi ||Vj||Vw |A↑
S(i, j)A↑
S(j, w)A↑
S(w, j)
This is a slight simpliﬁcation of how to answer queries: there is some additional weighting of A
↑
S
involved.
Takeaway: Answering queries takes time O(poly(k)).
11 / 32

What is a good summary?
Given G = ([n], E), each partitioning P of [n] corresponds to a summary SP;
Which partition of [n] gives the best summary?
We need an error metric to evaluate the partitions, i.e., the summaries;
Intuition: compare the lifted adjacency matrix of SP with the adjacency matrix AG;
Error metrics:
• p-reconstruction error, based on the p norm;
•cut-norm error, based on the cut norm.
12 / 32

What is the reconstruction error?
The entry-wise p-norm of the diﬀerence between AG and A↑
S:
errp(AG, A↑
S) = AG − A↑
S p =


|V |
i=1
|V |
j=1
|AG(i, j) − A↑
S(i, j)|p


1/p
.
The reconstruction error is such that errp(AG, A↑
S) ∈ [0, n2/p];
For the summary in the example we had:
err1(AG, A↑
S) =
329
18
= 18.27 and err2(AG, A↑
S) =
11844
1296
= 3.023059525 .
13 / 32

How do we compute the p-reconstruction error?
In time O(k2) from the summary:
errp(A↑
S, AG)p
=
i,j∈[k]
|Vi ||Vj|dG(i, j)(1 − dG(i, j)) (1 − dG(i, j))p−1
+ dG(i, j)p−1
.
Interesting consequence: The choice of p does not matter much. E.g.,:
err2(A↑
S, AG)2
= 2 · err1(A↑
S, AG)
I.e., a partition that minimizes the 2-reconstruction error, also minimizes the
1-reconstruction error;
(similarly also for any errp and errq, with p, q ∈ O(1), up to constant factors)
14 / 32

What’s wrong with the p-reconstruction error?
Achieving low (e.g., o(n2)) 1-reconstruction error may require k very close to n;
Example:
Consider a Erdős-Renyi graph G ∈ Gn,1/2;
W.h.p., for any two vertices u, v, we have N(v) − N(v) = n
2 ±
√
n log n;
Assigning two vertices to the same supernode has cost n
2 − o(n);
Hence, even for k = n/2, the reconstruction error of any k-summary is ≥ n2/9.
15 / 32

How can we solve this drawback?
When G is not random, we hope to ﬁnd structure to exploit for building the summary;
Look at adjacency matrix AG as:
AG = S + E
where
•S: “structured” matrix captured by summary;
•E: “residual” matrix representing the error;
The “smaller” E is , the better the summary;
The cut norm error tries to quantify E.
16 / 32

What is the cut norm?
The cut norm of B ∈ Rn×n is the maximum absolute sum of the entries of any of its
submatrices:
B = max
S,T⊆[n]
|B(S, T)| = max
S,T⊆[n]
i∈S,j∈T
Bi,j .
The cut-norm error of a summary S w.r.t. G is:
err (AG, A↑
S) = AG − A↑
S = max
S,T⊆V
|eG(S, T) − eS↑ (S, T)|,
Interpretation: maximum absolute diﬀerence between (twice) the sum of edges
weights between any two sets of vertices in G and the same sum in A↑
S;
Computing the cut norm is NP-hard, but exists constant-factor approximation via SDP.
17 / 32

What is GraSS?
GraSS: Graph Structure Summarization, by LeFevre and Terzi, SIAM SDM’10;
Introduces the 1-reconstruction error;
Deﬁnes a variant of the lifted adjacency matrix;
Shows how to answer some queries using the summary;
Gives an algorithm to build a summary using agglomerative hierarchical clustering:
1) Put each vertex in a separate supernode;
2) Until the number of supernodes is k:
Merge the two supernodes whose merging minimizes the 1-reconstruction error;
3) Output the resulting k supernodes;
GraSS does not oﬀer any quality guarantees and it is extremely slow.
18 / 32

Why using the lifted adjacency matrix?
Given p ≥ 1, a graph G = ([n], E) with adjacency matrix AG, and a partitioning
P = {S1, . . . , Sk} of [n], what P-constant matrix B∗,p
P minimizes AG − B∗,p
P p?
Lemma: A↑
S is a (small) constant -factor approximation of B∗,p
P .
19 / 32

Why using the lifted adjacency matrix?
Given p ≥ 1, a graph G = ([n], E) with adjacency matrix AG, and a partitioning
P = {S1, . . . , Sk} of [n], what P-constant matrix B∗,p
P minimizes AG − B∗,p
P p?
Lemma: A↑
S is a (small) constant -factor approximation of B∗,p
P .
Intuition (this is not a proof):
Assume to sample u.a.r. a pair of vertices (u, w) from Vi × Vj. Let X be the
r.v. with value AG(u, w). What is the real a that minimizes E[|X − a|p]?
For p = 1 is median(X), for p = 2 is E[X]. We have:
A↑
S(i, j) =
(u,v)∈Vi ×Vj
AG(u, w)
|Vi ||Vj|
= E[X] .
Hence, B∗,2
P = A↑
S;
Also, we show that AG − B∗,1
P 1
≤ AG − A↑
S 1
≤ 2 AG − B∗,1
P .
19 / 32

What is our algorithm for graph summarization?
As simple as:
•For p-reconstruction error, perform p-clustering of the rows of AG (p = 1:
k-median, p = 2: k-means). If column i is in cluster j, then vertex i is in supernode Vj.
Lemma: The summary obtained from the optimal 1 (resp. 2) clustering is a 8
(resp. 4) approximation of the optimal summary for the 1 (resp. 2) reconstruction
error.
...not really so simple to prove!
•For the cut-norm error, perform 1-clustering of the rows of A2
G, with a speciﬁc
distance function;
Lemma: (a tad more complex, let’s skip this.)
20 / 32

Is there some intuition behind this connection with clustering?
Given a partition P = {V1, . . . , Vk}, consider the orthonormal vectors vi ∈ Rn such
that vi (j) = 1/|Vi | if j ∈ Vi , 0 othw.
Consider the n × n matrix P = k
i=1 vi vi ;
We have A↑
S = PAGP;
The p-cost of the clustering associated to P is AG − AGP p;
For the p norms, AG − AGP p /2 ≤ AG − PAGP p ≤ 2 AG − AGP p;
Let ¯S be the summary arising from the optimal 2-clustering, and S∗ be the optimal
summary with minimal 2-reconstruction error. Then:
err2(AG, A↑
¯S
) = AG − P ¯SAGP ¯S 2 ≤ 2 AG − AGP ¯S 2 ≤ 2 AG − AGPS∗
2
≤ 4 AG − PS∗ AGPS∗
2 = 4 · err2(AG, A↑
S∗ ) .
21 / 32

How can we build a summary efficiently?
Both k-means and k-median are NP-hard;
There are constant factor approximation algorithms;
Bottleneck: computing all pairwise distance for the n rows of AG is expensive
(like matrix multiplication);
Solution: Use a sketch of the adjacency matrix with n rows and log n columns;
Incurs in additional constant error;
Even with the sketch, the approximation algorithms take time Õ(n2);
Idea: select O(k) rows of the sketch adaptively, compute a clustering using them;
In the end, the algorithm runs in time Õ(m + nk) and obtains a constant-factor
approximation.
Proving it is not fun. . . or is it?
22 / 32

So, what’s the algorithm?
Algorithm 1: Graph summarization with p-reconstruction error
Input : G = (V , E) with |V | = n, k ∈ N, p ∈ {1, 2}
Output: A O(1)-approximation to the best k-summary for G under the
p-reconstruction error
// Create the n × O(log n) sketch matrix (Indyk, 2006)
S ← createSketch(AG, O(log n), p)
// Select O(k) rows from the sketch (Aggarwal et al, 2009)
R ← reduceClustInstance(AG, S, k)
// Run the approximation algorithm by Mettu and Plaxton (2003) to
obtain a partition.
P ← getApproxClustPartition(p, k, R, S)
// Compute the densities for the summary
D ← computeDensities(P, AG)
return (P, D)
23 / 32

What is the story for the cut norm?
Szemerédi regularity lemma: in every large enough graph, there is a partitioning of the
nodes into subsets of about the same size so that the edges between diﬀerent subsets
behave almost randomly.
Key result in extremal graph theory: if F holds on random graphs, it holds on large
enough generic graphs;
The partitioning can be found in polynomial time, but may have a galactic size (tower
of exponentials in m);
Frieze and Kannan deﬁned weak regularity, which requires fewer classes.
24 / 32

What is weak regularity?
Given G = ([n], E) and a partitioning P of [n], P is ε-weakly regular if the cut-norm
error of the summary corresponding to P is at most εn2, i.e.:
err (AG, A↑
SP
) = AG − A↑
SP
= max
S,T⊆V
|eG(S, T) − eS↑
P
(S, T)|,
Any unweighted graph with average degree α = 2|E|/|V |2 ∈ [0, 1] admits an ε-weakly
regular partition with 2O(α log(1/α)/ε)2
parts (tight bound);
A speciﬁc graph may still have much smaller partitions.
The optimal deterministic algorithm to build a weakly ε-regular partition with at most
21/poly(ε) parts takes time c(ε)n2.
25 / 32

What’s wrong with this?
Existing algorithms take an error bound ε and always produce a partition of size
2Θ(1/ε2);
We solve the problem of constructing a partition of a given size k that minimizes the
cut-norm error.
Algorithm:
1) Square AG;
2) Do k-median on the rows of A2
G, using the distance function
δ(u, v) = w∈V |A2
G(u, w) − A2
G(v, w)|
n2
.
Lemma:
Let OPTp be the error of the optimal weakly-regular k-partition of [n].
The partitioning induced by the optimal k-median solution is a weakly-regular
partition with cost at most 16/ OPTp.
26 / 32

How does our algorithm perform in practice?
Implementation: C++, available at http://github.com/rionda/graphsumm;
Datasets: from SNAP (http://snap.stanford.edu/datasets)
Facebook, Enron, Epinions1, SignEpinions, Stanford, Amazon0601
(from 4k to 400k vertices, 88k to 2.5M edges).
Goals:
•study the structure of summaries generated by our algorithm;
•compare its performances w.r.t. GraSS;
•evaluate the quality of approximate query answers.
27 / 32

What do the summaries look like?
Experiment: As k varies, build a summary with k supernodes, and measure:
•distribution of supernode sizes;
•distributions of internal and cross densities;
•reconstruction or cut-norm error;
•runtime.
28 / 32

What do the summaries look like?
Experiment: As k varies, build a summary with k supernodes, and measure:
•distribution of supernode sizes;
•distributions of internal and cross densities;
•reconstruction or cut-norm error;
•runtime.
Results:
•always at least one supernode containing a single node, but also supernodes with
102 or 103 nodes: high standard deviation decreasing superlinearly in k;
•Some supernodes contain large quasi-cliques, but stddev of internal density is high;
•All cross densities are very low, some are zero: there are independent supernodes;
•The 2-reconstruction and cut-norm error shrink linearly in k;
•The runtime increases linearly in k, its stddev grows with k.
28 / 32

How does our algorithm compare to the state of the art?
Compared with GraSS (implemented by us, the authors lost their implementation);
Our algorithm is “better, faster, stronger”;
2-reconstr. Err.
k Alg. c (for GraSS) avg stdev min max Runtime (s)
10
GraSS
0.1 0.168 14 × 10−3
0.167 0.170 495.122
0.5 0.155 0.8 × 10−3
0.154 0.156 2669.961
0.75 0.153 0.7 × 10−3
0.153 0.154 4401.213
1.0 0.153 0.6 × 10−3
0.152 0.154 5516.915
Ours 0.152 1.6 × 10−3
0.150 0.154 0.440
250
GraSS
0.1 0.081 0.8 × 10−3
0.080 0.082 462.085
0.5 0.072 0.4 × 10−3
0.072 0.073 2517.515
0.75 0.070 0.5 × 10−3
0.070 0.071 4192.601
1.0 0.069 0.5 × 10−3
0.069 0.070 5263.651
Ours 0.065 0.8 × 10−3
0.064 0.067 0.552
Table: between ours and GraSS on a random sample (n = 500) of ego-gplus.
29 / 32

How good is the query answering approximation?
Experiment: run queries on summaries of diﬀerent size k, and measure the error;
Queries: Adjacency, Degree, Clustering Coeﬃcient (Triangles)
Results:
•The error decreases very rapidly as k increases;
•Adjacency is well estimated on average, but with large stddev and max when two
supernodes are large and have very few edges connecting their nodes.
•Degree is extremely well approximated, with small stddev;
•Triangle counts are well approximated when k is not too small w.r.t. |V |, but
quality decreases very fast otherwise.
30 / 32

What did we learn from the experiments?
A supernode is allocated for a single node: this seems suboptimal;
The sweet spot for the number of supernodes is not easy to ﬁnd;
Query performance is just ok, not great;
Question: Was this just a theoretical exercise?
Answer: Maybe not, but we aren’t “there” yet in graph summarization.
31 / 32

What did I tell you about?
The ﬁrst constant-factor approximation algorithm for graph summarization;
We considered diﬀerent error measures, including the cut norm, related to Szemeredi’s
Regularity Lemma;
We show a practical application of highly theoretical concepts;
The algorithm crushes the competition under any possible metric;
Experimental inspections of built summaries opens more questions about summaries
and error metrics;
Thank you!
EML: matteo@twosigma.com TWTR: @teorionda
WWW: http://matteo.rionda.to
32 / 32

This document is being distributed for informational and educational purposes only and is not an offer to sell or the solicitation of an offer to buy
any securities or other instruments. The information contained herein is not intended to provide, and should not be relied upon for investment
advice. The views expressed herein are not necessarily the views of Two Sigma Investments, LP or any of its affiliates (collectively, “Two Sigma”).
Such views reflect significant assumptions and subjective of the author(s) of the document and are subject to change without notice. The
document may employ data derived from third-party sources. No representation is made as to the accuracy of such information and the use of
such information in no way implies an endorsement of the source of such information or its validity.
The copyrights and/or trademarks in some of the images, logos or other material used herein may be owned by entities other than Two Sigma. If
so, such copyrights and/or trademarks are most likely owned by the entity that created the material and are used purely for identification and
comment as fair use under international copyright and/or trademark laws. Use of such image, copyright or trademark does not imply any
association with such organization (or endorsement of such organization) by Two Sigma, nor vice versa.

Graph Summarization with Quality Guarantees

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Graph Summarization with Quality Guarantees

Ähnlich wie Graph Summarization with Quality Guarantees (20)

Mehr von Two Sigma

Mehr von Two Sigma (18)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Graph Summarization with Quality Guarantees