SlideShare ist ein Scribd-Unternehmen logo
1 von 39
Downloaden Sie, um offline zu lesen
Graph Summarization with Quality Guarantees
Matteo Riondato – Labs, Two Sigma Investments LP
DEI – UNIPD – September 26, 2016
1 / 32
Who am I?
Laurea in Ing. Informazione and Laurea Spec. in Ing. Informatica @ DEI-UNIPD;
Ph.D. in Computer Science at Brown U. with Eli Upfal;
Working at
Labs, Two Sigma Investments LP (Research Scientist);
CS Dept., Brown U. (Visiting Asst. Prof.);
Research in algorithmic data science: algorithms plus data, algorithms for data
(used to be data mining, but then data miners forgot about algorithms...);
Tweeting @teorionda;
“Living” at http://matteo.rionda.to.
2 / 32
What is my approach to research?
Not This:
Theory → Practice
3 / 32
What is my approach to research?
Not This:
Theory ← Practice
3 / 32
What is my approach to research?
Not This:
Theory ↔ Practice
3 / 32
What is my approach to research?
Not This:
Theory + Practice
3 / 32
What is my approach to research?
This:
(Theory × Practice)Theory × Practice
Marry Theory and Practice: make them one, leverage individual/combined strengths!
We can’t look at them separately to solve larger-than-anything data problems.
3 / 32
What is summarization?
Motivation: Speeding up the answering of queries on a large graph G = ([n], E).
Examples of query:
“What’s the degree of v ∈ [n]?”, “How many triangles are there in G?”
Summarization: creation and use of a summary data structure to answer the queries;
Summary: a lossy representation; of G that fits in main memory (i.e., sketch);
We also need algorithms using the summary to approximately answer the queries;
We want a general-purpose summary, not specialized for only one query.
Summarization is not:
•graph compression: store a lossless compressed version of the graph on disk;
•graph partitioning: partition the nodes into k classes, while minimizing the number
of edges from one class to the other.
4 / 32
What am I going to talk about?
We present the first constant-factor
approximation algorithm for building
graph summaries according to natural
error measures;
Among the contributions: a practical use
of extreme graph theory, with the cut
norm and the algorithmic version of
Szemerédi’s Regularity Lemma.
Joint work with
Francesco Bonchi (ISI / Eurecat) and
David García-Soriano (Eurecat).
IEEE ICDM’14;
Graph Summarization with Quality Guarantees
Matteo Riondato
Stanford University
rionda@cs.stanford.edu
David Garc´ıa-Soriano
Yahoo Labs, Barcelona, Spain
davidgs@yahoo-inc.com
Francesco Bonchi
Yahoo Labs, Barcelona, Spain
bonchi@yahoo-inc.com
Abstract—We study the problem of graph summarization.
Given a large graph we aim at producing a concise lossy
representation that can be stored in main memory and used to
approximately answer queries about the original graph much
faster than by using the exact representation. In this paper
we study a very natural type of summary: the original set
of vertices is partitioned into a small number of supernodes
connected by superedges to form a complete weighted graph.
The superedge weights are the edge densities between vertices in
the corresponding supernodes. The goal is to produce a summary
that minimizes the reconstruction error w.r.t. the original graph.
By exposing a connection between graph summarization and
geometric clustering problems (i.e., k-means and k-median),
we develop the first polynomial-time approximation algorithm to
compute the best possible summary of a given size.
I. INTRODUCTION
Data analysts in several application domains (e.g., social
networks, molecular biology, communication networks, and
many others) routinely face graphs with millions of vertices
and billions of edges. In principle, this abundance of data
should allow for a more accurate analysis of the phenomena
under study. However, as the graphs under analysis grow, min-
ing and visualizing them become computationally challenging
tasks. In fact, the running time of most graph algorithms
depends on the size of the input: executing them on huge
graphs might be impractical, especially when the input is too
large to fit in main memory.
Graph summarization speeds up the analysis by creating
a lossy concise representation of the graph that fits into
main memory. Answers to otherwise expensive queries can
then be computed using the summary without accessing the
exact representation on disk. Query answers computed on the
summary incur in a minimal loss of accuracy. Summaries
can also be used for privacy purposes [1], to create easily
interpretable visualizations of the graph [2], and to store a
compressed representation of the graph.
LeFevre and Terzi [1] propose an enriched “supergraph” as a
summary, associating an integer to each supernode and to each
The GraSS algorithm presented in [1] follows a greedy heuris-
tic resembling an agglomerative hierarchical clustering using
Ward’s method [3] and as such can not give any guarantee on
the quality of the summary. In this paper instead, we propose
efficient algorithms to compute summaries of guaranteed
quality (a constant factor from the optimal). This theoretical
property is also verified empirically: our algorithms build more
representative summaries and are much more efficient and
scalable than GraSS in building those summaries.
II. PROBLEM DEFINITION
We consider an undirected graph G = (V, E) with |V | = n.
In the rest of the paper, the key concepts are defined from the
standpoint of the symmetric adjacency matrix AG of G. We
allow the edges to be weighted (so the adjacency matrix is not
necessarily binary) and we allow self-loops (so the diagonal
of the adjacency matrix is not necessarily all-zero).
Given a graph G = (V, E) and k 2 N, a k-summary S of
G is a complete undirected weighted graph S = (V 0
, V 0
⇥V 0
)
that is uniquely identified by a k-partition V 0
of V (i.e., V 0
=
{V1, . . . , Vk}, s.t. [i2[1,k]Vi = V and 8i, j 2 [1, k], i 6= j, it
holds Vi  Vj = ;). The vertices of S are called supernodes.
There is a superedge eij for each unordered pair of supernodes
(Vi, Vj), including (Vi, Vi) (i.e., each supernode has a self-loop
eii). Each superedge eij is weighted by the density of edges
between Vi and Vj:
dG(i, j) = dG(Vi, Vj) = eG(Vi, Vj)/(|Vi||Vj|),
where for any two sets of vertices S, T ✓ V , we denote
eG(S, T) =
X
i2S,j2T
AG(i, j).
We define the density matrix of S as the k⇥k matrix AS with
entries AS(i, j) = dG(i, j), 1  i, j  k. For each v 2 V , we
also denote by s(v) the unique element of S (a supernode)
such that v 2 s(v). The density matrix AS 2 Rk⇥k
can be
lifted1
to the matrix A"
S 2 Rn⇥n
defined by
A"
S(v, w) = AS(s(v), s(w)).
Data Mining and Knowledge Discovery,
in press;
Data Mining and Knowledge Discovery manuscript No.
(will be inserted by the editor)
Graph Summarization with Quality Guarantees
Matteo Riondato · David Garc´ıa-Soriano ·
Francesco Bonchi
5 / 32
What does a summary look like?
G = ([n], E): an undirected weighted graph with adjacency matrix AG;
Summary S of G: a complete weighted graph with self-loops: S = (V, V × V);
The supernodes Vi , i = 1, . . . , k, in V form a partitioning of [n];
The superedges weights are the edge densities between the corresponding sets of nodes:
dG(i, j) = dG(Vi , Vj) =
eG(Vi , Vj)
|Vi ||Vj|
,
where
eG(S, T) =
i∈S,j∈T
AG(i, j) for all S, T ⊆ V .
Technical point: if S and T overlap, each non-loop edge with both endpoints in
S ∩ T is counted twice for eG(S, T).
6 / 32
Shall we look at an example?
A graph with 7 nodes, and a summary of it with 3 supernodes:
2 1
3
4 75
0
7/9
3/4
1/4
1/3
1/3
{3,4,5}
{6,7}
{1,2}
6
7 / 32
How can we answer queries using a summary?
Preliminary question:
What is the desired semantic of an answer computed on the summary?
A summary S is a lossy representation of G, compatible with many other graphs;
We adopt an expected-value semantic for query answers:
Let π be a distribution over the set of graphs compatible with S;
Given q and S, we want
answer(q, S) = Eπ[answer(q, compatible G )] .
We most often use the uniform distribution as π.
8 / 32
What is the lifted adjacency matrix?
Key object to compute query answers and evaluate the quality of the summary;
Preliminaries:
Given a summary S, let AS be the k × k matrix such that AS(i, j) = dG(i, j);
Given a node v ∈ [n], let s(v) be the unique supernode of S containing v;
Lifted adjacency matrix A↑
S ∈ Rn×n of S:
A↑
S(u, w) = AS(s(u), s(w)) .
A↑
S has a nice regular structure (example coming up);
We call matrices like A↑
S “P-constant”, where P is the partitioning of [n] used in S.
9 / 32
What does a lifted adjacency matrix look like?
2 1
3
4 75
0
7/9
3/4
1/4
1/3
1/3
{3,4,5}
{6,7}
{1,2}
6
1 2 3 4 5 6 7
1 0 0 1/3 1/3 1/3 1/4 1/4
2 0 0 1/3 1/3 1/3 1/4 1/4
3 1/3 1/3 7/9 7/9 7/9 1/3 1/3
4 1/3 1/3 7/9 7/9 7/9 1/3 1/3
5 1/3 1/3 7/9 7/9 7/9 1/3 1/3
6 1/4 1/4 1/3 1/3 1/3 3/4 3/4
7 1/4 1/4 1/3 1/3 1/3 3/4 3/4
10 / 32
How do we use the lifted adjacency matrix to answer queries?
Query Answer
Adjacency of u and w A↑
S(u, w)
Degree of u n
i=1 A↑
S(u, i)
Eigenvector centrality of u n
i=1 A↑
S(u, i)/2|E|
No. of triangles in G
k
i=1
|Vi |
3 A↑
S(i, i)3+
k
j=i+1 A↑
S(i, j)2 |Vi |
2 |Vj|A↑
S(i, i) + |Vj |
2 |Vi |A↑
S(j, j) +
+ k
w=j+1 |Vi ||Vj||Vw |A↑
S(i, j)A↑
S(j, w)A↑
S(w, j)
This is a slight simplification of how to answer queries: there is some additional weighting of A
↑
S
involved.
Takeaway: Answering queries takes time O(poly(k)).
11 / 32
What is a good summary?
Given G = ([n], E), each partitioning P of [n] corresponds to a summary SP;
Which partition of [n] gives the best summary?
We need an error metric to evaluate the partitions, i.e., the summaries;
Intuition: compare the lifted adjacency matrix of SP with the adjacency matrix AG;
Error metrics:
• p-reconstruction error, based on the p norm;
•cut-norm error, based on the cut norm.
12 / 32
What is the reconstruction error?
The entry-wise p-norm of the difference between AG and A↑
S:
errp(AG, A↑
S) = AG − A↑
S p =


|V |
i=1
|V |
j=1
|AG(i, j) − A↑
S(i, j)|p


1/p
.
The reconstruction error is such that errp(AG, A↑
S) ∈ [0, n2/p];
For the summary in the example we had:
err1(AG, A↑
S) =
329
18
= 18.27 and err2(AG, A↑
S) =
11844
1296
= 3.023059525 .
13 / 32
How do we compute the p-reconstruction error?
In time O(k2) from the summary:
errp(A↑
S, AG)p
=
i,j∈[k]
|Vi ||Vj|dG(i, j)(1 − dG(i, j)) (1 − dG(i, j))p−1
+ dG(i, j)p−1
.
Interesting consequence: The choice of p does not matter much. E.g.,:
err2(A↑
S, AG)2
= 2 · err1(A↑
S, AG)
I.e., a partition that minimizes the 2-reconstruction error, also minimizes the
1-reconstruction error;
(similarly also for any errp and errq, with p, q ∈ O(1), up to constant factors)
14 / 32
What’s wrong with the p-reconstruction error?
Achieving low (e.g., o(n2)) 1-reconstruction error may require k very close to n;
Example:
Consider a Erdős-Renyi graph G ∈ Gn,1/2;
W.h.p., for any two vertices u, v, we have N(v) − N(v) = n
2 ±
√
n log n;
Assigning two vertices to the same supernode has cost n
2 − o(n);
Hence, even for k = n/2, the reconstruction error of any k-summary is ≥ n2/9.
15 / 32
How can we solve this drawback?
When G is not random, we hope to find structure to exploit for building the summary;
Look at adjacency matrix AG as:
AG = S + E
where
•S: “structured” matrix captured by summary;
•E: “residual” matrix representing the error;
The “smaller” E is , the better the summary;
The cut norm error tries to quantify E.
16 / 32
What is the cut norm?
The cut norm of B ∈ Rn×n is the maximum absolute sum of the entries of any of its
submatrices:
B = max
S,T⊆[n]
|B(S, T)| = max
S,T⊆[n]
i∈S,j∈T
Bi,j .
The cut-norm error of a summary S w.r.t. G is:
err (AG, A↑
S) = AG − A↑
S = max
S,T⊆V
|eG(S, T) − eS↑ (S, T)|,
Interpretation: maximum absolute difference between (twice) the sum of edges
weights between any two sets of vertices in G and the same sum in A↑
S;
Computing the cut norm is NP-hard, but exists constant-factor approximation via SDP.
17 / 32
What is GraSS?
GraSS: Graph Structure Summarization, by LeFevre and Terzi, SIAM SDM’10;
Introduces the 1-reconstruction error;
Defines a variant of the lifted adjacency matrix;
Shows how to answer some queries using the summary;
Gives an algorithm to build a summary using agglomerative hierarchical clustering:
1) Put each vertex in a separate supernode;
2) Until the number of supernodes is k:
Merge the two supernodes whose merging minimizes the 1-reconstruction error;
3) Output the resulting k supernodes;
GraSS does not offer any quality guarantees and it is extremely slow.
18 / 32
Why using the lifted adjacency matrix?
Given p ≥ 1, a graph G = ([n], E) with adjacency matrix AG, and a partitioning
P = {S1, . . . , Sk} of [n], what P-constant matrix B∗,p
P minimizes AG − B∗,p
P p?
Lemma: A↑
S is a (small) constant -factor approximation of B∗,p
P .
19 / 32
Why using the lifted adjacency matrix?
Given p ≥ 1, a graph G = ([n], E) with adjacency matrix AG, and a partitioning
P = {S1, . . . , Sk} of [n], what P-constant matrix B∗,p
P minimizes AG − B∗,p
P p?
Lemma: A↑
S is a (small) constant -factor approximation of B∗,p
P .
Intuition (this is not a proof):
Assume to sample u.a.r. a pair of vertices (u, w) from Vi × Vj. Let X be the
r.v. with value AG(u, w). What is the real a that minimizes E[|X − a|p]?
For p = 1 is median(X), for p = 2 is E[X]. We have:
A↑
S(i, j) =
(u,v)∈Vi ×Vj
AG(u, w)
|Vi ||Vj|
= E[X] .
Hence, B∗,2
P = A↑
S;
Also, we show that AG − B∗,1
P 1
≤ AG − A↑
S 1
≤ 2 AG − B∗,1
P .
19 / 32
What is our algorithm for graph summarization?
As simple as:
•For p-reconstruction error, perform p-clustering of the rows of AG (p = 1:
k-median, p = 2: k-means). If column i is in cluster j, then vertex i is in supernode Vj.
Lemma: The summary obtained from the optimal 1 (resp. 2) clustering is a 8
(resp. 4) approximation of the optimal summary for the 1 (resp. 2) reconstruction
error.
...not really so simple to prove!
•For the cut-norm error, perform 1-clustering of the rows of A2
G, with a specific
distance function;
Lemma: (a tad more complex, let’s skip this.)
20 / 32
Is there some intuition behind this connection with clustering?
Given a partition P = {V1, . . . , Vk}, consider the orthonormal vectors vi ∈ Rn such
that vi (j) = 1/|Vi | if j ∈ Vi , 0 othw.
Consider the n × n matrix P = k
i=1 vi vi ;
We have A↑
S = PAGP;
The p-cost of the clustering associated to P is AG − AGP p;
For the p norms, AG − AGP p /2 ≤ AG − PAGP p ≤ 2 AG − AGP p;
Let ¯S be the summary arising from the optimal 2-clustering, and S∗ be the optimal
summary with minimal 2-reconstruction error. Then:
err2(AG, A↑
¯S
) = AG − P ¯SAGP ¯S 2 ≤ 2 AG − AGP ¯S 2 ≤ 2 AG − AGPS∗
2
≤ 4 AG − PS∗ AGPS∗
2 = 4 · err2(AG, A↑
S∗ ) .
21 / 32
How can we build a summary efficiently?
Both k-means and k-median are NP-hard;
There are constant factor approximation algorithms;
Bottleneck: computing all pairwise distance for the n rows of AG is expensive
(like matrix multiplication);
Solution: Use a sketch of the adjacency matrix with n rows and log n columns;
Incurs in additional constant error;
Even with the sketch, the approximation algorithms take time ˜O(n2);
Idea: select O(k) rows of the sketch adaptively, compute a clustering using them;
In the end, the algorithm runs in time ˜O(m + nk) and obtains a constant-factor
approximation.
Proving it is not fun. . . or is it?
22 / 32
So, what’s the algorithm?
Algorithm 1: Graph summarization with p-reconstruction error
Input : G = (V , E) with |V | = n, k ∈ N, p ∈ {1, 2}
Output: A O(1)-approximation to the best k-summary for G under the
p-reconstruction error
// Create the n × O(log n) sketch matrix (Indyk, 2006)
S ← createSketch(AG, O(log n), p)
// Select O(k) rows from the sketch (Aggarwal et al, 2009)
R ← reduceClustInstance(AG, S, k)
// Run the approximation algorithm by Mettu and Plaxton (2003) to
obtain a partition.
P ← getApproxClustPartition(p, k, R, S)
// Compute the densities for the summary
D ← computeDensities(P, AG)
return (P, D)
23 / 32
What is the story for the cut norm?
Szemerédi regularity lemma: in every large enough graph, there is a partitioning of the
nodes into subsets of about the same size so that the edges between different subsets
behave almost randomly.
Key result in extremal graph theory: if F holds on random graphs, it holds on large
enough generic graphs;
The partitioning can be found in polynomial time, but may have a galactic size (tower
of exponentials in m);
Frieze and Kannan defined weak regularity, which requires fewer classes.
24 / 32
What is weak regularity?
Given G = ([n], E) and a partitioning P of [n], P is ε-weakly regular if the cut-norm
error of the summary corresponding to P is at most εn2, i.e.:
err (AG, A↑
SP
) = AG − A↑
SP
= max
S,T⊆V
|eG(S, T) − eS↑
P
(S, T)|,
Any unweighted graph with average degree α = 2|E|/|V |2 ∈ [0, 1] admits an ε-weakly
regular partition with 2O(α log(1/α)/ε)2
parts (tight bound);
A specific graph may still have much smaller partitions.
The optimal deterministic algorithm to build a weakly ε-regular partition with at most
21/poly(ε) parts takes time c(ε)n2.
25 / 32
What’s wrong with this?
Existing algorithms take an error bound ε and always produce a partition of size
2Θ(1/ε2);
We solve the problem of constructing a partition of a given size k that minimizes the
cut-norm error.
Algorithm:
1) Square AG;
2) Do k-median on the rows of A2
G, using the distance function
δ(u, v) = w∈V |A2
G(u, w) − A2
G(v, w)|
n2
.
Lemma:
Let OPTp be the error of the optimal weakly-regular k-partition of [n].
The partitioning induced by the optimal k-median solution is a weakly-regular
partition with cost at most 16/ OPTp.
26 / 32
How does our algorithm perform in practice?
Implementation: C++, available at http://github.com/rionda/graphsumm;
Datasets: from SNAP (http://snap.stanford.edu/datasets)
Facebook, Enron, Epinions1, SignEpinions, Stanford, Amazon0601
(from 4k to 400k vertices, 88k to 2.5M edges).
Goals:
•study the structure of summaries generated by our algorithm;
•compare its performances w.r.t. GraSS;
•evaluate the quality of approximate query answers.
27 / 32
What do the summaries look like?
Experiment: As k varies, build a summary with k supernodes, and measure:
•distribution of supernode sizes;
•distributions of internal and cross densities;
•reconstruction or cut-norm error;
•runtime.
28 / 32
What do the summaries look like?
Experiment: As k varies, build a summary with k supernodes, and measure:
•distribution of supernode sizes;
•distributions of internal and cross densities;
•reconstruction or cut-norm error;
•runtime.
Results:
•always at least one supernode containing a single node, but also supernodes with
102 or 103 nodes: high standard deviation decreasing superlinearly in k;
•Some supernodes contain large quasi-cliques, but stddev of internal density is high;
•All cross densities are very low, some are zero: there are independent supernodes;
•The 2-reconstruction and cut-norm error shrink linearly in k;
•The runtime increases linearly in k, its stddev grows with k.
28 / 32
How does our algorithm compare to the state of the art?
Compared with GraSS (implemented by us, the authors lost their implementation);
Our algorithm is “better, faster, stronger”;
2-reconstr. Err.
k Alg. c (for GraSS) avg stdev min max Runtime (s)
10
GraSS
0.1 0.168 14 × 10−3
0.167 0.170 495.122
0.5 0.155 0.8 × 10−3
0.154 0.156 2669.961
0.75 0.153 0.7 × 10−3
0.153 0.154 4401.213
1.0 0.153 0.6 × 10−3
0.152 0.154 5516.915
Ours 0.152 1.6 × 10−3
0.150 0.154 0.440
250
GraSS
0.1 0.081 0.8 × 10−3
0.080 0.082 462.085
0.5 0.072 0.4 × 10−3
0.072 0.073 2517.515
0.75 0.070 0.5 × 10−3
0.070 0.071 4192.601
1.0 0.069 0.5 × 10−3
0.069 0.070 5263.651
Ours 0.065 0.8 × 10−3
0.064 0.067 0.552
Table: between ours and GraSS on a random sample (n = 500) of ego-gplus.
29 / 32
How good is the query answering approximation?
Experiment: run queries on summaries of different size k, and measure the error;
Queries: Adjacency, Degree, Clustering Coefficient (Triangles)
Results:
•The error decreases very rapidly as k increases;
•Adjacency is well estimated on average, but with large stddev and max when two
supernodes are large and have very few edges connecting their nodes.
•Degree is extremely well approximated, with small stddev;
•Triangle counts are well approximated when k is not too small w.r.t. |V |, but
quality decreases very fast otherwise.
30 / 32
What did we learn from the experiments?
A supernode is allocated for a single node: this seems suboptimal;
The sweet spot for the number of supernodes is not easy to find;
Query performance is just ok, not great;
Question: Was this just a theoretical exercise?
Answer: Maybe not, but we aren’t “there” yet in graph summarization.
31 / 32
What did I tell you about?
The first constant-factor approximation algorithm for graph summarization;
We considered different error measures, including the cut norm, related to Szemeredi’s
Regularity Lemma;
We show a practical application of highly theoretical concepts;
The algorithm crushes the competition under any possible metric;
Experimental inspections of built summaries opens more questions about summaries
and error metrics;
Thank you!
EML: matteo@twosigma.com TWTR: @teorionda
WWW: http://matteo.rionda.to
32 / 32
This document is being distributed for informational and educational purposes only and is not an offer to sell or the solicitation of an offer to buy
any securities or other instruments. The information contained herein is not intended to provide, and should not be relied upon for investment
advice. The views expressed herein are not necessarily the views of Two Sigma Investments, LP or any of its affiliates (collectively, “Two Sigma”).
Such views reflect significant assumptions and subjective of the author(s) of the document and are subject to change without notice. The
document may employ data derived from third-party sources. No representation is made as to the accuracy of such information and the use of
such information in no way implies an endorsement of the source of such information or its validity.
The copyrights and/or trademarks in some of the images, logos or other material used herein may be owned by entities other than Two Sigma. If
so, such copyrights and/or trademarks are most likely owned by the entity that created the material and are used purely for identification and
comment as fair use under international copyright and/or trademark laws. Use of such image, copyright or trademark does not imply any
association with such organization (or endorsement of such organization) by Two Sigma, nor vice versa.

Weitere ähnliche Inhalte

Was ist angesagt?

Mit15 082 jf10_lec01
Mit15 082 jf10_lec01Mit15 082 jf10_lec01
Mit15 082 jf10_lec01
Saad Liaqat
 
Core–periphery detection in networks with nonlinear Perron eigenvectors
Core–periphery detection in networks with nonlinear Perron eigenvectorsCore–periphery detection in networks with nonlinear Perron eigenvectors
Core–periphery detection in networks with nonlinear Perron eigenvectors
Francesco Tudisco
 
elliptic-curves-modern
elliptic-curves-modernelliptic-curves-modern
elliptic-curves-modern
Eric Seifert
 
ACM ICPC 2013 NEERC (Northeastern European Regional Contest) Problems Review
ACM ICPC 2013 NEERC (Northeastern European Regional Contest) Problems ReviewACM ICPC 2013 NEERC (Northeastern European Regional Contest) Problems Review
ACM ICPC 2013 NEERC (Northeastern European Regional Contest) Problems Review
Roman Elizarov
 

Was ist angesagt? (20)

Towards a stable definition of Algorithmic Randomness
Towards a stable definition of Algorithmic RandomnessTowards a stable definition of Algorithmic Randomness
Towards a stable definition of Algorithmic Randomness
 
Lec 2-2
Lec 2-2Lec 2-2
Lec 2-2
 
Mit15 082 jf10_lec01
Mit15 082 jf10_lec01Mit15 082 jf10_lec01
Mit15 082 jf10_lec01
 
Asymptotic analysis
Asymptotic analysisAsymptotic analysis
Asymptotic analysis
 
Cs6402 design and analysis of algorithms may june 2016 answer key
Cs6402 design and analysis of algorithms may june 2016 answer keyCs6402 design and analysis of algorithms may june 2016 answer key
Cs6402 design and analysis of algorithms may june 2016 answer key
 
Graph Spectra through Network Complexity Measures: Information Content of Eig...
Graph Spectra through Network Complexity Measures: Information Content of Eig...Graph Spectra through Network Complexity Measures: Information Content of Eig...
Graph Spectra through Network Complexity Measures: Information Content of Eig...
 
Fractal dimension versus Computational Complexity
Fractal dimension versus Computational ComplexityFractal dimension versus Computational Complexity
Fractal dimension versus Computational Complexity
 
Fractal Dimension of Space-time Diagrams and the Runtime Complexity of Small ...
Fractal Dimension of Space-time Diagrams and the Runtime Complexity of Small ...Fractal Dimension of Space-time Diagrams and the Runtime Complexity of Small ...
Fractal Dimension of Space-time Diagrams and the Runtime Complexity of Small ...
 
Core–periphery detection in networks with nonlinear Perron eigenvectors
Core–periphery detection in networks with nonlinear Perron eigenvectorsCore–periphery detection in networks with nonlinear Perron eigenvectors
Core–periphery detection in networks with nonlinear Perron eigenvectors
 
Lecture warshall floyd
Lecture warshall floydLecture warshall floyd
Lecture warshall floyd
 
Lecture26
Lecture26Lecture26
Lecture26
 
Cordial Labelings in the Context of Triplication
Cordial Labelings in the Context of  TriplicationCordial Labelings in the Context of  Triplication
Cordial Labelings in the Context of Triplication
 
Graph Traversal Algorithms - Breadth First Search
Graph Traversal Algorithms - Breadth First SearchGraph Traversal Algorithms - Breadth First Search
Graph Traversal Algorithms - Breadth First Search
 
testpang
testpangtestpang
testpang
 
Information Content of Complex Networks
Information Content of Complex NetworksInformation Content of Complex Networks
Information Content of Complex Networks
 
elliptic-curves-modern
elliptic-curves-modernelliptic-curves-modern
elliptic-curves-modern
 
Optimal L-shaped matrix reordering, aka graph's core-periphery
Optimal L-shaped matrix reordering, aka graph's core-peripheryOptimal L-shaped matrix reordering, aka graph's core-periphery
Optimal L-shaped matrix reordering, aka graph's core-periphery
 
A review on structure learning in GNN
A review on structure learning in GNNA review on structure learning in GNN
A review on structure learning in GNN
 
ACM ICPC 2013 NEERC (Northeastern European Regional Contest) Problems Review
ACM ICPC 2013 NEERC (Northeastern European Regional Contest) Problems ReviewACM ICPC 2013 NEERC (Northeastern European Regional Contest) Problems Review
ACM ICPC 2013 NEERC (Northeastern European Regional Contest) Problems Review
 
ACM ICPC 2015 NEERC (Northeastern European Regional Contest) Problems Review
ACM ICPC 2015 NEERC (Northeastern European Regional Contest) Problems ReviewACM ICPC 2015 NEERC (Northeastern European Regional Contest) Problems Review
ACM ICPC 2015 NEERC (Northeastern European Regional Contest) Problems Review
 

Ähnlich wie Graph Summarization with Quality Guarantees

Formulas for Surface Weighted Numbers on Graph
Formulas for Surface Weighted Numbers on GraphFormulas for Surface Weighted Numbers on Graph
Formulas for Surface Weighted Numbers on Graph
ijtsrd
 
Fixed point and common fixed point theorems in complete metric spaces
Fixed point and common fixed point theorems in complete metric spacesFixed point and common fixed point theorems in complete metric spaces
Fixed point and common fixed point theorems in complete metric spaces
Alexander Decker
 
lecture 17
lecture 17lecture 17
lecture 17
sajinsc
 
Scalable Constrained Spectral Clustering
Scalable Constrained Spectral ClusteringScalable Constrained Spectral Clustering
Scalable Constrained Spectral Clustering
1crore projects
 

Ähnlich wie Graph Summarization with Quality Guarantees (20)

An analysis between exact and approximate algorithms for the k-center proble...
An analysis between exact and approximate algorithms for the  k-center proble...An analysis between exact and approximate algorithms for the  k-center proble...
An analysis between exact and approximate algorithms for the k-center proble...
 
Comparative Analysis of Algorithms for Single Source Shortest Path Problem
Comparative Analysis of Algorithms for Single Source Shortest Path ProblemComparative Analysis of Algorithms for Single Source Shortest Path Problem
Comparative Analysis of Algorithms for Single Source Shortest Path Problem
 
JAVA BASED VISUALIZATION AND ANIMATION FOR TEACHING THE DIJKSTRA SHORTEST PAT...
JAVA BASED VISUALIZATION AND ANIMATION FOR TEACHING THE DIJKSTRA SHORTEST PAT...JAVA BASED VISUALIZATION AND ANIMATION FOR TEACHING THE DIJKSTRA SHORTEST PAT...
JAVA BASED VISUALIZATION AND ANIMATION FOR TEACHING THE DIJKSTRA SHORTEST PAT...
 
Lausanne 2019 #4
Lausanne 2019 #4Lausanne 2019 #4
Lausanne 2019 #4
 
50120130406039
5012013040603950120130406039
50120130406039
 
Formulas for Surface Weighted Numbers on Graph
Formulas for Surface Weighted Numbers on GraphFormulas for Surface Weighted Numbers on Graph
Formulas for Surface Weighted Numbers on Graph
 
Fixed point and common fixed point theorems in complete metric spaces
Fixed point and common fixed point theorems in complete metric spacesFixed point and common fixed point theorems in complete metric spaces
Fixed point and common fixed point theorems in complete metric spaces
 
Using Cauchy Inequality to Find Function Extremums
Using Cauchy Inequality to Find Function ExtremumsUsing Cauchy Inequality to Find Function Extremums
Using Cauchy Inequality to Find Function Extremums
 
Dijkstra Shortest Path Visualization
Dijkstra Shortest Path VisualizationDijkstra Shortest Path Visualization
Dijkstra Shortest Path Visualization
 
50120130405020
5012013040502050120130405020
50120130405020
 
On solving fuzzy delay differential equationusing bezier curves
On solving fuzzy delay differential equationusing bezier curves  On solving fuzzy delay differential equationusing bezier curves
On solving fuzzy delay differential equationusing bezier curves
 
lecture 17
lecture 17lecture 17
lecture 17
 
Planted Clique Research Paper
Planted Clique Research PaperPlanted Clique Research Paper
Planted Clique Research Paper
 
A Study on Differential Equations on the Sphere Using Leapfrog Method
A Study on Differential Equations on the Sphere Using Leapfrog MethodA Study on Differential Equations on the Sphere Using Leapfrog Method
A Study on Differential Equations on the Sphere Using Leapfrog Method
 
Survival and hazard estimation of weibull distribution based on
Survival and hazard estimation of weibull distribution based onSurvival and hazard estimation of weibull distribution based on
Survival and hazard estimation of weibull distribution based on
 
50120130406008
5012013040600850120130406008
50120130406008
 
Design and Implementation of Mobile Map Application for Finding Shortest Dire...
Design and Implementation of Mobile Map Application for Finding Shortest Dire...Design and Implementation of Mobile Map Application for Finding Shortest Dire...
Design and Implementation of Mobile Map Application for Finding Shortest Dire...
 
Symbolic Computation via Gröbner Basis
Symbolic Computation via Gröbner BasisSymbolic Computation via Gröbner Basis
Symbolic Computation via Gröbner Basis
 
Scalable Constrained Spectral Clustering
Scalable Constrained Spectral ClusteringScalable Constrained Spectral Clustering
Scalable Constrained Spectral Clustering
 
Regression on gaussian symbols
Regression on gaussian symbolsRegression on gaussian symbols
Regression on gaussian symbols
 

Mehr von Two Sigma

Waiter: An Open-Source Distributed Auto-Scaler
Waiter: An Open-Source Distributed Auto-ScalerWaiter: An Open-Source Distributed Auto-Scaler
Waiter: An Open-Source Distributed Auto-Scaler
Two Sigma
 
The Language of Compression - Leif Walsh
The Language of Compression - Leif WalshThe Language of Compression - Leif Walsh
The Language of Compression - Leif Walsh
Two Sigma
 

Mehr von Two Sigma (18)

The State of Open Data on School Bullying
The State of Open Data on School BullyingThe State of Open Data on School Bullying
The State of Open Data on School Bullying
 
Halite @ Google Cloud Next 2018
Halite @ Google Cloud Next 2018Halite @ Google Cloud Next 2018
Halite @ Google Cloud Next 2018
 
Future of Pandas - Jeff Reback
Future of Pandas - Jeff RebackFuture of Pandas - Jeff Reback
Future of Pandas - Jeff Reback
 
BeakerX - Tiezheng Li
BeakerX - Tiezheng LiBeakerX - Tiezheng Li
BeakerX - Tiezheng Li
 
Engineering with Open Source - Hyonjee Joo
Engineering with Open Source - Hyonjee JooEngineering with Open Source - Hyonjee Joo
Engineering with Open Source - Hyonjee Joo
 
Bringing Linux back to the Server BIOS with LinuxBoot - Trammel Hudson
Bringing Linux back to the Server BIOS with LinuxBoot - Trammel HudsonBringing Linux back to the Server BIOS with LinuxBoot - Trammel Hudson
Bringing Linux back to the Server BIOS with LinuxBoot - Trammel Hudson
 
Waiter: An Open-Source Distributed Auto-Scaler
Waiter: An Open-Source Distributed Auto-ScalerWaiter: An Open-Source Distributed Auto-Scaler
Waiter: An Open-Source Distributed Auto-Scaler
 
Responsive and Scalable Real-time Data Analytics for SHPE 2017 - Cecilia Ye
Responsive and Scalable Real-time Data Analytics for SHPE 2017 - Cecilia YeResponsive and Scalable Real-time Data Analytics for SHPE 2017 - Cecilia Ye
Responsive and Scalable Real-time Data Analytics for SHPE 2017 - Cecilia Ye
 
Archival Storage at Two Sigma - Josh Leners
Archival Storage at Two Sigma - Josh LenersArchival Storage at Two Sigma - Josh Leners
Archival Storage at Two Sigma - Josh Leners
 
Smooth Storage - A distributed storage system for managing structured time se...
Smooth Storage - A distributed storage system for managing structured time se...Smooth Storage - A distributed storage system for managing structured time se...
Smooth Storage - A distributed storage system for managing structured time se...
 
The Language of Compression - Leif Walsh
The Language of Compression - Leif WalshThe Language of Compression - Leif Walsh
The Language of Compression - Leif Walsh
 
Identifying Emergent Behaviors in Complex Systems - Jane Adams
Identifying Emergent Behaviors in Complex Systems - Jane AdamsIdentifying Emergent Behaviors in Complex Systems - Jane Adams
Identifying Emergent Behaviors in Complex Systems - Jane Adams
 
HUOHUA: A Distributed Time Series Analysis Framework For Spark
HUOHUA: A Distributed Time Series Analysis Framework For SparkHUOHUA: A Distributed Time Series Analysis Framework For Spark
HUOHUA: A Distributed Time Series Analysis Framework For Spark
 
Improving Python and Spark Performance and Interoperability with Apache Arrow
Improving Python and Spark Performance and Interoperability with Apache ArrowImproving Python and Spark Performance and Interoperability with Apache Arrow
Improving Python and Spark Performance and Interoperability with Apache Arrow
 
Exploring the Urban – Rural Incarceration Divide: Drivers of Local Jail Incar...
Exploring the Urban – Rural Incarceration Divide: Drivers of Local Jail Incar...Exploring the Urban – Rural Incarceration Divide: Drivers of Local Jail Incar...
Exploring the Urban – Rural Incarceration Divide: Drivers of Local Jail Incar...
 
Rademacher Averages: Theory and Practice
Rademacher Averages: Theory and PracticeRademacher Averages: Theory and Practice
Rademacher Averages: Theory and Practice
 
Credit-Implied Volatility
Credit-Implied VolatilityCredit-Implied Volatility
Credit-Implied Volatility
 
Principles of REST API Design
Principles of REST API DesignPrinciples of REST API Design
Principles of REST API Design
 

Kürzlich hochgeladen

Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
chadhar227
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
gajnagarg
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit RiyadhCytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Abortion pills in Riyadh +966572737505 get cytotec
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
vexqp
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Klinik kandungan
 
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
vexqp
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
gajnagarg
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
wsppdmt
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Bertram Ludäscher
 
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
vexqp
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
gajnagarg
 

Kürzlich hochgeladen (20)

Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangePredicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit RiyadhCytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
 
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
 
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptxThe-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
 
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
 

Graph Summarization with Quality Guarantees

  • 1. Graph Summarization with Quality Guarantees Matteo Riondato – Labs, Two Sigma Investments LP DEI – UNIPD – September 26, 2016 1 / 32
  • 2. Who am I? Laurea in Ing. Informazione and Laurea Spec. in Ing. Informatica @ DEI-UNIPD; Ph.D. in Computer Science at Brown U. with Eli Upfal; Working at Labs, Two Sigma Investments LP (Research Scientist); CS Dept., Brown U. (Visiting Asst. Prof.); Research in algorithmic data science: algorithms plus data, algorithms for data (used to be data mining, but then data miners forgot about algorithms...); Tweeting @teorionda; “Living” at http://matteo.rionda.to. 2 / 32
  • 3. What is my approach to research? Not This: Theory → Practice 3 / 32
  • 4. What is my approach to research? Not This: Theory ← Practice 3 / 32
  • 5. What is my approach to research? Not This: Theory ↔ Practice 3 / 32
  • 6. What is my approach to research? Not This: Theory + Practice 3 / 32
  • 7. What is my approach to research? This: (Theory × Practice)Theory × Practice Marry Theory and Practice: make them one, leverage individual/combined strengths! We can’t look at them separately to solve larger-than-anything data problems. 3 / 32
  • 8. What is summarization? Motivation: Speeding up the answering of queries on a large graph G = ([n], E). Examples of query: “What’s the degree of v ∈ [n]?”, “How many triangles are there in G?” Summarization: creation and use of a summary data structure to answer the queries; Summary: a lossy representation; of G that fits in main memory (i.e., sketch); We also need algorithms using the summary to approximately answer the queries; We want a general-purpose summary, not specialized for only one query. Summarization is not: •graph compression: store a lossless compressed version of the graph on disk; •graph partitioning: partition the nodes into k classes, while minimizing the number of edges from one class to the other. 4 / 32
  • 9. What am I going to talk about? We present the first constant-factor approximation algorithm for building graph summaries according to natural error measures; Among the contributions: a practical use of extreme graph theory, with the cut norm and the algorithmic version of Szemerédi’s Regularity Lemma. Joint work with Francesco Bonchi (ISI / Eurecat) and David García-Soriano (Eurecat). IEEE ICDM’14; Graph Summarization with Quality Guarantees Matteo Riondato Stanford University rionda@cs.stanford.edu David Garc´ıa-Soriano Yahoo Labs, Barcelona, Spain davidgs@yahoo-inc.com Francesco Bonchi Yahoo Labs, Barcelona, Spain bonchi@yahoo-inc.com Abstract—We study the problem of graph summarization. Given a large graph we aim at producing a concise lossy representation that can be stored in main memory and used to approximately answer queries about the original graph much faster than by using the exact representation. In this paper we study a very natural type of summary: the original set of vertices is partitioned into a small number of supernodes connected by superedges to form a complete weighted graph. The superedge weights are the edge densities between vertices in the corresponding supernodes. The goal is to produce a summary that minimizes the reconstruction error w.r.t. the original graph. By exposing a connection between graph summarization and geometric clustering problems (i.e., k-means and k-median), we develop the first polynomial-time approximation algorithm to compute the best possible summary of a given size. I. INTRODUCTION Data analysts in several application domains (e.g., social networks, molecular biology, communication networks, and many others) routinely face graphs with millions of vertices and billions of edges. In principle, this abundance of data should allow for a more accurate analysis of the phenomena under study. However, as the graphs under analysis grow, min- ing and visualizing them become computationally challenging tasks. In fact, the running time of most graph algorithms depends on the size of the input: executing them on huge graphs might be impractical, especially when the input is too large to fit in main memory. Graph summarization speeds up the analysis by creating a lossy concise representation of the graph that fits into main memory. Answers to otherwise expensive queries can then be computed using the summary without accessing the exact representation on disk. Query answers computed on the summary incur in a minimal loss of accuracy. Summaries can also be used for privacy purposes [1], to create easily interpretable visualizations of the graph [2], and to store a compressed representation of the graph. LeFevre and Terzi [1] propose an enriched “supergraph” as a summary, associating an integer to each supernode and to each The GraSS algorithm presented in [1] follows a greedy heuris- tic resembling an agglomerative hierarchical clustering using Ward’s method [3] and as such can not give any guarantee on the quality of the summary. In this paper instead, we propose efficient algorithms to compute summaries of guaranteed quality (a constant factor from the optimal). This theoretical property is also verified empirically: our algorithms build more representative summaries and are much more efficient and scalable than GraSS in building those summaries. II. PROBLEM DEFINITION We consider an undirected graph G = (V, E) with |V | = n. In the rest of the paper, the key concepts are defined from the standpoint of the symmetric adjacency matrix AG of G. We allow the edges to be weighted (so the adjacency matrix is not necessarily binary) and we allow self-loops (so the diagonal of the adjacency matrix is not necessarily all-zero). Given a graph G = (V, E) and k 2 N, a k-summary S of G is a complete undirected weighted graph S = (V 0 , V 0 ⇥V 0 ) that is uniquely identified by a k-partition V 0 of V (i.e., V 0 = {V1, . . . , Vk}, s.t. [i2[1,k]Vi = V and 8i, j 2 [1, k], i 6= j, it holds Vi Vj = ;). The vertices of S are called supernodes. There is a superedge eij for each unordered pair of supernodes (Vi, Vj), including (Vi, Vi) (i.e., each supernode has a self-loop eii). Each superedge eij is weighted by the density of edges between Vi and Vj: dG(i, j) = dG(Vi, Vj) = eG(Vi, Vj)/(|Vi||Vj|), where for any two sets of vertices S, T ✓ V , we denote eG(S, T) = X i2S,j2T AG(i, j). We define the density matrix of S as the k⇥k matrix AS with entries AS(i, j) = dG(i, j), 1  i, j  k. For each v 2 V , we also denote by s(v) the unique element of S (a supernode) such that v 2 s(v). The density matrix AS 2 Rk⇥k can be lifted1 to the matrix A" S 2 Rn⇥n defined by A" S(v, w) = AS(s(v), s(w)). Data Mining and Knowledge Discovery, in press; Data Mining and Knowledge Discovery manuscript No. (will be inserted by the editor) Graph Summarization with Quality Guarantees Matteo Riondato · David Garc´ıa-Soriano · Francesco Bonchi 5 / 32
  • 10. What does a summary look like? G = ([n], E): an undirected weighted graph with adjacency matrix AG; Summary S of G: a complete weighted graph with self-loops: S = (V, V × V); The supernodes Vi , i = 1, . . . , k, in V form a partitioning of [n]; The superedges weights are the edge densities between the corresponding sets of nodes: dG(i, j) = dG(Vi , Vj) = eG(Vi , Vj) |Vi ||Vj| , where eG(S, T) = i∈S,j∈T AG(i, j) for all S, T ⊆ V . Technical point: if S and T overlap, each non-loop edge with both endpoints in S ∩ T is counted twice for eG(S, T). 6 / 32
  • 11. Shall we look at an example? A graph with 7 nodes, and a summary of it with 3 supernodes: 2 1 3 4 75 0 7/9 3/4 1/4 1/3 1/3 {3,4,5} {6,7} {1,2} 6 7 / 32
  • 12. How can we answer queries using a summary? Preliminary question: What is the desired semantic of an answer computed on the summary? A summary S is a lossy representation of G, compatible with many other graphs; We adopt an expected-value semantic for query answers: Let π be a distribution over the set of graphs compatible with S; Given q and S, we want answer(q, S) = Eπ[answer(q, compatible G )] . We most often use the uniform distribution as π. 8 / 32
  • 13. What is the lifted adjacency matrix? Key object to compute query answers and evaluate the quality of the summary; Preliminaries: Given a summary S, let AS be the k × k matrix such that AS(i, j) = dG(i, j); Given a node v ∈ [n], let s(v) be the unique supernode of S containing v; Lifted adjacency matrix A↑ S ∈ Rn×n of S: A↑ S(u, w) = AS(s(u), s(w)) . A↑ S has a nice regular structure (example coming up); We call matrices like A↑ S “P-constant”, where P is the partitioning of [n] used in S. 9 / 32
  • 14. What does a lifted adjacency matrix look like? 2 1 3 4 75 0 7/9 3/4 1/4 1/3 1/3 {3,4,5} {6,7} {1,2} 6 1 2 3 4 5 6 7 1 0 0 1/3 1/3 1/3 1/4 1/4 2 0 0 1/3 1/3 1/3 1/4 1/4 3 1/3 1/3 7/9 7/9 7/9 1/3 1/3 4 1/3 1/3 7/9 7/9 7/9 1/3 1/3 5 1/3 1/3 7/9 7/9 7/9 1/3 1/3 6 1/4 1/4 1/3 1/3 1/3 3/4 3/4 7 1/4 1/4 1/3 1/3 1/3 3/4 3/4 10 / 32
  • 15. How do we use the lifted adjacency matrix to answer queries? Query Answer Adjacency of u and w A↑ S(u, w) Degree of u n i=1 A↑ S(u, i) Eigenvector centrality of u n i=1 A↑ S(u, i)/2|E| No. of triangles in G k i=1 |Vi | 3 A↑ S(i, i)3+ k j=i+1 A↑ S(i, j)2 |Vi | 2 |Vj|A↑ S(i, i) + |Vj | 2 |Vi |A↑ S(j, j) + + k w=j+1 |Vi ||Vj||Vw |A↑ S(i, j)A↑ S(j, w)A↑ S(w, j) This is a slight simplification of how to answer queries: there is some additional weighting of A ↑ S involved. Takeaway: Answering queries takes time O(poly(k)). 11 / 32
  • 16. What is a good summary? Given G = ([n], E), each partitioning P of [n] corresponds to a summary SP; Which partition of [n] gives the best summary? We need an error metric to evaluate the partitions, i.e., the summaries; Intuition: compare the lifted adjacency matrix of SP with the adjacency matrix AG; Error metrics: • p-reconstruction error, based on the p norm; •cut-norm error, based on the cut norm. 12 / 32
  • 17. What is the reconstruction error? The entry-wise p-norm of the difference between AG and A↑ S: errp(AG, A↑ S) = AG − A↑ S p =   |V | i=1 |V | j=1 |AG(i, j) − A↑ S(i, j)|p   1/p . The reconstruction error is such that errp(AG, A↑ S) ∈ [0, n2/p]; For the summary in the example we had: err1(AG, A↑ S) = 329 18 = 18.27 and err2(AG, A↑ S) = 11844 1296 = 3.023059525 . 13 / 32
  • 18. How do we compute the p-reconstruction error? In time O(k2) from the summary: errp(A↑ S, AG)p = i,j∈[k] |Vi ||Vj|dG(i, j)(1 − dG(i, j)) (1 − dG(i, j))p−1 + dG(i, j)p−1 . Interesting consequence: The choice of p does not matter much. E.g.,: err2(A↑ S, AG)2 = 2 · err1(A↑ S, AG) I.e., a partition that minimizes the 2-reconstruction error, also minimizes the 1-reconstruction error; (similarly also for any errp and errq, with p, q ∈ O(1), up to constant factors) 14 / 32
  • 19. What’s wrong with the p-reconstruction error? Achieving low (e.g., o(n2)) 1-reconstruction error may require k very close to n; Example: Consider a Erdős-Renyi graph G ∈ Gn,1/2; W.h.p., for any two vertices u, v, we have N(v) − N(v) = n 2 ± √ n log n; Assigning two vertices to the same supernode has cost n 2 − o(n); Hence, even for k = n/2, the reconstruction error of any k-summary is ≥ n2/9. 15 / 32
  • 20. How can we solve this drawback? When G is not random, we hope to find structure to exploit for building the summary; Look at adjacency matrix AG as: AG = S + E where •S: “structured” matrix captured by summary; •E: “residual” matrix representing the error; The “smaller” E is , the better the summary; The cut norm error tries to quantify E. 16 / 32
  • 21. What is the cut norm? The cut norm of B ∈ Rn×n is the maximum absolute sum of the entries of any of its submatrices: B = max S,T⊆[n] |B(S, T)| = max S,T⊆[n] i∈S,j∈T Bi,j . The cut-norm error of a summary S w.r.t. G is: err (AG, A↑ S) = AG − A↑ S = max S,T⊆V |eG(S, T) − eS↑ (S, T)|, Interpretation: maximum absolute difference between (twice) the sum of edges weights between any two sets of vertices in G and the same sum in A↑ S; Computing the cut norm is NP-hard, but exists constant-factor approximation via SDP. 17 / 32
  • 22. What is GraSS? GraSS: Graph Structure Summarization, by LeFevre and Terzi, SIAM SDM’10; Introduces the 1-reconstruction error; Defines a variant of the lifted adjacency matrix; Shows how to answer some queries using the summary; Gives an algorithm to build a summary using agglomerative hierarchical clustering: 1) Put each vertex in a separate supernode; 2) Until the number of supernodes is k: Merge the two supernodes whose merging minimizes the 1-reconstruction error; 3) Output the resulting k supernodes; GraSS does not offer any quality guarantees and it is extremely slow. 18 / 32
  • 23. Why using the lifted adjacency matrix? Given p ≥ 1, a graph G = ([n], E) with adjacency matrix AG, and a partitioning P = {S1, . . . , Sk} of [n], what P-constant matrix B∗,p P minimizes AG − B∗,p P p? Lemma: A↑ S is a (small) constant -factor approximation of B∗,p P . 19 / 32
  • 24. Why using the lifted adjacency matrix? Given p ≥ 1, a graph G = ([n], E) with adjacency matrix AG, and a partitioning P = {S1, . . . , Sk} of [n], what P-constant matrix B∗,p P minimizes AG − B∗,p P p? Lemma: A↑ S is a (small) constant -factor approximation of B∗,p P . Intuition (this is not a proof): Assume to sample u.a.r. a pair of vertices (u, w) from Vi × Vj. Let X be the r.v. with value AG(u, w). What is the real a that minimizes E[|X − a|p]? For p = 1 is median(X), for p = 2 is E[X]. We have: A↑ S(i, j) = (u,v)∈Vi ×Vj AG(u, w) |Vi ||Vj| = E[X] . Hence, B∗,2 P = A↑ S; Also, we show that AG − B∗,1 P 1 ≤ AG − A↑ S 1 ≤ 2 AG − B∗,1 P . 19 / 32
  • 25. What is our algorithm for graph summarization? As simple as: •For p-reconstruction error, perform p-clustering of the rows of AG (p = 1: k-median, p = 2: k-means). If column i is in cluster j, then vertex i is in supernode Vj. Lemma: The summary obtained from the optimal 1 (resp. 2) clustering is a 8 (resp. 4) approximation of the optimal summary for the 1 (resp. 2) reconstruction error. ...not really so simple to prove! •For the cut-norm error, perform 1-clustering of the rows of A2 G, with a specific distance function; Lemma: (a tad more complex, let’s skip this.) 20 / 32
  • 26. Is there some intuition behind this connection with clustering? Given a partition P = {V1, . . . , Vk}, consider the orthonormal vectors vi ∈ Rn such that vi (j) = 1/|Vi | if j ∈ Vi , 0 othw. Consider the n × n matrix P = k i=1 vi vi ; We have A↑ S = PAGP; The p-cost of the clustering associated to P is AG − AGP p; For the p norms, AG − AGP p /2 ≤ AG − PAGP p ≤ 2 AG − AGP p; Let ¯S be the summary arising from the optimal 2-clustering, and S∗ be the optimal summary with minimal 2-reconstruction error. Then: err2(AG, A↑ ¯S ) = AG − P ¯SAGP ¯S 2 ≤ 2 AG − AGP ¯S 2 ≤ 2 AG − AGPS∗ 2 ≤ 4 AG − PS∗ AGPS∗ 2 = 4 · err2(AG, A↑ S∗ ) . 21 / 32
  • 27. How can we build a summary efficiently? Both k-means and k-median are NP-hard; There are constant factor approximation algorithms; Bottleneck: computing all pairwise distance for the n rows of AG is expensive (like matrix multiplication); Solution: Use a sketch of the adjacency matrix with n rows and log n columns; Incurs in additional constant error; Even with the sketch, the approximation algorithms take time ˜O(n2); Idea: select O(k) rows of the sketch adaptively, compute a clustering using them; In the end, the algorithm runs in time ˜O(m + nk) and obtains a constant-factor approximation. Proving it is not fun. . . or is it? 22 / 32
  • 28. So, what’s the algorithm? Algorithm 1: Graph summarization with p-reconstruction error Input : G = (V , E) with |V | = n, k ∈ N, p ∈ {1, 2} Output: A O(1)-approximation to the best k-summary for G under the p-reconstruction error // Create the n × O(log n) sketch matrix (Indyk, 2006) S ← createSketch(AG, O(log n), p) // Select O(k) rows from the sketch (Aggarwal et al, 2009) R ← reduceClustInstance(AG, S, k) // Run the approximation algorithm by Mettu and Plaxton (2003) to obtain a partition. P ← getApproxClustPartition(p, k, R, S) // Compute the densities for the summary D ← computeDensities(P, AG) return (P, D) 23 / 32
  • 29. What is the story for the cut norm? Szemerédi regularity lemma: in every large enough graph, there is a partitioning of the nodes into subsets of about the same size so that the edges between different subsets behave almost randomly. Key result in extremal graph theory: if F holds on random graphs, it holds on large enough generic graphs; The partitioning can be found in polynomial time, but may have a galactic size (tower of exponentials in m); Frieze and Kannan defined weak regularity, which requires fewer classes. 24 / 32
  • 30. What is weak regularity? Given G = ([n], E) and a partitioning P of [n], P is ε-weakly regular if the cut-norm error of the summary corresponding to P is at most εn2, i.e.: err (AG, A↑ SP ) = AG − A↑ SP = max S,T⊆V |eG(S, T) − eS↑ P (S, T)|, Any unweighted graph with average degree α = 2|E|/|V |2 ∈ [0, 1] admits an ε-weakly regular partition with 2O(α log(1/α)/ε)2 parts (tight bound); A specific graph may still have much smaller partitions. The optimal deterministic algorithm to build a weakly ε-regular partition with at most 21/poly(ε) parts takes time c(ε)n2. 25 / 32
  • 31. What’s wrong with this? Existing algorithms take an error bound ε and always produce a partition of size 2Θ(1/ε2); We solve the problem of constructing a partition of a given size k that minimizes the cut-norm error. Algorithm: 1) Square AG; 2) Do k-median on the rows of A2 G, using the distance function δ(u, v) = w∈V |A2 G(u, w) − A2 G(v, w)| n2 . Lemma: Let OPTp be the error of the optimal weakly-regular k-partition of [n]. The partitioning induced by the optimal k-median solution is a weakly-regular partition with cost at most 16/ OPTp. 26 / 32
  • 32. How does our algorithm perform in practice? Implementation: C++, available at http://github.com/rionda/graphsumm; Datasets: from SNAP (http://snap.stanford.edu/datasets) Facebook, Enron, Epinions1, SignEpinions, Stanford, Amazon0601 (from 4k to 400k vertices, 88k to 2.5M edges). Goals: •study the structure of summaries generated by our algorithm; •compare its performances w.r.t. GraSS; •evaluate the quality of approximate query answers. 27 / 32
  • 33. What do the summaries look like? Experiment: As k varies, build a summary with k supernodes, and measure: •distribution of supernode sizes; •distributions of internal and cross densities; •reconstruction or cut-norm error; •runtime. 28 / 32
  • 34. What do the summaries look like? Experiment: As k varies, build a summary with k supernodes, and measure: •distribution of supernode sizes; •distributions of internal and cross densities; •reconstruction or cut-norm error; •runtime. Results: •always at least one supernode containing a single node, but also supernodes with 102 or 103 nodes: high standard deviation decreasing superlinearly in k; •Some supernodes contain large quasi-cliques, but stddev of internal density is high; •All cross densities are very low, some are zero: there are independent supernodes; •The 2-reconstruction and cut-norm error shrink linearly in k; •The runtime increases linearly in k, its stddev grows with k. 28 / 32
  • 35. How does our algorithm compare to the state of the art? Compared with GraSS (implemented by us, the authors lost their implementation); Our algorithm is “better, faster, stronger”; 2-reconstr. Err. k Alg. c (for GraSS) avg stdev min max Runtime (s) 10 GraSS 0.1 0.168 14 × 10−3 0.167 0.170 495.122 0.5 0.155 0.8 × 10−3 0.154 0.156 2669.961 0.75 0.153 0.7 × 10−3 0.153 0.154 4401.213 1.0 0.153 0.6 × 10−3 0.152 0.154 5516.915 Ours 0.152 1.6 × 10−3 0.150 0.154 0.440 250 GraSS 0.1 0.081 0.8 × 10−3 0.080 0.082 462.085 0.5 0.072 0.4 × 10−3 0.072 0.073 2517.515 0.75 0.070 0.5 × 10−3 0.070 0.071 4192.601 1.0 0.069 0.5 × 10−3 0.069 0.070 5263.651 Ours 0.065 0.8 × 10−3 0.064 0.067 0.552 Table: between ours and GraSS on a random sample (n = 500) of ego-gplus. 29 / 32
  • 36. How good is the query answering approximation? Experiment: run queries on summaries of different size k, and measure the error; Queries: Adjacency, Degree, Clustering Coefficient (Triangles) Results: •The error decreases very rapidly as k increases; •Adjacency is well estimated on average, but with large stddev and max when two supernodes are large and have very few edges connecting their nodes. •Degree is extremely well approximated, with small stddev; •Triangle counts are well approximated when k is not too small w.r.t. |V |, but quality decreases very fast otherwise. 30 / 32
  • 37. What did we learn from the experiments? A supernode is allocated for a single node: this seems suboptimal; The sweet spot for the number of supernodes is not easy to find; Query performance is just ok, not great; Question: Was this just a theoretical exercise? Answer: Maybe not, but we aren’t “there” yet in graph summarization. 31 / 32
  • 38. What did I tell you about? The first constant-factor approximation algorithm for graph summarization; We considered different error measures, including the cut norm, related to Szemeredi’s Regularity Lemma; We show a practical application of highly theoretical concepts; The algorithm crushes the competition under any possible metric; Experimental inspections of built summaries opens more questions about summaries and error metrics; Thank you! EML: matteo@twosigma.com TWTR: @teorionda WWW: http://matteo.rionda.to 32 / 32
  • 39. This document is being distributed for informational and educational purposes only and is not an offer to sell or the solicitation of an offer to buy any securities or other instruments. The information contained herein is not intended to provide, and should not be relied upon for investment advice. The views expressed herein are not necessarily the views of Two Sigma Investments, LP or any of its affiliates (collectively, “Two Sigma”). Such views reflect significant assumptions and subjective of the author(s) of the document and are subject to change without notice. The document may employ data derived from third-party sources. No representation is made as to the accuracy of such information and the use of such information in no way implies an endorsement of the source of such information or its validity. The copyrights and/or trademarks in some of the images, logos or other material used herein may be owned by entities other than Two Sigma. If so, such copyrights and/or trademarks are most likely owned by the entity that created the material and are used purely for identification and comment as fair use under international copyright and/or trademark laws. Use of such image, copyright or trademark does not imply any association with such organization (or endorsement of such organization) by Two Sigma, nor vice versa.