VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
On Sampling from Massive Graph Streams
1. Joint work with:
Nick Duffield -‐ Texas A&M University
Ted Willke – Intel Labs
Ryan Rossi – PARC research
VLDB’17, Germany
August 31st, 2017
Nesreen K. Ahmed
Research Scientist, Intel Labs
2. -‐
-‐
-‐
-‐
-‐
Social network
Human Disease Network
[Barabasi 2007]
Food Web [2007]
Terrorist Network
[Krebs 2002]Internet (AS) [2005]
Gene Regulatory Network
[Decourty 2008]
Protein Interactions
[breast cancer]
Political blogs
Power grid
4. Studying and analyzing complex networks
is a challenging and computationally intensive task
Studying and analyzing complex networks
is a challenging and computationally intensive task
Ø Today’s networks are dynamic/streaming over time
-‐ e.g., Twitter streams, email communications
Ø Today’s networks are massive in size
-‐ e.g., online social networks have billions of users
Ø Today’s networks are dynamic/streaming over time
-‐ e.g., Twitter streams, email communications
Ø Today’s networks are massive in size
-‐ e.g., online social networks have billions of users
5. Studying and analyzing complex networks
is a challenging and computationally intensive task
Studying and analyzing complex networks
is a challenging and computationally intensive task
Due to these challenges, we usually need to sampleDue to these challenges, we usually need to sample
Statistical
Sampling
Graph G Sample S
e.g. Uniform Random
Sampling
Ø Today’s networks are dynamic/streaming over time
-‐ e.g., Twitter streams, email communications
Ø Today’s networks are massive in size
-‐ e.g., online social networks have billions of users
Ø Today’s networks are dynamic/streaming over time
-‐ e.g., Twitter streams, email communications
Ø Today’s networks are massive in size
-‐ e.g., online social networks have billions of users
6. Given a large graph G represented as a stream of edges e1, e2, e3…
We show how to efficiently sample from G while limiting memory
space to calculate unbiased estimates of various graph properties
Given a large graph G represented as a stream of edges e1, e2, e3…
We show how to efficiently sample from G while limiting memory
space to calculate unbiased estimates of various graph properties
7. Given a large graph G represented as a stream of edges e1, e2, e3…
We show how to efficiently sample from G while limiting memory
space to calculate unbiased estimates of various graph properties
Given a large graph G represented as a stream of edges e1, e2, e3…
We show how to efficiently sample from G while limiting memory
space to calculate unbiased estimates of various graph properties
8. Given a large graph G represented as a stream of edges e1, e2, e3…
We show how to efficiently sample from G while limiting memory
space to calculate unbiased estimates of various graph properties
Given a large graph G represented as a stream of edges e1, e2, e3…
We show how to efficiently sample from G while limiting memory
space to calculate unbiased estimates of various graph properties
No. TrianglesNo. Wedges
Frequent connected subsets of edges
9. Given a large graph G represented as a stream of edges e1, e2, e3…
We show how to efficiently sample from G while limiting memory
space to calculate unbiased estimates of various graph properties
Given a large graph G represented as a stream of edges e1, e2, e3…
We show how to efficiently sample from G while limiting memory
space to calculate unbiased estimates of various graph properties
Frequent connected subsets of edges
Transitivity
No. TrianglesNo. Wedges
10. § Random Sampling
• Uniform random sampling – [Tsourakakis et. al KDD’09]
— Graph Sparsification with probability p
— Chance of sampling a subgraph (e.g., triangle) is very low
— Estimates suffer from high variance
• Wedge Sampling – [Seshadhri et. al SDM’13]
— Sample vertices, then sample pairs of incident edges (wedges)
— Output the estimate of the closed wedges (triangles)
11. § Random Sampling
• Uniform random sampling – [Tsourakakis et. al KDD’09]
— Graph Sparsification with probability p
— Chance of sampling a subgraph (e.g., triangle) is very low
— Estimates suffer from high variance
• Wedge Sampling – [Seshadhri et. al SDM’13]
— Sample vertices, then sample pairs of incident edges (wedges)
— Output the estimate of the closed wedges (triangles)
Assume we’ve access to the full graph
Not a good fit for massive streaming graphs
12. § Assume specific order of the stream – [Buriol et. al 2006]
• Incidence stream model– vertex neighbors arrive together in the stream
§ Use multiple passes over the stream – [Becchetti et. al KDD’08]
§ Single-‐pass algorithms for arbitrary-‐ordered graph streams
13. § Single-‐pass algorithms for arbitrary-‐ordered graph streams
• Streaming-‐Triangles – [Jha et. al KDD’13]
— Sample edges using reservoir sampling, then sample pairs of incident
edges (wedges), and finally scan for closed wedges (triangles)
• Neighborhood Sampling – [Pavan et. al VLDB’13]
— Sampling vectors of wedge estimators, scan the stream for closed wedges
(triangles)
• TRIEST– [De Stefani et. al KDD’16]
— Uses standard reservoir sampling to maintain the edge sample
• MASCOT– [Lim et. al KDD’15]
— Independent edge sampling with probability p
• Graph Sample & Hold– [Ahmed et. al KDD’14]
— Conditionally independent edge sampling
14. Summary of previous work
Sampling designs for specific graph properties (triangles)
Not generally applicable to other properties
Uniform-‐based Sampling
Obtain variable-‐size sample
We propose a generic unbiased sampling framework: Graph Priority Sampling
• Weight-‐sensitive
• Fixed-‐size sample
• Single-‐pass
• Applicable for general graph properties
• Use topological information that we wish to estimate as auxiliary variables
• Variance-‐optimal sampling (cost optimization approach)
16. Input
Graph Priority Sampling Framework
GPS(m)
Output
For each edge k
Generate a random number
u(k) ⇠ Uni(0, 1]
Edge stream
k1, k2, ..., k, ...
Sampled Edge stream ˆK
Stored State m = O(| ˆK|)
Compute edge weight
w(k) = W(k, ˆK)
Compute edge priority
r(k) = w(k)/u(k)
ˆK = ˆK [ {k}
17. Input
Graph Priority Sampling Framework
GPS(m)
Output
For each edge k
Edge stream
k1, k2, ..., k, ...
Sampled Edge stream ˆK
Stored State m = O(| ˆK|)
Find edge with lowest priority
k⇤
= arg mink02 ˆK r(k0
)
Update sample threshold
z⇤
= max{z⇤
, r(k⇤
)}
Remove lowest priority edge
ˆK = ˆK{k⇤
}
Use a priority queue with O(log m) updates
18. § We use edge weights to express the role of the arriving
edge in the sampled graph
• e.g., no. subgraphs completed by the arriving edge, and/or other
auxiliary variables
§ Computational feasibility
• Efficient implementation by using a priority queue
• Implemented as a Min-‐heap with O(log m) insertion/deletion
• O(1) access to the edge with minimum priority
w(k) = W(k, ˆK)
19. For each edge i,
we construct a sequence of edge estimators ˆSi,t
We achieve unbiasedness by
establishing that the sequence is a Martingale (Theorem 1)
E[ ˆSi,t] = Si,t
ˆSi,t = I(i 2 ˆKt)/min{1, wi/z⇤
}
where ˆSi,t are unbiased estimators of the corresponding edge
ˆKt is the sample at time t
Edge Estimation
20. For each subgraph J ⇢ [t],
we define the sequence of subgraph estimators as
ˆSJ,t =
Q
i2J
ˆSi,t
E[ ˆSJ,t] = SJ,t
We prove the sequence is a Martingale (Theorem 2)
Subgraph Estimation
21. Subgraph Counting
For any set J of subgraphs of G,
ˆNt(J ) =
P
J2J :J⇢Kt
ˆSJ,t
is an unbiased estimator of Nt(J ) = |Jt|
(Theorem 2)
22. § We provide a cost minimization approach
• inspired by IPPS sampling [Duffield et. al 2005]
§ By minimizing the conditional variance of the increment
incurred by the arriving edge in
How the ranks ri,t should be distributed in order to minimize
the variance of the unbiased estimator of Nt(J )?
Nt(J )
23. § Post-‐stream Estimation
• enables retrospective subgraph queries
• after any number t of edge arrivals have taken place, we can
compute an unbiased estimator for any subgraph
§ In-‐stream Estimation
• we can take “snapshots” of estimates of specific sampled subgraphs
at arbitrary times during the stream
• Still Unbiased!
• Lightweight online/incremental update of unbiased estimates of
subgraph counts
• Same sampling procedure
• Using stopped Martingale
24. Input
Graph Priority Sampling Framework
GPS(m)
Output
For each edge k
Edge stream
k1, k2, ..., k, ...
Sampled Edge stream ˆK
Stored State m = O(| ˆK|)
Compute edge priority
r(k) = w(k)/u(k)
Update the sample
Update unbiased estimates
of subgraph counts
25. In-stream Estimation
We define a snapshot as an edge subset J, with a family of
stopping times T such that T = {Tj : j 2 J}
We prove the sequence is a stopped Martingale (Theorem 4)
ˆST
J,t =
Q
j2J
ˆS
Tj
j,t =
Q
j2J
ˆSj,min{Tj ,t}
E[ ˆST
J,t] = SJ,t
26. § We use GPS for the estimation of
• Triangle counts
• Wedge counts
• Global clustering coefficient
• And their unbiased variance (Theorem 3 in the paper)
• Weight function
• Used a large set of graphs from a variety of domains (social, we,
tech, etc) -‐ data is available on http://networkrepository.com/
— Up to 49B edges
W(k, ˆK) = 9 ⇤ ˆ4(k) + 1
where ˆ4(k) is the number of triangles
completed by edge k and whose edges in ˆK
27. - GPS accurately estimates various properties simultaneously
- Consistent performance across graphs from various domains
- A key advantage for GPS in-stream has smaller variance and tight confidence bounds
28. Results for triangle counts
Using massive real-world and synthetic graphs of up to 49B edges
GPS is shown to be accurate with <0.01 error
Sample size = 1M edges, in-stream estimation
95% confidence intervals
31. 0 2 4 6 8 10 12
x 10
7
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
5.5
x 10
8
Stream Size at time t (|Kt|)
Trianglesattimet(xt)
soc−orkut
Actual
Estimate
Upper Bound
Lower Bound
0 2 4 6 8 10 12
x 10
7
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
5.5
x 10
8
Stream Size at time t (|Kt|)
Trianglesattimet(xt)
soc−orkut
0 2 4 6 8 10 12
x 10
7
0.005
0.01
0.015
0.02
0.025
0.03
0.035
0.04
Stream Size at time t (|Kt|)
ClusteringCoeff.attimet(xt)
soc−orkut
Actual
Estimate
Upper Bound
Lower Bound
0 2 4 6 8 10 12
x 10
7
0.005
0.01
0.015
0.02
0.025
0.03
0.035
0.04
Stream Size at time t (|Kt|)
ClusteringCoeff.attimet(xt)
soc−orkut
GPS in-stream estimates over time
Sample size = 80K edges
95% confidence intervals
32. 0.994 0.996 0.998 1 1.002 1.004 1.006
0.994
0.996
0.998
1
1.002
1.004
1.006
ca-hollywood-2009
com-amazon
higgs-social-network
soc-flickr
soc-youtube-snap
socfb-Indiana69
socfb-Penn94
socfb-Texas84
socfb-UF21
tech-as-skitter
web-BerkStan
web-google
GPS In-stream Estimation, sample size 100K edges
GPS accurately estimates both triangle and wedge counts
simultaneously with a single sample
33.
34. We observe accurate results with no significant difference in error between
the ordering schemes
35. § We used three schemes for weighting edges during sampling
§ Goal: estimate triangle counts for Friendster social network
with sample size=1M (0.1% of the graph)
1. triangle-‐based weights (3% relative error)
2. wedge-‐based weights (25% relative error)
3. uniform weights for all incoming edges (43% relative error)
-‐ this is equivalent to simple random sampling
The estimator variance was 3.8x higher using wedge-based weights, and
6.2x higher using uniform weights compared to triangle-based weights.
36. § A sample is representative if graph properties of interest can be
estimated with a known degree of accuracy
§ We proposed a generic framework Graph Priority Sampling (GPS)
-‐ GPS is an efficient single-‐pass streaming framework
-‐ GPS selects a representative sample and computes unbiased estimates of
counts of connected subsets of edges (e.g., triangles, wedges …)
-‐ Theoretical properties of GPS are supported by empirical analysis
§ GPS admits generalizations by allowing the dependence of the
sampling process as a function of the stored state and/or auxiliary
variables
§ GPS is variance minimizing sampling approach
§ GPS has a relative estimation error < 1%