SlideShare ist ein Scribd-Unternehmen logo
1 von 37
Downloaden Sie, um offline zu lesen
Joint  work  with:
Nick  Duffield  -­‐ Texas  A&M  University
Ted  Willke – Intel  Labs
Ryan  Rossi  – PARC  research
VLDB’17,  Germany
August  31st,  2017
Nesreen  K.  Ahmed
Research  Scientist,  Intel  Labs
-­‐
-­‐
-­‐
-­‐
-­‐
Social  network  
Human  Disease  Network  
[Barabasi 2007]
Food  Web  [2007]
Terrorist  Network
[Krebs  2002]Internet  (AS)  [2005]
Gene  Regulatory  Network  
[Decourty 2008]
Protein  Interactions  
[breast  cancer]
Political  blogs
Power  grid
Social  Network Internet  (AS)
BiologicalPolitical  Blogs
Graph
Mining
Studying  and  analyzing  complex  networks
is  a  challenging and  computationally  intensive task
Studying  and  analyzing  complex  networks
is  a  challenging and  computationally  intensive task
Ø Today’s  networks  are  dynamic/streaming  over  time
-­‐ e.g.,  Twitter  streams,  email  communications  
Ø Today’s  networks  are  massive  in  size  
-­‐ e.g.,  online  social  networks  have  billions  of  users
Ø Today’s  networks  are  dynamic/streaming  over  time
-­‐ e.g.,  Twitter  streams,  email  communications  
Ø Today’s  networks  are  massive  in  size  
-­‐ e.g.,  online  social  networks  have  billions  of  users
Studying  and  analyzing  complex  networks
is  a  challenging and  computationally  intensive task
Studying  and  analyzing  complex  networks
is  a  challenging and  computationally  intensive task
Due  to  these  challenges,  we  usually  need  to  sampleDue  to  these  challenges,  we  usually  need  to  sample
Statistical  
Sampling
Graph  G Sample  S
e.g. Uniform Random
Sampling
Ø Today’s  networks  are  dynamic/streaming  over  time
-­‐ e.g.,  Twitter  streams,  email  communications  
Ø Today’s  networks  are  massive  in  size  
-­‐ e.g.,  online  social  networks  have  billions  of  users
Ø Today’s  networks  are  dynamic/streaming  over  time
-­‐ e.g.,  Twitter  streams,  email  communications  
Ø Today’s  networks  are  massive  in  size  
-­‐ e.g.,  online  social  networks  have  billions  of  users
Given a large graph G represented as a stream of edges e1, e2, e3…
We show how to efficiently sample from G while limiting memory
space to calculate unbiased estimates of various graph properties
Given a large graph G represented as a stream of edges e1, e2, e3…
We show how to efficiently sample from G while limiting memory
space to calculate unbiased estimates of various graph properties
Given a large graph G represented as a stream of edges e1, e2, e3…
We show how to efficiently sample from G while limiting memory
space to calculate unbiased estimates of various graph properties
Given a large graph G represented as a stream of edges e1, e2, e3…
We show how to efficiently sample from G while limiting memory
space to calculate unbiased estimates of various graph properties
Given a large graph G represented as a stream of edges e1, e2, e3…
We show how to efficiently sample from G while limiting memory
space to calculate unbiased estimates of various graph properties
Given a large graph G represented as a stream of edges e1, e2, e3…
We show how to efficiently sample from G while limiting memory
space to calculate unbiased estimates of various graph properties
No.  TrianglesNo.  Wedges
Frequent  connected  subsets  of  edges
Given a large graph G represented as a stream of edges e1, e2, e3…
We show how to efficiently sample from G while limiting memory
space to calculate unbiased estimates of various graph properties
Given a large graph G represented as a stream of edges e1, e2, e3…
We show how to efficiently sample from G while limiting memory
space to calculate unbiased estimates of various graph properties
Frequent  connected  subsets  of  edges
Transitivity
No.  TrianglesNo.  Wedges
§ Random  Sampling
• Uniform  random  sampling – [Tsourakakis et.  al  KDD’09]    
— Graph  Sparsification with  probability  p  
— Chance  of  sampling  a  subgraph (e.g.,  triangle)  is  very  low
— Estimates  suffer  from  high  variance  
• Wedge  Sampling – [Seshadhri et.  al  SDM’13]  
— Sample  vertices,  then  sample  pairs  of  incident  edges  (wedges)
— Output  the  estimate  of  the  closed  wedges  (triangles)  
§ Random  Sampling
• Uniform  random  sampling – [Tsourakakis et.  al  KDD’09]    
— Graph  Sparsification with  probability  p  
— Chance  of  sampling  a  subgraph (e.g.,  triangle)  is  very  low
— Estimates  suffer  from  high  variance  
• Wedge  Sampling – [Seshadhri et.  al  SDM’13]  
— Sample  vertices,  then  sample  pairs  of  incident  edges  (wedges)
— Output  the  estimate  of  the  closed  wedges  (triangles)  
Assume  we’ve  access  to  the  full  graph
Not  a  good  fit  for  massive  streaming  graphs
§ Assume  specific  order  of  the  stream  – [Buriol et.  al  2006]  
• Incidence  stream  model– vertex  neighbors  arrive  together  in  the  stream
§ Use  multiple  passes  over  the  stream  – [Becchetti et.  al  KDD’08]
§ Single-­‐pass  algorithms  for  arbitrary-­‐ordered  graph  streams
§ Single-­‐pass  algorithms  for  arbitrary-­‐ordered  graph  streams
• Streaming-­‐Triangles  – [Jha et.  al  KDD’13]
— Sample  edges  using  reservoir  sampling,  then  sample  pairs  of  incident  
edges  (wedges),  and  finally  scan  for  closed  wedges  (triangles)
• Neighborhood  Sampling  – [Pavan et.  al  VLDB’13]
— Sampling  vectors  of  wedge  estimators,  scan  the  stream  for  closed  wedges  
(triangles)
• TRIEST– [De  Stefani  et.  al  KDD’16]
— Uses  standard  reservoir  sampling  to  maintain  the  edge  sample
• MASCOT– [Lim  et.  al  KDD’15]
— Independent  edge  sampling  with  probability  p
• Graph  Sample  &  Hold– [Ahmed  et.  al  KDD’14]
— Conditionally  independent  edge  sampling
Summary  of  previous  work
Sampling  designs  for  specific  graph  properties  (triangles)  
Not  generally  applicable  to  other  properties
Uniform-­‐based  Sampling
Obtain  variable-­‐size  sample  
We  propose  a  generic  unbiased  sampling  framework:  Graph  Priority  Sampling
• Weight-­‐sensitive
• Fixed-­‐size  sample
• Single-­‐pass
• Applicable  for  general  graph  properties
• Use  topological  information  that  we  wish  to  estimate  as  auxiliary  variables
• Variance-­‐optimal  sampling  (cost  optimization  approach)
Input
Graph Priority Sampling Framework
GPS(m)
Output
Edge stream
k1, k2, ..., k, ...
Sampled Edge stream ˆK
Stored State m = O(| ˆK|)
Input
Graph Priority Sampling Framework
GPS(m)
Output
For each edge k
Generate a random number
u(k) ⇠ Uni(0, 1]
Edge stream
k1, k2, ..., k, ...
Sampled Edge stream ˆK
Stored State m = O(| ˆK|)
Compute edge weight
w(k) = W(k, ˆK)
Compute edge priority
r(k) = w(k)/u(k)
ˆK = ˆK [ {k}
Input
Graph Priority Sampling Framework
GPS(m)
Output
For each edge k
Edge stream
k1, k2, ..., k, ...
Sampled Edge stream ˆK
Stored State m = O(| ˆK|)
Find edge with lowest priority
k⇤
= arg mink02 ˆK r(k0
)
Update sample threshold
z⇤
= max{z⇤
, r(k⇤
)}
Remove lowest priority edge
ˆK = ˆK{k⇤
}
Use a priority queue with O(log m) updates
§ We  use  edge  weights  to  express  the  role  of  the  arriving  
edge  in  the  sampled  graph
• e.g.,  no.  subgraphs completed  by  the  arriving  edge,  and/or  other  
auxiliary  variables
§ Computational  feasibility  
• Efficient  implementation  by  using  a  priority  queue  
• Implemented  as  a  Min-­‐heap  with  O(log  m)  insertion/deletion
• O(1)  access  to  the  edge  with  minimum  priority    
w(k) = W(k, ˆK)
For each edge i,
we construct a sequence of edge estimators ˆSi,t
We achieve unbiasedness by
establishing that the sequence is a Martingale (Theorem 1)
E[ ˆSi,t] = Si,t
ˆSi,t = I(i 2 ˆKt)/min{1, wi/z⇤
}
where ˆSi,t are unbiased estimators of the corresponding edge
ˆKt is the sample at time t
Edge Estimation
For each subgraph J ⇢ [t],
we define the sequence of subgraph estimators as
ˆSJ,t =
Q
i2J
ˆSi,t
E[ ˆSJ,t] = SJ,t
We prove the sequence is a Martingale (Theorem 2)
Subgraph Estimation
Subgraph Counting
For any set J of subgraphs of G,
ˆNt(J ) =
P
J2J :J⇢Kt
ˆSJ,t
is an unbiased estimator of Nt(J ) = |Jt|
(Theorem 2)
§ We  provide  a  cost  minimization  approach  
• inspired  by  IPPS  sampling  [Duffield  et.  al  2005]    
§ By  minimizing  the  conditional  variance  of  the  increment  
incurred  by  the  arriving  edge  in
How the ranks ri,t should be distributed in order to minimize
the variance of the unbiased estimator of Nt(J )?
Nt(J )
§ Post-­‐stream  Estimation
• enables  retrospective  subgraph queries
• after  any  number  t of  edge  arrivals  have  taken  place,  we  can  
compute  an  unbiased  estimator  for  any  subgraph
§ In-­‐stream  Estimation
• we  can  take  “snapshots”  of  estimates  of  specific  sampled  subgraphs
at  arbitrary  times  during  the  stream
• Still  Unbiased!
• Lightweight  online/incremental  update  of  unbiased  estimates  of  
subgraph counts
• Same  sampling  procedure
• Using  stopped  Martingale
Input
Graph Priority Sampling Framework
GPS(m)
Output
For each edge k
Edge stream
k1, k2, ..., k, ...
Sampled Edge stream ˆK
Stored State m = O(| ˆK|)
Compute edge priority
r(k) = w(k)/u(k)
Update the sample
Update unbiased estimates
of subgraph counts
In-stream Estimation
We define a snapshot as an edge subset J, with a family of
stopping times T such that T = {Tj : j 2 J}
We prove the sequence is a stopped Martingale (Theorem 4)
ˆST
J,t =
Q
j2J
ˆS
Tj
j,t =
Q
j2J
ˆSj,min{Tj ,t}
E[ ˆST
J,t] = SJ,t
§ We  use  GPS  for  the  estimation  of  
• Triangle  counts
• Wedge  counts
• Global  clustering  coefficient
• And  their  unbiased  variance    (Theorem  3  in  the  paper)
• Weight  function
• Used    a  large  set  of  graphs  from  a  variety  of  domains  (social,  we,  
tech,  etc)    -­‐ data  is  available  on  http://networkrepository.com/
— Up  to  49B  edges
W(k, ˆK) = 9 ⇤ ˆ4(k) + 1
where ˆ4(k) is the number of triangles
completed by edge k and whose edges in ˆK
-­ GPS  accurately  estimates  various  properties  simultaneously
-­ Consistent  performance  across  graphs  from  various  domains
-­ A  key  advantage  for  GPS  in-­stream  has  smaller  variance  and  tight  confidence  bounds
Results  for  triangle  counts  
Using  massive  real-­world  and  synthetic  graphs  of  up  to  49B  edges
GPS  is  shown  to  be  accurate  with  <0.01  error  
Sample  size  =  1M  edges,  in-­stream  estimation
95%  confidence  intervals
10
4
10
5
0.6
0.7
0.8
0.9
1
1.1
1.2
1.3
1.4
soc−twitter−2010
Sample Size |K|
x/x
10
4
10
5
0.6
0.7
0.8
0.9
1
1.1
1.2
1.3
1.4
soc−twitter−2010
Sample Size |K|
x/x
Global  Clustering  Coeff
10
4
10
5
0.6
0.7
0.8
0.9
1
1.1
1.2
1.3
1.4
soc−twitter−2010
Sample Size |K|
x/x
10
4
10
5
0.6
0.7
0.8
0.9
1
1.1
1.2
1.3
1.4
soc−twitter−2010
Sample Size |K|
x/x
Triangle  Count
10
4
10
5
0.9
0.92
0.94
0.96
0.98
1
1.02
1.04
1.06
1.08
1.1
soc−twitter−2010
Sample Size |K|
x/x
10
4
10
5
0.9
0.92
0.94
0.96
0.98
1
1.02
1.04
1.06
1.08
1.1
soc−twitter−2010
Sample Size |K|
x/x
Wedge  Count
Actual
Estimated/Actual
Confidence  Upper  &  Lower  Bounds  
Sample  Size  =  40K  edges
Accurate  estimates  for  large  Twitter  graph  ~  265M  edges,  and  17.2B  triangles
95%  confidence  intervals
Global  Clustering  CoeffTriangle  Count Wedge  Count
Actual
Estimated/Actual
Confidence  Upper  &  Lower  Bounds  
Sample  Size  =  40K  edges
Accurate  estimates  for  large  social  network  Orkut  ~  120M  edges,  and  630M  triangles
95%  confidence  intervals
10
4
10
5
10
6
0.9
0.92
0.94
0.96
0.98
1
1.02
1.04
1.06
1.08
1.1
soc−orkut
Sample Size |K|
x/x
10
4
10
5
10
6
0.9
0.92
0.94
0.96
0.98
1
1.02
1.04
1.06
1.08
1.1
soc−orkut
Sample Size |K|
x/x
10
4
10
5
10
6
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
soc−orkut
Sample Size |K|
x/x
10
4
10
5
10
6
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
soc−orkut
Sample Size |K|
x/x
10
4
10
5
10
6
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
soc−orkut
Sample Size |K|
x/x
10
4
10
5
10
6
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
soc−orkut
Sample Size |K|
x/x
0 2 4 6 8 10 12
x 10
7
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
5.5
x 10
8
Stream Size at time t (|Kt|)
Trianglesattimet(xt)
soc−orkut
Actual
Estimate
Upper Bound
Lower Bound
0 2 4 6 8 10 12
x 10
7
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
5.5
x 10
8
Stream Size at time t (|Kt|)
Trianglesattimet(xt)
soc−orkut
0 2 4 6 8 10 12
x 10
7
0.005
0.01
0.015
0.02
0.025
0.03
0.035
0.04
Stream Size at time t (|Kt|)
ClusteringCoeff.attimet(xt)
soc−orkut
Actual
Estimate
Upper Bound
Lower Bound
0 2 4 6 8 10 12
x 10
7
0.005
0.01
0.015
0.02
0.025
0.03
0.035
0.04
Stream Size at time t (|Kt|)
ClusteringCoeff.attimet(xt)
soc−orkut
GPS  in-­stream  estimates  over  time
Sample  size  =  80K  edges
95%  confidence  intervals
0.994 0.996 0.998 1 1.002 1.004 1.006
0.994
0.996
0.998
1
1.002
1.004
1.006
ca-hollywood-2009
com-amazon
higgs-social-network
soc-flickr
soc-youtube-snap
socfb-Indiana69
socfb-Penn94
socfb-Texas84
socfb-UF21
tech-as-skitter
web-BerkStan
web-google
GPS  In-­stream  Estimation,  sample  size  100K  edges
GPS  accurately  estimates  both  triangle  and  wedge  counts  
simultaneously  with  a  single  sample
We  observe  accurate  results  with  no  significant  difference  in  error  between  
the  ordering  schemes
§ We  used  three  schemes  for  weighting  edges  during  sampling
§ Goal:  estimate  triangle  counts  for  Friendster  social  network  
with  sample  size=1M  (0.1%  of  the  graph)
1. triangle-­‐based  weights  (3%  relative  error)
2. wedge-­‐based  weights  (25%  relative  error)
3. uniform  weights  for  all  incoming  edges  (43%  relative  error)
-­‐ this  is  equivalent  to  simple  random  sampling
The  estimator  variance  was  3.8x  higher  using  wedge-­based weights,  and  
6.2x  higher  using  uniform  weights  compared  to  triangle-­based  weights.
§ A  sample  is  representative if  graph  properties  of  interest  can  be  
estimated  with  a  known  degree  of  accuracy  
§ We  proposed  a  generic  framework  Graph  Priority  Sampling  (GPS)
-­‐ GPS  is  an  efficient single-­‐pass  streaming  framework
-­‐ GPS  selects  a  representative sample  and  computes  unbiased estimates  of  
counts  of  connected  subsets  of  edges  (e.g.,  triangles,  wedges  …)    
-­‐ Theoretical  properties  of  GPS  are  supported  by  empirical  analysis      
§ GPS  admits  generalizations  by  allowing  the  dependence of  the  
sampling  process  as  a  function  of  the  stored  state  and/or  auxiliary  
variables
§ GPS  is  variance  minimizing  sampling  approach  
§ GPS  has  a  relative  estimation  error  <  1%
Thank  you!
Questions?

Weitere ähnliche Inhalte

Was ist angesagt?

Introduction of "TrailBlazer" algorithm
Introduction of "TrailBlazer" algorithmIntroduction of "TrailBlazer" algorithm
Introduction of "TrailBlazer" algorithmKatsuki Ohto
 
Gradient Estimation Using Stochastic Computation Graphs
Gradient Estimation Using Stochastic Computation GraphsGradient Estimation Using Stochastic Computation Graphs
Gradient Estimation Using Stochastic Computation GraphsYoonho Lee
 
Differential privacy without sensitivity [NIPS2016読み会資料]
Differential privacy without sensitivity [NIPS2016読み会資料]Differential privacy without sensitivity [NIPS2016読み会資料]
Differential privacy without sensitivity [NIPS2016読み会資料]Kentaro Minami
 
Queue length estimation on urban corridors
Queue length estimation on urban corridorsQueue length estimation on urban corridors
Queue length estimation on urban corridorsGuillaume Costeseque
 
Improving Variational Inference with Inverse Autoregressive Flow
Improving Variational Inference with Inverse Autoregressive FlowImproving Variational Inference with Inverse Autoregressive Flow
Improving Variational Inference with Inverse Autoregressive FlowTatsuya Shirakawa
 
Joint CSI Estimation, Beamforming and Scheduling Design for Wideband Massive ...
Joint CSI Estimation, Beamforming and Scheduling Design for Wideband Massive ...Joint CSI Estimation, Beamforming and Scheduling Design for Wideband Massive ...
Joint CSI Estimation, Beamforming and Scheduling Design for Wideband Massive ...T. E. BOGALE
 
VRP2013 - Comp Aspects VRP
VRP2013 - Comp Aspects VRPVRP2013 - Comp Aspects VRP
VRP2013 - Comp Aspects VRPVictor Pillac
 
Optimization in deep learning
Optimization in deep learningOptimization in deep learning
Optimization in deep learningJeremy Nixon
 
Hyperparameter optimization with approximate gradient
Hyperparameter optimization with approximate gradientHyperparameter optimization with approximate gradient
Hyperparameter optimization with approximate gradientFabian Pedregosa
 
presentation
presentationpresentation
presentationjie ren
 
Presto: Distributed Machine Learning and Graph Processing with Sparse Matrices
Presto: Distributed Machine Learning and Graph Processing with Sparse MatricesPresto: Distributed Machine Learning and Graph Processing with Sparse Matrices
Presto: Distributed Machine Learning and Graph Processing with Sparse MatricesQian Lin
 
5.3 dyn algo-i
5.3 dyn algo-i5.3 dyn algo-i
5.3 dyn algo-iKrish_ver2
 
2014-06-20 Multinomial Logistic Regression with Apache Spark
2014-06-20 Multinomial Logistic Regression with Apache Spark2014-06-20 Multinomial Logistic Regression with Apache Spark
2014-06-20 Multinomial Logistic Regression with Apache SparkDB Tsai
 
Yuki Oyama - Incorporating context-dependent energy into the pedestrian dynam...
Yuki Oyama - Incorporating context-dependent energy into the pedestrian dynam...Yuki Oyama - Incorporating context-dependent energy into the pedestrian dynam...
Yuki Oyama - Incorporating context-dependent energy into the pedestrian dynam...Yuki Oyama
 
A Scalable Dataflow Implementation of Curran's Approximation Algorithm
A Scalable Dataflow Implementation of Curran's Approximation AlgorithmA Scalable Dataflow Implementation of Curran's Approximation Algorithm
A Scalable Dataflow Implementation of Curran's Approximation AlgorithmNECST Lab @ Politecnico di Milano
 
ABSTRACT GRAPH MACHINE: MODELING ORDERINGS IN ASYNCHRONOUS DISTRIBUTED-MEMORY...
ABSTRACT GRAPH MACHINE: MODELING ORDERINGS IN ASYNCHRONOUS DISTRIBUTED-MEMORY...ABSTRACT GRAPH MACHINE: MODELING ORDERINGS IN ASYNCHRONOUS DISTRIBUTED-MEMORY...
ABSTRACT GRAPH MACHINE: MODELING ORDERINGS IN ASYNCHRONOUS DISTRIBUTED-MEMORY...Thejaka Amila Kanewala, Ph.D.
 
Implementation of Thorup's Linear Time Algorithm for Undirected Single Source...
Implementation of Thorup's Linear Time Algorithm for Undirected Single Source...Implementation of Thorup's Linear Time Algorithm for Undirected Single Source...
Implementation of Thorup's Linear Time Algorithm for Undirected Single Source...Nick Pruehs
 
Visualization using tSNE
Visualization using tSNEVisualization using tSNE
Visualization using tSNEYan Xu
 

Was ist angesagt? (20)

Introduction of "TrailBlazer" algorithm
Introduction of "TrailBlazer" algorithmIntroduction of "TrailBlazer" algorithm
Introduction of "TrailBlazer" algorithm
 
Gradient Estimation Using Stochastic Computation Graphs
Gradient Estimation Using Stochastic Computation GraphsGradient Estimation Using Stochastic Computation Graphs
Gradient Estimation Using Stochastic Computation Graphs
 
Differential privacy without sensitivity [NIPS2016読み会資料]
Differential privacy without sensitivity [NIPS2016読み会資料]Differential privacy without sensitivity [NIPS2016読み会資料]
Differential privacy without sensitivity [NIPS2016読み会資料]
 
Queue length estimation on urban corridors
Queue length estimation on urban corridorsQueue length estimation on urban corridors
Queue length estimation on urban corridors
 
Improving Variational Inference with Inverse Autoregressive Flow
Improving Variational Inference with Inverse Autoregressive FlowImproving Variational Inference with Inverse Autoregressive Flow
Improving Variational Inference with Inverse Autoregressive Flow
 
Joint CSI Estimation, Beamforming and Scheduling Design for Wideband Massive ...
Joint CSI Estimation, Beamforming and Scheduling Design for Wideband Massive ...Joint CSI Estimation, Beamforming and Scheduling Design for Wideband Massive ...
Joint CSI Estimation, Beamforming and Scheduling Design for Wideband Massive ...
 
VRP2013 - Comp Aspects VRP
VRP2013 - Comp Aspects VRPVRP2013 - Comp Aspects VRP
VRP2013 - Comp Aspects VRP
 
Optimization in deep learning
Optimization in deep learningOptimization in deep learning
Optimization in deep learning
 
Hyperparameter optimization with approximate gradient
Hyperparameter optimization with approximate gradientHyperparameter optimization with approximate gradient
Hyperparameter optimization with approximate gradient
 
presentation
presentationpresentation
presentation
 
Presto: Distributed Machine Learning and Graph Processing with Sparse Matrices
Presto: Distributed Machine Learning and Graph Processing with Sparse MatricesPresto: Distributed Machine Learning and Graph Processing with Sparse Matrices
Presto: Distributed Machine Learning and Graph Processing with Sparse Matrices
 
5.3 dyn algo-i
5.3 dyn algo-i5.3 dyn algo-i
5.3 dyn algo-i
 
2014-06-20 Multinomial Logistic Regression with Apache Spark
2014-06-20 Multinomial Logistic Regression with Apache Spark2014-06-20 Multinomial Logistic Regression with Apache Spark
2014-06-20 Multinomial Logistic Regression with Apache Spark
 
Encoding survey
Encoding surveyEncoding survey
Encoding survey
 
Yuki Oyama - Incorporating context-dependent energy into the pedestrian dynam...
Yuki Oyama - Incorporating context-dependent energy into the pedestrian dynam...Yuki Oyama - Incorporating context-dependent energy into the pedestrian dynam...
Yuki Oyama - Incorporating context-dependent energy into the pedestrian dynam...
 
A Scalable Dataflow Implementation of Curran's Approximation Algorithm
A Scalable Dataflow Implementation of Curran's Approximation AlgorithmA Scalable Dataflow Implementation of Curran's Approximation Algorithm
A Scalable Dataflow Implementation of Curran's Approximation Algorithm
 
ABSTRACT GRAPH MACHINE: MODELING ORDERINGS IN ASYNCHRONOUS DISTRIBUTED-MEMORY...
ABSTRACT GRAPH MACHINE: MODELING ORDERINGS IN ASYNCHRONOUS DISTRIBUTED-MEMORY...ABSTRACT GRAPH MACHINE: MODELING ORDERINGS IN ASYNCHRONOUS DISTRIBUTED-MEMORY...
ABSTRACT GRAPH MACHINE: MODELING ORDERINGS IN ASYNCHRONOUS DISTRIBUTED-MEMORY...
 
Scalable Link Discovery for Modern Data-Driven Applications
Scalable Link Discovery for Modern Data-Driven ApplicationsScalable Link Discovery for Modern Data-Driven Applications
Scalable Link Discovery for Modern Data-Driven Applications
 
Implementation of Thorup's Linear Time Algorithm for Undirected Single Source...
Implementation of Thorup's Linear Time Algorithm for Undirected Single Source...Implementation of Thorup's Linear Time Algorithm for Undirected Single Source...
Implementation of Thorup's Linear Time Algorithm for Undirected Single Source...
 
Visualization using tSNE
Visualization using tSNEVisualization using tSNE
Visualization using tSNE
 

Ähnlich wie On Sampling from Massive Graph Streams

Leveraging Multiple GPUs and CPUs for Graphlet Counting in Large Networks
Leveraging Multiple GPUs and CPUs for  Graphlet Counting in Large Networks Leveraging Multiple GPUs and CPUs for  Graphlet Counting in Large Networks
Leveraging Multiple GPUs and CPUs for Graphlet Counting in Large Networks Ryan Rossi
 
141222 graphulo ingraphblas
141222 graphulo ingraphblas141222 graphulo ingraphblas
141222 graphulo ingraphblasMIT
 
141205 graphulo ingraphblas
141205 graphulo ingraphblas141205 graphulo ingraphblas
141205 graphulo ingraphblasgraphulo
 
Greg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
Greg Hogan – To Petascale and Beyond- Apache Flink in the CloudsGreg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
Greg Hogan – To Petascale and Beyond- Apache Flink in the CloudsFlink Forward
 
Modeling and Roll, Pitch and Yaw Simulation of Quadrotor.
Modeling and Roll, Pitch and Yaw Simulation of Quadrotor.Modeling and Roll, Pitch and Yaw Simulation of Quadrotor.
Modeling and Roll, Pitch and Yaw Simulation of Quadrotor.Oka Danil
 
unit-4-dynamic programming
unit-4-dynamic programmingunit-4-dynamic programming
unit-4-dynamic programminghodcsencet
 
[Paper reading] L-SHAPLEY AND C-SHAPLEY: EFFICIENT MODEL INTERPRETATION FOR S...
[Paper reading] L-SHAPLEY AND C-SHAPLEY: EFFICIENT MODEL INTERPRETATION FOR S...[Paper reading] L-SHAPLEY AND C-SHAPLEY: EFFICIENT MODEL INTERPRETATION FOR S...
[Paper reading] L-SHAPLEY AND C-SHAPLEY: EFFICIENT MODEL INTERPRETATION FOR S...Daiki Tanaka
 
Seq2Seq (encoder decoder) model
Seq2Seq (encoder decoder) modelSeq2Seq (encoder decoder) model
Seq2Seq (encoder decoder) model佳蓉 倪
 
Mit15 082 jf10_lec01
Mit15 082 jf10_lec01Mit15 082 jf10_lec01
Mit15 082 jf10_lec01Saad Liaqat
 
Asymptotics 140510003721-phpapp02
Asymptotics 140510003721-phpapp02Asymptotics 140510003721-phpapp02
Asymptotics 140510003721-phpapp02mansab MIRZA
 
Efficient anomaly detection via matrix sketching
Efficient anomaly detection via matrix sketchingEfficient anomaly detection via matrix sketching
Efficient anomaly detection via matrix sketchingHsing-chuan Hsieh
 
대용량의 동적인 그래프 및 텐서 마이닝 (Mining Large Dynamic Graphs and Tensors)
대용량의 동적인 그래프 및 텐서 마이닝 (Mining Large Dynamic Graphs and Tensors)대용량의 동적인 그래프 및 텐서 마이닝 (Mining Large Dynamic Graphs and Tensors)
대용량의 동적인 그래프 및 텐서 마이닝 (Mining Large Dynamic Graphs and Tensors)NAVER Engineering
 
DimensionalityReduction.pptx
DimensionalityReduction.pptxDimensionalityReduction.pptx
DimensionalityReduction.pptx36rajneekant
 
Vlsiphysicaldesignautomationonpartitioning 120219012744-phpapp01
Vlsiphysicaldesignautomationonpartitioning 120219012744-phpapp01Vlsiphysicaldesignautomationonpartitioning 120219012744-phpapp01
Vlsiphysicaldesignautomationonpartitioning 120219012744-phpapp01Hemant Jha
 
Aaa ped-17-Unsupervised Learning: Dimensionality reduction
Aaa ped-17-Unsupervised Learning: Dimensionality reductionAaa ped-17-Unsupervised Learning: Dimensionality reduction
Aaa ped-17-Unsupervised Learning: Dimensionality reductionAminaRepo
 
Introduction geostatistic for_mineral_resources
Introduction geostatistic for_mineral_resourcesIntroduction geostatistic for_mineral_resources
Introduction geostatistic for_mineral_resourcesAdi Handarbeni
 

Ähnlich wie On Sampling from Massive Graph Streams (20)

Leveraging Multiple GPUs and CPUs for Graphlet Counting in Large Networks
Leveraging Multiple GPUs and CPUs for  Graphlet Counting in Large Networks Leveraging Multiple GPUs and CPUs for  Graphlet Counting in Large Networks
Leveraging Multiple GPUs and CPUs for Graphlet Counting in Large Networks
 
141222 graphulo ingraphblas
141222 graphulo ingraphblas141222 graphulo ingraphblas
141222 graphulo ingraphblas
 
141205 graphulo ingraphblas
141205 graphulo ingraphblas141205 graphulo ingraphblas
141205 graphulo ingraphblas
 
Greg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
Greg Hogan – To Petascale and Beyond- Apache Flink in the CloudsGreg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
Greg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
 
Modeling and Roll, Pitch and Yaw Simulation of Quadrotor.
Modeling and Roll, Pitch and Yaw Simulation of Quadrotor.Modeling and Roll, Pitch and Yaw Simulation of Quadrotor.
Modeling and Roll, Pitch and Yaw Simulation of Quadrotor.
 
unit-4-dynamic programming
unit-4-dynamic programmingunit-4-dynamic programming
unit-4-dynamic programming
 
[Paper reading] L-SHAPLEY AND C-SHAPLEY: EFFICIENT MODEL INTERPRETATION FOR S...
[Paper reading] L-SHAPLEY AND C-SHAPLEY: EFFICIENT MODEL INTERPRETATION FOR S...[Paper reading] L-SHAPLEY AND C-SHAPLEY: EFFICIENT MODEL INTERPRETATION FOR S...
[Paper reading] L-SHAPLEY AND C-SHAPLEY: EFFICIENT MODEL INTERPRETATION FOR S...
 
Seq2Seq (encoder decoder) model
Seq2Seq (encoder decoder) modelSeq2Seq (encoder decoder) model
Seq2Seq (encoder decoder) model
 
DAA Notes.pdf
DAA Notes.pdfDAA Notes.pdf
DAA Notes.pdf
 
Graph Evolution Models
Graph Evolution ModelsGraph Evolution Models
Graph Evolution Models
 
Mit15 082 jf10_lec01
Mit15 082 jf10_lec01Mit15 082 jf10_lec01
Mit15 082 jf10_lec01
 
Finding Dense Subgraphs
Finding Dense SubgraphsFinding Dense Subgraphs
Finding Dense Subgraphs
 
Asymptotics 140510003721-phpapp02
Asymptotics 140510003721-phpapp02Asymptotics 140510003721-phpapp02
Asymptotics 140510003721-phpapp02
 
Efficient anomaly detection via matrix sketching
Efficient anomaly detection via matrix sketchingEfficient anomaly detection via matrix sketching
Efficient anomaly detection via matrix sketching
 
대용량의 동적인 그래프 및 텐서 마이닝 (Mining Large Dynamic Graphs and Tensors)
대용량의 동적인 그래프 및 텐서 마이닝 (Mining Large Dynamic Graphs and Tensors)대용량의 동적인 그래프 및 텐서 마이닝 (Mining Large Dynamic Graphs and Tensors)
대용량의 동적인 그래프 및 텐서 마이닝 (Mining Large Dynamic Graphs and Tensors)
 
DimensionalityReduction.pptx
DimensionalityReduction.pptxDimensionalityReduction.pptx
DimensionalityReduction.pptx
 
Vlsiphysicaldesignautomationonpartitioning 120219012744-phpapp01
Vlsiphysicaldesignautomationonpartitioning 120219012744-phpapp01Vlsiphysicaldesignautomationonpartitioning 120219012744-phpapp01
Vlsiphysicaldesignautomationonpartitioning 120219012744-phpapp01
 
Archipelagos
ArchipelagosArchipelagos
Archipelagos
 
Aaa ped-17-Unsupervised Learning: Dimensionality reduction
Aaa ped-17-Unsupervised Learning: Dimensionality reductionAaa ped-17-Unsupervised Learning: Dimensionality reduction
Aaa ped-17-Unsupervised Learning: Dimensionality reduction
 
Introduction geostatistic for_mineral_resources
Introduction geostatistic for_mineral_resourcesIntroduction geostatistic for_mineral_resources
Introduction geostatistic for_mineral_resources
 

Kürzlich hochgeladen

代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改atducpo
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxMohammedJunaid861692
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystSamantha Rae Coolbeth
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiSuhani Kapoor
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiSuhani Kapoor
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 

Kürzlich hochgeladen (20)

代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data Analyst
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 

On Sampling from Massive Graph Streams

  • 1. Joint  work  with: Nick  Duffield  -­‐ Texas  A&M  University Ted  Willke – Intel  Labs Ryan  Rossi  – PARC  research VLDB’17,  Germany August  31st,  2017 Nesreen  K.  Ahmed Research  Scientist,  Intel  Labs
  • 2. -­‐ -­‐ -­‐ -­‐ -­‐ Social  network   Human  Disease  Network   [Barabasi 2007] Food  Web  [2007] Terrorist  Network [Krebs  2002]Internet  (AS)  [2005] Gene  Regulatory  Network   [Decourty 2008] Protein  Interactions   [breast  cancer] Political  blogs Power  grid
  • 3. Social  Network Internet  (AS) BiologicalPolitical  Blogs Graph Mining
  • 4. Studying  and  analyzing  complex  networks is  a  challenging and  computationally  intensive task Studying  and  analyzing  complex  networks is  a  challenging and  computationally  intensive task Ø Today’s  networks  are  dynamic/streaming  over  time -­‐ e.g.,  Twitter  streams,  email  communications   Ø Today’s  networks  are  massive  in  size   -­‐ e.g.,  online  social  networks  have  billions  of  users Ø Today’s  networks  are  dynamic/streaming  over  time -­‐ e.g.,  Twitter  streams,  email  communications   Ø Today’s  networks  are  massive  in  size   -­‐ e.g.,  online  social  networks  have  billions  of  users
  • 5. Studying  and  analyzing  complex  networks is  a  challenging and  computationally  intensive task Studying  and  analyzing  complex  networks is  a  challenging and  computationally  intensive task Due  to  these  challenges,  we  usually  need  to  sampleDue  to  these  challenges,  we  usually  need  to  sample Statistical   Sampling Graph  G Sample  S e.g. Uniform Random Sampling Ø Today’s  networks  are  dynamic/streaming  over  time -­‐ e.g.,  Twitter  streams,  email  communications   Ø Today’s  networks  are  massive  in  size   -­‐ e.g.,  online  social  networks  have  billions  of  users Ø Today’s  networks  are  dynamic/streaming  over  time -­‐ e.g.,  Twitter  streams,  email  communications   Ø Today’s  networks  are  massive  in  size   -­‐ e.g.,  online  social  networks  have  billions  of  users
  • 6. Given a large graph G represented as a stream of edges e1, e2, e3… We show how to efficiently sample from G while limiting memory space to calculate unbiased estimates of various graph properties Given a large graph G represented as a stream of edges e1, e2, e3… We show how to efficiently sample from G while limiting memory space to calculate unbiased estimates of various graph properties
  • 7. Given a large graph G represented as a stream of edges e1, e2, e3… We show how to efficiently sample from G while limiting memory space to calculate unbiased estimates of various graph properties Given a large graph G represented as a stream of edges e1, e2, e3… We show how to efficiently sample from G while limiting memory space to calculate unbiased estimates of various graph properties
  • 8. Given a large graph G represented as a stream of edges e1, e2, e3… We show how to efficiently sample from G while limiting memory space to calculate unbiased estimates of various graph properties Given a large graph G represented as a stream of edges e1, e2, e3… We show how to efficiently sample from G while limiting memory space to calculate unbiased estimates of various graph properties No.  TrianglesNo.  Wedges Frequent  connected  subsets  of  edges
  • 9. Given a large graph G represented as a stream of edges e1, e2, e3… We show how to efficiently sample from G while limiting memory space to calculate unbiased estimates of various graph properties Given a large graph G represented as a stream of edges e1, e2, e3… We show how to efficiently sample from G while limiting memory space to calculate unbiased estimates of various graph properties Frequent  connected  subsets  of  edges Transitivity No.  TrianglesNo.  Wedges
  • 10. § Random  Sampling • Uniform  random  sampling – [Tsourakakis et.  al  KDD’09]     — Graph  Sparsification with  probability  p   — Chance  of  sampling  a  subgraph (e.g.,  triangle)  is  very  low — Estimates  suffer  from  high  variance   • Wedge  Sampling – [Seshadhri et.  al  SDM’13]   — Sample  vertices,  then  sample  pairs  of  incident  edges  (wedges) — Output  the  estimate  of  the  closed  wedges  (triangles)  
  • 11. § Random  Sampling • Uniform  random  sampling – [Tsourakakis et.  al  KDD’09]     — Graph  Sparsification with  probability  p   — Chance  of  sampling  a  subgraph (e.g.,  triangle)  is  very  low — Estimates  suffer  from  high  variance   • Wedge  Sampling – [Seshadhri et.  al  SDM’13]   — Sample  vertices,  then  sample  pairs  of  incident  edges  (wedges) — Output  the  estimate  of  the  closed  wedges  (triangles)   Assume  we’ve  access  to  the  full  graph Not  a  good  fit  for  massive  streaming  graphs
  • 12. § Assume  specific  order  of  the  stream  – [Buriol et.  al  2006]   • Incidence  stream  model– vertex  neighbors  arrive  together  in  the  stream § Use  multiple  passes  over  the  stream  – [Becchetti et.  al  KDD’08] § Single-­‐pass  algorithms  for  arbitrary-­‐ordered  graph  streams
  • 13. § Single-­‐pass  algorithms  for  arbitrary-­‐ordered  graph  streams • Streaming-­‐Triangles  – [Jha et.  al  KDD’13] — Sample  edges  using  reservoir  sampling,  then  sample  pairs  of  incident   edges  (wedges),  and  finally  scan  for  closed  wedges  (triangles) • Neighborhood  Sampling  – [Pavan et.  al  VLDB’13] — Sampling  vectors  of  wedge  estimators,  scan  the  stream  for  closed  wedges   (triangles) • TRIEST– [De  Stefani  et.  al  KDD’16] — Uses  standard  reservoir  sampling  to  maintain  the  edge  sample • MASCOT– [Lim  et.  al  KDD’15] — Independent  edge  sampling  with  probability  p • Graph  Sample  &  Hold– [Ahmed  et.  al  KDD’14] — Conditionally  independent  edge  sampling
  • 14. Summary  of  previous  work Sampling  designs  for  specific  graph  properties  (triangles)   Not  generally  applicable  to  other  properties Uniform-­‐based  Sampling Obtain  variable-­‐size  sample   We  propose  a  generic  unbiased  sampling  framework:  Graph  Priority  Sampling • Weight-­‐sensitive • Fixed-­‐size  sample • Single-­‐pass • Applicable  for  general  graph  properties • Use  topological  information  that  we  wish  to  estimate  as  auxiliary  variables • Variance-­‐optimal  sampling  (cost  optimization  approach)
  • 15. Input Graph Priority Sampling Framework GPS(m) Output Edge stream k1, k2, ..., k, ... Sampled Edge stream ˆK Stored State m = O(| ˆK|)
  • 16. Input Graph Priority Sampling Framework GPS(m) Output For each edge k Generate a random number u(k) ⇠ Uni(0, 1] Edge stream k1, k2, ..., k, ... Sampled Edge stream ˆK Stored State m = O(| ˆK|) Compute edge weight w(k) = W(k, ˆK) Compute edge priority r(k) = w(k)/u(k) ˆK = ˆK [ {k}
  • 17. Input Graph Priority Sampling Framework GPS(m) Output For each edge k Edge stream k1, k2, ..., k, ... Sampled Edge stream ˆK Stored State m = O(| ˆK|) Find edge with lowest priority k⇤ = arg mink02 ˆK r(k0 ) Update sample threshold z⇤ = max{z⇤ , r(k⇤ )} Remove lowest priority edge ˆK = ˆK{k⇤ } Use a priority queue with O(log m) updates
  • 18. § We  use  edge  weights  to  express  the  role  of  the  arriving   edge  in  the  sampled  graph • e.g.,  no.  subgraphs completed  by  the  arriving  edge,  and/or  other   auxiliary  variables § Computational  feasibility   • Efficient  implementation  by  using  a  priority  queue   • Implemented  as  a  Min-­‐heap  with  O(log  m)  insertion/deletion • O(1)  access  to  the  edge  with  minimum  priority     w(k) = W(k, ˆK)
  • 19. For each edge i, we construct a sequence of edge estimators ˆSi,t We achieve unbiasedness by establishing that the sequence is a Martingale (Theorem 1) E[ ˆSi,t] = Si,t ˆSi,t = I(i 2 ˆKt)/min{1, wi/z⇤ } where ˆSi,t are unbiased estimators of the corresponding edge ˆKt is the sample at time t Edge Estimation
  • 20. For each subgraph J ⇢ [t], we define the sequence of subgraph estimators as ˆSJ,t = Q i2J ˆSi,t E[ ˆSJ,t] = SJ,t We prove the sequence is a Martingale (Theorem 2) Subgraph Estimation
  • 21. Subgraph Counting For any set J of subgraphs of G, ˆNt(J ) = P J2J :J⇢Kt ˆSJ,t is an unbiased estimator of Nt(J ) = |Jt| (Theorem 2)
  • 22. § We  provide  a  cost  minimization  approach   • inspired  by  IPPS  sampling  [Duffield  et.  al  2005]     § By  minimizing  the  conditional  variance  of  the  increment   incurred  by  the  arriving  edge  in How the ranks ri,t should be distributed in order to minimize the variance of the unbiased estimator of Nt(J )? Nt(J )
  • 23. § Post-­‐stream  Estimation • enables  retrospective  subgraph queries • after  any  number  t of  edge  arrivals  have  taken  place,  we  can   compute  an  unbiased  estimator  for  any  subgraph § In-­‐stream  Estimation • we  can  take  “snapshots”  of  estimates  of  specific  sampled  subgraphs at  arbitrary  times  during  the  stream • Still  Unbiased! • Lightweight  online/incremental  update  of  unbiased  estimates  of   subgraph counts • Same  sampling  procedure • Using  stopped  Martingale
  • 24. Input Graph Priority Sampling Framework GPS(m) Output For each edge k Edge stream k1, k2, ..., k, ... Sampled Edge stream ˆK Stored State m = O(| ˆK|) Compute edge priority r(k) = w(k)/u(k) Update the sample Update unbiased estimates of subgraph counts
  • 25. In-stream Estimation We define a snapshot as an edge subset J, with a family of stopping times T such that T = {Tj : j 2 J} We prove the sequence is a stopped Martingale (Theorem 4) ˆST J,t = Q j2J ˆS Tj j,t = Q j2J ˆSj,min{Tj ,t} E[ ˆST J,t] = SJ,t
  • 26. § We  use  GPS  for  the  estimation  of   • Triangle  counts • Wedge  counts • Global  clustering  coefficient • And  their  unbiased  variance    (Theorem  3  in  the  paper) • Weight  function • Used    a  large  set  of  graphs  from  a  variety  of  domains  (social,  we,   tech,  etc)    -­‐ data  is  available  on  http://networkrepository.com/ — Up  to  49B  edges W(k, ˆK) = 9 ⇤ ˆ4(k) + 1 where ˆ4(k) is the number of triangles completed by edge k and whose edges in ˆK
  • 27. -­ GPS  accurately  estimates  various  properties  simultaneously -­ Consistent  performance  across  graphs  from  various  domains -­ A  key  advantage  for  GPS  in-­stream  has  smaller  variance  and  tight  confidence  bounds
  • 28. Results  for  triangle  counts   Using  massive  real-­world  and  synthetic  graphs  of  up  to  49B  edges GPS  is  shown  to  be  accurate  with  <0.01  error   Sample  size  =  1M  edges,  in-­stream  estimation 95%  confidence  intervals
  • 29. 10 4 10 5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 soc−twitter−2010 Sample Size |K| x/x 10 4 10 5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 soc−twitter−2010 Sample Size |K| x/x Global  Clustering  Coeff 10 4 10 5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 soc−twitter−2010 Sample Size |K| x/x 10 4 10 5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 soc−twitter−2010 Sample Size |K| x/x Triangle  Count 10 4 10 5 0.9 0.92 0.94 0.96 0.98 1 1.02 1.04 1.06 1.08 1.1 soc−twitter−2010 Sample Size |K| x/x 10 4 10 5 0.9 0.92 0.94 0.96 0.98 1 1.02 1.04 1.06 1.08 1.1 soc−twitter−2010 Sample Size |K| x/x Wedge  Count Actual Estimated/Actual Confidence  Upper  &  Lower  Bounds   Sample  Size  =  40K  edges Accurate  estimates  for  large  Twitter  graph  ~  265M  edges,  and  17.2B  triangles 95%  confidence  intervals
  • 30. Global  Clustering  CoeffTriangle  Count Wedge  Count Actual Estimated/Actual Confidence  Upper  &  Lower  Bounds   Sample  Size  =  40K  edges Accurate  estimates  for  large  social  network  Orkut  ~  120M  edges,  and  630M  triangles 95%  confidence  intervals 10 4 10 5 10 6 0.9 0.92 0.94 0.96 0.98 1 1.02 1.04 1.06 1.08 1.1 soc−orkut Sample Size |K| x/x 10 4 10 5 10 6 0.9 0.92 0.94 0.96 0.98 1 1.02 1.04 1.06 1.08 1.1 soc−orkut Sample Size |K| x/x 10 4 10 5 10 6 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 soc−orkut Sample Size |K| x/x 10 4 10 5 10 6 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 soc−orkut Sample Size |K| x/x 10 4 10 5 10 6 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 soc−orkut Sample Size |K| x/x 10 4 10 5 10 6 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 soc−orkut Sample Size |K| x/x
  • 31. 0 2 4 6 8 10 12 x 10 7 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 x 10 8 Stream Size at time t (|Kt|) Trianglesattimet(xt) soc−orkut Actual Estimate Upper Bound Lower Bound 0 2 4 6 8 10 12 x 10 7 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 x 10 8 Stream Size at time t (|Kt|) Trianglesattimet(xt) soc−orkut 0 2 4 6 8 10 12 x 10 7 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 Stream Size at time t (|Kt|) ClusteringCoeff.attimet(xt) soc−orkut Actual Estimate Upper Bound Lower Bound 0 2 4 6 8 10 12 x 10 7 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 Stream Size at time t (|Kt|) ClusteringCoeff.attimet(xt) soc−orkut GPS  in-­stream  estimates  over  time Sample  size  =  80K  edges 95%  confidence  intervals
  • 32. 0.994 0.996 0.998 1 1.002 1.004 1.006 0.994 0.996 0.998 1 1.002 1.004 1.006 ca-hollywood-2009 com-amazon higgs-social-network soc-flickr soc-youtube-snap socfb-Indiana69 socfb-Penn94 socfb-Texas84 socfb-UF21 tech-as-skitter web-BerkStan web-google GPS  In-­stream  Estimation,  sample  size  100K  edges GPS  accurately  estimates  both  triangle  and  wedge  counts   simultaneously  with  a  single  sample
  • 33.
  • 34. We  observe  accurate  results  with  no  significant  difference  in  error  between   the  ordering  schemes
  • 35. § We  used  three  schemes  for  weighting  edges  during  sampling § Goal:  estimate  triangle  counts  for  Friendster  social  network   with  sample  size=1M  (0.1%  of  the  graph) 1. triangle-­‐based  weights  (3%  relative  error) 2. wedge-­‐based  weights  (25%  relative  error) 3. uniform  weights  for  all  incoming  edges  (43%  relative  error) -­‐ this  is  equivalent  to  simple  random  sampling The  estimator  variance  was  3.8x  higher  using  wedge-­based weights,  and   6.2x  higher  using  uniform  weights  compared  to  triangle-­based  weights.
  • 36. § A  sample  is  representative if  graph  properties  of  interest  can  be   estimated  with  a  known  degree  of  accuracy   § We  proposed  a  generic  framework  Graph  Priority  Sampling  (GPS) -­‐ GPS  is  an  efficient single-­‐pass  streaming  framework -­‐ GPS  selects  a  representative sample  and  computes  unbiased estimates  of   counts  of  connected  subsets  of  edges  (e.g.,  triangles,  wedges  …)     -­‐ Theoretical  properties  of  GPS  are  supported  by  empirical  analysis       § GPS  admits  generalizations  by  allowing  the  dependence of  the   sampling  process  as  a  function  of  the  stored  state  and/or  auxiliary   variables § GPS  is  variance  minimizing  sampling  approach   § GPS  has  a  relative  estimation  error  <  1%