SlideShare ist ein Scribd-Unternehmen logo
1 von 27
Downloaden Sie, um offline zu lesen
Overlapping
Clusters for
Distributed
Computation
DAVID F. GLEICH "     REID ANDERSEN "
 PURDUE UNIVERSITY
     MICROSOFT CORP.
COMPUTER SCIENCE "    VAHAB MIRROKNI"
 DEPARTMENT
            GOOGLE RESEARCH, NYC




                                                                      1
                                 David Gleich · Purdue
   WSDM2012
Problem 
Find a good way to distribute a big graph 
    for solving things like linear systems and simulating random walks

Contributions
Theoretical demonstration that overlap helps
Proof of concept procedure to find overlapping
partitions to reduce communication (~20%)

All code available
http://www.cs.purdue.edu/~dgleich/codes/
  overlapping





                                                                              2
                                         David Gleich · Purdue
   WSDM2012
The problem
     WHAT OUR NETWORKS       WHAT OUR OTHER
     LOOK LIKE
              NETWORKS LOOK LIKE




                                                              3
                         David Gleich · Purdue
   WSDM2012
The problem
     COMBINING NETWORKS AND GRAPHS IS A MESS




                                                                 4
                            David Gleich · Purdue
   WSDM2012
“Good” data distributions are
a fundamental problem in
distributed computation.
!
How to divide the
communication graph!
Balance work
Balance communication
Balance data
Balance programming
  complexity too




                                                     5
                David Gleich · Purdue
   WSDM2012
Current solutions
                  Work
        Comm.
       Data
          Programming

Disjoint vertex                Okay to                     “Think like a
                  Excellent
                Excellent
partitions
                    Good
                       vertex”

2d or Edge
                  Excellent
   Excellent
   Good
          “Impossible”
Partitions



Where we fit!

Overlapping                    Good to                     “Think like a
                  Okay
                     “Let’s see”
partitions
                    Excellent
                  cached vertex”




                                                                                  6
                                            David Gleich · Purdue
    WSDM2012
Goals
Find a set of "
overlapping clusters "
where 

random walks stay in a
 cluster for a long time

solving diffusion-like problems
 requires little communication
 (think PageRank, Katz, hitting times,
 semi-supervised learning) 




                                                                              7
                                         David Gleich · Purdue
   WSDM2012
Related work
Domain decomposition, Schwarz methods
 How to solve a linear system with overlap. Szyld et al.
Communication avoiding algorithms
 k-step matrix-vector products (Demmel et al.) and "
 growing overlap around partitions (Fritzsche, Frommer, Szyld)
Overlapping communities and link partitioning
algorithms for social network analysis
 Link communities (Ahn et al.); surveys by Fortunato and Satu
P2P based PageRank algorithms
 Parreira, Castillo, Donato et al. 




                                                                            8
                                       David Gleich · Purdue
   WSDM2012
Overlapping clusters
                           Each vertex 
                              in at least one cluster
                              has one home cluster
                           
Formally,
                           an overlapping cover is
                           (C, ⌧ )

                           C={       ,   ,       }
                              = set of clusters

                           ⌧ : V 7! C = map to homes
                           ⌧ is a partition!




                                                                 9
                        David Gleich · Purdue
       WSDM2012
Random walks in
      overlapping clusters
                                      Each vertex 
                                          in at least one cluster
                                          has one home cluster
                                      
    red cluster "
keeps the walk
                       Random walks
                       red cluster "
                                          go to the home
                       sends the walk     cluster after leaving
                       to gray cluster
                                      
                                      




                                                                       10
                                  David Gleich · Purdue
   WSDM2012
An evaluation metric"
      Swapping probability
                                     Is (C, ⌧ ) a good
                                     overlapping cover?
                                     Does a random walk
                                     swap clusters often?
    red cluster "
keeps the walk
                      ⇢
                                     
 1      =
                                         probability that a walk
                      red cluster "
                      sends the walk     changes clusters on each
                      to gray cluster
   step
                                         computable expression in the paper




                                                                          11
                                 David Gleich · Purdue
     WSDM2012
Overlapping clusters
                           Each vertex 
                              is in at least one cluster
                              has one home cluster
                              

                           Vol(C) = sum of degrees of
                            vertices in cluster C
                           MaxVol = "
                            upper bound on Vol(C) 
                           TotalVol(C) = "
                                    C
                             sum of Vol(C) for all clusters
                           VolRatio = TotalVol(C) / Vol(G)"
                                               C
                             how much extra data!




                                                               12
                        David Gleich · Purdue
   WSDM2012
Swapping probability &
partitioning
                                                       No overlap in
       
                                               this figure !

P is a partition
       
⇢1 (P) 
=
     1    X
       
    Cut(P)
  Vol(G)
           P2P
       
       Much like a
       classical graph
       partitioning metric




                                                                        13
                              David Gleich · Purdue
      WSDM2012
Overlapping clusters vs.
Partitioning in theory
                         Take a cycle graph
                             M groups of ℓ������ vertices
                             MaxVol = 2ℓ������
                         
                         
 partitioning
                         for
                                 1
                         
1
                         ⇢     =          (Optimal!)
                                 `
                         for overlapping
                                  4
                         ⇢1 =
                               ⌦(`2 )




                                                            14
                      David Gleich · Purdue
    WSDM2012
Heuristics for finding good "                        N P-hard for optimal
overlapping clusters
                               solution L



      Our multi-stage heuristic!
      1.  Find a large set of good clusters
          Use personalized PageRank clusters
      2.  Find “well contained” nodes (cores)
          Compute expected “leavetime” 
      3.  Cover the graph with core vertices
          Approximately solve a min set-cover problem
      4.  Combine clusters up to MaxVol
          The swapping probability is sub-modular
      




                                                                           15
                                 David Gleich · Purdue
    WSDM2012
Heuristics for finding good "                        N P-hard for optimal
overlapping clusters
                               solution L



      Our multi-stage heuristic!
      1.  Find a large set of good clusters
                                                               Each cluster takes
          Use personalized PageRank clusters, or metis
        “< MaxVol” work

      2.  Find “well contained” nodes (cores)
                                                               Takes O(Vol)
          Compute expected “leave time” 
                      work per cluster
      3.  Cover the graph with core vertices
          Approximately solve a min set-cover problem
         Fast enough

      4.  Combine clusters up to MaxVol
          The swapping probability is sub-modular
             Fast enough

      




                                                                              16
                                 David Gleich · Purdue
    WSDM2012
Demo!




                                              17
         David Gleich · Purdue
   WSDM2012
Solving "
linear "
systems
 Like PageRank, Katz, and
 semi-supervised learning




                                                                  18
                             David Gleich · Purdue
   WSDM2012
All nodes solve locally using "
the coordinate descent method.




                                     19
David Gleich · Purdue
   WSDM2012
All nodes solve locally using "
the coordinate descent method.




A core vertex for the
gray cluster.




                                      20
 David Gleich · Purdue
   WSDM2012
All nodes solve locally using "
    the coordinate descent method.




   Red sends residuals to white.
White send residuals to red.




                                          21
     David Gleich · Purdue
   WSDM2012
White then uses the coordinate
descent method to adjust its solution.
Will cause communication to red/blue.




                                          22
 David Gleich · Purdue
   WSDM2012
That algorithm is called "
restricted additive Schwarz.

  PageRank
 We look at
                 PageRank!
  Katz scores
  semi-supervised learning
  any spd or M-matrix "
     linear system




                                                   23
              David Gleich · Purdue
   WSDM2012
It works!
                           2
         communication

                                            Swapping Probability (usroads)
                                            PageRank Communication (usroads)
                                            Swapping Probability (web−Google)
                          1.5
                                            PageRank Communication (web−Google)
Relative Relative Work




                           1                                                 Metis Partitioner
                                                                        Partitioning baseline

                          0.5


                           0
                            1   1.1   1.2    1.3     1.4         1.5         1.6           1.7
                                             Volume Ratio
                                      How much more of the
                                      graph we need to store.




                                                                                                 24
                                                    David Gleich · Purdue
     WSDM2012
Edges are counted twice and some graphs have self-
    loops. The first group are geometric networks and
    the second are information networks.
                              Graph
                             Graph     Vertices
                                       |V |                     Edges
                                                                |E|               MaxDeg
                                                                                  max deg                   Density
                                                                                                            |E|/|V |
                              onera    85567                    419201            5                         4.9
                            usroads    126146                   323900            7                         2.6
                            annulus    500000                   2999258           19                        6.0

            email-Enron                33696                    361622            1383                      10.7
           soc-Slashdot                77360                    1015667           2540                      13.1
                   dico                111982                   2750576           68191                     24.6
                   lcsh                144791                   394186            1025                      2.7
             web-Google                855802                   8582704           6332                      10.0
             as-skitter                1694616                  22188418          35455                     13.1
            cit-Patents                3764117                  33023481          793                       8.8

                   1                                       1                                       1

                  0.8                                     0.8                                     0.8
    Conductance




                                                                                    Conductance
-
                                            Conductance




                  0.6                                     0.6                                     0.6

                  0.4                                     0.4                                     0.4




                                                                                                                           25
                  0.2                                     0.2                                     0.2

                   0
                                                                         David Gleich · Purdue
                                                                                        0
                                                                                                             WSDM2012
                        0               5                  0                                            0              5
he communication ratio of our best result for the PageRan
ommunication volume compared to METIS or GRACLUS show
 at the method works for 6 of them (perf. ratio < 1). The
ommunication result is not a bug.
  Graph            Comm. of         Comm. of        Perf. Ratio      Vol. Ratio
                     Partition       Overlap
  onera                18654               48            0.003                2.82
  usroads               3256                0            0.000                1.49
  annulus              12074                2            0.000                0.01
  email-Enron       194536*           235316             1.210                 1.7
  soc-Slashdot      875435*         1.3 ⇥ 106            1.480                1.78
  dico            1.5 ⇥ 106 *       2.0 ⇥ 106            1.320                1.53
  lcsh                73000*           48777             0.668                2.17
  web-Google        201159*           167609             0.833                1.57
  as-skitter       2.4 ⇥ 106        3.9 ⇥ 106            1.645                1.93
  cit-Patents      8.7 ⇥ 106        7.3 ⇥ 106            0.845                1.34

             * means Graculus
nally, we evaluate our heuristic.
                         gave a better
                 partition than Metis
       At left, the cluster combine procedure reduces 106 clusters to




                                                                                26
       around 102 . Middle, combining clusters can decrease the volume
                                           David Gleich · Purdue
 WSDM2012
Summary
                         Future work
!                                
Overlap helps reduce             Truly distributed implementation and
communication in a distributed   evaluation
process!                         
!                                Can we exploit data redundancy to
Proof of concept procedure to    solve problems on large graphs faster?
find overlapping partitions to    
reduce communication 
                     Copy 1
           Copy 2
                                       src -> dst
       src -> dst
                                       src -> dst
       src -> dst
                                       src -> dst
       src -> dst

All code available
http://www.cs.purdue.edu/~dgleich/codes/
  overlapping




                                                                           27

                                    David Gleich · Purdue
   WSDM2012

Weitere ähnliche Inhalte

Was ist angesagt?

Query optimization
Query optimizationQuery optimization
Query optimizationdixitdavey
 
Database , 8 Query Optimization
Database , 8 Query OptimizationDatabase , 8 Query Optimization
Database , 8 Query OptimizationAli Usman
 
Analysis and Design of Algorithms
Analysis and Design of AlgorithmsAnalysis and Design of Algorithms
Analysis and Design of AlgorithmsBulbul Agrawal
 
20. Parallel Databases in DBMS
20. Parallel Databases in DBMS20. Parallel Databases in DBMS
20. Parallel Databases in DBMSkoolkampus
 
K-Means, its Variants and its Applications
K-Means, its Variants and its ApplicationsK-Means, its Variants and its Applications
K-Means, its Variants and its ApplicationsVarad Meru
 
Hashing In Data Structure
Hashing In Data Structure Hashing In Data Structure
Hashing In Data Structure Meghaj Mallick
 
Graph Data Structure
Graph Data StructureGraph Data Structure
Graph Data StructureKeno benti
 
Binary search in data structure
Binary search in data structureBinary search in data structure
Binary search in data structureMeherul1234
 
Query optimization
Query optimizationQuery optimization
Query optimizationPooja Dixit
 
8 query processing and optimization
8 query processing and optimization8 query processing and optimization
8 query processing and optimizationKumar
 
Graph representation
Graph representationGraph representation
Graph representationTech_MX
 
Distributed Shared Memory Systems
Distributed Shared Memory SystemsDistributed Shared Memory Systems
Distributed Shared Memory SystemsAnkit Gupta
 

Was ist angesagt? (20)

Query optimization
Query optimizationQuery optimization
Query optimization
 
Database , 8 Query Optimization
Database , 8 Query OptimizationDatabase , 8 Query Optimization
Database , 8 Query Optimization
 
Spanning trees
Spanning treesSpanning trees
Spanning trees
 
Heaps & priority queues
Heaps & priority queuesHeaps & priority queues
Heaps & priority queues
 
Analysis and Design of Algorithms
Analysis and Design of AlgorithmsAnalysis and Design of Algorithms
Analysis and Design of Algorithms
 
Normalization
NormalizationNormalization
Normalization
 
20. Parallel Databases in DBMS
20. Parallel Databases in DBMS20. Parallel Databases in DBMS
20. Parallel Databases in DBMS
 
K-Means, its Variants and its Applications
K-Means, its Variants and its ApplicationsK-Means, its Variants and its Applications
K-Means, its Variants and its Applications
 
Sorting network
Sorting networkSorting network
Sorting network
 
Hashing In Data Structure
Hashing In Data Structure Hashing In Data Structure
Hashing In Data Structure
 
Graph Data Structure
Graph Data StructureGraph Data Structure
Graph Data Structure
 
Binary search in data structure
Binary search in data structureBinary search in data structure
Binary search in data structure
 
Query optimization
Query optimizationQuery optimization
Query optimization
 
8 query processing and optimization
8 query processing and optimization8 query processing and optimization
8 query processing and optimization
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop Ecosystem
 
Graphs
GraphsGraphs
Graphs
 
Searching techniques
Searching techniquesSearching techniques
Searching techniques
 
Graph representation
Graph representationGraph representation
Graph representation
 
Parallel Database
Parallel DatabaseParallel Database
Parallel Database
 
Distributed Shared Memory Systems
Distributed Shared Memory SystemsDistributed Shared Memory Systems
Distributed Shared Memory Systems
 

Andere mochten auch

Graph libraries in Matlab: MatlabBGL and gaimc
Graph libraries in Matlab: MatlabBGL and gaimcGraph libraries in Matlab: MatlabBGL and gaimc
Graph libraries in Matlab: MatlabBGL and gaimcDavid Gleich
 
PageRank Centrality of dynamic graph structures
PageRank Centrality of dynamic graph structuresPageRank Centrality of dynamic graph structures
PageRank Centrality of dynamic graph structuresDavid Gleich
 
Localized methods in graph mining
Localized methods in graph miningLocalized methods in graph mining
Localized methods in graph miningDavid Gleich
 
Fast matrix primitives for ranking, link-prediction and more
Fast matrix primitives for ranking, link-prediction and moreFast matrix primitives for ranking, link-prediction and more
Fast matrix primitives for ranking, link-prediction and moreDavid Gleich
 
Spacey random walks and higher order Markov chains
Spacey random walks and higher order Markov chainsSpacey random walks and higher order Markov chains
Spacey random walks and higher order Markov chainsDavid Gleich
 
Using Local Spectral Methods to Robustify Graph-Based Learning
Using Local Spectral Methods to Robustify Graph-Based LearningUsing Local Spectral Methods to Robustify Graph-Based Learning
Using Local Spectral Methods to Robustify Graph-Based LearningDavid Gleich
 
Iterative methods with special structures
Iterative methods with special structuresIterative methods with special structures
Iterative methods with special structuresDavid Gleich
 
Anti-differentiating Approximation Algorithms: PageRank and MinCut
Anti-differentiating Approximation Algorithms: PageRank and MinCutAnti-differentiating Approximation Algorithms: PageRank and MinCut
Anti-differentiating Approximation Algorithms: PageRank and MinCutDavid Gleich
 
Gaps between the theory and practice of large-scale matrix-based network comp...
Gaps between the theory and practice of large-scale matrix-based network comp...Gaps between the theory and practice of large-scale matrix-based network comp...
Gaps between the theory and practice of large-scale matrix-based network comp...David Gleich
 
The power and Arnoldi methods in an algebra of circulants
The power and Arnoldi methods in an algebra of circulantsThe power and Arnoldi methods in an algebra of circulants
The power and Arnoldi methods in an algebra of circulantsDavid Gleich
 
Anti-differentiating approximation algorithms: A case study with min-cuts, sp...
Anti-differentiating approximation algorithms: A case study with min-cuts, sp...Anti-differentiating approximation algorithms: A case study with min-cuts, sp...
Anti-differentiating approximation algorithms: A case study with min-cuts, sp...David Gleich
 
Iterative methods for network alignment
Iterative methods for network alignmentIterative methods for network alignment
Iterative methods for network alignmentDavid Gleich
 
MapReduce Tall-and-skinny QR and applications
MapReduce Tall-and-skinny QR and applicationsMapReduce Tall-and-skinny QR and applications
MapReduce Tall-and-skinny QR and applicationsDavid Gleich
 
What you can do with a tall-and-skinny QR factorization in Hadoop: Principal ...
What you can do with a tall-and-skinny QR factorization in Hadoop: Principal ...What you can do with a tall-and-skinny QR factorization in Hadoop: Principal ...
What you can do with a tall-and-skinny QR factorization in Hadoop: Principal ...David Gleich
 
Tall and Skinny QRs in MapReduce
Tall and Skinny QRs in MapReduceTall and Skinny QRs in MapReduce
Tall and Skinny QRs in MapReduceDavid Gleich
 
Direct tall-and-skinny QR factorizations in MapReduce architectures
Direct tall-and-skinny QR factorizations in MapReduce architecturesDirect tall-and-skinny QR factorizations in MapReduce architectures
Direct tall-and-skinny QR factorizations in MapReduce architecturesDavid Gleich
 
A multithreaded method for network alignment
A multithreaded method for network alignmentA multithreaded method for network alignment
A multithreaded method for network alignmentDavid Gleich
 
A history of PageRank from the numerical computing perspective
A history of PageRank from the numerical computing perspectiveA history of PageRank from the numerical computing perspective
A history of PageRank from the numerical computing perspectiveDavid Gleich
 
How does Google Google: A journey into the wondrous mathematics behind your f...
How does Google Google: A journey into the wondrous mathematics behind your f...How does Google Google: A journey into the wondrous mathematics behind your f...
How does Google Google: A journey into the wondrous mathematics behind your f...David Gleich
 
Tall-and-skinny QR factorizations in MapReduce architectures
Tall-and-skinny QR factorizations in MapReduce architecturesTall-and-skinny QR factorizations in MapReduce architectures
Tall-and-skinny QR factorizations in MapReduce architecturesDavid Gleich
 

Andere mochten auch (20)

Graph libraries in Matlab: MatlabBGL and gaimc
Graph libraries in Matlab: MatlabBGL and gaimcGraph libraries in Matlab: MatlabBGL and gaimc
Graph libraries in Matlab: MatlabBGL and gaimc
 
PageRank Centrality of dynamic graph structures
PageRank Centrality of dynamic graph structuresPageRank Centrality of dynamic graph structures
PageRank Centrality of dynamic graph structures
 
Localized methods in graph mining
Localized methods in graph miningLocalized methods in graph mining
Localized methods in graph mining
 
Fast matrix primitives for ranking, link-prediction and more
Fast matrix primitives for ranking, link-prediction and moreFast matrix primitives for ranking, link-prediction and more
Fast matrix primitives for ranking, link-prediction and more
 
Spacey random walks and higher order Markov chains
Spacey random walks and higher order Markov chainsSpacey random walks and higher order Markov chains
Spacey random walks and higher order Markov chains
 
Using Local Spectral Methods to Robustify Graph-Based Learning
Using Local Spectral Methods to Robustify Graph-Based LearningUsing Local Spectral Methods to Robustify Graph-Based Learning
Using Local Spectral Methods to Robustify Graph-Based Learning
 
Iterative methods with special structures
Iterative methods with special structuresIterative methods with special structures
Iterative methods with special structures
 
Anti-differentiating Approximation Algorithms: PageRank and MinCut
Anti-differentiating Approximation Algorithms: PageRank and MinCutAnti-differentiating Approximation Algorithms: PageRank and MinCut
Anti-differentiating Approximation Algorithms: PageRank and MinCut
 
Gaps between the theory and practice of large-scale matrix-based network comp...
Gaps between the theory and practice of large-scale matrix-based network comp...Gaps between the theory and practice of large-scale matrix-based network comp...
Gaps between the theory and practice of large-scale matrix-based network comp...
 
The power and Arnoldi methods in an algebra of circulants
The power and Arnoldi methods in an algebra of circulantsThe power and Arnoldi methods in an algebra of circulants
The power and Arnoldi methods in an algebra of circulants
 
Anti-differentiating approximation algorithms: A case study with min-cuts, sp...
Anti-differentiating approximation algorithms: A case study with min-cuts, sp...Anti-differentiating approximation algorithms: A case study with min-cuts, sp...
Anti-differentiating approximation algorithms: A case study with min-cuts, sp...
 
Iterative methods for network alignment
Iterative methods for network alignmentIterative methods for network alignment
Iterative methods for network alignment
 
MapReduce Tall-and-skinny QR and applications
MapReduce Tall-and-skinny QR and applicationsMapReduce Tall-and-skinny QR and applications
MapReduce Tall-and-skinny QR and applications
 
What you can do with a tall-and-skinny QR factorization in Hadoop: Principal ...
What you can do with a tall-and-skinny QR factorization in Hadoop: Principal ...What you can do with a tall-and-skinny QR factorization in Hadoop: Principal ...
What you can do with a tall-and-skinny QR factorization in Hadoop: Principal ...
 
Tall and Skinny QRs in MapReduce
Tall and Skinny QRs in MapReduceTall and Skinny QRs in MapReduce
Tall and Skinny QRs in MapReduce
 
Direct tall-and-skinny QR factorizations in MapReduce architectures
Direct tall-and-skinny QR factorizations in MapReduce architecturesDirect tall-and-skinny QR factorizations in MapReduce architectures
Direct tall-and-skinny QR factorizations in MapReduce architectures
 
A multithreaded method for network alignment
A multithreaded method for network alignmentA multithreaded method for network alignment
A multithreaded method for network alignment
 
A history of PageRank from the numerical computing perspective
A history of PageRank from the numerical computing perspectiveA history of PageRank from the numerical computing perspective
A history of PageRank from the numerical computing perspective
 
How does Google Google: A journey into the wondrous mathematics behind your f...
How does Google Google: A journey into the wondrous mathematics behind your f...How does Google Google: A journey into the wondrous mathematics behind your f...
How does Google Google: A journey into the wondrous mathematics behind your f...
 
Tall-and-skinny QR factorizations in MapReduce architectures
Tall-and-skinny QR factorizations in MapReduce architecturesTall-and-skinny QR factorizations in MapReduce architectures
Tall-and-skinny QR factorizations in MapReduce architectures
 

Ähnlich wie Overlapping clusters for distributed computation

DIMACS10: Parallel Community Detection for Massive Graphs
DIMACS10: Parallel Community Detection for Massive GraphsDIMACS10: Parallel Community Detection for Massive Graphs
DIMACS10: Parallel Community Detection for Massive GraphsJason Riedy
 
Rank aggregation via nuclear norm minimization
Rank aggregation via nuclear norm minimizationRank aggregation via nuclear norm minimization
Rank aggregation via nuclear norm minimizationDavid Gleich
 
Simulation Informatics; Analyzing Large Scientific Datasets
Simulation Informatics; Analyzing Large Scientific DatasetsSimulation Informatics; Analyzing Large Scientific Datasets
Simulation Informatics; Analyzing Large Scientific DatasetsDavid Gleich
 
The spectre of the spectrum
The spectre of the spectrumThe spectre of the spectrum
The spectre of the spectrumDavid Gleich
 
Massive MapReduce Matrix Computations & Multicore Graph Algorithms
Massive MapReduce Matrix Computations & Multicore Graph AlgorithmsMassive MapReduce Matrix Computations & Multicore Graph Algorithms
Massive MapReduce Matrix Computations & Multicore Graph AlgorithmsDavid Gleich
 
MapReduce for scientific simulation analysis
MapReduce for scientific simulation analysisMapReduce for scientific simulation analysis
MapReduce for scientific simulation analysisDavid Gleich
 

Ähnlich wie Overlapping clusters for distributed computation (6)

DIMACS10: Parallel Community Detection for Massive Graphs
DIMACS10: Parallel Community Detection for Massive GraphsDIMACS10: Parallel Community Detection for Massive Graphs
DIMACS10: Parallel Community Detection for Massive Graphs
 
Rank aggregation via nuclear norm minimization
Rank aggregation via nuclear norm minimizationRank aggregation via nuclear norm minimization
Rank aggregation via nuclear norm minimization
 
Simulation Informatics; Analyzing Large Scientific Datasets
Simulation Informatics; Analyzing Large Scientific DatasetsSimulation Informatics; Analyzing Large Scientific Datasets
Simulation Informatics; Analyzing Large Scientific Datasets
 
The spectre of the spectrum
The spectre of the spectrumThe spectre of the spectrum
The spectre of the spectrum
 
Massive MapReduce Matrix Computations & Multicore Graph Algorithms
Massive MapReduce Matrix Computations & Multicore Graph AlgorithmsMassive MapReduce Matrix Computations & Multicore Graph Algorithms
Massive MapReduce Matrix Computations & Multicore Graph Algorithms
 
MapReduce for scientific simulation analysis
MapReduce for scientific simulation analysisMapReduce for scientific simulation analysis
MapReduce for scientific simulation analysis
 

Mehr von David Gleich

Engineering Data Science Objectives for Social Network Analysis
Engineering Data Science Objectives for Social Network AnalysisEngineering Data Science Objectives for Social Network Analysis
Engineering Data Science Objectives for Social Network AnalysisDavid Gleich
 
Correlation clustering and community detection in graphs and networks
Correlation clustering and community detection in graphs and networksCorrelation clustering and community detection in graphs and networks
Correlation clustering and community detection in graphs and networksDavid Gleich
 
Spectral clustering with motifs and higher-order structures
Spectral clustering with motifs and higher-order structuresSpectral clustering with motifs and higher-order structures
Spectral clustering with motifs and higher-order structuresDavid Gleich
 
Higher-order organization of complex networks
Higher-order organization of complex networksHigher-order organization of complex networks
Higher-order organization of complex networksDavid Gleich
 
Spacey random walks and higher-order data analysis
Spacey random walks and higher-order data analysisSpacey random walks and higher-order data analysis
Spacey random walks and higher-order data analysisDavid Gleich
 
Big data matrix factorizations and Overlapping community detection in graphs
Big data matrix factorizations and Overlapping community detection in graphsBig data matrix factorizations and Overlapping community detection in graphs
Big data matrix factorizations and Overlapping community detection in graphsDavid Gleich
 
Localized methods for diffusions in large graphs
Localized methods for diffusions in large graphsLocalized methods for diffusions in large graphs
Localized methods for diffusions in large graphsDavid Gleich
 
Fast relaxation methods for the matrix exponential
Fast relaxation methods for the matrix exponential Fast relaxation methods for the matrix exponential
Fast relaxation methods for the matrix exponential David Gleich
 
Recommendation and graph algorithms in Hadoop and SQL
Recommendation and graph algorithms in Hadoop and SQLRecommendation and graph algorithms in Hadoop and SQL
Recommendation and graph algorithms in Hadoop and SQLDavid Gleich
 
Relaxation methods for the matrix exponential on large networks
Relaxation methods for the matrix exponential on large networksRelaxation methods for the matrix exponential on large networks
Relaxation methods for the matrix exponential on large networksDavid Gleich
 
Personalized PageRank based community detection
Personalized PageRank based community detectionPersonalized PageRank based community detection
Personalized PageRank based community detectionDavid Gleich
 
Vertex neighborhoods, low conductance cuts, and good seeds for local communit...
Vertex neighborhoods, low conductance cuts, and good seeds for local communit...Vertex neighborhoods, low conductance cuts, and good seeds for local communit...
Vertex neighborhoods, low conductance cuts, and good seeds for local communit...David Gleich
 
A dynamical system for PageRank with time-dependent teleportation
A dynamical system for PageRank with time-dependent teleportationA dynamical system for PageRank with time-dependent teleportation
A dynamical system for PageRank with time-dependent teleportationDavid Gleich
 
Sparse matrix computations in MapReduce
Sparse matrix computations in MapReduceSparse matrix computations in MapReduce
Sparse matrix computations in MapReduceDavid Gleich
 
Matrix methods for Hadoop
Matrix methods for HadoopMatrix methods for Hadoop
Matrix methods for HadoopDavid Gleich
 

Mehr von David Gleich (15)

Engineering Data Science Objectives for Social Network Analysis
Engineering Data Science Objectives for Social Network AnalysisEngineering Data Science Objectives for Social Network Analysis
Engineering Data Science Objectives for Social Network Analysis
 
Correlation clustering and community detection in graphs and networks
Correlation clustering and community detection in graphs and networksCorrelation clustering and community detection in graphs and networks
Correlation clustering and community detection in graphs and networks
 
Spectral clustering with motifs and higher-order structures
Spectral clustering with motifs and higher-order structuresSpectral clustering with motifs and higher-order structures
Spectral clustering with motifs and higher-order structures
 
Higher-order organization of complex networks
Higher-order organization of complex networksHigher-order organization of complex networks
Higher-order organization of complex networks
 
Spacey random walks and higher-order data analysis
Spacey random walks and higher-order data analysisSpacey random walks and higher-order data analysis
Spacey random walks and higher-order data analysis
 
Big data matrix factorizations and Overlapping community detection in graphs
Big data matrix factorizations and Overlapping community detection in graphsBig data matrix factorizations and Overlapping community detection in graphs
Big data matrix factorizations and Overlapping community detection in graphs
 
Localized methods for diffusions in large graphs
Localized methods for diffusions in large graphsLocalized methods for diffusions in large graphs
Localized methods for diffusions in large graphs
 
Fast relaxation methods for the matrix exponential
Fast relaxation methods for the matrix exponential Fast relaxation methods for the matrix exponential
Fast relaxation methods for the matrix exponential
 
Recommendation and graph algorithms in Hadoop and SQL
Recommendation and graph algorithms in Hadoop and SQLRecommendation and graph algorithms in Hadoop and SQL
Recommendation and graph algorithms in Hadoop and SQL
 
Relaxation methods for the matrix exponential on large networks
Relaxation methods for the matrix exponential on large networksRelaxation methods for the matrix exponential on large networks
Relaxation methods for the matrix exponential on large networks
 
Personalized PageRank based community detection
Personalized PageRank based community detectionPersonalized PageRank based community detection
Personalized PageRank based community detection
 
Vertex neighborhoods, low conductance cuts, and good seeds for local communit...
Vertex neighborhoods, low conductance cuts, and good seeds for local communit...Vertex neighborhoods, low conductance cuts, and good seeds for local communit...
Vertex neighborhoods, low conductance cuts, and good seeds for local communit...
 
A dynamical system for PageRank with time-dependent teleportation
A dynamical system for PageRank with time-dependent teleportationA dynamical system for PageRank with time-dependent teleportation
A dynamical system for PageRank with time-dependent teleportation
 
Sparse matrix computations in MapReduce
Sparse matrix computations in MapReduceSparse matrix computations in MapReduce
Sparse matrix computations in MapReduce
 
Matrix methods for Hadoop
Matrix methods for HadoopMatrix methods for Hadoop
Matrix methods for Hadoop
 

Kürzlich hochgeladen

My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 

Kürzlich hochgeladen (20)

My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 

Overlapping clusters for distributed computation

  • 1. Overlapping Clusters for Distributed Computation DAVID F. GLEICH " REID ANDERSEN " PURDUE UNIVERSITY MICROSOFT CORP. COMPUTER SCIENCE " VAHAB MIRROKNI" DEPARTMENT GOOGLE RESEARCH, NYC 1 David Gleich · Purdue WSDM2012
  • 2. Problem Find a good way to distribute a big graph for solving things like linear systems and simulating random walks Contributions Theoretical demonstration that overlap helps Proof of concept procedure to find overlapping partitions to reduce communication (~20%) All code available http://www.cs.purdue.edu/~dgleich/codes/ overlapping 2 David Gleich · Purdue WSDM2012
  • 3. The problem WHAT OUR NETWORKS WHAT OUR OTHER LOOK LIKE NETWORKS LOOK LIKE 3 David Gleich · Purdue WSDM2012
  • 4. The problem COMBINING NETWORKS AND GRAPHS IS A MESS 4 David Gleich · Purdue WSDM2012
  • 5. “Good” data distributions are a fundamental problem in distributed computation. ! How to divide the communication graph! Balance work Balance communication Balance data Balance programming complexity too 5 David Gleich · Purdue WSDM2012
  • 6. Current solutions Work Comm. Data Programming Disjoint vertex Okay to “Think like a Excellent Excellent partitions Good vertex” 2d or Edge Excellent Excellent Good “Impossible” Partitions Where we fit! Overlapping Good to “Think like a Okay “Let’s see” partitions Excellent cached vertex” 6 David Gleich · Purdue WSDM2012
  • 7. Goals Find a set of " overlapping clusters " where random walks stay in a cluster for a long time solving diffusion-like problems requires little communication (think PageRank, Katz, hitting times, semi-supervised learning) 7 David Gleich · Purdue WSDM2012
  • 8. Related work Domain decomposition, Schwarz methods How to solve a linear system with overlap. Szyld et al. Communication avoiding algorithms k-step matrix-vector products (Demmel et al.) and " growing overlap around partitions (Fritzsche, Frommer, Szyld) Overlapping communities and link partitioning algorithms for social network analysis Link communities (Ahn et al.); surveys by Fortunato and Satu P2P based PageRank algorithms Parreira, Castillo, Donato et al. 8 David Gleich · Purdue WSDM2012
  • 9. Overlapping clusters Each vertex in at least one cluster has one home cluster Formally, an overlapping cover is (C, ⌧ ) C={ , , } = set of clusters ⌧ : V 7! C = map to homes ⌧ is a partition! 9 David Gleich · Purdue WSDM2012
  • 10. Random walks in overlapping clusters Each vertex in at least one cluster has one home cluster red cluster " keeps the walk Random walks red cluster " go to the home sends the walk cluster after leaving to gray cluster 10 David Gleich · Purdue WSDM2012
  • 11. An evaluation metric" Swapping probability Is (C, ⌧ ) a good overlapping cover? Does a random walk swap clusters often? red cluster " keeps the walk ⇢ 1 = probability that a walk red cluster " sends the walk changes clusters on each to gray cluster step computable expression in the paper 11 David Gleich · Purdue WSDM2012
  • 12. Overlapping clusters Each vertex is in at least one cluster has one home cluster Vol(C) = sum of degrees of vertices in cluster C MaxVol = " upper bound on Vol(C) TotalVol(C) = " C sum of Vol(C) for all clusters VolRatio = TotalVol(C) / Vol(G)" C how much extra data! 12 David Gleich · Purdue WSDM2012
  • 13. Swapping probability & partitioning No overlap in this figure ! P is a partition ⇢1 (P) = 1 X Cut(P) Vol(G) P2P Much like a classical graph partitioning metric 13 David Gleich · Purdue WSDM2012
  • 14. Overlapping clusters vs. Partitioning in theory Take a cycle graph M groups of ℓ������ vertices MaxVol = 2ℓ������ partitioning for 1 1 ⇢ = (Optimal!) ` for overlapping 4 ⇢1 = ⌦(`2 ) 14 David Gleich · Purdue WSDM2012
  • 15. Heuristics for finding good " N P-hard for optimal overlapping clusters solution L Our multi-stage heuristic! 1.  Find a large set of good clusters Use personalized PageRank clusters 2.  Find “well contained” nodes (cores) Compute expected “leavetime” 3.  Cover the graph with core vertices Approximately solve a min set-cover problem 4.  Combine clusters up to MaxVol The swapping probability is sub-modular 15 David Gleich · Purdue WSDM2012
  • 16. Heuristics for finding good " N P-hard for optimal overlapping clusters solution L Our multi-stage heuristic! 1.  Find a large set of good clusters Each cluster takes Use personalized PageRank clusters, or metis “< MaxVol” work 2.  Find “well contained” nodes (cores) Takes O(Vol) Compute expected “leave time” work per cluster 3.  Cover the graph with core vertices Approximately solve a min set-cover problem Fast enough 4.  Combine clusters up to MaxVol The swapping probability is sub-modular Fast enough 16 David Gleich · Purdue WSDM2012
  • 17. Demo! 17 David Gleich · Purdue WSDM2012
  • 18. Solving " linear " systems Like PageRank, Katz, and semi-supervised learning 18 David Gleich · Purdue WSDM2012
  • 19. All nodes solve locally using " the coordinate descent method. 19 David Gleich · Purdue WSDM2012
  • 20. All nodes solve locally using " the coordinate descent method. A core vertex for the gray cluster. 20 David Gleich · Purdue WSDM2012
  • 21. All nodes solve locally using " the coordinate descent method. Red sends residuals to white. White send residuals to red. 21 David Gleich · Purdue WSDM2012
  • 22. White then uses the coordinate descent method to adjust its solution. Will cause communication to red/blue. 22 David Gleich · Purdue WSDM2012
  • 23. That algorithm is called " restricted additive Schwarz. PageRank We look at PageRank! Katz scores semi-supervised learning any spd or M-matrix " linear system 23 David Gleich · Purdue WSDM2012
  • 24. It works! 2 communication Swapping Probability (usroads) PageRank Communication (usroads) Swapping Probability (web−Google) 1.5 PageRank Communication (web−Google) Relative Relative Work 1 Metis Partitioner Partitioning baseline 0.5 0 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 Volume Ratio How much more of the graph we need to store. 24 David Gleich · Purdue WSDM2012
  • 25. Edges are counted twice and some graphs have self- loops. The first group are geometric networks and the second are information networks. Graph Graph Vertices |V | Edges |E| MaxDeg max deg Density |E|/|V | onera 85567 419201 5 4.9 usroads 126146 323900 7 2.6 annulus 500000 2999258 19 6.0 email-Enron 33696 361622 1383 10.7 soc-Slashdot 77360 1015667 2540 13.1 dico 111982 2750576 68191 24.6 lcsh 144791 394186 1025 2.7 web-Google 855802 8582704 6332 10.0 as-skitter 1694616 22188418 35455 13.1 cit-Patents 3764117 33023481 793 8.8 1 1 1 0.8 0.8 0.8 Conductance Conductance - Conductance 0.6 0.6 0.6 0.4 0.4 0.4 25 0.2 0.2 0.2 0 David Gleich · Purdue 0 WSDM2012 0 5 0 0 5
  • 26. he communication ratio of our best result for the PageRan ommunication volume compared to METIS or GRACLUS show at the method works for 6 of them (perf. ratio < 1). The ommunication result is not a bug. Graph Comm. of Comm. of Perf. Ratio Vol. Ratio Partition Overlap onera 18654 48 0.003 2.82 usroads 3256 0 0.000 1.49 annulus 12074 2 0.000 0.01 email-Enron 194536* 235316 1.210 1.7 soc-Slashdot 875435* 1.3 ⇥ 106 1.480 1.78 dico 1.5 ⇥ 106 * 2.0 ⇥ 106 1.320 1.53 lcsh 73000* 48777 0.668 2.17 web-Google 201159* 167609 0.833 1.57 as-skitter 2.4 ⇥ 106 3.9 ⇥ 106 1.645 1.93 cit-Patents 8.7 ⇥ 106 7.3 ⇥ 106 0.845 1.34 * means Graculus nally, we evaluate our heuristic. gave a better partition than Metis At left, the cluster combine procedure reduces 106 clusters to 26 around 102 . Middle, combining clusters can decrease the volume David Gleich · Purdue WSDM2012
  • 27. Summary Future work ! Overlap helps reduce Truly distributed implementation and communication in a distributed evaluation process! ! Can we exploit data redundancy to Proof of concept procedure to solve problems on large graphs faster? find overlapping partitions to reduce communication Copy 1 Copy 2 src -> dst src -> dst src -> dst src -> dst src -> dst src -> dst All code available http://www.cs.purdue.edu/~dgleich/codes/ overlapping 27 David Gleich · Purdue WSDM2012