SlideShare ist ein Scribd-Unternehmen logo
1 von 40
Counting Triangles &
The Curse of the Last Reducer
                        Siddharth Suri
                    Sergei Vassilvitskii
                     Yahoo! Research
Why Count Triangles?




WWW 2011                 2   Sergei Vassilvitskii
Why Count Triangles?

           Clustering Coefficient:
            Given an undirected graph G = (V, E)
            cc(v) = fraction of v’s neighbors who are neighbors themselves
                   |{(u, w) ∈ E|u ∈ Γ(v) ∧ w ∈ Γ(v)}|
                 =                dv 
                                   2




WWW 2011                                       3                      Sergei Vassilvitskii
Why Count Triangles?

           Clustering Coefficient:
            Given an undirected graph G = (V, E)
            cc(v) = fraction of v’s neighbors who are neighbors themselves
                   |{(u, w) ∈ E|u ∈ Γ(v) ∧ w ∈ Γ(v)}|
                 =                dv 
                                   2


                                                            cc (   ) = N/A

                                                            cc (   ) = 1/3

                                                            cc (   )=1

                                                            cc (   )=1




WWW 2011                                       4                         Sergei Vassilvitskii
Why Count Triangles?

           Clustering Coefficient:
            Given an undirected graph G = (V, E)
            cc(v) = fraction of v’s neighbors who are neighbors themselves
                   |{(u, w) ∈ E|u ∈ Γ(v) ∧ w ∈ Γ(v)}|   #∆s incident on v
                 =                dv                =       dv 
                                   2                           2


                                                               cc (   ) = N/A

                                                               cc (   ) = 1/3

                                                               cc (   )=1

                                                               cc (   )=1




WWW 2011                                     5                              Sergei Vassilvitskii
Why Clustering Coefficient?

           Captures how tight-knit the network is around a node.

    cc (      ) = 0.5                             cc (   ) = 0.1


                                            vs.




WWW 2011                                     6                     Sergei Vassilvitskii
Why Clustering Coefficient?

           Captures how tight-knit the network is around a node.

    cc (      ) = 0.5                              cc (   ) = 0.1


                                             vs.




           Network Cohesion:
             - Tightly knit communities foster more trust, social norms. [Coleman
           ’88, Portes ’88]
           Structural Holes:
            - Individuals benefit form bridging [Burt ’04, ’07]
WWW 2011                                      7                         Sergei Vassilvitskii
Why MapReduce?

           De facto standard for parallel computation on
           large data
           – Widely used at: Yahoo!, Google, Facebook,
           – Also at: New York Times, Amazon.com, Match.com, ...
           – Commodity hardware
           – Reliable infrastructure


           – Data continues to outpace available RAM !




WWW 2011                                   8                       Sergei Vassilvitskii
How to Count Triangles

           Sequential Version:
            foreach v in V
                foreach u,w in Adjacency(v)
                   if (u,w) in E
                      Triangles[v]++




                          v

                                              Triangles[v]=0




WWW 2011                            9                     Sergei Vassilvitskii
How to Count Triangles

           Sequential Version:
            foreach v in V
                foreach u,w in Adjacency(v)
                   if (u,w) in E
                      Triangles[v]++




                          v

                                              Triangles[v]=1
                                         w

                              u
WWW 2011                            10                    Sergei Vassilvitskii
How to Count Triangles

           Sequential Version:
            foreach v in V
                foreach u,w in Adjacency(v)
                   if (u,w) in E
                      Triangles[v]++




                          v

                                              Triangles[v]=1

              w
                              u
WWW 2011                            11                    Sergei Vassilvitskii
How to Count Triangles

           Sequential Version:
            foreach v in V
                foreach u,w in Adjacency(v)
                   if (u,w) in E
                      Triangles[v]++

                              
           Running time:            d2
                                     v
                              v∈V



             Even for sparse graphs can be quadratic if one vertex has high
             degree.




WWW 2011                                    12                         Sergei Vassilvitskii
Parallel Version

           Parallelize the edge checking phase




WWW 2011                          13             Sergei Vassilvitskii
Parallel Version

           Parallelize the edge checking phase
           – Map 1: For each v send (v, Γ(v)) to single machine.
           – Reduce 1: Input: v; Γ(v)
             Output: all 2 paths (v1 , v2 ); u where v1 , v2 ∈ Γ(u)
             ( , );        ( , );           ( , );




WWW 2011                                         14                     Sergei Vassilvitskii
Parallel Version

           Parallelize the edge checking phase
           – Map 1: For each v send (v, Γ(v)) to single machine.
           – Reduce 1: Input: v; Γ(v)
             Output: all 2 paths (v1 , v2 ); u where v1 , v2 ∈ Γ(u)
             ( , );           ( , );             ( , );


           – Map 2: Send (v1 , v2 ); u and (v1 , v2 ); $ for (v1 , v2 ) ∈ E to same
             machine.
           – Reduce 2: input: (v, w); u1 , u2 , . . . , uk , $?
             Output: if $ part of the input, then: ui = ui + 1/3


                           ( , ); , $ −→          +1/3       +1/3   +1/3
                           ( , );     −→


WWW 2011                                               15                           Sergei Vassilvitskii
Data skew

           How much parallelization can we achieve?
           - Generate all the paths to check in parallel
           - The running time becomes max d2   v
                                         v∈V




WWW 2011                                     16            Sergei Vassilvitskii
Data skew

           How much parallelization can we achieve?
           - Generate all the paths to check in parallel
           - The running time becomes max d2   v
                                         v∈V




           Naive parallelization does not help with data skew
           – Some nodes will have very high degree
           – Example. 3.2 Million followers, must generate 10 Trillion (10^13)
             potential edges to check.

           – Even if generating 100M edges per second, 100K seconds ~ 27 hours.




WWW 2011                                     17                          Sergei Vassilvitskii
“Just 5 more minutes”

           Running the naive algorithm on LiveJournal Graph
           – 80% of reducers done after 5 min
           – 99% done after 35 min




WWW 2011                                  18        Sergei Vassilvitskii
Adapting the Algorithm

           Approach 1: Dealing with skew directly
           – currently every triangle counted 3 times (once per vertex)
           – Running time quadratic in the degree of the vertex
           – Idea: Count each once, from the perspective of lowest degree vertex
           – Does this heuristic work?




WWW 2011                                    19                            Sergei Vassilvitskii
Adapting the Algorithm

           Approach 1: Dealing with skew directly
           – currently every triangle counted 3 times (once per vertex)
           – Running time quadratic in the degree of the vertex
           – Idea: Count each once, from the perspective of lowest degree vertex
           – Does this heuristic work?



           Approach 2: Divide  Conquer
           – Equally divide the graph between machines
           – But any edge partition will be bound to miss triangles
           – Divide into overlapping subgraphs, account for the overlap




WWW 2011                                    20                            Sergei Vassilvitskii
How to Count Triangles Better

           Sequential Version [Schank ’07]:

           foreach v in V
              foreach u,w in Adjacency(v)
                if deg(u)  deg(v)  deg(w)  deg(v)
                    if (u,w) in E
                       Triangles[v]++




WWW 2011                             21                 Sergei Vassilvitskii
Does it make a difference?




WWW 2011               22      Sergei Vassilvitskii
Dealing with Skew

           Why does it help?
           – Partition nodes into two groups:
                                   √
              • Low: L = {v : dv ≤ m}
                                   √
              • High: H = {v : dv  m}
           – There are at most n low nodes; each produces at most O(m) paths
                                √
           – There are at most 2 m high nodes
              • Each produces paths to other high nodes: O(m) paths per node




WWW 2011                                   23                       Sergei Vassilvitskii
Dealing with Skew

           Why does it help?
           – Partition nodes into two groups:
                                   √
              • Low: L = {v : dv ≤ m}
                                   √
              • High: H = {v : dv  m}
           – There are at most n low nodes; each produces at most O(m) paths
                                √
           – There are at most 2 m high nodes
              • Each produces paths to other high nodes: O(m) paths per node


           – These two are identical !
           – Therefore, no mapper can produce substantially more work than
             others.
                                3
           – Total work is O(m /2 ) , which is optimal




WWW 2011                                     24                     Sergei Vassilvitskii
Approach 2: Graph Split

           Partitioning the nodes:
           - Previous algorithm shows one way to achieve better parallelization
           - But what if even O(m) is too much. Is it possible to divide input into
             smaller chunks?



           Graph Split Algorithm:
           – Partition vertices into p equal sized groups V1 , V2 , . . . , Vp .
           – Consider all possible triples (Vi , Vj , Vk ) and the induced subgraph:
                       Gijk = G [Vi ∪ Vj ∪ Vk ]
           – Compute the triangles on each Gijk separately.




WWW 2011                                          25                               Sergei Vassilvitskii
Approach 2: Graph Split

           Some Triangles present in multiple subgraphs:


                                                            in p-2 subgraphs


                   Vi              Vj

                                                            in 1 subgraph



                           Vk
                                                            in ~p2 subgraphs



           Can count exactly how many subgraphs each triangle will be in


WWW 2011                                    26                        Sergei Vassilvitskii
Approach 2: Graph Split

           Analysis:
           – Each subgraph has O(m/p2 ) edges in expectation.
           – Very balanced running times




WWW 2011                                   27                   Sergei Vassilvitskii
Approach 2: Graph Split
           Analysis:
           – Very balanced running times
           – p controls memory needed per machine




WWW 2011                                   28       Sergei Vassilvitskii
Approach 2: Graph Split
           Analysis:
           – Very balanced running times
           – p controls memory needed per machine
                          3
           – Total work: p ·   O((m/p2 )3/2 )   = O(m
                                                        3/2
                                                              ) , independent of p




WWW 2011                                           29                                Sergei Vassilvitskii
Approach 2: Graph Split
           Analysis:
           – Very balanced running times
           – p controls memory needed per machine
                          3
           – Total work: p ·   O((m/p2 )3/2 )   = O(m
                                                        3/2
                                                              ) , independent of p




  Input too big:                                                                     Shuffle time
  paging                                                                             increases with
                                                                                     duplication




WWW 2011                                           30                                 Sergei Vassilvitskii
Overall




           Naive Parallelization Doesn’t help with Data Skew




WWW 2011                           31                Sergei Vassilvitskii
Related Work

      • Tsourakakis et al. [09]:
           – Count global number of triangles by estimating the trace of the cube
             of the matrix
           – Don’t specifically deal with skew, obtain high probability
             approximations.


      • Becchetti et al. [08]
           – Approximate the number of triangles per node
           – Use multiple passes to obtain a better and better approximation




WWW 2011                                     32                           Sergei Vassilvitskii
Conclusions




WWW 2011        33   Sergei Vassilvitskii
Conclusions

           Think about data skew.... and avoid the curse




WWW 2011                           33                 Sergei Vassilvitskii
Conclusions

           Think about data skew.... and avoid the curse
           – Get programs to run faster




WWW 2011                                  33          Sergei Vassilvitskii
Conclusions

           Think about data skew.... and avoid the curse
           – Get programs to run faster
           – Publish more papers




WWW 2011                                  33          Sergei Vassilvitskii
Conclusions

           Think about data skew.... and avoid the curse
           – Get programs to run faster
           – Publish more papers
           – Get more sleep




WWW 2011                                  33          Sergei Vassilvitskii
Conclusions

           Think about data skew.... and avoid the curse
           – Get programs to run faster
           – Publish more papers
           – Get more sleep
           – ..




WWW 2011                                  33          Sergei Vassilvitskii
Conclusions

           Think about data skew.... and avoid the curse
           – Get programs to run faster
           – Publish more papers
           – Get more sleep
           – ..
           – The possibilities are endless!




WWW 2011                                      33      Sergei Vassilvitskii
Thank You

Weitere ähnliche Inhalte

Was ist angesagt?

Benchmark MinHash+LSH algorithm on Spark
Benchmark MinHash+LSH algorithm on SparkBenchmark MinHash+LSH algorithm on Spark
Benchmark MinHash+LSH algorithm on SparkXiaoqian Liu
 
Scylla Summit 2022: Scylla 5.0 New Features, Part 2
Scylla Summit 2022: Scylla 5.0 New Features, Part 2Scylla Summit 2022: Scylla 5.0 New Features, Part 2
Scylla Summit 2022: Scylla 5.0 New Features, Part 2ScyllaDB
 
Hadoop Strata Talk - Uber, your hadoop has arrived
Hadoop Strata Talk - Uber, your hadoop has arrived Hadoop Strata Talk - Uber, your hadoop has arrived
Hadoop Strata Talk - Uber, your hadoop has arrived Vinoth Chandar
 
Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...
Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...
Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...StreamNative
 
ASE Tempdb Performance and Tuning
ASE Tempdb Performance and Tuning ASE Tempdb Performance and Tuning
ASE Tempdb Performance and Tuning SAP Technology
 
Hue architecture in the Hadoop ecosystem and SQL Editor
Hue architecture in the Hadoop ecosystem and SQL EditorHue architecture in the Hadoop ecosystem and SQL Editor
Hue architecture in the Hadoop ecosystem and SQL EditorRomain Rigaux
 
MarkLogic Overview and Use Cases
MarkLogic Overview and Use CasesMarkLogic Overview and Use Cases
MarkLogic Overview and Use CasesIoan Toma
 
Scylla Summit 2022: Scylla 5.0 New Features, Part 1
Scylla Summit 2022: Scylla 5.0 New Features, Part 1Scylla Summit 2022: Scylla 5.0 New Features, Part 1
Scylla Summit 2022: Scylla 5.0 New Features, Part 1ScyllaDB
 
Apache hadoop hue overview and introduction
Apache hadoop hue overview and introductionApache hadoop hue overview and introduction
Apache hadoop hue overview and introductionBigClasses Com
 
Under the Hood of a Shard-per-Core Database Architecture
Under the Hood of a Shard-per-Core Database ArchitectureUnder the Hood of a Shard-per-Core Database Architecture
Under the Hood of a Shard-per-Core Database ArchitectureScyllaDB
 
Fiddler 피들러에 대해 알아보자
Fiddler 피들러에 대해 알아보자Fiddler 피들러에 대해 알아보자
Fiddler 피들러에 대해 알아보자용진 조
 
Facebook chat architecture
Facebook chat architectureFacebook chat architecture
Facebook chat architectureUdaya Kiran
 
Hive Bucketing in Apache Spark
Hive Bucketing in Apache SparkHive Bucketing in Apache Spark
Hive Bucketing in Apache SparkTejas Patil
 
Building Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta LakeBuilding Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta LakeFlink Forward
 
How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...DataWorks Summit/Hadoop Summit
 
Introduction to MariaDB
Introduction to MariaDBIntroduction to MariaDB
Introduction to MariaDBJongJin Lee
 

Was ist angesagt? (20)

Benchmark MinHash+LSH algorithm on Spark
Benchmark MinHash+LSH algorithm on SparkBenchmark MinHash+LSH algorithm on Spark
Benchmark MinHash+LSH algorithm on Spark
 
Scylla Summit 2022: Scylla 5.0 New Features, Part 2
Scylla Summit 2022: Scylla 5.0 New Features, Part 2Scylla Summit 2022: Scylla 5.0 New Features, Part 2
Scylla Summit 2022: Scylla 5.0 New Features, Part 2
 
Hadoop Strata Talk - Uber, your hadoop has arrived
Hadoop Strata Talk - Uber, your hadoop has arrived Hadoop Strata Talk - Uber, your hadoop has arrived
Hadoop Strata Talk - Uber, your hadoop has arrived
 
Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...
Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...
Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...
 
ASE Tempdb Performance and Tuning
ASE Tempdb Performance and Tuning ASE Tempdb Performance and Tuning
ASE Tempdb Performance and Tuning
 
TiDB Introduction
TiDB IntroductionTiDB Introduction
TiDB Introduction
 
Hue architecture in the Hadoop ecosystem and SQL Editor
Hue architecture in the Hadoop ecosystem and SQL EditorHue architecture in the Hadoop ecosystem and SQL Editor
Hue architecture in the Hadoop ecosystem and SQL Editor
 
MarkLogic Overview and Use Cases
MarkLogic Overview and Use CasesMarkLogic Overview and Use Cases
MarkLogic Overview and Use Cases
 
Scylla Summit 2022: Scylla 5.0 New Features, Part 1
Scylla Summit 2022: Scylla 5.0 New Features, Part 1Scylla Summit 2022: Scylla 5.0 New Features, Part 1
Scylla Summit 2022: Scylla 5.0 New Features, Part 1
 
Apache hadoop hue overview and introduction
Apache hadoop hue overview and introductionApache hadoop hue overview and introduction
Apache hadoop hue overview and introduction
 
Under the Hood of a Shard-per-Core Database Architecture
Under the Hood of a Shard-per-Core Database ArchitectureUnder the Hood of a Shard-per-Core Database Architecture
Under the Hood of a Shard-per-Core Database Architecture
 
Fiddler 피들러에 대해 알아보자
Fiddler 피들러에 대해 알아보자Fiddler 피들러에 대해 알아보자
Fiddler 피들러에 대해 알아보자
 
Facebook chat architecture
Facebook chat architectureFacebook chat architecture
Facebook chat architecture
 
Hive: Loading Data
Hive: Loading DataHive: Loading Data
Hive: Loading Data
 
12-Factor Apps
12-Factor Apps12-Factor Apps
12-Factor Apps
 
Hive Bucketing in Apache Spark
Hive Bucketing in Apache SparkHive Bucketing in Apache Spark
Hive Bucketing in Apache Spark
 
Building Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta LakeBuilding Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta Lake
 
How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...
 
Containers and Big Data
Containers and Big DataContainers and Big Data
Containers and Big Data
 
Introduction to MariaDB
Introduction to MariaDBIntroduction to MariaDB
Introduction to MariaDB
 

Ähnlich wie Counting Triangles and the Curse of the Last Reducer

Directed Acyclic Graph
Directed Acyclic Graph Directed Acyclic Graph
Directed Acyclic Graph AJAL A J
 
Generarlized operations on fuzzy graphs
Generarlized operations on fuzzy graphsGenerarlized operations on fuzzy graphs
Generarlized operations on fuzzy graphsAlexander Decker
 
lec 09-graphs-bfs-dfs.ppt
lec 09-graphs-bfs-dfs.pptlec 09-graphs-bfs-dfs.ppt
lec 09-graphs-bfs-dfs.pptTalhaFarooqui12
 
Chapter 5 Graphs (1).ppt
Chapter 5 Graphs (1).pptChapter 5 Graphs (1).ppt
Chapter 5 Graphs (1).pptishan743441
 
topological_sort_strongly Connected Components
topological_sort_strongly Connected Componentstopological_sort_strongly Connected Components
topological_sort_strongly Connected ComponentsJahidulIslam47153
 
A Very Brief Introduction to Reflections in 2D Geometric Algebra, and their U...
A Very Brief Introduction to Reflections in 2D Geometric Algebra, and their U...A Very Brief Introduction to Reflections in 2D Geometric Algebra, and their U...
A Very Brief Introduction to Reflections in 2D Geometric Algebra, and their U...James Smith
 
Gram-Schmidt Orthogonalization and QR Decompositon
Gram-Schmidt Orthogonalization and QR Decompositon Gram-Schmidt Orthogonalization and QR Decompositon
Gram-Schmidt Orthogonalization and QR Decompositon Mohammad Umar Rehman
 
Graph in Data Structure
Graph in Data StructureGraph in Data Structure
Graph in Data StructureProf Ansari
 
Inroduction_To_Algorithms_Lect14
Inroduction_To_Algorithms_Lect14Inroduction_To_Algorithms_Lect14
Inroduction_To_Algorithms_Lect14Naor Ami
 
20121020 semi local-string_comparison_tiskin
20121020 semi local-string_comparison_tiskin20121020 semi local-string_comparison_tiskin
20121020 semi local-string_comparison_tiskinComputer Science Club
 

Ähnlich wie Counting Triangles and the Curse of the Last Reducer (20)

graph.ppt
graph.pptgraph.ppt
graph.ppt
 
Directed Acyclic Graph
Directed Acyclic Graph Directed Acyclic Graph
Directed Acyclic Graph
 
Ch08.ppt
Ch08.pptCh08.ppt
Ch08.ppt
 
Generarlized operations on fuzzy graphs
Generarlized operations on fuzzy graphsGenerarlized operations on fuzzy graphs
Generarlized operations on fuzzy graphs
 
lec 09-graphs-bfs-dfs.ppt
lec 09-graphs-bfs-dfs.pptlec 09-graphs-bfs-dfs.ppt
lec 09-graphs-bfs-dfs.ppt
 
Graph
GraphGraph
Graph
 
Chapter 5 Graphs (1).ppt
Chapter 5 Graphs (1).pptChapter 5 Graphs (1).ppt
Chapter 5 Graphs (1).ppt
 
Graph theory
Graph theoryGraph theory
Graph theory
 
topological_sort_strongly Connected Components
topological_sort_strongly Connected Componentstopological_sort_strongly Connected Components
topological_sort_strongly Connected Components
 
A Very Brief Introduction to Reflections in 2D Geometric Algebra, and their U...
A Very Brief Introduction to Reflections in 2D Geometric Algebra, and their U...A Very Brief Introduction to Reflections in 2D Geometric Algebra, and their U...
A Very Brief Introduction to Reflections in 2D Geometric Algebra, and their U...
 
19-graph1 (1).ppt
19-graph1 (1).ppt19-graph1 (1).ppt
19-graph1 (1).ppt
 
Gram-Schmidt Orthogonalization and QR Decompositon
Gram-Schmidt Orthogonalization and QR Decompositon Gram-Schmidt Orthogonalization and QR Decompositon
Gram-Schmidt Orthogonalization and QR Decompositon
 
lec6.pptx
lec6.pptxlec6.pptx
lec6.pptx
 
P5 - Routing Protocols
P5 - Routing ProtocolsP5 - Routing Protocols
P5 - Routing Protocols
 
Shortest path algorithms
Shortest path algorithmsShortest path algorithms
Shortest path algorithms
 
Chap 6 Graph.ppt
Chap 6 Graph.pptChap 6 Graph.ppt
Chap 6 Graph.ppt
 
Graph in Data Structure
Graph in Data StructureGraph in Data Structure
Graph in Data Structure
 
Inroduction_To_Algorithms_Lect14
Inroduction_To_Algorithms_Lect14Inroduction_To_Algorithms_Lect14
Inroduction_To_Algorithms_Lect14
 
graphs
graphsgraphs
graphs
 
20121020 semi local-string_comparison_tiskin
20121020 semi local-string_comparison_tiskin20121020 semi local-string_comparison_tiskin
20121020 semi local-string_comparison_tiskin
 

Counting Triangles and the Curse of the Last Reducer

  • 1. Counting Triangles & The Curse of the Last Reducer Siddharth Suri Sergei Vassilvitskii Yahoo! Research
  • 2. Why Count Triangles? WWW 2011 2 Sergei Vassilvitskii
  • 3. Why Count Triangles? Clustering Coefficient: Given an undirected graph G = (V, E) cc(v) = fraction of v’s neighbors who are neighbors themselves |{(u, w) ∈ E|u ∈ Γ(v) ∧ w ∈ Γ(v)}| = dv 2 WWW 2011 3 Sergei Vassilvitskii
  • 4. Why Count Triangles? Clustering Coefficient: Given an undirected graph G = (V, E) cc(v) = fraction of v’s neighbors who are neighbors themselves |{(u, w) ∈ E|u ∈ Γ(v) ∧ w ∈ Γ(v)}| = dv 2 cc ( ) = N/A cc ( ) = 1/3 cc ( )=1 cc ( )=1 WWW 2011 4 Sergei Vassilvitskii
  • 5. Why Count Triangles? Clustering Coefficient: Given an undirected graph G = (V, E) cc(v) = fraction of v’s neighbors who are neighbors themselves |{(u, w) ∈ E|u ∈ Γ(v) ∧ w ∈ Γ(v)}| #∆s incident on v = dv = dv 2 2 cc ( ) = N/A cc ( ) = 1/3 cc ( )=1 cc ( )=1 WWW 2011 5 Sergei Vassilvitskii
  • 6. Why Clustering Coefficient? Captures how tight-knit the network is around a node. cc ( ) = 0.5 cc ( ) = 0.1 vs. WWW 2011 6 Sergei Vassilvitskii
  • 7. Why Clustering Coefficient? Captures how tight-knit the network is around a node. cc ( ) = 0.5 cc ( ) = 0.1 vs. Network Cohesion: - Tightly knit communities foster more trust, social norms. [Coleman ’88, Portes ’88] Structural Holes: - Individuals benefit form bridging [Burt ’04, ’07] WWW 2011 7 Sergei Vassilvitskii
  • 8. Why MapReduce? De facto standard for parallel computation on large data – Widely used at: Yahoo!, Google, Facebook, – Also at: New York Times, Amazon.com, Match.com, ... – Commodity hardware – Reliable infrastructure – Data continues to outpace available RAM ! WWW 2011 8 Sergei Vassilvitskii
  • 9. How to Count Triangles Sequential Version: foreach v in V foreach u,w in Adjacency(v) if (u,w) in E Triangles[v]++ v Triangles[v]=0 WWW 2011 9 Sergei Vassilvitskii
  • 10. How to Count Triangles Sequential Version: foreach v in V foreach u,w in Adjacency(v) if (u,w) in E Triangles[v]++ v Triangles[v]=1 w u WWW 2011 10 Sergei Vassilvitskii
  • 11. How to Count Triangles Sequential Version: foreach v in V foreach u,w in Adjacency(v) if (u,w) in E Triangles[v]++ v Triangles[v]=1 w u WWW 2011 11 Sergei Vassilvitskii
  • 12. How to Count Triangles Sequential Version: foreach v in V foreach u,w in Adjacency(v) if (u,w) in E Triangles[v]++ Running time: d2 v v∈V Even for sparse graphs can be quadratic if one vertex has high degree. WWW 2011 12 Sergei Vassilvitskii
  • 13. Parallel Version Parallelize the edge checking phase WWW 2011 13 Sergei Vassilvitskii
  • 14. Parallel Version Parallelize the edge checking phase – Map 1: For each v send (v, Γ(v)) to single machine. – Reduce 1: Input: v; Γ(v) Output: all 2 paths (v1 , v2 ); u where v1 , v2 ∈ Γ(u) ( , ); ( , ); ( , ); WWW 2011 14 Sergei Vassilvitskii
  • 15. Parallel Version Parallelize the edge checking phase – Map 1: For each v send (v, Γ(v)) to single machine. – Reduce 1: Input: v; Γ(v) Output: all 2 paths (v1 , v2 ); u where v1 , v2 ∈ Γ(u) ( , ); ( , ); ( , ); – Map 2: Send (v1 , v2 ); u and (v1 , v2 ); $ for (v1 , v2 ) ∈ E to same machine. – Reduce 2: input: (v, w); u1 , u2 , . . . , uk , $? Output: if $ part of the input, then: ui = ui + 1/3 ( , ); , $ −→ +1/3 +1/3 +1/3 ( , ); −→ WWW 2011 15 Sergei Vassilvitskii
  • 16. Data skew How much parallelization can we achieve? - Generate all the paths to check in parallel - The running time becomes max d2 v v∈V WWW 2011 16 Sergei Vassilvitskii
  • 17. Data skew How much parallelization can we achieve? - Generate all the paths to check in parallel - The running time becomes max d2 v v∈V Naive parallelization does not help with data skew – Some nodes will have very high degree – Example. 3.2 Million followers, must generate 10 Trillion (10^13) potential edges to check. – Even if generating 100M edges per second, 100K seconds ~ 27 hours. WWW 2011 17 Sergei Vassilvitskii
  • 18. “Just 5 more minutes” Running the naive algorithm on LiveJournal Graph – 80% of reducers done after 5 min – 99% done after 35 min WWW 2011 18 Sergei Vassilvitskii
  • 19. Adapting the Algorithm Approach 1: Dealing with skew directly – currently every triangle counted 3 times (once per vertex) – Running time quadratic in the degree of the vertex – Idea: Count each once, from the perspective of lowest degree vertex – Does this heuristic work? WWW 2011 19 Sergei Vassilvitskii
  • 20. Adapting the Algorithm Approach 1: Dealing with skew directly – currently every triangle counted 3 times (once per vertex) – Running time quadratic in the degree of the vertex – Idea: Count each once, from the perspective of lowest degree vertex – Does this heuristic work? Approach 2: Divide Conquer – Equally divide the graph between machines – But any edge partition will be bound to miss triangles – Divide into overlapping subgraphs, account for the overlap WWW 2011 20 Sergei Vassilvitskii
  • 21. How to Count Triangles Better Sequential Version [Schank ’07]: foreach v in V foreach u,w in Adjacency(v) if deg(u) deg(v) deg(w) deg(v) if (u,w) in E Triangles[v]++ WWW 2011 21 Sergei Vassilvitskii
  • 22. Does it make a difference? WWW 2011 22 Sergei Vassilvitskii
  • 23. Dealing with Skew Why does it help? – Partition nodes into two groups: √ • Low: L = {v : dv ≤ m} √ • High: H = {v : dv m} – There are at most n low nodes; each produces at most O(m) paths √ – There are at most 2 m high nodes • Each produces paths to other high nodes: O(m) paths per node WWW 2011 23 Sergei Vassilvitskii
  • 24. Dealing with Skew Why does it help? – Partition nodes into two groups: √ • Low: L = {v : dv ≤ m} √ • High: H = {v : dv m} – There are at most n low nodes; each produces at most O(m) paths √ – There are at most 2 m high nodes • Each produces paths to other high nodes: O(m) paths per node – These two are identical ! – Therefore, no mapper can produce substantially more work than others. 3 – Total work is O(m /2 ) , which is optimal WWW 2011 24 Sergei Vassilvitskii
  • 25. Approach 2: Graph Split Partitioning the nodes: - Previous algorithm shows one way to achieve better parallelization - But what if even O(m) is too much. Is it possible to divide input into smaller chunks? Graph Split Algorithm: – Partition vertices into p equal sized groups V1 , V2 , . . . , Vp . – Consider all possible triples (Vi , Vj , Vk ) and the induced subgraph: Gijk = G [Vi ∪ Vj ∪ Vk ] – Compute the triangles on each Gijk separately. WWW 2011 25 Sergei Vassilvitskii
  • 26. Approach 2: Graph Split Some Triangles present in multiple subgraphs: in p-2 subgraphs Vi Vj in 1 subgraph Vk in ~p2 subgraphs Can count exactly how many subgraphs each triangle will be in WWW 2011 26 Sergei Vassilvitskii
  • 27. Approach 2: Graph Split Analysis: – Each subgraph has O(m/p2 ) edges in expectation. – Very balanced running times WWW 2011 27 Sergei Vassilvitskii
  • 28. Approach 2: Graph Split Analysis: – Very balanced running times – p controls memory needed per machine WWW 2011 28 Sergei Vassilvitskii
  • 29. Approach 2: Graph Split Analysis: – Very balanced running times – p controls memory needed per machine 3 – Total work: p · O((m/p2 )3/2 ) = O(m 3/2 ) , independent of p WWW 2011 29 Sergei Vassilvitskii
  • 30. Approach 2: Graph Split Analysis: – Very balanced running times – p controls memory needed per machine 3 – Total work: p · O((m/p2 )3/2 ) = O(m 3/2 ) , independent of p Input too big: Shuffle time paging increases with duplication WWW 2011 30 Sergei Vassilvitskii
  • 31. Overall Naive Parallelization Doesn’t help with Data Skew WWW 2011 31 Sergei Vassilvitskii
  • 32. Related Work • Tsourakakis et al. [09]: – Count global number of triangles by estimating the trace of the cube of the matrix – Don’t specifically deal with skew, obtain high probability approximations. • Becchetti et al. [08] – Approximate the number of triangles per node – Use multiple passes to obtain a better and better approximation WWW 2011 32 Sergei Vassilvitskii
  • 33. Conclusions WWW 2011 33 Sergei Vassilvitskii
  • 34. Conclusions Think about data skew.... and avoid the curse WWW 2011 33 Sergei Vassilvitskii
  • 35. Conclusions Think about data skew.... and avoid the curse – Get programs to run faster WWW 2011 33 Sergei Vassilvitskii
  • 36. Conclusions Think about data skew.... and avoid the curse – Get programs to run faster – Publish more papers WWW 2011 33 Sergei Vassilvitskii
  • 37. Conclusions Think about data skew.... and avoid the curse – Get programs to run faster – Publish more papers – Get more sleep WWW 2011 33 Sergei Vassilvitskii
  • 38. Conclusions Think about data skew.... and avoid the curse – Get programs to run faster – Publish more papers – Get more sleep – .. WWW 2011 33 Sergei Vassilvitskii
  • 39. Conclusions Think about data skew.... and avoid the curse – Get programs to run faster – Publish more papers – Get more sleep – .. – The possibilities are endless! WWW 2011 33 Sergei Vassilvitskii