XXL Graph Algorithms__HadoopSummit2010

XXL Graph Algorithms
Sergei Vassilvitskii
Yahoo! Research

With help from Jake Hofman, Siddharth Suri, Cong Yu and many others

Introduction
XXL Graphs are everywhere:
– Web graph
– Friend graphs
– Advertising graphs...

2

Introduction
XXL Graphs are everywhere:
– Web graph
– Friend graphs
– Advertising graphs...

But we have Hadoop!
– Few algorithms have been ported (no Hadoop Algorithms book)
– Few general algorithmic approaches
– Active area of research

3

Outline
Today:
– Act 1: Crawl before you walk
• Counting connected components
– Act 2: The curse of the last reducer
• Finding tight knit friend groups

4

Act 1: Connected Components
Given a graph, how many components does it have?

f
b
a
g

c

e h

d

5

Act 1: Connected Components
Given a graph, how many components does it have?

f
b
(b,c) 1
a (f,h) 1
g (b,d) 1

(a,c) 1 (a,b) 1
(c,d) 1
c
(c,e) 1 (f,g) 1
e h (d,e) 1

(d,e) 1
d (b,e) 1
(g,h) 1

Data too big to fit on one reducer!

6

CC Overview
Outline for Connected Components
– Partition the input into several chunks (map 1)
– Summarize the connectivity on each chunk (reduce 1)
– Combine all of the (small) summaries (map 2)
– Find the number of connected components

7

Connected Components
1. Partition (randomly):

f
b
a
g

c

e h

d

8

1. Partition (randomly):

f
b b
a
g

c c

e h

d

Reduce 1 Reduce 2

9

1. Partition:
2. Summarize (retain < n edges):
f
b b
a
g

c c

e h

d

Reduce 1 Reduce 2

10

1. Partition:
2. Summarize (retain < n edges):
f
b b
a
g

c c

e h

d

Reduce 1 Reduce 2

11

1. Partition:
2. Summarize:
3. Recombine: f
b b
a
g

c c

e h

d

Reduce 1 Reduce 2

12

1. Partition:
2. Summarize:
3. Recombine:
b f
a

g

c

e
h

d

Round 2

13

1. Partition:
2. Summarize:
3. Recombine:
b f (b,c) 1
a (f,h) 1
(b,d) 1

g (a,c) 1 (a,b) 1
(c,d) 1
c
(c,e) 1 (f,g) 1
(d,e) 1
e
h (d,e) 1
(b,e) 1
d (g,h) 1

Round 2

14

1. Partition:
2. Summarize:
3. Recombine:
b f
a

g (a,c) 1 (a,b) 1
(c,d) 1
c
(f,g) 1

e
h (d,e) 1

d (g,h) 1

Round 2
Small enough to fit!

15

The summarization does not affect connectivity
– Drops redundant edges
– Dramatically reduces data size
– Takes two MapReduce rounds

16

The summarization does not affect connectivity
– Drops redundant edges
– Dramatically reduces data size
– Takes two MapReduce rounds

Similar approach works in other situations:
– Consider vertices connected only if k edges between vertices
– Consider vertices connected if similarity score above a threshold
• E.g. approximate Jaccard similarity when computing for recommendation
systems
– Find minimum spanning trees
• Summarize by computing an MST on the subset graph
– Clustering
• Cluster each partition, then aggregate the clusters

17

Outline
Today:
– Act 1: Crawl before you walk
• Counting connected components
– Act 2: The curse of the last reducer
• Finding tight knit friend groups

18

Act 2: Clustering Coefficient
Finding tight knit groups of friends

19


vs.

19


vs.

2/15 ≈ 0.13 8/15 ≈ 0.53

CC(v) = Fraction of v’s friends who know each other
– Count: number of triangles incident on v

20

Finding CC For Each Node
Attempt 1:
– Look at each node
– Enumerate all possible triangles (Pivot)

21

Attempt 1:

22

Attempt 1:
– Check which of those edges exist:

∩ =

15 edges possible 2 edges present

23

Attempt 1:
– Check which of those edges exist

24

Attempt 1:
– Enumerate all possible triangles

Amount of intermediate data
– Quadratic in the degree of the nodes
– 6 friends: 15 possible triangles
– n friends, n(n-1)/2 possible triangles

25

Attempt 1:
– Enumerate all possible triangles

Amount of intermediate data
– Quadratic in the degree of the nodes
– 6 friends: 15 possible triangles
– n friends, n(n-1)/2 possible triangles

There’s always “that guy”:
– tens of thousands of friends
– tens of thousands of movie ratings (really!)
– millions of followers
26

Attempt 1:
– Look at each node a le
Sc triangles
ot
– Enumerate all possible
sn
oe
D

27

Attempt 1:
– Look at each node a le
Sc triangles
ot
– Enumerate all possible
sn
oe
D

Attempt 2:
– There is a limited number of High degree nodes
– Count LLL, LLH, LHH, and HHH triangles differently
– If a triangle has at least one Low node
– Pivot on Low node to count the triangles
– If a triangle has all High nodes
– Pivot but only on other neighboring High nodes (not all nodes)

28

Algorithm in Pictures
When looking at Low degree nodes
– Check for all triangles

29

Algorithm in Pictures
When looking at Low degree nodes
– Check for all triangles

When looking at High degree nodes
– Check for triangles with other High degree nodes

30

Clustering Coefficient Discussion
Attempt 2:
– Main idea: treat High and Low degree nodes differently
• Limit the amount of data generated (No more than O(n) per node)
– All triangles accounted for
– Can set High-Low threshold to balance the two cases
• Rule of thumb: threshold around square root of number of vertices
– A bit more complex, but still easy to code
• Doesn’t suffer from the one high degree node problem

31

XXL Graphs: Conclusions
Algorithm Design
– Prove performance guarantees independent of input data
• Input skew (e.g. high degree nodes) should not severely affect
algorithm performance
• Number of rounds fixed (and hopefully small)

32

XXL Graphs: Conclusions
Algorithm Design
– Prove performance guarantees independent of input data
• Input skew (e.g. high degree nodes) should not severely affect
algorithm performance
• Number of rounds fixed (and hopefully small)

Rethink graph algorithms:
– Connected Components: Two round approach
– Clustering Coefficient: High-Low node decomposition
– (Breaking News) Matchings: Two round sampling technique

33

Thank You
sergei@yahoo-inc.com

XXL Graph Algorithms__HadoopSummit2010

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Andere mochten auch

Andere mochten auch (10)

Mehr von Yahoo Developer Network

Mehr von Yahoo Developer Network (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

XXL Graph Algorithms__HadoopSummit2010