Compute "Closeness" in Graphs using Apache Giraph.

Compute “Closeness” in Graphs using
Apache Giraph
… using probabilistic data structures.
Today: Validation

IMPRO-3, TU Berlin, Winter 13/14
Robert Metzger, Robert Waury
13.1.2014

DIMA - TU Berlin

Quick Recap on our Task
● Measure reachable nodes
within s steps from a node n
in a Graph.
→ N(a,s).
N(“Robert”,1)=80
N(“Robert”,2)=10413
…
● Largest N() is graph
diameter.
Robert’s Xing Network
13.1.2014

DIMA - TU Berlin

What happened so far ...

● Giraph Implementation:
○ a) Bitfield
○ b) Flajolet Martin Sketch
■ 32 bit with Thomas Wang’s integer hash
■ 64 bit MurmurHash 2.0
○ c) HyperLogLogSketch with MurmurHash 2.0
● Drafted Stratosphere “Spargel” implementation
● Benchmarked a) and b) for AIM-3
13.1.2014

DIMA - TU Berlin

Validating the correctness of the
implementation ...

● Approach: Assume the “bitfield” implementation
as the reference and measure the correlation
with the results from the other implementations.
● On two (small) datasets:
○
○

13.1.2014

General Relativity and Quantum Cosmology collaboration
network (Coauthor relationships). Largest CC 4.158 Nodes.
Enron email network. Largest CC 33.696 Nodes.

DIMA - TU Berlin

Statistical Methods to determine correlation
● Kendall's τ (tau)
○
○

-1 < τ < 1
expects an order (ranking)
e.g. Comparable interface ;-)

● Spearman's ρ (rho)
○

same properties as Kendall but checks
whether relation is monotonic (not just linear)

● Pearson’s r
○
○

13.1.2014

checks for linear correlation
uses the actual values (not just ranks)

DIMA - TU Berlin

Coauthorship Results (I)

Kendall’s τ

Spearman’s ρ

Pearson’s r

FM32

0.906881050538273 0.98765689317449

FM64

0.905736944670186 0.987400738579957 0.991700042774567

HLL

0.931782793461063 0.993272573234886 0.9956213651786

0.991695076216846

→ High (linear) correlation with all metrics ✔
→ HyperLogLog has highest correlation and has best memory
properties

13.1.2014

DIMA - TU Berlin

Coauthorship Results (II)

Top10

Top100

Top1000

Last1

Last100

FM32

6/10

76/100

891/1000

1/1

94/100

FM64

5/10

69/100

881/1000

1/1

94/100

HLL

8/10

80/100

932/1000

1/1

95/100

→ HLL the best approximation
→ outliers can be identified with higher confidence than central nodes
→ nodes with highest closeness tend to have similar values
13.1.2014

DIMA - TU Berlin

Enron Results (I)
Kendall’s τ

Spearman’s ρ

Pearson’s r

FM32

0.9138299158409239

0.9880939188638478

0.9935462917118506

FM64

0.8894530452951206

0.9803803899254973

0.9902062846287614

HLL

0.9335364446051608

0.9927569721570411

0.9966840593148085

→ High (linear) correlation with all metrics ✔
→ HyperLogLog has highest correlation and has best memory
properties

13.1.2014

DIMA - TU Berlin

Enron Results (II)

Top10

Top100

Top1000

Last1

Last100

FM32

5/10

80/100

877/1000

1/1

96/100

FM64

7/10

66/100

839/1000

1/1

97/100

HLL

8/10

86/100

889/1000

1/1

97/100

→ HLL again best approximation
→ outliers can be identified with higher confidence than central nodes
13.1.2014

DIMA - TU Berlin

Validation Summary

● HyperLogLog exhibits the highest correlation
in all experiments. It also has the lowest
memory footprint.
● We assume that these results hold for larger
data sets.

13.1.2014

DIMA - TU Berlin

Next step

● Benchmark implementations with larger datasets
(that require Giraph out-of-core execution)
● Datasets:
Description

Name

Vertices

Edges

Text File Size
in GB

The data of Stanford's
WebBase 2001 crawl as a
graph

webbase-2001

118,142,155

1,019,903,190

9.46

Follower relationships

twitter-2010

41,652,230

1,468,365,182

12.49

13.1.2014

DIMA - TU Berlin

References
U. Kang, Charalampos E. Tsourakakis, Ana Paula Appel, Christos Faloutsos, and Jure Leskovec. 2011. HADI: Mining Radii
of Large Graphs. ACM Trans. Knowl. Discov. Data 5, 2, Article 8 (February 2011), 24 pages
Centralities in Large Networks: Algorithms and Observations. U Kang, Spiros Papadimitriou, Jimeng Sun, and
Hanghang Tong.
SIAM International Conference on Data Mining (SDM) 2011, Mesa, Arizona, USA
Stefan Heule, Marc Nunkesser, and Alexander Hall. 2013. HyperLogLog in practice: algorithmic engineering of a state
of the art cardinality estimation algorithm. InProceedings of the 16th International Conference on Extending Database
Technology(EDBT '13). ACM, New York, NY, USA, 683-692
Paolo Boldi, Marco Rosa, and Sebastiano Vigna. 2011. HyperANF: approximating the neighbourhood function of very
large graphs on a budget. In Proceedings of the 20th international conference on World wide web (WWW '11). ACM, New
York, NY, USA, 625-634.
Formulas taken from Wikipedia.
13.1.2014

DIMA - TU Berlin

Compute "Closeness" in Graphs using Apache Giraph.

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Compute "Closeness" in Graphs using Apache Giraph.

Ähnlich wie Compute "Closeness" in Graphs using Apache Giraph. (20)

Mehr von Robert Metzger

Mehr von Robert Metzger (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Compute "Closeness" in Graphs using Apache Giraph.