The Codex of Business Writing Software for Real-World Solutions 2.pptx
Compute "Closeness" in Graphs using Apache Giraph.
1. Compute “Closeness” in Graphs using
Apache Giraph
… using probabilistic data structures.
Today: Validation
IMPRO-3, TU Berlin, Winter 13/14
Robert Metzger, Robert Waury
13.1.2014
DIMA - TU Berlin
2. Quick Recap on our Task
● Measure reachable nodes
within s steps from a node n
in a Graph.
→ N(a,s).
N(“Robert”,1)=80
N(“Robert”,2)=10413
…
● Largest N() is graph
diameter.
Robert’s Xing Network
13.1.2014
DIMA - TU Berlin
3. What happened so far ...
● Giraph Implementation:
○ a) Bitfield
○ b) Flajolet Martin Sketch
■ 32 bit with Thomas Wang’s integer hash
■ 64 bit MurmurHash 2.0
○ c) HyperLogLogSketch with MurmurHash 2.0
● Drafted Stratosphere “Spargel” implementation
● Benchmarked a) and b) for AIM-3
13.1.2014
DIMA - TU Berlin
4. Validating the correctness of the
implementation ...
● Approach: Assume the “bitfield” implementation
as the reference and measure the correlation
with the results from the other implementations.
● On two (small) datasets:
○
○
13.1.2014
General Relativity and Quantum Cosmology collaboration
network (Coauthor relationships). Largest CC 4.158 Nodes.
Enron email network. Largest CC 33.696 Nodes.
DIMA - TU Berlin
5. Statistical Methods to determine correlation
● Kendall's τ (tau)
○
○
-1 < τ < 1
expects an order (ranking)
e.g. Comparable interface ;-)
● Spearman's ρ (rho)
○
same properties as Kendall but checks
whether relation is monotonic (not just linear)
● Pearson’s r
○
○
13.1.2014
checks for linear correlation
uses the actual values (not just ranks)
DIMA - TU Berlin
6. Coauthorship Results (I)
Kendall’s τ
Spearman’s ρ
Pearson’s r
FM32
0.906881050538273 0.98765689317449
FM64
0.905736944670186 0.987400738579957 0.991700042774567
HLL
0.931782793461063 0.993272573234886 0.9956213651786
0.991695076216846
→ High (linear) correlation with all metrics ✔
→ HyperLogLog has highest correlation and has best memory
properties
13.1.2014
DIMA - TU Berlin
8. Enron Results (I)
Kendall’s τ
Spearman’s ρ
Pearson’s r
FM32
0.9138299158409239
0.9880939188638478
0.9935462917118506
FM64
0.8894530452951206
0.9803803899254973
0.9902062846287614
HLL
0.9335364446051608
0.9927569721570411
0.9966840593148085
→ High (linear) correlation with all metrics ✔
→ HyperLogLog has highest correlation and has best memory
properties
13.1.2014
DIMA - TU Berlin
10. Validation Summary
● HyperLogLog exhibits the highest correlation
in all experiments. It also has the lowest
memory footprint.
● We assume that these results hold for larger
data sets.
13.1.2014
DIMA - TU Berlin
11. Next step
● Benchmark implementations with larger datasets
(that require Giraph out-of-core execution)
● Datasets:
Description
Name
Vertices
Edges
Text File Size
in GB
The data of Stanford's
WebBase 2001 crawl as a
graph
webbase-2001
118,142,155
1,019,903,190
9.46
Follower relationships
twitter-2010
41,652,230
1,468,365,182
12.49
13.1.2014
DIMA - TU Berlin
12. References
U. Kang, Charalampos E. Tsourakakis, Ana Paula Appel, Christos Faloutsos, and Jure Leskovec. 2011. HADI: Mining Radii
of Large Graphs. ACM Trans. Knowl. Discov. Data 5, 2, Article 8 (February 2011), 24 pages
Centralities in Large Networks: Algorithms and Observations. U Kang, Spiros Papadimitriou, Jimeng Sun, and
Hanghang Tong.
SIAM International Conference on Data Mining (SDM) 2011, Mesa, Arizona, USA
Stefan Heule, Marc Nunkesser, and Alexander Hall. 2013. HyperLogLog in practice: algorithmic engineering of a state
of the art cardinality estimation algorithm. InProceedings of the 16th International Conference on Extending Database
Technology(EDBT '13). ACM, New York, NY, USA, 683-692
Paolo Boldi, Marco Rosa, and Sebastiano Vigna. 2011. HyperANF: approximating the neighbourhood function of very
large graphs on a budget. In Proceedings of the 20th international conference on World wide web (WWW '11). ACM, New
York, NY, USA, 625-634.
Formulas taken from Wikipedia.
13.1.2014
DIMA - TU Berlin