SlideShare ist ein Scribd-Unternehmen logo
1 von 12
Downloaden Sie, um offline zu lesen
Compute “Closeness” in Graphs using
Apache Giraph
… using probabilistic data structures.
Today: Validation

IMPRO-3, TU Berlin, Winter 13/14
Robert Metzger, Robert Waury
13.1.2014

DIMA - TU Berlin
Quick Recap on our Task
● Measure reachable nodes
within s steps from a node n
in a Graph.
→ N(a,s).
N(“Robert”,1)=80
N(“Robert”,2)=10413
…
● Largest N() is graph
diameter.
Robert’s Xing Network
13.1.2014

DIMA - TU Berlin
What happened so far ...

● Giraph Implementation:
○ a) Bitfield
○ b) Flajolet Martin Sketch
■ 32 bit with Thomas Wang’s integer hash
■ 64 bit MurmurHash 2.0
○ c) HyperLogLogSketch with MurmurHash 2.0
● Drafted Stratosphere “Spargel” implementation
● Benchmarked a) and b) for AIM-3
13.1.2014

DIMA - TU Berlin
Validating the correctness of the
implementation ...

● Approach: Assume the “bitfield” implementation
as the reference and measure the correlation
with the results from the other implementations.
● On two (small) datasets:
○
○

13.1.2014

General Relativity and Quantum Cosmology collaboration
network (Coauthor relationships). Largest CC 4.158 Nodes.
Enron email network. Largest CC 33.696 Nodes.

DIMA - TU Berlin
Statistical Methods to determine correlation
● Kendall's τ (tau)
○
○

-1 < τ < 1
expects an order (ranking)
e.g. Comparable interface ;-)

● Spearman's ρ (rho)
○

same properties as Kendall but checks
whether relation is monotonic (not just linear)

● Pearson’s r
○
○

13.1.2014

checks for linear correlation
uses the actual values (not just ranks)

DIMA - TU Berlin
Coauthorship Results (I)

Kendall’s τ

Spearman’s ρ

Pearson’s r

FM32

0.906881050538273 0.98765689317449

FM64

0.905736944670186 0.987400738579957 0.991700042774567

HLL

0.931782793461063 0.993272573234886 0.9956213651786

0.991695076216846

→ High (linear) correlation with all metrics ✔
→ HyperLogLog has highest correlation and has best memory
properties

13.1.2014

DIMA - TU Berlin
Coauthorship Results (II)

Top10

Top100

Top1000

Last1

Last100

FM32

6/10

76/100

891/1000

1/1

94/100

FM64

5/10

69/100

881/1000

1/1

94/100

HLL

8/10

80/100

932/1000

1/1

95/100

→ HLL the best approximation
→ outliers can be identified with higher confidence than central nodes
→ nodes with highest closeness tend to have similar values
13.1.2014

DIMA - TU Berlin
Enron Results (I)
Kendall’s τ

Spearman’s ρ

Pearson’s r

FM32

0.9138299158409239

0.9880939188638478

0.9935462917118506

FM64

0.8894530452951206

0.9803803899254973

0.9902062846287614

HLL

0.9335364446051608

0.9927569721570411

0.9966840593148085

→ High (linear) correlation with all metrics ✔
→ HyperLogLog has highest correlation and has best memory
properties

13.1.2014

DIMA - TU Berlin
Enron Results (II)

Top10

Top100

Top1000

Last1

Last100

FM32

5/10

80/100

877/1000

1/1

96/100

FM64

7/10

66/100

839/1000

1/1

97/100

HLL

8/10

86/100

889/1000

1/1

97/100

→ HLL again best approximation
→ outliers can be identified with higher confidence than central nodes
13.1.2014

DIMA - TU Berlin
Validation Summary

● HyperLogLog exhibits the highest correlation
in all experiments. It also has the lowest
memory footprint.
● We assume that these results hold for larger
data sets.

13.1.2014

DIMA - TU Berlin
Next step

● Benchmark implementations with larger datasets
(that require Giraph out-of-core execution)
● Datasets:
Description

Name

Vertices

Edges

Text File Size
in GB

The data of Stanford's
WebBase 2001 crawl as a
graph

webbase-2001

118,142,155

1,019,903,190

9.46

Follower relationships

twitter-2010

41,652,230

1,468,365,182

12.49

13.1.2014

DIMA - TU Berlin
References
U. Kang, Charalampos E. Tsourakakis, Ana Paula Appel, Christos Faloutsos, and Jure Leskovec. 2011. HADI: Mining Radii
of Large Graphs. ACM Trans. Knowl. Discov. Data 5, 2, Article 8 (February 2011), 24 pages
Centralities in Large Networks: Algorithms and Observations. U Kang, Spiros Papadimitriou, Jimeng Sun, and
Hanghang Tong.
SIAM International Conference on Data Mining (SDM) 2011, Mesa, Arizona, USA
Stefan Heule, Marc Nunkesser, and Alexander Hall. 2013. HyperLogLog in practice: algorithmic engineering of a state
of the art cardinality estimation algorithm. InProceedings of the 16th International Conference on Extending Database
Technology(EDBT '13). ACM, New York, NY, USA, 683-692
Paolo Boldi, Marco Rosa, and Sebastiano Vigna. 2011. HyperANF: approximating the neighbourhood function of very
large graphs on a budget. In Proceedings of the 20th international conference on World wide web (WWW '11). ACM, New
York, NY, USA, 625-634.
Formulas taken from Wikipedia.
13.1.2014

DIMA - TU Berlin

Weitere ähnliche Inhalte

Was ist angesagt?

Renewable energy course#02 gen
Renewable energy course#02 genRenewable energy course#02 gen
Renewable energy course#02 genSyed_Sajjad_Raza
 
Station Performance and Operation Characteristics
Station Performance and Operation Characteristics Station Performance and Operation Characteristics
Station Performance and Operation Characteristics Sirat Mahmood
 
Mr. giacomo martirano (epsilon italia srl) “arc fuel and inspire”
Mr. giacomo martirano (epsilon italia srl) “arc fuel and inspire”Mr. giacomo martirano (epsilon italia srl) “arc fuel and inspire”
Mr. giacomo martirano (epsilon italia srl) “arc fuel and inspire”anest_trip
 
Calculate the Area and Volume of objects in Cinema 4D with Plugin Aire et vol...
Calculate the Area and Volume of objects in Cinema 4D with Plugin Aire et vol...Calculate the Area and Volume of objects in Cinema 4D with Plugin Aire et vol...
Calculate the Area and Volume of objects in Cinema 4D with Plugin Aire et vol...FIDE Master Tihomir Dovramadjiev PhD
 
NUMERICAL METHOD
NUMERICAL METHODNUMERICAL METHOD
NUMERICAL METHODmehedi15
 
Efficient HPR-based Rendering of Point Clouds
Efficient HPR-based Rendering of Point CloudsEfficient HPR-based Rendering of Point Clouds
Efficient HPR-based Rendering of Point CloudsRoger Hernando Buch
 
Soft Shadow Maps for Linear Lights
Soft Shadow Maps for Linear LightsSoft Shadow Maps for Linear Lights
Soft Shadow Maps for Linear Lightsstefan_b
 
2019 IML workshop: A hybrid deep learning approach to vertexing
2019 IML workshop: A hybrid deep learning approach to vertexing2019 IML workshop: A hybrid deep learning approach to vertexing
2019 IML workshop: A hybrid deep learning approach to vertexingHenry Schreiner
 
2019 CtD: A hybrid deep learning approach to vertexing
2019 CtD: A hybrid deep learning approach to vertexing2019 CtD: A hybrid deep learning approach to vertexing
2019 CtD: A hybrid deep learning approach to vertexingHenry Schreiner
 
Modeling of heat transfer in 2 d slab
Modeling of heat transfer in 2 d slabModeling of heat transfer in 2 d slab
Modeling of heat transfer in 2 d slabAlexander Decker
 
ACAT 2019: A hybrid deep learning approach to vertexing
ACAT 2019: A hybrid deep learning approach to vertexingACAT 2019: A hybrid deep learning approach to vertexing
ACAT 2019: A hybrid deep learning approach to vertexingHenry Schreiner
 
How to Master Raster (Faster) - Tips and Examples
How to Master Raster (Faster) - Tips and ExamplesHow to Master Raster (Faster) - Tips and Examples
How to Master Raster (Faster) - Tips and ExamplesSafe Software
 
HOW 2019: Machine Learning for the Primary Vertex Reconstruction
HOW 2019: Machine Learning for the Primary Vertex ReconstructionHOW 2019: Machine Learning for the Primary Vertex Reconstruction
HOW 2019: Machine Learning for the Primary Vertex ReconstructionHenry Schreiner
 
CHiMaD Hackathon 2: Team mcgill
CHiMaD Hackathon 2: Team mcgillCHiMaD Hackathon 2: Team mcgill
CHiMaD Hackathon 2: Team mcgillDaniel Wheeler
 

Was ist angesagt? (20)

Renewable energy course#02 gen
Renewable energy course#02 genRenewable energy course#02 gen
Renewable energy course#02 gen
 
Station Performance and Operation Characteristics
Station Performance and Operation Characteristics Station Performance and Operation Characteristics
Station Performance and Operation Characteristics
 
Mr. giacomo martirano (epsilon italia srl) “arc fuel and inspire”
Mr. giacomo martirano (epsilon italia srl) “arc fuel and inspire”Mr. giacomo martirano (epsilon italia srl) “arc fuel and inspire”
Mr. giacomo martirano (epsilon italia srl) “arc fuel and inspire”
 
Calculate the Area and Volume of objects in Cinema 4D with Plugin Aire et vol...
Calculate the Area and Volume of objects in Cinema 4D with Plugin Aire et vol...Calculate the Area and Volume of objects in Cinema 4D with Plugin Aire et vol...
Calculate the Area and Volume of objects in Cinema 4D with Plugin Aire et vol...
 
NUMERICAL METHOD
NUMERICAL METHODNUMERICAL METHOD
NUMERICAL METHOD
 
CHiMaD Hackathon 2
CHiMaD Hackathon 2CHiMaD Hackathon 2
CHiMaD Hackathon 2
 
Efficient HPR-based Rendering of Point Clouds
Efficient HPR-based Rendering of Point CloudsEfficient HPR-based Rendering of Point Clouds
Efficient HPR-based Rendering of Point Clouds
 
Soft Shadow Maps for Linear Lights
Soft Shadow Maps for Linear LightsSoft Shadow Maps for Linear Lights
Soft Shadow Maps for Linear Lights
 
Midterm Presentation
Midterm PresentationMidterm Presentation
Midterm Presentation
 
2019 IML workshop: A hybrid deep learning approach to vertexing
2019 IML workshop: A hybrid deep learning approach to vertexing2019 IML workshop: A hybrid deep learning approach to vertexing
2019 IML workshop: A hybrid deep learning approach to vertexing
 
Base 11 Presentation 2
Base 11 Presentation 2Base 11 Presentation 2
Base 11 Presentation 2
 
2019 CtD: A hybrid deep learning approach to vertexing
2019 CtD: A hybrid deep learning approach to vertexing2019 CtD: A hybrid deep learning approach to vertexing
2019 CtD: A hybrid deep learning approach to vertexing
 
Modeling of heat transfer in 2 d slab
Modeling of heat transfer in 2 d slabModeling of heat transfer in 2 d slab
Modeling of heat transfer in 2 d slab
 
Fall detection
Fall detectionFall detection
Fall detection
 
ACAT 2019: A hybrid deep learning approach to vertexing
ACAT 2019: A hybrid deep learning approach to vertexingACAT 2019: A hybrid deep learning approach to vertexing
ACAT 2019: A hybrid deep learning approach to vertexing
 
How to Master Raster (Faster) - Tips and Examples
How to Master Raster (Faster) - Tips and ExamplesHow to Master Raster (Faster) - Tips and Examples
How to Master Raster (Faster) - Tips and Examples
 
HOW 2019: Machine Learning for the Primary Vertex Reconstruction
HOW 2019: Machine Learning for the Primary Vertex ReconstructionHOW 2019: Machine Learning for the Primary Vertex Reconstruction
HOW 2019: Machine Learning for the Primary Vertex Reconstruction
 
Minimum spanning tree
Minimum spanning treeMinimum spanning tree
Minimum spanning tree
 
21. Mathematics II
21. Mathematics II21. Mathematics II
21. Mathematics II
 
CHiMaD Hackathon 2: Team mcgill
CHiMaD Hackathon 2: Team mcgillCHiMaD Hackathon 2: Team mcgill
CHiMaD Hackathon 2: Team mcgill
 

Ähnlich wie Compute "Closeness" in Graphs using Apache Giraph.

Evaluating Surrogate Models for Robot Swarm Simulations
Evaluating Surrogate Models for Robot Swarm SimulationsEvaluating Surrogate Models for Robot Swarm Simulations
Evaluating Surrogate Models for Robot Swarm SimulationsDaniel H. Stolfi
 
Frank van diggelen keynote, android gnss measurements update
Frank van diggelen keynote, android gnss measurements updateFrank van diggelen keynote, android gnss measurements update
Frank van diggelen keynote, android gnss measurements updateThe European GNSS Agency (GSA)
 
Modelling the Notre-Dame de Paris fire
Modelling the Notre-Dame de Paris fireModelling the Notre-Dame de Paris fire
Modelling the Notre-Dame de Paris fireARIANET
 
EGRE 310 RAMEYJM Final Project Writeup
EGRE 310 RAMEYJM Final Project WriteupEGRE 310 RAMEYJM Final Project Writeup
EGRE 310 RAMEYJM Final Project WriteupJacob Ramey
 
Low Power FPGA Based Elliptical Curve Cryptography
Low Power FPGA Based Elliptical Curve CryptographyLow Power FPGA Based Elliptical Curve Cryptography
Low Power FPGA Based Elliptical Curve CryptographyIOSR Journals
 
Low Power FPGA Based Elliptical Curve Cryptography
Low Power FPGA Based Elliptical Curve CryptographyLow Power FPGA Based Elliptical Curve Cryptography
Low Power FPGA Based Elliptical Curve CryptographyIOSR Journals
 
Combining semantic 3D GIS with numerical Simulation for assessing the impact ...
Combining semantic 3D GIS with numerical Simulation for assessing the impact ...Combining semantic 3D GIS with numerical Simulation for assessing the impact ...
Combining semantic 3D GIS with numerical Simulation for assessing the impact ...virtualcitySYSTEMS GmbH
 
Localization Issues in a ZigBee based Internet of Things scenario
Localization Issues in a ZigBee based Internet of Things scenarioLocalization Issues in a ZigBee based Internet of Things scenario
Localization Issues in a ZigBee based Internet of Things scenarioMassimiliano Dibitonto
 
Which Algorithms Really Matter
Which Algorithms Really MatterWhich Algorithms Really Matter
Which Algorithms Really MatterTed Dunning
 
Introduction to Monte Carlo Ray Tracing, OpenCL Implementation (CEDEC 2014)
Introduction to Monte Carlo Ray Tracing, OpenCL Implementation (CEDEC 2014)Introduction to Monte Carlo Ray Tracing, OpenCL Implementation (CEDEC 2014)
Introduction to Monte Carlo Ray Tracing, OpenCL Implementation (CEDEC 2014)Takahiro Harada
 
IRJET- Secure Data on Multi-Cloud using Homomorphic Encryption
IRJET- Secure Data on Multi-Cloud using Homomorphic EncryptionIRJET- Secure Data on Multi-Cloud using Homomorphic Encryption
IRJET- Secure Data on Multi-Cloud using Homomorphic EncryptionIRJET Journal
 
On theory and applications of mathematics to security in cloud computing: a c...
On theory and applications of mathematics to security in cloud computing: a c...On theory and applications of mathematics to security in cloud computing: a c...
On theory and applications of mathematics to security in cloud computing: a c...Dr. Richard Otieno
 
ScaleMaster 2.0: a ScaleMaster extension to monitor automatic multi-scales ge...
ScaleMaster 2.0: a ScaleMaster extension to monitor automatic multi-scales ge...ScaleMaster 2.0: a ScaleMaster extension to monitor automatic multi-scales ge...
ScaleMaster 2.0: a ScaleMaster extension to monitor automatic multi-scales ge...Guillaume Touya
 
Real Time Implementation on TM320C6711 DSP processor of a new CFAR Radar
Real Time Implementation on TM320C6711 DSP processor of a new CFAR RadarReal Time Implementation on TM320C6711 DSP processor of a new CFAR Radar
Real Time Implementation on TM320C6711 DSP processor of a new CFAR RadarCSCJournals
 
International Journal of Computational Science and Information Technology (...
  International Journal of Computational Science and Information Technology (...  International Journal of Computational Science and Information Technology (...
International Journal of Computational Science and Information Technology (...ijcsity
 
HIGH PERFORMANCE COMPUTING ON THE RASPBERRY PI
HIGH PERFORMANCE COMPUTING ON THE RASPBERRY PIHIGH PERFORMANCE COMPUTING ON THE RASPBERRY PI
HIGH PERFORMANCE COMPUTING ON THE RASPBERRY PIijcsity
 
A Comparison of Different Methods of Glow Curve Analysis for Thermoluminescen...
A Comparison of Different Methods of Glow Curve Analysis for Thermoluminescen...A Comparison of Different Methods of Glow Curve Analysis for Thermoluminescen...
A Comparison of Different Methods of Glow Curve Analysis for Thermoluminescen...Doug Kripke
 
Chapter 3. sensors in the network domain
Chapter 3. sensors in the network domainChapter 3. sensors in the network domain
Chapter 3. sensors in the network domainPhu Nguyen
 

Ähnlich wie Compute "Closeness" in Graphs using Apache Giraph. (20)

Evaluating Surrogate Models for Robot Swarm Simulations
Evaluating Surrogate Models for Robot Swarm SimulationsEvaluating Surrogate Models for Robot Swarm Simulations
Evaluating Surrogate Models for Robot Swarm Simulations
 
Frank van diggelen keynote, android gnss measurements update
Frank van diggelen keynote, android gnss measurements updateFrank van diggelen keynote, android gnss measurements update
Frank van diggelen keynote, android gnss measurements update
 
Modelling the Notre-Dame de Paris fire
Modelling the Notre-Dame de Paris fireModelling the Notre-Dame de Paris fire
Modelling the Notre-Dame de Paris fire
 
EGRE 310 RAMEYJM Final Project Writeup
EGRE 310 RAMEYJM Final Project WriteupEGRE 310 RAMEYJM Final Project Writeup
EGRE 310 RAMEYJM Final Project Writeup
 
Low Power FPGA Based Elliptical Curve Cryptography
Low Power FPGA Based Elliptical Curve CryptographyLow Power FPGA Based Elliptical Curve Cryptography
Low Power FPGA Based Elliptical Curve Cryptography
 
Low Power FPGA Based Elliptical Curve Cryptography
Low Power FPGA Based Elliptical Curve CryptographyLow Power FPGA Based Elliptical Curve Cryptography
Low Power FPGA Based Elliptical Curve Cryptography
 
Combining semantic 3D GIS with numerical Simulation for assessing the impact ...
Combining semantic 3D GIS with numerical Simulation for assessing the impact ...Combining semantic 3D GIS with numerical Simulation for assessing the impact ...
Combining semantic 3D GIS with numerical Simulation for assessing the impact ...
 
Localization Issues in a ZigBee based Internet of Things scenario
Localization Issues in a ZigBee based Internet of Things scenarioLocalization Issues in a ZigBee based Internet of Things scenario
Localization Issues in a ZigBee based Internet of Things scenario
 
Which Algorithms Really Matter
Which Algorithms Really MatterWhich Algorithms Really Matter
Which Algorithms Really Matter
 
Introduction to Monte Carlo Ray Tracing, OpenCL Implementation (CEDEC 2014)
Introduction to Monte Carlo Ray Tracing, OpenCL Implementation (CEDEC 2014)Introduction to Monte Carlo Ray Tracing, OpenCL Implementation (CEDEC 2014)
Introduction to Monte Carlo Ray Tracing, OpenCL Implementation (CEDEC 2014)
 
IRJET- Secure Data on Multi-Cloud using Homomorphic Encryption
IRJET- Secure Data on Multi-Cloud using Homomorphic EncryptionIRJET- Secure Data on Multi-Cloud using Homomorphic Encryption
IRJET- Secure Data on Multi-Cloud using Homomorphic Encryption
 
On theory and applications of mathematics to security in cloud computing: a c...
On theory and applications of mathematics to security in cloud computing: a c...On theory and applications of mathematics to security in cloud computing: a c...
On theory and applications of mathematics to security in cloud computing: a c...
 
ScaleMaster 2.0: a ScaleMaster extension to monitor automatic multi-scales ge...
ScaleMaster 2.0: a ScaleMaster extension to monitor automatic multi-scales ge...ScaleMaster 2.0: a ScaleMaster extension to monitor automatic multi-scales ge...
ScaleMaster 2.0: a ScaleMaster extension to monitor automatic multi-scales ge...
 
Edge-Fog Cloud
Edge-Fog CloudEdge-Fog Cloud
Edge-Fog Cloud
 
Real Time Implementation on TM320C6711 DSP processor of a new CFAR Radar
Real Time Implementation on TM320C6711 DSP processor of a new CFAR RadarReal Time Implementation on TM320C6711 DSP processor of a new CFAR Radar
Real Time Implementation on TM320C6711 DSP processor of a new CFAR Radar
 
International Journal of Computational Science and Information Technology (...
  International Journal of Computational Science and Information Technology (...  International Journal of Computational Science and Information Technology (...
International Journal of Computational Science and Information Technology (...
 
HIGH PERFORMANCE COMPUTING ON THE RASPBERRY PI
HIGH PERFORMANCE COMPUTING ON THE RASPBERRY PIHIGH PERFORMANCE COMPUTING ON THE RASPBERRY PI
HIGH PERFORMANCE COMPUTING ON THE RASPBERRY PI
 
Ijetr042170
Ijetr042170Ijetr042170
Ijetr042170
 
A Comparison of Different Methods of Glow Curve Analysis for Thermoluminescen...
A Comparison of Different Methods of Glow Curve Analysis for Thermoluminescen...A Comparison of Different Methods of Glow Curve Analysis for Thermoluminescen...
A Comparison of Different Methods of Glow Curve Analysis for Thermoluminescen...
 
Chapter 3. sensors in the network domain
Chapter 3. sensors in the network domainChapter 3. sensors in the network domain
Chapter 3. sensors in the network domain
 

Mehr von Robert Metzger

How to Contribute to Apache Flink (and Flink at the Apache Software Foundation)
How to Contribute to Apache Flink (and Flink at the Apache Software Foundation)How to Contribute to Apache Flink (and Flink at the Apache Software Foundation)
How to Contribute to Apache Flink (and Flink at the Apache Software Foundation)Robert Metzger
 
Apache Flink @ Tel Aviv / Herzliya Meetup
Apache Flink @ Tel Aviv / Herzliya MeetupApache Flink @ Tel Aviv / Herzliya Meetup
Apache Flink @ Tel Aviv / Herzliya MeetupRobert Metzger
 
Apache Flink Community Updates November 2016 @ Berlin Meetup
Apache Flink Community Updates November 2016 @ Berlin MeetupApache Flink Community Updates November 2016 @ Berlin Meetup
Apache Flink Community Updates November 2016 @ Berlin MeetupRobert Metzger
 
A Data Streaming Architecture with Apache Flink (berlin Buzzwords 2016)
A Data Streaming Architecture with Apache Flink (berlin Buzzwords 2016)A Data Streaming Architecture with Apache Flink (berlin Buzzwords 2016)
A Data Streaming Architecture with Apache Flink (berlin Buzzwords 2016)Robert Metzger
 
Community Update May 2016 (January - May) | Berlin Apache Flink Meetup
Community Update May 2016 (January - May) | Berlin Apache Flink MeetupCommunity Update May 2016 (January - May) | Berlin Apache Flink Meetup
Community Update May 2016 (January - May) | Berlin Apache Flink MeetupRobert Metzger
 
GOTO Night Amsterdam - Stream processing with Apache Flink
GOTO Night Amsterdam - Stream processing with Apache FlinkGOTO Night Amsterdam - Stream processing with Apache Flink
GOTO Night Amsterdam - Stream processing with Apache FlinkRobert Metzger
 
QCon London - Stream Processing with Apache Flink
QCon London - Stream Processing with Apache FlinkQCon London - Stream Processing with Apache Flink
QCon London - Stream Processing with Apache FlinkRobert Metzger
 
January 2016 Flink Community Update & Roadmap 2016
January 2016 Flink Community Update & Roadmap 2016January 2016 Flink Community Update & Roadmap 2016
January 2016 Flink Community Update & Roadmap 2016Robert Metzger
 
Flink Community Update December 2015: Year in Review
Flink Community Update December 2015: Year in ReviewFlink Community Update December 2015: Year in Review
Flink Community Update December 2015: Year in ReviewRobert Metzger
 
Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...
Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...
Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...Robert Metzger
 
Chicago Flink Meetup: Flink's streaming architecture
Chicago Flink Meetup: Flink's streaming architectureChicago Flink Meetup: Flink's streaming architecture
Chicago Flink Meetup: Flink's streaming architectureRobert Metzger
 
Flink September 2015 Community Update
Flink September 2015 Community UpdateFlink September 2015 Community Update
Flink September 2015 Community UpdateRobert Metzger
 
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015Robert Metzger
 
Click-Through Example for Flink’s KafkaConsumer Checkpointing
Click-Through Example for Flink’s KafkaConsumer CheckpointingClick-Through Example for Flink’s KafkaConsumer Checkpointing
Click-Through Example for Flink’s KafkaConsumer CheckpointingRobert Metzger
 
August Flink Community Update
August Flink Community UpdateAugust Flink Community Update
August Flink Community UpdateRobert Metzger
 
Flink Cummunity Update July (Berlin Meetup)
Flink Cummunity Update July (Berlin Meetup)Flink Cummunity Update July (Berlin Meetup)
Flink Cummunity Update July (Berlin Meetup)Robert Metzger
 
Apache Flink First Half of 2015 Community Update
Apache Flink First Half of 2015 Community UpdateApache Flink First Half of 2015 Community Update
Apache Flink First Half of 2015 Community UpdateRobert Metzger
 
Apache Flink Deep-Dive @ Hadoop Summit 2015 in San Jose, CA
Apache Flink Deep-Dive @ Hadoop Summit 2015 in San Jose, CAApache Flink Deep-Dive @ Hadoop Summit 2015 in San Jose, CA
Apache Flink Deep-Dive @ Hadoop Summit 2015 in San Jose, CARobert Metzger
 

Mehr von Robert Metzger (20)

How to Contribute to Apache Flink (and Flink at the Apache Software Foundation)
How to Contribute to Apache Flink (and Flink at the Apache Software Foundation)How to Contribute to Apache Flink (and Flink at the Apache Software Foundation)
How to Contribute to Apache Flink (and Flink at the Apache Software Foundation)
 
dA Platform Overview
dA Platform OverviewdA Platform Overview
dA Platform Overview
 
Apache Flink @ Tel Aviv / Herzliya Meetup
Apache Flink @ Tel Aviv / Herzliya MeetupApache Flink @ Tel Aviv / Herzliya Meetup
Apache Flink @ Tel Aviv / Herzliya Meetup
 
Apache Flink Community Updates November 2016 @ Berlin Meetup
Apache Flink Community Updates November 2016 @ Berlin MeetupApache Flink Community Updates November 2016 @ Berlin Meetup
Apache Flink Community Updates November 2016 @ Berlin Meetup
 
A Data Streaming Architecture with Apache Flink (berlin Buzzwords 2016)
A Data Streaming Architecture with Apache Flink (berlin Buzzwords 2016)A Data Streaming Architecture with Apache Flink (berlin Buzzwords 2016)
A Data Streaming Architecture with Apache Flink (berlin Buzzwords 2016)
 
Community Update May 2016 (January - May) | Berlin Apache Flink Meetup
Community Update May 2016 (January - May) | Berlin Apache Flink MeetupCommunity Update May 2016 (January - May) | Berlin Apache Flink Meetup
Community Update May 2016 (January - May) | Berlin Apache Flink Meetup
 
GOTO Night Amsterdam - Stream processing with Apache Flink
GOTO Night Amsterdam - Stream processing with Apache FlinkGOTO Night Amsterdam - Stream processing with Apache Flink
GOTO Night Amsterdam - Stream processing with Apache Flink
 
QCon London - Stream Processing with Apache Flink
QCon London - Stream Processing with Apache FlinkQCon London - Stream Processing with Apache Flink
QCon London - Stream Processing with Apache Flink
 
January 2016 Flink Community Update & Roadmap 2016
January 2016 Flink Community Update & Roadmap 2016January 2016 Flink Community Update & Roadmap 2016
January 2016 Flink Community Update & Roadmap 2016
 
Flink Community Update December 2015: Year in Review
Flink Community Update December 2015: Year in ReviewFlink Community Update December 2015: Year in Review
Flink Community Update December 2015: Year in Review
 
Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...
Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...
Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...
 
Chicago Flink Meetup: Flink's streaming architecture
Chicago Flink Meetup: Flink's streaming architectureChicago Flink Meetup: Flink's streaming architecture
Chicago Flink Meetup: Flink's streaming architecture
 
Flink September 2015 Community Update
Flink September 2015 Community UpdateFlink September 2015 Community Update
Flink September 2015 Community Update
 
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015
 
Click-Through Example for Flink’s KafkaConsumer Checkpointing
Click-Through Example for Flink’s KafkaConsumer CheckpointingClick-Through Example for Flink’s KafkaConsumer Checkpointing
Click-Through Example for Flink’s KafkaConsumer Checkpointing
 
August Flink Community Update
August Flink Community UpdateAugust Flink Community Update
August Flink Community Update
 
Flink Cummunity Update July (Berlin Meetup)
Flink Cummunity Update July (Berlin Meetup)Flink Cummunity Update July (Berlin Meetup)
Flink Cummunity Update July (Berlin Meetup)
 
Apache Flink First Half of 2015 Community Update
Apache Flink First Half of 2015 Community UpdateApache Flink First Half of 2015 Community Update
Apache Flink First Half of 2015 Community Update
 
Apache Flink Deep-Dive @ Hadoop Summit 2015 in San Jose, CA
Apache Flink Deep-Dive @ Hadoop Summit 2015 in San Jose, CAApache Flink Deep-Dive @ Hadoop Summit 2015 in San Jose, CA
Apache Flink Deep-Dive @ Hadoop Summit 2015 in San Jose, CA
 
Apache Flink Hands On
Apache Flink Hands OnApache Flink Hands On
Apache Flink Hands On
 

Kürzlich hochgeladen

Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 

Kürzlich hochgeladen (20)

Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 

Compute "Closeness" in Graphs using Apache Giraph.

  • 1. Compute “Closeness” in Graphs using Apache Giraph … using probabilistic data structures. Today: Validation IMPRO-3, TU Berlin, Winter 13/14 Robert Metzger, Robert Waury 13.1.2014 DIMA - TU Berlin
  • 2. Quick Recap on our Task ● Measure reachable nodes within s steps from a node n in a Graph. → N(a,s). N(“Robert”,1)=80 N(“Robert”,2)=10413 … ● Largest N() is graph diameter. Robert’s Xing Network 13.1.2014 DIMA - TU Berlin
  • 3. What happened so far ... ● Giraph Implementation: ○ a) Bitfield ○ b) Flajolet Martin Sketch ■ 32 bit with Thomas Wang’s integer hash ■ 64 bit MurmurHash 2.0 ○ c) HyperLogLogSketch with MurmurHash 2.0 ● Drafted Stratosphere “Spargel” implementation ● Benchmarked a) and b) for AIM-3 13.1.2014 DIMA - TU Berlin
  • 4. Validating the correctness of the implementation ... ● Approach: Assume the “bitfield” implementation as the reference and measure the correlation with the results from the other implementations. ● On two (small) datasets: ○ ○ 13.1.2014 General Relativity and Quantum Cosmology collaboration network (Coauthor relationships). Largest CC 4.158 Nodes. Enron email network. Largest CC 33.696 Nodes. DIMA - TU Berlin
  • 5. Statistical Methods to determine correlation ● Kendall's τ (tau) ○ ○ -1 < τ < 1 expects an order (ranking) e.g. Comparable interface ;-) ● Spearman's ρ (rho) ○ same properties as Kendall but checks whether relation is monotonic (not just linear) ● Pearson’s r ○ ○ 13.1.2014 checks for linear correlation uses the actual values (not just ranks) DIMA - TU Berlin
  • 6. Coauthorship Results (I) Kendall’s τ Spearman’s ρ Pearson’s r FM32 0.906881050538273 0.98765689317449 FM64 0.905736944670186 0.987400738579957 0.991700042774567 HLL 0.931782793461063 0.993272573234886 0.9956213651786 0.991695076216846 → High (linear) correlation with all metrics ✔ → HyperLogLog has highest correlation and has best memory properties 13.1.2014 DIMA - TU Berlin
  • 7. Coauthorship Results (II) Top10 Top100 Top1000 Last1 Last100 FM32 6/10 76/100 891/1000 1/1 94/100 FM64 5/10 69/100 881/1000 1/1 94/100 HLL 8/10 80/100 932/1000 1/1 95/100 → HLL the best approximation → outliers can be identified with higher confidence than central nodes → nodes with highest closeness tend to have similar values 13.1.2014 DIMA - TU Berlin
  • 8. Enron Results (I) Kendall’s τ Spearman’s ρ Pearson’s r FM32 0.9138299158409239 0.9880939188638478 0.9935462917118506 FM64 0.8894530452951206 0.9803803899254973 0.9902062846287614 HLL 0.9335364446051608 0.9927569721570411 0.9966840593148085 → High (linear) correlation with all metrics ✔ → HyperLogLog has highest correlation and has best memory properties 13.1.2014 DIMA - TU Berlin
  • 9. Enron Results (II) Top10 Top100 Top1000 Last1 Last100 FM32 5/10 80/100 877/1000 1/1 96/100 FM64 7/10 66/100 839/1000 1/1 97/100 HLL 8/10 86/100 889/1000 1/1 97/100 → HLL again best approximation → outliers can be identified with higher confidence than central nodes 13.1.2014 DIMA - TU Berlin
  • 10. Validation Summary ● HyperLogLog exhibits the highest correlation in all experiments. It also has the lowest memory footprint. ● We assume that these results hold for larger data sets. 13.1.2014 DIMA - TU Berlin
  • 11. Next step ● Benchmark implementations with larger datasets (that require Giraph out-of-core execution) ● Datasets: Description Name Vertices Edges Text File Size in GB The data of Stanford's WebBase 2001 crawl as a graph webbase-2001 118,142,155 1,019,903,190 9.46 Follower relationships twitter-2010 41,652,230 1,468,365,182 12.49 13.1.2014 DIMA - TU Berlin
  • 12. References U. Kang, Charalampos E. Tsourakakis, Ana Paula Appel, Christos Faloutsos, and Jure Leskovec. 2011. HADI: Mining Radii of Large Graphs. ACM Trans. Knowl. Discov. Data 5, 2, Article 8 (February 2011), 24 pages Centralities in Large Networks: Algorithms and Observations. U Kang, Spiros Papadimitriou, Jimeng Sun, and Hanghang Tong. SIAM International Conference on Data Mining (SDM) 2011, Mesa, Arizona, USA Stefan Heule, Marc Nunkesser, and Alexander Hall. 2013. HyperLogLog in practice: algorithmic engineering of a state of the art cardinality estimation algorithm. InProceedings of the 16th International Conference on Extending Database Technology(EDBT '13). ACM, New York, NY, USA, 683-692 Paolo Boldi, Marco Rosa, and Sebastiano Vigna. 2011. HyperANF: approximating the neighbourhood function of very large graphs on a budget. In Proceedings of the 20th international conference on World wide web (WWW '11). ACM, New York, NY, USA, 625-634. Formulas taken from Wikipedia. 13.1.2014 DIMA - TU Berlin