1. Large Scale Social Networks
Analysis – LS SNA
Rui Sarmento João Gama
Tiago Cunha Albert Bifet
LIAAD/INESC TEC
FEP - University of Porto
April 13, 2013
2. Outline – LS SNA 2/19
1. Motivation
2. Software Tools
– State of the art – Recent Evolution
– PEGASUS
– Graphlab
– Snap (Stanford Network Analysis Platform)
– Other Tools
3. Case Study
– Network of companies and financial organizations
– Some Numbers
– Algorithms and Used tools
– Processing Time
4. Summary & Conclusions
3. 1.Motivation – LS SNA 3/19
Generic Problem:
Nowadays, the huge amounts of data
available pose problems for analysis with
regular hardware and/or software.
Example Facts:
“We have produced more data in the last two
years than in all of prior history so we are
witnessing a Big Bang of Data” – Tim
McGuire, Mckinsey
4. 1.Motivation – LS SNA 4/19
Solution:
Emerging technologies, like modern models
for parallel computing, multicore computers
or even clusters of computers, can be very
useful for analyzing massive network data.
5. 1.Motivation – LS SNA 5/19
Particular case Study:
CrunchBase database (accessed May 2012)
• Network A of companies and financial organizations/funds, e.g:
Y X
» Company Y has connection to investment fund X
• Network B of persons and companies e.g.:
A Y
» Person A has connection to company Y
6. 1.Motivation – LS SNA 6/19
What can we do?
- we want to analyze entities behavior in terms of
relationships, or other influences.
- we want to determine some characteristic of the
network from the point of view of the self-
centered and the network as a whole.
What is the problem?
- Takes too much time (many hours or even days)
to do it with normal software like Gephi or R even
with a good PC
7. 2. Software Tools – LS SNA 7/19
• State of the art – Recent Evolution
2001 – Boost Graph Library (C++)
2005 – Parallel BGL (C++), Hadoop (Java)
2007 – Development of Graphlab Starts
2008 – SNAP Small-world Network Analysis and
Partitioning (C, openMP)
.
.
2013 – Several Graph Frameworks using Hadoop
and/or HDFS
8. 2. Software Tools – LS SNA 8/19
• PEGASUS
– Computation framework written in JAVA
– Is an open-source, graph-mining system with
massive scalability
– Dependent of Hadoop
– Graph Oriented Tool
9. 2. Software Tools – LS SNA 9/19
• Graphlab API
– Computation framework written in C++
– Computation in GraphLab is applied to dependent
records which are stored as vertices in a large
distributed data-graph
– Computation in GraphLab is expressed as vertex-
programs which are executed in parallel on each
vertex and can interact with neighboring vertices.
– GraphLab programs interact by directly reading the
state of neighboring vertices and by modifying the
state of adjacent edges.
– HDFS Integration: Access your data directly from HDFS
10. 2. Software Tools – LS SNA 10/19
• Snap (Stanford Network Analysis Platform)
– Not Parallel however…
– SNAP library is written in C++ and optimized for
maximum performance and compact graph
representation
– It easily scales to massive networks with hundreds
of millions of nodes, and billions of edges
– …although some algorithms in Snap might be slow
due to complexity
11. 2. Software Tools – LS SNA 11/19
• Other Tools (Resuming)
– Several more tools available:
• Giraph – graph oriented
• Rhadoop (Package for R and Hadoop) – generic tool
=> All previous tools dependant of Hadoop which
seems to be more and more commonly adopted
12. 2. Software Tools – LS SNA 12/19
Software Pegasus Graphlab Snap
Algorithms
available from
Degree approximate Cascades
software install PageRank diameter Centrality
(graph analysis) Random Walk kcore Cliques
with Restart pagerank Community
(RWR) connected Concomp
Radius component Forestfire
Connected simple coloring Graphgen
Components directed triangle
count Graphhash
format convert Kcores
sssp Kronem
undirected triangle Krongen
count Kronfit
Maggen
Magfit
Motifs
Ncpplot
Netevol
Netinf
Netstat
Mkdatasets
infopath
13. 3. Case Study – LS SNA 13/19
=> Some Numbers
• Network of companies and financial organizations/funds
1. Number of firms: 88,269
2. Number of investment funds: 7697
• Network of persons and companies
1. Number of persons: 118,394
14. 3. Case Study – LS SNA 14/19
=> Algorithms and Used tools
– Node Degree with PEGASUS
– Friends of Friends with Hadoop Map-Reduce
– Centrality Measures with Snap (Stanford Network
Analysis Platform)
– Triangles Counting with Graphlab
16. 4. Summary & Conclusions LS SNA
16/19
• Summary & Conclusions
– This paper resumes which tools to look for when
dealing with big graphs studies.
– We are witnesses of a big proliferation of software
tools aimed at the analysis of big scale graphs.
– What was once a problem to deal with these
networks is solved with the right tools
17. References I – LS SNA 17/19
• APACHE. 2012. Apache Giraph [Online]. The Apache Software Foundation.
Available: http://incubator.apache.org/giraph/.
• GRAPHLAB. Graphlab The Abstraction [Online]. Available:
http://graphlab.org/home/abstraction/ 2012].
• GRAPHLAB. 2012. Graph Analytics Toolkit [Online]. Available:
http://graphlab.org/toolkits/graph-analytics/ 2012].
• HOLMES, A. 2012. Hadoop In Practice, Manning.
• LESKOVEC, J. Stanford Network Analysis Platform [Online]. Available:
http://snap.stanford.edu/snap/ [Accessed 12-2012 2012].
• MAZZA, G. 2012. FrontPage - Hadoop Wiki [Online]. Available:
http://wiki.apache.org/lucene-hadoop/ [Accessed 11-2012.
• THANEDAR, V. 2012. API Documentation [Online]. Available:
http://developer.crunchbase.com/docs [Accessed 04-2012 2012].
18. References II – LS SNA 18/19
• UNIVERSITY, C. M. 2012. Project Pegasus [Online]. Available:
http://www.cs.cmu.edu/~pegasus/ 2012].
• WASHINGTON, U. O. What is Hadoop? [Online]. Available:
http://escience.washington.edu/get-help-now/what-hadoop [Accessed
05-03-2013 2013].
• OWENS, J. R. 2013. Hadoop Real-World Solutions Cookbook. PACKT
Publishing.
• HOLMES, A. 2012. Hadoop In Practice, Manning.
• McGuire, T. Big Data Better Decisions [Online]. Available:
http://www.slideshare.net/McK_CMSOForum/big-data-and-advanced-
analytics [Accessed 05-03-2013 2013].