1. GRAPH ANALYTICS AND
MACHINE LEARNING
STANLEY WANG
SOLUTION ARCHITECT, TECH LEAD
@SWANG68
http://www.linkedin.com/in/stanley-wang-a2b143b
2. Mathematics on Graph
• An abstract representation of a set of
entities where some pairs are connected
by links;
Entity (Vertex,
Node)
Link ( Edge,
Relationship)
9. What is a Graph Database?
• A Database with an Explicit Graph
Structure;
• Each Node Knows its Adjacent Nodes;
• As the Number of Nodes Increases, the
Cost of a Local Step Remains the Same,
O(n);
• An Index for Lookups;
10. Relational Model vs Graph Model
Optimized for Aggregation Optimized for Connections
11. RDBMS
SQL vs NOSQL
Complexity
Big Table
Column Family
Size
Key-Value
Store
Document
Databases
Graph
Databases
90% of
Use Cases
Relational
Databases
13. Value in Relationships
Low High
Key-Value
Why Graph Databases?
K V
BigTable
K V V V V
Document
Relational
Graph
14. NoSQL and Big Data
14
• Traditional databases handle big data sets, too.
But, more on structure data;
• NoSQL databases have poor analytics;
• HDFS, MapReduce often works from text files;
• NoSQL is more for high throughput, basically,
AP from the CAP theorem, instead of CP;
• In practice, Big Data is likely to be a mix of text
files, NoSQL, and SQL RDBMS;
15. Graph Terminology
• Graph Computation(Analytics):
o Whole graph is processed, typically for several
iterations vertex-centric computation.
o Examples: Belief Propagation, Pagerank,
Community detection, Triangle Counting,
Matrix Factorization, Machine Learning…
• Graph Database (Queries):
o Selective graph queries (compare to SQL
queries)
o Traversals: shortest-path, friends-of-friends,…
15
18. Graphs are Essential to ML
• Identify influential people and information;
• Discover communities;
• Understand people’s interests in common;
• Model complex real life data dependencies;
It’s all about GRAPH: The Value of Data is Proportional
to the Number of Meaningful Relationships!
20. Graph Social Network Model
Model can be easily used in real life applications for customer
classification, profiling, segmentation and product
recommendations.
25. Graph Analytics - Page Rank
• PageRank, is about the
importance of nodes in
GRAPH – Link Analysis,
which is defined as the
probability falling into
node depending on:
The probability
landing onto one of
the node’s neighbor;
The probability
crossing the link
from neighbor to it;
o Identify the influential
leader;
26. Graph Analytics - Triangle Count
• Clustering coefficient (CC) is a
measure of the degree to which
nodes in a graph tend to cluster
together;
• Calculation of CC can be tuned to
counting the number of triangles
around one particular node in the
graph;
• CC indicates the degree to which a
node’s neighbors are themselves
neighbors;
• CC of a graph is closely related to the
transitivity of a graph;
27. Graph Analytics - Connected Components
• Connected component is a subgraph in which any
two vertices are connected and no additional
vertices connected to the supergraph;
• A graph is strongly connected if every vertex is
reachable from other vertices. The strongly
connected components form a partition into
subgraphs that are themselves strongly connected;
• A spanning tree is a subgraph of the original graph,
which connect all the vertexes that where originally
connected;
• A minimum spanning tree (mst) is a spanning tree
such that the sum of the weights of its edges is not
greater than the sum of the edges of any other
spanning tree;
28. Graph Analytics - Betweenness centrality
• Betweenness centrality is an
indicator of a node's centrality in
a network, which is equal to the
number of shortest paths from
all vertices to all others that pass
through that node;
• A node with high betweenness
centrality has a large influence
on the transfer of items through
the network;
• Betweenness centrality is related
to a network's connectivity;
30. Graph Computing Opportunity
Combining with the leading tools such as Graph
Database, Machine Learning, High Performance
Computing, Clustering, Streaming, Graph
Computing Technology is ready to take off in Big
Data Era!