SlideShare ist ein Scribd-Unternehmen logo
1 von 28
Downloaden Sie, um offline zu lesen
Overview of GraphX
Presentation by @dougneedham
Introduction
 @dougneedham
 Data Guy - Started as a DBA in the Marine Corps, evolved to Architect,
now aspiring Data Scientist.
 Oracle, SQL Server, Cassandra, Hadoop, MySQL, Spark.
 I have a strong relational/traditional background.
 Perpetual Student
 Learning new things challenges our assumptions. Forces us to take a
new perspective on “old” problems. Eventually maybe even shows us
that there is a better way to solve a problem.
Graphs: What problems do they solve?
 Solving Crime
 Customers/Products
 Some examples: Introduction to Graph_Theory
 There are many ways of constructing networks, and how exactly you construct them
depends on the questions you are posing.
 Economics: You don’t participate in an economy by yourself, you make purchases
from others. Record enough transactions, you have a graph.
 Almost anything can be modeled as a graph. However, it does require a slight shift in
thinking.
 One of the most used examples is a citation network for academic publications.
 I publish a paper, then you cite my paper in your publication.
 This shows which paper (ultimately back through the tree) had the largest influence.
A little History
 The 7 Bridges of Konisberg
 Every tome on Graph theory or Network analysis devotes a small
portion of there time to the 7 Bridges of Konisberg.
 If I don’t cover this with you, the gods of mathematics will strike me
down, and never allow me to do analysis again in the future.
The Bridges
The Problem
 Folks enjoyed there Sunday afternoon strolls across the bridges, but
occasionally people would wonder if one particular route was more
efficient than another.
 Eventually Leonhard Euler was brought into the debate about the
efficiency problem.
 Euler used Vertices to represent the land masses and edges (or arcs, at
the time) to represent bridges. He realized the odd number of edges
per vertex made the problem unsolvable.
 Sarada Herke provides for one of the best explanations of the solution
Solution to Konisburg
 And here is the cool thing about mathematicians. If we tell you
something is impossible, we have to tell you why in a way you can
understand it. But he also invented the branch of mathematics today
we call Graph Theory.
 http://en.wikipedia.org/wiki/Leonhard_Euler
A few terms
 Stand back, we are going to talk about math!
 Basically we are talking about a bunch of dots joined together by lines
 Vertex – Dot on a graph
 Edge – Line connecting the two points
 Edge_Label – this is a term I coined originally related to Data Structure Graphs that
helps trace a path. If you label your edges, and you have multiple edges with the same
label in a Graph you can quite easily identify walks, paths, and cycles through your
graph.
 Triangle – 3 Vertices, 3 Edges
 Square – 4 Vertices, 4 edges
 Open Triangle - 3 Vertices, 2 edges
 A lot of things are networks if you look at them the right way.
 Mark Newman has done a number of really cool presentations, available on Youtube
about Network analysis.
 https://www.youtube.com/watch?v=lETt7IcDWLI
More terms
 Shortest path – How are two vertices connected?
 Longest Path – Tracing the flow of an interesting item through a large
collection of applications.
 What is a path?
 Centrality – Hub and Authority
 This is almost a whole topic by itself, since there are different types of Centrality:
 Degree Centrality, Eigenvector Centrality, PageRank, etc…
 Transitivity
 Homophily – how things are similar
 Directed Graphs – or Digraphs
 Contagion – How do things “spread” through a network?
 Let’s rearrange things, how does the layout affect understanding?
 Order of a graph – number of vertices
 Size of the graph – number of edges
 This is not just data visualization, it can also be used for prediction.
https://www.youtube.com/watch?v=rwA-y-XwjuU
Samples
 Some Samples from Wiki.
 On the right, a basic graph, on the left the languages used in wikipedia
Little sidebar - Paths
 Now that we have some terms under our belt.
 What is the difference between shortest path, and longest path?
The Math doesn’t change.
 One thing I like about Graphs –
 The Math does not change.
 The math behind Graph theory can be a little intense, but it does not
change regardless of the scale of the graph.
 Once you understand how to “do the math” on a small graph, those
same Maths apply to a Graph whether it is a graph of the people in this
room, or a graph of the people on this planet.
Small Graphs
 What is a small graph?
 Friends on Facebook, or LinkedIN.
 Usually this can be displayed and analyzed rather easily.
 If the Graph continues to grow, you need better tools.
 Let’s do a quick demo of a small graph visualization.
Gephi
 http://gephi.github.io/
 From the website: “Gephi is an interactive visualization and exploration
platform for all kinds of networks and complex systems, dynamic and
hierarchical graphs.”
 To get this yourself go into Facebook and search for: Netvizz. (You have
to authorized it. You can un-authorized it later)
 Click the application.
 Click “personal network”
 Click Start
 Download your gdf file
 Quick Demo – ( Vote time: If everyone is comfortable with general
graphs we can come back to this.)
Large Graphs
 What is a large graph?
 To me a large graph is one that cannot be easily visualized by software
such as Gephi.
 You have to use large tools to calculate the important statistics, such as
centrality, diameter, average degree, etc…
 Breaking a large graph down to a small graph is actually not as simple
as it sounds.
 This can be done reasonably easily with tools such as GraphX
 Now what we all came for:
GraphX
GraphX
 GraphX is Apache Spark's API for graphs and graph-parallel
computation.
 https://spark.apache.org/graphx/
 http://ampcamp.berkeley.edu/big-data-mini-course/graph-analytics-
with-graphx.html
 While GraphX is “just a library” it is a library that exists within the Spark
environment. Which provides a whole host of benefits like scaling,
clustering, storage, and other things that you don’t have to dwell on.
 As of right now, GraphX is Scala only.
Data Science Challenge
 Who should Follow whom?
 Winklr is a curiously popular social network for fans of the sitcom Happy
Days. Users can post photos, write messages, and most importantly,
follow each other’s posts and content. This helps users keep up with
new content from their favorite users on the site.
 Problem 3 of the data science challenge was a graph analysis
problem.
 Derive the top 70,000 connections that should be recommended.
Sample of the whole graph
My approach
 Type of problem: Graph Analysis
 Create a Master Graph.
 Run Page Rank to identify centrality.
 Create many small graphs for individual users.
 Mask the Master Graph, and PageRank Graph.
 Multiply out Centrality, number of in Degrees for a possible followers,
and the inverse of the length of the path away from this particular user
to a candidate vertex to be followed.
 This code runs in over 48 hours.
 Code: Problem3.sh, and AnalyzeGraph.scala
Now we will review github
 https://github.com/dougneedham/Cloudera-Data-Scientist-
Challenge-3/tree/master/problem3
Snapshot of code:
 var PathGraph = ShortestPathFromSource.mask(ClickPairGraph)
 var Influence = PathGraph.joinVertices(MasterGraph.inDegrees)((id,pathlength,indeg)=> (1/pathlength)*indeg)
 var central_influence = Influence.joinVertices(MaskedMasterGraphPR.vertices)((id,dist,pagerank) => dist*pagerank)
 //
 // We want to eliminate the infinite, follow someone that there is in fact a path to
 //
 println("Processing " + central_influence.vertices.filter(_._2 < Double.PositiveInfinity).count())
 //central_influence.vertices.filter(_._2 < Double.PositiveInfinity).collect()foreach(record_to_list(SourceID,_))
 val save_file_name = base_path+"/problem3/OutGraph/"+SourceID.trim()+".data"
 central_influence.vertices.filter(_._2 < Double.PositiveInfinity).saveAsTextFile(save_file_name)
Expectations
 This is where we tie together the “small graphs” versus “big graphs”
 Creating a Sub-graph of a larger graph is not obvious.
 I was expecting to see one big clump of nodes tightly connected. This
would be the “Target” to follow.
 I was also expecting to see two smaller clumps of nodes, loosely
connected to the larger clump. These are the “followers”, as we make
a recommendation to them to follow the more popular node, they will
be closer connected to this user.
 Here is the output from Gephi that shows whether the code worked or
not.
Gephi output
Where do I get data?
 How you construct the network
depends on the question(s) you
are posing.
 Chances are you have lots of
data already, it is simply a
matter of perspective.
 Apply Graphs to your own
companies architecture
 Public social network data
 The example mentioned from
Gephi (netvizz)
Data Structure Graphs
 A DSG Level 1 can show you where you are going to have the most
interesting query performance of your tables.
 A DSG Level 2 can show you where the most amount of work is going
on in your Enterprise.
 Data Structure Graph Level 1 – This is roughly like an Entity Relationship
Diagram (ERD) Tables are Vertices, Foreign Keys are Edges.
 Data Structure Graph Level 2 – Each Vertex in this graph is an
application. Each Edge is data transfer. Roughly equivalent to what we
used to call Data Flow diagrams.
SNAP
 SNAP – Stanford Network Analysis Project.
 If you want to learn about how to do Network Analysis and you can’t
find any data, go here.
Consider the following:
 Network/Graph Analysis is cool.
 It can show you some interesting things about your data that you may
not have considered.
 Due thought should be put towards a network analysis project.
 Organizing the data requires a bit of thought. (From -> To vertices is just
a start).
 Directed graph, undirected, bigraph? Some up front setup work needs
to be done.
 Tools help with the detailed calculations, and show the paths, walks,
etc.
 If you need assistance, send a message to the group, or contact me
directly (I am easy to find @dougneedham)
Final Thoughts – Questions?

Weitere ähnliche Inhalte

Was ist angesagt?

Graphs are everywhere! Distributed graph computing with Spark GraphX
Graphs are everywhere! Distributed graph computing with Spark GraphXGraphs are everywhere! Distributed graph computing with Spark GraphX
Graphs are everywhere! Distributed graph computing with Spark GraphXAndrea Iacono
 
An excursion into Graph Analytics with Apache Spark GraphX
An excursion into Graph Analytics with Apache Spark GraphXAn excursion into Graph Analytics with Apache Spark GraphX
An excursion into Graph Analytics with Apache Spark GraphXKrishna Sankar
 
Graph Analytics in Spark
Graph Analytics in SparkGraph Analytics in Spark
Graph Analytics in SparkPaco Nathan
 
Signals from outer space
Signals from outer spaceSignals from outer space
Signals from outer spaceGraphAware
 
Graph x pregel
Graph x pregelGraph x pregel
Graph x pregelSigmoid
 
GraphX and Pregel - Apache Spark
GraphX and Pregel - Apache SparkGraphX and Pregel - Apache Spark
GraphX and Pregel - Apache SparkAshutosh Trivedi
 
Graph Analytics for big data
Graph Analytics for big dataGraph Analytics for big data
Graph Analytics for big dataSigmoid
 
Congressional PageRank: Graph Analytics of US Congress With Neo4j
Congressional PageRank: Graph Analytics of US Congress With Neo4jCongressional PageRank: Graph Analytics of US Congress With Neo4j
Congressional PageRank: Graph Analytics of US Congress With Neo4jWilliam Lyon
 
Machine Learning Powered by Graphs - Alessandro Negro
Machine Learning Powered by Graphs - Alessandro NegroMachine Learning Powered by Graphs - Alessandro Negro
Machine Learning Powered by Graphs - Alessandro NegroGraphAware
 
Graph-Powered Machine Learning
Graph-Powered Machine Learning Graph-Powered Machine Learning
Graph-Powered Machine Learning GraphAware
 
SF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in PythonSF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in PythonPaco Nathan
 
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MoreStrata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MorePaco Nathan
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...Jose Quesada (hiring)
 
Magellan-Spark as a Geospatial Analytics Engine by Ram Sriharsha
Magellan-Spark as a Geospatial Analytics Engine by Ram SriharshaMagellan-Spark as a Geospatial Analytics Engine by Ram Sriharsha
Magellan-Spark as a Geospatial Analytics Engine by Ram SriharshaSpark Summit
 
Improve ML Predictions using Connected Feature Extraction
Improve ML Predictions using Connected Feature ExtractionImprove ML Predictions using Connected Feature Extraction
Improve ML Predictions using Connected Feature ExtractionDatabricks
 
Big Graph Analytics on Neo4j with Apache Spark
Big Graph Analytics on Neo4j with Apache SparkBig Graph Analytics on Neo4j with Apache Spark
Big Graph Analytics on Neo4j with Apache SparkKenny Bastani
 
Benchmarking graph databases on the problem of community detection
Benchmarking graph databases on the problem of community detectionBenchmarking graph databases on the problem of community detection
Benchmarking graph databases on the problem of community detectionSymeon Papadopoulos
 
Extending Spark Graph for the Enterprise with Morpheus and Neo4j
Extending Spark Graph for the Enterprise with Morpheus and Neo4jExtending Spark Graph for the Enterprise with Morpheus and Neo4j
Extending Spark Graph for the Enterprise with Morpheus and Neo4jDatabricks
 

Was ist angesagt? (20)

Graphs are everywhere! Distributed graph computing with Spark GraphX
Graphs are everywhere! Distributed graph computing with Spark GraphXGraphs are everywhere! Distributed graph computing with Spark GraphX
Graphs are everywhere! Distributed graph computing with Spark GraphX
 
An excursion into Graph Analytics with Apache Spark GraphX
An excursion into Graph Analytics with Apache Spark GraphXAn excursion into Graph Analytics with Apache Spark GraphX
An excursion into Graph Analytics with Apache Spark GraphX
 
Graph Analytics in Spark
Graph Analytics in SparkGraph Analytics in Spark
Graph Analytics in Spark
 
Signals from outer space
Signals from outer spaceSignals from outer space
Signals from outer space
 
Graph x pregel
Graph x pregelGraph x pregel
Graph x pregel
 
Spark graphx
Spark graphxSpark graphx
Spark graphx
 
GraphX and Pregel - Apache Spark
GraphX and Pregel - Apache SparkGraphX and Pregel - Apache Spark
GraphX and Pregel - Apache Spark
 
Graph Analytics for big data
Graph Analytics for big dataGraph Analytics for big data
Graph Analytics for big data
 
Congressional PageRank: Graph Analytics of US Congress With Neo4j
Congressional PageRank: Graph Analytics of US Congress With Neo4jCongressional PageRank: Graph Analytics of US Congress With Neo4j
Congressional PageRank: Graph Analytics of US Congress With Neo4j
 
Machine Learning Powered by Graphs - Alessandro Negro
Machine Learning Powered by Graphs - Alessandro NegroMachine Learning Powered by Graphs - Alessandro Negro
Machine Learning Powered by Graphs - Alessandro Negro
 
Graph-Powered Machine Learning
Graph-Powered Machine Learning Graph-Powered Machine Learning
Graph-Powered Machine Learning
 
SF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in PythonSF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in Python
 
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MoreStrata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
 
Power of Polyglot Search
Power of Polyglot SearchPower of Polyglot Search
Power of Polyglot Search
 
Magellan-Spark as a Geospatial Analytics Engine by Ram Sriharsha
Magellan-Spark as a Geospatial Analytics Engine by Ram SriharshaMagellan-Spark as a Geospatial Analytics Engine by Ram Sriharsha
Magellan-Spark as a Geospatial Analytics Engine by Ram Sriharsha
 
Improve ML Predictions using Connected Feature Extraction
Improve ML Predictions using Connected Feature ExtractionImprove ML Predictions using Connected Feature Extraction
Improve ML Predictions using Connected Feature Extraction
 
Big Graph Analytics on Neo4j with Apache Spark
Big Graph Analytics on Neo4j with Apache SparkBig Graph Analytics on Neo4j with Apache Spark
Big Graph Analytics on Neo4j with Apache Spark
 
Benchmarking graph databases on the problem of community detection
Benchmarking graph databases on the problem of community detectionBenchmarking graph databases on the problem of community detection
Benchmarking graph databases on the problem of community detection
 
Extending Spark Graph for the Enterprise with Morpheus and Neo4j
Extending Spark Graph for the Enterprise with Morpheus and Neo4jExtending Spark Graph for the Enterprise with Morpheus and Neo4j
Extending Spark Graph for the Enterprise with Morpheus and Neo4j
 

Andere mochten auch

Real time and reliable processing with Apache Storm
Real time and reliable processing with Apache StormReal time and reliable processing with Apache Storm
Real time and reliable processing with Apache StormAndrea Iacono
 
Spark Concepts - Spark SQL, Graphx, Streaming
Spark Concepts - Spark SQL, Graphx, StreamingSpark Concepts - Spark SQL, Graphx, Streaming
Spark Concepts - Spark SQL, Graphx, StreamingPetr Zapletal
 
Graph processing - Powergraph and GraphX
Graph processing - Powergraph and GraphXGraph processing - Powergraph and GraphX
Graph processing - Powergraph and GraphXAmir Payberah
 
Indiabulls one gurgaon 99997.44778 Sachiv Chawla indiabulls new project gurga...
Indiabulls one gurgaon 99997.44778 Sachiv Chawla indiabulls new project gurga...Indiabulls one gurgaon 99997.44778 Sachiv Chawla indiabulls new project gurga...
Indiabulls one gurgaon 99997.44778 Sachiv Chawla indiabulls new project gurga...sachivchawla
 
διδω σωτηριου
διδω σωτηριουδιδω σωτηριου
διδω σωτηριουekidrou
 
Thanksgiving power point
Thanksgiving power pointThanksgiving power point
Thanksgiving power pointsoniapr30
 
The Shurangama Mantra《大佛頂首楞嚴咒》議軌
The Shurangama Mantra《大佛頂首楞嚴咒》議軌The Shurangama Mantra《大佛頂首楞嚴咒》議軌
The Shurangama Mantra《大佛頂首楞嚴咒》議軌walkmankim
 
Optimizing for Ecommerce: The Dynamic Landscape of SEO
Optimizing for Ecommerce: The Dynamic Landscape of SEOOptimizing for Ecommerce: The Dynamic Landscape of SEO
Optimizing for Ecommerce: The Dynamic Landscape of SEOBe Found Online
 
虚云大师《禅修入门》
虚云大师《禅修入门》虚云大师《禅修入门》
虚云大师《禅修入门》walkmankim
 
Shurangama sutra
Shurangama sutraShurangama sutra
Shurangama sutrawalkmankim
 
圣严法师《皈依三宝的好处》
圣严法师《皈依三宝的好处》圣严法师《皈依三宝的好处》
圣严法师《皈依三宝的好处》walkmankim
 
张澄基教授《什么是佛法》
张澄基教授《什么是佛法》张澄基教授《什么是佛法》
张澄基教授《什么是佛法》walkmankim
 
SILKWORMS POWER POINT
SILKWORMS POWER POINTSILKWORMS POWER POINT
SILKWORMS POWER POINTsoniapr30
 
Flyer show 5o dim sxoleio geraka
Flyer show   5o dim sxoleio gerakaFlyer show   5o dim sxoleio geraka
Flyer show 5o dim sxoleio gerakaekyrmizaki
 

Andere mochten auch (20)

Real time and reliable processing with Apache Storm
Real time and reliable processing with Apache StormReal time and reliable processing with Apache Storm
Real time and reliable processing with Apache Storm
 
Spark Concepts - Spark SQL, Graphx, Streaming
Spark Concepts - Spark SQL, Graphx, StreamingSpark Concepts - Spark SQL, Graphx, Streaming
Spark Concepts - Spark SQL, Graphx, Streaming
 
Graph processing - Powergraph and GraphX
Graph processing - Powergraph and GraphXGraph processing - Powergraph and GraphX
Graph processing - Powergraph and GraphX
 
Indiabulls one gurgaon 99997.44778 Sachiv Chawla indiabulls new project gurga...
Indiabulls one gurgaon 99997.44778 Sachiv Chawla indiabulls new project gurga...Indiabulls one gurgaon 99997.44778 Sachiv Chawla indiabulls new project gurga...
Indiabulls one gurgaon 99997.44778 Sachiv Chawla indiabulls new project gurga...
 
διδω σωτηριου
διδω σωτηριουδιδω σωτηριου
διδω σωτηριου
 
Thanksgiving power point
Thanksgiving power pointThanksgiving power point
Thanksgiving power point
 
The Shurangama Mantra《大佛頂首楞嚴咒》議軌
The Shurangama Mantra《大佛頂首楞嚴咒》議軌The Shurangama Mantra《大佛頂首楞嚴咒》議軌
The Shurangama Mantra《大佛頂首楞嚴咒》議軌
 
Optimizing for Ecommerce: The Dynamic Landscape of SEO
Optimizing for Ecommerce: The Dynamic Landscape of SEOOptimizing for Ecommerce: The Dynamic Landscape of SEO
Optimizing for Ecommerce: The Dynamic Landscape of SEO
 
虚云大师《禅修入门》
虚云大师《禅修入门》虚云大师《禅修入门》
虚云大师《禅修入门》
 
Fun facts about eyes
Fun facts about eyesFun facts about eyes
Fun facts about eyes
 
Shurangama sutra
Shurangama sutraShurangama sutra
Shurangama sutra
 
圣严法师《皈依三宝的好处》
圣严法师《皈依三宝的好处》圣严法师《皈依三宝的好处》
圣严法师《皈依三宝的好处》
 
Bahasa indonesia (Pidato)
Bahasa indonesia (Pidato)Bahasa indonesia (Pidato)
Bahasa indonesia (Pidato)
 
No Ajahn Chah
No Ajahn ChahNo Ajahn Chah
No Ajahn Chah
 
张澄基教授《什么是佛法》
张澄基教授《什么是佛法》张澄基教授《什么是佛法》
张澄基教授《什么是佛法》
 
Jorge anm
Jorge anmJorge anm
Jorge anm
 
SILKWORMS POWER POINT
SILKWORMS POWER POINTSILKWORMS POWER POINT
SILKWORMS POWER POINT
 
Flyer show 5o dim sxoleio geraka
Flyer show   5o dim sxoleio gerakaFlyer show   5o dim sxoleio geraka
Flyer show 5o dim sxoleio geraka
 
Christmas
ChristmasChristmas
Christmas
 
SILKWOR
SILKWORSILKWOR
SILKWOR
 

Ähnlich wie Apache Spark GraphX highlights.

Data Structure Graph DMZ #DMZone
Data Structure Graph DMZ #DMZoneData Structure Graph DMZ #DMZone
Data Structure Graph DMZ #DMZoneDoug Needham
 
Social Network Analysis Introduction including Data Structure Graph overview.
Social Network Analysis Introduction including Data Structure Graph overview. Social Network Analysis Introduction including Data Structure Graph overview.
Social Network Analysis Introduction including Data Structure Graph overview. Doug Needham
 
How Graph Databases used in Police Department?
How Graph Databases used in Police Department?How Graph Databases used in Police Department?
How Graph Databases used in Police Department?Samet KILICTAS
 
Neo4j Graph Data Science Training - June 9 & 10 - Slides #7 GDS Best Practices
Neo4j Graph Data Science Training - June 9 & 10 - Slides #7 GDS Best PracticesNeo4j Graph Data Science Training - June 9 & 10 - Slides #7 GDS Best Practices
Neo4j Graph Data Science Training - June 9 & 10 - Slides #7 GDS Best PracticesNeo4j
 
Multiplatform Spark solution for Graph datasources by Javier Dominguez
Multiplatform Spark solution for Graph datasources by Javier DominguezMultiplatform Spark solution for Graph datasources by Javier Dominguez
Multiplatform Spark solution for Graph datasources by Javier DominguezBig Data Spain
 
Intro to Graph Theory w Neo4J
Intro to Graph Theory w Neo4JIntro to Graph Theory w Neo4J
Intro to Graph Theory w Neo4JRay Lukas
 
Knowledge Graphs - Journey to the Connected Enterprise - Data Strategy and An...
Knowledge Graphs - Journey to the Connected Enterprise - Data Strategy and An...Knowledge Graphs - Journey to the Connected Enterprise - Data Strategy and An...
Knowledge Graphs - Journey to the Connected Enterprise - Data Strategy and An...Benjamin Nussbaum
 
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your DataCloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your DataCloudera, Inc.
 
Data Science Challenge presentation given to the CinBITools Meetup Group
Data Science Challenge presentation given to the CinBITools Meetup GroupData Science Challenge presentation given to the CinBITools Meetup Group
Data Science Challenge presentation given to the CinBITools Meetup GroupDoug Needham
 
Cloudera Data Science Challenge
Cloudera Data Science ChallengeCloudera Data Science Challenge
Cloudera Data Science ChallengeMark Nichols, P.E.
 
Knowledge graphs, meet Deep Learning
Knowledge graphs, meet Deep LearningKnowledge graphs, meet Deep Learning
Knowledge graphs, meet Deep LearningConnected Data World
 
EuroMPI 2016 Keynote: How Can MPI Fit Into Today's Big Computing
EuroMPI 2016 Keynote: How Can MPI Fit Into Today's Big ComputingEuroMPI 2016 Keynote: How Can MPI Fit Into Today's Big Computing
EuroMPI 2016 Keynote: How Can MPI Fit Into Today's Big ComputingJonathan Dursi
 
aRangodb, un package per l'utilizzo di ArangoDB con R
aRangodb, un package per l'utilizzo di ArangoDB con RaRangodb, un package per l'utilizzo di ArangoDB con R
aRangodb, un package per l'utilizzo di ArangoDB con RGraphRM
 
Intro to Graph Theory
Intro to Graph TheoryIntro to Graph Theory
Intro to Graph TheoryRay Lukas
 
Microservices, containers, and machine learning
Microservices, containers, and machine learningMicroservices, containers, and machine learning
Microservices, containers, and machine learningPaco Nathan
 
Social data visualization
Social data visualizationSocial data visualization
Social data visualizationCristina Serban
 
Distributed computing poli
Distributed computing poliDistributed computing poli
Distributed computing poliivascucristian
 
Unit 2 - Data Manipulation with R.pptx
Unit 2 - Data Manipulation with R.pptxUnit 2 - Data Manipulation with R.pptx
Unit 2 - Data Manipulation with R.pptxMalla Reddy University
 

Ähnlich wie Apache Spark GraphX highlights. (20)

Data Structure Graph DMZ #DMZone
Data Structure Graph DMZ #DMZoneData Structure Graph DMZ #DMZone
Data Structure Graph DMZ #DMZone
 
Social Network Analysis Introduction including Data Structure Graph overview.
Social Network Analysis Introduction including Data Structure Graph overview. Social Network Analysis Introduction including Data Structure Graph overview.
Social Network Analysis Introduction including Data Structure Graph overview.
 
How Graph Databases used in Police Department?
How Graph Databases used in Police Department?How Graph Databases used in Police Department?
How Graph Databases used in Police Department?
 
Neo4j Graph Data Science Training - June 9 & 10 - Slides #7 GDS Best Practices
Neo4j Graph Data Science Training - June 9 & 10 - Slides #7 GDS Best PracticesNeo4j Graph Data Science Training - June 9 & 10 - Slides #7 GDS Best Practices
Neo4j Graph Data Science Training - June 9 & 10 - Slides #7 GDS Best Practices
 
Multiplatform Spark solution for Graph datasources by Javier Dominguez
Multiplatform Spark solution for Graph datasources by Javier DominguezMultiplatform Spark solution for Graph datasources by Javier Dominguez
Multiplatform Spark solution for Graph datasources by Javier Dominguez
 
Intro to Graph Theory w Neo4J
Intro to Graph Theory w Neo4JIntro to Graph Theory w Neo4J
Intro to Graph Theory w Neo4J
 
Knowledge Graphs - Journey to the Connected Enterprise - Data Strategy and An...
Knowledge Graphs - Journey to the Connected Enterprise - Data Strategy and An...Knowledge Graphs - Journey to the Connected Enterprise - Data Strategy and An...
Knowledge Graphs - Journey to the Connected Enterprise - Data Strategy and An...
 
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your DataCloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
 
Data Science Challenge presentation given to the CinBITools Meetup Group
Data Science Challenge presentation given to the CinBITools Meetup GroupData Science Challenge presentation given to the CinBITools Meetup Group
Data Science Challenge presentation given to the CinBITools Meetup Group
 
Cloudera Data Science Challenge
Cloudera Data Science ChallengeCloudera Data Science Challenge
Cloudera Data Science Challenge
 
Knowledge graphs, meet Deep Learning
Knowledge graphs, meet Deep LearningKnowledge graphs, meet Deep Learning
Knowledge graphs, meet Deep Learning
 
EuroMPI 2016 Keynote: How Can MPI Fit Into Today's Big Computing
EuroMPI 2016 Keynote: How Can MPI Fit Into Today's Big ComputingEuroMPI 2016 Keynote: How Can MPI Fit Into Today's Big Computing
EuroMPI 2016 Keynote: How Can MPI Fit Into Today's Big Computing
 
aRangodb, un package per l'utilizzo di ArangoDB con R
aRangodb, un package per l'utilizzo di ArangoDB con RaRangodb, un package per l'utilizzo di ArangoDB con R
aRangodb, un package per l'utilizzo di ArangoDB con R
 
Intro to Graph Theory
Intro to Graph TheoryIntro to Graph Theory
Intro to Graph Theory
 
Microservices, containers, and machine learning
Microservices, containers, and machine learningMicroservices, containers, and machine learning
Microservices, containers, and machine learning
 
Grandata
GrandataGrandata
Grandata
 
Social data visualization
Social data visualizationSocial data visualization
Social data visualization
 
ML.pdf
ML.pdfML.pdf
ML.pdf
 
Distributed computing poli
Distributed computing poliDistributed computing poli
Distributed computing poli
 
Unit 2 - Data Manipulation with R.pptx
Unit 2 - Data Manipulation with R.pptxUnit 2 - Data Manipulation with R.pptx
Unit 2 - Data Manipulation with R.pptx
 

Kürzlich hochgeladen

Role of Consumer Insights in business transformation
Role of Consumer Insights in business transformationRole of Consumer Insights in business transformation
Role of Consumer Insights in business transformationAnnie Melnic
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Boston Institute of Analytics
 
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelDecoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelBoston Institute of Analytics
 
Decision Making Under Uncertainty - Is It Better Off Joining a Partnership or...
Decision Making Under Uncertainty - Is It Better Off Joining a Partnership or...Decision Making Under Uncertainty - Is It Better Off Joining a Partnership or...
Decision Making Under Uncertainty - Is It Better Off Joining a Partnership or...ThinkInnovation
 
IBEF report on the Insurance market in India
IBEF report on the Insurance market in IndiaIBEF report on the Insurance market in India
IBEF report on the Insurance market in IndiaManalVerma4
 
Digital Indonesia Report 2024 by We Are Social .pdf
Digital Indonesia Report 2024 by We Are Social .pdfDigital Indonesia Report 2024 by We Are Social .pdf
Digital Indonesia Report 2024 by We Are Social .pdfNicoChristianSunaryo
 
Statistics For Management by Richard I. Levin 8ed.pdf
Statistics For Management by Richard I. Levin 8ed.pdfStatistics For Management by Richard I. Levin 8ed.pdf
Statistics For Management by Richard I. Levin 8ed.pdfnikeshsingh56
 
DATA ANALYSIS using various data sets like shoping data set etc
DATA ANALYSIS using various data sets like shoping data set etcDATA ANALYSIS using various data sets like shoping data set etc
DATA ANALYSIS using various data sets like shoping data set etclalithasri22
 
Presentation of project of business person who are success
Presentation of project of business person who are successPresentation of project of business person who are success
Presentation of project of business person who are successPratikSingh115843
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBoston Institute of Analytics
 
Predictive Analysis - Using Insight-informed Data to Plan Inventory in Next 6...
Predictive Analysis - Using Insight-informed Data to Plan Inventory in Next 6...Predictive Analysis - Using Insight-informed Data to Plan Inventory in Next 6...
Predictive Analysis - Using Insight-informed Data to Plan Inventory in Next 6...ThinkInnovation
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...Dr Arash Najmaei ( Phd., MBA, BSc)
 
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...Jack Cole
 

Kürzlich hochgeladen (16)

Role of Consumer Insights in business transformation
Role of Consumer Insights in business transformationRole of Consumer Insights in business transformation
Role of Consumer Insights in business transformation
 
Data Analysis Project: Stroke Prediction
Data Analysis Project: Stroke PredictionData Analysis Project: Stroke Prediction
Data Analysis Project: Stroke Prediction
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
 
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelDecoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
 
Insurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis ProjectInsurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis Project
 
Decision Making Under Uncertainty - Is It Better Off Joining a Partnership or...
Decision Making Under Uncertainty - Is It Better Off Joining a Partnership or...Decision Making Under Uncertainty - Is It Better Off Joining a Partnership or...
Decision Making Under Uncertainty - Is It Better Off Joining a Partnership or...
 
IBEF report on the Insurance market in India
IBEF report on the Insurance market in IndiaIBEF report on the Insurance market in India
IBEF report on the Insurance market in India
 
Digital Indonesia Report 2024 by We Are Social .pdf
Digital Indonesia Report 2024 by We Are Social .pdfDigital Indonesia Report 2024 by We Are Social .pdf
Digital Indonesia Report 2024 by We Are Social .pdf
 
2023 Survey Shows Dip in High School E-Cigarette Use
2023 Survey Shows Dip in High School E-Cigarette Use2023 Survey Shows Dip in High School E-Cigarette Use
2023 Survey Shows Dip in High School E-Cigarette Use
 
Statistics For Management by Richard I. Levin 8ed.pdf
Statistics For Management by Richard I. Levin 8ed.pdfStatistics For Management by Richard I. Levin 8ed.pdf
Statistics For Management by Richard I. Levin 8ed.pdf
 
DATA ANALYSIS using various data sets like shoping data set etc
DATA ANALYSIS using various data sets like shoping data set etcDATA ANALYSIS using various data sets like shoping data set etc
DATA ANALYSIS using various data sets like shoping data set etc
 
Presentation of project of business person who are success
Presentation of project of business person who are successPresentation of project of business person who are success
Presentation of project of business person who are success
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
 
Predictive Analysis - Using Insight-informed Data to Plan Inventory in Next 6...
Predictive Analysis - Using Insight-informed Data to Plan Inventory in Next 6...Predictive Analysis - Using Insight-informed Data to Plan Inventory in Next 6...
Predictive Analysis - Using Insight-informed Data to Plan Inventory in Next 6...
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
 
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
 

Apache Spark GraphX highlights.

  • 2. Introduction  @dougneedham  Data Guy - Started as a DBA in the Marine Corps, evolved to Architect, now aspiring Data Scientist.  Oracle, SQL Server, Cassandra, Hadoop, MySQL, Spark.  I have a strong relational/traditional background.  Perpetual Student  Learning new things challenges our assumptions. Forces us to take a new perspective on “old” problems. Eventually maybe even shows us that there is a better way to solve a problem.
  • 3. Graphs: What problems do they solve?  Solving Crime  Customers/Products  Some examples: Introduction to Graph_Theory  There are many ways of constructing networks, and how exactly you construct them depends on the questions you are posing.  Economics: You don’t participate in an economy by yourself, you make purchases from others. Record enough transactions, you have a graph.  Almost anything can be modeled as a graph. However, it does require a slight shift in thinking.  One of the most used examples is a citation network for academic publications.  I publish a paper, then you cite my paper in your publication.  This shows which paper (ultimately back through the tree) had the largest influence.
  • 4. A little History  The 7 Bridges of Konisberg  Every tome on Graph theory or Network analysis devotes a small portion of there time to the 7 Bridges of Konisberg.  If I don’t cover this with you, the gods of mathematics will strike me down, and never allow me to do analysis again in the future.
  • 6. The Problem  Folks enjoyed there Sunday afternoon strolls across the bridges, but occasionally people would wonder if one particular route was more efficient than another.  Eventually Leonhard Euler was brought into the debate about the efficiency problem.  Euler used Vertices to represent the land masses and edges (or arcs, at the time) to represent bridges. He realized the odd number of edges per vertex made the problem unsolvable.  Sarada Herke provides for one of the best explanations of the solution Solution to Konisburg  And here is the cool thing about mathematicians. If we tell you something is impossible, we have to tell you why in a way you can understand it. But he also invented the branch of mathematics today we call Graph Theory.  http://en.wikipedia.org/wiki/Leonhard_Euler
  • 7. A few terms  Stand back, we are going to talk about math!  Basically we are talking about a bunch of dots joined together by lines  Vertex – Dot on a graph  Edge – Line connecting the two points  Edge_Label – this is a term I coined originally related to Data Structure Graphs that helps trace a path. If you label your edges, and you have multiple edges with the same label in a Graph you can quite easily identify walks, paths, and cycles through your graph.  Triangle – 3 Vertices, 3 Edges  Square – 4 Vertices, 4 edges  Open Triangle - 3 Vertices, 2 edges  A lot of things are networks if you look at them the right way.  Mark Newman has done a number of really cool presentations, available on Youtube about Network analysis.  https://www.youtube.com/watch?v=lETt7IcDWLI
  • 8. More terms  Shortest path – How are two vertices connected?  Longest Path – Tracing the flow of an interesting item through a large collection of applications.  What is a path?  Centrality – Hub and Authority  This is almost a whole topic by itself, since there are different types of Centrality:  Degree Centrality, Eigenvector Centrality, PageRank, etc…  Transitivity  Homophily – how things are similar  Directed Graphs – or Digraphs  Contagion – How do things “spread” through a network?  Let’s rearrange things, how does the layout affect understanding?  Order of a graph – number of vertices  Size of the graph – number of edges  This is not just data visualization, it can also be used for prediction. https://www.youtube.com/watch?v=rwA-y-XwjuU
  • 9. Samples  Some Samples from Wiki.  On the right, a basic graph, on the left the languages used in wikipedia
  • 10. Little sidebar - Paths  Now that we have some terms under our belt.  What is the difference between shortest path, and longest path?
  • 11. The Math doesn’t change.  One thing I like about Graphs –  The Math does not change.  The math behind Graph theory can be a little intense, but it does not change regardless of the scale of the graph.  Once you understand how to “do the math” on a small graph, those same Maths apply to a Graph whether it is a graph of the people in this room, or a graph of the people on this planet.
  • 12. Small Graphs  What is a small graph?  Friends on Facebook, or LinkedIN.  Usually this can be displayed and analyzed rather easily.  If the Graph continues to grow, you need better tools.  Let’s do a quick demo of a small graph visualization.
  • 13. Gephi  http://gephi.github.io/  From the website: “Gephi is an interactive visualization and exploration platform for all kinds of networks and complex systems, dynamic and hierarchical graphs.”  To get this yourself go into Facebook and search for: Netvizz. (You have to authorized it. You can un-authorized it later)  Click the application.  Click “personal network”  Click Start  Download your gdf file  Quick Demo – ( Vote time: If everyone is comfortable with general graphs we can come back to this.)
  • 14. Large Graphs  What is a large graph?  To me a large graph is one that cannot be easily visualized by software such as Gephi.  You have to use large tools to calculate the important statistics, such as centrality, diameter, average degree, etc…  Breaking a large graph down to a small graph is actually not as simple as it sounds.  This can be done reasonably easily with tools such as GraphX  Now what we all came for:
  • 16. GraphX  GraphX is Apache Spark's API for graphs and graph-parallel computation.  https://spark.apache.org/graphx/  http://ampcamp.berkeley.edu/big-data-mini-course/graph-analytics- with-graphx.html  While GraphX is “just a library” it is a library that exists within the Spark environment. Which provides a whole host of benefits like scaling, clustering, storage, and other things that you don’t have to dwell on.  As of right now, GraphX is Scala only.
  • 17. Data Science Challenge  Who should Follow whom?  Winklr is a curiously popular social network for fans of the sitcom Happy Days. Users can post photos, write messages, and most importantly, follow each other’s posts and content. This helps users keep up with new content from their favorite users on the site.  Problem 3 of the data science challenge was a graph analysis problem.  Derive the top 70,000 connections that should be recommended.
  • 18. Sample of the whole graph
  • 19. My approach  Type of problem: Graph Analysis  Create a Master Graph.  Run Page Rank to identify centrality.  Create many small graphs for individual users.  Mask the Master Graph, and PageRank Graph.  Multiply out Centrality, number of in Degrees for a possible followers, and the inverse of the length of the path away from this particular user to a candidate vertex to be followed.  This code runs in over 48 hours.  Code: Problem3.sh, and AnalyzeGraph.scala
  • 20. Now we will review github  https://github.com/dougneedham/Cloudera-Data-Scientist- Challenge-3/tree/master/problem3
  • 21. Snapshot of code:  var PathGraph = ShortestPathFromSource.mask(ClickPairGraph)  var Influence = PathGraph.joinVertices(MasterGraph.inDegrees)((id,pathlength,indeg)=> (1/pathlength)*indeg)  var central_influence = Influence.joinVertices(MaskedMasterGraphPR.vertices)((id,dist,pagerank) => dist*pagerank)  //  // We want to eliminate the infinite, follow someone that there is in fact a path to  //  println("Processing " + central_influence.vertices.filter(_._2 < Double.PositiveInfinity).count())  //central_influence.vertices.filter(_._2 < Double.PositiveInfinity).collect()foreach(record_to_list(SourceID,_))  val save_file_name = base_path+"/problem3/OutGraph/"+SourceID.trim()+".data"  central_influence.vertices.filter(_._2 < Double.PositiveInfinity).saveAsTextFile(save_file_name)
  • 22. Expectations  This is where we tie together the “small graphs” versus “big graphs”  Creating a Sub-graph of a larger graph is not obvious.  I was expecting to see one big clump of nodes tightly connected. This would be the “Target” to follow.  I was also expecting to see two smaller clumps of nodes, loosely connected to the larger clump. These are the “followers”, as we make a recommendation to them to follow the more popular node, they will be closer connected to this user.  Here is the output from Gephi that shows whether the code worked or not.
  • 24. Where do I get data?  How you construct the network depends on the question(s) you are posing.  Chances are you have lots of data already, it is simply a matter of perspective.  Apply Graphs to your own companies architecture  Public social network data  The example mentioned from Gephi (netvizz)
  • 25. Data Structure Graphs  A DSG Level 1 can show you where you are going to have the most interesting query performance of your tables.  A DSG Level 2 can show you where the most amount of work is going on in your Enterprise.  Data Structure Graph Level 1 – This is roughly like an Entity Relationship Diagram (ERD) Tables are Vertices, Foreign Keys are Edges.  Data Structure Graph Level 2 – Each Vertex in this graph is an application. Each Edge is data transfer. Roughly equivalent to what we used to call Data Flow diagrams.
  • 26. SNAP  SNAP – Stanford Network Analysis Project.  If you want to learn about how to do Network Analysis and you can’t find any data, go here.
  • 27. Consider the following:  Network/Graph Analysis is cool.  It can show you some interesting things about your data that you may not have considered.  Due thought should be put towards a network analysis project.  Organizing the data requires a bit of thought. (From -> To vertices is just a start).  Directed graph, undirected, bigraph? Some up front setup work needs to be done.  Tools help with the detailed calculations, and show the paths, walks, etc.  If you need assistance, send a message to the group, or contact me directly (I am easy to find @dougneedham)
  • 28. Final Thoughts – Questions?