SlideShare ist ein Scribd-Unternehmen logo
1 von 48
Downloaden Sie, um offline zu lesen
Scalable graph analysis with
Apache Giraph and Spark GraphX
Roman Shaposhnik rvs@apache.org @rhatr
Director of Open Source, Pivotal Inc.
Introduction into
scalable graph analysis with
Apache Giraph and Spark GraphX
Roman Shaposhnik rvs@apache.org @rhatr
Director of Open Source, Pivotal Inc.
Shameless plug #1
Shameless plug #1
Agenda:
Lets define some terms
•  Graph is a G = (V, E), where E VxV
•  Directed multigraphs with properties attached to each vertex and edge

foo
bar
fee
Lets define some terms
•  Graph is a G = (V, E), where E VxV
•  Directed multigraphs with properties attached to each vertex and edge

foo
bar
fee
Lets define some terms
•  Graph is a G = (V, E), where E VxV
•  Directed multigraphs with properties attached to each vertex and edge

foo
bar
fee
2
1
Lets define some terms
•  Graph is a G = (V, E), where E VxV
•  Directed multigraphs with properties attached to each vertex and edge

foo
bar
fee
2
1 foo
bar
fee
42
fum
What kind of graphs are we talking about?
• Page ranking on Facebook social graph (mid 2013)
•  10^9 (billions) vertices
•  10^12 (trillion) edges
•  10^15 (petabtybe) cold storage data scale
•  200 servers
•  …all in under 4 minutes!
“On day one Doug created
HDFS and MapReduce”
Google papers that started it all
• GFS (file system)
•  distributed
•  replicated
•  non-POSIX"

• MapReduce (computational framework)
•  distributed
•  batch-oriented (long jobs; final results)
•  data-gravity aware
•  designed for “embarrassingly parallel” algorithms
HDFS pools and abstracts direct-attached storage
…
HDFS
MR MR
A Unix analogy
§ It is as though instead of:
$	
  grep	
  foo	
  bar.txt	
  |	
  tr	
  “,”	
  “	
  “	
  |	
  sort	
  -­‐u	
  
	
  
§ We are doing:
$	
  grep	
  foo	
  <	
  bar.txt	
  >	
  /tmp/1.txt	
  
$	
  tr	
  “,”	
  “	
  “	
  	
  <	
  /tmp/1.txt	
  >	
  /tmp/2.txt	
  
$	
  sort	
  –u	
  <	
  /tmp/2.txt	
  
Enter Apache Spark
RAM is the new disk, Disk is the new tape
Source: UC Berkeley Spark project (just the image)
RDDs instead of HDFS files, RAM instead of Disk
warnings = textFile(…).filter(_.contains(“warning”))
.map(_.split(‘ ‘)(1))
HadoopRDD
path = hdfs://
FilteredRDD
contains…
MappedRDD
split…
pooled RAM
RDDs: resilient, distributed, datasets
§ Distributed on a cluster in RAM
§ Immutable (mostly)
§ Can be evicted, snapshotted, etc.
§ Manipulated via parallel operators (map, etc.)
§ Automatically rebuilt on failure
§ A parallel ecosystem
§ A solution to iterative and multi-stage apps
What’s so special about Graphs and
big data?
Graph relationships
§ Entities in your data: tuples
-  customer data
-  product data
-  interaction data
§ Connection between entities: graphs
-  social network or my customers
-  clustering of customers vs. products
A word about Graph databases
§  Plenty available
-  Neo4J, Titan, etc.
§  Benefits
-  Query language
-  Tightly integrate systems with few moving parts
-  High performance on known data sets
§  Shortcomings
-  Not easy to scale horizontally
-  Don’t integrate with HDFS
-  Combine storage and computational layers
-  A sea of APIs
What’s the key API?
§ Directed multi-graph with labels attached to vertices and edges
§ Defining vertices and edges dynamically
§ Selecting sub-graphs
§ Mutating the topology of the graph
§ Partitioning the graph
§ Computing model that is
-  iterative
-  scalable (shared nothing)
-  resilient
-  easy to manage at scale
Bulk Synchronous Parallel
BSP compute model
BSP in a nutshell
time
communications
local
processing
barrier #1
barrier #2
barrier #3
Vertex-centric BSP application
@rhatr
@TheASF
@c0sin
“Think like a vertex”
•  I know my local state
•  I know my neighbors
•  I can send messages to vertices
•  I can declare that I am done
•  I can mutate graph topology
Local state, global messaging
time
communications
vertices are
doing local
computing
and pooling 
messages
superstep #1
all vertices are
done computing
superstep #2
Lets put it all together
Hadoop ecosystem view
HDFS
Pig
Sqoop Flume
MR
Hive
Tez
Giraph
Mahout
Spark
SparkSQL
MLib
GraphX
HAWQ
Kafka
YARN
MADlib
Spark view
HDFS, Ceph, GlusterFS, S3
Hive
Spark
SparkSQL
MLib
GraphX
Kafka
YARN, Mesos, MR
Enough boxology!
Lets look at some code
Our toy for the rest of this talk
Adjacency lists stored on HDFS
$ hadoop fs –cat /tmp/graph/1.txt
1
2 1 3
3 1 2
@rhatr
@TheASF
@c0sin
3
1
2
Graph modeling in GraphX
§  The property graph is parameterized over the vertex (VD) and edge (ED) types
class Graph[VD, ED] {
val vertices: VertexRDD[VD]
val edges: EdgeRDD[ED]
}
§  Graph[(String, String), String]
Hello world in GraphX
$ spark*/bin/spark-shell
scala val inputFile = sc.textFile(“hdfs:///tmp/graph/1.txt”)
scala val edges = inputFile.flatMap(s = { // “2 1 3”
val l = s.split(t); // [ “2”, “1”, “3” ]
l.drop(1).map(x = (l.head.toLong, x.toLong)) // [ (2, 1), (2, 3) ]
})
scala val graph = Graph.fromEdgeTuples(edges, ) // Graph[String, Int]
scala val result = graph.collectNeighborIds(EdgeDirection.Out).map(x =
println(Hello world from the:  + x._1 +  :  + x._2.mkString( )) )
scala result.collect() // don’t try this @home
Hello world from the: 1 :
Hello world from the: 2 : 1 3
Hello world from the: 3 : 1 2
Graph modeling in Giraph
BasicComputationI	
  extends	
  WritableComparable,	
  	
  	
  	
  	
  //	
  VertexID	
  	
  	
  -­‐-­‐	
  vertex	
  ref	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  V	
  extends	
  Writable,	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  //	
  VertexData	
  -­‐-­‐	
  a	
  vertex	
  datum	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  E	
  extends	
  Writable,	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  //	
  EdgeData	
  	
  	
  -­‐-­‐	
  an	
  edge	
  label	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  M	
  extends	
  Writable	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  //	
  MessageData-­‐–	
  message	
  payload	
  
	
  
	
  
V	
  is	
  sort	
  of	
  like	
  VD	
  
E	
  is	
  sort	
  of	
  like	
  ED	
  
Hello world in Giraph
public class GiraphHelloWorld extends
BasicComputationIntWritable, IntWritable, NullWritable, NullWritable {
public void compute(VertexIntWritable, IntWritable, NullWritable vertex,
IterableNullWritable messages) {
System.out.print(“Hello world from the: “ + vertex.getId() + “ : “);
for (EdgeIntWritable, NullWritable e : vertex.getEdges()) {
System.out.print(“ “ + e.getTargetVertexId());
}
System.out.println(“”);
vertex.voteToHalt();
}
}
How to run it
$ giraph target/*.jar giraph.GiraphHelloWorld 
-vip /tmp/graph/ 
-vif org.apache.giraph.io.formats.IntIntNullTextInputFormat 
-w 1 
-ca giraph.SplitMasterWorker=false,giraph.logLevel=error
Hello world from the: 1 :
Hello world from the: 2 : 1 3
Hello world from the: 3 : 1 2
Anatomy of Giraph run
BSP assumes an exclusively vertex view
Turning Twitter into Facebook
@rhatr
@TheASF
@c0sin
@rhatr
@TheASF
@c0sin
Hello world in Giraph
public void compute(VertexText, DoubleWritable, DoubleWritable vertex, IterableText ms ){
if (getSuperstep() == 0) {
sendMessageToAllEdges(vertex, vertex.getId());
} else {
for (Text m : ms) {
if (vertex.getEdgeValue(m) == null) {
vertex.addEdge(EdgeFactory.create(m, SYNTHETIC_EDGE));
}
}
}
vertex.voteToHalt();
}
BSP in GraphX
Single source shortest path
scala val sssp = graph.pregel(Double.PositiveInfinity) // Initial message
((id, dist, newDist) = math.min(dist, newDist), // Vertex Program
triplet = { // Send Message
if (triplet.srcAttr + triplet.attr  triplet.dstAttr) {
Iterator((triplet.dstId, triplet.srcAttr + triplet.attr))
} else {
Iterator.empty
}
},
(a,b) = math.min(a,b)) // Merge Messages
scala println(sssp.vertices.collect.mkString(n))
2
42
0
3
Single source shortest path
scala val sssp = graph.pregel(Double.PositiveInfinity) // Initial message
((id, dist, newDist) = math.min(dist, newDist), // Vertex Program
triplet = { // Send Message
if (triplet.srcAttr + triplet.attr  triplet.dstAttr) {
Iterator((triplet.dstId, triplet.srcAttr + triplet.attr))
} else {
Iterator.empty
}
},
(a,b) = math.min(a,b)) // Merge Messages
scala println(sssp.vertices.collect.mkString(n))
2
5
0
3
Operational views of the graph
Masking instead of mutation
§ def subgraph(
epred: EdgeTriplet[VD,ED] = Boolean = (x = true),
vpred: (VertexID, VD) = Boolean = ((v, d) = true))
: Graph[VD, ED]
§ def mask[VD2, ED2](other: Graph[VD2, ED2]): Graph[VD, ED]
Built-in algorithms
§  def pageRank(tol: Double, resetProb: Double = 0.15):
Graph[Double, Double]
§  def connectedComponents(): Graph[VertexID, ED]
§  def triangleCount(): Graph[Int, ED]
§  def stronglyConnectedComponents(numIter: Int): Graph[VertexID, ED]
Final thoughts
Giraph
§ An unconstrained BSP framework
§ Specialized fully mutable,
dynamically balanced in-memory
graph representation
§ Very procedural, vertex-centric
programming model
§ Genuine part of Hadoop ecosystem
§ Definitely a 1.0
GraphX
§ An RDD framework
§ Graphs are “views” on RDDs and
thus immutable
§ Functional-like, “declarative”
programming model
§ Genuine part of Spark ecosystem
§ Technically still an alpha
QA
Thanks!

Weitere ähnliche Inhalte

Was ist angesagt?

GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communities
Paco Nathan
 
Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015 Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Sameer Farooqui
 

Was ist angesagt? (20)

Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...
Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...
Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...
 
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
 
Apache Giraph: Large-scale graph processing done better
Apache Giraph: Large-scale graph processing done betterApache Giraph: Large-scale graph processing done better
Apache Giraph: Large-scale graph processing done better
 
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communities
 
An excursion into Graph Analytics with Apache Spark GraphX
An excursion into Graph Analytics with Apache Spark GraphXAn excursion into Graph Analytics with Apache Spark GraphX
An excursion into Graph Analytics with Apache Spark GraphX
 
Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...
Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...
Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...
 
Mapreduce in Search
Mapreduce in SearchMapreduce in Search
Mapreduce in Search
 
Graph databases: Tinkerpop and Titan DB
Graph databases: Tinkerpop and Titan DBGraph databases: Tinkerpop and Titan DB
Graph databases: Tinkerpop and Titan DB
 
Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks
Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks
Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks
 
Large Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache SparkLarge Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache Spark
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
 
Data profiling in Apache Calcite
Data profiling in Apache CalciteData profiling in Apache Calcite
Data profiling in Apache Calcite
 
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
 
Apache Giraph
Apache GiraphApache Giraph
Apache Giraph
 
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
 
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui MengChallenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
 
Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015 Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015
 
Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQL
 
Spark Meetup @ Netflix, 05/19/2015
Spark Meetup @ Netflix, 05/19/2015Spark Meetup @ Netflix, 05/19/2015
Spark Meetup @ Netflix, 05/19/2015
 

Andere mochten auch

Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge DatasetsScaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets
Turi, Inc.
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0
Cloudera, Inc.
 

Andere mochten auch (13)

Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming dataUsing Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
 
Kudu - Fast Analytics on Fast Data
Kudu - Fast Analytics on Fast DataKudu - Fast Analytics on Fast Data
Kudu - Fast Analytics on Fast Data
 
HPE Keynote Hadoop Summit San Jose 2016
HPE Keynote Hadoop Summit San Jose 2016HPE Keynote Hadoop Summit San Jose 2016
HPE Keynote Hadoop Summit San Jose 2016
 
Hadoop Graph Processing with Apache Giraph
Hadoop Graph Processing with Apache GiraphHadoop Graph Processing with Apache Giraph
Hadoop Graph Processing with Apache Giraph
 
Apache Arrow (Strata-Hadoop World San Jose 2016)
Apache Arrow (Strata-Hadoop World San Jose 2016)Apache Arrow (Strata-Hadoop World San Jose 2016)
Apache Arrow (Strata-Hadoop World San Jose 2016)
 
Apache kudu
Apache kuduApache kudu
Apache kudu
 
Machine Learning with GraphLab Create
Machine Learning with GraphLab CreateMachine Learning with GraphLab Create
Machine Learning with GraphLab Create
 
Time Series Analysis with Spark
Time Series Analysis with SparkTime Series Analysis with Spark
Time Series Analysis with Spark
 
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge DatasetsScaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets
 
Introduction to Apache Kudu
Introduction to Apache KuduIntroduction to Apache Kudu
Introduction to Apache Kudu
 
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0
 
Next-generation Python Big Data Tools, powered by Apache Arrow
Next-generation Python Big Data Tools, powered by Apache ArrowNext-generation Python Big Data Tools, powered by Apache Arrow
Next-generation Python Big Data Tools, powered by Apache Arrow
 

Ähnlich wie Introduction into scalable graph analysis with Apache Giraph and Spark GraphX

Behm Shah Pagerank
Behm Shah PagerankBehm Shah Pagerank
Behm Shah Pagerank
gothicane
 
Big Data for Mobile
Big Data for MobileBig Data for Mobile
Big Data for Mobile
BugSense
 
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraBrief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Somnath Mazumdar
 
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Andrey Vykhodtsev
 

Ähnlich wie Introduction into scalable graph analysis with Apache Giraph and Spark GraphX (20)

Hadoop trainingin bangalore
Hadoop trainingin bangaloreHadoop trainingin bangalore
Hadoop trainingin bangalore
 
The Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsightThe Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsight
 
Apache Flink & Graph Processing
Apache Flink & Graph ProcessingApache Flink & Graph Processing
Apache Flink & Graph Processing
 
Behm Shah Pagerank
Behm Shah PagerankBehm Shah Pagerank
Behm Shah Pagerank
 
Full stack analytics with Hadoop 2
Full stack analytics with Hadoop 2Full stack analytics with Hadoop 2
Full stack analytics with Hadoop 2
 
Hadoop
HadoopHadoop
Hadoop
 
Big Data for Mobile
Big Data for MobileBig Data for Mobile
Big Data for Mobile
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraBrief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
 
Cloud jpl
Cloud jplCloud jpl
Cloud jpl
 
Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Spark training-in-bangalore
Spark training-in-bangaloreSpark training-in-bangalore
Spark training-in-bangalore
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Osd ctw spark
Osd ctw sparkOsd ctw spark
Osd ctw spark
 
MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptx
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
 
Scala+data
Scala+dataScala+data
Scala+data
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 

Mehr von rhatr

Elephant in the cloud
Elephant in the cloudElephant in the cloud
Elephant in the cloud
rhatr
 

Mehr von rhatr (8)

Unikernels: in search of a killer app and a killer ecosystem
Unikernels: in search of a killer app and a killer ecosystemUnikernels: in search of a killer app and a killer ecosystem
Unikernels: in search of a killer app and a killer ecosystem
 
You Call that Micro, Mr. Docker? How OSv and Unikernels Help Micro-services S...
You Call that Micro, Mr. Docker? How OSv and Unikernels Help Micro-services S...You Call that Micro, Mr. Docker? How OSv and Unikernels Help Micro-services S...
You Call that Micro, Mr. Docker? How OSv and Unikernels Help Micro-services S...
 
Tachyon and Apache Spark
Tachyon and Apache SparkTachyon and Apache Spark
Tachyon and Apache Spark
 
Apache Spark: killer or savior of Apache Hadoop?
Apache Spark: killer or savior of Apache Hadoop?Apache Spark: killer or savior of Apache Hadoop?
Apache Spark: killer or savior of Apache Hadoop?
 
OSv: probably the best OS for cloud workloads you've never hear of
OSv: probably the best OS for cloud workloads you've never hear ofOSv: probably the best OS for cloud workloads you've never hear of
OSv: probably the best OS for cloud workloads you've never hear of
 
Apache Bigtop: a crash course in deploying a Hadoop bigdata management platform
Apache Bigtop: a crash course in deploying a Hadoop bigdata management platformApache Bigtop: a crash course in deploying a Hadoop bigdata management platform
Apache Bigtop: a crash course in deploying a Hadoop bigdata management platform
 
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...
 
Elephant in the cloud
Elephant in the cloudElephant in the cloud
Elephant in the cloud
 

Kürzlich hochgeladen

Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Medical / Health Care (+971588192166) Mifepristone and Misoprostol tablets 200mg
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
masabamasaba
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
Health
 
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
masabamasaba
 
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
VictoriaMetrics
 

Kürzlich hochgeladen (20)

%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
 
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
 
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
 
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park %in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
Artyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptxArtyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptx
 
Announcing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK SoftwareAnnouncing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK Software
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
 
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
 
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park %in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
 
%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand
 
WSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go PlatformlessWSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go Platformless
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
 
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...
 
What Goes Wrong with Language Definitions and How to Improve the Situation
What Goes Wrong with Language Definitions and How to Improve the SituationWhat Goes Wrong with Language Definitions and How to Improve the Situation
What Goes Wrong with Language Definitions and How to Improve the Situation
 
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
 
%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in soweto%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in soweto
 
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
 

Introduction into scalable graph analysis with Apache Giraph and Spark GraphX

  • 1. Scalable graph analysis with Apache Giraph and Spark GraphX Roman Shaposhnik rvs@apache.org @rhatr Director of Open Source, Pivotal Inc.
  • 2. Introduction into scalable graph analysis with Apache Giraph and Spark GraphX Roman Shaposhnik rvs@apache.org @rhatr Director of Open Source, Pivotal Inc.
  • 6. Lets define some terms •  Graph is a G = (V, E), where E VxV •  Directed multigraphs with properties attached to each vertex and edge foo bar fee
  • 7. Lets define some terms •  Graph is a G = (V, E), where E VxV •  Directed multigraphs with properties attached to each vertex and edge foo bar fee
  • 8. Lets define some terms •  Graph is a G = (V, E), where E VxV •  Directed multigraphs with properties attached to each vertex and edge foo bar fee 2 1
  • 9. Lets define some terms •  Graph is a G = (V, E), where E VxV •  Directed multigraphs with properties attached to each vertex and edge foo bar fee 2 1 foo bar fee 42 fum
  • 10. What kind of graphs are we talking about? • Page ranking on Facebook social graph (mid 2013) •  10^9 (billions) vertices •  10^12 (trillion) edges •  10^15 (petabtybe) cold storage data scale •  200 servers •  …all in under 4 minutes!
  • 11. “On day one Doug created HDFS and MapReduce”
  • 12. Google papers that started it all • GFS (file system) •  distributed •  replicated •  non-POSIX" • MapReduce (computational framework) •  distributed •  batch-oriented (long jobs; final results) •  data-gravity aware •  designed for “embarrassingly parallel” algorithms
  • 13. HDFS pools and abstracts direct-attached storage … HDFS MR MR
  • 14. A Unix analogy § It is as though instead of: $  grep  foo  bar.txt  |  tr  “,”  “  “  |  sort  -­‐u     § We are doing: $  grep  foo  <  bar.txt  >  /tmp/1.txt   $  tr  “,”  “  “    <  /tmp/1.txt  >  /tmp/2.txt   $  sort  –u  <  /tmp/2.txt  
  • 16. RAM is the new disk, Disk is the new tape Source: UC Berkeley Spark project (just the image)
  • 17. RDDs instead of HDFS files, RAM instead of Disk warnings = textFile(…).filter(_.contains(“warning”)) .map(_.split(‘ ‘)(1)) HadoopRDD path = hdfs:// FilteredRDD contains… MappedRDD split… pooled RAM
  • 18. RDDs: resilient, distributed, datasets § Distributed on a cluster in RAM § Immutable (mostly) § Can be evicted, snapshotted, etc. § Manipulated via parallel operators (map, etc.) § Automatically rebuilt on failure § A parallel ecosystem § A solution to iterative and multi-stage apps
  • 19. What’s so special about Graphs and big data?
  • 20. Graph relationships § Entities in your data: tuples -  customer data -  product data -  interaction data § Connection between entities: graphs -  social network or my customers -  clustering of customers vs. products
  • 21. A word about Graph databases §  Plenty available -  Neo4J, Titan, etc. §  Benefits -  Query language -  Tightly integrate systems with few moving parts -  High performance on known data sets §  Shortcomings -  Not easy to scale horizontally -  Don’t integrate with HDFS -  Combine storage and computational layers -  A sea of APIs
  • 22. What’s the key API? § Directed multi-graph with labels attached to vertices and edges § Defining vertices and edges dynamically § Selecting sub-graphs § Mutating the topology of the graph § Partitioning the graph § Computing model that is -  iterative -  scalable (shared nothing) -  resilient -  easy to manage at scale
  • 24. BSP in a nutshell time communications local processing barrier #1 barrier #2 barrier #3
  • 25. Vertex-centric BSP application @rhatr @TheASF @c0sin “Think like a vertex” •  I know my local state •  I know my neighbors •  I can send messages to vertices •  I can declare that I am done •  I can mutate graph topology
  • 26. Local state, global messaging time communications vertices are doing local computing and pooling messages superstep #1 all vertices are done computing superstep #2
  • 27. Lets put it all together
  • 28. Hadoop ecosystem view HDFS Pig Sqoop Flume MR Hive Tez Giraph Mahout Spark SparkSQL MLib GraphX HAWQ Kafka YARN MADlib
  • 29. Spark view HDFS, Ceph, GlusterFS, S3 Hive Spark SparkSQL MLib GraphX Kafka YARN, Mesos, MR
  • 31. Our toy for the rest of this talk Adjacency lists stored on HDFS $ hadoop fs –cat /tmp/graph/1.txt 1 2 1 3 3 1 2 @rhatr @TheASF @c0sin 3 1 2
  • 32. Graph modeling in GraphX §  The property graph is parameterized over the vertex (VD) and edge (ED) types class Graph[VD, ED] { val vertices: VertexRDD[VD] val edges: EdgeRDD[ED] } §  Graph[(String, String), String]
  • 33. Hello world in GraphX $ spark*/bin/spark-shell scala val inputFile = sc.textFile(“hdfs:///tmp/graph/1.txt”) scala val edges = inputFile.flatMap(s = { // “2 1 3” val l = s.split(t); // [ “2”, “1”, “3” ] l.drop(1).map(x = (l.head.toLong, x.toLong)) // [ (2, 1), (2, 3) ] }) scala val graph = Graph.fromEdgeTuples(edges, ) // Graph[String, Int] scala val result = graph.collectNeighborIds(EdgeDirection.Out).map(x = println(Hello world from the: + x._1 + : + x._2.mkString( )) ) scala result.collect() // don’t try this @home Hello world from the: 1 : Hello world from the: 2 : 1 3 Hello world from the: 3 : 1 2
  • 34. Graph modeling in Giraph BasicComputationI  extends  WritableComparable,          //  VertexID      -­‐-­‐  vertex  ref                                                                                                        V  extends  Writable,                              //  VertexData  -­‐-­‐  a  vertex  datum                                    E  extends  Writable,                              //  EdgeData      -­‐-­‐  an  edge  label                                    M  extends  Writable                              //  MessageData-­‐–  message  payload       V  is  sort  of  like  VD   E  is  sort  of  like  ED  
  • 35. Hello world in Giraph public class GiraphHelloWorld extends BasicComputationIntWritable, IntWritable, NullWritable, NullWritable { public void compute(VertexIntWritable, IntWritable, NullWritable vertex, IterableNullWritable messages) { System.out.print(“Hello world from the: “ + vertex.getId() + “ : “); for (EdgeIntWritable, NullWritable e : vertex.getEdges()) { System.out.print(“ “ + e.getTargetVertexId()); } System.out.println(“”); vertex.voteToHalt(); } }
  • 36. How to run it $ giraph target/*.jar giraph.GiraphHelloWorld -vip /tmp/graph/ -vif org.apache.giraph.io.formats.IntIntNullTextInputFormat -w 1 -ca giraph.SplitMasterWorker=false,giraph.logLevel=error Hello world from the: 1 : Hello world from the: 2 : 1 3 Hello world from the: 3 : 1 2
  • 38. BSP assumes an exclusively vertex view
  • 39. Turning Twitter into Facebook @rhatr @TheASF @c0sin @rhatr @TheASF @c0sin
  • 40. Hello world in Giraph public void compute(VertexText, DoubleWritable, DoubleWritable vertex, IterableText ms ){ if (getSuperstep() == 0) { sendMessageToAllEdges(vertex, vertex.getId()); } else { for (Text m : ms) { if (vertex.getEdgeValue(m) == null) { vertex.addEdge(EdgeFactory.create(m, SYNTHETIC_EDGE)); } } } vertex.voteToHalt(); }
  • 42. Single source shortest path scala val sssp = graph.pregel(Double.PositiveInfinity) // Initial message ((id, dist, newDist) = math.min(dist, newDist), // Vertex Program triplet = { // Send Message if (triplet.srcAttr + triplet.attr triplet.dstAttr) { Iterator((triplet.dstId, triplet.srcAttr + triplet.attr)) } else { Iterator.empty } }, (a,b) = math.min(a,b)) // Merge Messages scala println(sssp.vertices.collect.mkString(n)) 2 42 0 3
  • 43. Single source shortest path scala val sssp = graph.pregel(Double.PositiveInfinity) // Initial message ((id, dist, newDist) = math.min(dist, newDist), // Vertex Program triplet = { // Send Message if (triplet.srcAttr + triplet.attr triplet.dstAttr) { Iterator((triplet.dstId, triplet.srcAttr + triplet.attr)) } else { Iterator.empty } }, (a,b) = math.min(a,b)) // Merge Messages scala println(sssp.vertices.collect.mkString(n)) 2 5 0 3
  • 44. Operational views of the graph
  • 45. Masking instead of mutation § def subgraph( epred: EdgeTriplet[VD,ED] = Boolean = (x = true), vpred: (VertexID, VD) = Boolean = ((v, d) = true)) : Graph[VD, ED] § def mask[VD2, ED2](other: Graph[VD2, ED2]): Graph[VD, ED]
  • 46. Built-in algorithms §  def pageRank(tol: Double, resetProb: Double = 0.15): Graph[Double, Double] §  def connectedComponents(): Graph[VertexID, ED] §  def triangleCount(): Graph[Int, ED] §  def stronglyConnectedComponents(numIter: Int): Graph[VertexID, ED]
  • 47. Final thoughts Giraph § An unconstrained BSP framework § Specialized fully mutable, dynamically balanced in-memory graph representation § Very procedural, vertex-centric programming model § Genuine part of Hadoop ecosystem § Definitely a 1.0 GraphX § An RDD framework § Graphs are “views” on RDDs and thus immutable § Functional-like, “declarative” programming model § Genuine part of Spark ecosystem § Technically still an alpha