SlideShare ist ein Scribd-Unternehmen logo
1 von 21
Downloaden Sie, um offline zu lesen
Streaming and Online Algorithms for GraphX 
Graph Analytics Team 
Xia (Ivy) Zhu 
Intel Confidential — Do Not Forward
Why Streaming Processing on Graph? 
2 
• New stores join 
• New users join 
• New users 
browse/clicks and 
buy items 
• Old users 
browse/clicks and 
buy items 
• New ads added 
• … 
• Recommend products 
based on users’ interest 
• Recommend products 
based on users’ shopping 
habits 
• Recommend products 
based on users’ 
purchasing capability 
• Place ads which most 
likely will be clicked by 
users 
• … 
Everyday 
How 
To 
Huge amount of relationships are created each day, 
Wisely utilize them is important
Alibaba Is Not Alone, Graphs are Everywhere 
3 
100B Neuron 
100T Relationships 
1.23B Users 
160B Friendships 
1 Trillion Pages 
100s T Links 
Millions of Products 
and Users 
50M Users 
1B hours/moth watch 
Large Biological 
Cell Networks
… And Graphs Keep Evolving 
4
Streaming Processing Pipeline 
5 
Data Stream 
ETL 
Graph 
Creation 
ML 
Distributed Messaging System 
• We are using Kafka for distributed messaging 
• GraphX as graph processing engine
6 
What is GraphX 
• Graph processing engine on Spark 
• Support Pregel-type vertex programming 
• Unifies data-parallel and graph-parallel processing 
Picture Source: GraphX team
7 
Why GraphX 
• GraphLab performs well, but standalone 
• Giraph, open source, scales well, but performance is not good 
• GraphX supports both table and graph operations 
• On the same platform, Spark streaming provides basic streaming 
framework 
SchemaRDD’s RDD-Based 
RDDs, Transformations, and Actions 
Spark 
Spark Streaming 
real-time 
Spark 
SQL 
MLLib 
machine learning 
DStream’s: 
Streams of RDD’s 
Matrices 
RDD-Based 
Graphs 
GraphX 
graph processing/ 
machine learning 
Picture Source: Databricks
8 
Naïve Streaming Does not Scale 
• Current GraphX is designed for static graphs 
• Current Spark streaming provides limited types of state DStreams 
• Naïve approach: 
• Merge table data before going to graph processing pipeline 
• Re-generate whole graph and re-run ML at each window 
• Minimal changes to GraphX and Spark Streaming 
• Straightforward, but does not scale well 
180 
160 
140 
120 
100 
80 
60 
40 
20 
0 
Throughput vs Latency of Naive Graph Streaming 
1 2 3 4 5 6 7 8 9 
Latency(s) 
Sample Point
Our solution 
9 
• Static algorithms -> Online algorithms 
• Merge information at graph phase 
• Efficient graph store for evolving graph 
• Better partitioning algorithms to reduce replicas 
• Static index -> On the fly indexing method (ongoing)
Static vs Online Algorithms 
10 
• Static algorithms 
• Good for re-compute the whole graph at each time instance , and re-run ML 
• Become increasingly infeasible in Big Data era, given the size and growth rate 
of graphs 
• Online algorithms 
• Incremental machine learning is triggered by changes in the graph 
• We designed delta updates based online algorithms 
• Page rank as an example 
• Same idea is applicable to other machine learning algorithms
Static vs Online Page Rank 
11 
Static_PageRank 
// InitialVertexValue 
(0.0, 0.0) 
// first messsage 
initialMessage: 
msg = alpha/(1.0-alpha) 
// broadcast to neighbors 
SendMessage: 
if (edge.srcAttr._2 > tol) 
Iterator((edge.dstId, edge.srcAttr_2 * 
edge.attr)) 
//Aggregate Messages for each Vertex 
messageCombiner(a,b) : 
sum = a+b 
//Update Vertex 
vertexProgram(sum) : 
updates = (1.0 - alpha) * sum 
(oldPR + updates, updates) 
Online_PageRank 
// Initialize vertex value 
base graph: 
(0.0, 0.0) 
incremental graph: 
old vertices: 
(lastWindowPR, lastWindowDelta) 
new vertices: 
(alpha, alpha) 
// First Message 
initialMessage: 
base graph: 
msg = alpha/(1.0-alpha) 
incremental graph: 
none 
// broadcast to neighbors 
SendMessage: 
oldSrc->newDst: 
Iterator((edge.dstId,(edge.srcAttr_1 – alpha) * 
edge.attr)) 
newSrc->newDst or not converged: 
Iterator((edge.dstId,edge.srcAttr_2 * edge.attr)) 
//Aggregate Messages for each Vertex 
messageCombiner(a,b) : 
sum = a+b 
//Update Vertex 
vertexProgram(sum) : 
updates = (1.0 - alpha) * sum 
(oldPR + updates, updates)
GraphX Data Loading and Data Structure 
12 
Edge 
lists 
SSrrccIIdd 
DstId 
EdgeRDD 
DDaattaa 
IInnddeexx 
Re-HashPartition 
RRoouuttiinnggTTaabblleePPaarrttiittiioonn 
VVeerrtteexxRRDDDD 
RoutingTableMesssage 
HHaassSSrrccIIdd 
HHaassDDssttIIdd 
Replicated 
Vertex 
View 
GGrraapphhIImmppll 
EEddggeePPaarrttiittiioonn 
VVeerrtteexxPPaarrttiittiioonn 
Vid 
DDaattaa 
Mask 
Shippable 
Vertex 
Partition 
VVeerrtteexxPPaarrttiittiioonn 
Vid 
DDaattaa 
Mask
GraphX Data Loading and Data Structure 
13 
Edge 
lists 
SSrrccIIdd 
DstId 
EdgeRDD 
DDaattaa 
Index 
Re-HashPartition 
RRoouuttiinnggTTaabblleePPaarrttiittiioonn 
VVeerrtteexxRRDDDD 
RoutingTableMesssage 
HHaassSSrrccIIdd 
HHaassDDssttIIdd 
Replicated 
Vertex 
View 
GGrraapphhIImmppll 
EEddggeePPaarrttiittiioonn 
VVeerrtteexxPPaarrttiittiioonn 
Vid 
DDaattaa 
Mask 
Shippable 
Vertex 
Partition 
VVeerrtteexxPPaarrttiittiioonn 
Vid 
DDaattaa 
Mask 
Static Index 
Partitioning Algorithm can help 
reduce the replication factors
Partitioning Algorithm 
14 
• Torus-based partitioning 
• Divide overall partitions to A x B matrix 
• Vertex’s master partition is decided by Hash function 
• Replica set is in the same column as master partition (full column), and same row as 
master partition (  
⁄ + 1 elements starting from master partition) 
• The intersection between source replica set and target replica set decides where an 
edge is placed
Index Structure for Graph Streaming 
15 
• GraphX uses CSR(Compressed Sparse Row)-based index 
• Originated from sparse matrix compression 
• Good for finding all out edges of a source vertex 
• No support for finding all in edges of a target vertex. Need full table scan 
• At minimal, need to add CSC(Compressed Sparse Column) for indexing in edges 
Raw Edge Lists 
Src Dst Data 
3 2  
3 5  
3 9  
5 2 	 
5 3 
 
7 3  
8 5
8 6 
 
10 6  
Dst Data 
2  
5  
9  
2 	 
3 
 
3  
5
6 
 
6  
Idx Unique 
Src 
0 3 
3 5 
5 7 
6 8 
8 10 
CSR 
Data Src 
 3 
	 5 

 5 
 7 
 3
8 

 8 
 10 
 3 
Unique 
Dst 
Idx 
2 0 
3 2 
5 4 
6 6 
9 8 
CSC
Index Structure for Graph Streaming 
16 
• Both CSR and CSC need firstly sort edge lists and then create index. 
• Even better way is to build index on the fly 
• For graph streaming, need to support both fast insert/write and fast search/read 
• HashMap 
• Good for exact match, point search 
• Fast on insert and search 
• Good for graph with fixed/known size 
• Need to re-hash when size surpasses capacity 
• Trees: B-Tree, LSM-Tree (Log Structured Merge Tree), COLA(Cache Oblivious 
Lookahead Array) 
• Support both point search and range search 
• B-Tree good for fast search, slow for insert 
• LSM-Tree good for fast insert, slow for search 
• COLA achieves good tradeoff: fast insert and good enough search 
COLA based index for graph streaming
Putting Things Together: Our Streaming Pipeline 
17 
 
 
OML 
 
 
 
+ 
 
OML 
 
 
 
+ 
 
OML 
 
 
 
+ 
 
OML 
 
	 
	 
+ 
		 
OML 
		 
…
Performance - Convergence Rate 
18 
1.2 
Converage Rate 
Naive Incremental 
Normalized Number of Iterations Graph Size ( Num of Edges) 
1.0 
0.8 
0.6 
0.4 
0.2 
0.0 
Base +20% +40% +60% +80% +100% +150% +200%

Weitere ähnliche Inhalte

Was ist angesagt?

FiloDB: Reactive, Real-Time, In-Memory Time Series at Scale
FiloDB: Reactive, Real-Time, In-Memory Time Series at ScaleFiloDB: Reactive, Real-Time, In-Memory Time Series at Scale
FiloDB: Reactive, Real-Time, In-Memory Time Series at ScaleEvan Chan
 
Pivoting Data with SparkSQL by Andrew Ray
Pivoting Data with SparkSQL by Andrew RayPivoting Data with SparkSQL by Andrew Ray
Pivoting Data with SparkSQL by Andrew RaySpark Summit
 
04 2017 emea_roadshowmilan_mariadb columnstore
04 2017 emea_roadshowmilan_mariadb columnstore04 2017 emea_roadshowmilan_mariadb columnstore
04 2017 emea_roadshowmilan_mariadb columnstoremlraviol
 
Lens at apachecon
Lens at apacheconLens at apachecon
Lens at apacheconamarsri
 
Introduction into scalable graph analysis with Apache Giraph and Spark GraphX
Introduction into scalable graph analysis with Apache Giraph and Spark GraphXIntroduction into scalable graph analysis with Apache Giraph and Spark GraphX
Introduction into scalable graph analysis with Apache Giraph and Spark GraphXrhatr
 
Case Study Real Time Olap Cubes
Case Study Real Time Olap CubesCase Study Real Time Olap Cubes
Case Study Real Time Olap Cubesmister_zed
 
Apache Flink Deep-Dive @ Hadoop Summit 2015 in San Jose, CA
Apache Flink Deep-Dive @ Hadoop Summit 2015 in San Jose, CAApache Flink Deep-Dive @ Hadoop Summit 2015 in San Jose, CA
Apache Flink Deep-Dive @ Hadoop Summit 2015 in San Jose, CARobert Metzger
 
Enhancements on Spark SQL optimizer by Min Qiu
Enhancements on Spark SQL optimizer by Min QiuEnhancements on Spark SQL optimizer by Min Qiu
Enhancements on Spark SQL optimizer by Min QiuSpark Summit
 
Large-Scale Machine Learning with Apache Spark
Large-Scale Machine Learning with Apache SparkLarge-Scale Machine Learning with Apache Spark
Large-Scale Machine Learning with Apache SparkDB Tsai
 
Enhancing Spark SQL Optimizer with Reliable Statistics
Enhancing Spark SQL Optimizer with Reliable StatisticsEnhancing Spark SQL Optimizer with Reliable Statistics
Enhancing Spark SQL Optimizer with Reliable StatisticsJen Aman
 
Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...
Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...
Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...Spark Summit
 
Batch and Stream Graph Processing with Apache Flink
Batch and Stream Graph Processing with Apache FlinkBatch and Stream Graph Processing with Apache Flink
Batch and Stream Graph Processing with Apache FlinkVasia Kalavri
 
Amazon Redshift: Performance Tuning and Optimization
Amazon Redshift: Performance Tuning and OptimizationAmazon Redshift: Performance Tuning and Optimization
Amazon Redshift: Performance Tuning and OptimizationAmazon Web Services
 
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...Spark Summit
 
Large Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache SparkLarge Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache SparkCloudera, Inc.
 
AWS (Amazon Redshift) presentation
AWS (Amazon Redshift) presentationAWS (Amazon Redshift) presentation
AWS (Amazon Redshift) presentationVolodymyr Rovetskiy
 
Materialized Column: An Efficient Way to Optimize Queries on Nested Columns
Materialized Column: An Efficient Way to Optimize Queries on Nested ColumnsMaterialized Column: An Efficient Way to Optimize Queries on Nested Columns
Materialized Column: An Efficient Way to Optimize Queries on Nested ColumnsDatabricks
 
Data Warehousing with Amazon Redshift
Data Warehousing with Amazon RedshiftData Warehousing with Amazon Redshift
Data Warehousing with Amazon RedshiftAmazon Web Services
 
Best practices for Data warehousing with Amazon Redshift - AWS PS Summit Canb...
Best practices for Data warehousing with Amazon Redshift - AWS PS Summit Canb...Best practices for Data warehousing with Amazon Redshift - AWS PS Summit Canb...
Best practices for Data warehousing with Amazon Redshift - AWS PS Summit Canb...Amazon Web Services
 

Was ist angesagt? (20)

FiloDB: Reactive, Real-Time, In-Memory Time Series at Scale
FiloDB: Reactive, Real-Time, In-Memory Time Series at ScaleFiloDB: Reactive, Real-Time, In-Memory Time Series at Scale
FiloDB: Reactive, Real-Time, In-Memory Time Series at Scale
 
Apache Flink Deep Dive
Apache Flink Deep DiveApache Flink Deep Dive
Apache Flink Deep Dive
 
Pivoting Data with SparkSQL by Andrew Ray
Pivoting Data with SparkSQL by Andrew RayPivoting Data with SparkSQL by Andrew Ray
Pivoting Data with SparkSQL by Andrew Ray
 
04 2017 emea_roadshowmilan_mariadb columnstore
04 2017 emea_roadshowmilan_mariadb columnstore04 2017 emea_roadshowmilan_mariadb columnstore
04 2017 emea_roadshowmilan_mariadb columnstore
 
Lens at apachecon
Lens at apacheconLens at apachecon
Lens at apachecon
 
Introduction into scalable graph analysis with Apache Giraph and Spark GraphX
Introduction into scalable graph analysis with Apache Giraph and Spark GraphXIntroduction into scalable graph analysis with Apache Giraph and Spark GraphX
Introduction into scalable graph analysis with Apache Giraph and Spark GraphX
 
Case Study Real Time Olap Cubes
Case Study Real Time Olap CubesCase Study Real Time Olap Cubes
Case Study Real Time Olap Cubes
 
Apache Flink Deep-Dive @ Hadoop Summit 2015 in San Jose, CA
Apache Flink Deep-Dive @ Hadoop Summit 2015 in San Jose, CAApache Flink Deep-Dive @ Hadoop Summit 2015 in San Jose, CA
Apache Flink Deep-Dive @ Hadoop Summit 2015 in San Jose, CA
 
Enhancements on Spark SQL optimizer by Min Qiu
Enhancements on Spark SQL optimizer by Min QiuEnhancements on Spark SQL optimizer by Min Qiu
Enhancements on Spark SQL optimizer by Min Qiu
 
Large-Scale Machine Learning with Apache Spark
Large-Scale Machine Learning with Apache SparkLarge-Scale Machine Learning with Apache Spark
Large-Scale Machine Learning with Apache Spark
 
Enhancing Spark SQL Optimizer with Reliable Statistics
Enhancing Spark SQL Optimizer with Reliable StatisticsEnhancing Spark SQL Optimizer with Reliable Statistics
Enhancing Spark SQL Optimizer with Reliable Statistics
 
Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...
Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...
Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...
 
Batch and Stream Graph Processing with Apache Flink
Batch and Stream Graph Processing with Apache FlinkBatch and Stream Graph Processing with Apache Flink
Batch and Stream Graph Processing with Apache Flink
 
Amazon Redshift: Performance Tuning and Optimization
Amazon Redshift: Performance Tuning and OptimizationAmazon Redshift: Performance Tuning and Optimization
Amazon Redshift: Performance Tuning and Optimization
 
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
 
Large Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache SparkLarge Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache Spark
 
AWS (Amazon Redshift) presentation
AWS (Amazon Redshift) presentationAWS (Amazon Redshift) presentation
AWS (Amazon Redshift) presentation
 
Materialized Column: An Efficient Way to Optimize Queries on Nested Columns
Materialized Column: An Efficient Way to Optimize Queries on Nested ColumnsMaterialized Column: An Efficient Way to Optimize Queries on Nested Columns
Materialized Column: An Efficient Way to Optimize Queries on Nested Columns
 
Data Warehousing with Amazon Redshift
Data Warehousing with Amazon RedshiftData Warehousing with Amazon Redshift
Data Warehousing with Amazon Redshift
 
Best practices for Data warehousing with Amazon Redshift - AWS PS Summit Canb...
Best practices for Data warehousing with Amazon Redshift - AWS PS Summit Canb...Best practices for Data warehousing with Amazon Redshift - AWS PS Summit Canb...
Best practices for Data warehousing with Amazon Redshift - AWS PS Summit Canb...
 

Andere mochten auch

Graphs as Streams: Rethinking Graph Processing in the Streaming Era
Graphs as Streams: Rethinking Graph Processing in the Streaming EraGraphs as Streams: Rethinking Graph Processing in the Streaming Era
Graphs as Streams: Rethinking Graph Processing in the Streaming EraVasia Kalavri
 
An excursion into Text Analytics with Apache Spark
An excursion into Text Analytics with Apache SparkAn excursion into Text Analytics with Apache Spark
An excursion into Text Analytics with Apache SparkKrishna Sankar
 
GraphX and Pregel - Apache Spark
GraphX and Pregel - Apache SparkGraphX and Pregel - Apache Spark
GraphX and Pregel - Apache SparkAshutosh Trivedi
 
Social Network Analysis with Spark
Social Network Analysis with SparkSocial Network Analysis with Spark
Social Network Analysis with SparkGhulam Imaduddin
 
Using spark for timeseries graph analytics
Using spark for timeseries graph analyticsUsing spark for timeseries graph analytics
Using spark for timeseries graph analyticsSigmoid
 
Building a Graph of all US Businesses Using Spark Technologies by Alexis Roos
Building a Graph of all US Businesses Using Spark Technologies by Alexis RoosBuilding a Graph of all US Businesses Using Spark Technologies by Alexis Roos
Building a Graph of all US Businesses Using Spark Technologies by Alexis RoosSpark Summit
 
Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks
Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks
Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks DataWorks Summit/Hadoop Summit
 
Neo4j Makes Graphs Easy- GraphDays
Neo4j Makes Graphs Easy- GraphDaysNeo4j Makes Graphs Easy- GraphDays
Neo4j Makes Graphs Easy- GraphDaysNeo4j
 
IoT Analytics from Edge to Cloud - using IBM Informix
IoT Analytics from Edge to Cloud - using IBM InformixIoT Analytics from Edge to Cloud - using IBM Informix
IoT Analytics from Edge to Cloud - using IBM InformixPradeep Muthalpuredathe
 
An excursion into Graph Analytics with Apache Spark GraphX
An excursion into Graph Analytics with Apache Spark GraphXAn excursion into Graph Analytics with Apache Spark GraphX
An excursion into Graph Analytics with Apache Spark GraphXKrishna Sankar
 
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesPaco Nathan
 
Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...
Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...
Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...Spark Summit
 
Apache Storm vs. Spark Streaming – two Stream Processing Platforms compared
Apache Storm vs. Spark Streaming – two Stream Processing Platforms comparedApache Storm vs. Spark Streaming – two Stream Processing Platforms compared
Apache Storm vs. Spark Streaming – two Stream Processing Platforms comparedGuido Schmutz
 
Machine Learning and GraphX
Machine Learning and GraphXMachine Learning and GraphX
Machine Learning and GraphXAndy Petrella
 
Gephi, Graphx, and Giraph
Gephi, Graphx, and GiraphGephi, Graphx, and Giraph
Gephi, Graphx, and GiraphDoug Needham
 
Graph database Use Cases
Graph database Use CasesGraph database Use Cases
Graph database Use CasesMax De Marzi
 

Andere mochten auch (17)

Graphs as Streams: Rethinking Graph Processing in the Streaming Era
Graphs as Streams: Rethinking Graph Processing in the Streaming EraGraphs as Streams: Rethinking Graph Processing in the Streaming Era
Graphs as Streams: Rethinking Graph Processing in the Streaming Era
 
An excursion into Text Analytics with Apache Spark
An excursion into Text Analytics with Apache SparkAn excursion into Text Analytics with Apache Spark
An excursion into Text Analytics with Apache Spark
 
GraphX and Pregel - Apache Spark
GraphX and Pregel - Apache SparkGraphX and Pregel - Apache Spark
GraphX and Pregel - Apache Spark
 
Social Network Analysis with Spark
Social Network Analysis with SparkSocial Network Analysis with Spark
Social Network Analysis with Spark
 
Using spark for timeseries graph analytics
Using spark for timeseries graph analyticsUsing spark for timeseries graph analytics
Using spark for timeseries graph analytics
 
Building a Graph of all US Businesses Using Spark Technologies by Alexis Roos
Building a Graph of all US Businesses Using Spark Technologies by Alexis RoosBuilding a Graph of all US Businesses Using Spark Technologies by Alexis Roos
Building a Graph of all US Businesses Using Spark Technologies by Alexis Roos
 
Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks
Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks
Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks
 
Neo4j Makes Graphs Easy- GraphDays
Neo4j Makes Graphs Easy- GraphDaysNeo4j Makes Graphs Easy- GraphDays
Neo4j Makes Graphs Easy- GraphDays
 
IoT Analytics from Edge to Cloud - using IBM Informix
IoT Analytics from Edge to Cloud - using IBM InformixIoT Analytics from Edge to Cloud - using IBM Informix
IoT Analytics from Edge to Cloud - using IBM Informix
 
An excursion into Graph Analytics with Apache Spark GraphX
An excursion into Graph Analytics with Apache Spark GraphXAn excursion into Graph Analytics with Apache Spark GraphX
An excursion into Graph Analytics with Apache Spark GraphX
 
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communities
 
Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...
Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...
Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...
 
Apache Storm vs. Spark Streaming – two Stream Processing Platforms compared
Apache Storm vs. Spark Streaming – two Stream Processing Platforms comparedApache Storm vs. Spark Streaming – two Stream Processing Platforms compared
Apache Storm vs. Spark Streaming – two Stream Processing Platforms compared
 
Machine Learning and GraphX
Machine Learning and GraphXMachine Learning and GraphX
Machine Learning and GraphX
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
Gephi, Graphx, and Giraph
Gephi, Graphx, and GiraphGephi, Graphx, and Giraph
Gephi, Graphx, and Giraph
 
Graph database Use Cases
Graph database Use CasesGraph database Use Cases
Graph database Use Cases
 

Ähnlich wie Streaming and Online Algorithms for GraphX

Using Graph Analysis and Fraud Detection in the Fintech Industry
Using Graph Analysis and Fraud Detection in the Fintech IndustryUsing Graph Analysis and Fraud Detection in the Fintech Industry
Using Graph Analysis and Fraud Detection in the Fintech IndustryStanka Dalekova
 
Using Graph Analysis and Fraud Detection in the Fintech Industry
Using Graph Analysis and Fraud Detection in the Fintech IndustryUsing Graph Analysis and Fraud Detection in the Fintech Industry
Using Graph Analysis and Fraud Detection in the Fintech IndustryStanka Dalekova
 
Thorny Path to the Large Scale Graph Processing, Алексей Зиновьев (Тамтэк)
Thorny Path to the Large Scale Graph Processing, Алексей Зиновьев (Тамтэк)Thorny Path to the Large Scale Graph Processing, Алексей Зиновьев (Тамтэк)
Thorny Path to the Large Scale Graph Processing, Алексей Зиновьев (Тамтэк)Ontico
 
Thorny path to the Large-Scale Graph Processing (Highload++, 2014)
Thorny path to the Large-Scale Graph Processing (Highload++, 2014)Thorny path to the Large-Scale Graph Processing (Highload++, 2014)
Thorny path to the Large-Scale Graph Processing (Highload++, 2014)Alexey Zinoviev
 
Stream processing from single node to a cluster
Stream processing from single node to a clusterStream processing from single node to a cluster
Stream processing from single node to a clusterGal Marder
 
SnappyData at Spark Summit 2017
SnappyData at Spark Summit 2017SnappyData at Spark Summit 2017
SnappyData at Spark Summit 2017Jags Ramnarayan
 
SnappyData, the Spark Database. A unified cluster for streaming, transactions...
SnappyData, the Spark Database. A unified cluster for streaming, transactions...SnappyData, the Spark Database. A unified cluster for streaming, transactions...
SnappyData, the Spark Database. A unified cluster for streaming, transactions...SnappyData
 
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Advanced Data Science on Spark-(Reza Zadeh, Stanford)Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Advanced Data Science on Spark-(Reza Zadeh, Stanford)Spark Summit
 
Deploying your Data Warehouse on AWS
Deploying your Data Warehouse on AWSDeploying your Data Warehouse on AWS
Deploying your Data Warehouse on AWSAmazon Web Services
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...Jose Quesada (hiring)
 
Hadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log projectHadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log projectMao Geng
 
Follow the money with graphs
Follow the money with graphsFollow the money with graphs
Follow the money with graphsStanka Dalekova
 
Processing Large Graphs
Processing Large GraphsProcessing Large Graphs
Processing Large GraphsNishant Gandhi
 
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)Sudhir Mallem
 
Large Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured StreamingLarge Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured StreamingDatabricks
 
Nebula Graph nMeetup in Shanghai - Meet with Graph Technology Enthusiasts
Nebula Graph nMeetup in Shanghai - Meet with Graph Technology EnthusiastsNebula Graph nMeetup in Shanghai - Meet with Graph Technology Enthusiasts
Nebula Graph nMeetup in Shanghai - Meet with Graph Technology EnthusiastsNebula Graph
 
Best Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon RedshiftBest Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon RedshiftAmazon Web Services
 
Big Stream Processing Systems, Big Graphs
Big Stream Processing Systems, Big GraphsBig Stream Processing Systems, Big Graphs
Big Stream Processing Systems, Big GraphsPetr Novotný
 

Ähnlich wie Streaming and Online Algorithms for GraphX (20)

Using Graph Analysis and Fraud Detection in the Fintech Industry
Using Graph Analysis and Fraud Detection in the Fintech IndustryUsing Graph Analysis and Fraud Detection in the Fintech Industry
Using Graph Analysis and Fraud Detection in the Fintech Industry
 
Using Graph Analysis and Fraud Detection in the Fintech Industry
Using Graph Analysis and Fraud Detection in the Fintech IndustryUsing Graph Analysis and Fraud Detection in the Fintech Industry
Using Graph Analysis and Fraud Detection in the Fintech Industry
 
Thorny Path to the Large Scale Graph Processing, Алексей Зиновьев (Тамтэк)
Thorny Path to the Large Scale Graph Processing, Алексей Зиновьев (Тамтэк)Thorny Path to the Large Scale Graph Processing, Алексей Зиновьев (Тамтэк)
Thorny Path to the Large Scale Graph Processing, Алексей Зиновьев (Тамтэк)
 
Thorny path to the Large-Scale Graph Processing (Highload++, 2014)
Thorny path to the Large-Scale Graph Processing (Highload++, 2014)Thorny path to the Large-Scale Graph Processing (Highload++, 2014)
Thorny path to the Large-Scale Graph Processing (Highload++, 2014)
 
Stream processing from single node to a cluster
Stream processing from single node to a clusterStream processing from single node to a cluster
Stream processing from single node to a cluster
 
SnappyData at Spark Summit 2017
SnappyData at Spark Summit 2017SnappyData at Spark Summit 2017
SnappyData at Spark Summit 2017
 
SnappyData, the Spark Database. A unified cluster for streaming, transactions...
SnappyData, the Spark Database. A unified cluster for streaming, transactions...SnappyData, the Spark Database. A unified cluster for streaming, transactions...
SnappyData, the Spark Database. A unified cluster for streaming, transactions...
 
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Advanced Data Science on Spark-(Reza Zadeh, Stanford)Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
 
cb streams - gavin pickin
cb streams - gavin pickincb streams - gavin pickin
cb streams - gavin pickin
 
Deploying your Data Warehouse on AWS
Deploying your Data Warehouse on AWSDeploying your Data Warehouse on AWS
Deploying your Data Warehouse on AWS
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
 
Dev Ops Training
Dev Ops TrainingDev Ops Training
Dev Ops Training
 
Hadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log projectHadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log project
 
Follow the money with graphs
Follow the money with graphsFollow the money with graphs
Follow the money with graphs
 
Processing Large Graphs
Processing Large GraphsProcessing Large Graphs
Processing Large Graphs
 
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
 
Large Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured StreamingLarge Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured Streaming
 
Nebula Graph nMeetup in Shanghai - Meet with Graph Technology Enthusiasts
Nebula Graph nMeetup in Shanghai - Meet with Graph Technology EnthusiastsNebula Graph nMeetup in Shanghai - Meet with Graph Technology Enthusiasts
Nebula Graph nMeetup in Shanghai - Meet with Graph Technology Enthusiasts
 
Best Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon RedshiftBest Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon Redshift
 
Big Stream Processing Systems, Big Graphs
Big Stream Processing Systems, Big GraphsBig Stream Processing Systems, Big Graphs
Big Stream Processing Systems, Big Graphs
 

Mehr von MLconf

Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...
Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...
Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...MLconf
 
Ted Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Ted Willke - The Brain’s Guide to Dealing with Context in Language UnderstandingTed Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Ted Willke - The Brain’s Guide to Dealing with Context in Language UnderstandingMLconf
 
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...MLconf
 
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold Rush
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold RushIgor Markov - Quantum Computing: a Treasure Hunt, not a Gold Rush
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold RushMLconf
 
Josh Wills - Data Labeling as Religious Experience
Josh Wills - Data Labeling as Religious ExperienceJosh Wills - Data Labeling as Religious Experience
Josh Wills - Data Labeling as Religious ExperienceMLconf
 
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...MLconf
 
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...MLconf
 
Meghana Ravikumar - Optimized Image Classification on the Cheap
Meghana Ravikumar - Optimized Image Classification on the CheapMeghana Ravikumar - Optimized Image Classification on the Cheap
Meghana Ravikumar - Optimized Image Classification on the CheapMLconf
 
Noam Finkelstein - The Importance of Modeling Data Collection
Noam Finkelstein - The Importance of Modeling Data CollectionNoam Finkelstein - The Importance of Modeling Data Collection
Noam Finkelstein - The Importance of Modeling Data CollectionMLconf
 
June Andrews - The Uncanny Valley of ML
June Andrews - The Uncanny Valley of MLJune Andrews - The Uncanny Valley of ML
June Andrews - The Uncanny Valley of MLMLconf
 
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection TasksSneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection TasksMLconf
 
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...MLconf
 
Vito Ostuni - The Voice: New Challenges in a Zero UI World
Vito Ostuni - The Voice: New Challenges in a Zero UI WorldVito Ostuni - The Voice: New Challenges in a Zero UI World
Vito Ostuni - The Voice: New Challenges in a Zero UI WorldMLconf
 
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...MLconf
 
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...MLconf
 
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...MLconf
 
Neel Sundaresan - Teaching a machine to code
Neel Sundaresan - Teaching a machine to codeNeel Sundaresan - Teaching a machine to code
Neel Sundaresan - Teaching a machine to codeMLconf
 
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...MLconf
 
Soumith Chintala - Increasing the Impact of AI Through Better Software
Soumith Chintala - Increasing the Impact of AI Through Better SoftwareSoumith Chintala - Increasing the Impact of AI Through Better Software
Soumith Chintala - Increasing the Impact of AI Through Better SoftwareMLconf
 
Roy Lowrance - Predicting Bond Prices: Regime Changes
Roy Lowrance - Predicting Bond Prices: Regime ChangesRoy Lowrance - Predicting Bond Prices: Regime Changes
Roy Lowrance - Predicting Bond Prices: Regime ChangesMLconf
 

Mehr von MLconf (20)

Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...
Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...
Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...
 
Ted Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Ted Willke - The Brain’s Guide to Dealing with Context in Language UnderstandingTed Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Ted Willke - The Brain’s Guide to Dealing with Context in Language Understanding
 
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
 
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold Rush
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold RushIgor Markov - Quantum Computing: a Treasure Hunt, not a Gold Rush
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold Rush
 
Josh Wills - Data Labeling as Religious Experience
Josh Wills - Data Labeling as Religious ExperienceJosh Wills - Data Labeling as Religious Experience
Josh Wills - Data Labeling as Religious Experience
 
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...
 
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
 
Meghana Ravikumar - Optimized Image Classification on the Cheap
Meghana Ravikumar - Optimized Image Classification on the CheapMeghana Ravikumar - Optimized Image Classification on the Cheap
Meghana Ravikumar - Optimized Image Classification on the Cheap
 
Noam Finkelstein - The Importance of Modeling Data Collection
Noam Finkelstein - The Importance of Modeling Data CollectionNoam Finkelstein - The Importance of Modeling Data Collection
Noam Finkelstein - The Importance of Modeling Data Collection
 
June Andrews - The Uncanny Valley of ML
June Andrews - The Uncanny Valley of MLJune Andrews - The Uncanny Valley of ML
June Andrews - The Uncanny Valley of ML
 
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection TasksSneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
 
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
 
Vito Ostuni - The Voice: New Challenges in a Zero UI World
Vito Ostuni - The Voice: New Challenges in a Zero UI WorldVito Ostuni - The Voice: New Challenges in a Zero UI World
Vito Ostuni - The Voice: New Challenges in a Zero UI World
 
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
 
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...
 
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...
 
Neel Sundaresan - Teaching a machine to code
Neel Sundaresan - Teaching a machine to codeNeel Sundaresan - Teaching a machine to code
Neel Sundaresan - Teaching a machine to code
 
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
 
Soumith Chintala - Increasing the Impact of AI Through Better Software
Soumith Chintala - Increasing the Impact of AI Through Better SoftwareSoumith Chintala - Increasing the Impact of AI Through Better Software
Soumith Chintala - Increasing the Impact of AI Through Better Software
 
Roy Lowrance - Predicting Bond Prices: Regime Changes
Roy Lowrance - Predicting Bond Prices: Regime ChangesRoy Lowrance - Predicting Bond Prices: Regime Changes
Roy Lowrance - Predicting Bond Prices: Regime Changes
 

Kürzlich hochgeladen

Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 

Kürzlich hochgeladen (20)

Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 

Streaming and Online Algorithms for GraphX

  • 1. Streaming and Online Algorithms for GraphX Graph Analytics Team Xia (Ivy) Zhu Intel Confidential — Do Not Forward
  • 2. Why Streaming Processing on Graph? 2 • New stores join • New users join • New users browse/clicks and buy items • Old users browse/clicks and buy items • New ads added • … • Recommend products based on users’ interest • Recommend products based on users’ shopping habits • Recommend products based on users’ purchasing capability • Place ads which most likely will be clicked by users • … Everyday How To Huge amount of relationships are created each day, Wisely utilize them is important
  • 3. Alibaba Is Not Alone, Graphs are Everywhere 3 100B Neuron 100T Relationships 1.23B Users 160B Friendships 1 Trillion Pages 100s T Links Millions of Products and Users 50M Users 1B hours/moth watch Large Biological Cell Networks
  • 4. … And Graphs Keep Evolving 4
  • 5. Streaming Processing Pipeline 5 Data Stream ETL Graph Creation ML Distributed Messaging System • We are using Kafka for distributed messaging • GraphX as graph processing engine
  • 6. 6 What is GraphX • Graph processing engine on Spark • Support Pregel-type vertex programming • Unifies data-parallel and graph-parallel processing Picture Source: GraphX team
  • 7. 7 Why GraphX • GraphLab performs well, but standalone • Giraph, open source, scales well, but performance is not good • GraphX supports both table and graph operations • On the same platform, Spark streaming provides basic streaming framework SchemaRDD’s RDD-Based RDDs, Transformations, and Actions Spark Spark Streaming real-time Spark SQL MLLib machine learning DStream’s: Streams of RDD’s Matrices RDD-Based Graphs GraphX graph processing/ machine learning Picture Source: Databricks
  • 8. 8 Naïve Streaming Does not Scale • Current GraphX is designed for static graphs • Current Spark streaming provides limited types of state DStreams • Naïve approach: • Merge table data before going to graph processing pipeline • Re-generate whole graph and re-run ML at each window • Minimal changes to GraphX and Spark Streaming • Straightforward, but does not scale well 180 160 140 120 100 80 60 40 20 0 Throughput vs Latency of Naive Graph Streaming 1 2 3 4 5 6 7 8 9 Latency(s) Sample Point
  • 9. Our solution 9 • Static algorithms -> Online algorithms • Merge information at graph phase • Efficient graph store for evolving graph • Better partitioning algorithms to reduce replicas • Static index -> On the fly indexing method (ongoing)
  • 10. Static vs Online Algorithms 10 • Static algorithms • Good for re-compute the whole graph at each time instance , and re-run ML • Become increasingly infeasible in Big Data era, given the size and growth rate of graphs • Online algorithms • Incremental machine learning is triggered by changes in the graph • We designed delta updates based online algorithms • Page rank as an example • Same idea is applicable to other machine learning algorithms
  • 11. Static vs Online Page Rank 11 Static_PageRank // InitialVertexValue (0.0, 0.0) // first messsage initialMessage: msg = alpha/(1.0-alpha) // broadcast to neighbors SendMessage: if (edge.srcAttr._2 > tol) Iterator((edge.dstId, edge.srcAttr_2 * edge.attr)) //Aggregate Messages for each Vertex messageCombiner(a,b) : sum = a+b //Update Vertex vertexProgram(sum) : updates = (1.0 - alpha) * sum (oldPR + updates, updates) Online_PageRank // Initialize vertex value base graph: (0.0, 0.0) incremental graph: old vertices: (lastWindowPR, lastWindowDelta) new vertices: (alpha, alpha) // First Message initialMessage: base graph: msg = alpha/(1.0-alpha) incremental graph: none // broadcast to neighbors SendMessage: oldSrc->newDst: Iterator((edge.dstId,(edge.srcAttr_1 – alpha) * edge.attr)) newSrc->newDst or not converged: Iterator((edge.dstId,edge.srcAttr_2 * edge.attr)) //Aggregate Messages for each Vertex messageCombiner(a,b) : sum = a+b //Update Vertex vertexProgram(sum) : updates = (1.0 - alpha) * sum (oldPR + updates, updates)
  • 12. GraphX Data Loading and Data Structure 12 Edge lists SSrrccIIdd DstId EdgeRDD DDaattaa IInnddeexx Re-HashPartition RRoouuttiinnggTTaabblleePPaarrttiittiioonn VVeerrtteexxRRDDDD RoutingTableMesssage HHaassSSrrccIIdd HHaassDDssttIIdd Replicated Vertex View GGrraapphhIImmppll EEddggeePPaarrttiittiioonn VVeerrtteexxPPaarrttiittiioonn Vid DDaattaa Mask Shippable Vertex Partition VVeerrtteexxPPaarrttiittiioonn Vid DDaattaa Mask
  • 13. GraphX Data Loading and Data Structure 13 Edge lists SSrrccIIdd DstId EdgeRDD DDaattaa Index Re-HashPartition RRoouuttiinnggTTaabblleePPaarrttiittiioonn VVeerrtteexxRRDDDD RoutingTableMesssage HHaassSSrrccIIdd HHaassDDssttIIdd Replicated Vertex View GGrraapphhIImmppll EEddggeePPaarrttiittiioonn VVeerrtteexxPPaarrttiittiioonn Vid DDaattaa Mask Shippable Vertex Partition VVeerrtteexxPPaarrttiittiioonn Vid DDaattaa Mask Static Index Partitioning Algorithm can help reduce the replication factors
  • 14. Partitioning Algorithm 14 • Torus-based partitioning • Divide overall partitions to A x B matrix • Vertex’s master partition is decided by Hash function • Replica set is in the same column as master partition (full column), and same row as master partition ( ⁄ + 1 elements starting from master partition) • The intersection between source replica set and target replica set decides where an edge is placed
  • 15. Index Structure for Graph Streaming 15 • GraphX uses CSR(Compressed Sparse Row)-based index • Originated from sparse matrix compression • Good for finding all out edges of a source vertex • No support for finding all in edges of a target vertex. Need full table scan • At minimal, need to add CSC(Compressed Sparse Column) for indexing in edges Raw Edge Lists Src Dst Data 3 2 3 5 3 9 5 2 5 3 7 3 8 5
  • 16. 8 6 10 6 Dst Data 2 5 9 2 3 3 5
  • 17. 6 6 Idx Unique Src 0 3 3 5 5 7 6 8 8 10 CSR Data Src 3 5 5 7 3
  • 18. 8 8 10 3 Unique Dst Idx 2 0 3 2 5 4 6 6 9 8 CSC
  • 19. Index Structure for Graph Streaming 16 • Both CSR and CSC need firstly sort edge lists and then create index. • Even better way is to build index on the fly • For graph streaming, need to support both fast insert/write and fast search/read • HashMap • Good for exact match, point search • Fast on insert and search • Good for graph with fixed/known size • Need to re-hash when size surpasses capacity • Trees: B-Tree, LSM-Tree (Log Structured Merge Tree), COLA(Cache Oblivious Lookahead Array) • Support both point search and range search • B-Tree good for fast search, slow for insert • LSM-Tree good for fast insert, slow for search • COLA achieves good tradeoff: fast insert and good enough search COLA based index for graph streaming
  • 20. Putting Things Together: Our Streaming Pipeline 17 OML + OML + OML + OML + OML …
  • 21. Performance - Convergence Rate 18 1.2 Converage Rate Naive Incremental Normalized Number of Iterations Graph Size ( Num of Edges) 1.0 0.8 0.6 0.4 0.2 0.0 Base +20% +40% +60% +80% +100% +150% +200%
  • 22. Performance - Communication Overhead 19 120% 100% 80% 60% 40% 20% 0% Communication Overhead Base +20% +40% +60% +80% +100% +150% +200% Normalized Number of Messages Sent Graph Size (Num of Edges) naive Incremental
  • 23. Ongoing Future Work 20 • Working on online version of ML algorithms in different categories • Performance evaluation on various online algorithms • Complete on the fly indexing work • Performance evaluation on different indexing methods
  • 24. Intel Confidential — Do Not Forward