This document discusses streaming and online algorithms for graph processing using GraphX on Spark. It proposes moving from static algorithms that recompute the entire graph at each time interval to online algorithms that use incremental machine learning triggered by graph changes. The key aspects of the solution include efficient graph storage for evolving graphs, partitioning algorithms to reduce replication, and on-the-fly indexing methods instead of prebuilding indexes. Performance results show the online algorithms have better convergence rates and lower communication overhead compared to naive recomputation approaches as the graph sizes increase over time.
1. Streaming and Online Algorithms for GraphX
Graph Analytics Team
Xia (Ivy) Zhu
Intel Confidential — Do Not Forward
2. Why Streaming Processing on Graph?
2
• New stores join
• New users join
• New users
browse/clicks and
buy items
• Old users
browse/clicks and
buy items
• New ads added
• …
• Recommend products
based on users’ interest
• Recommend products
based on users’ shopping
habits
• Recommend products
based on users’
purchasing capability
• Place ads which most
likely will be clicked by
users
• …
Everyday
How
To
Huge amount of relationships are created each day,
Wisely utilize them is important
3. Alibaba Is Not Alone, Graphs are Everywhere
3
100B Neuron
100T Relationships
1.23B Users
160B Friendships
1 Trillion Pages
100s T Links
Millions of Products
and Users
50M Users
1B hours/moth watch
Large Biological
Cell Networks
5. Streaming Processing Pipeline
5
Data Stream
ETL
Graph
Creation
ML
Distributed Messaging System
• We are using Kafka for distributed messaging
• GraphX as graph processing engine
6. 6
What is GraphX
• Graph processing engine on Spark
• Support Pregel-type vertex programming
• Unifies data-parallel and graph-parallel processing
Picture Source: GraphX team
7. 7
Why GraphX
• GraphLab performs well, but standalone
• Giraph, open source, scales well, but performance is not good
• GraphX supports both table and graph operations
• On the same platform, Spark streaming provides basic streaming
framework
SchemaRDD’s RDD-Based
RDDs, Transformations, and Actions
Spark
Spark Streaming
real-time
Spark
SQL
MLLib
machine learning
DStream’s:
Streams of RDD’s
Matrices
RDD-Based
Graphs
GraphX
graph processing/
machine learning
Picture Source: Databricks
8. 8
Naïve Streaming Does not Scale
• Current GraphX is designed for static graphs
• Current Spark streaming provides limited types of state DStreams
• Naïve approach:
• Merge table data before going to graph processing pipeline
• Re-generate whole graph and re-run ML at each window
• Minimal changes to GraphX and Spark Streaming
• Straightforward, but does not scale well
180
160
140
120
100
80
60
40
20
0
Throughput vs Latency of Naive Graph Streaming
1 2 3 4 5 6 7 8 9
Latency(s)
Sample Point
9. Our solution
9
• Static algorithms -> Online algorithms
• Merge information at graph phase
• Efficient graph store for evolving graph
• Better partitioning algorithms to reduce replicas
• Static index -> On the fly indexing method (ongoing)
10. Static vs Online Algorithms
10
• Static algorithms
• Good for re-compute the whole graph at each time instance , and re-run ML
• Become increasingly infeasible in Big Data era, given the size and growth rate
of graphs
• Online algorithms
• Incremental machine learning is triggered by changes in the graph
• We designed delta updates based online algorithms
• Page rank as an example
• Same idea is applicable to other machine learning algorithms
11. Static vs Online Page Rank
11
Static_PageRank
// InitialVertexValue
(0.0, 0.0)
// first messsage
initialMessage:
msg = alpha/(1.0-alpha)
// broadcast to neighbors
SendMessage:
if (edge.srcAttr._2 > tol)
Iterator((edge.dstId, edge.srcAttr_2 *
edge.attr))
//Aggregate Messages for each Vertex
messageCombiner(a,b) :
sum = a+b
//Update Vertex
vertexProgram(sum) :
updates = (1.0 - alpha) * sum
(oldPR + updates, updates)
Online_PageRank
// Initialize vertex value
base graph:
(0.0, 0.0)
incremental graph:
old vertices:
(lastWindowPR, lastWindowDelta)
new vertices:
(alpha, alpha)
// First Message
initialMessage:
base graph:
msg = alpha/(1.0-alpha)
incremental graph:
none
// broadcast to neighbors
SendMessage:
oldSrc->newDst:
Iterator((edge.dstId,(edge.srcAttr_1 – alpha) *
edge.attr))
newSrc->newDst or not converged:
Iterator((edge.dstId,edge.srcAttr_2 * edge.attr))
//Aggregate Messages for each Vertex
messageCombiner(a,b) :
sum = a+b
//Update Vertex
vertexProgram(sum) :
updates = (1.0 - alpha) * sum
(oldPR + updates, updates)
12. GraphX Data Loading and Data Structure
12
Edge
lists
SSrrccIIdd
DstId
EdgeRDD
DDaattaa
IInnddeexx
Re-HashPartition
RRoouuttiinnggTTaabblleePPaarrttiittiioonn
VVeerrtteexxRRDDDD
RoutingTableMesssage
HHaassSSrrccIIdd
HHaassDDssttIIdd
Replicated
Vertex
View
GGrraapphhIImmppll
EEddggeePPaarrttiittiioonn
VVeerrtteexxPPaarrttiittiioonn
Vid
DDaattaa
Mask
Shippable
Vertex
Partition
VVeerrtteexxPPaarrttiittiioonn
Vid
DDaattaa
Mask
13. GraphX Data Loading and Data Structure
13
Edge
lists
SSrrccIIdd
DstId
EdgeRDD
DDaattaa
Index
Re-HashPartition
RRoouuttiinnggTTaabblleePPaarrttiittiioonn
VVeerrtteexxRRDDDD
RoutingTableMesssage
HHaassSSrrccIIdd
HHaassDDssttIIdd
Replicated
Vertex
View
GGrraapphhIImmppll
EEddggeePPaarrttiittiioonn
VVeerrtteexxPPaarrttiittiioonn
Vid
DDaattaa
Mask
Shippable
Vertex
Partition
VVeerrtteexxPPaarrttiittiioonn
Vid
DDaattaa
Mask
Static Index
Partitioning Algorithm can help
reduce the replication factors
14. Partitioning Algorithm
14
• Torus-based partitioning
• Divide overall partitions to A x B matrix
• Vertex’s master partition is decided by Hash function
• Replica set is in the same column as master partition (full column), and same row as
master partition (
⁄ + 1 elements starting from master partition)
• The intersection between source replica set and target replica set decides where an
edge is placed
15. Index Structure for Graph Streaming
15
• GraphX uses CSR(Compressed Sparse Row)-based index
• Originated from sparse matrix compression
• Good for finding all out edges of a source vertex
• No support for finding all in edges of a target vertex. Need full table scan
• At minimal, need to add CSC(Compressed Sparse Column) for indexing in edges
Raw Edge Lists
Src Dst Data
3 2
3 5
3 9
5 2
5 3
7 3
8 5
19. Index Structure for Graph Streaming
16
• Both CSR and CSC need firstly sort edge lists and then create index.
• Even better way is to build index on the fly
• For graph streaming, need to support both fast insert/write and fast search/read
• HashMap
• Good for exact match, point search
• Fast on insert and search
• Good for graph with fixed/known size
• Need to re-hash when size surpasses capacity
• Trees: B-Tree, LSM-Tree (Log Structured Merge Tree), COLA(Cache Oblivious
Lookahead Array)
• Support both point search and range search
• B-Tree good for fast search, slow for insert
• LSM-Tree good for fast insert, slow for search
• COLA achieves good tradeoff: fast insert and good enough search
COLA based index for graph streaming
21. Performance - Convergence Rate
18
1.2
Converage Rate
Naive Incremental
Normalized Number of Iterations Graph Size ( Num of Edges)
1.0
0.8
0.6
0.4
0.2
0.0
Base +20% +40% +60% +80% +100% +150% +200%
22. Performance - Communication Overhead
19
120%
100%
80%
60%
40%
20%
0%
Communication Overhead
Base +20% +40% +60% +80% +100% +150% +200%
Normalized Number of Messages Sent
Graph Size (Num of Edges)
naive
Incremental
23. Ongoing Future Work
20
• Working on online version of ML algorithms in different categories
• Performance evaluation on various online algorithms
• Complete on the fly indexing work
• Performance evaluation on different indexing methods