SlideShare a Scribd company logo
1 of 53
RONG CHEN, JIAXIN SHI, YANZHE CHEN, HAIBO CHEN
Institute of Parallel and Distributed Systems
Shanghai Jiao Tong University, China
H
EuroSys 2015
L
Differentiated Graph
Computation and Partitioning
on Skewed Graphs
PowerLyra
Graphs are Everywhere1
Traffic network graphs
□ Monitor/react to road incidents
□ Analyze Internet security
Biological networks
□ Predict disease outbreaks
□ Model drug-protein interactions
Social networks
□ Recommend people/products
□ Track information flow
1 From PPoPP’15 Keynote: Computing in a Persistent (Memory) World, P. Faraboschi 2
A centrality analysis algorithm to iteratively rank
each vertex as a weighted sum of neighbors’ ranks
Example: PageRank
Graph-parallel Processing
Coding graph algorithms as vertex-centric programs to
process vertices in parallel and communicate along edges
-- "Think as a Vertex" philosophy
COMPUTE(v)
foreach n in v.in_nbrs
sum += n.rank / n.nedges
v.rank = 0.15 + 0.85 * sum
if !converged(v)
foreach n in v.out_nbrs
activate(n)
𝑗, 𝑖 ∈ 𝐸
𝜔𝑖𝑗 𝑅𝑗0.15 + 0.85𝑅𝑖 =
3
Graph-parallel Processing
A centrality analysis algorithm to iteratively rank
each vertex as a weighted sum of neighbors’ ranks
Example: PageRank
𝑗, 𝑖 ∈ 𝐸
𝜔𝑖𝑗 𝑅𝑗0.15 + 0.85𝑅𝑖 =
Coding graph algorithms as vertex-centric programs to
process vertices in parallel and communicate along edges
-- "Think as a Vertex" philosophy
COMPUTE(v)
foreach n in v.in_nbrs
sum += n.rank / n.nedges
v.rank = 0.15 + 0.85 * sum
if !converged(v)
foreach n in v.out_nbrs
activate(n)
Gather data
from neighbors
via in-edges
4
Graph-parallel Processing
A centrality analysis algorithm to iteratively rank
each vertex as a weighted sum of neighbors’ ranks
Example: PageRank
𝑗, 𝑖 ∈ 𝐸
𝜔𝑖𝑗 𝑅𝑗0.15 + 0.85𝑅𝑖 =
Coding graph algorithms as vertex-centric programs to
process vertices in parallel and communicate along edges
-- "Think as a Vertex" philosophy
COMPUTE(v)
foreach n in v.in_nbrs
sum += n.rank / n.nedges
v.rank = 0.15 + 0.85 * sum
if !converged(v)
foreach n in v.out_nbrs
activate(n)
Update vertex
with new data
5
Graph-parallel Processing
A centrality analysis algorithm to iteratively rank
each vertex as a weighted sum of neighbors’ ranks
Example: PageRank
𝑗, 𝑖 ∈ 𝐸
𝜔𝑖𝑗 𝑅𝑗0.15 + 0.85𝑅𝑖 =
Coding graph algorithms as vertex-centric programs to
process vertices in parallel and communicate along edges
-- "Think as a Vertex" philosophy
COMPUTE(v)
foreach n in v.in_nbrs
sum += n.rank / n.nedges
v.rank = 0.15 + 0.85 * sum
if !converged(v)
foreach n in v.out_nbrs
activate(n)
Scatter data
to neighbors
via out-edges
6
Natural Graphs are Skewed
Graphs in real-world
□ Power-law degree distribution
(aka Zipf’s law or Pareto)
Source: M. E. J. Newman, ”Power laws,
Pareto distributions and Zipf’s law”, 2005
“most vertices have relatively few
neighbors while a few have many
neighbors” -- PowerGraph[OSDI’12]
#Follower :
Word
frequency
Citation Web Hits
Books
sold
Telephone
calls
received
Earthquake
Magnitude
Crater
diameter
in
Peak
intensity
intensity
Net worth
in USDs
Name
frequency
Population
of city
Case: SINA Weibo(微博)
□ 167 million active users1
1 http://www.chinainternetwatch.com/10735/weibo-q3-2014/
........77,971,093
#1
76,946,782
#2 #100
14,723,818 397 .......... 69
7
Challenge: LOCALITY vs. PARALLELISM
LOCALITY
make resource
locally accessible
to hidden
network latency
H
High-degree
vertex
77,971,093
L
Low-degree
vertex
397 1. Imbalance
2. High contention
3. Heavy network traffic
8
Challenge: LOCALITY vs. PARALLELISM
LOCALITY PARALLELISM
make resource
locally accessible
to hidden
network latency
evenly parallelize
the workloads
to avoid
load imbalance
H
High-degree
vertex
77,971,093
L
Low-degree
vertex
397 1. Redundant
comm., comp. &
sync. (not worthwhile)
9
Challenge: LOCALITY vs. PARALLELISM
LOCALITY PARALLELISM
make resource
locally accessible
to hidden
network latency
evenly parallelize
the workloads
to avoid
load imbalance
H
High-degree
vertex
77,971,093
L
Low-degree
vertex
397
skewed graphs
100-million users w/
100 followers
100 users w/
100-million followers
Conflict
10
Existing Efforts: One Size Fits All
Pregel[SIGMOD’08] and GraphLab[VLDB’12]
□ Focus on exploiting LOCALITY
□ Partitioning: use edge-cut to evenly assign vertices
along with all edges
□ Computation: aggregate all resources (i.e., msgs
or replicas) of a vertex on local machine
A B
A A
B
msg replica
(i.e., mirror)
master
x2
B
L L H L
Edge-cut
11
Existing Efforts: One Size Fits All
PowerGraph [OSDI’12] and GraphX[OSDI’14]
□ Focus on exploiting PARALLELISM
□ Partitioning: use vertex-cut to evenly assign edges
with replicated vertices
□ Computation: decompose the workload of a
vertex into multiple machines
Vertex-cut
L HH L
mirror
A
B
A
B
master
x5
12
: differentiated graph computation
and partitioning on skewed graphs
□ Hybrid computation and partitioning strategies
→ Embrace locality (low-degree vertex) and
parallelism (high-degree vertex)
□ An optimization to improve locality and reduce
interference in communication
□ Outperform PowerGraph (up to 5.53X) and other
systems for ingress and runtime
PARALLELISMLOCALITY
Contributions
PowerLyra
13
Agenda
Graph Partitioning
Graph Computation
Optimization
Evaluation
Graph Partitioning
Hybrid-cut: differentiate graph partitioning for low-
degree and high-degree vertices
□ Low-cut (inspired by edge-cut): reduce mirrors and
exploit locality for low-degree vertices
□ High-cut (inspired by vertex-cut): provide balance
and restrict the impact of high-degree vertices
□ Efficiently combine low-cut and high-cut
1 2
52 1 2
3
6
3
High-master
Low-master
High-mirror
Low-mirror
14
1 3
1 24
3 5
6
Sample
Graph
θ=3
User-defined
threshold 15
Low-cut: evenly assign low-degree vertices along with
edges in only one direction (e.g., in-edges) to
machines by hashing the (e.g., target) vertex
1. Unidirectional access locality
2. No duplicated edges
3. Efficiency at ingress
4. Load balance (E & V)
5. Ready-made heuristics
1 24
3 5
6
Hybrid-cut (Low)
4
2
5
3
6
2 3
1
1
1 2
3
NOTE: A new heuristic, called Ginger,
is provided to further optimize
Random hybrid-cut. (see paper)
hash(X) = X % 3
θ=3
16
High-cut: distribute one-direction edges (e.g., in-edges)
of high-degree vertices to machines by hashing
opposite (e.g., source) vertex
1. Avoid adding any mirrors for low-degree vertices
(i.e., an upper bound of
increased mirrors for assigning
a new high-degree vertex)
2. No duplicated edges
3. Efficiency at ingress
4. Load balance (E & V)
6
24
3 5
1
Hybrid-cut (High)
2
5
1 1
32 3
1
hash(X) = X % 3
θ=3
4 1
17
Constructing Hybrid-cut
1, 6
1, 4
2, 1
2, 5
2, 6
3, 1
3, 4
4, 1
5, 1
21 3 Load
1 24
3 5
6
Sample
Graph
θ=3
1. Load files in parallel
format: edge-list
18
Constructing Hybrid-cut
14 2
5 3
6
1, 6
1, 4
2, 1
2, 5
2, 6
3, 1
3, 4
4, 1
5, 1
21 3
Dispatch
Load
1 24
3 5
6
Sample
Graph
θ=3
1. Load files in parallel
2. Dispatch all edges
(Low-cut) and count
in-degree of vertices
format: edge-list
19
Constructing Hybrid-cut
14 2
5 3
6
14 1 2 1
35
1, 6
1, 4
2, 1
2, 5
2, 6
3, 1
3, 4
4, 1
5, 1
21 3
Dispatch
Load
Re-assign
1 24
3 5
6
Sample
Graph
θ=3
1. Load files in parallel
2. Dispatch all edges
(Low-cut) and count
in-degree of vertices
3. Re-assign in-edges of
high-degree vertices
(High-cut, θ=3)
NOTE: some formats (e.g., adjacent list)
can skip counting and re-assignment
format: edge-list
20
Constructing Hybrid-cut
14 2
5 3
6
14 1 2 1
35
14 1 2
5 1
2
3
6
1, 6
1, 4
2, 1
2, 5
2, 6
3, 1
3, 4
4, 1
5, 1
21 3
3
Dispatch
Load
Re-assign
Construct
1 24
3 5
6
Sample
Graph
θ=3
1. Load files in parallel
2. Dispatch all edges
(Low-cut) and count
in-degree of vertices
3. Re-assign in-edges of
high-degree vertices
(High-cut, θ=3)
4. Create mirrors to
construct a local
graph
format: edge-list
21
Agenda
Graph Partitioning
Graph Computation
Optimization
Evaluation
22
Graph Computation
Hybrid-model: differentiate the graph computation
model to high-degree and low-degree vertices
□ High-degree model: decompose workload of
high-degree vertices for load balance
□ Low-degree model: minimize the communication
cost for low-degree vertices
□ Seamlessly run all of existing graph algorithms
master mirror
L
. . .
. . .
L
. . .
G
A
S S
High-degree model Low-degree model
master mirror
H
. . .
. . .
G
A
S
G
S
H
. . .
. . .
23
COMPUTE(v)
foreach n in v.in_nbrs
sum += n.rank / n.nedges
v.rank = 0.15 + 0.85 * sum
if !converged(v)
foreach n in v.out_nbrs
activate(n)
Example: PageRank
Hybrid-model (High)
High-degree vertex
□ Exploit PARALLELISM
□ Follow GAS model
(inspired by POWERGRAPH)
□ Comm. Cost: ≤ 4 x #mirrors
APPLY (v, sum):
v.rank = 0.15 + 0.85 * sum
GATHER(v, n):
return n.rank / n.nedges
SUM (a, b): return a + b
GATHER_EDGES = IN
SCATTER (v, n):
if !converged(v)
activate(n)
SCATTER_EDGES = OUT
G
A
S
local
distributed
distributed
master H
. . .
. . .
G
A
S
G
S
mirrorH
. . .
. . .
High-degree vertex
24
Hybrid-model (Low)
Low-degree vertex
□ Exploit LOCALITY
□ Reserve GAS interface
□ Unidirectional access locality
□ Comm. Cost: ≤ 1 x #mirrors
OBSERVATION: most algorithms only
gather or scatter in one direction
(e.g., PageRank: G/IN and S/OUT)
APPLY (v, sum):
v.rank = 0.15 + 0.85 * sum
GATHER(v, n):
return n.rank / n.nedges
SUM (a, b): return a + b
GATHER_EDGES = IN
SCATTER (v, n):
if !converged(v)
activate(n)
SCATTER_EDGES = OUT
G
A
S
local
distributed
local
S
L
. . .
. . .
master L
. . .
mirror
Low-degree vertex
All of in-edges
A
S
G
25
Generalization
Seamlessly run all graph algorithms
□ Adaptively downgrade model (+ N msgs, 1 ≤ N ≤ 3)
□ Detect at runtime w/o any changes to APPs
TYPE GATHER_EDGES SCATTER_EDGES EX. ALGO.
NATURAL
IN or NONE OUT or NONE PR, SSSP
OUT or NONE IN or NONE DIA
OTHER ANY ANY CC, ALS
e.g., Connected Components (CC)
→ G/NONE and S/ALL
→ +1 msg in SCATTER (mirror  master )
→ Comm. Cost: ≤ 2 x #mirrors master mirror
L
. . .
. . .
L
. . .
G
A
S S
+1 msg
One-direction locality limits the expressiveness to
some OTHER algorithms (i.e., G or S in both IN and OUT)
26
Agenda
Graph Partitioning
Graph Computation
Optimization
Evaluation
Poor locality and high interference
□ Messages are batched
and sent periodically
□ Messages from different
nodes will be updated
to vertices (i.e., master
and mirror) in parallel
. . . M1
. . . M0
Worker
Thread
Challenge: Locality & Interference
poor localitymsgs RPC
Thread
OPT: Locality-conscious graph layout
□ Focus on exploiting locality in communication
□ Extend hybrid-cut at ingress (w/o runtime overhead)
heavy interference
28
9 8 6 1 3524
Example: Initial State
7 5 3 8 4162
6 4 1 3 9852 7M1
M2
M0 h L l l h H l L
H h l h L L l l l
l L l H h l h L
7 5
1 2High-master high-mirror
Low-master low-mirror
Both master and mirror of all vertices (high- & low-degree)
are mixed and stored in a random order on each machine
poor locality
interference
e.g. Hash(X) = (X-1)%3
29
9 8 6 1 3524
Example: Zoning
7 5 3 8 4162
6 4 1 3 9852 7M1
M2
M0
5 8 6 3 9412 7M1
9 3 1 8 4526M2
7 4 2 3 8561M0 H0 L0 h-mrr l-mrr
H1 L1 h-mrr l-mrr
H2 L2 h-mrr l-mrr
Z0 Z1 Z2 Z3
1. ZONE: divide vertex space into 4 zones to store
High-masters (Z0), Low-masters (Z1), high-mirrors
(Z2) and low-mirrors (Z3)
7 5
1 2High-master high-mirror
Low-master low-mirror
interference
poor locality
Z0 Z1 Z2 Z3
30
Example: Grouping
H0 L0 h1 h2 l1 l2
H1 L1 h0 h2 l0 l2
H2 L2 h0 h1 l0 l1
Z0 Z1 Z2 Z3
2. GROUP: the mirrors (i.e., Z2 and Z3) are further grouped
within zone according to the location of their masters
7 5
1 2High-master high-mirror
Low-master low-mirror
5 8 6 3 9412 7
7 4 2 3 8561
9 3 1 8 4526
M1
M2
M0
5 8 1 7 9462 3M1
M0 7 4 2 5 3861
M2 9 3 1 8 5426
Z0 Z1 Z2 Z3
31
5 8 1 7 9462 3M1
Example: Sorting
H0 L0 h1 h2 l1 l2
H1 L1 h0 h2 l0 l2
H2 L2 h0 h1 l0 l1
Z0 Z1 Z2 Z3
3. SORT: both the masters and mirrors within a group are
sorted according to the global vertex ID
7 5
1 2High-master high-mirror
Low-master low-mirror
3 9 1 5 8426M2
4 7 2 8 3561M0
M0 7 4 2 5 3861
M2 3 9 1 8 5426
5 8 1 7 3462 9M1
Z0 Z1 Z2 Z3
32
5 8 1 7 9462 3M1
Example: Sorting
H0 L0 h1 h2 l1 l2
H1 L1 h0 h2 l0 l2
H2 L2 h0 h1 l0 l1
Z0 Z1 Z2 Z3
3. SORT: both the masters and mirrors within a group are
sorted according to the global vertex ID
7 5
1 2High-master high-mirror
Low-master low-mirror
4 7 2 8 3561
3 9 1 5 8426M2
M0
M0 7 4 2 5 3861
M2 3 9 1 8 5426
5 8 1 7 3462 9M1
contention
Z0 Z1 Z2 Z3
33
5 8 6 9 4312 7M1
Example: Rolling
H0 L0 h1 h2 l1 l2
H1 L1 h2 h0 l2 l0
H2 L2 h0 h1 l0 l1
Z0 Z1 Z2 Z3
4. ROLL: place mirror groups in a rolling order: the mirror
groups in machine n for p partitions start from (n + 1)%p
e.g., M1 start from h2 then h0,
where n=2 and p=3 7 5
1 2High-master high-mirror
Low-master low-mirror
4 7 2 8 3561
3 9 1 5 8426M2
M0
M0 4 7 2 8 3561
M2 3 9 1 5 8426
5 8 1 7 3462 9M1
Z0 Z1 Z2 Z3
34
Tradeoff: Ingress vs. Runtime
OPT: Locality-conscious graph layout
□ The increase of graph ingress time is modest
→ Independently executed on each machine
→ Less than 5% (power-law) & around 10% (real-world)
□ Usually more than 10% speedup for runtime
→ 21% for the Twitter follower graph
-20%
-10%
0%
10%
20%
30%
1.8 1.9 2.0 2.1 2.2 TW UK WK LJ GW
Normalized
Speedup
Exec Time Ingress Time
Power-law (α) Real-world
35
Agenda
Graph Partitioning
Graph Computation
Optimization
Evaluation
Implementation
POWERLYRA
□ Based on latest PowerGraph (v2.2, Mar. 2014)
□ A separate engine and partitioning algorithms
□ Support both sync and async execution engines
□ Seamlessly run all graph toolkits in GraphLab and
respect the fault tolerance model
□ Port the Random hybrid-cut to GraphX1
The source code and a brief instruction are available
at http://ipads.se.sjtu.edu.cn/projects/powerlyra.html
1 We only port Random hybrid-cut for preserving the graph partitioning interface of GraphX. 37
Evaluation
Baseline: latest POWERGRAPH (2.2 Mar. 2014) with
various vertex-cuts (Grid/Oblivious/Coordinated)
Platforms
1. A 48-node VM-based cluster1 (Each: 4 Cores, 12GB RAM)
2. A 6-node physical cluster1 (Each: 24 Cores, 64GB RAM)
Algorithms and graphs
□ 3 graph-analytics algorithms (PR2, CC, DIA) and
2 MLDM algorithms (ALS, SGD)
□ 5 real-world graphs,
5 synthetic graphs3, etc.
Graph |V| |E|
Twitter 42M 1.47B
UK-2005 40M 936M
Wiki 5.7M 130M
LJournal 5.4M 79M
GWeb 0.9M 5.1M
α |E|
1.8 673M
1.9 249M
2.0 105M
2.1 54M
2.2 39M
6-node48-node
1 Clusters are connected via 1GigE
2 We run 10 iterations for PageRank.
3 Synthetic graphs has fixed 10 million vertices
38
Performance
0
1
2
3
4
5
6
1.8 1.9 2.0 2.1 2.2 TW UK WK LJ GW
NormalizedSpeedup
PG+Grid
PG+Oblivious
PG+Coordinated
PL+Hybrid
PL+Ginger
Power-law (α) Real-world
default
2.2X
2.9X
3.3X 3.2X 3.2X
2.6X
2.0X 2.0X
3.0X
5.5X 48-node
PageRank
Twitter Follower Graph (TW)
Graph
Partitioning
Time (sec)
Ingress Exec
Random 263 823
Grid 123 373
Oblivious 289 660
Coordinated 391 298
Hybrid 138 155
Ginger 307 143
better
39
Scalability
0
20
40
60
80
100
120
140
160
8 16 24 32 40 48
ExecutionTime(sec)
0 100 200 300 400
PG+Grid
PG+Oblivious
PG+Coordinated
PL+Hybrid
PL+Ginger
default
#Machines #Vertices (millions)
47.4
28.5
23.1 16.8
Twitter
Follower
Graph
48-node 6-node
Scale-free
Power-law
Graph
better
Breakdown
0
50
100
150
200
250
300
1.8 1.9 2.0 2.1 2.2
OneIteration
Communications(MB)
PG+Grid
PG+Oblivious
PG+Coordinated
PL+Hybrid
PL+Ginger
Power-law (α)
48-node
Power-law (α)
0
2
4
6
8
10
1.8 1.9 2.0 2.1 2.2
ExecutionTime(sec)
PG+Hybrid
PG+Ginger
PL+Hybrid
PL+Ginger
48-node
default
better
vs. Grid 75%
vs. Coor 50%
about 30%
41
Graph Partitioning Comparison
Partitioning
Algorithm
Type
Replication
Factor (λ)
Ingress
Time
Load
Balance
Random Hashing Bad Modest Good
Grid 2D Hashing Modest Great Modest
Oblivious Greedy Modest Modest Great
Coordinated Greedy Great Bad Great
Hybrid(R) Hashing Good Great Great
Ginger Greedy Great Bad Good
0
2
4
6
8
10
12
14
16
1.8 1.9 2.0 2.1 2.2 TW UK WK LJ GW
Replication
Factor(λ)
PG+Grid
PG+Oblivious
PG+Coordinated
PL+Hybrid
PL+Ginger
Power-law (α) Real-world
28.4
18.0
48 partitions
Twitter Follower Graph (TW)
Partitioning λ Ingress
Random 16.0 263
Grid 8.3 123
Oblivious 12.8 289
Coordinated 5.5 391
Hybrid 5.6 138
Ginger 4.8 307
Threshold
0
10
20
30
40
50
0
2
4
6
8
10
12
1000
0
10
20
30
40
50
0
2
4
6
8
10
12
0 100 200 300 400 500
Replication
Factor(λ)
Replication Factor Execution Time
103 104 105
∞+
PageRank w/ Twitter
48-node
Impact of different thresholds
□ Using high-cut or low-cut for all vertices is poor
□ Execution time is stable for a large range of θ
→ Difference is lower than 1sec (100 ≤ θ ≤ 500)
High-cut for all Low-cut for all
Threshold (θ)
stable
6-node
vs. Other Systems1
0
200
400
600
800
1000
1200
Giraph
GPS
CombBLAS
GraphX
GraphX/H
PowerGraph
PowerLyra
Twitter Follower Graph Power-law (α=2.0) Graph
Exec-Time(sec)
Power-law
Graphs
PowerLyra (6/1) Polymer Galois X-Stream GraphChi
|V|=10M 14 / 45 6.3 9.8 9.0 115
|V|=400M 186 / -- -- -- 710 1666
528
367
1975 219
233
0
25
50
75
100
125
150
56
36
23
24
172
62
69
PageRank with 10 iterations
6-node6-node
In-memory
Out-of-core
1 We use Giraph 1.1, CombBLAS 1.4, GraphX 1.1, Galois 2.2.1, X-Stream 1.0,
as well as the latest version of GPS, PowerGraph, Polymer and GraphChi.
PageRank w/ 10 iterations
ingress
better
25%
44
Conclusion
Existing systems use a “ONE SIZE FITS ALL” design for
skewed graphs, resulting in suboptimal performance
POWERLYRA: a hybrid and adaptive design that
differentiates computation and partitioning
on low- and high-degree vertices
Embrace LOCALITY and PARALLELISM
□ Outperform PowerGraph by up to 5.53X
□ Also much faster than other systems like GraphX,
Giraph, and CombBLAS for ingress and runtime
□ Random hybrid-cut improves GraphX by 1.33X
45
L
Questions
Thanks
http://ipads.se.sjtu.edu.cn/
projects/powerlyra.html
Institute of Parallel and
Distributed Systems
H
PowerLyra
Related Work
GENERAL
Pregel[SIGMOD'09], GraphLab[VLDB'12],
PowerGraph[OSDI'12], GraphX[OSDI'14]
OPTIMIZATION
Mizan[EuroSys'13], Giraph++[VLDB'13],
PowerSwitch[PPoPP'15]
SINGLE
MACHINE
GraphChi[OSDI'12], X-Stream[SOSP'13],
Galois[SOSP'13], Polymer[PPoPP'15]
OTHER
PARADIGMS
CombBLAS[IJHPCA'11], SociaLite[VLDB'13]
EDGE-CUT Stanton et al. [SIGKDD'12], Fennel[WSDM'14]
VERTEX-CUT Greedy[OSDI'12], GraphBuilder[GRADES'13]
HEURISTICS 2D[SC'05], Degree-Based Hashing[NIPS'15]
Other graph processing systems
Graph replication and partitioning
Backup
Downgrade
master mirror
L
. . .
. . .
L
. . .
G
A
S S
master mirror
L
. . .
. . .
G
A
S
G
S
L
. . .
ALL IN OUT NONE
ALL 4 2 2 2
IN 3 2 1 1
OUT 3 1 2 1
NONE 3 1 1 1
GATHER
SCATTER
master mirror
L
. . .
. . .
G
A
S
G
S
L
. . .
master mirror
L
. . .
. . .
G
A
S S
L
. . .
#msgs
OBSERVATION:
most algorithms only gather
or scatter in one direction
Sweet
Spot
Other Algorithms and Graphs
ALS d=5 d=20 d=50 d=100
PG 10| 33 11|144 16|732 Failed
PL 13| 23 13| 51 14|177 15|614
V_DATA E_DATA Grid Hybrid
8d + 13 16 12.3 2.6
SGD d=5 d=20 d=50 d=100
PG 15| 35 17| 48 21| 73 28|115
PL 16| 26 19| 33 19| 43 20| 59
0
1
2
3
4
5
1.8 1.9 2.0 2.1 2.2
Power-law (α)
NormalizedSpeedup
0
1
2
3
4
5
1.8 1.9 2.0 2.1 2.2
PG+Grid
PG+Oblivious
PG+Coordinated
PL+Hybrid
PL+Ginger
Power-law (α)
DIA
CC
48-node
48-node
MLDM Algorithms
Netflix:|V|=0.5M, |E|=99M
48-node
default
G/OUT & S/NONE
ingress | runtime
better
Latent
dimensionG/NONE
& S/ALL
Other Algorithms and Graphs
Graph
Partitioning
λ
Time (sec)
Ingress Exec
Grid 3.2 15.5 57.3
Oblivious 2.3 13.8 51.8
Coordinated 2.3 26.9 50.4
Hybrid 3.3 14.0 32.2
Ginger 2.8 28.8 31.3
ALS d=5 d=20 d=50 d=100
PG 10| 33 11|144 16|732 Failed
PL 13| 23 13| 51 14|177 15|614
V_DATA E_DATA Grid Hybrid
8d + 13 16 12.3 2.6
SGD d=5 d=20 d=50 d=100
PG 15| 35 17| 48 21| 73 28|115
PL 16| 26 19| 33 19| 43 20| 59
0
1
2
3
4
5
1.8 1.9 2.0 2.1 2.2
Power-law (α)
NormalizedSpeedup
0
1
2
3
4
5
1.8 1.9 2.0 2.1 2.2
PG+Grid
PG+Oblivious
PG+Coordinated
PL+Hybrid
PL+Ginger
Power-law (α)
48-node
48-node
MLDM Algorithms
Netflix:|V|=0.5M, |E|=99M
48-node
Non-skewed Graph
RoadUS: |V|=23.9M, |E|=58.3M
48-node
default
G/OUT & S/NONE
G/NONE
& S/ALL
DIA
CC
better
Ingress Time vs. Execution Time
Twitter Follower Graph
Graph
Partitioning
λ
Time (sec)
Ingress Exec
Random 16.0 263 823
Grid 8.3 123 373
Oblivious 12.8 289 660
Coordinated 5.5 391 298
Hybrid 5.6 138 155
Netflix Movie Recommendation
Graph
Partitioning
λ
Time (sec)
Ingress Exec
Random 36.9 21 547
Grid 12.3 12 174
Oblivious 31.5 25 476
Coordinated 5.3 31 105
Hybrid 2.6 14 67
PageRank ALS (d=20)
Fast CPU & Networking
0
20
40
60
80
100
120
140
160
PG-1Gb PL-1Gb
PG-10Gb PL-10Gb
Exec-Time(sec)
Twitter Follower Graph Power-law (α=2.0) Graph
0
10
20
30
40
50
60
PageRank w/ 10 iterations
Testbed: 6-node physical clusters
1. SLOW: 24 Cores (AMD 6168 1.9GHz L3:12MB), 64GB RAM, 1GbE
2. FAST: 20 Cores (Intel E5-2650 2.3GHz L3:24MB), 64GB RAM, 10GbE
1.77X
1.51X
3.68X
1.67X

More Related Content

What's hot

2015年度GPGPU実践プログラミング 第6回 パフォーマンス解析ツール
2015年度GPGPU実践プログラミング 第6回 パフォーマンス解析ツール2015年度GPGPU実践プログラミング 第6回 パフォーマンス解析ツール
2015年度GPGPU実践プログラミング 第6回 パフォーマンス解析ツール
智啓 出川
 
2015年度GPGPU実践プログラミング 第1回 GPGPUの歴史と応用例
2015年度GPGPU実践プログラミング 第1回 GPGPUの歴史と応用例2015年度GPGPU実践プログラミング 第1回 GPGPUの歴史と応用例
2015年度GPGPU実践プログラミング 第1回 GPGPUの歴史と応用例
智啓 出川
 
2015年度先端GPGPUシミュレーション工学特論 第11回 数値流体力学への応用 (支配方程式,CPUプログラム)
2015年度先端GPGPUシミュレーション工学特論 第11回 数値流体力学への応用(支配方程式,CPUプログラム)2015年度先端GPGPUシミュレーション工学特論 第11回 数値流体力学への応用(支配方程式,CPUプログラム)
2015年度先端GPGPUシミュレーション工学特論 第11回 数値流体力学への応用 (支配方程式,CPUプログラム)
智啓 出川
 
2015年度GPGPU実践基礎工学 第12回 GPUによる画像処理
2015年度GPGPU実践基礎工学 第12回 GPUによる画像処理2015年度GPGPU実践基礎工学 第12回 GPUによる画像処理
2015年度GPGPU実践基礎工学 第12回 GPUによる画像処理
智啓 出川
 
Link prediction
Link predictionLink prediction
Link prediction
ybenjo
 

What's hot (20)

[DL輪読会]Continuous Adaptation via Meta-Learning in Nonstationary and Competiti...
[DL輪読会]Continuous Adaptation via Meta-Learning in Nonstationary and Competiti...[DL輪読会]Continuous Adaptation via Meta-Learning in Nonstationary and Competiti...
[DL輪読会]Continuous Adaptation via Meta-Learning in Nonstationary and Competiti...
 
カークマンの女学生問題と有限幾何
カークマンの女学生問題と有限幾何カークマンの女学生問題と有限幾何
カークマンの女学生問題と有限幾何
 
AVX-512(フォーマット)詳解
AVX-512(フォーマット)詳解AVX-512(フォーマット)詳解
AVX-512(フォーマット)詳解
 
2015年度GPGPU実践プログラミング 第6回 パフォーマンス解析ツール
2015年度GPGPU実践プログラミング 第6回 パフォーマンス解析ツール2015年度GPGPU実践プログラミング 第6回 パフォーマンス解析ツール
2015年度GPGPU実践プログラミング 第6回 パフォーマンス解析ツール
 
マイクロソフトが考えるAI活用のロードマップ
マイクロソフトが考えるAI活用のロードマップマイクロソフトが考えるAI活用のロードマップ
マイクロソフトが考えるAI活用のロードマップ
 
2015年度GPGPU実践プログラミング 第1回 GPGPUの歴史と応用例
2015年度GPGPU実践プログラミング 第1回 GPGPUの歴史と応用例2015年度GPGPU実践プログラミング 第1回 GPGPUの歴史と応用例
2015年度GPGPU実践プログラミング 第1回 GPGPUの歴史と応用例
 
GANの簡単な理解から正しい理解まで
GANの簡単な理解から正しい理解までGANの簡単な理解から正しい理解まで
GANの簡単な理解から正しい理解まで
 
CUDAプログラミング入門
CUDAプログラミング入門CUDAプログラミング入門
CUDAプログラミング入門
 
2015年度先端GPGPUシミュレーション工学特論 第11回 数値流体力学への応用 (支配方程式,CPUプログラム)
2015年度先端GPGPUシミュレーション工学特論 第11回 数値流体力学への応用(支配方程式,CPUプログラム)2015年度先端GPGPUシミュレーション工学特論 第11回 数値流体力学への応用(支配方程式,CPUプログラム)
2015年度先端GPGPUシミュレーション工学特論 第11回 数値流体力学への応用 (支配方程式,CPUプログラム)
 
from old java to java8 - KanJava Edition
from old java to java8 - KanJava Editionfrom old java to java8 - KanJava Edition
from old java to java8 - KanJava Edition
 
2015年度GPGPU実践基礎工学 第12回 GPUによる画像処理
2015年度GPGPU実践基礎工学 第12回 GPUによる画像処理2015年度GPGPU実践基礎工学 第12回 GPUによる画像処理
2015年度GPGPU実践基礎工学 第12回 GPUによる画像処理
 
最近のDeep Learning (NLP) 界隈におけるAttention事情
最近のDeep Learning (NLP) 界隈におけるAttention事情最近のDeep Learning (NLP) 界隈におけるAttention事情
最近のDeep Learning (NLP) 界隈におけるAttention事情
 
コンピュータに「最長しりとり」「最短距離でのJR線全線乗り尽くし」を解いてもらった方法
コンピュータに「最長しりとり」「最短距離でのJR線全線乗り尽くし」を解いてもらった方法コンピュータに「最長しりとり」「最短距離でのJR線全線乗り尽くし」を解いてもらった方法
コンピュータに「最長しりとり」「最短距離でのJR線全線乗り尽くし」を解いてもらった方法
 
ツール比較しながら語る O/RマッパーとDBマイグレーションの実際のところ
ツール比較しながら語る O/RマッパーとDBマイグレーションの実際のところツール比較しながら語る O/RマッパーとDBマイグレーションの実際のところ
ツール比較しながら語る O/RマッパーとDBマイグレーションの実際のところ
 
2009/12/10 GPUコンピューティングの現状とスーパーコンピューティングの未来
2009/12/10 GPUコンピューティングの現状とスーパーコンピューティングの未来2009/12/10 GPUコンピューティングの現状とスーパーコンピューティングの未来
2009/12/10 GPUコンピューティングの現状とスーパーコンピューティングの未来
 
グラフデータ分析 入門編
グラフデータ分析 入門編グラフデータ分析 入門編
グラフデータ分析 入門編
 
グラフデータベースは如何に自然言語を理解するか?
グラフデータベースは如何に自然言語を理解するか?グラフデータベースは如何に自然言語を理解するか?
グラフデータベースは如何に自然言語を理解するか?
 
Link prediction
Link predictionLink prediction
Link prediction
 
OpenMPI入門
OpenMPI入門OpenMPI入門
OpenMPI入門
 
[DL輪読会]Attention is not Explanation (NAACL2019)
[DL輪読会]Attention is not Explanation (NAACL2019)[DL輪読会]Attention is not Explanation (NAACL2019)
[DL輪読会]Attention is not Explanation (NAACL2019)
 

Similar to PowerLyra@EuroSys2015

Pagerank (from Google)
Pagerank (from Google)Pagerank (from Google)
Pagerank (from Google)
Sri Prasanna
 
Streaming Python on Hadoop
Streaming Python on HadoopStreaming Python on Hadoop
Streaming Python on Hadoop
Vivian S. Zhang
 
Fact checking using Knowledge graphs (course: CS584 @Emory U)

Fact checking using Knowledge graphs (course: CS584 @Emory U)
Fact checking using Knowledge graphs (course: CS584 @Emory U)

Fact checking using Knowledge graphs (course: CS584 @Emory U)

Ramraj Chandradevan
 

Similar to PowerLyra@EuroSys2015 (20)

Pagerank (from Google)
Pagerank (from Google)Pagerank (from Google)
Pagerank (from Google)
 
Lec5 Pagerank
Lec5 PagerankLec5 Pagerank
Lec5 Pagerank
 
Lec5 pagerank
Lec5 pagerankLec5 pagerank
Lec5 pagerank
 
Lec5 Pagerank
Lec5 PagerankLec5 Pagerank
Lec5 Pagerank
 
Streaming Python on Hadoop
Streaming Python on HadoopStreaming Python on Hadoop
Streaming Python on Hadoop
 
MapReduce Algorithm Design
MapReduce Algorithm DesignMapReduce Algorithm Design
MapReduce Algorithm Design
 
Sandy Ryza – Software Engineer, Cloudera at MLconf ATL
Sandy Ryza – Software Engineer, Cloudera at MLconf ATLSandy Ryza – Software Engineer, Cloudera at MLconf ATL
Sandy Ryza – Software Engineer, Cloudera at MLconf ATL
 
Dijkstra
DijkstraDijkstra
Dijkstra
 
d
dd
d
 
Hot-Spot Analysis Tips & Tools
Hot-Spot Analysis Tips & ToolsHot-Spot Analysis Tips & Tools
Hot-Spot Analysis Tips & Tools
 
MapReduceAlgorithms.ppt
MapReduceAlgorithms.pptMapReduceAlgorithms.ppt
MapReduceAlgorithms.ppt
 
ABSTRACT GRAPH MACHINE: MODELING ORDERINGS IN ASYNCHRONOUS DISTRIBUTED-MEMORY...
ABSTRACT GRAPH MACHINE: MODELING ORDERINGS IN ASYNCHRONOUS DISTRIBUTED-MEMORY...ABSTRACT GRAPH MACHINE: MODELING ORDERINGS IN ASYNCHRONOUS DISTRIBUTED-MEMORY...
ABSTRACT GRAPH MACHINE: MODELING ORDERINGS IN ASYNCHRONOUS DISTRIBUTED-MEMORY...
 
Distributed graph summarization
Distributed graph summarizationDistributed graph summarization
Distributed graph summarization
 
Rank adjustment strategies for Dynamic PageRank : REPORT
Rank adjustment strategies for Dynamic PageRank : REPORTRank adjustment strategies for Dynamic PageRank : REPORT
Rank adjustment strategies for Dynamic PageRank : REPORT
 
Multinomial Logistic Regression with Apache Spark
Multinomial Logistic Regression with Apache SparkMultinomial Logistic Regression with Apache Spark
Multinomial Logistic Regression with Apache Spark
 
Alpine Spark Implementation - Technical
Alpine Spark Implementation - TechnicalAlpine Spark Implementation - Technical
Alpine Spark Implementation - Technical
 
Silicon valleycodecamp2013
Silicon valleycodecamp2013Silicon valleycodecamp2013
Silicon valleycodecamp2013
 
Fact checking using Knowledge graphs (course: CS584 @Emory U)

Fact checking using Knowledge graphs (course: CS584 @Emory U)
Fact checking using Knowledge graphs (course: CS584 @Emory U)

Fact checking using Knowledge graphs (course: CS584 @Emory U)

 
Parallel Computing 2007: Bring your own parallel application
Parallel Computing 2007: Bring your own parallel applicationParallel Computing 2007: Bring your own parallel application
Parallel Computing 2007: Bring your own parallel application
 
Fast relaxation methods for the matrix exponential
Fast relaxation methods for the matrix exponential Fast relaxation methods for the matrix exponential
Fast relaxation methods for the matrix exponential
 

Recently uploaded

Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...
Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...
Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...
amilabibi1
 
Uncommon Grace The Autobiography of Isaac Folorunso
Uncommon Grace The Autobiography of Isaac FolorunsoUncommon Grace The Autobiography of Isaac Folorunso
Uncommon Grace The Autobiography of Isaac Folorunso
Kayode Fayemi
 
If this Giant Must Walk: A Manifesto for a New Nigeria
If this Giant Must Walk: A Manifesto for a New NigeriaIf this Giant Must Walk: A Manifesto for a New Nigeria
If this Giant Must Walk: A Manifesto for a New Nigeria
Kayode Fayemi
 
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptx
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptxChiulli_Aurora_Oman_Raffaele_Beowulf.pptx
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptx
raffaeleoman
 

Recently uploaded (20)

Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...
Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...
Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...
 
BDSM⚡Call Girls in Sector 97 Noida Escorts >༒8448380779 Escort Service
BDSM⚡Call Girls in Sector 97 Noida Escorts >༒8448380779 Escort ServiceBDSM⚡Call Girls in Sector 97 Noida Escorts >༒8448380779 Escort Service
BDSM⚡Call Girls in Sector 97 Noida Escorts >༒8448380779 Escort Service
 
My Presentation "In Your Hands" by Halle Bailey
My Presentation "In Your Hands" by Halle BaileyMy Presentation "In Your Hands" by Halle Bailey
My Presentation "In Your Hands" by Halle Bailey
 
Aesthetic Colaba Mumbai Cst Call girls 📞 7738631006 Grant road Call Girls ❤️-...
Aesthetic Colaba Mumbai Cst Call girls 📞 7738631006 Grant road Call Girls ❤️-...Aesthetic Colaba Mumbai Cst Call girls 📞 7738631006 Grant road Call Girls ❤️-...
Aesthetic Colaba Mumbai Cst Call girls 📞 7738631006 Grant road Call Girls ❤️-...
 
Introduction to Prompt Engineering (Focusing on ChatGPT)
Introduction to Prompt Engineering (Focusing on ChatGPT)Introduction to Prompt Engineering (Focusing on ChatGPT)
Introduction to Prompt Engineering (Focusing on ChatGPT)
 
BDSM⚡Call Girls in Sector 93 Noida Escorts >༒8448380779 Escort Service
BDSM⚡Call Girls in Sector 93 Noida Escorts >༒8448380779 Escort ServiceBDSM⚡Call Girls in Sector 93 Noida Escorts >༒8448380779 Escort Service
BDSM⚡Call Girls in Sector 93 Noida Escorts >༒8448380779 Escort Service
 
Uncommon Grace The Autobiography of Isaac Folorunso
Uncommon Grace The Autobiography of Isaac FolorunsoUncommon Grace The Autobiography of Isaac Folorunso
Uncommon Grace The Autobiography of Isaac Folorunso
 
Thirunelveli call girls Tamil escorts 7877702510
Thirunelveli call girls Tamil escorts 7877702510Thirunelveli call girls Tamil escorts 7877702510
Thirunelveli call girls Tamil escorts 7877702510
 
Report Writing Webinar Training
Report Writing Webinar TrainingReport Writing Webinar Training
Report Writing Webinar Training
 
Presentation on Engagement in Book Clubs
Presentation on Engagement in Book ClubsPresentation on Engagement in Book Clubs
Presentation on Engagement in Book Clubs
 
Air breathing and respiratory adaptations in diver animals
Air breathing and respiratory adaptations in diver animalsAir breathing and respiratory adaptations in diver animals
Air breathing and respiratory adaptations in diver animals
 
If this Giant Must Walk: A Manifesto for a New Nigeria
If this Giant Must Walk: A Manifesto for a New NigeriaIf this Giant Must Walk: A Manifesto for a New Nigeria
If this Giant Must Walk: A Manifesto for a New Nigeria
 
Busty Desi⚡Call Girls in Sector 51 Noida Escorts >༒8448380779 Escort Service-...
Busty Desi⚡Call Girls in Sector 51 Noida Escorts >༒8448380779 Escort Service-...Busty Desi⚡Call Girls in Sector 51 Noida Escorts >༒8448380779 Escort Service-...
Busty Desi⚡Call Girls in Sector 51 Noida Escorts >༒8448380779 Escort Service-...
 
ICT role in 21st century education and it's challenges.pdf
ICT role in 21st century education and it's challenges.pdfICT role in 21st century education and it's challenges.pdf
ICT role in 21st century education and it's challenges.pdf
 
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptx
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptxChiulli_Aurora_Oman_Raffaele_Beowulf.pptx
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptx
 
AWS Data Engineer Associate (DEA-C01) Exam Dumps 2024.pdf
AWS Data Engineer Associate (DEA-C01) Exam Dumps 2024.pdfAWS Data Engineer Associate (DEA-C01) Exam Dumps 2024.pdf
AWS Data Engineer Associate (DEA-C01) Exam Dumps 2024.pdf
 
SaaStr Workshop Wednesday w/ Lucas Price, Yardstick
SaaStr Workshop Wednesday w/ Lucas Price, YardstickSaaStr Workshop Wednesday w/ Lucas Price, Yardstick
SaaStr Workshop Wednesday w/ Lucas Price, Yardstick
 
Dreaming Music Video Treatment _ Project & Portfolio III
Dreaming Music Video Treatment _ Project & Portfolio IIIDreaming Music Video Treatment _ Project & Portfolio III
Dreaming Music Video Treatment _ Project & Portfolio III
 
Dreaming Marissa Sánchez Music Video Treatment
Dreaming Marissa Sánchez Music Video TreatmentDreaming Marissa Sánchez Music Video Treatment
Dreaming Marissa Sánchez Music Video Treatment
 
Sector 62, Noida Call girls :8448380779 Noida Escorts | 100% verified
Sector 62, Noida Call girls :8448380779 Noida Escorts | 100% verifiedSector 62, Noida Call girls :8448380779 Noida Escorts | 100% verified
Sector 62, Noida Call girls :8448380779 Noida Escorts | 100% verified
 

PowerLyra@EuroSys2015

  • 1. RONG CHEN, JIAXIN SHI, YANZHE CHEN, HAIBO CHEN Institute of Parallel and Distributed Systems Shanghai Jiao Tong University, China H EuroSys 2015 L Differentiated Graph Computation and Partitioning on Skewed Graphs PowerLyra
  • 2. Graphs are Everywhere1 Traffic network graphs □ Monitor/react to road incidents □ Analyze Internet security Biological networks □ Predict disease outbreaks □ Model drug-protein interactions Social networks □ Recommend people/products □ Track information flow 1 From PPoPP’15 Keynote: Computing in a Persistent (Memory) World, P. Faraboschi 2
  • 3. A centrality analysis algorithm to iteratively rank each vertex as a weighted sum of neighbors’ ranks Example: PageRank Graph-parallel Processing Coding graph algorithms as vertex-centric programs to process vertices in parallel and communicate along edges -- "Think as a Vertex" philosophy COMPUTE(v) foreach n in v.in_nbrs sum += n.rank / n.nedges v.rank = 0.15 + 0.85 * sum if !converged(v) foreach n in v.out_nbrs activate(n) 𝑗, 𝑖 ∈ 𝐸 𝜔𝑖𝑗 𝑅𝑗0.15 + 0.85𝑅𝑖 = 3
  • 4. Graph-parallel Processing A centrality analysis algorithm to iteratively rank each vertex as a weighted sum of neighbors’ ranks Example: PageRank 𝑗, 𝑖 ∈ 𝐸 𝜔𝑖𝑗 𝑅𝑗0.15 + 0.85𝑅𝑖 = Coding graph algorithms as vertex-centric programs to process vertices in parallel and communicate along edges -- "Think as a Vertex" philosophy COMPUTE(v) foreach n in v.in_nbrs sum += n.rank / n.nedges v.rank = 0.15 + 0.85 * sum if !converged(v) foreach n in v.out_nbrs activate(n) Gather data from neighbors via in-edges 4
  • 5. Graph-parallel Processing A centrality analysis algorithm to iteratively rank each vertex as a weighted sum of neighbors’ ranks Example: PageRank 𝑗, 𝑖 ∈ 𝐸 𝜔𝑖𝑗 𝑅𝑗0.15 + 0.85𝑅𝑖 = Coding graph algorithms as vertex-centric programs to process vertices in parallel and communicate along edges -- "Think as a Vertex" philosophy COMPUTE(v) foreach n in v.in_nbrs sum += n.rank / n.nedges v.rank = 0.15 + 0.85 * sum if !converged(v) foreach n in v.out_nbrs activate(n) Update vertex with new data 5
  • 6. Graph-parallel Processing A centrality analysis algorithm to iteratively rank each vertex as a weighted sum of neighbors’ ranks Example: PageRank 𝑗, 𝑖 ∈ 𝐸 𝜔𝑖𝑗 𝑅𝑗0.15 + 0.85𝑅𝑖 = Coding graph algorithms as vertex-centric programs to process vertices in parallel and communicate along edges -- "Think as a Vertex" philosophy COMPUTE(v) foreach n in v.in_nbrs sum += n.rank / n.nedges v.rank = 0.15 + 0.85 * sum if !converged(v) foreach n in v.out_nbrs activate(n) Scatter data to neighbors via out-edges 6
  • 7. Natural Graphs are Skewed Graphs in real-world □ Power-law degree distribution (aka Zipf’s law or Pareto) Source: M. E. J. Newman, ”Power laws, Pareto distributions and Zipf’s law”, 2005 “most vertices have relatively few neighbors while a few have many neighbors” -- PowerGraph[OSDI’12] #Follower : Word frequency Citation Web Hits Books sold Telephone calls received Earthquake Magnitude Crater diameter in Peak intensity intensity Net worth in USDs Name frequency Population of city Case: SINA Weibo(微博) □ 167 million active users1 1 http://www.chinainternetwatch.com/10735/weibo-q3-2014/ ........77,971,093 #1 76,946,782 #2 #100 14,723,818 397 .......... 69 7
  • 8. Challenge: LOCALITY vs. PARALLELISM LOCALITY make resource locally accessible to hidden network latency H High-degree vertex 77,971,093 L Low-degree vertex 397 1. Imbalance 2. High contention 3. Heavy network traffic 8
  • 9. Challenge: LOCALITY vs. PARALLELISM LOCALITY PARALLELISM make resource locally accessible to hidden network latency evenly parallelize the workloads to avoid load imbalance H High-degree vertex 77,971,093 L Low-degree vertex 397 1. Redundant comm., comp. & sync. (not worthwhile) 9
  • 10. Challenge: LOCALITY vs. PARALLELISM LOCALITY PARALLELISM make resource locally accessible to hidden network latency evenly parallelize the workloads to avoid load imbalance H High-degree vertex 77,971,093 L Low-degree vertex 397 skewed graphs 100-million users w/ 100 followers 100 users w/ 100-million followers Conflict 10
  • 11. Existing Efforts: One Size Fits All Pregel[SIGMOD’08] and GraphLab[VLDB’12] □ Focus on exploiting LOCALITY □ Partitioning: use edge-cut to evenly assign vertices along with all edges □ Computation: aggregate all resources (i.e., msgs or replicas) of a vertex on local machine A B A A B msg replica (i.e., mirror) master x2 B L L H L Edge-cut 11
  • 12. Existing Efforts: One Size Fits All PowerGraph [OSDI’12] and GraphX[OSDI’14] □ Focus on exploiting PARALLELISM □ Partitioning: use vertex-cut to evenly assign edges with replicated vertices □ Computation: decompose the workload of a vertex into multiple machines Vertex-cut L HH L mirror A B A B master x5 12
  • 13. : differentiated graph computation and partitioning on skewed graphs □ Hybrid computation and partitioning strategies → Embrace locality (low-degree vertex) and parallelism (high-degree vertex) □ An optimization to improve locality and reduce interference in communication □ Outperform PowerGraph (up to 5.53X) and other systems for ingress and runtime PARALLELISMLOCALITY Contributions PowerLyra 13
  • 15. Graph Partitioning Hybrid-cut: differentiate graph partitioning for low- degree and high-degree vertices □ Low-cut (inspired by edge-cut): reduce mirrors and exploit locality for low-degree vertices □ High-cut (inspired by vertex-cut): provide balance and restrict the impact of high-degree vertices □ Efficiently combine low-cut and high-cut 1 2 52 1 2 3 6 3 High-master Low-master High-mirror Low-mirror 14 1 3 1 24 3 5 6 Sample Graph θ=3 User-defined threshold 15
  • 16. Low-cut: evenly assign low-degree vertices along with edges in only one direction (e.g., in-edges) to machines by hashing the (e.g., target) vertex 1. Unidirectional access locality 2. No duplicated edges 3. Efficiency at ingress 4. Load balance (E & V) 5. Ready-made heuristics 1 24 3 5 6 Hybrid-cut (Low) 4 2 5 3 6 2 3 1 1 1 2 3 NOTE: A new heuristic, called Ginger, is provided to further optimize Random hybrid-cut. (see paper) hash(X) = X % 3 θ=3 16
  • 17. High-cut: distribute one-direction edges (e.g., in-edges) of high-degree vertices to machines by hashing opposite (e.g., source) vertex 1. Avoid adding any mirrors for low-degree vertices (i.e., an upper bound of increased mirrors for assigning a new high-degree vertex) 2. No duplicated edges 3. Efficiency at ingress 4. Load balance (E & V) 6 24 3 5 1 Hybrid-cut (High) 2 5 1 1 32 3 1 hash(X) = X % 3 θ=3 4 1 17
  • 18. Constructing Hybrid-cut 1, 6 1, 4 2, 1 2, 5 2, 6 3, 1 3, 4 4, 1 5, 1 21 3 Load 1 24 3 5 6 Sample Graph θ=3 1. Load files in parallel format: edge-list 18
  • 19. Constructing Hybrid-cut 14 2 5 3 6 1, 6 1, 4 2, 1 2, 5 2, 6 3, 1 3, 4 4, 1 5, 1 21 3 Dispatch Load 1 24 3 5 6 Sample Graph θ=3 1. Load files in parallel 2. Dispatch all edges (Low-cut) and count in-degree of vertices format: edge-list 19
  • 20. Constructing Hybrid-cut 14 2 5 3 6 14 1 2 1 35 1, 6 1, 4 2, 1 2, 5 2, 6 3, 1 3, 4 4, 1 5, 1 21 3 Dispatch Load Re-assign 1 24 3 5 6 Sample Graph θ=3 1. Load files in parallel 2. Dispatch all edges (Low-cut) and count in-degree of vertices 3. Re-assign in-edges of high-degree vertices (High-cut, θ=3) NOTE: some formats (e.g., adjacent list) can skip counting and re-assignment format: edge-list 20
  • 21. Constructing Hybrid-cut 14 2 5 3 6 14 1 2 1 35 14 1 2 5 1 2 3 6 1, 6 1, 4 2, 1 2, 5 2, 6 3, 1 3, 4 4, 1 5, 1 21 3 3 Dispatch Load Re-assign Construct 1 24 3 5 6 Sample Graph θ=3 1. Load files in parallel 2. Dispatch all edges (Low-cut) and count in-degree of vertices 3. Re-assign in-edges of high-degree vertices (High-cut, θ=3) 4. Create mirrors to construct a local graph format: edge-list 21
  • 23. Graph Computation Hybrid-model: differentiate the graph computation model to high-degree and low-degree vertices □ High-degree model: decompose workload of high-degree vertices for load balance □ Low-degree model: minimize the communication cost for low-degree vertices □ Seamlessly run all of existing graph algorithms master mirror L . . . . . . L . . . G A S S High-degree model Low-degree model master mirror H . . . . . . G A S G S H . . . . . . 23
  • 24. COMPUTE(v) foreach n in v.in_nbrs sum += n.rank / n.nedges v.rank = 0.15 + 0.85 * sum if !converged(v) foreach n in v.out_nbrs activate(n) Example: PageRank Hybrid-model (High) High-degree vertex □ Exploit PARALLELISM □ Follow GAS model (inspired by POWERGRAPH) □ Comm. Cost: ≤ 4 x #mirrors APPLY (v, sum): v.rank = 0.15 + 0.85 * sum GATHER(v, n): return n.rank / n.nedges SUM (a, b): return a + b GATHER_EDGES = IN SCATTER (v, n): if !converged(v) activate(n) SCATTER_EDGES = OUT G A S local distributed distributed master H . . . . . . G A S G S mirrorH . . . . . . High-degree vertex 24
  • 25. Hybrid-model (Low) Low-degree vertex □ Exploit LOCALITY □ Reserve GAS interface □ Unidirectional access locality □ Comm. Cost: ≤ 1 x #mirrors OBSERVATION: most algorithms only gather or scatter in one direction (e.g., PageRank: G/IN and S/OUT) APPLY (v, sum): v.rank = 0.15 + 0.85 * sum GATHER(v, n): return n.rank / n.nedges SUM (a, b): return a + b GATHER_EDGES = IN SCATTER (v, n): if !converged(v) activate(n) SCATTER_EDGES = OUT G A S local distributed local S L . . . . . . master L . . . mirror Low-degree vertex All of in-edges A S G 25
  • 26. Generalization Seamlessly run all graph algorithms □ Adaptively downgrade model (+ N msgs, 1 ≤ N ≤ 3) □ Detect at runtime w/o any changes to APPs TYPE GATHER_EDGES SCATTER_EDGES EX. ALGO. NATURAL IN or NONE OUT or NONE PR, SSSP OUT or NONE IN or NONE DIA OTHER ANY ANY CC, ALS e.g., Connected Components (CC) → G/NONE and S/ALL → +1 msg in SCATTER (mirror  master ) → Comm. Cost: ≤ 2 x #mirrors master mirror L . . . . . . L . . . G A S S +1 msg One-direction locality limits the expressiveness to some OTHER algorithms (i.e., G or S in both IN and OUT) 26
  • 28. Poor locality and high interference □ Messages are batched and sent periodically □ Messages from different nodes will be updated to vertices (i.e., master and mirror) in parallel . . . M1 . . . M0 Worker Thread Challenge: Locality & Interference poor localitymsgs RPC Thread OPT: Locality-conscious graph layout □ Focus on exploiting locality in communication □ Extend hybrid-cut at ingress (w/o runtime overhead) heavy interference 28
  • 29. 9 8 6 1 3524 Example: Initial State 7 5 3 8 4162 6 4 1 3 9852 7M1 M2 M0 h L l l h H l L H h l h L L l l l l L l H h l h L 7 5 1 2High-master high-mirror Low-master low-mirror Both master and mirror of all vertices (high- & low-degree) are mixed and stored in a random order on each machine poor locality interference e.g. Hash(X) = (X-1)%3 29
  • 30. 9 8 6 1 3524 Example: Zoning 7 5 3 8 4162 6 4 1 3 9852 7M1 M2 M0 5 8 6 3 9412 7M1 9 3 1 8 4526M2 7 4 2 3 8561M0 H0 L0 h-mrr l-mrr H1 L1 h-mrr l-mrr H2 L2 h-mrr l-mrr Z0 Z1 Z2 Z3 1. ZONE: divide vertex space into 4 zones to store High-masters (Z0), Low-masters (Z1), high-mirrors (Z2) and low-mirrors (Z3) 7 5 1 2High-master high-mirror Low-master low-mirror interference poor locality Z0 Z1 Z2 Z3 30
  • 31. Example: Grouping H0 L0 h1 h2 l1 l2 H1 L1 h0 h2 l0 l2 H2 L2 h0 h1 l0 l1 Z0 Z1 Z2 Z3 2. GROUP: the mirrors (i.e., Z2 and Z3) are further grouped within zone according to the location of their masters 7 5 1 2High-master high-mirror Low-master low-mirror 5 8 6 3 9412 7 7 4 2 3 8561 9 3 1 8 4526 M1 M2 M0 5 8 1 7 9462 3M1 M0 7 4 2 5 3861 M2 9 3 1 8 5426 Z0 Z1 Z2 Z3 31
  • 32. 5 8 1 7 9462 3M1 Example: Sorting H0 L0 h1 h2 l1 l2 H1 L1 h0 h2 l0 l2 H2 L2 h0 h1 l0 l1 Z0 Z1 Z2 Z3 3. SORT: both the masters and mirrors within a group are sorted according to the global vertex ID 7 5 1 2High-master high-mirror Low-master low-mirror 3 9 1 5 8426M2 4 7 2 8 3561M0 M0 7 4 2 5 3861 M2 3 9 1 8 5426 5 8 1 7 3462 9M1 Z0 Z1 Z2 Z3 32
  • 33. 5 8 1 7 9462 3M1 Example: Sorting H0 L0 h1 h2 l1 l2 H1 L1 h0 h2 l0 l2 H2 L2 h0 h1 l0 l1 Z0 Z1 Z2 Z3 3. SORT: both the masters and mirrors within a group are sorted according to the global vertex ID 7 5 1 2High-master high-mirror Low-master low-mirror 4 7 2 8 3561 3 9 1 5 8426M2 M0 M0 7 4 2 5 3861 M2 3 9 1 8 5426 5 8 1 7 3462 9M1 contention Z0 Z1 Z2 Z3 33
  • 34. 5 8 6 9 4312 7M1 Example: Rolling H0 L0 h1 h2 l1 l2 H1 L1 h2 h0 l2 l0 H2 L2 h0 h1 l0 l1 Z0 Z1 Z2 Z3 4. ROLL: place mirror groups in a rolling order: the mirror groups in machine n for p partitions start from (n + 1)%p e.g., M1 start from h2 then h0, where n=2 and p=3 7 5 1 2High-master high-mirror Low-master low-mirror 4 7 2 8 3561 3 9 1 5 8426M2 M0 M0 4 7 2 8 3561 M2 3 9 1 5 8426 5 8 1 7 3462 9M1 Z0 Z1 Z2 Z3 34
  • 35. Tradeoff: Ingress vs. Runtime OPT: Locality-conscious graph layout □ The increase of graph ingress time is modest → Independently executed on each machine → Less than 5% (power-law) & around 10% (real-world) □ Usually more than 10% speedup for runtime → 21% for the Twitter follower graph -20% -10% 0% 10% 20% 30% 1.8 1.9 2.0 2.1 2.2 TW UK WK LJ GW Normalized Speedup Exec Time Ingress Time Power-law (α) Real-world 35
  • 37. Implementation POWERLYRA □ Based on latest PowerGraph (v2.2, Mar. 2014) □ A separate engine and partitioning algorithms □ Support both sync and async execution engines □ Seamlessly run all graph toolkits in GraphLab and respect the fault tolerance model □ Port the Random hybrid-cut to GraphX1 The source code and a brief instruction are available at http://ipads.se.sjtu.edu.cn/projects/powerlyra.html 1 We only port Random hybrid-cut for preserving the graph partitioning interface of GraphX. 37
  • 38. Evaluation Baseline: latest POWERGRAPH (2.2 Mar. 2014) with various vertex-cuts (Grid/Oblivious/Coordinated) Platforms 1. A 48-node VM-based cluster1 (Each: 4 Cores, 12GB RAM) 2. A 6-node physical cluster1 (Each: 24 Cores, 64GB RAM) Algorithms and graphs □ 3 graph-analytics algorithms (PR2, CC, DIA) and 2 MLDM algorithms (ALS, SGD) □ 5 real-world graphs, 5 synthetic graphs3, etc. Graph |V| |E| Twitter 42M 1.47B UK-2005 40M 936M Wiki 5.7M 130M LJournal 5.4M 79M GWeb 0.9M 5.1M α |E| 1.8 673M 1.9 249M 2.0 105M 2.1 54M 2.2 39M 6-node48-node 1 Clusters are connected via 1GigE 2 We run 10 iterations for PageRank. 3 Synthetic graphs has fixed 10 million vertices 38
  • 39. Performance 0 1 2 3 4 5 6 1.8 1.9 2.0 2.1 2.2 TW UK WK LJ GW NormalizedSpeedup PG+Grid PG+Oblivious PG+Coordinated PL+Hybrid PL+Ginger Power-law (α) Real-world default 2.2X 2.9X 3.3X 3.2X 3.2X 2.6X 2.0X 2.0X 3.0X 5.5X 48-node PageRank Twitter Follower Graph (TW) Graph Partitioning Time (sec) Ingress Exec Random 263 823 Grid 123 373 Oblivious 289 660 Coordinated 391 298 Hybrid 138 155 Ginger 307 143 better 39
  • 40. Scalability 0 20 40 60 80 100 120 140 160 8 16 24 32 40 48 ExecutionTime(sec) 0 100 200 300 400 PG+Grid PG+Oblivious PG+Coordinated PL+Hybrid PL+Ginger default #Machines #Vertices (millions) 47.4 28.5 23.1 16.8 Twitter Follower Graph 48-node 6-node Scale-free Power-law Graph better
  • 41. Breakdown 0 50 100 150 200 250 300 1.8 1.9 2.0 2.1 2.2 OneIteration Communications(MB) PG+Grid PG+Oblivious PG+Coordinated PL+Hybrid PL+Ginger Power-law (α) 48-node Power-law (α) 0 2 4 6 8 10 1.8 1.9 2.0 2.1 2.2 ExecutionTime(sec) PG+Hybrid PG+Ginger PL+Hybrid PL+Ginger 48-node default better vs. Grid 75% vs. Coor 50% about 30% 41
  • 42. Graph Partitioning Comparison Partitioning Algorithm Type Replication Factor (λ) Ingress Time Load Balance Random Hashing Bad Modest Good Grid 2D Hashing Modest Great Modest Oblivious Greedy Modest Modest Great Coordinated Greedy Great Bad Great Hybrid(R) Hashing Good Great Great Ginger Greedy Great Bad Good 0 2 4 6 8 10 12 14 16 1.8 1.9 2.0 2.1 2.2 TW UK WK LJ GW Replication Factor(λ) PG+Grid PG+Oblivious PG+Coordinated PL+Hybrid PL+Ginger Power-law (α) Real-world 28.4 18.0 48 partitions Twitter Follower Graph (TW) Partitioning λ Ingress Random 16.0 263 Grid 8.3 123 Oblivious 12.8 289 Coordinated 5.5 391 Hybrid 5.6 138 Ginger 4.8 307
  • 43. Threshold 0 10 20 30 40 50 0 2 4 6 8 10 12 1000 0 10 20 30 40 50 0 2 4 6 8 10 12 0 100 200 300 400 500 Replication Factor(λ) Replication Factor Execution Time 103 104 105 ∞+ PageRank w/ Twitter 48-node Impact of different thresholds □ Using high-cut or low-cut for all vertices is poor □ Execution time is stable for a large range of θ → Difference is lower than 1sec (100 ≤ θ ≤ 500) High-cut for all Low-cut for all Threshold (θ) stable
  • 44. 6-node vs. Other Systems1 0 200 400 600 800 1000 1200 Giraph GPS CombBLAS GraphX GraphX/H PowerGraph PowerLyra Twitter Follower Graph Power-law (α=2.0) Graph Exec-Time(sec) Power-law Graphs PowerLyra (6/1) Polymer Galois X-Stream GraphChi |V|=10M 14 / 45 6.3 9.8 9.0 115 |V|=400M 186 / -- -- -- 710 1666 528 367 1975 219 233 0 25 50 75 100 125 150 56 36 23 24 172 62 69 PageRank with 10 iterations 6-node6-node In-memory Out-of-core 1 We use Giraph 1.1, CombBLAS 1.4, GraphX 1.1, Galois 2.2.1, X-Stream 1.0, as well as the latest version of GPS, PowerGraph, Polymer and GraphChi. PageRank w/ 10 iterations ingress better 25% 44
  • 45. Conclusion Existing systems use a “ONE SIZE FITS ALL” design for skewed graphs, resulting in suboptimal performance POWERLYRA: a hybrid and adaptive design that differentiates computation and partitioning on low- and high-degree vertices Embrace LOCALITY and PARALLELISM □ Outperform PowerGraph by up to 5.53X □ Also much faster than other systems like GraphX, Giraph, and CombBLAS for ingress and runtime □ Random hybrid-cut improves GraphX by 1.33X 45
  • 47. Related Work GENERAL Pregel[SIGMOD'09], GraphLab[VLDB'12], PowerGraph[OSDI'12], GraphX[OSDI'14] OPTIMIZATION Mizan[EuroSys'13], Giraph++[VLDB'13], PowerSwitch[PPoPP'15] SINGLE MACHINE GraphChi[OSDI'12], X-Stream[SOSP'13], Galois[SOSP'13], Polymer[PPoPP'15] OTHER PARADIGMS CombBLAS[IJHPCA'11], SociaLite[VLDB'13] EDGE-CUT Stanton et al. [SIGKDD'12], Fennel[WSDM'14] VERTEX-CUT Greedy[OSDI'12], GraphBuilder[GRADES'13] HEURISTICS 2D[SC'05], Degree-Based Hashing[NIPS'15] Other graph processing systems Graph replication and partitioning
  • 49. Downgrade master mirror L . . . . . . L . . . G A S S master mirror L . . . . . . G A S G S L . . . ALL IN OUT NONE ALL 4 2 2 2 IN 3 2 1 1 OUT 3 1 2 1 NONE 3 1 1 1 GATHER SCATTER master mirror L . . . . . . G A S G S L . . . master mirror L . . . . . . G A S S L . . . #msgs OBSERVATION: most algorithms only gather or scatter in one direction Sweet Spot
  • 50. Other Algorithms and Graphs ALS d=5 d=20 d=50 d=100 PG 10| 33 11|144 16|732 Failed PL 13| 23 13| 51 14|177 15|614 V_DATA E_DATA Grid Hybrid 8d + 13 16 12.3 2.6 SGD d=5 d=20 d=50 d=100 PG 15| 35 17| 48 21| 73 28|115 PL 16| 26 19| 33 19| 43 20| 59 0 1 2 3 4 5 1.8 1.9 2.0 2.1 2.2 Power-law (α) NormalizedSpeedup 0 1 2 3 4 5 1.8 1.9 2.0 2.1 2.2 PG+Grid PG+Oblivious PG+Coordinated PL+Hybrid PL+Ginger Power-law (α) DIA CC 48-node 48-node MLDM Algorithms Netflix:|V|=0.5M, |E|=99M 48-node default G/OUT & S/NONE ingress | runtime better Latent dimensionG/NONE & S/ALL
  • 51. Other Algorithms and Graphs Graph Partitioning λ Time (sec) Ingress Exec Grid 3.2 15.5 57.3 Oblivious 2.3 13.8 51.8 Coordinated 2.3 26.9 50.4 Hybrid 3.3 14.0 32.2 Ginger 2.8 28.8 31.3 ALS d=5 d=20 d=50 d=100 PG 10| 33 11|144 16|732 Failed PL 13| 23 13| 51 14|177 15|614 V_DATA E_DATA Grid Hybrid 8d + 13 16 12.3 2.6 SGD d=5 d=20 d=50 d=100 PG 15| 35 17| 48 21| 73 28|115 PL 16| 26 19| 33 19| 43 20| 59 0 1 2 3 4 5 1.8 1.9 2.0 2.1 2.2 Power-law (α) NormalizedSpeedup 0 1 2 3 4 5 1.8 1.9 2.0 2.1 2.2 PG+Grid PG+Oblivious PG+Coordinated PL+Hybrid PL+Ginger Power-law (α) 48-node 48-node MLDM Algorithms Netflix:|V|=0.5M, |E|=99M 48-node Non-skewed Graph RoadUS: |V|=23.9M, |E|=58.3M 48-node default G/OUT & S/NONE G/NONE & S/ALL DIA CC better
  • 52. Ingress Time vs. Execution Time Twitter Follower Graph Graph Partitioning λ Time (sec) Ingress Exec Random 16.0 263 823 Grid 8.3 123 373 Oblivious 12.8 289 660 Coordinated 5.5 391 298 Hybrid 5.6 138 155 Netflix Movie Recommendation Graph Partitioning λ Time (sec) Ingress Exec Random 36.9 21 547 Grid 12.3 12 174 Oblivious 31.5 25 476 Coordinated 5.3 31 105 Hybrid 2.6 14 67 PageRank ALS (d=20)
  • 53. Fast CPU & Networking 0 20 40 60 80 100 120 140 160 PG-1Gb PL-1Gb PG-10Gb PL-10Gb Exec-Time(sec) Twitter Follower Graph Power-law (α=2.0) Graph 0 10 20 30 40 50 60 PageRank w/ 10 iterations Testbed: 6-node physical clusters 1. SLOW: 24 Cores (AMD 6168 1.9GHz L3:12MB), 64GB RAM, 1GbE 2. FAST: 20 Cores (Intel E5-2650 2.3GHz L3:24MB), 64GB RAM, 10GbE 1.77X 1.51X 3.68X 1.67X