This presentation is for the published paper "Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing" which was presented in EuroSys 2013
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Presentation on "Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing"
1. Mizan: A System for Dynamic Load
Balancing in Large-scale Graph Processing
Zuhair Khayyat1 Karim Awara1 Amani Alonazi1
Hani Jamjoom2 Dan Williams2 Panos Kalnis1
1
King Abdullah University of Science and Technology
Thuwal, Saudi Arabia
2
IBM Watson Research Center
Yorktown Heights, NY
1/34
2. Importance of Graphs
Graphs abstract application-specific algorithms into generic
problems represented as interactions using vertices and edges
Max flow in road network Diameter of WWW
Ranking in social networks Simulating protein interactions
Algorithms vary in their computational requirements
2/34
3. How to Scale Graph Processing
Pregel was introduced by Google as a scalable abstraction for
graph processing
Overcomes the limitations of processing graphs on MapReduce
Based on vertex-centric computation
Utilizes bulk synchronous parallel (BSP)
System Programming Data Computations
Abstraction Exchange
map() Key based
MapReduce combine() grouping Disk based
reduce()
compute() Message
Pregel combine() passing In memory
aggregate()
3/34
4. Pregel’s BSP Graph Processing
Superstep 1 Superstep 2 Superstep 3
Worker 1 Worker 1 Worker 1
Worker 2 Worker 2 Worker 2
Worker 3 Worker 3 Worker 3
BSP Barrier
Balanced computation and communication is fundamental to
Pregel’s efficiency
4/34
5. Optimizing Graph Processing
Existing work focus on optimizing for graph structure (static
optimization):
Optimize graph partitioning:
Simple graph partitioning schemes (e.g., hash or range): Giraph
User-defined partitioning function: Pregel
Sophisticated partitioning techniques (e.g., min-cuts): GraphLab,
PowerGraph, GoldenOrb and Surfer
Optimize graph data access:
Distributed data stores and graph indexing: GoldenOrb, Hama
and Trinity
5/34
6. Optimizing Graph Processing
Existing work focus on optimizing for graph structure (static
optimization):
Optimize graph partitioning:
Simple graph partitioning schemes (e.g., hash or range): Giraph
User-defined partitioning function: Pregel
Sophisticated partitioning techniques (e.g., min-cuts): GraphLab,
PowerGraph, GoldenOrb and Surfer
Optimize graph data access:
Distributed data stores and graph indexing: GoldenOrb, Hama
and Trinity
What about algorithm behavior?
Pregel provides coarse-grained load balancing, is it enough?
5/34
7. Types of Algorithms
Algorithms behaves differently, we classify algorithms in two
categories depending on their behavior
Algorithm Type In/Out Vertices Examples
Messages State
PageRank,
Stationary Predictable Fixed Diameter estimation,
WCC
Distributed minimum
spanning tree,
Non-stationary Variable Variable Belief propagation,
Graph queries,
Ad propagation
6/34
10. Types of Algorithms – Superstep k + m
V2 V8 V2 V8
V4 V1 V6 V4 V1 V6
V7 V3 V5 V7 V3 V5
PageRank DMST
9/34
11. What Causes Computation Imbalance in
Non-stationary Algorithms?
Superstep 1 Superstep 2 Superstep 3
Worker 1 Worker 1 Worker 1
Worker 2 Worker 2 Worker 2
Worker 3 Worker 3 Worker 3
BSP Barrier
-Vertex response time -Time to receive in messages
-Time to send out messages
10/34
12. Why to Optimize for Algorithm Behavior?
Difference between stationary and non-stationary algorithms
1000
In Messages (Millions)
100
10
1
0.1
0.01
0.001
0 10 20 30 40 50 60
SuperSteps
PageRank - Total DMST - Total
PageRank - Max/W DMST - Max/W
11/34
13. What is Mizan?
BSP-based graph processing framework
Uses runtime fine-grained vertex migrations to balance
computation and communication
Follows the Pregel programming model
Open source, written in C++
12/34
14. PageRank’s compute() in Mizan
void compute(messageIterator * messages, userVertexObject * data,
messageManager * comm) {
double currVal = data->getVertexValue().getValue();
double newVal = 0; double c = 0.85;
if (data->getCurrentSS() > 1) {
while (messages->hasNext()) {
newVal = newVal + messages->getNext().getValue();
}
newVal = newVal * c + (1.0 - c) / ((double) vertexTotal);
data->setVertexValue(mDouble(newVal));
} else {newVal = currVal;}
if (data->getCurrentSS() <= maxSuperStep) {
mDouble outVal(newVal / ((double) data->getOutEdgeCount()));
for (int i = 0; i < data->getOutEdgeCount(); i++) {
comm->sendMessage(data->getOutEdgeID(i), outVal);
}
} else {data->voteToHalt();}
}
13/34
15. Migration Planning Objectives
Decentralized
Simple
Transparent
No need to change Pregel’s API
Does not assume any a priori knowledge to graph structure or
algorithm
14/34
16. Mizan’s Migration Barrier
Mizan performs both planning and migrations after all workers
reach the BSP barrier
Superstep 1 Superstep 2 Superstep 3
Worker 1 Worker 1 Worker 1
Worker 2 Worker 2 Worker 2
Worker 3 Worker 3 Worker 3
Migration Planner
Migration Barrier
BSP Barrier
15/34
17. Mizan’s Migration Planning Steps
1 Identify the source of imbalance: By comparing the worker’s
execution time against a normal distribution and flagging outliers
16/34
18. Mizan’s Migration Planning Steps
1 Identify the source of imbalance: By comparing the worker’s
execution time against a normal distribution and flagging outliers
Mizan monitors for each vertex:
Remote outgoing messages
All incoming messages
Response time
High level summaries are broadcast to each worker
Vertex
Response Time
Remote Outgoing Messages
V6 V2 V1
V4 V3
V5
Mizan Mizan
Worker 1 Worker 2
Local Incoming Messages Remote Incoming Messages
16/34
19. Mizan’s Migration Planning Steps
1 Identify the source of imbalance
2 Select the migration objective:
Mizan finds the strongest cause of workload imbalance by
comparing statistics for outgoing messages and incoming
messages of all workers with the worker’s execution time
17/34
20. Mizan’s Migration Planning Steps
1 Identify the source of imbalance
2 Select the migration objective:
Mizan finds the strongest cause of workload imbalance by
comparing statistics for outgoing messages and incoming
messages of all workers with the worker’s execution time
The migration objective is either:
Optimize for outgoing messages, or
Optimize for incoming messages, or
Optimize for response time
17/34
21. Mizan’s Migration Planning Steps
1 Identify the source of imbalance
2 Select the migration objective
3 Pair over-utilized workers with under-utilized ones:
All workers create and execute the migration plan in parallel
without centralized coordination
Each worker is paired with one other worker at most
W7 W2 W1 W5 W8 W4 W0 W6 W9 W3
0 1 2 3 4 5 6 7 8
18/34
22. Mizan’s Migration Planning Steps
1 Identify the source of imbalance
2 Select the migration objective
3 Pair over-utilized workers with under-utilized ones
4 Select vertices to migrate: Depending on the migration
objective
19/34
23. Mizan’s Migration Planning Steps
1 Identify the source of imbalance
2 Select the migration objective
3 Pair over-utilized workers with under-utilized ones
4 Select vertices to migrate
5 Migrate vertices:
How to migrate vertices freely across workers while maintaining
vertex ownership and fast updates?
How to minimize migration costs for large vertices?
20/34
24. Vertex Ownership
Mizan uses distributed hash table (DHT) to implement a
distributed lookup service:
V can execute at any worker
V ’s home worker ID = (hash(ID) mod N )
Workers ask the home worker of V for its current location
The home worker is notified with the new location as V migrates
Where is V5 & V8
V1 V2 V8 V4 V6 V7 V8
Compute
V3 V5 V2 V5 V9 V1 V3 V9
DHT V1:W1 V4:W2 V7:W3 V2:W1 V5:W2 V8:W3 V3:W1 V6:W2 V9:W3
Worker 1 Worker 2 Worker 3
21/34
25. Migrating Vertices with Large Message Size
Introduce delayed migration for very large vertices:
At SS t: only ownership of vertex v is moved to workernew
Superstep t Superstep t+1 Superstep t+2
Worker_old Worker_old Worker_old
1 3 1 3 1 3
2 7 2 7 2
4 4 '7 4 7
5 6 5 6 5 6
Worker_new Worker_new Worker_new
22/34
26. Migrating Vertices with Large Message Size
Introduce delayed migration for very large vertices:
At SS t: only ownership of vertex v is moved to workernew
At SS t + 1:
workernew receives vertex 7 messages
workerold process vertex 7
Superstep t Superstep t+1
Worker_old Worker_old Worker_old
1 3 1 3 1 3
2 7 2 7 2
4 4 '7 4 7
5 6 5 6 5 6
Worker_new Worker_new Worker_new
22/34
27. Migrating Vertices with Large Message Size
Introduce delayed migration for very large vertices:
At SS t: only ownership of vertex v is moved to workernew
At SS t + 1:
workernew receives vertex 7 messages
workerold process vertex 7
After SS t + 1: workerold moves state of vertex 7 to workernew
Superstep t Superstep t+1 Superstep t+2
Worker_old Worker_old Worker_old
1 3 1 3 1 3
2 7 2 7 2
4 4 '7 4 7
5 6 5 6 5 6
Worker_new Worker_new Worker_new
22/34
28. Experimental Setup
Mizan is implemented on C++ with MPI with three variations:
Static Mizan: Emulates Giraph, disables dynamic migration
Work stealing (WS): Emulates Pregel’s coarse-grained
dynamic load balancing
Mizan: Activates dynamic migration
Local cluster of 21 machines:
Mix of i5 and i7 processors with 16GB RAM Each
IBM Blue Gene/P supercomputer with 1024 compute nodes:
Each is a 4 core PowerPC450 CPU at 850MHz with 4GB RAM
23/34
29. Datasets
Experiments on public datasets:
Stanford Network Analysis Project (SNAP)
The Laboratory for Web Algorithmics (LAW)
Kronecker generator
name |nodes| |edges|
kg1 (synthetic) 1,048,576 5,360,368
kg4m68m (synthetic) 4,194,304 68,671,566
web-Google 875,713 5,105,039
LiveJournal1 4,847,571 68,993,773
hollywood-2011 2,180,759 228,985,632
arabic-2005 22,744,080 639,999,458
24/34
30. Giraph vs. Static Mizan
40
Giraph 12 Giraph
35 Static Mizan Static Mizan
30 10
Run Time (Min)
Run Time (Min)
25 8
20 6
15
4
10
2
5
0
0
web k Live k 1 2 4 8 16
- Goo g1 Jour g4m68
nal1
gle m
Vertex count (Millions)
Figure: PageRank on social network Figure: PageRank on regular random
and random graphs graphs, each has around 17M edge
25/34
31. Effectiveness of Dynamic Vertex Migration
PageRank on a social graph (LiveJournal1)
The shaded columns: algorithm runtime
The unshaded columns: initial partitioning cost
30
25
Run Time (Min)
20
15
10
5
0
Mizan
Mizan
Mizan
Static
Static
Static
WS
WS
WS
Hash Range Metis
26/34
32. Effectiveness of Dynamic Vertex Migration
Comparing the performance on highly variable messaging pattern
algorithms, which are DMST and ad propagation simulation, on
a metis partitioned social graph (LiveJournal1)
300
Static
250 Work Stealing
Mizan
Run Time (Min)
200
150
100
50
0
Adve DMS
r tism T
ent
27/34
33. Mizan’s Overhead with Scaling
8 10
Mizan - hollywood-2011 Mizan - Arabic-2005
7
8
6
Speedup
5
Speedup
6
4
4
3
2 2
1
0 0
64
128
256
512
1024
0 2 4 6 8 10 12 14 16
Compute Nodes
Compute Nodes
Figure: Speedup on Linux Cluster
Figure: Speedup on IBM
Blue Gene/P supercomputer
28/34
34. Future Work
Work with skewed graphs
Improving Pregel’s fault tolerance
29/34
35. Conclusion
Mizan is a Pregel system that uses fine-grained vertex migration
to load balance computation and communication across
supersteps
Mizan is an open source project developed within InfoCloud
group in KAUST in collaboration with IBM, programmed in
C++ with MPI
Mizan scales up to thousands of workers
Mizan improves the overall computation cost between 40% up to
two order of magnitudes with less than 10% migration overhead
Download it at: http://cloud.kaust.edu.sa
Try it on EC2 (ami-52ed743b)
30/34
36. Extra: Migration Details and Costs
4
Superstep Runtime
3.5 Average Workers Runtime
Run Time (Minutes)
3 Migration Cost
Balance Outgoing Messages
2.5 Balance Incoming Messages
2 Balance Response Time
1.5
1
0.5
0
5 10 15 20 25 30
Supersteps
31/34
37. Extra: Migrated Vertices per Superstep
4
Migrated Vertices (x1000)
500 Migration Cost
Migration Cost (Minutes)
Total Migrated Vertices
400 Max Vertices Migrated 3
by Single Worker
300
2
200
1
100
0 0
5 10 15 20 25 30
Supersteps
32/34
38. Extra: Mizan’s Migration Planning
Compute()
BSP
Barrier
Pair with
Underutilized
Worker
Migration No Superstep
Barrier Imbalanced?
Select
Yes Vertices
and Migrate
Identify Source
of Imbalance
No Worker Yes
Overutilized?
33/34