1) The document discusses scaling Apache Giraph, an open source graph computation engine. It outlines several problems that arise when scaling Giraph to large graphs, such as worker crashes and master crashes.
2) Solutions proposed to address these problems include checkpointing to handle worker crashes, using ZooKeeper for master queue handling to address master crashes, and using byte arrays and unsafe serialization to reduce object overhead.
3) Test results show Giraph can scale to graphs with billions of vertices and edges on a cluster of 50 workers, achieving speedups of 20x CPU and 100x elapsed time compared to Hive for similar graph computations.
4. What is Giraph?
• Apache open source graph computation engine based on Google’s Pregel.
• Support for Hadoop, Hive, HBase, and Accumulo.
• BSP model with simple think like a vertex API.
• Combiners, Aggregators, Mutability, and more.
• Configurable Graph<I,V,E,M>:
– I: Vertex ID
– V: Vertex Value
– E: Edge Value
– M: Message data
What is Giraph NOT?
• A Graph database. See Neo4J.
• A completely asynchronous generic MPI system.
• A slow tool.
implements
Writable
6. Giraph components
Master – Application coordinator
• Synchronizes supersteps
• Assigns partitions to workers before superstep begins
Workers – Computation & messaging
• Handle I/O – reading and writing the graph
• Computation/messaging of assigned partitions
ZooKeeper
• Maintains global application state
7. Giraph Dataflow
Split 0
Split 1
Split 2
Split 3
Worker
1
Master
Worker
0Input format
Load /
Send
Graph
Load /
Send
Graph
Loading the graph
1
Part 0
Part 1
Part 2
Part 3
Compute /
Send
Messages
Worker
1
Compute /
Send
Messages
Master
Worker
0
In-memory
graph
Send stats / iterate!
Compute/Iterate
2
Worker
1
Worker
0
Part 0
Part 1
Part 2
Part 3
Output format
Part 0
Part 1
Part 2
Part 3
Storing the graph
3
Split 4
Split
8. Giraph Job Lifetime
Output
Active Inactive
Vote to Halt
Received Message
Vertex Lifecycle
All Vertices
Halted?
Input
Compute
Superstep
No
Master
halted?
No
Yes
Yes
9. Simple Example – Compute the maximum value
5
1
5
2
5
5
2
5
5
5
5
5
1
2
Processor 1
Processor 2
Time
Connected Components
e.g. Finding Communities
10. PageRank – ranking websites
Mahout (Hadoop)
854 lines
Giraph
< 30 lines
• Send neighbors an equal fraction of your page rank
• New page rank = 0.15 / (# of vertices) + 0.85 * (messages
sum)
12. Problem: Worker Crash.
Superstep i
(no checkpoint)
Superstep i+1
(checkpoint)
Superstep i+2
(no checkpoint)
Worker failure!
Superstep i+1
(checkpoint)
Superstep i+2
(no checkpoint)
Superstep i+3
(checkpoint)
Worker failure after
checkpoint complete!
Superstep i+3
(no checkpoint)
Application
Complete…
Solution: Checkpointing.
13. “Spare”
Master 2
Active
Master State“Spare”
Master 1
“Active”
Master 0
Before failure of active master 0
“Spare”
Master 2
Active
Master State“Active”
Master 1
“Active”
Master 0
After failure of active master 0
ZooKeeper ZooKeeper
Problem: Master Crash.
Solution: ZooKeeper Master Queue.
14. Problem: Primitive Collections.
• Graphs often parameterized with { }
• Boxing/unboxing. Objects have internal overhead.
3
Solution: Use fastutil, e.g. Long2DoubleOpenHashMap.
fastutil extends the Java™ Collections Framework by providing type-specific
maps, sets, lists and queues with a small memory footprint and fast access and
insertion
1
2
4
5
1.2
0.5
0.8
0.4
1.7
0.7
Single Source Shortest Path
s
t
1.2
0.5
0.8
0.4
0.2
0.7
Network Flow
3
1
2
4
5
Count In-Degree
15. Problem: Too many objects.
Lots of time spent in GC.
Graph: 1B Vertices, 200B Edges, 200 Workers.
• 1B Edges per Worker. 1 object per edge value.
• List<Edge<I, E>> ~ 10B objects
• 5M Vertices per Worker. 10 objects per vertex value.
• Map<I, Vertex<I, V, E> ~ 50M objects
• 1 Message per Edge. 10 objects per message data.
• Map<I, List<M>> ~ 10B objects
• Objects used ~= O(E*e + V*v + M*m) => O(E*e)
Label Propagation
e.g. Who’s sleeping?
3
1
2
4
5
Boring
Amazing
Q: What did he think?
0.5
0.2
0.8 0.36
0.17
0.41
Confusing
16. Problem: Too many objects.
Lots of time spent in GC.
Solution: byte[]
• Serialize messages, edges, and vertices.
• Iterable interface with representative object.
Input Input Input
next()
next()
next()
Objects per worker ~= O(V)
Label Propagation
e.g. Who’s sleeping?
3
1
2
4
5
Boring
Amazing
Q: What did he think?
0.5
0.2
0.8 0.36
0.17
0.41
Confusing
17. Problem: Serialization of byte[]
• DataInput? Kyro? Custom?
Solution: Unsafe
• Dangerous. No formal API. Volatile. Non-portable (oracle JVM only).
• AWESOME. As fast as it gets.
• True native. Essentially C: *(long*)(data+offset);
18. Problem: Large Aggregations.
Worker
Worker
Worker
Worker
Worker
Master
Workers own aggregators
Worker
Worker
Worker
Worker
Worker
Master
Aggregator owners communicate
with Master
Worker
Worker
Worker
Worker
Worker
Master
Aggregator owners distribute values
Solution: Sharded Aggregators.
Worker
Worker
Worker
Worker
Worker
Master
K-Means Clustering
e.g. Similar Emails
19. Problem: Network Wait.
• RPC doesn’t fit model.
• Synchronous calls no good.
Solution: Netty
Tune queue sizes & threads
BarrierBarrier
Begin superstep
compute
network
End compute
End superstep
wait
Barrier
Barrier
Begin superstep
compute
network
wait
Time to first message
End compute
End superstep
22. Lessons Learned
• Coordinating is a zoo. Be resilient with ZooKeeper.
• Efficient networking is hard. Let Netty help.
• Primitive collections, primitive performance. Use fastutil.
• byte[] is simple yet powerful.
• Being Unsafe can be a good thing.
• Have a graph? Use Giraph.
23. What’s the final result?
Comparison with Hive:
• 20x CPU speedup
• 100x Elapsed time speedup. 15 hours => 9 minutes.
Computations on entire Facebook graph no longer “weekend jobs”.
Now they’re coffee breaks.
25. Problem: Measurements.
• Need tools to gain visibility into the system.
• Problems with connecting to Hadoop sub-processes.
Solution: Do it all.
• YourKit – see YourKitProfiler
• jmap – see JMapHistoDumper
• VisualVM –with jstatd & ssh socks proxy
• Yammer Metrics
• Hadoop Counters
• Logging & GC prints
26. Problem: Mutations
• Synchronization.
• Load balancing.
Solution: Reshuffle resources
• Mutations handled at barrier between supersteps.
• Master rebalances vertex assignments to optimize distribution.
• Handle mutations in batches.
• Avoid if using byte[].
• Favor algorithms which don’t mutate graph.
Hinweis der Redaktion
No internal FB repo. Everyone committer.A global epoch followed by a global barrier where components do concurrent computation and send messages.Graphs are sparse.
Giraph is a map-only job
Code is real, checked into Giraph.All vertices find the maximum value in a strongly connected graph
One active master, with spare masters taking over in the event of an active master failureAll active master state is stored in ZooKeeper so that a spare master can immediately step in when an active master fails“Active” master implemented as a queue in ZooKeeperA single worker failure causes the superstep to failApplication reverts to the last committed superstep automaticallyMaster detects worker failure during any superstep with a ZooKeeper “health” znodeMaster chooses the last committed superstep and sends a command through ZooKeeper for all workers to restart from that superstep
One active master, with spare masters taking over in the event of an active master failureAll active master state is stored in ZooKeeper so that a spare master can immediately step in when an active master fails“Active” master implemented as a queue in ZooKeeper
Primitive collections are primitive.Lots of boxing / unboxing of types.Object and reference for each instance.
Also other implementations like Map<I,E> for edges which use more space but better for lots of mutations.Realistically for FB sized graphs need even bigger.Edges are not uniform in reality, some vertices are much larger.
Dangerous, non-portable, volatile. Oracle JVM only. No formal API.Allocate non-GC memory.Inherit from String (final class).Direct access memory (C pointer casts)
Cluster open source projects.Histograms. Job metrics.
Start sending messages early and sendwith computation.Tune message buffer sizes to reduce wait time.
First thing’s first – what’s going on with the system?Want debugger, but don’t have one.Use YourKit’s API to create granular snapshots withinapplication.JMap binding errors – spawn from within process.
With byte[] any mutation requires full deserialization / re-serialization.