https://github.com/aching/Giraph
Giraph : Large-scale graph processing on Hadoop
Web and online social graphs have been rapidly growing in size and
scale during the past decade. In 2008, Google estimated that the
number of web pages reached over a trillion. Online social networking
and email sites, including Yahoo!, Google, Microsoft, Facebook,
LinkedIn, and Twitter, have hundreds of millions of users and are
expected to grow much more in the future. Processing these graphs
plays a big role in relevant and personalized information for users,
such as results from a search engine or news in an online social
networking site.
Graph processing platforms to run large-scale algorithms (such as page
rank, shared connections, personalization-based popularity, etc.) have
become quite popular. Some recent examples include Pregel and HaLoop.
For general-purpose big data computation, the map-reduce computing
model has been well adopted and the most deployed map-reduce
infrastructure is Apache Hadoop. We have implemented a
graph-processing framework that is launched as a typical Hadoop job to
leverage existing Hadoop infrastructure, such as Amazon’s EC2. Giraph
builds upon the graph-oriented nature of Pregel but additionally adds
fault-tolerance to the coordinator process with the use of ZooKeeper
as its centralized coordination service and is in the process of being
open-sourced.
Giraph follows the bulk-synchronous parallel model relative to graphs
where vertices can send messages to other vertices during a given
superstep. Checkpoints are initiated by the Giraph infrastructure at
user-defined intervals and are used for automatic application restarts
when any worker in the application fails. Any worker in the
application can act as the application coordinator and one will
automatically take over if the current application coordinator fails.
1. Giraph: Large-scale graph processing infrastructure on Hadoop Avery Ching, Yahoo! Christian Kunz, Jybe 6/29/2011
2. Why scalable graph processing? 2 Real world web and social graphs are at immense scale and continuing to grow In 2008, Google estimated the number of web pages at 1 trillion In 2011, Goldman Sachs disclosed Facebook has over 600 million active users In March 2011, LinkedIn said it had over 100 million users In October 2010, Twitter claimed to have over 175 million users Relevant and personalized information for users relies strongly on iterative graph ranking algorithms (search results, social news, ads, etc.) In web graphs, page rank and its variants In social graphs, popularity rank, personalized rankings, shared connections, shortest paths, etc.
3. Existing solutions Sequence of map-reduce jobs in Hadoop Classic map-reduce overheads (job startup/shutdown, reloading data from HDFS, shuffling) Map-reduce programming model not a good fit for graph algorithms Google’s Pregel Requires its own computing infrastructure Not available (unless you work at Google) Master is a SPOF Message passing interface (MPI) Not fault-tolerant Too generic 3
4. Giraph Leverage Hadoop installations around the world for iterative graph processing Big data today is processed on Hadoop with the Map-Reduce computing model Map-Reduce with Hadoop is widely deployed outside of Yahoo! as well (i.e. EC2, Cloudera, etc.) Bulk synchronous processing (BSP) computing model Input data loaded once during the application, all messaging in memory Fault-tolerant/dynamic graph processing infrastructure Automatically adjust to available resources on the Hadoop grid No single point of failure except Hadoop namenode and jobtracker Relies on ZooKeeper as a fault-tolerant coordination service Vertex-centric API to do graph processing (similar to the map() method in Hadoop) in a bulk synchronous parallel computing model Inspired by Pregel Open-source https://github.com/aching/Giraph 4
5. Bulk synchronous parallel model Computation consists of a series of “supersteps” Supersteps are an atomic unit of computation where operations can happen in parallel During a superstep, components are assigned tasks and receive unordered messages from the previous superstep Components communicate through point-to-point messaging All (or a subset of) components can be synchronized through the superstep concept 5 Superstep i Superstep i+1 Superstep i+2 Component 1 Component 1 Component 1 Component 2 Component 2 Component 2 Component 3 Component 3 Component 3 Application progress
6. Writing a Giraph application Every vertex will call compute() method once during a superstep Analogous to map() method in Hadoop for a <key, value> tuple Users chooses 4 types for their implementation of Vertex (I VertexId, V VertexValue, E EdgeValue, M MsgValue) 6
7. Basic Giraph API Local vertex mutations happen immediately Graph mutations are processed just prior to the next superstep Sent messages received at the next superstep during compute Vertices “vote” to end the computation Once all vertices have voted to end the computation, the application is finished 7
8. Page rank example public class SimplePageRankVertex extends HadoopVertex<LongWritable, DoubleWritable, FloatWritable, DoubleWritable> { public void compute(Iterator<DoubleWritable> msgIterator) { double sum = 0; while (msgIterator.hasNext()) { sum += msgIterator.next().get(); } setVertexValue(new DoubleWritable((0.15f / getNumVertices()) + 0.85f * sum); if (getSuperstep() < 30) { long edges = getOutEdgeIterator().size(); sentMsgToAllEdges(new DoubleWritable(getVertexValue().get() / edges)); } else { voteToHalt(); } } 8
10. Thread responsibilities Master Only one active master at a time Runs the VertexInputFormat getSplits() to get the appropriate number of VertexSplit objects for the application and writes it to ZooKeeper Coordinates application Synchronizes supersteps, end of application Handles changes that can occur within supersteps (i.e. vertex movement, change in number of workers, etc.) Worker Reads vertices from one or more VertexSplit objects, splitting them into VertexRanges Executes the compute() method for every Vertex it is assigned once per superstep Buffers the incoming messages to every Vertex it is assigned for the next superstep ZooKeeper Manages a server that Is a part of the ZooKeeper quorum (maintains global application state) 10
11. Vertex distribution 11 Worker 0 V Worker 0 VertexRange 0 V Master VertexSplit 0 V V Workers process the VertexSplit objects and divide them further to create VertexRange objects. V V VertexRange 0,1 Prior to every superstep, the master assigns every worker 0 or more VertexRange objects. They are responsible for the vertices that fall in those VertexRange objects. V V VertexRange 1 V V V V V V Master V Worker 1 VertexRange 2 V VertexSplit 1 VertexRange 2,3 V V V V Master uses the VertexInputFormat to divide the input data into VertexSplit objects and serializes them to ZooKeeper. V V V VertexRange 3 V V V V V Worker 1 V V
12.
13. Fault tolerance No single point of failure from BSP threads With multiple master threads, if the current master dies, a new one will automatically take over. If a worker thread dies, the application is rolled back to a previously checkpointed superstep. The next superstep will begin with the new amount of workers If a zookeeper server dies, as long as a quorum remains, the application can proceed Hadoop single points of failure still exist Namenode, jobtracker Restarting manually from a checkpoint is always possible 13
14. Master thread fault tolerance 14 Before failure of active master 0 After failure of active master 0 Active Master State Active Master State “Active” Master 0 “Active” Master 0 “Spare” Master 1 “Active” Master 1 “Spare” Master 2 “Spare” Master 2 One active master, with spare masters taking over in the event of an active master failure All active master state is stored in ZooKeeper so that a spare master can immediately step in when an active master fails “Active” master implemented as a queue in ZooKeeper
15. Worker thread fault tolerance 15 Superstep i (no checkpoint) Superstep i+1 (checkpoint) Superstep i+2 (no checkpoint) Worker failure! Superstep i+3 (checkpoint) Superstep i+1 (checkpoint) Superstep i+2 (no checkpoint) Worker failure after checkpoint complete! Superstep i+3 (no checkpoint) Application Complete A single worker failure causes the superstep to fail In order to disrupt the application, a worker must be registered and chosen for the current superstep Application reverts to the last committed superstep automatically Master detects worker failure during any superstep through the loss of a ZooKeeper “health” znode Master chooses the last committed superstep and sends a command through ZooKeeper for all workers to restart from that superstep
16. Optional features Combiners Similar to Map-Reduce combiners Users implement a combine() method that can reduce the amount of messages sent and received Run on both the client side and server side Client side saves memory and message traffic Server side save memory Aggregators Similar to MPI aggregation routines (i.e. max, min, sum, etc.) Users can write their own aggregators Commutative and associate operations that are performed globally Examples include global communication, monitoring, and statistics 16
17. Early Yahoo! customers Web of Objects Currently used for the movie database (10’s of millions of records, run with 400 workers) Popularity rank, shared connections, personalized page rank Web map Next generation page-rank related algorithms will use this framework (250 billion web pages) Current graph processing solution uses MPI (no fault-tolerance, customized code) 17
18. Future work Improved fault tolerance testing Unit tests exist, but test a limited subset of failures Performance testing Tested with 400 workers, but thousands desired Importing some of the currently used graph algorithms into an included library Storing messages that surpass available memory on disk Leverage the programming model, maybe even convert to a native Map-Reduce application 18
19. Conclusion Giraph is a graph processing infrastructure that runs on existing Hadoop infrastructure Already being used at Yahoo! Open source – available on GitHub https://github.com/aching/Giraph In the process of submitting an Apache Incubator proposal Questions/Comments? aching@yahoo-inc.com 19