Processing large-scale graphs 
, 
with GoogleTM Pregel 
November 22, 2014 
Frank Celler 
@fceller 
www.arangodb.com
About 
about us 
Frank Celler (@fceller) working on the ArangoDB core 
Michael Hackstein (@mchacki) started an experimenta...
About 
about us 
Frank Celler (@fceller) working on the ArangoDB core 
Michael Hackstein (@mchacki) started an experimenta...
Pregel at ArangoDB 
Started as a side project in free hack time 
Experimental on operational database 
Implemented as an a...
Graph Algorithms 
Pattern matching 
Search through the entire graph 
Identify similar components 
) Touch all vertices and...
Graph Algorithms 
Pattern matching 
Search through the entire graph 
Identify similar components 
) Touch all vertices and...
Graph Algorithms 
Pattern matching 
Search through the entire graph 
Identify similar components 
) Touch all vertices and...
Pregel 
A framework to query distributed, directed graphs. 
Known as “Map-Reduce” for graphs 
Uses same phases 
Has severa...
Example – Connected Components 
1 
1 
2 
2 
5 
7 
7 
5 4 
3 4 
3 
6 
6 
active inactive 
3 forward message 2 backward mess...
Example – Connected Components 
1 
1 
2 
2 
5 
7 
7 
5 
6 
7 
5 4 
3 4 
3 
6 
6 
4 
2 
3 
4 
active inactive 
3 forward me...
Example – Connected Components 
1 
1 
2 
2 
5 
7 
7 
5 
6 
7 
5 4 
3 4 
3 
6 
6 
4 
2 
3 
4 
active inactive 
3 forward me...
Example – Connected Components 
1 
1 
2 
2 
5 
6 
7 
5 
6 
5 
5 4 
3 4 
3 
5 
6 
3 
1 
2 
2 
active inactive 
3 forward me...
Example – Connected Components 
1 
1 
2 
2 
5 
6 
7 
5 
6 
5 
5 4 
3 4 
3 
5 
6 
3 
1 
2 
2 
active inactive 
3 forward me...
Example – Connected Components 
1 
1 
1 
2 
5 
5 
7 
5 2 
2 4 
3 
5 
6 
1 
1 
2 
2 
active inactive 
3 forward message 2 b...
Example – Connected Components 
1 
1 
1 
2 
5 
5 
7 
5 2 
2 4 
3 
5 
6 
1 
1 
2 
2 
active inactive 
3 forward message 2 b...
Example – Connected Components 
1 
1 
1 
2 
5 
5 
7 
5 1 
1 4 
3 
5 
6 
1 
1 
active inactive 
3 forward message 2 backwar...
Example – Connected Components 
1 
1 
1 
2 
5 
5 
7 
5 1 
1 4 
3 
5 
6 
1 
1 
active inactive 
3 forward message 2 backwar...
Example – Connected Components 
1 
1 
1 
2 
5 
5 
7 
5 1 
1 4 
3 
5 
6 
active inactive 
3 forward message 2 backward mess...
Pregel – Sequence 
6
Pregel – Sequence 
6
Pregel – Sequence 
6
Pregel – Sequence 
6
Pregel – Sequence 
6
Worker ^= Map 
“Map” a user-de1ned algorithm over all vertices 
Output: set of messages to other vertices 
Available param...
Combine ^= Reduce 
“Reduce” all generated messages 
Output: An aggregated message for each vertex. 
Executed on sender as ...
Activity ^= Termination 
Execute several rounds of Map/Reduce 
Count active vertices and messages 
Start next round if one...
Pregel Questions 
connected components 
page rank 
bipartite matching 
semi-clustering 
mimum spanning forest 
graph color...
Pagerank 
11
Pagerank 
11
Pagerank 
11
Pagerank 
11
Pagerank for Giraph 
12 
1 public class SimplePageRankComputation extends BasicComputation < 
LongWritable , DoubleWritabl...
Pagerank for TinkerPop3 
13 
1 public class PageRankVertexProgram implements VertexProgram < 
Double > { 
2 private Messag...
Pagerank for ArangoDB 
1 var pageRank = function (vertex , message , global ) { 
2 var total = global . vertexCount ; 
3 v...
Pregel Questions 
connected components 
page rank 
bipartite matching 
semi-clustering 
mimum spanning forest 
graph color...
Bipartite Matching 
16
Bipartite Matching 
16
Pregel Questions 
connected components 
page rank 
bipartite matching 
semi-clustering 
mimum spanning forest 
graph color...
Thank You 
Twitter: @arangodb 
Github: triagens/ArangoDB 
Google Group: arangodb 
IRC: arangodb 
18
Nächste SlideShare
Wird geladen in …5
×

Frank Celler – Processing large-scale graphs with Google(TM) Pregel - NoSQL matters Barcelona 2014

617 Aufrufe

Veröffentlicht am

Frank Celler – Processing large-scale graphs with Google(TM) Pregel

Many popular graph databases are optimized to run on a single machine, using efficient traversals to query the stored graphs. This boosts performance of algorithms originating at a single vertex and iterating through the graph e.g. finding shortest paths or neighbors. However, graphs are getting bigger and traversals are poorly performing if they require a large depth. If you need to distribute a large-scale graph thru several machines, traversals won't be the best choice (in case of performance) to process the graph. Therefore Google has released it's Pregel framework offering an environment to query distributed graphs, Pregel is also known as the map-reduce for graphs. In this talk I want to present the architecture and requirements of the Pregel framework and introduce you to the different mind-set required to write a Pregel algorithm. Furthermore I will give a short introduction to three implementations or Pregel — Giraph, TinkerPop3 and ArangoDB.

Veröffentlicht in: Daten & Analysen
0 Kommentare
0 Gefällt mir
Statistik
Notizen
  • Als Erste(r) kommentieren

  • Gehören Sie zu den Ersten, denen das gefällt!

Keine Downloads
Aufrufe
Aufrufe insgesamt
617
Auf SlideShare
0
Aus Einbettungen
0
Anzahl an Einbettungen
3
Aktionen
Geteilt
0
Downloads
18
Kommentare
0
Gefällt mir
0
Einbettungen 0
Keine Einbettungen

Keine Notizen für die Folie

Frank Celler – Processing large-scale graphs with Google(TM) Pregel - NoSQL matters Barcelona 2014

  1. 1. Processing large-scale graphs , with GoogleTM Pregel November 22, 2014 Frank Celler @fceller www.arangodb.com
  2. 2. About about us Frank Celler (@fceller) working on the ArangoDB core Michael Hackstein (@mchacki) started an experimental implementation of Pregel 1
  3. 3. About about us Frank Celler (@fceller) working on the ArangoDB core Michael Hackstein (@mchacki) started an experimental implementation of Pregel about the talk different kinds of graph algorithms Pregel example Pregel mind set aka Framework more examples 1
  4. 4. Pregel at ArangoDB Started as a side project in free hack time Experimental on operational database Implemented as an alternative to traversals Make use of the 2exibility of JavaScript: No strict type system No pre-compilation, on-the-2y queries Native JSON documents Really fast development 2
  5. 5. Graph Algorithms Pattern matching Search through the entire graph Identify similar components ) Touch all vertices and their neighbourhoods 3
  6. 6. Graph Algorithms Pattern matching Search through the entire graph Identify similar components ) Touch all vertices and their neighbourhoods Traversals De1ne a speci1c start point Iteratively explore the graph ) History of steps is known 3
  7. 7. Graph Algorithms Pattern matching Search through the entire graph Identify similar components ) Touch all vertices and their neighbourhoods Traversals De1ne a speci1c start point Iteratively explore the graph ) History of steps is known Global measurements Compute one value for the graph, based on all it’s vertices or edges Compute one value for each vertex or edge ) Often require a global view on the graph 3
  8. 8. Pregel A framework to query distributed, directed graphs. Known as “Map-Reduce” for graphs Uses same phases Has several iterations Aims at: Operate all servers at full capacity Reduce network traZc Good at calculations touching all vertices Bad at calculations touching a very small number of vertices 4
  9. 9. Example – Connected Components 1 1 2 2 5 7 7 5 4 3 4 3 6 6 active inactive 3 forward message 2 backward message 5
  10. 10. Example – Connected Components 1 1 2 2 5 7 7 5 6 7 5 4 3 4 3 6 6 4 2 3 4 active inactive 3 forward message 2 backward message 5
  11. 11. Example – Connected Components 1 1 2 2 5 7 7 5 6 7 5 4 3 4 3 6 6 4 2 3 4 active inactive 3 forward message 2 backward message 5
  12. 12. Example – Connected Components 1 1 2 2 5 6 7 5 6 5 5 4 3 4 3 5 6 3 1 2 2 active inactive 3 forward message 2 backward message 5
  13. 13. Example – Connected Components 1 1 2 2 5 6 7 5 6 5 5 4 3 4 3 5 6 3 1 2 2 active inactive 3 forward message 2 backward message 5
  14. 14. Example – Connected Components 1 1 1 2 5 5 7 5 2 2 4 3 5 6 1 1 2 2 active inactive 3 forward message 2 backward message 5
  15. 15. Example – Connected Components 1 1 1 2 5 5 7 5 2 2 4 3 5 6 1 1 2 2 active inactive 3 forward message 2 backward message 5
  16. 16. Example – Connected Components 1 1 1 2 5 5 7 5 1 1 4 3 5 6 1 1 active inactive 3 forward message 2 backward message 5
  17. 17. Example – Connected Components 1 1 1 2 5 5 7 5 1 1 4 3 5 6 1 1 active inactive 3 forward message 2 backward message 5
  18. 18. Example – Connected Components 1 1 1 2 5 5 7 5 1 1 4 3 5 6 active inactive 3 forward message 2 backward message 5
  19. 19. Pregel – Sequence 6
  20. 20. Pregel – Sequence 6
  21. 21. Pregel – Sequence 6
  22. 22. Pregel – Sequence 6
  23. 23. Pregel – Sequence 6
  24. 24. Worker ^= Map “Map” a user-de1ned algorithm over all vertices Output: set of messages to other vertices Available parameters: The current vertex and his outbound edges All incoming messages Global values Allow modi1cations on the vertex: Attach a result to this vertex and his outgoing edges Delete the vertex and his outgoing edges Deactivate the vertex 7
  25. 25. Combine ^= Reduce “Reduce” all generated messages Output: An aggregated message for each vertex. Executed on sender as well as receiver. Available parameters: One new message for a vertex The stored aggregate for this vertex Typical combiners are SUM, MIN or MAX Reduces network traZc 8
  26. 26. Activity ^= Termination Execute several rounds of Map/Reduce Count active vertices and messages Start next round if one of the following is true: At least one vertex is active At least one message is sent Terminate if neither a vertex is active nor messages were sent Store all non-deleted vertices and edges as resulting graph 9
  27. 27. Pregel Questions connected components page rank bipartite matching semi-clustering mimum spanning forest graph coloring shortest paths 10
  28. 28. Pagerank 11
  29. 29. Pagerank 11
  30. 30. Pagerank 11
  31. 31. Pagerank 11
  32. 32. Pagerank for Giraph 12 1 public class SimplePageRankComputation extends BasicComputation < LongWritable , DoubleWritable , FloatWritable , DoubleWritable > { 2 public static final int MAX_SUPERSTEPS = 30; 34 @Override 5 public void compute ( Vertex < LongWritable , DoubleWritable , FloatWritable > vertex , Iterable < DoubleWritable > messages ) throws IOException { 6 if ( getSuperstep () >= 1) { 7 double sum = 0; 8 for ( DoubleWritable message : messages ) { 9 sum += message .get (); 10 } 11 DoubleWritable vertexValue = new DoubleWritable ((0.15 f / getTotalNumVertices ()) + 0.85 f * sum ); 12 vertex . setValue ( vertexValue ); 13 } 14 if ( getSuperstep () < MAX_SUPERSTEPS ) { 15 long edges = vertex . getNumEdges (); 16 sendMessageToAllEdges (vertex , new DoubleWritable ( vertex . getValue ().get () / edges )); 17 } else { 18 vertex . voteToHalt (); 19 } 20 } 21 22 public static class SimplePageRankWorkerContext extends WorkerContext { 23 @Override 24 public void preApplication () throws InstantiationException , IllegalAccessException { } 25 @Override 26 public void postApplication () { } 27 @Override 28 public void preSuperstep () { } 29 @Override 30 public void postSuperstep () { } 31 } 32 33 public static class SimplePageRankMasterCompute extends DefaultMasterCompute { 34 @Override 35 public void initialize () throws InstantiationException , IllegalAccessException { 36 } 37 } 38 public static class SimplePageRankVertexReader extends GeneratedVertexReader < LongWritable , DoubleWritable , FloatWritable > { 39 @Override 40 public boolean nextVertex () { 41 return totalRecords > recordsRead ; 42 } 44 @Override 45 public Vertex < LongWritable , DoubleWritable , FloatWritable > getCurrentVertex () throws IOException { 46 Vertex < LongWritable , DoubleWritable , FloatWritable > vertex = getConf (). createVertex (); 47 LongWritable vertexId = new LongWritable ( 48 ( inputSplit . getSplitIndex () * totalRecords ) + recordsRead ); 49 DoubleWritable vertexValue = new DoubleWritable ( vertexId . get () * 10d); 50 long targetVertexId = ( vertexId .get () + 1) % ( inputSplit . getNumSplits () * totalRecords ); 51 float edgeValue = vertexId . get () * 100 f; 52 List <Edge < LongWritable , FloatWritable >> edges = Lists . newLinkedList (); 53 edges .add ( EdgeFactory . create (new LongWritable ( targetVertexId ), new FloatWritable ( edgeValue ))); 54 vertex . initialize ( vertexId , vertexValue , edges ); 55 ++ recordsRead ; 56 return vertex ; 57 } 58 } 59 60 public static class SimplePageRankVertexInputFormat extends GeneratedVertexInputFormat < LongWritable , DoubleWritable , FloatWritable > { 61 @Override 62 public VertexReader < LongWritable , DoubleWritable , FloatWritable > createVertexReader ( InputSplit split , TaskAttemptContext context ) 63 throws IOException { 64 return new SimplePageRankVertexReader (); 65 } 66 } 67 68 public static class SimplePageRankVertexOutputFormat extends TextVertexOutputFormat < LongWritable , DoubleWritable , FloatWritable > { 69 @Override 70 public TextVertexWriter createVertexWriter ( TaskAttemptContext context ) throws IOException , InterruptedException { 71 return new SimplePageRankVertexWriter (); 72 } 73 74 public class SimplePageRankVertexWriter extends TextVertexWriter { 75 @Override 76 public void writeVertex ( Vertex < LongWritable , DoubleWritable , FloatWritable > vertex ) throws IOException , InterruptedException { 77 getRecordWriter (). write ( new Text ( vertex . getId (). toString ()), new Text ( vertex . getValue (). toString ())) ; 78 } 79 } 80 } 81 }
  33. 33. Pagerank for TinkerPop3 13 1 public class PageRankVertexProgram implements VertexProgram < Double > { 2 private MessageType . Local messageType = MessageType . Local .of (() -> GraphTraversal .< Vertex >of (). outE ()); 3 public static final String PAGE_RANK = Graph .Key . hide (" gremlin . pageRank "); 4 public static final String EDGE_COUNT = Graph .Key . hide (" gremlin . edgeCount "); 5 private static final String VERTEX_COUNT = " gremlin . pageRankVertexProgram . vertexCount "; 6 private static final String ALPHA = " gremlin . pageRankVertexProgram . alpha "; 7 private static final String TOTAL_ITERATIONS = " gremlin . pageRankVertexProgram . totalIterations "; 8 private static final String INCIDENT_TRAVERSAL = " gremlin . pageRankVertexProgram . incidentTraversal "; 9 private double vertexCountAsDouble = 1; 10 private double alpha = 0.85 d; 11 private int totalIterations = 30; 12 private static final Set <String > COMPUTE_KEYS = new HashSet <>( Arrays . asList ( PAGE_RANK , EDGE_COUNT )); 13 14 private PageRankVertexProgram () {} 15 16 @Override 17 public void loadState ( final Configuration configuration ) { 18 this . vertexCountAsDouble = configuration . getDouble ( VERTEX_COUNT , 1.0 d); 19 this . alpha = configuration . getDouble (ALPHA , 0.85 d); 20 this . totalIterations = configuration . getInt ( TOTAL_ITERATIONS , 30); 21 try { 22 if ( configuration . containsKey ( INCIDENT_TRAVERSAL )) { 23 final SSupplier < Traversal > traversalSupplier = VertexProgramHelper . deserialize ( configuration , INCIDENT_TRAVERSAL ); 24 VertexProgramHelper . verifyReversibility ( traversalSupplier .get ()); 25 this . messageType = MessageType . Local .of (( SSupplier ) traversalSupplier ); 26 } 27 } catch ( final Exception e) { 28 throw new IllegalStateException (e. getMessage () , e); 29 } 30 } 32 @Override 33 public void storeState ( final Configuration configuration ) { 34 configuration . setProperty ( GraphComputer . VERTEX_PROGRAM , PageRankVertexProgram . class . getName ()); 35 configuration . setProperty ( VERTEX_COUNT , this . vertexCountAsDouble ); 36 configuration . setProperty (ALPHA , this . alpha ); 37 configuration . setProperty ( TOTAL_ITERATIONS , this . totalIterations ); 38 try { 39 VertexProgramHelper . serialize ( this . messageType . getIncidentTraversal () , configuration , INCIDENT_TRAVERSAL ); 40 } catch ( final Exception e) { 41 throw new IllegalStateException (e. getMessage () , e); 42 } 43 } 44 45 @Override 46 public Set <String > getElementComputeKeys () { 47 return COMPUTE_KEYS ; 48 } 49 50 @Override 51 public void setup ( final Memory memory ) { 52 53 } 54 55 @Override 56 public void execute ( final Vertex vertex , Messenger <Double > messenger , final Memory memory ) { 57 if ( memory . isInitialIteration ()) { 58 double initialPageRank = 1.0d / this . vertexCountAsDouble ; 59 double edgeCount = Double . valueOf (( Long ) this . messageType . edges ( vertex ). count (). next ()); 60 vertex . singleProperty ( PAGE_RANK , initialPageRank ); 61 vertex . singleProperty ( EDGE_COUNT , edgeCount ); 62 messenger . sendMessage ( this . messageType , initialPageRank / edgeCount ); 63 } else { 64 double newPageRank = StreamFactory . stream ( messenger . receiveMessages ( this . messageType )). reduce (0.0d, (a, b) -> a + b); 65 newPageRank = ( this . alpha * newPageRank ) + ((1.0 d - this . alpha ) / this . vertexCountAsDouble ); 66 vertex . singleProperty ( PAGE_RANK , newPageRank ); 67 messenger . sendMessage ( this . messageType , newPageRank / vertex .<Double > property ( EDGE_COUNT ). orElse (0.0 d)); 68 } 69 } 70 71 @Override 72 public boolean terminate ( final Memory memory ) { 73 return memory . getIteration () >= this . totalIterations ; 74 } 75 }
  34. 34. Pagerank for ArangoDB 1 var pageRank = function (vertex , message , global ) { 2 var total = global . vertexCount ; 3 var edgeCount = vertex . _outEdges . length ; 4 var alpha = global . alpha ; 5 var sum = 0, rank = 0; 6 if ( global . step > 0) { 7 while ( message . hasNext ()) { 8 sum += message . next (). data ; 9 } 10 rank = alpha * sum + (1- alpha ) / total ; 11 } else { 12 rank = 1 / total ; 13 } 14 vertex . _setResult ( rank ); 15 if ( global . step < global . MAX_STEPS ) { 16 var send = rank / edgeCount ; 17 while ( vertex . _outEdges . hasNext ()) { 18 message . sendTo ( vertex . _outEdges . next (). edge . _getTarget () , send ); 19 } 20 } else { 21 vertex . _deactivate (); 22 } 23 }; 14
  35. 35. Pregel Questions connected components page rank bipartite matching semi-clustering mimum spanning forest graph coloring shortest paths 15
  36. 36. Bipartite Matching 16
  37. 37. Bipartite Matching 16
  38. 38. Pregel Questions connected components page rank bipartite matching semi-clustering mimum spanning forest graph coloring shortest paths 17
  39. 39. Thank You Twitter: @arangodb Github: triagens/ArangoDB Google Group: arangodb IRC: arangodb 18

×