Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.
Introducing Apache Giraph for
 Large Scale Graph Processing

 Sebastian Schelter

 PhD student at the Database Systems and...
Graph recap
graph: abstract representation of a set of objects
(vertices), where some pairs of these objects are
connected...
Real world graphs are really large!
• the World Wide Web has several billion pages
  with several billion links
• Facebook...
Why not use MapReduce/Hadoop?
• Example: PageRank, Google‘s famous algorithm for
  measuring the authority of a webpage ba...
Textbook approach to
           PageRank in MapReduce
• PageRank p is the principal eigenvector of the Markov matrix
  M d...
Drawbacks
• Not intuitive: only crazy scientists
  think in matrices and eigenvectors
• Unnecessarily slow: Each iteration...
Google Pregel
• distributed system especially developed for
  large scale graph processing
• intuitive API that let‘s you ...
Bulk Synchronous Parallel (BSP)
                    processors




local computation


                                 su...
Vertex-centric BSP
• each vertex has an id, a value, a list of its adjacent vertex ids and the
  corresponding edge values...
Master-slave architecture
• vertices are partitioned and
  assigned to workers
   – default: hash-partitioning
   – custom...
PageRank in Pregel
class PageRankVertex {
 void compute(Iterator messages) {
   if (getSuperstep() > 0) {
     // recomput...
PageRank toy example
      .17         .33
.33         .33         .33   Superstep 0
      .17         .17
            .17...
Cool, where can I download it?
• Pregel is proprietary, but:
  – Apache Giraph is an open source
    implementation of Pre...
Giraph‘s Hadoop usage

   TaskTracker        TaskTracker          TaskTracker

worker    worker   worker      worker   wor...
Anatomy of an execution
Setup                               Teardown
• load the graph from disk          • write back resu...
Who is doing what?
• ZooKeeper: responsible for computation state
    – partition/worker mapping
    – global state: #supe...
What do you have to implement?
• your algorithm as a Vertex
  – Subclass one of the many existing implementations:
    Bas...
Starting a Giraph job
• no difference to starting a Hadoop job:

  $ hadoop jar giraph-0.1-jar-with-dependencies.jar
  o.a...
What‘s to come?
• Current and future work in Giraph
  – graduation from the incubator
  – out-of-core messaging
  – algori...
Everything is a network!
Further resources
• Apache Giraph homepage
  http://incubator.apache.org/giraph


• Claudio Martella: “Apache Giraph: Dist...
Thank you.
Questions?
Introducing Apache Giraph for Large Scale Graph Processing
Nächste SlideShare
Wird geladen in …5
×

von

Introducing Apache Giraph for Large Scale Graph Processing Slide 1 Introducing Apache Giraph for Large Scale Graph Processing Slide 2 Introducing Apache Giraph for Large Scale Graph Processing Slide 3 Introducing Apache Giraph for Large Scale Graph Processing Slide 4 Introducing Apache Giraph for Large Scale Graph Processing Slide 5 Introducing Apache Giraph for Large Scale Graph Processing Slide 6 Introducing Apache Giraph for Large Scale Graph Processing Slide 7 Introducing Apache Giraph for Large Scale Graph Processing Slide 8 Introducing Apache Giraph for Large Scale Graph Processing Slide 9 Introducing Apache Giraph for Large Scale Graph Processing Slide 10 Introducing Apache Giraph for Large Scale Graph Processing Slide 11 Introducing Apache Giraph for Large Scale Graph Processing Slide 12 Introducing Apache Giraph for Large Scale Graph Processing Slide 13 Introducing Apache Giraph for Large Scale Graph Processing Slide 14 Introducing Apache Giraph for Large Scale Graph Processing Slide 15 Introducing Apache Giraph for Large Scale Graph Processing Slide 16 Introducing Apache Giraph for Large Scale Graph Processing Slide 17 Introducing Apache Giraph for Large Scale Graph Processing Slide 18 Introducing Apache Giraph for Large Scale Graph Processing Slide 19 Introducing Apache Giraph for Large Scale Graph Processing Slide 20 Introducing Apache Giraph for Large Scale Graph Processing Slide 21 Introducing Apache Giraph for Large Scale Graph Processing Slide 22 Introducing Apache Giraph for Large Scale Graph Processing Slide 23
Nächste SlideShare
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Weiter
Herunterladen, um offline zu lesen und im Vollbildmodus anzuzeigen.

46 Gefällt mir

Teilen

Herunterladen, um offline zu lesen

Introducing Apache Giraph for Large Scale Graph Processing

Herunterladen, um offline zu lesen

Ähnliche Bücher

Kostenlos mit einer 30-tägigen Testversion von Scribd

Alle anzeigen

Ähnliche Hörbücher

Kostenlos mit einer 30-tägigen Testversion von Scribd

Alle anzeigen

Introducing Apache Giraph for Large Scale Graph Processing

  1. 1. Introducing Apache Giraph for Large Scale Graph Processing Sebastian Schelter PhD student at the Database Systems and Information Management Group of TU Berlin Committer and PMC member at Apache Mahout and Apache Giraph mail ssc@apache.org blog http://ssc.io
  2. 2. Graph recap graph: abstract representation of a set of objects (vertices), where some pairs of these objects are connected by links (edges), which can be directed or undirected Graphs can be used to model arbitrary things like road networks, social networks, flows of goods, etc. Majority of graph algorithms B are iterative and traverse the graph in some way A D C
  3. 3. Real world graphs are really large! • the World Wide Web has several billion pages with several billion links • Facebook‘s social graph had more than 700 million users and more than 68 billion friendships in 2011 • twitter‘s social graph has billions of follower relationships
  4. 4. Why not use MapReduce/Hadoop? • Example: PageRank, Google‘s famous algorithm for measuring the authority of a webpage based on the underlying network of hyperlinks • defined recursively: each vertex distributes its authority to its neighbors in equal proportions pj pi   dj j  ( j , i ) 
  5. 5. Textbook approach to PageRank in MapReduce • PageRank p is the principal eigenvector of the Markov matrix M defined by the transition probabilities between web pages • it can be obtained by iteratively multiplying an initial PageRank vector by M (power iteration) p  M p0 k row 1 of M ∙ row 2 of M ∙ pi pi+1 row n of M ∙
  6. 6. Drawbacks • Not intuitive: only crazy scientists think in matrices and eigenvectors • Unnecessarily slow: Each iteration is scheduled as separate MapReduce job with lots of overhead – the graph structure is read from disk – the map output is spilled to disk – the intermediary result is written to HDFS • Hard to implement: a join has to be implemented by hand, lots of work, best strategy is data dependent
  7. 7. Google Pregel • distributed system especially developed for large scale graph processing • intuitive API that let‘s you ‚think like a vertex‘ • Bulk Synchronous Parallel (BSP) as execution model • fault tolerance by checkpointing
  8. 8. Bulk Synchronous Parallel (BSP) processors local computation superstep communication barrier synchronization
  9. 9. Vertex-centric BSP • each vertex has an id, a value, a list of its adjacent vertex ids and the corresponding edge values • each vertex is invoked in each superstep, can recompute its value and send messages to other vertices, which are delivered over superstep barriers • advanced features : termination votes, combiners, aggregators, topology mutations vertex1 vertex1 vertex1 vertex2 vertex2 vertex2 vertex3 vertex3 vertex3 superstep i superstep i + 1 superstep i + 2
  10. 10. Master-slave architecture • vertices are partitioned and assigned to workers – default: hash-partitioning – custom partitioning possible • master assigns and coordinates, while workers execute vertices Master and communicate with each other Worker 1 Worker 2 Worker 3
  11. 11. PageRank in Pregel class PageRankVertex { void compute(Iterator messages) { if (getSuperstep() > 0) { // recompute own PageRank from the neighbors messages pageRank = sum(messages); pj  setVertexValue(pageRank); } pi  j  ( j , i )  dj if (getSuperstep() < k) { // send updated PageRank to each neighbor sendMessageToAllNeighbors(pageRank / getNumOutEdges()); } else { voteToHalt(); // terminate } }}
  12. 12. PageRank toy example .17 .33 .33 .33 .33 Superstep 0 .17 .17 .17 Input graph .25 .34 .17 .50 .34 Superstep 1 A B C .09 .25 .09 .22 .34 .25 .43 .34 Superstep 2 .13 .22 .13
  13. 13. Cool, where can I download it? • Pregel is proprietary, but: – Apache Giraph is an open source implementation of Pregel – runs on standard Hadoop infrastructure – computation is executed in memory – can be a job in a pipeline (MapReduce, Hive) – uses Apache ZooKeeper for synchronization
  14. 14. Giraph‘s Hadoop usage TaskTracker TaskTracker TaskTracker worker worker worker worker worker worker TaskTracker ZooKeeper master worker JobTracker NameNode
  15. 15. Anatomy of an execution Setup Teardown • load the graph from disk • write back result • assign vertices to workers • write back aggregators • validate workers health Compute Synchronize • assign messages to workers • send messages to workers • iterate on active vertices • compute aggregators • call vertices compute() • checkpoint
  16. 16. Who is doing what? • ZooKeeper: responsible for computation state – partition/worker mapping – global state: #superstep – checkpoint paths, aggregator values, statistics • Master: responsible for coordination – assigns partitions to workers – coordinates synchronization – requests checkpoints – aggregates aggregator values – collects health statuses • Worker: responsible for vertices – invokes active vertices compute() function – sends, receives and assigns messages – computes local aggregation values
  17. 17. What do you have to implement? • your algorithm as a Vertex – Subclass one of the many existing implementations: BasicVertex, MutableVertex, EdgeListVertex, HashMapVertex, LongDoubleFloatDoubleVertex,... • a VertexInputFormat to read your graph – e.g. from a text file with adjacency lists like <vertex> <neighbor1> <neighbor2> ... • a VertexOutputFormat to write back the result – e.g. <vertex> <pageRank>
  18. 18. Starting a Giraph job • no difference to starting a Hadoop job: $ hadoop jar giraph-0.1-jar-with-dependencies.jar o.a.g.GiraphRunner o.a.g.examples.ConnectedComponentsVertex --inputFormat o.a.g.examples.IntIntNullIntTextInputFormat --inputPath hdfs:///wikipedia/pagelinks.txt --outputFormat o.a.g.examples.ComponentOutputFormat --outputPath hdfs:///wikipedia/results/ --workers 89 --combiner o.a.g.examples.MinimumIntCombiner
  19. 19. What‘s to come? • Current and future work in Giraph – graduation from the incubator – out-of-core messaging – algorithms library • 2-day workshop after Berlin Buzzwords – topic: ‚Parallel Processing beyond MapReduce‘ – meet the developers of Giraph and Stratosphere http://berlinbuzzwords.de/content/workshops-berlin-buzzwords
  20. 20. Everything is a network!
  21. 21. Further resources • Apache Giraph homepage http://incubator.apache.org/giraph • Claudio Martella: “Apache Giraph: Distributed Graph Processing in the Cloud” http://prezi.com/9ake_klzwrga/apache-giraph-distributed-graph- processing-in-the-cloud/ • Malewicz et al.: „Pregel – a system for large scale graph processing“, PODC 09 http://dl.acm.org/citation.cfm?id=1582723
  22. 22. Thank you. Questions?
  • lynncheng

    Jan. 21, 2018
  • majidhazari

    Oct. 28, 2016
  • dhjnavy

    Jul. 31, 2016
  • trucvy79

    Feb. 27, 2016
  • ssuser52d465

    Oct. 26, 2015
  • LuizHenriqueZambomSa

    May. 20, 2015
  • felipemcarneiro

    Apr. 10, 2015
  • AvinashKumar197

    Apr. 4, 2015
  • MircoZingaretti

    Mar. 12, 2015
  • LinJiyuan90

    Jan. 10, 2015
  • morenovincent7

    Dec. 20, 2014
  • RichardDenman

    Sep. 24, 2014
  • adoshi123

    Aug. 23, 2014
  • erlonboechat

    Jul. 9, 2014
  • Wallut123

    Jun. 11, 2014
  • NELLAIVIJAY1

    May. 2, 2014
  • Navneetk

    Apr. 29, 2014
  • Quasi_quant2010

    Apr. 22, 2014
  • NicolasBrousse

    Apr. 10, 2014
  • YeshwanthKumar3743

    Apr. 9, 2014

Aufrufe

Aufrufe insgesamt

30.680

Auf Slideshare

0

Aus Einbettungen

0

Anzahl der Einbettungen

6.626

Befehle

Downloads

628

Geteilt

0

Kommentare

0

Likes

46

×