SlideShare ist ein Scribd-Unternehmen logo
1 von 21
WOOster: A Map-Reduce based
Platform for Graph Mining
  Aravindan Raghuveer
  Yahoo! Inc, Bangalore.
Introduction

           “If you squint the right way, graphs
             are everywhere” [1]
           @ Yahoo! :
                      • The WOO Graph: All knowledge
                        assimilated from the web.
                      - http://iswc2011.semanticweb.org/fileadmin/iswc/Pa
                        pers/Industry/WOO_ISWC.pptx
     [1] http://googleresearch.blogspot.com/2009/06/large-scale-graph-computing-at-google.html   2
Yahoo! Confidential
The What and Why?
 What?                Family of Graph Query Algorithms.
                      • Framework:
                          • For graph storage and invoking the query algorithms
                          • Hosted Solution on Hadoop

  Why?
                      • Family of Graph Query Algorithms: Present day
                      algorithms do not scale to billion edge, vertex graphs.
                      • Framework:
                          •Optimizes storage layout to suit graph query
                          algorithms
                          •Improves throughput of the queries.
                                                                                  3
Yahoo! Confidential
Outline of the talk

      •     MapReduce 101
      •     Graph Mining Approaches
      •     Brief overview of WOOster architecture
      •     Graph query algorithms in WOOster:
             • Sub Graph Matching
             • Reachability Query
      •     Experiments
      •     Conclusion


Yahoo! Confidential
Map Reduce 101


             Switch to slides from Cloud Computing
              with MapReduce and Hadoop
             www.cs.berkeley.edu/~matei/talks/2009/parlab_bo
              otcamp_clouds.ppt




                                                                5
Yahoo! Confidential
MapReduce Programming Model

• Data type: key-value records

• Map function:
            (Kin, Vin)  list(Kinter, Vinter)

• Reduce function:
         (Kinter, list(Vinter))  list(Kout, Vout)
Example: Word Count

def mapper(line):
    foreach word in line.split():
        output(word, 1)


def reducer(key, values):
    output(key, sum(values))
Word Count Execution

  Input       Map            Shuffle & Sort              Reduce   Output


                         the, 1
                        brown, 1
 the quick               fox, 1                                   brown, 2
              Map
brown fox                                                          fox, 2
                                                         Reduce
                                                                   how, 1
                    the, 1
                    fox, 1
                                                                   now, 1
                    the, 1                                         the, 3
the fox ate
              Map
the mouse                                     quick, 1

                 how, 1
                                    ate, 1                         ate, 1
                 now, 1
                                   mouse, 1
                brown, 1                                 Reduce    cow, 1
 how now                                                          mouse, 1
              Map                   cow, 1
brown cow                                                         quick, 1
Graph Mining Approaches : Two Schools
           School-1: Invent a new platform:
             - Map-reduce is not best suited for graph mining:
             - BSP, PRAM models : circa 1980s
             - Pregel, Haloop from Google [1]
           School-2: Ride on Map-Reduce
             -    MR has wide adoption, open source tools, industry support.
             -    Invest on one more computing infrastructure
             -    Apache Giraph: http://incubator.apache.org/giraph/ (BSP on Hadoop)
             -    Efforts in open source / academia on the same lines:
                    • Pegasus CMU [2]
                    • Graph Mining in Apache Mahout[3]
                    • Rayethon’s Graph Mining [4]
    [1] SIGMOD 2010, http://dl.acm.org/citation.cfm?id=1807184
    [2] http://www.cs.cmu.edu/~pegasus/
    [3] http://www.robust-project.eu/news/robust-project-pushes-large-scale-graph-mining-with-hadoop-apache                          9
    [4] http://www.cloudera.com/blog/2010/03/how-raytheon-researchers-are-using-hadoop-to-build-a-scalable-distributed-triple-store/
Yahoo! Confidential
WOOster Architecture
                                                    •   User submits a query
                 WOOster Web UI & WebService APIs   •   Planner periodically scans for
                                                        newly arrived queries.
                                                    •   Planner creates a M-R plan that
  Graph
                         Planner                        re-uses computation, / IO
 Indices                                    Jobs
                                            D/B         across queries. (Batching)
                         Executor                   •   Executor executes the M-R
                                                        plan.
                                                    •   Result notified to the user
                         WOO Graph
                                                        (Hosted Solution)
                                 Grid


Yahoo! Confidential
The Sub-Graph Match Query

     Find all
     instances                                     in graph G
     of query Q
                                                      Vertices have
                                                 attributes (ex age:31)
                      Vertices and edges have
                      constraints (ex: age<40)                                  Edges have relationship
                                                                                        labels.

          Notation            Query Vertex       Graph Vertex             A matched graph vertex


       Why Sub-Graph Match (Exact Graph Isomorphism)?:
        A popular and expressive graph query useful to mine patterns.
       To our knowledge, a large scale algorithm to operate on a billion vertex graph is
       not present.
Yahoo! Confidential
Overview of the Solution

    Step-0. Data Layout on HDFS


    Step-1. Query Graph Partitioning


   Step-2. Edge Selection


   Step-3. Query Partition Matching


   Step-4. Query Partition Merging
Yahoo! Confidential
Data Layout on HDFS

        •      How to store a large scale graph?
        •      Adjacency List like solution:
                • Each row/line has information about a vertex:
                      • Vertex attributes
                      • Vertex neighbors and the labels associated with each edge.


        Implications:
        •Enables early pruning of non-matching edges and vertices.
        •Each vertex has information about itself and its immediate
        neighbors only.

Yahoo! Confidential
Step-1: Query Graph Partitioning

        Why?: Parallelized solving of independent sub-
         problems
       How?
       Find minimum number of partitions such that
       diameter of partition = 2.
                                                             Pivot Vertices
       Intuition:
       •In a spanning tree of diameter 2, there is one vertex that is
       connected to all other vertices  pivot vertex
       •Will use this property in steps 2, 3.


Yahoo! Confidential
Step-2: Edge Selection
        •     What: Select a subset of edges from G that match atleast one
              edge in Q.
        •     How:                       3.
                                            g1-g2 emitted:
                                                          g1 mapped to a
                                                           query vertex.
            g2

                                        Map                g1           g2            Reduce
        g3                                                                                       g1
                      g1                Logic                                          Logic

                                                            g1          g2
            g4

1. g1:Current              2b.
                           2a.                      4.
      vertex in             For every emits allof
                              Mapper neigbor             g1-g2 emited         Reducer emits 5.
      mapper.               edges if vertex and
                             q1, there exists a            from g2’s         an edge if a pair
                            edge constraints are
                               corresponding                mapper               is found
                              neighbor for g1
                                    met
Yahoo! Confidential
Step-3: Query Partition Matching
   Edge Selection:
           • Associates a graph vertex to the possible query vertices it could map to
           • Associates the graph vertex to its “pivot” graph vertex.                 g1           g2
           • Pivot graph vertex is a graph vertex which is mapped to a pivot query vertex: g1 in this example



                                                                                                Reducer forms
                                                                                                 the partition
                                              g1           g2                              3.
     Edge
   Selection               Map                                             Reduce                       g2
                                              g1           g3
    output                 Logic                                            Logic
                                                                                                  g1      g3
                                              g1           g4
                                                                                                         g4
 Mapper emits pivot graph
 vertex as key and edge as                             2. Reducer receives all
            value                                          edges with the same
                              1.
                                                            pivot graph vertex
Yahoo! Confidential
Step-4: Query Partition Merging
        •     Merges partitions one after another to form the a query match
        •     More details in paper.




         Take-away from Steps1-4: (also for any scalable Map-Reduce
           program)
        The mapper/reducer keys are chosen such that:
        # keys is proportional to the number of matches of query Q
       in the graph.
        Hence the algorithm scales well for large graphs and complex
       queries.
Yahoo! Confidential
Results                     160
                                    140
                                    120


                       Time (sec)
                                    100
                                     80
                                    60
                                    40
                                    20
                                     0
                                          100          150             200          250
                                                     Number of Reducers

                       Edge Selection       Query Partition Matching   Query Partition Merging



             Graph of 10 million vertices and 50 million edges
             Complex Query of 24 vertices
             Note that the edge selection time reduces with
              increasing number of reducers.
Yahoo! Confidential
In the paper…

             Detailed map-reduce algorithms for sub-graph match and
              reachability
             Theoretical analysis for scalability
             Construction of the synthetic dataset
             Methodology and more experiments.
             Reachability query: examples, map-reduce algorithm
             Related work




Yahoo! Confidential
Future Work

        •     Indexing structure for graphs suited for M-R jobs
                • Compare with giraph based approach.
        •     Better batching strategies.
        •     Right interface for custom graph algorithms to be
              plugged in while WOOster providing automatic
              batching.
        •     More graph mining algorithms implemented



Yahoo! Confidential
Questions / Comments
                                             21
Yahoo! Confidential

Weitere ähnliche Inhalte

Ähnlich wie WOOster: A Map-Reduce based Platform for Graph Mining

L19CloudMapReduce introduction for cloud computing .ppt
L19CloudMapReduce introduction for cloud computing .pptL19CloudMapReduce introduction for cloud computing .ppt
L19CloudMapReduce introduction for cloud computing .pptMaruthiPrasad96
 
Gopher A Sub-graph centric framework for large scale graphs
Gopher A Sub-graph centric framework for large scale graphsGopher A Sub-graph centric framework for large scale graphs
Gopher A Sub-graph centric framework for large scale graphscharithwiki
 
Ling liu part 02:big graph processing
Ling liu part 02:big graph processingLing liu part 02:big graph processing
Ling liu part 02:big graph processingjins0618
 
Open GeoSocial API
Open GeoSocial APIOpen GeoSocial API
Open GeoSocial APIPat Cappelaere
 
Google_A_Behind_the_Scenes_Tour_-_Jeff_Dean
Google_A_Behind_the_Scenes_Tour_-_Jeff_DeanGoogle_A_Behind_the_Scenes_Tour_-_Jeff_Dean
Google_A_Behind_the_Scenes_Tour_-_Jeff_DeanHiroshi Ono
 
Large Scale Graph Processing with Apache Giraph
Large Scale Graph Processing with Apache GiraphLarge Scale Graph Processing with Apache Giraph
Large Scale Graph Processing with Apache Giraphsscdotopen
 
Ling liu part 01:big graph processing
Ling liu part 01:big graph processingLing liu part 01:big graph processing
Ling liu part 01:big graph processingjins0618
 
1st UIM-GDB - Connections to the Real World
1st UIM-GDB - Connections to the Real World1st UIM-GDB - Connections to the Real World
1st UIM-GDB - Connections to the Real WorldAchim Friedland
 
Cloud is such stuff as dreams are made on
Cloud is such stuff as dreams are made onCloud is such stuff as dreams are made on
Cloud is such stuff as dreams are made onPatrick Chanezon
 
Bigdata roundtable-storm
Bigdata roundtable-stormBigdata roundtable-storm
Bigdata roundtable-stormTobias Schlottke
 
Hadoop Overview kdd2011
Hadoop Overview kdd2011Hadoop Overview kdd2011
Hadoop Overview kdd2011Milind Bhandarkar
 
Hadoop 101 for bioinformaticians
Hadoop 101 for bioinformaticiansHadoop 101 for bioinformaticians
Hadoop 101 for bioinformaticiansattilacsordas
 
Le projet “Canadian Spatial Data Foundry”: Introduction à PostGIS WKT Raster
Le projet “Canadian Spatial Data Foundry”: Introduction à PostGIS WKT RasterLe projet “Canadian Spatial Data Foundry”: Introduction à PostGIS WKT Raster
Le projet “Canadian Spatial Data Foundry”: Introduction à PostGIS WKT RasterACSG Section Montréal
 
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)npinto
 
Apache Giraph: start analyzing graph relationships in your bigdata in 45 minu...
Apache Giraph: start analyzing graph relationships in your bigdata in 45 minu...Apache Giraph: start analyzing graph relationships in your bigdata in 45 minu...
Apache Giraph: start analyzing graph relationships in your bigdata in 45 minu...rhatr
 
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraBrief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraSomnath Mazumdar
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture EMC
 
An introduction to apache drill presentation
An introduction to apache drill presentationAn introduction to apache drill presentation
An introduction to apache drill presentationMapR Technologies
 
Thorny Path to the Large Scale Graph Processing, Алексей Зиновьев (Тамтэк)
Thorny Path to the Large Scale Graph Processing, Алексей Зиновьев (Тамтэк)Thorny Path to the Large Scale Graph Processing, Алексей Зиновьев (Тамтэк)
Thorny Path to the Large Scale Graph Processing, Алексей Зиновьев (Тамтэк)Ontico
 

Ähnlich wie WOOster: A Map-Reduce based Platform for Graph Mining (20)

L19CloudMapReduce introduction for cloud computing .ppt
L19CloudMapReduce introduction for cloud computing .pptL19CloudMapReduce introduction for cloud computing .ppt
L19CloudMapReduce introduction for cloud computing .ppt
 
Gopher A Sub-graph centric framework for large scale graphs
Gopher A Sub-graph centric framework for large scale graphsGopher A Sub-graph centric framework for large scale graphs
Gopher A Sub-graph centric framework for large scale graphs
 
Ling liu part 02:big graph processing
Ling liu part 02:big graph processingLing liu part 02:big graph processing
Ling liu part 02:big graph processing
 
Open GeoSocial API
Open GeoSocial APIOpen GeoSocial API
Open GeoSocial API
 
Project Matsu
Project MatsuProject Matsu
Project Matsu
 
Google_A_Behind_the_Scenes_Tour_-_Jeff_Dean
Google_A_Behind_the_Scenes_Tour_-_Jeff_DeanGoogle_A_Behind_the_Scenes_Tour_-_Jeff_Dean
Google_A_Behind_the_Scenes_Tour_-_Jeff_Dean
 
Large Scale Graph Processing with Apache Giraph
Large Scale Graph Processing with Apache GiraphLarge Scale Graph Processing with Apache Giraph
Large Scale Graph Processing with Apache Giraph
 
Ling liu part 01:big graph processing
Ling liu part 01:big graph processingLing liu part 01:big graph processing
Ling liu part 01:big graph processing
 
1st UIM-GDB - Connections to the Real World
1st UIM-GDB - Connections to the Real World1st UIM-GDB - Connections to the Real World
1st UIM-GDB - Connections to the Real World
 
Cloud is such stuff as dreams are made on
Cloud is such stuff as dreams are made onCloud is such stuff as dreams are made on
Cloud is such stuff as dreams are made on
 
Bigdata roundtable-storm
Bigdata roundtable-stormBigdata roundtable-storm
Bigdata roundtable-storm
 
Hadoop Overview kdd2011
Hadoop Overview kdd2011Hadoop Overview kdd2011
Hadoop Overview kdd2011
 
Hadoop 101 for bioinformaticians
Hadoop 101 for bioinformaticiansHadoop 101 for bioinformaticians
Hadoop 101 for bioinformaticians
 
Le projet “Canadian Spatial Data Foundry”: Introduction à PostGIS WKT Raster
Le projet “Canadian Spatial Data Foundry”: Introduction à PostGIS WKT RasterLe projet “Canadian Spatial Data Foundry”: Introduction à PostGIS WKT Raster
Le projet “Canadian Spatial Data Foundry”: Introduction à PostGIS WKT Raster
 
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
 
Apache Giraph: start analyzing graph relationships in your bigdata in 45 minu...
Apache Giraph: start analyzing graph relationships in your bigdata in 45 minu...Apache Giraph: start analyzing graph relationships in your bigdata in 45 minu...
Apache Giraph: start analyzing graph relationships in your bigdata in 45 minu...
 
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraBrief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
 
An introduction to apache drill presentation
An introduction to apache drill presentationAn introduction to apache drill presentation
An introduction to apache drill presentation
 
Thorny Path to the Large Scale Graph Processing, Алексей Зиновьев (Тамтэк)
Thorny Path to the Large Scale Graph Processing, Алексей Зиновьев (Тамтэк)Thorny Path to the Large Scale Graph Processing, Алексей Зиновьев (Тамтэк)
Thorny Path to the Large Scale Graph Processing, Алексей Зиновьев (Тамтэк)
 

KĂźrzlich hochgeladen

Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel AraĂşjo
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 

KĂźrzlich hochgeladen (20)

Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 

WOOster: A Map-Reduce based Platform for Graph Mining

  • 1. WOOster: A Map-Reduce based Platform for Graph Mining Aravindan Raghuveer Yahoo! Inc, Bangalore.
  • 2. Introduction “If you squint the right way, graphs are everywhere” [1] @ Yahoo! : • The WOO Graph: All knowledge assimilated from the web. - http://iswc2011.semanticweb.org/fileadmin/iswc/Pa pers/Industry/WOO_ISWC.pptx [1] http://googleresearch.blogspot.com/2009/06/large-scale-graph-computing-at-google.html 2 Yahoo! Confidential
  • 3. The What and Why? What? Family of Graph Query Algorithms. • Framework: • For graph storage and invoking the query algorithms • Hosted Solution on Hadoop Why? • Family of Graph Query Algorithms: Present day algorithms do not scale to billion edge, vertex graphs. • Framework: •Optimizes storage layout to suit graph query algorithms •Improves throughput of the queries. 3 Yahoo! Confidential
  • 4. Outline of the talk • MapReduce 101 • Graph Mining Approaches • Brief overview of WOOster architecture • Graph query algorithms in WOOster: • Sub Graph Matching • Reachability Query • Experiments • Conclusion Yahoo! Confidential
  • 5. Map Reduce 101  Switch to slides from Cloud Computing with MapReduce and Hadoop  www.cs.berkeley.edu/~matei/talks/2009/parlab_bo otcamp_clouds.ppt 5 Yahoo! Confidential
  • 6. MapReduce Programming Model • Data type: key-value records • Map function: (Kin, Vin)  list(Kinter, Vinter) • Reduce function: (Kinter, list(Vinter))  list(Kout, Vout)
  • 7. Example: Word Count def mapper(line): foreach word in line.split(): output(word, 1) def reducer(key, values): output(key, sum(values))
  • 8. Word Count Execution Input Map Shuffle & Sort Reduce Output the, 1 brown, 1 the quick fox, 1 brown, 2 Map brown fox fox, 2 Reduce how, 1 the, 1 fox, 1 now, 1 the, 1 the, 3 the fox ate Map the mouse quick, 1 how, 1 ate, 1 ate, 1 now, 1 mouse, 1 brown, 1 Reduce cow, 1 how now mouse, 1 Map cow, 1 brown cow quick, 1
  • 9. Graph Mining Approaches : Two Schools  School-1: Invent a new platform: - Map-reduce is not best suited for graph mining: - BSP, PRAM models : circa 1980s - Pregel, Haloop from Google [1]  School-2: Ride on Map-Reduce - MR has wide adoption, open source tools, industry support. - Invest on one more computing infrastructure - Apache Giraph: http://incubator.apache.org/giraph/ (BSP on Hadoop) - Efforts in open source / academia on the same lines: • Pegasus CMU [2] • Graph Mining in Apache Mahout[3] • Rayethon’s Graph Mining [4] [1] SIGMOD 2010, http://dl.acm.org/citation.cfm?id=1807184 [2] http://www.cs.cmu.edu/~pegasus/ [3] http://www.robust-project.eu/news/robust-project-pushes-large-scale-graph-mining-with-hadoop-apache 9 [4] http://www.cloudera.com/blog/2010/03/how-raytheon-researchers-are-using-hadoop-to-build-a-scalable-distributed-triple-store/ Yahoo! Confidential
  • 10. WOOster Architecture • User submits a query WOOster Web UI & WebService APIs • Planner periodically scans for newly arrived queries. • Planner creates a M-R plan that Graph Planner re-uses computation, / IO Indices Jobs D/B across queries. (Batching) Executor • Executor executes the M-R plan. • Result notified to the user WOO Graph (Hosted Solution) Grid Yahoo! Confidential
  • 11. The Sub-Graph Match Query Find all instances in graph G of query Q Vertices have attributes (ex age:31) Vertices and edges have constraints (ex: age<40) Edges have relationship labels. Notation Query Vertex Graph Vertex A matched graph vertex Why Sub-Graph Match (Exact Graph Isomorphism)?: A popular and expressive graph query useful to mine patterns. To our knowledge, a large scale algorithm to operate on a billion vertex graph is not present. Yahoo! Confidential
  • 12. Overview of the Solution Step-0. Data Layout on HDFS Step-1. Query Graph Partitioning Step-2. Edge Selection Step-3. Query Partition Matching Step-4. Query Partition Merging Yahoo! Confidential
  • 13. Data Layout on HDFS • How to store a large scale graph? • Adjacency List like solution: • Each row/line has information about a vertex: • Vertex attributes • Vertex neighbors and the labels associated with each edge. Implications: •Enables early pruning of non-matching edges and vertices. •Each vertex has information about itself and its immediate neighbors only. Yahoo! Confidential
  • 14. Step-1: Query Graph Partitioning Why?: Parallelized solving of independent sub- problems How? Find minimum number of partitions such that diameter of partition = 2. Pivot Vertices Intuition: •In a spanning tree of diameter 2, there is one vertex that is connected to all other vertices  pivot vertex •Will use this property in steps 2, 3. Yahoo! Confidential
  • 15. Step-2: Edge Selection • What: Select a subset of edges from G that match atleast one edge in Q. • How: 3. g1-g2 emitted: g1 mapped to a query vertex. g2 Map g1 g2 Reduce g3 g1 g1 Logic Logic g1 g2 g4 1. g1:Current 2b. 2a. 4. vertex in For every emits allof Mapper neigbor g1-g2 emited Reducer emits 5. mapper. edges if vertex and q1, there exists a from g2’s an edge if a pair edge constraints are corresponding mapper is found neighbor for g1 met Yahoo! Confidential
  • 16. Step-3: Query Partition Matching Edge Selection: • Associates a graph vertex to the possible query vertices it could map to • Associates the graph vertex to its “pivot” graph vertex. g1 g2 • Pivot graph vertex is a graph vertex which is mapped to a pivot query vertex: g1 in this example Reducer forms the partition g1 g2 3. Edge Selection Map Reduce g2 g1 g3 output Logic Logic g1 g3 g1 g4 g4 Mapper emits pivot graph vertex as key and edge as 2. Reducer receives all value edges with the same 1. pivot graph vertex Yahoo! Confidential
  • 17. Step-4: Query Partition Merging • Merges partitions one after another to form the a query match • More details in paper. Take-away from Steps1-4: (also for any scalable Map-Reduce program)  The mapper/reducer keys are chosen such that:  # keys is proportional to the number of matches of query Q in the graph.  Hence the algorithm scales well for large graphs and complex queries. Yahoo! Confidential
  • 18. Results 160 140 120 Time (sec) 100 80 60 40 20 0 100 150 200 250 Number of Reducers Edge Selection Query Partition Matching Query Partition Merging  Graph of 10 million vertices and 50 million edges  Complex Query of 24 vertices  Note that the edge selection time reduces with increasing number of reducers. Yahoo! Confidential
  • 19. In the paper…  Detailed map-reduce algorithms for sub-graph match and reachability  Theoretical analysis for scalability  Construction of the synthetic dataset  Methodology and more experiments.  Reachability query: examples, map-reduce algorithm  Related work Yahoo! Confidential
  • 20. Future Work • Indexing structure for graphs suited for M-R jobs • Compare with giraph based approach. • Better batching strategies. • Right interface for custom graph algorithms to be plugged in while WOOster providing automatic batching. • More graph mining algorithms implemented Yahoo! Confidential
  • 21. Questions / Comments 21 Yahoo! Confidential