SlideShare a Scribd company logo
1 of 36
Hadoop and Graph Data Management:
   Challenges and Opportunities

                 Daniel Abadi
     Assistant Professor at Yale University
           Chief Scientist at Hadapt
            Tuesday, November 8th

      (Collaboration with Jiewen Huang)
Two Major Trends
 Hadoop
   Becoming de facto standard for large scale data
    processing
   Becoming more than just MapReduce
   Ecosystem growing rapidly --- lot’s of great tools
    around it
 Graph data is proliferating; huge value if analyzed
 and processed
     Social graphs
     Linked data
     Telecom
     Advertising
     Defense
Use Hadoop to Process Graph
Data?
 Of course!
 BUT:
   Little automatic optimization (non-declarative
    interface)
   It’s possible to do graphs with Hadoop:
     Really badly
     Suboptimally
     Really well
   Case in point: subgraph pattern-matching
   benchmark on three different Hadoop-centered
   solutions:
     ~1000s,
     ~100s,
     < ~10s
Case study: Linked Data
Example Linked Data
 Entities (or “resources”)
  are nodes in the graph
 Relationships between
  entities are labeled
  directed edges in the
  graph
 (Resources typically have
  unique URI identifies to
  allow for global
  references --- not shown
  in example to the right)
 Any resource can
  connect to any other
  resource
Graph Representation
 Linked data graph
  can be parsed
  into a series of
  vertex-entity-
  vertex triples
 First entity
  referred to as the
  subject; second
  as the object;
  edge connecting
  them as the
  predicate
 We will call these
  “RDF triples”
Querying Linked Data
 Linked Data is typically queried in SPARQL
 Processing a SPARQL query is basically
 equivalent to the subgraph pattern matching
 problem
Scalable SPARQL Processing
 Single-node options are abundant, e.g.,
     Sesame
     Jena
     RDF-3X
     3store
 Fewer options that can scale across multiple
 machines
   This is where Hadoop comes in!
     One cool solution: SHARD (presented by Kurt Rohloff at
      HadoopWorld 2010)
       Uses HDFS to store graph, MapReduce jobs for
        subgraph pattern matching
       Much nicer than a naïve Hadoop solution
Another Example: Twitter Data
                               @joe_hellerst
                                   ein

                      follow                      follow
                      s                           s
                                follow
                                s
           @daniel_ab                               @mikeolso
              adi                                      n
                                 follow
                                 s                                  retweete
retweete
d                                                           retweeted
           retweete            retweete          retweete   d
           d                   d                 d

     @hadoop_is_my_life        @super_hadooper        @hadoop_is_the answer
Example Query over Twitter
 Graph

Who has retweeted both @daniel_abadi and
@mikeolson?

      @daniel_aba
                               @mikeolson
          di



          retweeted          retweeted


                      ???
Issues With Hadoop Defaults
 Hadoop does not default to pipelined algorithms
 between MapReduce jobs
   SHARD algorithm matches each clause of SPARQL
   subgraph in a separate MapReduce job
     Need full barrier synchronization between jobs which is
      unnecessary
 Hadoop does not default to pipelined algorithms
 between Map and Reduce
   Each job performs a join by vertex, where the object
    of one triple is joined with the subject of another
    triple
   Joins work much faster if you can pipeline between
    Map and Reduce (e.g. pipelined hash join)
Issues with Hadoop Defaults
(cont.)
 Hadoop hash partitions data across nodes
   Data in graph is randomly partitioned across nodes
   Data close in the graph could be on a physically
    distant node
   Therefore each join requires a complete
    redistribution of the entire data set across the
    network (and there are many joins)
 Hadoop defaults to replicating all data 3 times
  with no preferences for what is replicated or
  where it is replicated to
 Hadoop defaults to using the Hadoop Distributed
  File System (HDFS) for storage, which is not
  optimized for graph data
All is not lost! Don’t throw away
Hadoop!
 All we have to do is change the defaults and add
 to it a little
System Architecture
Partitioning
 Graphs can be represented as vertex1-edge-
  vertex2 triples
 Hash partitioning by vertex1 is straightforward
 Great for queries like:

Query: Find the names of the strikers that play for FC Barcelona.

SELECT ?name
WHERE { ?player type        footballer      .
          ?player name      ?name           .
          ?player position striker          .
          ?player playsFor FC_Barcelona . }
The problem with hash partitioning
 …

Find football players playing for clubs in a populous region where he was born.
Graph Partitioning
 Data close together in the graph should be
  physically close together (preferably on the same
  machine)
 Subgraph pattern matching can be done without
  require huge amounts of communication via joins
  across the cluster
Graph Partitioning
Graph Partitioning




   Machine 1   Machine 2   Machine 3
Edge/Triple Placement
●   Minimizing data exchange
    ●   Allowing data overlap
●   N-hop guarantee
    ●   The extent of data overlap
    ●   If a vertex is assigned to a machine, any
        vertex that is within n-hop of this vertex is
        also stored in this machine
0 Hop of Machine 1




  Machine 1   Machine 2   Machine 3
1 Hop of Machine 1




  Machine 1   Machine 2   Machine 3
2 Hop of Machine 1




  Machine 1   Machine 2   Machine 3
0 Hop of Machine 3




  Machine 1   Machine 2   Machine 3
1 Hop of Machine 3




  Machine 1   Machine 2   Machine 3
2 Hop of Machine 3




  Machine 1   Machine 2   Machine 3
High Degree Vertexes
●   Problem: High-degree vertexes make the
    graph well-connected and difficult to
    partition
●   Solution: Ignore them during graph
    partitioning

●   Problem: High-degree vertexes cause
    data explosion with a n-hop guarantee
●   Solution: Selectively weaken the n-hop
    guarantee
Query Processing
  ●   Query execution is more efficient if
      pushed to optimized storage (RDF-
      stores)
      ●   Minimizing the number of Hadoop jobs
      ●   The larger the hop guarantee, the more
          work is done in RDF-stores
To Exchange, or not to Exchange?
  ●   Given a query and n-hop guarantee, is
      data exchange (Hadoop job) between
      nodes needed?
      ●   Choose the “center” of the query graph
      ●   Calculate the distance from the “center” to
          the furthest edge
      ●   If distance > n, data exchange is needed;
          not needed otherwise
Data Exchange

Find football players playing for clubs in a populous region where he was born.
Experimental Setup
●   20-machine cluster
●   Leigh University Benchmark (LUBM):
    270 million triples
●   Things to compare:
    ●   Single-node RDF-3X
    ●   SHARD
    ●   Graph partitioning (the proposed system)
    ●   Hash partitioning on subjects
Performance Comparison
Speedup
●   Better than linear speedup
Analysis
 Difference between fastest implementation and
 slowest implementation was a factor of 1340!
   Using Hadoop does not mean that performance is
   fixed
 More improvements are possible
   Experiments used MapReduce whenever data
   communication was necessary
     NextGen Hadoop allows other programming paradigms
     besides MR
      MPI is a good candidate

   Still need to fix the data pipelining problem
 Factor of 1340 possible via focusing on storage --
 - similar in theme to Hadapt
How this fits with Hadapt

                                        Full SQL interface, Map
                                         Reduce, and JDBC Connector
  Flexible Query Interface
    (Full SQL Support, MR, JDBC)
                                        10x-50x faster than Hadoop and
                                         Hive
                                          Queries go from hours to minutes,
                   Split Query             and minutes to seconds
   Hadoop          Execution
                    (Patent Pending)
                                        Analytics across structured and
                                         unstructured data in one
                                         platform
   Hadapt Storage Engine
        (Relational + HDFS)
                                        3.5 Patents Pending

                                        $9.5 Series A financing, lead by
                                         Norwest Venture Partners and
                                         Bessemer Venture Partners
Optimized Storage Matters
 HDFS appropriate for unstructured data
 Relational storage appropriate for relational data
 Graph storage appropriate for graph data
 Hadapt allows for pluggable storage inside
 Hadoop (amongst other things)




 Bottom line: Hadoop can be used for scalable
 graph processing, but it might need some
 Hadapting ;)

More Related Content

What's hot

Big Data & Hadoop Tutorial
Big Data & Hadoop TutorialBig Data & Hadoop Tutorial
Big Data & Hadoop TutorialEdureka!
 
Apache Hadoop
Apache HadoopApache Hadoop
Apache HadoopAjit Koti
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceeakasit_dpu
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation HadoopVarun Narang
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and HadoopFlavio Vit
 
Seminar_Report_hadoop
Seminar_Report_hadoopSeminar_Report_hadoop
Seminar_Report_hadoopVarun Narang
 
Apache Hadoop - Big Data Engineering
Apache Hadoop - Big Data EngineeringApache Hadoop - Big Data Engineering
Apache Hadoop - Big Data EngineeringBADR
 
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...Mahantesh Angadi
 
Overview of Hadoop and HDFS
Overview of Hadoop and HDFSOverview of Hadoop and HDFS
Overview of Hadoop and HDFSBrendan Tierney
 
Big data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guideBig data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guideDanairat Thanabodithammachari
 
Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache HadoopChristopher Pezza
 
Hadoop Seminar Report
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar ReportAtul Kushwaha
 
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyScaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyRohit Kulkarni
 
Comparison - RDBMS vs Hadoop vs Apache
Comparison - RDBMS vs Hadoop vs ApacheComparison - RDBMS vs Hadoop vs Apache
Comparison - RDBMS vs Hadoop vs ApacheSandeepTaksande
 
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...Simplilearn
 
Big data Hadoop presentation
Big data  Hadoop  presentation Big data  Hadoop  presentation
Big data Hadoop presentation Shivanee garg
 

What's hot (20)

Big Data & Hadoop Tutorial
Big Data & Hadoop TutorialBig Data & Hadoop Tutorial
Big Data & Hadoop Tutorial
 
Apache Hadoop
Apache HadoopApache Hadoop
Apache Hadoop
 
Hadoop and Big Data
Hadoop and Big DataHadoop and Big Data
Hadoop and Big Data
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
 
PPT on Hadoop
PPT on HadoopPPT on Hadoop
PPT on Hadoop
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
Seminar_Report_hadoop
Seminar_Report_hadoopSeminar_Report_hadoop
Seminar_Report_hadoop
 
Big data concepts
Big data conceptsBig data concepts
Big data concepts
 
Apache Hadoop - Big Data Engineering
Apache Hadoop - Big Data EngineeringApache Hadoop - Big Data Engineering
Apache Hadoop - Big Data Engineering
 
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
 
Overview of Hadoop and HDFS
Overview of Hadoop and HDFSOverview of Hadoop and HDFS
Overview of Hadoop and HDFS
 
Big data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guideBig data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guide
 
Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache Hadoop
 
Hadoop Technology
Hadoop TechnologyHadoop Technology
Hadoop Technology
 
Hadoop Seminar Report
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar Report
 
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyScaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
 
Comparison - RDBMS vs Hadoop vs Apache
Comparison - RDBMS vs Hadoop vs ApacheComparison - RDBMS vs Hadoop vs Apache
Comparison - RDBMS vs Hadoop vs Apache
 
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
 
Big data Hadoop presentation
Big data  Hadoop  presentation Big data  Hadoop  presentation
Big data Hadoop presentation
 

Viewers also liked

CAP, PACELC, and Determinism
CAP, PACELC, and DeterminismCAP, PACELC, and Determinism
CAP, PACELC, and DeterminismDaniel Abadi
 
Leopard: Lightweight Partitioning and Replication for Dynamic Graphs
Leopard: Lightweight Partitioning and Replication  for Dynamic Graphs Leopard: Lightweight Partitioning and Replication  for Dynamic Graphs
Leopard: Lightweight Partitioning and Replication for Dynamic Graphs Daniel Abadi
 
Consistency Tradeoffs in Modern Distributed Database System Design
Consistency Tradeoffs in Modern Distributed Database System DesignConsistency Tradeoffs in Modern Distributed Database System Design
Consistency Tradeoffs in Modern Distributed Database System DesignArinto Murdopo
 
VLDB 2009 Tutorial on Column-Stores
VLDB 2009 Tutorial on Column-StoresVLDB 2009 Tutorial on Column-Stores
VLDB 2009 Tutorial on Column-StoresDaniel Abadi
 
The Power of Determinism in Database Systems
The Power of Determinism in Database SystemsThe Power of Determinism in Database Systems
The Power of Determinism in Database SystemsDaniel Abadi
 
Column-Stores vs. Row-Stores: How Different are they Really?
Column-Stores vs. Row-Stores: How Different are they Really?Column-Stores vs. Row-Stores: How Different are they Really?
Column-Stores vs. Row-Stores: How Different are they Really?Daniel Abadi
 

Viewers also liked (9)

CAP, PACELC, and Determinism
CAP, PACELC, and DeterminismCAP, PACELC, and Determinism
CAP, PACELC, and Determinism
 
Invisible loading
Invisible loadingInvisible loading
Invisible loading
 
Leopard: Lightweight Partitioning and Replication for Dynamic Graphs
Leopard: Lightweight Partitioning and Replication  for Dynamic Graphs Leopard: Lightweight Partitioning and Replication  for Dynamic Graphs
Leopard: Lightweight Partitioning and Replication for Dynamic Graphs
 
Consistency Tradeoffs in Modern Distributed Database System Design
Consistency Tradeoffs in Modern Distributed Database System DesignConsistency Tradeoffs in Modern Distributed Database System Design
Consistency Tradeoffs in Modern Distributed Database System Design
 
VLDB 2009 Tutorial on Column-Stores
VLDB 2009 Tutorial on Column-StoresVLDB 2009 Tutorial on Column-Stores
VLDB 2009 Tutorial on Column-Stores
 
The Power of Determinism in Database Systems
The Power of Determinism in Database SystemsThe Power of Determinism in Database Systems
The Power of Determinism in Database Systems
 
Column-Stores vs. Row-Stores: How Different are they Really?
Column-Stores vs. Row-Stores: How Different are they Really?Column-Stores vs. Row-Stores: How Different are they Really?
Column-Stores vs. Row-Stores: How Different are they Really?
 
Apache HAWQ Architecture
Apache HAWQ ArchitectureApache HAWQ Architecture
Apache HAWQ Architecture
 
MPP vs Hadoop
MPP vs HadoopMPP vs Hadoop
MPP vs Hadoop
 

Similar to Hadoop and Graph Data Management: Challenges and Opportunities

Hadoop World 2011: Hadoop and Graph Data Management: Challenges and Opportuni...
Hadoop World 2011: Hadoop and Graph Data Management: Challenges and Opportuni...Hadoop World 2011: Hadoop and Graph Data Management: Challenges and Opportuni...
Hadoop World 2011: Hadoop and Graph Data Management: Challenges and Opportuni...Cloudera, Inc.
 
Hadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoopHadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoopVictoria López
 
Cred_hadoop_presenatation
Cred_hadoop_presenatationCred_hadoop_presenatation
Cred_hadoop_presenatationAshish Saraf
 
Python in big data world
Python in big data worldPython in big data world
Python in big data worldRohit
 
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...Cognizant
 
Managing Big data with Hadoop
Managing Big data with HadoopManaging Big data with Hadoop
Managing Big data with HadoopNalini Mehta
 
Real time hadoop + mapreduce intro
Real time hadoop + mapreduce introReal time hadoop + mapreduce intro
Real time hadoop + mapreduce introGeoff Hendrey
 
Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry PerspectiveCloudera, Inc.
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to HadoopGiovanna Roda
 
Big Data: An Overview
Big Data: An OverviewBig Data: An Overview
Big Data: An OverviewC. Scyphers
 

Similar to Hadoop and Graph Data Management: Challenges and Opportunities (20)

Hadoop World 2011: Hadoop and Graph Data Management: Challenges and Opportuni...
Hadoop World 2011: Hadoop and Graph Data Management: Challenges and Opportuni...Hadoop World 2011: Hadoop and Graph Data Management: Challenges and Opportuni...
Hadoop World 2011: Hadoop and Graph Data Management: Challenges and Opportuni...
 
Hadoop seminar
Hadoop seminarHadoop seminar
Hadoop seminar
 
Hadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoopHadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoop
 
Cred_hadoop_presenatation
Cred_hadoop_presenatationCred_hadoop_presenatation
Cred_hadoop_presenatation
 
Seminar ppt
Seminar pptSeminar ppt
Seminar ppt
 
Hadoop An Introduction
Hadoop An IntroductionHadoop An Introduction
Hadoop An Introduction
 
Python in big data world
Python in big data worldPython in big data world
Python in big data world
 
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
 
Managing Big data with Hadoop
Managing Big data with HadoopManaging Big data with Hadoop
Managing Big data with Hadoop
 
Real time hadoop + mapreduce intro
Real time hadoop + mapreduce introReal time hadoop + mapreduce intro
Real time hadoop + mapreduce intro
 
Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry Perspective
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Big data
Big dataBig data
Big data
 
Big data
Big dataBig data
Big data
 
Hadoop info
Hadoop infoHadoop info
Hadoop info
 
hadoop
hadoophadoop
hadoop
 
hadoop
hadoophadoop
hadoop
 
Big Data: An Overview
Big Data: An OverviewBig Data: An Overview
Big Data: An Overview
 
Big data Analytics Hadoop
Big data Analytics HadoopBig data Analytics Hadoop
Big data Analytics Hadoop
 
Big Data & Hadoop. Simone Leo (CRS4)
Big Data & Hadoop. Simone Leo (CRS4)Big Data & Hadoop. Simone Leo (CRS4)
Big Data & Hadoop. Simone Leo (CRS4)
 

Recently uploaded

08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 

Recently uploaded (20)

08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 

Hadoop and Graph Data Management: Challenges and Opportunities

  • 1. Hadoop and Graph Data Management: Challenges and Opportunities Daniel Abadi Assistant Professor at Yale University Chief Scientist at Hadapt Tuesday, November 8th (Collaboration with Jiewen Huang)
  • 2. Two Major Trends  Hadoop  Becoming de facto standard for large scale data processing  Becoming more than just MapReduce  Ecosystem growing rapidly --- lot’s of great tools around it  Graph data is proliferating; huge value if analyzed and processed  Social graphs  Linked data  Telecom  Advertising  Defense
  • 3. Use Hadoop to Process Graph Data?  Of course!  BUT:  Little automatic optimization (non-declarative interface)  It’s possible to do graphs with Hadoop:  Really badly  Suboptimally  Really well  Case in point: subgraph pattern-matching benchmark on three different Hadoop-centered solutions:  ~1000s,  ~100s,  < ~10s
  • 5. Example Linked Data  Entities (or “resources”) are nodes in the graph  Relationships between entities are labeled directed edges in the graph  (Resources typically have unique URI identifies to allow for global references --- not shown in example to the right)  Any resource can connect to any other resource
  • 6. Graph Representation  Linked data graph can be parsed into a series of vertex-entity- vertex triples  First entity referred to as the subject; second as the object; edge connecting them as the predicate  We will call these “RDF triples”
  • 7. Querying Linked Data  Linked Data is typically queried in SPARQL  Processing a SPARQL query is basically equivalent to the subgraph pattern matching problem
  • 8. Scalable SPARQL Processing  Single-node options are abundant, e.g.,  Sesame  Jena  RDF-3X  3store  Fewer options that can scale across multiple machines  This is where Hadoop comes in!  One cool solution: SHARD (presented by Kurt Rohloff at HadoopWorld 2010)  Uses HDFS to store graph, MapReduce jobs for subgraph pattern matching  Much nicer than a naïve Hadoop solution
  • 9. Another Example: Twitter Data @joe_hellerst ein follow follow s s follow s @daniel_ab @mikeolso adi n follow s retweete retweete d retweeted retweete retweete retweete d d d d @hadoop_is_my_life @super_hadooper @hadoop_is_the answer
  • 10. Example Query over Twitter Graph Who has retweeted both @daniel_abadi and @mikeolson? @daniel_aba @mikeolson di retweeted retweeted ???
  • 11. Issues With Hadoop Defaults  Hadoop does not default to pipelined algorithms between MapReduce jobs  SHARD algorithm matches each clause of SPARQL subgraph in a separate MapReduce job  Need full barrier synchronization between jobs which is unnecessary  Hadoop does not default to pipelined algorithms between Map and Reduce  Each job performs a join by vertex, where the object of one triple is joined with the subject of another triple  Joins work much faster if you can pipeline between Map and Reduce (e.g. pipelined hash join)
  • 12. Issues with Hadoop Defaults (cont.)  Hadoop hash partitions data across nodes  Data in graph is randomly partitioned across nodes  Data close in the graph could be on a physically distant node  Therefore each join requires a complete redistribution of the entire data set across the network (and there are many joins)  Hadoop defaults to replicating all data 3 times with no preferences for what is replicated or where it is replicated to  Hadoop defaults to using the Hadoop Distributed File System (HDFS) for storage, which is not optimized for graph data
  • 13. All is not lost! Don’t throw away Hadoop!  All we have to do is change the defaults and add to it a little
  • 15. Partitioning  Graphs can be represented as vertex1-edge- vertex2 triples  Hash partitioning by vertex1 is straightforward  Great for queries like: Query: Find the names of the strikers that play for FC Barcelona. SELECT ?name WHERE { ?player type footballer . ?player name ?name . ?player position striker . ?player playsFor FC_Barcelona . }
  • 16. The problem with hash partitioning … Find football players playing for clubs in a populous region where he was born.
  • 17. Graph Partitioning  Data close together in the graph should be physically close together (preferably on the same machine)  Subgraph pattern matching can be done without require huge amounts of communication via joins across the cluster
  • 19. Graph Partitioning Machine 1 Machine 2 Machine 3
  • 20. Edge/Triple Placement ● Minimizing data exchange ● Allowing data overlap ● N-hop guarantee ● The extent of data overlap ● If a vertex is assigned to a machine, any vertex that is within n-hop of this vertex is also stored in this machine
  • 21. 0 Hop of Machine 1 Machine 1 Machine 2 Machine 3
  • 22. 1 Hop of Machine 1 Machine 1 Machine 2 Machine 3
  • 23. 2 Hop of Machine 1 Machine 1 Machine 2 Machine 3
  • 24. 0 Hop of Machine 3 Machine 1 Machine 2 Machine 3
  • 25. 1 Hop of Machine 3 Machine 1 Machine 2 Machine 3
  • 26. 2 Hop of Machine 3 Machine 1 Machine 2 Machine 3
  • 27. High Degree Vertexes ● Problem: High-degree vertexes make the graph well-connected and difficult to partition ● Solution: Ignore them during graph partitioning ● Problem: High-degree vertexes cause data explosion with a n-hop guarantee ● Solution: Selectively weaken the n-hop guarantee
  • 28. Query Processing ● Query execution is more efficient if pushed to optimized storage (RDF- stores) ● Minimizing the number of Hadoop jobs ● The larger the hop guarantee, the more work is done in RDF-stores
  • 29. To Exchange, or not to Exchange? ● Given a query and n-hop guarantee, is data exchange (Hadoop job) between nodes needed? ● Choose the “center” of the query graph ● Calculate the distance from the “center” to the furthest edge ● If distance > n, data exchange is needed; not needed otherwise
  • 30. Data Exchange Find football players playing for clubs in a populous region where he was born.
  • 31. Experimental Setup ● 20-machine cluster ● Leigh University Benchmark (LUBM): 270 million triples ● Things to compare: ● Single-node RDF-3X ● SHARD ● Graph partitioning (the proposed system) ● Hash partitioning on subjects
  • 33. Speedup ● Better than linear speedup
  • 34. Analysis  Difference between fastest implementation and slowest implementation was a factor of 1340!  Using Hadoop does not mean that performance is fixed  More improvements are possible  Experiments used MapReduce whenever data communication was necessary  NextGen Hadoop allows other programming paradigms besides MR  MPI is a good candidate  Still need to fix the data pipelining problem  Factor of 1340 possible via focusing on storage -- - similar in theme to Hadapt
  • 35. How this fits with Hadapt  Full SQL interface, Map Reduce, and JDBC Connector Flexible Query Interface (Full SQL Support, MR, JDBC)  10x-50x faster than Hadoop and Hive  Queries go from hours to minutes, Split Query and minutes to seconds Hadoop Execution (Patent Pending)  Analytics across structured and unstructured data in one platform Hadapt Storage Engine (Relational + HDFS)  3.5 Patents Pending  $9.5 Series A financing, lead by Norwest Venture Partners and Bessemer Venture Partners
  • 36. Optimized Storage Matters  HDFS appropriate for unstructured data  Relational storage appropriate for relational data  Graph storage appropriate for graph data  Hadapt allows for pluggable storage inside Hadoop (amongst other things)  Bottom line: Hadoop can be used for scalable graph processing, but it might need some Hadapting ;)