SlideShare ist ein Scribd-Unternehmen logo
1 von 27
Downloaden Sie, um offline zu lesen
HadoopDB: An open source hybrid of MapReduce
           and DBMS technologies

                Azza Abouzeid, Kamil Bajda-Pawlikowski
                   Daniel J. Abadi, Avi Silberschatz

                                 Yale University
                         http://hadoopdb.sourceforge.net


                               October 2, 2009


HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for
Analytical Workloads. Azza Abouzeid, Kamil Bajda-Pawlikowski, Daniel J. Abadi, Avi
Silberschatz, Alex Rasin. In Proceedings of VLDB, 2009.
Introduction Candidates Differences HadoopDB Evaluation Conclusion
                                                         Motivation


Major Trends



         1   Data explosion:
                    Automation of business processes, proliferation of digital
                    devices.
                    eBay has a 6.5 PB warehouse, Yahoo! Everest has 10 PB.
         2   Analysis over raw data




                      Yale University, HadoopWorld 2009   HadoopDB               2/24
Introduction Candidates Differences HadoopDB Evaluation Conclusion
                                                         Motivation


Major Trends



         1   Data explosion:
                    Automation of business processes, proliferation of digital
                    devices.
                    eBay has a 6.5 PB warehouse, Yahoo! Everest has 10 PB.
         2   Analysis over raw data

      Bottom line
      Analyzing massive structured data on 1000s of shared-nothing
      nodes.




                      Yale University, HadoopWorld 2009   HadoopDB               2/24
Introduction Candidates Differences HadoopDB Evaluation Conclusion
                                                         Motivation


Sales Record Example


      Consider a large data set of sales log records, each consisting of
      sales information including:
         1   a date of sale
         2   a price
      We would like to take the log records and generate a report
      showing the total sales for each year.
      Question:
      How do we generate this report efficiently and cheaply over massive
      data contained in a shared-nothing cluster of 1000s of machines?




                      Yale University, HadoopWorld 2009   HadoopDB         3/24
Introduction Candidates Differences HadoopDB Evaluation Conclusion
                                                         MapReduce Parallel Databases


MapReduce (Hadoop)


      MapReduce is a programming model which specifies:
             A map function that processes a key/value pair to generate a
             set of intermediate key/value pairs,
             A reduce function that merges all intermediate values
             associated with the same intermediate key.
      Hadoop
             is a MapReduce implementation for processing large data sets
             over 1000s of nodes.
             Maps (and Reduces) run independently of each other over
             blocks of data distributed across a cluster.



                      Yale University, HadoopWorld 2009   HadoopDB                      4/24
Introduction Candidates Differences HadoopDB Evaluation Conclusion
                                                         MapReduce Parallel Databases


Sales Record Example using Hadoop



      Query: Calculate total sales for each year.

      We write a MapReduce program:
             Map: Takes log records and extracts a key-value pair of year
             and sale price in dollars. Outputs the key-value pairs.
             Shuffle: Hadoop automatically partitions the key-value pairs
             by year to the nodes executing the Reduce function
             Reduce: Simply sums up all the dollar values for a year.




                      Yale University, HadoopWorld 2009   HadoopDB                      5/24
Introduction Candidates Differences HadoopDB Evaluation Conclusion
                                                         MapReduce Parallel Databases


Relational Databases

      Suppose that the data is stored in a relational database system,
      the sales record example could be expressed in SQL as:

      SELECT YEAR(date) AS year, SUM(price)
      FROM sales
      GROUP BY year

      The execution plan is:

      projection(year,price) → hash aggregation(year,price) .

      Question:
      How do we process this efficiently if the data is very large?


                      Yale University, HadoopWorld 2009   HadoopDB                      6/24
Introduction Candidates Differences HadoopDB Evaluation Conclusion
                                                         MapReduce Parallel Databases


Parallel Databases

      Parallel Databases are like single-node databases except:
             Data is partitioned across nodes
             Individual relational operations can be executed in parallel
      xxx
      SELECT YEAR(date) AS year, SUM(price)
      FROM sales GROUP BY year

      Execution plan for the query:
      projection(year,price) → partial hash aggregation(year,price) →
      partitioning(year) → final aggregation(year,price) .

      Note that the execution plan resembles the map and reduce phases
      of Hadoop.

                      Yale University, HadoopWorld 2009   HadoopDB                      7/24
Introduction Candidates Differences HadoopDB Evaluation Conclusion


Differences between Parallel Databases and Hadoop




                      Yale University, HadoopWorld 2009   HadoopDB   8/24
Introduction Candidates Differences HadoopDB Evaluation Conclusion


Differences between Parallel Databases and Hadoop




                      Yale University, HadoopWorld 2009   HadoopDB   8/24
Introduction Candidates Differences HadoopDB Evaluation Conclusion


To summarize




                      Yale University, HadoopWorld 2009   HadoopDB   9/24
At Yale, we looked beyond the differences ...
At Yale, we looked beyond the differences ...
and we discovered ...




                                                             Basic design idea
                                                             Multiple, independent, single
                                                             node databases coordinated by
 ... that they complete each other                           Hadoop.
 http://i214.photobucket.com/albums/cc19/brittanybutton/elephants.jpg
Introduction Candidates Differences HadoopDB Evaluation Conclusion
                                                         Background Architecture SMS


Hadoop Basics




                      Yale University, HadoopWorld 2009   HadoopDB                     12/24
Introduction Candidates Differences HadoopDB Evaluation Conclusion
                                                         Background Architecture SMS


Architecture




                      Yale University, HadoopWorld 2009   HadoopDB                     13/24
Introduction Candidates Differences HadoopDB Evaluation Conclusion
                                                         Background Architecture SMS


SQL-MR-SQL




      SELECT YEAR(saleDate), SUM(revenue) FROM sales GROUP BY YEAR(saleDate);

                      Yale University, HadoopWorld 2009   HadoopDB                     14/24
Introduction Candidates Differences HadoopDB Evaluation Conclusion
                                                         Hypotheses Load Performance Scalability


Evaluating HadoopDB


      Compare HadoopDB to
         1   Hadoop
         2   Parallel databases (Vertica, DBMS-X)
      Features:
        1 Performance:

                    We expected HadoopDB to approach the performance of
                    parallel databases
         2   Scalability:
                    We expected HadoopDB to scale as well as Hadoop
      We ran the Pavlo et al. SIGMOD’09 benchmark on Amazon EC2
      clusters of 10, 50, 100 nodes.



                      Yale University, HadoopWorld 2009   HadoopDB                                 15/24
Introduction Candidates Differences HadoopDB Evaluation Conclusion
                                                         Hypotheses Load Performance Scalability


Load



              1600                                             Vertica     DB-X
                                                               HadoopDB    Hadoop                                                    Vertica    DB-X




                                                                                                  Thousands
              1400                                                                                                                   HadoopDB   Hadoop
                                                                                                              50
              1200
                                                                                                              40
              1000
    seconds




                                                                                        seconds
              800                                                                                             30

              600
                                                                                                              20
              400
                                                                     164


                                                                            161
                                              141
                              139




              200                                                                                             10
                                                        100
                     92




                                                                                  77
                                     47




                                                               43




                0
                          10 nodes                  50 nodes           100 nodes                              0
                                                                                                                   10 nodes   50 nodes      100 nodes


               Random Unstructured Data
                                                                                                  Structured data (20GB/node)
                    (535MB/node)



                                          Yale University, HadoopWorld 2009            HadoopDB                                                         16/24
Introduction Candidates Differences HadoopDB Evaluation Conclusion
                                                         Hypotheses Load Performance Scalability


Performance: Grep Task


              70                                    Vertica       DB-X
                                                    HadoopDB      Hadoop
              60                                                                      1   Full table scan, highly
              50
                                                                                          selective filter
              40
                                                                                      2   Random data, no
    seconds




              30
                                                                                          room for indexing
              20
                                                                                      3   Hadoop overhead
                                                                                          outweighs query
              10
                                                                                          processing time in
              0
                        10 nodes             50 nodes          100 nodes
                                                                                          single-node databases

                   SELECT * FROM grep WHERE field LIKE ‘%xyz%’;




                                   Yale University, HadoopWorld 2009       HadoopDB                           17/24
Introduction Candidates Differences HadoopDB Evaluation Conclusion
                                                         Hypotheses Load Performance Scalability


Performance: Join Task

              2000                                                      Vertica         DB-X
                                                                        HadoopDB        Hadoop
              1800

              1600

              1400

              1200
    seconds




              1000

              800
                                                                                                               1   No full table scan due
              600                                                                                                  to clustered indexing


                                                                                            300.5
                                                                224.2




              400                                                                                              2   Hash partitioning and
                                    126.4




                                                                              67.7
              200
                                                 34.7




                                                                                     31.9
                                                         29.4
                             28.0
                     20.6




                0
                                                                                                                   efficient join
                            10 nodes                    50 nodes                   100 nodes                       algorithm

   SELECT sourceIP, AVG(pageRank), SUM(adRevenue)
   FROM rankings, uservisits
   WHERE pageURL=destURL
   AND visitDate BETWEEN 2000-1-15 AND 2000-1-22
   GROUP BY sourceIP
   ORDER BY SUM(adRevenue) DESC LIMIT 1;


                                        Yale University, HadoopWorld 2009                           HadoopDB                          18/24
Introduction Candidates Differences HadoopDB Evaluation Conclusion
                                                         Hypotheses Load Performance Scalability


Performance: Bottom Line




         1   Unstructured data
                    HadoopDB’s performance matches Hadoop
         2   Structured data
                    HadoopDB’s performance is close to parallel databases




                      Yale University, HadoopWorld 2009   HadoopDB                                 19/24
Introduction Candidates Differences HadoopDB Evaluation Conclusion
                                                         Hypotheses Load Performance Scalability


Scalability: Setup




         1   Simple aggregation task - full table scan
         2   Data replicated across 10 nodes
         3   Fault-tolerance: Kill a node halfway
         4   Fluctuation-tolerance: Slow down a node for the entire
             experiment




                      Yale University, HadoopWorld 2009   HadoopDB                                 20/24
Introduction Candidates Differences HadoopDB Evaluation Conclusion
                                                         Hypotheses Load Performance Scalability


Scalability: Results



                          200%                                      Vertica
                                                                                           1   HadoopDB and
                                                                    HadoopDB
                          180%
                                                                    Hadoop
                                                                                               Hadoop take
                          160%                                                                 advantage of runtime
    percentage slowdown




                          140%
                                                                                               scheduling by
                          120%
                          100%
                                                                                               splitting data into
                          80%                                                                  chunks or blocks
                          60%                                                              2   Parallel databases
                          40%
                                                                                               restart entire query on
                          20%
                           0%
                                                                                               node failure or wait
                                 Fault-tolerance        Fluctuation-tolerance                  for the slowest node




                                     Yale University, HadoopWorld 2009          HadoopDB                           21/24
Introduction Candidates Differences HadoopDB Evaluation Conclusion
                                                         Summary Future


To summarize


      HadoopDB ...
         1   is a hybrid of DBMS and MapReduce
         2   scales better than commercial parallel databases
         3   is as fault-tolerant as Hadoop
         4   approaches the performance of parallel databases
         5   is free and open-source

                          http://hadoopdb.sourceforge.net




                     Yale University, HadoopWorld 2009   HadoopDB         22/24
Introduction Candidates Differences HadoopDB Evaluation Conclusion
                                                         Summary Future


Future work

      Engineering work:
         1   Full SQL support in SMS
         2   Data compression
         3   Integration with other open source databases
         4   Full automation of the loading and replication process
         5   Out-of-the box deployment
         6   We’re hiring!
      Research work:
             Incremental loading and on-the-fly repartitioning
             Dynamically adjusting fault-tolerance levels based on failure
             rate


                     Yale University, HadoopWorld 2009   HadoopDB            23/24
Thank You ...




     We welcome all thoughts on how to raise HadoopDB ...
                http://www.jpbutler.com/thailand/images/elephant-8-days-old.jpg

Weitere ähnliche Inhalte

Was ist angesagt?

Using Hadoop and Hive to Optimize Travel Search , WindyCityDB 2010
Using Hadoop and Hive to Optimize Travel Search, WindyCityDB 2010Using Hadoop and Hive to Optimize Travel Search, WindyCityDB 2010
Using Hadoop and Hive to Optimize Travel Search , WindyCityDB 2010Jonathan Seidman
 
Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010nzhang
 
WaterlooHiveTalk
WaterlooHiveTalkWaterlooHiveTalk
WaterlooHiveTalknzhang
 
SQL on Hadoop: Defining the New Generation of Analytic SQL Databases
SQL on Hadoop: Defining the New Generation of Analytic SQL DatabasesSQL on Hadoop: Defining the New Generation of Analytic SQL Databases
SQL on Hadoop: Defining the New Generation of Analytic SQL DatabasesOReillyStrata
 
Hadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHitendra Kumar
 
Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo pptPhil Young
 
Big data processing with apache spark part1
Big data processing with apache spark   part1Big data processing with apache spark   part1
Big data processing with apache spark part1Abbas Maazallahi
 
Distributed Data Analysis with Hadoop and R - Strangeloop 2011
Distributed Data Analysis with Hadoop and R - Strangeloop 2011Distributed Data Analysis with Hadoop and R - Strangeloop 2011
Distributed Data Analysis with Hadoop and R - Strangeloop 2011Jonathan Seidman
 
Flexible In-Situ Indexing for Hadoop via Elephant Twin
Flexible In-Situ Indexing for Hadoop via Elephant TwinFlexible In-Situ Indexing for Hadoop via Elephant Twin
Flexible In-Situ Indexing for Hadoop via Elephant TwinDmitriy Ryaboy
 
HugeTable:Application-Oriented Structure Data Storage System
HugeTable:Application-Oriented Structure Data Storage SystemHugeTable:Application-Oriented Structure Data Storage System
HugeTable:Application-Oriented Structure Data Storage Systemqlw5
 
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Thanh Nguyen
 
Hadoop Ecosystem | Big Data Analytics Tools | Hadoop Tutorial | Edureka
Hadoop Ecosystem | Big Data Analytics Tools | Hadoop Tutorial | Edureka Hadoop Ecosystem | Big Data Analytics Tools | Hadoop Tutorial | Edureka
Hadoop Ecosystem | Big Data Analytics Tools | Hadoop Tutorial | Edureka Edureka!
 
Introduction to Apache hadoop
Introduction to Apache hadoopIntroduction to Apache hadoop
Introduction to Apache hadoopOmar Jaber
 
Hadoop Summit 2012 - Hadoop and Vertica: The Data Analytics Platform at Twitter
Hadoop Summit 2012 - Hadoop and Vertica: The Data Analytics Platform at TwitterHadoop Summit 2012 - Hadoop and Vertica: The Data Analytics Platform at Twitter
Hadoop Summit 2012 - Hadoop and Vertica: The Data Analytics Platform at TwitterBill Graham
 
Extending the Data Warehouse with Hadoop - Hadoop world 2011
Extending the Data Warehouse with Hadoop - Hadoop world 2011Extending the Data Warehouse with Hadoop - Hadoop world 2011
Extending the Data Warehouse with Hadoop - Hadoop world 2011Jonathan Seidman
 
Extending the EDW with Hadoop - Chicago Data Summit 2011
Extending the EDW with Hadoop - Chicago Data Summit 2011Extending the EDW with Hadoop - Chicago Data Summit 2011
Extending the EDW with Hadoop - Chicago Data Summit 2011Jonathan Seidman
 

Was ist angesagt? (20)

Hadoop Presentation
Hadoop PresentationHadoop Presentation
Hadoop Presentation
 
Using Hadoop and Hive to Optimize Travel Search , WindyCityDB 2010
Using Hadoop and Hive to Optimize Travel Search, WindyCityDB 2010Using Hadoop and Hive to Optimize Travel Search, WindyCityDB 2010
Using Hadoop and Hive to Optimize Travel Search , WindyCityDB 2010
 
Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010
 
WaterlooHiveTalk
WaterlooHiveTalkWaterlooHiveTalk
WaterlooHiveTalk
 
SQL on Hadoop: Defining the New Generation of Analytic SQL Databases
SQL on Hadoop: Defining the New Generation of Analytic SQL DatabasesSQL on Hadoop: Defining the New Generation of Analytic SQL Databases
SQL on Hadoop: Defining the New Generation of Analytic SQL Databases
 
Hadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log Processing
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo ppt
 
Big data processing with apache spark part1
Big data processing with apache spark   part1Big data processing with apache spark   part1
Big data processing with apache spark part1
 
An Introduction to the World of Hadoop
An Introduction to the World of HadoopAn Introduction to the World of Hadoop
An Introduction to the World of Hadoop
 
Distributed Data Analysis with Hadoop and R - Strangeloop 2011
Distributed Data Analysis with Hadoop and R - Strangeloop 2011Distributed Data Analysis with Hadoop and R - Strangeloop 2011
Distributed Data Analysis with Hadoop and R - Strangeloop 2011
 
Flexible In-Situ Indexing for Hadoop via Elephant Twin
Flexible In-Situ Indexing for Hadoop via Elephant TwinFlexible In-Situ Indexing for Hadoop via Elephant Twin
Flexible In-Situ Indexing for Hadoop via Elephant Twin
 
HugeTable:Application-Oriented Structure Data Storage System
HugeTable:Application-Oriented Structure Data Storage SystemHugeTable:Application-Oriented Structure Data Storage System
HugeTable:Application-Oriented Structure Data Storage System
 
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1
 
Hadoop Ecosystem | Big Data Analytics Tools | Hadoop Tutorial | Edureka
Hadoop Ecosystem | Big Data Analytics Tools | Hadoop Tutorial | Edureka Hadoop Ecosystem | Big Data Analytics Tools | Hadoop Tutorial | Edureka
Hadoop Ecosystem | Big Data Analytics Tools | Hadoop Tutorial | Edureka
 
Introduction to Apache hadoop
Introduction to Apache hadoopIntroduction to Apache hadoop
Introduction to Apache hadoop
 
Hadoop Summit 2012 - Hadoop and Vertica: The Data Analytics Platform at Twitter
Hadoop Summit 2012 - Hadoop and Vertica: The Data Analytics Platform at TwitterHadoop Summit 2012 - Hadoop and Vertica: The Data Analytics Platform at Twitter
Hadoop Summit 2012 - Hadoop and Vertica: The Data Analytics Platform at Twitter
 
Extending the Data Warehouse with Hadoop - Hadoop world 2011
Extending the Data Warehouse with Hadoop - Hadoop world 2011Extending the Data Warehouse with Hadoop - Hadoop world 2011
Extending the Data Warehouse with Hadoop - Hadoop world 2011
 
Big Data Concepts
Big Data ConceptsBig Data Concepts
Big Data Concepts
 
Extending the EDW with Hadoop - Chicago Data Summit 2011
Extending the EDW with Hadoop - Chicago Data Summit 2011Extending the EDW with Hadoop - Chicago Data Summit 2011
Extending the EDW with Hadoop - Chicago Data Summit 2011
 

Ähnlich wie Hw09 Hadoop Db

Improving MySQL performance with Hadoop
Improving MySQL performance with HadoopImproving MySQL performance with Hadoop
Improving MySQL performance with HadoopSagar Jauhari
 
Hadoop and Big Data: Revealed
Hadoop and Big Data: RevealedHadoop and Big Data: Revealed
Hadoop and Big Data: RevealedSachin Holla
 
Hadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG GridHadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG GridEvert Lammerts
 
Hadoop online training by certified trainer
Hadoop online training by certified trainerHadoop online training by certified trainer
Hadoop online training by certified trainersriram0233
 
Apache hadoop introduction and architecture
Apache hadoop  introduction and architectureApache hadoop  introduction and architecture
Apache hadoop introduction and architectureHarikrishnan K
 
The Forrester Wave Enterprise Hadoop Solutions Q1 2012
The Forrester Wave Enterprise Hadoop Solutions Q1 2012The Forrester Wave Enterprise Hadoop Solutions Q1 2012
The Forrester Wave Enterprise Hadoop Solutions Q1 2012m_hepburn
 
Hybrid Data Warehouse Hadoop Implementations
Hybrid Data Warehouse Hadoop ImplementationsHybrid Data Warehouse Hadoop Implementations
Hybrid Data Warehouse Hadoop ImplementationsDavid Portnoy
 
Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry PerspectiveCloudera, Inc.
 
Big-Data Hadoop Tutorials - MindScripts Technologies, Pune
Big-Data Hadoop Tutorials - MindScripts Technologies, Pune Big-Data Hadoop Tutorials - MindScripts Technologies, Pune
Big-Data Hadoop Tutorials - MindScripts Technologies, Pune amrutupre
 
Hadoop and mysql by Chris Schneider
Hadoop and mysql by Chris SchneiderHadoop and mysql by Chris Schneider
Hadoop and mysql by Chris SchneiderDmitry Makarchuk
 
Hadoop - A Very Short Introduction
Hadoop - A Very Short IntroductionHadoop - A Very Short Introduction
Hadoop - A Very Short Introductiondewang_mistry
 
Big Data Analytics 2014
Big Data Analytics 2014Big Data Analytics 2014
Big Data Analytics 2014Stratebi
 
Keynote from ApacheCon NA 2011
Keynote from ApacheCon NA 2011Keynote from ApacheCon NA 2011
Keynote from ApacheCon NA 2011Hortonworks
 
מצגת כנס מנתחי מערכות
מצגת כנס מנתחי מערכותמצגת כנס מנתחי מערכות
מצגת כנס מנתחי מערכותPini Mandel
 

Ähnlich wie Hw09 Hadoop Db (20)

Improving MySQL performance with Hadoop
Improving MySQL performance with HadoopImproving MySQL performance with Hadoop
Improving MySQL performance with Hadoop
 
Hadoop and Big Data: Revealed
Hadoop and Big Data: RevealedHadoop and Big Data: Revealed
Hadoop and Big Data: Revealed
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop Ecosystem
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
Hadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG GridHadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG Grid
 
Hadoop online training by certified trainer
Hadoop online training by certified trainerHadoop online training by certified trainer
Hadoop online training by certified trainer
 
Apache hadoop introduction and architecture
Apache hadoop  introduction and architectureApache hadoop  introduction and architecture
Apache hadoop introduction and architecture
 
The Forrester Wave Enterprise Hadoop Solutions Q1 2012
The Forrester Wave Enterprise Hadoop Solutions Q1 2012The Forrester Wave Enterprise Hadoop Solutions Q1 2012
The Forrester Wave Enterprise Hadoop Solutions Q1 2012
 
Hybrid Data Warehouse Hadoop Implementations
Hybrid Data Warehouse Hadoop ImplementationsHybrid Data Warehouse Hadoop Implementations
Hybrid Data Warehouse Hadoop Implementations
 
Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry Perspective
 
Big-Data Hadoop Tutorials - MindScripts Technologies, Pune
Big-Data Hadoop Tutorials - MindScripts Technologies, Pune Big-Data Hadoop Tutorials - MindScripts Technologies, Pune
Big-Data Hadoop Tutorials - MindScripts Technologies, Pune
 
Hadoop Tutorial for Beginners
Hadoop Tutorial for BeginnersHadoop Tutorial for Beginners
Hadoop Tutorial for Beginners
 
Hadoop and mysql by Chris Schneider
Hadoop and mysql by Chris SchneiderHadoop and mysql by Chris Schneider
Hadoop and mysql by Chris Schneider
 
Hadoop - A Very Short Introduction
Hadoop - A Very Short IntroductionHadoop - A Very Short Introduction
Hadoop - A Very Short Introduction
 
Big Data Analytics 2014
Big Data Analytics 2014Big Data Analytics 2014
Big Data Analytics 2014
 
Keynote from ApacheCon NA 2011
Keynote from ApacheCon NA 2011Keynote from ApacheCon NA 2011
Keynote from ApacheCon NA 2011
 
מצגת כנס מנתחי מערכות
מצגת כנס מנתחי מערכותמצגת כנס מנתחי מערכות
מצגת כנס מנתחי מערכות
 
Hadoop online training
Hadoop online trainingHadoop online training
Hadoop online training
 
Hadoop programming
Hadoop programmingHadoop programming
Hadoop programming
 

Mehr von Cloudera, Inc.

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxCloudera, Inc.
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera, Inc.
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards FinalistsCloudera, Inc.
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Cloudera, Inc.
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Cloudera, Inc.
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Cloudera, Inc.
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Cloudera, Inc.
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Cloudera, Inc.
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Cloudera, Inc.
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Cloudera, Inc.
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Cloudera, Inc.
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Cloudera, Inc.
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformCloudera, Inc.
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Cloudera, Inc.
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Cloudera, Inc.
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Cloudera, Inc.
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Cloudera, Inc.
 

Mehr von Cloudera, Inc. (20)

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
 

Kürzlich hochgeladen

Disha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfDisha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfchloefrazer622
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxVishalSingh1417
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxheathfieldcps1
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfAdmir Softic
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdfQucHHunhnh
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfchloefrazer622
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhikauryashika82
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Sapana Sha
 
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...PsychoTech Services
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Celine George
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformChameera Dedduwage
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsTechSoup
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfagholdier
 
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...fonyou31
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAssociation for Project Management
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeThiyagu K
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfJayanti Pande
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDThiyagu K
 

Kürzlich hochgeladen (20)

Disha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfDisha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdf
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdf
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
 
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 
Advance Mobile Application Development class 07
Advance Mobile Application Development class 07Advance Mobile Application Development class 07
Advance Mobile Application Development class 07
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy Reform
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across Sectors
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SD
 

Hw09 Hadoop Db

  • 1. HadoopDB: An open source hybrid of MapReduce and DBMS technologies Azza Abouzeid, Kamil Bajda-Pawlikowski Daniel J. Abadi, Avi Silberschatz Yale University http://hadoopdb.sourceforge.net October 2, 2009 HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads. Azza Abouzeid, Kamil Bajda-Pawlikowski, Daniel J. Abadi, Avi Silberschatz, Alex Rasin. In Proceedings of VLDB, 2009.
  • 2. Introduction Candidates Differences HadoopDB Evaluation Conclusion Motivation Major Trends 1 Data explosion: Automation of business processes, proliferation of digital devices. eBay has a 6.5 PB warehouse, Yahoo! Everest has 10 PB. 2 Analysis over raw data Yale University, HadoopWorld 2009 HadoopDB 2/24
  • 3. Introduction Candidates Differences HadoopDB Evaluation Conclusion Motivation Major Trends 1 Data explosion: Automation of business processes, proliferation of digital devices. eBay has a 6.5 PB warehouse, Yahoo! Everest has 10 PB. 2 Analysis over raw data Bottom line Analyzing massive structured data on 1000s of shared-nothing nodes. Yale University, HadoopWorld 2009 HadoopDB 2/24
  • 4. Introduction Candidates Differences HadoopDB Evaluation Conclusion Motivation Sales Record Example Consider a large data set of sales log records, each consisting of sales information including: 1 a date of sale 2 a price We would like to take the log records and generate a report showing the total sales for each year. Question: How do we generate this report efficiently and cheaply over massive data contained in a shared-nothing cluster of 1000s of machines? Yale University, HadoopWorld 2009 HadoopDB 3/24
  • 5. Introduction Candidates Differences HadoopDB Evaluation Conclusion MapReduce Parallel Databases MapReduce (Hadoop) MapReduce is a programming model which specifies: A map function that processes a key/value pair to generate a set of intermediate key/value pairs, A reduce function that merges all intermediate values associated with the same intermediate key. Hadoop is a MapReduce implementation for processing large data sets over 1000s of nodes. Maps (and Reduces) run independently of each other over blocks of data distributed across a cluster. Yale University, HadoopWorld 2009 HadoopDB 4/24
  • 6. Introduction Candidates Differences HadoopDB Evaluation Conclusion MapReduce Parallel Databases Sales Record Example using Hadoop Query: Calculate total sales for each year. We write a MapReduce program: Map: Takes log records and extracts a key-value pair of year and sale price in dollars. Outputs the key-value pairs. Shuffle: Hadoop automatically partitions the key-value pairs by year to the nodes executing the Reduce function Reduce: Simply sums up all the dollar values for a year. Yale University, HadoopWorld 2009 HadoopDB 5/24
  • 7. Introduction Candidates Differences HadoopDB Evaluation Conclusion MapReduce Parallel Databases Relational Databases Suppose that the data is stored in a relational database system, the sales record example could be expressed in SQL as: SELECT YEAR(date) AS year, SUM(price) FROM sales GROUP BY year The execution plan is: projection(year,price) → hash aggregation(year,price) . Question: How do we process this efficiently if the data is very large? Yale University, HadoopWorld 2009 HadoopDB 6/24
  • 8. Introduction Candidates Differences HadoopDB Evaluation Conclusion MapReduce Parallel Databases Parallel Databases Parallel Databases are like single-node databases except: Data is partitioned across nodes Individual relational operations can be executed in parallel xxx SELECT YEAR(date) AS year, SUM(price) FROM sales GROUP BY year Execution plan for the query: projection(year,price) → partial hash aggregation(year,price) → partitioning(year) → final aggregation(year,price) . Note that the execution plan resembles the map and reduce phases of Hadoop. Yale University, HadoopWorld 2009 HadoopDB 7/24
  • 9. Introduction Candidates Differences HadoopDB Evaluation Conclusion Differences between Parallel Databases and Hadoop Yale University, HadoopWorld 2009 HadoopDB 8/24
  • 10. Introduction Candidates Differences HadoopDB Evaluation Conclusion Differences between Parallel Databases and Hadoop Yale University, HadoopWorld 2009 HadoopDB 8/24
  • 11. Introduction Candidates Differences HadoopDB Evaluation Conclusion To summarize Yale University, HadoopWorld 2009 HadoopDB 9/24
  • 12. At Yale, we looked beyond the differences ...
  • 13. At Yale, we looked beyond the differences ...
  • 14. and we discovered ... Basic design idea Multiple, independent, single node databases coordinated by ... that they complete each other Hadoop. http://i214.photobucket.com/albums/cc19/brittanybutton/elephants.jpg
  • 15. Introduction Candidates Differences HadoopDB Evaluation Conclusion Background Architecture SMS Hadoop Basics Yale University, HadoopWorld 2009 HadoopDB 12/24
  • 16. Introduction Candidates Differences HadoopDB Evaluation Conclusion Background Architecture SMS Architecture Yale University, HadoopWorld 2009 HadoopDB 13/24
  • 17. Introduction Candidates Differences HadoopDB Evaluation Conclusion Background Architecture SMS SQL-MR-SQL SELECT YEAR(saleDate), SUM(revenue) FROM sales GROUP BY YEAR(saleDate); Yale University, HadoopWorld 2009 HadoopDB 14/24
  • 18. Introduction Candidates Differences HadoopDB Evaluation Conclusion Hypotheses Load Performance Scalability Evaluating HadoopDB Compare HadoopDB to 1 Hadoop 2 Parallel databases (Vertica, DBMS-X) Features: 1 Performance: We expected HadoopDB to approach the performance of parallel databases 2 Scalability: We expected HadoopDB to scale as well as Hadoop We ran the Pavlo et al. SIGMOD’09 benchmark on Amazon EC2 clusters of 10, 50, 100 nodes. Yale University, HadoopWorld 2009 HadoopDB 15/24
  • 19. Introduction Candidates Differences HadoopDB Evaluation Conclusion Hypotheses Load Performance Scalability Load 1600 Vertica DB-X HadoopDB Hadoop Vertica DB-X Thousands 1400 HadoopDB Hadoop 50 1200 40 1000 seconds seconds 800 30 600 20 400 164 161 141 139 200 10 100 92 77 47 43 0 10 nodes 50 nodes 100 nodes 0 10 nodes 50 nodes 100 nodes Random Unstructured Data Structured data (20GB/node) (535MB/node) Yale University, HadoopWorld 2009 HadoopDB 16/24
  • 20. Introduction Candidates Differences HadoopDB Evaluation Conclusion Hypotheses Load Performance Scalability Performance: Grep Task 70 Vertica DB-X HadoopDB Hadoop 60 1 Full table scan, highly 50 selective filter 40 2 Random data, no seconds 30 room for indexing 20 3 Hadoop overhead outweighs query 10 processing time in 0 10 nodes 50 nodes 100 nodes single-node databases SELECT * FROM grep WHERE field LIKE ‘%xyz%’; Yale University, HadoopWorld 2009 HadoopDB 17/24
  • 21. Introduction Candidates Differences HadoopDB Evaluation Conclusion Hypotheses Load Performance Scalability Performance: Join Task 2000 Vertica DB-X HadoopDB Hadoop 1800 1600 1400 1200 seconds 1000 800 1 No full table scan due 600 to clustered indexing 300.5 224.2 400 2 Hash partitioning and 126.4 67.7 200 34.7 31.9 29.4 28.0 20.6 0 efficient join 10 nodes 50 nodes 100 nodes algorithm SELECT sourceIP, AVG(pageRank), SUM(adRevenue) FROM rankings, uservisits WHERE pageURL=destURL AND visitDate BETWEEN 2000-1-15 AND 2000-1-22 GROUP BY sourceIP ORDER BY SUM(adRevenue) DESC LIMIT 1; Yale University, HadoopWorld 2009 HadoopDB 18/24
  • 22. Introduction Candidates Differences HadoopDB Evaluation Conclusion Hypotheses Load Performance Scalability Performance: Bottom Line 1 Unstructured data HadoopDB’s performance matches Hadoop 2 Structured data HadoopDB’s performance is close to parallel databases Yale University, HadoopWorld 2009 HadoopDB 19/24
  • 23. Introduction Candidates Differences HadoopDB Evaluation Conclusion Hypotheses Load Performance Scalability Scalability: Setup 1 Simple aggregation task - full table scan 2 Data replicated across 10 nodes 3 Fault-tolerance: Kill a node halfway 4 Fluctuation-tolerance: Slow down a node for the entire experiment Yale University, HadoopWorld 2009 HadoopDB 20/24
  • 24. Introduction Candidates Differences HadoopDB Evaluation Conclusion Hypotheses Load Performance Scalability Scalability: Results 200% Vertica 1 HadoopDB and HadoopDB 180% Hadoop Hadoop take 160% advantage of runtime percentage slowdown 140% scheduling by 120% 100% splitting data into 80% chunks or blocks 60% 2 Parallel databases 40% restart entire query on 20% 0% node failure or wait Fault-tolerance Fluctuation-tolerance for the slowest node Yale University, HadoopWorld 2009 HadoopDB 21/24
  • 25. Introduction Candidates Differences HadoopDB Evaluation Conclusion Summary Future To summarize HadoopDB ... 1 is a hybrid of DBMS and MapReduce 2 scales better than commercial parallel databases 3 is as fault-tolerant as Hadoop 4 approaches the performance of parallel databases 5 is free and open-source http://hadoopdb.sourceforge.net Yale University, HadoopWorld 2009 HadoopDB 22/24
  • 26. Introduction Candidates Differences HadoopDB Evaluation Conclusion Summary Future Future work Engineering work: 1 Full SQL support in SMS 2 Data compression 3 Integration with other open source databases 4 Full automation of the loading and replication process 5 Out-of-the box deployment 6 We’re hiring! Research work: Incremental loading and on-the-fly repartitioning Dynamically adjusting fault-tolerance levels based on failure rate Yale University, HadoopWorld 2009 HadoopDB 23/24
  • 27. Thank You ... We welcome all thoughts on how to raise HadoopDB ... http://www.jpbutler.com/thailand/images/elephant-8-days-old.jpg