SlideShare a Scribd company logo
1 of 30
From HadoopDB to Hadapt: A Case
Study of Transitioning a VLDB
paper into Real World Deployments
Daniel Abadi
Yale University
August 28th, 2013
Twitter: @daniel_abadi
Overview of Talk
Motivation for HadoopDB
Overview of HadoopDB
Overview of the commercialization process
Technical features missing from HadoopDB that
Hadapt needed to implement
What does this mean for tenure?
Situation in 2008
Hadoop starting to take off as a “Big Data”
processing platform
Parallel database startups such as
Netezza, Vertica, and Greenplum gaining
traction for “Big Data” analysis
2 Schools of Thought
– School 1: They are on a collision course
– School 2: They are complementary
technologies
From 10,000 feet Hadoop and Parallel
Database Systems are Quite Similar
Both are suitable for large-scale data
processing
– I.e. analytical processing workloads
– Bulk loads
– Not optimized for transactional workloads
– Queries over large amounts of data
– Both can handle both relational and nonrelational
queries (DBMS via UDFs)
SIGMOD 2009 Paper
Benchmarked Hadoop vs. 2 parallel
database systems
– Mostly focused on performance differences
– Measured differences in load and query time
for some common data processing tasks
– Used Web analytics benchmark whose goal
was to be representative of tasks that:
Both should excel at
Hadoop should excel at
Databases should excel at
Hardware Setup
100 node cluster
Each node
– 2.4 GHz Code 2 Duo Processors
– 4 GB RAM
– 2 250 GB SATA HDs (74 MB/Sec sequential I/O)
Dual GigE switches, each with 50 nodes
– 128 Gbit/sec fabric
Connected by a 64 Gbit/sec ring
Join Task
0
200
400
600
800
1000
1200
1400
1600
10 nodes 25 nodes 50 nodes 100 nodes
Time(seconds)
Vertica
DBMS-X
Hadoop
UDF Task
0
200
400
600
800
1000
1200
10 nodes 25 nodes 50 nodes 100
nodes
Time(seconds)
DBMS
Hadoop
DBMS clearly doesn’t scaleCalculate
PageRank
over a set of
HTML
documents
Performed
via a UDF
Scalability
Except for UDFs all systems scale near
linearly
BUT: only ran on 100 nodes
As nodes approach 1000, other effects
come into play
– Faults go from being rare, to not so rare
– It is nearly impossible to maintain
homogeneity at scale
Fault Tolerance and Cluster
Heterogeneity Results
0
20
40
60
80
100
120
140
160
180
200
Fault tolerance Slowdown tolerance
PercentageSlowdown
DBMS
Hadoop
Database systems restart entire
query upon a single node
failure, and do not adapt if a
node is running slowly
Benchmark Conclusions
Hadoop had scalability advantages
– Checkpointing allows for better fault tolerance
– Runtime scheduling allows for better tolerance of
unexpectedly slow nodes
– Better parallelization of UDFs
Hadoop was consistently less efficient for
structured, relational data
– Reasons mostly non-fundamental
– Needed better support for compression and direct
operation on compressed data
– Needed better support for indexing
– Needed better support for co-partitioning of datasets
Best of Both Worlds Possible?
Connector
Problems With the Connector
Approach
Network delays and bandwidth limitations
Data silos
Multiple vendors
Fundamentally wasteful
– Very similar architectures
Both partition data across a cluster
Both parallelize processing across the cluster
Both optimize for local data processing (to
minimize network costs)
Unified System
Two options:
– Bring Hadoop technology to a parallel
database system
Problem: Hadoop is more than just technology
– Bring parallel database system technology to
Hadoop
Far more likely to have impact
Adding DBMS Technology to
Hadoop
Option 1: Keep Hadoop’s storage and build parallel
executor on top of it
Cloudera Impala (which is sort of a combination of Hadoop++
and NoDB research projects)
Need better Storage Formats (Trevni and Parquet are
promising)
Updates and Deletes are hard (Impala doesn’t support them)
Option 2: Use relational storage on each node
Accelerates “time to complete system”
We chose this option for HadoopDB
HadoopDB Architecture
SMS Planner
TPC-H Benchmark Results
UDF Task
0
100
200
300
400
500
600
700
800
10 nodes 25 nodes 50 nodes
Time(seconds)
DBMS
Hadoop
HadoopDB
Fault Tolerance and Cluster
Heterogeneity Results
0
20
40
60
80
100
120
140
160
180
200
Fault tolerance Slowdown tolerance
PercentageSlowdown
DBMS
Hadoop
HadoopDB
HadoopDB Commercialization
Wanted to build a real system
Released initial prototype open source
Blog post about HadoopDB got slashdotted, led
to VC interest
– Initially reluctant to take VC money
Posted a job for an engineer to help build out
open source codebase
– Low quality of applicants
– Not enough government funding for more than 1
engineer
HadoopDB Commercialization
VC money only route to building a
complete system
– Launched with $1.5 million in seed money in
2010
– Raised an additional $8 million in 2011
– Raised an additional $6.75 million in 2012
Commercializing HadoopDB:
Where does development time go?
Work we expected to transition from
research prototype to commercial product
– SQL coverage
– Failover for high availability
– Authorization / authentication
– Error codes / messages for every situation
– Installer
– Documentation
But what about unexpected work?
Infrastructure Tools
Distributed systems are unwieldy
– For a cluster of size n, many things need to be done n times
Automated tools are critical
Just to try some new code, the following needs to
happen:
– Build product
– Provision a cluster
– Deploy build to cluster
– Install dependencies (Hadoop distro, libraries, etc)
– Install Hadapt with correct configuration parameters for that
cluster
– Generate data or copy data files to cluster for load
Upgrader
Start-ups need to move fast
Hadapt delivers a new release every
couple of months
Upgrade process must be easy
Downgrade (!) process must be easy
Changes in storage layout or APIs add
complexity to the process
UDF Support
HadoopDB supported both MapReduce
and SQL as interfaces
MapReduce was not a sufficient
replacement for database UDFs
Hadapt provides an “HDK” that enables
analysts to create functions that are
invokable from SQL
– Integrates with 3rd party tools
Search
Hadoop is increasingly used as a data
landfill
– Granular data
– Messy data
– Unprocessed data
Database for Hadoop cannot assume all
data fits in rows and columns
Search support was the first thing we built
after our A round of financing
Is doing a start-up pre-tenure a
good idea?
Spinning off a company takes a ton of time
– At first, you are the ONLY person who can give a
complete description of the technical vision, so
You’re talking to all the VCs to fundraise
You’re talking to all the prospective customers
You’re talking to all the prospective employees
– Lots of travel
– Eventually, others can help with the above, but a
good CEO will not let you escape
Ups and downs can be mentally draining
If you do a start-up you will:
Publish less
Advise fewer students
Pursue fewer grants
Avoid university committees as much as
possible
Skip faculty meetings (usually because of
travel)
Attend fewer academic conferences
At the end of the day
Unless there are changes (see SIGMOD panel
from June):
– Publishing a lot is the best way to get tenure
– Spinning off a company necessarily detracts from
university measurable objectives
Doing a start-up is putting all your eggs in one
basket
– If successful, you have a lot of impact you can point to
– If not successful, you have nothing
– A lot of market forces that you have no control over
determine success

More Related Content

What's hot

Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
Varun Narang
 
Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache Hadoop
Christopher Pezza
 

What's hot (20)

Hadoop and Big Data
Hadoop and Big DataHadoop and Big Data
Hadoop and Big Data
 
Big Data & Hadoop Tutorial
Big Data & Hadoop TutorialBig Data & Hadoop Tutorial
Big Data & Hadoop Tutorial
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
Jstorm introduction-0.9.6
Jstorm introduction-0.9.6Jstorm introduction-0.9.6
Jstorm introduction-0.9.6
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce
 
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
 
Apache Hadoop
Apache HadoopApache Hadoop
Apache Hadoop
 
PPT on Hadoop
PPT on HadoopPPT on Hadoop
PPT on Hadoop
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
Seminar_Report_hadoop
Seminar_Report_hadoopSeminar_Report_hadoop
Seminar_Report_hadoop
 
Big data concepts
Big data conceptsBig data concepts
Big data concepts
 
Hadoop 101
Hadoop 101Hadoop 101
Hadoop 101
 
Apache Hadoop - Big Data Engineering
Apache Hadoop - Big Data EngineeringApache Hadoop - Big Data Engineering
Apache Hadoop - Big Data Engineering
 
Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache Hadoop
 
Overview of Hadoop and HDFS
Overview of Hadoop and HDFSOverview of Hadoop and HDFS
Overview of Hadoop and HDFS
 
Big data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guideBig data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guide
 
Hadoop Tutorial For Beginners
Hadoop Tutorial For BeginnersHadoop Tutorial For Beginners
Hadoop Tutorial For Beginners
 
Hadoop technology doc
Hadoop technology docHadoop technology doc
Hadoop technology doc
 
Big data Hadoop presentation
Big data  Hadoop  presentation Big data  Hadoop  presentation
Big data Hadoop presentation
 

Viewers also liked

Finding All Maximal Cliques in Very Large Social Networks
Finding All Maximal Cliques in Very Large Social NetworksFinding All Maximal Cliques in Very Large Social Networks
Finding All Maximal Cliques in Very Large Social Networks
Antonio Maccioni
 
CAP, PACELC, and Determinism
CAP, PACELC, and DeterminismCAP, PACELC, and Determinism
CAP, PACELC, and Determinism
Daniel Abadi
 

Viewers also liked (11)

Large Scale ETL with Hadoop
Large Scale ETL with HadoopLarge Scale ETL with Hadoop
Large Scale ETL with Hadoop
 
Invisible loading
Invisible loadingInvisible loading
Invisible loading
 
Leopard: Lightweight Partitioning and Replication for Dynamic Graphs
Leopard: Lightweight Partitioning and Replication  for Dynamic Graphs Leopard: Lightweight Partitioning and Replication  for Dynamic Graphs
Leopard: Lightweight Partitioning and Replication for Dynamic Graphs
 
Finding All Maximal Cliques in Very Large Social Networks
Finding All Maximal Cliques in Very Large Social NetworksFinding All Maximal Cliques in Very Large Social Networks
Finding All Maximal Cliques in Very Large Social Networks
 
Consistency Tradeoffs in Modern Distributed Database System Design
Consistency Tradeoffs in Modern Distributed Database System DesignConsistency Tradeoffs in Modern Distributed Database System Design
Consistency Tradeoffs in Modern Distributed Database System Design
 
Accordion - VLDB 2014
Accordion - VLDB 2014Accordion - VLDB 2014
Accordion - VLDB 2014
 
VLDB 2009 Tutorial on Column-Stores
VLDB 2009 Tutorial on Column-StoresVLDB 2009 Tutorial on Column-Stores
VLDB 2009 Tutorial on Column-Stores
 
The Power of Determinism in Database Systems
The Power of Determinism in Database SystemsThe Power of Determinism in Database Systems
The Power of Determinism in Database Systems
 
CAP, PACELC, and Determinism
CAP, PACELC, and DeterminismCAP, PACELC, and Determinism
CAP, PACELC, and Determinism
 
Column-Stores vs. Row-Stores: How Different are they Really?
Column-Stores vs. Row-Stores: How Different are they Really?Column-Stores vs. Row-Stores: How Different are they Really?
Column-Stores vs. Row-Stores: How Different are they Really?
 
Personal Information Management Systems - EDBT/ICDT'15 Tutorial
Personal Information Management Systems - EDBT/ICDT'15 TutorialPersonal Information Management Systems - EDBT/ICDT'15 Tutorial
Personal Information Management Systems - EDBT/ICDT'15 Tutorial
 

Similar to From HadoopDB to Hadapt: A Case Study of Transitioning a VLDB paper into Real World Deployments

Владимир Слободянюк «DWH & BigData – architecture approaches»
Владимир Слободянюк «DWH & BigData – architecture approaches»Владимир Слободянюк «DWH & BigData – architecture approaches»
Владимир Слободянюк «DWH & BigData – architecture approaches»
Anna Shymchenko
 
Hadoop Infrastructure (Oct. 3rd, 2012)
Hadoop Infrastructure (Oct. 3rd, 2012)Hadoop Infrastructure (Oct. 3rd, 2012)
Hadoop Infrastructure (Oct. 3rd, 2012)
John Dougherty
 

Similar to From HadoopDB to Hadapt: A Case Study of Transitioning a VLDB paper into Real World Deployments (20)

Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
 
DWH & big data architecture approaches
DWH & big data architecture approachesDWH & big data architecture approaches
DWH & big data architecture approaches
 
Владимир Слободянюк «DWH & BigData – architecture approaches»
Владимир Слободянюк «DWH & BigData – architecture approaches»Владимир Слободянюк «DWH & BigData – architecture approaches»
Владимир Слободянюк «DWH & BigData – architecture approaches»
 
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and FacebookHow Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
 
FOSS Sea 2014_DataWarehouse & BigData_Владимир Слободянюк ( Luxoft)
FOSS Sea 2014_DataWarehouse & BigData_Владимир Слободянюк ( Luxoft)FOSS Sea 2014_DataWarehouse & BigData_Владимир Слободянюк ( Luxoft)
FOSS Sea 2014_DataWarehouse & BigData_Владимир Слободянюк ( Luxoft)
 
Agile data lake? An oxymoron?
Agile data lake? An oxymoron?Agile data lake? An oxymoron?
Agile data lake? An oxymoron?
 
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosHadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
 
Hadoop and Big Data: Revealed
Hadoop and Big Data: RevealedHadoop and Big Data: Revealed
Hadoop and Big Data: Revealed
 
Hybrid Data Warehouse Hadoop Implementations
Hybrid Data Warehouse Hadoop ImplementationsHybrid Data Warehouse Hadoop Implementations
Hybrid Data Warehouse Hadoop Implementations
 
Hadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log Processing
 
Chattanooga Hadoop Meetup - Hadoop 101 - November 2014
Chattanooga Hadoop Meetup - Hadoop 101 - November 2014Chattanooga Hadoop Meetup - Hadoop 101 - November 2014
Chattanooga Hadoop Meetup - Hadoop 101 - November 2014
 
Oct 2011 CHADNUG Presentation on Hadoop
Oct 2011 CHADNUG Presentation on HadoopOct 2011 CHADNUG Presentation on Hadoop
Oct 2011 CHADNUG Presentation on Hadoop
 
How to use Hadoop for operational and transactional purposes by RODRIGO MERI...
 How to use Hadoop for operational and transactional purposes by RODRIGO MERI... How to use Hadoop for operational and transactional purposes by RODRIGO MERI...
How to use Hadoop for operational and transactional purposes by RODRIGO MERI...
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo ppt
 
Architecting the Future of Big Data and Search
Architecting the Future of Big Data and SearchArchitecting the Future of Big Data and Search
Architecting the Future of Big Data and Search
 
عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟
 
Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL
Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQLCompressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL
Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL
 
Hadoop Infrastructure (Oct. 3rd, 2012)
Hadoop Infrastructure (Oct. 3rd, 2012)Hadoop Infrastructure (Oct. 3rd, 2012)
Hadoop Infrastructure (Oct. 3rd, 2012)
 
The future of Big Data tooling
The future of Big Data toolingThe future of Big Data tooling
The future of Big Data tooling
 

Recently uploaded

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Recently uploaded (20)

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 

From HadoopDB to Hadapt: A Case Study of Transitioning a VLDB paper into Real World Deployments

  • 1. From HadoopDB to Hadapt: A Case Study of Transitioning a VLDB paper into Real World Deployments Daniel Abadi Yale University August 28th, 2013 Twitter: @daniel_abadi
  • 2. Overview of Talk Motivation for HadoopDB Overview of HadoopDB Overview of the commercialization process Technical features missing from HadoopDB that Hadapt needed to implement What does this mean for tenure?
  • 3. Situation in 2008 Hadoop starting to take off as a “Big Data” processing platform Parallel database startups such as Netezza, Vertica, and Greenplum gaining traction for “Big Data” analysis 2 Schools of Thought – School 1: They are on a collision course – School 2: They are complementary technologies
  • 4. From 10,000 feet Hadoop and Parallel Database Systems are Quite Similar Both are suitable for large-scale data processing – I.e. analytical processing workloads – Bulk loads – Not optimized for transactional workloads – Queries over large amounts of data – Both can handle both relational and nonrelational queries (DBMS via UDFs)
  • 5. SIGMOD 2009 Paper Benchmarked Hadoop vs. 2 parallel database systems – Mostly focused on performance differences – Measured differences in load and query time for some common data processing tasks – Used Web analytics benchmark whose goal was to be representative of tasks that: Both should excel at Hadoop should excel at Databases should excel at
  • 6. Hardware Setup 100 node cluster Each node – 2.4 GHz Code 2 Duo Processors – 4 GB RAM – 2 250 GB SATA HDs (74 MB/Sec sequential I/O) Dual GigE switches, each with 50 nodes – 128 Gbit/sec fabric Connected by a 64 Gbit/sec ring
  • 7. Join Task 0 200 400 600 800 1000 1200 1400 1600 10 nodes 25 nodes 50 nodes 100 nodes Time(seconds) Vertica DBMS-X Hadoop
  • 8. UDF Task 0 200 400 600 800 1000 1200 10 nodes 25 nodes 50 nodes 100 nodes Time(seconds) DBMS Hadoop DBMS clearly doesn’t scaleCalculate PageRank over a set of HTML documents Performed via a UDF
  • 9. Scalability Except for UDFs all systems scale near linearly BUT: only ran on 100 nodes As nodes approach 1000, other effects come into play – Faults go from being rare, to not so rare – It is nearly impossible to maintain homogeneity at scale
  • 10. Fault Tolerance and Cluster Heterogeneity Results 0 20 40 60 80 100 120 140 160 180 200 Fault tolerance Slowdown tolerance PercentageSlowdown DBMS Hadoop Database systems restart entire query upon a single node failure, and do not adapt if a node is running slowly
  • 11. Benchmark Conclusions Hadoop had scalability advantages – Checkpointing allows for better fault tolerance – Runtime scheduling allows for better tolerance of unexpectedly slow nodes – Better parallelization of UDFs Hadoop was consistently less efficient for structured, relational data – Reasons mostly non-fundamental – Needed better support for compression and direct operation on compressed data – Needed better support for indexing – Needed better support for co-partitioning of datasets
  • 12. Best of Both Worlds Possible? Connector
  • 13. Problems With the Connector Approach Network delays and bandwidth limitations Data silos Multiple vendors Fundamentally wasteful – Very similar architectures Both partition data across a cluster Both parallelize processing across the cluster Both optimize for local data processing (to minimize network costs)
  • 14. Unified System Two options: – Bring Hadoop technology to a parallel database system Problem: Hadoop is more than just technology – Bring parallel database system technology to Hadoop Far more likely to have impact
  • 15. Adding DBMS Technology to Hadoop Option 1: Keep Hadoop’s storage and build parallel executor on top of it Cloudera Impala (which is sort of a combination of Hadoop++ and NoDB research projects) Need better Storage Formats (Trevni and Parquet are promising) Updates and Deletes are hard (Impala doesn’t support them) Option 2: Use relational storage on each node Accelerates “time to complete system” We chose this option for HadoopDB
  • 19. UDF Task 0 100 200 300 400 500 600 700 800 10 nodes 25 nodes 50 nodes Time(seconds) DBMS Hadoop HadoopDB
  • 20. Fault Tolerance and Cluster Heterogeneity Results 0 20 40 60 80 100 120 140 160 180 200 Fault tolerance Slowdown tolerance PercentageSlowdown DBMS Hadoop HadoopDB
  • 21. HadoopDB Commercialization Wanted to build a real system Released initial prototype open source Blog post about HadoopDB got slashdotted, led to VC interest – Initially reluctant to take VC money Posted a job for an engineer to help build out open source codebase – Low quality of applicants – Not enough government funding for more than 1 engineer
  • 22. HadoopDB Commercialization VC money only route to building a complete system – Launched with $1.5 million in seed money in 2010 – Raised an additional $8 million in 2011 – Raised an additional $6.75 million in 2012
  • 23. Commercializing HadoopDB: Where does development time go? Work we expected to transition from research prototype to commercial product – SQL coverage – Failover for high availability – Authorization / authentication – Error codes / messages for every situation – Installer – Documentation But what about unexpected work?
  • 24. Infrastructure Tools Distributed systems are unwieldy – For a cluster of size n, many things need to be done n times Automated tools are critical Just to try some new code, the following needs to happen: – Build product – Provision a cluster – Deploy build to cluster – Install dependencies (Hadoop distro, libraries, etc) – Install Hadapt with correct configuration parameters for that cluster – Generate data or copy data files to cluster for load
  • 25. Upgrader Start-ups need to move fast Hadapt delivers a new release every couple of months Upgrade process must be easy Downgrade (!) process must be easy Changes in storage layout or APIs add complexity to the process
  • 26. UDF Support HadoopDB supported both MapReduce and SQL as interfaces MapReduce was not a sufficient replacement for database UDFs Hadapt provides an “HDK” that enables analysts to create functions that are invokable from SQL – Integrates with 3rd party tools
  • 27. Search Hadoop is increasingly used as a data landfill – Granular data – Messy data – Unprocessed data Database for Hadoop cannot assume all data fits in rows and columns Search support was the first thing we built after our A round of financing
  • 28. Is doing a start-up pre-tenure a good idea? Spinning off a company takes a ton of time – At first, you are the ONLY person who can give a complete description of the technical vision, so You’re talking to all the VCs to fundraise You’re talking to all the prospective customers You’re talking to all the prospective employees – Lots of travel – Eventually, others can help with the above, but a good CEO will not let you escape Ups and downs can be mentally draining
  • 29. If you do a start-up you will: Publish less Advise fewer students Pursue fewer grants Avoid university committees as much as possible Skip faculty meetings (usually because of travel) Attend fewer academic conferences
  • 30. At the end of the day Unless there are changes (see SIGMOD panel from June): – Publishing a lot is the best way to get tenure – Spinning off a company necessarily detracts from university measurable objectives Doing a start-up is putting all your eggs in one basket – If successful, you have a lot of impact you can point to – If not successful, you have nothing – A lot of market forces that you have no control over determine success