SlideShare ist ein Scribd-Unternehmen logo
1 von 29
Hadoop Summit, June 2013
SQL on Hadoop
Defining the New Generation of
Analytic Databases
Speaker Bio: Carl Steinbach
1
Currently:
Engineer @ Citus Data
PMC Chair, Committer -- Apache Hive Project
Formerly:
Oracle, NetApp, Informatica, Cloudera
Twitter: @cwsteinbach
LinkedIn: carlsteinbach
This talk is about:
2
A New Type of
Distributed
Analytic Database
What Is an Analytic Database?
3
OLAP: Online Analytical Processing
Consolidation (Roll-up)
Drill-down
Slicing and Dicing
No Transactions
Large Sequential Scans
I/O Bound
Motivation:
The Problem with Enterprise Storage
4
Storage Tier (NAS/SAN)
Server/Worker Tier
Server Server Server
Server Server Server
Server Server Server
Server Server Server
Really Big Pipe
Google File System (’03)
A Possible Solution?
5
Design Priorities
• Commodity Hardware
• Fault Tolerance
• Big Files / Big Blocks
• Big Sequential Reads/Writes
Design Tradeoffs
• No random writes (write once/read many)
• Slow random reads
• Not POSIX compliant
So GFS Solved the problem?
6
- Yes, but not because of anything described in
the original paper
- Client/Server approach won’t scale
- Full scope of GFS revealed one year later with
publication of MapReduce (‘04) paper.
GFS + MapReduce Key Idea: Eliminate I/O
Bottleneck by Colocating Compute and Storage
Resources on the Same Node
What’s Good About Hadoop?
7
Commodity Storage
Scale-out
Fault Tolerance
Flexibility
MapReduce
Multi-structured Data
What’s Bad About Hadoop?
8
MapReduce!
No Schemas!
Missing Features
Optimizer, Indexes, Views
Incompatibility with Existing Tools
BI, ETL, IDEs
Apache Hive Solved Many of these
Problems
9
SQL to MapReduce
Compiler + Execution Engine
Pluggable Storage Layer
(SerDes)
Schema-on-Read
But Other Problems Remained
10
Many Missing Features:
• ANSI SQL
• Cost Based Optimizer
• UDFs
• Data Types
• Security
• …
Biggest Problem:
• MapReduce Latency Overhead
Work in Progress: Hive Improvements
11
Stinger Initiative:
• Columnar Query Engine
• ORCFile File Format
• Replace MR with Tez (Apache Incubator)
One Solution:
MPP Database + Hadoop Connector
12
MPP Worker NodeMPP Worker NodeMPP Worker NodeMPP Worker Node
Global Query
Executor
MPP Master Node
HDFS
datanode
HDFS
datanode
HDFS
datanode
HDFS
datanode
Local Query
Executor
Local Query
Executor
Local Query
Executor
Local Query
Executor
MPP Database Cluster
Hadoop Cluster
13
MPP Worker NodeMPP Worker NodeMPP Worker NodeMPP Worker Node
Global Query
Executor
MPP Master Node
HDFS
datanode
HDFS
datanode
HDFS
datanode
HDFS
datanode
Local Query
Executor
Local Query
Executor
Local Query
Executor
Local Query
Executor
Pull
Data
One Solution:
MPP Database + Hadoop Connector
14
MPP Worker NodeMPP Worker NodeMPP Worker NodeMPP Worker Node
Global Query
Executor
MPP Master Node
HDFS
datanode
HDFS
datanode
HDFS
datanode
HDFS
datanode
Local Query
Executor
Local Query
Executor
Local Query
Executor
Local Query
Executor
Pull
Data
IO Bottleneck
One Solution:
MPP Database + Hadoop Connector
A Better Solution:
New Architecture for SQL on Hadoop
15
MPP Worker NodeMPP Worker NodeMPP Worker NodeMPP Worker Node
Global Query
Executor
MPP Master Node
HDFS
datanode
HDFS
datanode
HDFS
datanode
HDFS
datanode
Local Query
Executor
Local Query
Executor
Local Query
Executor
Local Query
Executor
Maintain
Data
Locality
Push Work
To Data
New Architecture for SQL on Hadoop
16
Data Locality
• Block-Aware Query Planner Pushes Work to Data
Real-Time Query Performance
• Replace MapReduce
Schema-on-Read
• Pluggable Storage Format Handlers
Tight Integration with SQL Ecosystem Tools
Examples of the New Architecture
17
Google Dremel
• Interactive ad hoc query system for read-only
nested data. Powers BigQuery.
Apache Drill
• Open source version of Dremel. Implemented in
Java. Work in progress.
Cloudera Impala
• Heavily Influenced by MonetDB/X100. Runtime
codegen. CPU cache aware. Implemented in C++.
Citus Data
• Built on PostgreSQL. Powerful cost based optimizer
for disk I/O. Handles failures.
The New Architecture in Detail:
CitusDB
18
CitusDB Master Node
Metadata
Distributed Query
Planner
Distributed Query
Executor
datanode
HDFS
Local Query Planner
Local Query Executor
Foreign Data Wrappers
datanode
HDFS
Local Query Planner
Local Query Executor
Foreign Data Wrappers
datanode
HDFS
Local Query Planner
Local Query Executor
Foreign Data Wrappers
Hadoop
Metadata
HDFS
NameNode
PostgreSQL
Tools
ODBC/JDBC
Clients
CitusDB: Metadata Synchronization
19
CitusDB Master Node
Metadata
Distributed Query
Planner
Distributed Query
Executor
datanode
HDFS
Local Query Planner
Local Query Executor
Foreign Data Wrappers
datanode
HDFS
Local Query Planner
Local Query Executor
Foreign Data Wrappers
datanode
HDFS
Local Query Planner
Local Query Executor
Foreign Data Wrappers
Hadoop
Metadata
HDFS
NameNode
Metadata Sync
CREATE FOREIGN TABLE emp_{block_id} …
PostgreSQL
Tools
ODBC/JDBC
Clients
CREATE TABLE emp
CitusDB: Query Execution
20
CitusDB Master Node
Metadata
Distributed Query
Planner
Distributed Query
Executor
datanode
HDFS
Local Query Planner
Local Query Executor
Foreign Data Wrappers
datanode
HDFS
Local Query Planner
Local Query Executor
Foreign Data Wrappers
datanode
HDFS
Local Query Planner
Local Query Executor
Foreign Data Wrappers
Hadoop
Metadata
HDFS
NameNode
PostgreSQL
Tools
ODBC/JDBC
Clients
SELECT AVG(sal)
FROM emp
WHERE job = “manager”;
CitusDB: Query Execution
21
CitusDB Master Node
Metadata
Distributed Query
Planner
Distributed Query
Executor
datanode
HDFS
Local Query Planner
Local Query Executor
Foreign Data Wrappers
datanode
HDFS
Local Query Planner
Local Query Executor
Foreign Data Wrappers
datanode
HDFS
Local Query Planner
Local Query Executor
Foreign Data Wrappers
PostgreSQL
Tools
ODBC/JDBC
Clients
Hadoop
Metadata
HDFS
NameNode
Local Queries
SELECT SUM(sal), COUNT(sal)
FROM emp_{block_id}
WHERE job = “manager”;
CitusDB: Query Execution
22
CitusDB Master Node
Metadata
Distributed Query
Planner
Distributed Query
Executor
datanode
HDFS
Local Query Planner
Local Query Executor
Foreign Data Wrappers
datanode
HDFS
Local Query Planner
Local Query Executor
Foreign Data Wrappers
datanode
HDFS
Local Query Planner
Local Query Executor
Foreign Data Wrappers
PostgreSQL
Tools
ODBC/JDBC
Clients
Hadoop
Metadata
HDFS
NameNode
Local Results
{842176.53, 8}
{1234283.00, 12}
{0.00, 0}
{125500.00, 1}
{523100.00, 3}
{785300.32, 5}
CitusDB: Query Execution
23
CitusDB Master Node
Metadata
Distributed Query
Planner
Distributed Query
Executor
datanode
HDFS
Local Query Planner
Local Query Executor
Foreign Data Wrappers
datanode
HDFS
Local Query Planner
Local Query Executor
Foreign Data Wrappers
datanode
HDFS
Local Query Planner
Local Query Executor
Foreign Data Wrappers
PostgreSQL
Tools
ODBC/JDBC
Clients
Hadoop
Metadata
HDFS
NameNode
{121046.58}
Why We Chose PostgreSQL
24
- Powerful Cost-Based Optimizer
- Designed to minimize disk I/O
- Extensible, Rich Type System
- Pluggable Storage Format Handlers
- Lots of Extensions:
- Geospatial, Full Text Search, JSON, etc…
- Enterprise Features:
- ODBC/JDBC
- Security
- Internationalization
Defining the New Generation of
Distributed Analytic Databases
25
SQL  Ease of Use, Increased Productivity
Real-time responsiveness  Faster
Data Locality  Proven Scalability
Schema-on-Read  Flexibility, Lower Cost
Where Are We At?
26
CitusDB SQL on Hadoop is in Open Beta
Download our Binary Packages
Or Use Our EC2 AMI
http://citusdata.com/docs/sql-on-hadoop
We’re Hiring!
27
http://citusdata.com/job
28
Questions?

Weitere ähnliche Inhalte

Was ist angesagt?

Nov 2010 HUG: Business Intelligence for Big Data
Nov 2010 HUG: Business Intelligence for Big DataNov 2010 HUG: Business Intelligence for Big Data
Nov 2010 HUG: Business Intelligence for Big Data
Yahoo Developer Network
 
Improving HDFS Availability with IPC Quality of Service
Improving HDFS Availability with IPC Quality of ServiceImproving HDFS Availability with IPC Quality of Service
Improving HDFS Availability with IPC Quality of Service
DataWorks Summit
 

Was ist angesagt? (20)

Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
 
Hoodie: Incremental processing on hadoop
Hoodie: Incremental processing on hadoopHoodie: Incremental processing on hadoop
Hoodie: Incremental processing on hadoop
 
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
 
20131205 hadoop-hdfs-map reduce-introduction
20131205 hadoop-hdfs-map reduce-introduction20131205 hadoop-hdfs-map reduce-introduction
20131205 hadoop-hdfs-map reduce-introduction
 
Hadoop Summit 2015: Hive at Yahoo: Letters from the Trenches
Hadoop Summit 2015: Hive at Yahoo: Letters from the TrenchesHadoop Summit 2015: Hive at Yahoo: Letters from the Trenches
Hadoop Summit 2015: Hive at Yahoo: Letters from the Trenches
 
Nov 2010 HUG: Business Intelligence for Big Data
Nov 2010 HUG: Business Intelligence for Big DataNov 2010 HUG: Business Intelligence for Big Data
Nov 2010 HUG: Business Intelligence for Big Data
 
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
 
August 2016 HUG: Recent development in Apache Oozie
August 2016 HUG: Recent development in Apache OozieAugust 2016 HUG: Recent development in Apache Oozie
August 2016 HUG: Recent development in Apache Oozie
 
Hadoop Platform at Yahoo
Hadoop Platform at YahooHadoop Platform at Yahoo
Hadoop Platform at Yahoo
 
YARN Ready: Apache Spark
YARN Ready: Apache Spark YARN Ready: Apache Spark
YARN Ready: Apache Spark
 
Mutable Data in Hive's Immutable World
Mutable Data in Hive's Immutable WorldMutable Data in Hive's Immutable World
Mutable Data in Hive's Immutable World
 
Application architectures with Hadoop – Big Data TechCon 2014
Application architectures with Hadoop – Big Data TechCon 2014Application architectures with Hadoop – Big Data TechCon 2014
Application architectures with Hadoop – Big Data TechCon 2014
 
Big Data Platform Industrialization
Big Data Platform Industrialization Big Data Platform Industrialization
Big Data Platform Industrialization
 
Hadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data ModelHadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data Model
 
Application Architectures with Hadoop - Big Data TechCon SF 2014
Application Architectures with Hadoop - Big Data TechCon SF 2014Application Architectures with Hadoop - Big Data TechCon SF 2014
Application Architectures with Hadoop - Big Data TechCon SF 2014
 
Top Hadoop Big Data Interview Questions and Answers for Fresher
Top Hadoop Big Data Interview Questions and Answers for FresherTop Hadoop Big Data Interview Questions and Answers for Fresher
Top Hadoop Big Data Interview Questions and Answers for Fresher
 
Architecting applications with Hadoop - Fraud Detection
Architecting applications with Hadoop - Fraud DetectionArchitecting applications with Hadoop - Fraud Detection
Architecting applications with Hadoop - Fraud Detection
 
Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017
 
Improving HDFS Availability with IPC Quality of Service
Improving HDFS Availability with IPC Quality of ServiceImproving HDFS Availability with IPC Quality of Service
Improving HDFS Availability with IPC Quality of Service
 
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
 

Ähnlich wie SQL on Hadoop: Defining the New Generation of Analytics Databases

It takes two to tango! : Is SQL-on-Hadoop the next big step?
It takes two to tango! : Is SQL-on-Hadoop the next big step?It takes two to tango! : Is SQL-on-Hadoop the next big step?
It takes two to tango! : Is SQL-on-Hadoop the next big step?
Srihari Srinivasan
 
SQL on Hadoop: Defining the New Generation of Analytic SQL Databases
SQL on Hadoop: Defining the New Generation of Analytic SQL DatabasesSQL on Hadoop: Defining the New Generation of Analytic SQL Databases
SQL on Hadoop: Defining the New Generation of Analytic SQL Databases
OReillyStrata
 

Ähnlich wie SQL on Hadoop: Defining the New Generation of Analytics Databases (20)

Big data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guideBig data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guide
 
Hadoop training by keylabs
Hadoop training by keylabsHadoop training by keylabs
Hadoop training by keylabs
 
SQL and Machine Learning on Hadoop using HAWQ
SQL and Machine Learning on Hadoop using HAWQSQL and Machine Learning on Hadoop using HAWQ
SQL and Machine Learning on Hadoop using HAWQ
 
Yarn by default (Spark on YARN)
Yarn by default (Spark on YARN)Yarn by default (Spark on YARN)
Yarn by default (Spark on YARN)
 
It takes two to tango! : Is SQL-on-Hadoop the next big step?
It takes two to tango! : Is SQL-on-Hadoop the next big step?It takes two to tango! : Is SQL-on-Hadoop the next big step?
It takes two to tango! : Is SQL-on-Hadoop the next big step?
 
Hadoop and big data training
Hadoop and big data trainingHadoop and big data training
Hadoop and big data training
 
Handling not so big data
Handling not so big dataHandling not so big data
Handling not so big data
 
Hadoop Developer
Hadoop DeveloperHadoop Developer
Hadoop Developer
 
Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL Server
 
Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016
 
Technical Overview on Cloudera Impala
Technical Overview on Cloudera ImpalaTechnical Overview on Cloudera Impala
Technical Overview on Cloudera Impala
 
Apache drill
Apache drillApache drill
Apache drill
 
Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo ppt
 
Impala for PhillyDB Meetup
Impala for PhillyDB MeetupImpala for PhillyDB Meetup
Impala for PhillyDB Meetup
 
Basic of Big Data
Basic of Big Data Basic of Big Data
Basic of Big Data
 
Hadoop_arunam_ppt
Hadoop_arunam_pptHadoop_arunam_ppt
Hadoop_arunam_ppt
 
SQL on Hadoop: Defining the New Generation of Analytic SQL Databases
SQL on Hadoop: Defining the New Generation of Analytic SQL DatabasesSQL on Hadoop: Defining the New Generation of Analytic SQL Databases
SQL on Hadoop: Defining the New Generation of Analytic SQL Databases
 
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
 
Hadoop by kamran khan
Hadoop by kamran khanHadoop by kamran khan
Hadoop by kamran khan
 
Eric Baldeschwieler Keynote from Storage Developers Conference
Eric Baldeschwieler Keynote from Storage Developers ConferenceEric Baldeschwieler Keynote from Storage Developers Conference
Eric Baldeschwieler Keynote from Storage Developers Conference
 

Mehr von DataWorks Summit

HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
DataWorks Summit
 

Mehr von DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Kürzlich hochgeladen

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Kürzlich hochgeladen (20)

Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 

SQL on Hadoop: Defining the New Generation of Analytics Databases

  • 1. Hadoop Summit, June 2013 SQL on Hadoop Defining the New Generation of Analytic Databases
  • 2. Speaker Bio: Carl Steinbach 1 Currently: Engineer @ Citus Data PMC Chair, Committer -- Apache Hive Project Formerly: Oracle, NetApp, Informatica, Cloudera Twitter: @cwsteinbach LinkedIn: carlsteinbach
  • 3. This talk is about: 2 A New Type of Distributed Analytic Database
  • 4. What Is an Analytic Database? 3 OLAP: Online Analytical Processing Consolidation (Roll-up) Drill-down Slicing and Dicing No Transactions Large Sequential Scans I/O Bound
  • 5. Motivation: The Problem with Enterprise Storage 4 Storage Tier (NAS/SAN) Server/Worker Tier Server Server Server Server Server Server Server Server Server Server Server Server Really Big Pipe
  • 6. Google File System (’03) A Possible Solution? 5 Design Priorities • Commodity Hardware • Fault Tolerance • Big Files / Big Blocks • Big Sequential Reads/Writes Design Tradeoffs • No random writes (write once/read many) • Slow random reads • Not POSIX compliant
  • 7. So GFS Solved the problem? 6 - Yes, but not because of anything described in the original paper - Client/Server approach won’t scale - Full scope of GFS revealed one year later with publication of MapReduce (‘04) paper. GFS + MapReduce Key Idea: Eliminate I/O Bottleneck by Colocating Compute and Storage Resources on the Same Node
  • 8. What’s Good About Hadoop? 7 Commodity Storage Scale-out Fault Tolerance Flexibility MapReduce Multi-structured Data
  • 9. What’s Bad About Hadoop? 8 MapReduce! No Schemas! Missing Features Optimizer, Indexes, Views Incompatibility with Existing Tools BI, ETL, IDEs
  • 10. Apache Hive Solved Many of these Problems 9 SQL to MapReduce Compiler + Execution Engine Pluggable Storage Layer (SerDes) Schema-on-Read
  • 11. But Other Problems Remained 10 Many Missing Features: • ANSI SQL • Cost Based Optimizer • UDFs • Data Types • Security • … Biggest Problem: • MapReduce Latency Overhead
  • 12. Work in Progress: Hive Improvements 11 Stinger Initiative: • Columnar Query Engine • ORCFile File Format • Replace MR with Tez (Apache Incubator)
  • 13. One Solution: MPP Database + Hadoop Connector 12 MPP Worker NodeMPP Worker NodeMPP Worker NodeMPP Worker Node Global Query Executor MPP Master Node HDFS datanode HDFS datanode HDFS datanode HDFS datanode Local Query Executor Local Query Executor Local Query Executor Local Query Executor MPP Database Cluster Hadoop Cluster
  • 14. 13 MPP Worker NodeMPP Worker NodeMPP Worker NodeMPP Worker Node Global Query Executor MPP Master Node HDFS datanode HDFS datanode HDFS datanode HDFS datanode Local Query Executor Local Query Executor Local Query Executor Local Query Executor Pull Data One Solution: MPP Database + Hadoop Connector
  • 15. 14 MPP Worker NodeMPP Worker NodeMPP Worker NodeMPP Worker Node Global Query Executor MPP Master Node HDFS datanode HDFS datanode HDFS datanode HDFS datanode Local Query Executor Local Query Executor Local Query Executor Local Query Executor Pull Data IO Bottleneck One Solution: MPP Database + Hadoop Connector
  • 16. A Better Solution: New Architecture for SQL on Hadoop 15 MPP Worker NodeMPP Worker NodeMPP Worker NodeMPP Worker Node Global Query Executor MPP Master Node HDFS datanode HDFS datanode HDFS datanode HDFS datanode Local Query Executor Local Query Executor Local Query Executor Local Query Executor Maintain Data Locality Push Work To Data
  • 17. New Architecture for SQL on Hadoop 16 Data Locality • Block-Aware Query Planner Pushes Work to Data Real-Time Query Performance • Replace MapReduce Schema-on-Read • Pluggable Storage Format Handlers Tight Integration with SQL Ecosystem Tools
  • 18. Examples of the New Architecture 17 Google Dremel • Interactive ad hoc query system for read-only nested data. Powers BigQuery. Apache Drill • Open source version of Dremel. Implemented in Java. Work in progress. Cloudera Impala • Heavily Influenced by MonetDB/X100. Runtime codegen. CPU cache aware. Implemented in C++. Citus Data • Built on PostgreSQL. Powerful cost based optimizer for disk I/O. Handles failures.
  • 19. The New Architecture in Detail: CitusDB 18 CitusDB Master Node Metadata Distributed Query Planner Distributed Query Executor datanode HDFS Local Query Planner Local Query Executor Foreign Data Wrappers datanode HDFS Local Query Planner Local Query Executor Foreign Data Wrappers datanode HDFS Local Query Planner Local Query Executor Foreign Data Wrappers Hadoop Metadata HDFS NameNode PostgreSQL Tools ODBC/JDBC Clients
  • 20. CitusDB: Metadata Synchronization 19 CitusDB Master Node Metadata Distributed Query Planner Distributed Query Executor datanode HDFS Local Query Planner Local Query Executor Foreign Data Wrappers datanode HDFS Local Query Planner Local Query Executor Foreign Data Wrappers datanode HDFS Local Query Planner Local Query Executor Foreign Data Wrappers Hadoop Metadata HDFS NameNode Metadata Sync CREATE FOREIGN TABLE emp_{block_id} … PostgreSQL Tools ODBC/JDBC Clients CREATE TABLE emp
  • 21. CitusDB: Query Execution 20 CitusDB Master Node Metadata Distributed Query Planner Distributed Query Executor datanode HDFS Local Query Planner Local Query Executor Foreign Data Wrappers datanode HDFS Local Query Planner Local Query Executor Foreign Data Wrappers datanode HDFS Local Query Planner Local Query Executor Foreign Data Wrappers Hadoop Metadata HDFS NameNode PostgreSQL Tools ODBC/JDBC Clients SELECT AVG(sal) FROM emp WHERE job = “manager”;
  • 22. CitusDB: Query Execution 21 CitusDB Master Node Metadata Distributed Query Planner Distributed Query Executor datanode HDFS Local Query Planner Local Query Executor Foreign Data Wrappers datanode HDFS Local Query Planner Local Query Executor Foreign Data Wrappers datanode HDFS Local Query Planner Local Query Executor Foreign Data Wrappers PostgreSQL Tools ODBC/JDBC Clients Hadoop Metadata HDFS NameNode Local Queries SELECT SUM(sal), COUNT(sal) FROM emp_{block_id} WHERE job = “manager”;
  • 23. CitusDB: Query Execution 22 CitusDB Master Node Metadata Distributed Query Planner Distributed Query Executor datanode HDFS Local Query Planner Local Query Executor Foreign Data Wrappers datanode HDFS Local Query Planner Local Query Executor Foreign Data Wrappers datanode HDFS Local Query Planner Local Query Executor Foreign Data Wrappers PostgreSQL Tools ODBC/JDBC Clients Hadoop Metadata HDFS NameNode Local Results {842176.53, 8} {1234283.00, 12} {0.00, 0} {125500.00, 1} {523100.00, 3} {785300.32, 5}
  • 24. CitusDB: Query Execution 23 CitusDB Master Node Metadata Distributed Query Planner Distributed Query Executor datanode HDFS Local Query Planner Local Query Executor Foreign Data Wrappers datanode HDFS Local Query Planner Local Query Executor Foreign Data Wrappers datanode HDFS Local Query Planner Local Query Executor Foreign Data Wrappers PostgreSQL Tools ODBC/JDBC Clients Hadoop Metadata HDFS NameNode {121046.58}
  • 25. Why We Chose PostgreSQL 24 - Powerful Cost-Based Optimizer - Designed to minimize disk I/O - Extensible, Rich Type System - Pluggable Storage Format Handlers - Lots of Extensions: - Geospatial, Full Text Search, JSON, etc… - Enterprise Features: - ODBC/JDBC - Security - Internationalization
  • 26. Defining the New Generation of Distributed Analytic Databases 25 SQL  Ease of Use, Increased Productivity Real-time responsiveness  Faster Data Locality  Proven Scalability Schema-on-Read  Flexibility, Lower Cost
  • 27. Where Are We At? 26 CitusDB SQL on Hadoop is in Open Beta Download our Binary Packages Or Use Our EC2 AMI http://citusdata.com/docs/sql-on-hadoop

Hinweis der Redaktion

  1. Databases are tools that let you ask questions about data.The architecture of a database depends heavily on the design of the system that stores the data.Hadoop, and HDFS in particular, represent a radical change to the underlying storage infrastructure.In order to capitalize on these changes we need to redesign the database from the ground up. That’s the goal of these new systems.
  2. Make sure we’re on the same page.Next: Enterprise Storage Model
  3. Availability - Fault tolerance through RAIDAccessibility - Shared files - POSIX file APIProblems:- Cost- ScalabilityOutro:Folks at Google were aware of these problems when they were building their search engine.-Fibre channel,
  4. Distributed Block StoreACM interview Sean Quinlan and Kirk McKusick: http://queue.acm.org/detail.cfm?id=1594206
  5. Did this solve the problem?Commodity: yesFault tolerance: yesScalability: NoMR is the missing pieceOutro:2005: Mike Cafarella, Doug CuttingNutchDoug Cutting and Mike Cafarella launched the Hadoop project a year later. HDFS + MapReduce