The analytics and data warehousing industries are in the midst of a major period of transformation. Since the publication of Google?s MapReduce paper, we have witnessed the appearance of Apache Hadoop, followed by the arrival of batch-oriented SQL systems like Apache Hive, and the scramble by established SQL vendors to implement Hadoop connectors. This talk addresses the recent emergence of a new generation of analytic databases inspired by Google Dremel. These databases have been designed with the goal of running real-time SQL natively on Hadoop in a manner that fully exploits the flexibility and performance of the underlying platform. Characterized by features including schema-on-read, support for semi-structured data, and pluggable storage engines, these new systems share important architectural details that distinguish them from the previous generation of analytic databases. In this talk, we will discuss the performance limitations of the connector-based approach employed by many established vendors and explain the long-term significance of Apache Hive?s data model. Then, we will unravel the novel architectural features common to next generation analytic database systems like CitusDB and Impala that make real-time SQL-on-Hadoop feasible. Finally, we will conclude by reviewing several important database lessons learned over the previous decades that remain relevant today.
Handwritten Text Recognition for manuscripts and early printed texts
SQL on Hadoop: Defining the New Generation of Analytics Databases
1. Hadoop Summit, June 2013
SQL on Hadoop
Defining the New Generation of
Analytic Databases
2. Speaker Bio: Carl Steinbach
1
Currently:
Engineer @ Citus Data
PMC Chair, Committer -- Apache Hive Project
Formerly:
Oracle, NetApp, Informatica, Cloudera
Twitter: @cwsteinbach
LinkedIn: carlsteinbach
3. This talk is about:
2
A New Type of
Distributed
Analytic Database
4. What Is an Analytic Database?
3
OLAP: Online Analytical Processing
Consolidation (Roll-up)
Drill-down
Slicing and Dicing
No Transactions
Large Sequential Scans
I/O Bound
5. Motivation:
The Problem with Enterprise Storage
4
Storage Tier (NAS/SAN)
Server/Worker Tier
Server Server Server
Server Server Server
Server Server Server
Server Server Server
Really Big Pipe
6. Google File System (’03)
A Possible Solution?
5
Design Priorities
• Commodity Hardware
• Fault Tolerance
• Big Files / Big Blocks
• Big Sequential Reads/Writes
Design Tradeoffs
• No random writes (write once/read many)
• Slow random reads
• Not POSIX compliant
7. So GFS Solved the problem?
6
- Yes, but not because of anything described in
the original paper
- Client/Server approach won’t scale
- Full scope of GFS revealed one year later with
publication of MapReduce (‘04) paper.
GFS + MapReduce Key Idea: Eliminate I/O
Bottleneck by Colocating Compute and Storage
Resources on the Same Node
8. What’s Good About Hadoop?
7
Commodity Storage
Scale-out
Fault Tolerance
Flexibility
MapReduce
Multi-structured Data
9. What’s Bad About Hadoop?
8
MapReduce!
No Schemas!
Missing Features
Optimizer, Indexes, Views
Incompatibility with Existing Tools
BI, ETL, IDEs
10. Apache Hive Solved Many of these
Problems
9
SQL to MapReduce
Compiler + Execution Engine
Pluggable Storage Layer
(SerDes)
Schema-on-Read
11. But Other Problems Remained
10
Many Missing Features:
• ANSI SQL
• Cost Based Optimizer
• UDFs
• Data Types
• Security
• …
Biggest Problem:
• MapReduce Latency Overhead
12. Work in Progress: Hive Improvements
11
Stinger Initiative:
• Columnar Query Engine
• ORCFile File Format
• Replace MR with Tez (Apache Incubator)
13. One Solution:
MPP Database + Hadoop Connector
12
MPP Worker NodeMPP Worker NodeMPP Worker NodeMPP Worker Node
Global Query
Executor
MPP Master Node
HDFS
datanode
HDFS
datanode
HDFS
datanode
HDFS
datanode
Local Query
Executor
Local Query
Executor
Local Query
Executor
Local Query
Executor
MPP Database Cluster
Hadoop Cluster
14. 13
MPP Worker NodeMPP Worker NodeMPP Worker NodeMPP Worker Node
Global Query
Executor
MPP Master Node
HDFS
datanode
HDFS
datanode
HDFS
datanode
HDFS
datanode
Local Query
Executor
Local Query
Executor
Local Query
Executor
Local Query
Executor
Pull
Data
One Solution:
MPP Database + Hadoop Connector
15. 14
MPP Worker NodeMPP Worker NodeMPP Worker NodeMPP Worker Node
Global Query
Executor
MPP Master Node
HDFS
datanode
HDFS
datanode
HDFS
datanode
HDFS
datanode
Local Query
Executor
Local Query
Executor
Local Query
Executor
Local Query
Executor
Pull
Data
IO Bottleneck
One Solution:
MPP Database + Hadoop Connector
16. A Better Solution:
New Architecture for SQL on Hadoop
15
MPP Worker NodeMPP Worker NodeMPP Worker NodeMPP Worker Node
Global Query
Executor
MPP Master Node
HDFS
datanode
HDFS
datanode
HDFS
datanode
HDFS
datanode
Local Query
Executor
Local Query
Executor
Local Query
Executor
Local Query
Executor
Maintain
Data
Locality
Push Work
To Data
17. New Architecture for SQL on Hadoop
16
Data Locality
• Block-Aware Query Planner Pushes Work to Data
Real-Time Query Performance
• Replace MapReduce
Schema-on-Read
• Pluggable Storage Format Handlers
Tight Integration with SQL Ecosystem Tools
18. Examples of the New Architecture
17
Google Dremel
• Interactive ad hoc query system for read-only
nested data. Powers BigQuery.
Apache Drill
• Open source version of Dremel. Implemented in
Java. Work in progress.
Cloudera Impala
• Heavily Influenced by MonetDB/X100. Runtime
codegen. CPU cache aware. Implemented in C++.
Citus Data
• Built on PostgreSQL. Powerful cost based optimizer
for disk I/O. Handles failures.
19. The New Architecture in Detail:
CitusDB
18
CitusDB Master Node
Metadata
Distributed Query
Planner
Distributed Query
Executor
datanode
HDFS
Local Query Planner
Local Query Executor
Foreign Data Wrappers
datanode
HDFS
Local Query Planner
Local Query Executor
Foreign Data Wrappers
datanode
HDFS
Local Query Planner
Local Query Executor
Foreign Data Wrappers
Hadoop
Metadata
HDFS
NameNode
PostgreSQL
Tools
ODBC/JDBC
Clients
20. CitusDB: Metadata Synchronization
19
CitusDB Master Node
Metadata
Distributed Query
Planner
Distributed Query
Executor
datanode
HDFS
Local Query Planner
Local Query Executor
Foreign Data Wrappers
datanode
HDFS
Local Query Planner
Local Query Executor
Foreign Data Wrappers
datanode
HDFS
Local Query Planner
Local Query Executor
Foreign Data Wrappers
Hadoop
Metadata
HDFS
NameNode
Metadata Sync
CREATE FOREIGN TABLE emp_{block_id} …
PostgreSQL
Tools
ODBC/JDBC
Clients
CREATE TABLE emp
21. CitusDB: Query Execution
20
CitusDB Master Node
Metadata
Distributed Query
Planner
Distributed Query
Executor
datanode
HDFS
Local Query Planner
Local Query Executor
Foreign Data Wrappers
datanode
HDFS
Local Query Planner
Local Query Executor
Foreign Data Wrappers
datanode
HDFS
Local Query Planner
Local Query Executor
Foreign Data Wrappers
Hadoop
Metadata
HDFS
NameNode
PostgreSQL
Tools
ODBC/JDBC
Clients
SELECT AVG(sal)
FROM emp
WHERE job = “manager”;
22. CitusDB: Query Execution
21
CitusDB Master Node
Metadata
Distributed Query
Planner
Distributed Query
Executor
datanode
HDFS
Local Query Planner
Local Query Executor
Foreign Data Wrappers
datanode
HDFS
Local Query Planner
Local Query Executor
Foreign Data Wrappers
datanode
HDFS
Local Query Planner
Local Query Executor
Foreign Data Wrappers
PostgreSQL
Tools
ODBC/JDBC
Clients
Hadoop
Metadata
HDFS
NameNode
Local Queries
SELECT SUM(sal), COUNT(sal)
FROM emp_{block_id}
WHERE job = “manager”;
23. CitusDB: Query Execution
22
CitusDB Master Node
Metadata
Distributed Query
Planner
Distributed Query
Executor
datanode
HDFS
Local Query Planner
Local Query Executor
Foreign Data Wrappers
datanode
HDFS
Local Query Planner
Local Query Executor
Foreign Data Wrappers
datanode
HDFS
Local Query Planner
Local Query Executor
Foreign Data Wrappers
PostgreSQL
Tools
ODBC/JDBC
Clients
Hadoop
Metadata
HDFS
NameNode
Local Results
{842176.53, 8}
{1234283.00, 12}
{0.00, 0}
{125500.00, 1}
{523100.00, 3}
{785300.32, 5}
24. CitusDB: Query Execution
23
CitusDB Master Node
Metadata
Distributed Query
Planner
Distributed Query
Executor
datanode
HDFS
Local Query Planner
Local Query Executor
Foreign Data Wrappers
datanode
HDFS
Local Query Planner
Local Query Executor
Foreign Data Wrappers
datanode
HDFS
Local Query Planner
Local Query Executor
Foreign Data Wrappers
PostgreSQL
Tools
ODBC/JDBC
Clients
Hadoop
Metadata
HDFS
NameNode
{121046.58}
25. Why We Chose PostgreSQL
24
- Powerful Cost-Based Optimizer
- Designed to minimize disk I/O
- Extensible, Rich Type System
- Pluggable Storage Format Handlers
- Lots of Extensions:
- Geospatial, Full Text Search, JSON, etc…
- Enterprise Features:
- ODBC/JDBC
- Security
- Internationalization
26. Defining the New Generation of
Distributed Analytic Databases
25
SQL Ease of Use, Increased Productivity
Real-time responsiveness Faster
Data Locality Proven Scalability
Schema-on-Read Flexibility, Lower Cost
27. Where Are We At?
26
CitusDB SQL on Hadoop is in Open Beta
Download our Binary Packages
Or Use Our EC2 AMI
http://citusdata.com/docs/sql-on-hadoop
Databases are tools that let you ask questions about data.The architecture of a database depends heavily on the design of the system that stores the data.Hadoop, and HDFS in particular, represent a radical change to the underlying storage infrastructure.In order to capitalize on these changes we need to redesign the database from the ground up. That’s the goal of these new systems.
Make sure we’re on the same page.Next: Enterprise Storage Model
Availability - Fault tolerance through RAIDAccessibility - Shared files - POSIX file APIProblems:- Cost- ScalabilityOutro:Folks at Google were aware of these problems when they were building their search engine.-Fibre channel,
Distributed Block StoreACM interview Sean Quinlan and Kirk McKusick: http://queue.acm.org/detail.cfm?id=1594206
Did this solve the problem?Commodity: yesFault tolerance: yesScalability: NoMR is the missing pieceOutro:2005: Mike Cafarella, Doug CuttingNutchDoug Cutting and Mike Cafarella launched the Hadoop project a year later. HDFS + MapReduce