SQL on Hadoop: Defining the New Generation of Analytics Databases

Hadoop Summit, June 2013
SQL on Hadoop
Defining the New Generation of
Analytic Databases

Speaker Bio: Carl Steinbach
1
Currently:
Engineer @ Citus Data
PMC Chair, Committer -- Apache Hive Project
Formerly:
Oracle, NetApp, Informatica, Cloudera
Twitter: @cwsteinbach
LinkedIn: carlsteinbach

This talk is about:
2
A New Type of
Distributed
Analytic Database

What Is an Analytic Database?
3
OLAP: Online Analytical Processing
Consolidation (Roll-up)
Drill-down
Slicing and Dicing
No Transactions
Large Sequential Scans
I/O Bound

Motivation:
The Problem with Enterprise Storage
4
Storage Tier (NAS/SAN)
Server/Worker Tier
Server Server Server
Really Big Pipe

Google File System (’03)
A Possible Solution?
5
Design Priorities
• Commodity Hardware
• Fault Tolerance
• Big Files / Big Blocks
• Big Sequential Reads/Writes
Design Tradeoffs
• No random writes (write once/read many)
• Slow random reads
• Not POSIX compliant

So GFS Solved the problem?
6
- Yes, but not because of anything described in
the original paper
- Client/Server approach won’t scale
- Full scope of GFS revealed one year later with
publication of MapReduce (‘04) paper.
GFS + MapReduce Key Idea: Eliminate I/O
Bottleneck by Colocating Compute and Storage
Resources on the Same Node

What’s Good About Hadoop?
7
Commodity Storage
Scale-out
Fault Tolerance
Flexibility
MapReduce
Multi-structured Data

What’s Bad About Hadoop?
8
MapReduce!
No Schemas!
Missing Features
Optimizer, Indexes, Views
Incompatibility with Existing Tools
BI, ETL, IDEs

Apache Hive Solved Many of these
Problems
9
SQL to MapReduce
Compiler + Execution Engine
Pluggable Storage Layer
(SerDes)
Schema-on-Read

But Other Problems Remained
10
Many Missing Features:
• ANSI SQL
• Cost Based Optimizer
• UDFs
• Data Types
• Security
• …
Biggest Problem:
• MapReduce Latency Overhead

Work in Progress: Hive Improvements
11
Stinger Initiative:
• Columnar Query Engine
• ORCFile File Format
• Replace MR with Tez (Apache Incubator)

One Solution:
MPP Database + Hadoop Connector
12
MPP Worker NodeMPP Worker NodeMPP Worker NodeMPP Worker Node
Global Query
Executor
MPP Master Node
HDFS
datanode
HDFS
datanode
HDFS
datanode
HDFS
datanode
Local Query
Executor
Local Query
Executor
Local Query
Executor
Local Query
Executor
MPP Database Cluster
Hadoop Cluster

13
Global Query
Executor
MPP Master Node
HDFS
datanode
HDFS
datanode
HDFS
datanode
HDFS
datanode
Local Query
Executor
Local Query
Executor
Local Query
Executor
Local Query
Executor
Pull
Data
One Solution:

14
Global Query
Executor
MPP Master Node
HDFS
datanode
HDFS
datanode
HDFS
datanode
HDFS
datanode
Local Query
Executor
Local Query
Executor
Local Query
Executor
Local Query
Executor
Pull
Data
IO Bottleneck
One Solution:

A Better Solution:
New Architecture for SQL on Hadoop
15
Global Query
Executor
MPP Master Node
HDFS
datanode
HDFS
datanode
HDFS
datanode
HDFS
datanode
Local Query
Executor
Local Query
Executor
Local Query
Executor
Local Query
Executor
Maintain
Data
Locality
Push Work
To Data

New Architecture for SQL on Hadoop
16
Data Locality
• Block-Aware Query Planner Pushes Work to Data
Real-Time Query Performance
• Replace MapReduce
Schema-on-Read
• Pluggable Storage Format Handlers
Tight Integration with SQL Ecosystem Tools

Examples of the New Architecture
17
Google Dremel
• Interactive ad hoc query system for read-only
nested data. Powers BigQuery.
Apache Drill
• Open source version of Dremel. Implemented in
Java. Work in progress.
Cloudera Impala
• Heavily Influenced by MonetDB/X100. Runtime
codegen. CPU cache aware. Implemented in C++.
Citus Data
• Built on PostgreSQL. Powerful cost based optimizer
for disk I/O. Handles failures.

The New Architecture in Detail:
CitusDB
18
CitusDB Master Node
Metadata
Distributed Query
Planner
Distributed Query
Executor
datanode
HDFS
Local Query Planner
Local Query Executor
Foreign Data Wrappers
datanode
HDFS
Local Query Planner
datanode
HDFS
Local Query Planner
Hadoop
Metadata
HDFS
NameNode
PostgreSQL
Tools
ODBC/JDBC
Clients

CitusDB: Metadata Synchronization
19
CitusDB Master Node
Metadata
Distributed Query
Planner
Distributed Query
Executor
datanode
HDFS
Local Query Planner
datanode
HDFS
Local Query Planner
datanode
HDFS
Local Query Planner
Hadoop
Metadata
HDFS
NameNode
Metadata Sync
CREATE FOREIGN TABLE emp_{block_id} …
PostgreSQL
Tools
ODBC/JDBC
Clients
CREATE TABLE emp

CitusDB: Query Execution
20
CitusDB Master Node
Metadata
Distributed Query
Planner
Distributed Query
Executor
datanode
HDFS
Local Query Planner
datanode
HDFS
Local Query Planner
datanode
HDFS
Local Query Planner
Hadoop
Metadata
HDFS
NameNode
PostgreSQL
Tools
ODBC/JDBC
Clients
SELECT AVG(sal)
FROM emp
WHERE job = “manager”;

21
CitusDB Master Node
Metadata
Distributed Query
Planner
Distributed Query
Executor
datanode
HDFS
Local Query Planner
datanode
HDFS
Local Query Planner
datanode
HDFS
Local Query Planner
PostgreSQL
Tools
ODBC/JDBC
Clients
Hadoop
Metadata
HDFS
NameNode
Local Queries
SELECT SUM(sal), COUNT(sal)
FROM emp_{block_id}
WHERE job = “manager”;

22
CitusDB Master Node
Metadata
Distributed Query
Planner
Distributed Query
Executor
datanode
HDFS
Local Query Planner
datanode
HDFS
Local Query Planner
datanode
HDFS
Local Query Planner
PostgreSQL
Tools
ODBC/JDBC
Clients
Hadoop
Metadata
HDFS
NameNode
Local Results
{842176.53, 8}
{1234283.00, 12}
{0.00, 0}
{125500.00, 1}
{523100.00, 3}
{785300.32, 5}

23
CitusDB Master Node
Metadata
Distributed Query
Planner
Distributed Query
Executor
datanode
HDFS
Local Query Planner
datanode
HDFS
Local Query Planner
datanode
HDFS
Local Query Planner
PostgreSQL
Tools
ODBC/JDBC
Clients
Hadoop
Metadata
HDFS
NameNode
{121046.58}

Why We Chose PostgreSQL
24
- Powerful Cost-Based Optimizer
- Designed to minimize disk I/O
- Extensible, Rich Type System
- Pluggable Storage Format Handlers
- Lots of Extensions:
- Geospatial, Full Text Search, JSON, etc…
- Enterprise Features:
- ODBC/JDBC
- Security
- Internationalization

Defining the New Generation of
Distributed Analytic Databases
25
SQL  Ease of Use, Increased Productivity
Real-time responsiveness  Faster
Data Locality  Proven Scalability
Schema-on-Read  Flexibility, Lower Cost

Where Are We At?
26
CitusDB SQL on Hadoop is in Open Beta
Download our Binary Packages
Or Use Our EC2 AMI
http://citusdata.com/docs/sql-on-hadoop

We’re Hiring!
27
http://citusdata.com/job

SQL on Hadoop: Defining the New Generation of Analytics Databases

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie SQL on Hadoop: Defining the New Generation of Analytics Databases

Ähnlich wie SQL on Hadoop: Defining the New Generation of Analytics Databases (20)

Mehr von DataWorks Summit

Mehr von DataWorks Summit (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

SQL on Hadoop: Defining the New Generation of Analytics Databases

Hinweis der Redaktion