SQL on Hadoop: Defining the New Generation of Analytic SQL Databases
1. SQL
on
Hadoop
Defining
the
New
Genera/on
of
Analy/c
Databases
Strata Conference, February 2013
2. Speaker Bio: Carl Steinbach
Currently:
Engineer at Citus Data
PMC Chair, Committer -- Apache Hive Project
@cwsteinbach on Twitter
Formerly:
Cloudera, Informatica, NetApp, Oracle
2
3. This is going to sound strange, but…
I used to think
databases were boring
3
4. Why?
Undergrad at MIT 1997-2001
Number of Database Classes: 0
Number of Database Faculty Members: 0
My Conclusion: Databases are a Dead Field
4
5. Things Changed Over the Next Couple of Years
I got a job!
Database Group Formed at MIT (2003)
- Mike Stonebraker
- Sam Madden
New Class: 6.830 Database Systems (2005)
5
6. What Changed?
Web-scale Data
New DB Research: Columnar Storage, NoSQL
MPP Analytic Databases Gained Market Traction
GFS (’03) and MapReduce (‘04) Papers
Apache Hadoop – v0.1.0 released in 2006
6
7. What’s Good About Hadoop?
Commodity Storage
Scale-out
Flexibility
MapReduce
Multi-structured Data
7
8. What’s Bad About Hadoop?
MapReduce!
No Schemas!
Missing Features
Optimizer, Indexes, Views
Incompatibility with Existing Tools
BI, ETL, IDEs
8
9. Apache Hive Solved Many of These Problems
User
Client
HiveServer2
Hive
MetaStore
Hive
CLI
SQL
to
MapReduce
Table
to
Files
SQL
Queries
Catalog
Metadata
Compiler
Table
to
Format
ETL,
BI,
SQL
IDE
Rule
Based
Op/mizer
Hive
ODBC/JDBC
MR
Plan
Execu/on
Coordinator
Map/Reduce
Map/Reduce
Map/Reduce
Hive
Operators
Hive
Operators
Hive
Operators
Hive
SerDes
Hive
SerDes
Hive
SerDes
HDFS
HDFS
HDFS
datanode
datanode
datanode
9
10. But Other Problems Remained
MapReduce: Latency Overhead
Many Missing Features:
• ANSI SQL
• Cost Based Optimizer
• UDFs
• Data Types
• Security
• …
10
11. One Solution: Separate MPP DB Cluster
MPP
Database
Cluster
MPP
Master
Node
Global
Query
Executor
MPP
Worker
Node
MPP
Worker
Node
MPP
Worker
Node
MPP
Worker
Node
Local
Query
Local
Query
Local
Query
Local
Query
Executor
Executor
Executor
Executor
Hadoop
Cluster
HDFS
HDFS
HDFS
HDFS
datanode
datanode
datanode
datanode
11
12. One Solution: Separate MPP DB Cluster
MPP
Master
Node
Global
Query
Executor
MPP
Worker
Node
MPP
Worker
Node
MPP
Worker
Node
MPP
Worker
Node
Local
Query
Local
Query
Local
Query
Local
Query
Executor
Executor
Executor
Executor
Pull
Data
to
Work
IO
Bo]leneck
HDFS
HDFS
HDFS
HDFS
datanode
datanode
datanode
datanode
12
13. Better Solution:
A New Architecture for SQL on Hadoop
MPP
Master
Node
Global
Query
Push
Executor
Work
to
Data
Maintain
Local
Query
Local
Query
Local
Query
Local
Query
Executor
Executor
Executor
Executor
Data
Locality
HDFS
HDFS
HDFS
HDFS
datanode
datanode
datanode
datanode
13
14. The New Architecture in Detail: CitusDB
CitusDB
Master
Node
Hadoop
Metadata
Metadata
PostgreSQL
Tools
HDFS
Distributed
Query
ODBC/JDBC
Planner
NameNode
Clients
Distributed
Query
Executor
Local
Query
Planner
Local
Query
Planner
Local
Query
Planner
Local
Query
Executor
Local
Query
Executor
Local
Query
Executor
Foreign
Data
Wrappers
Foreign
Data
Wrappers
Foreign
Data
Wrappers
HDFS
HDFS
HDFS
datanode
datanode
datanode
14
15. The New Architecture in Detail: CitusDB
CitusDB
Master
Node
Metadata
Sync
Hadoop
Metadata
Metadata
PostgreSQL
Tools
HDFS
Distributed
Query
ODBC/JDBC
Planner
NameNode
Clients
Distributed
Query
Executor
Local
Query
Planner
Local
Query
Planner
Local
Query
Planner
Local
Query
Executor
Local
Query
Executor
Local
Query
Executor
Foreign
Data
Wrappers
Foreign
Data
Wrappers
Foreign
Data
Wrappers
HDFS
HDFS
HDFS
datanode
datanode
datanode
Step
1)
The
CitusDB
Master
Node
retrieves
file
system
metadata
from
the
Hadoop
NameNode.
15
16. The New Architecture in Detail: CitusDB
CitusDB
Master
Node
Hadoop
Metadata
Metadata
PostgreSQL
Tools
User
Query
HDFS
Distributed
Query
ODBC/JDBC
Planner
NameNode
Clients
Distributed
Query
Executor
Local
Query
Planner
Local
Query
Planner
Local
Query
Planner
Local
Query
Executor
Local
Query
Executor
Local
Query
Executor
Foreign
Data
Wrappers
Foreign
Data
Wrappers
Foreign
Data
Wrappers
HDFS
HDFS
HDFS
datanode
datanode
datanode
Step
2)
The
user
submits
a
SQL
query
to
the
CitusDB
master
node
using
the
PostgreSQL
CLI
or
a
JDBC/ODBC
app.
16
17. The New Architecture in Detail: CitusDB
CitusDB
Master
Node
Hadoop
Metadata
Metadata
PostgreSQL
Tools
HDFS
Distributed
Query
ODBC/JDBC
Planner
NameNode
Clients
Distributed
Query
Executor
Local
Queries
Local
Query
Planner
Local
Query
Planner
Local
Query
Planner
Local
Query
Executor
Local
Query
Executor
Local
Query
Executor
Foreign
Data
Wrappers
Foreign
Data
Wrappers
Foreign
Data
Wrappers
HDFS
HDFS
HDFS
datanode
datanode
datanode
Step
3)
The
Master
Node
generates
an
op/mized
global
query
plan
and
sends
fragment
queries
to
the
workers.
17
18. The New Architecture in Detail: CitusDB
CitusDB
Master
Node
Hadoop
Metadata
Metadata
PostgreSQL
Tools
HDFS
Distributed
Query
ODBC/JDBC
Planner
NameNode
Clients
Distributed
Query
Executor
Local
Results
Local
Query
Planner
Local
Query
Planner
Local
Query
Planner
Local
Query
Executor
Local
Query
Executor
Local
Query
Executor
Foreign
Data
Wrappers
Foreign
Data
Wrappers
Foreign
Data
Wrappers
HDFS
HDFS
HDFS
datanode
datanode
datanode
Step
4)
The
CitusDB
worker
processes
running
on
each
DataNode
process
the
fragment
queries
18
and
send
par/al
result
sets
back
to
the
Master
Node.
19. The New Architecture in Detail: CitusDB
CitusDB
Master
Node
Hadoop
Metadata
Metadata
PostgreSQL
Tools
Query
Results
HDFS
Distributed
Query
ODBC/JDBC
Planner
NameNode
Clients
Distributed
Query
Executor
Local
Query
Planner
Local
Query
Planner
Local
Query
Planner
Local
Query
Executor
Local
Query
Executor
Local
Query
Executor
Foreign
Data
Wrappers
Foreign
Data
Wrappers
Foreign
Data
Wrappers
HDFS
HDFS
HDFS
datanode
datanode
datanode
Step
5)
The
Master
Node
merges
the
par/al
result
sets
and
returns
the
final
result
to
the
user.
19
20. CitusDB: Standing on the Shoulders of Giants
+
Mature, Battle-tested
Proven Scalability
Enterprise Class Features
Cost Effectiveness
Has an Elephant Mascot
Has an Elephant Mascot
20
22. Leveraging PostgreSQL Features:
More than 300 Built-in Functions
QUOTE_LITERAL
REGR_SLOPE
COS
GREATEST
QUOTE_IDENT
SET_BYTE
STRING_TO_ARRAY
ENUM_RANGE
EXTRACT
REGR_SXY
REGR_R2
XMLFOREST
CONVERT_TO
NTH_VALUE
DIV
OVERLAPS
LAG
LAG
DATE_TRUNC
SIN
BTRIM
FLOOR
PI
FORMAT
TO_DATE
TRANSACTION_TIMESTAMP
LOWER
SQRT
TRUNC
ARRAY_AGG
LOWER_INC
REGR_SYY
CONCAT
RTRIM
STRIP
LTRIM
CHAR_LENGTH
IS FALSE
ARRAY_FILL
REGR_AVGY
XMLAGG
BETWEEN
CURRENT_TIMESTAMP
BROADCAST
JUSTIFY_DAYS
IS DISTINCT
UPPER
BOX
ARRAY_LENGTH
ISCLOSED
VAR_POP
TIMEOFDAY
COVAR_POP
CURRVAL
REPEAT
VAR_SAMP
OCTET_LENGTH
LN
NETMASK
LOCALTIME
UPPER
QUERY_TO_XML
STATEMENT_TIMESTAMP
TO_CHAR
FIRST_VALUE
LPAD
CASE
GET_BIT
TAN
TRUNC
LOWER_INF
REGR_AVGX
BOOL_AND
IS NOT UNKNOWN
ARRAY_APPEND
ISNULL
REGR_COUNT
DATE_PART
CORR
ENUM_LAST
XMLCOMMENT
SCHEMA_TO_XML
SET_MASKLEN
ARRAY_TO_STRING
XPATH_EXISTS
NUMNODE
REGEXP_MATCHES
COALESCE
NOW
EXTRACT
RADIUS
SPLIT_PART
CONVERT_FROM
ENUM_FIRST
ISOPEN
UPPER_INC
MOD
REPLACE
XPATH
BIT_AND
REGR_COUNT
TRANSLATE
AREA
EVERY
AT TIME ZONE
RADIANS
NOW
SQRT
ATAN2
IS TRUE
RANDOM
SUM
MIN
NOT LIKE
REGEXP_REPLACE
RPAD
CEILING
TRIM
TO_HEX
LOG
DECODE
NOW
WIDTH
STDDEV_POP
GET_BYTE
DATE_TRUNC
BOOL_OR
REGR_SXX
ROUND
LSEG
XML_IS_WELL_FORMED
VARIANCE
CUME_DIST
PATH
COVAR_SAMP
STRING_AGG
LASTVAL
UNNEST
OVERLAY
PERCENT_RANK
HOSTMASK
PCLOSE
HEIGHT
ANY
POINT
IN
ARRAY_DIMS
MASKLEN
DENSE_RANK
LOCALTIMESTAMP
JUSTIFY_INTERVAL
CURRENT_DATE
CURSOR_TO_XML
LIKE
SETVAL
LENGTH
POWER
UPPER_INF
GENERATE_SUBSCRIPTS
POSITION
LAST_VALUE
INITCAP
IS NOT TRUE
XMLAGG
PG_SLEEP
VAR_POP
STRPOS
SIGN
FORMAT
GENERATE_SERIES
STDDEV_SAMP
DENSE_RANK
COT
SUBSTR
REVERSE
REGR_INTERCEPT
SIMILAR TO
DATABASE_TO_XML
ARRAY_CAT
STDDEV
IS NOT FALSE
DIAMETER
NOTNULL
HOST
TO_ASCII
ABS
ROW_TO_JSON
ROW_NUMBER
SUBSTRING
SETSEED
ISFINITE
SOME
SET_BIT
ARRAY_NDIMS
REGEXP_SPLIT_TO_ARRAY
TO_TIMESTAMP
NOT
MD5
22
23. Leveraging PostgreSQL Features
Extensible, Rich Type System
Pluggable Format Handlers
Security
Internationalization
Connectivity: ODBC, JDBC
Ecosystem Add-Ons:
PostGIS, XML/JSON, Fuzzy Search, Language Bindings (.NET,
Python, etc)
23
24. Where are We Headed?
Distributed. SQL. Anywhere.
CitusDB
Master
Node
Metadata
Distributed
Query
Planner
Distributed
Query
Executor
Local
Query
Planner
Local
Query
Planner
Local
Query
Planner
Local
Query
Executor
Local
Query
Executor
Local
Query
Executor
Foreign
Data
Wrapper
Foreign
Data
Wrapper
Foreign
Data
Wrapper
HDFS
mongod
RDBMS
Hadoop
Datanode
MongoDB
Shard
RDBMS
server
24
25. Defining the New Generation of
Distributed Analytic Databases
SQL à Ease of Use, Increased Productivity
Real-time responsiveness à Faster
Data Locality à Proven Scalability
Schema-on-Read à Flexibility, Lower Cost
25
26. Where Are We At?
CitusDB SQL on Hadoop is in Open Beta
Download our Binary Packages
Or Use Our EC2 AMI
http://citusdata.com/docs/sql-on-hadoop
26