HBase and Drill: How loosley typed SQL is ideal for NoSQL

© 2014 MapR Technologies 2
Contact Information
Ted Dunning
Chief Applications Architect at MapR Technologies
Committer & PMC for Apache’s Drill, Zookeeper & Mahout
VP of Incubator at Apache Foundation
Email tdunning@apache.org tdunning@maprtech.com
Twitter @ted_dunning
Hashtag today: #BDE2015

Agenda
• What does good mean?
• What do we mean by loose typing?
• Examples of what you can do
• Real database with 10-20x fewer tables
• Looking forward
• Questions

What Does Good Mean (for a DB)?
• Expressive
– Must express the concepts we need
• Efficient
– Must run fast enough on cheap enough hardware

What Does Good Mean (for a DB)?
• Expressive
– Must express the concepts we need
• Efficient
– Must run fast enough on cheap enough hardware
• Introspectable
– Must be able to inspect the data and schema and gain understanding

What is New Here
• Introspection is better when
– A minimum of data entities are used to describe our model
– No name overflow
– Referential scoping helps narrow our focus to a simpler problem
– Many-to-one relations can in-lined
• Introspection was not a goal for the design of the relational
model
• Introspection was therefore not a result either

Older than Dirt
• Relational theory is old (1970)
– Pre-dates data structures
– Predates mainstream recursive procedures
– Predates lexical scoping
– Predates logic programming
– Predates real functional programming (Church, McCarthy, Iverson,
Backus and not-withstanding)
• Some updates are in order to enhance introspection

Contrast Relational and HBase Style noSQL
Relational
• Rows containing fields
• Fields contain primitive types
• Structure is fixed and uniform
• Structure is pre-defined
• Referential integrity (optional)
• Expressions over sets of
rows
HBase / MapR DB
• Rows contain fields
• Fields bytes
• Structure is flexible
• No pre-defined structure
• Single key
• Column families
• Timestamps
• Versions

Contrast relational and HBase with Structuring
Relational
• Rows containing fields
• Structure is fixed and uniform
• Structure is pre-defined
• Referential integrity (optional)
• Expressions over sets of
rows
HBase + Structuring
• Rows contain fields
– Or objects, or lists
• Structure is flexible, ragged
• No pre-defined structure
• Single key

Turtle Models for Databases
• Allows complex objects in field values
– JSON style lists and objects
• Allow references to objects via join
– Includes references localized within lists
• Lists of objects and objects of lists are isomorphic to tables so …
• Complex data in tables,
• But also tables in complex data,
• Even tables containing complex data containing tables

Proviso and Warning
• This is not your father’s BLOB
• And not the same as arrays with lateral view joins
• Rationale to come as we talk about idioms

A Catalog of noSQL Idioms

Tables as Objects, Objects as Tables
c1 c2 c3
Row-wise form
c1 c2 c3
Column-wise form
[ { c1:v1, c2:v2, c3:v3 },
{ c1:v1, c2:v2, c3:v3 },
{ c1:v1, c2:v2, c3:v3 } ]
List of objects
{ c1:[v1, v2, v3],
c2:[v1, v2, v3],
c3:[v1, v2, v3] }
Object containing lists

c1 c2 c3
c1 c2 c3
Micro Columnar Formats
An entire table stored in
columnar form can be a
first-class value using
these techniques
This is very powerful for
in-lining one-to-many
relations.

Note
• If embedded tables are first-class, schema becomes data
• If schema is data-driven when embedded, constructs that
elevate tables to top-level are impossible
• Thus, embedded first-class objects implies late discovery of
schema information

A first example:
Time-series data

Column names as data
• When column names are not pre-defined, they can convey
information
• Examples
– Time offsets within a window for time series
– Top-level domains for web crawlers
– Vendor id’s for customer purchase profiles
• Predefined schema is impossible for this idiom

Relational Model for Time-series

Table Design: Point-by-Point

Table Design: Hybrid Point-by-Point + Sub-table
After close of window, data in row is restated as column-oriented
tabular value in different column family.

Compression Results
Samples are
64b time, 16 bit sample
Sample time at 10kHz
Sample time jitter makes it
important to keep original
time-stamp
How much overhead to
retain time-stamp?

A second example:
Music meta-data

MusicBrainz on NoSQL
• Artists, albums, tracks and labels are key objects
• Reality check:
– Add works (compositions), recordings, release, release group
• 7 tables for artist alone
• 12 for place, 7 for label, 17 for release/group, 8 for work
– (but only 4 for recording!)
– Total of 12 + 7 + 17 + 8 + 4 = 48 tables
• But wait, there’s more!
– 10 annotation tables, 10 edit tables, 19 tag tables, 5 rating tables, 86
link tables, 5 cover art tables and 3 tables for CD timing info (138 total)
– And 50 more tables that aren’t documented yet

180 tables
not shown

236 tables
to describe 7 kinds of things

Can we do better?

artist
id
gid
name
sort_name
begin_date
end_date
ended
type
gender
area
being_area
end_area
comment
list<ipi>
list<isni>
list<alias>
list<release_id>
list<recording_id>
artist
id
gid
name
sort_name
begin_date
end_date
ended
type
gender
area
being_area
end_area
comment
list<ipi>
list<isni>
list<alias>

artist
id
gid
name
sort_name
begin_date
end_date
ended
type
gender
area
being_area
end_area
comment
list<ipi>
list<isni>
list<alias>
list<release_id>
list<recording_id>

artist
id
gid
name
sort_name
begin_date
end_date
ended
type
gender
area
begin_area
end_area
comment
list<ipi>
list<isni>
list<alias>
list<release_id>
list<recording_id>
{ name, begin_date,
end_date }

{id, recording_id,
name, list<credit>
length}
recording
id
gid
list<credit>
name
list<track_ref>
release
id
gid
release_group_id
list<credit>
name
barcode
status
packaging
language
script
list<medium>
{id, format,
name,
list<track>}
release_group
id
gid
name
list<credit>
type
list<release_id>

27 tables reduce to 4

27 tables reduce to 4
so far

Further Reductions
• All 86 link tables become properties on artists, releases and
other entities
• All 44 tag, rating and annotation tables become list properties
• All 5 cover art tables become lists of file references
• Current score: 162 tables become 4
• You get the idea

Is This Good?
• Expressivity
– The JSON data model is at least as expressive as the original relational
model
• Many cases easier to describe in nested data
• No cases are harder
• Efficiency
– Inlining can increase data size. Locality improves, however
– Sessionizing can substantially decrease data size
– Inlining back-references is more efficient than ordinary indexes
– Inlined columnar data allows 1000x speedup for time series
• Introspection (you decide)

But How Can We Query This?
• Can’t use SQL
– SQL is strongly typed
– SQL is heavily tied into the original relational model
– SQL generating tools require relational model
• Must use SQL
– Vast numbers of tools and people understand how to write SQL
– SQL is the lingua franca of databases

Squaring the Circle
• Enter Apache Drill
• Drill is SQL compliant
– Uses standard syntax and semantics
• Drill extends SQL
– First class treatment of objects, lists
– Full support for destructuring, flattening
– Full power of relational model can be applied to complex data

Drill Provides Scalable and Extended SQL

Sample Query
• Find Elvis
select distinct id, name, alias from (
select id, flatten(alias.name) alias from artist
where alias like 'Elvis%Presley'
)

Example Query
• Find discs where Elvis was credited
select distinct album_id, name
from
(
select id album_id, name, flatten(credit)
from release
) albums
join
(
select distinct artist_id from (
select id artist_id, flatten(alias) from artist
where name like 'Elvis%Presley’
)
) artists
using artist_id

Summary
• Extended relational model allows massive simplification
– On a real example, we see >20x reduction in number of tables
• Simplification drives improved introspection
– This is good
• Apache Drill gives very high performance execution for extended
relational problems
• You can try this out today

Short Books by Ted Dunning & Ellen Friedman
• Published by O’Reilly in 2014 and 2015
• For sale from Amazon or O’Reilly
• Free e-books currently available courtesy of MapR
http://bit.ly/ebook-real-
world-hadoop
http://bit.ly/mapr-tsdb-
ebook
http://bit.ly/ebook-
anomaly
http://bit.ly/recommend
ation-ebook

Thank You!

Q&A
@mapr maprtech
tdunning@mapr.tech.com
Engage with us!
MapR
maprtech
mapr-technologies

HBase and Drill: How loosley typed SQL is ideal for NoSQL

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (20)

Ähnlich wie HBase and Drill: How loosley typed SQL is ideal for NoSQL

Ähnlich wie HBase and Drill: How loosley typed SQL is ideal for NoSQL (20)

Mehr von DataWorks Summit

Mehr von DataWorks Summit (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

HBase and Drill: How loosley typed SQL is ideal for NoSQL

Hinweis der Redaktion