SlideShare a Scribd company logo
1 of 47
Download to read offline
© 2014 MapR Technologies 1© 2014 MapR Technologies
© 2014 MapR Technologies 2
Contact Information
Ted Dunning
Chief Applications Architect at MapR Technologies
Committer & PMC for Apache’s Drill, Zookeeper & Mahout
VP of Incubator at Apache Foundation
Email tdunning@apache.org tdunning@maprtech.com
Twitter @ted_dunning
Hashtag today: #BDE2015
© 2014 MapR Technologies 3
Agenda
• What does good mean?
• What do we mean by loose typing?
• Examples of what you can do
• Real database with 10-20x fewer tables
• Looking forward
• Questions
© 2014 MapR Technologies 4
What Does Good Mean (for a DB)?
• Expressive
– Must express the concepts we need
• Efficient
– Must run fast enough on cheap enough hardware
© 2014 MapR Technologies 5
What Does Good Mean (for a DB)?
• Expressive
– Must express the concepts we need
• Efficient
– Must run fast enough on cheap enough hardware
• Introspectable
– Must be able to inspect the data and schema and gain understanding
© 2014 MapR Technologies 6
What is New Here
• Introspection is better when
– A minimum of data entities are used to describe our model
– No name overflow
– Referential scoping helps narrow our focus to a simpler problem
– Many-to-one relations can in-lined
• Introspection was not a goal for the design of the relational
model
• Introspection was therefore not a result either
© 2014 MapR Technologies 7
Older than Dirt
• Relational theory is old (1970)
– Pre-dates data structures
– Predates mainstream recursive procedures
– Predates lexical scoping
– Predates logic programming
– Predates real functional programming (Church, McCarthy, Iverson,
Backus and not-withstanding)
• Some updates are in order to enhance introspection
© 2014 MapR Technologies 8
Contrast Relational and HBase Style noSQL
Relational
• Rows containing fields
• Fields contain primitive types
• Structure is fixed and uniform
• Structure is pre-defined
• Referential integrity (optional)
• Expressions over sets of
rows
HBase / MapR DB
• Rows contain fields
• Fields bytes
• Structure is flexible
• No pre-defined structure
• Single key
• Column families
• Timestamps
• Versions
© 2014 MapR Technologies 9
Contrast relational and HBase with Structuring
Relational
• Rows containing fields
• Fields contain primitive types
• Structure is fixed and uniform
• Structure is pre-defined
• Referential integrity (optional)
• Expressions over sets of
rows
HBase + Structuring
• Rows contain fields
• Fields contain primitive types
– Or objects, or lists
• Structure is flexible, ragged
• No pre-defined structure
• Single key
© 2014 MapR Technologies 10
Turtle Models for Databases
• Allows complex objects in field values
– JSON style lists and objects
• Allow references to objects via join
– Includes references localized within lists
• Lists of objects and objects of lists are isomorphic to tables so …
• Complex data in tables,
• But also tables in complex data,
• Even tables containing complex data containing tables
© 2014 MapR Technologies 11
Proviso and Warning
• This is not your father’s BLOB
• And not the same as arrays with lateral view joins
• Rationale to come as we talk about idioms
© 2014 MapR Technologies 12
A Catalog of noSQL Idioms
© 2014 MapR Technologies 13
Tables as Objects, Objects as Tables
c1 c2 c3
Row-wise form
c1 c2 c3
Column-wise form
[ { c1:v1, c2:v2, c3:v3 },
{ c1:v1, c2:v2, c3:v3 },
{ c1:v1, c2:v2, c3:v3 } ]
List of objects
{ c1:[v1, v2, v3],
c2:[v1, v2, v3],
c3:[v1, v2, v3] }
Object containing lists
© 2014 MapR Technologies 14
c1 c2 c3
c1 c2 c3
Micro Columnar Formats
An entire table stored in
columnar form can be a
first-class value using
these techniques
This is very powerful for
in-lining one-to-many
relations.
© 2014 MapR Technologies 15
Note
• If embedded tables are first-class, schema becomes data
• If schema is data-driven when embedded, constructs that
elevate tables to top-level are impossible
• Thus, embedded first-class objects implies late discovery of
schema information
© 2014 MapR Technologies 16
A first example:
Time-series data
© 2014 MapR Technologies 17
Column names as data
• When column names are not pre-defined, they can convey
information
• Examples
– Time offsets within a window for time series
– Top-level domains for web crawlers
– Vendor id’s for customer purchase profiles
• Predefined schema is impossible for this idiom
© 2014 MapR Technologies 18
Relational Model for Time-series
© 2014 MapR Technologies 19
Table Design: Point-by-Point
© 2014 MapR Technologies 20
Table Design: Hybrid Point-by-Point + Sub-table
After close of window, data in row is restated as column-oriented
tabular value in different column family.
© 2014 MapR Technologies 21
Compression Results
Samples are
64b time, 16 bit sample
Sample time at 10kHz
Sample time jitter makes it
important to keep original
time-stamp
How much overhead to
retain time-stamp?
© 2014 MapR Technologies 22
A second example:
Music meta-data
© 2014 MapR Technologies 23
MusicBrainz on NoSQL
• Artists, albums, tracks and labels are key objects
• Reality check:
– Add works (compositions), recordings, release, release group
• 7 tables for artist alone
• 12 for place, 7 for label, 17 for release/group, 8 for work
– (but only 4 for recording!)
– Total of 12 + 7 + 17 + 8 + 4 = 48 tables
• But wait, there’s more!
– 10 annotation tables, 10 edit tables, 19 tag tables, 5 rating tables, 86
link tables, 5 cover art tables and 3 tables for CD timing info (138 total)
– And 50 more tables that aren’t documented yet
© 2014 MapR Technologies 24
© 2014 MapR Technologies 25
180 tables
not shown
© 2014 MapR Technologies 26
236 tables
to describe 7 kinds of things
© 2014 MapR Technologies 27
Can we do better?
© 2014 MapR Technologies 28
artist
id
gid
name
sort_name
begin_date
end_date
ended
type
gender
area
being_area
end_area
comment
list<ipi>
list<isni>
list<alias>
list<release_id>
list<recording_id>
artist
id
gid
name
sort_name
begin_date
end_date
ended
type
gender
area
being_area
end_area
comment
list<ipi>
list<isni>
list<alias>
© 2014 MapR Technologies 29
artist
id
gid
name
sort_name
begin_date
end_date
ended
type
gender
area
being_area
end_area
comment
list<ipi>
list<isni>
list<alias>
list<release_id>
list<recording_id>
© 2014 MapR Technologies 30
artist
id
gid
name
sort_name
begin_date
end_date
ended
type
gender
area
begin_area
end_area
comment
list<ipi>
list<isni>
list<alias>
list<release_id>
list<recording_id>
{ name, begin_date,
end_date }
© 2014 MapR Technologies 31
© 2014 MapR Technologies 32
{id, recording_id,
name, list<credit>
length}
recording
id
gid
list<credit>
name
list<track_ref>
release
id
gid
release_group_id
list<credit>
name
barcode
status
packaging
language
script
list<medium>
{id, format,
name,
list<track>}
release_group
id
gid
name
list<credit>
type
list<release_id>
© 2014 MapR Technologies 33
27 tables reduce to 4
© 2014 MapR Technologies 34
27 tables reduce to 4
so far
© 2014 MapR Technologies 35
Further Reductions
• All 86 link tables become properties on artists, releases and
other entities
• All 44 tag, rating and annotation tables become list properties
• All 5 cover art tables become lists of file references
• Current score: 162 tables become 4
• You get the idea
© 2014 MapR Technologies 36
Is This Good?
• Expressivity
– The JSON data model is at least as expressive as the original relational
model
• Many cases easier to describe in nested data
• No cases are harder
• Efficiency
– Inlining can increase data size. Locality improves, however
– Sessionizing can substantially decrease data size
– Inlining back-references is more efficient than ordinary indexes
– Inlined columnar data allows 1000x speedup for time series
• Introspection (you decide)
© 2014 MapR Technologies 37
But How Can We Query This?
• Can’t use SQL
– SQL is strongly typed
– SQL is heavily tied into the original relational model
– SQL generating tools require relational model
• Must use SQL
– Vast numbers of tools and people understand how to write SQL
– SQL is the lingua franca of databases
© 2014 MapR Technologies 38
Squaring the Circle
• Enter Apache Drill
• Drill is SQL compliant
– Uses standard syntax and semantics
• Drill extends SQL
– First class treatment of objects, lists
– Full support for destructuring, flattening
– Full power of relational model can be applied to complex data
© 2014 MapR Technologies 39
Drill Provides Scalable and Extended SQL
© 2014 MapR Technologies 40
Sample Query
• Find Elvis
select distinct id, name, alias from (
select id, flatten(alias.name) alias from artist
where alias like 'Elvis%Presley'
)
© 2014 MapR Technologies 41
Example Query
• Find discs where Elvis was credited
select distinct album_id, name
from
(
select id album_id, name, flatten(credit)
from release
) albums
join
(
select distinct artist_id from (
select id artist_id, flatten(alias) from artist
where name like 'Elvis%Presley’
)
) artists
using artist_id
© 2014 MapR Technologies 42
Summary
• Extended relational model allows massive simplification
– On a real example, we see >20x reduction in number of tables
• Simplification drives improved introspection
– This is good
• Apache Drill gives very high performance execution for extended
relational problems
• You can try this out today
© 2014 MapR Technologies 43
© 2014 MapR Technologies 44
Short Books by Ted Dunning & Ellen Friedman
• Published by O’Reilly in 2014 and 2015
• For sale from Amazon or O’Reilly
• Free e-books currently available courtesy of MapR
http://bit.ly/ebook-real-
world-hadoop
http://bit.ly/mapr-tsdb-
ebook
http://bit.ly/ebook-
anomaly
http://bit.ly/recommend
ation-ebook
© 2014 MapR Technologies 45
Real World Hadoop
by Ted Dunning and Ellen Friedman © Feb 2015 (published by O’Reilly)
Free copies at book signing today
© 2014 MapR Technologies 46
Thank You!
© 2014 MapR Technologies 47
Q&A
@mapr maprtech
tdunning@mapr.tech.com
Engage with us!
MapR
maprtech
mapr-technologies

More Related Content

What's hot

Why R? A Brief Introduction to the Open Source Statistics Platform
Why R? A Brief Introduction to the Open Source Statistics PlatformWhy R? A Brief Introduction to the Open Source Statistics Platform
Why R? A Brief Introduction to the Open Source Statistics PlatformSyracuse University
 
Introduction to data analysis using R
Introduction to data analysis using RIntroduction to data analysis using R
Introduction to data analysis using RVictoria López
 
Workshop - Hadoop + R by CARLOS GIL BELLOSTA at Big Data Spain 2013
Workshop - Hadoop + R by CARLOS GIL BELLOSTA at Big Data Spain 2013Workshop - Hadoop + R by CARLOS GIL BELLOSTA at Big Data Spain 2013
Workshop - Hadoop + R by CARLOS GIL BELLOSTA at Big Data Spain 2013Big Data Spain
 
The History and Use of R
The History and Use of RThe History and Use of R
The History and Use of RAnalyticsWeek
 
R programming for data science
R programming for data scienceR programming for data science
R programming for data scienceSovello Hildebrand
 
A short tutorial on r
A short tutorial on rA short tutorial on r
A short tutorial on rAshraf Uddin
 
HUG_Ireland_Streaming_Ted_Dunning
HUG_Ireland_Streaming_Ted_DunningHUG_Ireland_Streaming_Ted_Dunning
HUG_Ireland_Streaming_Ted_DunningJohn Mulhall
 
DrawingML Introduction
DrawingML IntroductionDrawingML Introduction
DrawingML IntroductionShawn Villaron
 
scientific writing 01 - latex
scientific writing   01 - latexscientific writing   01 - latex
scientific writing 01 - latexLeo Chen
 
Scalable Data Analysis in R -- Lee Edlefsen
Scalable Data Analysis in R -- Lee EdlefsenScalable Data Analysis in R -- Lee Edlefsen
Scalable Data Analysis in R -- Lee EdlefsenRevolution Analytics
 
1.3 introduction to R language, importing dataset in r, data exploration in r
1.3 introduction to R language, importing dataset in r, data exploration in r1.3 introduction to R language, importing dataset in r, data exploration in r
1.3 introduction to R language, importing dataset in r, data exploration in rSimple Research
 
Parquet and impala overview external
Parquet and impala overview externalParquet and impala overview external
Parquet and impala overview externalmattlieber
 
Graph Databases & OrientDB
Graph Databases & OrientDBGraph Databases & OrientDB
Graph Databases & OrientDBArpit Poladia
 

What's hot (19)

Why R? A Brief Introduction to the Open Source Statistics Platform
Why R? A Brief Introduction to the Open Source Statistics PlatformWhy R? A Brief Introduction to the Open Source Statistics Platform
Why R? A Brief Introduction to the Open Source Statistics Platform
 
Introduction to data analysis using R
Introduction to data analysis using RIntroduction to data analysis using R
Introduction to data analysis using R
 
Workshop - Hadoop + R by CARLOS GIL BELLOSTA at Big Data Spain 2013
Workshop - Hadoop + R by CARLOS GIL BELLOSTA at Big Data Spain 2013Workshop - Hadoop + R by CARLOS GIL BELLOSTA at Big Data Spain 2013
Workshop - Hadoop + R by CARLOS GIL BELLOSTA at Big Data Spain 2013
 
R programming
R programmingR programming
R programming
 
The History and Use of R
The History and Use of RThe History and Use of R
The History and Use of R
 
R programming for data science
R programming for data scienceR programming for data science
R programming for data science
 
A short tutorial on r
A short tutorial on rA short tutorial on r
A short tutorial on r
 
HUG_Ireland_Streaming_Ted_Dunning
HUG_Ireland_Streaming_Ted_DunningHUG_Ireland_Streaming_Ted_Dunning
HUG_Ireland_Streaming_Ted_Dunning
 
DrawingML Introduction
DrawingML IntroductionDrawingML Introduction
DrawingML Introduction
 
Hive and Shark
Hive and SharkHive and Shark
Hive and Shark
 
R programming
R programmingR programming
R programming
 
scientific writing 01 - latex
scientific writing   01 - latexscientific writing   01 - latex
scientific writing 01 - latex
 
An Intoduction to R
An Intoduction to RAn Intoduction to R
An Intoduction to R
 
Scalable Data Analysis in R -- Lee Edlefsen
Scalable Data Analysis in R -- Lee EdlefsenScalable Data Analysis in R -- Lee Edlefsen
Scalable Data Analysis in R -- Lee Edlefsen
 
project
projectproject
project
 
1.3 introduction to R language, importing dataset in r, data exploration in r
1.3 introduction to R language, importing dataset in r, data exploration in r1.3 introduction to R language, importing dataset in r, data exploration in r
1.3 introduction to R language, importing dataset in r, data exploration in r
 
Parquet and impala overview external
Parquet and impala overview externalParquet and impala overview external
Parquet and impala overview external
 
Graph Databases & OrientDB
Graph Databases & OrientDBGraph Databases & OrientDB
Graph Databases & OrientDB
 
R language
R languageR language
R language
 

Viewers also liked

Fire Door Compliance & Safety
Fire Door Compliance & SafetyFire Door Compliance & Safety
Fire Door Compliance & SafetyBill Stewart
 
Haustein, S. (2017). The evolution of scholarly communication and the reward ...
Haustein, S. (2017). The evolution of scholarly communication and the reward ...Haustein, S. (2017). The evolution of scholarly communication and the reward ...
Haustein, S. (2017). The evolution of scholarly communication and the reward ...Stefanie Haustein
 
Machine Learning with Hadoop Boston hug 2012
Machine Learning with Hadoop Boston hug 2012Machine Learning with Hadoop Boston hug 2012
Machine Learning with Hadoop Boston hug 2012MapR Technologies
 
Apache Drill – Hands-On SQL References
Apache Drill – Hands-On SQL ReferencesApache Drill – Hands-On SQL References
Apache Drill – Hands-On SQL ReferencesMapR Technologies
 
HBase backups and performance on MapR
HBase backups and performance on MapRHBase backups and performance on MapR
HBase backups and performance on MapRlohitvijayarenu
 
Spark & Hadoop at Production at Scale
Spark & Hadoop at Production at ScaleSpark & Hadoop at Production at Scale
Spark & Hadoop at Production at ScaleMapR Technologies
 
Practical Machine Learning: Innovations in Recommendation Workshop
Practical Machine Learning:  Innovations in Recommendation WorkshopPractical Machine Learning:  Innovations in Recommendation Workshop
Practical Machine Learning: Innovations in Recommendation WorkshopMapR Technologies
 
MapR-DB – The First In-Hadoop Document Database
MapR-DB – The First In-Hadoop Document DatabaseMapR-DB – The First In-Hadoop Document Database
MapR-DB – The First In-Hadoop Document DatabaseMapR Technologies
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoMapR Technologies
 
Significance of Data Mining
Significance of Data MiningSignificance of Data Mining
Significance of Data Mining8trackweb
 
Mobile Phone Repairs & services
Mobile Phone Repairs & servicesMobile Phone Repairs & services
Mobile Phone Repairs & servicesphonewizard
 
Real Time and Big Data – It’s About Time
Real Time and Big Data – It’s About TimeReal Time and Big Data – It’s About Time
Real Time and Big Data – It’s About TimeMapR Technologies
 
Hadoop benchmark: Evaluating Cloudera, Hortonworks, and MapR
Hadoop benchmark: Evaluating Cloudera, Hortonworks, and MapRHadoop benchmark: Evaluating Cloudera, Hortonworks, and MapR
Hadoop benchmark: Evaluating Cloudera, Hortonworks, and MapRDouglas Bernardini
 
Intro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco VasquezIntro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco VasquezMapR Technologies
 
Apache Drill でたしなむ セルフサービスデータ探索 - 2014/11/06 Cloudera World Tokyo 2014 LTセッション
Apache Drill でたしなむ セルフサービスデータ探索 - 2014/11/06 Cloudera World Tokyo 2014 LTセッションApache Drill でたしなむ セルフサービスデータ探索 - 2014/11/06 Cloudera World Tokyo 2014 LTセッション
Apache Drill でたしなむ セルフサービスデータ探索 - 2014/11/06 Cloudera World Tokyo 2014 LTセッションMapR Technologies Japan
 
20160818巨量資料的分析現況與展望(國發會) 張大明v2.1
20160818巨量資料的分析現況與展望(國發會) 張大明v2.120160818巨量資料的分析現況與展望(國發會) 張大明v2.1
20160818巨量資料的分析現況與展望(國發會) 張大明v2.1張大明 Ta-Ming Chang
 
Inside MapR's M7
Inside MapR's M7Inside MapR's M7
Inside MapR's M7Ted Dunning
 
Introduction to Apache HBase, MapR Tables and Security
Introduction to Apache HBase, MapR Tables and SecurityIntroduction to Apache HBase, MapR Tables and Security
Introduction to Apache HBase, MapR Tables and SecurityMapR Technologies
 

Viewers also liked (20)

Fire Door Compliance & Safety
Fire Door Compliance & SafetyFire Door Compliance & Safety
Fire Door Compliance & Safety
 
Haustein, S. (2017). The evolution of scholarly communication and the reward ...
Haustein, S. (2017). The evolution of scholarly communication and the reward ...Haustein, S. (2017). The evolution of scholarly communication and the reward ...
Haustein, S. (2017). The evolution of scholarly communication and the reward ...
 
Machine Learning with Hadoop Boston hug 2012
Machine Learning with Hadoop Boston hug 2012Machine Learning with Hadoop Boston hug 2012
Machine Learning with Hadoop Boston hug 2012
 
Inside MapR's M7
Inside MapR's M7Inside MapR's M7
Inside MapR's M7
 
Apache Drill – Hands-On SQL References
Apache Drill – Hands-On SQL ReferencesApache Drill – Hands-On SQL References
Apache Drill – Hands-On SQL References
 
HBase backups and performance on MapR
HBase backups and performance on MapRHBase backups and performance on MapR
HBase backups and performance on MapR
 
Spark & Hadoop at Production at Scale
Spark & Hadoop at Production at ScaleSpark & Hadoop at Production at Scale
Spark & Hadoop at Production at Scale
 
Practical Machine Learning: Innovations in Recommendation Workshop
Practical Machine Learning:  Innovations in Recommendation WorkshopPractical Machine Learning:  Innovations in Recommendation Workshop
Practical Machine Learning: Innovations in Recommendation Workshop
 
MapR-DB – The First In-Hadoop Document Database
MapR-DB – The First In-Hadoop Document DatabaseMapR-DB – The First In-Hadoop Document Database
MapR-DB – The First In-Hadoop Document Database
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of Twingo
 
Significance of Data Mining
Significance of Data MiningSignificance of Data Mining
Significance of Data Mining
 
Mobile Phone Repairs & services
Mobile Phone Repairs & servicesMobile Phone Repairs & services
Mobile Phone Repairs & services
 
Real Time and Big Data – It’s About Time
Real Time and Big Data – It’s About TimeReal Time and Big Data – It’s About Time
Real Time and Big Data – It’s About Time
 
Hadoop benchmark: Evaluating Cloudera, Hortonworks, and MapR
Hadoop benchmark: Evaluating Cloudera, Hortonworks, and MapRHadoop benchmark: Evaluating Cloudera, Hortonworks, and MapR
Hadoop benchmark: Evaluating Cloudera, Hortonworks, and MapR
 
Intro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco VasquezIntro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco Vasquez
 
Apache Drill でたしなむ セルフサービスデータ探索 - 2014/11/06 Cloudera World Tokyo 2014 LTセッション
Apache Drill でたしなむ セルフサービスデータ探索 - 2014/11/06 Cloudera World Tokyo 2014 LTセッションApache Drill でたしなむ セルフサービスデータ探索 - 2014/11/06 Cloudera World Tokyo 2014 LTセッション
Apache Drill でたしなむ セルフサービスデータ探索 - 2014/11/06 Cloudera World Tokyo 2014 LTセッション
 
20160818巨量資料的分析現況與展望(國發會) 張大明v2.1
20160818巨量資料的分析現況與展望(國發會) 張大明v2.120160818巨量資料的分析現況與展望(國發會) 張大明v2.1
20160818巨量資料的分析現況與展望(國發會) 張大明v2.1
 
MapR & Skytree:
MapR & Skytree: MapR & Skytree:
MapR & Skytree:
 
Inside MapR's M7
Inside MapR's M7Inside MapR's M7
Inside MapR's M7
 
Introduction to Apache HBase, MapR Tables and Security
Introduction to Apache HBase, MapR Tables and SecurityIntroduction to Apache HBase, MapR Tables and Security
Introduction to Apache HBase, MapR Tables and Security
 

Similar to HBase and Drill: How Loosely Typed SQL is Ideal for NoSQL

Putting Apache Drill into Production
Putting Apache Drill into ProductionPutting Apache Drill into Production
Putting Apache Drill into ProductionMapR Technologies
 
Big Data Everywhere Chicago: SQL on Hadoop
Big Data Everywhere Chicago: SQL on Hadoop Big Data Everywhere Chicago: SQL on Hadoop
Big Data Everywhere Chicago: SQL on Hadoop BigDataEverywhere
 
Introduction to HBase - Phoenix HUG 5/14
Introduction to HBase - Phoenix HUG 5/14Introduction to HBase - Phoenix HUG 5/14
Introduction to HBase - Phoenix HUG 5/14Jeremy Walsh
 
IoT and Big Data - Iot Asia 2014
IoT and Big Data - Iot Asia 2014IoT and Big Data - Iot Asia 2014
IoT and Big Data - Iot Asia 2014John Berns
 
02 data warehouse applications with hive
02 data warehouse applications with hive02 data warehouse applications with hive
02 data warehouse applications with hiveSubhas Kumar Ghosh
 
Real time-hadoop
Real time-hadoopReal time-hadoop
Real time-hadoopTed Dunning
 
Recommendation Techn
Recommendation TechnRecommendation Techn
Recommendation TechnTed Dunning
 
Apache Kylin – Cubes on Hadoop
Apache Kylin – Cubes on HadoopApache Kylin – Cubes on Hadoop
Apache Kylin – Cubes on HadoopDataWorks Summit
 
Apache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on HadoopApache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on HadoopTed Dunning
 
Real-time Hadoop: The Ideal Messaging System for Hadoop
Real-time Hadoop: The Ideal Messaging System for Hadoop Real-time Hadoop: The Ideal Messaging System for Hadoop
Real-time Hadoop: The Ideal Messaging System for Hadoop DataWorks Summit/Hadoop Summit
 
DataFrames: The Extended Cut
DataFrames: The Extended CutDataFrames: The Extended Cut
DataFrames: The Extended CutWes McKinney
 
Data Analytics with R and SQL Server
Data Analytics with R and SQL ServerData Analytics with R and SQL Server
Data Analytics with R and SQL ServerStéphane Fréchette
 
Drill at the Chug 9-19-12
Drill at the Chug 9-19-12Drill at the Chug 9-19-12
Drill at the Chug 9-19-12Ted Dunning
 
How To Model and Construct Graphs with Oracle Database (AskTOM Office Hours p...
How To Model and Construct Graphs with Oracle Database (AskTOM Office Hours p...How To Model and Construct Graphs with Oracle Database (AskTOM Office Hours p...
How To Model and Construct Graphs with Oracle Database (AskTOM Office Hours p...Jean Ihm
 

Similar to HBase and Drill: How Loosely Typed SQL is Ideal for NoSQL (20)

Putting Apache Drill into Production
Putting Apache Drill into ProductionPutting Apache Drill into Production
Putting Apache Drill into Production
 
Using Apache Drill
Using Apache DrillUsing Apache Drill
Using Apache Drill
 
Big Data Everywhere Chicago: SQL on Hadoop
Big Data Everywhere Chicago: SQL on Hadoop Big Data Everywhere Chicago: SQL on Hadoop
Big Data Everywhere Chicago: SQL on Hadoop
 
Introduction to HBase - Phoenix HUG 5/14
Introduction to HBase - Phoenix HUG 5/14Introduction to HBase - Phoenix HUG 5/14
Introduction to HBase - Phoenix HUG 5/14
 
IoT and Big Data - Iot Asia 2014
IoT and Big Data - Iot Asia 2014IoT and Big Data - Iot Asia 2014
IoT and Big Data - Iot Asia 2014
 
02 data warehouse applications with hive
02 data warehouse applications with hive02 data warehouse applications with hive
02 data warehouse applications with hive
 
Real time-hadoop
Real time-hadoopReal time-hadoop
Real time-hadoop
 
2014 08-20-pit-hug
2014 08-20-pit-hug2014 08-20-pit-hug
2014 08-20-pit-hug
 
Recommendation Techn
Recommendation TechnRecommendation Techn
Recommendation Techn
 
Apache Kylin – Cubes on Hadoop
Apache Kylin – Cubes on HadoopApache Kylin – Cubes on Hadoop
Apache Kylin – Cubes on Hadoop
 
Apache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on HadoopApache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on Hadoop
 
Keys for Success from Streams to Queries
Keys for Success from Streams to QueriesKeys for Success from Streams to Queries
Keys for Success from Streams to Queries
 
L17 Data Source Layer
L17 Data Source LayerL17 Data Source Layer
L17 Data Source Layer
 
Real-time Hadoop: The Ideal Messaging System for Hadoop
Real-time Hadoop: The Ideal Messaging System for Hadoop Real-time Hadoop: The Ideal Messaging System for Hadoop
Real-time Hadoop: The Ideal Messaging System for Hadoop
 
DataFrames: The Extended Cut
DataFrames: The Extended CutDataFrames: The Extended Cut
DataFrames: The Extended Cut
 
L15 Data Source Layer
L15 Data Source LayerL15 Data Source Layer
L15 Data Source Layer
 
Data Analytics with R and SQL Server
Data Analytics with R and SQL ServerData Analytics with R and SQL Server
Data Analytics with R and SQL Server
 
HUG France - Apache Drill
HUG France - Apache DrillHUG France - Apache Drill
HUG France - Apache Drill
 
Drill at the Chug 9-19-12
Drill at the Chug 9-19-12Drill at the Chug 9-19-12
Drill at the Chug 9-19-12
 
How To Model and Construct Graphs with Oracle Database (AskTOM Office Hours p...
How To Model and Construct Graphs with Oracle Database (AskTOM Office Hours p...How To Model and Construct Graphs with Oracle Database (AskTOM Office Hours p...
How To Model and Construct Graphs with Oracle Database (AskTOM Office Hours p...
 

More from MapR Technologies

Converging your data landscape
Converging your data landscapeConverging your data landscape
Converging your data landscapeMapR Technologies
 
ML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & EvaluationML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & EvaluationMapR Technologies
 
Self-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your DataSelf-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your DataMapR Technologies
 
Enabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data CaptureEnabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data CaptureMapR Technologies
 
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...MapR Technologies
 
ML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning LogisticsML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning LogisticsMapR Technologies
 
Machine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model ManagementMachine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model ManagementMapR Technologies
 
Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action MapR Technologies
 
Live Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIsLive Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIsMapR Technologies
 
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale StorageBringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale StorageMapR Technologies
 
Live Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn PredictionLive Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn PredictionMapR Technologies
 
An Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data PlatformAn Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data PlatformMapR Technologies
 
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...MapR Technologies
 
Best Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in HealthcareBest Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in HealthcareMapR Technologies
 
Geo-Distributed Big Data and Analytics
Geo-Distributed Big Data and AnalyticsGeo-Distributed Big Data and Analytics
Geo-Distributed Big Data and AnalyticsMapR Technologies
 
MapR Product Update - Spring 2017
MapR Product Update - Spring 2017MapR Product Update - Spring 2017
MapR Product Update - Spring 2017MapR Technologies
 
3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data AnalyticsMapR Technologies
 
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA DeploymentsCisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA DeploymentsMapR Technologies
 
MapR and Cisco Make IT Better
MapR and Cisco Make IT BetterMapR and Cisco Make IT Better
MapR and Cisco Make IT BetterMapR Technologies
 
Evolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQLEvolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQLMapR Technologies
 

More from MapR Technologies (20)

Converging your data landscape
Converging your data landscapeConverging your data landscape
Converging your data landscape
 
ML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & EvaluationML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & Evaluation
 
Self-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your DataSelf-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your Data
 
Enabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data CaptureEnabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data Capture
 
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
 
ML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning LogisticsML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning Logistics
 
Machine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model ManagementMachine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model Management
 
Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action
 
Live Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIsLive Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIs
 
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale StorageBringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
 
Live Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn PredictionLive Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn Prediction
 
An Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data PlatformAn Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data Platform
 
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
 
Best Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in HealthcareBest Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in Healthcare
 
Geo-Distributed Big Data and Analytics
Geo-Distributed Big Data and AnalyticsGeo-Distributed Big Data and Analytics
Geo-Distributed Big Data and Analytics
 
MapR Product Update - Spring 2017
MapR Product Update - Spring 2017MapR Product Update - Spring 2017
MapR Product Update - Spring 2017
 
3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics
 
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA DeploymentsCisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
 
MapR and Cisco Make IT Better
MapR and Cisco Make IT BetterMapR and Cisco Make IT Better
MapR and Cisco Make IT Better
 
Evolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQLEvolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQL
 

Recently uploaded

IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfIaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfDaniel Santiago Silva Capera
 
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfJamie (Taka) Wang
 
Designing A Time bound resource download URL
Designing A Time bound resource download URLDesigning A Time bound resource download URL
Designing A Time bound resource download URLRuncy Oommen
 
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...DianaGray10
 
OpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureOpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureEric D. Schabell
 
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAAnypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAshyamraj55
 
Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1DianaGray10
 
Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.YounusS2
 
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesAI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesMd Hossain Ali
 
UiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPathCommunity
 
Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )Brian Pichman
 
Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024SkyPlanner
 
Cybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxCybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxGDSC PJATK
 
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationUsing IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationIES VE
 
20230202 - Introduction to tis-py
20230202 - Introduction to tis-py20230202 - Introduction to tis-py
20230202 - Introduction to tis-pyJamie (Taka) Wang
 
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...Aggregage
 
9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding TeamAdam Moalla
 
Linked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesLinked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesDavid Newbury
 
Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024D Cloud Solutions
 

Recently uploaded (20)

IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfIaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
 
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
 
Designing A Time bound resource download URL
Designing A Time bound resource download URLDesigning A Time bound resource download URL
Designing A Time bound resource download URL
 
20150722 - AGV
20150722 - AGV20150722 - AGV
20150722 - AGV
 
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
 
OpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureOpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability Adventure
 
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAAnypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
 
Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1
 
Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.
 
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesAI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
 
UiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation Developers
 
Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )
 
Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024
 
Cybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxCybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptx
 
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationUsing IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
 
20230202 - Introduction to tis-py
20230202 - Introduction to tis-py20230202 - Introduction to tis-py
20230202 - Introduction to tis-py
 
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
 
9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team
 
Linked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesLinked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond Ontologies
 
Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024
 

HBase and Drill: How Loosely Typed SQL is Ideal for NoSQL

  • 1. © 2014 MapR Technologies 1© 2014 MapR Technologies
  • 2. © 2014 MapR Technologies 2 Contact Information Ted Dunning Chief Applications Architect at MapR Technologies Committer & PMC for Apache’s Drill, Zookeeper & Mahout VP of Incubator at Apache Foundation Email tdunning@apache.org tdunning@maprtech.com Twitter @ted_dunning Hashtag today: #BDE2015
  • 3. © 2014 MapR Technologies 3 Agenda • What does good mean? • What do we mean by loose typing? • Examples of what you can do • Real database with 10-20x fewer tables • Looking forward • Questions
  • 4. © 2014 MapR Technologies 4 What Does Good Mean (for a DB)? • Expressive – Must express the concepts we need • Efficient – Must run fast enough on cheap enough hardware
  • 5. © 2014 MapR Technologies 5 What Does Good Mean (for a DB)? • Expressive – Must express the concepts we need • Efficient – Must run fast enough on cheap enough hardware • Introspectable – Must be able to inspect the data and schema and gain understanding
  • 6. © 2014 MapR Technologies 6 What is New Here • Introspection is better when – A minimum of data entities are used to describe our model – No name overflow – Referential scoping helps narrow our focus to a simpler problem – Many-to-one relations can in-lined • Introspection was not a goal for the design of the relational model • Introspection was therefore not a result either
  • 7. © 2014 MapR Technologies 7 Older than Dirt • Relational theory is old (1970) – Pre-dates data structures – Predates mainstream recursive procedures – Predates lexical scoping – Predates logic programming – Predates real functional programming (Church, McCarthy, Iverson, Backus and not-withstanding) • Some updates are in order to enhance introspection
  • 8. © 2014 MapR Technologies 8 Contrast Relational and HBase Style noSQL Relational • Rows containing fields • Fields contain primitive types • Structure is fixed and uniform • Structure is pre-defined • Referential integrity (optional) • Expressions over sets of rows HBase / MapR DB • Rows contain fields • Fields bytes • Structure is flexible • No pre-defined structure • Single key • Column families • Timestamps • Versions
  • 9. © 2014 MapR Technologies 9 Contrast relational and HBase with Structuring Relational • Rows containing fields • Fields contain primitive types • Structure is fixed and uniform • Structure is pre-defined • Referential integrity (optional) • Expressions over sets of rows HBase + Structuring • Rows contain fields • Fields contain primitive types – Or objects, or lists • Structure is flexible, ragged • No pre-defined structure • Single key
  • 10. © 2014 MapR Technologies 10 Turtle Models for Databases • Allows complex objects in field values – JSON style lists and objects • Allow references to objects via join – Includes references localized within lists • Lists of objects and objects of lists are isomorphic to tables so … • Complex data in tables, • But also tables in complex data, • Even tables containing complex data containing tables
  • 11. © 2014 MapR Technologies 11 Proviso and Warning • This is not your father’s BLOB • And not the same as arrays with lateral view joins • Rationale to come as we talk about idioms
  • 12. © 2014 MapR Technologies 12 A Catalog of noSQL Idioms
  • 13. © 2014 MapR Technologies 13 Tables as Objects, Objects as Tables c1 c2 c3 Row-wise form c1 c2 c3 Column-wise form [ { c1:v1, c2:v2, c3:v3 }, { c1:v1, c2:v2, c3:v3 }, { c1:v1, c2:v2, c3:v3 } ] List of objects { c1:[v1, v2, v3], c2:[v1, v2, v3], c3:[v1, v2, v3] } Object containing lists
  • 14. © 2014 MapR Technologies 14 c1 c2 c3 c1 c2 c3 Micro Columnar Formats An entire table stored in columnar form can be a first-class value using these techniques This is very powerful for in-lining one-to-many relations.
  • 15. © 2014 MapR Technologies 15 Note • If embedded tables are first-class, schema becomes data • If schema is data-driven when embedded, constructs that elevate tables to top-level are impossible • Thus, embedded first-class objects implies late discovery of schema information
  • 16. © 2014 MapR Technologies 16 A first example: Time-series data
  • 17. © 2014 MapR Technologies 17 Column names as data • When column names are not pre-defined, they can convey information • Examples – Time offsets within a window for time series – Top-level domains for web crawlers – Vendor id’s for customer purchase profiles • Predefined schema is impossible for this idiom
  • 18. © 2014 MapR Technologies 18 Relational Model for Time-series
  • 19. © 2014 MapR Technologies 19 Table Design: Point-by-Point
  • 20. © 2014 MapR Technologies 20 Table Design: Hybrid Point-by-Point + Sub-table After close of window, data in row is restated as column-oriented tabular value in different column family.
  • 21. © 2014 MapR Technologies 21 Compression Results Samples are 64b time, 16 bit sample Sample time at 10kHz Sample time jitter makes it important to keep original time-stamp How much overhead to retain time-stamp?
  • 22. © 2014 MapR Technologies 22 A second example: Music meta-data
  • 23. © 2014 MapR Technologies 23 MusicBrainz on NoSQL • Artists, albums, tracks and labels are key objects • Reality check: – Add works (compositions), recordings, release, release group • 7 tables for artist alone • 12 for place, 7 for label, 17 for release/group, 8 for work – (but only 4 for recording!) – Total of 12 + 7 + 17 + 8 + 4 = 48 tables • But wait, there’s more! – 10 annotation tables, 10 edit tables, 19 tag tables, 5 rating tables, 86 link tables, 5 cover art tables and 3 tables for CD timing info (138 total) – And 50 more tables that aren’t documented yet
  • 24. © 2014 MapR Technologies 24
  • 25. © 2014 MapR Technologies 25 180 tables not shown
  • 26. © 2014 MapR Technologies 26 236 tables to describe 7 kinds of things
  • 27. © 2014 MapR Technologies 27 Can we do better?
  • 28. © 2014 MapR Technologies 28 artist id gid name sort_name begin_date end_date ended type gender area being_area end_area comment list<ipi> list<isni> list<alias> list<release_id> list<recording_id> artist id gid name sort_name begin_date end_date ended type gender area being_area end_area comment list<ipi> list<isni> list<alias>
  • 29. © 2014 MapR Technologies 29 artist id gid name sort_name begin_date end_date ended type gender area being_area end_area comment list<ipi> list<isni> list<alias> list<release_id> list<recording_id>
  • 30. © 2014 MapR Technologies 30 artist id gid name sort_name begin_date end_date ended type gender area begin_area end_area comment list<ipi> list<isni> list<alias> list<release_id> list<recording_id> { name, begin_date, end_date }
  • 31. © 2014 MapR Technologies 31
  • 32. © 2014 MapR Technologies 32 {id, recording_id, name, list<credit> length} recording id gid list<credit> name list<track_ref> release id gid release_group_id list<credit> name barcode status packaging language script list<medium> {id, format, name, list<track>} release_group id gid name list<credit> type list<release_id>
  • 33. © 2014 MapR Technologies 33 27 tables reduce to 4
  • 34. © 2014 MapR Technologies 34 27 tables reduce to 4 so far
  • 35. © 2014 MapR Technologies 35 Further Reductions • All 86 link tables become properties on artists, releases and other entities • All 44 tag, rating and annotation tables become list properties • All 5 cover art tables become lists of file references • Current score: 162 tables become 4 • You get the idea
  • 36. © 2014 MapR Technologies 36 Is This Good? • Expressivity – The JSON data model is at least as expressive as the original relational model • Many cases easier to describe in nested data • No cases are harder • Efficiency – Inlining can increase data size. Locality improves, however – Sessionizing can substantially decrease data size – Inlining back-references is more efficient than ordinary indexes – Inlined columnar data allows 1000x speedup for time series • Introspection (you decide)
  • 37. © 2014 MapR Technologies 37 But How Can We Query This? • Can’t use SQL – SQL is strongly typed – SQL is heavily tied into the original relational model – SQL generating tools require relational model • Must use SQL – Vast numbers of tools and people understand how to write SQL – SQL is the lingua franca of databases
  • 38. © 2014 MapR Technologies 38 Squaring the Circle • Enter Apache Drill • Drill is SQL compliant – Uses standard syntax and semantics • Drill extends SQL – First class treatment of objects, lists – Full support for destructuring, flattening – Full power of relational model can be applied to complex data
  • 39. © 2014 MapR Technologies 39 Drill Provides Scalable and Extended SQL
  • 40. © 2014 MapR Technologies 40 Sample Query • Find Elvis select distinct id, name, alias from ( select id, flatten(alias.name) alias from artist where alias like 'Elvis%Presley' )
  • 41. © 2014 MapR Technologies 41 Example Query • Find discs where Elvis was credited select distinct album_id, name from ( select id album_id, name, flatten(credit) from release ) albums join ( select distinct artist_id from ( select id artist_id, flatten(alias) from artist where name like 'Elvis%Presley’ ) ) artists using artist_id
  • 42. © 2014 MapR Technologies 42 Summary • Extended relational model allows massive simplification – On a real example, we see >20x reduction in number of tables • Simplification drives improved introspection – This is good • Apache Drill gives very high performance execution for extended relational problems • You can try this out today
  • 43. © 2014 MapR Technologies 43
  • 44. © 2014 MapR Technologies 44 Short Books by Ted Dunning & Ellen Friedman • Published by O’Reilly in 2014 and 2015 • For sale from Amazon or O’Reilly • Free e-books currently available courtesy of MapR http://bit.ly/ebook-real- world-hadoop http://bit.ly/mapr-tsdb- ebook http://bit.ly/ebook- anomaly http://bit.ly/recommend ation-ebook
  • 45. © 2014 MapR Technologies 45 Real World Hadoop by Ted Dunning and Ellen Friedman © Feb 2015 (published by O’Reilly) Free copies at book signing today
  • 46. © 2014 MapR Technologies 46 Thank You!
  • 47. © 2014 MapR Technologies 47 Q&A @mapr maprtech tdunning@mapr.tech.com Engage with us! MapR maprtech mapr-technologies

Editor's Notes

  1. Key ideas: Unique row key based on an id for each time series (looked up from a separate look-up table); important part of the efficiency of design is to have each column be a time off-set from the start time shown in the row key. Note that data is stored point-by-point in this wide table design. Ted’s notes from his original slide: One technique for increasing the rate at which data can be retrieved from a time series database is to store many values in each row. Doing this allows data points to be retrieved at a higher speed Because both HBase and MapR-DB store data ordered by the primary key, this design will cause rows containing data from a single time series to wind up near one another on disk. Retrieving data from a particular time series for a time range will involve largely sequential disk operations and therefore will be much faster than would be the case if the rows were widely scattered. Typically, the time window is adjusted so that 100–1,000 samples are in each row.
  2. Ted’s notes from original slide: The table design is improved by collapsing all of the data for a row into a single data structure known as a blob. This blob can be highly compressed so that less data needs to be read from disk. Also, having a single column per row decreases the per-column overhead incurred by the on-disk format that HBase uses, which further increases performance. Data can be progressively converted to the compressed format as soon as it is known that little or no new data is likely to arrive for that time series and time window. Commonly, once the time window ends, new data will only arrive for a few more seconds, and the compression of the data can begin. Since compressed and uncompressed data can coexist in the same row, if a few samples arrive after the row is compressed, the row can simply be compressed again to merge the blob and the late-arriving samples.