SlideShare a Scribd company logo
1 of 46
Introduction to Apache Tajo:
Data Warehouse for Big Data
Jihoon Son / Gruter inc.
About Me
Jihoon Son (@jihoonson)
Tajo project co-founder
Committer and PMC member of Apache Tajo
Research engineer at Gruter
2
Outline
About Tajo
Features of the Recent Release
Demo
Roadmap
3
What is Tajo?
Tajo / tάːzo / 타조
An ostrich in Korean
The world's fastest two-legged animal
4
What is Tajo?
Apache Top-level Project
Big data warehouse system
ANSI-SQL compliant
Mature SQL features
Various types of join, window functions
Rapid query execution with own distributed DAG engine
Low latency, and long running batch queries with a single
system
Fault-tolerance
Beyond SQL-on-Hadoop
Support various types of storage
5
Tajo Master
Catalog Server
Tajo Master
Catalog Server
Architecture Overview
6
DBMS
HCatalog
Tajo Master
Catalog Server
Tajo Worker
Query Master
Query Executor
Storage Service
Tajo Worker
Query Master
Query Executor
Storage Service
Tajo Worker
Query Master
Query Executor
Storage Service
JDBC client
TSQLWebUI
REST API
Storage
Submit
a query
Manage
metadataAllocate
a query
Send tasks
& monitor
Send tasks
& monitor
Who are Using Tajo?
Use cases: replacement of commercial DW
1st telco in South Korea
Replacement of long-running ETL workloads on several TB
datasets
Lots of daily reports about user behavior
Ad-‐hoc analysis on TB datasets
Benefits
Simplified architecture for data analysis
An unified system for DW ETL, OLAP, and Hadoop ETL
Much less cost, more data analysis within same SLA
Saved license fee of commercial DW
7
Who are Using Tajo?
Use cases: data discovery
Music streaming service (26 million users)
Analysis of purchase history for target marketing
Benefits
Interactive query on large datasets
Data analysis with familiar BI tools
8
Recent Release: 0.11
Feature highlights
Query federation
JDBC-based storage support
Self-describing data formats support
Multi-query support
More stable and efficient join execution
Index support
Python UDF/UDAF support
9
Recent Release: 0.11
Today's topic
Query federation
JDBC-based storage support
Self-describing data formats support
Multi-query support
More stable and efficient join execution
Index support
Python UDF/UDAF support
10
Query Federation with Tajo
11
Your Data
Your data might be spread on multiple heterogeneous
sites
Cloud, DBMS, Hadoop, NoSQL, …
12
DBMS
Application
Cloud storage
On-premise
storage
NoSQL
Your Data
Even in a single site, your data might be stored in
different data formats
13
JSONCSV
Parque
t
ORC Log
...
Your Data
How to analyze distributed data?
Traditionally ...
14
DBMSApplication
Cloud storage
On-premise
storage
NoSQL
Global view
ETL transform
● Long delivery
● Complex data flow
● Human-intensive
Your Data with Tajo
Query federation
15
DBMSApplication
Cloud storage
On-premise
storage
NoSQL
Global view
● Fast delivery
● Easy maintenance
● Simple data flow
Storage and Data Format Support
16
Data
formats
Storage
types
Create Table
> CREATE EXTERNAL TABLE archive1 (id BIGINT, ...) USING text WITH
('text.delimiter'='|') LOCATION 'hdfs://localhost:8020/archive1';
> CREATE EXTERNAL TABLE user (user_id BIGINT, ...) USING orc WITH
('orc.compression.kind'='snappy') LOCATION 's3://user';
> CREATE EXTERNAL TABLE table1 (key TEXT, ...) USING hbase LOCATION
'hbase:zk://localhost:2181/uptodate';
> ...
17
Data
formatStorage
URI
Create Table
> CREATE EXTERNAL TABLE archive1 (id BIGINT, ...) USING text WITH
('text.delimiter'='|','text.null'='N','compression.codec'='org.apache.hadoop.io.compr
ess.SnappyCodec','timezone'='UTC+9','text.skip.headerlines'='2') LOCATION
'hdfs://localhost:8020/tajo/warehouse/archive1';
> CREATE EXTERNAL TABLE archive2 (id BIGINT, ...) USING text WITH
('text.delimiter'='|','text.null'='N','compression.codec'='org.apache.hadoop.io.compr
ess.SnappyCodec','timezone'='UTC+9','text.skip.headerlines'='2') LOCATION
'hdfs://localhost:8020/tajo/warehouse/archive2';
> CREATE EXTERNAL TABLE archive3 (id BIGINT, ...) USING text WITH
('text.delimiter'='|','text.null'='N','compression.codec'='org.apache.hadoop.io.compr
ess.SnappyCodec','timezone'='UTC+9','text.skip.headerlines'='2') LOCATION
'hdfs://localhost:8020/tajo/warehouse/archive3';
> ...
18
Too tedious!
Introduction to Tablespace
Tablespace
Registered storage space
A tablespace is identified by an unique URI
Configurations and policies are shared by all tables in a
tablespace
Storage type
Default data format and supported data formats
It allows users to reuse registered storage configurations
and policies
19
Tablespaces, Databases, and Tables
20
Namespace
Storage1
Storage2
...
...
...
Tablespace1
Tablespace2
Tablespace3
Physical space
Table1
Table2
Table3
Database1
Database1
...
Tablespace Configuration
{
"spaces" : {
"warehouse" : {
"uri" : "hdfs://localhost:8020/tajo/warehouse",
"configs" : [
{'text.delimiter'='|'},
{'text.null'='N'},
{'compression.codec'='org.apache.hadoop.io.compress.SnappyCodec'},
{'timezone'='UTC+9'},
{'text.skip.headerlines'='2'}
]
},
"hbase1" : {
"uri" : "hbase:zk://localhost:2181/table1"
}
}
}
21
Tablespace name
Tablespace URI
Create Table
> CREATE TABLE archive1 (id BIGINT, ...) TABLESPACE warehouse;
22
Tablespace
nameData format is omitted. Default data format is TEXT.
"warehouse" : {
"uri" : "hdfs://localhost:8020/tajo/warehouse",
"configs" : [
{'text.delimiter'='|'},
{'text.null'='N'},
{'compression.codec'='org.apache.hadoop.io.compress.SnappyCodec'},
{'timezone'='UTC+9'},
{'text.skip.headerlines'='2'}
]
},
Create Table
> CREATE TABLE archive1 (id BIGINT, ...) TABLESPACE warehouse;
> CREATE TABLE archive2 (id BIGINT, ...) TABLESPACE warehouse;
> CREATE TABLE archive3 (id BIGINT, ...) TABLESPACE warehouse;
> CREATE TABLE user (user_id BIGINT, ...) TABLESPACE aws USING orc
WITH ('orc.compression.kind'='snappy');
> CREATE TABLE table1 (key TEXT, ...) TABLESPACE hbase1;
> ...
23
HDFS HBase
Tajo Worker
Query
Engine
Storage
ServiceHDFS
handler
Tajo Worker
Query
Engine
Storage
ServiceHDFS
handler
Tajo Worker
Query
Engine
Storage
ServiceHBase
handler
Querying on Different Data Silos
How does a worker access different data sources?
Storage service
Return a proper handler for underlying storage
> SELECT ... FROM hdfs_table, hbase_table, ...
24
JDBC-based Storage Support
25
jdbc_db1 tajo_db1
JDBC-based Storage
Storage providing the JDBC interface
PostgreSQL, MySQL, MariaDB, ...
Databases of JDBC-based storage are mapped to Tajo
databases
26
Table
1
Table
2
Table
3
Table
1
Table
2
Table
3
tajo_db2
Table
1
Table
2
Table
3
…
jdbc_db2
Table
1
Table
2
Table
3
…
JDBC-based storage Tajo
Tablespace Configuration
{
"spaces": {
"pgsql_db1": {
"uri": "jdbc:postgresql://hostname:port/db1"
"configs": {
"mapped_database": "tajo_db1"
"connection_properties": {
"user": "tajo",
"password": "xxxx"
}
}
}
}
}
27
PostgreSQL
database name
Tajo
database name
Tablespace name
Return to Query Federation
How to correlate data on JDBC-based storage and
others?
Need to have a global view of metadata across different
storage types
Tajo also has its own metadata for its data
Each JDBC-based storage has own metadata for its data
Each NoSQL storage has metadata for its data
…
28
Metadata Federation
Federating metadata of underlying storage
29
DBMS metadata provider NoSQL metadata provider
Linked Metadata Manager
DBMS HCatalog
Tajo catalog metadata provider
Catalog Interface
● Tablespace
● Database
● Tables
● Schema names
...
Querying on JDBC-based Storage
A plan is converted into a SQL string
Query generation
Diverse SQL syntax of different types of storage
Different SQL builder for each storage type
30
Tajo Master Tajo Worker JDBC-based
storage
SELECT ...
Query plan
SELECT ...
Operation Push Down
Tajo can exploit the processing capability of underlying
storage
DBMSs, MongoDB, HBase, …
Operations are pushed down into underlying storage
Leveraging the advanced features provided by underlying
storage
Ex) DBMSs' query optimization, index, ...
31
Example 1
32
SELECT
count(*)
FROM
account ac, archive ar
WHERE
ac.key = ar.id and
ac.name = 'tajo'
account
DBMS
archive
HDFS
scan archive
scan account
ac.name = 'tajo'
join
ac.key = ar.id
group by
count(*)
group by
count(*)
Full scan Result only
Push
operation
Example 2
33
SELECT
ac.name, count(*)
FROM
account ac
GROUP BY
ac.name
account
DBMS
scan account
group by
count(*)
Result only
Push
operation
Self-describing Data Formats
Support
34
Self-describing Data Formats
Some data formats include schema information as well
as data
JSON, ORC, Parquet, …
Tajo 0.11 natively supports self-describing data formats
Since they already have schema information, Tajo doesn't
need to store it aside
Instead, Tajo can infer the schema at query execution
time
35
Create Table with Nested Data Format
{ "title" : "Hand of the King", "name" : { "first_name": "Eddard", "last_name": "Stark"}}
{ "title" : "Assassin", "name" : { "first_name": "Arya", "last_name": "Stark"}}
{ "title" : "Dancing Master", "name" : { "first_name": "Syrio", "last_name": "Forel"}}
> CREATE EXTERNAL TABLE schemaful_table (
title TEXT,
name RECORD (
first_name TEXT,
last_name TEXT
)
) USING json LOCATION 'hdfs:///json_table';
36
Nested type
How about This Data?
{"id":"2937257761","type":"ForkEvent","actor":{"id":1088854,"login":"CAOakleyII","gravatar_id":"","url":"https://api.github.com/users/CAOakleyII","avatar_url":"https://avatars.githubusercontent.com/
u/1088854?"},"repo":{"id":11909954,"name":"skycocker/chromebrew","url":"https://api.github.com/repos/skycocker/chromebrew"},"payload":{"forkee":{"id":38339291,"name":"chromebrew","full_na
me":"CAOakleyII/chromebrew","owner":{"login":"CAOakleyII","id":1088854,"avatar_url":"https://avatars.githubusercontent.com/u/1088854?v=3","gravatar_id":"","url":"https://api.github.com/users/C
AOakleyII","html_url":"https://github.com/CAOakleyII","followers_url":"https://api.github.com/users/CAOakleyII/followers","following_url":"https://api.github.com/users/CAOakleyII/following{/other_
user}","gists_url":"https://api.github.com/users/CAOakleyII/gists{/gist_id}","starred_url":"https://api.github.com/users/CAOakleyII/starred{/owner}{/repo}","subscriptions_url":"https://api.github.com/
users/CAOakleyII/subscriptions","organizations_url":"https://api.github.com/users/CAOakleyII/orgs","repos_url":"https://api.github.com/users/CAOakleyII/repos","events_url":"https://api.github.com/
users/CAOakleyII/events{/privacy}","received_events_url":"https://api.github.com/users/CAOakleyII/received_events","type":"User","site_admin":false},"private":false,"html_url":"https://github.com/C
AOakleyII/chromebrew","description":"Package manager for Chrome
OS","fork":true,"url":"https://api.github.com/repos/CAOakleyII/chromebrew","forks_url":"https://api.github.com/repos/CAOakleyII/chromebrew/forks","keys_url":"https://api.github.com/repos/CAOa
kleyII/chromebrew/keys{/key_id}","collaborators_url":"https://api.github.com/repos/CAOakleyII/chromebrew/collaborators{/collaborator}","teams_url":"https://api.github.com/repos/CAOakleyII/chro
mebrew/teams","hooks_url":"https://api.github.com/repos/CAOakleyII/chromebrew/hooks","issue_events_url":"https://api.github.com/repos/CAOakleyII/chromebrew/issues/events{/number}","event
s_url":"https://api.github.com/repos/CAOakleyII/chromebrew/events","assignees_url":"https://api.github.com/repos/CAOakleyII/chromebrew/assignees{/user}","branches_url":"https://api.github.com
/repos/CAOakleyII/chromebrew/branches{/branch}","tags_url":"https://api.github.com/repos/CAOakleyII/chromebrew/tags","blobs_url":"https://api.github.com/repos/CAOakleyII/chromebrew/git/blo
bs{/sha}","git_tags_url":"https://api.github.com/repos/CAOakleyII/chromebrew/git/tags{/sha}","git_refs_url":"https://api.github.com/repos/CAOakleyII/chromebrew/git/refs{/sha}","trees_url":"https:/
/api.github.com/repos/CAOakleyII/chromebrew/git/trees{/sha}","statuses_url":"https://api.github.com/repos/CAOakleyII/chromebrew/statuses/{sha}","languages_url":"https://api.github.com/repos/C
AOakleyII/chromebrew/languages","stargazers_url":"https://api.github.com/repos/CAOakleyII/chromebrew/stargazers","contributors_url":"https://api.github.com/repos/CAOakleyII/chromebrew/cont
ributors","subscribers_url":"https://api.github.com/repos/CAOakleyII/chromebrew/subscribers","subscription_url":"https://api.github.com/repos/CAOakleyII/chromebrew/subscription","commits_url":
"https://api.github.com/repos/CAOakleyII/chromebrew/commits{/sha}","git_commits_url":"https://api.github.com/repos/CAOakleyII/chromebrew/git/commits{/sha}","comments_url":"https://api.gith
ub.com/repos/CAOakleyII/chromebrew/comments{/number}","issue_comment_url":"https://api.github.com/repos/CAOakleyII/chromebrew/issues/comments{/number}","contents_url":"https://api.git
hub.com/repos/CAOakleyII/chromebrew/contents/{+path}","compare_url":"https://api.github.com/repos/CAOakleyII/chromebrew/compare/{base}...{head}","merges_url":"https://api.github.com/repo
s/CAOakleyII/chromebrew/merges","archive_url":"https://api.github.com/repos/CAOakleyII/chromebrew/{archive_format}{/ref}","downloads_url":"https://api.github.com/repos/CAOakleyII/chromebr
ew/downloads","issues_url":"https://api.github.com/repos/CAOakleyII/chromebrew/issues{/number}","pulls_url":"https://api.github.com/repos/CAOakleyII/chromebrew/pulls{/number}","milestones_
url":"https://api.github.com/repos/CAOakleyII/chromebrew/milestones{/number}","notifications_url":"https://api.github.com/repos/CAOakleyII/chromebrew/notifications{?since,all,participating}","lab
els_url":"https://api.github.com/repos/CAOakleyII/chromebrew/labels{/name}","releases_url":"https://api.github.com/repos/CAOakleyII/chromebrew/releases{/id}","created_at":"2015-07-
01T00:00:00Z","updated_at":"2015-06-28T10:11:09Z","pushed_at":"2015-06-
09T07:46:57Z","git_url":"git://github.com/CAOakleyII/chromebrew.git","ssh_url":"git@github.com:CAOakleyII/chromebrew.git","clone_url":"https://github.com/CAOakleyII/chromebrew.git","svn_url":"
https://github.com/CAOakleyII/chromebrew","homepage":"http://skycocker.github.io/chromebrew/","size":846,"stargazers_count":0,"watchers_count":0,"language":null,"has_issues":false,"has_downl
oads":true,"has_wiki":true,"has_pages":false,"forks_count":0,"mirror_url":null,"open_issues_count":0,"forks":0,"open_issues":0,"watchers":0,"default_branch":"master","public":true}},"public":true,"cre
ated_at":"2015-07-01T00:00:01Z"}
...
37
Create Schemaless Table
> CREATE EXTERNAL TABLE schemaless_table (*) USING json LOCATION
'hdfs:///json_table';
That's all!
38
Allow any schema
Schema-free Query Execution
> CREATE EXTERNAL TABLE schemaful_table (id BIGINT, name TEXT, ...)
USING text LOCATION 'hdfs:///csv_table;
> CREATE EXTERNAL TABLE schemaless_table (*) USING json LOCATION
'hdfs:///json_table';
> SELECT name.first_name, name.last_name from schemaless_table;
> SELECT title, count(*) FROM schemaful_table, schemaless_table WHERE
name = name.last_name GROUP BY title;
39
Schema Inference
Table schema is inferred at query time
Example
40
SELECT
a, b.b1, b.b2.c1
FROM
t;
(
a text,
b record (
b1 text,
b2 record (
c1 text
)
)
)
Inferred schemaQuery
Demo
41
Demo with Command line
42
Roadmap
43
Roadmap
0.12
Improved Yarn integration
Authentication support
JavaScript stored procedure support
Scalar subquery support
Hive UDF support
44
Roadmap
Next generation (beyond 0.12)
Exploiting modern hardware
Approximate query processing
Genetic query optimization
And more …
45
tajo> select question from you;
46

More Related Content

What's hot

Efficient in situ processing of various storage types on apache tajo
Efficient in situ processing of various storage types on apache tajoEfficient in situ processing of various storage types on apache tajo
Efficient in situ processing of various storage types on apache tajo
Hyunsik Choi
 
Tajo: A Distributed Data Warehouse System for Hadoop
Tajo: A Distributed Data Warehouse System for HadoopTajo: A Distributed Data Warehouse System for Hadoop
Tajo: A Distributed Data Warehouse System for Hadoop
Hyunsik Choi
 

What's hot (18)

Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
 
Apache Tajo on Swift: Bringing SQL to the OpenStack World
Apache Tajo on Swift: Bringing SQL to the OpenStack WorldApache Tajo on Swift: Bringing SQL to the OpenStack World
Apache Tajo on Swift: Bringing SQL to the OpenStack World
 
Tez Shuffle Handler: Shuffling at Scale with Apache Hadoop
Tez Shuffle Handler: Shuffling at Scale with Apache HadoopTez Shuffle Handler: Shuffling at Scale with Apache Hadoop
Tez Shuffle Handler: Shuffling at Scale with Apache Hadoop
 
From docker to kubernetes: running Apache Hadoop in a cloud native way
From docker to kubernetes: running Apache Hadoop in a cloud native wayFrom docker to kubernetes: running Apache Hadoop in a cloud native way
From docker to kubernetes: running Apache Hadoop in a cloud native way
 
Introduction to Apache Tajo: Data Warehouse for Big Data
Introduction to Apache Tajo: Data Warehouse for Big DataIntroduction to Apache Tajo: Data Warehouse for Big Data
Introduction to Apache Tajo: Data Warehouse for Big Data
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
July 2010 Triangle Hadoop Users Group - Chad Vawter Slides
July 2010 Triangle Hadoop Users Group - Chad Vawter SlidesJuly 2010 Triangle Hadoop Users Group - Chad Vawter Slides
July 2010 Triangle Hadoop Users Group - Chad Vawter Slides
 
Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in ...
Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in ...Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in ...
Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in ...
 
Efficient in situ processing of various storage types on apache tajo
Efficient in situ processing of various storage types on apache tajoEfficient in situ processing of various storage types on apache tajo
Efficient in situ processing of various storage types on apache tajo
 
User Defined Partitioning on PlazmaDB
User Defined Partitioning on PlazmaDBUser Defined Partitioning on PlazmaDB
User Defined Partitioning on PlazmaDB
 
[db tech showcase Tokyo 2017] C23: Lessons from SQLite4 by SQLite.org - Richa...
[db tech showcase Tokyo 2017] C23: Lessons from SQLite4 by SQLite.org - Richa...[db tech showcase Tokyo 2017] C23: Lessons from SQLite4 by SQLite.org - Richa...
[db tech showcase Tokyo 2017] C23: Lessons from SQLite4 by SQLite.org - Richa...
 
Presto At Treasure Data
Presto At Treasure DataPresto At Treasure Data
Presto At Treasure Data
 
Myths of Big Partitions (Robert Stupp, DataStax) | Cassandra Summit 2016
Myths of Big Partitions (Robert Stupp, DataStax) | Cassandra Summit 2016Myths of Big Partitions (Robert Stupp, DataStax) | Cassandra Summit 2016
Myths of Big Partitions (Robert Stupp, DataStax) | Cassandra Summit 2016
 
Tajo: A Distributed Data Warehouse System for Hadoop
Tajo: A Distributed Data Warehouse System for HadoopTajo: A Distributed Data Warehouse System for Hadoop
Tajo: A Distributed Data Warehouse System for Hadoop
 
Big data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesBig data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting Languages
 
Apache Kite
Apache KiteApache Kite
Apache Kite
 
Benchmarking Top NoSQL Databases: Apache Cassandra, Apache HBase and MongoDB
Benchmarking Top NoSQL Databases: Apache Cassandra, Apache HBase and MongoDBBenchmarking Top NoSQL Databases: Apache Cassandra, Apache HBase and MongoDB
Benchmarking Top NoSQL Databases: Apache Cassandra, Apache HBase and MongoDB
 
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
 

Viewers also liked

Hadoop and Enterprise Data Warehouse
Hadoop and Enterprise Data WarehouseHadoop and Enterprise Data Warehouse
Hadoop and Enterprise Data Warehouse
DataWorks Summit
 

Viewers also liked (11)

Introduction to Apache Tajo: Future of Data Warehouse
Introduction to Apache Tajo: Future of Data WarehouseIntroduction to Apache Tajo: Future of Data Warehouse
Introduction to Apache Tajo: Future of Data Warehouse
 
MelOn 빅데이터 플랫폼과 Tajo 이야기
MelOn 빅데이터 플랫폼과 Tajo 이야기MelOn 빅데이터 플랫폼과 Tajo 이야기
MelOn 빅데이터 플랫폼과 Tajo 이야기
 
Expanding Your Data Warehouse with Tajo
Expanding Your Data Warehouse with TajoExpanding Your Data Warehouse with Tajo
Expanding Your Data Warehouse with Tajo
 
Introduction to Apache Tajo
Introduction to Apache TajoIntroduction to Apache Tajo
Introduction to Apache Tajo
 
Elastic Search Performance Optimization - Deview 2014
Elastic Search Performance Optimization - Deview 2014Elastic Search Performance Optimization - Deview 2014
Elastic Search Performance Optimization - Deview 2014
 
Hadoop Integration into Data Warehousing Architectures
Hadoop Integration into Data Warehousing ArchitecturesHadoop Integration into Data Warehousing Architectures
Hadoop Integration into Data Warehousing Architectures
 
Hadoop and Enterprise Data Warehouse
Hadoop and Enterprise Data WarehouseHadoop and Enterprise Data Warehouse
Hadoop and Enterprise Data Warehouse
 
Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!
Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!
Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!
 
Big data analysis with R and Apache Tajo (in Korean)
Big data analysis with R and Apache Tajo (in Korean)Big data analysis with R and Apache Tajo (in Korean)
Big data analysis with R and Apache Tajo (in Korean)
 
Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals
Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop ProfessionalsBest Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals
Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals
 
How One Company Offloaded Data Warehouse ETL To Hadoop and Saved $30 Million
How One Company Offloaded Data Warehouse ETL To Hadoop and Saved $30 MillionHow One Company Offloaded Data Warehouse ETL To Hadoop and Saved $30 Million
How One Company Offloaded Data Warehouse ETL To Hadoop and Saved $30 Million
 

Similar to Introduction to Apache Tajo: Data Warehouse for Big Data

171_74_216_Module_5-Non_relational_database_-mongodb.pptx
171_74_216_Module_5-Non_relational_database_-mongodb.pptx171_74_216_Module_5-Non_relational_database_-mongodb.pptx
171_74_216_Module_5-Non_relational_database_-mongodb.pptx
sukrithlal008
 
DB2UDB_the_Basics Day2
DB2UDB_the_Basics Day2DB2UDB_the_Basics Day2
DB2UDB_the_Basics Day2
Pranav Prakash
 
Get started with Microsoft SQL Polybase
Get started with Microsoft SQL PolybaseGet started with Microsoft SQL Polybase
Get started with Microsoft SQL Polybase
Henk van der Valk
 

Similar to Introduction to Apache Tajo: Data Warehouse for Big Data (20)

MongoDB - A next-generation database that lets you create applications never ...
MongoDB - A next-generation database that lets you create applications never ...MongoDB - A next-generation database that lets you create applications never ...
MongoDB - A next-generation database that lets you create applications never ...
 
Homework help on oracle
Homework help on oracleHomework help on oracle
Homework help on oracle
 
ora_sothea
ora_sotheaora_sothea
ora_sothea
 
Design Considerations For Storing With Windows Azure
Design Considerations For Storing With Windows AzureDesign Considerations For Storing With Windows Azure
Design Considerations For Storing With Windows Azure
 
DBVersity MongoDB Online Training Presentations
DBVersity MongoDB Online Training PresentationsDBVersity MongoDB Online Training Presentations
DBVersity MongoDB Online Training Presentations
 
MS-SQL SERVER ARCHITECTURE
MS-SQL SERVER ARCHITECTUREMS-SQL SERVER ARCHITECTURE
MS-SQL SERVER ARCHITECTURE
 
An Introduction To NoSQL & MongoDB
An Introduction To NoSQL & MongoDBAn Introduction To NoSQL & MongoDB
An Introduction To NoSQL & MongoDB
 
Ms sql server architecture
Ms sql server architectureMs sql server architecture
Ms sql server architecture
 
171_74_216_Module_5-Non_relational_database_-mongodb.pptx
171_74_216_Module_5-Non_relational_database_-mongodb.pptx171_74_216_Module_5-Non_relational_database_-mongodb.pptx
171_74_216_Module_5-Non_relational_database_-mongodb.pptx
 
Introduction to MongoDB with PHP
Introduction to MongoDB with PHPIntroduction to MongoDB with PHP
Introduction to MongoDB with PHP
 
NoSQL Endgame DevoxxUA Conference 2020
NoSQL Endgame DevoxxUA Conference 2020NoSQL Endgame DevoxxUA Conference 2020
NoSQL Endgame DevoxxUA Conference 2020
 
Obevo Javasig.pptx
Obevo Javasig.pptxObevo Javasig.pptx
Obevo Javasig.pptx
 
DB2UDB_the_Basics Day2
DB2UDB_the_Basics Day2DB2UDB_the_Basics Day2
DB2UDB_the_Basics Day2
 
MongoDB-presentation.pptx
MongoDB-presentation.pptxMongoDB-presentation.pptx
MongoDB-presentation.pptx
 
Using Document Databases with TYPO3 Flow
Using Document Databases with TYPO3 FlowUsing Document Databases with TYPO3 Flow
Using Document Databases with TYPO3 Flow
 
Polyglot metadata for Hadoop
Polyglot metadata for HadoopPolyglot metadata for Hadoop
Polyglot metadata for Hadoop
 
Get started with Microsoft SQL Polybase
Get started with Microsoft SQL PolybaseGet started with Microsoft SQL Polybase
Get started with Microsoft SQL Polybase
 
Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Building a high-performance data lake analytics engine at Alibaba Cloud with ...Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Building a high-performance data lake analytics engine at Alibaba Cloud with ...
 
Amazon Athena Hands-On Workshop
Amazon Athena Hands-On WorkshopAmazon Athena Hands-On Workshop
Amazon Athena Hands-On Workshop
 
Apache Calcite (a tutorial given at BOSS '21)
Apache Calcite (a tutorial given at BOSS '21)Apache Calcite (a tutorial given at BOSS '21)
Apache Calcite (a tutorial given at BOSS '21)
 

More from Gruter

More from Gruter (18)

스타트업사례로 본 로그 데이터분석 : Tajo on AWS
스타트업사례로 본 로그 데이터분석 : Tajo on AWS스타트업사례로 본 로그 데이터분석 : Tajo on AWS
스타트업사례로 본 로그 데이터분석 : Tajo on AWS
 
Efficient In­‐situ Processing of Various Storage Types on Apache Tajo
Efficient In­‐situ Processing of Various Storage Types on Apache TajoEfficient In­‐situ Processing of Various Storage Types on Apache Tajo
Efficient In­‐situ Processing of Various Storage Types on Apache Tajo
 
Tajo TPC-H Benchmark Test on AWS
Tajo TPC-H Benchmark Test on AWSTajo TPC-H Benchmark Test on AWS
Tajo TPC-H Benchmark Test on AWS
 
Data analysis with Tajo
Data analysis with TajoData analysis with Tajo
Data analysis with Tajo
 
Gruter TECHDAY 2014 MelOn BigData
Gruter TECHDAY 2014 MelOn BigDataGruter TECHDAY 2014 MelOn BigData
Gruter TECHDAY 2014 MelOn BigData
 
Gruter_TECHDAY_2014_04_TajoCloudHandsOn (in Korean)
Gruter_TECHDAY_2014_04_TajoCloudHandsOn (in Korean)Gruter_TECHDAY_2014_04_TajoCloudHandsOn (in Korean)
Gruter_TECHDAY_2014_04_TajoCloudHandsOn (in Korean)
 
Gruter_TECHDAY_2014_01_SearchEngine (in Korean)
Gruter_TECHDAY_2014_01_SearchEngine (in Korean)Gruter_TECHDAY_2014_01_SearchEngine (in Korean)
Gruter_TECHDAY_2014_01_SearchEngine (in Korean)
 
Apache Tajo - BWC 2014
Apache Tajo - BWC 2014Apache Tajo - BWC 2014
Apache Tajo - BWC 2014
 
Hadoop security DeView 2014
Hadoop security DeView 2014Hadoop security DeView 2014
Hadoop security DeView 2014
 
Vectorized processing in_a_nutshell_DeView2014
Vectorized processing in_a_nutshell_DeView2014Vectorized processing in_a_nutshell_DeView2014
Vectorized processing in_a_nutshell_DeView2014
 
Cloumon sw제품설명회 발표자료
Cloumon sw제품설명회 발표자료Cloumon sw제품설명회 발표자료
Cloumon sw제품설명회 발표자료
 
Tajo and SQL-on-Hadoop in Tech Planet 2013
Tajo and SQL-on-Hadoop in Tech Planet 2013Tajo and SQL-on-Hadoop in Tech Planet 2013
Tajo and SQL-on-Hadoop in Tech Planet 2013
 
SQL-on-Hadoop with Apache Tajo, and application case of SK Telecom
SQL-on-Hadoop with Apache Tajo,  and application case of SK TelecomSQL-on-Hadoop with Apache Tajo,  and application case of SK Telecom
SQL-on-Hadoop with Apache Tajo, and application case of SK Telecom
 
Tajo case study bay area hug 20131105
Tajo case study bay area hug 20131105Tajo case study bay area hug 20131105
Tajo case study bay area hug 20131105
 
DeView2013 Big Data Platform Architecture with Hadoop - Hyeong-jun Kim
DeView2013 Big Data Platform Architecture with Hadoop - Hyeong-jun KimDeView2013 Big Data Platform Architecture with Hadoop - Hyeong-jun Kim
DeView2013 Big Data Platform Architecture with Hadoop - Hyeong-jun Kim
 
GRUTER가 들려주는 Big Data Platform 구축 전략과 적용 사례: Tajo와 SQL-on-Hadoop
GRUTER가 들려주는 Big Data Platform 구축 전략과 적용 사례: Tajo와 SQL-on-HadoopGRUTER가 들려주는 Big Data Platform 구축 전략과 적용 사례: Tajo와 SQL-on-Hadoop
GRUTER가 들려주는 Big Data Platform 구축 전략과 적용 사례: Tajo와 SQL-on-Hadoop
 
GRUTER가 들려주는 Big Data Platform 구축 전략과 적용 사례: 온라인 컨텐츠 서비스를 위한 빅데이터 구축 사례
GRUTER가 들려주는 Big Data Platform 구축 전략과 적용 사례: 온라인 컨텐츠 서비스를 위한 빅데이터 구축 사례GRUTER가 들려주는 Big Data Platform 구축 전략과 적용 사례: 온라인 컨텐츠 서비스를 위한 빅데이터 구축 사례
GRUTER가 들려주는 Big Data Platform 구축 전략과 적용 사례: 온라인 컨텐츠 서비스를 위한 빅데이터 구축 사례
 
GRUTER가 들려주는 Big Data Platform 구축 전략과 적용 사례: Bioinformatics Data를 위한 Hadoop기반...
GRUTER가 들려주는 Big Data Platform 구축 전략과 적용 사례: Bioinformatics Data를 위한 Hadoop기반...GRUTER가 들려주는 Big Data Platform 구축 전략과 적용 사례: Bioinformatics Data를 위한 Hadoop기반...
GRUTER가 들려주는 Big Data Platform 구축 전략과 적용 사례: Bioinformatics Data를 위한 Hadoop기반...
 

Recently uploaded

Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
HyderabadDolls
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
ranjankumarbehera14
 
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
gajnagarg
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
gajnagarg
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
Computer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdfComputer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdf
SayantanBiswas37
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
nirzagarg
 
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
gajnagarg
 
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
HyderabadDolls
 

Recently uploaded (20)

Kings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about themKings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about them
 
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
 
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
 
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubai
 
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
 
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
 
Computer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdfComputer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdf
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for Research
 
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
 
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
 
Statistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbersStatistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbers
 
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 

Introduction to Apache Tajo: Data Warehouse for Big Data

  • 1. Introduction to Apache Tajo: Data Warehouse for Big Data Jihoon Son / Gruter inc.
  • 2. About Me Jihoon Son (@jihoonson) Tajo project co-founder Committer and PMC member of Apache Tajo Research engineer at Gruter 2
  • 3. Outline About Tajo Features of the Recent Release Demo Roadmap 3
  • 4. What is Tajo? Tajo / tάːzo / 타조 An ostrich in Korean The world's fastest two-legged animal 4
  • 5. What is Tajo? Apache Top-level Project Big data warehouse system ANSI-SQL compliant Mature SQL features Various types of join, window functions Rapid query execution with own distributed DAG engine Low latency, and long running batch queries with a single system Fault-tolerance Beyond SQL-on-Hadoop Support various types of storage 5
  • 6. Tajo Master Catalog Server Tajo Master Catalog Server Architecture Overview 6 DBMS HCatalog Tajo Master Catalog Server Tajo Worker Query Master Query Executor Storage Service Tajo Worker Query Master Query Executor Storage Service Tajo Worker Query Master Query Executor Storage Service JDBC client TSQLWebUI REST API Storage Submit a query Manage metadataAllocate a query Send tasks & monitor Send tasks & monitor
  • 7. Who are Using Tajo? Use cases: replacement of commercial DW 1st telco in South Korea Replacement of long-running ETL workloads on several TB datasets Lots of daily reports about user behavior Ad-‐hoc analysis on TB datasets Benefits Simplified architecture for data analysis An unified system for DW ETL, OLAP, and Hadoop ETL Much less cost, more data analysis within same SLA Saved license fee of commercial DW 7
  • 8. Who are Using Tajo? Use cases: data discovery Music streaming service (26 million users) Analysis of purchase history for target marketing Benefits Interactive query on large datasets Data analysis with familiar BI tools 8
  • 9. Recent Release: 0.11 Feature highlights Query federation JDBC-based storage support Self-describing data formats support Multi-query support More stable and efficient join execution Index support Python UDF/UDAF support 9
  • 10. Recent Release: 0.11 Today's topic Query federation JDBC-based storage support Self-describing data formats support Multi-query support More stable and efficient join execution Index support Python UDF/UDAF support 10
  • 12. Your Data Your data might be spread on multiple heterogeneous sites Cloud, DBMS, Hadoop, NoSQL, … 12 DBMS Application Cloud storage On-premise storage NoSQL
  • 13. Your Data Even in a single site, your data might be stored in different data formats 13 JSONCSV Parque t ORC Log ...
  • 14. Your Data How to analyze distributed data? Traditionally ... 14 DBMSApplication Cloud storage On-premise storage NoSQL Global view ETL transform ● Long delivery ● Complex data flow ● Human-intensive
  • 15. Your Data with Tajo Query federation 15 DBMSApplication Cloud storage On-premise storage NoSQL Global view ● Fast delivery ● Easy maintenance ● Simple data flow
  • 16. Storage and Data Format Support 16 Data formats Storage types
  • 17. Create Table > CREATE EXTERNAL TABLE archive1 (id BIGINT, ...) USING text WITH ('text.delimiter'='|') LOCATION 'hdfs://localhost:8020/archive1'; > CREATE EXTERNAL TABLE user (user_id BIGINT, ...) USING orc WITH ('orc.compression.kind'='snappy') LOCATION 's3://user'; > CREATE EXTERNAL TABLE table1 (key TEXT, ...) USING hbase LOCATION 'hbase:zk://localhost:2181/uptodate'; > ... 17 Data formatStorage URI
  • 18. Create Table > CREATE EXTERNAL TABLE archive1 (id BIGINT, ...) USING text WITH ('text.delimiter'='|','text.null'='N','compression.codec'='org.apache.hadoop.io.compr ess.SnappyCodec','timezone'='UTC+9','text.skip.headerlines'='2') LOCATION 'hdfs://localhost:8020/tajo/warehouse/archive1'; > CREATE EXTERNAL TABLE archive2 (id BIGINT, ...) USING text WITH ('text.delimiter'='|','text.null'='N','compression.codec'='org.apache.hadoop.io.compr ess.SnappyCodec','timezone'='UTC+9','text.skip.headerlines'='2') LOCATION 'hdfs://localhost:8020/tajo/warehouse/archive2'; > CREATE EXTERNAL TABLE archive3 (id BIGINT, ...) USING text WITH ('text.delimiter'='|','text.null'='N','compression.codec'='org.apache.hadoop.io.compr ess.SnappyCodec','timezone'='UTC+9','text.skip.headerlines'='2') LOCATION 'hdfs://localhost:8020/tajo/warehouse/archive3'; > ... 18 Too tedious!
  • 19. Introduction to Tablespace Tablespace Registered storage space A tablespace is identified by an unique URI Configurations and policies are shared by all tables in a tablespace Storage type Default data format and supported data formats It allows users to reuse registered storage configurations and policies 19
  • 20. Tablespaces, Databases, and Tables 20 Namespace Storage1 Storage2 ... ... ... Tablespace1 Tablespace2 Tablespace3 Physical space Table1 Table2 Table3 Database1 Database1 ...
  • 21. Tablespace Configuration { "spaces" : { "warehouse" : { "uri" : "hdfs://localhost:8020/tajo/warehouse", "configs" : [ {'text.delimiter'='|'}, {'text.null'='N'}, {'compression.codec'='org.apache.hadoop.io.compress.SnappyCodec'}, {'timezone'='UTC+9'}, {'text.skip.headerlines'='2'} ] }, "hbase1" : { "uri" : "hbase:zk://localhost:2181/table1" } } } 21 Tablespace name Tablespace URI
  • 22. Create Table > CREATE TABLE archive1 (id BIGINT, ...) TABLESPACE warehouse; 22 Tablespace nameData format is omitted. Default data format is TEXT. "warehouse" : { "uri" : "hdfs://localhost:8020/tajo/warehouse", "configs" : [ {'text.delimiter'='|'}, {'text.null'='N'}, {'compression.codec'='org.apache.hadoop.io.compress.SnappyCodec'}, {'timezone'='UTC+9'}, {'text.skip.headerlines'='2'} ] },
  • 23. Create Table > CREATE TABLE archive1 (id BIGINT, ...) TABLESPACE warehouse; > CREATE TABLE archive2 (id BIGINT, ...) TABLESPACE warehouse; > CREATE TABLE archive3 (id BIGINT, ...) TABLESPACE warehouse; > CREATE TABLE user (user_id BIGINT, ...) TABLESPACE aws USING orc WITH ('orc.compression.kind'='snappy'); > CREATE TABLE table1 (key TEXT, ...) TABLESPACE hbase1; > ... 23
  • 24. HDFS HBase Tajo Worker Query Engine Storage ServiceHDFS handler Tajo Worker Query Engine Storage ServiceHDFS handler Tajo Worker Query Engine Storage ServiceHBase handler Querying on Different Data Silos How does a worker access different data sources? Storage service Return a proper handler for underlying storage > SELECT ... FROM hdfs_table, hbase_table, ... 24
  • 26. jdbc_db1 tajo_db1 JDBC-based Storage Storage providing the JDBC interface PostgreSQL, MySQL, MariaDB, ... Databases of JDBC-based storage are mapped to Tajo databases 26 Table 1 Table 2 Table 3 Table 1 Table 2 Table 3 tajo_db2 Table 1 Table 2 Table 3 … jdbc_db2 Table 1 Table 2 Table 3 … JDBC-based storage Tajo
  • 27. Tablespace Configuration { "spaces": { "pgsql_db1": { "uri": "jdbc:postgresql://hostname:port/db1" "configs": { "mapped_database": "tajo_db1" "connection_properties": { "user": "tajo", "password": "xxxx" } } } } } 27 PostgreSQL database name Tajo database name Tablespace name
  • 28. Return to Query Federation How to correlate data on JDBC-based storage and others? Need to have a global view of metadata across different storage types Tajo also has its own metadata for its data Each JDBC-based storage has own metadata for its data Each NoSQL storage has metadata for its data … 28
  • 29. Metadata Federation Federating metadata of underlying storage 29 DBMS metadata provider NoSQL metadata provider Linked Metadata Manager DBMS HCatalog Tajo catalog metadata provider Catalog Interface ● Tablespace ● Database ● Tables ● Schema names ...
  • 30. Querying on JDBC-based Storage A plan is converted into a SQL string Query generation Diverse SQL syntax of different types of storage Different SQL builder for each storage type 30 Tajo Master Tajo Worker JDBC-based storage SELECT ... Query plan SELECT ...
  • 31. Operation Push Down Tajo can exploit the processing capability of underlying storage DBMSs, MongoDB, HBase, … Operations are pushed down into underlying storage Leveraging the advanced features provided by underlying storage Ex) DBMSs' query optimization, index, ... 31
  • 32. Example 1 32 SELECT count(*) FROM account ac, archive ar WHERE ac.key = ar.id and ac.name = 'tajo' account DBMS archive HDFS scan archive scan account ac.name = 'tajo' join ac.key = ar.id group by count(*) group by count(*) Full scan Result only Push operation
  • 33. Example 2 33 SELECT ac.name, count(*) FROM account ac GROUP BY ac.name account DBMS scan account group by count(*) Result only Push operation
  • 35. Self-describing Data Formats Some data formats include schema information as well as data JSON, ORC, Parquet, … Tajo 0.11 natively supports self-describing data formats Since they already have schema information, Tajo doesn't need to store it aside Instead, Tajo can infer the schema at query execution time 35
  • 36. Create Table with Nested Data Format { "title" : "Hand of the King", "name" : { "first_name": "Eddard", "last_name": "Stark"}} { "title" : "Assassin", "name" : { "first_name": "Arya", "last_name": "Stark"}} { "title" : "Dancing Master", "name" : { "first_name": "Syrio", "last_name": "Forel"}} > CREATE EXTERNAL TABLE schemaful_table ( title TEXT, name RECORD ( first_name TEXT, last_name TEXT ) ) USING json LOCATION 'hdfs:///json_table'; 36 Nested type
  • 37. How about This Data? {"id":"2937257761","type":"ForkEvent","actor":{"id":1088854,"login":"CAOakleyII","gravatar_id":"","url":"https://api.github.com/users/CAOakleyII","avatar_url":"https://avatars.githubusercontent.com/ u/1088854?"},"repo":{"id":11909954,"name":"skycocker/chromebrew","url":"https://api.github.com/repos/skycocker/chromebrew"},"payload":{"forkee":{"id":38339291,"name":"chromebrew","full_na me":"CAOakleyII/chromebrew","owner":{"login":"CAOakleyII","id":1088854,"avatar_url":"https://avatars.githubusercontent.com/u/1088854?v=3","gravatar_id":"","url":"https://api.github.com/users/C AOakleyII","html_url":"https://github.com/CAOakleyII","followers_url":"https://api.github.com/users/CAOakleyII/followers","following_url":"https://api.github.com/users/CAOakleyII/following{/other_ user}","gists_url":"https://api.github.com/users/CAOakleyII/gists{/gist_id}","starred_url":"https://api.github.com/users/CAOakleyII/starred{/owner}{/repo}","subscriptions_url":"https://api.github.com/ users/CAOakleyII/subscriptions","organizations_url":"https://api.github.com/users/CAOakleyII/orgs","repos_url":"https://api.github.com/users/CAOakleyII/repos","events_url":"https://api.github.com/ users/CAOakleyII/events{/privacy}","received_events_url":"https://api.github.com/users/CAOakleyII/received_events","type":"User","site_admin":false},"private":false,"html_url":"https://github.com/C AOakleyII/chromebrew","description":"Package manager for Chrome OS","fork":true,"url":"https://api.github.com/repos/CAOakleyII/chromebrew","forks_url":"https://api.github.com/repos/CAOakleyII/chromebrew/forks","keys_url":"https://api.github.com/repos/CAOa kleyII/chromebrew/keys{/key_id}","collaborators_url":"https://api.github.com/repos/CAOakleyII/chromebrew/collaborators{/collaborator}","teams_url":"https://api.github.com/repos/CAOakleyII/chro mebrew/teams","hooks_url":"https://api.github.com/repos/CAOakleyII/chromebrew/hooks","issue_events_url":"https://api.github.com/repos/CAOakleyII/chromebrew/issues/events{/number}","event s_url":"https://api.github.com/repos/CAOakleyII/chromebrew/events","assignees_url":"https://api.github.com/repos/CAOakleyII/chromebrew/assignees{/user}","branches_url":"https://api.github.com /repos/CAOakleyII/chromebrew/branches{/branch}","tags_url":"https://api.github.com/repos/CAOakleyII/chromebrew/tags","blobs_url":"https://api.github.com/repos/CAOakleyII/chromebrew/git/blo bs{/sha}","git_tags_url":"https://api.github.com/repos/CAOakleyII/chromebrew/git/tags{/sha}","git_refs_url":"https://api.github.com/repos/CAOakleyII/chromebrew/git/refs{/sha}","trees_url":"https:/ /api.github.com/repos/CAOakleyII/chromebrew/git/trees{/sha}","statuses_url":"https://api.github.com/repos/CAOakleyII/chromebrew/statuses/{sha}","languages_url":"https://api.github.com/repos/C AOakleyII/chromebrew/languages","stargazers_url":"https://api.github.com/repos/CAOakleyII/chromebrew/stargazers","contributors_url":"https://api.github.com/repos/CAOakleyII/chromebrew/cont ributors","subscribers_url":"https://api.github.com/repos/CAOakleyII/chromebrew/subscribers","subscription_url":"https://api.github.com/repos/CAOakleyII/chromebrew/subscription","commits_url": "https://api.github.com/repos/CAOakleyII/chromebrew/commits{/sha}","git_commits_url":"https://api.github.com/repos/CAOakleyII/chromebrew/git/commits{/sha}","comments_url":"https://api.gith ub.com/repos/CAOakleyII/chromebrew/comments{/number}","issue_comment_url":"https://api.github.com/repos/CAOakleyII/chromebrew/issues/comments{/number}","contents_url":"https://api.git hub.com/repos/CAOakleyII/chromebrew/contents/{+path}","compare_url":"https://api.github.com/repos/CAOakleyII/chromebrew/compare/{base}...{head}","merges_url":"https://api.github.com/repo s/CAOakleyII/chromebrew/merges","archive_url":"https://api.github.com/repos/CAOakleyII/chromebrew/{archive_format}{/ref}","downloads_url":"https://api.github.com/repos/CAOakleyII/chromebr ew/downloads","issues_url":"https://api.github.com/repos/CAOakleyII/chromebrew/issues{/number}","pulls_url":"https://api.github.com/repos/CAOakleyII/chromebrew/pulls{/number}","milestones_ url":"https://api.github.com/repos/CAOakleyII/chromebrew/milestones{/number}","notifications_url":"https://api.github.com/repos/CAOakleyII/chromebrew/notifications{?since,all,participating}","lab els_url":"https://api.github.com/repos/CAOakleyII/chromebrew/labels{/name}","releases_url":"https://api.github.com/repos/CAOakleyII/chromebrew/releases{/id}","created_at":"2015-07- 01T00:00:00Z","updated_at":"2015-06-28T10:11:09Z","pushed_at":"2015-06- 09T07:46:57Z","git_url":"git://github.com/CAOakleyII/chromebrew.git","ssh_url":"git@github.com:CAOakleyII/chromebrew.git","clone_url":"https://github.com/CAOakleyII/chromebrew.git","svn_url":" https://github.com/CAOakleyII/chromebrew","homepage":"http://skycocker.github.io/chromebrew/","size":846,"stargazers_count":0,"watchers_count":0,"language":null,"has_issues":false,"has_downl oads":true,"has_wiki":true,"has_pages":false,"forks_count":0,"mirror_url":null,"open_issues_count":0,"forks":0,"open_issues":0,"watchers":0,"default_branch":"master","public":true}},"public":true,"cre ated_at":"2015-07-01T00:00:01Z"} ... 37
  • 38. Create Schemaless Table > CREATE EXTERNAL TABLE schemaless_table (*) USING json LOCATION 'hdfs:///json_table'; That's all! 38 Allow any schema
  • 39. Schema-free Query Execution > CREATE EXTERNAL TABLE schemaful_table (id BIGINT, name TEXT, ...) USING text LOCATION 'hdfs:///csv_table; > CREATE EXTERNAL TABLE schemaless_table (*) USING json LOCATION 'hdfs:///json_table'; > SELECT name.first_name, name.last_name from schemaless_table; > SELECT title, count(*) FROM schemaful_table, schemaless_table WHERE name = name.last_name GROUP BY title; 39
  • 40. Schema Inference Table schema is inferred at query time Example 40 SELECT a, b.b1, b.b2.c1 FROM t; ( a text, b record ( b1 text, b2 record ( c1 text ) ) ) Inferred schemaQuery
  • 42. Demo with Command line 42
  • 44. Roadmap 0.12 Improved Yarn integration Authentication support JavaScript stored procedure support Scalar subquery support Hive UDF support 44
  • 45. Roadmap Next generation (beyond 0.12) Exploiting modern hardware Approximate query processing Genetic query optimization And more … 45
  • 46. tajo> select question from you; 46