SlideShare ist ein Scribd-Unternehmen logo
1 von 20
Apache Tajo:
A Big Data Warehouse System
on Hadoop
Hyunsik Choi
Director of Research, Gruter
Big Data Camp LA 2014
Talk Outline
• Introduction to Apache Tajo
• What you can do with Tajo
• Why you should use Tajo
• Current Status of Tajo Project
• Demonstration
About Me
• Hyunsik Choi (pronounced “Hyeon-shick Cheh”)
• PhD (Computer Science & Engineering, 2013), Korea Univ
.
• Director of Research, Gruter Corp
• Open-source Involvement
– Full-time contributor to Apache Tajo (2013.6 ~ )
– Apache Tajo PMC member and committer (2013.3 ~ )
– Apache Giraph PMC member and committer (2011. 8 ~ )
• Contact Info
– Email: hyunsik@apache.org
– Linkedin: http://linkedin.com/in/hyunsikchoi/
Apache Tajo
• Open-source “SQL-on-H” “Big DW” system
• Apache Top-level project since March 2014
• Supports SQL standards
• Low latency, long running batch queries
• Features
– Supports Joins (inner and all outer), Groupby, and Sort
– Window function
– Most SQL data types supported (except for Decimal)
• Recent 0.8.0 release
– https://blogs.apache.org/tajo/entry/apache_tajo_0_8_0
Overall Architecture
What You Can Do with Tajo
• Batch queries
– Long-running queries (~ hours)
• Dynamic Scheduling
• Fault Tolerance
– ETL workloads
• Interactive Ad-hoc Queries
– Very low-latency (100 ms ~)
– Few seconds on several TB dataset if you cluster
capability is enough
Why You Should Use Tajo
• SQL Standards
– Non standard features – PgSQL and Oracle
• Simple Installation and Operation
– http://tajo.apache.org/docs/0.8.0/getting_started.html
• Simple Software Stack Requirement
– No MapReduce and No Tez
– Yarn support but not mandatory
– Tajo + Linux system for single node cluster
– Tajo + HDFS for a distributed cluster
Why You Should Use Tajo
• Mature SQL Feature Set
– Fully distributed query executions
• Inner join, and left/right/full outer join
• Groupby, sort, multiple distinct aggregation, window function
– SQL data types
• CHAR, BOOL, INT, BIGINT, REAL, DOUBLE, and TEXT
• TIMESTAMP, DATE, TIME, and INTERVAL
• DECIMAL (working)
– Various file formats
• Text file (CSV), RCFile, Parquet (flat schema), and
Avro (flat schema)
Why You Should Use Tajo
• Fully community-driven open source
• Stable development team
– 5 fulltime contributors + many contributors
• Performance and speed
– Faster than Hive 0.10 (1.5 – 10 times)
– Tajo v.s. Hive 0.13 ?
– Tajo v.s. Impala ?
Why You Should Use Tajo
• Integration with Hadoop Ecosystem
– Hadoop 2.2.0 – 2.4.0 support
– Be able to connect to Hive Metastore
– Directly process tables managed by Hive
– Yarn support (backport)
• Enable Tajo to deploy and run on Yarn cluster
• Allow users to add/remove cluster nodes to/from Tajo
cluster in runtime
• Contributed by Min Zhou (committer), Linkedin Engineer
• https://github.com/coderplay/tajo-yarn
Current Status – Overall
• Under beta stage – majority of key features are getting ready
• Most of SQL features implemented
• Working on hundreds of clusters for
production
– Collaboration with the biggest telco in S. Korea
• We’ve just started works on low-level
optimization.
– Runtime byte code generation (v0.9)
– Unsafe-based hash table for hash aggregation/join
– Vectorized execution engine
Current Status – Logical Plan Optimizer
• Basic Rewrite Rule
– Common sub expression elimination
– Constant folding (CF), and Null propagation
• Projection Push Down (PPD)
– push expressions to operators lower as possible
– narrow read columns
– remove duplicated expressions
• if some expressions has common expression
• Filter Push Down (FPD)
– reduce rows to be processed earlier as possible
• Extensible Rewrite Rule
– Allow developers to write their own rewrite rules
Current Status – Logical Plan Optimizer
SELECT
item_id,
order_id
sum_price * (1.2 * 0.3)
as total,
FROM (
SELECT
item_id,
order_id,
sum(price) as sum_price
FROM
ITEMS
GROUP BY item_id, order_id
) a
WHERE item_id = 17234
SELECT
item_id,
order_id,
sum(price) * (3.6)
FROM
ITEMS
GROUP BY
item_id,
order_id
WHERE item_id = 17234
Original Rewritten
CF + PPD
FPD
Current Status – Logical Plan Optimizer
• Cost-based Join Order (since v0.2)
– Don’t need to guess right join orders anymore
– Greedy heuristic algorithm
• Resulting in a bushy join tree instead of left-deep join tree
Left-deep Join Tree Bush Join Tree
Current Status – Window Function
• OVER clause
– row_number() and rank()
– Aggregation function support
– PARTITION and ORDER BY clause
SELECT depname, empno, salary, enroll_date FROM (
SELECT
depname, empno, salary, enroll_date,
rank() OVER (PARTITION BY depname
ORDER BY salary DESC, empno) AS pos
FROM empsalary
) AS ss
WHERE
pos < 3;
Current Status – Join
• Join
– NATURAL, INNER, OUTER (LEFT, RIGHT, FULL)
– SEMI, ANTI Join (planned for v0.9)
• Join Predicates
– WHERE and ON predicates
– de-factor standard outer join behavior with both
predicates
SELECT * FROM t1 LEFT JOIN t2 ON t1.num = t2.num
WHERE t2.value = 'xxx';
SELECT * FROM t1 LEFT JOIN t2 WHERE t1.num = t2.n
um and t2.value = ‘xxx’;
Current Status – Table Partitions
• Column Value Partition
– Hive Compatible Partition
• Range Partition (planned for 1.0)
– Table will be partitioned by disjoint ranges.
– Will remove the partition granularity problem of
Hive Partition
CREATE TABLE T1 (C1 INT, C2 TEXT)
using PARQUET
WITH (‘parquet.compression’ = ‘SNAPPY’)
PARTITION BY COLUMN (C3 INT, C4 TEXT);
Future Works
• Multi-tenant Scheduler (v0.9)
– Support multiple users and multiple queries
• Runtime byte code generation for
expressions (v0.9)
– Eliminate interpret overhead of expression evaluation
• Authentication and SQL Standard Access Control
• JIT-based Vectorized Processing Engine
– Refer to Hadoop Summit 2014 Slide
(http://goo.gl/jWghhp)
Get Involved!
• We are recruiting contributors!
• General
– http://tajo.apache.org
• Getting Started
– http://tajo.apache.org/docs/0.8.0/getting_started.html
• Downloads
– http://tajo.apache.org/docs/0.8.0/getting_started/downloading_source.html
• Jira – Issue Tracker
– https://issues.apache.org/jira/browse/TAJO
• Join the mailing list
– dev-subscribe@tajo.apache.org
– issues-subscribe@tajo.apache.org
Demonstration

Weitere ähnliche Inhalte

Was ist angesagt?

Tajo: A Distributed Data Warehouse System for Hadoop
Tajo: A Distributed Data Warehouse System for HadoopTajo: A Distributed Data Warehouse System for Hadoop
Tajo: A Distributed Data Warehouse System for HadoopHyunsik Choi
 
Introduction to Hive and HCatalog
Introduction to Hive and HCatalogIntroduction to Hive and HCatalog
Introduction to Hive and HCatalogmarkgrover
 
A Survey of HBase Application Archetypes
A Survey of HBase Application ArchetypesA Survey of HBase Application Archetypes
A Survey of HBase Application ArchetypesHBaseCon
 
Tez Shuffle Handler: Shuffling at Scale with Apache Hadoop
Tez Shuffle Handler: Shuffling at Scale with Apache HadoopTez Shuffle Handler: Shuffling at Scale with Apache Hadoop
Tez Shuffle Handler: Shuffling at Scale with Apache HadoopDataWorks Summit
 
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...Cloudera, Inc.
 
Introduction to Apache Tajo: Data Warehouse for Big Data
Introduction to Apache Tajo: Data Warehouse for Big DataIntroduction to Apache Tajo: Data Warehouse for Big Data
Introduction to Apache Tajo: Data Warehouse for Big DataGruter
 
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...Databricks
 
Architecting Applications with Hadoop
Architecting Applications with HadoopArchitecting Applications with Hadoop
Architecting Applications with Hadoopmarkgrover
 
Building a Large Scale SEO/SEM Application with Apache Solr
Building a Large Scale SEO/SEM Application with Apache SolrBuilding a Large Scale SEO/SEM Application with Apache Solr
Building a Large Scale SEO/SEM Application with Apache SolrRahul Jain
 
Evolving HDFS to a Generalized Distributed Storage Subsystem
Evolving HDFS to a Generalized Distributed Storage SubsystemEvolving HDFS to a Generalized Distributed Storage Subsystem
Evolving HDFS to a Generalized Distributed Storage SubsystemDataWorks Summit/Hadoop Summit
 
Cloudera Impala: A Modern SQL Engine for Apache Hadoop
Cloudera Impala: A Modern SQL Engine for Apache HadoopCloudera Impala: A Modern SQL Engine for Apache Hadoop
Cloudera Impala: A Modern SQL Engine for Apache HadoopCloudera, Inc.
 
Apache Tajo - Bay Area HUG Nov. 2013 LinkedIn Special Event
Apache Tajo - Bay Area HUG Nov. 2013 LinkedIn Special EventApache Tajo - Bay Area HUG Nov. 2013 LinkedIn Special Event
Apache Tajo - Bay Area HUG Nov. 2013 LinkedIn Special EventGruter
 
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...HBaseCon
 
HBaseCon 2012 | Mignify: A Big Data Refinery Built on HBase - Internet Memory...
HBaseCon 2012 | Mignify: A Big Data Refinery Built on HBase - Internet Memory...HBaseCon 2012 | Mignify: A Big Data Refinery Built on HBase - Internet Memory...
HBaseCon 2012 | Mignify: A Big Data Refinery Built on HBase - Internet Memory...Cloudera, Inc.
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesNishith Agarwal
 
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UKIntroduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UKSkills Matter
 
Building data pipelines with kite
Building data pipelines with kiteBuilding data pipelines with kite
Building data pipelines with kiteJoey Echeverria
 
NYC HUG - Application Architectures with Apache Hadoop
NYC HUG - Application Architectures with Apache HadoopNYC HUG - Application Architectures with Apache Hadoop
NYC HUG - Application Architectures with Apache Hadoopmarkgrover
 

Was ist angesagt? (20)

Tajo: A Distributed Data Warehouse System for Hadoop
Tajo: A Distributed Data Warehouse System for HadoopTajo: A Distributed Data Warehouse System for Hadoop
Tajo: A Distributed Data Warehouse System for Hadoop
 
Introduction to Hive and HCatalog
Introduction to Hive and HCatalogIntroduction to Hive and HCatalog
Introduction to Hive and HCatalog
 
A Survey of HBase Application Archetypes
A Survey of HBase Application ArchetypesA Survey of HBase Application Archetypes
A Survey of HBase Application Archetypes
 
Tez Shuffle Handler: Shuffling at Scale with Apache Hadoop
Tez Shuffle Handler: Shuffling at Scale with Apache HadoopTez Shuffle Handler: Shuffling at Scale with Apache Hadoop
Tez Shuffle Handler: Shuffling at Scale with Apache Hadoop
 
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
 
Introduction to Apache Tajo: Data Warehouse for Big Data
Introduction to Apache Tajo: Data Warehouse for Big DataIntroduction to Apache Tajo: Data Warehouse for Big Data
Introduction to Apache Tajo: Data Warehouse for Big Data
 
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
 
Architecting Applications with Hadoop
Architecting Applications with HadoopArchitecting Applications with Hadoop
Architecting Applications with Hadoop
 
Building a Large Scale SEO/SEM Application with Apache Solr
Building a Large Scale SEO/SEM Application with Apache SolrBuilding a Large Scale SEO/SEM Application with Apache Solr
Building a Large Scale SEO/SEM Application with Apache Solr
 
Evolving HDFS to a Generalized Distributed Storage Subsystem
Evolving HDFS to a Generalized Distributed Storage SubsystemEvolving HDFS to a Generalized Distributed Storage Subsystem
Evolving HDFS to a Generalized Distributed Storage Subsystem
 
Cloudera Impala: A Modern SQL Engine for Apache Hadoop
Cloudera Impala: A Modern SQL Engine for Apache HadoopCloudera Impala: A Modern SQL Engine for Apache Hadoop
Cloudera Impala: A Modern SQL Engine for Apache Hadoop
 
Apache drill
Apache drillApache drill
Apache drill
 
Apache Tajo - Bay Area HUG Nov. 2013 LinkedIn Special Event
Apache Tajo - Bay Area HUG Nov. 2013 LinkedIn Special EventApache Tajo - Bay Area HUG Nov. 2013 LinkedIn Special Event
Apache Tajo - Bay Area HUG Nov. 2013 LinkedIn Special Event
 
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...
 
HBaseCon 2012 | Mignify: A Big Data Refinery Built on HBase - Internet Memory...
HBaseCon 2012 | Mignify: A Big Data Refinery Built on HBase - Internet Memory...HBaseCon 2012 | Mignify: A Big Data Refinery Built on HBase - Internet Memory...
HBaseCon 2012 | Mignify: A Big Data Refinery Built on HBase - Internet Memory...
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilities
 
HDFS Tiered Storage: Mounting Object Stores in HDFS
HDFS Tiered Storage: Mounting Object Stores in HDFSHDFS Tiered Storage: Mounting Object Stores in HDFS
HDFS Tiered Storage: Mounting Object Stores in HDFS
 
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UKIntroduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
 
Building data pipelines with kite
Building data pipelines with kiteBuilding data pipelines with kite
Building data pipelines with kite
 
NYC HUG - Application Architectures with Apache Hadoop
NYC HUG - Application Architectures with Apache HadoopNYC HUG - Application Architectures with Apache Hadoop
NYC HUG - Application Architectures with Apache Hadoop
 

Ähnlich wie Big Data Camp LA 2014 - Apache Tajo: A Big Data Warehouse System on Hadoop

Tajolabigdatacamp2014 140618135810-phpapp01 hyunsik-choi
Tajolabigdatacamp2014 140618135810-phpapp01 hyunsik-choiTajolabigdatacamp2014 140618135810-phpapp01 hyunsik-choi
Tajolabigdatacamp2014 140618135810-phpapp01 hyunsik-choiData Con LA
 
Apache Tajo - BWC 2014
Apache Tajo - BWC 2014Apache Tajo - BWC 2014
Apache Tajo - BWC 2014Gruter
 
Tajo_Meetup_20141120
Tajo_Meetup_20141120Tajo_Meetup_20141120
Tajo_Meetup_20141120Hyoungjun Kim
 
Big Data Day LA 2015 - What's New Tajo 0.10 and Beyond by Hyunsik Choi of Gruter
Big Data Day LA 2015 - What's New Tajo 0.10 and Beyond by Hyunsik Choi of GruterBig Data Day LA 2015 - What's New Tajo 0.10 and Beyond by Hyunsik Choi of Gruter
Big Data Day LA 2015 - What's New Tajo 0.10 and Beyond by Hyunsik Choi of GruterData Con LA
 
Presto updates to 0.178
Presto updates to 0.178Presto updates to 0.178
Presto updates to 0.178Kai Sasaki
 
Gunther hagleitner:apache hive & stinger
Gunther hagleitner:apache hive & stingerGunther hagleitner:apache hive & stinger
Gunther hagleitner:apache hive & stingerhdhappy001
 
Introduction to Apache Tajo - Bay Area HUG Nov. 2013 LinkedIn Special Event
Introduction to Apache Tajo - Bay Area HUG Nov. 2013 LinkedIn Special EventIntroduction to Apache Tajo - Bay Area HUG Nov. 2013 LinkedIn Special Event
Introduction to Apache Tajo - Bay Area HUG Nov. 2013 LinkedIn Special EventHyunsik Choi
 
Emerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataEmerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataRahul Jain
 
What's new in pandas and the SciPy stack for financial users
What's new in pandas and the SciPy stack for financial usersWhat's new in pandas and the SciPy stack for financial users
What's new in pandas and the SciPy stack for financial usersWes McKinney
 
Tajo Seoul Meetup-201501
Tajo Seoul Meetup-201501Tajo Seoul Meetup-201501
Tajo Seoul Meetup-201501Jinho Kim
 
HBaseCon2015-final
HBaseCon2015-finalHBaseCon2015-final
HBaseCon2015-finalMaryann Xue
 
An In-Depth Look at Putting the Sting in Hive
An In-Depth Look at Putting the Sting in HiveAn In-Depth Look at Putting the Sting in Hive
An In-Depth Look at Putting the Sting in HiveDataWorks Summit
 
Stinger hadoop summit june 2013
Stinger hadoop summit june 2013Stinger hadoop summit june 2013
Stinger hadoop summit june 2013alanfgates
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in HadoopBackup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadooplarsgeorge
 
Jethro data meetup index base sql on hadoop - oct-2014
Jethro data meetup    index base sql on hadoop - oct-2014Jethro data meetup    index base sql on hadoop - oct-2014
Jethro data meetup index base sql on hadoop - oct-2014Eli Singer
 
Overview of stinger interactive query for hive
Overview of stinger   interactive query for hiveOverview of stinger   interactive query for hive
Overview of stinger interactive query for hiveDavid Kaiser
 
[FOSS4G 2015 SEOUL] Spatial tajo supporting spatial queries on Apache Tajo
[FOSS4G 2015 SEOUL] Spatial tajo supporting spatial queries on Apache Tajo[FOSS4G 2015 SEOUL] Spatial tajo supporting spatial queries on Apache Tajo
[FOSS4G 2015 SEOUL] Spatial tajo supporting spatial queries on Apache TajoBD
 
Big data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesBig data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesCorley S.r.l.
 
Defending the Enterprise with Evernote at SourceBoston on May 27, 2015
Defending the Enterprise with Evernote at SourceBoston on May 27, 2015Defending the Enterprise with Evernote at SourceBoston on May 27, 2015
Defending the Enterprise with Evernote at SourceBoston on May 27, 2015grecsl
 

Ähnlich wie Big Data Camp LA 2014 - Apache Tajo: A Big Data Warehouse System on Hadoop (20)

Tajolabigdatacamp2014 140618135810-phpapp01 hyunsik-choi
Tajolabigdatacamp2014 140618135810-phpapp01 hyunsik-choiTajolabigdatacamp2014 140618135810-phpapp01 hyunsik-choi
Tajolabigdatacamp2014 140618135810-phpapp01 hyunsik-choi
 
Apache Tajo - BWC 2014
Apache Tajo - BWC 2014Apache Tajo - BWC 2014
Apache Tajo - BWC 2014
 
Tajo_Meetup_20141120
Tajo_Meetup_20141120Tajo_Meetup_20141120
Tajo_Meetup_20141120
 
Big Data Day LA 2015 - What's New Tajo 0.10 and Beyond by Hyunsik Choi of Gruter
Big Data Day LA 2015 - What's New Tajo 0.10 and Beyond by Hyunsik Choi of GruterBig Data Day LA 2015 - What's New Tajo 0.10 and Beyond by Hyunsik Choi of Gruter
Big Data Day LA 2015 - What's New Tajo 0.10 and Beyond by Hyunsik Choi of Gruter
 
Presto updates to 0.178
Presto updates to 0.178Presto updates to 0.178
Presto updates to 0.178
 
Gunther hagleitner:apache hive & stinger
Gunther hagleitner:apache hive & stingerGunther hagleitner:apache hive & stinger
Gunther hagleitner:apache hive & stinger
 
Introduction to Apache Tajo - Bay Area HUG Nov. 2013 LinkedIn Special Event
Introduction to Apache Tajo - Bay Area HUG Nov. 2013 LinkedIn Special EventIntroduction to Apache Tajo - Bay Area HUG Nov. 2013 LinkedIn Special Event
Introduction to Apache Tajo - Bay Area HUG Nov. 2013 LinkedIn Special Event
 
Emerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataEmerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big Data
 
What's new in pandas and the SciPy stack for financial users
What's new in pandas and the SciPy stack for financial usersWhat's new in pandas and the SciPy stack for financial users
What's new in pandas and the SciPy stack for financial users
 
Tajo Seoul Meetup-201501
Tajo Seoul Meetup-201501Tajo Seoul Meetup-201501
Tajo Seoul Meetup-201501
 
HBaseCon2015-final
HBaseCon2015-finalHBaseCon2015-final
HBaseCon2015-final
 
Python redis talk
Python redis talkPython redis talk
Python redis talk
 
An In-Depth Look at Putting the Sting in Hive
An In-Depth Look at Putting the Sting in HiveAn In-Depth Look at Putting the Sting in Hive
An In-Depth Look at Putting the Sting in Hive
 
Stinger hadoop summit june 2013
Stinger hadoop summit june 2013Stinger hadoop summit june 2013
Stinger hadoop summit june 2013
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in HadoopBackup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
 
Jethro data meetup index base sql on hadoop - oct-2014
Jethro data meetup    index base sql on hadoop - oct-2014Jethro data meetup    index base sql on hadoop - oct-2014
Jethro data meetup index base sql on hadoop - oct-2014
 
Overview of stinger interactive query for hive
Overview of stinger   interactive query for hiveOverview of stinger   interactive query for hive
Overview of stinger interactive query for hive
 
[FOSS4G 2015 SEOUL] Spatial tajo supporting spatial queries on Apache Tajo
[FOSS4G 2015 SEOUL] Spatial tajo supporting spatial queries on Apache Tajo[FOSS4G 2015 SEOUL] Spatial tajo supporting spatial queries on Apache Tajo
[FOSS4G 2015 SEOUL] Spatial tajo supporting spatial queries on Apache Tajo
 
Big data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesBig data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting Languages
 
Defending the Enterprise with Evernote at SourceBoston on May 27, 2015
Defending the Enterprise with Evernote at SourceBoston on May 27, 2015Defending the Enterprise with Evernote at SourceBoston on May 27, 2015
Defending the Enterprise with Evernote at SourceBoston on May 27, 2015
 

Mehr von Gruter

MelOn 빅데이터 플랫폼과 Tajo 이야기
MelOn 빅데이터 플랫폼과 Tajo 이야기MelOn 빅데이터 플랫폼과 Tajo 이야기
MelOn 빅데이터 플랫폼과 Tajo 이야기Gruter
 
Introduction to Apache Tajo: Future of Data Warehouse
Introduction to Apache Tajo: Future of Data WarehouseIntroduction to Apache Tajo: Future of Data Warehouse
Introduction to Apache Tajo: Future of Data WarehouseGruter
 
Expanding Your Data Warehouse with Tajo
Expanding Your Data Warehouse with TajoExpanding Your Data Warehouse with Tajo
Expanding Your Data Warehouse with TajoGruter
 
Introduction to Apache Tajo
Introduction to Apache TajoIntroduction to Apache Tajo
Introduction to Apache TajoGruter
 
스타트업사례로 본 로그 데이터분석 : Tajo on AWS
스타트업사례로 본 로그 데이터분석 : Tajo on AWS스타트업사례로 본 로그 데이터분석 : Tajo on AWS
스타트업사례로 본 로그 데이터분석 : Tajo on AWSGruter
 
What's New Tajo 0.10 and Its Beyond
What's New Tajo 0.10 and Its BeyondWhat's New Tajo 0.10 and Its Beyond
What's New Tajo 0.10 and Its BeyondGruter
 
Big data analysis with R and Apache Tajo (in Korean)
Big data analysis with R and Apache Tajo (in Korean)Big data analysis with R and Apache Tajo (in Korean)
Big data analysis with R and Apache Tajo (in Korean)Gruter
 
Efficient In­‐situ Processing of Various Storage Types on Apache Tajo
Efficient In­‐situ Processing of Various Storage Types on Apache TajoEfficient In­‐situ Processing of Various Storage Types on Apache Tajo
Efficient In­‐situ Processing of Various Storage Types on Apache TajoGruter
 
Tajo TPC-H Benchmark Test on AWS
Tajo TPC-H Benchmark Test on AWSTajo TPC-H Benchmark Test on AWS
Tajo TPC-H Benchmark Test on AWSGruter
 
Data analysis with Tajo
Data analysis with TajoData analysis with Tajo
Data analysis with TajoGruter
 
Gruter TECHDAY 2014 MelOn BigData
Gruter TECHDAY 2014 MelOn BigDataGruter TECHDAY 2014 MelOn BigData
Gruter TECHDAY 2014 MelOn BigDataGruter
 
Gruter_TECHDAY_2014_04_TajoCloudHandsOn (in Korean)
Gruter_TECHDAY_2014_04_TajoCloudHandsOn (in Korean)Gruter_TECHDAY_2014_04_TajoCloudHandsOn (in Korean)
Gruter_TECHDAY_2014_04_TajoCloudHandsOn (in Korean)Gruter
 
Gruter_TECHDAY_2014_01_SearchEngine (in Korean)
Gruter_TECHDAY_2014_01_SearchEngine (in Korean)Gruter_TECHDAY_2014_01_SearchEngine (in Korean)
Gruter_TECHDAY_2014_01_SearchEngine (in Korean)Gruter
 
Elastic Search Performance Optimization - Deview 2014
Elastic Search Performance Optimization - Deview 2014Elastic Search Performance Optimization - Deview 2014
Elastic Search Performance Optimization - Deview 2014Gruter
 
Hadoop security DeView 2014
Hadoop security DeView 2014Hadoop security DeView 2014
Hadoop security DeView 2014Gruter
 
Vectorized processing in_a_nutshell_DeView2014
Vectorized processing in_a_nutshell_DeView2014Vectorized processing in_a_nutshell_DeView2014
Vectorized processing in_a_nutshell_DeView2014Gruter
 
Cloumon sw제품설명회 발표자료
Cloumon sw제품설명회 발표자료Cloumon sw제품설명회 발표자료
Cloumon sw제품설명회 발표자료Gruter
 
Tajo and SQL-on-Hadoop in Tech Planet 2013
Tajo and SQL-on-Hadoop in Tech Planet 2013Tajo and SQL-on-Hadoop in Tech Planet 2013
Tajo and SQL-on-Hadoop in Tech Planet 2013Gruter
 
SQL-on-Hadoop with Apache Tajo, and application case of SK Telecom
SQL-on-Hadoop with Apache Tajo,  and application case of SK TelecomSQL-on-Hadoop with Apache Tajo,  and application case of SK Telecom
SQL-on-Hadoop with Apache Tajo, and application case of SK TelecomGruter
 
Tajo case study bay area hug 20131105
Tajo case study bay area hug 20131105Tajo case study bay area hug 20131105
Tajo case study bay area hug 20131105Gruter
 

Mehr von Gruter (20)

MelOn 빅데이터 플랫폼과 Tajo 이야기
MelOn 빅데이터 플랫폼과 Tajo 이야기MelOn 빅데이터 플랫폼과 Tajo 이야기
MelOn 빅데이터 플랫폼과 Tajo 이야기
 
Introduction to Apache Tajo: Future of Data Warehouse
Introduction to Apache Tajo: Future of Data WarehouseIntroduction to Apache Tajo: Future of Data Warehouse
Introduction to Apache Tajo: Future of Data Warehouse
 
Expanding Your Data Warehouse with Tajo
Expanding Your Data Warehouse with TajoExpanding Your Data Warehouse with Tajo
Expanding Your Data Warehouse with Tajo
 
Introduction to Apache Tajo
Introduction to Apache TajoIntroduction to Apache Tajo
Introduction to Apache Tajo
 
스타트업사례로 본 로그 데이터분석 : Tajo on AWS
스타트업사례로 본 로그 데이터분석 : Tajo on AWS스타트업사례로 본 로그 데이터분석 : Tajo on AWS
스타트업사례로 본 로그 데이터분석 : Tajo on AWS
 
What's New Tajo 0.10 and Its Beyond
What's New Tajo 0.10 and Its BeyondWhat's New Tajo 0.10 and Its Beyond
What's New Tajo 0.10 and Its Beyond
 
Big data analysis with R and Apache Tajo (in Korean)
Big data analysis with R and Apache Tajo (in Korean)Big data analysis with R and Apache Tajo (in Korean)
Big data analysis with R and Apache Tajo (in Korean)
 
Efficient In­‐situ Processing of Various Storage Types on Apache Tajo
Efficient In­‐situ Processing of Various Storage Types on Apache TajoEfficient In­‐situ Processing of Various Storage Types on Apache Tajo
Efficient In­‐situ Processing of Various Storage Types on Apache Tajo
 
Tajo TPC-H Benchmark Test on AWS
Tajo TPC-H Benchmark Test on AWSTajo TPC-H Benchmark Test on AWS
Tajo TPC-H Benchmark Test on AWS
 
Data analysis with Tajo
Data analysis with TajoData analysis with Tajo
Data analysis with Tajo
 
Gruter TECHDAY 2014 MelOn BigData
Gruter TECHDAY 2014 MelOn BigDataGruter TECHDAY 2014 MelOn BigData
Gruter TECHDAY 2014 MelOn BigData
 
Gruter_TECHDAY_2014_04_TajoCloudHandsOn (in Korean)
Gruter_TECHDAY_2014_04_TajoCloudHandsOn (in Korean)Gruter_TECHDAY_2014_04_TajoCloudHandsOn (in Korean)
Gruter_TECHDAY_2014_04_TajoCloudHandsOn (in Korean)
 
Gruter_TECHDAY_2014_01_SearchEngine (in Korean)
Gruter_TECHDAY_2014_01_SearchEngine (in Korean)Gruter_TECHDAY_2014_01_SearchEngine (in Korean)
Gruter_TECHDAY_2014_01_SearchEngine (in Korean)
 
Elastic Search Performance Optimization - Deview 2014
Elastic Search Performance Optimization - Deview 2014Elastic Search Performance Optimization - Deview 2014
Elastic Search Performance Optimization - Deview 2014
 
Hadoop security DeView 2014
Hadoop security DeView 2014Hadoop security DeView 2014
Hadoop security DeView 2014
 
Vectorized processing in_a_nutshell_DeView2014
Vectorized processing in_a_nutshell_DeView2014Vectorized processing in_a_nutshell_DeView2014
Vectorized processing in_a_nutshell_DeView2014
 
Cloumon sw제품설명회 발표자료
Cloumon sw제품설명회 발표자료Cloumon sw제품설명회 발표자료
Cloumon sw제품설명회 발표자료
 
Tajo and SQL-on-Hadoop in Tech Planet 2013
Tajo and SQL-on-Hadoop in Tech Planet 2013Tajo and SQL-on-Hadoop in Tech Planet 2013
Tajo and SQL-on-Hadoop in Tech Planet 2013
 
SQL-on-Hadoop with Apache Tajo, and application case of SK Telecom
SQL-on-Hadoop with Apache Tajo,  and application case of SK TelecomSQL-on-Hadoop with Apache Tajo,  and application case of SK Telecom
SQL-on-Hadoop with Apache Tajo, and application case of SK Telecom
 
Tajo case study bay area hug 20131105
Tajo case study bay area hug 20131105Tajo case study bay area hug 20131105
Tajo case study bay area hug 20131105
 

Kürzlich hochgeladen

call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️Delhi Call girls
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplatePresentation.STUDIO
 
8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech studentsHimanshiGarg82
 
Define the academic and professional writing..pdf
Define the academic and professional writing..pdfDefine the academic and professional writing..pdf
Define the academic and professional writing..pdfPearlKirahMaeRagusta1
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai
 
How to Choose the Right Laravel Development Partner in New York City_compress...
How to Choose the Right Laravel Development Partner in New York City_compress...How to Choose the Right Laravel Development Partner in New York City_compress...
How to Choose the Right Laravel Development Partner in New York City_compress...software pro Development
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
Exploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdfExploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdfproinshot.com
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVshikhaohhpro
 
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...kalichargn70th171
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerThousandEyes
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionSolGuruz
 
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfVishalKumarJha10
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...Health
 
10 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 202410 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 2024Mind IT Systems
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension AidPhilip Schwarz
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsAndolasoft Inc
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...harshavardhanraghave
 

Kürzlich hochgeladen (20)

call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation Template
 
8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students
 
Define the academic and professional writing..pdf
Define the academic and professional writing..pdfDefine the academic and professional writing..pdf
Define the academic and professional writing..pdf
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
How to Choose the Right Laravel Development Partner in New York City_compress...
How to Choose the Right Laravel Development Partner in New York City_compress...How to Choose the Right Laravel Development Partner in New York City_compress...
How to Choose the Right Laravel Development Partner in New York City_compress...
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
Exploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdfExploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdf
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with Precision
 
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
10 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 202410 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 2024
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.js
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 

Big Data Camp LA 2014 - Apache Tajo: A Big Data Warehouse System on Hadoop

  • 1. Apache Tajo: A Big Data Warehouse System on Hadoop Hyunsik Choi Director of Research, Gruter Big Data Camp LA 2014
  • 2. Talk Outline • Introduction to Apache Tajo • What you can do with Tajo • Why you should use Tajo • Current Status of Tajo Project • Demonstration
  • 3. About Me • Hyunsik Choi (pronounced “Hyeon-shick Cheh”) • PhD (Computer Science & Engineering, 2013), Korea Univ . • Director of Research, Gruter Corp • Open-source Involvement – Full-time contributor to Apache Tajo (2013.6 ~ ) – Apache Tajo PMC member and committer (2013.3 ~ ) – Apache Giraph PMC member and committer (2011. 8 ~ ) • Contact Info – Email: hyunsik@apache.org – Linkedin: http://linkedin.com/in/hyunsikchoi/
  • 4. Apache Tajo • Open-source “SQL-on-H” “Big DW” system • Apache Top-level project since March 2014 • Supports SQL standards • Low latency, long running batch queries • Features – Supports Joins (inner and all outer), Groupby, and Sort – Window function – Most SQL data types supported (except for Decimal) • Recent 0.8.0 release – https://blogs.apache.org/tajo/entry/apache_tajo_0_8_0
  • 6. What You Can Do with Tajo • Batch queries – Long-running queries (~ hours) • Dynamic Scheduling • Fault Tolerance – ETL workloads • Interactive Ad-hoc Queries – Very low-latency (100 ms ~) – Few seconds on several TB dataset if you cluster capability is enough
  • 7. Why You Should Use Tajo • SQL Standards – Non standard features – PgSQL and Oracle • Simple Installation and Operation – http://tajo.apache.org/docs/0.8.0/getting_started.html • Simple Software Stack Requirement – No MapReduce and No Tez – Yarn support but not mandatory – Tajo + Linux system for single node cluster – Tajo + HDFS for a distributed cluster
  • 8. Why You Should Use Tajo • Mature SQL Feature Set – Fully distributed query executions • Inner join, and left/right/full outer join • Groupby, sort, multiple distinct aggregation, window function – SQL data types • CHAR, BOOL, INT, BIGINT, REAL, DOUBLE, and TEXT • TIMESTAMP, DATE, TIME, and INTERVAL • DECIMAL (working) – Various file formats • Text file (CSV), RCFile, Parquet (flat schema), and Avro (flat schema)
  • 9. Why You Should Use Tajo • Fully community-driven open source • Stable development team – 5 fulltime contributors + many contributors • Performance and speed – Faster than Hive 0.10 (1.5 – 10 times) – Tajo v.s. Hive 0.13 ? – Tajo v.s. Impala ?
  • 10. Why You Should Use Tajo • Integration with Hadoop Ecosystem – Hadoop 2.2.0 – 2.4.0 support – Be able to connect to Hive Metastore – Directly process tables managed by Hive – Yarn support (backport) • Enable Tajo to deploy and run on Yarn cluster • Allow users to add/remove cluster nodes to/from Tajo cluster in runtime • Contributed by Min Zhou (committer), Linkedin Engineer • https://github.com/coderplay/tajo-yarn
  • 11. Current Status – Overall • Under beta stage – majority of key features are getting ready • Most of SQL features implemented • Working on hundreds of clusters for production – Collaboration with the biggest telco in S. Korea • We’ve just started works on low-level optimization. – Runtime byte code generation (v0.9) – Unsafe-based hash table for hash aggregation/join – Vectorized execution engine
  • 12. Current Status – Logical Plan Optimizer • Basic Rewrite Rule – Common sub expression elimination – Constant folding (CF), and Null propagation • Projection Push Down (PPD) – push expressions to operators lower as possible – narrow read columns – remove duplicated expressions • if some expressions has common expression • Filter Push Down (FPD) – reduce rows to be processed earlier as possible • Extensible Rewrite Rule – Allow developers to write their own rewrite rules
  • 13. Current Status – Logical Plan Optimizer SELECT item_id, order_id sum_price * (1.2 * 0.3) as total, FROM ( SELECT item_id, order_id, sum(price) as sum_price FROM ITEMS GROUP BY item_id, order_id ) a WHERE item_id = 17234 SELECT item_id, order_id, sum(price) * (3.6) FROM ITEMS GROUP BY item_id, order_id WHERE item_id = 17234 Original Rewritten CF + PPD FPD
  • 14. Current Status – Logical Plan Optimizer • Cost-based Join Order (since v0.2) – Don’t need to guess right join orders anymore – Greedy heuristic algorithm • Resulting in a bushy join tree instead of left-deep join tree Left-deep Join Tree Bush Join Tree
  • 15. Current Status – Window Function • OVER clause – row_number() and rank() – Aggregation function support – PARTITION and ORDER BY clause SELECT depname, empno, salary, enroll_date FROM ( SELECT depname, empno, salary, enroll_date, rank() OVER (PARTITION BY depname ORDER BY salary DESC, empno) AS pos FROM empsalary ) AS ss WHERE pos < 3;
  • 16. Current Status – Join • Join – NATURAL, INNER, OUTER (LEFT, RIGHT, FULL) – SEMI, ANTI Join (planned for v0.9) • Join Predicates – WHERE and ON predicates – de-factor standard outer join behavior with both predicates SELECT * FROM t1 LEFT JOIN t2 ON t1.num = t2.num WHERE t2.value = 'xxx'; SELECT * FROM t1 LEFT JOIN t2 WHERE t1.num = t2.n um and t2.value = ‘xxx’;
  • 17. Current Status – Table Partitions • Column Value Partition – Hive Compatible Partition • Range Partition (planned for 1.0) – Table will be partitioned by disjoint ranges. – Will remove the partition granularity problem of Hive Partition CREATE TABLE T1 (C1 INT, C2 TEXT) using PARQUET WITH (‘parquet.compression’ = ‘SNAPPY’) PARTITION BY COLUMN (C3 INT, C4 TEXT);
  • 18. Future Works • Multi-tenant Scheduler (v0.9) – Support multiple users and multiple queries • Runtime byte code generation for expressions (v0.9) – Eliminate interpret overhead of expression evaluation • Authentication and SQL Standard Access Control • JIT-based Vectorized Processing Engine – Refer to Hadoop Summit 2014 Slide (http://goo.gl/jWghhp)
  • 19. Get Involved! • We are recruiting contributors! • General – http://tajo.apache.org • Getting Started – http://tajo.apache.org/docs/0.8.0/getting_started.html • Downloads – http://tajo.apache.org/docs/0.8.0/getting_started/downloading_source.html • Jira – Issue Tracker – https://issues.apache.org/jira/browse/TAJO • Join the mailing list – dev-subscribe@tajo.apache.org – issues-subscribe@tajo.apache.org