SlideShare ist ein Scribd-Unternehmen logo
1 von 20
Apache Tajo:
A Big Data Warehouse System
on Hadoop
Hyunsik Choi
Director of Research, Gruter
Big Data Camp LA 2014
Talk Outline
• Introduction to Apache Tajo
• What you can do with Tajo
• Why you should use Tajo
• Current Status of Tajo Project
• Demonstration
About Me
• Hyunsik Choi (pronounced “Hyeon-shick Cheh”)
• PhD (Computer Science & Engineering, 2013), Korea Univ
.
• Director of Research, Gruter Corp
• Open-source Involvement
– Full-time contributor to Apache Tajo (2013.6 ~ )
– Apache Tajo PMC member and committer (2013.3 ~ )
– Apache Giraph PMC member and committer (2011. 8 ~ )
• Contact Info
– Email: hyunsik@apache.org
– Linkedin: http://linkedin.com/in/hyunsikchoi/
Apache Tajo
• Open-source “SQL-on-H” “Big DW” system
• Apache Top-level project since March 2014
• Supports SQL standards
• Low latency, long running batch queries
• Features
– Supports Joins (inner and all outer), Groupby, and Sort
– Window function
– Most SQL data types supported (except for Decimal)
• Recent 0.8.0 release
– https://blogs.apache.org/tajo/entry/apache_tajo_0_8_0
Overall Architecture
What You Can Do with Tajo
• Batch queries
– Long-running queries (~ hours)
• Dynamic Scheduling
• Fault Tolerance
– ETL workloads
• Interactive Ad-hoc Queries
– Very low-latency (100 ms ~)
– Few seconds on several TB dataset if you cluster
capability is enough
Why You Should Use Tajo
• SQL Standards
– Non standard features – PgSQL and Oracle
• Simple Installation and Operation
– http://tajo.apache.org/docs/0.8.0/getting_started.html
• Simple Software Stack Requirement
– No MapReduce and No Tez
– Yarn support but not mandatory
– Tajo + Linux system for single node cluster
– Tajo + HDFS for a distributed cluster
Why You Should Use Tajo
• Mature SQL Feature Set
– Fully distributed query executions
• Inner join, and left/right/full outer join
• Groupby, sort, multiple distinct aggregation, window function
– SQL data types
• CHAR, BOOL, INT, BIGINT, REAL, DOUBLE, and TEXT
• TIMESTAMP, DATE, TIME, and INTERVAL
• DECIMAL (working)
– Various file formats
• Text file (CSV), RCFile, Parquet (flat schema), and
Avro (flat schema)
Why You Should Use Tajo
• Fully community-driven open source
• Stable development team
– 5 fulltime contributors + many contributors
• Performance and speed
– Faster than Hive 0.10 (1.5 – 10 times)
– Tajo v.s. Hive 0.13 ?
– Tajo v.s. Impala ?
Why You Should Use Tajo
• Integration with Hadoop Ecosystem
– Hadoop 2.2.0 – 2.4.0 support
– Be able to connect to Hive Metastore
– Directly process tables managed by Hive
– Yarn support (backport)
• Enable Tajo to deploy and run on Yarn cluster
• Allow users to add/remove cluster nodes to/from Tajo
cluster in runtime
• Contributed by Min Zhou (committer), Linkedin Engineer
• https://github.com/coderplay/tajo-yarn
Current Status – Overall
• Under beta stage – majority of key features are getting ready
• Most of SQL features implemented
• Working on hundreds of clusters for
production
– Collaboration with the biggest telco in S. Korea
• We’ve just started works on low-level
optimization.
– Runtime byte code generation (v0.9)
– Unsafe-based hash table for hash aggregation/join
– Vectorized execution engine
Current Status – Logical Plan Optimizer
• Basic Rewrite Rule
– Common sub expression elimination
– Constant folding (CF), and Null propagation
• Projection Push Down (PPD)
– push expressions to operators lower as possible
– narrow read columns
– remove duplicated expressions
• if some expressions has common expression
• Filter Push Down (FPD)
– reduce rows to be processed earlier as possible
• Extensible Rewrite Rule
– Allow developers to write their own rewrite rules
Current Status – Logical Plan Optimizer
SELECT
item_id,
order_id
sum_price * (1.2 * 0.3)
as total,
FROM (
SELECT
item_id,
order_id,
sum(price) as sum_price
FROM
ITEMS
GROUP BY item_id, order_id
) a
WHERE item_id = 17234
SELECT
item_id,
order_id,
sum(price) * (3.6)
FROM
ITEMS
GROUP BY
item_id,
order_id
WHERE item_id = 17234
Original Rewritten
CF + PPD
FPD
Current Status – Logical Plan Optimizer
• Cost-based Join Order (since v0.2)
– Don’t need to guess right join orders anymore
– Greedy heuristic algorithm
• Resulting in a bushy join tree instead of left-deep join tree
Left-deep Join Tree Bush Join Tree
Current Status – Window Function
• OVER clause
– row_number() and rank()
– Aggregation function support
– PARTITION and ORDER BY clause
SELECT depname, empno, salary, enroll_date FROM (
SELECT
depname, empno, salary, enroll_date,
rank() OVER (PARTITION BY depname
ORDER BY salary DESC, empno) AS pos
FROM empsalary
) AS ss
WHERE
pos < 3;
Current Status – Join
• Join
– NATURAL, INNER, OUTER (LEFT, RIGHT, FULL)
– SEMI, ANTI Join (planned for v0.9)
• Join Predicates
– WHERE and ON predicates
– de-factor standard outer join behavior with both
predicates
SELECT * FROM t1 LEFT JOIN t2 ON t1.num = t2.num
WHERE t2.value = 'xxx';
SELECT * FROM t1 LEFT JOIN t2 WHERE t1.num = t2.n
um and t2.value = ‘xxx’;
Current Status – Table Partitions
• Column Value Partition
– Hive Compatible Partition
• Range Partition (planned for 1.0)
– Table will be partitioned by disjoint ranges.
– Will remove the partition granularity problem of
Hive Partition
CREATE TABLE T1 (C1 INT, C2 TEXT)
using PARQUET
WITH (‘parquet.compression’ = ‘SNAPPY’)
PARTITION BY COLUMN (C3 INT, C4 TEXT);
Future Works
• Multi-tenant Scheduler (v0.9)
– Support multiple users and multiple queries
• Runtime byte code generation for
expressions (v0.9)
– Eliminate interpret overhead of expression evaluation
• Authentication and SQL Standard Access Control
• JIT-based Vectorized Processing Engine
– Refer to Hadoop Summit 2014 Slide
(http://goo.gl/jWghhp)
Get Involved!
• We are recruiting contributors!
• General
– http://tajo.apache.org
• Getting Started
– http://tajo.apache.org/docs/0.8.0/getting_started.html
• Downloads
– http://tajo.apache.org/docs/0.8.0/getting_started/downloading_source.html
• Jira – Issue Tracker
– https://issues.apache.org/jira/browse/TAJO
• Join the mailing list
– dev-subscribe@tajo.apache.org
– issues-subscribe@tajo.apache.org
Demonstration

Weitere ähnliche Inhalte

Was ist angesagt?

Tajo: A Distributed Data Warehouse System for Hadoop
Tajo: A Distributed Data Warehouse System for HadoopTajo: A Distributed Data Warehouse System for Hadoop
Tajo: A Distributed Data Warehouse System for Hadoop
Hyunsik Choi
 
Apache Tajo - Bay Area HUG Nov. 2013 LinkedIn Special Event
Apache Tajo - Bay Area HUG Nov. 2013 LinkedIn Special EventApache Tajo - Bay Area HUG Nov. 2013 LinkedIn Special Event
Apache Tajo - Bay Area HUG Nov. 2013 LinkedIn Special Event
Gruter
 

Was ist angesagt? (20)

Tajo: A Distributed Data Warehouse System for Hadoop
Tajo: A Distributed Data Warehouse System for HadoopTajo: A Distributed Data Warehouse System for Hadoop
Tajo: A Distributed Data Warehouse System for Hadoop
 
Introduction to Hive and HCatalog
Introduction to Hive and HCatalogIntroduction to Hive and HCatalog
Introduction to Hive and HCatalog
 
A Survey of HBase Application Archetypes
A Survey of HBase Application ArchetypesA Survey of HBase Application Archetypes
A Survey of HBase Application Archetypes
 
Tez Shuffle Handler: Shuffling at Scale with Apache Hadoop
Tez Shuffle Handler: Shuffling at Scale with Apache HadoopTez Shuffle Handler: Shuffling at Scale with Apache Hadoop
Tez Shuffle Handler: Shuffling at Scale with Apache Hadoop
 
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
 
Introduction to Apache Tajo: Data Warehouse for Big Data
Introduction to Apache Tajo: Data Warehouse for Big DataIntroduction to Apache Tajo: Data Warehouse for Big Data
Introduction to Apache Tajo: Data Warehouse for Big Data
 
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
 
Architecting Applications with Hadoop
Architecting Applications with HadoopArchitecting Applications with Hadoop
Architecting Applications with Hadoop
 
Building a Large Scale SEO/SEM Application with Apache Solr
Building a Large Scale SEO/SEM Application with Apache SolrBuilding a Large Scale SEO/SEM Application with Apache Solr
Building a Large Scale SEO/SEM Application with Apache Solr
 
Evolving HDFS to a Generalized Distributed Storage Subsystem
Evolving HDFS to a Generalized Distributed Storage SubsystemEvolving HDFS to a Generalized Distributed Storage Subsystem
Evolving HDFS to a Generalized Distributed Storage Subsystem
 
Cloudera Impala: A Modern SQL Engine for Apache Hadoop
Cloudera Impala: A Modern SQL Engine for Apache HadoopCloudera Impala: A Modern SQL Engine for Apache Hadoop
Cloudera Impala: A Modern SQL Engine for Apache Hadoop
 
Apache drill
Apache drillApache drill
Apache drill
 
Apache Tajo - Bay Area HUG Nov. 2013 LinkedIn Special Event
Apache Tajo - Bay Area HUG Nov. 2013 LinkedIn Special EventApache Tajo - Bay Area HUG Nov. 2013 LinkedIn Special Event
Apache Tajo - Bay Area HUG Nov. 2013 LinkedIn Special Event
 
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...
 
HBaseCon 2012 | Mignify: A Big Data Refinery Built on HBase - Internet Memory...
HBaseCon 2012 | Mignify: A Big Data Refinery Built on HBase - Internet Memory...HBaseCon 2012 | Mignify: A Big Data Refinery Built on HBase - Internet Memory...
HBaseCon 2012 | Mignify: A Big Data Refinery Built on HBase - Internet Memory...
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilities
 
HDFS Tiered Storage: Mounting Object Stores in HDFS
HDFS Tiered Storage: Mounting Object Stores in HDFSHDFS Tiered Storage: Mounting Object Stores in HDFS
HDFS Tiered Storage: Mounting Object Stores in HDFS
 
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UKIntroduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
 
Building data pipelines with kite
Building data pipelines with kiteBuilding data pipelines with kite
Building data pipelines with kite
 
NYC HUG - Application Architectures with Apache Hadoop
NYC HUG - Application Architectures with Apache HadoopNYC HUG - Application Architectures with Apache Hadoop
NYC HUG - Application Architectures with Apache Hadoop
 

Ähnlich wie Big Data Camp LA 2014 - Apache Tajo: A Big Data Warehouse System on Hadoop

What's new in pandas and the SciPy stack for financial users
What's new in pandas and the SciPy stack for financial usersWhat's new in pandas and the SciPy stack for financial users
What's new in pandas and the SciPy stack for financial users
Wes McKinney
 
HBaseCon2015-final
HBaseCon2015-finalHBaseCon2015-final
HBaseCon2015-final
Maryann Xue
 
Stinger hadoop summit june 2013
Stinger hadoop summit june 2013Stinger hadoop summit june 2013
Stinger hadoop summit june 2013
alanfgates
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in HadoopBackup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
larsgeorge
 

Ähnlich wie Big Data Camp LA 2014 - Apache Tajo: A Big Data Warehouse System on Hadoop (20)

Tajolabigdatacamp2014 140618135810-phpapp01 hyunsik-choi
Tajolabigdatacamp2014 140618135810-phpapp01 hyunsik-choiTajolabigdatacamp2014 140618135810-phpapp01 hyunsik-choi
Tajolabigdatacamp2014 140618135810-phpapp01 hyunsik-choi
 
Apache Tajo - BWC 2014
Apache Tajo - BWC 2014Apache Tajo - BWC 2014
Apache Tajo - BWC 2014
 
Tajo_Meetup_20141120
Tajo_Meetup_20141120Tajo_Meetup_20141120
Tajo_Meetup_20141120
 
Big Data Day LA 2015 - What's New Tajo 0.10 and Beyond by Hyunsik Choi of Gruter
Big Data Day LA 2015 - What's New Tajo 0.10 and Beyond by Hyunsik Choi of GruterBig Data Day LA 2015 - What's New Tajo 0.10 and Beyond by Hyunsik Choi of Gruter
Big Data Day LA 2015 - What's New Tajo 0.10 and Beyond by Hyunsik Choi of Gruter
 
Presto updates to 0.178
Presto updates to 0.178Presto updates to 0.178
Presto updates to 0.178
 
Gunther hagleitner:apache hive & stinger
Gunther hagleitner:apache hive & stingerGunther hagleitner:apache hive & stinger
Gunther hagleitner:apache hive & stinger
 
Introduction to Apache Tajo - Bay Area HUG Nov. 2013 LinkedIn Special Event
Introduction to Apache Tajo - Bay Area HUG Nov. 2013 LinkedIn Special EventIntroduction to Apache Tajo - Bay Area HUG Nov. 2013 LinkedIn Special Event
Introduction to Apache Tajo - Bay Area HUG Nov. 2013 LinkedIn Special Event
 
Emerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataEmerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big Data
 
What's new in pandas and the SciPy stack for financial users
What's new in pandas and the SciPy stack for financial usersWhat's new in pandas and the SciPy stack for financial users
What's new in pandas and the SciPy stack for financial users
 
Tajo Seoul Meetup-201501
Tajo Seoul Meetup-201501Tajo Seoul Meetup-201501
Tajo Seoul Meetup-201501
 
HBaseCon2015-final
HBaseCon2015-finalHBaseCon2015-final
HBaseCon2015-final
 
Python redis talk
Python redis talkPython redis talk
Python redis talk
 
Stinger hadoop summit june 2013
Stinger hadoop summit june 2013Stinger hadoop summit june 2013
Stinger hadoop summit june 2013
 
An In-Depth Look at Putting the Sting in Hive
An In-Depth Look at Putting the Sting in HiveAn In-Depth Look at Putting the Sting in Hive
An In-Depth Look at Putting the Sting in Hive
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in HadoopBackup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
 
Jethro data meetup index base sql on hadoop - oct-2014
Jethro data meetup    index base sql on hadoop - oct-2014Jethro data meetup    index base sql on hadoop - oct-2014
Jethro data meetup index base sql on hadoop - oct-2014
 
Overview of stinger interactive query for hive
Overview of stinger   interactive query for hiveOverview of stinger   interactive query for hive
Overview of stinger interactive query for hive
 
[FOSS4G 2015 SEOUL] Spatial tajo supporting spatial queries on Apache Tajo
[FOSS4G 2015 SEOUL] Spatial tajo supporting spatial queries on Apache Tajo[FOSS4G 2015 SEOUL] Spatial tajo supporting spatial queries on Apache Tajo
[FOSS4G 2015 SEOUL] Spatial tajo supporting spatial queries on Apache Tajo
 
Big data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesBig data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting Languages
 
Defending the Enterprise with Evernote at SourceBoston on May 27, 2015
Defending the Enterprise with Evernote at SourceBoston on May 27, 2015Defending the Enterprise with Evernote at SourceBoston on May 27, 2015
Defending the Enterprise with Evernote at SourceBoston on May 27, 2015
 

Mehr von Gruter

Mehr von Gruter (20)

MelOn 빅데이터 플랫폼과 Tajo 이야기
MelOn 빅데이터 플랫폼과 Tajo 이야기MelOn 빅데이터 플랫폼과 Tajo 이야기
MelOn 빅데이터 플랫폼과 Tajo 이야기
 
Introduction to Apache Tajo: Future of Data Warehouse
Introduction to Apache Tajo: Future of Data WarehouseIntroduction to Apache Tajo: Future of Data Warehouse
Introduction to Apache Tajo: Future of Data Warehouse
 
Expanding Your Data Warehouse with Tajo
Expanding Your Data Warehouse with TajoExpanding Your Data Warehouse with Tajo
Expanding Your Data Warehouse with Tajo
 
Introduction to Apache Tajo
Introduction to Apache TajoIntroduction to Apache Tajo
Introduction to Apache Tajo
 
스타트업사례로 본 로그 데이터분석 : Tajo on AWS
스타트업사례로 본 로그 데이터분석 : Tajo on AWS스타트업사례로 본 로그 데이터분석 : Tajo on AWS
스타트업사례로 본 로그 데이터분석 : Tajo on AWS
 
What's New Tajo 0.10 and Its Beyond
What's New Tajo 0.10 and Its BeyondWhat's New Tajo 0.10 and Its Beyond
What's New Tajo 0.10 and Its Beyond
 
Big data analysis with R and Apache Tajo (in Korean)
Big data analysis with R and Apache Tajo (in Korean)Big data analysis with R and Apache Tajo (in Korean)
Big data analysis with R and Apache Tajo (in Korean)
 
Efficient In­‐situ Processing of Various Storage Types on Apache Tajo
Efficient In­‐situ Processing of Various Storage Types on Apache TajoEfficient In­‐situ Processing of Various Storage Types on Apache Tajo
Efficient In­‐situ Processing of Various Storage Types on Apache Tajo
 
Tajo TPC-H Benchmark Test on AWS
Tajo TPC-H Benchmark Test on AWSTajo TPC-H Benchmark Test on AWS
Tajo TPC-H Benchmark Test on AWS
 
Data analysis with Tajo
Data analysis with TajoData analysis with Tajo
Data analysis with Tajo
 
Gruter TECHDAY 2014 MelOn BigData
Gruter TECHDAY 2014 MelOn BigDataGruter TECHDAY 2014 MelOn BigData
Gruter TECHDAY 2014 MelOn BigData
 
Gruter_TECHDAY_2014_04_TajoCloudHandsOn (in Korean)
Gruter_TECHDAY_2014_04_TajoCloudHandsOn (in Korean)Gruter_TECHDAY_2014_04_TajoCloudHandsOn (in Korean)
Gruter_TECHDAY_2014_04_TajoCloudHandsOn (in Korean)
 
Gruter_TECHDAY_2014_01_SearchEngine (in Korean)
Gruter_TECHDAY_2014_01_SearchEngine (in Korean)Gruter_TECHDAY_2014_01_SearchEngine (in Korean)
Gruter_TECHDAY_2014_01_SearchEngine (in Korean)
 
Elastic Search Performance Optimization - Deview 2014
Elastic Search Performance Optimization - Deview 2014Elastic Search Performance Optimization - Deview 2014
Elastic Search Performance Optimization - Deview 2014
 
Hadoop security DeView 2014
Hadoop security DeView 2014Hadoop security DeView 2014
Hadoop security DeView 2014
 
Vectorized processing in_a_nutshell_DeView2014
Vectorized processing in_a_nutshell_DeView2014Vectorized processing in_a_nutshell_DeView2014
Vectorized processing in_a_nutshell_DeView2014
 
Cloumon sw제품설명회 발표자료
Cloumon sw제품설명회 발표자료Cloumon sw제품설명회 발표자료
Cloumon sw제품설명회 발표자료
 
Tajo and SQL-on-Hadoop in Tech Planet 2013
Tajo and SQL-on-Hadoop in Tech Planet 2013Tajo and SQL-on-Hadoop in Tech Planet 2013
Tajo and SQL-on-Hadoop in Tech Planet 2013
 
SQL-on-Hadoop with Apache Tajo, and application case of SK Telecom
SQL-on-Hadoop with Apache Tajo,  and application case of SK TelecomSQL-on-Hadoop with Apache Tajo,  and application case of SK Telecom
SQL-on-Hadoop with Apache Tajo, and application case of SK Telecom
 
Tajo case study bay area hug 20131105
Tajo case study bay area hug 20131105Tajo case study bay area hug 20131105
Tajo case study bay area hug 20131105
 

Kürzlich hochgeladen

CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceCALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
anilsa9823
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
mohitmore19
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
anilsa9823
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
Health
 

Kürzlich hochgeladen (20)

call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceCALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveVip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 

Big Data Camp LA 2014 - Apache Tajo: A Big Data Warehouse System on Hadoop

  • 1. Apache Tajo: A Big Data Warehouse System on Hadoop Hyunsik Choi Director of Research, Gruter Big Data Camp LA 2014
  • 2. Talk Outline • Introduction to Apache Tajo • What you can do with Tajo • Why you should use Tajo • Current Status of Tajo Project • Demonstration
  • 3. About Me • Hyunsik Choi (pronounced “Hyeon-shick Cheh”) • PhD (Computer Science & Engineering, 2013), Korea Univ . • Director of Research, Gruter Corp • Open-source Involvement – Full-time contributor to Apache Tajo (2013.6 ~ ) – Apache Tajo PMC member and committer (2013.3 ~ ) – Apache Giraph PMC member and committer (2011. 8 ~ ) • Contact Info – Email: hyunsik@apache.org – Linkedin: http://linkedin.com/in/hyunsikchoi/
  • 4. Apache Tajo • Open-source “SQL-on-H” “Big DW” system • Apache Top-level project since March 2014 • Supports SQL standards • Low latency, long running batch queries • Features – Supports Joins (inner and all outer), Groupby, and Sort – Window function – Most SQL data types supported (except for Decimal) • Recent 0.8.0 release – https://blogs.apache.org/tajo/entry/apache_tajo_0_8_0
  • 6. What You Can Do with Tajo • Batch queries – Long-running queries (~ hours) • Dynamic Scheduling • Fault Tolerance – ETL workloads • Interactive Ad-hoc Queries – Very low-latency (100 ms ~) – Few seconds on several TB dataset if you cluster capability is enough
  • 7. Why You Should Use Tajo • SQL Standards – Non standard features – PgSQL and Oracle • Simple Installation and Operation – http://tajo.apache.org/docs/0.8.0/getting_started.html • Simple Software Stack Requirement – No MapReduce and No Tez – Yarn support but not mandatory – Tajo + Linux system for single node cluster – Tajo + HDFS for a distributed cluster
  • 8. Why You Should Use Tajo • Mature SQL Feature Set – Fully distributed query executions • Inner join, and left/right/full outer join • Groupby, sort, multiple distinct aggregation, window function – SQL data types • CHAR, BOOL, INT, BIGINT, REAL, DOUBLE, and TEXT • TIMESTAMP, DATE, TIME, and INTERVAL • DECIMAL (working) – Various file formats • Text file (CSV), RCFile, Parquet (flat schema), and Avro (flat schema)
  • 9. Why You Should Use Tajo • Fully community-driven open source • Stable development team – 5 fulltime contributors + many contributors • Performance and speed – Faster than Hive 0.10 (1.5 – 10 times) – Tajo v.s. Hive 0.13 ? – Tajo v.s. Impala ?
  • 10. Why You Should Use Tajo • Integration with Hadoop Ecosystem – Hadoop 2.2.0 – 2.4.0 support – Be able to connect to Hive Metastore – Directly process tables managed by Hive – Yarn support (backport) • Enable Tajo to deploy and run on Yarn cluster • Allow users to add/remove cluster nodes to/from Tajo cluster in runtime • Contributed by Min Zhou (committer), Linkedin Engineer • https://github.com/coderplay/tajo-yarn
  • 11. Current Status – Overall • Under beta stage – majority of key features are getting ready • Most of SQL features implemented • Working on hundreds of clusters for production – Collaboration with the biggest telco in S. Korea • We’ve just started works on low-level optimization. – Runtime byte code generation (v0.9) – Unsafe-based hash table for hash aggregation/join – Vectorized execution engine
  • 12. Current Status – Logical Plan Optimizer • Basic Rewrite Rule – Common sub expression elimination – Constant folding (CF), and Null propagation • Projection Push Down (PPD) – push expressions to operators lower as possible – narrow read columns – remove duplicated expressions • if some expressions has common expression • Filter Push Down (FPD) – reduce rows to be processed earlier as possible • Extensible Rewrite Rule – Allow developers to write their own rewrite rules
  • 13. Current Status – Logical Plan Optimizer SELECT item_id, order_id sum_price * (1.2 * 0.3) as total, FROM ( SELECT item_id, order_id, sum(price) as sum_price FROM ITEMS GROUP BY item_id, order_id ) a WHERE item_id = 17234 SELECT item_id, order_id, sum(price) * (3.6) FROM ITEMS GROUP BY item_id, order_id WHERE item_id = 17234 Original Rewritten CF + PPD FPD
  • 14. Current Status – Logical Plan Optimizer • Cost-based Join Order (since v0.2) – Don’t need to guess right join orders anymore – Greedy heuristic algorithm • Resulting in a bushy join tree instead of left-deep join tree Left-deep Join Tree Bush Join Tree
  • 15. Current Status – Window Function • OVER clause – row_number() and rank() – Aggregation function support – PARTITION and ORDER BY clause SELECT depname, empno, salary, enroll_date FROM ( SELECT depname, empno, salary, enroll_date, rank() OVER (PARTITION BY depname ORDER BY salary DESC, empno) AS pos FROM empsalary ) AS ss WHERE pos < 3;
  • 16. Current Status – Join • Join – NATURAL, INNER, OUTER (LEFT, RIGHT, FULL) – SEMI, ANTI Join (planned for v0.9) • Join Predicates – WHERE and ON predicates – de-factor standard outer join behavior with both predicates SELECT * FROM t1 LEFT JOIN t2 ON t1.num = t2.num WHERE t2.value = 'xxx'; SELECT * FROM t1 LEFT JOIN t2 WHERE t1.num = t2.n um and t2.value = ‘xxx’;
  • 17. Current Status – Table Partitions • Column Value Partition – Hive Compatible Partition • Range Partition (planned for 1.0) – Table will be partitioned by disjoint ranges. – Will remove the partition granularity problem of Hive Partition CREATE TABLE T1 (C1 INT, C2 TEXT) using PARQUET WITH (‘parquet.compression’ = ‘SNAPPY’) PARTITION BY COLUMN (C3 INT, C4 TEXT);
  • 18. Future Works • Multi-tenant Scheduler (v0.9) – Support multiple users and multiple queries • Runtime byte code generation for expressions (v0.9) – Eliminate interpret overhead of expression evaluation • Authentication and SQL Standard Access Control • JIT-based Vectorized Processing Engine – Refer to Hadoop Summit 2014 Slide (http://goo.gl/jWghhp)
  • 19. Get Involved! • We are recruiting contributors! • General – http://tajo.apache.org • Getting Started – http://tajo.apache.org/docs/0.8.0/getting_started.html • Downloads – http://tajo.apache.org/docs/0.8.0/getting_started/downloading_source.html • Jira – Issue Tracker – https://issues.apache.org/jira/browse/TAJO • Join the mailing list – dev-subscribe@tajo.apache.org – issues-subscribe@tajo.apache.org