SlideShare ist ein Scribd-Unternehmen logo
1 von 29
Page1 © Hortonworks Inc. 2014
Cost-based query optimization in
Apache Hive
Julian Hyde Julian Hyde
June 4th, 2014
Page2 © Hortonworks Inc. 2014
About me
Julian Hyde
Architect at Hortonworks
Open source:
• Founder & lead, Apache Optiq (query optimization framework)
• Founder & lead, Pentaho Mondrian (analysis engine)
• Committer, Apache Drill
• Contributor, Apache Hive
• Contributor, Cascading Lingual (SQL interface to Cascading)
Past:
• SQLstream (streaming SQL)
• Broadbase (data warehouse)
• Oracle (SQL kernel development)
Page3 © Hortonworks Inc. 2014
(Thanks to
John Pullokkaran,
Harish Butani
for presentation content
and actually doing the work.)
Page4 © Hortonworks Inc. 2014
Apache Hive
The original “SQL on Hadoop”
Undergoing extensive renovation
• Tez execution engine
• YARN execution environment
• Vectorized data representation
• Column-oriented data storage (ORC)
• ACID transactions
• SQL standards compliance
• SQL authorization model
• Cost-based query optimization (CBO) What? Why? How? When?
“Stinger
Initiative”
Page5 © Hortonworks Inc. 2014
Incremental cutover to cost-based optimization
Release Date Remarks
Apache Hive 0.12 October 2013 • Rule-based Optimizations
• No join reordering
• Main optimizations: predicate push-
down & partition pruning
• Semantic info and operator tree tightly
coupled
Apache Hive 0.13 April 2014 “Old-style” JOIN & push-down conditions:
… FROM t1, t2 WHERE …
CBO just missed the deadline 
HDP 2.1 April 2014 Cost-based ordering of joins
• HIVE-6439 “Introduce CBO step in
Semantic Analyzer”
• HIVE-5775 “Introduce Cost Based
Optimizer in Hive”
Apache Hive 0.14 ? CBO patches
More rework of internals
More cost-based features…
Page6 © Hortonworks Inc. 2014
Apache Optiq
(incubating)
Page7 © Hortonworks Inc. 2014
Apache Optiq
Apache incubator project since May, 2014
Query planning framework
• Extensible
• Usable standalone (JDBC) or embedded
Adoption
Lingual – SQL interface to Cascading
Apache Drill
Apache Hive
Adapters: Splunk, Spark, MongoDB, JDBC, CSV, Web tables, In-memory
data
Page8 © Hortonworks Inc. 2014
Conventional DB architecture
Page9 © Hortonworks Inc. 2014
Optiq architecture
Page10 © Hortonworks Inc. 2014
Optiq – APIs and SPIs
Cost, statistics
RelOptCost
RelOptCostFactory
RelMetadataProvider
• RelMdColumnUniquensss
• RelMdDistinctRowCount
• RelMdSelectivity
SQL parser
SqlNode
SqlParser
SqlValidator
Transformation rules
RelOptRule
• MergeFilterRule
• PushAggregateThroughUni
onRule
• RemoveCorrelationForScal
arProjectRule
• 100+ more
Unification (materialized view)
Column trimming
Relational algebra
RelNode (operator)
• TableScan
• Filter
• Project
• Union
• Aggregate
• …
RelDataType (type)
RexNode (expression)
RelTrait (physical property)
• RelConvention (calling-convention)
• RelCollation (sortedness)
• TBD (bucketedness/distribution) JDBC driver
Metadata
Schema
Table
Function
• TableFunction
• TableMacro
Page11 © Hortonworks Inc. 2014
Now… back to Hive
Page12 © Hortonworks Inc. 2014
CBO in Hive
Why cost-based optimization?
Ease of Use – Join Reordering
View Chaining
Ad hoc queries involving multiple views
Enables BI Tools as front ends to Hive
First version
Modest goal
Concrete results
Join re-ordering
Page 12
Page13 © Hortonworks Inc. 2014
Query preparation – Hive 0.13
SQL
parser
Semantic
analyzer
Logical
Optimizer
Physical
Optimizer
Abstract
Syntax
Tree (AST)
Hive SQL
Annotated
AST
Plan
Tez
Tuned
Plan
Page14 © Hortonworks Inc. 2014
Query preparation – full CBO
SQL
parser
Semantic
analyzer
Translate
to algebra
Physical
Optimizer
Abstract
Syntax
Tree (AST)
Hive SQL
Tez
Tuned
Plan
Optiq
optimizer
RelNode
Annotated
AST
Page15 © Hortonworks Inc. 2014
Query preparation – initial CBO
SQL
parser
Semantic
analyzer
Logical
Optimizer
Physical
Optimizer
Hive SQL
AST with optimized
join-ordering
Tez
Tuned
Plan
Translate
to algebra
Optiq
optimizer
Page16 © Hortonworks Inc. 2014
© Hortonworks Inc. 2013
Query Execution – The basics
Page 16
SELECT R1.x
FROM R1
JOIN R2 ON R1.x = R2.x
JOIN R3 on R1.x = R3.x AND R2.x = R3.x
WHERE R1.z > 10;
p
s


R1 R2
R3
TS [R1]
TS [R2]
RS
RS
Shuffle
Join
TS [R3]
Map
Join
Filter FS
Page17 © Hortonworks Inc. 2014
© Hortonworks Inc. 2013
Query Optimization – Rule Based vs. Cost Based
Page 17
p
s


R1 R2
R3
p
s


R1
R2
R3
p
s


R1
R3
R2
p
s


R2
R3
R1
Page18 © Hortonworks Inc. 2014
Introduction of CBO into Hive Planning
cbo
enabled?
No
Generate Plan w/o
multi-way joins
Can
cbo handle
plan?
No
- Predicate Pushdown
- Part. Pruning
- Column Pruning
- Stats Annotation
Pre CBO Optimizer
Col stats
available?
No
Optiq-based
Planner
Hive Plan
Revised AST
Regular Planning route on
new AST with CBO
turned off.
Fallback to Regular
planning: as though cbo
is disabled.
- < 10 total Join
Ops
- No Outer Joins
- No Windowing,
Lateral Views,
Script Op.
Series of gating
factors to get a CBO
Plan.
Page19 © Hortonworks Inc. 2014
Optiq Planner Process
Hive
Plan
Planner
RelNode
GraphRelNode Converter
RexNode Converter
Hive Op  RelNode
Hive Expr  RexNode
• Node for each node in
Input Plan
• Each node is a Set of
alternate Sub Plans
• Set further divided into
Subsets: based on
traits like sortedness
1. Plan Graph
• Rule: specifies a Operator
sub-graph to match and
logic to generate equivalent
‘better’ sub-graph.
• We only have Join
Reordering Rules.
2. Rules
• RelNodes have Cost (&
Cumulative Cost)
• We only use Cardinality
for Cost.
3. Cost Model
- Used to Plugin Schema,
Cost Formulas:
Selectivity, NDV
calculations etc.
- We only added
Selectivity and NDV
formulas; Schema is
only available at the
Node level
4. Metadata Providers
Rule Match Queue
- Add Rule matches to Queue
- Apply Rule match
transformations to Plan Graph
- Iterate for fixed iterations or
until Cost doesn’t change.
- Match importance based on
Cost of RelNode and height.
Best
RelNode
Graph
AST Converter
Revised
AST
Logical Plan
Physical traits:
Table Part./Buckets;
RedSink Ops
removed
Page20 © Hortonworks Inc. 2014
Join Reordering Rules
a b
=
b a
1. Swap Join Rule
a b
=
2. Push Join Through Join Rule
c
a c
b
c b
a=
but is really:
Optiq
schema is
position
based
b
a c
3. So
a b
c
d
≠
a c
d
b
4. Pull Up Project above Join
b
a c
d
a c
b
d
=
Added bonus
Join permutations
across sub-query
blocks
5. Merge Projects
Page21 © Hortonworks Inc. 2014
Summary
Join re-ordering
Join cardinality is used for cost
All other operators are assumed to have tiny cost
Cardinality of filter, join, group-by is based on selectivity
Selectivity is computed based on number-of-distinct-values (NDV)
Table Stats and Column stats are required
Current limitations
Only supports: filter, inner join, group-by, project, order-by, limit
Not all UDFs
Does not attempt all join permutations (e.g. bushy trees; 10-way joins or more)
May not work well for Bucket, SMB & Skew Joins
Page 21
Page22 © Hortonworks Inc. 2014
TPC-DS Query 50
Joins Store Sales, and Store Returns fact tables.
Each of the fact tables are independently restricted by date.
Analysis at Store grain, so this dimension also joined in.
As specified Query starts by joining the 2 Fact tables.
select
s_store_name , .. other store details
,sum(case when (sr_returned_date_sk - ss_sold_date_sk <= 30 ) then 1 else 0 end) as `30 days`, …
from
store_sales ss,store_returns sr,store s ,date_dim d1 ,date_dim d2
where
d2.d_year = 2000 and d2.d_moy = 9
and ss.ss_ticket_number = sr.sr_ticket_number and ss.ss_item_sk = sr.sr_item_sk
and ss.ss_sold_date_sk = d1.d_date_sk and sr.sr_returned_date_sk = d2.d_date_sk
and ss.ss_customer_sk = sr.sr_customer_sk and ss.ss_store_sk = s.s_store_sk
group by store details
order by store details limit 100;
Join Graph
Page23 © Hortonworks Inc. 2014
TPC-DS Query 50
Specified
Join Tree
Non CBO Plan
CBO Plan
Page24 © Hortonworks Inc. 2014
TPC-DS Query 50
Run 1 Run 2
Non CBO 53.1 53.4
CBO 22.5 21.9
 1 year test
 > 10 mins for Non CBO
 CBO time was about the same
 Fact tables
 partitioned by Day,
 bucketed by Item
 Bucketing off
 Bucketing should help CBO plan.
 SR table much smaller. Better chance of Bucket Join in place of Shuffle
Join.
Join Ordering Cost Estimate
['d2', [[['store_sales', 'd1'], 'store_returns'], 'store']] 515074768.659
['d1', [[['store_sales', 'store'], 'store_returns'], 'd2']] 448155.355
…
['store_returns', 'd2'] 9938.93
['store_sales', 'store_returns'] 156727295.634
['d1', 'store_sales'] 123675664.449
Facts restricted to 3 months
Orderings considered by Planner
Page25 © Hortonworks Inc. 2014
TPC-DS Query 17
Joins Store Sales, Store Returns and Catalog
Sales fact tables.
Each of the fact tables are independently
restricted by time.
Analysis at Item and Store grain, so these
dimensions are also joined in.
As specified Query starts by joining the 3 Fact
tables.
select i_item_id
,i_item_desc
,s_state
,count(ss_quantity) as store_sales_quantitycount
,….
from store_sales ss ,store_returns sr, catalog_sales cs,
date_dim d1, date_dim d2, date_dim d3, store s, item I
where d1.d_quarter_name = '2000Q1’
and d1.d_date_sk = ss.ss_sold_date_sk
and i.i_item_sk = ss.ss_item_sk and …
group by i_item_id ,i_item_desc, ,s_state
order by i_item_id ,i_item_desc, s_state
limit 100;
Page26 © Hortonworks Inc. 2014
TPC-DS Query 17
Specified
Join Tree
Non CBO Plan
CBO Plan
Page27 © Hortonworks Inc. 2014
TPC-DS Query 17
Run 1 Run 2
Non CBO 100.71 127.53
CBO 50.9 44.52
 1 year test
 > 10 mins for Non CBO
 CBO time was about the same
 Fact tables
 partitioned by Day,
 bucketed by Item
 Bucketing off
 Bucketing should help CBO plan.
 SR table much smaller. Better chance of Bucket Join in place of Shuffle
Join.
Join Ordering Cost Estimate
['item', [[[[[['d2', 'store_returns'], 'store_sales'], 'catalog_sales'], 'd1'], 'd3'], 'store']] 3547898.061
…
['store_returns', 'd2’] 19224.71
['store_sales', 'store_returns’] 23057497.991
['d1', 'store_sales'] 26142.943
Facts restricted to 3 months
Orderings considered by Planner
Page28 © Hortonworks Inc. 2014
Next?
Outer joins
Scale to larger numbers of joins
Support all expressions (UDFs)
Join algorithm selection
Sortedness & distribution as a trait
Trait propagation
Better cost model
More statistics
Move all pre-planning and logical planning to Optiq
Use Optiq costs/statistics to help physical planning
Constant reduction & tree pruning
Rewrite query to use materialized view
Page29 © Hortonworks Inc. 2014
Thank you!
@julianhyde
http://hive.apache.org/
http://incubator.apache.org/projects/optiq.html

Weitere ähnliche Inhalte

Was ist angesagt?

The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroThe Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroDatabricks
 
Presto At Treasure Data
Presto At Treasure DataPresto At Treasure Data
Presto At Treasure DataTaro L. Saito
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsDatabricks
 
Introduction to Apache Calcite
Introduction to Apache CalciteIntroduction to Apache Calcite
Introduction to Apache CalciteJordan Halterman
 
Understanding Presto - Presto meetup @ Tokyo #1
Understanding Presto - Presto meetup @ Tokyo #1Understanding Presto - Presto meetup @ Tokyo #1
Understanding Presto - Presto meetup @ Tokyo #1Sadayuki Furuhashi
 
A Rusty introduction to Apache Arrow and how it applies to a time series dat...
A Rusty introduction to Apache Arrow and how it applies to a  time series dat...A Rusty introduction to Apache Arrow and how it applies to a  time series dat...
A Rusty introduction to Apache Arrow and how it applies to a time series dat...Andrew Lamb
 
The columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowThe columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowDataWorks Summit
 
Cost-based Query Optimization in Apache Phoenix using Apache Calcite
Cost-based Query Optimization in Apache Phoenix using Apache CalciteCost-based Query Optimization in Apache Phoenix using Apache Calcite
Cost-based Query Optimization in Apache Phoenix using Apache CalciteJulian Hyde
 
Apache Calcite: One planner fits all
Apache Calcite: One planner fits allApache Calcite: One planner fits all
Apache Calcite: One planner fits allJulian Hyde
 
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...InfluxData
 
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaTuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaDatabricks
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekVenkata Naga Ravi
 
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...Julian Hyde
 
Hive 3 - a new horizon
Hive 3 - a new horizonHive 3 - a new horizon
Hive 3 - a new horizonThejas Nair
 
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...Edureka!
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveDataWorks Summit
 
Migrating Apache Hive Workload to Apache Spark: Bridge the Gap with Zhan Zhan...
Migrating Apache Hive Workload to Apache Spark: Bridge the Gap with Zhan Zhan...Migrating Apache Hive Workload to Apache Spark: Bridge the Gap with Zhan Zhan...
Migrating Apache Hive Workload to Apache Spark: Bridge the Gap with Zhan Zhan...Databricks
 
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing DataWorks Summit
 
DB Time, Average Active Sessions, and ASH Math - Oracle performance fundamentals
DB Time, Average Active Sessions, and ASH Math - Oracle performance fundamentalsDB Time, Average Active Sessions, and ASH Math - Oracle performance fundamentals
DB Time, Average Active Sessions, and ASH Math - Oracle performance fundamentalsJohn Beresniewicz
 

Was ist angesagt? (20)

The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroThe Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
 
Presto At Treasure Data
Presto At Treasure DataPresto At Treasure Data
Presto At Treasure Data
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
 
Achieving 100k Queries per Hour on Hive on Tez
Achieving 100k Queries per Hour on Hive on TezAchieving 100k Queries per Hour on Hive on Tez
Achieving 100k Queries per Hour on Hive on Tez
 
Introduction to Apache Calcite
Introduction to Apache CalciteIntroduction to Apache Calcite
Introduction to Apache Calcite
 
Understanding Presto - Presto meetup @ Tokyo #1
Understanding Presto - Presto meetup @ Tokyo #1Understanding Presto - Presto meetup @ Tokyo #1
Understanding Presto - Presto meetup @ Tokyo #1
 
A Rusty introduction to Apache Arrow and how it applies to a time series dat...
A Rusty introduction to Apache Arrow and how it applies to a  time series dat...A Rusty introduction to Apache Arrow and how it applies to a  time series dat...
A Rusty introduction to Apache Arrow and how it applies to a time series dat...
 
The columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowThe columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache Arrow
 
Cost-based Query Optimization in Apache Phoenix using Apache Calcite
Cost-based Query Optimization in Apache Phoenix using Apache CalciteCost-based Query Optimization in Apache Phoenix using Apache Calcite
Cost-based Query Optimization in Apache Phoenix using Apache Calcite
 
Apache Calcite: One planner fits all
Apache Calcite: One planner fits allApache Calcite: One planner fits all
Apache Calcite: One planner fits all
 
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
 
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaTuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
 
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...
 
Hive 3 - a new horizon
Hive 3 - a new horizonHive 3 - a new horizon
Hive 3 - a new horizon
 
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
 
Migrating Apache Hive Workload to Apache Spark: Bridge the Gap with Zhan Zhan...
Migrating Apache Hive Workload to Apache Spark: Bridge the Gap with Zhan Zhan...Migrating Apache Hive Workload to Apache Spark: Bridge the Gap with Zhan Zhan...
Migrating Apache Hive Workload to Apache Spark: Bridge the Gap with Zhan Zhan...
 
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing
 
DB Time, Average Active Sessions, and ASH Math - Oracle performance fundamentals
DB Time, Average Active Sessions, and ASH Math - Oracle performance fundamentalsDB Time, Average Active Sessions, and ASH Math - Oracle performance fundamentals
DB Time, Average Active Sessions, and ASH Math - Oracle performance fundamentals
 

Ähnlich wie Cost-based query optimization in Apache Hive

Cost-based query optimization in Apache Hive 0.14
Cost-based query optimization in Apache Hive 0.14Cost-based query optimization in Apache Hive 0.14
Cost-based query optimization in Apache Hive 0.14Julian Hyde
 
Cost-based query optimization in Apache Hive 0.14
Cost-based query optimization in Apache Hive 0.14Cost-based query optimization in Apache Hive 0.14
Cost-based query optimization in Apache Hive 0.14Julian Hyde
 
Apache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Apache Kylin: OLAP Engine on Hadoop - Tech Deep DiveApache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Apache Kylin: OLAP Engine on Hadoop - Tech Deep DiveXu Jiang
 
What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?DataWorks Summit
 
What's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoWhat's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoDataWorks Summit
 
Fast federated SQL with Apache Calcite
Fast federated SQL with Apache CalciteFast federated SQL with Apache Calcite
Fast federated SQL with Apache CalciteChris Baynes
 
Self-serve analytics journey at Celtra: Snowflake, Spark, and Databricks
Self-serve analytics journey at Celtra: Snowflake, Spark, and DatabricksSelf-serve analytics journey at Celtra: Snowflake, Spark, and Databricks
Self-serve analytics journey at Celtra: Snowflake, Spark, and DatabricksGrega Kespret
 
Presentation v mware roi tco calculator
Presentation   v mware roi tco calculatorPresentation   v mware roi tco calculator
Presentation v mware roi tco calculatorsolarisyourep
 
2018 data warehouse features in spark
2018   data warehouse features in spark2018   data warehouse features in spark
2018 data warehouse features in sparkChester Chen
 
Querying Druid in SQL with Superset
Querying Druid in SQL with SupersetQuerying Druid in SQL with Superset
Querying Druid in SQL with SupersetDataWorks Summit
 
Apache Calcite: One Frontend to Rule Them All
Apache Calcite: One Frontend to Rule Them AllApache Calcite: One Frontend to Rule Them All
Apache Calcite: One Frontend to Rule Them AllMichael Mior
 
Use dependency injection to get Hadoop *out* of your application code
Use dependency injection to get Hadoop *out* of your application codeUse dependency injection to get Hadoop *out* of your application code
Use dependency injection to get Hadoop *out* of your application codeDataWorks Summit
 
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...Spark Summit
 
Fast SQL on Hadoop, really?
Fast SQL on Hadoop, really?Fast SQL on Hadoop, really?
Fast SQL on Hadoop, really?DataWorks Summit
 
Maximizing Database Tuning in SAP SQL Anywhere
Maximizing Database Tuning in SAP SQL AnywhereMaximizing Database Tuning in SAP SQL Anywhere
Maximizing Database Tuning in SAP SQL AnywhereSAP Technology
 
Presto query optimizer: pursuit of performance
Presto query optimizer: pursuit of performancePresto query optimizer: pursuit of performance
Presto query optimizer: pursuit of performanceDataWorks Summit
 
Powerpivot web wordpress present
Powerpivot web wordpress presentPowerpivot web wordpress present
Powerpivot web wordpress presentMariAnne Woehrle
 
Accelerating query processing with materialized views in Apache Hive
Accelerating query processing with materialized views in Apache HiveAccelerating query processing with materialized views in Apache Hive
Accelerating query processing with materialized views in Apache HiveSahil Takiar
 
Big Data, Bigger Analytics
Big Data, Bigger AnalyticsBig Data, Bigger Analytics
Big Data, Bigger AnalyticsItzhak Kameli
 

Ähnlich wie Cost-based query optimization in Apache Hive (20)

Cost-based query optimization in Apache Hive 0.14
Cost-based query optimization in Apache Hive 0.14Cost-based query optimization in Apache Hive 0.14
Cost-based query optimization in Apache Hive 0.14
 
Cost-based query optimization in Apache Hive 0.14
Cost-based query optimization in Apache Hive 0.14Cost-based query optimization in Apache Hive 0.14
Cost-based query optimization in Apache Hive 0.14
 
Apache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Apache Kylin: OLAP Engine on Hadoop - Tech Deep DiveApache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Apache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
 
What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?
 
What's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoWhat's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - Tokyo
 
Fast federated SQL with Apache Calcite
Fast federated SQL with Apache CalciteFast federated SQL with Apache Calcite
Fast federated SQL with Apache Calcite
 
Self-serve analytics journey at Celtra: Snowflake, Spark, and Databricks
Self-serve analytics journey at Celtra: Snowflake, Spark, and DatabricksSelf-serve analytics journey at Celtra: Snowflake, Spark, and Databricks
Self-serve analytics journey at Celtra: Snowflake, Spark, and Databricks
 
Presentation v mware roi tco calculator
Presentation   v mware roi tco calculatorPresentation   v mware roi tco calculator
Presentation v mware roi tco calculator
 
2018 data warehouse features in spark
2018   data warehouse features in spark2018   data warehouse features in spark
2018 data warehouse features in spark
 
Querying Druid in SQL with Superset
Querying Druid in SQL with SupersetQuerying Druid in SQL with Superset
Querying Druid in SQL with Superset
 
Apache Calcite: One Frontend to Rule Them All
Apache Calcite: One Frontend to Rule Them AllApache Calcite: One Frontend to Rule Them All
Apache Calcite: One Frontend to Rule Them All
 
Use dependency injection to get Hadoop *out* of your application code
Use dependency injection to get Hadoop *out* of your application codeUse dependency injection to get Hadoop *out* of your application code
Use dependency injection to get Hadoop *out* of your application code
 
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
 
Fast SQL on Hadoop, really?
Fast SQL on Hadoop, really?Fast SQL on Hadoop, really?
Fast SQL on Hadoop, really?
 
Maximizing Database Tuning in SAP SQL Anywhere
Maximizing Database Tuning in SAP SQL AnywhereMaximizing Database Tuning in SAP SQL Anywhere
Maximizing Database Tuning in SAP SQL Anywhere
 
Presto query optimizer: pursuit of performance
Presto query optimizer: pursuit of performancePresto query optimizer: pursuit of performance
Presto query optimizer: pursuit of performance
 
Powerpivot web wordpress present
Powerpivot web wordpress presentPowerpivot web wordpress present
Powerpivot web wordpress present
 
Quark Virtualization Engine for Analytics
Quark Virtualization Engine for Analytics Quark Virtualization Engine for Analytics
Quark Virtualization Engine for Analytics
 
Accelerating query processing with materialized views in Apache Hive
Accelerating query processing with materialized views in Apache HiveAccelerating query processing with materialized views in Apache Hive
Accelerating query processing with materialized views in Apache Hive
 
Big Data, Bigger Analytics
Big Data, Bigger AnalyticsBig Data, Bigger Analytics
Big Data, Bigger Analytics
 

Mehr von Julian Hyde

Building a semantic/metrics layer using Calcite
Building a semantic/metrics layer using CalciteBuilding a semantic/metrics layer using Calcite
Building a semantic/metrics layer using CalciteJulian Hyde
 
Cubing and Metrics in SQL, oh my!
Cubing and Metrics in SQL, oh my!Cubing and Metrics in SQL, oh my!
Cubing and Metrics in SQL, oh my!Julian Hyde
 
Adding measures to Calcite SQL
Adding measures to Calcite SQLAdding measures to Calcite SQL
Adding measures to Calcite SQLJulian Hyde
 
Morel, a data-parallel programming language
Morel, a data-parallel programming languageMorel, a data-parallel programming language
Morel, a data-parallel programming languageJulian Hyde
 
Is there a perfect data-parallel programming language? (Experiments with More...
Is there a perfect data-parallel programming language? (Experiments with More...Is there a perfect data-parallel programming language? (Experiments with More...
Is there a perfect data-parallel programming language? (Experiments with More...Julian Hyde
 
Morel, a Functional Query Language
Morel, a Functional Query LanguageMorel, a Functional Query Language
Morel, a Functional Query LanguageJulian Hyde
 
Apache Calcite (a tutorial given at BOSS '21)
Apache Calcite (a tutorial given at BOSS '21)Apache Calcite (a tutorial given at BOSS '21)
Apache Calcite (a tutorial given at BOSS '21)Julian Hyde
 
The evolution of Apache Calcite and its Community
The evolution of Apache Calcite and its CommunityThe evolution of Apache Calcite and its Community
The evolution of Apache Calcite and its CommunityJulian Hyde
 
What to expect when you're Incubating
What to expect when you're IncubatingWhat to expect when you're Incubating
What to expect when you're IncubatingJulian Hyde
 
Open Source SQL - beyond parsers: ZetaSQL and Apache Calcite
Open Source SQL - beyond parsers: ZetaSQL and Apache CalciteOpen Source SQL - beyond parsers: ZetaSQL and Apache Calcite
Open Source SQL - beyond parsers: ZetaSQL and Apache CalciteJulian Hyde
 
Efficient spatial queries on vanilla databases
Efficient spatial queries on vanilla databasesEfficient spatial queries on vanilla databases
Efficient spatial queries on vanilla databasesJulian Hyde
 
Tactical data engineering
Tactical data engineeringTactical data engineering
Tactical data engineeringJulian Hyde
 
Don't optimize my queries, organize my data!
Don't optimize my queries, organize my data!Don't optimize my queries, organize my data!
Don't optimize my queries, organize my data!Julian Hyde
 
Spatial query on vanilla databases
Spatial query on vanilla databasesSpatial query on vanilla databases
Spatial query on vanilla databasesJulian Hyde
 
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...Julian Hyde
 
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...Julian Hyde
 
Lazy beats Smart and Fast
Lazy beats Smart and FastLazy beats Smart and Fast
Lazy beats Smart and FastJulian Hyde
 
Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!Julian Hyde
 
Data profiling with Apache Calcite
Data profiling with Apache CalciteData profiling with Apache Calcite
Data profiling with Apache CalciteJulian Hyde
 
A smarter Pig: Building a SQL interface to Apache Pig using Apache Calcite
A smarter Pig: Building a SQL interface to Apache Pig using Apache CalciteA smarter Pig: Building a SQL interface to Apache Pig using Apache Calcite
A smarter Pig: Building a SQL interface to Apache Pig using Apache CalciteJulian Hyde
 

Mehr von Julian Hyde (20)

Building a semantic/metrics layer using Calcite
Building a semantic/metrics layer using CalciteBuilding a semantic/metrics layer using Calcite
Building a semantic/metrics layer using Calcite
 
Cubing and Metrics in SQL, oh my!
Cubing and Metrics in SQL, oh my!Cubing and Metrics in SQL, oh my!
Cubing and Metrics in SQL, oh my!
 
Adding measures to Calcite SQL
Adding measures to Calcite SQLAdding measures to Calcite SQL
Adding measures to Calcite SQL
 
Morel, a data-parallel programming language
Morel, a data-parallel programming languageMorel, a data-parallel programming language
Morel, a data-parallel programming language
 
Is there a perfect data-parallel programming language? (Experiments with More...
Is there a perfect data-parallel programming language? (Experiments with More...Is there a perfect data-parallel programming language? (Experiments with More...
Is there a perfect data-parallel programming language? (Experiments with More...
 
Morel, a Functional Query Language
Morel, a Functional Query LanguageMorel, a Functional Query Language
Morel, a Functional Query Language
 
Apache Calcite (a tutorial given at BOSS '21)
Apache Calcite (a tutorial given at BOSS '21)Apache Calcite (a tutorial given at BOSS '21)
Apache Calcite (a tutorial given at BOSS '21)
 
The evolution of Apache Calcite and its Community
The evolution of Apache Calcite and its CommunityThe evolution of Apache Calcite and its Community
The evolution of Apache Calcite and its Community
 
What to expect when you're Incubating
What to expect when you're IncubatingWhat to expect when you're Incubating
What to expect when you're Incubating
 
Open Source SQL - beyond parsers: ZetaSQL and Apache Calcite
Open Source SQL - beyond parsers: ZetaSQL and Apache CalciteOpen Source SQL - beyond parsers: ZetaSQL and Apache Calcite
Open Source SQL - beyond parsers: ZetaSQL and Apache Calcite
 
Efficient spatial queries on vanilla databases
Efficient spatial queries on vanilla databasesEfficient spatial queries on vanilla databases
Efficient spatial queries on vanilla databases
 
Tactical data engineering
Tactical data engineeringTactical data engineering
Tactical data engineering
 
Don't optimize my queries, organize my data!
Don't optimize my queries, organize my data!Don't optimize my queries, organize my data!
Don't optimize my queries, organize my data!
 
Spatial query on vanilla databases
Spatial query on vanilla databasesSpatial query on vanilla databases
Spatial query on vanilla databases
 
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
 
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
 
Lazy beats Smart and Fast
Lazy beats Smart and FastLazy beats Smart and Fast
Lazy beats Smart and Fast
 
Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!
 
Data profiling with Apache Calcite
Data profiling with Apache CalciteData profiling with Apache Calcite
Data profiling with Apache Calcite
 
A smarter Pig: Building a SQL interface to Apache Pig using Apache Calcite
A smarter Pig: Building a SQL interface to Apache Pig using Apache CalciteA smarter Pig: Building a SQL interface to Apache Pig using Apache Calcite
A smarter Pig: Building a SQL interface to Apache Pig using Apache Calcite
 

Kürzlich hochgeladen

TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 

Kürzlich hochgeladen (20)

TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 

Cost-based query optimization in Apache Hive

  • 1. Page1 © Hortonworks Inc. 2014 Cost-based query optimization in Apache Hive Julian Hyde Julian Hyde June 4th, 2014
  • 2. Page2 © Hortonworks Inc. 2014 About me Julian Hyde Architect at Hortonworks Open source: • Founder & lead, Apache Optiq (query optimization framework) • Founder & lead, Pentaho Mondrian (analysis engine) • Committer, Apache Drill • Contributor, Apache Hive • Contributor, Cascading Lingual (SQL interface to Cascading) Past: • SQLstream (streaming SQL) • Broadbase (data warehouse) • Oracle (SQL kernel development)
  • 3. Page3 © Hortonworks Inc. 2014 (Thanks to John Pullokkaran, Harish Butani for presentation content and actually doing the work.)
  • 4. Page4 © Hortonworks Inc. 2014 Apache Hive The original “SQL on Hadoop” Undergoing extensive renovation • Tez execution engine • YARN execution environment • Vectorized data representation • Column-oriented data storage (ORC) • ACID transactions • SQL standards compliance • SQL authorization model • Cost-based query optimization (CBO) What? Why? How? When? “Stinger Initiative”
  • 5. Page5 © Hortonworks Inc. 2014 Incremental cutover to cost-based optimization Release Date Remarks Apache Hive 0.12 October 2013 • Rule-based Optimizations • No join reordering • Main optimizations: predicate push- down & partition pruning • Semantic info and operator tree tightly coupled Apache Hive 0.13 April 2014 “Old-style” JOIN & push-down conditions: … FROM t1, t2 WHERE … CBO just missed the deadline  HDP 2.1 April 2014 Cost-based ordering of joins • HIVE-6439 “Introduce CBO step in Semantic Analyzer” • HIVE-5775 “Introduce Cost Based Optimizer in Hive” Apache Hive 0.14 ? CBO patches More rework of internals More cost-based features…
  • 6. Page6 © Hortonworks Inc. 2014 Apache Optiq (incubating)
  • 7. Page7 © Hortonworks Inc. 2014 Apache Optiq Apache incubator project since May, 2014 Query planning framework • Extensible • Usable standalone (JDBC) or embedded Adoption Lingual – SQL interface to Cascading Apache Drill Apache Hive Adapters: Splunk, Spark, MongoDB, JDBC, CSV, Web tables, In-memory data
  • 8. Page8 © Hortonworks Inc. 2014 Conventional DB architecture
  • 9. Page9 © Hortonworks Inc. 2014 Optiq architecture
  • 10. Page10 © Hortonworks Inc. 2014 Optiq – APIs and SPIs Cost, statistics RelOptCost RelOptCostFactory RelMetadataProvider • RelMdColumnUniquensss • RelMdDistinctRowCount • RelMdSelectivity SQL parser SqlNode SqlParser SqlValidator Transformation rules RelOptRule • MergeFilterRule • PushAggregateThroughUni onRule • RemoveCorrelationForScal arProjectRule • 100+ more Unification (materialized view) Column trimming Relational algebra RelNode (operator) • TableScan • Filter • Project • Union • Aggregate • … RelDataType (type) RexNode (expression) RelTrait (physical property) • RelConvention (calling-convention) • RelCollation (sortedness) • TBD (bucketedness/distribution) JDBC driver Metadata Schema Table Function • TableFunction • TableMacro
  • 11. Page11 © Hortonworks Inc. 2014 Now… back to Hive
  • 12. Page12 © Hortonworks Inc. 2014 CBO in Hive Why cost-based optimization? Ease of Use – Join Reordering View Chaining Ad hoc queries involving multiple views Enables BI Tools as front ends to Hive First version Modest goal Concrete results Join re-ordering Page 12
  • 13. Page13 © Hortonworks Inc. 2014 Query preparation – Hive 0.13 SQL parser Semantic analyzer Logical Optimizer Physical Optimizer Abstract Syntax Tree (AST) Hive SQL Annotated AST Plan Tez Tuned Plan
  • 14. Page14 © Hortonworks Inc. 2014 Query preparation – full CBO SQL parser Semantic analyzer Translate to algebra Physical Optimizer Abstract Syntax Tree (AST) Hive SQL Tez Tuned Plan Optiq optimizer RelNode Annotated AST
  • 15. Page15 © Hortonworks Inc. 2014 Query preparation – initial CBO SQL parser Semantic analyzer Logical Optimizer Physical Optimizer Hive SQL AST with optimized join-ordering Tez Tuned Plan Translate to algebra Optiq optimizer
  • 16. Page16 © Hortonworks Inc. 2014 © Hortonworks Inc. 2013 Query Execution – The basics Page 16 SELECT R1.x FROM R1 JOIN R2 ON R1.x = R2.x JOIN R3 on R1.x = R3.x AND R2.x = R3.x WHERE R1.z > 10; p s   R1 R2 R3 TS [R1] TS [R2] RS RS Shuffle Join TS [R3] Map Join Filter FS
  • 17. Page17 © Hortonworks Inc. 2014 © Hortonworks Inc. 2013 Query Optimization – Rule Based vs. Cost Based Page 17 p s   R1 R2 R3 p s   R1 R2 R3 p s   R1 R3 R2 p s   R2 R3 R1
  • 18. Page18 © Hortonworks Inc. 2014 Introduction of CBO into Hive Planning cbo enabled? No Generate Plan w/o multi-way joins Can cbo handle plan? No - Predicate Pushdown - Part. Pruning - Column Pruning - Stats Annotation Pre CBO Optimizer Col stats available? No Optiq-based Planner Hive Plan Revised AST Regular Planning route on new AST with CBO turned off. Fallback to Regular planning: as though cbo is disabled. - < 10 total Join Ops - No Outer Joins - No Windowing, Lateral Views, Script Op. Series of gating factors to get a CBO Plan.
  • 19. Page19 © Hortonworks Inc. 2014 Optiq Planner Process Hive Plan Planner RelNode GraphRelNode Converter RexNode Converter Hive Op  RelNode Hive Expr  RexNode • Node for each node in Input Plan • Each node is a Set of alternate Sub Plans • Set further divided into Subsets: based on traits like sortedness 1. Plan Graph • Rule: specifies a Operator sub-graph to match and logic to generate equivalent ‘better’ sub-graph. • We only have Join Reordering Rules. 2. Rules • RelNodes have Cost (& Cumulative Cost) • We only use Cardinality for Cost. 3. Cost Model - Used to Plugin Schema, Cost Formulas: Selectivity, NDV calculations etc. - We only added Selectivity and NDV formulas; Schema is only available at the Node level 4. Metadata Providers Rule Match Queue - Add Rule matches to Queue - Apply Rule match transformations to Plan Graph - Iterate for fixed iterations or until Cost doesn’t change. - Match importance based on Cost of RelNode and height. Best RelNode Graph AST Converter Revised AST Logical Plan Physical traits: Table Part./Buckets; RedSink Ops removed
  • 20. Page20 © Hortonworks Inc. 2014 Join Reordering Rules a b = b a 1. Swap Join Rule a b = 2. Push Join Through Join Rule c a c b c b a= but is really: Optiq schema is position based b a c 3. So a b c d ≠ a c d b 4. Pull Up Project above Join b a c d a c b d = Added bonus Join permutations across sub-query blocks 5. Merge Projects
  • 21. Page21 © Hortonworks Inc. 2014 Summary Join re-ordering Join cardinality is used for cost All other operators are assumed to have tiny cost Cardinality of filter, join, group-by is based on selectivity Selectivity is computed based on number-of-distinct-values (NDV) Table Stats and Column stats are required Current limitations Only supports: filter, inner join, group-by, project, order-by, limit Not all UDFs Does not attempt all join permutations (e.g. bushy trees; 10-way joins or more) May not work well for Bucket, SMB & Skew Joins Page 21
  • 22. Page22 © Hortonworks Inc. 2014 TPC-DS Query 50 Joins Store Sales, and Store Returns fact tables. Each of the fact tables are independently restricted by date. Analysis at Store grain, so this dimension also joined in. As specified Query starts by joining the 2 Fact tables. select s_store_name , .. other store details ,sum(case when (sr_returned_date_sk - ss_sold_date_sk <= 30 ) then 1 else 0 end) as `30 days`, … from store_sales ss,store_returns sr,store s ,date_dim d1 ,date_dim d2 where d2.d_year = 2000 and d2.d_moy = 9 and ss.ss_ticket_number = sr.sr_ticket_number and ss.ss_item_sk = sr.sr_item_sk and ss.ss_sold_date_sk = d1.d_date_sk and sr.sr_returned_date_sk = d2.d_date_sk and ss.ss_customer_sk = sr.sr_customer_sk and ss.ss_store_sk = s.s_store_sk group by store details order by store details limit 100; Join Graph
  • 23. Page23 © Hortonworks Inc. 2014 TPC-DS Query 50 Specified Join Tree Non CBO Plan CBO Plan
  • 24. Page24 © Hortonworks Inc. 2014 TPC-DS Query 50 Run 1 Run 2 Non CBO 53.1 53.4 CBO 22.5 21.9  1 year test  > 10 mins for Non CBO  CBO time was about the same  Fact tables  partitioned by Day,  bucketed by Item  Bucketing off  Bucketing should help CBO plan.  SR table much smaller. Better chance of Bucket Join in place of Shuffle Join. Join Ordering Cost Estimate ['d2', [[['store_sales', 'd1'], 'store_returns'], 'store']] 515074768.659 ['d1', [[['store_sales', 'store'], 'store_returns'], 'd2']] 448155.355 … ['store_returns', 'd2'] 9938.93 ['store_sales', 'store_returns'] 156727295.634 ['d1', 'store_sales'] 123675664.449 Facts restricted to 3 months Orderings considered by Planner
  • 25. Page25 © Hortonworks Inc. 2014 TPC-DS Query 17 Joins Store Sales, Store Returns and Catalog Sales fact tables. Each of the fact tables are independently restricted by time. Analysis at Item and Store grain, so these dimensions are also joined in. As specified Query starts by joining the 3 Fact tables. select i_item_id ,i_item_desc ,s_state ,count(ss_quantity) as store_sales_quantitycount ,…. from store_sales ss ,store_returns sr, catalog_sales cs, date_dim d1, date_dim d2, date_dim d3, store s, item I where d1.d_quarter_name = '2000Q1’ and d1.d_date_sk = ss.ss_sold_date_sk and i.i_item_sk = ss.ss_item_sk and … group by i_item_id ,i_item_desc, ,s_state order by i_item_id ,i_item_desc, s_state limit 100;
  • 26. Page26 © Hortonworks Inc. 2014 TPC-DS Query 17 Specified Join Tree Non CBO Plan CBO Plan
  • 27. Page27 © Hortonworks Inc. 2014 TPC-DS Query 17 Run 1 Run 2 Non CBO 100.71 127.53 CBO 50.9 44.52  1 year test  > 10 mins for Non CBO  CBO time was about the same  Fact tables  partitioned by Day,  bucketed by Item  Bucketing off  Bucketing should help CBO plan.  SR table much smaller. Better chance of Bucket Join in place of Shuffle Join. Join Ordering Cost Estimate ['item', [[[[[['d2', 'store_returns'], 'store_sales'], 'catalog_sales'], 'd1'], 'd3'], 'store']] 3547898.061 … ['store_returns', 'd2’] 19224.71 ['store_sales', 'store_returns’] 23057497.991 ['d1', 'store_sales'] 26142.943 Facts restricted to 3 months Orderings considered by Planner
  • 28. Page28 © Hortonworks Inc. 2014 Next? Outer joins Scale to larger numbers of joins Support all expressions (UDFs) Join algorithm selection Sortedness & distribution as a trait Trait propagation Better cost model More statistics Move all pre-planning and logical planning to Optiq Use Optiq costs/statistics to help physical planning Constant reduction & tree pruning Rewrite query to use materialized view
  • 29. Page29 © Hortonworks Inc. 2014 Thank you! @julianhyde http://hive.apache.org/ http://incubator.apache.org/projects/optiq.html

Hinweis der Redaktion

  1. Hive CBO didn’t quite make it into Apache Hive 0.13. This talk: What is CBO? Why are we putting it in Hive? How did we do it? When is it released? And what next?
  2. 0. Converters convert a Hive Op. Graph to an Optiq representation. In Optiq we have RelNodes and RexNodes in place of Operators and ExprNodes. The conversion creates a ‘Logical’ plan. RedSinks are dropped; Physical traits like Partitioning/Bucketness is lost. The Plan Graph is the central data structure of the Planner. There is a Node for each Node in the input Plan. A Node represents a Set of equivalent Sub Graphs(Plans). Each Set is further divided into Subsets based on traits: traits capture physical attributes like sortedness/bucketness Rules comprise of a Match Graph Template and an onMatch action. Action generates a ‘better’ equivalent Plan. So Rule match actions populates Plan Graph Sets. Metadata Providers provide all Metadata information to the Planner: Schema, but also Cost Formulas like Selectivity and NDV calculations. RelNodes have Cost. The Cost model encapsulates Cost calculations. Rule Match Queue is a Queue of Rule Matches. Planner runs until the Queue is empty for a fixed number of iterations. The application of a RuleMatch adds to the Plan Graph and also adds new Rule Matches to the Queue. RuleMatches are ordered based on importance: which is based on RelNode cost and distance of Node in Plan from Root.