SlideShare ist ein Scribd-Unternehmen logo
1 von 29
Page1 © Hortonworks Inc. 2014
Cost-based query optimization in
Apache Hive
Julian Hyde Julian Hyde
June 4th, 2014
Page2 © Hortonworks Inc. 2014
About me
Julian Hyde
Architect at Hortonworks
Open source:
• Founder & lead, Apache Optiq (query optimization framework)
• Founder & lead, Pentaho Mondrian (analysis engine)
• Committer, Apache Drill
• Contributor, Apache Hive
• Contributor, Cascading Lingual (SQL interface to Cascading)
Past:
• SQLstream (streaming SQL)
• Broadbase (data warehouse)
• Oracle (SQL kernel development)
Page3 © Hortonworks Inc. 2014
(Thanks to
John Pullokkaran,
Harish Butani
for presentation content
and actually doing the work.)
Page4 © Hortonworks Inc. 2014
Apache Hive
The original “SQL on Hadoop”
Undergoing extensive renovation
• Tez execution engine
• YARN execution environment
• Vectorized data representation
• Column-oriented data storage (ORC)
• ACID transactions
• SQL standards compliance
• SQL authorization model
• Cost-based query optimization (CBO) What? Why? How? When?
“Stinger
Initiative”
Page5 © Hortonworks Inc. 2014
Incremental cutover to cost-based optimization
Release Date Remarks
Apache Hive 0.12 October 2013 • Rule-based Optimizations
• No join reordering
• Main optimizations: predicate push-
down & partition pruning
• Semantic info and operator tree tightly
coupled
Apache Hive 0.13 April 2014 “Old-style” JOIN & push-down conditions:
… FROM t1, t2 WHERE …
CBO just missed the deadline 
HDP 2.1 April 2014 Cost-based ordering of joins
• HIVE-6439 “Introduce CBO step in
Semantic Analyzer”
• HIVE-5775 “Introduce Cost Based
Optimizer in Hive”
Apache Hive 0.14 ? CBO patches
More rework of internals
More cost-based features…
Page6 © Hortonworks Inc. 2014
Apache Optiq
(incubating)
Page7 © Hortonworks Inc. 2014
Apache Optiq
Apache incubator project since May, 2014
Query planning framework
• Extensible
• Usable standalone (JDBC) or embedded
Adoption
Lingual – SQL interface to Cascading
Apache Drill
Apache Hive
Adapters: Splunk, Spark, MongoDB, JDBC, CSV, Web tables, In-memory
data
Page8 © Hortonworks Inc. 2014
Conventional DB architecture
Page9 © Hortonworks Inc. 2014
Optiq architecture
Page10 © Hortonworks Inc. 2014
Optiq – APIs and SPIs
Cost, statistics
RelOptCost
RelOptCostFactory
RelMetadataProvider
• RelMdColumnUniquensss
• RelMdDistinctRowCount
• RelMdSelectivity
SQL parser
SqlNode
SqlParser
SqlValidator
Transformation rules
RelOptRule
• MergeFilterRule
• PushAggregateThroughUni
onRule
• RemoveCorrelationForScal
arProjectRule
• 100+ more
Unification (materialized view)
Column trimming
Relational algebra
RelNode (operator)
• TableScan
• Filter
• Project
• Union
• Aggregate
• …
RelDataType (type)
RexNode (expression)
RelTrait (physical property)
• RelConvention (calling-convention)
• RelCollation (sortedness)
• TBD (bucketedness/distribution) JDBC driver
Metadata
Schema
Table
Function
• TableFunction
• TableMacro
Page11 © Hortonworks Inc. 2014
Now… back to Hive
Page12 © Hortonworks Inc. 2014
CBO in Hive
Why cost-based optimization?
Ease of Use – Join Reordering
View Chaining
Ad hoc queries involving multiple views
Enables BI Tools as front ends to Hive
First version
Modest goal
Concrete results
Join re-ordering
Page 12
Page13 © Hortonworks Inc. 2014
Query preparation – Hive 0.13
SQL
parser
Semantic
analyzer
Logical
Optimizer
Physical
Optimizer
Abstract
Syntax
Tree (AST)
Hive SQL
Annotated
AST
Plan
Tez
Tuned
Plan
Page14 © Hortonworks Inc. 2014
Query preparation – full CBO
SQL
parser
Semantic
analyzer
Translate
to algebra
Physical
Optimizer
Abstract
Syntax
Tree (AST)
Hive SQL
Tez
Tuned
Plan
Optiq
optimizer
RelNode
Annotated
AST
Page15 © Hortonworks Inc. 2014
Query preparation – initial CBO
SQL
parser
Semantic
analyzer
Logical
Optimizer
Physical
Optimizer
Hive SQL
AST with optimized
join-ordering
Tez
Tuned
Plan
Translate
to algebra
Optiq
optimizer
Page16 © Hortonworks Inc. 2014
© Hortonworks Inc. 2013
Query Execution – The basics
Page 16
SELECT R1.x
FROM R1
JOIN R2 ON R1.x = R2.x
JOIN R3 on R1.x = R3.x AND R2.x = R3.x
WHERE R1.z > 10;
p
s


R1 R2
R3
TS [R1]
TS [R2]
RS
RS
Shuffle
Join
TS [R3]
Map
Join
Filter FS
Page17 © Hortonworks Inc. 2014
© Hortonworks Inc. 2013
Query Optimization – Rule Based vs. Cost Based
Page 17
p
s


R1 R2
R3
p
s


R1
R2
R3
p
s


R1
R3
R2
p
s


R2
R3
R1
Page18 © Hortonworks Inc. 2014
Introduction of CBO into Hive Planning
cbo
enabled?
No
Generate Plan w/o
multi-way joins
Can
cbo handle
plan?
No
- Predicate Pushdown
- Part. Pruning
- Column Pruning
- Stats Annotation
Pre CBO Optimizer
Col stats
available?
No
Optiq-based
Planner
Hive Plan
Revised AST
Regular Planning route on
new AST with CBO
turned off.
Fallback to Regular
planning: as though cbo
is disabled.
- < 10 total Join
Ops
- No Outer Joins
- No Windowing,
Lateral Views,
Script Op.
Series of gating
factors to get a CBO
Plan.
Page19 © Hortonworks Inc. 2014
Optiq Planner Process
Hive
Plan
Planner
RelNode
GraphRelNode Converter
RexNode Converter
Hive Op  RelNode
Hive Expr  RexNode
• Node for each node in
Input Plan
• Each node is a Set of
alternate Sub Plans
• Set further divided into
Subsets: based on
traits like sortedness
1. Plan Graph
• Rule: specifies a Operator
sub-graph to match and
logic to generate equivalent
‘better’ sub-graph.
• We only have Join
Reordering Rules.
2. Rules
• RelNodes have Cost (&
Cumulative Cost)
• We only use Cardinality
for Cost.
3. Cost Model
- Used to Plugin Schema,
Cost Formulas:
Selectivity, NDV
calculations etc.
- We only added
Selectivity and NDV
formulas; Schema is
only available at the
Node level
4. Metadata Providers
Rule Match Queue
- Add Rule matches to Queue
- Apply Rule match
transformations to Plan Graph
- Iterate for fixed iterations or
until Cost doesn’t change.
- Match importance based on
Cost of RelNode and height.
Best
RelNode
Graph
AST Converter
Revised
AST
Logical Plan
Physical traits:
Table Part./Buckets;
RedSink Ops
removed
Page20 © Hortonworks Inc. 2014
Join Reordering Rules
a b
=
b a
1. Swap Join Rule
a b
=
2. Push Join Through Join Rule
c
a c
b
c b
a=
but is really:
Optiq
schema is
position
based
b
a c
3. So
a b
c
d
≠
a c
d
b
4. Pull Up Project above Join
b
a c
d
a c
b
d
=
Added bonus
Join permutations
across sub-query
blocks
5. Merge Projects
Page21 © Hortonworks Inc. 2014
Summary
Join re-ordering
Join cardinality is used for cost
All other operators are assumed to have tiny cost
Cardinality of filter, join, group-by is based on selectivity
Selectivity is computed based on number-of-distinct-values (NDV)
Table Stats and Column stats are required
Current limitations
Only supports: filter, inner join, group-by, project, order-by, limit
Not all UDFs
Does not attempt all join permutations (e.g. bushy trees; 10-way joins or more)
May not work well for Bucket, SMB & Skew Joins
Page 21
Page22 © Hortonworks Inc. 2014
TPC-DS Query 50
Joins Store Sales, and Store Returns fact tables.
Each of the fact tables are independently restricted by date.
Analysis at Store grain, so this dimension also joined in.
As specified Query starts by joining the 2 Fact tables.
select
s_store_name , .. other store details
,sum(case when (sr_returned_date_sk - ss_sold_date_sk <= 30 ) then 1 else 0 end) as `30 days`, …
from
store_sales ss,store_returns sr,store s ,date_dim d1 ,date_dim d2
where
d2.d_year = 2000 and d2.d_moy = 9
and ss.ss_ticket_number = sr.sr_ticket_number and ss.ss_item_sk = sr.sr_item_sk
and ss.ss_sold_date_sk = d1.d_date_sk and sr.sr_returned_date_sk = d2.d_date_sk
and ss.ss_customer_sk = sr.sr_customer_sk and ss.ss_store_sk = s.s_store_sk
group by store details
order by store details limit 100;
Join Graph
Page23 © Hortonworks Inc. 2014
TPC-DS Query 50
Specified
Join Tree
Non CBO Plan
CBO Plan
Page24 © Hortonworks Inc. 2014
TPC-DS Query 50
Run 1 Run 2
Non CBO 53.1 53.4
CBO 22.5 21.9
 1 year test
 > 10 mins for Non CBO
 CBO time was about the same
 Fact tables
 partitioned by Day,
 bucketed by Item
 Bucketing off
 Bucketing should help CBO plan.
 SR table much smaller. Better chance of Bucket Join in place of Shuffle
Join.
Join Ordering Cost Estimate
['d2', [[['store_sales', 'd1'], 'store_returns'], 'store']] 515074768.659
['d1', [[['store_sales', 'store'], 'store_returns'], 'd2']] 448155.355
…
['store_returns', 'd2'] 9938.93
['store_sales', 'store_returns'] 156727295.634
['d1', 'store_sales'] 123675664.449
Facts restricted to 3 months
Orderings considered by Planner
Page25 © Hortonworks Inc. 2014
TPC-DS Query 17
Joins Store Sales, Store Returns and Catalog
Sales fact tables.
Each of the fact tables are independently
restricted by time.
Analysis at Item and Store grain, so these
dimensions are also joined in.
As specified Query starts by joining the 3 Fact
tables.
select i_item_id
,i_item_desc
,s_state
,count(ss_quantity) as store_sales_quantitycount
,….
from store_sales ss ,store_returns sr, catalog_sales cs,
date_dim d1, date_dim d2, date_dim d3, store s, item I
where d1.d_quarter_name = '2000Q1’
and d1.d_date_sk = ss.ss_sold_date_sk
and i.i_item_sk = ss.ss_item_sk and …
group by i_item_id ,i_item_desc, ,s_state
order by i_item_id ,i_item_desc, s_state
limit 100;
Page26 © Hortonworks Inc. 2014
TPC-DS Query 17
Specified
Join Tree
Non CBO Plan
CBO Plan
Page27 © Hortonworks Inc. 2014
TPC-DS Query 17
Run 1 Run 2
Non CBO 100.71 127.53
CBO 50.9 44.52
 1 year test
 > 10 mins for Non CBO
 CBO time was about the same
 Fact tables
 partitioned by Day,
 bucketed by Item
 Bucketing off
 Bucketing should help CBO plan.
 SR table much smaller. Better chance of Bucket Join in place of Shuffle
Join.
Join Ordering Cost Estimate
['item', [[[[[['d2', 'store_returns'], 'store_sales'], 'catalog_sales'], 'd1'], 'd3'], 'store']] 3547898.061
…
['store_returns', 'd2’] 19224.71
['store_sales', 'store_returns’] 23057497.991
['d1', 'store_sales'] 26142.943
Facts restricted to 3 months
Orderings considered by Planner
Page28 © Hortonworks Inc. 2014
Next?
Outer joins
Scale to larger numbers of joins
Support all expressions (UDFs)
Join algorithm selection
Sortedness & distribution as a trait
Trait propagation
Better cost model
More statistics
Move all pre-planning and logical planning to Optiq
Use Optiq costs/statistics to help physical planning
Constant reduction & tree pruning
Rewrite query to use materialized view
Page29 © Hortonworks Inc. 2014
Thank you!
@julianhyde
http://hive.apache.org/
http://incubator.apache.org/projects/optiq.html

Weitere ähnliche Inhalte

Was ist angesagt?

Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingDataWorks Summit
 
Beam + Pulsar: Powerful Stream Processing at Scale - Pulsar Summit SF 2022
Beam + Pulsar: Powerful Stream Processing at Scale - Pulsar Summit SF 2022Beam + Pulsar: Powerful Stream Processing at Scale - Pulsar Summit SF 2022
Beam + Pulsar: Powerful Stream Processing at Scale - Pulsar Summit SF 2022StreamNative
 
Writing powerful stored procedures in PL/SQL
Writing powerful stored procedures in PL/SQLWriting powerful stored procedures in PL/SQL
Writing powerful stored procedures in PL/SQLMariaDB plc
 
Databricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks
 
The Hidden Life of Spark Jobs
The Hidden Life of Spark JobsThe Hidden Life of Spark Jobs
The Hidden Life of Spark JobsDataWorks Summit
 
PostgreSQL and JDBC: striving for high performance
PostgreSQL and JDBC: striving for high performancePostgreSQL and JDBC: striving for high performance
PostgreSQL and JDBC: striving for high performanceVladimir Sitnikov
 
Hadoop Strata Talk - Uber, your hadoop has arrived
Hadoop Strata Talk - Uber, your hadoop has arrived Hadoop Strata Talk - Uber, your hadoop has arrived
Hadoop Strata Talk - Uber, your hadoop has arrived Vinoth Chandar
 
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiHow to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiFlink Forward
 
Evening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkFlink Forward
 
Consumer offset management in Kafka
Consumer offset management in KafkaConsumer offset management in Kafka
Consumer offset management in KafkaJoel Koshy
 
Oracle LOB Internals and Performance Tuning
Oracle LOB Internals and Performance TuningOracle LOB Internals and Performance Tuning
Oracle LOB Internals and Performance TuningTanel Poder
 
Continuous Data Replication into Cloud Storage with Oracle GoldenGate
Continuous Data Replication into Cloud Storage with Oracle GoldenGateContinuous Data Replication into Cloud Storage with Oracle GoldenGate
Continuous Data Replication into Cloud Storage with Oracle GoldenGateMichael Rainey
 
Apache Spark Internals
Apache Spark InternalsApache Spark Internals
Apache Spark InternalsKnoldus Inc.
 
Introduction to Structured Streaming
Introduction to Structured StreamingIntroduction to Structured Streaming
Introduction to Structured StreamingKnoldus Inc.
 
Apache Tez – Present and Future
Apache Tez – Present and FutureApache Tez – Present and Future
Apache Tez – Present and FutureDataWorks Summit
 
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...Databricks
 
Practical learnings from running thousands of Flink jobs
Practical learnings from running thousands of Flink jobsPractical learnings from running thousands of Flink jobs
Practical learnings from running thousands of Flink jobsFlink Forward
 

Was ist angesagt? (20)

Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data Processing
 
Beam + Pulsar: Powerful Stream Processing at Scale - Pulsar Summit SF 2022
Beam + Pulsar: Powerful Stream Processing at Scale - Pulsar Summit SF 2022Beam + Pulsar: Powerful Stream Processing at Scale - Pulsar Summit SF 2022
Beam + Pulsar: Powerful Stream Processing at Scale - Pulsar Summit SF 2022
 
Writing powerful stored procedures in PL/SQL
Writing powerful stored procedures in PL/SQLWriting powerful stored procedures in PL/SQL
Writing powerful stored procedures in PL/SQL
 
Databricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks Delta Lake and Its Benefits
Databricks Delta Lake and Its Benefits
 
6.hive
6.hive6.hive
6.hive
 
The Hidden Life of Spark Jobs
The Hidden Life of Spark JobsThe Hidden Life of Spark Jobs
The Hidden Life of Spark Jobs
 
PostgreSQL and JDBC: striving for high performance
PostgreSQL and JDBC: striving for high performancePostgreSQL and JDBC: striving for high performance
PostgreSQL and JDBC: striving for high performance
 
Hadoop Strata Talk - Uber, your hadoop has arrived
Hadoop Strata Talk - Uber, your hadoop has arrived Hadoop Strata Talk - Uber, your hadoop has arrived
Hadoop Strata Talk - Uber, your hadoop has arrived
 
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiHow to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
 
Evening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in Flink
 
Consumer offset management in Kafka
Consumer offset management in KafkaConsumer offset management in Kafka
Consumer offset management in Kafka
 
Oracle LOB Internals and Performance Tuning
Oracle LOB Internals and Performance TuningOracle LOB Internals and Performance Tuning
Oracle LOB Internals and Performance Tuning
 
Continuous Data Replication into Cloud Storage with Oracle GoldenGate
Continuous Data Replication into Cloud Storage with Oracle GoldenGateContinuous Data Replication into Cloud Storage with Oracle GoldenGate
Continuous Data Replication into Cloud Storage with Oracle GoldenGate
 
Apache Spark Internals
Apache Spark InternalsApache Spark Internals
Apache Spark Internals
 
Introduction to Structured Streaming
Introduction to Structured StreamingIntroduction to Structured Streaming
Introduction to Structured Streaming
 
Spark core
Spark coreSpark core
Spark core
 
Apache Flink Hands On
Apache Flink Hands OnApache Flink Hands On
Apache Flink Hands On
 
Apache Tez – Present and Future
Apache Tez – Present and FutureApache Tez – Present and Future
Apache Tez – Present and Future
 
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
 
Practical learnings from running thousands of Flink jobs
Practical learnings from running thousands of Flink jobsPractical learnings from running thousands of Flink jobs
Practical learnings from running thousands of Flink jobs
 

Ähnlich wie Cost-based Query Optimization in Hive

Cost-based query optimization in Apache Hive 0.14
Cost-based query optimization in Apache Hive 0.14Cost-based query optimization in Apache Hive 0.14
Cost-based query optimization in Apache Hive 0.14Julian Hyde
 
Cost-based query optimization in Apache Hive 0.14
Cost-based query optimization in Apache Hive 0.14Cost-based query optimization in Apache Hive 0.14
Cost-based query optimization in Apache Hive 0.14Julian Hyde
 
SQL on everything, in memory
SQL on everything, in memorySQL on everything, in memory
SQL on everything, in memoryJulian Hyde
 
Apache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Apache Kylin: OLAP Engine on Hadoop - Tech Deep DiveApache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Apache Kylin: OLAP Engine on Hadoop - Tech Deep DiveXu Jiang
 
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...Julian Hyde
 
What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?DataWorks Summit
 
What's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoWhat's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoDataWorks Summit
 
Fast federated SQL with Apache Calcite
Fast federated SQL with Apache CalciteFast federated SQL with Apache Calcite
Fast federated SQL with Apache CalciteChris Baynes
 
Self-serve analytics journey at Celtra: Snowflake, Spark, and Databricks
Self-serve analytics journey at Celtra: Snowflake, Spark, and DatabricksSelf-serve analytics journey at Celtra: Snowflake, Spark, and Databricks
Self-serve analytics journey at Celtra: Snowflake, Spark, and DatabricksGrega Kespret
 
Presentation v mware roi tco calculator
Presentation   v mware roi tco calculatorPresentation   v mware roi tco calculator
Presentation v mware roi tco calculatorsolarisyourep
 
2018 data warehouse features in spark
2018   data warehouse features in spark2018   data warehouse features in spark
2018 data warehouse features in sparkChester Chen
 
Querying Druid in SQL with Superset
Querying Druid in SQL with SupersetQuerying Druid in SQL with Superset
Querying Druid in SQL with SupersetDataWorks Summit
 
Apache Calcite: One Frontend to Rule Them All
Apache Calcite: One Frontend to Rule Them AllApache Calcite: One Frontend to Rule Them All
Apache Calcite: One Frontend to Rule Them AllMichael Mior
 
Use dependency injection to get Hadoop *out* of your application code
Use dependency injection to get Hadoop *out* of your application codeUse dependency injection to get Hadoop *out* of your application code
Use dependency injection to get Hadoop *out* of your application codeDataWorks Summit
 
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...Spark Summit
 
Fast SQL on Hadoop, really?
Fast SQL on Hadoop, really?Fast SQL on Hadoop, really?
Fast SQL on Hadoop, really?DataWorks Summit
 
Maximizing Database Tuning in SAP SQL Anywhere
Maximizing Database Tuning in SAP SQL AnywhereMaximizing Database Tuning in SAP SQL Anywhere
Maximizing Database Tuning in SAP SQL AnywhereSAP Technology
 
Presto query optimizer: pursuit of performance
Presto query optimizer: pursuit of performancePresto query optimizer: pursuit of performance
Presto query optimizer: pursuit of performanceDataWorks Summit
 
Powerpivot web wordpress present
Powerpivot web wordpress presentPowerpivot web wordpress present
Powerpivot web wordpress presentMariAnne Woehrle
 

Ähnlich wie Cost-based Query Optimization in Hive (20)

Cost-based query optimization in Apache Hive 0.14
Cost-based query optimization in Apache Hive 0.14Cost-based query optimization in Apache Hive 0.14
Cost-based query optimization in Apache Hive 0.14
 
Cost-based query optimization in Apache Hive 0.14
Cost-based query optimization in Apache Hive 0.14Cost-based query optimization in Apache Hive 0.14
Cost-based query optimization in Apache Hive 0.14
 
SQL on everything, in memory
SQL on everything, in memorySQL on everything, in memory
SQL on everything, in memory
 
Apache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Apache Kylin: OLAP Engine on Hadoop - Tech Deep DiveApache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Apache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
 
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...
 
What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?
 
What's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoWhat's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - Tokyo
 
Fast federated SQL with Apache Calcite
Fast federated SQL with Apache CalciteFast federated SQL with Apache Calcite
Fast federated SQL with Apache Calcite
 
Self-serve analytics journey at Celtra: Snowflake, Spark, and Databricks
Self-serve analytics journey at Celtra: Snowflake, Spark, and DatabricksSelf-serve analytics journey at Celtra: Snowflake, Spark, and Databricks
Self-serve analytics journey at Celtra: Snowflake, Spark, and Databricks
 
Presentation v mware roi tco calculator
Presentation   v mware roi tco calculatorPresentation   v mware roi tco calculator
Presentation v mware roi tco calculator
 
2018 data warehouse features in spark
2018   data warehouse features in spark2018   data warehouse features in spark
2018 data warehouse features in spark
 
Querying Druid in SQL with Superset
Querying Druid in SQL with SupersetQuerying Druid in SQL with Superset
Querying Druid in SQL with Superset
 
Apache Calcite: One Frontend to Rule Them All
Apache Calcite: One Frontend to Rule Them AllApache Calcite: One Frontend to Rule Them All
Apache Calcite: One Frontend to Rule Them All
 
Use dependency injection to get Hadoop *out* of your application code
Use dependency injection to get Hadoop *out* of your application codeUse dependency injection to get Hadoop *out* of your application code
Use dependency injection to get Hadoop *out* of your application code
 
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
 
Fast SQL on Hadoop, really?
Fast SQL on Hadoop, really?Fast SQL on Hadoop, really?
Fast SQL on Hadoop, really?
 
Maximizing Database Tuning in SAP SQL Anywhere
Maximizing Database Tuning in SAP SQL AnywhereMaximizing Database Tuning in SAP SQL Anywhere
Maximizing Database Tuning in SAP SQL Anywhere
 
Presto query optimizer: pursuit of performance
Presto query optimizer: pursuit of performancePresto query optimizer: pursuit of performance
Presto query optimizer: pursuit of performance
 
Powerpivot web wordpress present
Powerpivot web wordpress presentPowerpivot web wordpress present
Powerpivot web wordpress present
 
Quark Virtualization Engine for Analytics
Quark Virtualization Engine for Analytics Quark Virtualization Engine for Analytics
Quark Virtualization Engine for Analytics
 

Mehr von DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

Mehr von DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Kürzlich hochgeladen

Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Zilliz
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistandanishmna97
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusZilliz
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityWSO2
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 

Kürzlich hochgeladen (20)

Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 

Cost-based Query Optimization in Hive

  • 1. Page1 © Hortonworks Inc. 2014 Cost-based query optimization in Apache Hive Julian Hyde Julian Hyde June 4th, 2014
  • 2. Page2 © Hortonworks Inc. 2014 About me Julian Hyde Architect at Hortonworks Open source: • Founder & lead, Apache Optiq (query optimization framework) • Founder & lead, Pentaho Mondrian (analysis engine) • Committer, Apache Drill • Contributor, Apache Hive • Contributor, Cascading Lingual (SQL interface to Cascading) Past: • SQLstream (streaming SQL) • Broadbase (data warehouse) • Oracle (SQL kernel development)
  • 3. Page3 © Hortonworks Inc. 2014 (Thanks to John Pullokkaran, Harish Butani for presentation content and actually doing the work.)
  • 4. Page4 © Hortonworks Inc. 2014 Apache Hive The original “SQL on Hadoop” Undergoing extensive renovation • Tez execution engine • YARN execution environment • Vectorized data representation • Column-oriented data storage (ORC) • ACID transactions • SQL standards compliance • SQL authorization model • Cost-based query optimization (CBO) What? Why? How? When? “Stinger Initiative”
  • 5. Page5 © Hortonworks Inc. 2014 Incremental cutover to cost-based optimization Release Date Remarks Apache Hive 0.12 October 2013 • Rule-based Optimizations • No join reordering • Main optimizations: predicate push- down & partition pruning • Semantic info and operator tree tightly coupled Apache Hive 0.13 April 2014 “Old-style” JOIN & push-down conditions: … FROM t1, t2 WHERE … CBO just missed the deadline  HDP 2.1 April 2014 Cost-based ordering of joins • HIVE-6439 “Introduce CBO step in Semantic Analyzer” • HIVE-5775 “Introduce Cost Based Optimizer in Hive” Apache Hive 0.14 ? CBO patches More rework of internals More cost-based features…
  • 6. Page6 © Hortonworks Inc. 2014 Apache Optiq (incubating)
  • 7. Page7 © Hortonworks Inc. 2014 Apache Optiq Apache incubator project since May, 2014 Query planning framework • Extensible • Usable standalone (JDBC) or embedded Adoption Lingual – SQL interface to Cascading Apache Drill Apache Hive Adapters: Splunk, Spark, MongoDB, JDBC, CSV, Web tables, In-memory data
  • 8. Page8 © Hortonworks Inc. 2014 Conventional DB architecture
  • 9. Page9 © Hortonworks Inc. 2014 Optiq architecture
  • 10. Page10 © Hortonworks Inc. 2014 Optiq – APIs and SPIs Cost, statistics RelOptCost RelOptCostFactory RelMetadataProvider • RelMdColumnUniquensss • RelMdDistinctRowCount • RelMdSelectivity SQL parser SqlNode SqlParser SqlValidator Transformation rules RelOptRule • MergeFilterRule • PushAggregateThroughUni onRule • RemoveCorrelationForScal arProjectRule • 100+ more Unification (materialized view) Column trimming Relational algebra RelNode (operator) • TableScan • Filter • Project • Union • Aggregate • … RelDataType (type) RexNode (expression) RelTrait (physical property) • RelConvention (calling-convention) • RelCollation (sortedness) • TBD (bucketedness/distribution) JDBC driver Metadata Schema Table Function • TableFunction • TableMacro
  • 11. Page11 © Hortonworks Inc. 2014 Now… back to Hive
  • 12. Page12 © Hortonworks Inc. 2014 CBO in Hive Why cost-based optimization? Ease of Use – Join Reordering View Chaining Ad hoc queries involving multiple views Enables BI Tools as front ends to Hive First version Modest goal Concrete results Join re-ordering Page 12
  • 13. Page13 © Hortonworks Inc. 2014 Query preparation – Hive 0.13 SQL parser Semantic analyzer Logical Optimizer Physical Optimizer Abstract Syntax Tree (AST) Hive SQL Annotated AST Plan Tez Tuned Plan
  • 14. Page14 © Hortonworks Inc. 2014 Query preparation – full CBO SQL parser Semantic analyzer Translate to algebra Physical Optimizer Abstract Syntax Tree (AST) Hive SQL Tez Tuned Plan Optiq optimizer RelNode Annotated AST
  • 15. Page15 © Hortonworks Inc. 2014 Query preparation – initial CBO SQL parser Semantic analyzer Logical Optimizer Physical Optimizer Hive SQL AST with optimized join-ordering Tez Tuned Plan Translate to algebra Optiq optimizer
  • 16. Page16 © Hortonworks Inc. 2014 © Hortonworks Inc. 2013 Query Execution – The basics Page 16 SELECT R1.x FROM R1 JOIN R2 ON R1.x = R2.x JOIN R3 on R1.x = R3.x AND R2.x = R3.x WHERE R1.z > 10; p s   R1 R2 R3 TS [R1] TS [R2] RS RS Shuffle Join TS [R3] Map Join Filter FS
  • 17. Page17 © Hortonworks Inc. 2014 © Hortonworks Inc. 2013 Query Optimization – Rule Based vs. Cost Based Page 17 p s   R1 R2 R3 p s   R1 R2 R3 p s   R1 R3 R2 p s   R2 R3 R1
  • 18. Page18 © Hortonworks Inc. 2014 Introduction of CBO into Hive Planning cbo enabled? No Generate Plan w/o multi-way joins Can cbo handle plan? No - Predicate Pushdown - Part. Pruning - Column Pruning - Stats Annotation Pre CBO Optimizer Col stats available? No Optiq-based Planner Hive Plan Revised AST Regular Planning route on new AST with CBO turned off. Fallback to Regular planning: as though cbo is disabled. - < 10 total Join Ops - No Outer Joins - No Windowing, Lateral Views, Script Op. Series of gating factors to get a CBO Plan.
  • 19. Page19 © Hortonworks Inc. 2014 Optiq Planner Process Hive Plan Planner RelNode GraphRelNode Converter RexNode Converter Hive Op  RelNode Hive Expr  RexNode • Node for each node in Input Plan • Each node is a Set of alternate Sub Plans • Set further divided into Subsets: based on traits like sortedness 1. Plan Graph • Rule: specifies a Operator sub-graph to match and logic to generate equivalent ‘better’ sub-graph. • We only have Join Reordering Rules. 2. Rules • RelNodes have Cost (& Cumulative Cost) • We only use Cardinality for Cost. 3. Cost Model - Used to Plugin Schema, Cost Formulas: Selectivity, NDV calculations etc. - We only added Selectivity and NDV formulas; Schema is only available at the Node level 4. Metadata Providers Rule Match Queue - Add Rule matches to Queue - Apply Rule match transformations to Plan Graph - Iterate for fixed iterations or until Cost doesn’t change. - Match importance based on Cost of RelNode and height. Best RelNode Graph AST Converter Revised AST Logical Plan Physical traits: Table Part./Buckets; RedSink Ops removed
  • 20. Page20 © Hortonworks Inc. 2014 Join Reordering Rules a b = b a 1. Swap Join Rule a b = 2. Push Join Through Join Rule c a c b c b a= but is really: Optiq schema is position based b a c 3. So a b c d ≠ a c d b 4. Pull Up Project above Join b a c d a c b d = Added bonus Join permutations across sub-query blocks 5. Merge Projects
  • 21. Page21 © Hortonworks Inc. 2014 Summary Join re-ordering Join cardinality is used for cost All other operators are assumed to have tiny cost Cardinality of filter, join, group-by is based on selectivity Selectivity is computed based on number-of-distinct-values (NDV) Table Stats and Column stats are required Current limitations Only supports: filter, inner join, group-by, project, order-by, limit Not all UDFs Does not attempt all join permutations (e.g. bushy trees; 10-way joins or more) May not work well for Bucket, SMB & Skew Joins Page 21
  • 22. Page22 © Hortonworks Inc. 2014 TPC-DS Query 50 Joins Store Sales, and Store Returns fact tables. Each of the fact tables are independently restricted by date. Analysis at Store grain, so this dimension also joined in. As specified Query starts by joining the 2 Fact tables. select s_store_name , .. other store details ,sum(case when (sr_returned_date_sk - ss_sold_date_sk <= 30 ) then 1 else 0 end) as `30 days`, … from store_sales ss,store_returns sr,store s ,date_dim d1 ,date_dim d2 where d2.d_year = 2000 and d2.d_moy = 9 and ss.ss_ticket_number = sr.sr_ticket_number and ss.ss_item_sk = sr.sr_item_sk and ss.ss_sold_date_sk = d1.d_date_sk and sr.sr_returned_date_sk = d2.d_date_sk and ss.ss_customer_sk = sr.sr_customer_sk and ss.ss_store_sk = s.s_store_sk group by store details order by store details limit 100; Join Graph
  • 23. Page23 © Hortonworks Inc. 2014 TPC-DS Query 50 Specified Join Tree Non CBO Plan CBO Plan
  • 24. Page24 © Hortonworks Inc. 2014 TPC-DS Query 50 Run 1 Run 2 Non CBO 53.1 53.4 CBO 22.5 21.9  1 year test  > 10 mins for Non CBO  CBO time was about the same  Fact tables  partitioned by Day,  bucketed by Item  Bucketing off  Bucketing should help CBO plan.  SR table much smaller. Better chance of Bucket Join in place of Shuffle Join. Join Ordering Cost Estimate ['d2', [[['store_sales', 'd1'], 'store_returns'], 'store']] 515074768.659 ['d1', [[['store_sales', 'store'], 'store_returns'], 'd2']] 448155.355 … ['store_returns', 'd2'] 9938.93 ['store_sales', 'store_returns'] 156727295.634 ['d1', 'store_sales'] 123675664.449 Facts restricted to 3 months Orderings considered by Planner
  • 25. Page25 © Hortonworks Inc. 2014 TPC-DS Query 17 Joins Store Sales, Store Returns and Catalog Sales fact tables. Each of the fact tables are independently restricted by time. Analysis at Item and Store grain, so these dimensions are also joined in. As specified Query starts by joining the 3 Fact tables. select i_item_id ,i_item_desc ,s_state ,count(ss_quantity) as store_sales_quantitycount ,…. from store_sales ss ,store_returns sr, catalog_sales cs, date_dim d1, date_dim d2, date_dim d3, store s, item I where d1.d_quarter_name = '2000Q1’ and d1.d_date_sk = ss.ss_sold_date_sk and i.i_item_sk = ss.ss_item_sk and … group by i_item_id ,i_item_desc, ,s_state order by i_item_id ,i_item_desc, s_state limit 100;
  • 26. Page26 © Hortonworks Inc. 2014 TPC-DS Query 17 Specified Join Tree Non CBO Plan CBO Plan
  • 27. Page27 © Hortonworks Inc. 2014 TPC-DS Query 17 Run 1 Run 2 Non CBO 100.71 127.53 CBO 50.9 44.52  1 year test  > 10 mins for Non CBO  CBO time was about the same  Fact tables  partitioned by Day,  bucketed by Item  Bucketing off  Bucketing should help CBO plan.  SR table much smaller. Better chance of Bucket Join in place of Shuffle Join. Join Ordering Cost Estimate ['item', [[[[[['d2', 'store_returns'], 'store_sales'], 'catalog_sales'], 'd1'], 'd3'], 'store']] 3547898.061 … ['store_returns', 'd2’] 19224.71 ['store_sales', 'store_returns’] 23057497.991 ['d1', 'store_sales'] 26142.943 Facts restricted to 3 months Orderings considered by Planner
  • 28. Page28 © Hortonworks Inc. 2014 Next? Outer joins Scale to larger numbers of joins Support all expressions (UDFs) Join algorithm selection Sortedness & distribution as a trait Trait propagation Better cost model More statistics Move all pre-planning and logical planning to Optiq Use Optiq costs/statistics to help physical planning Constant reduction & tree pruning Rewrite query to use materialized view
  • 29. Page29 © Hortonworks Inc. 2014 Thank you! @julianhyde http://hive.apache.org/ http://incubator.apache.org/projects/optiq.html

Hinweis der Redaktion

  1. Hive CBO didn’t quite make it into Apache Hive 0.13. This talk: What is CBO? Why are we putting it in Hive? How did we do it? When is it released? And what next?
  2. 0. Converters convert a Hive Op. Graph to an Optiq representation. In Optiq we have RelNodes and RexNodes in place of Operators and ExprNodes. The conversion creates a ‘Logical’ plan. RedSinks are dropped; Physical traits like Partitioning/Bucketness is lost. The Plan Graph is the central data structure of the Planner. There is a Node for each Node in the input Plan. A Node represents a Set of equivalent Sub Graphs(Plans). Each Set is further divided into Subsets based on traits: traits capture physical attributes like sortedness/bucketness Rules comprise of a Match Graph Template and an onMatch action. Action generates a ‘better’ equivalent Plan. So Rule match actions populates Plan Graph Sets. Metadata Providers provide all Metadata information to the Planner: Schema, but also Cost Formulas like Selectivity and NDV calculations. RelNodes have Cost. The Cost model encapsulates Cost calculations. Rule Match Queue is a Queue of Rule Matches. Planner runs until the Queue is empty for a fixed number of iterations. The application of a RuleMatch adds to the Plan Graph and also adds new Rule Matches to the Queue. RuleMatches are ordered based on importance: which is based on RelNode cost and distance of Node in Plan from Root.