SlideShare ist ein Scribd-Unternehmen logo
1 von 33
How to understand and analyze
Apache Hive query execution plan
for performance debugging
© Hortonworks Inc. 2011 – 2015. All Rights Reserved
Pengcheng Xiong and Ashutosh Chauhan
Hortonworks Inc., Apache Hive
Community
{pxiong,ashutosh}@hortonworks.com
Page2 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Goals:
• WHY old style Hive explain plan is hard to read
• Compare the old style explain with Postgres over a body of 500+ realistic SQL queries.
• WHAT is new style Hive explain plan
• Show orchestration of the tasks and operator trees, join sequences and algorithms, operator
execution costs
• HOW to performance debug a real query by analyzing the new Hive
explain plan
• Identify the potential improvement by changing join sequence, join algorithm and etc
• Show the real improvement by running the query in real cluster
• Integration/interaction with other system/tools
• Future work
Page3 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
WHY old style Hive explain plan is hard to read
• M**** company’s schema and queries.
• Comparison of explain plans between Hive and Postgres for the 528
queries they can both execute.
• Hive: “explain”, Postgres: “explain verbose”
Hive “old
style”, 233.5
postgres,
53.8
Hive “old
style”, 1289
postgres,
328.6
Average Lines Per Explain Plan Average Words Per Explain Plan
N = 528 queries
Page4 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
We can see that Hive old style explain is quite
verbose, is it necessary?
select a11.PBTNAME PBTNAME
from PMT_INVENTORY a11
join LU_MONTH a12
on (a11.QUARTER_ID = a12.QUARTER_ID)
where a12.MONTH_ID in (200607, 200606)
group by a11.PBTNAME;
Page5 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
High level plan comparison: Postgres
QUERY PLAN
---------------------------------------------------------------------------------
Group (cost=3.83..3.84 rows=2 width=18)
Output: a11.pbtname
Group Key: a11.pbtname
-> Sort (cost=3.83..3.84 rows=2 width=18)
Output: a11.pbtname
Sort Key: a11.pbtname
-> Hash Join (cost=2.62..3.83 rows=2 width=18)
Output: a11.pbtname
Hash Cond: (a11.quarter_id = a12.quarter_id)
-> Seq Scan on public.pmt_inventory a11 (cost=0.00..1.12 rows=12 width=22)
Output: a11.quarter_id, a11.pbtname
-> Hash (cost=2.60..2.60 rows=2 width=4)
Output: a12.quarter_id
-> Seq Scan on public.lu_month a12 (cost=0.00..2.60 rows=2 width=4)
Output: a12.quarter_id
Filter: (a12.month_id = ANY ('{200607,200606}'::integer[]))
16 lines
Page6 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
High level plan comparison: Hive
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 depends on stages: Stage-1
STAGE PLANS:
Stage: Stage-1
Tez
Edges:
Map 2 <- Map 1 (BROADCAST_EDGE)
Reducer 3 <- Map 2 (SIMPLE_EDGE)
DagName: carter_20151114133018_2f2f0101-d14d-4688-bbb1-db67055016c3:946
Vertices:
Map 1
Map Operator Tree:
TableScan
alias: a11
filterExpr: quarter_id is not null (type: boolean)
Statistics: Num rows: 12 Data size: 1260 Basic stats: COMPLETE Column stats: NONE
Filter Operator
predicate: quarter_id is not null (type: boolean)
Statistics: Num rows: 6 Data size: 630 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: quarter_id (type: int)
sort order: +
Map-reduce partition columns: quarter_id (type: int)
Statistics: Num rows: 6 Data size: 630 Basic stats: COMPLETE Column stats: NONE
value expressions: pbtname (type: string)
Execution mode: vectorized
Map 2
Map Operator Tree:
TableScan
alias: a12
filterExpr: (quarter_id is not null and (month_id) IN (200607, 200606)) (type: boolean)
Statistics: Num rows: 48 Data size: 46752 Basic stats: COMPLETE Column stats: NONE
Filter Operator
predicate: (quarter_id is not null and (month_id) IN (200607, 200606)) (type: boolean)
Statistics: Num rows: 12 Data size: 11688 Basic stats: COMPLETE Column stats: NONE
Map Join Operator
condition map:
Inner Join 0 to 1
keys:
0 quarter_id (type: int)
1 quarter_id (type: int)
outputColumnNames: _col1
input vertices:
0 Map 1
Statistics: Num rows: 13 Data size: 12856 Basic stats: COMPLETE Column stats: NONE
HybridGraceHashJoin: true
Group By Operator
keys: _col1 (type: string)
mode: hash
outputColumnNames: _col0
Statistics: Num rows: 13 Data size: 12856 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: _col0 (type: string)
sort order: +
Map-reduce partition columns: _col0 (type: string)
Statistics: Num rows: 13 Data size: 12856 Basic stats: COMPLETE Column stats: NONE
Execution mode: vectorized
Reducer 3
Reduce Operator Tree:
Group By Operator
keys: KEY._col0 (type: string)
mode: mergepartial
outputColumnNames: _col0
Statistics: Num rows: 6 Data size: 5933 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 6 Data size: 5933 Basic stats: COMPLETE Column stats: NONE
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink
Too verbose, need a magnifier!
80+ lines
Page7 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Map 2 Operator
Map 2
Map Operator Tree:
TableScan
alias: a12
filterExpr: (quarter_id is not null and (month_id) IN (200607, 200606)) (type: boolean)
Statistics: Num rows: 48 Data size: 46752 Basic stats: COMPLETE Column stats: NONE
Filter Operator
predicate: (quarter_id is not null and (month_id) IN (200607, 200606)) (type: boolean)
Statistics: Num rows: 12 Data size: 11688 Basic stats: COMPLETE Column stats: NONE
Map Join Operator
condition map:
Inner Join 0 to 1
keys:
0 quarter_id (type: int)
1 quarter_id (type: int)
outputColumnNames: _col1
input vertices:
0 Map 1
Statistics: Num rows: 13 Data size: 12856 Basic stats: COMPLETE Column stats: NONE
HybridGraceHashJoin: true
Group By Operator
keys: _col1 (type: string)
mode: hash
outputColumnNames: _col0
Statistics: Num rows: 13 Data size: 12856 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: _col0 (type: string)
sort order: +
Map-reduce partition columns: _col0 (type: string)
Statistics: Num rows: 13 Data size: 12856 Basic stats: COMPLETE Column stats: NONE
Execution mode: vectorized
Data flows from top to bottom
Each operator has 0 or 1
child
Page8 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Map 2 Operator
Map 2
Map Operator Tree:
TableScan
alias: a12
filterExpr: (quarter_id is not null and (month_id) IN (200607, 200606)) (type: boolean)
Statistics: Num rows: 48 Data size: 46752 Basic stats: COMPLETE Column stats: NONE
Filter Operator
predicate: (quarter_id is not null and (month_id) IN (200607, 200606)) (type: boolean)
Statistics: Num rows: 12 Data size: 11688 Basic stats: COMPLETE Column stats: NONE
Map Join Operator
condition map:
Inner Join 0 to 1
keys:
0 quarter_id (type: int)
1 quarter_id (type: int)
outputColumnNames: _col1
input vertices:
0 Map 1
Statistics: Num rows: 13 Data size: 12856 Basic stats: COMPLETE Column stats: NONE
HybridGraceHashJoin: true
Group By Operator
keys: _col1 (type: string)
mode: hash
outputColumnNames: _col0
Statistics: Num rows: 13 Data size: 12856 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: _col0 (type: string)
sort order: +
Map-reduce partition columns: _col0 (type: string)
Statistics: Num rows: 13 Data size: 12856 Basic stats: COMPLETE Column stats: NONE
Execution mode: vectorized
Must scroll to another part
of the plan to see what this is
Page9 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Map 1 Operator
Map 1
Map Operator Tree:
TableScan
alias: a11
filterExpr: quarter_id is not null (type: boolean)
Statistics: Num rows: 12 Data size: 1260 Basic stats: COMPLETE Column stats: NONE
Filter Operator
predicate: quarter_id is not null (type: boolean)
Statistics: Num rows: 6 Data size: 630 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: quarter_id (type: int)
sort order: +
Map-reduce partition columns: quarter_id (type: int)
Statistics: Num rows: 6 Data size: 630 Basic stats: COMPLETE Column stats: NONE
value expressions: pbtname (type: string)
Execution mode: vectorized
Actual table name (PMT_INVENTORY)
not mentioned anywhere, only
the alias
Page10 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Map 1 Operator
Map 1
Map Operator Tree:
TableScan
alias: a11
filterExpr: quarter_id is not null (type: boolean)
Statistics: Num rows: 12 Data size: 1260 Basic stats: COMPLETE Column stats: NONE
Filter Operator
predicate: quarter_id is not null (type: boolean)
Statistics: Num rows: 6 Data size: 630 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: quarter_id (type: int)
sort order: +
Map-reduce partition columns: quarter_id (type: int)
Statistics: Num rows: 6 Data size: 630 Basic stats: COMPLETE Column stats: NONE
value expressions: pbtname (type: string)
Execution mode: vectorized
How much of this information is really
necessary to SQL users?
Page11 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Back to Postgres
QUERY PLAN
---------------------------------------------------------------------------------
Group (cost=3.83..3.84 rows=2 width=18)
Output: a11.pbtname
Group Key: a11.pbtname
-> Sort (cost=3.83..3.84 rows=2 width=18)
Output: a11.pbtname
Sort Key: a11.pbtname
-> Hash Join (cost=2.62..3.83 rows=2 width=18)
Output: a11.pbtname
Hash Cond: (a11.quarter_id = a12.quarter_id)
-> Seq Scan on public.pmt_inventory a11 (cost=0.00..1.12 rows=12 width=22)
Output: a11.quarter_id, a11.pbtname
-> Hash (cost=2.60..2.60 rows=2 width=4)
Output: a12.quarter_id
-> Seq Scan on public.lu_month a12 (cost=0.00..2.60 rows=2 width=4)
Output: a12.quarter_id
Filter: (a12.month_id = ANY ('{200607,200606}'::integer[]))
Data flows bottom to top
Page12 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Back to Postgres
QUERY PLAN
---------------------------------------------------------------------------------
Group (cost=3.83..3.84 rows=2 width=18)
Output: a11.pbtname
Group Key: a11.pbtname
-> Sort (cost=3.83..3.84 rows=2 width=18)
Output: a11.pbtname
Sort Key: a11.pbtname
-> Hash Join (cost=2.62..3.83 rows=2 width=18)
Output: a11.pbtname
Hash Cond: (a11.quarter_id = a12.quarter_id)
-> Seq Scan on public.pmt_inventory a11 (cost=0.00..1.12 rows=12 width=22)
Output: a11.quarter_id, a11.pbtname
-> Hash (cost=2.60..2.60 rows=2 width=4)
Output: a12.quarter_id
-> Seq Scan on public.lu_month a12 (cost=0.00..2.60 rows=2 width=4)
Output: a12.quarter_id
Filter: (a12.month_id = ANY ('{200607,200606}'::integer[]))
Operators have multiple
children when it makes
sense
Page13 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Back to Postgres
QUERY PLAN
---------------------------------------------------------------------------------
Group (cost=3.83..3.84 rows=2 width=18)
Output: a11.pbtname
Group Key: a11.pbtname
-> Sort (cost=3.83..3.84 rows=2 width=18)
Output: a11.pbtname
Sort Key: a11.pbtname
-> Hash Join (cost=2.62..3.83 rows=2 width=18)
Output: a11.pbtname
Hash Cond: (a11.quarter_id = a12.quarter_id)
-> Seq Scan on public.pmt_inventory a11 (cost=0.00..1.12 rows=12 width=22)
Output: a11.quarter_id, a11.pbtname
-> Hash (cost=2.60..2.60 rows=2 width=4)
Output: a12.quarter_id
-> Seq Scan on public.lu_month a12 (cost=0.00..2.60 rows=2 width=4)
Output: a12.quarter_id
Filter: (a12.month_id = ANY ('{200607,200606}'::integer[]))
Join is done using a scan of pmt_inventory and a hash
following a scan of lu_month.
All this info is available without referring to a stage plan.
IOW you don’t have to jump around in the plan.
Page14 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Back to Postgres
QUERY PLAN
---------------------------------------------------------------------------------
Group (cost=3.83..3.84 rows=2 width=18)
Output: a11.pbtname
Group Key: a11.pbtname
-> Sort (cost=3.83..3.84 rows=2 width=18)
Output: a11.pbtname
Sort Key: a11.pbtname
-> Hash Join (cost=2.62..3.83 rows=2 width=18)
Output: a11.pbtname
Hash Cond: (a11.quarter_id = a12.quarter_id)
-> Seq Scan on public.pmt_inventory a11 (cost=0.00..1.12 rows=12 width=22)
Output: a11.quarter_id, a11.pbtname
-> Hash (cost=2.60..2.60 rows=2 width=4)
Output: a12.quarter_id
-> Seq Scan on public.lu_month a12 (cost=0.00..2.60 rows=2 width=4)
Output: a12.quarter_id
Filter: (a12.month_id = ANY ('{200607,200606}'::integer[]))
Actual schema / table names
visible
Page15 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Back to Postgres
QUERY PLAN
---------------------------------------------------------------------------------
Group (cost=3.83..3.84 rows=2 width=18)
Output: a11.pbtname
Group Key: a11.pbtname
-> Sort (cost=3.83..3.84 rows=2 width=18)
Output: a11.pbtname
Sort Key: a11.pbtname
-> Hash Join (cost=2.62..3.83 rows=2 width=18)
Output: a11.pbtname
Hash Cond: (a11.quarter_id = a12.quarter_id)
-> Seq Scan on public.pmt_inventory a11 (cost=0.00..1.12 rows=12 width=22)
Output: a11.quarter_id, a11.pbtname
-> Hash (cost=2.60..2.60 rows=2 width=4)
Output: a12.quarter_id
-> Seq Scan on public.lu_month a12 (cost=0.00..2.60 rows=2 width=4)
Output: a12.quarter_id
Filter: (a12.month_id = ANY ('{200607,200606}'::integer[]))
Cost mentioned once per operator
Cost monotonically increases
as you go up
Page16 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
WHAT is new style Hive explain plan (HIVE-9780)
• Set hive.explain.user=true; (by default). Use Tez, LLAP, etc
Stage-1
Reducer 3
File Output Operator [FS_14]
Group By Operator [GBY_12] (rows=8 width=101)
Output:["_col0"],keys:KEY._col0
<-Map 2 [SIMPLE_EDGE]
SHUFFLE [RS_11]
PartitionCols:_col0
Group By Operator [GBY_10] (rows=8 width=101)
Output:["_col0"],keys:_col1
Map Join Operator [MAPJOIN_19] (rows=33 width=101)
Conds:RS_6._col0=SEL_5._col1(Inner),HybridGraceHashJoin:true,Output:["_col1"]
<-Map 1 [BROADCAST_EDGE]
BROADCAST [RS_6]
PartitionCols:_col0
Select Operator [SEL_2] (rows=12 width=105)
Output:["_col0","_col1"]
Filter Operator [FIL_17] (rows=12 width=105)
predicate:quarter_id is not null
TableScan [TS_0] (rows=12 width=105)
m****@pmt_inventory,a11,Tbl:COMPLETE,Col:COMPLETE,Output:["quarter_id","pbtname"]
<-Select Operator [SEL_5] (rows=48 width=8)
Output:["_col1"]
Filter Operator [FIL_18] (rows=48 width=8)
predicate:((month_id) IN (200607, 200606) and quarter_id is not null)
TableScan [TS_3] (rows=48 width=8)
m****@lu_month,a12,Tbl:COMPLETE,Col:COMPLETE,Output:["month_id","quarter_id"]
Immediate Notes:
1. Much smaller
2. Can be read in order
Page17 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
WHAT is new style Hive explain plan (HIVE-9780)
Stage-1
Reducer 3
File Output Operator [FS_14]
Group By Operator [GBY_12] (rows=8 width=101)
Output:["_col0"],keys:KEY._col0
<-Map 2 [SIMPLE_EDGE]
SHUFFLE [RS_11]
PartitionCols:_col0
Group By Operator [GBY_10] (rows=8 width=101)
Output:["_col0"],keys:_col1
Map Join Operator [MAPJOIN_19] (rows=33 width=101)
Conds:RS_6._col0=SEL_5._col1(Inner),HybridGraceHashJoin:true,Output:["_col1"]
<-Map 1 [BROADCAST_EDGE]
BROADCAST [RS_6]
PartitionCols:_col0
Select Operator [SEL_2] (rows=12 width=105)
Output:["_col0","_col1"]
Filter Operator [FIL_17] (rows=12 width=105)
predicate:quarter_id is not null
TableScan [TS_0] (rows=12 width=105)
m****@pmt_inventory,a11,Tbl:COMPLETE,Col:COMPLETE,Output:["quarter_id","pbtname"]
<-Select Operator [SEL_5] (rows=48 width=8)
Output:["_col1"]
Filter Operator [FIL_18] (rows=48 width=8)
predicate:((month_id) IN (200607, 200606) and quarter_id is not null)
TableScan [TS_3] (rows=48 width=8)
m****@lu_month,a12,Tbl:COMPLETE,Col:COMPLETE,Output:["month_id","quarter_id"]
Data flows bottom to top
Page18 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
WHAT is new style Hive explain plan (HIVE-9780)
Stage-1
Reducer 3
File Output Operator [FS_14]
Group By Operator [GBY_12] (rows=8 width=101)
Output:["_col0"],keys:KEY._col0
<-Map 2 [SIMPLE_EDGE]
SHUFFLE [RS_11]
PartitionCols:_col0
Group By Operator [GBY_10] (rows=8 width=101)
Output:["_col0"],keys:_col1
Map Join Operator [MAPJOIN_19] (rows=33 width=101)
Conds:RS_6._col0=SEL_5._col1(Inner),HybridGraceHashJoin:true,Output:["_col1"]
<-Map 1 [BROADCAST_EDGE]
BROADCAST [RS_6]
PartitionCols:_col0
Select Operator [SEL_2] (rows=12 width=105)
Output:["_col0","_col1"]
Filter Operator [FIL_17] (rows=12 width=105)
predicate:quarter_id is not null
TableScan [TS_0] (rows=12 width=105)
m****@pmt_inventory,a11,Tbl:COMPLETE,Col:COMPLETE,Output:["quarter_id","pbtname"]
<-Select Operator [SEL_5] (rows=48 width=8)
Output:["_col1"]
Filter Operator [FIL_18] (rows=48 width=8)
predicate:((month_id) IN (200607, 200606) and quarter_id is not null)
TableScan [TS_3] (rows=48 width=8)
m****@lu_month,a12,Tbl:COMPLETE,Col:COMPLETE,Output:["month_id","quarter_id"]
Operators have multiple
children when it makes
sense
Page19 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
WHAT is new style Hive explain plan (HIVE-9780)
Stage-1
Reducer 3
File Output Operator [FS_14]
Group By Operator [GBY_12] (rows=8 width=101)
Output:["_col0"],keys:KEY._col0
<-Map 2 [SIMPLE_EDGE]
SHUFFLE [RS_11]
PartitionCols:_col0
Group By Operator [GBY_10] (rows=8 width=101)
Output:["_col0"],keys:_col1
Map Join Operator [MAPJOIN_19] (rows=33 width=101)
Conds:RS_6._col0=SEL_5._col1(Inner),HybridGraceHashJoin:true,Output:["_col1"]
<-Map 1 [BROADCAST_EDGE]
BROADCAST [RS_6]
PartitionCols:_col0
Select Operator [SEL_2] (rows=12 width=105)
Output:["_col0","_col1"]
Filter Operator [FIL_17] (rows=12 width=105)
predicate:quarter_id is not null
TableScan [TS_0] (rows=12 width=105)
m****@pmt_inventory,a11,Tbl:COMPLETE,Col:COMPLETE,Output:["quarter_id","pbtname"]
<-Select Operator [SEL_5] (rows=48 width=8)
Output:["_col1"]
Filter Operator [FIL_18] (rows=48 width=8)
predicate:((month_id) IN (200607, 200606) and quarter_id is not null)
TableScan [TS_3] (rows=48 width=8)
m****@lu_month,a12,Tbl:COMPLETE,Col:COMPLETE,Output:["month_id","quarter_id"]
Join’s information is clear
pmt_inventory is broadcasted to lu_month
and a MapJoin is done
Page20 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
WHAT is new style Hive explain plan (HIVE-9780)
Stage-1
Reducer 3
File Output Operator [FS_14]
Group By Operator [GBY_12] (rows=8 width=101)
Output:["_col0"],keys:KEY._col0
<-Map 2 [SIMPLE_EDGE]
SHUFFLE [RS_11]
PartitionCols:_col0
Group By Operator [GBY_10] (rows=8 width=101)
Output:["_col0"],keys:_col1
Map Join Operator [MAPJOIN_19] (rows=33 width=101)
Conds:RS_6._col0=SEL_5._col1(Inner),HybridGraceHashJoin:true,Output:["_col1"]
<-Map 1 [BROADCAST_EDGE]
BROADCAST [RS_6]
PartitionCols:_col0
Select Operator [SEL_2] (rows=12 width=105)
Output:["_col0","_col1"]
Filter Operator [FIL_17] (rows=12 width=105)
predicate:quarter_id is not null
TableScan [TS_0] (rows=12 width=105)
m****@pmt_inventory,a11,Tbl:COMPLETE,Col:COMPLETE,Output:["quarter_id","pbtname"]
<-Select Operator [SEL_5] (rows=48 width=8)
Output:["_col1"]
Filter Operator [FIL_18] (rows=48 width=8)
predicate:((month_id) IN (200607, 200606) and quarter_id is not null)
TableScan [TS_3] (rows=48 width=8)
m****@lu_month,a12,Tbl:COMPLETE,Col:COMPLETE,Output:["month_id","quarter_id"]
Cost mentioned once per operator
Page21 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
HOW to performance debug a real query (TPC-DS Q3, 1TB)
select
dt.d_year, item.i_brand_id brand_id, item.i_brand brand, sum(ss_ext_sales_price) sum_agg
from
date_dim dt, store_sales, item
where
dt.d_date_sk = store_sales.ss_sold_date_sk
and store_sales.ss_item_sk = item.i_item_sk
and item.i_manufact_id = 436
and dt.d_moy = 12
group by dt.d_year , item.i_brand , item.i_brand_id
order by dt.d_year , sum_agg desc , brand_id
limit 10
partitioned by ss_sold_date_sk
Page22 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
<-Reducer 3 [SIMPLE_EDGE]
SHUFFLE [RS_15]
PartitionCols:_col0, _col1, _col2
Group By Operator [GBY_14] (rows=1 width=116)
Output:["_col0","_col1","_col2","_col3"],aggregations:["sum(_col45)"],keys:_col6, _col65, _col64
Select Operator [SEL_13] (rows=76515 width=128)
Output:["_col6","_col65","_col64","_col45"]
Filter Operator [FIL_23] (rows=76515 width=128)
predicate:((_col0 = _col53) and (_col32 = _col57))
Merge Join Operator [MERGEJOIN_28] (rows=306061 width=128)
Conds:RS_8._col32=RS_34.i_item_sk(Inner),Output:["_col0","_col6","_col32","_col45","_col53","_col57","_col64","_col65"]
<-Map 7 [SIMPLE_EDGE] vectorized
SHUFFLE [RS_34]
PartitionCols:i_item_sk
Filter Operator [FIL_33] (rows=434 width=111)
predicate:(i_item_sk is not null and (i_manufact_id = 436))
TableScan [TS_2] (rows=300000 width=111)
tpcds_bin_partitioned_orc_1000@item,item,Tbl:COMPLETE,Col:COMPLETE,Output:["i_item_sk","i_brand_id","i_brand","i_manufact_id"]
<-Reducer 2 [SIMPLE_EDGE]
SHUFFLE [RS_8]
PartitionCols:_col32
Merge Join Operator [MERGEJOIN_27] (rows=211562452 width=20)
Conds:RS_30.d_date_sk=RS_32.ss_sold_date_sk(Inner),Output:["_col0","_col6","_col32","_col45","_col53"]
<-Map 1 [SIMPLE_EDGE] vectorized
SHUFFLE [RS_30]
PartitionCols:d_date_sk
Filter Operator [FIL_29] (rows=5619 width=12)
predicate:(d_date_sk is not null and (d_moy = 12))
TableScan [TS_0] (rows=73049 width=12)
tpcds_bin_partitioned_orc_1000@date_dim,dt,Tbl:COMPLETE,Col:COMPLETE,Output:["d_date_sk","d_year","d_moy"]
<-Map 6 [SIMPLE_EDGE] vectorized
SHUFFLE [RS_32]
PartitionCols:ss_sold_date_sk
Filter Operator [FIL_31] (rows=2750387156 width=11)
predicate:ss_item_sk is not null
TableScan [TS_1] (rows=2750387156 width=11)
tpcds_bin_partitioned_orc_1000@store_sales,store_sales,Tbl:COMPLETE,Col:COMPLETE,Output:["ss_item_sk","ss_ext_sales_price"]
Original plan runs 163.33s. Sounds
like column pruning and predicate
push down are working fine.
However, the join sequence
store_sales✖date_dim✖item is not
good enough. A better one is
store_sales✖item✖date_dim
Table Cardinality Cardinality after filter Selectivity
date_dim 73K 5619 7.6%
item 300K 434 0.14%
Page23 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
<-Reducer 3 [SIMPLE_EDGE]
SHUFFLE [RS_17]
PartitionCols:_col0, _col1, _col2
Group By Operator [GBY_16] (rows=9 width=116)
Output:["_col0","_col1","_col2","_col3"],aggregations:["sum(_col1)"],keys:_col8, _col4, _col5
Select Operator [SEL_15] (rows=306061 width=112)
Output:["_col8","_col4","_col5","_col1"]
Merge Join Operator [MERGEJOIN_29] (rows=306061 width=112)
Conds:RS_12._col2=RS_38._col0(Inner),Output:["_col1","_col4","_col5","_col8"]
<-Map 7 [SIMPLE_EDGE] vectorized
SHUFFLE [RS_38]
PartitionCols:_col0
Select Operator [SEL_37] (rows=5619 width=12)
Output:["_col0","_col1"]
Filter Operator [FIL_36] (rows=5619 width=12)
predicate:((d_moy = 12) and d_date_sk is not null)
TableScan [TS_6] (rows=73049 width=12)
tpcds_bin_partitioned_orc_1000@date_dim,dt,Tbl:COMPLETE,Col:COMPLETE,Output:["d_date_sk","d_year","d_moy"]
<-Reducer 2 [SIMPLE_EDGE]
SHUFFLE [RS_12]
PartitionCols:_col2
Merge Join Operator [MERGEJOIN_28] (rows=3978894 width=112)
Conds:RS_32._col0=RS_35._col0(Inner),Output:["_col1","_col2","_col4","_col5"]
<-Map 1 [SIMPLE_EDGE] vectorized
SHUFFLE [RS_32]
PartitionCols:_col0
Select Operator [SEL_31] (rows=2750387156 width=11)
Output:["_col0","_col1","_col2"]
Filter Operator [FIL_30] (rows=2750387156 width=11)
predicate:ss_item_sk is not null
TableScan [TS_0] (rows=2750387156 width=11)
tpcds_bin_partitioned_orc_1000@store_sales,store_sales,Tbl:COMPLETE,Col:COMPLETE,Output:["ss_item_sk","ss_ext_sales_price"]
<-Map 6 [SIMPLE_EDGE] vectorized
SHUFFLE [RS_35]
PartitionCols:_col0
Select Operator [SEL_34] (rows=434 width=111)
Output:["_col0","_col1","_col2"]
Filter Operator [FIL_33] (rows=434 width=111)
predicate:((i_manufact_id = 436) and i_item_sk is not null)
TableScan [TS_3] (rows=300000 width=111)
tpcds_bin_partitioned_orc_1000@item,item,Tbl:COMPLETE,Col:COMPLETE,Output:["i_item_sk","i_brand_id","i_brand","i_manufact_id"]
CBO on, new plan runs 143.97s
with new join sequence
store_sales✖item✖date_dim.
The input data size of one branch of
join is pretty small, should use map
join, rather than merge join.
Page24 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
SHUFFLE [RS_44]
PartitionCols:_col0, _col1, _col2
Group By Operator [GBY_43] (rows=9 width=116)
Output:["_col0","_col1","_col2","_col3"],aggregations:["sum(_col1)"],keys:_col8, _col4, _col5
Select Operator [SEL_42] (rows=306061 width=112)
Output:["_col8","_col4","_col5","_col1"]
Map Join Operator [MAPJOIN_41] (rows=306061 width=112)
Conds:MAPJOIN_40._col2=RS_37._col0(Inner),HybridGraceHashJoin:true,Output:["_col1","_col4","_col5","_col8"]
<-Map 5 [BROADCAST_EDGE] vectorized
BROADCAST [RS_37]
PartitionCols:_col0
Select Operator [SEL_36] (rows=5619 width=12)
Output:["_col0","_col1"]
Filter Operator [FIL_35] (rows=5619 width=12)
predicate:((d_moy = 12) and d_date_sk is not null)
TableScan [TS_6] (rows=73049 width=12)
tpcds_bin_partitioned_orc_1000@date_dim,dt,Tbl:COMPLETE,Col:COMPLETE,Output:["d_date_sk","d_year","d_moy"]
<-Map Join Operator [MAPJOIN_40] (rows=3978894 width=112)
Conds:SEL_39._col0=RS_34._col0(Inner),HybridGraceHashJoin:true,Output:["_col1","_col2","_col4","_col5"]
<-Map 4 [BROADCAST_EDGE] vectorized
BROADCAST [RS_34]
PartitionCols:_col0
Select Operator [SEL_33] (rows=434 width=111)
Output:["_col0","_col1","_col2"]
Filter Operator [FIL_32] (rows=434 width=111)
predicate:((i_manufact_id = 436) and i_item_sk is not null)
TableScan [TS_3] (rows=300000 width=111)
tpcds_bin_partitioned_orc_1000@item,item,Tbl:COMPLETE,Col:COMPLETE,Output:["i_item_sk","i_brand_id","i_brand","i_manufact_id"]
<-Select Operator [SEL_39] (rows=2750387156 width=11)
Output:["_col0","_col1","_col2"]
Filter Operator [FIL_38] (rows=2750387156 width=11)
predicate:ss_item_sk is not null
TableScan [TS_0] (rows=2750387156 width=11)
tpcds_bin_partitioned_orc_1000@store_sales,store_sales,Tbl:COMPLETE,Col:COMPLETE,Output:["ss_item_sk","ss_ext_sales_price"]
Increase
hive.auto.convert.join.noconditionaltask.size=
1,359,688,499, we can see it is now using map
join operators. New plan runs 45.84s.
store_sales is a partitioned table on
the join key ss_sold_date_sk with
date_dim table.
Page25 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
SHUFFLE [RS_55]
PartitionCols:_col0, _col1, _col2
Group By Operator [GBY_54] (rows=9 width=116)
Output:["_col0","_col1","_col2","_col3"],aggregations:["sum(_col1)"],keys:_col8, _col4, _col5
Select Operator [SEL_53] (rows=306061 width=112)
Output:["_col8","_col4","_col5","_col1"]
Map Join Operator [MAPJOIN_52] (rows=306061 width=112)
Conds:MAPJOIN_51._col2=RS_45._col0(Inner),HybridGraceHashJoin:true,Output:["_col1","_col4","_col5","_col8"]
<-Map 5 [BROADCAST_EDGE] vectorized
BROADCAST [RS_45]
PartitionCols:_col0
Select Operator [SEL_44] (rows=5619 width=12)
Output:["_col0","_col1"]
Filter Operator [FIL_43] (rows=5619 width=12)
predicate:((d_moy = 12) and d_date_sk is not null)
TableScan [TS_6] (rows=73049 width=12)
tpcds_bin_partitioned_orc_1000@date_dim,dt,Tbl:COMPLETE,Col:COMPLETE,Output:["d_date_sk","d_year","d_moy"]
Dynamic Partitioning Event Operator [EVENT_48] (rows=2809 width=12)
Group By Operator [GBY_47] (rows=2809 width=12)
Output:["_col0"],keys:_col0
Select Operator [SEL_46] (rows=5619 width=12)
Output:["_col0"]
Please refer to the previous Select Operator [SEL_44]
<-Map Join Operator [MAPJOIN_51] (rows=3978894 width=112)
Conds:SEL_50._col0=RS_42._col0(Inner),HybridGraceHashJoin:true,Output:["_col1","_col2","_col4","_col5"]
<-Map 4 [BROADCAST_EDGE] vectorized
BROADCAST [RS_42]
PartitionCols:_col0
Select Operator [SEL_41] (rows=434 width=111)
Output:["_col0","_col1","_col2"]
Filter Operator [FIL_40] (rows=434 width=111)
predicate:((i_manufact_id = 436) and i_item_sk is not null)
TableScan [TS_3] (rows=300000 width=111)
tpcds_bin_partitioned_orc_1000@item,item,Tbl:COMPLETE,Col:COMPLETE,Output:["i_item_sk","i_brand_id","i_brand","i_manufact_id"]
<-Select Operator [SEL_50] (rows=2750387156 width=11)
Output:["_col0","_col1","_col2"]
Filter Operator [FIL_49] (rows=2750387156 width=11)
predicate:ss_item_sk is not null
TableScan [TS_0] (rows=2750387156 width=11)
tpcds_bin_partitioned_orc_1000@store_sales,store_sales,Tbl:COMPLETE,Col:COMPLETE,Output:["ss_item_sk","ss_ext_sales_price"]
By setting
hive.tez.dynamic.partition.pruning=true,
we can see dynamic partitioning
event operators. See more about this
in HIVE-7826. New plan runs 31.35s.
In the run time, dynamic partition event
operator will send values needed to
prune to the application master - where
splits are generated and tasks are
submitted. Using these values we can
strip out any unneeded partitions
dynamically, while the query is running.
Page26 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Performance debugging summary (TPC-DS Q3, 1TB)
0
20
40
60
80
100
120
140
160
180
Original Join re-order Join selection Dynamic partition pruning
Queryexecutiontime(s)
Query execution time
Continuous improvement
Page27 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Integration with Apache Ambari
1. Type the query here
2. Click “explain”
3. explain plan will be shown
Page28 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
 “set hive.tez.exec.print.summary=true;”
 Get more insights on query performance
Can be used along with Tez vertex runtime stats
Runtime stats
Page29 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Summary
• Show old style Hive explain plan is hard to read.
• Verbose with too much redundant information, hard to follow how data
flows, cost of operator is unclear
• Compare with Postgres over a body of 500+ realistic SQL queries and
identify the candidate improving points
• Introduce new style Hive explain plan
• Use a concrete example to help understand the explain: execution cost, join
sequence and orchestration of the operator tree
• Use the new Hive explain plan to performance debug TPC-DS Q3
• Show the improvement after join re-ordering, join selection, and dynamic
partition pruning
• Integration/interaction with other system/tools
Page30 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Future work -- Some gaps remain after HIVE-9780
• Put the real schema, table and column names in the
explain plan, e.g., no more _col0 etc.
• This will help users to understand the plan better
• HIVE-8681: CBO: Column names are missing from join expression
in Map join with CBO enabled
• Get an equivalent of “EXPLAIN ANALYZE” – such as
operator level runtime stats and warnings.
• This will help users to find out the gap between estimated cost and
real cost
• HIVE-14362: Support explain analyze in Hive
Page31 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
SHUFFLE [RS_55]
PartitionCols:d_year, i_brand, i_brand_id
Group By Operator [GBY_54] (rows=9/10 width=116)
Output:["d_year","sum_agg","i_brand","i_brand_id"],aggregations:["sum(ss_ext_sales_price)"],keys:d_year, i_brand, i_brand_id
Select Operator [SEL_53] (rows=306061/324651 width=112)
Output:["d_year","i_brand","i_brand_id","ss_ext_sales_price"]
Map Join Operator [MAPJOIN_52] (rows=306061/324651 width=112)
Conds:MAPJOIN_51. ss_sold_date_sk=RS_45. d_date_sk(Inner),HybridGraceHashJoin:true,Output:["d_year","i_brand","i_brand_id","ss_ext_sales_price"]
<-Map 5 [BROADCAST_EDGE] vectorized
BROADCAST [RS_45]
PartitionCols:d_date_sk
Select Operator [SEL_44] (rows=5619/6034 width=12)
Output:["d_date_sk","d_year"]
Filter Operator [FIL_43] (rows=5619/6034 width=12)
predicate:((d_moy = 12) and d_date_sk is not null)
TableScan [TS_6] (rows=73049/73049 width=12)
tpcds_bin_partitioned_orc_1000@date_dim,dt,Tbl:COMPLETE,Col:COMPLETE,Output:["d_date_sk","d_year","d_moy"]
Dynamic Partitioning Event Operator [EVENT_48] (rows=2809 width=12)
Group By Operator [GBY_47] (rows=2809/2324 width=12)
Output:["d_date_sk"],keys:d_date_sk
Select Operator [SEL_46] (rows=5619/6034 width=12)
Output:["d_date_sk"]
Please refer to the previous Select Operator [SEL_44]
<-Map Join Operator [MAPJOIN_51] (rows=3978894/4202377 width=112)
Conds:SEL_50.ss_item_sk=RS_42. i_item_sk(Inner),HybridGraceHashJoin:true,Output:["ss_ext_sales_price",” ss_sold_date_sk","i_brand","i_brand_id"]
<-Map 4 [BROADCAST_EDGE] vectorized
BROADCAST [RS_42]
PartitionCols:i_item_sk
Select Operator [SEL_41] (rows=434/453 width=111)
Output:[” i_item_sk","i_brand_id","i_brand"]
Filter Operator [FIL_40] (rows=434/453 width=111)
predicate:((i_manufact_id = 436) and i_item_sk is not null)
TableScan [TS_3] (rows=300000/300000 width=111)
tpcds_bin_partitioned_orc_1000@item,item,Tbl:COMPLETE,Col:COMPLETE,Output:["i_item_sk","i_brand_id","i_brand","i_manufact_id"]
<-Select Operator [SEL_50] (rows=2750387156/2750387156 width=11)
Output:["ss_item_sk","ss_ext_sales_price","ss_sold_date_sk"]
Filter Operator [FIL_49] (rows=2750387156/2750387156 width=11)
predicate:ss_item_sk is not null
TableScan [TS_0] (rows=2750387156/2750387156 width=11)
tpcds_bin_partitioned_orc_1000@store_sales,store_sales,Tbl:COMPLETE,Col:COMPLETE,Output:["ss_item_sk","ss_ext_sales_price"]
Page32 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Acknowledgement
• We thank all the anonymous reviewers’ votes to give us this
opportunity to share our work.
• Part of the slides are borrowed from or modified based on Carter
Shanklin and Rajesh Balamohan’s slides.
• We thank Gunther Hagleitner for all the support and inputs.
• We thank Sapin Amin for setting up the testing cluster.
Page33 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Thank you! Questions?

Weitere ähnliche Inhalte

Was ist angesagt?

Cost-based Query Optimization in Apache Phoenix using Apache Calcite
Cost-based Query Optimization in Apache Phoenix using Apache CalciteCost-based Query Optimization in Apache Phoenix using Apache Calcite
Cost-based Query Optimization in Apache Phoenix using Apache CalciteJulian Hyde
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudNoritaka Sekiyama
 
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroThe Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroDatabricks
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guideRyan Blue
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introductioncolorant
 
Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingIntroduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingTill Rohrmann
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks
 
Autoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeAutoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeFlink Forward
 
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query ProcessingApache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query ProcessingHortonworks
 
Hive User Meeting August 2009 Facebook
Hive User Meeting August 2009 FacebookHive User Meeting August 2009 Facebook
Hive User Meeting August 2009 Facebookragho
 
Hive 3 - a new horizon
Hive 3 - a new horizonHive 3 - a new horizon
Hive 3 - a new horizonThejas Nair
 
HDFS on Kubernetes—Lessons Learned with Kimoon Kim
HDFS on Kubernetes—Lessons Learned with Kimoon KimHDFS on Kubernetes—Lessons Learned with Kimoon Kim
HDFS on Kubernetes—Lessons Learned with Kimoon KimDatabricks
 
Tez Shuffle Handler: Shuffling at Scale with Apache Hadoop
Tez Shuffle Handler: Shuffling at Scale with Apache HadoopTez Shuffle Handler: Shuffling at Scale with Apache Hadoop
Tez Shuffle Handler: Shuffling at Scale with Apache HadoopDataWorks Summit
 

Was ist angesagt? (20)

Internal Hive
Internal HiveInternal Hive
Internal Hive
 
Cost-based Query Optimization in Apache Phoenix using Apache Calcite
Cost-based Query Optimization in Apache Phoenix using Apache CalciteCost-based Query Optimization in Apache Phoenix using Apache Calcite
Cost-based Query Optimization in Apache Phoenix using Apache Calcite
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
 
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive Queries
 
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroThe Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
 
Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingIntroduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processing
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
Hive: Loading Data
Hive: Loading DataHive: Loading Data
Hive: Loading Data
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Autoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeAutoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive Mode
 
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query ProcessingApache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing
 
Hive User Meeting August 2009 Facebook
Hive User Meeting August 2009 FacebookHive User Meeting August 2009 Facebook
Hive User Meeting August 2009 Facebook
 
Hive 3 - a new horizon
Hive 3 - a new horizonHive 3 - a new horizon
Hive 3 - a new horizon
 
HDFS on Kubernetes—Lessons Learned with Kimoon Kim
HDFS on Kubernetes—Lessons Learned with Kimoon KimHDFS on Kubernetes—Lessons Learned with Kimoon Kim
HDFS on Kubernetes—Lessons Learned with Kimoon Kim
 
Sqoop
SqoopSqoop
Sqoop
 
Tez Shuffle Handler: Shuffling at Scale with Apache Hadoop
Tez Shuffle Handler: Shuffling at Scale with Apache HadoopTez Shuffle Handler: Shuffling at Scale with Apache Hadoop
Tez Shuffle Handler: Shuffling at Scale with Apache Hadoop
 

Andere mochten auch

Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive QueriesOwen O'Malley
 
Analytical Queries with Hive: SQL Windowing and Table Functions
Analytical Queries with Hive: SQL Windowing and Table FunctionsAnalytical Queries with Hive: SQL Windowing and Table Functions
Analytical Queries with Hive: SQL Windowing and Table FunctionsDataWorks Summit
 
Benchmarking Top NoSQL Databases: Apache Cassandra, Apache HBase and MongoDB
Benchmarking Top NoSQL Databases: Apache Cassandra, Apache HBase and MongoDBBenchmarking Top NoSQL Databases: Apache Cassandra, Apache HBase and MongoDB
Benchmarking Top NoSQL Databases: Apache Cassandra, Apache HBase and MongoDBAthiq Ahamed
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDatabricks
 
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)Spark Summit
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkRahul Jain
 
SQL to Hive Cheat Sheet
SQL to Hive Cheat SheetSQL to Hive Cheat Sheet
SQL to Hive Cheat SheetHortonworks
 
Apache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and SmarterApache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and SmarterDatabricks
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark InternalsPietro Michiardi
 
HBase Vs Cassandra Vs MongoDB - Choosing the right NoSQL database
HBase Vs Cassandra Vs MongoDB - Choosing the right NoSQL databaseHBase Vs Cassandra Vs MongoDB - Choosing the right NoSQL database
HBase Vs Cassandra Vs MongoDB - Choosing the right NoSQL databaseEdureka!
 

Andere mochten auch (11)

Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive Queries
 
Analytical Queries with Hive: SQL Windowing and Table Functions
Analytical Queries with Hive: SQL Windowing and Table FunctionsAnalytical Queries with Hive: SQL Windowing and Table Functions
Analytical Queries with Hive: SQL Windowing and Table Functions
 
Benchmarking Top NoSQL Databases: Apache Cassandra, Apache HBase and MongoDB
Benchmarking Top NoSQL Databases: Apache Cassandra, Apache HBase and MongoDBBenchmarking Top NoSQL Databases: Apache Cassandra, Apache HBase and MongoDB
Benchmarking Top NoSQL Databases: Apache Cassandra, Apache HBase and MongoDB
 
Hive tuning
Hive tuningHive tuning
Hive tuning
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
 
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
SQL to Hive Cheat Sheet
SQL to Hive Cheat SheetSQL to Hive Cheat Sheet
SQL to Hive Cheat Sheet
 
Apache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and SmarterApache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and Smarter
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark Internals
 
HBase Vs Cassandra Vs MongoDB - Choosing the right NoSQL database
HBase Vs Cassandra Vs MongoDB - Choosing the right NoSQL databaseHBase Vs Cassandra Vs MongoDB - Choosing the right NoSQL database
HBase Vs Cassandra Vs MongoDB - Choosing the right NoSQL database
 

Ähnlich wie How to understand and analyze Apache Hive query execution plan for performance debugging

PPT on Data Science Using Python
PPT on Data Science Using PythonPPT on Data Science Using Python
PPT on Data Science Using PythonNishantKumar1179
 
Query Compilation in Impala
Query Compilation in ImpalaQuery Compilation in Impala
Query Compilation in ImpalaCloudera, Inc.
 
Enhancing Spark SQL Optimizer with Reliable Statistics
Enhancing Spark SQL Optimizer with Reliable StatisticsEnhancing Spark SQL Optimizer with Reliable Statistics
Enhancing Spark SQL Optimizer with Reliable StatisticsJen Aman
 
Myth busters - performance tuning 101 2007
Myth busters - performance tuning 101 2007Myth busters - performance tuning 101 2007
Myth busters - performance tuning 101 2007paulguerin
 
Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...
Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...
Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...Databricks
 
R programming & Machine Learning
R programming & Machine LearningR programming & Machine Learning
R programming & Machine LearningAmanBhalla14
 
Cost-Based Optimizer in Apache Spark 2.2
Cost-Based Optimizer in Apache Spark 2.2 Cost-Based Optimizer in Apache Spark 2.2
Cost-Based Optimizer in Apache Spark 2.2 Databricks
 
Flink Forward San Francisco 2019: Build a Table-centric Apache Flink Ecosyste...
Flink Forward San Francisco 2019: Build a Table-centric Apache Flink Ecosyste...Flink Forward San Francisco 2019: Build a Table-centric Apache Flink Ecosyste...
Flink Forward San Francisco 2019: Build a Table-centric Apache Flink Ecosyste...Flink Forward
 
Flink Forward San Francisco 2019: Build a Table-centric Apache Flink Ecosyste...
Flink Forward San Francisco 2019: Build a Table-centric Apache Flink Ecosyste...Flink Forward San Francisco 2019: Build a Table-centric Apache Flink Ecosyste...
Flink Forward San Francisco 2019: Build a Table-centric Apache Flink Ecosyste...Flink Forward
 
Presentation_BigData_NenaMarin
Presentation_BigData_NenaMarinPresentation_BigData_NenaMarin
Presentation_BigData_NenaMarinn5712036
 
AWS July Webinar Series: Amazon Redshift Optimizing Performance
AWS July Webinar Series: Amazon Redshift Optimizing PerformanceAWS July Webinar Series: Amazon Redshift Optimizing Performance
AWS July Webinar Series: Amazon Redshift Optimizing PerformanceAmazon Web Services
 
Python-for-Data-Analysis.pptx
Python-for-Data-Analysis.pptxPython-for-Data-Analysis.pptx
Python-for-Data-Analysis.pptxParveenShaik21
 
Tactical data engineering
Tactical data engineeringTactical data engineering
Tactical data engineeringJulian Hyde
 
Meetup Junio Data Analysis with python 2018
Meetup Junio Data Analysis with python 2018Meetup Junio Data Analysis with python 2018
Meetup Junio Data Analysis with python 2018DataLab Community
 
Project Management System
Project Management SystemProject Management System
Project Management SystemDivyen Patel
 

Ähnlich wie How to understand and analyze Apache Hive query execution plan for performance debugging (20)

PPT on Data Science Using Python
PPT on Data Science Using PythonPPT on Data Science Using Python
PPT on Data Science Using Python
 
Query Compilation in Impala
Query Compilation in ImpalaQuery Compilation in Impala
Query Compilation in Impala
 
Lecture 9.pptx
Lecture 9.pptxLecture 9.pptx
Lecture 9.pptx
 
Enhancing Spark SQL Optimizer with Reliable Statistics
Enhancing Spark SQL Optimizer with Reliable StatisticsEnhancing Spark SQL Optimizer with Reliable Statistics
Enhancing Spark SQL Optimizer with Reliable Statistics
 
interenship.pptx
interenship.pptxinterenship.pptx
interenship.pptx
 
Myth busters - performance tuning 101 2007
Myth busters - performance tuning 101 2007Myth busters - performance tuning 101 2007
Myth busters - performance tuning 101 2007
 
Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...
Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...
Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...
 
Basic Analysis using Python
Basic Analysis using PythonBasic Analysis using Python
Basic Analysis using Python
 
R programming & Machine Learning
R programming & Machine LearningR programming & Machine Learning
R programming & Machine Learning
 
Cost-Based Optimizer in Apache Spark 2.2
Cost-Based Optimizer in Apache Spark 2.2 Cost-Based Optimizer in Apache Spark 2.2
Cost-Based Optimizer in Apache Spark 2.2
 
Flink Forward San Francisco 2019: Build a Table-centric Apache Flink Ecosyste...
Flink Forward San Francisco 2019: Build a Table-centric Apache Flink Ecosyste...Flink Forward San Francisco 2019: Build a Table-centric Apache Flink Ecosyste...
Flink Forward San Francisco 2019: Build a Table-centric Apache Flink Ecosyste...
 
Flink Forward San Francisco 2019: Build a Table-centric Apache Flink Ecosyste...
Flink Forward San Francisco 2019: Build a Table-centric Apache Flink Ecosyste...Flink Forward San Francisco 2019: Build a Table-centric Apache Flink Ecosyste...
Flink Forward San Francisco 2019: Build a Table-centric Apache Flink Ecosyste...
 
Presentation_BigData_NenaMarin
Presentation_BigData_NenaMarinPresentation_BigData_NenaMarin
Presentation_BigData_NenaMarin
 
AWS July Webinar Series: Amazon Redshift Optimizing Performance
AWS July Webinar Series: Amazon Redshift Optimizing PerformanceAWS July Webinar Series: Amazon Redshift Optimizing Performance
AWS July Webinar Series: Amazon Redshift Optimizing Performance
 
Chapter 5
Chapter 5Chapter 5
Chapter 5
 
Python-for-Data-Analysis.pptx
Python-for-Data-Analysis.pptxPython-for-Data-Analysis.pptx
Python-for-Data-Analysis.pptx
 
Tactical data engineering
Tactical data engineeringTactical data engineering
Tactical data engineering
 
Meetup Junio Data Analysis with python 2018
Meetup Junio Data Analysis with python 2018Meetup Junio Data Analysis with python 2018
Meetup Junio Data Analysis with python 2018
 
DATA REPRESENTATION
DATA  REPRESENTATIONDATA  REPRESENTATION
DATA REPRESENTATION
 
Project Management System
Project Management SystemProject Management System
Project Management System
 

Mehr von DataWorks Summit/Hadoop Summit

Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerDataWorks Summit/Hadoop Summit
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformEnabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformDataWorks Summit/Hadoop Summit
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDataWorks Summit/Hadoop Summit
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...DataWorks Summit/Hadoop Summit
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...DataWorks Summit/Hadoop Summit
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLDataWorks Summit/Hadoop Summit
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)DataWorks Summit/Hadoop Summit
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...DataWorks Summit/Hadoop Summit
 

Mehr von DataWorks Summit/Hadoop Summit (20)

Running Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in ProductionRunning Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in Production
 
State of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache ZeppelinState of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache Zeppelin
 
Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache Ranger
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformEnabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science Platform
 
Revolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and ZeppelinRevolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and Zeppelin
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSense
 
Hadoop Crash Course
Hadoop Crash CourseHadoop Crash Course
Hadoop Crash Course
 
Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Apache Spark Crash Course
Apache Spark Crash CourseApache Spark Crash Course
Apache Spark Crash Course
 
Dataflow with Apache NiFi
Dataflow with Apache NiFiDataflow with Apache NiFi
Dataflow with Apache NiFi
 
Schema Registry - Set you Data Free
Schema Registry - Set you Data FreeSchema Registry - Set you Data Free
Schema Registry - Set you Data Free
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and ML
 
How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient
 
HBase in Practice
HBase in Practice HBase in Practice
HBase in Practice
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS HadoopBreaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
 

Kürzlich hochgeladen

The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 

Kürzlich hochgeladen (20)

The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 

How to understand and analyze Apache Hive query execution plan for performance debugging

  • 1. How to understand and analyze Apache Hive query execution plan for performance debugging © Hortonworks Inc. 2011 – 2015. All Rights Reserved Pengcheng Xiong and Ashutosh Chauhan Hortonworks Inc., Apache Hive Community {pxiong,ashutosh}@hortonworks.com
  • 2. Page2 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Goals: • WHY old style Hive explain plan is hard to read • Compare the old style explain with Postgres over a body of 500+ realistic SQL queries. • WHAT is new style Hive explain plan • Show orchestration of the tasks and operator trees, join sequences and algorithms, operator execution costs • HOW to performance debug a real query by analyzing the new Hive explain plan • Identify the potential improvement by changing join sequence, join algorithm and etc • Show the real improvement by running the query in real cluster • Integration/interaction with other system/tools • Future work
  • 3. Page3 © Hortonworks Inc. 2011 – 2014. All Rights Reserved WHY old style Hive explain plan is hard to read • M**** company’s schema and queries. • Comparison of explain plans between Hive and Postgres for the 528 queries they can both execute. • Hive: “explain”, Postgres: “explain verbose” Hive “old style”, 233.5 postgres, 53.8 Hive “old style”, 1289 postgres, 328.6 Average Lines Per Explain Plan Average Words Per Explain Plan N = 528 queries
  • 4. Page4 © Hortonworks Inc. 2011 – 2014. All Rights Reserved We can see that Hive old style explain is quite verbose, is it necessary? select a11.PBTNAME PBTNAME from PMT_INVENTORY a11 join LU_MONTH a12 on (a11.QUARTER_ID = a12.QUARTER_ID) where a12.MONTH_ID in (200607, 200606) group by a11.PBTNAME;
  • 5. Page5 © Hortonworks Inc. 2011 – 2014. All Rights Reserved High level plan comparison: Postgres QUERY PLAN --------------------------------------------------------------------------------- Group (cost=3.83..3.84 rows=2 width=18) Output: a11.pbtname Group Key: a11.pbtname -> Sort (cost=3.83..3.84 rows=2 width=18) Output: a11.pbtname Sort Key: a11.pbtname -> Hash Join (cost=2.62..3.83 rows=2 width=18) Output: a11.pbtname Hash Cond: (a11.quarter_id = a12.quarter_id) -> Seq Scan on public.pmt_inventory a11 (cost=0.00..1.12 rows=12 width=22) Output: a11.quarter_id, a11.pbtname -> Hash (cost=2.60..2.60 rows=2 width=4) Output: a12.quarter_id -> Seq Scan on public.lu_month a12 (cost=0.00..2.60 rows=2 width=4) Output: a12.quarter_id Filter: (a12.month_id = ANY ('{200607,200606}'::integer[])) 16 lines
  • 6. Page6 © Hortonworks Inc. 2011 – 2014. All Rights Reserved High level plan comparison: Hive STAGE DEPENDENCIES: Stage-1 is a root stage Stage-0 depends on stages: Stage-1 STAGE PLANS: Stage: Stage-1 Tez Edges: Map 2 <- Map 1 (BROADCAST_EDGE) Reducer 3 <- Map 2 (SIMPLE_EDGE) DagName: carter_20151114133018_2f2f0101-d14d-4688-bbb1-db67055016c3:946 Vertices: Map 1 Map Operator Tree: TableScan alias: a11 filterExpr: quarter_id is not null (type: boolean) Statistics: Num rows: 12 Data size: 1260 Basic stats: COMPLETE Column stats: NONE Filter Operator predicate: quarter_id is not null (type: boolean) Statistics: Num rows: 6 Data size: 630 Basic stats: COMPLETE Column stats: NONE Reduce Output Operator key expressions: quarter_id (type: int) sort order: + Map-reduce partition columns: quarter_id (type: int) Statistics: Num rows: 6 Data size: 630 Basic stats: COMPLETE Column stats: NONE value expressions: pbtname (type: string) Execution mode: vectorized Map 2 Map Operator Tree: TableScan alias: a12 filterExpr: (quarter_id is not null and (month_id) IN (200607, 200606)) (type: boolean) Statistics: Num rows: 48 Data size: 46752 Basic stats: COMPLETE Column stats: NONE Filter Operator predicate: (quarter_id is not null and (month_id) IN (200607, 200606)) (type: boolean) Statistics: Num rows: 12 Data size: 11688 Basic stats: COMPLETE Column stats: NONE Map Join Operator condition map: Inner Join 0 to 1 keys: 0 quarter_id (type: int) 1 quarter_id (type: int) outputColumnNames: _col1 input vertices: 0 Map 1 Statistics: Num rows: 13 Data size: 12856 Basic stats: COMPLETE Column stats: NONE HybridGraceHashJoin: true Group By Operator keys: _col1 (type: string) mode: hash outputColumnNames: _col0 Statistics: Num rows: 13 Data size: 12856 Basic stats: COMPLETE Column stats: NONE Reduce Output Operator key expressions: _col0 (type: string) sort order: + Map-reduce partition columns: _col0 (type: string) Statistics: Num rows: 13 Data size: 12856 Basic stats: COMPLETE Column stats: NONE Execution mode: vectorized Reducer 3 Reduce Operator Tree: Group By Operator keys: KEY._col0 (type: string) mode: mergepartial outputColumnNames: _col0 Statistics: Num rows: 6 Data size: 5933 Basic stats: COMPLETE Column stats: NONE File Output Operator compressed: false Statistics: Num rows: 6 Data size: 5933 Basic stats: COMPLETE Column stats: NONE table: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe Stage: Stage-0 Fetch Operator limit: -1 Processor Tree: ListSink Too verbose, need a magnifier! 80+ lines
  • 7. Page7 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Map 2 Operator Map 2 Map Operator Tree: TableScan alias: a12 filterExpr: (quarter_id is not null and (month_id) IN (200607, 200606)) (type: boolean) Statistics: Num rows: 48 Data size: 46752 Basic stats: COMPLETE Column stats: NONE Filter Operator predicate: (quarter_id is not null and (month_id) IN (200607, 200606)) (type: boolean) Statistics: Num rows: 12 Data size: 11688 Basic stats: COMPLETE Column stats: NONE Map Join Operator condition map: Inner Join 0 to 1 keys: 0 quarter_id (type: int) 1 quarter_id (type: int) outputColumnNames: _col1 input vertices: 0 Map 1 Statistics: Num rows: 13 Data size: 12856 Basic stats: COMPLETE Column stats: NONE HybridGraceHashJoin: true Group By Operator keys: _col1 (type: string) mode: hash outputColumnNames: _col0 Statistics: Num rows: 13 Data size: 12856 Basic stats: COMPLETE Column stats: NONE Reduce Output Operator key expressions: _col0 (type: string) sort order: + Map-reduce partition columns: _col0 (type: string) Statistics: Num rows: 13 Data size: 12856 Basic stats: COMPLETE Column stats: NONE Execution mode: vectorized Data flows from top to bottom Each operator has 0 or 1 child
  • 8. Page8 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Map 2 Operator Map 2 Map Operator Tree: TableScan alias: a12 filterExpr: (quarter_id is not null and (month_id) IN (200607, 200606)) (type: boolean) Statistics: Num rows: 48 Data size: 46752 Basic stats: COMPLETE Column stats: NONE Filter Operator predicate: (quarter_id is not null and (month_id) IN (200607, 200606)) (type: boolean) Statistics: Num rows: 12 Data size: 11688 Basic stats: COMPLETE Column stats: NONE Map Join Operator condition map: Inner Join 0 to 1 keys: 0 quarter_id (type: int) 1 quarter_id (type: int) outputColumnNames: _col1 input vertices: 0 Map 1 Statistics: Num rows: 13 Data size: 12856 Basic stats: COMPLETE Column stats: NONE HybridGraceHashJoin: true Group By Operator keys: _col1 (type: string) mode: hash outputColumnNames: _col0 Statistics: Num rows: 13 Data size: 12856 Basic stats: COMPLETE Column stats: NONE Reduce Output Operator key expressions: _col0 (type: string) sort order: + Map-reduce partition columns: _col0 (type: string) Statistics: Num rows: 13 Data size: 12856 Basic stats: COMPLETE Column stats: NONE Execution mode: vectorized Must scroll to another part of the plan to see what this is
  • 9. Page9 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Map 1 Operator Map 1 Map Operator Tree: TableScan alias: a11 filterExpr: quarter_id is not null (type: boolean) Statistics: Num rows: 12 Data size: 1260 Basic stats: COMPLETE Column stats: NONE Filter Operator predicate: quarter_id is not null (type: boolean) Statistics: Num rows: 6 Data size: 630 Basic stats: COMPLETE Column stats: NONE Reduce Output Operator key expressions: quarter_id (type: int) sort order: + Map-reduce partition columns: quarter_id (type: int) Statistics: Num rows: 6 Data size: 630 Basic stats: COMPLETE Column stats: NONE value expressions: pbtname (type: string) Execution mode: vectorized Actual table name (PMT_INVENTORY) not mentioned anywhere, only the alias
  • 10. Page10 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Map 1 Operator Map 1 Map Operator Tree: TableScan alias: a11 filterExpr: quarter_id is not null (type: boolean) Statistics: Num rows: 12 Data size: 1260 Basic stats: COMPLETE Column stats: NONE Filter Operator predicate: quarter_id is not null (type: boolean) Statistics: Num rows: 6 Data size: 630 Basic stats: COMPLETE Column stats: NONE Reduce Output Operator key expressions: quarter_id (type: int) sort order: + Map-reduce partition columns: quarter_id (type: int) Statistics: Num rows: 6 Data size: 630 Basic stats: COMPLETE Column stats: NONE value expressions: pbtname (type: string) Execution mode: vectorized How much of this information is really necessary to SQL users?
  • 11. Page11 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Back to Postgres QUERY PLAN --------------------------------------------------------------------------------- Group (cost=3.83..3.84 rows=2 width=18) Output: a11.pbtname Group Key: a11.pbtname -> Sort (cost=3.83..3.84 rows=2 width=18) Output: a11.pbtname Sort Key: a11.pbtname -> Hash Join (cost=2.62..3.83 rows=2 width=18) Output: a11.pbtname Hash Cond: (a11.quarter_id = a12.quarter_id) -> Seq Scan on public.pmt_inventory a11 (cost=0.00..1.12 rows=12 width=22) Output: a11.quarter_id, a11.pbtname -> Hash (cost=2.60..2.60 rows=2 width=4) Output: a12.quarter_id -> Seq Scan on public.lu_month a12 (cost=0.00..2.60 rows=2 width=4) Output: a12.quarter_id Filter: (a12.month_id = ANY ('{200607,200606}'::integer[])) Data flows bottom to top
  • 12. Page12 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Back to Postgres QUERY PLAN --------------------------------------------------------------------------------- Group (cost=3.83..3.84 rows=2 width=18) Output: a11.pbtname Group Key: a11.pbtname -> Sort (cost=3.83..3.84 rows=2 width=18) Output: a11.pbtname Sort Key: a11.pbtname -> Hash Join (cost=2.62..3.83 rows=2 width=18) Output: a11.pbtname Hash Cond: (a11.quarter_id = a12.quarter_id) -> Seq Scan on public.pmt_inventory a11 (cost=0.00..1.12 rows=12 width=22) Output: a11.quarter_id, a11.pbtname -> Hash (cost=2.60..2.60 rows=2 width=4) Output: a12.quarter_id -> Seq Scan on public.lu_month a12 (cost=0.00..2.60 rows=2 width=4) Output: a12.quarter_id Filter: (a12.month_id = ANY ('{200607,200606}'::integer[])) Operators have multiple children when it makes sense
  • 13. Page13 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Back to Postgres QUERY PLAN --------------------------------------------------------------------------------- Group (cost=3.83..3.84 rows=2 width=18) Output: a11.pbtname Group Key: a11.pbtname -> Sort (cost=3.83..3.84 rows=2 width=18) Output: a11.pbtname Sort Key: a11.pbtname -> Hash Join (cost=2.62..3.83 rows=2 width=18) Output: a11.pbtname Hash Cond: (a11.quarter_id = a12.quarter_id) -> Seq Scan on public.pmt_inventory a11 (cost=0.00..1.12 rows=12 width=22) Output: a11.quarter_id, a11.pbtname -> Hash (cost=2.60..2.60 rows=2 width=4) Output: a12.quarter_id -> Seq Scan on public.lu_month a12 (cost=0.00..2.60 rows=2 width=4) Output: a12.quarter_id Filter: (a12.month_id = ANY ('{200607,200606}'::integer[])) Join is done using a scan of pmt_inventory and a hash following a scan of lu_month. All this info is available without referring to a stage plan. IOW you don’t have to jump around in the plan.
  • 14. Page14 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Back to Postgres QUERY PLAN --------------------------------------------------------------------------------- Group (cost=3.83..3.84 rows=2 width=18) Output: a11.pbtname Group Key: a11.pbtname -> Sort (cost=3.83..3.84 rows=2 width=18) Output: a11.pbtname Sort Key: a11.pbtname -> Hash Join (cost=2.62..3.83 rows=2 width=18) Output: a11.pbtname Hash Cond: (a11.quarter_id = a12.quarter_id) -> Seq Scan on public.pmt_inventory a11 (cost=0.00..1.12 rows=12 width=22) Output: a11.quarter_id, a11.pbtname -> Hash (cost=2.60..2.60 rows=2 width=4) Output: a12.quarter_id -> Seq Scan on public.lu_month a12 (cost=0.00..2.60 rows=2 width=4) Output: a12.quarter_id Filter: (a12.month_id = ANY ('{200607,200606}'::integer[])) Actual schema / table names visible
  • 15. Page15 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Back to Postgres QUERY PLAN --------------------------------------------------------------------------------- Group (cost=3.83..3.84 rows=2 width=18) Output: a11.pbtname Group Key: a11.pbtname -> Sort (cost=3.83..3.84 rows=2 width=18) Output: a11.pbtname Sort Key: a11.pbtname -> Hash Join (cost=2.62..3.83 rows=2 width=18) Output: a11.pbtname Hash Cond: (a11.quarter_id = a12.quarter_id) -> Seq Scan on public.pmt_inventory a11 (cost=0.00..1.12 rows=12 width=22) Output: a11.quarter_id, a11.pbtname -> Hash (cost=2.60..2.60 rows=2 width=4) Output: a12.quarter_id -> Seq Scan on public.lu_month a12 (cost=0.00..2.60 rows=2 width=4) Output: a12.quarter_id Filter: (a12.month_id = ANY ('{200607,200606}'::integer[])) Cost mentioned once per operator Cost monotonically increases as you go up
  • 16. Page16 © Hortonworks Inc. 2011 – 2014. All Rights Reserved WHAT is new style Hive explain plan (HIVE-9780) • Set hive.explain.user=true; (by default). Use Tez, LLAP, etc Stage-1 Reducer 3 File Output Operator [FS_14] Group By Operator [GBY_12] (rows=8 width=101) Output:["_col0"],keys:KEY._col0 <-Map 2 [SIMPLE_EDGE] SHUFFLE [RS_11] PartitionCols:_col0 Group By Operator [GBY_10] (rows=8 width=101) Output:["_col0"],keys:_col1 Map Join Operator [MAPJOIN_19] (rows=33 width=101) Conds:RS_6._col0=SEL_5._col1(Inner),HybridGraceHashJoin:true,Output:["_col1"] <-Map 1 [BROADCAST_EDGE] BROADCAST [RS_6] PartitionCols:_col0 Select Operator [SEL_2] (rows=12 width=105) Output:["_col0","_col1"] Filter Operator [FIL_17] (rows=12 width=105) predicate:quarter_id is not null TableScan [TS_0] (rows=12 width=105) m****@pmt_inventory,a11,Tbl:COMPLETE,Col:COMPLETE,Output:["quarter_id","pbtname"] <-Select Operator [SEL_5] (rows=48 width=8) Output:["_col1"] Filter Operator [FIL_18] (rows=48 width=8) predicate:((month_id) IN (200607, 200606) and quarter_id is not null) TableScan [TS_3] (rows=48 width=8) m****@lu_month,a12,Tbl:COMPLETE,Col:COMPLETE,Output:["month_id","quarter_id"] Immediate Notes: 1. Much smaller 2. Can be read in order
  • 17. Page17 © Hortonworks Inc. 2011 – 2014. All Rights Reserved WHAT is new style Hive explain plan (HIVE-9780) Stage-1 Reducer 3 File Output Operator [FS_14] Group By Operator [GBY_12] (rows=8 width=101) Output:["_col0"],keys:KEY._col0 <-Map 2 [SIMPLE_EDGE] SHUFFLE [RS_11] PartitionCols:_col0 Group By Operator [GBY_10] (rows=8 width=101) Output:["_col0"],keys:_col1 Map Join Operator [MAPJOIN_19] (rows=33 width=101) Conds:RS_6._col0=SEL_5._col1(Inner),HybridGraceHashJoin:true,Output:["_col1"] <-Map 1 [BROADCAST_EDGE] BROADCAST [RS_6] PartitionCols:_col0 Select Operator [SEL_2] (rows=12 width=105) Output:["_col0","_col1"] Filter Operator [FIL_17] (rows=12 width=105) predicate:quarter_id is not null TableScan [TS_0] (rows=12 width=105) m****@pmt_inventory,a11,Tbl:COMPLETE,Col:COMPLETE,Output:["quarter_id","pbtname"] <-Select Operator [SEL_5] (rows=48 width=8) Output:["_col1"] Filter Operator [FIL_18] (rows=48 width=8) predicate:((month_id) IN (200607, 200606) and quarter_id is not null) TableScan [TS_3] (rows=48 width=8) m****@lu_month,a12,Tbl:COMPLETE,Col:COMPLETE,Output:["month_id","quarter_id"] Data flows bottom to top
  • 18. Page18 © Hortonworks Inc. 2011 – 2014. All Rights Reserved WHAT is new style Hive explain plan (HIVE-9780) Stage-1 Reducer 3 File Output Operator [FS_14] Group By Operator [GBY_12] (rows=8 width=101) Output:["_col0"],keys:KEY._col0 <-Map 2 [SIMPLE_EDGE] SHUFFLE [RS_11] PartitionCols:_col0 Group By Operator [GBY_10] (rows=8 width=101) Output:["_col0"],keys:_col1 Map Join Operator [MAPJOIN_19] (rows=33 width=101) Conds:RS_6._col0=SEL_5._col1(Inner),HybridGraceHashJoin:true,Output:["_col1"] <-Map 1 [BROADCAST_EDGE] BROADCAST [RS_6] PartitionCols:_col0 Select Operator [SEL_2] (rows=12 width=105) Output:["_col0","_col1"] Filter Operator [FIL_17] (rows=12 width=105) predicate:quarter_id is not null TableScan [TS_0] (rows=12 width=105) m****@pmt_inventory,a11,Tbl:COMPLETE,Col:COMPLETE,Output:["quarter_id","pbtname"] <-Select Operator [SEL_5] (rows=48 width=8) Output:["_col1"] Filter Operator [FIL_18] (rows=48 width=8) predicate:((month_id) IN (200607, 200606) and quarter_id is not null) TableScan [TS_3] (rows=48 width=8) m****@lu_month,a12,Tbl:COMPLETE,Col:COMPLETE,Output:["month_id","quarter_id"] Operators have multiple children when it makes sense
  • 19. Page19 © Hortonworks Inc. 2011 – 2014. All Rights Reserved WHAT is new style Hive explain plan (HIVE-9780) Stage-1 Reducer 3 File Output Operator [FS_14] Group By Operator [GBY_12] (rows=8 width=101) Output:["_col0"],keys:KEY._col0 <-Map 2 [SIMPLE_EDGE] SHUFFLE [RS_11] PartitionCols:_col0 Group By Operator [GBY_10] (rows=8 width=101) Output:["_col0"],keys:_col1 Map Join Operator [MAPJOIN_19] (rows=33 width=101) Conds:RS_6._col0=SEL_5._col1(Inner),HybridGraceHashJoin:true,Output:["_col1"] <-Map 1 [BROADCAST_EDGE] BROADCAST [RS_6] PartitionCols:_col0 Select Operator [SEL_2] (rows=12 width=105) Output:["_col0","_col1"] Filter Operator [FIL_17] (rows=12 width=105) predicate:quarter_id is not null TableScan [TS_0] (rows=12 width=105) m****@pmt_inventory,a11,Tbl:COMPLETE,Col:COMPLETE,Output:["quarter_id","pbtname"] <-Select Operator [SEL_5] (rows=48 width=8) Output:["_col1"] Filter Operator [FIL_18] (rows=48 width=8) predicate:((month_id) IN (200607, 200606) and quarter_id is not null) TableScan [TS_3] (rows=48 width=8) m****@lu_month,a12,Tbl:COMPLETE,Col:COMPLETE,Output:["month_id","quarter_id"] Join’s information is clear pmt_inventory is broadcasted to lu_month and a MapJoin is done
  • 20. Page20 © Hortonworks Inc. 2011 – 2014. All Rights Reserved WHAT is new style Hive explain plan (HIVE-9780) Stage-1 Reducer 3 File Output Operator [FS_14] Group By Operator [GBY_12] (rows=8 width=101) Output:["_col0"],keys:KEY._col0 <-Map 2 [SIMPLE_EDGE] SHUFFLE [RS_11] PartitionCols:_col0 Group By Operator [GBY_10] (rows=8 width=101) Output:["_col0"],keys:_col1 Map Join Operator [MAPJOIN_19] (rows=33 width=101) Conds:RS_6._col0=SEL_5._col1(Inner),HybridGraceHashJoin:true,Output:["_col1"] <-Map 1 [BROADCAST_EDGE] BROADCAST [RS_6] PartitionCols:_col0 Select Operator [SEL_2] (rows=12 width=105) Output:["_col0","_col1"] Filter Operator [FIL_17] (rows=12 width=105) predicate:quarter_id is not null TableScan [TS_0] (rows=12 width=105) m****@pmt_inventory,a11,Tbl:COMPLETE,Col:COMPLETE,Output:["quarter_id","pbtname"] <-Select Operator [SEL_5] (rows=48 width=8) Output:["_col1"] Filter Operator [FIL_18] (rows=48 width=8) predicate:((month_id) IN (200607, 200606) and quarter_id is not null) TableScan [TS_3] (rows=48 width=8) m****@lu_month,a12,Tbl:COMPLETE,Col:COMPLETE,Output:["month_id","quarter_id"] Cost mentioned once per operator
  • 21. Page21 © Hortonworks Inc. 2011 – 2014. All Rights Reserved HOW to performance debug a real query (TPC-DS Q3, 1TB) select dt.d_year, item.i_brand_id brand_id, item.i_brand brand, sum(ss_ext_sales_price) sum_agg from date_dim dt, store_sales, item where dt.d_date_sk = store_sales.ss_sold_date_sk and store_sales.ss_item_sk = item.i_item_sk and item.i_manufact_id = 436 and dt.d_moy = 12 group by dt.d_year , item.i_brand , item.i_brand_id order by dt.d_year , sum_agg desc , brand_id limit 10 partitioned by ss_sold_date_sk
  • 22. Page22 © Hortonworks Inc. 2011 – 2014. All Rights Reserved <-Reducer 3 [SIMPLE_EDGE] SHUFFLE [RS_15] PartitionCols:_col0, _col1, _col2 Group By Operator [GBY_14] (rows=1 width=116) Output:["_col0","_col1","_col2","_col3"],aggregations:["sum(_col45)"],keys:_col6, _col65, _col64 Select Operator [SEL_13] (rows=76515 width=128) Output:["_col6","_col65","_col64","_col45"] Filter Operator [FIL_23] (rows=76515 width=128) predicate:((_col0 = _col53) and (_col32 = _col57)) Merge Join Operator [MERGEJOIN_28] (rows=306061 width=128) Conds:RS_8._col32=RS_34.i_item_sk(Inner),Output:["_col0","_col6","_col32","_col45","_col53","_col57","_col64","_col65"] <-Map 7 [SIMPLE_EDGE] vectorized SHUFFLE [RS_34] PartitionCols:i_item_sk Filter Operator [FIL_33] (rows=434 width=111) predicate:(i_item_sk is not null and (i_manufact_id = 436)) TableScan [TS_2] (rows=300000 width=111) tpcds_bin_partitioned_orc_1000@item,item,Tbl:COMPLETE,Col:COMPLETE,Output:["i_item_sk","i_brand_id","i_brand","i_manufact_id"] <-Reducer 2 [SIMPLE_EDGE] SHUFFLE [RS_8] PartitionCols:_col32 Merge Join Operator [MERGEJOIN_27] (rows=211562452 width=20) Conds:RS_30.d_date_sk=RS_32.ss_sold_date_sk(Inner),Output:["_col0","_col6","_col32","_col45","_col53"] <-Map 1 [SIMPLE_EDGE] vectorized SHUFFLE [RS_30] PartitionCols:d_date_sk Filter Operator [FIL_29] (rows=5619 width=12) predicate:(d_date_sk is not null and (d_moy = 12)) TableScan [TS_0] (rows=73049 width=12) tpcds_bin_partitioned_orc_1000@date_dim,dt,Tbl:COMPLETE,Col:COMPLETE,Output:["d_date_sk","d_year","d_moy"] <-Map 6 [SIMPLE_EDGE] vectorized SHUFFLE [RS_32] PartitionCols:ss_sold_date_sk Filter Operator [FIL_31] (rows=2750387156 width=11) predicate:ss_item_sk is not null TableScan [TS_1] (rows=2750387156 width=11) tpcds_bin_partitioned_orc_1000@store_sales,store_sales,Tbl:COMPLETE,Col:COMPLETE,Output:["ss_item_sk","ss_ext_sales_price"] Original plan runs 163.33s. Sounds like column pruning and predicate push down are working fine. However, the join sequence store_sales✖date_dim✖item is not good enough. A better one is store_sales✖item✖date_dim Table Cardinality Cardinality after filter Selectivity date_dim 73K 5619 7.6% item 300K 434 0.14%
  • 23. Page23 © Hortonworks Inc. 2011 – 2014. All Rights Reserved <-Reducer 3 [SIMPLE_EDGE] SHUFFLE [RS_17] PartitionCols:_col0, _col1, _col2 Group By Operator [GBY_16] (rows=9 width=116) Output:["_col0","_col1","_col2","_col3"],aggregations:["sum(_col1)"],keys:_col8, _col4, _col5 Select Operator [SEL_15] (rows=306061 width=112) Output:["_col8","_col4","_col5","_col1"] Merge Join Operator [MERGEJOIN_29] (rows=306061 width=112) Conds:RS_12._col2=RS_38._col0(Inner),Output:["_col1","_col4","_col5","_col8"] <-Map 7 [SIMPLE_EDGE] vectorized SHUFFLE [RS_38] PartitionCols:_col0 Select Operator [SEL_37] (rows=5619 width=12) Output:["_col0","_col1"] Filter Operator [FIL_36] (rows=5619 width=12) predicate:((d_moy = 12) and d_date_sk is not null) TableScan [TS_6] (rows=73049 width=12) tpcds_bin_partitioned_orc_1000@date_dim,dt,Tbl:COMPLETE,Col:COMPLETE,Output:["d_date_sk","d_year","d_moy"] <-Reducer 2 [SIMPLE_EDGE] SHUFFLE [RS_12] PartitionCols:_col2 Merge Join Operator [MERGEJOIN_28] (rows=3978894 width=112) Conds:RS_32._col0=RS_35._col0(Inner),Output:["_col1","_col2","_col4","_col5"] <-Map 1 [SIMPLE_EDGE] vectorized SHUFFLE [RS_32] PartitionCols:_col0 Select Operator [SEL_31] (rows=2750387156 width=11) Output:["_col0","_col1","_col2"] Filter Operator [FIL_30] (rows=2750387156 width=11) predicate:ss_item_sk is not null TableScan [TS_0] (rows=2750387156 width=11) tpcds_bin_partitioned_orc_1000@store_sales,store_sales,Tbl:COMPLETE,Col:COMPLETE,Output:["ss_item_sk","ss_ext_sales_price"] <-Map 6 [SIMPLE_EDGE] vectorized SHUFFLE [RS_35] PartitionCols:_col0 Select Operator [SEL_34] (rows=434 width=111) Output:["_col0","_col1","_col2"] Filter Operator [FIL_33] (rows=434 width=111) predicate:((i_manufact_id = 436) and i_item_sk is not null) TableScan [TS_3] (rows=300000 width=111) tpcds_bin_partitioned_orc_1000@item,item,Tbl:COMPLETE,Col:COMPLETE,Output:["i_item_sk","i_brand_id","i_brand","i_manufact_id"] CBO on, new plan runs 143.97s with new join sequence store_sales✖item✖date_dim. The input data size of one branch of join is pretty small, should use map join, rather than merge join.
  • 24. Page24 © Hortonworks Inc. 2011 – 2014. All Rights Reserved SHUFFLE [RS_44] PartitionCols:_col0, _col1, _col2 Group By Operator [GBY_43] (rows=9 width=116) Output:["_col0","_col1","_col2","_col3"],aggregations:["sum(_col1)"],keys:_col8, _col4, _col5 Select Operator [SEL_42] (rows=306061 width=112) Output:["_col8","_col4","_col5","_col1"] Map Join Operator [MAPJOIN_41] (rows=306061 width=112) Conds:MAPJOIN_40._col2=RS_37._col0(Inner),HybridGraceHashJoin:true,Output:["_col1","_col4","_col5","_col8"] <-Map 5 [BROADCAST_EDGE] vectorized BROADCAST [RS_37] PartitionCols:_col0 Select Operator [SEL_36] (rows=5619 width=12) Output:["_col0","_col1"] Filter Operator [FIL_35] (rows=5619 width=12) predicate:((d_moy = 12) and d_date_sk is not null) TableScan [TS_6] (rows=73049 width=12) tpcds_bin_partitioned_orc_1000@date_dim,dt,Tbl:COMPLETE,Col:COMPLETE,Output:["d_date_sk","d_year","d_moy"] <-Map Join Operator [MAPJOIN_40] (rows=3978894 width=112) Conds:SEL_39._col0=RS_34._col0(Inner),HybridGraceHashJoin:true,Output:["_col1","_col2","_col4","_col5"] <-Map 4 [BROADCAST_EDGE] vectorized BROADCAST [RS_34] PartitionCols:_col0 Select Operator [SEL_33] (rows=434 width=111) Output:["_col0","_col1","_col2"] Filter Operator [FIL_32] (rows=434 width=111) predicate:((i_manufact_id = 436) and i_item_sk is not null) TableScan [TS_3] (rows=300000 width=111) tpcds_bin_partitioned_orc_1000@item,item,Tbl:COMPLETE,Col:COMPLETE,Output:["i_item_sk","i_brand_id","i_brand","i_manufact_id"] <-Select Operator [SEL_39] (rows=2750387156 width=11) Output:["_col0","_col1","_col2"] Filter Operator [FIL_38] (rows=2750387156 width=11) predicate:ss_item_sk is not null TableScan [TS_0] (rows=2750387156 width=11) tpcds_bin_partitioned_orc_1000@store_sales,store_sales,Tbl:COMPLETE,Col:COMPLETE,Output:["ss_item_sk","ss_ext_sales_price"] Increase hive.auto.convert.join.noconditionaltask.size= 1,359,688,499, we can see it is now using map join operators. New plan runs 45.84s. store_sales is a partitioned table on the join key ss_sold_date_sk with date_dim table.
  • 25. Page25 © Hortonworks Inc. 2011 – 2014. All Rights Reserved SHUFFLE [RS_55] PartitionCols:_col0, _col1, _col2 Group By Operator [GBY_54] (rows=9 width=116) Output:["_col0","_col1","_col2","_col3"],aggregations:["sum(_col1)"],keys:_col8, _col4, _col5 Select Operator [SEL_53] (rows=306061 width=112) Output:["_col8","_col4","_col5","_col1"] Map Join Operator [MAPJOIN_52] (rows=306061 width=112) Conds:MAPJOIN_51._col2=RS_45._col0(Inner),HybridGraceHashJoin:true,Output:["_col1","_col4","_col5","_col8"] <-Map 5 [BROADCAST_EDGE] vectorized BROADCAST [RS_45] PartitionCols:_col0 Select Operator [SEL_44] (rows=5619 width=12) Output:["_col0","_col1"] Filter Operator [FIL_43] (rows=5619 width=12) predicate:((d_moy = 12) and d_date_sk is not null) TableScan [TS_6] (rows=73049 width=12) tpcds_bin_partitioned_orc_1000@date_dim,dt,Tbl:COMPLETE,Col:COMPLETE,Output:["d_date_sk","d_year","d_moy"] Dynamic Partitioning Event Operator [EVENT_48] (rows=2809 width=12) Group By Operator [GBY_47] (rows=2809 width=12) Output:["_col0"],keys:_col0 Select Operator [SEL_46] (rows=5619 width=12) Output:["_col0"] Please refer to the previous Select Operator [SEL_44] <-Map Join Operator [MAPJOIN_51] (rows=3978894 width=112) Conds:SEL_50._col0=RS_42._col0(Inner),HybridGraceHashJoin:true,Output:["_col1","_col2","_col4","_col5"] <-Map 4 [BROADCAST_EDGE] vectorized BROADCAST [RS_42] PartitionCols:_col0 Select Operator [SEL_41] (rows=434 width=111) Output:["_col0","_col1","_col2"] Filter Operator [FIL_40] (rows=434 width=111) predicate:((i_manufact_id = 436) and i_item_sk is not null) TableScan [TS_3] (rows=300000 width=111) tpcds_bin_partitioned_orc_1000@item,item,Tbl:COMPLETE,Col:COMPLETE,Output:["i_item_sk","i_brand_id","i_brand","i_manufact_id"] <-Select Operator [SEL_50] (rows=2750387156 width=11) Output:["_col0","_col1","_col2"] Filter Operator [FIL_49] (rows=2750387156 width=11) predicate:ss_item_sk is not null TableScan [TS_0] (rows=2750387156 width=11) tpcds_bin_partitioned_orc_1000@store_sales,store_sales,Tbl:COMPLETE,Col:COMPLETE,Output:["ss_item_sk","ss_ext_sales_price"] By setting hive.tez.dynamic.partition.pruning=true, we can see dynamic partitioning event operators. See more about this in HIVE-7826. New plan runs 31.35s. In the run time, dynamic partition event operator will send values needed to prune to the application master - where splits are generated and tasks are submitted. Using these values we can strip out any unneeded partitions dynamically, while the query is running.
  • 26. Page26 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Performance debugging summary (TPC-DS Q3, 1TB) 0 20 40 60 80 100 120 140 160 180 Original Join re-order Join selection Dynamic partition pruning Queryexecutiontime(s) Query execution time Continuous improvement
  • 27. Page27 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Integration with Apache Ambari 1. Type the query here 2. Click “explain” 3. explain plan will be shown
  • 28. Page28 © Hortonworks Inc. 2011 – 2014. All Rights Reserved  “set hive.tez.exec.print.summary=true;”  Get more insights on query performance Can be used along with Tez vertex runtime stats Runtime stats
  • 29. Page29 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Summary • Show old style Hive explain plan is hard to read. • Verbose with too much redundant information, hard to follow how data flows, cost of operator is unclear • Compare with Postgres over a body of 500+ realistic SQL queries and identify the candidate improving points • Introduce new style Hive explain plan • Use a concrete example to help understand the explain: execution cost, join sequence and orchestration of the operator tree • Use the new Hive explain plan to performance debug TPC-DS Q3 • Show the improvement after join re-ordering, join selection, and dynamic partition pruning • Integration/interaction with other system/tools
  • 30. Page30 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Future work -- Some gaps remain after HIVE-9780 • Put the real schema, table and column names in the explain plan, e.g., no more _col0 etc. • This will help users to understand the plan better • HIVE-8681: CBO: Column names are missing from join expression in Map join with CBO enabled • Get an equivalent of “EXPLAIN ANALYZE” – such as operator level runtime stats and warnings. • This will help users to find out the gap between estimated cost and real cost • HIVE-14362: Support explain analyze in Hive
  • 31. Page31 © Hortonworks Inc. 2011 – 2014. All Rights Reserved SHUFFLE [RS_55] PartitionCols:d_year, i_brand, i_brand_id Group By Operator [GBY_54] (rows=9/10 width=116) Output:["d_year","sum_agg","i_brand","i_brand_id"],aggregations:["sum(ss_ext_sales_price)"],keys:d_year, i_brand, i_brand_id Select Operator [SEL_53] (rows=306061/324651 width=112) Output:["d_year","i_brand","i_brand_id","ss_ext_sales_price"] Map Join Operator [MAPJOIN_52] (rows=306061/324651 width=112) Conds:MAPJOIN_51. ss_sold_date_sk=RS_45. d_date_sk(Inner),HybridGraceHashJoin:true,Output:["d_year","i_brand","i_brand_id","ss_ext_sales_price"] <-Map 5 [BROADCAST_EDGE] vectorized BROADCAST [RS_45] PartitionCols:d_date_sk Select Operator [SEL_44] (rows=5619/6034 width=12) Output:["d_date_sk","d_year"] Filter Operator [FIL_43] (rows=5619/6034 width=12) predicate:((d_moy = 12) and d_date_sk is not null) TableScan [TS_6] (rows=73049/73049 width=12) tpcds_bin_partitioned_orc_1000@date_dim,dt,Tbl:COMPLETE,Col:COMPLETE,Output:["d_date_sk","d_year","d_moy"] Dynamic Partitioning Event Operator [EVENT_48] (rows=2809 width=12) Group By Operator [GBY_47] (rows=2809/2324 width=12) Output:["d_date_sk"],keys:d_date_sk Select Operator [SEL_46] (rows=5619/6034 width=12) Output:["d_date_sk"] Please refer to the previous Select Operator [SEL_44] <-Map Join Operator [MAPJOIN_51] (rows=3978894/4202377 width=112) Conds:SEL_50.ss_item_sk=RS_42. i_item_sk(Inner),HybridGraceHashJoin:true,Output:["ss_ext_sales_price",” ss_sold_date_sk","i_brand","i_brand_id"] <-Map 4 [BROADCAST_EDGE] vectorized BROADCAST [RS_42] PartitionCols:i_item_sk Select Operator [SEL_41] (rows=434/453 width=111) Output:[” i_item_sk","i_brand_id","i_brand"] Filter Operator [FIL_40] (rows=434/453 width=111) predicate:((i_manufact_id = 436) and i_item_sk is not null) TableScan [TS_3] (rows=300000/300000 width=111) tpcds_bin_partitioned_orc_1000@item,item,Tbl:COMPLETE,Col:COMPLETE,Output:["i_item_sk","i_brand_id","i_brand","i_manufact_id"] <-Select Operator [SEL_50] (rows=2750387156/2750387156 width=11) Output:["ss_item_sk","ss_ext_sales_price","ss_sold_date_sk"] Filter Operator [FIL_49] (rows=2750387156/2750387156 width=11) predicate:ss_item_sk is not null TableScan [TS_0] (rows=2750387156/2750387156 width=11) tpcds_bin_partitioned_orc_1000@store_sales,store_sales,Tbl:COMPLETE,Col:COMPLETE,Output:["ss_item_sk","ss_ext_sales_price"]
  • 32. Page32 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Acknowledgement • We thank all the anonymous reviewers’ votes to give us this opportunity to share our work. • Part of the slides are borrowed from or modified based on Carter Shanklin and Rajesh Balamohan’s slides. • We thank Gunther Hagleitner for all the support and inputs. • We thank Sapin Amin for setting up the testing cluster.
  • 33. Page33 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Thank you! Questions?

Hinweis der Redaktion

  1. Hive contributors have striven to improve the capability of Hive in terms of both performance and functionality. We assert that understanding and analyzing Apache Hive query execution plan is crucial for performance debugging. In this talk, we study why Apache Hive’s query execution plan today are so difficult to analyze. We identify a set of pain points from our Apache Hive performance engineers, development engineers as well as real users/customers. We propose and show a new presentation data model that can well address the pain points. The three most critical parts of the presentation are (1) the estimated query execution cost, which is the planner's guess at how long it will take to run the query (measured in #rows); (2) the orchestration of the operator tree across consecutive M/R (Tez) jobs; and (3) integration and extension support with other presentation tools, e.g., Apache Ambari.
  2. We can see that Hive old style explain is quite verbose, is it necessary?
  3. We pick a real query out of the 500 queries.
  4. 16 lines
  5. 80 lines
  6. Although Reduce Output Operator is crucial as it defines the boundary between a map task and a reduce task, we are thinking how much information of the reduce output operator is.... SQL users care about more about relational operators rather than MR operators.
  7. We can also see more details about this join.
  8. It seems that Postgres sets up a good example for hive to improve on.
  9. So this is the basic introduction of the new style hive explain plan
  10. This query involves 3 tables.
  11. splits
  12. For Query 87
  13. The next version of Hive explain plan will look like this. We highlight the changes that we want to make in red color.
  14. We thank the audience for your attention.
  15. I am ready to take any questions.