Сергей Ковалёв: Solutions Architect, Big Data/High-performance Computation Expert в Altoros; г.Минск
Доклад: «Practical Steps to Improve Apache Hive Performance»
4. 1. Use partitions whenever possible
create table video (
id STRING,
title STRING,
description STRING,
viewCount BIGINT
) PARTITIONED BY (uploadYear date)
STORED AS ORC;
insert into table video PARTITION (uploadYear) select * from video_external;
5. 2. Use bucketing
create table video (
id STRING,
channelId STRING,
title STRING,
description STRING,
) CLUSTERED BY(channelId)
INTO 2 BUCKETS
STORED AS ORC;
create table channel (
id STRING,
title STRING,
description STRING,
viewCount BIGINT
) CLUSTERED BY(id)
INTO 2 BUCKETS
STORED AS ORC;
SELECT v.title FROM video v JOIN channel ch ON v.channelId = ch.id WHERE
ch.viewCount>1000
12. 4. Use joins optimization
Sort-merge-bucket (SMB) join:
13. 5. Choose the right input format
Row Data Column Store
14. 6. Other optimization
Avoid highly normalized table structures
Compress map/reduce output
For map output compression, execute set mapred.compress.map.output = true.
For job output compression, execute set mapred.output.compress = true.
Use parallel execution
SET hive.exce.parallel=true;
15. 7. Use the 'explain' keyword to improve the query
execution plan
EXPLAIN query...
16. 7. Use the 'explain' keyword to improve the query
execution plan
17. 8. Stinger Initiative
Use cost-based optimization
Use vectorization
Transactions with ACID semantics