SlideShare ist ein Scribd-Unternehmen logo
1 von 20
Practical Steps to Improve Hive Queries
Performance
Sergey Kovalev
Software Engineer at Altoros
How Hive works
1. Use partitions whenever possible
/folder1/video_data/file1
id, title, channelId, description, uploadYear
1, title1, channelId1, description1, 2012
2, title2, channelId2, description2, 2012
3, title3, channelId3, description3, 2013
4, title4, channelId4, description4, 2013
/folder1/video_data/2012/file1
1, title1, channelId1, description1, 2012
2, title2, channelId2, description2, 2012
/folder1/video_data/2013/file1
3, title3, channelId3, description3, 2013
4, title4, channelId4, description4, 2013
SELECT * from video WHERE uploadYear=’2013-04-08’
1. Use partitions whenever possible
create table video (
id STRING,
title STRING,
description STRING,
viewCount BIGINT
) PARTITIONED BY (uploadYear date)
STORED AS ORC;
insert into table video PARTITION (uploadYear) select * from video_external;
2. Use bucketing
create table video (
id STRING,
channelId STRING,
title STRING,
description STRING,
) CLUSTERED BY(channelId)
INTO 2 BUCKETS
STORED AS ORC;
create table channel (
id STRING,
title STRING,
description STRING,
viewCount BIGINT
) CLUSTERED BY(id)
INTO 2 BUCKETS
STORED AS ORC;
SELECT v.title FROM video v JOIN channel ch ON v.channelId = ch.id WHERE
ch.viewCount>1000
2. Use bucketing
/folder1/video_data/file1
id, title, channelId, description, uploadYear
1, title1, channelId1, description1, 2012
2, title2, channelId2, description2, 2012
3, title3, channelId3, description3, 2012
4, title4, channelId4, description4, 2012
5, title5, channelId5, description5, 2013
6, title6, channelId6, description6, 2013
7, title7, channelId7, description7, 2013
8, title8, channelId8, description8, 2013
/folder1/video_data/file1
2, title2, channelId2, description2, 2012
4, title4, channelId4, description4, 2012
6, title6, channelId6, description6, 2013
8, title8, channelId8, description8, 2013
/folder1/video_data/file2
1, title1, channelId1, description1, 2012
3, title3, channelId3, description3, 2012
5, title5, channelId5, description5, 2013
7, title7, channelId7, description7, 2013
2. Use bucketing
/folder1/channel_data/file1
id, title, description, viewCount
channelId1, title1, description1, viewCount1
channelId2, title2, description2, viewCount2
channelId3, title3, description3, viewCount3
channelId4, title4, description4, viewCount4
channelId5, title5, description5, viewCount5
channelId6, title6, description6, viewCount6
channelId7, title7, description7, viewCount7
channelId8, title8, description8, viewCount8
/folder1/channel_data/file1
channelId2, title2, description2, viewCount2
channelId4, title4, description4, viewCount4
channelId6, title6, description6, viewCount6
channelId8, title8, description8, viewCount8
/folder1/channel_data/file2
channelId1, title1, description1, viewCount1
channelId3, title3, description3, viewCount3
channelId5, title5, description5, viewCount5
channelId7, title7, description7, viewCount7
3. Partitions + bucketing
create table video (
id STRING,
channelId STRING,
title STRING,
description STRING,
viewCount BIGINT
) PARTITIONED BY (uploadYear date)
CLUSTERED BY(channelId)
INTO 2 BUCKETS
STORED AS ORC;
3. Partitions + bucketing
/folder1/video_data/file1
id, title, channelId, viewCount, uploadYear
1, title1, channelId1, viewCount1, 2012
2, title2, channelId2, viewCount2, 2012
3, title3, channelId3, viewCount3, 2012
4, title4, channelId4, viewCount4, 2012
5, title5, channelId5, viewCount5, 2013
6, title6, channelId6, viewCount6, 2013
7, title7, channelId7, viewCount7, 2013
8, title8, channelId8, viewCount8, 2013
/folder1/video_data/2012/file1
2, title2, description2, viewCount2, 2012
4, title4, description4, viewCount4, 2012
/folder1/video_data/2012/file2
1, title1, description1, viewCount1, 2012
3, title3, description3, viewCount3, 2012
/folder1/video_data/2013/file1
6, title6, description6, viewCount6, 2013
8, title8, description8, viewCount8, 2013
/folder1/video_data/2013/file2
5, title5, description5, viewCount5, 2013
7, title7, description7, viewCount7, 2013
4. Use joins optimization
Shuffle join/Common join:
4. Use joins optimization
Map-side join:
4. Use joins optimization
Sort-merge-bucket (SMB) join:
5. Choose the right input format
Row Data Column Store
6. Other optimization
Avoid highly normalized table structures
Compress map/reduce output
For map output compression, execute set mapred.compress.map.output = true.
For job output compression, execute set mapred.output.compress = true.
Use parallel execution
SET hive.exce.parallel=true;
7. Use the 'explain' keyword to improve the query
execution plan
EXPLAIN query...
7. Use the 'explain' keyword to improve the query
execution plan
8. Stinger Initiative
Use cost-based optimization
Use vectorization
Transactions with ACID semantics
8. Hive on Tez
8. Sub-Second Queries with Hive LLAP
New approach using a hybrid engine that leverages Tez and something new called LLAP (Live
Long and Process)
Questiones?

Weitere ähnliche Inhalte

Ähnlich wie Сергей Ковалёв (Altoros): Practical Steps to Improve Apache Hive Performance

初心者Scala in f@n 第五回 sbt+giter8
初心者Scala in f@n 第五回 sbt+giter8初心者Scala in f@n 第五回 sbt+giter8
初心者Scala in f@n 第五回 sbt+giter8
gak2223
 
Building Killr Applications with DataStax Enterprise
Building Killr Applications with  DataStax EnterpriseBuilding Killr Applications with  DataStax Enterprise
Building Killr Applications with DataStax Enterprise
DataStax
 
Building Killr Applications with DSE
Building Killr Applications with DSEBuilding Killr Applications with DSE
Building Killr Applications with DSE
DataStax
 
FileWrite.javaFileWrite.java  To change this license header.docx
FileWrite.javaFileWrite.java  To change this license header.docxFileWrite.javaFileWrite.java  To change this license header.docx
FileWrite.javaFileWrite.java  To change this license header.docx
ssuser454af01
 
TechDays 2010 Portugal - Scaling your data tier with app fabric 16x9
TechDays 2010 Portugal - Scaling your data tier with app fabric 16x9TechDays 2010 Portugal - Scaling your data tier with app fabric 16x9
TechDays 2010 Portugal - Scaling your data tier with app fabric 16x9
Nuno Godinho
 
Skyfire log files100411
Skyfire log files100411Skyfire log files100411
Skyfire log files100411
navaidkhan
 

Ähnlich wie Сергей Ковалёв (Altoros): Practical Steps to Improve Apache Hive Performance (20)

Building a Video Encoding Pipeline at The New York Times
Building a Video Encoding Pipeline at The New York TimesBuilding a Video Encoding Pipeline at The New York Times
Building a Video Encoding Pipeline at The New York Times
 
初心者Scala in f@n 第五回 sbt+giter8
初心者Scala in f@n 第五回 sbt+giter8初心者Scala in f@n 第五回 sbt+giter8
初心者Scala in f@n 第五回 sbt+giter8
 
Building a Video Encoding Pipeline at The New York Times
Building a Video Encoding Pipeline at The New York TimesBuilding a Video Encoding Pipeline at The New York Times
Building a Video Encoding Pipeline at The New York Times
 
Cassandra 3.0 advanced preview
Cassandra 3.0 advanced previewCassandra 3.0 advanced preview
Cassandra 3.0 advanced preview
 
Building Killr Applications with DataStax Enterprise
Building Killr Applications with  DataStax EnterpriseBuilding Killr Applications with  DataStax Enterprise
Building Killr Applications with DataStax Enterprise
 
Building Killr Applications with DSE
Building Killr Applications with DSEBuilding Killr Applications with DSE
Building Killr Applications with DSE
 
FileWrite.javaFileWrite.java  To change this license header.docx
FileWrite.javaFileWrite.java  To change this license header.docxFileWrite.javaFileWrite.java  To change this license header.docx
FileWrite.javaFileWrite.java  To change this license header.docx
 
Darknet yolo
Darknet yoloDarknet yolo
Darknet yolo
 
SQL server Backup Restore Revealed
SQL server Backup Restore RevealedSQL server Backup Restore Revealed
SQL server Backup Restore Revealed
 
BOXEE apps API
BOXEE apps APIBOXEE apps API
BOXEE apps API
 
Scale Your Data Tier With Windows Server App Fabric
Scale Your Data Tier With Windows Server App FabricScale Your Data Tier With Windows Server App Fabric
Scale Your Data Tier With Windows Server App Fabric
 
EDI Training Module 11: Publishing Data in the EDI Repository
EDI Training Module 11:  Publishing Data in the EDI RepositoryEDI Training Module 11:  Publishing Data in the EDI Repository
EDI Training Module 11: Publishing Data in the EDI Repository
 
Neo4j Bloom: What’s New with Neo4j's Data Visualization Tool
Neo4j Bloom: What’s New with Neo4j's Data Visualization ToolNeo4j Bloom: What’s New with Neo4j's Data Visualization Tool
Neo4j Bloom: What’s New with Neo4j's Data Visualization Tool
 
HDF5 Tools
HDF5 ToolsHDF5 Tools
HDF5 Tools
 
TechDays 2010 Portugal - Scaling your data tier with app fabric 16x9
TechDays 2010 Portugal - Scaling your data tier with app fabric 16x9TechDays 2010 Portugal - Scaling your data tier with app fabric 16x9
TechDays 2010 Portugal - Scaling your data tier with app fabric 16x9
 
Skyfire log files100411
Skyfire log files100411Skyfire log files100411
Skyfire log files100411
 
short_intro_to_CMake_(inria_REVES_team)
short_intro_to_CMake_(inria_REVES_team)short_intro_to_CMake_(inria_REVES_team)
short_intro_to_CMake_(inria_REVES_team)
 
Standards For Java Coding
Standards For Java CodingStandards For Java Coding
Standards For Java Coding
 
Hey Relational Developer, Let's Go Crazy (Patrick McFadin, DataStax) | Cassan...
Hey Relational Developer, Let's Go Crazy (Patrick McFadin, DataStax) | Cassan...Hey Relational Developer, Let's Go Crazy (Patrick McFadin, DataStax) | Cassan...
Hey Relational Developer, Let's Go Crazy (Patrick McFadin, DataStax) | Cassan...
 
IP3build.xml Builds, tests, and runs the project IP3..docx
IP3build.xml      Builds, tests, and runs the project IP3..docxIP3build.xml      Builds, tests, and runs the project IP3..docx
IP3build.xml Builds, tests, and runs the project IP3..docx
 

Mehr von Olga Lavrentieva

«Обзор возможностей Open cv»
«Обзор возможностей Open cv»«Обзор возможностей Open cv»
«Обзор возможностей Open cv»
Olga Lavrentieva
 

Mehr von Olga Lavrentieva (20)

15 10-22 altoros-fact_sheet_st_v4
15 10-22 altoros-fact_sheet_st_v415 10-22 altoros-fact_sheet_st_v4
15 10-22 altoros-fact_sheet_st_v4
 
Андрей Козлов (Altoros): Оптимизация производительности Cassandra
Андрей Козлов (Altoros): Оптимизация производительности CassandraАндрей Козлов (Altoros): Оптимизация производительности Cassandra
Андрей Козлов (Altoros): Оптимизация производительности Cassandra
 
Владимир Иванов (Oracle): Java: прошлое и будущее
Владимир Иванов (Oracle): Java: прошлое и будущееВладимир Иванов (Oracle): Java: прошлое и будущее
Владимир Иванов (Oracle): Java: прошлое и будущее
 
Brug - Web push notification
Brug  - Web push notificationBrug  - Web push notification
Brug - Web push notification
 
Александр Ломов: "Reactjs + Haskell + Cloud Foundry = Love"
Александр Ломов: "Reactjs + Haskell + Cloud Foundry = Love"Александр Ломов: "Reactjs + Haskell + Cloud Foundry = Love"
Александр Ломов: "Reactjs + Haskell + Cloud Foundry = Love"
 
Максим Жилинский: "Контейнеры: под капотом"
Максим Жилинский: "Контейнеры: под капотом"Максим Жилинский: "Контейнеры: под капотом"
Максим Жилинский: "Контейнеры: под капотом"
 
Александр Протасеня: "PayPal. Различные способы интеграции"
Александр Протасеня: "PayPal. Различные способы интеграции"Александр Протасеня: "PayPal. Различные способы интеграции"
Александр Протасеня: "PayPal. Различные способы интеграции"
 
Сергей Черничков: "Интеграция платежных систем в .Net приложения"
Сергей Черничков: "Интеграция платежных систем в .Net приложения"Сергей Черничков: "Интеграция платежных систем в .Net приложения"
Сергей Черничков: "Интеграция платежных систем в .Net приложения"
 
Антон Шемерей «Single responsibility principle в руби или почему instanceclas...
Антон Шемерей «Single responsibility principle в руби или почему instanceclas...Антон Шемерей «Single responsibility principle в руби или почему instanceclas...
Антон Шемерей «Single responsibility principle в руби или почему instanceclas...
 
Егор Воробьёв: «Ruby internals»
Егор Воробьёв: «Ruby internals»Егор Воробьёв: «Ruby internals»
Егор Воробьёв: «Ruby internals»
 
Андрей Колешко «Что не так с Rails»
Андрей Колешко «Что не так с Rails»Андрей Колешко «Что не так с Rails»
Андрей Колешко «Что не так с Rails»
 
Дмитрий Савицкий «Ruby Anti Magic Shield»
Дмитрий Савицкий «Ruby Anti Magic Shield»Дмитрий Савицкий «Ruby Anti Magic Shield»
Дмитрий Савицкий «Ruby Anti Magic Shield»
 
Сергей Алексеев «Парное программирование. Удаленно»
Сергей Алексеев «Парное программирование. Удаленно»Сергей Алексеев «Парное программирование. Удаленно»
Сергей Алексеев «Парное программирование. Удаленно»
 
«Почему Spark отнюдь не так хорош»
«Почему Spark отнюдь не так хорош»«Почему Spark отнюдь не так хорош»
«Почему Spark отнюдь не так хорош»
 
«Cassandra data modeling – моделирование данных для NoSQL СУБД Cassandra»
«Cassandra data modeling – моделирование данных для NoSQL СУБД Cassandra»«Cassandra data modeling – моделирование данных для NoSQL СУБД Cassandra»
«Cassandra data modeling – моделирование данных для NoSQL СУБД Cassandra»
 
«Практика построения высокодоступного решения на базе Cloud Foundry Paas»
«Практика построения высокодоступного решения на базе Cloud Foundry Paas»«Практика построения высокодоступного решения на базе Cloud Foundry Paas»
«Практика построения высокодоступного решения на базе Cloud Foundry Paas»
 
«Дизайн продвинутых нереляционных схем для Big Data»
«Дизайн продвинутых нереляционных схем для Big Data»«Дизайн продвинутых нереляционных схем для Big Data»
«Дизайн продвинутых нереляционных схем для Big Data»
 
«Обзор возможностей Open cv»
«Обзор возможностей Open cv»«Обзор возможностей Open cv»
«Обзор возможностей Open cv»
 
«Нужно больше шин! Eventbus based framework vertx.io»
«Нужно больше шин! Eventbus based framework vertx.io»«Нужно больше шин! Eventbus based framework vertx.io»
«Нужно больше шин! Eventbus based framework vertx.io»
 
«Работа с базами данных с использованием Sequel»
«Работа с базами данных с использованием Sequel»«Работа с базами данных с использованием Sequel»
«Работа с базами данных с использованием Sequel»
 

Kürzlich hochgeladen

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Kürzlich hochgeladen (20)

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 

Сергей Ковалёв (Altoros): Practical Steps to Improve Apache Hive Performance

  • 1. Practical Steps to Improve Hive Queries Performance Sergey Kovalev Software Engineer at Altoros
  • 3. 1. Use partitions whenever possible /folder1/video_data/file1 id, title, channelId, description, uploadYear 1, title1, channelId1, description1, 2012 2, title2, channelId2, description2, 2012 3, title3, channelId3, description3, 2013 4, title4, channelId4, description4, 2013 /folder1/video_data/2012/file1 1, title1, channelId1, description1, 2012 2, title2, channelId2, description2, 2012 /folder1/video_data/2013/file1 3, title3, channelId3, description3, 2013 4, title4, channelId4, description4, 2013 SELECT * from video WHERE uploadYear=’2013-04-08’
  • 4. 1. Use partitions whenever possible create table video ( id STRING, title STRING, description STRING, viewCount BIGINT ) PARTITIONED BY (uploadYear date) STORED AS ORC; insert into table video PARTITION (uploadYear) select * from video_external;
  • 5. 2. Use bucketing create table video ( id STRING, channelId STRING, title STRING, description STRING, ) CLUSTERED BY(channelId) INTO 2 BUCKETS STORED AS ORC; create table channel ( id STRING, title STRING, description STRING, viewCount BIGINT ) CLUSTERED BY(id) INTO 2 BUCKETS STORED AS ORC; SELECT v.title FROM video v JOIN channel ch ON v.channelId = ch.id WHERE ch.viewCount>1000
  • 6. 2. Use bucketing /folder1/video_data/file1 id, title, channelId, description, uploadYear 1, title1, channelId1, description1, 2012 2, title2, channelId2, description2, 2012 3, title3, channelId3, description3, 2012 4, title4, channelId4, description4, 2012 5, title5, channelId5, description5, 2013 6, title6, channelId6, description6, 2013 7, title7, channelId7, description7, 2013 8, title8, channelId8, description8, 2013 /folder1/video_data/file1 2, title2, channelId2, description2, 2012 4, title4, channelId4, description4, 2012 6, title6, channelId6, description6, 2013 8, title8, channelId8, description8, 2013 /folder1/video_data/file2 1, title1, channelId1, description1, 2012 3, title3, channelId3, description3, 2012 5, title5, channelId5, description5, 2013 7, title7, channelId7, description7, 2013
  • 7. 2. Use bucketing /folder1/channel_data/file1 id, title, description, viewCount channelId1, title1, description1, viewCount1 channelId2, title2, description2, viewCount2 channelId3, title3, description3, viewCount3 channelId4, title4, description4, viewCount4 channelId5, title5, description5, viewCount5 channelId6, title6, description6, viewCount6 channelId7, title7, description7, viewCount7 channelId8, title8, description8, viewCount8 /folder1/channel_data/file1 channelId2, title2, description2, viewCount2 channelId4, title4, description4, viewCount4 channelId6, title6, description6, viewCount6 channelId8, title8, description8, viewCount8 /folder1/channel_data/file2 channelId1, title1, description1, viewCount1 channelId3, title3, description3, viewCount3 channelId5, title5, description5, viewCount5 channelId7, title7, description7, viewCount7
  • 8. 3. Partitions + bucketing create table video ( id STRING, channelId STRING, title STRING, description STRING, viewCount BIGINT ) PARTITIONED BY (uploadYear date) CLUSTERED BY(channelId) INTO 2 BUCKETS STORED AS ORC;
  • 9. 3. Partitions + bucketing /folder1/video_data/file1 id, title, channelId, viewCount, uploadYear 1, title1, channelId1, viewCount1, 2012 2, title2, channelId2, viewCount2, 2012 3, title3, channelId3, viewCount3, 2012 4, title4, channelId4, viewCount4, 2012 5, title5, channelId5, viewCount5, 2013 6, title6, channelId6, viewCount6, 2013 7, title7, channelId7, viewCount7, 2013 8, title8, channelId8, viewCount8, 2013 /folder1/video_data/2012/file1 2, title2, description2, viewCount2, 2012 4, title4, description4, viewCount4, 2012 /folder1/video_data/2012/file2 1, title1, description1, viewCount1, 2012 3, title3, description3, viewCount3, 2012 /folder1/video_data/2013/file1 6, title6, description6, viewCount6, 2013 8, title8, description8, viewCount8, 2013 /folder1/video_data/2013/file2 5, title5, description5, viewCount5, 2013 7, title7, description7, viewCount7, 2013
  • 10. 4. Use joins optimization Shuffle join/Common join:
  • 11. 4. Use joins optimization Map-side join:
  • 12. 4. Use joins optimization Sort-merge-bucket (SMB) join:
  • 13. 5. Choose the right input format Row Data Column Store
  • 14. 6. Other optimization Avoid highly normalized table structures Compress map/reduce output For map output compression, execute set mapred.compress.map.output = true. For job output compression, execute set mapred.output.compress = true. Use parallel execution SET hive.exce.parallel=true;
  • 15. 7. Use the 'explain' keyword to improve the query execution plan EXPLAIN query...
  • 16. 7. Use the 'explain' keyword to improve the query execution plan
  • 17. 8. Stinger Initiative Use cost-based optimization Use vectorization Transactions with ACID semantics
  • 18. 8. Hive on Tez
  • 19. 8. Sub-Second Queries with Hive LLAP New approach using a hybrid engine that leverages Tez and something new called LLAP (Live Long and Process)