Hive query optimization infinity

•Als PPTX, PDF herunterladen•

1 gefällt mir•1,296 views

Shashwat Shriparv

How to optimize hive queries for better performance and execution

Technologie

dwivedishashwat@gmail.com
http://helpmetocode.blogspot.com

 Well designed tables
 Partitioning

 Bucketing
 and well written queries can improve your query speed and

reduce processing cost.

Optimization on Table side
 Partitioning Hive Tables:
 It is a kind of horizontal slicing of data. This slicing can be

on the range, single value or a set of values.
 Imagine log files where each record includes a timestamp. If
we partitioned by date, then records for the same date would
be stored in the same partition.
 E.g.:
 Partition on date.
 Partition on geography location.
 Partition on number range.

Defining a table partition
 Lets take a Apache log file example where we have log generated by web

server on visit of client.
 These log contains data & time information about browser and location(IP).
 So we can create table in hive and partition these log data using date & time
and we can create sub partition of location. Which looks like :
CREATE TABLE alogs (timstamp BIGINT, detail STRING) PARTITIONED BY (date STRING, loc STRING);

Log Table



Directory Structure

/user/hive/warehouse/logs/dt=2010-01-01/country=GB/file1
/file2
/country=US/file3
/dt=2010-01-02/country=GB/file4
/country=US/file5
/file6

Hive Buckets
 Bucketing Hive Tables:
 Bucketing hive table result in more efficient queries.


Bucketing imposes extra structure on the table, which Hive
can take advantage of when performing certain queries.

 It makes sampling more efficient.
 The two tables are bucketed in the same way, a mapper
processing a bucket of the left table knows that the
matching rows in the right table are in its corresponding
bucket, so it need only retrieve that bucket.
 Bucket may additionally be sorted by one or more columns.
This allows even more efficient map-side joins, since the
join of each bucket becomes an efficient merge-sort.

Parallel execution of queries
 Hadoop can execute map reduce jobs in parallel and several queries executed on Hive make
automatically use of this parallelism.
 The queries or sub queries which are not interdependent can be execute in parallel mode,like
some Join queries.
 Following is the example how it is done:

 SET hive.exce.parallel=true; #Can be used to set this mode on

1
Final Result
Sub query 1

4

Sub query
(1 & 2) Joined

Main Query

Join

2
Sub query 2

5
Query (1 & 2)
& 3 Joined
Join

3
Sub query 3

Misc
 So in the above flow, 1,2,4 can run in parallel as sub queries and

then joined finally to 3 and then to 5 and the final query result.

Since map join is faster than the common join, it's better to run the
map join whenever possible. Previously, Hive users needed to
give a hint in the query to specify the small table.
For example,
select /*+mapjoin(a)*/ * from src1 x join src2 y on x.key=y.key;
Newer hive automatically converts normal join to map join.

Some examples

 Which query is faster?
 Select count(distinct(column)) from table.
 Or
 Select count(*) from (select distinct(column) from table) ??

Answer
M

M

M

M

M

R

R

R

M

R

M

M

M

R

Result

Result

2nd one is faster

 In first case :
 Maps send each value to reducer
 Single reducer counts them all(over head)

 In Second Case:
 Map splits the values to many reducer
 Each reducer generated a list
 Final job is to count the size of each list

 Note : Singleton reducer is not always good.

Tips
 Hive does not know whether query is bad.
 So try to use “Explain” for queries which you doubt to be bad or

even don’t doubt.
 Explain tells about following
 Number of jobs
 Number of map and reduce
 What job is sorting by
 What are the directories it will read.
 So explain will help to see the difference between the two or
more queries for the same purpose.
 Job configuration and history can be studied for the query
performance.

Weitere ähnliche Inhalte

Was ist angesagt?

Hadoop Map Reduce ArchJeff Hammerbacher

HiveSrinath Reddy

How to understand and analyze Apache Hive query execution plan for performanc...DataWorks Summit/Hadoop Summit

Topic 6: MapReduce ApplicationsZubair Nabi

04 pig data operationsSubhas Kumar Ghosh

Team3 presentationAmanda Gilbert

Mastering Hadoop Map Reduce - Custom Types and Other Optimizationsscottcrespo

MapReduce Algorithm DesignGabriela Agustini

report on aadhaar anlysis using bid data hadoop and hivesiddharthboora

Accumulo Summit 2015: Building Aggregation Systems on Accumulo [Leveraging Ac...Accumulo Summit

Hive Percona 2009prasadc

Hadoop Summit 2009 HiveZheng Shao

Hw09 Hadoop Development At Facebook Hive And HdfsCloudera, Inc.

Hadoop MapReduce framework - Module 3Rohit Agrawal

Map reduce in Hadoopishan0019

Repartition join in mapreduceUday Vakalapudi

ACADILD:: HADOOP LESSON Padma shree. T

Data preparation covariatesFAO

Hive User Meeting August 2009 Facebookragho

Upgrading To The New Map Reduce APITom Croucher

Was ist angesagt? (20)

Hadoop Map Reduce Arch

Hive

How to understand and analyze Apache Hive query execution plan for performanc...

Topic 6: MapReduce Applications

04 pig data operations

Team3 presentation

Mastering Hadoop Map Reduce - Custom Types and Other Optimizations

MapReduce Algorithm Design

report on aadhaar anlysis using bid data hadoop and hive

Accumulo Summit 2015: Building Aggregation Systems on Accumulo [Leveraging Ac...

Hive Percona 2009

Hadoop Summit 2009 Hive

Hw09 Hadoop Development At Facebook Hive And Hdfs

Hadoop MapReduce framework - Module 3

Map reduce in Hadoop

Repartition join in mapreduce

ACADILD:: HADOOP LESSON

Data preparation covariates

Hive User Meeting August 2009 Facebook

Upgrading To The New Map Reduce API

Andere mochten auch

Hive contributors meetup apache sentryBrock Noland

Hive Correlation OptimizerYin Huai

Hadoop World 2011: Replacing RDB/DW with Hadoop and Hive for Telco Big Data -...Cloudera, Inc.

Optimizing Hive QueriesDataWorks Summit

Hive ppt (1)marwa baich

Jump Start with Apache Spark 2.0 on DatabricksDatabricks

Spark Summit Europe 2016 Keynote - Databricks CEO Databricks

Internal HiveRecruit Technologies

Apache Spark and Online Analytics Databricks

Spark Summit EU 2016: The Next AMPLab: Real-time Intelligent Secure ExecutionDatabricks

A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016 Databricks

Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0Databricks

Hive tuningMichael Zhang

A look under the hood at Apache Spark's API and engine evolutionsDatabricks

Insights Without Tradeoffs: Using Structured StreamingDatabricks

Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structure...Databricks

Tuning and Monitoring Deep Learning on Apache SparkDatabricks

Keeping Spark on Track: Productionizing Spark for ETLDatabricks

Making Structured Streaming Ready for ProductionDatabricks

Parallelizing Existing R Packages with SparkRDatabricks

Andere mochten auch (20)

Hive contributors meetup apache sentry

Hive Correlation Optimizer

Hadoop World 2011: Replacing RDB/DW with Hadoop and Hive for Telco Big Data -...

Optimizing Hive Queries

Hive ppt (1)

Jump Start with Apache Spark 2.0 on Databricks

Spark Summit Europe 2016 Keynote - Databricks CEO

Internal Hive

Apache Spark and Online Analytics

Spark Summit EU 2016: The Next AMPLab: Real-time Intelligent Secure Execution

A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016

Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0

Hive tuning

A look under the hood at Apache Spark's API and engine evolutions

Insights Without Tradeoffs: Using Structured Streaming

Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structure...

Tuning and Monitoring Deep Learning on Apache Spark

Keeping Spark on Track: Productionizing Spark for ETL

Making Structured Streaming Ready for Production

Parallelizing Existing R Packages with SparkR

Ähnlich wie Hive query optimization infinity

MapReduce-Notes.pdfAnilVijayagiri

Unit-2 Hadoop Framework.pdfSitamarhi Institute of Technology

Applying stratosphere for big data analyticsAvinash Pandu

Brad McGehee Intepreting Execution Plans Mar09guest9d79e073

Brad McGehee Intepreting Execution Plans Mar09Mark Ginnebaugh

Lecture 2 part 3Jazan University

Big data hadoop distributed file system for datapreetik9044

Fusing Transformations of Strict Scala Collections with ViewsPhilip Schwarz

Map reduceShahbaz Sidhu

2004 map reduce simplied data processing on large clusters (mapreduce)anh tuan

Shuffle sort 101Jeff Bean

January 2016 Meetup: Speeding up (big) data manipulation with data.table packageZurich_R_User_Group

Lecture 1 mapreduceShubham Bansal

Hadoop Hive Talk At IIT-DelhiJoydeep Sen Sarma

Hadoop interview questions - Softwarequery.comsoftwarequery

Query Optimization - Brandon Latronica"FENG "GEORGE"" YU

22827361 ab initio-fa-qsCapgemini

Hadoop Interview Questions and AnswersBig Data Interview Questions

Join Algorithms in MapReduceShrihari Rathod

Ähnlich wie Hive query optimization infinity (20)

MapReduce-Notes.pdf

Unit-2 Hadoop Framework.pdf

Applying stratosphere for big data analytics

Brad McGehee Intepreting Execution Plans Mar09

Lecture 2 part 3

Big data hadoop distributed file system for data

Fusing Transformations of Strict Scala Collections with Views

Map reduce

2004 map reduce simplied data processing on large clusters (mapreduce)

Shuffle sort 101

January 2016 Meetup: Speeding up (big) data manipulation with data.table package

Lecture 1 mapreduce

Hadoop Hive Talk At IIT-Delhi

Hadoop interview questions - Softwarequery.com

Query Optimization - Brandon Latronica

22827361 ab initio-fa-qs

Hadoop Interview Questions and Answers

Join Algorithms in MapReduce

Mehr von Shashwat Shriparv

Learning Linux Series Administrator Commands.pptxShashwat Shriparv

LibreOffice 7.3.pptxShashwat Shriparv

Kerberos Architecture.pptxShashwat Shriparv

Suspending a Process in Linux.pptxShashwat Shriparv

Kerberos Architecture.pptxShashwat Shriparv

Command Seperators.pptxShashwat Shriparv

Upgrading hadoopShashwat Shriparv

Hadoop migration and upgradationShashwat Shriparv

R language introductionShashwat Shriparv

Hbase interact with shellShashwat Shriparv

H base developmentShashwat Shriparv

HbaseShashwat Shriparv

H baseShashwat Shriparv

My sqlShashwat Shriparv

Apache tomcatShashwat Shriparv

Linux 4 youShashwat Shriparv

Introduction to apache hadoopShashwat Shriparv

Next generation technologyShashwat Shriparv

Configure h base hadoop and hbase clientShashwat Shriparv

Java interview questionsShashwat Shriparv

Mehr von Shashwat Shriparv (20)

Learning Linux Series Administrator Commands.pptx

LibreOffice 7.3.pptx

Kerberos Architecture.pptx

Suspending a Process in Linux.pptx

Kerberos Architecture.pptx

Command Seperators.pptx

Upgrading hadoop

Hadoop migration and upgradation

R language introduction

Hbase interact with shell

H base development

Hbase

H base

My sql

Apache tomcat

Linux 4 you

Introduction to apache hadoop

Next generation technology

Configure h base hadoop and hbase client

Java interview questions

Kürzlich hochgeladen

TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc

A Year of the Servo Reboot: Where Are We Now?Igalia

Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC

CNv6 Instructor Chapter 6 Quality of Servicegiselly40

08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls

08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer

Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer

Presentation on how to chat with PDF using ChatGPT code interpreternaman860154

Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2

Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies

🐬 The future of MySQL is Postgres 🐘RTylerCroy

Histor y of HAM Radio presentation slidevu2urc

Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko

2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong

GenCyber Cyber Security Day PresentationMichael W. Hawkins

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays

The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j

The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los

Kürzlich hochgeladen (20)

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments

A Year of the Servo Reboot: Where Are We Now?

Breaking the Kubernetes Kill Chain: Host Path Mount

CNv6 Instructor Chapter 6 Quality of Service

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men

08448380779 Call Girls In Civil Lines Women Seeking Men

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024

Tata AIG General Insurance Company - Insurer Innovation Award 2024

Presentation on how to chat with PDF using ChatGPT code interpreter

Exploring the Future Potential of AI-Enabled Smartphone Processors

Factors to Consider When Choosing Accounts Payable Services Providers.pptx

🐬 The future of MySQL is Postgres 🐘

Histor y of HAM Radio presentation slide

Handwritten Text Recognition for manuscripts and early printed texts

2024: Domino Containers - The Next Step. News from the Domino Container commu...

GenCyber Cyber Security Day Presentation

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...

The Codex of Business Writing Software for Real-World Solutions 2.pptx

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...

The 7 Things I Know About Cyber Security After 25 Years | April 2024

Hive query optimization infinity

1. dwivedishashwat@gmail.com http://helpmetocode.blogspot.com

2.  Well designed tables  Partitioning  Bucketing  and well written queries can improve your query speed and reduce processing cost.

3. Optimization on Table side  Partitioning Hive Tables:  It is a kind of horizontal slicing of data. This slicing can be on the range, single value or a set of values.  Imagine log files where each record includes a timestamp. If we partitioned by date, then records for the same date would be stored in the same partition.  E.g.:  Partition on date.  Partition on geography location.  Partition on number range.

4. Defining a table partition  Lets take a Apache log file example where we have log generated by web server on visit of client.  These log contains data & time information about browser and location(IP).  So we can create table in hive and partition these log data using date & time and we can create sub partition of location. Which looks like : CREATE TABLE alogs (timstamp BIGINT, detail STRING) PARTITIONED BY (date STRING, loc STRING); Log Table 

5. Directory Structure /user/hive/warehouse/logs/dt=2010-01-01/country=GB/file1 /file2 /country=US/file3 /dt=2010-01-02/country=GB/file4 /country=US/file5 /file6

6. Hive Buckets  Bucketing Hive Tables:  Bucketing hive table result in more efficient queries.  Bucketing imposes extra structure on the table, which Hive can take advantage of when performing certain queries.  It makes sampling more efficient.  The two tables are bucketed in the same way, a mapper processing a bucket of the left table knows that the matching rows in the right table are in its corresponding bucket, so it need only retrieve that bucket.  Bucket may additionally be sorted by one or more columns. This allows even more efficient map-side joins, since the join of each bucket becomes an efficient merge-sort.

7. Parallel execution of queries  Hadoop can execute map reduce jobs in parallel and several queries executed on Hive make automatically use of this parallelism.  The queries or sub queries which are not interdependent can be execute in parallel mode,like some Join queries.  Following is the example how it is done:  SET hive.exce.parallel=true; #Can be used to set this mode on 1 Final Result Sub query 1 4 Sub query (1 & 2) Joined Main Query Join 2 Sub query 2 5 Query (1 & 2) & 3 Joined Join 3 Sub query 3

8. Misc  So in the above flow, 1,2,4 can run in parallel as sub queries and then joined finally to 3 and then to 5 and the final query result. Since map join is faster than the common join, it's better to run the map join whenever possible. Previously, Hive users needed to give a hint in the query to specify the small table. For example, select /*+mapjoin(a)*/ * from src1 x join src2 y on x.key=y.key; Newer hive automatically converts normal join to map join.

9. Some examples  Which query is faster?  Select count(distinct(column)) from table.  Or  Select count(*) from (select distinct(column) from table) ??

10. Answer M M M M M R R R M R M M M R Result Result

11. 2nd one is faster  In first case :  Maps send each value to reducer  Single reducer counts them all(over head)  In Second Case:  Map splits the values to many reducer  Each reducer generated a list  Final job is to count the size of each list  Note : Singleton reducer is not always good.

12. Tips  Hive does not know whether query is bad.  So try to use “Explain” for queries which you doubt to be bad or even don’t doubt.  Explain tells about following  Number of jobs  Number of map and reduce  What job is sorting by  What are the directories it will read.  So explain will help to see the difference between the two or more queries for the same purpose.  Job configuration and history can be studied for the query performance.

Hive query optimization infinity

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (20)

Ähnlich wie Hive query optimization infinity

Ähnlich wie Hive query optimization infinity (20)

Mehr von Shashwat Shriparv

Mehr von Shashwat Shriparv (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Hive query optimization infinity