SlideShare ist ein Scribd-Unternehmen logo
1 von 15
Downloaden Sie, um offline zu lesen
Hive - III
Table Partition, HQL
Why partitioning the table is important?
 Data is split into multiple partitions based on the values
of the conditions such as date, city, department etc.
 Data partition increases the efficiency of querying a
table.
 For example, our previous table tb_1 contains ID,
name, location, year. And if we want to retrieve only
the data with year 2010 now our query will search the
whole table for the data related to year 2010.
However if we partition the table with year and store in
a separate file and whenever a table is queried for the
year 2010 it will only read the file partitioned with year
2010 and will ignore the rest partitions. Hence it improves
the query processing time.
Rupak Roy
Create a partitioned Tables
hive> create table empPartitioned
(ID int, name string, location string )
Partitioned by (Year string)
Row format delimited
Fields terminated by ‘#’
Lines terminated by’n’
Stored as textfile
#note: the column values that will be used for partitioning the table must
not be defined in the table definition.
#Load the partitioned data
Hive> load data inpath ‘/home/hduser/dataset/htable2008’ overwrite
into table empPartitioned Partition(year = 2008);
Hive>load data inpath ‘/home/hduser/dataaset/htable2005’ overwrite
into table empPartitioned Partition(year= 2005);
Rupak Roy
Hive> Select * from empPartitioned;
Hive> Select * from empPartitioned
where year = 2005;
Hive> show partition empPartitioned;
Now this query will read only the partition with year
2005 and all other partitions will be ignored.
Rupak Roy
Partitioned External Table
 We can also take the advantage of external
tables for Partitioned Tables and also we don’t
need to specify the ‘ Location ‘ as we did for
external tables.
hive> create external table empPartitioned
(ID int, name string, location string, year
string)
Partitioned by (year string)
Row format delimited
Lines terminated by’#’
Fields terminated by’n’
Stored as textfile;
Rupak Roy
Hive Query Language (HQL)
 HQL inherits the SQL i.e. Structured Query Language to query most of the
tables
Example 1:
Select upper(name), TotalSales/100 as Average
From transactionaldata;
This will give us two columns, one Name in capital letters and the second is the
Average;
Example 2:
Select name, sellingprice – costprice as Profit
Where year = 2010,
And sellingprice > 100
From transactiondata;
#this will give us the profit based on selling price which are more than $100 for
the year 2010
Rupak Roy
We can also use the casting CAST() function to
change the data type to another.
Example 3:
Select name, selling price, CAST( year as int)
from transactionaldata;
Example 4: select CONCAT(name, id),location
Where date= 2005
We can also perform all the SQL queries like inner
joins, outer joins in hive.
Rupak Roy
Hive in RC File
 We can save hive data in different formats. We are
already familiar with the text format (stored as text
file), json, csv, xml and so on. However text format is
more convenient when it comes to sharing data with
other applications but not very effective in terms of
storage.
 Sequential file is another type of format that stores
data effectively by using binary key value pairs but
the drawback is it saves a complete row as a single
binary value. So whenever we query for a single
column hive have to read the full row even if one
column is requested.
 Let’s understand this the help of an example.
Rupak Roy
Create table in sequential file
Create table emp
(ID int, name string, location string)
Row format delimited
Lines terminated by’#’
Fields terminated by’n’
Stored as SEQUENCEFILE;
------------------------------------------
Describe formatted emp;
Rupak Roy
Row Vs Column Storage
 Row Oriented Storage:
Row oriented is efficient when retrieving for all the
columns data. For example from 50 columns & rows
and it realizes that it only has to scan 2 rows.
But when it comes to read only few columns it
needs to read all the rows. Best suits for row data.
ID Name Location Year
11 Bob IN 2005
22 Fara SG 2005
Rupak Roy
Row Vs Column Storage
 Columns Oriented Storage: is the vice versa of
row oriented storage that is best suited when it
comes to reading few columns
ID Name Location Year
11 Bob IN 2005
22 Fara SG 2005
33 Niki JP 2005
44 Steve NZ 2005
Rupak Roy
Record Columnar File
 To address the issue of row oriented storage
RC(Record Columnar ) file format was created.
 Along with the hive, RC file format was also
developed by Facebook.
 RC file stores data on disk in a record columnar
way that splits rows horizontally into row groups.
Row Group 1 Row Group 2
ID Name Location Year
11 Bob IN 2005
22 Fara SG 2005
33 Niki Jp 2005
ID Name Location Year
44 Steve NZ 2005
55 Nina RU 2009
66 Ryan IN 2005
Rupak Roy
Create table empRC
( ID int, name sring, location string)
Stored as RCFile;
----------------
Describe formatted empRC;
-----------------
Load data in hive
Insert overwrite table empRC select * from emp;
-------------------
Now query the table empRC and emp to observe
the difference in time taken to process the request.
Rupak Roy
Next
 Apache Hbase a column oriented non-
relational distributed database
management system.
Rupak Roy
 Stay Tuned.
Rupak Roy

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Introductive to Hive
Introductive to Hive Introductive to Hive
Introductive to Hive
 
Unit 4 lecture-3
Unit 4 lecture-3Unit 4 lecture-3
Unit 4 lecture-3
 
Introduction to scoop and its functions
Introduction to scoop and its functionsIntroduction to scoop and its functions
Introduction to scoop and its functions
 
Inside Parquet Format
Inside Parquet FormatInside Parquet Format
Inside Parquet Format
 
Introduction to Data Mining with R and Data Import/Export in R
Introduction to Data Mining with R and Data Import/Export in RIntroduction to Data Mining with R and Data Import/Export in R
Introduction to Data Mining with R and Data Import/Export in R
 
R data-import, data-export
R data-import, data-exportR data-import, data-export
R data-import, data-export
 
Data file handling in python binary & csv files
Data file handling in python binary & csv filesData file handling in python binary & csv files
Data file handling in python binary & csv files
 
Unit 4 lecture2
Unit 4 lecture2Unit 4 lecture2
Unit 4 lecture2
 
Session 04 -Pig Continued
Session 04 -Pig ContinuedSession 04 -Pig Continued
Session 04 -Pig Continued
 
Unit 5-lecture4
Unit 5-lecture4Unit 5-lecture4
Unit 5-lecture4
 
Hive(ppt)
Hive(ppt)Hive(ppt)
Hive(ppt)
 
Unit 5-lecture-3
Unit 5-lecture-3Unit 5-lecture-3
Unit 5-lecture-3
 
03 hive query language (hql)
03 hive query language (hql)03 hive query language (hql)
03 hive query language (hql)
 
2CPP17 - File IO
2CPP17 - File IO2CPP17 - File IO
2CPP17 - File IO
 
Introduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLab
 
Stream classes in C++
Stream classes in C++Stream classes in C++
Stream classes in C++
 
Hive User Meeting August 2009 Facebook
Hive User Meeting August 2009 FacebookHive User Meeting August 2009 Facebook
Hive User Meeting August 2009 Facebook
 
Advanced topics in hive
Advanced topics in hiveAdvanced topics in hive
Advanced topics in hive
 
Oracle sql loader utility
Oracle sql loader utilityOracle sql loader utility
Oracle sql loader utility
 
7. Data Import – Data Export
7. Data Import – Data Export7. Data Import – Data Export
7. Data Import – Data Export
 

Ähnlich wie Apache Hive Table Partition and HQL

What's New for Developers in SQL Server 2008?
What's New for Developers in SQL Server 2008?What's New for Developers in SQL Server 2008?
What's New for Developers in SQL Server 2008?
ukdpe
 
SessionFive_ImportingandExportingData
SessionFive_ImportingandExportingDataSessionFive_ImportingandExportingData
SessionFive_ImportingandExportingData
Hellen Gakuruh
 

Ähnlich wie Apache Hive Table Partition and HQL (20)

Oracle
OracleOracle
Oracle
 
04 pig data operations
04 pig data operations04 pig data operations
04 pig data operations
 
Sql Basics And Advanced
Sql Basics And AdvancedSql Basics And Advanced
Sql Basics And Advanced
 
Hive - ORIEN IT
Hive - ORIEN ITHive - ORIEN IT
Hive - ORIEN IT
 
Database Management Lab -SQL Queries
Database Management Lab -SQL Queries Database Management Lab -SQL Queries
Database Management Lab -SQL Queries
 
012. SQL.pdf
012. SQL.pdf012. SQL.pdf
012. SQL.pdf
 
Sql intro & ddl 1
Sql intro & ddl 1Sql intro & ddl 1
Sql intro & ddl 1
 
Sql intro & ddl 1
Sql intro & ddl 1Sql intro & ddl 1
Sql intro & ddl 1
 
012. SQL.pdf
012. SQL.pdf012. SQL.pdf
012. SQL.pdf
 
Hive_An Brief Introduction to HIVE_BIGDATAANALYTICS
Hive_An Brief Introduction to HIVE_BIGDATAANALYTICSHive_An Brief Introduction to HIVE_BIGDATAANALYTICS
Hive_An Brief Introduction to HIVE_BIGDATAANALYTICS
 
Unit I - 1R introduction to R program.pptx
Unit I - 1R introduction to R program.pptxUnit I - 1R introduction to R program.pptx
Unit I - 1R introduction to R program.pptx
 
Sql fundamentals
Sql fundamentalsSql fundamentals
Sql fundamentals
 
What's New for Developers in SQL Server 2008?
What's New for Developers in SQL Server 2008?What's New for Developers in SQL Server 2008?
What's New for Developers in SQL Server 2008?
 
SQL Server 2008 Overview
SQL Server 2008 OverviewSQL Server 2008 Overview
SQL Server 2008 Overview
 
R Text-Based Data I/O and Data Frame Access and Manupulation
R Text-Based Data I/O and Data Frame Access and ManupulationR Text-Based Data I/O and Data Frame Access and Manupulation
R Text-Based Data I/O and Data Frame Access and Manupulation
 
Five major tips to maximize performance on a 200+ SQL HBase/Phoenix cluster
Five major tips to maximize performance on a 200+ SQL HBase/Phoenix clusterFive major tips to maximize performance on a 200+ SQL HBase/Phoenix cluster
Five major tips to maximize performance on a 200+ SQL HBase/Phoenix cluster
 
Libsys 7 to koha
Libsys 7 to kohaLibsys 7 to koha
Libsys 7 to koha
 
SessionFive_ImportingandExportingData
SessionFive_ImportingandExportingDataSessionFive_ImportingandExportingData
SessionFive_ImportingandExportingData
 
Basics R.ppt
Basics R.pptBasics R.ppt
Basics R.ppt
 
12 SQL
12 SQL12 SQL
12 SQL
 

Mehr von Rupak Roy

Mehr von Rupak Roy (20)

Hierarchical Clustering - Text Mining/NLP
Hierarchical Clustering - Text Mining/NLPHierarchical Clustering - Text Mining/NLP
Hierarchical Clustering - Text Mining/NLP
 
Clustering K means and Hierarchical - NLP
Clustering K means and Hierarchical - NLPClustering K means and Hierarchical - NLP
Clustering K means and Hierarchical - NLP
 
Network Analysis - NLP
Network Analysis  - NLPNetwork Analysis  - NLP
Network Analysis - NLP
 
Topic Modeling - NLP
Topic Modeling - NLPTopic Modeling - NLP
Topic Modeling - NLP
 
Sentiment Analysis Practical Steps
Sentiment Analysis Practical StepsSentiment Analysis Practical Steps
Sentiment Analysis Practical Steps
 
NLP - Sentiment Analysis
NLP - Sentiment AnalysisNLP - Sentiment Analysis
NLP - Sentiment Analysis
 
Text Mining using Regular Expressions
Text Mining using Regular ExpressionsText Mining using Regular Expressions
Text Mining using Regular Expressions
 
Introduction to Text Mining
Introduction to Text Mining Introduction to Text Mining
Introduction to Text Mining
 
Apache Hbase Architecture
Apache Hbase ArchitectureApache Hbase Architecture
Apache Hbase Architecture
 
Introduction to Hbase
Introduction to Hbase Introduction to Hbase
Introduction to Hbase
 
Installing Apache Hive, internal and external table, import-export
Installing Apache Hive, internal and external table, import-export Installing Apache Hive, internal and external table, import-export
Installing Apache Hive, internal and external table, import-export
 
Scoop Job, import and export to RDBMS
Scoop Job, import and export to RDBMSScoop Job, import and export to RDBMS
Scoop Job, import and export to RDBMS
 
Introduction to Flume
Introduction to FlumeIntroduction to Flume
Introduction to Flume
 
Apache Pig Relational Operators - II
Apache Pig Relational Operators - II Apache Pig Relational Operators - II
Apache Pig Relational Operators - II
 
Passing Parameters using File and Command Line
Passing Parameters using File and Command LinePassing Parameters using File and Command Line
Passing Parameters using File and Command Line
 
Apache PIG Relational Operations
Apache PIG Relational Operations Apache PIG Relational Operations
Apache PIG Relational Operations
 
Apache PIG casting, reference
Apache PIG casting, referenceApache PIG casting, reference
Apache PIG casting, reference
 
Pig Latin, Data Model with Load and Store Functions
Pig Latin, Data Model with Load and Store FunctionsPig Latin, Data Model with Load and Store Functions
Pig Latin, Data Model with Load and Store Functions
 
Introduction to PIG components
Introduction to PIG components Introduction to PIG components
Introduction to PIG components
 
YARN(yet an another resource locator)
YARN(yet an another resource locator)YARN(yet an another resource locator)
YARN(yet an another resource locator)
 

Kürzlich hochgeladen

Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Kürzlich hochgeladen (20)

Simplifying Mobile A11y Presentation.pptx
Simplifying Mobile A11y Presentation.pptxSimplifying Mobile A11y Presentation.pptx
Simplifying Mobile A11y Presentation.pptx
 
Decarbonising Commercial Real Estate: The Role of Operational Performance
Decarbonising Commercial Real Estate: The Role of Operational PerformanceDecarbonising Commercial Real Estate: The Role of Operational Performance
Decarbonising Commercial Real Estate: The Role of Operational Performance
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Design Guidelines for Passkeys 2024.pptx
Design Guidelines for Passkeys 2024.pptxDesign Guidelines for Passkeys 2024.pptx
Design Guidelines for Passkeys 2024.pptx
 
JohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptx
 
JavaScript Usage Statistics 2024 - The Ultimate Guide
JavaScript Usage Statistics 2024 - The Ultimate GuideJavaScript Usage Statistics 2024 - The Ultimate Guide
JavaScript Usage Statistics 2024 - The Ultimate Guide
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Top 10 CodeIgniter Development Companies
Top 10 CodeIgniter Development CompaniesTop 10 CodeIgniter Development Companies
Top 10 CodeIgniter Development Companies
 
The Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and InsightThe Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and Insight
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
ADP Passwordless Journey Case Study.pptx
ADP Passwordless Journey Case Study.pptxADP Passwordless Journey Case Study.pptx
ADP Passwordless Journey Case Study.pptx
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
Design and Development of a Provenance Capture Platform for Data Science
Design and Development of a Provenance Capture Platform for Data ScienceDesign and Development of a Provenance Capture Platform for Data Science
Design and Development of a Provenance Capture Platform for Data Science
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
How to Check CNIC Information Online with Pakdata cf
How to Check CNIC Information Online with Pakdata cfHow to Check CNIC Information Online with Pakdata cf
How to Check CNIC Information Online with Pakdata cf
 
AI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by AnitarajAI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by Anitaraj
 
API Governance and Monetization - The evolution of API governance
API Governance and Monetization -  The evolution of API governanceAPI Governance and Monetization -  The evolution of API governance
API Governance and Monetization - The evolution of API governance
 

Apache Hive Table Partition and HQL

  • 1. Hive - III Table Partition, HQL
  • 2. Why partitioning the table is important?  Data is split into multiple partitions based on the values of the conditions such as date, city, department etc.  Data partition increases the efficiency of querying a table.  For example, our previous table tb_1 contains ID, name, location, year. And if we want to retrieve only the data with year 2010 now our query will search the whole table for the data related to year 2010. However if we partition the table with year and store in a separate file and whenever a table is queried for the year 2010 it will only read the file partitioned with year 2010 and will ignore the rest partitions. Hence it improves the query processing time. Rupak Roy
  • 3. Create a partitioned Tables hive> create table empPartitioned (ID int, name string, location string ) Partitioned by (Year string) Row format delimited Fields terminated by ‘#’ Lines terminated by’n’ Stored as textfile #note: the column values that will be used for partitioning the table must not be defined in the table definition. #Load the partitioned data Hive> load data inpath ‘/home/hduser/dataset/htable2008’ overwrite into table empPartitioned Partition(year = 2008); Hive>load data inpath ‘/home/hduser/dataaset/htable2005’ overwrite into table empPartitioned Partition(year= 2005); Rupak Roy
  • 4. Hive> Select * from empPartitioned; Hive> Select * from empPartitioned where year = 2005; Hive> show partition empPartitioned; Now this query will read only the partition with year 2005 and all other partitions will be ignored. Rupak Roy
  • 5. Partitioned External Table  We can also take the advantage of external tables for Partitioned Tables and also we don’t need to specify the ‘ Location ‘ as we did for external tables. hive> create external table empPartitioned (ID int, name string, location string, year string) Partitioned by (year string) Row format delimited Lines terminated by’#’ Fields terminated by’n’ Stored as textfile; Rupak Roy
  • 6. Hive Query Language (HQL)  HQL inherits the SQL i.e. Structured Query Language to query most of the tables Example 1: Select upper(name), TotalSales/100 as Average From transactionaldata; This will give us two columns, one Name in capital letters and the second is the Average; Example 2: Select name, sellingprice – costprice as Profit Where year = 2010, And sellingprice > 100 From transactiondata; #this will give us the profit based on selling price which are more than $100 for the year 2010 Rupak Roy
  • 7. We can also use the casting CAST() function to change the data type to another. Example 3: Select name, selling price, CAST( year as int) from transactionaldata; Example 4: select CONCAT(name, id),location Where date= 2005 We can also perform all the SQL queries like inner joins, outer joins in hive. Rupak Roy
  • 8. Hive in RC File  We can save hive data in different formats. We are already familiar with the text format (stored as text file), json, csv, xml and so on. However text format is more convenient when it comes to sharing data with other applications but not very effective in terms of storage.  Sequential file is another type of format that stores data effectively by using binary key value pairs but the drawback is it saves a complete row as a single binary value. So whenever we query for a single column hive have to read the full row even if one column is requested.  Let’s understand this the help of an example. Rupak Roy
  • 9. Create table in sequential file Create table emp (ID int, name string, location string) Row format delimited Lines terminated by’#’ Fields terminated by’n’ Stored as SEQUENCEFILE; ------------------------------------------ Describe formatted emp; Rupak Roy
  • 10. Row Vs Column Storage  Row Oriented Storage: Row oriented is efficient when retrieving for all the columns data. For example from 50 columns & rows and it realizes that it only has to scan 2 rows. But when it comes to read only few columns it needs to read all the rows. Best suits for row data. ID Name Location Year 11 Bob IN 2005 22 Fara SG 2005 Rupak Roy
  • 11. Row Vs Column Storage  Columns Oriented Storage: is the vice versa of row oriented storage that is best suited when it comes to reading few columns ID Name Location Year 11 Bob IN 2005 22 Fara SG 2005 33 Niki JP 2005 44 Steve NZ 2005 Rupak Roy
  • 12. Record Columnar File  To address the issue of row oriented storage RC(Record Columnar ) file format was created.  Along with the hive, RC file format was also developed by Facebook.  RC file stores data on disk in a record columnar way that splits rows horizontally into row groups. Row Group 1 Row Group 2 ID Name Location Year 11 Bob IN 2005 22 Fara SG 2005 33 Niki Jp 2005 ID Name Location Year 44 Steve NZ 2005 55 Nina RU 2009 66 Ryan IN 2005 Rupak Roy
  • 13. Create table empRC ( ID int, name sring, location string) Stored as RCFile; ---------------- Describe formatted empRC; ----------------- Load data in hive Insert overwrite table empRC select * from emp; ------------------- Now query the table empRC and emp to observe the difference in time taken to process the request. Rupak Roy
  • 14. Next  Apache Hbase a column oriented non- relational distributed database management system. Rupak Roy