SlideShare a Scribd company logo
1 of 61
Apache Hive
Prashant Gupta
HIVE
Hive
• Data warehousing package built on top of
hadoop.
• Used for data analysis on structured data.
• Targeted towards users comfortable with SQL.
• It is similar to SQL and called HiveQL.
• Abstracts complexity of hadoop.
• No Java is required.
• Developed by Facebook.
Features of Hive
How is it Different from SQL
•The major difference is that a Hive query
executes on a Hadoop infrastructure rather than
a traditional database.
•This allows Hive to handle huge data sets - data
sets so large that high-end, expensive, traditional
databases would fail.
•The internal execution of a Hive query is via a
series of automatically generated Map Reduce
jobs
When not to use Hive
• Semi-structured or complete unstructured data.
• Hive is not designed for online transaction
processing.
• It is best for batch jobs over large sets of data.
• Latency for Hive queries is generally very high in
minutes, even when data sets are very small (say
a few hundred megabytes).
• It cannot be compared with systems such as
oracle where analyses are conducted on a
significantly smaller amount of data.
Install Hive
•To install hive
• untar the .gz file using tar –xvzf hive-0.13.0-bin.tar.gz
•To initialize the environment variables, export the
following:
• export HADOOP_HOME=/home/usr/hadoop-0.20.2
(Specifies the location of the installation directory
of hadoop.)
• export HIVE_HOME=/home/usr/hive-0.13.0-bin
(Specifies the location of the hive to the environment
variable.)
• export PATH=$PATH:$HIVE_HOME/bin
Hive configurations
• Hive default configuration is stored in hive-default.xml
file in the conf directory
• Hive comes configured to use derby as the metastore
Hive Modes
To start the hive shell, type hive and Enter.
• Hive in Local mode
No HDFS is required, All files run on local file
system.
hive> SET mapred.job.tracker=local
• Hive in MapReduce(hadoop) mode
hive> SET mapred.job.tracker=master:9001;
Introducing data types
• The primitive data types in hive include
Integers, Boolean, Floating point,
Date,Timestamp and Strings.
• The below table lists the size of data types:
Type Size
-------------------------
TINYINT 1 byte
SMALLINT 2 byte
INT 4 byte
BIGINT 8 byte
FLOAT 4 byte (single precision floating point numbers)
DOUBLE 8 byte (double precision floating point numbers)
BOOLEAN TRUE/FALSE value
STRING Max size is 2GB.
• Complex data Type : Array ,Map ,Structs
Configuring Hive
• Hive is configured using an XML configuration file called
hivesite.xml and is located in Hive’s conf directory.
• Execution engines
 Hive was originally written to use MapReduce as its execution engine,
and that is still the default.
 We can use Apache Tez as its execution engine, and also work is
underway to support Spark, too. Both Tez and Spark are general
directed acyclic graph (DAG) engines that offer more flexibility and
higher performance than MapReduce.
 It’s easy to switch the execution engine on a per-query basis, so you
can see the effect of a different engine on a particular query.
 Set Hive to use Tez: hive> SET hive.execution.engine=tez;
 The execution engine is controlled by the hive.execution.engine
property, which defaults to “mr” (for MapReduce).
Hive Architecture
Components
• Thrift Client
It is possible to interact with hive by using any
programming language that usages Thrift server. For e.g.
Python
Ruby
• JDBC Driver
Hive provides a pure java JDBC driver for java application
to connect to hive , defined in the class
org.hadoop.hive.jdbc.HiveDriver
• ODBC Driver
An ODBC driver allows application that supports ODBC
protocol
Components
• Metastore
 This is the central repository for Hive metadata.
 By default, Hive is configured to use Derby as the metastore. As a result of the
configuration, a metastore_db directory is created in each working folder.
• What are the problems with the default metastore
 Users cannot see the tables created by others if they do not use the same
metastore_db.
 Only one embedded Derby database can access the database files at any given
point of time
 Results in only one open Hive session with a metastore. Not possible to have
multiple sessions with Derby as the metastore.
Solution
 We can use a standalone database either on the same machine or on a
remote machine as a metastore and any JDBC-compliant database can be used
 MySQL is a popular choice for the standalone metastore.
Configuring MySQL as metastore
 Install MySQL Admin/Client
 Create a Hadoop user and grant permissions to the user
 mysql -u root –p
 mysql> Create user 'hadoop'@'localhost' identified by 'hadoop‘;
 mysql> Grant ALL on *.* to 'hadoop'@'localhost' with GRANT option;
 Modify the following properties in hive-site.xml to use MySQL instead of Derby. This creates a
database in MySql by the name – Hive :
 name : javax.jdo.option.ConnectionUR
 value :
dbc:mysql://localhost:3306/Hive?createDatabaseIfNotExist=true
 name : javax.jdo.option.ConnectionDriverName
 value : com.mysql.jdbc.Driver
 name : javax.jdo.option.ConnectionUserName
 value : hadoop
 name : javax.jdo.option.ConnectionPassword
 value : hadoop
Hive Program Structure
• The Hive Shell
 The shell is the primary way that we will interact with Hive, by issuing
commands in HiveQL.
 HiveQL is heavily influenced by MySQL, so if you are familiar with
MySQL, you should feel at home using Hive.
 The command must be terminated with a semicolon to tell Hive to
execute it.
 HiveQL is generally case insensitive.
 The Tab key will autocomplete Hive keywords and functions.
• Hive can run in non-interactive mode.
 Use -f option to run the commands in the specified file,
 hive -f script.hql
 For short scripts, you can use the -e option to specify the commands
inline, in which case the final semicolon is not required.
 hive -e 'SELECT * FROM dummy'
Ser-de
• A SerDe is a combination of a Serializer and a
Deserializer (hence, Ser-De).
• The Serializer, however, will take a Java object that Hive
has been working with, and turn it into something that
Hive can write to HDFS or another supported system.
• Serializer is used when writing data, such as through an
INSERT-SELECT statement.
• The Deserializer interface takes a string or binary
representation of a record, and translates it into a Java
object that Hive can manipulate.
• Deserializer is used at query time to execute SELECT
statements.
Hive Tables
A Hive table is logically made up of the data being stored in HDFS and the
associated metadata describing the layout of the data in the MySQL table.
• Managed Table
 When you create a table in Hive and load data into a managed table, it is moved into
Hive’s warehouse directory.
 CREATE TABLE managed_table (dummy STRING);
 LOAD DATA INPATH '/user/tom/data.txt' INTO table managed_table;
• External Table
 Alternatively, you may create an external table, which tells Hive to refer to the data that
is at an existing location outside the warehouse directory.
 The location of the external data is specified at table creation time:
 CREATE EXTERNAL TABLE external_table (dummy STRING)
 LOCATION '/user/tom/external_table';
 LOAD DATA INPATH '/user/tom/data.txt' INTO TABLE external_table;
• When you drop an external table, Hive will leave the data untouched and
only delete the metadata.
• Hive does not do any transformation while loading data into tables. Load
operations are currently pure copy/move operations that move data files
into locations corresponding to Hive tables.
Storage Format
Text File
When you create a table with no ROW FORMAT or STORED AS
clauses, the default format is delimited text with one row per
line.
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY 'n' STORED AS
INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat‘
Storage Format
RC: Record Columnar File
The RC format was designed for clusters with MapReduce in
mind. It is a huge step up over standard text files. It’s a mature
format with ways to ingest into the cluster without ETL. It is
supported in several hadoop system components.
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe'
STORED AS
INPUTFORMAT 'org.apache.hadoop.hive.ql.io.RCFileInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.RCFileOutputFormat'
Storage Format
ORC: Optimized Row Columnar File
The ORC format showed up in Hive 0.11 onwards. As the name
implies, it is more optimized than the RC format. If you want to
hold onto speed and compress the data as much as possible,
then ORC is best.
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe'
STORED AS
INPUTFORMAT 'org.apache.hadoop.hive.ql.io.ORCFileInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.ORCFileOutputFormat'
Practice Session
• CREATE DATABASE|SCHEMA [IF NOT EXISTS]
<database name>
or
hive> CREATE SCHEMA testdb;
SHOW DATABASES;
DROP SCHEMA userdb;
• CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT
EXISTS] [db_name.] table_name
• [(col_name data_type [COMMENT col_comment],
...)] [COMMENT table_comment] [ROW FORMAT
row_format] [STORED AS file_format]
• Loading a data
LOAD DATA [local ] INPATH
'hdfs_file_or_directory_path'
Create Table
• Managed Table
CREATE TABLE Student (sno int, sname string, year
int) row format delimited fields terminated by ',';
• External Table
CREATE EXTERNAL TABLE Student(sno int, sname
string, year int) row format delimited fields
terminated by ',‘LOCATION '/user/external_table';
Load Data to table
To store the local files to hive location
• LOAD DATA local INPATH
'/home/cloudera/SampleDataFile/student_ma
rks.csv' INTO table Student;
To store file located in HDFS file system to hive
table location
• LOAD DATA INPATH
'/user/cloudera/Student_Year.csv' INTO table
Student;
Table Commands
• Insert Data
 INSERT OVERWRITE TABLE targettable
select col1, col2 from source (to overwrite data)
 INSERT INTO TABLE targettbl
select col1, col2 from source (to append data)
• Multitable insert
 From sourcetable
INSERT OVERWRITE TABLE table1
select col1,col2 where condition1
INSERT OVERWRITE TABLE table2
select col1,col2 where condition2
• Create table..as Select
 Create table table1 as select col1,col2 from source;
• Create a new table with existing schema like other table
 Create table newtable like existingtable;
Database Commands
• Displays all created DB List.
Show Databases;
• To Create new database with default properties.
Create Database DBName;
• Create database with comment
Create Database DBName comment ‘holds backup data’ ;
• To Use Database
Use DBName;
• To View the database details
DESCRIBE DATABASE EXTENDED DbName
Table Commands
• To list all tables
Show Tables;
• Displaying all contents of the table
select * from <table-name>;
select * from Student_Year where year = 2011;
• Display header information along with Data
set hive.cli.print.header=true;
• Using Group by
select year,count(sno) from Student_Year group by
year;
Table Commands
• SubQueries
 A subquery is a SELECT statement that is embedded in another SQL
statement.
 Hive has limited support for subqueries, permitting a subquery in the
FROM clause of a SELECT statement, or in the WHERE clause in certain
cases.
 The following query finds the average maximum temperature for
every year and weather station:
SELECT year, AVG(max_temperature)
FROM (
SELECT year, MAX(temperature) AS max_temperature
FROM records2
GROUP BY year
) mt
GROUP BY year;
Table Commands
Alter table
• To Add column
 ALTER TABLE student ADD COLUMNS (Year string);
• To Modify a column
 ALTER TABLE table_name CHANGE old_col_name new_col_name
new_data_type
• Changes the table name;
 Alter table Employee RENAME to emp;
• Drops a partition
 ALTER table MyTable DROP PARTITION (age=17) -- Drop Table
• DROP TABLE
 DROP TABLE operatordetails;
• Describe Table Schema
 Desc Employee;
 Describe extended Employee; -- displays detailed information
View
• A view is a sort of “virtual table” that is defined by a SELECT
statement.
• Views may also be used to restrict users’ access to particular
subsets of tables that they are authorized to see.
• In Hive, a view is not materialized to disk when it is created; rather,
the view’s SELECT statement is executed when the statement that
refers to the view is run.
• Views are included in the output of the SHOW TABLES command,
and you can see more details about a particular view, including the
query used to define it, by issuing the DESCRIBE EXTENDED
view_name command.
 Create Views
CREATE VIEW view_name (id,name) AS SELECT * from users;
 Drop a view
Drop view viewName;
Joins
• Only equality joins, outer joins, and left semi
joins are supported in Hive.
• Hive does not support join conditions that are
not equality conditions as it is very difficult to
express such conditions as a map/reduce job.
Also, more than two tables can be joined in
Hive
Example-Join
• hive> SELECT * FROM sales;
Joe 2
Hank 4
Ali 0
Eve 3
Hank 2
• hive> SELECT * FROM items;
2 Tie
4 Coat
3 Hat
1 Scarf
Table Commands
• Using Join
 One of the nice things about using Hive, rather than raw MapReduce,
is that Hive makes performing commonly used operations very simple.
• We can perform an inner join on the two tables as follows:
 hive> SELECT sales.*, items.* FROM sales JOIN items ON (sales.id =
items.id);
 hive> SELECT a.val, b.val, c.val FROM a JOIN b ON (a.KEY = b.key1)
JOIN c ON (c.KEY = b.key1)
• You can see how many MapReduce jobs Hive will use for any particular
query by prefixing it with the EXPLAIN keyword:,
• For even more detail, prefix the query with EXPLAIN EXTENDED.
 EXPLAIN SELECT sales.*, items.* FROM sales JOIN items ON (sales.id
= items.id);
• Outer joins
Outer joins allow you to find non-matches in the
tables being joined.
hive> SELECT sales.*, items.* FROM sales LEFT
OUTER JOIN items ON (sales.id = items.id);
hive> SELECT sales.*, items.* FROM sales RIGHT
OUTER JOIN items ON (sales.id = items.id);
hive>SELECT sales.*, items.* FROM sales FULL
OUTER JOIN items ON (sales.id = items.id);
Table Commands
Map Side Join
• If all but one of the tables being joined are small, the join can be
performed as a map only job.
• The query does not need a reducer. For every mapper a,b is read
completely. A restriction is that a FULL/RIGHT OUTER JOIN b cannot be
performed.
• SELECT /*+ MAPJOIN(b) */ a.key, a.value FROM a join b on a.key = b.key
Partitioning in Hive
• Using partitions, you
can make it faster to
execute queries on
slices of the data.
• A table can have one
or more partition
columns.
• A separate data
directory is created for
each distinct value
combination in the
partition columns.
Partitioning in Hive
• Partitions are defined at the time of creating a table
using PARTITIONED BY clause is used to create
partition.
Static Partition (Example-1)
CREATE TABLE student_partnew (name STRING,id
int,marks String) PARTITIONED BY (pyear STRING) row
format delimited fields terminated by ',';
LOAD DATA LOCAL INPATH '/home/notroot/std_2011.csv'
INTO TABLE student_partnew PARTITION (pyear='2011');
LOAD DATA LOCAL INPATH '/home/notroot/std_2012.csv'
INTO TABLE student_partnew PARTITION (pyear='2012');
LOAD DATA LOCAL INPATH '/home/notroot/std_2013.csv'
INTO TABLE student_partnew PARTITION (pyear='2013');
Partitioning in Hive
Static Partition (Example-2)
• CREATE TABLE student_New (id int,name string,marks
int,year int) row format delimited fields terminated by ',';
• LOAD DATA local INPATH
'/home/notroot/Sandeep/DataSamples/Student_new.csv'
INTO table Student_New;
• CREATE TABLE student_part (id int,name string,marks int,)
PARTITIONED BY (year STRING);
• INSERT into TABLE student_part PARTITION(pyear='2012' )
SELECT id,name,marks from student_new WHERE
year='2012';
SHOW Partition
• SHOW PARTITIONS month_part;
Partitioning in Hive
Dynamic Partition
• To enable dynamic partitions
 set hive.exec.dynamic.partition=true;
(To enable dynamic partitions, by default it is false)
 set hive.exec.dynamic.partition.mode=nonstrict;
(To allow a table to be partitioned based on multiple columns in hive, in
such case we have to enable the nonstrict mode)
 set hive.exec.max.dynamic.partitions.pernode=300;
(The default value is 100, we have to modify the same according to the
possible no of partitions that would come in your case)
hive.exec.max.created.files=150000
(IThe default values is 100000 but for larger tables it can exceed the default,
so we may have to update the same. )
Partitioning in Hive
• CREATE TABLE Stage_oper_Month (oper_id string, Creation_Date string,
oper_name String, oper_age int, oper_dept String, oper_dept_id int, opr_status
string, EYEAR STRING, EMONTH STRING) ROW FORMAT DELIMITED FIELDS
TERMINATED BY ',';
• LOAD DATA local INPATH
'/home/notroot/Sandeep/DataSamples/user_info.csv'INTO TABLE
Stage_oper_Month;
• CREATE TABLE Fact_oper_Month (oper_id string, Creation_Date string, oper_name
String, oper_age int, oper_dept String, oper_dept_id int) PARTITIONED BY
(opr_status string, eyear STRING, eMONTH STRING) ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',';
• FROM Stage_oper_Month INSERT OVERWRITE TABLE Fact_oper_Month
PARTITION (opr_status, eyear, eMONTH) SELECT oper_id, Creation_Date,
oper_name, oper_age, oper_dept, oper_dept_id, opr_status, EYEAR, EMONTH
DISTRIBUTE BY opr_status, eyear, eMONTH;
• (Select from partition table)
 Select oper_id, oper_name, oper_dept from Fact_oper_Month where
eyear=2010 and emonth=1;
Bucketing Features
• Partitioning gives effective results when there are limited number of
partitions and comparatively equal sized partitions
• To overcome the problem of partitioning, Hive provides Bucketing
concept, another technique for decomposing table data sets into more
manageable parts.
• Bucketing concept is based on (hashing function on the bucketed
column) mod (by total number of buckets)
• Use CLUSTERED BY clause to divide the table into buckets.
• Bucketing can be done along with Partitioning on Hive tables and even
without partitioning.
• Bucketed tables will create almost equally distributed data file parts.
• To populate the bucketed table, we need to set the property
 set hive.enforce.bucketing = true;
Bucketing Advantage
Bucketing Advantages
• Bucketed tables offer efficient sampling than by non-bucketed
tables. With sampling, we can try out queries on a fraction of data
for testing and debugging purpose when the original data sets are
very huge.
• As the data files are equal sized parts, map-side joins will be faster
on bucketed tables than non-bucketed tables. In Map-side join, a
mapper processing a bucket of the left table knows that the
matching rows in the right table will be in its corresponding bucket,
so it only retrieves that bucket (which is a small fraction of all the
data stored in the right table).
• Similar to partitioning, bucketed tables provide faster query
responses than non-bucketed tables.
Bucketing Example
• We can create bucketed tables with the help of CLUSTERED BY clause and
optional SORTED BY clause in CREATE TABLE statement and DISTRIBUTED
BY clause in load statement.
• CREATE TABLE Month_bucketed (oper_id string, Creation_Date string,
oper_name String, oper_age int,oper_dept String, oper_dept_id int,
opr_status string, eyear string , emonth string) CLUSTERED BY(oper_id)
SORTED BY (oper_id,Creation_Date) INTO 10 BUCKETS ROW FORMAT
DELIMITED FIELDS TERMINATED BY ',';
Similar to partitioned tables, we cannot directly load bucketed tables
with LOAD DATA (LOCAL) INPATH command, rather we need to use INSERT
OVERWRITE TABLE … SELECT …FROM clause from another table to populate
the bucketed tables.
• INSERT OVERWRITE TABLE Month_bucketed SELECT oper_id,
Creation_Date, oper_name, oper_age, oper_dept, oper_dept_id,
opr_status, EYEAR, EMONTH FROM stage_oper_month DISTRIBUTE BY
oper_id sort by oper_id, Creation_Date;
Partitioning with Bucketing
• CREATE TABLE Month_Part_bucketed (oper_id string,
Creation_Date string, oper_name String, oper_age int,oper_dept
String, oper_dept_id int) PARTITIONED BY (opr_status string, eyear
STRING, eMONTH STRING) CLUSTERED BY(oper_id) SORTED BY
(oper_id,Creation_Date) INTO 12 BUCKETS ROW FORMAT
DELIMITED FIELDS TERMINATED BY ',';
• FROM Stage_oper_Month stg INSERT OVERWRITE TABLE
Month_Part_bucketed PARTITION(opr_status, eyear, eMONTH)
SELECT stg.oper_id, stg.Creation_Date, stg.oper_name,
stg.oper_age, stg.oper_dept, stg.oper_dept_id, stg.opr_status,
stg.EYEAR, stg.EMONTH DISTRIBUTE BY opr_status, eyear,
eMONTH;
Note: Unlike partitioned columns (which are not included in table
columns definition), Bucketed columns are included in table
definition as shown in above code
for oper_id and creation_date columns.
Table Sampling in Hive
Table Sampling in hive is nothing but extraction small fraction of data from
the original large data sets. It is similar to LIMIT operator in Hive.
Difference between LIMIT and TABLESAMPLE in Hive.
 In many cases a LIMIT clause executes the entire query, and then only returns
limited results.
 But Sampling will only select a portion of data to perform query.
To see the performance difference between bucketed and non-bucketed
tables.
 Query-1: SELECT oper_id, Creation_Date, oper_name, oper_age, oper_dept
FROM month_bucketed TABLESAMPLE(BUCKET 12 OUT OF 12 ON oper_id);
 Query-2: SELECT oper_id, Creation_Date, oper_name, oper_age, oper_dept
FROM stage_oper_month limit 18;
Note: Query-1 should always perform faster that query-2
To perform random sampling with Hive
 SELECT oper_id, Creation_Date, oper_name, oper_age, oper_dept FROM
month_bucketed TABLESAMPLE (1 percent);
Hive UDF
• UDF is a java code which must satisfy the following two properties.
• UDF must implement at least one evaluate() method
• UDF must be a subclass of org.apache.hadoop.hive.ql.exec.UDF
Sample UDF
package com.example.hive.udf;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;
public final class Lower extends UDF {
public Text evaluate(final Text s) {
if (s == null)
{ return null;
}
return new Text(s.toString().toLowerCase());
}
}
• hive> add jar my_jar.jar;
• hive> create temporary function my_lower as 'com.example.hive.udf.Lower';
• hive> select empid , my_lower(empname) from employee;
Hive UDAF
• A UDAF works on multiple input rows and creates a single output
row. Aggregate functions include such functions as COUNT and
MAX.
• An aggregate function is more difficult to write than a regular UDF.
• UDAF must be a subclass of org.apache.hadoop.hive.ql.exec.UDAF
• Contain one or more nested static classes implementing
org.apache.hadoop.hive.ql.exec.UDAFEvaluator
• UDF must be a subclass of org.apache.hadoop.hive.ql.exec.UDF
An evaluator must implement five methods
• init()
 The init() method initializes the evaluator and resets its internal
state.
 In MaximumIntUDAFEvaluator, we set the IntWritable object
holding the final result to null.
Hive UDAF
• iterate()
 The iterate() method is called every time there is a new value to be
aggregated. The evaluator should update its internal state with the result
of performing the aggregation. The arguments that iterate() takes
correspond to those in the Hive function from which it was called.
 In this example, there is only one argument. The value is first checked to
see whether it is null, and if it is, it is ignored. Otherwise, the result
instance variable is set either to value’s integer value (if this is the first
value that has been seen) or to the larger of the current result and value
(if one or more values have already been seen). We return true to indicate
that the input value was valid.
• terminatePartial()
 The terminatePartial() method is called when Hive wants a result for the
partial aggregation. The method must return an object that encapsulates
the state of the aggregation.
 In this case, an IntWritable suffices because it encapsulates either the
maximum value seen or null if no values have been processed.
Hive UDAF
• merge()
 The merge() method is called when Hive decides to combine one
partial aggregation with another. The method takes a single object,
whose type must correspond to the return type of the
terminatePartial() method.
 In this example, the merge() method can simply delegate to the
iterate() method because the partial aggregation is represented in the
same way as a value being aggregated. This is not generally the
case(we’ll see a more general example later), and the method should
implement the logic to combine the evaluator’s state with the state of
the partial aggregation.
• terminate()
 The terminate() method is called when the final result of the
aggregation is needed. The evaluator should return its state as a value.
 In this case, we return the result instance variable.
Hive UDAF
package com.hadoopbook.hive;
import org.apache.hadoop.hive.ql.exec.UDAF;
import org.apache.hadoop.hive.ql.exec.UDAFEvaluator;
import org.apache.hadoop.io.IntWritable;
public class HiveUDAFSample extends UDAF {
public static class MaximumIntUDAFEvaluator implements UDAFEvaluator {
private IntWritable result;
public void init() {
result = null;
}
public boolean iterate(IntWritable value) {
if (value == null) {
return true;
}
Hive UDAF
if (result == null) {
result = new IntWritable(value.get());
} else {
result.set(Math.max(result.get(), value.get()));
}
return true;
}
public IntWritable terminatePartial() {
return result;
}
public boolean merge(IntWritable other) {
return iterate(other);
}
public IntWritable terminate() {
return result;
}
}
}
Hive UDAF
To Use UDAF in hive;
• hive> add jar my_jar.jar;
• hive> CREATE TEMPORARY FUNCTION maximum AS
'com.hadoopbook.hive.HiveUDAFSample';
• hive>SELECT maximum(salary) FROM employee;
Performance Tuning
Partitioning Tables:
• Hive partitioning is an effective method to improve the
query performance on larger tables. Partitioning allows
you to store data in separate sub-directories under
table location. It greatly helps the queries which are
queried upon the partition key(s). Although the
selection of partition key is always a sensitive decision,
it should always be a low cardinal attribute, e.g. if your
data is associated with time dimension, then date
could be a good partition key. Similarly, if data has
association with location, like a country or state, then
it’s a good idea to have hierarchical partitions like
country/state.
Performance Tuning
De-normalizing data:
• Normalization is a standard process used to model
your data tables with certain rules to deal with
redundancy of data and anomalies. In simpler words, if
you normalize your data sets, you end up creating
multiple relational tables which can be joined at the
run time to produce the results. Joins are expensive
and difficult operations to perform and are one of the
common reasons for performance issues. Because of
that, it’s a good idea to avoid highly normalized table
structures because they require join queries to derive
the desired metrics.
Performance Tuning
Compress map/reduce output:
• Compression techniques significantly reduce the intermediate data
volume, which internally reduces the amount of data transfers
between mappers and reducers. All this generally occurs over the
network. Compression can be applied on the mapper and reducer
output individually. Keep in mind that gzip compressed files are not
splittable. That means this should be applied with caution. A
compressed file size should not be larger than a few hundred
megabytes. Otherwise it can potentially lead to an imbalanced job.
• Other options of compression codec could be snappy, lzo, bzip, etc.
• For map output compression set mapred.compress.map.output to
true
• For job output compression set mapred.output.compress to true
Performance Tuning
Map join:
• Map joins are really efficient if a table on the
other side of a join is small enough to fit in the
memory. Hive supports a parameter,
hive.auto.convert.join, which when it’s set to
“true” suggests that Hive try to map join
automatically. When using this parameter, be
sure the auto convert is enabled in the Hive
environment.
Performance Tuning
Bucketing:
• Bucketing improves the join performance if the bucket key and join
keys are common. Bucketing in Hive distributes the data in different
buckets based on the hash results on the bucket key. It also reduces
the I/O scans during the join process if the process is happening on
the same keys (columns).
• Additionally it’s important to ensure the bucketing flag is set (SET
hive.enforce.bucketing=true;) every time before writing data to the
bucketed table. To leverage the bucketing in the join operation we
should SET hive.optimize.bucketmapjoin=true. This setting hints to
Hive to do bucket level join during the map stage join. It also
reduces the scan cycles to find a particular key because bucketing
ensures that the key is present in a certain bucket.
Performance Tuning
Parallel execution:
• As HIVE queries are inbuilt translated to a
number of map reduce jobs, but having
multiple Map-reduce jobs is not enough, real
advantage is of their parallel execution and
as noted above simply writing a query does
not achieve this.
• SELECT table1.a FROM
table1 JOIN table2 ON (table1.a =table2.a )
join table3 ON (table3.a=table1.a)
join table4 ON (table4.b=table3.b);
• Output: Execution time : 800 sec
But let us check the execution plan for this:
observations (see picture highlighted area):
• Total Map-Reduce Jobs: 2.
• Serially Launched & Run.
Performance Tuning
Parallel execution:
• To achieve this, we thought about query re-writing in a
way to segregate the query into independent units which
HIVE could work upon as independent map reduce jobs
running parallel. Following is what we did to our query:
• SELECT r1.a FROM
(SELECT table1.a FROM table1 JOIN table2 ON table1.a
=table2.a ) r1
JOIN
(SELECT table3.a FROM table3 JOIN table4 ON table3.b
=table4.b ) r2
ON (r1.a =r2.a) ;
• Output: Same results. But Execution time: 464 secs
observations:
• Total Map-Reduce Jobs: 5 (see picture highlighted area).
• Jobs are parallel Launched & Run. (see highlighted area).
• Decrease in query execution time (around 50% in our case)
Points to Note:
• Need to set hive.exec.parallel parameter to set to TRUE.
• To control how many jobs at most can be executed in
parallel set hive.exec.parallel.thread.number parameter.
Thank You
• Question?
• Feedback?
explorehadoop@gmail.com

More Related Content

What's hot

Introduction to HiveQL
Introduction to HiveQLIntroduction to HiveQL
Introduction to HiveQLkristinferrier
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overviewDataArt
 
YARN Ready: Integrating to YARN with Tez
YARN Ready: Integrating to YARN with Tez YARN Ready: Integrating to YARN with Tez
YARN Ready: Integrating to YARN with Tez Hortonworks
 
Impala presentation
Impala presentationImpala presentation
Impala presentationtrihug
 
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing DataWorks Summit
 
Handling Data Skew Adaptively In Spark Using Dynamic Repartitioning
Handling Data Skew Adaptively In Spark Using Dynamic RepartitioningHandling Data Skew Adaptively In Spark Using Dynamic Repartitioning
Handling Data Skew Adaptively In Spark Using Dynamic RepartitioningSpark Summit
 
Big Data Processing with Spark and Scala
Big Data Processing with Spark and Scala Big Data Processing with Spark and Scala
Big Data Processing with Spark and Scala Edureka!
 
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Simplilearn
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component rebeccatho
 
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveSachin Aggarwal
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkRahul Jain
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation HadoopVarun Narang
 
Sqoop on Spark for Data Ingestion
Sqoop on Spark for Data IngestionSqoop on Spark for Data Ingestion
Sqoop on Spark for Data IngestionDataWorks Summit
 
Introduction to Apache Hive
Introduction to Apache HiveIntroduction to Apache Hive
Introduction to Apache HiveAvkash Chauhan
 
Hadoop Installation presentation
Hadoop Installation presentationHadoop Installation presentation
Hadoop Installation presentationpuneet yadav
 

What's hot (20)

Introduction to HiveQL
Introduction to HiveQLIntroduction to HiveQL
Introduction to HiveQL
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overview
 
Apache hive
Apache hiveApache hive
Apache hive
 
Apache HBase™
Apache HBase™Apache HBase™
Apache HBase™
 
YARN Ready: Integrating to YARN with Tez
YARN Ready: Integrating to YARN with Tez YARN Ready: Integrating to YARN with Tez
YARN Ready: Integrating to YARN with Tez
 
Intro to HBase
Intro to HBaseIntro to HBase
Intro to HBase
 
Impala presentation
Impala presentationImpala presentation
Impala presentation
 
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing
 
Handling Data Skew Adaptively In Spark Using Dynamic Repartitioning
Handling Data Skew Adaptively In Spark Using Dynamic RepartitioningHandling Data Skew Adaptively In Spark Using Dynamic Repartitioning
Handling Data Skew Adaptively In Spark Using Dynamic Repartitioning
 
Big Data Processing with Spark and Scala
Big Data Processing with Spark and Scala Big Data Processing with Spark and Scala
Big Data Processing with Spark and Scala
 
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
 
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
 
Unit 5-apache hive
Unit 5-apache hiveUnit 5-apache hive
Unit 5-apache hive
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
 
Sqoop on Spark for Data Ingestion
Sqoop on Spark for Data IngestionSqoop on Spark for Data Ingestion
Sqoop on Spark for Data Ingestion
 
Introduction to Apache Hive
Introduction to Apache HiveIntroduction to Apache Hive
Introduction to Apache Hive
 
Hadoop Installation presentation
Hadoop Installation presentationHadoop Installation presentation
Hadoop Installation presentation
 
Hive
HiveHive
Hive
 

Viewers also liked

Big data analytics -hive
Big data analytics -hiveBig data analytics -hive
Big data analytics -hivekarthika karthi
 
On the move with Big Data (Hadoop, Pig, Sqoop, SSIS...)
On the move with Big Data (Hadoop, Pig, Sqoop, SSIS...)On the move with Big Data (Hadoop, Pig, Sqoop, SSIS...)
On the move with Big Data (Hadoop, Pig, Sqoop, SSIS...)Stéphane Fréchette
 
Programming Hive Reading #4
Programming Hive Reading #4Programming Hive Reading #4
Programming Hive Reading #4moai kids
 
Hiveハンズオン
HiveハンズオンHiveハンズオン
HiveハンズオンSatoshi Noto
 
Programming Hive Reading #3
Programming Hive Reading #3Programming Hive Reading #3
Programming Hive Reading #3moai kids
 
Hadoop World 2011: Replacing RDB/DW with Hadoop and Hive for Telco Big Data -...
Hadoop World 2011: Replacing RDB/DW with Hadoop and Hive for Telco Big Data -...Hadoop World 2011: Replacing RDB/DW with Hadoop and Hive for Telco Big Data -...
Hadoop World 2011: Replacing RDB/DW with Hadoop and Hive for Telco Big Data -...Cloudera, Inc.
 
Hive Object Model
Hive Object ModelHive Object Model
Hive Object ModelZheng Shao
 
How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...DataWorks Summit/Hadoop Summit
 

Viewers also liked (8)

Big data analytics -hive
Big data analytics -hiveBig data analytics -hive
Big data analytics -hive
 
On the move with Big Data (Hadoop, Pig, Sqoop, SSIS...)
On the move with Big Data (Hadoop, Pig, Sqoop, SSIS...)On the move with Big Data (Hadoop, Pig, Sqoop, SSIS...)
On the move with Big Data (Hadoop, Pig, Sqoop, SSIS...)
 
Programming Hive Reading #4
Programming Hive Reading #4Programming Hive Reading #4
Programming Hive Reading #4
 
Hiveハンズオン
HiveハンズオンHiveハンズオン
Hiveハンズオン
 
Programming Hive Reading #3
Programming Hive Reading #3Programming Hive Reading #3
Programming Hive Reading #3
 
Hadoop World 2011: Replacing RDB/DW with Hadoop and Hive for Telco Big Data -...
Hadoop World 2011: Replacing RDB/DW with Hadoop and Hive for Telco Big Data -...Hadoop World 2011: Replacing RDB/DW with Hadoop and Hive for Telco Big Data -...
Hadoop World 2011: Replacing RDB/DW with Hadoop and Hive for Telco Big Data -...
 
Hive Object Model
Hive Object ModelHive Object Model
Hive Object Model
 
How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...
 

Similar to 6.hive

Unveiling Hive: A Comprehensive Exploration of Hive in Hadoop Ecosystem
Unveiling Hive: A Comprehensive Exploration of Hive in Hadoop EcosystemUnveiling Hive: A Comprehensive Exploration of Hive in Hadoop Ecosystem
Unveiling Hive: A Comprehensive Exploration of Hive in Hadoop Ecosystemmashoodsyed66
 
Working with Hive Analytics
Working with Hive AnalyticsWorking with Hive Analytics
Working with Hive AnalyticsManish Chopra
 
Unit II Hadoop Ecosystem_Updated.pptx
Unit II Hadoop Ecosystem_Updated.pptxUnit II Hadoop Ecosystem_Updated.pptx
Unit II Hadoop Ecosystem_Updated.pptxBhavanaHotchandani
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoopbddmoscow
 
Apache Hive micro guide - ConfusedCoders
Apache Hive micro guide - ConfusedCodersApache Hive micro guide - ConfusedCoders
Apache Hive micro guide - ConfusedCodersYash Sharma
 
hive_slides_Webinar_Session_1.pptx
hive_slides_Webinar_Session_1.pptxhive_slides_Webinar_Session_1.pptx
hive_slides_Webinar_Session_1.pptxvishwasgarade1
 
Apache Hadoop 1.1
Apache Hadoop 1.1Apache Hadoop 1.1
Apache Hadoop 1.1Sperasoft
 
Apache Hive, data segmentation and bucketing
Apache Hive, data segmentation and bucketingApache Hive, data segmentation and bucketing
Apache Hive, data segmentation and bucketingearnwithme2522
 
Yahoo! Hack Europe Workshop
Yahoo! Hack Europe WorkshopYahoo! Hack Europe Workshop
Yahoo! Hack Europe WorkshopHortonworks
 

Similar to 6.hive (20)

03 hive query language (hql)
03 hive query language (hql)03 hive query language (hql)
03 hive query language (hql)
 
Unveiling Hive: A Comprehensive Exploration of Hive in Hadoop Ecosystem
Unveiling Hive: A Comprehensive Exploration of Hive in Hadoop EcosystemUnveiling Hive: A Comprehensive Exploration of Hive in Hadoop Ecosystem
Unveiling Hive: A Comprehensive Exploration of Hive in Hadoop Ecosystem
 
Working with Hive Analytics
Working with Hive AnalyticsWorking with Hive Analytics
Working with Hive Analytics
 
Unit II Hadoop Ecosystem_Updated.pptx
Unit II Hadoop Ecosystem_Updated.pptxUnit II Hadoop Ecosystem_Updated.pptx
Unit II Hadoop Ecosystem_Updated.pptx
 
מיכאל
מיכאלמיכאל
מיכאל
 
1. Apache HIVE
1. Apache HIVE1. Apache HIVE
1. Apache HIVE
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoop
 
SQL Server 2012 and Big Data
SQL Server 2012 and Big DataSQL Server 2012 and Big Data
SQL Server 2012 and Big Data
 
Apache Hive micro guide - ConfusedCoders
Apache Hive micro guide - ConfusedCodersApache Hive micro guide - ConfusedCoders
Apache Hive micro guide - ConfusedCoders
 
Apache Hive
Apache HiveApache Hive
Apache Hive
 
Hadoop_arunam_ppt
Hadoop_arunam_pptHadoop_arunam_ppt
Hadoop_arunam_ppt
 
hive_slides_Webinar_Session_1.pptx
hive_slides_Webinar_Session_1.pptxhive_slides_Webinar_Session_1.pptx
hive_slides_Webinar_Session_1.pptx
 
Hive
HiveHive
Hive
 
Hive training
Hive trainingHive training
Hive training
 
Big data Hadoop
Big data  Hadoop   Big data  Hadoop
Big data Hadoop
 
Apache Hadoop 1.1
Apache Hadoop 1.1Apache Hadoop 1.1
Apache Hadoop 1.1
 
Getting started big data
Getting started big dataGetting started big data
Getting started big data
 
Impala for PhillyDB Meetup
Impala for PhillyDB MeetupImpala for PhillyDB Meetup
Impala for PhillyDB Meetup
 
Apache Hive, data segmentation and bucketing
Apache Hive, data segmentation and bucketingApache Hive, data segmentation and bucketing
Apache Hive, data segmentation and bucketing
 
Yahoo! Hack Europe Workshop
Yahoo! Hack Europe WorkshopYahoo! Hack Europe Workshop
Yahoo! Hack Europe Workshop
 

More from Prashant Gupta

More from Prashant Gupta (8)

Spark core
Spark coreSpark core
Spark core
 
Spark Sql and DataFrame
Spark Sql and DataFrameSpark Sql and DataFrame
Spark Sql and DataFrame
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)
 
Apache PIG
Apache PIGApache PIG
Apache PIG
 
Map reduce prashant
Map reduce prashantMap reduce prashant
Map reduce prashant
 
Mongodb - NoSql Database
Mongodb - NoSql DatabaseMongodb - NoSql Database
Mongodb - NoSql Database
 
Sonar Tool - JAVA code analysis
Sonar Tool - JAVA code analysisSonar Tool - JAVA code analysis
Sonar Tool - JAVA code analysis
 

Recently uploaded

CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceCALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceanilsa9823
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...panagenda
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providermohitmore19
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsAndolasoft Inc
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️anilsa9823
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxComplianceQuest1
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionSolGuruz
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVshikhaohhpro
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsJhone kinadey
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdfWave PLM
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerThousandEyes
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️Delhi Call girls
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS
 

Recently uploaded (20)

CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceCALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.js
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with Precision
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveVip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptx
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 

6.hive

  • 3. Hive • Data warehousing package built on top of hadoop. • Used for data analysis on structured data. • Targeted towards users comfortable with SQL. • It is similar to SQL and called HiveQL. • Abstracts complexity of hadoop. • No Java is required. • Developed by Facebook.
  • 4. Features of Hive How is it Different from SQL •The major difference is that a Hive query executes on a Hadoop infrastructure rather than a traditional database. •This allows Hive to handle huge data sets - data sets so large that high-end, expensive, traditional databases would fail. •The internal execution of a Hive query is via a series of automatically generated Map Reduce jobs
  • 5. When not to use Hive • Semi-structured or complete unstructured data. • Hive is not designed for online transaction processing. • It is best for batch jobs over large sets of data. • Latency for Hive queries is generally very high in minutes, even when data sets are very small (say a few hundred megabytes). • It cannot be compared with systems such as oracle where analyses are conducted on a significantly smaller amount of data.
  • 6. Install Hive •To install hive • untar the .gz file using tar –xvzf hive-0.13.0-bin.tar.gz •To initialize the environment variables, export the following: • export HADOOP_HOME=/home/usr/hadoop-0.20.2 (Specifies the location of the installation directory of hadoop.) • export HIVE_HOME=/home/usr/hive-0.13.0-bin (Specifies the location of the hive to the environment variable.) • export PATH=$PATH:$HIVE_HOME/bin Hive configurations • Hive default configuration is stored in hive-default.xml file in the conf directory • Hive comes configured to use derby as the metastore
  • 7. Hive Modes To start the hive shell, type hive and Enter. • Hive in Local mode No HDFS is required, All files run on local file system. hive> SET mapred.job.tracker=local • Hive in MapReduce(hadoop) mode hive> SET mapred.job.tracker=master:9001;
  • 8. Introducing data types • The primitive data types in hive include Integers, Boolean, Floating point, Date,Timestamp and Strings. • The below table lists the size of data types: Type Size ------------------------- TINYINT 1 byte SMALLINT 2 byte INT 4 byte BIGINT 8 byte FLOAT 4 byte (single precision floating point numbers) DOUBLE 8 byte (double precision floating point numbers) BOOLEAN TRUE/FALSE value STRING Max size is 2GB. • Complex data Type : Array ,Map ,Structs
  • 9. Configuring Hive • Hive is configured using an XML configuration file called hivesite.xml and is located in Hive’s conf directory. • Execution engines  Hive was originally written to use MapReduce as its execution engine, and that is still the default.  We can use Apache Tez as its execution engine, and also work is underway to support Spark, too. Both Tez and Spark are general directed acyclic graph (DAG) engines that offer more flexibility and higher performance than MapReduce.  It’s easy to switch the execution engine on a per-query basis, so you can see the effect of a different engine on a particular query.  Set Hive to use Tez: hive> SET hive.execution.engine=tez;  The execution engine is controlled by the hive.execution.engine property, which defaults to “mr” (for MapReduce).
  • 11. Components • Thrift Client It is possible to interact with hive by using any programming language that usages Thrift server. For e.g. Python Ruby • JDBC Driver Hive provides a pure java JDBC driver for java application to connect to hive , defined in the class org.hadoop.hive.jdbc.HiveDriver • ODBC Driver An ODBC driver allows application that supports ODBC protocol
  • 12. Components • Metastore  This is the central repository for Hive metadata.  By default, Hive is configured to use Derby as the metastore. As a result of the configuration, a metastore_db directory is created in each working folder. • What are the problems with the default metastore  Users cannot see the tables created by others if they do not use the same metastore_db.  Only one embedded Derby database can access the database files at any given point of time  Results in only one open Hive session with a metastore. Not possible to have multiple sessions with Derby as the metastore. Solution  We can use a standalone database either on the same machine or on a remote machine as a metastore and any JDBC-compliant database can be used  MySQL is a popular choice for the standalone metastore.
  • 13. Configuring MySQL as metastore  Install MySQL Admin/Client  Create a Hadoop user and grant permissions to the user  mysql -u root –p  mysql> Create user 'hadoop'@'localhost' identified by 'hadoop‘;  mysql> Grant ALL on *.* to 'hadoop'@'localhost' with GRANT option;  Modify the following properties in hive-site.xml to use MySQL instead of Derby. This creates a database in MySql by the name – Hive :  name : javax.jdo.option.ConnectionUR  value : dbc:mysql://localhost:3306/Hive?createDatabaseIfNotExist=true  name : javax.jdo.option.ConnectionDriverName  value : com.mysql.jdbc.Driver  name : javax.jdo.option.ConnectionUserName  value : hadoop  name : javax.jdo.option.ConnectionPassword  value : hadoop
  • 14. Hive Program Structure • The Hive Shell  The shell is the primary way that we will interact with Hive, by issuing commands in HiveQL.  HiveQL is heavily influenced by MySQL, so if you are familiar with MySQL, you should feel at home using Hive.  The command must be terminated with a semicolon to tell Hive to execute it.  HiveQL is generally case insensitive.  The Tab key will autocomplete Hive keywords and functions. • Hive can run in non-interactive mode.  Use -f option to run the commands in the specified file,  hive -f script.hql  For short scripts, you can use the -e option to specify the commands inline, in which case the final semicolon is not required.  hive -e 'SELECT * FROM dummy'
  • 15. Ser-de • A SerDe is a combination of a Serializer and a Deserializer (hence, Ser-De). • The Serializer, however, will take a Java object that Hive has been working with, and turn it into something that Hive can write to HDFS or another supported system. • Serializer is used when writing data, such as through an INSERT-SELECT statement. • The Deserializer interface takes a string or binary representation of a record, and translates it into a Java object that Hive can manipulate. • Deserializer is used at query time to execute SELECT statements.
  • 16. Hive Tables A Hive table is logically made up of the data being stored in HDFS and the associated metadata describing the layout of the data in the MySQL table. • Managed Table  When you create a table in Hive and load data into a managed table, it is moved into Hive’s warehouse directory.  CREATE TABLE managed_table (dummy STRING);  LOAD DATA INPATH '/user/tom/data.txt' INTO table managed_table; • External Table  Alternatively, you may create an external table, which tells Hive to refer to the data that is at an existing location outside the warehouse directory.  The location of the external data is specified at table creation time:  CREATE EXTERNAL TABLE external_table (dummy STRING)  LOCATION '/user/tom/external_table';  LOAD DATA INPATH '/user/tom/data.txt' INTO TABLE external_table; • When you drop an external table, Hive will leave the data untouched and only delete the metadata. • Hive does not do any transformation while loading data into tables. Load operations are currently pure copy/move operations that move data files into locations corresponding to Hive tables.
  • 17. Storage Format Text File When you create a table with no ROW FORMAT or STORED AS clauses, the default format is delimited text with one row per line. ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY 'n' STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat‘
  • 18. Storage Format RC: Record Columnar File The RC format was designed for clusters with MapReduce in mind. It is a huge step up over standard text files. It’s a mature format with ways to ingest into the cluster without ETL. It is supported in several hadoop system components. ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.RCFileInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.RCFileOutputFormat'
  • 19. Storage Format ORC: Optimized Row Columnar File The ORC format showed up in Hive 0.11 onwards. As the name implies, it is more optimized than the RC format. If you want to hold onto speed and compress the data as much as possible, then ORC is best. ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.ORCFileInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.ORCFileOutputFormat'
  • 21. • CREATE DATABASE|SCHEMA [IF NOT EXISTS] <database name> or hive> CREATE SCHEMA testdb; SHOW DATABASES; DROP SCHEMA userdb;
  • 22. • CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.] table_name • [(col_name data_type [COMMENT col_comment], ...)] [COMMENT table_comment] [ROW FORMAT row_format] [STORED AS file_format] • Loading a data LOAD DATA [local ] INPATH 'hdfs_file_or_directory_path'
  • 23. Create Table • Managed Table CREATE TABLE Student (sno int, sname string, year int) row format delimited fields terminated by ','; • External Table CREATE EXTERNAL TABLE Student(sno int, sname string, year int) row format delimited fields terminated by ',‘LOCATION '/user/external_table';
  • 24. Load Data to table To store the local files to hive location • LOAD DATA local INPATH '/home/cloudera/SampleDataFile/student_ma rks.csv' INTO table Student; To store file located in HDFS file system to hive table location • LOAD DATA INPATH '/user/cloudera/Student_Year.csv' INTO table Student;
  • 25. Table Commands • Insert Data  INSERT OVERWRITE TABLE targettable select col1, col2 from source (to overwrite data)  INSERT INTO TABLE targettbl select col1, col2 from source (to append data) • Multitable insert  From sourcetable INSERT OVERWRITE TABLE table1 select col1,col2 where condition1 INSERT OVERWRITE TABLE table2 select col1,col2 where condition2 • Create table..as Select  Create table table1 as select col1,col2 from source; • Create a new table with existing schema like other table  Create table newtable like existingtable;
  • 26. Database Commands • Displays all created DB List. Show Databases; • To Create new database with default properties. Create Database DBName; • Create database with comment Create Database DBName comment ‘holds backup data’ ; • To Use Database Use DBName; • To View the database details DESCRIBE DATABASE EXTENDED DbName
  • 27. Table Commands • To list all tables Show Tables; • Displaying all contents of the table select * from <table-name>; select * from Student_Year where year = 2011; • Display header information along with Data set hive.cli.print.header=true; • Using Group by select year,count(sno) from Student_Year group by year;
  • 28. Table Commands • SubQueries  A subquery is a SELECT statement that is embedded in another SQL statement.  Hive has limited support for subqueries, permitting a subquery in the FROM clause of a SELECT statement, or in the WHERE clause in certain cases.  The following query finds the average maximum temperature for every year and weather station: SELECT year, AVG(max_temperature) FROM ( SELECT year, MAX(temperature) AS max_temperature FROM records2 GROUP BY year ) mt GROUP BY year;
  • 29. Table Commands Alter table • To Add column  ALTER TABLE student ADD COLUMNS (Year string); • To Modify a column  ALTER TABLE table_name CHANGE old_col_name new_col_name new_data_type • Changes the table name;  Alter table Employee RENAME to emp; • Drops a partition  ALTER table MyTable DROP PARTITION (age=17) -- Drop Table • DROP TABLE  DROP TABLE operatordetails; • Describe Table Schema  Desc Employee;  Describe extended Employee; -- displays detailed information
  • 30. View • A view is a sort of “virtual table” that is defined by a SELECT statement. • Views may also be used to restrict users’ access to particular subsets of tables that they are authorized to see. • In Hive, a view is not materialized to disk when it is created; rather, the view’s SELECT statement is executed when the statement that refers to the view is run. • Views are included in the output of the SHOW TABLES command, and you can see more details about a particular view, including the query used to define it, by issuing the DESCRIBE EXTENDED view_name command.  Create Views CREATE VIEW view_name (id,name) AS SELECT * from users;  Drop a view Drop view viewName;
  • 31. Joins • Only equality joins, outer joins, and left semi joins are supported in Hive. • Hive does not support join conditions that are not equality conditions as it is very difficult to express such conditions as a map/reduce job. Also, more than two tables can be joined in Hive
  • 32.
  • 33. Example-Join • hive> SELECT * FROM sales; Joe 2 Hank 4 Ali 0 Eve 3 Hank 2 • hive> SELECT * FROM items; 2 Tie 4 Coat 3 Hat 1 Scarf
  • 34. Table Commands • Using Join  One of the nice things about using Hive, rather than raw MapReduce, is that Hive makes performing commonly used operations very simple. • We can perform an inner join on the two tables as follows:  hive> SELECT sales.*, items.* FROM sales JOIN items ON (sales.id = items.id);  hive> SELECT a.val, b.val, c.val FROM a JOIN b ON (a.KEY = b.key1) JOIN c ON (c.KEY = b.key1) • You can see how many MapReduce jobs Hive will use for any particular query by prefixing it with the EXPLAIN keyword:, • For even more detail, prefix the query with EXPLAIN EXTENDED.  EXPLAIN SELECT sales.*, items.* FROM sales JOIN items ON (sales.id = items.id);
  • 35. • Outer joins Outer joins allow you to find non-matches in the tables being joined. hive> SELECT sales.*, items.* FROM sales LEFT OUTER JOIN items ON (sales.id = items.id); hive> SELECT sales.*, items.* FROM sales RIGHT OUTER JOIN items ON (sales.id = items.id); hive>SELECT sales.*, items.* FROM sales FULL OUTER JOIN items ON (sales.id = items.id); Table Commands
  • 36. Map Side Join • If all but one of the tables being joined are small, the join can be performed as a map only job. • The query does not need a reducer. For every mapper a,b is read completely. A restriction is that a FULL/RIGHT OUTER JOIN b cannot be performed. • SELECT /*+ MAPJOIN(b) */ a.key, a.value FROM a join b on a.key = b.key
  • 37. Partitioning in Hive • Using partitions, you can make it faster to execute queries on slices of the data. • A table can have one or more partition columns. • A separate data directory is created for each distinct value combination in the partition columns.
  • 38. Partitioning in Hive • Partitions are defined at the time of creating a table using PARTITIONED BY clause is used to create partition. Static Partition (Example-1) CREATE TABLE student_partnew (name STRING,id int,marks String) PARTITIONED BY (pyear STRING) row format delimited fields terminated by ','; LOAD DATA LOCAL INPATH '/home/notroot/std_2011.csv' INTO TABLE student_partnew PARTITION (pyear='2011'); LOAD DATA LOCAL INPATH '/home/notroot/std_2012.csv' INTO TABLE student_partnew PARTITION (pyear='2012'); LOAD DATA LOCAL INPATH '/home/notroot/std_2013.csv' INTO TABLE student_partnew PARTITION (pyear='2013');
  • 39. Partitioning in Hive Static Partition (Example-2) • CREATE TABLE student_New (id int,name string,marks int,year int) row format delimited fields terminated by ','; • LOAD DATA local INPATH '/home/notroot/Sandeep/DataSamples/Student_new.csv' INTO table Student_New; • CREATE TABLE student_part (id int,name string,marks int,) PARTITIONED BY (year STRING); • INSERT into TABLE student_part PARTITION(pyear='2012' ) SELECT id,name,marks from student_new WHERE year='2012'; SHOW Partition • SHOW PARTITIONS month_part;
  • 40. Partitioning in Hive Dynamic Partition • To enable dynamic partitions  set hive.exec.dynamic.partition=true; (To enable dynamic partitions, by default it is false)  set hive.exec.dynamic.partition.mode=nonstrict; (To allow a table to be partitioned based on multiple columns in hive, in such case we have to enable the nonstrict mode)  set hive.exec.max.dynamic.partitions.pernode=300; (The default value is 100, we have to modify the same according to the possible no of partitions that would come in your case) hive.exec.max.created.files=150000 (IThe default values is 100000 but for larger tables it can exceed the default, so we may have to update the same. )
  • 41. Partitioning in Hive • CREATE TABLE Stage_oper_Month (oper_id string, Creation_Date string, oper_name String, oper_age int, oper_dept String, oper_dept_id int, opr_status string, EYEAR STRING, EMONTH STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ','; • LOAD DATA local INPATH '/home/notroot/Sandeep/DataSamples/user_info.csv'INTO TABLE Stage_oper_Month; • CREATE TABLE Fact_oper_Month (oper_id string, Creation_Date string, oper_name String, oper_age int, oper_dept String, oper_dept_id int) PARTITIONED BY (opr_status string, eyear STRING, eMONTH STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ','; • FROM Stage_oper_Month INSERT OVERWRITE TABLE Fact_oper_Month PARTITION (opr_status, eyear, eMONTH) SELECT oper_id, Creation_Date, oper_name, oper_age, oper_dept, oper_dept_id, opr_status, EYEAR, EMONTH DISTRIBUTE BY opr_status, eyear, eMONTH; • (Select from partition table)  Select oper_id, oper_name, oper_dept from Fact_oper_Month where eyear=2010 and emonth=1;
  • 42. Bucketing Features • Partitioning gives effective results when there are limited number of partitions and comparatively equal sized partitions • To overcome the problem of partitioning, Hive provides Bucketing concept, another technique for decomposing table data sets into more manageable parts. • Bucketing concept is based on (hashing function on the bucketed column) mod (by total number of buckets) • Use CLUSTERED BY clause to divide the table into buckets. • Bucketing can be done along with Partitioning on Hive tables and even without partitioning. • Bucketed tables will create almost equally distributed data file parts. • To populate the bucketed table, we need to set the property  set hive.enforce.bucketing = true;
  • 43. Bucketing Advantage Bucketing Advantages • Bucketed tables offer efficient sampling than by non-bucketed tables. With sampling, we can try out queries on a fraction of data for testing and debugging purpose when the original data sets are very huge. • As the data files are equal sized parts, map-side joins will be faster on bucketed tables than non-bucketed tables. In Map-side join, a mapper processing a bucket of the left table knows that the matching rows in the right table will be in its corresponding bucket, so it only retrieves that bucket (which is a small fraction of all the data stored in the right table). • Similar to partitioning, bucketed tables provide faster query responses than non-bucketed tables.
  • 44. Bucketing Example • We can create bucketed tables with the help of CLUSTERED BY clause and optional SORTED BY clause in CREATE TABLE statement and DISTRIBUTED BY clause in load statement. • CREATE TABLE Month_bucketed (oper_id string, Creation_Date string, oper_name String, oper_age int,oper_dept String, oper_dept_id int, opr_status string, eyear string , emonth string) CLUSTERED BY(oper_id) SORTED BY (oper_id,Creation_Date) INTO 10 BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY ','; Similar to partitioned tables, we cannot directly load bucketed tables with LOAD DATA (LOCAL) INPATH command, rather we need to use INSERT OVERWRITE TABLE … SELECT …FROM clause from another table to populate the bucketed tables. • INSERT OVERWRITE TABLE Month_bucketed SELECT oper_id, Creation_Date, oper_name, oper_age, oper_dept, oper_dept_id, opr_status, EYEAR, EMONTH FROM stage_oper_month DISTRIBUTE BY oper_id sort by oper_id, Creation_Date;
  • 45. Partitioning with Bucketing • CREATE TABLE Month_Part_bucketed (oper_id string, Creation_Date string, oper_name String, oper_age int,oper_dept String, oper_dept_id int) PARTITIONED BY (opr_status string, eyear STRING, eMONTH STRING) CLUSTERED BY(oper_id) SORTED BY (oper_id,Creation_Date) INTO 12 BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY ','; • FROM Stage_oper_Month stg INSERT OVERWRITE TABLE Month_Part_bucketed PARTITION(opr_status, eyear, eMONTH) SELECT stg.oper_id, stg.Creation_Date, stg.oper_name, stg.oper_age, stg.oper_dept, stg.oper_dept_id, stg.opr_status, stg.EYEAR, stg.EMONTH DISTRIBUTE BY opr_status, eyear, eMONTH; Note: Unlike partitioned columns (which are not included in table columns definition), Bucketed columns are included in table definition as shown in above code for oper_id and creation_date columns.
  • 46. Table Sampling in Hive Table Sampling in hive is nothing but extraction small fraction of data from the original large data sets. It is similar to LIMIT operator in Hive. Difference between LIMIT and TABLESAMPLE in Hive.  In many cases a LIMIT clause executes the entire query, and then only returns limited results.  But Sampling will only select a portion of data to perform query. To see the performance difference between bucketed and non-bucketed tables.  Query-1: SELECT oper_id, Creation_Date, oper_name, oper_age, oper_dept FROM month_bucketed TABLESAMPLE(BUCKET 12 OUT OF 12 ON oper_id);  Query-2: SELECT oper_id, Creation_Date, oper_name, oper_age, oper_dept FROM stage_oper_month limit 18; Note: Query-1 should always perform faster that query-2 To perform random sampling with Hive  SELECT oper_id, Creation_Date, oper_name, oper_age, oper_dept FROM month_bucketed TABLESAMPLE (1 percent);
  • 47. Hive UDF • UDF is a java code which must satisfy the following two properties. • UDF must implement at least one evaluate() method • UDF must be a subclass of org.apache.hadoop.hive.ql.exec.UDF Sample UDF package com.example.hive.udf; import org.apache.hadoop.hive.ql.exec.UDF; import org.apache.hadoop.io.Text; public final class Lower extends UDF { public Text evaluate(final Text s) { if (s == null) { return null; } return new Text(s.toString().toLowerCase()); } } • hive> add jar my_jar.jar; • hive> create temporary function my_lower as 'com.example.hive.udf.Lower'; • hive> select empid , my_lower(empname) from employee;
  • 48. Hive UDAF • A UDAF works on multiple input rows and creates a single output row. Aggregate functions include such functions as COUNT and MAX. • An aggregate function is more difficult to write than a regular UDF. • UDAF must be a subclass of org.apache.hadoop.hive.ql.exec.UDAF • Contain one or more nested static classes implementing org.apache.hadoop.hive.ql.exec.UDAFEvaluator • UDF must be a subclass of org.apache.hadoop.hive.ql.exec.UDF An evaluator must implement five methods • init()  The init() method initializes the evaluator and resets its internal state.  In MaximumIntUDAFEvaluator, we set the IntWritable object holding the final result to null.
  • 49. Hive UDAF • iterate()  The iterate() method is called every time there is a new value to be aggregated. The evaluator should update its internal state with the result of performing the aggregation. The arguments that iterate() takes correspond to those in the Hive function from which it was called.  In this example, there is only one argument. The value is first checked to see whether it is null, and if it is, it is ignored. Otherwise, the result instance variable is set either to value’s integer value (if this is the first value that has been seen) or to the larger of the current result and value (if one or more values have already been seen). We return true to indicate that the input value was valid. • terminatePartial()  The terminatePartial() method is called when Hive wants a result for the partial aggregation. The method must return an object that encapsulates the state of the aggregation.  In this case, an IntWritable suffices because it encapsulates either the maximum value seen or null if no values have been processed.
  • 50. Hive UDAF • merge()  The merge() method is called when Hive decides to combine one partial aggregation with another. The method takes a single object, whose type must correspond to the return type of the terminatePartial() method.  In this example, the merge() method can simply delegate to the iterate() method because the partial aggregation is represented in the same way as a value being aggregated. This is not generally the case(we’ll see a more general example later), and the method should implement the logic to combine the evaluator’s state with the state of the partial aggregation. • terminate()  The terminate() method is called when the final result of the aggregation is needed. The evaluator should return its state as a value.  In this case, we return the result instance variable.
  • 51. Hive UDAF package com.hadoopbook.hive; import org.apache.hadoop.hive.ql.exec.UDAF; import org.apache.hadoop.hive.ql.exec.UDAFEvaluator; import org.apache.hadoop.io.IntWritable; public class HiveUDAFSample extends UDAF { public static class MaximumIntUDAFEvaluator implements UDAFEvaluator { private IntWritable result; public void init() { result = null; } public boolean iterate(IntWritable value) { if (value == null) { return true; }
  • 52. Hive UDAF if (result == null) { result = new IntWritable(value.get()); } else { result.set(Math.max(result.get(), value.get())); } return true; } public IntWritable terminatePartial() { return result; } public boolean merge(IntWritable other) { return iterate(other); } public IntWritable terminate() { return result; } } }
  • 53. Hive UDAF To Use UDAF in hive; • hive> add jar my_jar.jar; • hive> CREATE TEMPORARY FUNCTION maximum AS 'com.hadoopbook.hive.HiveUDAFSample'; • hive>SELECT maximum(salary) FROM employee;
  • 54. Performance Tuning Partitioning Tables: • Hive partitioning is an effective method to improve the query performance on larger tables. Partitioning allows you to store data in separate sub-directories under table location. It greatly helps the queries which are queried upon the partition key(s). Although the selection of partition key is always a sensitive decision, it should always be a low cardinal attribute, e.g. if your data is associated with time dimension, then date could be a good partition key. Similarly, if data has association with location, like a country or state, then it’s a good idea to have hierarchical partitions like country/state.
  • 55. Performance Tuning De-normalizing data: • Normalization is a standard process used to model your data tables with certain rules to deal with redundancy of data and anomalies. In simpler words, if you normalize your data sets, you end up creating multiple relational tables which can be joined at the run time to produce the results. Joins are expensive and difficult operations to perform and are one of the common reasons for performance issues. Because of that, it’s a good idea to avoid highly normalized table structures because they require join queries to derive the desired metrics.
  • 56. Performance Tuning Compress map/reduce output: • Compression techniques significantly reduce the intermediate data volume, which internally reduces the amount of data transfers between mappers and reducers. All this generally occurs over the network. Compression can be applied on the mapper and reducer output individually. Keep in mind that gzip compressed files are not splittable. That means this should be applied with caution. A compressed file size should not be larger than a few hundred megabytes. Otherwise it can potentially lead to an imbalanced job. • Other options of compression codec could be snappy, lzo, bzip, etc. • For map output compression set mapred.compress.map.output to true • For job output compression set mapred.output.compress to true
  • 57. Performance Tuning Map join: • Map joins are really efficient if a table on the other side of a join is small enough to fit in the memory. Hive supports a parameter, hive.auto.convert.join, which when it’s set to “true” suggests that Hive try to map join automatically. When using this parameter, be sure the auto convert is enabled in the Hive environment.
  • 58. Performance Tuning Bucketing: • Bucketing improves the join performance if the bucket key and join keys are common. Bucketing in Hive distributes the data in different buckets based on the hash results on the bucket key. It also reduces the I/O scans during the join process if the process is happening on the same keys (columns). • Additionally it’s important to ensure the bucketing flag is set (SET hive.enforce.bucketing=true;) every time before writing data to the bucketed table. To leverage the bucketing in the join operation we should SET hive.optimize.bucketmapjoin=true. This setting hints to Hive to do bucket level join during the map stage join. It also reduces the scan cycles to find a particular key because bucketing ensures that the key is present in a certain bucket.
  • 59. Performance Tuning Parallel execution: • As HIVE queries are inbuilt translated to a number of map reduce jobs, but having multiple Map-reduce jobs is not enough, real advantage is of their parallel execution and as noted above simply writing a query does not achieve this. • SELECT table1.a FROM table1 JOIN table2 ON (table1.a =table2.a ) join table3 ON (table3.a=table1.a) join table4 ON (table4.b=table3.b); • Output: Execution time : 800 sec But let us check the execution plan for this: observations (see picture highlighted area): • Total Map-Reduce Jobs: 2. • Serially Launched & Run.
  • 60. Performance Tuning Parallel execution: • To achieve this, we thought about query re-writing in a way to segregate the query into independent units which HIVE could work upon as independent map reduce jobs running parallel. Following is what we did to our query: • SELECT r1.a FROM (SELECT table1.a FROM table1 JOIN table2 ON table1.a =table2.a ) r1 JOIN (SELECT table3.a FROM table3 JOIN table4 ON table3.b =table4.b ) r2 ON (r1.a =r2.a) ; • Output: Same results. But Execution time: 464 secs observations: • Total Map-Reduce Jobs: 5 (see picture highlighted area). • Jobs are parallel Launched & Run. (see highlighted area). • Decrease in query execution time (around 50% in our case) Points to Note: • Need to set hive.exec.parallel parameter to set to TRUE. • To control how many jobs at most can be executed in parallel set hive.exec.parallel.thread.number parameter.
  • 61. Thank You • Question? • Feedback? explorehadoop@gmail.com

Editor's Notes

  1. Arrays Arrays in Hive are used the same way they are used in Java. Syntax: ARRAY<data_type> Maps Maps in Hive are similar to Java Maps. Syntax: MAP<primitive_type, data_type> Structs Structs in Hive is similar to using complex data with comment. Syntax: STRUCT<col_name : data_type [COMMENT col_comment], ...>
  2. hive> CREATE TABLE IF NOT EXISTS employee ( eid int, name String, salary String, destination String) COMMENT ‘Employee details’ ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘\t’ LINES TERMINATED BY ‘\n’ STORED AS TEXTFILE;
  3. A restriction is that a FULL/RIGHT OUTER JOIN b cannot be performed.