Hive is a data warehousing infrastructure based on Hadoop. Hadoop provides massive scale out and fault tolerance capabilities for data storage and processing (using the map-reduce programming paradigm) on commodity hardware.
Hive is designed to enable easy data summarization, ad-hoc querying and analysis of large volumes of data. It provides a simple query language called Hive QL, which is based on SQL and which enables users familiar with SQL to do ad-hoc querying, summarization and data analysis easily. At the same time, Hive QL also allows traditional map/reduce programmers to be able to plug in their custom mappers and reducers to do more sophisticated analysis that may not be supported by the built-in capabilities of the language.
2. HIVE Introduction
•A cost effective data warehouse style solution
for Hadoop
•Hadoop base
–Cost effective ,very large scale and flexible data
management
•Familiar to huge exiting base of SQL users
•Easy to learn
•No need to write Java data access programs
3. HIVE Introduction
•SQL like ad-hoc query , aggregation and analysis
of huge volumes of data
–SQL like query language called HiveQL
•Hadoop base for cost effective data
management
–Map/reduce for execution
–Hadoop distributed file system (HDFS) for storage
•JDBC/ODBC access
•Extensible
4. HIVE Introduction
•Schema on read Vs schema on write improves
flexibility
–Traditional databases enforces schema at load time
–Schema on write
–Hive enforces schema when query is issued
–Schema on read
•Not designed for online transaction processing
7. Data model – Database and table
•Location in HDFS
–Hive data stored in HDFS under user
/hive/warehouse (default)
•Database
–Namespace the group together related table and
other data units
–Each database is a parent folder in the Hive specific
directory in HDFS
8. Data model – Database and table
•TABLE
–A collection of related columns
–Can be filtered , projected , joined etc
–Columns types
–Primitives
•TINYINT,INT,BIGINT,BOOLEAN,DOUBLE,STRING
–Array of primitives
–Map of primitives (key value pairs)
–Structure made up of elements of different data types
•Accessed using dot notation
•CREATE TABLE complex_data_type (
–Fruits Array <string>
–Pass_list Map<string,String>
–Car Struct<color:string , Wheel_size :float>);
9. HiveQL is not SQL
•Not 100% ANSI-Compliant SQL
•Join predicates only support equity operator
•No “inset into”
–Can’t insert into an exsisting table or data partition
–Only supports “insert overwrite “so an insert will always
overwrite the existing data in the whole table or partition
•No “update “or “delete”
•No access control language supported
•Incomplete support for correlated subquery
10. Hive Benefits
•Bridges the gap between low-level java
programming for hadoop and SQL
•ODBC/JDBC interfaces enable many commercial
business intelligence and ETL
•Leverages Hadoop supports partitioning for
scalability and performance
•Extensible (UDF,SerDe etc.)
11. Datatypes in HIVE
primitive datatypes
TINYINT
SMALLINT
INT
BIGINT
BOOLEAN
FLOAT
DOUBLE
STRING
13. DATABASES in HIVE
It is a catalog or namespace of tables.Used for avoiding
table name collisions.
SYNTAX:
hive>CREATE DATABASE movies;
you can see the databases that already exist as follows:
hive> SHOW DATABASES;
setting a database as your working database:
hive> USE movies;
If not specified, the default database is used.
14. DATABASES in HIVE
It is a catalog or namespace of tables.Used for avoiding
table name collisions.
SYNTAX:
hive>CREATE DATABASE movies;
you can see the databases that already exist as follows:
hive> SHOW DATABASES;
setting a database as your working database:
hive> USE movies;
If not specified, the default database is used.
15. Creating Tables
Table creation SYNTAX:
hive>CREATE TABLE movies(id INT,name STRING,
year INT,rating FLOAT,duration FLOAT)
row format delimited fields terminated by '
Showing tables in a database SYNTAX:
hive>SHOW TABLES;
Showing details about the table SYNTAX:
hive>DESCRIBE movies;
Deleting table SYTAX:
hive>DROP TABLE movies;
16. Managed vs External table
The tables we have created so far are called managed
table(internal tables) and hive Controls life cycle of Data.
Managed tables are less convenient for sharing with
other tools.
We can define an external table that points to that data,
but doesn’t take ownership of it.
SYNTAX:
CREATE EXTERNAL TABLE movies(......)ROW
FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '/data/movies';
17. Alter Table
Table properties can be altered with this, which
change metadata about the table but not the data itself.
Renaming:
ALTER TABLE movies RENAME TO cinemas;
Adding Columns:
ALTER TABLE movies ADD COLUMNS (language string);
Changing column position:
ALTER TABLE movies
CHANGE COLUMN name names string AFTER year;
18. Loading data
Hive has no row-level insert, update, and delete
operations, the only way to put data into an table
is to use one of the “bulk” load operations.
From hdfs SYNTAX:
load data inpath '/user/divya/dataset/movie.csv'
into table movies;
From local system SYNTAX:
load data LOCAL inpath '/home/divya/dataset/movie.into table movies;