Introductive to Hive

Apache hive
 Hive is a data warehouse platform built for hadoop
to provide data summarization, query analysis.
 Using MapReduce of hadoop for query analysis is a
bit complicated, so in order to overcome the issue
Apache Hive came into the picture with its easy
structured query language-like in short: SQL-like that
gets translated into MapReduce job for query
analysis on Big Data.
 We also got a name for Hive SQL–like that is Hive
Query Language or Hive QL or HQL in short.
 Internally a compiler translates HiveQL statements
into a directed acyclic graph (DAG) of Mapreduce
which are transferred to hadoop for execution.
 Finally, it gives a mechanism to setup structured onto
data stored in HDFS for faster processing the data.
Rupak Roy

Hive Architecture
 Frist, we have the following user interfaces:
CLI : command Line interface
Web UI: web user interface
Thrift Server is an optional service that allows clients to access Hive to
execute jobs using variety of programming languages similar to JDBC or
ODBC protocols.
ODBC and JDBC driver: allows applications to connect to hive.
Both uses thrift to communicate with the Hive ecosystem.
 Second comes the Driver, it passes/forwards the HiveQL statements and
stores the metadata generated during the execution of an HiveQL. It
comprises of 3 important functions:
Compiler: performs compilation of the HiveQL to Directed acyclic graph
(DAG) for hadoop MapReduce jobs
Optimizer: optimizes various transformations to get a optimized DAG.
Executor: it simply executes the tasks by interacting with the Hadoop
Ecosystems.
 Lastly the Metastore: is basically stores meta data i.e. information about
the table format, schemas and location. MetaData helps the driver to
keep a track of the data distributed over the cluster.
Rupak Roy

 HiveQL doesn’t strictly follows the full SQL-92
standard.
 Because generally SQL for typical data provides
multiple inserts, updates and deletes, however in
HiveQL we can’t as the reason is it saved in HDFS
and we are aware that we cant update any
data inside the file in HDFS.
 Alternative solution to this is we can upload new
fresh file using HiveQL.
Rupak Roy

Data Types of Apache Hive
 Some of the Primitive data types.
1) Numeric Types:
TINYINT:1-byte size ranges from -128 to 127
SMALLINT: 2-byte size ranges from -32,768 to 32,767
INT:4-byte size
BIGINT : 8 byte size
FLOAT : 4-byte size
DOUBLE : 8-byte size
DECIMAL
Rupak Roy

2. Data/Time types
3. String Types: STRING, CHAR, VARCHAR
4. Others like BOOLEAN , BINARY
Complex Data Types:
1) Arrays: it is a collection of elements of same type.
Example: store the data like char[] arrary=(i,am,…..)
then,
elements can be called by
array[0] means i
array[1] means am
array[2] means array
Rupak Roy

2) STRUCTS: it’s a collection of elements of different
data types.
Example: student STRUCT(name:STIRNG,
Course:STRING, ID: TINYINT)
Now if we save the name as Emma, course as MBA
and ID as 782:
Then we can recall or access each different type
value by
student.name which will give us ‘Emma’
student.course which will give us ‘ MBA’
student.ID which will give us ‘782’;
Rupak Roy

 MAP: is a collection of elements in Key-Value pairs.
The key-value pairs can be of different data
types.
Example: MAP < PRIMITIVE_type, data_type>
student MAP < STRING,TINYINT>
Now if we save the value in STRING(KEY) as ‘Emma’
Then TINYINT(VALUE) as 782
Then you can accessed by using
student[“Emma”] will give 782
Rupak Roy

Next
 We will learn how to install Hive, import
export and more.
Rupak Roy

Introductive to Hive

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Introductive to Hive

Ähnlich wie Introductive to Hive (20)

Mehr von Rupak Roy

Mehr von Rupak Roy (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Introductive to Hive