Well illustrated with definitions of Apache Hive with its architecture workflows plus with the types of data available for Apache Hive
Let me know if anything is required. Happy to help.
Ping me google #bobrupakroy.
2. Apache hive
ï Hive is a data warehouse platform built for hadoop
to provide data summarization, query analysis.
ï Using MapReduce of hadoop for query analysis is a
bit complicated, so in order to overcome the issue
Apache Hive came into the picture with its easy
structured query language-like in short: SQL-like that
gets translated into MapReduce job for query
analysis on Big Data.
ï We also got a name for Hive SQLâlike that is Hive
Query Language or Hive QL or HQL in short.
ï Internally a compiler translates HiveQL statements
into a directed acyclic graph (DAG) of Mapreduce
which are transferred to hadoop for execution.
ï Finally, it gives a mechanism to setup structured onto
data stored in HDFS for faster processing the data.
Rupak Roy
3. Hive Architecture
ï Frist, we have the following user interfaces:
CLI : command Line interface
Web UI: web user interface
Thrift Server is an optional service that allows clients to access Hive to
execute jobs using variety of programming languages similar to JDBC or
ODBC protocols.
ODBC and JDBC driver: allows applications to connect to hive.
Both uses thrift to communicate with the Hive ecosystem.
ï Second comes the Driver, it passes/forwards the HiveQL statements and
stores the metadata generated during the execution of an HiveQL. It
comprises of 3 important functions:
Compiler: performs compilation of the HiveQL to Directed acyclic graph
(DAG) for hadoop MapReduce jobs
Optimizer: optimizes various transformations to get a optimized DAG.
Executor: it simply executes the tasks by interacting with the Hadoop
Ecosystems.
ï Lastly the Metastore: is basically stores meta data i.e. information about
the table format, schemas and location. MetaData helps the driver to
keep a track of the data distributed over the cluster.
Rupak Roy
5. ï HiveQL doesnât strictly follows the full SQL-92
standard.
ï Because generally SQL for typical data provides
multiple inserts, updates and deletes, however in
HiveQL we canât as the reason is it saved in HDFS
and we are aware that we cant update any
data inside the file in HDFS.
ï Alternative solution to this is we can upload new
fresh file using HiveQL.
Rupak Roy
6. Data Types of Apache Hive
ï Some of the Primitive data types.
1) Numeric Types:
TINYINT:1-byte size ranges from -128 to 127
SMALLINT: 2-byte size ranges from -32,768 to 32,767
INT:4-byte size
BIGINT : 8 byte size
FLOAT : 4-byte size
DOUBLE : 8-byte size
DECIMAL
Rupak Roy
7. Data Types of Apache Hive
2. Data/Time types
3. String Types: STRING, CHAR, VARCHAR
4. Others like BOOLEAN , BINARY
Complex Data Types:
1) Arrays: it is a collection of elements of same type.
Example: store the data like char[] arrary=(i,am,âŠ..)
then,
elements can be called by
array[0] means i
array[1] means am
array[2] means array
Rupak Roy
8. Data Types of Apache Hive
2) STRUCTS: itâs a collection of elements of different
data types.
Example: student STRUCT(name:STIRNG,
Course:STRING, ID: TINYINT)
Now if we save the name as Emma, course as MBA
and ID as 782:
Then we can recall or access each different type
value by
student.name which will give us âEmmaâ
student.course which will give us â MBAâ
student.ID which will give us â782â;
Rupak Roy
9. Data Types of Apache Hive
ï MAP: is a collection of elements in Key-Value pairs.
The key-value pairs can be of different data
types.
Example: MAP < PRIMITIVE_type, data_type>
student MAP < STRING,TINYINT>
Now if we save the value in STRING(KEY) as âEmmaâ
Then TINYINT(VALUE) as 782
Then you can accessed by using
student[âEmmaâ] will give 782
Rupak Roy
10. Next
ï We will learn how to install Hive, import
export and more.
Rupak Roy