Call Girls in Sarai Kale Khan Delhi đŻ Call Us đ9205541914 đ( Delhi) Escorts S...
Â
Apache hive
1. Apache Hive & Hadoop
The de facto standard for SQL queries in Hadoop
Since its incubation in 2008, Apache Hive is considered the defacto standard
for interactive SQL queries over petabytes of data in Hadoop.
With the completion of the Stinger Initiative, and the next phase of
Stinger.next, the Apache community has greatly improved Hiveâs speed, scale
and SQL semantics. Hive easily integrates with other critical data center
technologies using a familiar JDBC interface.
What Hive Does
Hadoop was built to organize and store massive amounts of data of all shapes,
sizes and formats. Because of Hadoopâs âschema on readâ architecture, a
Hadoop cluster is a perfect reservoir of heterogeneous dataâstructured and
unstructuredâfrom a multitude of sources.
Data analysts use Hive to explore, structure and analyze that data, then turn it
into actionable business insight.
How Hive Works
The tables in Hive are similar to tables in a relational database, and data units
are organized in a taxonomy from larger to more granular units. Databases are
comprised of tables, which are made up of partitions. Data can be accessed via
a simple query language and Hive supports overwriting or appending data.
Within a particular database, data in the tables is serialized and each table has
a corresponding Hadoop Distributed File System (HDFS) directory. Each table
can be sub-divided into partitions that determine how data is distributed
within sub-directories of the table directory. Data within partitions can be
further broken down into buckets.
Hive supports all the common primitive data formats such as BIGINT, BINARY,
BOOLEAN, CHAR, DECIMAL, DOUBLE, FLOAT, INT, SMALLINT, STRING,
2. TIMESTAMP, and TINYINT. In addition, analysts can combine primitive data
types to form complex data types, such as structs, maps and arrays.
As part of the exercise in learning about the hive query language we completed
one sample exercise on hive.
The sample exercise has been taken from the following website
http://hortonworks.com/hadoop-tutorial/how-to-process-data-with-apache-
hive/
Hive provides a platform to run SQL queries. This is a more familiar with the
programmers of SQL background. It is a known and learned fact that sql
queries are similar to comprehend and understand
Downloading the data :
The csv file can be downloaded from the following zip file.
http://hortonassets.s3.amazonaws.com/pig/lahman591-csv.zip
Uploading the data:
Although the link from wherewe are solving the programis an easy guide. We
would be using the hue to upload the files for execution of the program.
Steps to upload:
ï StartVM Box and open SSH Terminal
ï Log on to the address http://127.0.0.1:8000/
ï Click on file browser option followed by view then click on hue
ï Upload 2 files batting.csv and master.csv
3. Lets look into the programme:
create table temp_batting (col_value STRING);
This would create the table to store the data. Remember we have to save the
queries with a name create and also press execute once it is done
4. Next we would load the batting.csv into the temp_batting we created. Refer
snapshot below:
Load data Inpath '/user/admin/Batting.csv' OVERWRITE INTO TABLE
temp_batting;
As in sql for looking at the data we would write the select command . Here also
we follow suite
Select * from batting_temp;
5. The output is given below:
create table batting ( player_id STRING , year INT ,runs INT);
Through regex operation we would copy the data from temp_batting and copy
it into batting. For this we need to create the table batting with 3 fields
player_id , years and run. After this we would group the data by year to
highlight the maximum runs obtained for each year
Select year , max(runs) FROM batting GROUP BY year;
6.
7.
8. Last step we would find the player id with maximum runs. For this refer the
query below...
Select a. year , a. player_ID , a.runs frombatting a
JOIN(SELECTyear , max(runs) runs FROMbatting GROUP BY year ) b
ON (a.year = b. year AND a.runs = b.runs);
9.
10. This was a simple yet engaging exercise to learn about the HQL. Interesting
part we found the batting.csv file vanishes from the view of file browser. Refer
Below...