3. HiveQL:Views
A view is a sort of “virtual table” that is defined by a SELECT statement.
A view allows a query to be saved and treated like a table. It is a logical construct, as it does not
store data like a table
Eg:
FROM ( SELECT * FROM people JOIN cart
ON (cart.people_id=people.id) WHERE firstname='john'
) a SELECT a.lastname WHERE a.id=3;
Create View : -
CREATE VIEW shorter_join AS
SELECT * FROM people JOIN cart
ON (cart.people_id=people.id) WHERE firstname='john';
Final require query: SELECT lastname FROM shorter_join WHERE id=3;
4. HiveQL: Indexes
The goal of Hive indexing is to improve the speed of query lookup on certain columns of a table.
Creating an Index:
CREATE INDEX employees_index
ON TABLE employees (country)
AS 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler'
WITH DEFERRED REBUILD
IDXPROPERTIES ('creator = 'me', 'created_at' = 'some_time')
IN TABLE employees_index_table
PARTITIONED BY (country, name)
COMMENT 'Employees indexed by country and name.';
Rebuilding the Index:
ALTER INDEX employees_index ON TABLE employees PARTITION (country = 'US') REBUILD;
5. Partitions
Hive organizes tables into partitions, a way of dividing a table into coarse-grained parts based on the value of a
partition column, such as a date, country, state.. Etc.
Imagine logfiles where each record includes a timestamp. If we partitioned by date, then records for the same
date would be stored in the same partition.
A table may be partitioned in multiple dimensions
CREATE TABLE logs (ts BIGINT, line STRING) PARTITIONED BY (dt STRING, country STRING);
Load data into a partitioned table:
LOAD DATA LOCAL INPATH 'input/hive/partitions/file1‘ INTO TABLE logs PARTITION (dt='2001-01-01', country='GB');
SHOW PARTITIONS logs;
6. Buckets
Partitions offer a convenient way to segregate data and to optimize queries. However, not all data sets lead to
sensible partitioning.
Bucketing is another technique for decomposing data sets into more manageable parts.
For example, suppose a table using the date dt as the top-level partition and the user_id as the second-level
partition leads to too many small partitions
CREATE TABLE weblog (user_id INT, url STRING, source_ip STRING)
PARTITIONED BY (dt STRING)
CLUSTERED BY (user_id) INTO 10 BUCKETS;
Sorted bucketing:
CREATE TABLE bucketed_users (id INT, name STRING)
CLUSTERED BY (id) SORTED BY (id ASC) INTO 4 BUCKETS;
Inserting data into bucket table:
INSERT OVERWRITE TABLE bucketed_users
SELECT * FROM users;
7. Hive UDF’s
UDFs have to be written in Java, the language that Hive itself is written in. For other languages, consider using a
SELECT TRANSFORM query, which allows you to stream data through a user-defined script
A UDF operates on a single row and produces a single row as its output. Most functions, such as mathematical
functions and string functions, are of this type.
A UDAF works on multiple input rows and creates a single output row. Aggregate functions include such functions
as COUNT and MAX.
A UDTF operates on a single row and produces multiple rows—a table—as output.
A UDF must satisfy the following two properties:
A UDF must be a subclass of org.apache.hadoop.hive.ql.exec.UDF.
A UDF must implement at least one evaluate() method.
CREATE FUNCTION strip AS 'com.hadoopbook.hive.Strip‘ USING JAR '/path/to/hive-examples.jar';
hive> SELECT strip(' bee ') FROM dummy;
bee