SlideShare ist ein Scribd-Unternehmen logo
1 von 29
HIVE
Bucharest Java User Group
July 3, 2014
whoami
• Developer with SQL Server team since 2001
• Apache contributor
• Hive
• Hadoop core (security)
• stackoverflow user 105929s
• @rusanu
What is HIVE
• Datawarehouse for querying and managing large datasets
• A query engine that use Hadoop MapReduce for execution
• A SQL abstraction for creating MapReduce algorithms
• SQL interface to HDFS data
• Developed at Facebook
VLDB 2009: Hive - A Warehousing Solution Over a Map-Reduce
Framework
• ASF top project since September 2010
What is Hadoop
Hadoop Core
• Distributed execution engine
• MapReduce
• YARN
• TEZ
• Distributed File System HDFS
• Tools for administering the
execution engine and HDFS
• Libraries for writing MapReduce
jobs
Hadoop Ecosystem
• HBase (BigTable)
• Pig (scripting query language)
• Hive (SQL)
• Storm (Stream Processing)
• Flume (Data Aggregator)
• Sqoop (RDBMS bulk data transfer)
• Oozie (workflow scheduling)
• Mahout (machine learning)
• Falcon (data lifecycle)
• Spark, Cassandra etc (not based on Hadoop)
hadoopecosystemtable.github.io
How does Hadoop work
• JOB: binary code (Java JAR), configuration XML, any additional file(s)
• The job gets uploaded into the cluster file system (usually HDFS)
• SPLIT: a fragment of data (file) to be processes
• The input data is broken into several splits
• TASK: execution of the job JAR to process a split
• Scheduler attempts to execute the task near the data split
• MAP: takes unsorted, unclustered data and outputs clustered data
• SHUFFLE: takes clustered data and produces sorted data
• REDUCE: takes sorted data and produces desired output
• Synergies
• Processing locality: execute the code near the data storage, avoid data transfer
• Algorithms scalability:
• Map phase can scale out because assumes no sorting and no clustering
• Reduce phase easy to write algorithms when data is guaranteed sorted and clustered
• Execution reliability (monitoring, retry, preemptive execution etc)
MapReduce
How does Hive work
• SQL submitted via CLI or
Hiveserver(2)
• Metadata describing tables
stored in RDBMS
• Driver compiles/optimizes
execution plan
• Plan and execution engine
submitted to Hadoop as job
• MR invokes Hive execution
engine which executes plan
HiveHadoop
Metastore
RDBMS
HCatalog
HDFS
Driver
Compiles, Optimizes
MapReduce
Task
Task
Split
Split
CLI Hiveserver2
ODBC JDBCShell
Job
Tracker
Beeswax
Hive Query execution
• Compilation/Optimization results in an AST containing operators eg:
• FetchOperator: scans source data (the input split)
• SelectOperator: projects column values, computes
• GroupByOperator: aggregate functions (SUM, COUNT etc)
• JoinOperator:joins
• The plan forms a DAG of MR jobs
• The plan tree is serialized (Kryo)
• Hive Driver dispatches jobs
• Multiple stages can result in multiple jobs
• Task execution picks up the plan and start iterating the plan
• MR emits values (rows) into the topmost operator (Fetch)
• Rows propagate down the tree
• ReduceSinkOperator emits map output for shuffle
• Each operator implements both a map side and a reduce side algorithm
• Executes the one appropriate for the current task
• MR does the shuffle, many operators rely on it as part of their algorithm
• Eg. SortOperator, GroupByOperator
• Multi-stage queries create intermediate output and the driver submits new job to continue next stage
• TEZ execution: map-reduce-reduce, usually eliminates multiple stages (more later)
• Vectorized execution mode emits batches of rows (1024 rows)
Interacting with Hive
• hive from shell prompt launches CLI
• Run SQL command interactively
• Can execute a batch of commands from a file
• Results displayed in console
• hiveserver2 is a daemon
• JDBC and ODBC drivers for applications to connect to it
• Queries submitted via JDBC/ODBC
• Query results as JDBC/ODBC resultsets
• Other applications embed Hive driver eg. beeswax
Hive QL
• The dialect of SQL supported by Hive
• More similar to MySQL dialect than ANSI-SQL
• Drive toward ANSI-92 compliance (syntax, data types)
• Query language: SELECT
• DDL: CREATE/ALTER/DROP DATABASE/TABLE/PARTITION
• DML: Only bulk insert operations
• LOAD
• INSERT
• HIVE-5317 Implement insert, update, and delete in Hive with full ACID
support
Supported data types
• Numeric
• tinyint, smallint, int,
bigint
• float, double
• decimal(precision, scale)
• Date/Time
• timestamp
• date
• Character types
• string
• char(size)
• varchar(size)
• Misc. types
• boolean
• binary
• Complex types
• ARRAY<type>
• MAP<type, type>
• STRUCT<name:type, name:type>
• UNIONTYPE<type, type, type>
Storage Formats
• Text
• ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘t’
LINES TERMINATED BY ‘n’;
• Gzip or Bzip2 is automatically detected
• SEQUENCEFILE (default map-reduce output)
• ORC Files
• Columnar, Compressed
• Certain features only enabled on ORC
• Parquet
• Columnar, Compressed
• Arbitrary SerDe (Serializer Deserializer)
DDL/Databases/Tables
• CREATE (DATABASE|SCHEMA) [IF NOT EXISTS] database_name
[COMMENT database_comment]
[LOCATION hdfs_path]
[WITH DBPROPERTIES (property_name=property_value, ...)];
• CREATE [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name
[(col_name data_type [COMMENT col_comment], ...)]
[COMMENT table_comment]
[PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)]
[CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS]
[SKEWED BY (col_name, col_name, ...) ON ([(col_value, col_value, ...), ...|col_value, col_value, ...])
[STORED AS DIRECTORIES]
[
[ROW FORMAT row_format] [STORED AS file_format]
| STORED BY 'storage.handler.class.name' [WITH SERDEPROPERTIES (...)]
]
[LOCATION hdfs_path]
[TBLPROPERTIES (property_name=property_value, ...)]
[AS select_statement]
• EXTERNAL tables are not owned by Hive (DROP TABLE lets the file in place)
• Partitioning, Bucketing, Skew control allow precise control of file size (important for processing to achieve balanced MR splits)
• ALTER TABLE … EXCHANGE PARTITION allows for fast (metadata only) move of data.
• ALTER TABLE … ADD PARTITION adds to Hive metadata a partition already existing on disk
• MSCK REPAIR TABLE … scans on-disk files to discover partitions and synchronizes Hive metadata
• https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL
Data Load
• LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename
[PARTITION (partcol1=val1, partcol2=val2 ...)]
• File format must match table format (no transformations)
• INSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2
...) [IF NOT EXISTS]] select_statement1 FROM from_statement;
INSERT INTO TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...)]
select_statement1 FROM from_statement;
• OVERWRITE replaces the data in the table (TRUNCATE + INSERT)
• INTO appends the data (leaves existing data intact)
• Dynamic Partitioning
• Creates new partitions based on data
• https://cwiki.apache.org/confluence/display/Hive/Tutorial#Tutorial-Dynamic-PartitionInsert
• INSERT OVERWRITE [LOCAL] DIRECTORY directory1
[ROW FORMAT row_format] [STORED AS file_format]
SELECT ... FROM ...
• Writes a file without creating Hive table
Hive SELECT syntax
[WITH CommonTableExpression (, CommonTableExpression)*]
SELECT [ALL | DISTINCT] select_expr, select_expr, ...
FROM table_reference
[WHERE where_condition]
[GROUP BY col_list]
[CLUSTER BY col_list
| [DISTRIBUTE BY col_list] [SORT BY col_list]
]
[HAVING having_condition]
[LIMIT number]
SELECT features
• REGEX column specifications
• SELECT `(ds|hr)?+.+` FROM sales
• Virtual columns
• INPUT_FILE_NAME
• BLOCK_OFFSET_INSIDE_FILE
• Sampling
• SELECT … FROM source TABLESAMPLE (BUCKET 3 OUT OF 32 ON
rand());
• SELECT … FROM source TABLESAMPLE (1 PERCENT);
• SELECT … FROM source TABLESAMPLE (10M);
• SELECT … FROM source TABLESAMPLE (100 ROWS);
Clustering and Distribution
• ORDER BY
• In strict mode must be followed by LIMIT as a last single reducer is required to
sort all output
• SORT BY
• Only guarantees order of rows up to the last reducer
• If multiple last reducers then only partially ordered result
• DISTRIBUTE BY
• Specifies how to distribute the rows to reducers, but does not require order
• CLUSTER BY
• Syntactic sugar for SORT BY and DISTRIBUTE BY
Subqueries
• In FROM clause
• SELECT … FROM (SELECT ….FROM …) AS alias …
• In WHERE clause
• SELECT … FROM …. WHERE EXISTS (SELECT … )
• SELECT … FROM …. WHERE col IN (SELECT …)
• Must appear on the right-hand side in expressions
• IN/NOT IN must project exactly one column
• EXISTS/NOT EXISTS must contain correlated predicates
• Otherwise they’re JOINs
• Reference to parent query is only supported in WHERE clause subqueries
• References of course required for correlated sub-queries
Common Table Expressions (CTE)
• Supported for SELECT and INSERT
• Do not support recursive syntax
• with q1 as (
select key, value from src where key = '5')
from q1
insert overwrite table s1
select *;
Lateral Views
• Aka CROSS APPLY
• Apply a table function to every row
• SELECT … FROM table
LATERAL VIEW explode(column) exTable AS exCol;
• OUTER clause to include rows for which the function generates nothing
• Similar to ANSI-SQL OUTER APPLY
• Built-in table functions (UDTF):
• explode(ARRAY)
• explode(MAP)
• inline(STRUCT)
• json_tuple(json, k1, k2,…)
• Returns k1, k2 from json as rows
• parse_url(url, part, part, …)
• Returns URL host, path, query:key
• posexplode(ARRAY)
• explode + index
• stack(n, v1, v2, …, vk)
• n rows, each with k/n columns
Windowing and analytical functions
• LEAD, LAG, FIRST_VALUE, LAST_VALUE
• RANK, ROW_NUMBER, DENSE_RANK, PERCENT_RANK, NTILE
• OVER clause for aggregates
• PARTITION BY
• SELECT SUM(a) OVER (PARTITION BY b)
• ORDER BY
• SELECT SUM(a) OVER (PARTITION BY b ORDER BY c)
• window specification
• SELECT SUM(a) OVER (PARTITION BY b ORDER BY c ROWS 3 PRECEDING AND 3
FOLLOWING)
• WINDOW clause
• SELECT SUM(b) OVER w
FROM t
WINDOW w AS (PARTITION BY b ORDER BY c ROWS BETWEEN CURRENT ROW AND 2
FOLLOWING)
GROUPING SETS, CUBE, ROLLUP
• GROUPING SET
• Logical equivalent of having the same query run with different GROUP BY and then UNION
the results
• SELECT SUM(a) … GROUP BY a,b GROUPING SETS (a, (a,b))
SELECT SUM(a) … GROUP BY a
UNION
SELECT SUM(a) … GROUP BY a,b;
• GROUP BY … WITH CUBE
• Equivalent of adding all possible GROUPING SETS
• GROUP BY a,b,c WITH CUBE
GROUP BY a,b,c GROUPING SETS ((a,b,c), (a,b), (a,c), (b,c),(a), (b),(c), ())
• GROUP BY … WITH ROLLUP
• Equivalent of adding all the GROUPING SETS that lead with the GROUP BY columns
• GROUP BY a,b,c WITH ROLLUP
GROUP BY a,b,c GROUPING SETS ((a,b,c), (a,b), (a))
XPath functions
• xpath_...(xml_string, xpath_expression_string)
• xpath_long returns a long
• xpath_short returns a short
• xpath_string returns a string
• …
• xpath(xml, xpath) returns an array of strings
• SELECT xpath(col, ‘//configuration/property[name=“foo”]/value’)
User Defined Functions
package com.example.hive.udf;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;
public final class Lower extends UDF {
public Text evaluate(final Text s) {
if (s == null) { return null; }
return new Text(s.toString().toLowerCase());
}
}
CREATE FUNCTION myLower AS ‘Lower' USING JAR 'hdfs:///path/to/jar';
• Aggregate functions also possible, but more complicated
• Must track amp side vs. reduce side and ‘merge’ the intermediate results
• https://cwiki.apache.org/confluence/display/Hive/GenericUDAFCaseStudy
TRANSFORM
• Plug custom scripts into query execution
• SELECT TRANSFORM(stuff)
USING 'script‘
AS (thing1 INT, thing2 INT)
• FROM (
FROM pv_users
MAP pv_users.userid, pv_users.date
USING 'map_script‘
CLUSTER BY key) map_output
INSERT OVERWRITE TABLE pv_users_reduced
REDUCE map_output.key, map_output.value
USING 'reduce_script‘
AS date, count;
• https://cwiki.apache.org/confluence/display/Hive/LanguageMan
ual+Transform
Hive Indexes
• Indexes aimed at reducing data for range scans
• Fewer splits, fewer map tasks, less IO
• Relies in Predicate Push Down
• Order guarantee can simplify certain algorithms
• GROUP BY aggregations can use streaming aggregates vs. hash aggregates
• Hive does not need/use indexes for ‘seek’ like OLTP RDBMSs
• Indexes are in almost every respect just another table with same data
• Query Optimizer uses rewrite rules to leverage indexes
• Indexes are not automatically maintained on LOAD/INSERT
• https://cwiki.apache.org/confluence/display/Hive/IndexDev
JOIN optimizations
• Difficult problem in MR
• Naïve join relies on MR shuffle to partition the data
• Reducers can implement JOIN easily simply by merging the input, as is sorted
• Is a size-of-data copy through the MR shuffle
• MapJoin
• If there is one big table (facts) and several small tables (dimensions)
• Read all the small tables, hash them
• serialize the hash into HDFS distributed cache
• Done by driver as stage-0, before launching the actual query
• The MapJoinOperator loads the small tables in memory
• JOIN can be performed on-the-fly, on the map side, avoiding big shuffle
• Requires live RAM, task JVM memory settings must allow for enough memory
• Sort Merge Bucket (SMB) join
• Between big tables that are bucketed by the same key
• And the bucketing key is also the join key
• Map task scans buckets from multiple tables in parallel
• MR only knows about one of them
• For the rest the SMBJoinOperator simulates a MR environment to scan them
• https://cwiki.apache.org/confluence/display/Hive/LanguageManual+JoinOptimization
Partitioning in Hive
• CREATE TABLE …. PARTITIONED BY (…)
• Separate data directory created for each distinct combination of
partitioning column values
• Can result in many small tables if abused
• Use org.apache.hadoop.hive.ql.io.CombineHiveInputFormat
• Use also Bucketing
• CREATE TABLE …
PARTITIONED BY (…)
CLUSTERED BY (…) SORTED BY (…) INTO … BUCKETS
• Bucketing helps many queries
• https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL+BucketedT
ables
How to get started with Hive
• HDPInsight 3.1 comes with Hive 0.13
• Hortonworks Sandbox (VM) has Hive 0.13
• Cloudera CDH 5 VM comes with Hive 0.12
• Build it yourself 
• https://cwiki.apache.org/confluence/display/Hive/AdminManual+Installation
• Mailing list: user@hive.apache.org
•

Weitere ähnliche Inhalte

Was ist angesagt?

U-SQL Killer Scenarios: Taming the Data Science Monster with U-SQL and Big Co...
U-SQL Killer Scenarios: Taming the Data Science Monster with U-SQL and Big Co...U-SQL Killer Scenarios: Taming the Data Science Monster with U-SQL and Big Co...
U-SQL Killer Scenarios: Taming the Data Science Monster with U-SQL and Big Co...Michael Rys
 
U-SQL Query Execution and Performance Tuning
U-SQL Query Execution and Performance TuningU-SQL Query Execution and Performance Tuning
U-SQL Query Execution and Performance TuningMichael Rys
 
Hive big-data meetup
Hive big-data meetupHive big-data meetup
Hive big-data meetupRemus Rusanu
 
U-SQL Partitioned Data and Tables (SQLBits 2016)
U-SQL Partitioned Data and Tables (SQLBits 2016)U-SQL Partitioned Data and Tables (SQLBits 2016)
U-SQL Partitioned Data and Tables (SQLBits 2016)Michael Rys
 
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...Julian Hyde
 
Is SQLcl the Next Generation of SQL*Plus?
Is SQLcl the Next Generation of SQL*Plus?Is SQLcl the Next Generation of SQL*Plus?
Is SQLcl the Next Generation of SQL*Plus?Zohar Elkayam
 
Hive and HiveQL - Module6
Hive and HiveQL - Module6Hive and HiveQL - Module6
Hive and HiveQL - Module6Rohit Agrawal
 
Advance Hive, NoSQL Database (HBase) - Module 7
Advance Hive, NoSQL Database (HBase) - Module 7Advance Hive, NoSQL Database (HBase) - Module 7
Advance Hive, NoSQL Database (HBase) - Module 7Rohit Agrawal
 
Be A Hero: Transforming GoPro Analytics Data Pipeline
Be A Hero: Transforming GoPro Analytics Data PipelineBe A Hero: Transforming GoPro Analytics Data Pipeline
Be A Hero: Transforming GoPro Analytics Data PipelineChester Chen
 
The art of querying – newest and advanced SQL techniques
The art of querying – newest and advanced SQL techniquesThe art of querying – newest and advanced SQL techniques
The art of querying – newest and advanced SQL techniquesZohar Elkayam
 
U-SQL Meta Data Catalog (SQLBits 2016)
U-SQL Meta Data Catalog (SQLBits 2016)U-SQL Meta Data Catalog (SQLBits 2016)
U-SQL Meta Data Catalog (SQLBits 2016)Michael Rys
 
PL/SQL New and Advanced Features for Extreme Performance
PL/SQL New and Advanced Features for Extreme PerformancePL/SQL New and Advanced Features for Extreme Performance
PL/SQL New and Advanced Features for Extreme PerformanceZohar Elkayam
 
Introduction to Vertica (Architecture & More)
Introduction to Vertica (Architecture & More)Introduction to Vertica (Architecture & More)
Introduction to Vertica (Architecture & More)LivePerson
 
Taming the Data Science Monster with A New ‘Sword’ – U-SQL
Taming the Data Science Monster with A New ‘Sword’ – U-SQLTaming the Data Science Monster with A New ‘Sword’ – U-SQL
Taming the Data Science Monster with A New ‘Sword’ – U-SQLMichael Rys
 

Was ist angesagt? (20)

U-SQL Killer Scenarios: Taming the Data Science Monster with U-SQL and Big Co...
U-SQL Killer Scenarios: Taming the Data Science Monster with U-SQL and Big Co...U-SQL Killer Scenarios: Taming the Data Science Monster with U-SQL and Big Co...
U-SQL Killer Scenarios: Taming the Data Science Monster with U-SQL and Big Co...
 
U-SQL Query Execution and Performance Tuning
U-SQL Query Execution and Performance TuningU-SQL Query Execution and Performance Tuning
U-SQL Query Execution and Performance Tuning
 
Hive big-data meetup
Hive big-data meetupHive big-data meetup
Hive big-data meetup
 
U-SQL Partitioned Data and Tables (SQLBits 2016)
U-SQL Partitioned Data and Tables (SQLBits 2016)U-SQL Partitioned Data and Tables (SQLBits 2016)
U-SQL Partitioned Data and Tables (SQLBits 2016)
 
Vertica-Database
Vertica-DatabaseVertica-Database
Vertica-Database
 
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
 
Is SQLcl the Next Generation of SQL*Plus?
Is SQLcl the Next Generation of SQL*Plus?Is SQLcl the Next Generation of SQL*Plus?
Is SQLcl the Next Generation of SQL*Plus?
 
Hive and HiveQL - Module6
Hive and HiveQL - Module6Hive and HiveQL - Module6
Hive and HiveQL - Module6
 
Advance Hive, NoSQL Database (HBase) - Module 7
Advance Hive, NoSQL Database (HBase) - Module 7Advance Hive, NoSQL Database (HBase) - Module 7
Advance Hive, NoSQL Database (HBase) - Module 7
 
Be A Hero: Transforming GoPro Analytics Data Pipeline
Be A Hero: Transforming GoPro Analytics Data PipelineBe A Hero: Transforming GoPro Analytics Data Pipeline
Be A Hero: Transforming GoPro Analytics Data Pipeline
 
The art of querying – newest and advanced SQL techniques
The art of querying – newest and advanced SQL techniquesThe art of querying – newest and advanced SQL techniques
The art of querying – newest and advanced SQL techniques
 
U-SQL Meta Data Catalog (SQLBits 2016)
U-SQL Meta Data Catalog (SQLBits 2016)U-SQL Meta Data Catalog (SQLBits 2016)
U-SQL Meta Data Catalog (SQLBits 2016)
 
PL/SQL New and Advanced Features for Extreme Performance
PL/SQL New and Advanced Features for Extreme PerformancePL/SQL New and Advanced Features for Extreme Performance
PL/SQL New and Advanced Features for Extreme Performance
 
Deep Dive on Amazon Redshift
Deep Dive on Amazon RedshiftDeep Dive on Amazon Redshift
Deep Dive on Amazon Redshift
 
Introduction to Vertica (Architecture & More)
Introduction to Vertica (Architecture & More)Introduction to Vertica (Architecture & More)
Introduction to Vertica (Architecture & More)
 
Streaming SQL
Streaming SQLStreaming SQL
Streaming SQL
 
Streaming SQL
Streaming SQLStreaming SQL
Streaming SQL
 
Taming the Data Science Monster with A New ‘Sword’ – U-SQL
Taming the Data Science Monster with A New ‘Sword’ – U-SQLTaming the Data Science Monster with A New ‘Sword’ – U-SQL
Taming the Data Science Monster with A New ‘Sword’ – U-SQL
 
SQOOP PPT
SQOOP PPTSQOOP PPT
SQOOP PPT
 
Deep Dive on Amazon Redshift
Deep Dive on Amazon RedshiftDeep Dive on Amazon Redshift
Deep Dive on Amazon Redshift
 

Andere mochten auch

Really using Oracle analytic SQL functions
Really using Oracle analytic SQL functionsReally using Oracle analytic SQL functions
Really using Oracle analytic SQL functionsKim Berg Hansen
 
Hadoop For Enterprises
Hadoop For EnterprisesHadoop For Enterprises
Hadoop For Enterprisesnvvrajesh
 
Advanced Analytics using Apache Hive
Advanced Analytics using Apache HiveAdvanced Analytics using Apache Hive
Advanced Analytics using Apache HiveMurtaza Doctor
 
Analytical Queries with Hive: SQL Windowing and Table Functions
Analytical Queries with Hive: SQL Windowing and Table FunctionsAnalytical Queries with Hive: SQL Windowing and Table Functions
Analytical Queries with Hive: SQL Windowing and Table FunctionsDataWorks Summit
 
Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013Julien Le Dem
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Cloudera, Inc.
 

Andere mochten auch (7)

Really using Oracle analytic SQL functions
Really using Oracle analytic SQL functionsReally using Oracle analytic SQL functions
Really using Oracle analytic SQL functions
 
Hadoop For Enterprises
Hadoop For EnterprisesHadoop For Enterprises
Hadoop For Enterprises
 
HiveServer2
HiveServer2HiveServer2
HiveServer2
 
Advanced Analytics using Apache Hive
Advanced Analytics using Apache HiveAdvanced Analytics using Apache Hive
Advanced Analytics using Apache Hive
 
Analytical Queries with Hive: SQL Windowing and Table Functions
Analytical Queries with Hive: SQL Windowing and Table FunctionsAnalytical Queries with Hive: SQL Windowing and Table Functions
Analytical Queries with Hive: SQL Windowing and Table Functions
 
Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0
 

Ähnlich wie Hive @ Bucharest Java User Group

Hive Evolution: ApacheCon NA 2010
Hive Evolution:  ApacheCon NA 2010Hive Evolution:  ApacheCon NA 2010
Hive Evolution: ApacheCon NA 2010John Sichi
 
Introduction to Apache Hive(Big Data, Final Seminar)
Introduction to Apache Hive(Big Data, Final Seminar)Introduction to Apache Hive(Big Data, Final Seminar)
Introduction to Apache Hive(Big Data, Final Seminar)Takrim Ul Islam Laskar
 
Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...
Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...
Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...Michael Rys
 
H base introduction & development
H base introduction & developmentH base introduction & development
H base introduction & developmentShashwat Shriparv
 
Object Relational Database Management System
Object Relational Database Management SystemObject Relational Database Management System
Object Relational Database Management SystemAmar Myana
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoopbddmoscow
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoopnvvrajesh
 
Apache Drill talk ApacheCon 2018
Apache Drill talk ApacheCon 2018Apache Drill talk ApacheCon 2018
Apache Drill talk ApacheCon 2018Aman Sinha
 
Intro to HBase - Lars George
Intro to HBase - Lars GeorgeIntro to HBase - Lars George
Intro to HBase - Lars GeorgeJAX London
 
Ten tools for ten big data areas 04_Apache Hive
Ten tools for ten big data areas 04_Apache HiveTen tools for ten big data areas 04_Apache Hive
Ten tools for ten big data areas 04_Apache HiveWill Du
 
Apache Hadoop 1.1
Apache Hadoop 1.1Apache Hadoop 1.1
Apache Hadoop 1.1Sperasoft
 
SQL on Hadoop for the Oracle Professional
SQL on Hadoop for the Oracle ProfessionalSQL on Hadoop for the Oracle Professional
SQL on Hadoop for the Oracle ProfessionalMichael Rainey
 
Alasql JavaScript SQL Database Library: User Manual
Alasql JavaScript SQL Database Library: User ManualAlasql JavaScript SQL Database Library: User Manual
Alasql JavaScript SQL Database Library: User ManualAndrey Gershun
 
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkJames Chen
 
hive_slides_Webinar_Session_1.pptx
hive_slides_Webinar_Session_1.pptxhive_slides_Webinar_Session_1.pptx
hive_slides_Webinar_Session_1.pptxvishwasgarade1
 

Ähnlich wie Hive @ Bucharest Java User Group (20)

Hive Evolution: ApacheCon NA 2010
Hive Evolution:  ApacheCon NA 2010Hive Evolution:  ApacheCon NA 2010
Hive Evolution: ApacheCon NA 2010
 
Introduction to Apache Hive(Big Data, Final Seminar)
Introduction to Apache Hive(Big Data, Final Seminar)Introduction to Apache Hive(Big Data, Final Seminar)
Introduction to Apache Hive(Big Data, Final Seminar)
 
Hive Hadoop
Hive HadoopHive Hadoop
Hive Hadoop
 
Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...
Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...
Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...
 
H base introduction & development
H base introduction & developmentH base introduction & development
H base introduction & development
 
Object Relational Database Management System
Object Relational Database Management SystemObject Relational Database Management System
Object Relational Database Management System
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoop
 
Apache Hive
Apache HiveApache Hive
Apache Hive
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoop
 
Apache Drill talk ApacheCon 2018
Apache Drill talk ApacheCon 2018Apache Drill talk ApacheCon 2018
Apache Drill talk ApacheCon 2018
 
Intro to HBase - Lars George
Intro to HBase - Lars GeorgeIntro to HBase - Lars George
Intro to HBase - Lars George
 
Ten tools for ten big data areas 04_Apache Hive
Ten tools for ten big data areas 04_Apache HiveTen tools for ten big data areas 04_Apache Hive
Ten tools for ten big data areas 04_Apache Hive
 
Apache Hadoop 1.1
Apache Hadoop 1.1Apache Hadoop 1.1
Apache Hadoop 1.1
 
SQL on Hadoop for the Oracle Professional
SQL on Hadoop for the Oracle ProfessionalSQL on Hadoop for the Oracle Professional
SQL on Hadoop for the Oracle Professional
 
Apache hive
Apache hiveApache hive
Apache hive
 
Alasql JavaScript SQL Database Library: User Manual
Alasql JavaScript SQL Database Library: User ManualAlasql JavaScript SQL Database Library: User Manual
Alasql JavaScript SQL Database Library: User Manual
 
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
 
Hadoop intro
Hadoop introHadoop intro
Hadoop intro
 
hive_slides_Webinar_Session_1.pptx
hive_slides_Webinar_Session_1.pptxhive_slides_Webinar_Session_1.pptx
hive_slides_Webinar_Session_1.pptx
 
Apache hive introduction
Apache hive introductionApache hive introduction
Apache hive introduction
 

Kürzlich hochgeladen

20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...Akihiro Suda
 
SoftTeco - Software Development Company Profile
SoftTeco - Software Development Company ProfileSoftTeco - Software Development Company Profile
SoftTeco - Software Development Company Profileakrivarotava
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfMarharyta Nedzelska
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...Technogeeks
 
Leveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
Leveraging AI for Mobile App Testing on Real Devices | Applitools + KobitonLeveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
Leveraging AI for Mobile App Testing on Real Devices | Applitools + KobitonApplitools
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...OnePlan Solutions
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...confluent
 
Strategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero resultsStrategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero resultsJean Silva
 
Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsSafe Software
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Cizo Technology Services
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationBradBedford3
 
Machine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringMachine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringHironori Washizaki
 
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfDrew Moseley
 
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full RecordingOpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full RecordingShane Coughlan
 
Odoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 EnterpriseOdoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 Enterprisepreethippts
 
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxUI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxAndreas Kunz
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtimeandrehoraa
 
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdfExploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdfkalichargn70th171
 
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptxReal-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptxRTS corp
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Angel Borroy López
 

Kürzlich hochgeladen (20)

20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
 
SoftTeco - Software Development Company Profile
SoftTeco - Software Development Company ProfileSoftTeco - Software Development Company Profile
SoftTeco - Software Development Company Profile
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdf
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...
 
Leveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
Leveraging AI for Mobile App Testing on Real Devices | Applitools + KobitonLeveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
Leveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
 
Strategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero resultsStrategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero results
 
Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data Streams
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion Application
 
Machine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringMachine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their Engineering
 
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdf
 
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full RecordingOpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
 
Odoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 EnterpriseOdoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 Enterprise
 
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxUI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtime
 
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdfExploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
 
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptxReal-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
 

Hive @ Bucharest Java User Group

  • 1. HIVE Bucharest Java User Group July 3, 2014
  • 2. whoami • Developer with SQL Server team since 2001 • Apache contributor • Hive • Hadoop core (security) • stackoverflow user 105929s • @rusanu
  • 3. What is HIVE • Datawarehouse for querying and managing large datasets • A query engine that use Hadoop MapReduce for execution • A SQL abstraction for creating MapReduce algorithms • SQL interface to HDFS data • Developed at Facebook VLDB 2009: Hive - A Warehousing Solution Over a Map-Reduce Framework • ASF top project since September 2010
  • 4. What is Hadoop Hadoop Core • Distributed execution engine • MapReduce • YARN • TEZ • Distributed File System HDFS • Tools for administering the execution engine and HDFS • Libraries for writing MapReduce jobs Hadoop Ecosystem • HBase (BigTable) • Pig (scripting query language) • Hive (SQL) • Storm (Stream Processing) • Flume (Data Aggregator) • Sqoop (RDBMS bulk data transfer) • Oozie (workflow scheduling) • Mahout (machine learning) • Falcon (data lifecycle) • Spark, Cassandra etc (not based on Hadoop) hadoopecosystemtable.github.io
  • 5. How does Hadoop work • JOB: binary code (Java JAR), configuration XML, any additional file(s) • The job gets uploaded into the cluster file system (usually HDFS) • SPLIT: a fragment of data (file) to be processes • The input data is broken into several splits • TASK: execution of the job JAR to process a split • Scheduler attempts to execute the task near the data split • MAP: takes unsorted, unclustered data and outputs clustered data • SHUFFLE: takes clustered data and produces sorted data • REDUCE: takes sorted data and produces desired output • Synergies • Processing locality: execute the code near the data storage, avoid data transfer • Algorithms scalability: • Map phase can scale out because assumes no sorting and no clustering • Reduce phase easy to write algorithms when data is guaranteed sorted and clustered • Execution reliability (monitoring, retry, preemptive execution etc)
  • 7. How does Hive work • SQL submitted via CLI or Hiveserver(2) • Metadata describing tables stored in RDBMS • Driver compiles/optimizes execution plan • Plan and execution engine submitted to Hadoop as job • MR invokes Hive execution engine which executes plan HiveHadoop Metastore RDBMS HCatalog HDFS Driver Compiles, Optimizes MapReduce Task Task Split Split CLI Hiveserver2 ODBC JDBCShell Job Tracker Beeswax
  • 8. Hive Query execution • Compilation/Optimization results in an AST containing operators eg: • FetchOperator: scans source data (the input split) • SelectOperator: projects column values, computes • GroupByOperator: aggregate functions (SUM, COUNT etc) • JoinOperator:joins • The plan forms a DAG of MR jobs • The plan tree is serialized (Kryo) • Hive Driver dispatches jobs • Multiple stages can result in multiple jobs • Task execution picks up the plan and start iterating the plan • MR emits values (rows) into the topmost operator (Fetch) • Rows propagate down the tree • ReduceSinkOperator emits map output for shuffle • Each operator implements both a map side and a reduce side algorithm • Executes the one appropriate for the current task • MR does the shuffle, many operators rely on it as part of their algorithm • Eg. SortOperator, GroupByOperator • Multi-stage queries create intermediate output and the driver submits new job to continue next stage • TEZ execution: map-reduce-reduce, usually eliminates multiple stages (more later) • Vectorized execution mode emits batches of rows (1024 rows)
  • 9. Interacting with Hive • hive from shell prompt launches CLI • Run SQL command interactively • Can execute a batch of commands from a file • Results displayed in console • hiveserver2 is a daemon • JDBC and ODBC drivers for applications to connect to it • Queries submitted via JDBC/ODBC • Query results as JDBC/ODBC resultsets • Other applications embed Hive driver eg. beeswax
  • 10. Hive QL • The dialect of SQL supported by Hive • More similar to MySQL dialect than ANSI-SQL • Drive toward ANSI-92 compliance (syntax, data types) • Query language: SELECT • DDL: CREATE/ALTER/DROP DATABASE/TABLE/PARTITION • DML: Only bulk insert operations • LOAD • INSERT • HIVE-5317 Implement insert, update, and delete in Hive with full ACID support
  • 11. Supported data types • Numeric • tinyint, smallint, int, bigint • float, double • decimal(precision, scale) • Date/Time • timestamp • date • Character types • string • char(size) • varchar(size) • Misc. types • boolean • binary • Complex types • ARRAY<type> • MAP<type, type> • STRUCT<name:type, name:type> • UNIONTYPE<type, type, type>
  • 12. Storage Formats • Text • ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘t’ LINES TERMINATED BY ‘n’; • Gzip or Bzip2 is automatically detected • SEQUENCEFILE (default map-reduce output) • ORC Files • Columnar, Compressed • Certain features only enabled on ORC • Parquet • Columnar, Compressed • Arbitrary SerDe (Serializer Deserializer)
  • 13. DDL/Databases/Tables • CREATE (DATABASE|SCHEMA) [IF NOT EXISTS] database_name [COMMENT database_comment] [LOCATION hdfs_path] [WITH DBPROPERTIES (property_name=property_value, ...)]; • CREATE [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name [(col_name data_type [COMMENT col_comment], ...)] [COMMENT table_comment] [PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)] [CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS] [SKEWED BY (col_name, col_name, ...) ON ([(col_value, col_value, ...), ...|col_value, col_value, ...]) [STORED AS DIRECTORIES] [ [ROW FORMAT row_format] [STORED AS file_format] | STORED BY 'storage.handler.class.name' [WITH SERDEPROPERTIES (...)] ] [LOCATION hdfs_path] [TBLPROPERTIES (property_name=property_value, ...)] [AS select_statement] • EXTERNAL tables are not owned by Hive (DROP TABLE lets the file in place) • Partitioning, Bucketing, Skew control allow precise control of file size (important for processing to achieve balanced MR splits) • ALTER TABLE … EXCHANGE PARTITION allows for fast (metadata only) move of data. • ALTER TABLE … ADD PARTITION adds to Hive metadata a partition already existing on disk • MSCK REPAIR TABLE … scans on-disk files to discover partitions and synchronizes Hive metadata • https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL
  • 14. Data Load • LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)] • File format must match table format (no transformations) • INSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...) [IF NOT EXISTS]] select_statement1 FROM from_statement; INSERT INTO TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...)] select_statement1 FROM from_statement; • OVERWRITE replaces the data in the table (TRUNCATE + INSERT) • INTO appends the data (leaves existing data intact) • Dynamic Partitioning • Creates new partitions based on data • https://cwiki.apache.org/confluence/display/Hive/Tutorial#Tutorial-Dynamic-PartitionInsert • INSERT OVERWRITE [LOCAL] DIRECTORY directory1 [ROW FORMAT row_format] [STORED AS file_format] SELECT ... FROM ... • Writes a file without creating Hive table
  • 15. Hive SELECT syntax [WITH CommonTableExpression (, CommonTableExpression)*] SELECT [ALL | DISTINCT] select_expr, select_expr, ... FROM table_reference [WHERE where_condition] [GROUP BY col_list] [CLUSTER BY col_list | [DISTRIBUTE BY col_list] [SORT BY col_list] ] [HAVING having_condition] [LIMIT number]
  • 16. SELECT features • REGEX column specifications • SELECT `(ds|hr)?+.+` FROM sales • Virtual columns • INPUT_FILE_NAME • BLOCK_OFFSET_INSIDE_FILE • Sampling • SELECT … FROM source TABLESAMPLE (BUCKET 3 OUT OF 32 ON rand()); • SELECT … FROM source TABLESAMPLE (1 PERCENT); • SELECT … FROM source TABLESAMPLE (10M); • SELECT … FROM source TABLESAMPLE (100 ROWS);
  • 17. Clustering and Distribution • ORDER BY • In strict mode must be followed by LIMIT as a last single reducer is required to sort all output • SORT BY • Only guarantees order of rows up to the last reducer • If multiple last reducers then only partially ordered result • DISTRIBUTE BY • Specifies how to distribute the rows to reducers, but does not require order • CLUSTER BY • Syntactic sugar for SORT BY and DISTRIBUTE BY
  • 18. Subqueries • In FROM clause • SELECT … FROM (SELECT ….FROM …) AS alias … • In WHERE clause • SELECT … FROM …. WHERE EXISTS (SELECT … ) • SELECT … FROM …. WHERE col IN (SELECT …) • Must appear on the right-hand side in expressions • IN/NOT IN must project exactly one column • EXISTS/NOT EXISTS must contain correlated predicates • Otherwise they’re JOINs • Reference to parent query is only supported in WHERE clause subqueries • References of course required for correlated sub-queries
  • 19. Common Table Expressions (CTE) • Supported for SELECT and INSERT • Do not support recursive syntax • with q1 as ( select key, value from src where key = '5') from q1 insert overwrite table s1 select *;
  • 20. Lateral Views • Aka CROSS APPLY • Apply a table function to every row • SELECT … FROM table LATERAL VIEW explode(column) exTable AS exCol; • OUTER clause to include rows for which the function generates nothing • Similar to ANSI-SQL OUTER APPLY • Built-in table functions (UDTF): • explode(ARRAY) • explode(MAP) • inline(STRUCT) • json_tuple(json, k1, k2,…) • Returns k1, k2 from json as rows • parse_url(url, part, part, …) • Returns URL host, path, query:key • posexplode(ARRAY) • explode + index • stack(n, v1, v2, …, vk) • n rows, each with k/n columns
  • 21. Windowing and analytical functions • LEAD, LAG, FIRST_VALUE, LAST_VALUE • RANK, ROW_NUMBER, DENSE_RANK, PERCENT_RANK, NTILE • OVER clause for aggregates • PARTITION BY • SELECT SUM(a) OVER (PARTITION BY b) • ORDER BY • SELECT SUM(a) OVER (PARTITION BY b ORDER BY c) • window specification • SELECT SUM(a) OVER (PARTITION BY b ORDER BY c ROWS 3 PRECEDING AND 3 FOLLOWING) • WINDOW clause • SELECT SUM(b) OVER w FROM t WINDOW w AS (PARTITION BY b ORDER BY c ROWS BETWEEN CURRENT ROW AND 2 FOLLOWING)
  • 22. GROUPING SETS, CUBE, ROLLUP • GROUPING SET • Logical equivalent of having the same query run with different GROUP BY and then UNION the results • SELECT SUM(a) … GROUP BY a,b GROUPING SETS (a, (a,b)) SELECT SUM(a) … GROUP BY a UNION SELECT SUM(a) … GROUP BY a,b; • GROUP BY … WITH CUBE • Equivalent of adding all possible GROUPING SETS • GROUP BY a,b,c WITH CUBE GROUP BY a,b,c GROUPING SETS ((a,b,c), (a,b), (a,c), (b,c),(a), (b),(c), ()) • GROUP BY … WITH ROLLUP • Equivalent of adding all the GROUPING SETS that lead with the GROUP BY columns • GROUP BY a,b,c WITH ROLLUP GROUP BY a,b,c GROUPING SETS ((a,b,c), (a,b), (a))
  • 23. XPath functions • xpath_...(xml_string, xpath_expression_string) • xpath_long returns a long • xpath_short returns a short • xpath_string returns a string • … • xpath(xml, xpath) returns an array of strings • SELECT xpath(col, ‘//configuration/property[name=“foo”]/value’)
  • 24. User Defined Functions package com.example.hive.udf; import org.apache.hadoop.hive.ql.exec.UDF; import org.apache.hadoop.io.Text; public final class Lower extends UDF { public Text evaluate(final Text s) { if (s == null) { return null; } return new Text(s.toString().toLowerCase()); } } CREATE FUNCTION myLower AS ‘Lower' USING JAR 'hdfs:///path/to/jar'; • Aggregate functions also possible, but more complicated • Must track amp side vs. reduce side and ‘merge’ the intermediate results • https://cwiki.apache.org/confluence/display/Hive/GenericUDAFCaseStudy
  • 25. TRANSFORM • Plug custom scripts into query execution • SELECT TRANSFORM(stuff) USING 'script‘ AS (thing1 INT, thing2 INT) • FROM ( FROM pv_users MAP pv_users.userid, pv_users.date USING 'map_script‘ CLUSTER BY key) map_output INSERT OVERWRITE TABLE pv_users_reduced REDUCE map_output.key, map_output.value USING 'reduce_script‘ AS date, count; • https://cwiki.apache.org/confluence/display/Hive/LanguageMan ual+Transform
  • 26. Hive Indexes • Indexes aimed at reducing data for range scans • Fewer splits, fewer map tasks, less IO • Relies in Predicate Push Down • Order guarantee can simplify certain algorithms • GROUP BY aggregations can use streaming aggregates vs. hash aggregates • Hive does not need/use indexes for ‘seek’ like OLTP RDBMSs • Indexes are in almost every respect just another table with same data • Query Optimizer uses rewrite rules to leverage indexes • Indexes are not automatically maintained on LOAD/INSERT • https://cwiki.apache.org/confluence/display/Hive/IndexDev
  • 27. JOIN optimizations • Difficult problem in MR • Naïve join relies on MR shuffle to partition the data • Reducers can implement JOIN easily simply by merging the input, as is sorted • Is a size-of-data copy through the MR shuffle • MapJoin • If there is one big table (facts) and several small tables (dimensions) • Read all the small tables, hash them • serialize the hash into HDFS distributed cache • Done by driver as stage-0, before launching the actual query • The MapJoinOperator loads the small tables in memory • JOIN can be performed on-the-fly, on the map side, avoiding big shuffle • Requires live RAM, task JVM memory settings must allow for enough memory • Sort Merge Bucket (SMB) join • Between big tables that are bucketed by the same key • And the bucketing key is also the join key • Map task scans buckets from multiple tables in parallel • MR only knows about one of them • For the rest the SMBJoinOperator simulates a MR environment to scan them • https://cwiki.apache.org/confluence/display/Hive/LanguageManual+JoinOptimization
  • 28. Partitioning in Hive • CREATE TABLE …. PARTITIONED BY (…) • Separate data directory created for each distinct combination of partitioning column values • Can result in many small tables if abused • Use org.apache.hadoop.hive.ql.io.CombineHiveInputFormat • Use also Bucketing • CREATE TABLE … PARTITIONED BY (…) CLUSTERED BY (…) SORTED BY (…) INTO … BUCKETS • Bucketing helps many queries • https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL+BucketedT ables
  • 29. How to get started with Hive • HDPInsight 3.1 comes with Hive 0.13 • Hortonworks Sandbox (VM) has Hive 0.13 • Cloudera CDH 5 VM comes with Hive 0.12 • Build it yourself  • https://cwiki.apache.org/confluence/display/Hive/AdminManual+Installation • Mailing list: user@hive.apache.org •