SlideShare a Scribd company logo
1 of 24
Download to read offline
An Introduction to Hive:
Components and Query Language


Jeff Hammerbacher
Chief Scientist and VP of Product
October 30, 2008
Hive Components
A Leaky Database
▪ Hadoop

 ▪ HDFS

 ▪ MapReduce     (bundles Resource Manager and Job Scheduler)
▪ Hive

 ▪ Logical   data partitioning
 ▪ Metastore    (command line and web interfaces)
 ▪ Query   Language
 ▪ Libraries   to handle different serialization formats (SerDes)
 ▪ JDBC   interface
Related Work
Glaringly Incomplete
▪ Gamma,    Bubba, Volcano, etc.
▪ Google:   Sawzall
▪ Yahoo:   Pig
▪ IBM   Research: JAQL
▪ Microsoft:     SCOPE
▪ Greenplum:      YAML MapReduce
▪ Aster   Data: In-Database MapReduce
▪ Business.com:     CloudBase
Hive Resources
▪ Facebook    Mirror: http://mirror.facebook.com/facebook/hive
 ▪ Currently   the best place to get the Hive distribution


▪ Wiki    page: http://wiki.apache.org/hadoop/Hive
 ▪ Getting    started: http://wiki.apache.org/hadoop/Hive/GettingStarted
 ▪ Query    language reference: http://wiki.apache.org/hadoop/Hive/HiveQL
 ▪ Presentations:   http://wiki.apache.org/hadoop/Hive/Presentations
 ▪ Roadmap:     http://wiki.apache.org/hadoop/Hive/Roadmap


▪ Mailing   list: hive-users@publists.facebook.com


▪ JIRA:   https://issues.apache.org/jira/browse/HADOOP/component/12312455
Running Hive
Quickstart
▪ <install     Hadoop>
▪ wget    http://mirror.facebook.com/facebook/hive/hadoop-0.19/dist.tar.gz
 ▪ (Replace      0.19 with 0.17 if you’re still on 0.17)
▪ tar   xvzf dist.tar.gz
▪ cd    dist
▪ export       HADOOP=<path to bin/hadoop in your Hadoop distribution>
 ▪ Or:    edit hadoop.bin.path and hadoop.conf.dir in conf/hive-default.xml
▪ bin/hive

▪ hive>
Running Hive
Configuration Details
▪ conf/hive-default.xml

 ▪ hadoop.bin.path:     Points to bin/hadoop in your Hadoop installation
 ▪ hadoop.config.dir:     Points to conf/ in your Hadoop installation
 ▪ hive.exec.scratchdir:    HDFS directory where execution information is written
 ▪ hive.metastore.warehouse.dir:      HDFS directory managed by Hive
 ▪ The    rest of the properties relate to the Metastore
▪ conf/hive-log4j.properties

 ▪ Will   put data into /tmp/{user.name}/hive.log by default
▪ conf/jpox.properties

 ▪ JPOX    is a Java object persistence library used by the Metastore
Populating Hive
MovieLens Data
▪   <cd into your hive directory>
▪   wget http://www.grouplens.org/system/files/ml-data.tar__0.gz
▪   tar xvzf ml-data.tar__0.gz
▪   CREATE TABLE u_data (userid INT, movieid INT, rating INT, unixtime TIMESTAMP)
    ROW FORMAT DELIMITED FIELDS TERMINATED BY 't';
    ▪   The first query can take ten seconds or more, as the Metastore needs to be created
▪   To confirm our table has been created:
    ▪   SHOW TABLES;
    ▪   DESCRIBE u_data;
▪   LOAD DATA LOCAL INPATH 'ml-data/u.data'
    OVERWRITE INTO TABLE u_data;
▪   SELECT COUNT(1) FROM u_data;
    ▪   Should fire off 2 MapReduce jobs and ultimately return a count of 100,000
Hive Query Language
Utility Statements
▪   SHOW TABLES [table_name | table_name_pattern]

▪   DESCRIBE [EXTENDED] table_name
    [PARTITION (partition_col = partition_col_value, ...)]

▪   EXPLAIN [EXTENDED] query_statement

▪   SET [EXTENDED]

    ▪   “SET property_name=property_value” to modify a value
Hive Query Language
CREATE TABLE Syntax
▪   CREATE [EXTERNAL] TABLE table_name (col_name data_type [col_comment], ...)
    [PARTITIONED BY (col_name data_type [col_comment], ...)]
    [CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name, ...)] INTO num_buckets BUCKETS]
    [ROW FORMAT row_format]
    [STORED AS file_format]
    [LOCATION hdfs_path]

▪   PARTITION columns are virtual columns; they are not part of the data itself but are derived on load
▪   CLUSTERED columns are real columns, hash partitioned into num_buckets folders
▪   ROW FORMAT can be used to specify a delimited data set or a custom deserializer
▪   Use EXTERNAL with ROW FORMAT, STORED AS, and LOCATION to analyze HDFS files in place
▪   “DROP TABLE table_name” can reverse this operation
    ▪   NB: Currently, DROP TABLE will delete both data and metadata
Hive Query Language
CREATE TABLE Syntax, Part Two
▪   data_type: primitive_type | array_type | map_type
▪   primitive_type:
    ▪   TINYINT | INT | BIGINT | BOOLEAN | FLOAT | DOUBLE | STRING
    ▪   DATE | DATETIME | TIMESTAMP
▪   array_type: ARRAY < primitive_type >
▪   map_type: MAP < primitive_type, primitive_type >
▪   row_format:
    ▪   DELIMITED [FIELDS TERMINATED BY char] [COLLECTION ITEMS TERMINATED BY char]
        [MAP KEYS TERMINATED BY char] [LINES TERMINATED BY char]
    ▪   SERIALIZER serde_name [WITH PROPERTIES property_name=property_value,
        property_name=property_value, ...]
▪   file_format: SEQUENCEFILE | TEXTFILE
Hive Query Language
ALTER TABLE Syntax
▪   ALTER TABLE table_name RENAME TO new_table_name;
▪   ALTER TABLE table_name ADD COLUMNS (col_name data_type [col_comment], ...);
▪   ALTER TABLE DROP partition_spec, partition_spec, ...;


▪   Future work:
     ▪   Support for removing or renaming columns
     ▪   Support for altering serialization format
Hive Query Language
LOAD DATA Syntax
▪   LOAD DATA [LOCAL] INPATH '/path/to/file'
    [OVERWRITE] INTO TABLE table_name
    [PARTITION (partition_col = partition_col_value, partition_col = partiton_col_value, ...)]

▪   You can load data from the local filesystem or anywhere in HDFS (cf. CREATE TABLE EXTERNAL)

▪   If you don’t specify OVERWRITE, data will be appended to existing table
Hive Query Language
SELECT Syntax
▪   [insert_clause]
    SELECT [ALL|DISTINCT] select_list
    FROM [table_source|join_source]
    [WHERE where_condition]
    [GROUP BY col_list]
    [ORDER BY col_list]
    [CLUSTER BY col_list]

▪   insert_clause: INSERT OVERWRITE destination

▪   destination:

    ▪   LOCAL DIRECTORY '/local/path'
    ▪   DIRECTORY '/hdfs/path'
    ▪   TABLE table_name [PARTITION (partition_col = partiton_col_value, ...)]
Hive Query Language
SELECT Syntax
▪   join_source: table_source join_clause table_source join_clause table_source ...

▪   join_clause

    ▪   [LEFT OUTER|RIGHT OUTER|FULL OUTER] JOIN ON (equality_expression, equality_expression, ...)



▪   Currently, only outer equi-joins are supported in Hive.

▪   There are two join algorithms

    ▪   Map-side merge join

    ▪   Reduce-side merge join
Hive Query Language
Building a Histogram of Review Counts
▪   CREATE TABLE review_counts (userid INT, review_count INT);
▪   INSERT OVERWRITE TABLE review_counts
    SELECT a.userid, COUNT(1) AS review_count
    FROM u_data a
    GROUP BY a.userid;
▪   SELECT b.review_count, COUNT(1)
    FROM review_counts b
    GROUP BY b.review_count;
▪   Notes:
    ▪   No INSERT OVERWRITE for second query means output is dumped to the shell
    ▪   Hive does not currently support CREATE TABLE AS
        ▪   We have to create the table and then INSERT into it
    ▪   Hive does not currently support subqueries
        ▪   We have to write two queries
Hive Query Language
Running Custom MapReduce
▪   Put the following into weekday_mapper.py:
    ▪   import sys
        import datetime

        for line in sys.stdin:
         line = line.strip()
         userid, movieid, rating, unixtime = line.split('t')
         weekday = datetime.datetime.fromtimestamp(float(unixtime)).isoweekday()
         print ','.join([userid, movieid, rating, str(weekday)])
▪   CREATE TABLE u_data_new (userid INT, movieid INT, rating INT, weekday INT)
    ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,’;
▪   FROM u_data a
    INSERT OVERWRITE TABLE u_data_new
    SELECT
     TRANSFORM (a.userid, a.movieid, a.rating, a.unixtime)
     AS (userid, movieid, rating, weekday)
     USING ‘python /full/path/to/weekday_mapper.py’
Hive Query Language
Programmatic Access
▪ The   Hive shell can take a file with queries to be executed
▪ bin/hive   -f /path/to/query/file


▪ You   can also run a Hive query straight from the command line
▪ bin/hive   -e 'quoted query string'


▪A   simple JDBC interface is available for experimentation as well
▪ https://issues.apache.org/jira/browse/HADOOP-4101
Hive Components
Metastore
▪ Currently        uses an embedded Derby database for persistence
▪ While  Derby is in place, you’ll need to put it into Server Mode to
    have more than one Hive concurrent Hive user
▪ See      http://wiki.apache.org/hadoop/HiveDerbyServerMode
▪ Next     release will use MySQL as default persistent data store
▪ The    goal is have the persistent store be pluggable
▪ You     can view the Thrift IDL for the metastore online
▪   https://svn.apache.org/repos/asf/hadoop/core/trunk/src/contrib/hive/metastore/if/hive_metastore.thrift
Hive Components
Query Processing
▪ Compiler

 ▪ Parser

 ▪ Type   Checking
 ▪ Semantic   Analysis
 ▪ Plan   Generation
 ▪ Task   Generation
▪ Execution   Engine
 ▪ Plan

 ▪ Operators

 ▪ UDFs   and UDAFs
Future Directions
▪   Query Optimization
    ▪   Support for Statistics
        ▪   These stats are needed to make optimization decisions
    ▪   Join Optimizations
        ▪   Map-side joins, semi join techniques etc to do the join faster
    ▪   Predicate Pushdown Optimizations
        ▪   Pushing predicates just above the table scan for certain situations in joins as well as ensuring that
            only required columns are sent across map/reduce boundaries
    ▪   Group By Optimizations
        ▪   Various optimizations to make group by faster
    ▪   Optimizations to reduce the number of map files created by filter operations
        ▪   Filters with a large number of mappers produces a lot of files which slows down the following
            operations.
Future Directions
▪   MapReduce Integration
    ▪   Schema-less MapReduce
        ▪   TRANSFORM needs a schema while MapReduce is schema-less.
    ▪   Improvements to TRANSFORM
        ▪   Make this more intuitive to MapReduce developers - evaluate some other keywords, etc.


▪   User Experience
    ▪   Create a web interface
    ▪   Error reporting improvements for parse errors
    ▪   Add “help” command to the CLI
    ▪   JDBC driver to enable traditional database tools to be used with Hive
Future Directions
▪   Integrating Dynamic SerDe with the DDL
    ▪   This allows the users to create typed tables along with list and map types from the DDL


▪   Transformations in LOAD DATA
    ▪   LOAD DATA currently does not transform the input data if it is not in the format expected by the
        destination table.


▪   Explode and Collect Operators
    ▪   Explode and collect operators to convert collections to individual items and vice versa.


▪   Propagating sort properties to destination tables
    ▪   If the query produces sorted we want to capture that in the destination table's metadata so that
        downstream optimizations can be enabled.
(c) 2008 Cloudera, Inc. or its licensors.  quot;Clouderaquot; is a registered trademark of Cloudera, Inc. All rights reserved. 1.0

More Related Content

What's hot

Hive User Meeting 2009 8 Facebook
Hive User Meeting 2009 8 FacebookHive User Meeting 2009 8 Facebook
Hive User Meeting 2009 8 Facebook
Zheng Shao
 
Hadoop Summit 2009 Hive
Hadoop Summit 2009 HiveHadoop Summit 2009 Hive
Hadoop Summit 2009 Hive
Zheng Shao
 
Apache Hadoop India Summit 2011 talk "Hive Evolution" by Namit Jain
Apache Hadoop India Summit 2011 talk "Hive Evolution" by Namit JainApache Hadoop India Summit 2011 talk "Hive Evolution" by Namit Jain
Apache Hadoop India Summit 2011 talk "Hive Evolution" by Namit Jain
Yahoo Developer Network
 
Introduction to Apache Hive(Big Data, Final Seminar)
Introduction to Apache Hive(Big Data, Final Seminar)Introduction to Apache Hive(Big Data, Final Seminar)
Introduction to Apache Hive(Big Data, Final Seminar)
Takrim Ul Islam Laskar
 
Hive vs Pig for HadoopSourceCodeReading
Hive vs Pig for HadoopSourceCodeReadingHive vs Pig for HadoopSourceCodeReading
Hive vs Pig for HadoopSourceCodeReading
Mitsuharu Hamba
 

What's hot (18)

Hive User Meeting 2009 8 Facebook
Hive User Meeting 2009 8 FacebookHive User Meeting 2009 8 Facebook
Hive User Meeting 2009 8 Facebook
 
Hive User Meeting August 2009 Facebook
Hive User Meeting August 2009 FacebookHive User Meeting August 2009 Facebook
Hive User Meeting August 2009 Facebook
 
Ten tools for ten big data areas 04_Apache Hive
Ten tools for ten big data areas 04_Apache HiveTen tools for ten big data areas 04_Apache Hive
Ten tools for ten big data areas 04_Apache Hive
 
Hadoop Summit 2009 Hive
Hadoop Summit 2009 HiveHadoop Summit 2009 Hive
Hadoop Summit 2009 Hive
 
Apache Hadoop India Summit 2011 talk "Hive Evolution" by Namit Jain
Apache Hadoop India Summit 2011 talk "Hive Evolution" by Namit JainApache Hadoop India Summit 2011 talk "Hive Evolution" by Namit Jain
Apache Hadoop India Summit 2011 talk "Hive Evolution" by Namit Jain
 
report on aadhaar anlysis using bid data hadoop and hive
report on aadhaar anlysis using bid data hadoop and hivereport on aadhaar anlysis using bid data hadoop and hive
report on aadhaar anlysis using bid data hadoop and hive
 
Advanced topics in hive
Advanced topics in hiveAdvanced topics in hive
Advanced topics in hive
 
SQL to Hive Cheat Sheet
SQL to Hive Cheat SheetSQL to Hive Cheat Sheet
SQL to Hive Cheat Sheet
 
Hive(ppt)
Hive(ppt)Hive(ppt)
Hive(ppt)
 
03 hive query language (hql)
03 hive query language (hql)03 hive query language (hql)
03 hive query language (hql)
 
Apache Hive
Apache HiveApache Hive
Apache Hive
 
Introduction to HBase - Phoenix HUG 5/14
Introduction to HBase - Phoenix HUG 5/14Introduction to HBase - Phoenix HUG 5/14
Introduction to HBase - Phoenix HUG 5/14
 
Introduction to Apache Hive(Big Data, Final Seminar)
Introduction to Apache Hive(Big Data, Final Seminar)Introduction to Apache Hive(Big Data, Final Seminar)
Introduction to Apache Hive(Big Data, Final Seminar)
 
Hadoop and Spark for the SAS Developer
Hadoop and Spark for the SAS DeveloperHadoop and Spark for the SAS Developer
Hadoop and Spark for the SAS Developer
 
Hive vs Pig for HadoopSourceCodeReading
Hive vs Pig for HadoopSourceCodeReadingHive vs Pig for HadoopSourceCodeReading
Hive vs Pig for HadoopSourceCodeReading
 
Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!
 
MapReduce Design Patterns
MapReduce Design PatternsMapReduce Design Patterns
MapReduce Design Patterns
 
Introduction to HiveQL
Introduction to HiveQLIntroduction to HiveQL
Introduction to HiveQL
 

Similar to 20081030linkedin

Hadoop & Zing
Hadoop & ZingHadoop & Zing
Hadoop & Zing
Long Dao
 
hadoop&zing
hadoop&zinghadoop&zing
hadoop&zing
zingopen
 
Survey of Front End Topics in Rails
Survey of Front End Topics in RailsSurvey of Front End Topics in Rails
Survey of Front End Topics in Rails
Benjamin Vandgrift
 
ちょっとHadoopについて語ってみるか(仮題)
ちょっとHadoopについて語ってみるか(仮題)ちょっとHadoopについて語ってみるか(仮題)
ちょっとHadoopについて語ってみるか(仮題)
moai kids
 
Web applications with Catalyst
Web applications with CatalystWeb applications with Catalyst
Web applications with Catalyst
svilen.ivanov
 

Similar to 20081030linkedin (20)

Hadoop & Zing
Hadoop & ZingHadoop & Zing
Hadoop & Zing
 
hadoop&zing
hadoop&zinghadoop&zing
hadoop&zing
 
AWS Hadoop and PIG and overview
AWS Hadoop and PIG and overviewAWS Hadoop and PIG and overview
AWS Hadoop and PIG and overview
 
Hadoop Hive Talk At IIT-Delhi
Hadoop Hive Talk At IIT-DelhiHadoop Hive Talk At IIT-Delhi
Hadoop Hive Talk At IIT-Delhi
 
03 pig intro
03 pig intro03 pig intro
03 pig intro
 
HivePart1.pptx
HivePart1.pptxHivePart1.pptx
HivePart1.pptx
 
Build tons of multi-device JavaScript applications - Part 1 : Boilerplate, de...
Build tons of multi-device JavaScript applications - Part 1 : Boilerplate, de...Build tons of multi-device JavaScript applications - Part 1 : Boilerplate, de...
Build tons of multi-device JavaScript applications - Part 1 : Boilerplate, de...
 
Hive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use CasesHive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use Cases
 
Introduction to Apache Hive | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Apache Hive | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to Apache Hive | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Apache Hive | Big Data Hadoop Spark Tutorial | CloudxLab
 
Custom post-framworks
Custom post-framworksCustom post-framworks
Custom post-framworks
 
Custom post-framworks
Custom post-framworksCustom post-framworks
Custom post-framworks
 
DBIx-DataModel v2.0 in detail
DBIx-DataModel v2.0 in detail DBIx-DataModel v2.0 in detail
DBIx-DataModel v2.0 in detail
 
Survey of Front End Topics in Rails
Survey of Front End Topics in RailsSurvey of Front End Topics in Rails
Survey of Front End Topics in Rails
 
Death of a Themer
Death of a ThemerDeath of a Themer
Death of a Themer
 
ちょっとHadoopについて語ってみるか(仮題)
ちょっとHadoopについて語ってみるか(仮題)ちょっとHadoopについて語ってみるか(仮題)
ちょっとHadoopについて語ってみるか(仮題)
 
barplotv4.pdf
barplotv4.pdfbarplotv4.pdf
barplotv4.pdf
 
Getting Started with PostGIS geographic database - Lasma Sietinsone, EDINA
Getting Started with PostGIS geographic database - Lasma Sietinsone, EDINAGetting Started with PostGIS geographic database - Lasma Sietinsone, EDINA
Getting Started with PostGIS geographic database - Lasma Sietinsone, EDINA
 
Getting started with PostGIS geographic database
Getting started with PostGIS geographic databaseGetting started with PostGIS geographic database
Getting started with PostGIS geographic database
 
Hive Hadoop
Hive HadoopHive Hadoop
Hive Hadoop
 
Web applications with Catalyst
Web applications with CatalystWeb applications with Catalyst
Web applications with Catalyst
 

More from Jeff Hammerbacher (20)

20120223keystone
20120223keystone20120223keystone
20120223keystone
 
20100714accel
20100714accel20100714accel
20100714accel
 
20100608sigmod
20100608sigmod20100608sigmod
20100608sigmod
 
20100513brown
20100513brown20100513brown
20100513brown
 
20100423sage
20100423sage20100423sage
20100423sage
 
20100418sos
20100418sos20100418sos
20100418sos
 
20100301icde
20100301icde20100301icde
20100301icde
 
20100201hplabs
20100201hplabs20100201hplabs
20100201hplabs
 
20100128ebay
20100128ebay20100128ebay
20100128ebay
 
20091203gemini
20091203gemini20091203gemini
20091203gemini
 
20091203gemini
20091203gemini20091203gemini
20091203gemini
 
20091110startup2startup
20091110startup2startup20091110startup2startup
20091110startup2startup
 
20091030nasajpl
20091030nasajpl20091030nasajpl
20091030nasajpl
 
20091027genentech
20091027genentech20091027genentech
20091027genentech
 
Mårten Mickos's presentation "Open Source: Why Freedom Makes a Better Busines...
Mårten Mickos's presentation "Open Source: Why Freedom Makes a Better Busines...Mårten Mickos's presentation "Open Source: Why Freedom Makes a Better Busines...
Mårten Mickos's presentation "Open Source: Why Freedom Makes a Better Busines...
 
20090622 Velocity
20090622 Velocity20090622 Velocity
20090622 Velocity
 
20090422 Www
20090422 Www20090422 Www
20090422 Www
 
20090309berkeley
20090309berkeley20090309berkeley
20090309berkeley
 
20081022cca
20081022cca20081022cca
20081022cca
 
20081009nychive
20081009nychive20081009nychive
20081009nychive
 

Recently uploaded

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Recently uploaded (20)

MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 

20081030linkedin

  • 1.
  • 2. An Introduction to Hive: Components and Query Language Jeff Hammerbacher Chief Scientist and VP of Product October 30, 2008
  • 3. Hive Components A Leaky Database ▪ Hadoop ▪ HDFS ▪ MapReduce (bundles Resource Manager and Job Scheduler) ▪ Hive ▪ Logical data partitioning ▪ Metastore (command line and web interfaces) ▪ Query Language ▪ Libraries to handle different serialization formats (SerDes) ▪ JDBC interface
  • 4. Related Work Glaringly Incomplete ▪ Gamma, Bubba, Volcano, etc. ▪ Google: Sawzall ▪ Yahoo: Pig ▪ IBM Research: JAQL ▪ Microsoft: SCOPE ▪ Greenplum: YAML MapReduce ▪ Aster Data: In-Database MapReduce ▪ Business.com: CloudBase
  • 5. Hive Resources ▪ Facebook Mirror: http://mirror.facebook.com/facebook/hive ▪ Currently the best place to get the Hive distribution ▪ Wiki page: http://wiki.apache.org/hadoop/Hive ▪ Getting started: http://wiki.apache.org/hadoop/Hive/GettingStarted ▪ Query language reference: http://wiki.apache.org/hadoop/Hive/HiveQL ▪ Presentations: http://wiki.apache.org/hadoop/Hive/Presentations ▪ Roadmap: http://wiki.apache.org/hadoop/Hive/Roadmap ▪ Mailing list: hive-users@publists.facebook.com ▪ JIRA: https://issues.apache.org/jira/browse/HADOOP/component/12312455
  • 6. Running Hive Quickstart ▪ <install Hadoop> ▪ wget http://mirror.facebook.com/facebook/hive/hadoop-0.19/dist.tar.gz ▪ (Replace 0.19 with 0.17 if you’re still on 0.17) ▪ tar xvzf dist.tar.gz ▪ cd dist ▪ export HADOOP=<path to bin/hadoop in your Hadoop distribution> ▪ Or: edit hadoop.bin.path and hadoop.conf.dir in conf/hive-default.xml ▪ bin/hive ▪ hive>
  • 7. Running Hive Configuration Details ▪ conf/hive-default.xml ▪ hadoop.bin.path: Points to bin/hadoop in your Hadoop installation ▪ hadoop.config.dir: Points to conf/ in your Hadoop installation ▪ hive.exec.scratchdir: HDFS directory where execution information is written ▪ hive.metastore.warehouse.dir: HDFS directory managed by Hive ▪ The rest of the properties relate to the Metastore ▪ conf/hive-log4j.properties ▪ Will put data into /tmp/{user.name}/hive.log by default ▪ conf/jpox.properties ▪ JPOX is a Java object persistence library used by the Metastore
  • 8. Populating Hive MovieLens Data ▪ <cd into your hive directory> ▪ wget http://www.grouplens.org/system/files/ml-data.tar__0.gz ▪ tar xvzf ml-data.tar__0.gz ▪ CREATE TABLE u_data (userid INT, movieid INT, rating INT, unixtime TIMESTAMP) ROW FORMAT DELIMITED FIELDS TERMINATED BY 't'; ▪ The first query can take ten seconds or more, as the Metastore needs to be created ▪ To confirm our table has been created: ▪ SHOW TABLES; ▪ DESCRIBE u_data; ▪ LOAD DATA LOCAL INPATH 'ml-data/u.data' OVERWRITE INTO TABLE u_data; ▪ SELECT COUNT(1) FROM u_data; ▪ Should fire off 2 MapReduce jobs and ultimately return a count of 100,000
  • 9. Hive Query Language Utility Statements ▪ SHOW TABLES [table_name | table_name_pattern] ▪ DESCRIBE [EXTENDED] table_name [PARTITION (partition_col = partition_col_value, ...)] ▪ EXPLAIN [EXTENDED] query_statement ▪ SET [EXTENDED] ▪ “SET property_name=property_value” to modify a value
  • 10. Hive Query Language CREATE TABLE Syntax ▪ CREATE [EXTERNAL] TABLE table_name (col_name data_type [col_comment], ...) [PARTITIONED BY (col_name data_type [col_comment], ...)] [CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name, ...)] INTO num_buckets BUCKETS] [ROW FORMAT row_format] [STORED AS file_format] [LOCATION hdfs_path] ▪ PARTITION columns are virtual columns; they are not part of the data itself but are derived on load ▪ CLUSTERED columns are real columns, hash partitioned into num_buckets folders ▪ ROW FORMAT can be used to specify a delimited data set or a custom deserializer ▪ Use EXTERNAL with ROW FORMAT, STORED AS, and LOCATION to analyze HDFS files in place ▪ “DROP TABLE table_name” can reverse this operation ▪ NB: Currently, DROP TABLE will delete both data and metadata
  • 11. Hive Query Language CREATE TABLE Syntax, Part Two ▪ data_type: primitive_type | array_type | map_type ▪ primitive_type: ▪ TINYINT | INT | BIGINT | BOOLEAN | FLOAT | DOUBLE | STRING ▪ DATE | DATETIME | TIMESTAMP ▪ array_type: ARRAY < primitive_type > ▪ map_type: MAP < primitive_type, primitive_type > ▪ row_format: ▪ DELIMITED [FIELDS TERMINATED BY char] [COLLECTION ITEMS TERMINATED BY char] [MAP KEYS TERMINATED BY char] [LINES TERMINATED BY char] ▪ SERIALIZER serde_name [WITH PROPERTIES property_name=property_value, property_name=property_value, ...] ▪ file_format: SEQUENCEFILE | TEXTFILE
  • 12. Hive Query Language ALTER TABLE Syntax ▪ ALTER TABLE table_name RENAME TO new_table_name; ▪ ALTER TABLE table_name ADD COLUMNS (col_name data_type [col_comment], ...); ▪ ALTER TABLE DROP partition_spec, partition_spec, ...; ▪ Future work: ▪ Support for removing or renaming columns ▪ Support for altering serialization format
  • 13. Hive Query Language LOAD DATA Syntax ▪ LOAD DATA [LOCAL] INPATH '/path/to/file' [OVERWRITE] INTO TABLE table_name [PARTITION (partition_col = partition_col_value, partition_col = partiton_col_value, ...)] ▪ You can load data from the local filesystem or anywhere in HDFS (cf. CREATE TABLE EXTERNAL) ▪ If you don’t specify OVERWRITE, data will be appended to existing table
  • 14. Hive Query Language SELECT Syntax ▪ [insert_clause] SELECT [ALL|DISTINCT] select_list FROM [table_source|join_source] [WHERE where_condition] [GROUP BY col_list] [ORDER BY col_list] [CLUSTER BY col_list] ▪ insert_clause: INSERT OVERWRITE destination ▪ destination: ▪ LOCAL DIRECTORY '/local/path' ▪ DIRECTORY '/hdfs/path' ▪ TABLE table_name [PARTITION (partition_col = partiton_col_value, ...)]
  • 15. Hive Query Language SELECT Syntax ▪ join_source: table_source join_clause table_source join_clause table_source ... ▪ join_clause ▪ [LEFT OUTER|RIGHT OUTER|FULL OUTER] JOIN ON (equality_expression, equality_expression, ...) ▪ Currently, only outer equi-joins are supported in Hive. ▪ There are two join algorithms ▪ Map-side merge join ▪ Reduce-side merge join
  • 16. Hive Query Language Building a Histogram of Review Counts ▪ CREATE TABLE review_counts (userid INT, review_count INT); ▪ INSERT OVERWRITE TABLE review_counts SELECT a.userid, COUNT(1) AS review_count FROM u_data a GROUP BY a.userid; ▪ SELECT b.review_count, COUNT(1) FROM review_counts b GROUP BY b.review_count; ▪ Notes: ▪ No INSERT OVERWRITE for second query means output is dumped to the shell ▪ Hive does not currently support CREATE TABLE AS ▪ We have to create the table and then INSERT into it ▪ Hive does not currently support subqueries ▪ We have to write two queries
  • 17. Hive Query Language Running Custom MapReduce ▪ Put the following into weekday_mapper.py: ▪ import sys import datetime for line in sys.stdin: line = line.strip() userid, movieid, rating, unixtime = line.split('t') weekday = datetime.datetime.fromtimestamp(float(unixtime)).isoweekday() print ','.join([userid, movieid, rating, str(weekday)]) ▪ CREATE TABLE u_data_new (userid INT, movieid INT, rating INT, weekday INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,’; ▪ FROM u_data a INSERT OVERWRITE TABLE u_data_new SELECT TRANSFORM (a.userid, a.movieid, a.rating, a.unixtime) AS (userid, movieid, rating, weekday) USING ‘python /full/path/to/weekday_mapper.py’
  • 18. Hive Query Language Programmatic Access ▪ The Hive shell can take a file with queries to be executed ▪ bin/hive -f /path/to/query/file ▪ You can also run a Hive query straight from the command line ▪ bin/hive -e 'quoted query string' ▪A simple JDBC interface is available for experimentation as well ▪ https://issues.apache.org/jira/browse/HADOOP-4101
  • 19. Hive Components Metastore ▪ Currently uses an embedded Derby database for persistence ▪ While Derby is in place, you’ll need to put it into Server Mode to have more than one Hive concurrent Hive user ▪ See http://wiki.apache.org/hadoop/HiveDerbyServerMode ▪ Next release will use MySQL as default persistent data store ▪ The goal is have the persistent store be pluggable ▪ You can view the Thrift IDL for the metastore online ▪ https://svn.apache.org/repos/asf/hadoop/core/trunk/src/contrib/hive/metastore/if/hive_metastore.thrift
  • 20. Hive Components Query Processing ▪ Compiler ▪ Parser ▪ Type Checking ▪ Semantic Analysis ▪ Plan Generation ▪ Task Generation ▪ Execution Engine ▪ Plan ▪ Operators ▪ UDFs and UDAFs
  • 21. Future Directions ▪ Query Optimization ▪ Support for Statistics ▪ These stats are needed to make optimization decisions ▪ Join Optimizations ▪ Map-side joins, semi join techniques etc to do the join faster ▪ Predicate Pushdown Optimizations ▪ Pushing predicates just above the table scan for certain situations in joins as well as ensuring that only required columns are sent across map/reduce boundaries ▪ Group By Optimizations ▪ Various optimizations to make group by faster ▪ Optimizations to reduce the number of map files created by filter operations ▪ Filters with a large number of mappers produces a lot of files which slows down the following operations.
  • 22. Future Directions ▪ MapReduce Integration ▪ Schema-less MapReduce ▪ TRANSFORM needs a schema while MapReduce is schema-less. ▪ Improvements to TRANSFORM ▪ Make this more intuitive to MapReduce developers - evaluate some other keywords, etc. ▪ User Experience ▪ Create a web interface ▪ Error reporting improvements for parse errors ▪ Add “help” command to the CLI ▪ JDBC driver to enable traditional database tools to be used with Hive
  • 23. Future Directions ▪ Integrating Dynamic SerDe with the DDL ▪ This allows the users to create typed tables along with list and map types from the DDL ▪ Transformations in LOAD DATA ▪ LOAD DATA currently does not transform the input data if it is not in the format expected by the destination table. ▪ Explode and Collect Operators ▪ Explode and collect operators to convert collections to individual items and vice versa. ▪ Propagating sort properties to destination tables ▪ If the query produces sorted we want to capture that in the destination table's metadata so that downstream optimizations can be enabled.
  • 24. (c) 2008 Cloudera, Inc. or its licensors.  quot;Clouderaquot; is a registered trademark of Cloudera, Inc. All rights reserved. 1.0