TriHUG November HCatalog Talk by Alan Gates

HCatalog
Table Management for Hadoop
Alan F. Gates

Motivation: Data Sharing is Hard
This is analyst Joe, he uses
This is programmer Bob, he Hive to build reports and
uses Pig to crunch data. answer ad-hoc queries.

Joe, I need
today’s data

Ok

Photo Credit: totalAldo via Flickr

Hmm, is it done yet? Where is it? What format did
you use to store it today? Is it compressed? And
can you help me load it into Hive, I can never
remember all the parameters I have to pass that
alter table command.

Dude, we need HCatalog

More Motivation: Each tool requires its own
Translator

Pig Hive Map Reduce

Hive HCatLoader HCatSerDe RCFile Custom
HCatInputFormat
Custom Columnar Custom
Columnar Input Input
Loader SerDe SerDe
Loader Format Format

HCatalog

RCFile Custom
StorageDriver StorageDriver

Custom
RCFile
Format

End User Example
raw = load „/rawevents/20100819/data‟ using MyLoader()
as (ts:long, user:chararray, url:chararray);
botless = filter raw by NotABot(user);
…
store output into „/processedevents/20100819/data‟;

Processedevents consumers must be manually informed by producer that data is
available, or poll on HDFS (= bad for the NameNode)

raw = load „rawevents‟ using HCatLoader();
botless = filter raw by date = „20100819‟ and NotABot(user);
…
store output into „processedevents‟
using HCatStorage(“date=20100819”);

Processedevents consumers will be notified by HCatalog data is available and can
then start their jobs

Command Line for DDL
• Uses Hive SQL
• Create, drop, alter table
• CREATE TABLE employee (
emp_id INT,
emp_name STRING,
emp_start_date STRING,
emp_gender STRING)
PARTITIONED BY (
emp_country STRING,
emp_state STRING)
STORED AS RCFILE
tblproperties(
'hcat.isd'='RCFileInputDriver',
'hcat.osd'='RCFileOutputDriver');

Manages Data Format and Schema Changes

• Allows columns to be appended to tables in new partitions
− no need to change existing data
− fields not present in old data will be read as null
− must do „alter table add column‟ first
• Allows storage format changes
− no need to change existing data, HCatalog will handle reading each
partition in the appropriate format
− all new partitions will be written in current format

Security
• Uses underlying storage permissions to determine
authorization
− Currently only works with HDFS based storage
− If user can read from the HDFS directory, then he can read the table
− If user can write to the HDFS directory, then he can write to the table
− If the user can write to the database level directory, he can create and
drop tables
− Allows users to define which group to create table as so table access
can be controlled by Unix group
• Authentication done via kerberos

Metadata Architecture

HCatLoader HCatStorage HTTP
HCatInputFormat HCatOutputFormat CLI Notification
Hive metadata interface

Thrift
server RDBMS

= Current HCatalog

= Hive

= Future HCatalog

Storage Architecture

HCatLoader HCatStorage
HCatInputFormat HCatOutputFormat
Input Output
StorageDriver StorageDriver

HDFS HBase

Project Status
• HCatalog was accepted to the Apache Incubator last March
• 0.2 released in October, includes:
− Read/write from Pig
− Read/write from MapReduce
− Read/write from Hive
− StorageDrivers for RCFile
− Notification via JMS when data is available
− Store to multiple partitions simultaneously
− Import/Export tools

HCatalog 0.3
• Plan to release mid-December
• Adds a Binary type (to Hive and HCatalog)
• Storage drivers for JSON and text
• Improved integration with Hive for custom storage formats
• Web services interface

Future Plans
• Support for HBase and other data sources for storage
• RCFile compression improvements
• High Availability for Thrift server
• Data management interfaces for archivers, cleaners, etc.
• Additional metadata storage:
− statistics
− lineage/provenance
− user tags

Get Involved
• incubator.apache.org/hcatalog
• Join the mailing lists
− User list: hcatalog-user@incubator.apache.org
− Dev list: hcatalog-dev@incubator.apache.org

TriHUG November HCatalog Talk by Alan Gates

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to TriHUG November HCatalog Talk by Alan Gates

Similar to TriHUG November HCatalog Talk by Alan Gates (20)

More from trihug

More from trihug (10)

Recently uploaded

Recently uploaded (20)

TriHUG November HCatalog Talk by Alan Gates

Editor's Notes