3. Apache HCatalog
â˘âŻIncubator Project at Apache.org
â˘âŻGood adoption
â˘âŻWill likely merge with Hive project as it adds
important functionality to metastore
â˘âŻAllows for a âschema-on-readâ approach to Big Data
in HDFS
â˘âŻSeat of a lot of innovation in Data Management
â˘âŻUsed by platform partners to enhance Hadoop
integration
â˘âŻWill likely be used to enhance existing Data
Management products in the Enterprise & create new
products
Š Hortonworks Inc. 2012 Page 3
4. Great Tooling Options
MapReduce
â˘âŻ Early adopters
â˘âŻ Non-relational algorithms
â˘âŻ Performance sensitive applications
Pig
â˘âŻ ETL
â˘âŻ Data modeling
â˘âŻ Iterative algorithms
Hive
â˘âŻ Analysis
â˘âŻ Connectors to BI tools
Strength: Right tool for right application
Weakness: Hard to share their data
Š Hortonworks Inc. 2012 Page 4
5. HCatalog Changes the Game
Apache HCatalog provides flexible metadata
services across tools and external access
â˘âŻ Consistency of metadata and data models across the
Enterprise (MapReduce, Pig, Hbase, Hive, External Systems)
â˘âŻ Accessibility: share data as tables in and out of HDFS
â˘âŻ Availability: enables flexible, thin-client access via REST API
HCatalog Shared table
and schema
management
â˘âŻ Raw Hadoop data Table access opens the
â˘âŻ Inconsistent, unknown Aligned metadata platform
â˘âŻ Tool specific access REST API
Š Hortonworks Inc. 2012 Page 5
6. Options == Complexity
Feature MapReduce Pig Hive
Record format Key value pairs Tuple Record
Data model User defined int, float, string, int, float, string,
bytes, maps, maps, structs, lists
tuples, bags
Schema Encoded in app Declared in script Read from
or read by loader metadata
Data location Encoded in app Declared in script Read from
metadata
Data format Encoded in app Declared in script Read from
metadata
â˘âŻ Pig and MR users need to know a lot to write their apps
â˘âŻ When data schema, location, or format change Pig and MR apps must be
rewritten, retested, and redeployed
â˘âŻ Hive users have to load data from Pig/MR users to have access to it
Š Hortonworks 2012
Page 6
7. Hcatalog == Simple, Consistent
Feature MapReduce + Pig + HCatalog Hive
HCatalog
Record format Record Tuple Record
Data model int, float, string, int, float, string, int, float, string,
maps, structs, lists bytes, maps, maps, structs, lists
tuples, bags
Schema Read from Read from Read from
metadata metadata metadata
Data location Read from Read from Read from
metadata metadata metadata
Data format Read from Read from Read from
metadata metadata metadata
â˘âŻ Pig/MR users can read schema from metadata
â˘âŻ Pig/MR users are insulated from schema, location, and format changes
â˘âŻ All users have access to other usersâ data as soon as it is committed
Š Hortonworks 2012
Page 7
10. Pig Example
Assume you want to count how many time each of your users went to each of your URLs
raw = load '/data/rawevents/20120530' as (url, user);
botless = filter raw by myudfs.NotABot(user);
grpd = group botless by (url, user);
cntd = foreach grpd generate flatten(url, user), COUNT(botless);
No need to know No need to
store cntd into '/data/counted/20120530';
file location declare schema
Using HCatalog:
raw = load 'rawevents' using HCatLoader();
botless = filter raw by myudfs.NotABot(user) and ds == '20120530';
grpd = group botless by (url, user);
cntd = foreach grpd generate flatten(url, user), COUNT(botless);
store cntd into 'counted' using HCatStorer();
Partition filter
Š Hortonworks 2012
Page 10
11. Working with HCatalog in MapReduce
Setting input: table to read from
HCatInputFormat.setInput(job,
InputJobInfo.create(dbname, tableName, filter));
database to read specify which
Setting output:
from partitions to read
HCatOutputFormat.setOutput(job,
OutputJobInfo.create(dbname, tableName, partitionSpec));
specify which
Obtaining schema: partition to write
schema = HCatInputFormat.getOutputSchema();
access fields by
Key is unused, Value is HCatRecord: name
String url = value.get("url", schema);
output.set("cnt", schema, cnt);
Š Hortonworks 2012
Page 11
12. Managing Metadata
â˘âŻ If you are a Hive user, you can use your Hive metastore with no modifications
â˘âŻ If not, you can use the HCatalog command line tool to issue Hive DDL (Data
Definition Language) commands:
> /usr/bin/hcat -e âcreate table rawevents (url string,
user string) partitioned by (ds string);";
â˘âŻ Starting in Pig 0.11, you will be able to issue DDL commands from Pig
Š Hortonworks 2012
Page 12
13. Templeton - REST API
â˘âŻ REST endpoints: databases, tables, partitions, columns, table properties
â˘âŻ PUT to create/update, GET to list or describe, DELETE to drop
Get a list of all tables in the default database:
Create new table âraweventsâ
Hadoop/
HCatalog
Describe table âraweventsâ
Š Hortonworks 2012
Page 13
14. HCatalog is in Hortonworks Data PlatformâŚ
OPERATIONAL
 SERVICES
 DATA
Â
SERVICES
Â
AMBARI
 FLUME
 PIG
 HIVE
Â
HBASE
Â
SQOOP
 HCATALOG
Â
OOZIE
Â
WEBHDFS
 MAP
 REDUCE
Â
HADOOP
 CORE
Â
HDFS
 YARN
 (in
 2.0)
Â
Enterprise Readiness
PLATFORM
 SERVICES
 High Availability, Disaster Recovery, Snapshots, Security, etcâŚ
HORTONWORKS
Â
Â
DATA
 PLATFORM
 (HDP)
Â
OS
 Cloud
 VM
 Appliance
Â
Š Hortonworks Inc. 2012 Page 14
15. Key 2013 âEnterprise Hadoopâ Initiatives
Invest In:
Hive / âStingerâ
Interactive Query
ââŻPlatform Services
Ambari HBase ââŻDR, Snapshot, âŚ
Manage & Operate Online Data
OPERATIONAL
 DATA
Â
SERVICES
 SERVICES
Â
HADOOP
 CORE
Â
ââŻData Services
PLATFORM
 SERVICES
 ââŻIn support of Refine,
âKnoxâ HORTONWORKS
Â
 âHerdâ Explore, Enrich
Secure Access
DATA
 PLATFORM
 (HDP)
 Data Integration
ââŻOperational Services
âContinuumâ ââŻManageability,
Biz Continuity
Security, âŚ
Š Hortonworks Inc. 2012 Page 15
17. Top BI Vendors Support Hive Today
Š Hortonworks Inc. 2012 Page 17
18. Goal: Enhance Hive for BI Use Cases
Enterprise Reports Parameterized Reports
Dashboard / Scorecard
Data Mining Visualization
More SQL
&
Better Performance
Batch Interactive
Š Hortonworks Inc. 2012 Page 18
19. Differing Needs For Scale / Interaction
Interactivity Scalability and Reliability
is key are key
Non-
Interactive Batch
Interactive
â˘âŻ Parameterized â˘âŻ Data preparation â˘âŻ Operational batch
Reports â˘âŻ Incremental batch processing
â˘âŻ Drilldown processing â˘âŻ Enterprise
â˘âŻ Visualization â˘âŻ Dashboards / Reports
â˘âŻ Exploration Scorecards â˘âŻ Data Mining
5s â 1m 1m â 1h 1h+
Data Size
Š Hortonworks Inc. 2012 Page 19
20. Stinger: Make Hive Best for All Needs
Non-
Interactive Batch
Interactive
â˘âŻ Parameterized â˘âŻ Data preparation â˘âŻ Operational batch
Reports â˘âŻ Incremental batch processing
â˘âŻ Drilldown processing â˘âŻ Enterprise
â˘âŻ Visualization â˘âŻ Dashboards / Reports
â˘âŻ Exploration Scorecards â˘âŻ Data Mining
5s â 1m 1m â 1h 1h+
Data Size
Improve Latency & Throughput Extend Deep Analytical Ability
â˘âŻ Query engine improvements â˘âŻ Analytics functions
â˘âŻ New âOptimized RCFileâ column store â˘âŻ Improved SQL coverage
â˘âŻ Next-gen runtime (elimâs M/R latency) â˘âŻ Continued focus on core Hive use cases
Š Hortonworks Inc. 2012 Page 20
21. Analytic Function Use Cases
â˘âŻOVER
ââŻRankings, top 10, bottom 10
ââŻRunning balances
ââŻStatistics within time windows (e.g. last 3 months, last 6 months)
â˘âŻLEAD / LAG
ââŻTrend identification
ââŻSessionization
ââŻForecasting / prediction
â˘âŻDistributions
ââŻHistograms and bucketing
â˘âŻGood for Enterprise Reports, Dashboards, Data
Mining and Business Processing.
Š Hortonworks Inc. 2012 Page 21
22. Stinger 2013 Roadmap Summary
â˘âŻHDP 1.x (aka Hadoop 1.x âŚ)
ââŻAdditional SQL Types
ââŻSQL Analytic Functions (OVER, Subqueries in WHERE, etc.)
ââŻModern Optimized Column Store (ORC file)
ââŻHive Query Enhancements
ââŻStartup time, star joins, optimize M/R DAGs, vectorization, etc.
â˘âŻHDP 2.x (aka Hadoop 2.x âŚ)
ââŻFeatures in HDP 1.3 & 1.4
ââŻNext-gen runtime that eliminates startup time
ââŻPersistent function registry
ââŻOther features
Š Hortonworks Inc. 2012 Page 22