Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Data Discovery on Hadoop - Realizing the Full Potential of your Data
1. Data Discovery on Hadoop -
Realizing the Full Potential of Your Data
P R E S E N T E D B Y T h i r u v e l T h i r u m o o l a n , S u m e e t S i n g h ⎪ J u n e 3 , 2 0 1 4
2014 Hadoop Summit, San Jose, California
2. Introduction
2 2014 Hadoop Summit, San Jose, California
Sumeet Singh
Senior Director, Product Management
Hadoop and Big Data Platforms
Cloud Engineering Group
Thiruvel Thirumoolan
Principal Engineer
Hadoop and Big Data Platforms
Cloud Engineering Group
Developer in the Hive-HCatalog team, and active
contributor to Apache Hive
Responsible for Hive, HiveServer2 and HCatalog
across all Hadoop clusters and ensuring they work
at scale for the usage patterns of Yahoo
Loves mining the trove of Hadoop logs for usage
patterns and insights
Bachelors degree from Anna University
701 First Avenue,
Sunnyvale, CA 94089 USA
@thiruvel
Manages Hadoop products team at Yahoo!
Responsible for Product Management, Strategy
and Customer Engagements
Managed Cloud Services products team and
headed Strategy functions for the Cloud Platform
Group at Yahoo
M.B.A. from UCLA and M.S. from Rensselaer(RPI)
701 First Avenue,
Sunnyvale, CA 94089 USA
@sumeetksingh
3. Agenda
3
The Data Management Challenge1
Apache HCatalog to Rescue2
Data Registration and Discovery3
Opening Up Adhoc Access to Data4
Summary and Q&A5
2014 Hadoop Summit, San Jose, California
4. Hadoop Grid as the Source of Truth for Data
4 2014 Hadoop Summit, San Jose, California
TV
PC
Phone
Tablet
Pushed Data
Pulled Data
Web Crawl
Social
Email
3rd Party Content
Data
Advertising
Content
User Profiles /
No-SQL
Serving Stores
Serving
Data Highway
Feeds
Hadoop Grid
BI, Reporting, Adhoc Analytics
ILLUSTRATIVE
5. 5 2014 Hadoop Summit, San Jose, California
34,000
servers
478 PB
0
100
200
300
400
500
600
0
5,000
10,000
15,000
20,000
25,000
30,000
35,000
40,000
45,000
2006 2007 2008 2009 2010 2011 2012 2013 2014
RawHDFSStorage(inPB)
NumberofServers
Year
Servers
1 Across all Hadoop (16 clusters, 32,500 servers, 455 PB) and HBase (7 clusters, 1,500 servers, 23 PB) clusters, May 23, 2014
Growth in HDFS1
1.25 billion
files & dir
6. Processing and Analyzing Data with Hadoop…Then
6 2014 Hadoop Summit, San Jose, California
HDFS
MapReduce (YARN)
Pig Hive
Java MR
APIs
InputFormat/ OutputFormat
Load / Store SerDe
MetaStore
Client
Hive
MetaStore
Hadoop
Streaming
Oozie
7. Processing and Analyzing Data with HBase…Then
7 2014 Hadoop Summit, San Jose, California
HDFS
HBase
Pig HiveJava MR APIs
TableInputFormat/
TableOutputFormat
HBaseStorage MetaStore
Client
Hive
MetaStore
HBaseStorage
Handler
Oozie
8. Hadoop Jobs on the Platform Today
8 2014 Hadoop Summit, San Jose, California
100%
(21.5 M)
1%4%
9%
10%
31%
45%
All Jobs Pig Oozie
Launcher
Java MR Hive GDM Streaming,
distcp, Spark
Job Distribution (May 1 – May 26, 2014)
9. Challenges in Managing Data on Multi-tenant Platforms
9 2014 Hadoop Summit, San Jose, California
Data Producers
Platform Services
Data Consumers
Data shared across tools such as MR,
Pig, and Hive
Schema and semantics knowledge
across the company
Support for schema evolution and
downstream change communication
Fine-grained access controls (row /
column) vs. all or nothing
Clear ownership of data
Data lineage and integrity
Audits and compliance (e.g. SOX)
Retention, duplication, and waste
Data Economy Challenges
Apache
HCatalog
&
Data Discovery
10. Apache HCatalog in the Technology Stack at Yahoo
10 2014 Hadoop Summit, San Jose, California
Compute
Services
Storage
Infrastructure
Services
HivePig Oozie HDFS ProxyGDM
YARN MapReduce
HDFS HBase
Zookeeper
Support
Shop
Monitoring Starling
Messaging
Service
HCatalog
Storm SparkTez
12. 12 2014 Hadoop Summit, San Jose, California
Data Model
Database
(namespace)
Table
(schema)
Table
(schema)
Partition
s
Partition
s
Buckets
Buckets
Skewed Unskewed
Optional
per table
Partitions, buckets, and skews facilitate faster, more direct access to data
Note on Buckets
It is hard to guess the right number of buckets that can also change overtime, hard to coordinate and align for joins
Community is working on dynamic bucketing that would have the same benefit without the need for static partitioning
13. Sample Table Registration
13 2014 Hadoop Summit, San Jose, California
Select project database
USE xyz;
Create table
CREATE EXTERNAL TABLE search (
bcookie string COMMENT ‘Standard browser cookie’,
time_stamp int COMMENT ‘DD-MON-YYYY HH:MI:SS (AM/PM)’,
uid string COMMENT ‘User id’,
ip string COMMENT ‘...’,
pg_spaceid string COMMENT ‘...’,
...)
PARTITIONED BY (
locale string COMMENT ‘Country of origin’,
datestamp string COMMENT ‘Date in YYYYMMDD format’)
STORED AS ORC
LOCATION ‘/projects/search/...’;
Add partitions manually, (if you choose to)
ALTER TABLE search ADD PARTITION ( locale=‘US’, datestamp=‘20130201’)
LOCATION ‘/projects/search/...’;
All your company’s data (metadata) can be registered with HCatalog irrespective of the
tool used.
14. Getting Data into HCatalog – DML and DDL
14 2014 Hadoop Summit, San Jose, California
LOAD Files into tables
Load operations are copy/move operations from HDFS or local filesystem that move datafiles into locations
corresponding to HCat tables. File format must agree with the table format.
LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename
[PARTITION (partcol1=val1, partcol2=val2 ...)];
INSERT data from a query into tables
Query results can be inserted into tables of file system directories by using the insert clause.
INSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...) [IF NOT EXISTS]]
select_statement1 FROM from_statement;
INSERT INTO TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...)] select_statement1 FROM
from_statement;
HCat also supports multiple inserts in the same statement or dynamic partition inserts.
ALTER TABLE ADD PARTITIONS
You can use ALTER TABLE ADD PARTITION to add partitions to a table. The location must be a directory
inside of which data files reside. If new partitions are directly added to HDFS, HCat will not be aware of
these.
ALTER TABLE table_name ADD PARTITION (partCol = 'value1') location 'loc1’;
15. Getting Data into HCatalog – HCat APIs
15 2014 Hadoop Summit, San Jose, California
Pig
HCatLoader is used with Pig scripts to read data from HCatalog-managed tables, and HCatStorer is used
with Pig scripts to write data to HCatalog-managed tables.
A = load '$DB.$TABLE' using org.apache.hcatalog.pig.HCatLoader();
B = FILTER A BY $FILTER;
C = foreach B generate foo, bar;
store C into '$OUTPUT_DB.$OUTPUT_TABLE' USING org.apache.hcatalog.pig.HCatStorer
('$OUTPUT_PARTITION');
MapReduce
The HCatInputFormat is used with MapReduce jobs to read data from HCatalog-managed tables.
HCatOutputFormat is used with MapReduce jobs to write data to HCatalog-managed tables.
Map<String, String> partitionValues = new HashMap<String, String>();
partitionValues.put("a", "1");
partitionValues.put("b", "1");
HCatTableInfo info = HCatTableInfo.getOutputTableInfo(dbName, tblName, partitionValues);
HCatOutputFormat.setOutput(job, info);
16. HCatalog Integration with Data Mgmt. Platform (GDM)
16 2014 Hadoop Summit, San Jose, California
HCatalog
MetaStore
Cluster 1 - Colo 1
HDFS
Cluster 2 – Colo 2
HDFS
Grid Data
Management
Feed Acquisition
Feed
Replication
HCatalog
MetaStore
Feed datasets
as partitioned
external tables
Growl extracts
schema for
backfill
HCatClient.
addPartitions(…)
Mark
LOAD_DONE
HCatClient.
addPartitions(…)
Mark
LOAD_DONE
Partitions are dropped with
(HCatClient.dropPartitions(…))
after retention expiration with a
drop_partition notification
add_partition
event notification
add_partition
event notification
17. HCatalog Notification
17 2014 Hadoop Summit, San Jose, California
Namespace: E.g. “hcat.thebestcluster”
JMS Topic: E.g. “<dbname>.<tablename>”
Sample JMS Notification
{
"timestamp" : 1360272556,
"eventType" : "ADD_PARTITION",
"server" : "thebestcluster-hcat.dc1.grid.yahoo.com",
"servicePrincipal" : "hcat/thebestcluster-hcat.dc1.grid.yahoo.com@GRID.YAHOO.COM",
"db" : "xyz",
"table" : "search",
"partitions": [
{ "locale" : "US", "datestamp" : "20140602" },
{ "locale" : "UK", "datestamp" : "20140602" },
{ "locale" : "IN", "datestamp" : "20140602" }
]
}
HCatalog uses JMS (ActiveMQ) notifications that can be sent for add_database,
add_table, add_partition, drop_partition, drop_table, and drop_database
Notifications can be extended for schema change notifications (proposed)
HCat
Client
HCat
MetaStore
ActiveMQ
Server
Register Channel Publish to listener channels
Subscribers
18. Oozie, HCatalog, and Messaging Integration
18 2014 Hadoop Summit, San Jose, California
Oozie
Message
Bus
HCatalog
3. Push notification
<New Partition>
2. Register Topic
4. Notify New Partition
Data
Producer
HDFS
Produce data (distcp, pig, M/R..)
/data/click/2014/06/02
1. Query/Poll Partition
Start workflow
Update metadata
(ALTER TABLE click ADD PARTITION(data=‘2014/06/02’)
location ’hdfs://data/click/2014/06/02’)
19. Data Discovery with HCatalog
19 2014 Hadoop Summit, San Jose, California
HCatalog instances become a unifying metastore for all data at
Yahoo
Discovery is about
o Browsing / inspecting metadata
o Searching for datasets
It helps to solve
o Schema knowledge across the company
o Schema evolution
o Lineage
o Ownerships
o Data type – dev or prod
20. Data Discovery Physical View
20 2014 Hadoop Summit, San Jose, California
Global View of
All Data in HCatalog
DC1-C1
DC1-C2
DCn-Cn
.
.
.
DC2-C1
DC2-C2
DCm-Cm
.
.
.
Discovery UI
Data Center 1 Data Center 2
HCat REST
(Templeton
)
HCat REST
(Templeton
)
HCat REST
(Templeton
)
HCatREST
(Templeton
)
HCatREST
(Templeton
)
HCat
REST
(Templeton
)
ILLUSTRATIVE
21. Data Discovery Features
21 2014 Hadoop Summit, San Jose, California
Browsing
o Tables / Databases
o Schema, format, properties
o Partitions and metadata about each partition
Searches for tables
o Table name (regex) or Comments
o Column name or comments
o Ownership, File format
o Location
o Properties (Dev/Prod)
22. Discovery UI
22 2014 Hadoop Summit, San Jose, California
Search Tables Search
The Best Cluster
audience_db
tumblr_db
user_db
adv_warehouse
flickr_db
page_clicks Hourly clickstream table
ad_clicks Hourly ad clicks table
user_info User registration info
session_info Session feed info
audience_info Primary audience table
GLOBAL HCATALOG DASHBOARD
Available Databases
Available Tables (audience_db)
Search the HCat tables
Browse
the DBs
by
cluster
Search
results
or
browse
db
results
1 2 Next 1 2 Next
ILLUSTRATIVE
23. Table Display UI
23 2014 Hadoop Summit, San Jose, California
ILLUSTRATIVE
GLOBAL HCATALOG DASHBOARD
HCat Instance The Best Cluster
Database audience_db
Table page_clicks
Owner Awesome Yahoo
Schema
…more table information and properties (e.g. data format etc.)
Partitions
…list of partitions
Column Type Description
bcookie string Standard browser cookie
timestamp string DD-MON-YYYY HH:MI:SS (AM/PM)
uid string User id
.
.
.
24. Data Discovery Design Approach
24 2014 Hadoop Summit, San Jose, California
A single web interface connects to all HCatalog instances (same and
cross-colo)
Select an appropriate HCat instance and browse all metadata
o Each HCatalog instance runs a webserver (Templeton/ WebHCat) to read
metadata
o All reads audited
o ACL’s apply
Search functionality will be added to Templeton and HCatalog
o New Thrift interface to support search
o All searches audited
o ACL’s apply
Long term design
o Read and Write HCatalog instances
25. Data Discovery Going Forward
25 2014 Hadoop Summit, San Jose, California
Lineage
o Source datasets
o Derived datasets
Data Quality
o Statistics help in heuristics instead of running a job
Table 1 /
Partition 1
HBase
ORC Table
Partition 1
Dimension
Table
Statistics/
Agg. Table
Daily Stats
Table
Copied by
distcp / external
registrar
Hourly
ILLUSTRATIVE
26. Data Discovery Going Forward (cont’d)
26 2014 Hadoop Summit, San Jose, California
ILLUSTRATIVE
Schema
Column Type Description
bcookie string Standard browser cookie
timestamp string DD-MON-YYYY HH:MI:SS (AM/PM)
uid string User id
File Format
ORC
Table Properties
Compression
Type
zlib
External
User ‘awesome_yahoo’
added ‘foo string’ to the
table on May 29, 2014 at
‘1:10 AM’
User ‘me_too’ added table
properties
‘orc.compress=ZLIB’ on
May 30, 2014 at ‘9:00 AM’
User ‘me_too’ changed the
file format from ‘RCFile’ to
‘ORC’ on Jun 1, 2014 at
‘10:30 AM’
.
.
.
.
.
.
27. HCatalog is Part of a Broader Solution Set
27 2014 Hadoop Summit, San Jose, California
Hive
HiveServer2
HCatalog
Data warehousing software that facilitates querying and managing large
datasets in HDFS
Provides a mechanism to project structure onto HDFS data and query the
data using a SQL-like language called HiveQL
Server process (Thrift-based RPC interface) to support concurrent clients
connecting over ODBC/JDBC
Provides authentication and enforces authorization for ODBC/JDBC clients
for metadata access
Table and storage management layer that enables users with different tools
(Pig, M/R, and Hive) to more easily share data
Presents a relational view of data in HDFS, abstracts where or in what
format data is stored, and enables notifications of data availability
Starling
Hadoop log warehouse for analytics on grid usage (job history, tasks, job
counters etc.)
1TB of raw logs processed / day, 24 TB of processed data
Product Role in the Grid Stack
28. 28
Deployment Layout
Tez and MapReduce
on YARN
+
HDFS
Oracle
DBMS
LoadBalancer
HCatalog
Thrift
HS2
ODBC/JDBC
Launcher Gateway
LoadBalancer
Data Out Client
Client/ CLI
HiveQL
M/R Jobs
Pig M/R
Cloud
Messaging
ActiveMQ
notifications
HiveServer2
Hadoop
Hive
HCatalog
2014 Hadoop Summit, San Jose, California
29. 29 2014 Hadoop Summit, San Jose, California
Hive for Both Batch and Interactive Adhoc Analytics
Tez
Computation expressed as a dataflow graph
with reusable primitives
No intermediate outputs to HDFS
Built on top of YARN
Hive generates Tez plans for lower latency
Query Engine Improvements
Cost-based optimizations
In-memory joins
Caching hot tables
Vectorized processing
Better Columnar Store
ORCFile with predicate pushdown
Built for both speed and storage efficiency
Tez Service
Always-on pool of AMs / container re-use
Improved Latency and Throughput
Analytics Functions
SQL 2003 Compliant
OVER with PARTITION BY and ORDER BY
Wide variety of windowing functions:
o RANK
o LEAD/LAG
o ROW_NUMBER
o FIRST_VALUE
o LAST_VALUE
o Many more
Aligns well with BI ecosystem
Improving SQL Coverage
Non-correlated sub-queries using IN in
WHERE
Expanded SQL types including DATETIME,
VARCHAR, etc.
Extended Analytical Ability
30. HiveServer2 as ODBC / JDBC Endpoint
Gateway that Hive clients
can talk to
Supports concurrent clients
User/ global
session/configuration
information
Support for secure clusters
and encryption
DoAs support allows Hive
queries to run as the
requester
30 2014 Hadoop Summit, San Jose, California
31. 31 2014 Hadoop Summit, San Jose, California
Data to Desktop (D2D) – BI and Reporting on ODBC
HiveServer2
Hive
Hadoop
Desktop Web
Intelligence Server
Metadata Database
Grid ODBC driver
32. 32 2014 Hadoop Summit, San Jose, California
DataOut – Data to Any Off-Grid Destination on JDBC
HiveSplit HiveSplit
HiveServer2M
S
FS/DB
S
FS/DB
HiveSplit
S
FS/DB
Execute Query
Prepare Splits
Fetch Splits
Legend:
M – Master, S – Slave, FS/ DB – Filesystem/ Database
DataOut is an efficient
method of moving data off
the grid
Advantages:
o API based on well-known
JDBC interface
o Works with HCatalog / Hive
o Agnostic to the underlying
storage format
o Parts of the whole data can
be pulled in parallel
33. SQL-based Authorization for Controlled Access
33 2014 Hadoop Summit, San Jose, California
SQL-compliant authorization model (Users, Roles, Privileges, Objects)
Fine-grain authorization and access control patterns (row and column in
conjunction with views)
Can be used in conjunction with storage-based authorization
Privileges Access Control
Objects consist of databases, tables,
and views
Privileges are GRANTed on objects
o SELECT: read access to an object
o INSERT: write (insert) access to an
object
o UPDATE: write (update) access to an
object
o DELETE: delete access for an object
o ALL PRIVILEGES: all privileges
Roles can be associated with objects
Privileges are associated with roles
CREATE, DROP, and SET ROLE
statements manipulate roles and
membership
SUPERUSER role for databases can
grant access control to users or roles
(not limited to HDFS permissions)
PUBLIC role includes all users
Prevents undesirable operations on
objects by unauthorized users
34. Starling (Log Warehouse) for Historical Analysis and Trends
34 2014 Hadoop Summit, San Jose, California
Cluster 1 Cluster 2 Cluster 3 Cluster N
Oozie
HCatalog HDFS
Hive
Starling
Dashboard
Discovery
Portal
Query
Server
Source
Clusters
Warehouse
Clusters
35. 35 2014 Hadoop Summit, San Jose, California
SQL on Hadoop the Fastest Growing Product on Grid
0.0%
1.0%
2.0%
3.0%
4.0%
5.0%
6.0%
7.0%
8.0%
9.0%
10.0%
0
5
10
15
20
25
30
Mar-13 Apr-13 May-13 Jun-13 Jul-13 Aug-13 Sep-13 Oct-13 Nov-13 Dec-13 Jan-14 Feb-14 Mar-14 Apr-14 May-14
HiveJobs(%ofAllJobs)
AllGridJobs(inMillions)
All Jobs Hive (% of all jobs)
2.5 million
queries
36. In Summary
36 2014 Hadoop Summit, San Jose, California
Data shared across tools such as MR, Pig, and Hive Apache HCatalog
Schema and semantics knowledge across the
company
Data Discovery
Support for schema evolution and downstream
change communication
Apache HCatalog
Fine-grained access controls (row / column) vs. all
or nothing
SQL-based
Authorization
Clear ownership of data Data Discovery
Data lineage and integrity Data Discovery / Starling
Audits and compliance (e.g. SOX) Data Discovery / Starling
Retention, duplication, and waste Data Discovery / Starling
✔
✔
✔
✔
✔
✔
✔
✔
37. Acknowledge
37 2014 Hadoop Summit, San Jose, California
1 Apache Hive (and HiveServer2, HCatalog) Community
http://hive.apache.org/people.html
2 HCatalog and Hive Development Team at Yahoo
Olga Natkovich Annie Lin Fangyue Wang
Chris Drome Jin Sun Selina Zhang
Mithun Radhakrishnan Viraj Bhat
3 Oozie Development Team
Rohini Palaniswamy Ryota Egashira Purshotam Shah
Mona Chitnis Michelle Chiang
4 Grid Data Management (GDM) Team
Mark Holderbaugh Aaron Gresch Lawrence Prem Kumar
Scott Preece Yan Braun
5 Service Engineering and Data Operations
Rob Realini David Kuder Chuck Sheldon
Rajiv Chittajallu Vineeth Vadrevu Andy Rhee
6 Product Management
Sid Shaik Amrit Lal Kimsukh Kundu
(30 sec)
Welcome to data discovery on hadoop. We will explain our approach to realizing the full potential or value from the data in your organizations if you are a hadoop driven business or want to be a hadoop driven business.
(30 sec) – 1 min
Before we begin, let us introduce ourselves. My name is Sumeet Singh, I am a Sr. Director of Product Management and I head the PM functions for Hadoop at Yahoo. Thiruvel is a Principal Engineer in the Hive team at Yahoo who works across Hive, HCatalog, HiveServer2 and Starling at Yahoo. With that, let’s get into the details.
(1 min) – 2 min
Let me walk you through the agenda we have.
We will explain the challenges with data management, particularly as it relates to Hadoop at Yahoo.
We will then introduce HCatalog and explain why it is a great solution for the challenges we face in data management.
We will then describe how to get all your data into HCatalog, and the specifics of data discovery.
Once your data is in a central repository and you can discover it, we will explain ways you can open it up by exercising controlled access to that data so that the entire organization can benefit from data.
We will then summarize and open it up for Q&A.
(1 min) – 3 min
Yahoo products and properties across devices generate a lot of data that is of immense value to us in driving new and interesting use experiences across devices. All that data comes to Hadoop that acts as a single source of truth for all data at Yahoo.
A wide variety of other data also gets pulled into the Hadoop Grid from various sources as shown. The idea is to consolidated data from all over was the company from disparate sources at once place so that it can be (a) shared (b) enriched (c) de-duped (d) kept up to date.
That data once processed is applied back or served as value to our consumers in the form of personalized experiences across our products and properties. And of course is used for reporting and analytics. All of this is done while keeping the web scale economics and cost in mind.
(30 sec) – 3.5 mins
As a result, our infrastructure and in particular HDFS continues to grow and as of last month accounts for almost 480 PB across Hadoop and HBase clusters. There are 1.25 billions files and directories on this infrastructure as of last month (NN keep tracks of that along with blocks). I am not sure you want to know about all about the 1.25 billions files, but you should have the ability to if you wanted to. This talk is really all about that.
(30 sec) – 4 mins
All that data in HDFS gets processed and analyzed through a variety of tools such as MapReduce, Pig, and Hive. Most of our users use Oozie to automate the scheduling of these jobs. Pig and MR have the schema, format and location encoded in the app or the script. Hive on the other hand introduced an additional component, the metastore, to read the data from metadata.
(30 sec) – 4.5
HBase also provided access to data stored to these tools through table i/p and o/p format and storage handlers, but the story largely stayed the same as hadoop.
(30 secs – 5 mins)
Just to put things in perspective, this is how the hadoop platform and data in HDFS get used in terms of jobs that are run on that platform to process data. A wide variety of tools on Hadoop such of Pig, MapReduce, Hive all read and write data on the HDFS. Pig, MR, and Hive continue to dominate the job mix at Yahoo with Oozie scheduling most of those jobs which is why you see the Oozie launcher numbers so high in the total job mix.
(1.5 min) – 6.5 mins
Just so you understand the data economy, I like that term, producers with ETL produce the data on the platform that is then consumed off of the platform by downstream consumers for analytics or serving. And, managing a platform of this magnitude and the data volume at scale of course has its challenges.
Sharing of data, schema and semantics knowledge across the company, schema evolution and change awareness, access controls, lineage, integrity or data quality, audits and compliance, and finally reducing HDFS waste.
We believe that HCatalog and Data Discovery solves almost all of these to take full advantage of company’s data for research, insights, driving product performance, and coming up new user experiences.
(30 secs) – 10.5 min
Registrations are generally External tables as we are getting legacy HDFS data into Hive. The data is not managed by Hive. It is useful for sharing data, e.g. that created by Pig but queried with Hive without giving ownership to Hive
Also useful when data is already processed and in a usable state in HDFS. Dropping tables does not delete the data. Manually clean up after dropping tables / partitions.
Partitions can be added manually or through automation with data movement tools that I will describe in just a second.
(1 min) – 11.5 min
Getting data registered using HCat DML (External tables). New data can be internal then on etc.
DML: LOAD and INSERT from a query
DDL: ADD PARTITIONS
(1 min) – 12.5 mins
Explicit data-paths: When data organization on HDFS changes, scripts need modification.
Explicit file/record format: Prone to change. Needs script change.
Explicit schema during consumption: When schema evolves, this needs change.
(1.5 mins) – 14 mins
Talk about what GDM is
Approaches here:
How GDM HCat registration is accomplished
New partitions and old partitions backfill
Add JIRA numbers for the work Yahoo has done.
Feed registrations and partition availability publication
Extracting schema info from existing HDFS files (e.g. using growl)
(1 min) – 15 mins
Partition-message consumption
Federation-layer atop ActiveMQ
Arranges ActiveMQ servers into “namespaces”
Manages access-control for messages sent to topics in a namespace.
(1 min) – 16 mins
GDM Acquisition copies data onto one cluster.
Oozie consumes by polling (say) every hour for daily-feed, in directory.
Load on the name-node. 2000 GDM feeds. 5 minute frequencies.
Latency: Worst-case, max poll interval.
Notion of Done: _DONE_ files vs. existence of data in directory. Source of truth.
When data is available: Launch appropriate workflow.
Name-node is hammered.
Data-consumption latency must be balanced against #1. Worst-case latency == poll_freq.
Explain completeness of a dataset-instance (which differs from the notion of a partition) -> Partition-set support in HCatalog.
Problems with using empty file-markers
Further NN pressure.
Oozie notified on partition availability via JMS messages, to trigger workflows immediately
Oozie and HCatalog interoperate via Cloud Messaging System (CMS) for messaging
(1.5 min) – 17.5 mins
Thiruvel
(1 min) – 18.5 mins
Thiruvel
(1.5 mins) – 20 mins
Thiruvel
(1 min) – 21 mins
Thiruvel
(1 min) – 22 mins
Thiruvel
(2 mins) – 24 mins
Thiruvel
(1.5 mins) – 25.5 mins
Thiruvel
(1 min) – 26.5 mins
Thiruvel
(1 min) – 27.5 mins
(1 min) – 28.5 mins
(1 min) – 29.5 mins
Based on expressing a computation as a dataflow graph with reusable primitives (e.g. sort, merge etc.)
Hive SQL can be expressed as a single job (no interruptions for efficient pipeline)
No intermediate outputs to HDFS (speed and network/disk usage savings)
Vectorization allows Hive to process a batch of rows together
MapReduce query startup is very expensive. Job and task launch latencies can add up to 5 to 30 seconds, not good for short queries. Container pre-allocation or warm containers (container pre-launch) eliminates task launch overhead to serve queries
CBO: Hive has table and column level statistics. Used to determine parallelism, join selection