Yahoo! Hadoop grid makes use of a managed service to get the data pulled into the clusters. However, when it comes to getting the data-out of the clusters, the choices are limited to proxies such as HDFSProxy and HTTPProxy. With the introduction of HCatalog services, customers of the grid now have their data represented in a central metadata repository. HCatalog abstracts out file locations and underlying storage format of data for the users, along with several other advantages such as sharing of data among MapReduce, Pig, and Hive. In this talk, we will focus on how the ODBC/JDBC interface of HiveServer2 accomplished the use case of getting data out of the clusters when HCatalog is in use and users no longer want to worry about the files, partitions and their location. We will also demo the data out capabilities, and go through other nice properties of the data out feature.
Presenter(s):
Sumeet Singh, Senior Director, Product Management, Yahoo!
Chris Drome, Technical Yahoo!
2. Moving Data Out of Hadoop Clusters Today
2Yahoo! Presentation, Confidential
Client’s
Machine
HTTP
Client
HTTP
Server
Launcher/
Gateway
HDFS
Proxy1
HTTP
Proxy
M/R on
YARN
HDFS
Hadoop RPC
Hadoop RPC
SSH
HTTPS
HTTPS
M/R on
YARN
Custom
Proxy
HTTPS
HTTP
Server
Filers
HTTPS
HDFS
M/R on
YARN
DistCp
Clients Multi-tenant Hadoop Clusters Managed Data-loading
1Similar to HttpFS Gateway/ Hoop in Hadoop 2.0 – Hadoop HDFS over HTTP
SSH
3. SQLLDR
Typical Data Out Scenario
3Yahoo! Presentation, Confidential
HDFS
ProxyHDFS
§ Data (to be pulled out) is stored in a predefined directory structure as files
§ Client determines (through a custom interface) if a particular data feed of interest is
committed or not
§ If committed, client gets the list of files first, and then pulls them out (file-by-file)
through HDFSProxy
CustomInterface
Filer Temp Table
Main Table
cURL
data copy
INSERT
Oracle DB
Ext. Table
Main Table
delimited files
4. Pros and Cons of the Data Out Approach
4Yahoo! Presentation, Confidential
Pros
§ Security of DB passwords – password not stored in the grid
§ Compression – cross-colo network bandwidth is expensive and compression is not possible with
JDBC drivers
§ Encryption – data out of the grids has to be encrypted as it may be cross-colo
§ ACLs – DB hosts are not accessible from grid nodes, and hence the proxy
Cons
§ Directory structure – has to be predefined and known to downstream consumers of data
§ Data discovery – availability of data for consumption requires polling or other hooks
§ Overhead – Use of DONE files
§ Maintenance – Separate schema files and schema file formats
The introduction of HCatalog and JMS notifications solves the problem
5. Hadoop – One Platform, Many Tools
Yahoo! Presentation, Confidential 5
Metastore
HDFS
Hive
Metastore Client
InputFormat/
OuputFormat
SerDe
InputFormat/
OuputFormat
MapReduce Pig
Load/
Store
Source: Alan Gates on HCatalog, Hadoop Summit, 2012
MapReduce/ Pig
§ Pipelines
§ Iterative Processing
§ Research
Data Warehouse
Hive
§ BI Tools
§ Analysis
6. HCatLoader/
HCatStorer
HCatalog – Opening Up the Hive Metastore
Yahoo! Presentation, Confidential 6
Metastore
HDFS
Metastore Client
InputFormat/
OuputFormat
SerDe
HCatInputFormat/
HCatOuputFormat
MapReduce Pig
Source: Alan Gates on HCatalog, Hadoop Summit, 2012
Hive
REST
External
System
7. HCatalog Value Proposition
Yahoo! Presentation, Confidential 7
Source: Alan Gates on HCatalog, Hadoop Summit, 2012
§ Centralized metadata service for Hadoop
§ Facilitates interoperability among tools such as Pig, Hive, M/R, allows for
sharing of data
§ Provides DB-like abstractions (databases, tables, and partitions) and
supports schema evolution
§ Abstracts out the file storage format and data location
8. HiveServer2 with HCatalog
Yahoo! Presentation, Confidential 8
HDFS
(ODBC)
HiveServer2
(ODBC/ JDBC)
Data Out Client
(JDBC)
HCatalog Server
(Metastore)
Messaging
Service
(ActiveMQ)
HiveServer2
Jobs
Hive Jobs
(CLI)
HCat Jobs
(Pig, M/R)
doAs(user)
doAs(user)
JMS notification (Producer)
Notification (Consumer)
9. Issues Solved
9Yahoo! Presentation, Confidential
Directory structure – has to be predefined and known to downstream
consumers of data
Data discovery – availability of data for consumption requires polling or
other hooks
Overhead – Use of DONE files
Maintenance – Separate schema files and schema file formats
✔
✔
✔
✔
10. DataOut Motivation
10Yahoo! Presentation, Confidential
§ Many ways to load and manage data on the grid
§ HCatalog/Hive
§ Pig
§ Hadoop MR
§ Sqoop
§ GDM
§ Fewer ways of getting data off the cluster
§ Sqoop
§ HDFSProxy
§ HDFS copy to local file system
§ distcp between clusters
§ Challenges
§ Underlying file format
§ Size of data
§ SLA
11. DataOut Overview
11Yahoo! Presentation, Confidential
§ What is DataOut?
§ Efficient method of moving data off the grid
§ API exposes a programmatic interface
§ What are the advantages of DataOut?
§ API based on well-known JDBC API
§ Works with HCatalog/Hive
§ Agnostic to the underlying storage format
§ Parts of the whole data can be pulled in parallel
§ What are the limitations of DataOut?
§ Queries must be SELECT * FROM type queries
13. How DataOut Works
13Yahoo! Presentation, Confidential
HiveServer2M
HiveSplit
S
FS/DB
HiveSplit
S
FS/DB
HiveSplit
S
FS/DB
Execute Query
Prepare Splits
Fetch Splits
Legend:
M – Master, S – Slave, FS/ DB – Filesystem/ Database
14. Code to Prepare the HiveSplits
14Yahoo! Presentation, Confidential
DataOut
dataout
=
new
DataOut();
HiveConnection
c
=
dataout.getConnection();
Statement
s
=
c.createGenerateSplitStatement();
ResultSet
rs
=
s.executeQuery(sql);
while(rs.next())
{
HiveSplit
split
=
(HiveSplit)
rs.getObject(1);
/*
Launch
job
to
fetch
the
split
data.
*/
}
/*
Synchronize
on
fetch
jobs.
*/
rs.close();
s.close();
c.close();
15. Code to Retrieve the HiveSplits
15Yahoo! Presentation, Confidential
DataOut
dataout
=
new
DataOut();
HiveConnection
c
=
dataout.getConnection();
PreparedStatement
ps
=
c.prepareFetchSplitStatement(split);
ResultSet
rs
=
ps.executeQuery();
while(rs.next())
{
/*
Process
row
data.
*/
}
rs.close();
ps.close();
c.close();
/*
Communicate
with
master
process.
*/
17. HS2 Performance – Single Client Connection
17Yahoo! Presentation, Confidential
18. HS2 Performance – Five Concurrent Clients
18Yahoo! Presentation, Confidential
19. HS2 Performance Summary
19Yahoo! Presentation, Confidential
§ Throughput scales linearly
§ Single client: 1GB: 60s, 5GB: 250s, 10GB: 500s
§ Multiple clients: 1GB: 120s, 5GB: 600s, 10GB: 1200s
§ Throughput is affected by fetch size
§ Sweet spot around ~200 rows
§ Average row size may affect this number (pending further testing)
§ HiveServer2 is capable of handling multiple clients
§ Throughput of 10GB in ~20 minutes with five client connections
§ Drop-off in throughput is expected and reasonable
§ 5x increase in concurrent connections = 2x increase in transfer time
§ Goal of 50GB in 5min
§ Achievable with ~10 HiveServer2 instances streaming data