SlideShare ist ein Scribd-Unternehmen logo
1 von 20
HCatalog/ Hive DataOut
Bay Area Hadoop User Group Meetup
May 15, 2013
Moving Data Out of Hadoop Clusters Today
2Yahoo! Presentation, Confidential
Client’s
Machine
HTTP
Client
HTTP
Server
Launcher/
Gateway
HDFS
Proxy1
HTTP
Proxy
M/R on
YARN
HDFS
Hadoop RPC
Hadoop RPC
SSH
HTTPS
HTTPS
M/R on
YARN
Custom
Proxy
HTTPS
HTTP
Server
Filers
HTTPS
HDFS
M/R on
YARN
DistCp
Clients Multi-tenant Hadoop Clusters Managed Data-loading
1Similar to HttpFS Gateway/ Hoop in Hadoop 2.0 – Hadoop HDFS over HTTP
SSH
SQLLDR
Typical Data Out Scenario
3Yahoo! Presentation, Confidential
HDFS
ProxyHDFS
§  Data (to be pulled out) is stored in a predefined directory structure as files
§  Client determines (through a custom interface) if a particular data feed of interest is
committed or not
§  If committed, client gets the list of files first, and then pulls them out (file-by-file)
through HDFSProxy
CustomInterface
Filer Temp Table
Main Table
cURL
data copy
INSERT
Oracle DB
Ext. Table
Main Table
delimited files
Pros and Cons of the Data Out Approach
4Yahoo! Presentation, Confidential
Pros
§  Security of DB passwords – password not stored in the grid
§  Compression – cross-colo network bandwidth is expensive and compression is not possible with
JDBC drivers
§  Encryption – data out of the grids has to be encrypted as it may be cross-colo
§  ACLs – DB hosts are not accessible from grid nodes, and hence the proxy
Cons
§  Directory structure – has to be predefined and known to downstream consumers of data
§  Data discovery – availability of data for consumption requires polling or other hooks
§  Overhead – Use of DONE files
§  Maintenance – Separate schema files and schema file formats
The introduction of HCatalog and JMS notifications solves the problem
Hadoop – One Platform, Many Tools
Yahoo! Presentation, Confidential 5
Metastore
HDFS
Hive
Metastore Client
InputFormat/
OuputFormat
SerDe
InputFormat/
OuputFormat
MapReduce Pig
Load/
Store
Source: Alan Gates on HCatalog, Hadoop Summit, 2012
MapReduce/ Pig
§  Pipelines
§  Iterative Processing
§  Research
Data Warehouse
Hive
§  BI Tools
§  Analysis
HCatLoader/
HCatStorer
HCatalog – Opening Up the Hive Metastore
Yahoo! Presentation, Confidential 6
Metastore
HDFS
Metastore Client
InputFormat/
OuputFormat
SerDe
HCatInputFormat/
HCatOuputFormat
MapReduce Pig
Source: Alan Gates on HCatalog, Hadoop Summit, 2012
Hive
REST
External
System
HCatalog Value Proposition
Yahoo! Presentation, Confidential 7
Source: Alan Gates on HCatalog, Hadoop Summit, 2012
§  Centralized metadata service for Hadoop
§  Facilitates interoperability among tools such as Pig, Hive, M/R, allows for
sharing of data
§  Provides DB-like abstractions (databases, tables, and partitions) and
supports schema evolution
§  Abstracts out the file storage format and data location
HiveServer2 with HCatalog
Yahoo! Presentation, Confidential 8
HDFS
(ODBC)
HiveServer2
(ODBC/ JDBC)
Data Out Client
(JDBC)
HCatalog Server
(Metastore)
Messaging
Service
(ActiveMQ)
HiveServer2
Jobs
Hive Jobs
(CLI)
HCat Jobs
(Pig, M/R)
doAs(user)
doAs(user)
JMS notification (Producer)
Notification (Consumer)
Issues Solved
9Yahoo! Presentation, Confidential
Directory structure – has to be predefined and known to downstream
consumers of data
Data discovery – availability of data for consumption requires polling or
other hooks
Overhead – Use of DONE files
Maintenance – Separate schema files and schema file formats
✔
✔
✔
✔
DataOut Motivation
10Yahoo! Presentation, Confidential
§  Many ways to load and manage data on the grid
§  HCatalog/Hive
§  Pig
§  Hadoop MR
§  Sqoop
§  GDM
§  Fewer ways of getting data off the cluster
§  Sqoop
§  HDFSProxy
§  HDFS copy to local file system
§  distcp between clusters
§  Challenges
§  Underlying file format
§  Size of data
§  SLA
DataOut Overview
11Yahoo! Presentation, Confidential
§  What is DataOut?
§  Efficient method of moving data off the grid
§  API exposes a programmatic interface
§  What are the advantages of DataOut?
§  API based on well-known JDBC API
§  Works with HCatalog/Hive
§  Agnostic to the underlying storage format
§  Parts of the whole data can be pulled in parallel
§  What are the limitations of DataOut?
§  Queries must be SELECT * FROM type queries
DataOut Deployment
12Yahoo! Presentation, Confidential
HDFS
HS2 HS2 … HS2 HS2
DataOut
Client
Query Data
How DataOut Works
13Yahoo! Presentation, Confidential
HiveServer2M
HiveSplit
S
FS/DB
HiveSplit
S
FS/DB
HiveSplit
S
FS/DB
Execute Query
Prepare Splits
Fetch Splits
Legend:
M – Master, S – Slave, FS/ DB – Filesystem/ Database
Code to Prepare the HiveSplits
14Yahoo! Presentation, Confidential
DataOut	
  dataout	
  =	
  new	
  DataOut();	
  
HiveConnection	
  c	
  =	
  dataout.getConnection();	
  
	
  
Statement	
  s	
  =	
  c.createGenerateSplitStatement();	
  
ResultSet	
  rs	
  =	
  s.executeQuery(sql);	
  
	
  
while(rs.next())	
  {	
  
HiveSplit	
  split	
  =	
  (HiveSplit)	
  rs.getObject(1);	
  
/*	
  Launch	
  job	
  to	
  fetch	
  the	
  split	
  data.	
  */	
  
}	
  
	
  
/*	
  Synchronize	
  on	
  fetch	
  jobs.	
  */	
  
	
  
rs.close();	
  
s.close();	
  
c.close();	
  
Code to Retrieve the HiveSplits
15Yahoo! Presentation, Confidential
DataOut	
  dataout	
  =	
  new	
  DataOut();	
  
HiveConnection	
  c	
  =	
  dataout.getConnection();	
  
	
  
PreparedStatement	
  ps	
  =	
  c.prepareFetchSplitStatement(split);	
  
ResultSet	
  rs	
  =	
  ps.executeQuery();	
  
	
  
while(rs.next())	
  {	
  
/*	
  Process	
  row	
  data.	
  */	
  
}	
  
	
  
rs.close();	
  
ps.close();	
  
c.close();	
  
	
  
/*	
  Communicate	
  with	
  master	
  process.	
  */	
  
DataOut Demo
Yahoo! Presentation, Confidential 16
HS2 Performance – Single Client Connection
17Yahoo! Presentation, Confidential
HS2 Performance – Five Concurrent Clients
18Yahoo! Presentation, Confidential
HS2 Performance Summary
19Yahoo! Presentation, Confidential
§  Throughput scales linearly
§  Single client: 1GB: 60s, 5GB: 250s, 10GB: 500s
§  Multiple clients: 1GB: 120s, 5GB: 600s, 10GB: 1200s
§  Throughput is affected by fetch size
§  Sweet spot around ~200 rows
§  Average row size may affect this number (pending further testing)
§  HiveServer2 is capable of handling multiple clients
§  Throughput of 10GB in ~20 minutes with five client connections
§  Drop-off in throughput is expected and reasonable
§  5x increase in concurrent connections = 2x increase in transfer time
§  Goal of 50GB in 5min
§  Achievable with ~10 HiveServer2 instances streaming data
HUG Meetup 2013: HCatalog / Hive Data Out

Weitere ähnliche Inhalte

Was ist angesagt?

Analyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache DrillAnalyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache Drilltshiran
 
Hive at Yahoo: Letters from the trenches
Hive at Yahoo: Letters from the trenchesHive at Yahoo: Letters from the trenches
Hive at Yahoo: Letters from the trenchesDataWorks Summit
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to HadoopGiovanna Roda
 
Basics of big data analytics hadoop
Basics of big data analytics hadoopBasics of big data analytics hadoop
Basics of big data analytics hadoopAmbuj Kumar
 
Integration of HIve and HBase
Integration of HIve and HBaseIntegration of HIve and HBase
Integration of HIve and HBaseHortonworks
 
Keynote: Getting Serious about MySQL and Hadoop at Continuent
Keynote: Getting Serious about MySQL and Hadoop at ContinuentKeynote: Getting Serious about MySQL and Hadoop at Continuent
Keynote: Getting Serious about MySQL and Hadoop at ContinuentContinuent
 
Large scale ETL with Hadoop
Large scale ETL with HadoopLarge scale ETL with Hadoop
Large scale ETL with HadoopOReillyStrata
 
Introduction to Hadoop part 2
Introduction to Hadoop part 2Introduction to Hadoop part 2
Introduction to Hadoop part 2Giovanna Roda
 
Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigMilind Bhandarkar
 
AWS Summit Sydney 2014 | Secure Hadoop as a Service - Session Sponsored by Intel
AWS Summit Sydney 2014 | Secure Hadoop as a Service - Session Sponsored by IntelAWS Summit Sydney 2014 | Secure Hadoop as a Service - Session Sponsored by Intel
AWS Summit Sydney 2014 | Secure Hadoop as a Service - Session Sponsored by IntelAmazon Web Services
 
Real time hadoop + mapreduce intro
Real time hadoop + mapreduce introReal time hadoop + mapreduce intro
Real time hadoop + mapreduce introGeoff Hendrey
 
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...Simplilearn
 
Hadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data ModelHadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data ModelUwe Printz
 
Drilling into Data with Apache Drill
Drilling into Data with Apache DrillDrilling into Data with Apache Drill
Drilling into Data with Apache DrillMapR Technologies
 
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...Sumeet Singh
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component rebeccatho
 

Was ist angesagt? (20)

Analyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache DrillAnalyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache Drill
 
Hive at Yahoo: Letters from the trenches
Hive at Yahoo: Letters from the trenchesHive at Yahoo: Letters from the trenches
Hive at Yahoo: Letters from the trenches
 
HW09 Hadoop Vaidya
HW09 Hadoop VaidyaHW09 Hadoop Vaidya
HW09 Hadoop Vaidya
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Basics of big data analytics hadoop
Basics of big data analytics hadoopBasics of big data analytics hadoop
Basics of big data analytics hadoop
 
Enabling R on Hadoop
Enabling R on HadoopEnabling R on Hadoop
Enabling R on Hadoop
 
Integration of HIve and HBase
Integration of HIve and HBaseIntegration of HIve and HBase
Integration of HIve and HBase
 
Keynote: Getting Serious about MySQL and Hadoop at Continuent
Keynote: Getting Serious about MySQL and Hadoop at ContinuentKeynote: Getting Serious about MySQL and Hadoop at Continuent
Keynote: Getting Serious about MySQL and Hadoop at Continuent
 
Large scale ETL with Hadoop
Large scale ETL with HadoopLarge scale ETL with Hadoop
Large scale ETL with Hadoop
 
Introduction to Hadoop part 2
Introduction to Hadoop part 2Introduction to Hadoop part 2
Introduction to Hadoop part 2
 
Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & Pig
 
M7 and Apache Drill, Micheal Hausenblas
M7 and Apache Drill, Micheal HausenblasM7 and Apache Drill, Micheal Hausenblas
M7 and Apache Drill, Micheal Hausenblas
 
AWS Summit Sydney 2014 | Secure Hadoop as a Service - Session Sponsored by Intel
AWS Summit Sydney 2014 | Secure Hadoop as a Service - Session Sponsored by IntelAWS Summit Sydney 2014 | Secure Hadoop as a Service - Session Sponsored by Intel
AWS Summit Sydney 2014 | Secure Hadoop as a Service - Session Sponsored by Intel
 
February 2014 HUG : Pig On Tez
February 2014 HUG : Pig On TezFebruary 2014 HUG : Pig On Tez
February 2014 HUG : Pig On Tez
 
Real time hadoop + mapreduce intro
Real time hadoop + mapreduce introReal time hadoop + mapreduce intro
Real time hadoop + mapreduce intro
 
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
 
Hadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data ModelHadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data Model
 
Drilling into Data with Apache Drill
Drilling into Data with Apache DrillDrilling into Data with Apache Drill
Drilling into Data with Apache Drill
 
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
 

Ähnlich wie HUG Meetup 2013: HCatalog / Hive Data Out

Data Discovery on Hadoop - Realizing the Full Potential of your Data
Data Discovery on Hadoop - Realizing the Full Potential of your DataData Discovery on Hadoop - Realizing the Full Potential of your Data
Data Discovery on Hadoop - Realizing the Full Potential of your DataDataWorks Summit
 
Hadoop Summit San Jose 2014: Data Discovery on Hadoop
Hadoop Summit San Jose 2014: Data Discovery on Hadoop Hadoop Summit San Jose 2014: Data Discovery on Hadoop
Hadoop Summit San Jose 2014: Data Discovery on Hadoop Sumeet Singh
 
Data discoveryonhadoop@yahoo! hadoopsummit2014
Data discoveryonhadoop@yahoo! hadoopsummit2014Data discoveryonhadoop@yahoo! hadoopsummit2014
Data discoveryonhadoop@yahoo! hadoopsummit2014thiruvel
 
Strata Conference + Hadoop World San Jose 2015: Data Discovery on Hadoop
Strata Conference + Hadoop World San Jose 2015: Data Discovery on Hadoop Strata Conference + Hadoop World San Jose 2015: Data Discovery on Hadoop
Strata Conference + Hadoop World San Jose 2015: Data Discovery on Hadoop Sumeet Singh
 
Big data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guideBig data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guideDanairat Thanabodithammachari
 
Web Services Hadoop Summit 2012
Web Services Hadoop Summit 2012Web Services Hadoop Summit 2012
Web Services Hadoop Summit 2012Hortonworks
 
Охота на уязвимости Hadoop
Охота на уязвимости HadoopОхота на уязвимости Hadoop
Охота на уязвимости HadoopPositive Hack Days
 
Big data hadooop analytic and data warehouse comparison guide
Big data hadooop analytic and data warehouse comparison guideBig data hadooop analytic and data warehouse comparison guide
Big data hadooop analytic and data warehouse comparison guideDanairat Thanabodithammachari
 
July 2010 Triangle Hadoop Users Group - Chad Vawter Slides
July 2010 Triangle Hadoop Users Group - Chad Vawter SlidesJuly 2010 Triangle Hadoop Users Group - Chad Vawter Slides
July 2010 Triangle Hadoop Users Group - Chad Vawter Slidesryancox
 
Pivotal HD and Spring for Apache Hadoop
Pivotal HD and Spring for Apache HadoopPivotal HD and Spring for Apache Hadoop
Pivotal HD and Spring for Apache Hadoopmarklpollack
 
Druid at Hadoop Ecosystem
Druid at Hadoop EcosystemDruid at Hadoop Ecosystem
Druid at Hadoop EcosystemSlim Bouguerra
 
Introduction to Hadoop
Introduction to Hadoop Introduction to Hadoop
Introduction to Hadoop Sudarshan Pant
 
Hands on Hadoop and pig
Hands on Hadoop and pigHands on Hadoop and pig
Hands on Hadoop and pigSudar Muthu
 
Big Data Summer training presentation
Big Data Summer training presentationBig Data Summer training presentation
Big Data Summer training presentationHarshitaKamboj
 
SQL and Machine Learning on Hadoop using HAWQ
SQL and Machine Learning on Hadoop using HAWQSQL and Machine Learning on Hadoop using HAWQ
SQL and Machine Learning on Hadoop using HAWQpivotalny
 
Sql saturday pig session (wes floyd) v2
Sql saturday   pig session (wes floyd) v2Sql saturday   pig session (wes floyd) v2
Sql saturday pig session (wes floyd) v2Wes Floyd
 
Design and Research of Hadoop Distributed Cluster Based on Raspberry
Design and Research of Hadoop Distributed Cluster Based on RaspberryDesign and Research of Hadoop Distributed Cluster Based on Raspberry
Design and Research of Hadoop Distributed Cluster Based on RaspberryIJRESJOURNAL
 

Ähnlich wie HUG Meetup 2013: HCatalog / Hive Data Out (20)

Data Discovery on Hadoop - Realizing the Full Potential of your Data
Data Discovery on Hadoop - Realizing the Full Potential of your DataData Discovery on Hadoop - Realizing the Full Potential of your Data
Data Discovery on Hadoop - Realizing the Full Potential of your Data
 
Hadoop Summit San Jose 2014: Data Discovery on Hadoop
Hadoop Summit San Jose 2014: Data Discovery on Hadoop Hadoop Summit San Jose 2014: Data Discovery on Hadoop
Hadoop Summit San Jose 2014: Data Discovery on Hadoop
 
Data discoveryonhadoop@yahoo! hadoopsummit2014
Data discoveryonhadoop@yahoo! hadoopsummit2014Data discoveryonhadoop@yahoo! hadoopsummit2014
Data discoveryonhadoop@yahoo! hadoopsummit2014
 
Strata Conference + Hadoop World San Jose 2015: Data Discovery on Hadoop
Strata Conference + Hadoop World San Jose 2015: Data Discovery on Hadoop Strata Conference + Hadoop World San Jose 2015: Data Discovery on Hadoop
Strata Conference + Hadoop World San Jose 2015: Data Discovery on Hadoop
 
Big data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guideBig data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guide
 
Web Services Hadoop Summit 2012
Web Services Hadoop Summit 2012Web Services Hadoop Summit 2012
Web Services Hadoop Summit 2012
 
Охота на уязвимости Hadoop
Охота на уязвимости HadoopОхота на уязвимости Hadoop
Охота на уязвимости Hadoop
 
Big data hadooop analytic and data warehouse comparison guide
Big data hadooop analytic and data warehouse comparison guideBig data hadooop analytic and data warehouse comparison guide
Big data hadooop analytic and data warehouse comparison guide
 
What is hadoop
What is hadoopWhat is hadoop
What is hadoop
 
July 2010 Triangle Hadoop Users Group - Chad Vawter Slides
July 2010 Triangle Hadoop Users Group - Chad Vawter SlidesJuly 2010 Triangle Hadoop Users Group - Chad Vawter Slides
July 2010 Triangle Hadoop Users Group - Chad Vawter Slides
 
Hadoop_arunam_ppt
Hadoop_arunam_pptHadoop_arunam_ppt
Hadoop_arunam_ppt
 
Pivotal HD and Spring for Apache Hadoop
Pivotal HD and Spring for Apache HadoopPivotal HD and Spring for Apache Hadoop
Pivotal HD and Spring for Apache Hadoop
 
Hadoop basics
Hadoop basicsHadoop basics
Hadoop basics
 
Druid at Hadoop Ecosystem
Druid at Hadoop EcosystemDruid at Hadoop Ecosystem
Druid at Hadoop Ecosystem
 
Introduction to Hadoop
Introduction to Hadoop Introduction to Hadoop
Introduction to Hadoop
 
Hands on Hadoop and pig
Hands on Hadoop and pigHands on Hadoop and pig
Hands on Hadoop and pig
 
Big Data Summer training presentation
Big Data Summer training presentationBig Data Summer training presentation
Big Data Summer training presentation
 
SQL and Machine Learning on Hadoop using HAWQ
SQL and Machine Learning on Hadoop using HAWQSQL and Machine Learning on Hadoop using HAWQ
SQL and Machine Learning on Hadoop using HAWQ
 
Sql saturday pig session (wes floyd) v2
Sql saturday   pig session (wes floyd) v2Sql saturday   pig session (wes floyd) v2
Sql saturday pig session (wes floyd) v2
 
Design and Research of Hadoop Distributed Cluster Based on Raspberry
Design and Research of Hadoop Distributed Cluster Based on RaspberryDesign and Research of Hadoop Distributed Cluster Based on Raspberry
Design and Research of Hadoop Distributed Cluster Based on Raspberry
 

Mehr von Sumeet Singh

Hadoop Summit Kiosk Deck
Hadoop Summit Kiosk DeckHadoop Summit Kiosk Deck
Hadoop Summit Kiosk DeckSumeet Singh
 
Keynote Hadoop Summit San Jose 2017 : Shaping Data Platform To Create Lasting...
Keynote Hadoop Summit San Jose 2017 : Shaping Data Platform To Create Lasting...Keynote Hadoop Summit San Jose 2017 : Shaping Data Platform To Create Lasting...
Keynote Hadoop Summit San Jose 2017 : Shaping Data Platform To Create Lasting...Sumeet Singh
 
Keynote Hadoop Summit Dublin 2016: Hadoop Platform Innovations - Pushing The ...
Keynote Hadoop Summit Dublin 2016: Hadoop Platform Innovations - Pushing The ...Keynote Hadoop Summit Dublin 2016: Hadoop Platform Innovations - Pushing The ...
Keynote Hadoop Summit Dublin 2016: Hadoop Platform Innovations - Pushing The ...Sumeet Singh
 
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...Sumeet Singh
 
Hadoop Summit San Jose 2015: Towards SLA-based Scheduling on YARN Clusters
Hadoop Summit San Jose 2015: Towards SLA-based Scheduling on YARN Clusters Hadoop Summit San Jose 2015: Towards SLA-based Scheduling on YARN Clusters
Hadoop Summit San Jose 2015: Towards SLA-based Scheduling on YARN Clusters Sumeet Singh
 
Hadoop Summit San Jose 2013: Compression Options in Hadoop - A Tale of Tradeo...
Hadoop Summit San Jose 2013: Compression Options in Hadoop - A Tale of Tradeo...Hadoop Summit San Jose 2013: Compression Options in Hadoop - A Tale of Tradeo...
Hadoop Summit San Jose 2013: Compression Options in Hadoop - A Tale of Tradeo...Sumeet Singh
 
SAP Technology Services Conference 2013: Big Data and The Cloud at Yahoo!
SAP Technology Services Conference 2013: Big Data and The Cloud at Yahoo! SAP Technology Services Conference 2013: Big Data and The Cloud at Yahoo!
SAP Technology Services Conference 2013: Big Data and The Cloud at Yahoo! Sumeet Singh
 
Strata Conference + Hadoop World NY 2013: Running On-premise Hadoop as a Busi...
Strata Conference + Hadoop World NY 2013: Running On-premise Hadoop as a Busi...Strata Conference + Hadoop World NY 2013: Running On-premise Hadoop as a Busi...
Strata Conference + Hadoop World NY 2013: Running On-premise Hadoop as a Busi...Sumeet Singh
 
HBaseCon 2013: Multi-tenant Apache HBase at Yahoo!
HBaseCon 2013: Multi-tenant Apache HBase at Yahoo! HBaseCon 2013: Multi-tenant Apache HBase at Yahoo!
HBaseCon 2013: Multi-tenant Apache HBase at Yahoo! Sumeet Singh
 

Mehr von Sumeet Singh (9)

Hadoop Summit Kiosk Deck
Hadoop Summit Kiosk DeckHadoop Summit Kiosk Deck
Hadoop Summit Kiosk Deck
 
Keynote Hadoop Summit San Jose 2017 : Shaping Data Platform To Create Lasting...
Keynote Hadoop Summit San Jose 2017 : Shaping Data Platform To Create Lasting...Keynote Hadoop Summit San Jose 2017 : Shaping Data Platform To Create Lasting...
Keynote Hadoop Summit San Jose 2017 : Shaping Data Platform To Create Lasting...
 
Keynote Hadoop Summit Dublin 2016: Hadoop Platform Innovations - Pushing The ...
Keynote Hadoop Summit Dublin 2016: Hadoop Platform Innovations - Pushing The ...Keynote Hadoop Summit Dublin 2016: Hadoop Platform Innovations - Pushing The ...
Keynote Hadoop Summit Dublin 2016: Hadoop Platform Innovations - Pushing The ...
 
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...
 
Hadoop Summit San Jose 2015: Towards SLA-based Scheduling on YARN Clusters
Hadoop Summit San Jose 2015: Towards SLA-based Scheduling on YARN Clusters Hadoop Summit San Jose 2015: Towards SLA-based Scheduling on YARN Clusters
Hadoop Summit San Jose 2015: Towards SLA-based Scheduling on YARN Clusters
 
Hadoop Summit San Jose 2013: Compression Options in Hadoop - A Tale of Tradeo...
Hadoop Summit San Jose 2013: Compression Options in Hadoop - A Tale of Tradeo...Hadoop Summit San Jose 2013: Compression Options in Hadoop - A Tale of Tradeo...
Hadoop Summit San Jose 2013: Compression Options in Hadoop - A Tale of Tradeo...
 
SAP Technology Services Conference 2013: Big Data and The Cloud at Yahoo!
SAP Technology Services Conference 2013: Big Data and The Cloud at Yahoo! SAP Technology Services Conference 2013: Big Data and The Cloud at Yahoo!
SAP Technology Services Conference 2013: Big Data and The Cloud at Yahoo!
 
Strata Conference + Hadoop World NY 2013: Running On-premise Hadoop as a Busi...
Strata Conference + Hadoop World NY 2013: Running On-premise Hadoop as a Busi...Strata Conference + Hadoop World NY 2013: Running On-premise Hadoop as a Busi...
Strata Conference + Hadoop World NY 2013: Running On-premise Hadoop as a Busi...
 
HBaseCon 2013: Multi-tenant Apache HBase at Yahoo!
HBaseCon 2013: Multi-tenant Apache HBase at Yahoo! HBaseCon 2013: Multi-tenant Apache HBase at Yahoo!
HBaseCon 2013: Multi-tenant Apache HBase at Yahoo!
 

Kürzlich hochgeladen

GEAR TRAIN- BASIC CONCEPTS AND WORKING PRINCIPLE
GEAR TRAIN- BASIC CONCEPTS AND WORKING PRINCIPLEGEAR TRAIN- BASIC CONCEPTS AND WORKING PRINCIPLE
GEAR TRAIN- BASIC CONCEPTS AND WORKING PRINCIPLEselvakumar948
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfJiananWang21
 
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills KuwaitKuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwaitjaanualu31
 
Hospital management system project report.pdf
Hospital management system project report.pdfHospital management system project report.pdf
Hospital management system project report.pdfKamal Acharya
 
Computer Lecture 01.pptxIntroduction to Computers
Computer Lecture 01.pptxIntroduction to ComputersComputer Lecture 01.pptxIntroduction to Computers
Computer Lecture 01.pptxIntroduction to ComputersMairaAshraf6
 
Moment Distribution Method For Btech Civil
Moment Distribution Method For Btech CivilMoment Distribution Method For Btech Civil
Moment Distribution Method For Btech CivilVinayVitekari
 
Double Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueDouble Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueBhangaleSonal
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXssuser89054b
 
DC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equationDC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equationBhangaleSonal
 
School management system project Report.pdf
School management system project Report.pdfSchool management system project Report.pdf
School management system project Report.pdfKamal Acharya
 
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdfAldoGarca30
 
Employee leave management system project.
Employee leave management system project.Employee leave management system project.
Employee leave management system project.Kamal Acharya
 
Block diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptBlock diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptNANDHAKUMARA10
 
AIRCANVAS[1].pdf mini project for btech students
AIRCANVAS[1].pdf mini project for btech studentsAIRCANVAS[1].pdf mini project for btech students
AIRCANVAS[1].pdf mini project for btech studentsvanyagupta248
 
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptxS1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptxSCMS School of Architecture
 
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptxHOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptxSCMS School of Architecture
 
kiln thermal load.pptx kiln tgermal load
kiln thermal load.pptx kiln tgermal loadkiln thermal load.pptx kiln tgermal load
kiln thermal load.pptx kiln tgermal loadhamedmustafa094
 
Standard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayStandard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayEpec Engineered Technologies
 
Unleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapUnleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapRishantSharmaFr
 
Verification of thevenin's theorem for BEEE Lab (1).pptx
Verification of thevenin's theorem for BEEE Lab (1).pptxVerification of thevenin's theorem for BEEE Lab (1).pptx
Verification of thevenin's theorem for BEEE Lab (1).pptxchumtiyababu
 

Kürzlich hochgeladen (20)

GEAR TRAIN- BASIC CONCEPTS AND WORKING PRINCIPLE
GEAR TRAIN- BASIC CONCEPTS AND WORKING PRINCIPLEGEAR TRAIN- BASIC CONCEPTS AND WORKING PRINCIPLE
GEAR TRAIN- BASIC CONCEPTS AND WORKING PRINCIPLE
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdf
 
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills KuwaitKuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
 
Hospital management system project report.pdf
Hospital management system project report.pdfHospital management system project report.pdf
Hospital management system project report.pdf
 
Computer Lecture 01.pptxIntroduction to Computers
Computer Lecture 01.pptxIntroduction to ComputersComputer Lecture 01.pptxIntroduction to Computers
Computer Lecture 01.pptxIntroduction to Computers
 
Moment Distribution Method For Btech Civil
Moment Distribution Method For Btech CivilMoment Distribution Method For Btech Civil
Moment Distribution Method For Btech Civil
 
Double Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueDouble Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torque
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
 
DC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equationDC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equation
 
School management system project Report.pdf
School management system project Report.pdfSchool management system project Report.pdf
School management system project Report.pdf
 
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
 
Employee leave management system project.
Employee leave management system project.Employee leave management system project.
Employee leave management system project.
 
Block diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptBlock diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.ppt
 
AIRCANVAS[1].pdf mini project for btech students
AIRCANVAS[1].pdf mini project for btech studentsAIRCANVAS[1].pdf mini project for btech students
AIRCANVAS[1].pdf mini project for btech students
 
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptxS1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
 
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptxHOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
 
kiln thermal load.pptx kiln tgermal load
kiln thermal load.pptx kiln tgermal loadkiln thermal load.pptx kiln tgermal load
kiln thermal load.pptx kiln tgermal load
 
Standard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayStandard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power Play
 
Unleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapUnleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leap
 
Verification of thevenin's theorem for BEEE Lab (1).pptx
Verification of thevenin's theorem for BEEE Lab (1).pptxVerification of thevenin's theorem for BEEE Lab (1).pptx
Verification of thevenin's theorem for BEEE Lab (1).pptx
 

HUG Meetup 2013: HCatalog / Hive Data Out

  • 1. HCatalog/ Hive DataOut Bay Area Hadoop User Group Meetup May 15, 2013
  • 2. Moving Data Out of Hadoop Clusters Today 2Yahoo! Presentation, Confidential Client’s Machine HTTP Client HTTP Server Launcher/ Gateway HDFS Proxy1 HTTP Proxy M/R on YARN HDFS Hadoop RPC Hadoop RPC SSH HTTPS HTTPS M/R on YARN Custom Proxy HTTPS HTTP Server Filers HTTPS HDFS M/R on YARN DistCp Clients Multi-tenant Hadoop Clusters Managed Data-loading 1Similar to HttpFS Gateway/ Hoop in Hadoop 2.0 – Hadoop HDFS over HTTP SSH
  • 3. SQLLDR Typical Data Out Scenario 3Yahoo! Presentation, Confidential HDFS ProxyHDFS §  Data (to be pulled out) is stored in a predefined directory structure as files §  Client determines (through a custom interface) if a particular data feed of interest is committed or not §  If committed, client gets the list of files first, and then pulls them out (file-by-file) through HDFSProxy CustomInterface Filer Temp Table Main Table cURL data copy INSERT Oracle DB Ext. Table Main Table delimited files
  • 4. Pros and Cons of the Data Out Approach 4Yahoo! Presentation, Confidential Pros §  Security of DB passwords – password not stored in the grid §  Compression – cross-colo network bandwidth is expensive and compression is not possible with JDBC drivers §  Encryption – data out of the grids has to be encrypted as it may be cross-colo §  ACLs – DB hosts are not accessible from grid nodes, and hence the proxy Cons §  Directory structure – has to be predefined and known to downstream consumers of data §  Data discovery – availability of data for consumption requires polling or other hooks §  Overhead – Use of DONE files §  Maintenance – Separate schema files and schema file formats The introduction of HCatalog and JMS notifications solves the problem
  • 5. Hadoop – One Platform, Many Tools Yahoo! Presentation, Confidential 5 Metastore HDFS Hive Metastore Client InputFormat/ OuputFormat SerDe InputFormat/ OuputFormat MapReduce Pig Load/ Store Source: Alan Gates on HCatalog, Hadoop Summit, 2012 MapReduce/ Pig §  Pipelines §  Iterative Processing §  Research Data Warehouse Hive §  BI Tools §  Analysis
  • 6. HCatLoader/ HCatStorer HCatalog – Opening Up the Hive Metastore Yahoo! Presentation, Confidential 6 Metastore HDFS Metastore Client InputFormat/ OuputFormat SerDe HCatInputFormat/ HCatOuputFormat MapReduce Pig Source: Alan Gates on HCatalog, Hadoop Summit, 2012 Hive REST External System
  • 7. HCatalog Value Proposition Yahoo! Presentation, Confidential 7 Source: Alan Gates on HCatalog, Hadoop Summit, 2012 §  Centralized metadata service for Hadoop §  Facilitates interoperability among tools such as Pig, Hive, M/R, allows for sharing of data §  Provides DB-like abstractions (databases, tables, and partitions) and supports schema evolution §  Abstracts out the file storage format and data location
  • 8. HiveServer2 with HCatalog Yahoo! Presentation, Confidential 8 HDFS (ODBC) HiveServer2 (ODBC/ JDBC) Data Out Client (JDBC) HCatalog Server (Metastore) Messaging Service (ActiveMQ) HiveServer2 Jobs Hive Jobs (CLI) HCat Jobs (Pig, M/R) doAs(user) doAs(user) JMS notification (Producer) Notification (Consumer)
  • 9. Issues Solved 9Yahoo! Presentation, Confidential Directory structure – has to be predefined and known to downstream consumers of data Data discovery – availability of data for consumption requires polling or other hooks Overhead – Use of DONE files Maintenance – Separate schema files and schema file formats ✔ ✔ ✔ ✔
  • 10. DataOut Motivation 10Yahoo! Presentation, Confidential §  Many ways to load and manage data on the grid §  HCatalog/Hive §  Pig §  Hadoop MR §  Sqoop §  GDM §  Fewer ways of getting data off the cluster §  Sqoop §  HDFSProxy §  HDFS copy to local file system §  distcp between clusters §  Challenges §  Underlying file format §  Size of data §  SLA
  • 11. DataOut Overview 11Yahoo! Presentation, Confidential §  What is DataOut? §  Efficient method of moving data off the grid §  API exposes a programmatic interface §  What are the advantages of DataOut? §  API based on well-known JDBC API §  Works with HCatalog/Hive §  Agnostic to the underlying storage format §  Parts of the whole data can be pulled in parallel §  What are the limitations of DataOut? §  Queries must be SELECT * FROM type queries
  • 12. DataOut Deployment 12Yahoo! Presentation, Confidential HDFS HS2 HS2 … HS2 HS2 DataOut Client Query Data
  • 13. How DataOut Works 13Yahoo! Presentation, Confidential HiveServer2M HiveSplit S FS/DB HiveSplit S FS/DB HiveSplit S FS/DB Execute Query Prepare Splits Fetch Splits Legend: M – Master, S – Slave, FS/ DB – Filesystem/ Database
  • 14. Code to Prepare the HiveSplits 14Yahoo! Presentation, Confidential DataOut  dataout  =  new  DataOut();   HiveConnection  c  =  dataout.getConnection();     Statement  s  =  c.createGenerateSplitStatement();   ResultSet  rs  =  s.executeQuery(sql);     while(rs.next())  {   HiveSplit  split  =  (HiveSplit)  rs.getObject(1);   /*  Launch  job  to  fetch  the  split  data.  */   }     /*  Synchronize  on  fetch  jobs.  */     rs.close();   s.close();   c.close();  
  • 15. Code to Retrieve the HiveSplits 15Yahoo! Presentation, Confidential DataOut  dataout  =  new  DataOut();   HiveConnection  c  =  dataout.getConnection();     PreparedStatement  ps  =  c.prepareFetchSplitStatement(split);   ResultSet  rs  =  ps.executeQuery();     while(rs.next())  {   /*  Process  row  data.  */   }     rs.close();   ps.close();   c.close();     /*  Communicate  with  master  process.  */  
  • 17. HS2 Performance – Single Client Connection 17Yahoo! Presentation, Confidential
  • 18. HS2 Performance – Five Concurrent Clients 18Yahoo! Presentation, Confidential
  • 19. HS2 Performance Summary 19Yahoo! Presentation, Confidential §  Throughput scales linearly §  Single client: 1GB: 60s, 5GB: 250s, 10GB: 500s §  Multiple clients: 1GB: 120s, 5GB: 600s, 10GB: 1200s §  Throughput is affected by fetch size §  Sweet spot around ~200 rows §  Average row size may affect this number (pending further testing) §  HiveServer2 is capable of handling multiple clients §  Throughput of 10GB in ~20 minutes with five client connections §  Drop-off in throughput is expected and reasonable §  5x increase in concurrent connections = 2x increase in transfer time §  Goal of 50GB in 5min §  Achievable with ~10 HiveServer2 instances streaming data