SlideShare ist ein Scribd-Unternehmen logo
1 von 37
Downloaden Sie, um offline zu lesen
HCatalog 
© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
Motivation 
© Copyright 2014. All rights reserved. Not to be reproduced without prior written consent.
Inside a Data-Driven Company 
■ Analysts use multiple tools for processing data 
● Java MapReduce, Hive and Pig and more 
■ Analysts create many (derivative) datasets 
● Different formats e.g. CSV, JSON, Avro, ORC 
● Files in HDFS 
● Tables in Hive We simply pick a right 
tool and format for 
each application! 
© Copyright 2014. All rights reserved. Not to be reproduced without prior written consent.
Pig Analysts 
■ To process data, they need to remember 
● where a dataset is located 
● what format it has 
● what the schema is 
© Copyright 2014. All rights reserved. Not to be reproduced without prior written consent.
Hive Analysts 
■ They store popular datasets in (external) Hive tables 
■ To analyze datasets generated by Pig analysts, they need to 
know 
● where a dataset is located 
● what format it has 
● what the schema is 
● how to load it into Hive table/partition 
© Copyright 2014. All rights reserved. Not to be reproduced without prior written consent.
Changes, Changes And Changes 
Let’s start using more 
efficient format since 
today! 
NO WAY! 
We would have to re-write, re-test and 
re-deploy our applications! 
This means a lot of engineering work 
for us! 
© Copyright 2014. All rights reserved. Not to be reproduced without prior written consent.
MR, Pig, Hive And Data Storage 
■ Hive reads data location, format and schema from metadata 
● Managed by Hive Metastore 
■ MapReduce encodes them in the application code 
■ Pig specifies them in the script 
● Schema can be provided by the Loader 
Conclusion 
■ MapReduce and Pig are sensitive to metadata changes! 
© Copyright 2014. All rights reserved. Not to be reproduced without prior written consent.
Data Availability 
Jeff: Is your dataset already generated? 
Tom: Check in HDFS! 
Jeff: What is the location? 
Tom: /data/streams/20140101 
Jeff: hdfs dfs -ls /data/streams/20140101 
Not yet…. :( 
Tom: Check it later! 
© Copyright 2014. All rights reserved. Not to be reproduced without prior written consent.
HCatalog 
■ Aims to solve these problems! 
■ First of all 
● It knows the location, format and schema of our datasets! 
© Copyright 2014. All rights reserved. Not to be reproduced without prior written consent.
HCatalog With Pig 
■ Traditional way 
raw = load ‘/data/streams/20140101’ using MyLoader() 
as (time:chararray, host:chararray, userId:int, songId:int, duration:int); 
… 
store output into ‘/data/users/20140101’ using MyStorer(); 
© Copyright 2014. All rights reserved. Not to be reproduced without prior written consent.
HCatalog With Pig 
■ Traditional way 
raw = load ‘/data/streams/20140101’ using MyLoader() 
as (time:chararray, host:chararray, userId:int, songId:int, duration:int); 
… 
store output into ‘/data/users/20140101’ using MyStorer(); 
■ HCatalog way 
raw = load ‘streams’ using HCatLoader(); 
… 
store output into users using HCatStorer(‘date=20140101’); 
© Copyright 2014. All rights reserved. Not to be reproduced without prior written consent.
Interacting with HCatalog 
© Copyright 2014. All rights reserved. Not to be reproduced without prior written consent. 
Demo
Demo 
1. Upload a dataset in HDFS 
2. Add the dataset HCatalog 
3. Access the dataset with Hive and Pig 
© Copyright 2014. All rights reserved. Not to be reproduced without prior written consent.
HCatalog CLI 
1. Upload a dataset in HDFS 
$ hdfs dfs -put streams /data 
Streams: timestamp, host, userId, songId, duration 
© Copyright 2014. All rights reserved. Not to be reproduced without prior written consent.
HCatalog CLI 
2. Add the dataset HCatalog 
■ A file with HCatalog DDL can be prepared 
CREATE EXTERNAL TABLE streams(time string, host string, userId int, songId int, duration int) 
ROW FORMAT DELIMITED FIELDS TERMINATED BY 't' 
STORED AS TEXTFILE 
LOCATION '/data/streams'; 
■ And executed by hcat -f 
$ hcat -f streams.hcatalog 
© Copyright 2014. All rights reserved. Not to be reproduced without prior written consent.
HCatalog CLI 
2. Add the dataset HCatalog 
■ Describe the dataset 
$ hcat -e "describe streams" 
OK 
time string None 
host string None 
userid int None 
songid int None 
duration int None 
© Copyright 2014. All rights reserved. Not to be reproduced without prior written consent.
HCatalog CLI 
3. Access the dataset with Pig 
raw_streams = LOAD 'streams' USING org.apache.hcatalog.pig.HCatLoader(); 
all_count = FOREACH (GROUP raw_streams ALL) GENERATE COUNT(raw_streams); 
DUMP all_count; 
$ pig -useHCatalog streams.pig 
… 
(93294) 
© Copyright 2014. All rights reserved. Not to be reproduced without prior written consent.
HCatalog CLI 
3. Access the dataset with Hive 
$ hive -e "select count(*) from streams" 
OK 
93294 
© Copyright 2014. All rights reserved. Not to be reproduced without prior written consent.
Goals 
© Copyright 2014. All rights reserved. Not to be reproduced without prior written consent.
Three Main Goals 
1. Provide an abstraction on top of datasets stored in HDFS 
● Just use the name of the dataset, not the path 
2. Enable data discovery 
● Store all datasets (and their properties) in HCatalog 
● Integrate with Web UI 
3. Provide notifications for data availability 
● Process new data immediately when it appears 
© Copyright 2014. All rights reserved. Not to be reproduced without prior written consent.
Supported Formats 
and Projects 
© Copyright 2014. All rights reserved. Not to be reproduced without prior written consent.
Supported Projects And Formats 
© Copyright 2014. All rights reserved. Not to be reproduced without prior written consent.
Custom Formats 
■ A custom format can be supported 
● But InputFormat, OutputFormat, and Hive SerDe must 
be provided 
© Copyright 2014. All rights reserved. Not to be reproduced without prior written consent.
Pig Interface - HCatLoader 
■ Consists of HCatLoader and HCatStorer 
■ HCatLoader read data from a dataset 
● Indicate which partitions to scan by following the load 
statement with a partition filter statement 
raw = load 'streams' using HCatLoader(); 
valid = filter raw by date = '20140101' and isValid(duration); 
© Copyright 2014. All rights reserved. Not to be reproduced without prior written consent.
Pig Interface - HCatStorer 
■ HCatStorer writes data to a dataset 
● A specification of partition keys can be also provided 
● Possible to write to a single partition or multiple partitions 
store valid into 'streams_valid' using HCatStorer 
('date=20110924'); 
© Copyright 2014. All rights reserved. Not to be reproduced without prior written consent.
MapReduce Interface 
■ Consists of HCatInputFormat and HCatOutputFormat 
■ HCatInputFormat accepts a dataset to read data from 
● Optionally, indicate which partitions to scan 
■ HCatOutputFormat accepts a dataset to write to 
● Optionally, indicated with partition to write to 
● Possible to write to a single partition or multiple partitions 
© Copyright 2014. All rights reserved. Not to be reproduced without prior written consent.
Hive Interface 
■ There is no Hive-specific interface 
● Hive can read information from HCatalog directly 
■ Actually, HCatalog is now a submodule of Hive 
Conclusion 
■ HCatalog enables non-Hive projects to access Hive tables 
© Copyright 2014. All rights reserved. Not to be reproduced without prior written consent.
Components 
■ Hive Metastore to store information about datasets 
● A table per dataset is created (the same as in Hive) 
■ hcat CLI 
● Create and drop tables, specify table parameters, etc 
■ Programming interfaces 
● For MapReduce and Pig 
● New ones can be implemented e.g. for Crunch 
■ WebHCat server 
● More about it later 
© Copyright 2014. All rights reserved. Not to be reproduced without prior written consent.
Features 
© Copyright 2014. All rights reserved. Not to be reproduced without prior written consent.
Data Discovery 
■ A nice web UI can be build on top of HCatalog 
● You have all Hive tables there for free! 
● See Yahoo!’s illustrative example below 
© Copyright 2014. All rights reserved. Not to be reproduced without prior written consent. 
Pictures 
come from 
Yahoo’s 
presentation 
at Hadoop 
Summit San 
Jose 2014
Properties Of Datasets 
■ Can store data life-cycle management information 
● Cleaning, archiving and replication tools can learn which 
datasets are eligible for their services 
ALTER TABLE intermediate.featured-streams SET TBLPROPERTIES ('retention' = 30); 
SHOW TBLPROPERTIES intermediate.featured-streams; 
SHOW TBLPROPERTIES intermediate.featured-streams("retention"); 
© Copyright 2014. All rights reserved. Not to be reproduced without prior written consent.
Notifications 
■ Provides notifications for certain events 
● e.g. Oozie or custom Java code can wait for those events 
and immediately schedule tasks depending on them 
■ Multiple events can trigger notifications 
● A database, a table or a partition is added 
● A set of partitions is added 
● A database, a table or a partition is dropped 
© Copyright 2014. All rights reserved. Not to be reproduced without prior written consent.
Evolution Of Data 
■ Allows data producers to change how they write data 
● No need to re-write existing data 
● HCatalog can read old and new data 
● Data consumers don’t have to change their applications 
■ Data location, format and schema can be changed 
● In case of schema, new fields can be added 
© Copyright 2014. All rights reserved. Not to be reproduced without prior written consent.
HCatalog Beyond HDFS 
© Copyright 2014. All rights reserved. Not to be reproduced without prior written consent.
WebHCat Server 
■ Provides a REST-like web API for HCatalog 
● Send requests to get information about datasets 
curl -s 'http://hostname: 
port/templeton/v1/ddl/database/db_name/table/table_name?user. 
name=adam' 
● Send requests to run Pig or Hive scripts 
■ Previously called Templeton 
© Copyright 2014. All rights reserved. Not to be reproduced without prior written consent.
There Is More! 
© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
■ Data engineers, architects and instructors 
■ +4 years of experience in Apache Hadoop 
● Working with hundreds-node Hadoop clusters 
● Developing Hadoop applications in many cool frameworks 
● Delivering Hadoop trainings for +2 years 
■ Passionate about data 
● Speaking at conferences and meetups 
● Blogging and reviewing books 
© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.

Weitere ähnliche Inhalte

Was ist angesagt?

Introduction to Apache Hive | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Apache Hive | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to Apache Hive | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Apache Hive | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 
Future of HCatalog - Hadoop Summit 2012
Future of HCatalog - Hadoop Summit 2012Future of HCatalog - Hadoop Summit 2012
Future of HCatalog - Hadoop Summit 2012Hortonworks
 
Big Data Camp LA 2014 - Apache Tajo: A Big Data Warehouse System on Hadoop
Big Data Camp LA 2014 - Apache Tajo: A Big Data Warehouse System on HadoopBig Data Camp LA 2014 - Apache Tajo: A Big Data Warehouse System on Hadoop
Big Data Camp LA 2014 - Apache Tajo: A Big Data Warehouse System on HadoopGruter
 
An intriduction to hive
An intriduction to hiveAn intriduction to hive
An intriduction to hiveReza Ameri
 
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)Uwe Printz
 
Hive Anatomy
Hive AnatomyHive Anatomy
Hive Anatomynzhang
 
Asbury Hadoop Overview
Asbury Hadoop OverviewAsbury Hadoop Overview
Asbury Hadoop OverviewBrian Enochson
 
Hadoop - Overview
Hadoop - OverviewHadoop - Overview
Hadoop - OverviewJay
 
Efficient in situ processing of various storage types on apache tajo
Efficient in situ processing of various storage types on apache tajoEfficient in situ processing of various storage types on apache tajo
Efficient in situ processing of various storage types on apache tajoHyunsik Choi
 
An introduction to Apache Hadoop Hive
An introduction to Apache Hadoop HiveAn introduction to Apache Hadoop Hive
An introduction to Apache Hadoop HiveMike Frampton
 
Introduction to Apache Hive(Big Data, Final Seminar)
Introduction to Apache Hive(Big Data, Final Seminar)Introduction to Apache Hive(Big Data, Final Seminar)
Introduction to Apache Hive(Big Data, Final Seminar)Takrim Ul Islam Laskar
 
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UKIntroduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UKSkills Matter
 

Was ist angesagt? (20)

Introduction to Apache Hive | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Apache Hive | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to Apache Hive | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Apache Hive | Big Data Hadoop Spark Tutorial | CloudxLab
 
Future of HCatalog - Hadoop Summit 2012
Future of HCatalog - Hadoop Summit 2012Future of HCatalog - Hadoop Summit 2012
Future of HCatalog - Hadoop Summit 2012
 
6.hive
6.hive6.hive
6.hive
 
Big Data Camp LA 2014 - Apache Tajo: A Big Data Warehouse System on Hadoop
Big Data Camp LA 2014 - Apache Tajo: A Big Data Warehouse System on HadoopBig Data Camp LA 2014 - Apache Tajo: A Big Data Warehouse System on Hadoop
Big Data Camp LA 2014 - Apache Tajo: A Big Data Warehouse System on Hadoop
 
Future of HCatalog
Future of HCatalogFuture of HCatalog
Future of HCatalog
 
An intriduction to hive
An intriduction to hiveAn intriduction to hive
An intriduction to hive
 
Introduction to Hive
Introduction to HiveIntroduction to Hive
Introduction to Hive
 
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
 
Apache hive
Apache hiveApache hive
Apache hive
 
Hive Anatomy
Hive AnatomyHive Anatomy
Hive Anatomy
 
Asbury Hadoop Overview
Asbury Hadoop OverviewAsbury Hadoop Overview
Asbury Hadoop Overview
 
Hadoop - Overview
Hadoop - OverviewHadoop - Overview
Hadoop - Overview
 
Efficient in situ processing of various storage types on apache tajo
Efficient in situ processing of various storage types on apache tajoEfficient in situ processing of various storage types on apache tajo
Efficient in situ processing of various storage types on apache tajo
 
SQOOP PPT
SQOOP PPTSQOOP PPT
SQOOP PPT
 
Hive Hadoop
Hive HadoopHive Hadoop
Hive Hadoop
 
An introduction to Apache Hadoop Hive
An introduction to Apache Hadoop HiveAn introduction to Apache Hadoop Hive
An introduction to Apache Hadoop Hive
 
Introduction to Apache Hive(Big Data, Final Seminar)
Introduction to Apache Hive(Big Data, Final Seminar)Introduction to Apache Hive(Big Data, Final Seminar)
Introduction to Apache Hive(Big Data, Final Seminar)
 
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UKIntroduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
 
Pptx present
Pptx presentPptx present
Pptx present
 
Sqoop
SqoopSqoop
Sqoop
 

Andere mochten auch

Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)
Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)
Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)Adam Kawa
 
Apache Flume
Apache FlumeApache Flume
Apache FlumeGetInData
 
Quick Introduction to Apache Tez
Quick Introduction to Apache TezQuick Introduction to Apache Tez
Quick Introduction to Apache TezGetInData
 
How Apache Drives Music Recommendations At Spotify
How Apache Drives Music Recommendations At SpotifyHow Apache Drives Music Recommendations At Spotify
How Apache Drives Music Recommendations At SpotifyJosh Baer
 
Apache Flume - DataDayTexas
Apache Flume - DataDayTexasApache Flume - DataDayTexas
Apache Flume - DataDayTexasArvind Prabhakar
 
Streaming analytics better than batch when and why - (Big Data Tech 2017)
Streaming analytics better than batch   when and why - (Big Data Tech 2017)Streaming analytics better than batch   when and why - (Big Data Tech 2017)
Streaming analytics better than batch when and why - (Big Data Tech 2017)GetInData
 
The Evolution of Hadoop at Spotify - Through Failures and Pain
The Evolution of Hadoop at Spotify - Through Failures and PainThe Evolution of Hadoop at Spotify - Through Failures and Pain
The Evolution of Hadoop at Spotify - Through Failures and PainRafał Wojdyła
 
Apache HAMA: An Introduction toBulk Synchronization Parallel on Hadoop
Apache HAMA: An Introduction toBulk Synchronization Parallel on HadoopApache HAMA: An Introduction toBulk Synchronization Parallel on Hadoop
Apache HAMA: An Introduction toBulk Synchronization Parallel on HadoopEdward Yoon
 
Inside Hulu's Data platform (BigDataCamp LA 2013)
Inside Hulu's Data platform (BigDataCamp LA 2013)Inside Hulu's Data platform (BigDataCamp LA 2013)
Inside Hulu's Data platform (BigDataCamp LA 2013)Prasan Samtani
 
Jak utworzyć Company Page na Linkedin
Jak utworzyć Company Page na LinkedinJak utworzyć Company Page na Linkedin
Jak utworzyć Company Page na LinkedinMarcin Czajka
 
Data Discovery on Hadoop - Realizing the Full Potential of your Data
Data Discovery on Hadoop - Realizing the Full Potential of your DataData Discovery on Hadoop - Realizing the Full Potential of your Data
Data Discovery on Hadoop - Realizing the Full Potential of your DataDataWorks Summit
 
Hadoop Playlist (Ignite talks at Strata + Hadoop World 2013)
Hadoop Playlist (Ignite talks at Strata + Hadoop World 2013)Hadoop Playlist (Ignite talks at Strata + Hadoop World 2013)
Hadoop Playlist (Ignite talks at Strata + Hadoop World 2013)Adam Kawa
 
Apache Flume NG
Apache Flume NGApache Flume NG
Apache Flume NGhuguk
 
Introduction To Elastic MapReduce at WHUG
Introduction To Elastic MapReduce at WHUGIntroduction To Elastic MapReduce at WHUG
Introduction To Elastic MapReduce at WHUGAdam Kawa
 
Analytics in olap with lucene & hadoop
Analytics in olap with lucene & hadoopAnalytics in olap with lucene & hadoop
Analytics in olap with lucene & hadooplucenerevolution
 
Data Aggregation At Scale Using Apache Flume
Data Aggregation At Scale Using Apache FlumeData Aggregation At Scale Using Apache Flume
Data Aggregation At Scale Using Apache FlumeArvind Prabhakar
 

Andere mochten auch (20)

Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)
Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)
Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)
 
Apache Flume
Apache FlumeApache Flume
Apache Flume
 
Quick Introduction to Apache Tez
Quick Introduction to Apache TezQuick Introduction to Apache Tez
Quick Introduction to Apache Tez
 
Apache Flume
Apache FlumeApache Flume
Apache Flume
 
How Apache Drives Music Recommendations At Spotify
How Apache Drives Music Recommendations At SpotifyHow Apache Drives Music Recommendations At Spotify
How Apache Drives Music Recommendations At Spotify
 
Apache Flume - DataDayTexas
Apache Flume - DataDayTexasApache Flume - DataDayTexas
Apache Flume - DataDayTexas
 
Streaming analytics better than batch when and why - (Big Data Tech 2017)
Streaming analytics better than batch   when and why - (Big Data Tech 2017)Streaming analytics better than batch   when and why - (Big Data Tech 2017)
Streaming analytics better than batch when and why - (Big Data Tech 2017)
 
The Evolution of Hadoop at Spotify - Through Failures and Pain
The Evolution of Hadoop at Spotify - Through Failures and PainThe Evolution of Hadoop at Spotify - Through Failures and Pain
The Evolution of Hadoop at Spotify - Through Failures and Pain
 
Apache HAMA: An Introduction toBulk Synchronization Parallel on Hadoop
Apache HAMA: An Introduction toBulk Synchronization Parallel on HadoopApache HAMA: An Introduction toBulk Synchronization Parallel on Hadoop
Apache HAMA: An Introduction toBulk Synchronization Parallel on Hadoop
 
Inside Hulu's Data platform (BigDataCamp LA 2013)
Inside Hulu's Data platform (BigDataCamp LA 2013)Inside Hulu's Data platform (BigDataCamp LA 2013)
Inside Hulu's Data platform (BigDataCamp LA 2013)
 
Jak utworzyć Company Page na Linkedin
Jak utworzyć Company Page na LinkedinJak utworzyć Company Page na Linkedin
Jak utworzyć Company Page na Linkedin
 
Data Discovery on Hadoop - Realizing the Full Potential of your Data
Data Discovery on Hadoop - Realizing the Full Potential of your DataData Discovery on Hadoop - Realizing the Full Potential of your Data
Data Discovery on Hadoop - Realizing the Full Potential of your Data
 
Hadoop Playlist (Ignite talks at Strata + Hadoop World 2013)
Hadoop Playlist (Ignite talks at Strata + Hadoop World 2013)Hadoop Playlist (Ignite talks at Strata + Hadoop World 2013)
Hadoop Playlist (Ignite talks at Strata + Hadoop World 2013)
 
Hadoop.TW : Now and Future
Hadoop.TW : Now and FutureHadoop.TW : Now and Future
Hadoop.TW : Now and Future
 
Apache Flume NG
Apache Flume NGApache Flume NG
Apache Flume NG
 
HDFS Federation
HDFS FederationHDFS Federation
HDFS Federation
 
Introduction To Elastic MapReduce at WHUG
Introduction To Elastic MapReduce at WHUGIntroduction To Elastic MapReduce at WHUG
Introduction To Elastic MapReduce at WHUG
 
Analytics in olap with lucene & hadoop
Analytics in olap with lucene & hadoopAnalytics in olap with lucene & hadoop
Analytics in olap with lucene & hadoop
 
Data Aggregation At Scale Using Apache Flume
Data Aggregation At Scale Using Apache FlumeData Aggregation At Scale Using Apache Flume
Data Aggregation At Scale Using Apache Flume
 
Apache Avro and You
Apache Avro and YouApache Avro and You
Apache Avro and You
 

Ähnlich wie HCatalog

Sql saturday pig session (wes floyd) v2
Sql saturday   pig session (wes floyd) v2Sql saturday   pig session (wes floyd) v2
Sql saturday pig session (wes floyd) v2Wes Floyd
 
Hadoop data access layer v4.0
Hadoop data access layer v4.0Hadoop data access layer v4.0
Hadoop data access layer v4.0SpringPeople
 
Apache hadoop: POSH Meetup Palo Alto, CA April 2014
Apache hadoop: POSH Meetup Palo Alto, CA April 2014Apache hadoop: POSH Meetup Palo Alto, CA April 2014
Apache hadoop: POSH Meetup Palo Alto, CA April 2014Kevin Crocker
 
TriHUG November HCatalog Talk by Alan Gates
TriHUG November HCatalog Talk by Alan GatesTriHUG November HCatalog Talk by Alan Gates
TriHUG November HCatalog Talk by Alan Gatestrihug
 
A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0 A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0 DataWorks Summit
 
GETTING YOUR DATA IN HADOOP.pptx
GETTING YOUR DATA IN HADOOP.pptxGETTING YOUR DATA IN HADOOP.pptx
GETTING YOUR DATA IN HADOOP.pptxinfinix8
 
Application architectures with hadoop – big data techcon 2014
Application architectures with hadoop – big data techcon 2014Application architectures with hadoop – big data techcon 2014
Application architectures with hadoop – big data techcon 2014Jonathan Seidman
 
Application architectures with Hadoop – Big Data TechCon 2014
Application architectures with Hadoop – Big Data TechCon 2014Application architectures with Hadoop – Big Data TechCon 2014
Application architectures with Hadoop – Big Data TechCon 2014hadooparchbook
 
Druid at Hadoop Ecosystem
Druid at Hadoop EcosystemDruid at Hadoop Ecosystem
Druid at Hadoop EcosystemSlim Bouguerra
 
Migrating your clusters and workloads from Hadoop 2 to Hadoop 3
Migrating your clusters and workloads from Hadoop 2 to Hadoop 3Migrating your clusters and workloads from Hadoop 2 to Hadoop 3
Migrating your clusters and workloads from Hadoop 2 to Hadoop 3DataWorks Summit
 
H cat berlinbuzzwords2012
H cat berlinbuzzwords2012H cat berlinbuzzwords2012
H cat berlinbuzzwords2012Hortonworks
 
Intro to Hadoop Presentation at Carnegie Mellon - Silicon Valley
Intro to Hadoop Presentation at Carnegie Mellon - Silicon ValleyIntro to Hadoop Presentation at Carnegie Mellon - Silicon Valley
Intro to Hadoop Presentation at Carnegie Mellon - Silicon Valleymarkgrover
 
Yahoo! Hack Europe Workshop
Yahoo! Hack Europe WorkshopYahoo! Hack Europe Workshop
Yahoo! Hack Europe WorkshopHortonworks
 
9/2017 STL HUG - Back to School
9/2017 STL HUG - Back to School9/2017 STL HUG - Back to School
9/2017 STL HUG - Back to SchoolAdam Doyle
 
HDFS tiered storage: mounting object stores in HDFS
HDFS tiered storage: mounting object stores in HDFSHDFS tiered storage: mounting object stores in HDFS
HDFS tiered storage: mounting object stores in HDFSDataWorks Summit
 
Gruter_TECHDAY_2014_03_ApacheTajo (in Korean)
Gruter_TECHDAY_2014_03_ApacheTajo (in Korean)Gruter_TECHDAY_2014_03_ApacheTajo (in Korean)
Gruter_TECHDAY_2014_03_ApacheTajo (in Korean)Gruter
 
Building data pipelines with kite
Building data pipelines with kiteBuilding data pipelines with kite
Building data pipelines with kiteJoey Echeverria
 
Application Architectures with Hadoop - Big Data TechCon SF 2014
Application Architectures with Hadoop - Big Data TechCon SF 2014Application Architectures with Hadoop - Big Data TechCon SF 2014
Application Architectures with Hadoop - Big Data TechCon SF 2014hadooparchbook
 
Application Architectures with Hadoop
Application Architectures with HadoopApplication Architectures with Hadoop
Application Architectures with Hadoophadooparchbook
 

Ähnlich wie HCatalog (20)

Sql saturday pig session (wes floyd) v2
Sql saturday   pig session (wes floyd) v2Sql saturday   pig session (wes floyd) v2
Sql saturday pig session (wes floyd) v2
 
Hadoop data access layer v4.0
Hadoop data access layer v4.0Hadoop data access layer v4.0
Hadoop data access layer v4.0
 
Apache hadoop: POSH Meetup Palo Alto, CA April 2014
Apache hadoop: POSH Meetup Palo Alto, CA April 2014Apache hadoop: POSH Meetup Palo Alto, CA April 2014
Apache hadoop: POSH Meetup Palo Alto, CA April 2014
 
TriHUG November HCatalog Talk by Alan Gates
TriHUG November HCatalog Talk by Alan GatesTriHUG November HCatalog Talk by Alan Gates
TriHUG November HCatalog Talk by Alan Gates
 
A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0 A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0
 
GETTING YOUR DATA IN HADOOP.pptx
GETTING YOUR DATA IN HADOOP.pptxGETTING YOUR DATA IN HADOOP.pptx
GETTING YOUR DATA IN HADOOP.pptx
 
Application architectures with hadoop – big data techcon 2014
Application architectures with hadoop – big data techcon 2014Application architectures with hadoop – big data techcon 2014
Application architectures with hadoop – big data techcon 2014
 
Application architectures with Hadoop – Big Data TechCon 2014
Application architectures with Hadoop – Big Data TechCon 2014Application architectures with Hadoop – Big Data TechCon 2014
Application architectures with Hadoop – Big Data TechCon 2014
 
Druid at Hadoop Ecosystem
Druid at Hadoop EcosystemDruid at Hadoop Ecosystem
Druid at Hadoop Ecosystem
 
Migrating your clusters and workloads from Hadoop 2 to Hadoop 3
Migrating your clusters and workloads from Hadoop 2 to Hadoop 3Migrating your clusters and workloads from Hadoop 2 to Hadoop 3
Migrating your clusters and workloads from Hadoop 2 to Hadoop 3
 
H cat berlinbuzzwords2012
H cat berlinbuzzwords2012H cat berlinbuzzwords2012
H cat berlinbuzzwords2012
 
Intro to Hadoop Presentation at Carnegie Mellon - Silicon Valley
Intro to Hadoop Presentation at Carnegie Mellon - Silicon ValleyIntro to Hadoop Presentation at Carnegie Mellon - Silicon Valley
Intro to Hadoop Presentation at Carnegie Mellon - Silicon Valley
 
Yahoo! Hack Europe Workshop
Yahoo! Hack Europe WorkshopYahoo! Hack Europe Workshop
Yahoo! Hack Europe Workshop
 
9/2017 STL HUG - Back to School
9/2017 STL HUG - Back to School9/2017 STL HUG - Back to School
9/2017 STL HUG - Back to School
 
HDFS tiered storage: mounting object stores in HDFS
HDFS tiered storage: mounting object stores in HDFSHDFS tiered storage: mounting object stores in HDFS
HDFS tiered storage: mounting object stores in HDFS
 
Gruter_TECHDAY_2014_03_ApacheTajo (in Korean)
Gruter_TECHDAY_2014_03_ApacheTajo (in Korean)Gruter_TECHDAY_2014_03_ApacheTajo (in Korean)
Gruter_TECHDAY_2014_03_ApacheTajo (in Korean)
 
Building data pipelines with kite
Building data pipelines with kiteBuilding data pipelines with kite
Building data pipelines with kite
 
InternReport
InternReportInternReport
InternReport
 
Application Architectures with Hadoop - Big Data TechCon SF 2014
Application Architectures with Hadoop - Big Data TechCon SF 2014Application Architectures with Hadoop - Big Data TechCon SF 2014
Application Architectures with Hadoop - Big Data TechCon SF 2014
 
Application Architectures with Hadoop
Application Architectures with HadoopApplication Architectures with Hadoop
Application Architectures with Hadoop
 

Mehr von GetInData

How do we work with customers on Big Data / ML / Analytics Projects using Scr...
How do we work with customers on Big Data / ML / Analytics Projects using Scr...How do we work with customers on Big Data / ML / Analytics Projects using Scr...
How do we work with customers on Big Data / ML / Analytics Projects using Scr...GetInData
 
Data-Driven Fast Track: Introduction to data-drivenness with Piotr Menclewicz
Data-Driven Fast Track: Introduction to data-drivenness with Piotr MenclewiczData-Driven Fast Track: Introduction to data-drivenness with Piotr Menclewicz
Data-Driven Fast Track: Introduction to data-drivenness with Piotr MenclewiczGetInData
 
How NOT to win a Kaggle competition
How NOT to win a Kaggle competitionHow NOT to win a Kaggle competition
How NOT to win a Kaggle competitionGetInData
 
How to become good Developer in Scrum Team?
How to become good Developer in Scrum Team? How to become good Developer in Scrum Team?
How to become good Developer in Scrum Team? GetInData
 
OpenLineage & Airflow - data lineage has never been easier
OpenLineage & Airflow - data lineage has never been easierOpenLineage & Airflow - data lineage has never been easier
OpenLineage & Airflow - data lineage has never been easierGetInData
 
Benefits of a Homemade ML Platform
Benefits of a Homemade ML PlatformBenefits of a Homemade ML Platform
Benefits of a Homemade ML PlatformGetInData
 
Model serving made easy using Kedro pipelines - Mariusz Strzelecki, GetInData
Model serving made easy using Kedro pipelines - Mariusz Strzelecki, GetInDataModel serving made easy using Kedro pipelines - Mariusz Strzelecki, GetInData
Model serving made easy using Kedro pipelines - Mariusz Strzelecki, GetInDataGetInData
 
Creating Real-Time Data Streaming powered by SQL on Kubernetes - Albert Lewan...
Creating Real-Time Data Streaming powered by SQL on Kubernetes - Albert Lewan...Creating Real-Time Data Streaming powered by SQL on Kubernetes - Albert Lewan...
Creating Real-Time Data Streaming powered by SQL on Kubernetes - Albert Lewan...GetInData
 
MLOps implemented - how we combine the cloud & open-source to boost data scie...
MLOps implemented - how we combine the cloud & open-source to boost data scie...MLOps implemented - how we combine the cloud & open-source to boost data scie...
MLOps implemented - how we combine the cloud & open-source to boost data scie...GetInData
 
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...GetInData
 
Feast + Amundsen Integration - Mariusz Strzelecki, GetInData
Feast + Amundsen Integration - Mariusz Strzelecki, GetInDataFeast + Amundsen Integration - Mariusz Strzelecki, GetInData
Feast + Amundsen Integration - Mariusz Strzelecki, GetInDataGetInData
 
Kubernetes and real-time analytics - how to connect these two worlds with Apa...
Kubernetes and real-time analytics - how to connect these two worlds with Apa...Kubernetes and real-time analytics - how to connect these two worlds with Apa...
Kubernetes and real-time analytics - how to connect these two worlds with Apa...GetInData
 
Big data trends - Krzysztof Zarzycki, GetInData
Big data trends - Krzysztof Zarzycki, GetInDataBig data trends - Krzysztof Zarzycki, GetInData
Big data trends - Krzysztof Zarzycki, GetInDataGetInData
 
Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...
Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...
Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...GetInData
 
Analytics 101 - How to build a data-driven organisation? - Rafał Małanij, Get...
Analytics 101 - How to build a data-driven organisation? - Rafał Małanij, Get...Analytics 101 - How to build a data-driven organisation? - Rafał Małanij, Get...
Analytics 101 - How to build a data-driven organisation? - Rafał Małanij, Get...GetInData
 
Monitoring in Big Data Platform - Albert Lewandowski, GetInData
Monitoring in Big Data Platform - Albert Lewandowski, GetInDataMonitoring in Big Data Platform - Albert Lewandowski, GetInData
Monitoring in Big Data Platform - Albert Lewandowski, GetInDataGetInData
 
Complex event processing platform handling millions of users - Krzysztof Zarz...
Complex event processing platform handling millions of users - Krzysztof Zarz...Complex event processing platform handling millions of users - Krzysztof Zarz...
Complex event processing platform handling millions of users - Krzysztof Zarz...GetInData
 
Predicting Startup Market Trends based on the news and social media - Albert ...
Predicting Startup Market Trends based on the news and social media - Albert ...Predicting Startup Market Trends based on the news and social media - Albert ...
Predicting Startup Market Trends based on the news and social media - Albert ...GetInData
 
Managing Big Data projects in a constantly changing environment - Rafał Zalew...
Managing Big Data projects in a constantly changing environment - Rafał Zalew...Managing Big Data projects in a constantly changing environment - Rafał Zalew...
Managing Big Data projects in a constantly changing environment - Rafał Zalew...GetInData
 
NLP for videos: Understanding customers' feelings in videos - Albert Lewandow...
NLP for videos: Understanding customers' feelings in videos - Albert Lewandow...NLP for videos: Understanding customers' feelings in videos - Albert Lewandow...
NLP for videos: Understanding customers' feelings in videos - Albert Lewandow...GetInData
 

Mehr von GetInData (20)

How do we work with customers on Big Data / ML / Analytics Projects using Scr...
How do we work with customers on Big Data / ML / Analytics Projects using Scr...How do we work with customers on Big Data / ML / Analytics Projects using Scr...
How do we work with customers on Big Data / ML / Analytics Projects using Scr...
 
Data-Driven Fast Track: Introduction to data-drivenness with Piotr Menclewicz
Data-Driven Fast Track: Introduction to data-drivenness with Piotr MenclewiczData-Driven Fast Track: Introduction to data-drivenness with Piotr Menclewicz
Data-Driven Fast Track: Introduction to data-drivenness with Piotr Menclewicz
 
How NOT to win a Kaggle competition
How NOT to win a Kaggle competitionHow NOT to win a Kaggle competition
How NOT to win a Kaggle competition
 
How to become good Developer in Scrum Team?
How to become good Developer in Scrum Team? How to become good Developer in Scrum Team?
How to become good Developer in Scrum Team?
 
OpenLineage & Airflow - data lineage has never been easier
OpenLineage & Airflow - data lineage has never been easierOpenLineage & Airflow - data lineage has never been easier
OpenLineage & Airflow - data lineage has never been easier
 
Benefits of a Homemade ML Platform
Benefits of a Homemade ML PlatformBenefits of a Homemade ML Platform
Benefits of a Homemade ML Platform
 
Model serving made easy using Kedro pipelines - Mariusz Strzelecki, GetInData
Model serving made easy using Kedro pipelines - Mariusz Strzelecki, GetInDataModel serving made easy using Kedro pipelines - Mariusz Strzelecki, GetInData
Model serving made easy using Kedro pipelines - Mariusz Strzelecki, GetInData
 
Creating Real-Time Data Streaming powered by SQL on Kubernetes - Albert Lewan...
Creating Real-Time Data Streaming powered by SQL on Kubernetes - Albert Lewan...Creating Real-Time Data Streaming powered by SQL on Kubernetes - Albert Lewan...
Creating Real-Time Data Streaming powered by SQL on Kubernetes - Albert Lewan...
 
MLOps implemented - how we combine the cloud & open-source to boost data scie...
MLOps implemented - how we combine the cloud & open-source to boost data scie...MLOps implemented - how we combine the cloud & open-source to boost data scie...
MLOps implemented - how we combine the cloud & open-source to boost data scie...
 
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
 
Feast + Amundsen Integration - Mariusz Strzelecki, GetInData
Feast + Amundsen Integration - Mariusz Strzelecki, GetInDataFeast + Amundsen Integration - Mariusz Strzelecki, GetInData
Feast + Amundsen Integration - Mariusz Strzelecki, GetInData
 
Kubernetes and real-time analytics - how to connect these two worlds with Apa...
Kubernetes and real-time analytics - how to connect these two worlds with Apa...Kubernetes and real-time analytics - how to connect these two worlds with Apa...
Kubernetes and real-time analytics - how to connect these two worlds with Apa...
 
Big data trends - Krzysztof Zarzycki, GetInData
Big data trends - Krzysztof Zarzycki, GetInDataBig data trends - Krzysztof Zarzycki, GetInData
Big data trends - Krzysztof Zarzycki, GetInData
 
Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...
Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...
Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...
 
Analytics 101 - How to build a data-driven organisation? - Rafał Małanij, Get...
Analytics 101 - How to build a data-driven organisation? - Rafał Małanij, Get...Analytics 101 - How to build a data-driven organisation? - Rafał Małanij, Get...
Analytics 101 - How to build a data-driven organisation? - Rafał Małanij, Get...
 
Monitoring in Big Data Platform - Albert Lewandowski, GetInData
Monitoring in Big Data Platform - Albert Lewandowski, GetInDataMonitoring in Big Data Platform - Albert Lewandowski, GetInData
Monitoring in Big Data Platform - Albert Lewandowski, GetInData
 
Complex event processing platform handling millions of users - Krzysztof Zarz...
Complex event processing platform handling millions of users - Krzysztof Zarz...Complex event processing platform handling millions of users - Krzysztof Zarz...
Complex event processing platform handling millions of users - Krzysztof Zarz...
 
Predicting Startup Market Trends based on the news and social media - Albert ...
Predicting Startup Market Trends based on the news and social media - Albert ...Predicting Startup Market Trends based on the news and social media - Albert ...
Predicting Startup Market Trends based on the news and social media - Albert ...
 
Managing Big Data projects in a constantly changing environment - Rafał Zalew...
Managing Big Data projects in a constantly changing environment - Rafał Zalew...Managing Big Data projects in a constantly changing environment - Rafał Zalew...
Managing Big Data projects in a constantly changing environment - Rafał Zalew...
 
NLP for videos: Understanding customers' feelings in videos - Albert Lewandow...
NLP for videos: Understanding customers' feelings in videos - Albert Lewandow...NLP for videos: Understanding customers' feelings in videos - Albert Lewandow...
NLP for videos: Understanding customers' feelings in videos - Albert Lewandow...
 

Kürzlich hochgeladen

Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 

Kürzlich hochgeladen (20)

Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 

HCatalog

  • 1. HCatalog © Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
  • 2. Motivation © Copyright 2014. All rights reserved. Not to be reproduced without prior written consent.
  • 3. Inside a Data-Driven Company ■ Analysts use multiple tools for processing data ● Java MapReduce, Hive and Pig and more ■ Analysts create many (derivative) datasets ● Different formats e.g. CSV, JSON, Avro, ORC ● Files in HDFS ● Tables in Hive We simply pick a right tool and format for each application! © Copyright 2014. All rights reserved. Not to be reproduced without prior written consent.
  • 4. Pig Analysts ■ To process data, they need to remember ● where a dataset is located ● what format it has ● what the schema is © Copyright 2014. All rights reserved. Not to be reproduced without prior written consent.
  • 5. Hive Analysts ■ They store popular datasets in (external) Hive tables ■ To analyze datasets generated by Pig analysts, they need to know ● where a dataset is located ● what format it has ● what the schema is ● how to load it into Hive table/partition © Copyright 2014. All rights reserved. Not to be reproduced without prior written consent.
  • 6. Changes, Changes And Changes Let’s start using more efficient format since today! NO WAY! We would have to re-write, re-test and re-deploy our applications! This means a lot of engineering work for us! © Copyright 2014. All rights reserved. Not to be reproduced without prior written consent.
  • 7. MR, Pig, Hive And Data Storage ■ Hive reads data location, format and schema from metadata ● Managed by Hive Metastore ■ MapReduce encodes them in the application code ■ Pig specifies them in the script ● Schema can be provided by the Loader Conclusion ■ MapReduce and Pig are sensitive to metadata changes! © Copyright 2014. All rights reserved. Not to be reproduced without prior written consent.
  • 8. Data Availability Jeff: Is your dataset already generated? Tom: Check in HDFS! Jeff: What is the location? Tom: /data/streams/20140101 Jeff: hdfs dfs -ls /data/streams/20140101 Not yet…. :( Tom: Check it later! © Copyright 2014. All rights reserved. Not to be reproduced without prior written consent.
  • 9. HCatalog ■ Aims to solve these problems! ■ First of all ● It knows the location, format and schema of our datasets! © Copyright 2014. All rights reserved. Not to be reproduced without prior written consent.
  • 10. HCatalog With Pig ■ Traditional way raw = load ‘/data/streams/20140101’ using MyLoader() as (time:chararray, host:chararray, userId:int, songId:int, duration:int); … store output into ‘/data/users/20140101’ using MyStorer(); © Copyright 2014. All rights reserved. Not to be reproduced without prior written consent.
  • 11. HCatalog With Pig ■ Traditional way raw = load ‘/data/streams/20140101’ using MyLoader() as (time:chararray, host:chararray, userId:int, songId:int, duration:int); … store output into ‘/data/users/20140101’ using MyStorer(); ■ HCatalog way raw = load ‘streams’ using HCatLoader(); … store output into users using HCatStorer(‘date=20140101’); © Copyright 2014. All rights reserved. Not to be reproduced without prior written consent.
  • 12. Interacting with HCatalog © Copyright 2014. All rights reserved. Not to be reproduced without prior written consent. Demo
  • 13. Demo 1. Upload a dataset in HDFS 2. Add the dataset HCatalog 3. Access the dataset with Hive and Pig © Copyright 2014. All rights reserved. Not to be reproduced without prior written consent.
  • 14. HCatalog CLI 1. Upload a dataset in HDFS $ hdfs dfs -put streams /data Streams: timestamp, host, userId, songId, duration © Copyright 2014. All rights reserved. Not to be reproduced without prior written consent.
  • 15. HCatalog CLI 2. Add the dataset HCatalog ■ A file with HCatalog DDL can be prepared CREATE EXTERNAL TABLE streams(time string, host string, userId int, songId int, duration int) ROW FORMAT DELIMITED FIELDS TERMINATED BY 't' STORED AS TEXTFILE LOCATION '/data/streams'; ■ And executed by hcat -f $ hcat -f streams.hcatalog © Copyright 2014. All rights reserved. Not to be reproduced without prior written consent.
  • 16. HCatalog CLI 2. Add the dataset HCatalog ■ Describe the dataset $ hcat -e "describe streams" OK time string None host string None userid int None songid int None duration int None © Copyright 2014. All rights reserved. Not to be reproduced without prior written consent.
  • 17. HCatalog CLI 3. Access the dataset with Pig raw_streams = LOAD 'streams' USING org.apache.hcatalog.pig.HCatLoader(); all_count = FOREACH (GROUP raw_streams ALL) GENERATE COUNT(raw_streams); DUMP all_count; $ pig -useHCatalog streams.pig … (93294) © Copyright 2014. All rights reserved. Not to be reproduced without prior written consent.
  • 18. HCatalog CLI 3. Access the dataset with Hive $ hive -e "select count(*) from streams" OK 93294 © Copyright 2014. All rights reserved. Not to be reproduced without prior written consent.
  • 19. Goals © Copyright 2014. All rights reserved. Not to be reproduced without prior written consent.
  • 20. Three Main Goals 1. Provide an abstraction on top of datasets stored in HDFS ● Just use the name of the dataset, not the path 2. Enable data discovery ● Store all datasets (and their properties) in HCatalog ● Integrate with Web UI 3. Provide notifications for data availability ● Process new data immediately when it appears © Copyright 2014. All rights reserved. Not to be reproduced without prior written consent.
  • 21. Supported Formats and Projects © Copyright 2014. All rights reserved. Not to be reproduced without prior written consent.
  • 22. Supported Projects And Formats © Copyright 2014. All rights reserved. Not to be reproduced without prior written consent.
  • 23. Custom Formats ■ A custom format can be supported ● But InputFormat, OutputFormat, and Hive SerDe must be provided © Copyright 2014. All rights reserved. Not to be reproduced without prior written consent.
  • 24. Pig Interface - HCatLoader ■ Consists of HCatLoader and HCatStorer ■ HCatLoader read data from a dataset ● Indicate which partitions to scan by following the load statement with a partition filter statement raw = load 'streams' using HCatLoader(); valid = filter raw by date = '20140101' and isValid(duration); © Copyright 2014. All rights reserved. Not to be reproduced without prior written consent.
  • 25. Pig Interface - HCatStorer ■ HCatStorer writes data to a dataset ● A specification of partition keys can be also provided ● Possible to write to a single partition or multiple partitions store valid into 'streams_valid' using HCatStorer ('date=20110924'); © Copyright 2014. All rights reserved. Not to be reproduced without prior written consent.
  • 26. MapReduce Interface ■ Consists of HCatInputFormat and HCatOutputFormat ■ HCatInputFormat accepts a dataset to read data from ● Optionally, indicate which partitions to scan ■ HCatOutputFormat accepts a dataset to write to ● Optionally, indicated with partition to write to ● Possible to write to a single partition or multiple partitions © Copyright 2014. All rights reserved. Not to be reproduced without prior written consent.
  • 27. Hive Interface ■ There is no Hive-specific interface ● Hive can read information from HCatalog directly ■ Actually, HCatalog is now a submodule of Hive Conclusion ■ HCatalog enables non-Hive projects to access Hive tables © Copyright 2014. All rights reserved. Not to be reproduced without prior written consent.
  • 28. Components ■ Hive Metastore to store information about datasets ● A table per dataset is created (the same as in Hive) ■ hcat CLI ● Create and drop tables, specify table parameters, etc ■ Programming interfaces ● For MapReduce and Pig ● New ones can be implemented e.g. for Crunch ■ WebHCat server ● More about it later © Copyright 2014. All rights reserved. Not to be reproduced without prior written consent.
  • 29. Features © Copyright 2014. All rights reserved. Not to be reproduced without prior written consent.
  • 30. Data Discovery ■ A nice web UI can be build on top of HCatalog ● You have all Hive tables there for free! ● See Yahoo!’s illustrative example below © Copyright 2014. All rights reserved. Not to be reproduced without prior written consent. Pictures come from Yahoo’s presentation at Hadoop Summit San Jose 2014
  • 31. Properties Of Datasets ■ Can store data life-cycle management information ● Cleaning, archiving and replication tools can learn which datasets are eligible for their services ALTER TABLE intermediate.featured-streams SET TBLPROPERTIES ('retention' = 30); SHOW TBLPROPERTIES intermediate.featured-streams; SHOW TBLPROPERTIES intermediate.featured-streams("retention"); © Copyright 2014. All rights reserved. Not to be reproduced without prior written consent.
  • 32. Notifications ■ Provides notifications for certain events ● e.g. Oozie or custom Java code can wait for those events and immediately schedule tasks depending on them ■ Multiple events can trigger notifications ● A database, a table or a partition is added ● A set of partitions is added ● A database, a table or a partition is dropped © Copyright 2014. All rights reserved. Not to be reproduced without prior written consent.
  • 33. Evolution Of Data ■ Allows data producers to change how they write data ● No need to re-write existing data ● HCatalog can read old and new data ● Data consumers don’t have to change their applications ■ Data location, format and schema can be changed ● In case of schema, new fields can be added © Copyright 2014. All rights reserved. Not to be reproduced without prior written consent.
  • 34. HCatalog Beyond HDFS © Copyright 2014. All rights reserved. Not to be reproduced without prior written consent.
  • 35. WebHCat Server ■ Provides a REST-like web API for HCatalog ● Send requests to get information about datasets curl -s 'http://hostname: port/templeton/v1/ddl/database/db_name/table/table_name?user. name=adam' ● Send requests to run Pig or Hive scripts ■ Previously called Templeton © Copyright 2014. All rights reserved. Not to be reproduced without prior written consent.
  • 36. There Is More! © Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
  • 37. ■ Data engineers, architects and instructors ■ +4 years of experience in Apache Hadoop ● Working with hundreds-node Hadoop clusters ● Developing Hadoop applications in many cool frameworks ● Delivering Hadoop trainings for +2 years ■ Passionate about data ● Speaking at conferences and meetups ● Blogging and reviewing books © Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.