Big Data Warehousing: Pig vs. Hive Comparison

Big Data Warehousing Meetup

Today’s Topic: Exploring Big Data
Analytics Techniques with Datameer

Sponsored By:

WELCOME!
Joe Caserta
Founder & President, Caserta Concepts

Agenda
7:00 Networking
Grab a slice of pizza and a drink...

7:15 Joe Caserta Welcome
President, Caserta Concepts About the Meetup and about Caserta Concepts
Author, Data Warehouse ETL Toolkit

7:30 Elliott Cordo Pig and Hive
Principal Consultant, Caserta Concepts Walkthrough of these powerful native Hadoop tools

7:50 Adam Gugliciello Datameer
Solutions Engineer, Datameer

8:10 - More Networking
9:00 Tell us what you’re up to…

About BDW Meetup
• Big Data is a complex, rapidly
changing landscape

• We want to share our stories and
hear about yours

• Great networking opportunity for like
minded data nerds

• Opportunities to collaborate on
exciting projects

• Next BDW Meetup: April 22.
• Topic: Intro to NoSQL Databases

About Caserta Concepts
Focused Industries Served
Expertise
• Financial Services
• Big Data Analytics • Healthcare / Insurance
• Data Warehousing • Retail / eCommerce
• Business Intelligence • Digital Media / Marketing
• Strategic Data • K-12 / Higher Education
Ecosystems

Founded in 2001

• President: Joe Caserta, industry thought leader,
consultant, educator and co-author, The Data
Warehouse ETL Toolkit (Wiley, 2004)

Client Portfolio
Finance
& Insurance

Retail/eCommerce
& Manufacturing

Education
& Services

Expertise & Offerings
Strategic Roadmap/
Assessment/Consulting

Big Data
Analytics

Data Warehousing/
ETL/Data Integration

BI/Visualization/
Analytics

Master Data Management

Opportunities
Does this word cloud excite you?

Speak with us about our open positions: jobs@casertaconcepts.com

Contacts

Joe Caserta
President & Founder, Caserta Concepts
P: (855) 755-2246 x227
E: joe@casertaconcepts.com

Erik Laurence
VP Marketing, Caserta Concepts
P: (855) 755-2246 x528 info@casertaconcepts.com
E: erik@casertaconcepts.com 1(855) 755-2246
www.casertaconcepts.com
Elliott Cordo
Principal Consultant, Caserta Concepts
P: (855) 755-2246 x267
E: elliott@casertaconcepts.com

ANALYZING DATA: PIG AND HIVE
Elliott Cordo
Principal Consultant, Caserta Concepts

Big Data Analysis
• Let’s review some tools for analyzing and processing Big
Data

• We will go over some simple use cases – point out what is
interesting about them

• Develop a point of view of what each one is well suited for.

Big Data Analysis – Map Reduce?
Distributed programming framework – Divide and Conquer!
• Master divides work into digestible chunks and distributes to worker nodes
– > MAP
• Work from nodes is then collected by the master and combined to form an
answer -> REDUCE

Powerful tool for to solve interesting computational problems at scale

HELP
• We are doing low-level language coding to perform low-
level operations

• For productivity we need higher level tools!

• We will get help from a few animals!

N1 N2 N3 N4 N5
Hadoop Distributed File System (HDFS)

HIVE
• The Hadoop “Data Warehouse”

• HiveQL is a SQL-Like interface that allows you to abstract
“relational-db like” structure on top of non-relational or
unstructured data
• Flat Files, JSON, Web logs
• HBase, Casandra, other NoSQL stores like MongoDB

• Thanks to ODBC/JDBC drivers some conventional BI
tools can interact with Hive

• Ability to integrate custom programming, mappers,
reducers

HIVE
But don’t get too excited!
• Hive is not a Database, especially in terms of
optimizations.

• SQL is interpreted to Map Reduce Jobs, expect even
simple queries to be around a minute or more.
Start query,
go get coffee

• But now that expectations have been set, it’s still a very
useful tool

HIVE DDL– Create and load a table
hive> create table user_movie_ratings(
> user_id int,
> movie_id int, Looks like a typical
> rating int,
> time_unix_ts string) table declaration,
> row format delimited except we are specify
> fields terminated by 't' the ingested file
> stored as textfile; format
OK
Time taken: 0.395 seconds

hive> load data inpath '/user/hive/staging/data/u.data' overwrite into table
user_movie_ratings;
Loading data to table default.user_movie_ratings
Deleted hdfs://localhost:54310/user/hive/warehouse/user_movie_ratings
Table default.user_movie_ratings stats: [num_partitions: 0, num_files: 1, num_rows: 0,
total_size: 1979173, raw_data_size: 0]
OK

HIVE DDL– Create an external table
hive> create external table user (
> user_id int,
> age int,
This time we don’t
> gender string, want Hive to own this
> occupation string, data’s lifecycle
> postal_code int )
> row format delimited fields terminated by '|'
> location '/user/hive/staging/user';
OK

HIVE – YAY SQL!
hive> select occupation, count(1)
> from user_movie_ratings m
> join user u on u.user_id=m.user_id
> group by occupation;

Total MapReduce jobs = 2
Launching Job 1 out of 2
...
Total MapReduce CPU Time Spent: 47 seconds 170 msec
OK

administrator 7479
artist 2308
doctor 540
educator 9442
engineer 8175
entertainment 2095
….
retired 1609
salesman 856
scientist 2058
student 21957
technician 3506
writer 5536 Hmmm..

PIG
• Powerful High Level Programming Language

• SQL-ish, small learning curve for SQL and procedural
programmers

• Excellent for data transformation, ETL

• Not meant to be an ad-hoc query tool, happy with doing
grunt work

• Plenty of supported file formats, databases, ability to
create custom UDF’s

PIG Example
grunt> lens_users= load '/user/movie_lens/u.user' using PigStorage('|') as
(user_id:int, age:int, gender:chararray, occupation:chararray, postal_code:int);

grunt> lens_data= load '/user/movie_lens/u.data' using PigStorage('t') as
(user_id:int, movie_id:int, rating:int, time_unix_ts:chararray);

grunt> joined = join lens_users by user_id, lens_data by user_id

grunt> grouped = group joined by (occupation);

grunt> results = FOREACH grouped GENERATE COUNT_STAR(joined),*;

grunt> store results into '/user/movie_lens_user_summary'
Interesting,
We are doing
our aggregate
functions after
grouping

PIG - Results
Grouping in PIG is a fair
deviation from SQL ->
original elements are
preserved in a bag

Summary
Hive:
• Helpful for ETL
• Very good for Ad-Hoc Analysis - Not necessarily suited
for front end users but definitely helpful for data analysts
• Directly leverages SQL expertise!!

PIG:
• Great for ETL
• Powerful, transformation and processing capabilities
• SQL-like, but different in many ways, will take some time
to master.

Big Data Warehousing: Pig vs. Hive Comparison

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Big Data Warehousing: Pig vs. Hive Comparison

Ähnlich wie Big Data Warehousing: Pig vs. Hive Comparison (20)

Mehr von Caserta

Mehr von Caserta (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Big Data Warehousing: Pig vs. Hive Comparison