3. Pig History
â˘âŻBorn from Yahoo Research then Apache incubated
â˘âŻBuilt to avoid low level programming of Map/Reduce
without Hive/SQL queries
â˘âŻCommitters from: Yahoo, Hortonworks, LinkedIn,
SalesForce, IBM, Twitter, Netflix, and others
â˘âŻAlan Gates on Pig
Page 3
4. Pig
â˘âŻAn engine for executing programs on top of
Hadoop
â˘âŻIt provides a language, Pig Latin, to specify these
programs
Page 4
5. HDP: Enterprise Hadoop Platform
Page 5
Hortonworks
Data Platform (HDP)
â˘âŻ The ONLY 100% open source
and complete platform
â˘âŻ Integrates full range of
enterprise-ready services
â˘âŻ Certified and tested at scale
â˘âŻ Engineered for deep
ecosystem interoperability
OS/VM
 Cloud
 Appliance
Â
PLATFORM
Â
Â
SERVICES
Â
HADOOP
Â
Â
CORE
Â
Enterprise Readiness
High Availability, Disaster
Recovery, Rolling Upgrades,
Security and Snapshots
HORTONWORKS
Â
Â
DATA
 PLATFORM
 (HDP)
Â
OPERATIONAL
Â
SERVICES
Â
DATA
Â
SERVICES
Â
HDFS
Â
SQOOP
Â
FLUME
Â
NFS
Â
LOAD
 &
Â
Â
EXTRACT
Â
WebHDFS
Â
KNOX*
Â
OOZIE
Â
AMBARI
Â
FALCON*
Â
YARN
Â
Â
Â
MAP
Â
Â
Â
TEZ
 REDUCE
Â
HIVE
 &
Â
HCATALOG
Â
PIG
 HBASE
Â
6. Why use Pig?
â˘âŻSuppose you have user data in one file, website data
in another, and you need to find the top 5 most visited
sites by users aged 18 - 25
Page 6
8. In Pig Latin
Users
 =
 load
 âinput/usersâ
 using
 PigStorage(â,â)
 as
 (name:chararray,
 age:int);
Â
Fltrd
 =
 filter
 Users
 by
 age
 >=
 18
 and
 age
 <=
 25;
Â
Pages
 =
 load
 âinput/pagesâ
 using
 PigStorage(â,â)
Â
 as
 (user:chararray,
Â
url:chararray);
Â
Jnd
 =
 join
 Fltrd
 by
 name,
 Pages
 by
 user;
Â
Grpd
 =
 group
 Jnd
 by
 url;
Â
Smmd
 =
 foreach
 Grpd
 generate
 group,COUNT(Jnd)
 as
 clicks;
Â
Srtd
 =
 order
 Smmd
 by
 clicks
 desc;
Â
Top5
 =
 limit
 Srtd
 5;
Â
store
 Top5
 into
 âoutput/top5sitesâ
 using
 PigStorage(â,â);
Â
Page 8
9 lines of code, 15 minutes to write
170 lines to 9 lines of code
9. Essence of Pig
â˘âŻMap-Reduce is too low a level, SQL too high
â˘âŻPig-Latin, a language intended to sit between the two
ââŻProvides standard relational transforms (join, sort, etc.)
ââŻSchemas are optional, used when available, can be defined at
runtime
ââŻUser Defined Functions are first class citizens
Page 9
11. Pig Elements
Page 11
â˘âŻ High-level scripting language
â˘âŻ Requires no metadata or schema
â˘âŻ Statements translated into a series of
MapReduce jobs
Pig Latin
â˘âŻ Interactive shellGrunt
â˘âŻ Shared repository for User Defined
Functions (UDFs)Piggybank
12. Pig Latin Data Flow
Page 12
LOAD
(HDFS/HCat)
TRANSFORM
(Pig)
DUMP or
STORE
(HDFS/HCAT)
Read data to be
manipulated from the
file system
Manipulate the
data
Output data to the
screen or store for
processing
In code:
â˘âŻ VARIABLE1
 =
 LOAD
 [somedata]
Â
â˘âŻ VARIABLE2
 =
 [TRANSFORM
 operation]
Â
â˘âŻ STORE
 VARIABLE2
 INTO
 â[some
 location]â
Â
13. Pig Relations
1.⯠A bag is a collection of
unordered tuples
(can be different sizes).
2.⯠A tuple is an ordered
set of fields.
3.⯠A field is a piece of data.
Pig Latin statements work with relations
Field
Field 1
Field
2
Field 3
Tuple
Bag
14. FILTER, GROUP, FOREACH, ORDER
Page 14
logevents
 =
 LOAD
 'input/my.log'
 AS
 (date:chararray,
Â
 level:chararray,
 code:int,
 message:chararray);
Â
severe
 =
 FILTER
 logevents
 BY
 (level
 ==
 'severeâ
Â
 AND
Â
 code
 >=
 500);
Â
grouped
 =
 GROUP
 severe
 BY
 code;
Â
e1
 =
 LOAD
 'pig/input/File1'
 USING
 PigStorage(',')
Â
Â
Â
Â
Â
Â
Â
 AS
 (name:chararray,age:int,
Â
zip:int,salary:double);
Â
f
 =
 FOREACH
 e1
 GENERATE
 age,
 salary;
Â
g
 =
 ORDER
 f
 BY
 age
Â
15. JOIN, GROUP, LIMIT
Page 15
employees
 =
 LOAD
 â[somefile]â
Â
Â
Â
Â
Â
Â
 AS
 (name:chararray,age:int,
 zip:int,salary:double);
Â
agegroup
 =
 GROUP
 employees
 BY
 age;
Â
Â
Â
Â
Â
Â
Â
Â
Â
h
 =
 LIMIT
 agegroup
 100;
Â
e1
 =
 LOAD
 â[somefile]'
 USING
 PigStorage(',')
Â
Â
Â
Â
Â
Â
Â
Â
 AS
 (name:chararray,
 age:int,
 zip:int,
Â
salary:double);
Â
e2
 =
 LOAD
 '[somefile]'
 USING
 PigStorage(',')
Â
Â
Â
Â
Â
Â
Â
Â
 AS
 (name:chararray,
 phone:chararray);
Â
e3
 =
 JOIN
 e1
 BY
 name,
 e2
 BY
 name;
Â
18. Hive vs Pig
Page 18
Pig and Hive work well together
and many businesses use both.
Hive is a good choice:
â˘âŻ when you want to query the data
â˘âŻ when you need an answer to specific
questions
â˘âŻ if you are familiar with SQL
Pig is a good choice:
â˘âŻ for ETL (Extract -> Transform -> Load)
â˘âŻ for preparing data for easier analysis
â˘âŻ when you have a long series of steps to
perform
19. Tool Comparison
Page 19Š Hortonworks 2012
Feature MapReduce Pig Hive
Record format Key value pairs Tuple Record
Data model User defined int, float, string,
bytes, maps,
tuples, bags
int, float, string,
maps, structs, lists,
char, varchar,
decimal, âŚ
Schema Encoded in app Declared in script
or read by loader
Read from
metadata
Data location Encoded in app Declared in script Read from
metadata
Data format Encoded in app Declared in script Read from
metadata
20. T-SQL vs Hadoop Ecosystem
Page 20
T-SQL PIG Hive
Query Data Yes Yes (in bulk) Yes
Local Variables Yes Yes No
Conditional Logic Yes limited limited
Procedural
Programming
Yes No No
UDFs No Yes Yes
21. HCatalog: Data Sharing is Hard
Page 21
Photo Credit: totalAldo via Flickr
This is programmer Bob, he
uses Pig to crunch data.
This is analyst Joe, he uses Hive
to build reports and answer ad-hoc
queries.
Hmm, is it done yet? Where is it? What format
did you use to store it today? Is it compressed?
And can you help me load it into Hive, I can
never remember all the parameters I have to
pass that alter table command.
Ok
Bob, I need
todayâs data
Dude, we need
HCatalog
Š Hortonworks Inc. 2012
22. Pig Example
Page 22
Assume you want to count how many time each of your users went to each of
your URLs
raw = load '/data/rawevents/20120530' as (url, user);
botless = filter raw by myudfs.NotABot(user);
grpd = group botless by (url, user);
cntd = foreach grpd generate flatten(url, user), COUNT(botless);
store cntd into '/data/counted/20120530';
Š Hortonworks 2013
23. Pig Example
Page 23
Assume you want to count how many time each of your users went to each of
your URLs
raw = load '/data/rawevents/20120530' as (url, user);
botless = filter raw by myudfs.NotABot(user);
grpd = group botless by (url, user);
cntd = foreach grpd generate flatten(url, user), COUNT(botless);
store cntd into '/data/counted/20120530';
Using HCatalog:
raw = load 'rawevents' using HCatLoader();
botless = filter raw by myudfs.NotABot(user) and ds == '20120530';
grpd = group botless by (url, user);
cntd = foreach grpd generate flatten(url, user), COUNT(botless);
store cntd into 'counted' using HCatStorer();
No need to know
file location
No need to
declare schema
Partition filter
Š Hortonworks 2013
24. Tools With HCatalog
Page 24
Feature MapReduce +
HCatalog
Pig + HCatalog Hive
Record format Record Tuple Record
Data model int, float, string,
maps, structs, lists
int, float, string,
bytes, maps,
tuples, bags
int, float, string,
maps, structs, lists
Schema Read from
metadata
Read from
metadata
Read from
metadata
Data location Read from
metadata
Read from
metadata
Read from
metadata
Data format Read from
metadata
Read from
metadata
Read from
metadata
â˘âŻ Pig/MR users can read schema from metadata
â˘âŻ Pig/MR users are insulated from schema, location, and format changes
â˘âŻ All users have access to other usersâ data as soon as it is committed
26. Data & Metadata REST Services APIs
Page 26
HDFS HBase
External
Store
Existing & New Applications
MapReduce Pig Hive
HCatalog
WebHCat RESTful Web Services
WebHDFS & WebHCat
provide RESTful API as
âfront doorâ for Hadoop
â˘âŻ Opens the door to
languages other than Java
â˘âŻ Thin clients via web
services vs. fat-clients in
gateway
â˘âŻ Insulation from interface
changes release to release
Opens Hadoop to integration with existing and new applications
WebHDFS
27. RESTful API Access for Pig
â˘âŻCode example
 curl
 -Ââs
 -Ââd
 user.name=hue
Â
Â
Â
Â
Â
Â
Â
Â
 -Ââd
 execute=â<pig
 script>â
Â
Â
Â
Â
Â
Â
Â
Â
 'http://localhost:50111/templeton/v1/pig'
â˘âŻ RestSharp (restsharp.org/)
ââŻSimple REST and HTTP API Client
for .NET
Page 27
28. WebHCat REST API
Page 28
Page 28Š Hortonworks 2012
Hadoop/
HCatalog
Get a list of all tables in the default database:
GET
http://âŚ/v1/ddl/database/default/table
{
"tables": ["counted","processed",],
"database": "default"
}
â˘âŻ REST endpoints: databases, tables, partitions, columns, table properties
â˘âŻ PUT to create/update, GET to list or describe, DELETE to drop
PUT
{"columns": [{ "name": "url", "type": "string" },
{ "name": "user", "type": "string"}],
"partitionedBy": [{ "name": "ds", "type": "string" }]}
http://âŚ/v1/ddl/database/default/table/rawevents
{
"table": "rawevents",
"database": "defaultâ
}{
"columns": [{"name": "url","type": "string"},
{"name": "user","type": "string"}],
"database": "default",
"table": "rawevents"
}
30. Hive â MR Hive â Tez
Hive-on-MR vs. Hive-on-Tez
Page 30
SELECT a.state, COUNT(*), AVERAGE(c.price)
FROM a
JOIN b ON (a.id = b.id)
JOIN c ON (a.itemId = c.itemId)
GROUP BY a.state
SELECT a.state
JOIN (a, c)
SELECT c.price
SELECT b.id
JOIN(a, b)
GROUP BY a.state
COUNT(*)
AVERAGE(c.price)
M M M
R R
M M
R
M M
R
M M
R
HDFS
HDFS
HDFS
M M M
R R
R
M M
R
R
SELECT a.state,
c.itemId
JOIN (a, c)
JOIN(a, b)
GROUP BY a.state
COUNT(*)
AVERAGE(c.price)
SELECT b.id
Tez avoids
unneeded writes to
HDFS
31. Pig on Tez - Design
3
Logical Plan
Tez Plan MR Plan
Physical Plan
Tez Execution Engine MR Execution Engine
LogToPhyTranslationVisitor
MRCompilerTezCompiler
33. User Defined Functions
â˘âŻUltimate in extensibility and portability
â˘âŻCustom processing
ââŻJava
ââŻPython
ââŻJavaScript
ââŻRuby
â˘âŻIntegration with MapReduce phases
ââŻMap
ââŻCombine
ââŻReduce
34. User Defined Functions
public class MyUDF extends EvalFunc<DataBag>
implements Algebraic {!
âŚ!
}!
â˘âŻAlgebraic functions
â˘âŻ3-phase execution
ââŻMap â called once for each tuple
ââŻCombiner â called zero or more times for each map result
ââŻReduce
35. User Defined Functions
public class MyUDF extends EvalFunc<DataBag>
implements Accumulator {!
âŚ!
}!
â˘âŻAccmulator functions
â˘âŻIncremental processing of data
â˘âŻCalled in both map and reduce phase
36. User Defined Functions
public class MyUDF extends FilterFunc {!
âŚ!
}!
â˘âŻFilter functions
â˘âŻReturns boolean based on processing of the tuple
â˘âŻCalled in both map and reduce phase