1. ETL with Apache Pig
By
Arjun Shah
Under the guidance of
Dr Duc Thanh Tran
2. Agenda
• What is Pig?
• Introduction to Pig Latin
• Installation of Pig
• Getting Started with Pig
• Examples
3. What is Pig?
• Pig is a dataflow language
• Language is called PigLatin
• Pretty simple syntax
• Under the covers, PigLatin scripts are turned into MapReduce jobs
and executed on the cluster
• Built for Hadoop
• Translates script to MapReduce program under the hood
• Originally developed at Yahoo!
• Huge contributions from Hortonworks, Twitter
4. What Pig Does
• Pig was designed for performing a long series of
data operations, making it ideal for three
categories of Big Data jobs:
• Extract-transform-load (ETL) data pipelines,
• Research on raw data, and
• Iterative data processing.
5. Features of Pig
• Joining datasets
• Grouping data
• Referring to elements by position rather than name ($0, $1, etc)
• Loading non-delimited data using a custom SerDe (Writing a custom Reader and Writer)
• Creation of user-defined functions (UDF), written in Java
• And more..
6. Pig: Install
• There are some prerequisites that one needs to
follow for installing pig. They are:
• JAVA_HOME should be set up
• Hadoop should be installed (Single node
cluster)
• Useful link :
http://codesfusion.blogspot.com/2013/10/setup-
hadoop-2x-220-on-ubuntu.html
14. Pig: Configure
• The user can run Pig in two modes:
• Local mode (pig -x local) - With access to a single
machine, all files are installed and run using a
local host and file system.
• Hadoop mode - This is the default mode, which
requires access to a Hadoop cluster
• The user can run Pig in either mode using the “pig”
command or the “java” command.
15. Pig: Run
• Script: Pig can run a script file that contains Pig commands.
• For example,
% pig script.pig
• Runs the commands in the local file ”script.pig”.
• Alternatively, for very short scripts, you can use the -e option to run a script specified as a string on a
command line.
•
• Grunt: Grunt is an interactive shell for running Pig commands.
• Grunt is started when no file is specified for Pig to run, and the -e option is not used.
• Note: It is also possible to run Pig scripts from within Grunt using run and exec.
• Embedded: You can run Pig programs from Java, much like you can use JDBC to run SQL programs
from Java.
• There are more details on the Pig wiki at http://wiki.apache.org/pig/EmbeddedPig
•
16. Pig Latin: Loading Data
• LOAD
- Reads data from the file system
• Syntax
- LOAD ‘input’ [USING function] [AS schema];
-Eg, A = LOAD ‘input’ USING PigStorage(‘t’) AS
(name:chararray, age:int, gpa:float);
17. Schema
• Use schemas to assign types to fields
• A = LOAD 'data' AS (name, age, gpa);
-name, age, gpa default to bytearrays
• A = LOAD 'data' AS (name:chararray, age:int,
gpa:float);
-name is now a String (chararray), age is integer
and gpa is float
18. Describing Schema
• Describe
• Provides the schema of a relation
• Syntax
• DESCRIBE [alias];
• If schema is not provided, describe will say “Schema for alias unknown”
• grunt> A = load 'data' as (a:int, b: long, c: float);
• grunt> describe A;
• A: {a: int, b: long, c: float}
• grunt> B = load 'somemoredata';
• grunt> describe B;
• Schema for B unknown.
19. Dump and Store
• Dump writes the output to console
• grunt> A = load ‘data’;
• grunt> DUMP A; //This will print contents of A on Console
• Store writes output to a HDFS location
• grunt> A = load ‘data’;
• grunt> STORE A INTO ‘/user/username/output’; //This will
write contents of A to HDFS
• Pig starts a job only when a DUMP or STORE is encountered
20. Referencing Fields
• Fields are referred to by positional notation OR by name (alias)
• Positional notation is generated by the system
• Starts with $0
• Names are assigned by you using schemas. Eg, A = load
‘data’ as (name:chararray, age:int);
• With positional notation, fields can be accessed as
• A = load ‘data’;
• B = foreach A generate $0, $1; //1st & 2nd column
21. Limit
• Limits the number of output tuples
• Syntax
• alias = LIMIT alias n;
• grunt> A = load 'data';
• grunt> B = LIMIT A 10;
• grunt> DUMP B; --Prints only 10 rows
22. Foreach.. Generate
• Used for data transformations and projections
• Syntax
• alias = FOREACH { block | nested_block };
• nested_block usage later in the deck
• grunt> A = load ‘data’ as (a1,a2,a3);
• grunt> B = FOREACH A GENERATE *,
• grunt> DUMP B;
• (1,2,3)
• (4,2,1)
• grunt> C = FOREACH A GENERATE a1, a3;
• grunt> DUMP C;
• (1,3)
• (4,1)
23. Filter
• Selects tuples from a relation based on some condition
• Syntax
• alias = FILTER alias BY expression;
• Example, to filter for ‘marcbenioff’
• A = LOAD ‘sfdcemployees’ USING PigStorage(‘,’) as
(name:chararray,employeesince:int,age:int);
• B = FILTER A BY name == ‘marcbenioff’;
• You can use boolean operators (AND, OR, NOT)
• B = FILTER A BY (employeesince < 2005) AND (NOT(name ==
‘marcbenioff’));
24. Group By
• Groups data in one or more relations (similar to SQL GROUP BY)
• Syntax:
• alias = GROUP alias { ALL | BY expression} [, alias ALL | BY expression …] [PARALLEL
n];
• Eg, to group by (employee start year at Salesforce)
• A = LOAD ‘sfdcemployees’ USING PigStorage(‘,’) as (name:chararray,
employeesince:int, age:int);
• B = GROUP A BY (employeesince);
• You can also group by all fields together
• B = GROUP B BY ALL;
• Or Group by multiple fields
• B = GROUP A BY (age, employeesince);
25. Demo: Sample Data (employee.txt)
• Example contents of ‘employee.txt’ a tab delimited text
• 1 Peter234000000 none
• 2 Peter_01 234000000 none
• 124163 Jacob 10000 cloud
• 124164 Arthur 1000000 setlabs
• 124165 Robert 1000000 setlabs
• 124166 Ram 450000 es
• 124167 Madhusudhan 450000 e&r
• 124168 Alex 6500000 e&r
• 124169 Bob 50000 cloud
26. Demo: Employees with salary > 1lk
• Loading data from employee.txt into emps bag and with a schema
empls = LOAD ‘employee.txt’ AS (id:int, name:chararray, salary:double,
dept:chararray);
• Filtering the data as required
rich = FILTER empls BY $2 > 100000;
• Sorting
sortd = ORDER rich BY salary DESC;
• Storing the final results
STORE sortd INTO ‘rich_employees.txt’;
• Or alternatively we can dump the record on the screen
DUMP sortd;
------------------------------------------------------------------
• Group by salary
grp = GROUP empls BY salary;
• Get count of employees in each salary group
cnt = FOREACH grp GENERATE group, COUNT(empls.id) as emp_cnt;
29. More PigLatin (1/2)
• Load using PigStorage
• empls = LOAD ‘employee.txt’ USING
PigStorage('t') AS (id:int, name:chararray,
salary:double, dept:chararray);
• Store using PigStorage
• STORE srtd INTO ‘rich_employees.txt’ USING
PigStorage('t');
30. More PigLatin (2/2)
• To view the schema of a relation
• DESCRIBE empls;
• To view step-by-step execution of a series of
statements
• ILLUSTRATE empls;
• To view the execution plan of a relation
• EXPLAIN empls;
47. PigLatin: UDF
• Pig provides extensive support for user-defined
functions (UDFs) as a way to specify custom
processing. Functions can be a part of almost
every operator in Pig
• All UDF’s are case sensitive
48. UDF: Types
• Eval Functions (EvalFunc)
• Ex: StringConcat (built-in) : Generates the concatenation of the first two fields
of a tuple.
• Aggregate Functions (EvalFunc & Algebraic)
• Ex: COUNT, AVG ( both built-in)
• Filter Functions (FilterFunc)
• Ex: IsEmpty (built-in)
• Load/Store Functions (LoadFunc/ StoreFunc)
• Ex: PigStorage (built-in)
• Note: URL for built in functions:
http://pig.apache.org/docs/r0.7.0/api/org/apache/pig/builtin/package-
summary.html
49. Summary
• Pig can be used to run ETL jobs on Hadoop. It
saves you from writing MapReduce code in Java
while its syntax may look familiar to SQL users.
Nonetheless, it is important to take some time to
learn Pig and to understand its advantages and
limitations. Who knows, maybe pigs can fly after
all.