Apache Pig is a high-level platform for creating programs that runs on Apache Hadoop. The language for this platform is called Pig Latin. Pig can execute its Hadoop jobs in MapReduce, Apache Tez, or Apache Spark.
2. PIG Latin
• Pig Latin is a data flow language used for exploring large data sets.
• Rapid development
• No Java is required.
• Its is a high-level platform for creating MapReduce programs used
with Hadoop.
• Pig was originally developed at Yahoo Research around 2006 for
researchers to have an ad-hoc way of creating and executing map-
reduce jobs on very large data sets. In 2007,it was moved into the
Apache Software Foundation
• Like actual pigs, who eat almost anything, the Pig programming
language is designed to handle any kind of data—hence the name!
3. What Pig Does
Pig was designed for performing a long series of data operations,
making it ideal for three categories of Big Data jobs:
• Extract-transform-load (ETL) data pipelines,
• Research on raw data, and
• Iterative data processing.
4. Features of PIG
• Provides support for data types – long, float, chararray, schemas
and functions
• Is extensible and supports User Defined Functions
• Schema not mandatory, but used when available
• Provides common operations like JOIN, GROUP, FILTER, SORT
5. When not to use PIG
• Really nasty data formats or complete unstructured data.
– Video Files
– Audio Files
– Image Files
– Raw human readable text
• PIG is slow compared to Map-Reduce
• When you need more power to optimize code.
8. I Install PIG
•To install pig
• untar the .gz file using tar –xvzf pig-0.13.0-bin.tar.gz
•To initialize the environment variables, export the following:
• export PIG_HADOOP_VERSION=20
(Specifies the version of hadoop that is running)
• export HADOOP_HOME=/home/(user-name)/hadoop-0.20.2
(Specifies the installation directory of hadoop to the environment
variable HADOOP_HOME. Typically defined as /home/user-
name/hadoop-version)
• export PIG_CLASSPATH=$HADOOP_HOME/conf
(Specifies the class path for pig)
• export PATH=$PATH:/home/user-name/pig-0.13.1/bin
(for setting the PATH variable)
• export JAVA_HOME=/usr
(Specifies the java home to the environment variable.)
9. PIG Modes
• Pig in Local mode
– No HDFS is required, All files run on local file system.
– Command: pig –x local
• Pig in MapReduce(hadoop) mode
– To run PIG scripts in MR mode, ensure you have access to
HDFS, By Default, PIG starts in MapReduce Mode.
– Command: pig –x mapreduce or pig
10. PIG Program Structure
• Grunt Shell or Interactive mode
– Grunt is an interactive shell for running PIG commands.
• PIG Scripts or Batch mode
– PIG can run a script file that contains PIG commands.
– E.g. PIG script.pig
11. Introducing data types
• Data type is a data storage format that can contain a specific type or
range of values.
– Scalar types
• Sample: int, long, double, chararray, bytearray
– Complex types
• Sample: Atom, Tuple, Bag, Map
12. • User can declare data type at load time as below.
– A= LOAD ‘test.data’ using PigStorage(',') AS (sno:chararray,
name: chararray, marks:long);
• If data type is not declared but script treats value as a certain type,
Pig will assume it is of that type and cast it.
– A= LOAD ‘test.data’ using PigStorage(',') AS (sno, name,
marks);
– B = FOREACH A GENERATE marks* 100; --marks cast to long
13.
14. Data types continues…
Relation can be defined as follows:
• A field/Atom is a piece of data.
Ex:12.5 or hello world
• A tuple is an ordered set of fields.
EX: Tuple (12.5,hello world,-2)
It’s most often used as a row in a relation.
It’s represented by fields separated by commas, enclosed by
parentheses.
15. • A bag is a collection of tuples.
Bag {(12.5,hello world,-2),(2.87,bye world,10)}
A bag is an unordered collection of tuples.
A bag is represented by tuples separated by commas, all
enclosed by curly
• Map [key value]
A map is a set of key/value pairs.
Keys must be unique and be a string (chararray).
The value can be any type.
16. In sort ..
Relations, Bags, Tuples, Fields
Pig Latin statements work with relations, A relation can be defined as
follows:
• A relation is a bag (more specifically, an outer bag).
• A bag is a collection of tuples.
• A tuple is an ordered set of fields.
• A field is a piece of data.
17. PIG Latin Statements
• A Pig Latin statement is an operator that takes a relation as input
and produces another relation as output.
• This definition applies to all Pig Latin operators except LOAD and
STORE command which read data from and write data to the file
system.
• In PIG when a data element is null it means its unknown. Data of
any type can be null.
• Pig Latin statements can span multiple lines and must end with a
semi-colon ( ; )
18. PIG The programming language
• Pig Latin statements are generally organized in the following
manner:
– A LOAD statement reads data from the file system.
– A series of "transformation" statements process the data.
– A STORE statement writes output to the file system;
OR
– A DUMP statement displays output to the screen
19. MULTIQUERY EXECUTION
•Because DUMP is a diagnostic tool, it will always trigger execution.
However, the STORE command is different.
• In interactive mode, STORE acts like DUMP and will always trigger
execution (this includes the run command), but in batch mode it will not
(this includes the exec command).
•The reason for this is efficiency. In batch mode, Pig will parse the
whole script to see whether there are any optimizations that could be
made to limit the amount of data to be written to or read from disk.
20. Consider the following simple example:
• A = LOAD 'input/pig/multiquery/A';
• B = FILTER A BY $1 == 'banana';
• C = FILTER A BY $1 != 'banana';
• STORE B INTO 'output/b';
• STORE C INTO 'output/c';
Relations B and C are both derived from A, so to save reading A twice,
Pig can run this script as a single MapReduce job by reading A once
and writing two output files from the job, one for each of B and C. This
feature is called multiquery execution.
24. Logical vs. Physical Plan
When the Pig Latin interpreter sees the first line containing the LOAD
statement, it confirms that it is syntactically and semantically correct
and adds it to the logical plan, but it does not load the data from the file
(or even check whether the file exists).
The point is that it makes no sense to start any processing until the
whole flow is defined. Similarly, Pig validates the GROUP and
FOREACH…GENERATE statements, and adds them to the logical
plan without executing them. The trigger for Pig to start execution is the
DUMP statement. At that point, the logical plan is compiled into a
physical plan and executed.
26. Create a sample file
John,18,4.0
Mary,19,3.8
Bill,20,3.9
Joe,18,3.8
Save it as “student.txt”
Move it to HDFS by using below command.
hadoop fs – put <local path - filename> hdfspath
27. LOAD/DUMP/STORE
A = load 'student' using PigStorage(‘,’) AS
(name:chararray,age:int,gpa:float);
DESCRIBE A;
A: {name: chararray,age: int,gpa: float}
DUMP A;
(John,18,4.0)
(Mary,19,3.8)
(Bill,20,3.9)
(Joe,18,3.8)
store A into ‘/hdfspath’;
28. Group
Groups the data in one relations.
B = GROUP A BY age;
DUMP B;
(18,{(John,18,4.0),(Joe,18,3.8)})
(19,{(Mary,19,3.8)})
(20,{(Bill,20,3.9)})
29. Foreach…Generate
C = FOREACH B GENERATE group, COUNT(A);
DUMP C;
(18,2)
(19,1)
(20,1)
C = FOREACH B GENERATE $0, $1.name;
DUMP C;
(18,{(John),(Joe)})
(19,{(Mary)})
(20,{(Bill)})
30. Create Sample File
FileA.txt
1 2 3
4 2 1
8 3 4
4 3 3
7 2 5
8 4 3
Move it to HDFS by using below command.
hadoop fs – put <localpath> <hdfspath>
31. Create another Sample File
FileB.txt
2 4
8 9
1 3
2 7
2 9
4 6
4 9
Move it to HDFS by using below command.
hadoop fs – put localpath hdfspath
32. Filter
Definition: Selects tuples from a relation based on some condition.
FILTER is commonly used to select the data that you want; or, conversely, to filter out
(remove) the data you don’t want.
Examples
A = LOAD 'data' using PigStorage(‘,’) AS (a1:int,a2:int,a3:int);
DUMP A;
(1,2,3)
(4,2,1)
(8,3,4)
(4,3,3)
(7,2,5)
(8,4,3)
X = FILTER A BY a3 == 3;
DUMP X;
(1,2,3)
(4,3,3)
(8,4,3)
33. Co-Group
Definition: The GROUP and COGROUP operators are identical. For
readability GROUP is used in statements involving one relation and
COGROUP is used in statements involving two or more relations.
X = COGROUP A BY $0, B BY $0;
(1, {(1, 2, 3)}, {(1, 3)})
(2, {}, {(2, 4), (2, 7), (2, 9)})
(4, {(4, 2, 1), (4, 3, 3)}, {(4, 6),(4, 9)})
(7, {(7, 2, 5)}, {})
(8, {(8, 3, 4), (8, 4, 3)}, {(8, 9)})
•To see groups for which inputs have at least one tuple:
X = COGROUP A BY $0 INNER, B BY $0 INNER;
(1, {(1, 2, 3)}, {(1, 3)})
(4, {(4, 2, 1), (4, 3, 3)}, {(4, 6), (4, 9)})
(8, {(8, 3, 4), (8, 4, 3)}, {(8, 9)})
FileA
1 2 3
4 2 1
8 3 4
4 3 3
7 2 5
8 4 3
FileB.txt
2 4
8 9
1 3
2 7
2 9
4 6
4 9
34. Flatten Operator
• Flatten un-nests tuples as well as bags.
• For tuples, flatten substitutes the fields of a tuple in place of the tuple.
• For example, consider a relation (a, (b, c)).
• GENERATE $0, flatten($1)
– (a, b, c).
• For bags, flatten substitutes bags with new tuples.
• For Example, consider a bag ({(b,c),(d,e)}).
• GENERATE flatten($0),
– will end up with two tuples (b,c) and (d,e).
• When we remove a level of nesting in a bag, sometimes we cause a cross product to
happen.
• For example, consider a relation (a, {(b,c), (d,e)})
• GENERATE $0, flatten($1),
– it will create new tuples: (a, b, c) and (a, d, e).
35. JOIN
Definition: Performs join of two or more relations based on common field values
Syntax:
X= JOIN A BY $0, B BY $0;
which is equivalent to:
X = COGROUP A BY $0 INNER, B BY $0 INNER;
Y = FOREACH X GENERATE FLATTEN(A), FLATTEN(B);
The result is:
(1, 2, 3, 1, 3)
(4, 2, 1, 4, 6)
(4, 3, 3, 4, 6)
(4, 2, 1, 4, 9)
(4, 3, 3, 4, 9)
(8, 3, 4, 8, 9)
(8, 4, 3, 8, 9)
(1, {(1, 2, 3)}, {(1, 3)})
(4, {(4, 2, 1), (4, 3, 3)}, {(4, 6), (4, 9)})
(8, {(8, 3, 4), (8, 4, 3)}, {(8, 9)})
37. CROSS
•Computes the cross product of two or more relations.
Example: X = CROSS A, B;
(1, 2, 3, 2, 4)
(1, 2, 3, 8, 9)
(1, 2, 3, 1, 3)
(1, 2, 3, 2, 7)
(1, 2, 3, 2, 9)
(1, 2, 3, 4, 6)
(1, 2, 3, 4, 9)
(4, 2, 1, 2, 4)
(4, 2, 1, 8, 9)
38. SPLIT
Partitions a relation into two or more relations.
Example: A = LOAD 'data' AS (f1:int,f2:int,f3:int);
DUMP A; (1,2,3) (4,5,6) (7,8,9)
SPLIT A INTO X IF f1<7, Y IF f2==5, Z IF (f3<6 OR f3>6);
DUMP X;
(1,2,3)
(4,5,6)
DUMP Y;
(4,5,6)
DUMP Z;
(1,2,3) (7,8,9)
39. Some more commands
• To select few columns from one dataset
– S1 = foreach a generate a1, a1;
• Simple calculation on dataset
– K = foreach A generate $1, $2, $1*$2;
• To display only 100 records
– B = limit a 100;
• To see the structure/Schema
– Describe A;
• To Union two datasets
– C = UNION A,B;
40. Word Count Program
Create a basic wordsample.txt file and move to
HDFS
x = load '/home/pgupta5/prashant/data.txt';
y = foreach x generate flatten (TOKENIZE ((chararray) $0))
as word;
z = group y by word;
counter = foreach z generate group, COUNT(y);
store counter into ‘/NewPigData/WordCount’;
41.
42. Another Example
i/p: webcount
en google.com 70 2012
en yahoo.com 60 2013
us google.com 80 2012
en google.com 40 2014
us google.com 80 2012
records = LOAD ‘webcount’ using PigStorage (‘t’) as (country:chararray,
name:chararray, pagecount:int, year:int);
filtered_records = filter records by country == ‘en’;
grouped_records = group filtered_records by name;
results = foreach grouped_records generate group, SUM
(filtered_records.pagecount);
sorted_result = order results by $1 desc;
store sorted_result into ‘/some_external_HDFS_location//data’; -- Hive
external table path
43. Find Maximum Score
i/p: CricketScore.txt
a = load '/user/cloudera/SampleDataFile/CricketScore.txt'
using PigStorage('t');
b = foreach a generate $0, $1;
c = group b by $0;
d = foreach c generate group, max(b.$1);
dump d;
44. Sorting Data
Relations are unordered in Pig.
Consider a relation A:
• grunt> DUMP A;
• (2,3)
• (1,2)
• (2,4)
There is no guarantee which order the rows will be processed in. In particular, when
retrieving the contents of A using DUMP or STORE, the rows may be written in any
order. If you want to impose an order on the output, you can use the ORDER
operator to sort a relation by one or more fields.
The following example sorts A by the first field in ascending order and by the
second field in descending order:
grunt> B = ORDER A BY $0, $1 DESC;
grunt> DUMP B;
• (1,2)
• (2,4)
• (2,3)
Any further processing on a sorted relation is not guaranteed to retain its order.
45. Using Hive tables with HCatalog
• HCatalog (which is a component of Hive) provides
access to Hive’s metastore, so that Pig queries can
reference schemas each time.
• For example, after running through An Example to load
data into a Hive table called records, Pig can access the
table’s schema and data as follows:
• pig -useHCatalog
• grunt> records = LOAD ‘School_db.student_tbl'
USING org.apache.hcatalog.pig.HCatLoader();
• grunt> DESCRIBE records;
• grunt> DUMP records;
46. PIG UDFs
Pig provides extensive support for user defined functions (UDFs) to
specify custom processing.
REGISTER - Registers the JAR file with PIG runtime.
REGISTER myudfs.jar;
//JAR file should be available in local LINUX.
A = LOAD 'student_data‘ using PigStorage(‘,’) AS (name: chararray,
age: int, gpa: float);
B = FOREACH A GENERATE myudfs.UPPER(name);
DUMP B;
47. UDF Sample Program
package myudfs;
import java.io.IOException;
import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;
public class UPPER extends EvalFunc<String>
{
public String exec(Tuple input) throws IOException {
if (input == null || input.size() == 0)
return null;
try{
String str = (String)input.get(0);
return str.toUpperCase();
}
catch(Exception e){
throw new IOException("Caught exception processing input row ", e);
}
}
}
• (Pig’s Java UDF extends functionalities of EvalFunc)
48. Diagnostic operator
• DESCRIBE: Prints a relation’s schema.
• EXPLAIN: Prints the logical and physical plans.
• ILLUSTRATE: Shows a sample execution of the logical
plan, using a generated subset of the input.
49. Performance Tuning
Pig does not (yet) determine when a field is no longer needed and drop the field from the
row. For example, say you have a query like:
• Project Early and Often
– A = load 'myfile' as (t, u, v);
– B = load 'myotherfile' as (x, y, z);
– C = join A by t, B by x;
– D = group C by u;
– E = foreach D generate group, COUNT($1);
• There is no need for v, y, or z to participate in this query. And there is no need to
carry both t and x past the join, just one will suffice. Changing the query above to the
query below will greatly reduce the amount of data being carried through the map and
reduce phases by pig.
– A = load 'myfile' as (t, u, v);
– A1 = foreach A generate t, u;
– B = load 'myotherfile' as (x, y, z);
– B1 = foreach B generate x;
– C = join A1 by t, B1 by x;
– C1 = foreach C generate t, u;
– D = group C1 by u;
– E = foreach D generate group, COUNT($1);
50. Performance Tuning
As with early projection, in most cases it is beneficial to apply filters as early as possible
to reduce the amount of data flowing through the pipeline.
• Filter Early and Often
-- Query 1
– A = load 'myfile' as (t, u, v);
– B = load 'myotherfile' as (x, y, z);
– C = filter A by t == 1;
– D = join C by t, B by x;
– E = group D by u;
– F = foreach E generate group, COUNT($1);
-- Query 2
– A = load 'myfile' as (t, u, v);
– B = load 'myotherfile' as (x, y, z);
– C = join A by t, B by x;
– D = group C by u;
– E = foreach D generate group, COUNT($1);
– F = filter E by C.t == 1;
• The first query is clearly more efficient than the second one because
it reduces the amount of data going into the join.
51. Performance Tuning
Often you are not interested in the entire output but rather a
sample or top results. In such cases, LIMIT can yield a
much better performance as we push the limit as high as
possible to minimize the amount of data travelling through
the pipeline.
• Use the LIMIT Operator
– A = load 'myfile' as (t, u, v);
– B = order A by t;
– C = limit B 500;
52. Performance Tuning
If types are not specified in the load statement, Pig assumes the
type of double for numeric computations. A lot of the time, your
data would be much smaller, maybe, integer or long. Specifying
the real type will help with speed of arithmetic computation.
• Use Types
– --Query 1
• A = load 'myfile' as (t, u, v);
• B = foreach A generate t + u;
– --Query 2
• A = load 'myfile' as (t: int, u: int, v);
• B = foreach A generate t + u;
• The second query will run more efficiently than the first. In
some of our queries with see 2x speedup.
53. Performance Tuning
• Use Joins appropriately.
– Understand Skewed Vs. Replicated vs. Merge join.
• Remove null values before join.
– A = load 'myfile' as (t, u, v);
– B = load 'myotherfile' as (x, y, z);
– C = join A by t, B by x;
• is rewritten by Pig to
– A = load 'myfile' as (t, u, v);
– B = load 'myotherfile' as (x, y, z);
– C1 = cogroup A by t INNER, B by x INNER;
– C = foreach C1 generate flatten(A), flatten(B);
Since the nulls from A and B won't be collected together,
when the nulls are flattened we're guaranteed to have an
empty bag, which will result in no output. But they will not
be dropped until the last possible moment.
54. Performance Tuning
• Hence the previous query should be rewritten as
– A = load 'myfile' as (t, u, v);
– B = load 'myotherfile' as (x, y, z);
– A1 = filter A by t is not null;
– B1 = filter B by x is not null;
– C = join A1 by t, B1 by x;
Now nulls will be dropped before the join. Since all null
keys go to a single reducer, if your key is null even a small
percentage of the time the gain can be significant.
55. Performance Tuning
• You can set the number of reduce tasks for the
MapReduce jobs generated by Pig using parallel
reducer feature.
– set default parallel command is used at the script
level.
• In this example all the MapReduce jobs gets launched use 20
reducers.
– SET default_parallel 20;
– A = LOAD ‘myfile.txt’ USING PigStorage() AS (t, u, v);
– B = GROUP A BY t;
– C = FOREACH B GENERATE group, COUNT(A.t) as mycount;
– D = ORDER C BY mycount;
– PARALLEL clause can be used with any operator like
group, cogroup, join, order by, distinct that starts
reduce phase.
56. Replicated Join
• One of the datasets is small enough that it fits in the memory.
• A replicated join copies the small dataset to the distributed cache -
space that is available on every cluster machine - and loads it into
the memory.
• Coz the data is available in the memory(DC), and is processed on
the map side of MapReduce, this operation works much faster than
a default join.
57. • Limitations
It isn’t clear how small the dataset needs to be for using replicated join.
According to the Pig documentation, a relation of up to 100 MB can
be used when the process has 1 GB of memory. A run-time error will
be generated if not enough memory is available for loading the data.
58. • transactions = load 'customer_transactions' as ( fname, lname, city,
state, country, amount, tax);
• geography = load 'geo_data' as (state, country, district, manager);
Regular join
• sales = join transactions by (state, country), geography by (state,
country);
• sales = join transactions by (state, country), geography by (state,
country) using 'replicated';
59. Skewed Join
• One of the keys is much more common than others, and the data for
it is too large to fit in the memory.
• Standard joins run in parallel across different reducers by splitting
key values across processes. If there is a lot of data for a certain
key, the data will not be distributed evenly across the reducers, and
one of them will be ‘stuck’ processing the majority of data.
• Skewed join handles this case. It calculates a histogram to check
which key is the most prevalent and then splits its data across
different reducers for optimal performance.
60. • transactions = load 'customer_transactions' as ( fname,
lname, city, state, country, amount, tax);
• geography = load 'geo_data' as (state, country, district,
manager);
• sales = join transactions by (state, country), geography
by (state, country) using 'skewed';
61. Merge Join
• The two datasets are both sorted in ascending order by the join key.
• Datasets may already be sorted by the join key if that’s the order in
which data was entered or they have undergone sorting before the
join operation for other needs.
• When merge join receives the pre-sorted datasets, they are read
and compared on the map side, and as a result they run faster. Both
inner and outer join are available.
•
62. • transactions = load 'customer_transactions' as (
fname, lname, city, state, country, amount, tax);
• geography = load 'geo_data' as (state, country,
district, manager);
• sales = join transactions by (state, country),
geography by (state, country) using 'merge';
Pig is made up of two components: the first is the language itself, which is called PigLatin (yes, people naming various Hadoop projects do tend to have a sense of humor associated with their naming conventions), and the second is a runtime environment where PigLatin programs are executed. Think of the relationship between a Java Virtual Machine (JVM) and a Java application.
As the example is written, this job will requires both a Map & Reduce job to successfully make the join work which leads to larger and larger inefficiency as the customer data set grows in size. This is the exact scenario that is optimized by using a Replicated join.
The replicated join, tells Pig to distribute the geography set to each node, where it can be join directly in the Map job and eliminates the need for the Reduce job altogether.
Skewed join supports both inner and outer join, though only with two inputs - joins between additional tables should be broken up into further joins. Also, there is a pig.skwedjoin.reduce.memusage Java parameter that specifies the heap fraction available to reducers in order to perform this join. Setting a low value means more reducers will be used, yet the cost of copying the data across them will increase. Pig’s developers claim to have good performance when setting it between 0.1-0.4,