Learning Objectives - In this module, you will learn what is Pig, in which type of use case we can use Pig, how Pig is tightly coupled with MapReduce, and Pig Latin scripting.
2. What is Pig?
• Apache Pig is a Hadoop platform for creating MapReduce jobs. Pig uses a high-level,
SQL-like programming language named Pig Latin.
The benefits of Pig include:
• Run a MapReduce job with a few simple lines of code.
• Process structured data with a schema, or Pig can process unstructured data without a
schema. (Pigs eat anything!)
• Pig Latin uses a familiar SQL-like syntax.
• Pig scripts read and write data from HDFS.
• Pig Latin is a data flow language, a logical solution for many MapReduce algorithms.
3. Pig Latin
• Pig Latin is a high-level data flow scripting language.
Pig Latin scripts can be executed one of three ways:
• Pig script: write a Pig Latin program in a text file and execute it using the pig
executable.
• Grunt shell: enter Pig statements manually one-at-a-time from a CLI tool known
as the Grunt interactive shell.
• Embedded in Java: use the PigServer class to execute a Pig query from within Java
code.
4. The Grunt Shell
• Grunt is an interactive shell that enables users to enter Pig Latin statements
and also interact with HDFS.
• To enter the Grunt shell, run the pig executable in the PIG_HOMEbin folder:
6. Functions
• Functions in Pig come in four types:
• Eval function : A function that takes one or more expressions and returns another
expression.
• Filter function : A special type of eval function that returns a logical Boolean result.
• Load function: A function that specifies how to load data into a relation from
external storage.
• Store function : A function that specifies how to save the contents of a relation to
external storage.
12. User-Defined Functions : Filter UDF
• Filter UDFs are all subclasses of FilterFunc, which itself is a subclass of EvalFunc
• Override EvalFunc’s only abstract method, exec(),
13. Filter UDF Contd..
public class IsGoodQuality extends FilterFunc {
@Override
public Boolean exec(Tuple tuple) throws IOException {
if (tuple == null || tuple.size() == 0) {return false;}
try {
Object object = tuple.get(0);
if (object == null) {return false;}
int i = (Integer) object;
return i == 0 || i == 1 || i == 4 || i == 5 || i == 9;
} catch (ExecException e) {
throw new IOException(e);
}}}