Apache Pig is a platform for analyzing large datasets that operates on top of Hadoop. It provides a high-level language called Pig Latin that allows users to express data analysis programs, which Pig then compiles into MapReduce jobs for execution. The main components of the Pig architecture are the Pig Latin parser and optimizer, which generate a logical plan, and the compiler, which converts this into a physical execution plan of MapReduce jobs. Pig aims to simplify big data analysis for users by hiding the complexity of MapReduce.
2. ï Pig is a high-level programming language
useful for analyzing large data sets. Pig
was a result of development effort at
Yahoo!
ï In a MapReduce framework, programs
need to be translated into a series of Map
and Reduce stages. However, this is not a
programming model which data analysts
are familiar with. So, in order to bridge this
gap, an abstraction called Pig was built on
top of Hadoop.
3. ï Apache Pig enables people to focus more
on analyzing bulk data sets and to
spend less time writing Map-Reduce
programs. Similar to Pigs, who eat
anything, the Apache Pig programming
language is designed to work upon any
kind of data. That's why the name, Pig!
4. Pig Architecture
The Architecture of Pig consists of two components:
ï Pig Latin, which is a language
ï A runtime environment, for running PigLatin programs.
A Pig Latin program consists of a series of operations or
transformations which are applied to the input data to produce
output. These operations describe a data flow which is
translated into an executable representation, by Hadoop Pig
execution environment. Underneath, results of these
transformations are series of MapReduce jobs which a
programmer is unaware of. So, in a way, Pig in Hadoop allows
the programmer to focus on data rather than the nature of
execution.
PigLatin is a relatively stiffened language which uses familiar
keywords from data processing e.g., Join, Group and Filter.
5. Apache Pig Architecture in Hadoop
ï Apache Pig architecture consists of a Pig Latin
interpreter that uses Pig Latin scripts to process and
analyze massive datasets. Programmers use Pig
Latin language to analyze large datasets in
the Hadoop environment. Apache pig has a rich set
of datasets for performing different data operations
like join, filter, sort, load, group, etc.
ï Programmers must use Pig Latin language to write a
Pig script to perform a specific task. Pig converts
these Pig scripts into a series of Map-Reduce jobs to
ease programmersâ work. Pig Latin programs are
executed via various mechanisms such as UDFs,
embedded, and Grunt shells.
ï
6. Apache Pig architecture is consisting of the following
major components:
ï Parser
ï Optimizer
ï Compiler
ï Execution Engine
ï Execution Mode
7. Pig Latin Scripts
ï Pig scripts are submitted to the Pig execution
environment to produce the desired results.
You can execute the Pig scripts by using one of
the methods:
ï Grunt Shell
ï Script file
ï Embedded script
8. Parser
Parser handles all the Pig Latin statements or commands. Parser performs several checks on the Pig
statements like syntax check, type check, and generates a DAG (Directed Acyclic Graph) output.
DAG output represents all the logical operators of the scripts as nodes and data flow as edges.
Optimizer
Once parsing operation is completed and a DAG output is generated, the output is passed to the
optimizer. The optimizer then performs the optimization activities on the output, such as split, merge,
projection, pushdown, transform, and reorder, etc. The optimizer processes the extracted data and
omits unnecessary data or columns by performing pushdown and projection activity and improves
query performance.
Compiler
The compiler compiles the output that is generated by the optimizer into a series of Map Reduce jobs.
The compiler automatically converts Pig jobs into Map Reduce jobs and optimizes performance by
rearranging the execution order.
Execution Engine
After performing all the above operations, these Map Reduce jobs are submitted to the execution engine,
which is then executed on the Hadoop platform to produce the desired results. You can then use the
DUMP statement to display the results on screen or STORE statements to store the results
in HDFS (Hadoop Distributed File System).
Execution Mode
Apache Pig is executed in two execution modes that are local and Map Reduce. The choice of execution
mode depends on where the data is stored and where you want to run the Pig script. You can either
store your data locally (in a single machine) or in a distributed Hadoop cluster environment.
Local Mode â You can use local mode if your dataset is small. In local mode, Pig runs in a single JVM
using the local host and file system. In this mode, parallel mapper execution is impossible as all files
are installed and run on the localhost. You can use pig -x local command to specify the local mode.
Map Reduce Mode â Apache Pig uses the Map Reduce mode by default. In Map Reduce mode, a
programmer executes the Pig Latin statements on data that is already stored in the HDFS (Hadoop
Distributed File System). You can use pig -x mapreduce command to specify the Map-Reduce
mode.
9.
10. Apache Pig Components
Parser
Initially the Pig Scripts are handled by the Parser. It checks the syntax
of the script, does type checking, and other miscellaneous checks.
The output of the parser will be a DAG (directed acyclic graph),
which represents the Pig Latin statements and logical operators.
In the DAG, the logical operators of the script are represented as the
nodes and the data flows are represented as edges.
Optimizer
The logical plan (DAG) is passed to the logical optimizer, which carries
out the logical optimizations such as projection and pushdown.
Compiler
The compiler compiles the optimized logical plan into a series of
MapReduce jobs.
Execution engine
Finally the MapReduce jobs are submitted to Hadoop in a sorted order.
Finally, these MapReduce jobs are executed on Hadoop producing
the desired results.
11. Pig Latin Data Model
ï The data model of Pig Latin is fully nested and it
allows complex non-atomic datatypes such
as map and tuple. Given below is the
diagrammatical representation of Pig Latinâs data
model.
12. ï Atom
Any single value in Pig Latin, irrespective of their data, type is known as an Atom. It is stored as string
and can be used as string and number. int, long, float, double, chararray, and bytearray are the
atomic values of Pig. A piece of data or a simple atomic value is known as a field.
Example â ârajaâ or â30â
ï Tuple
A record that is formed by an ordered set of fields is known as a tuple, the fields can be of any type. A
tuple is similar to a row in a table of RDBMS.
Example â (Raja, 30)
ï Bag
A bag is an unordered set of tuples. In other words, a collection of tuples (non-unique) is known as a
bag. Each tuple can have any number of fields (flexible schema). A bag is represented by â{}â. It is
similar to a table in RDBMS, but unlike a table in RDBMS, it is not necessary that every tuple
contain the same number of fields or that the fields in the same position (column) have the same
type.
Example â {(Raja, 30), (Mohammad, 45)}
A bag can be a field in a relation; in that context, it is known as inner bag.
Example â {Raja, 30, {9848022338, raja@gmail.com,}}
ï Map
A map (or data map) is a set of key-value pairs. The key needs to be of type chararray and should be
unique. The value might be of any type. It is represented by â[]â
Example â [name#Raja, age#30]
ï Relation
A relation is a bag of tuples. The relations in Pig Latin are unordered (there is no guarantee that tuples
are processed in any particular order).
13. Map Reduce vs. Apache Pig
Apache Pig Map Reduce
Scripting language Compiled language
Provides a higher level of
abstraction
Provides a low level of abstraction
Requires a few lines of code (10
lines of code can summarize 200
lines of Map Reduce code)
Requires a more extensive code
(more lines of code)
Requires less development time
and effort
Requires more development time
and effort
Lesser code efficiency
Higher efficiency of code in
comparison to Apache Pig
14. Apache Pig Features
ï Allows programmers to write fewer lines of codes.
Programmers can write 200 lines of Java code in only
ten lines using the Pig Latin language.
ï Apache Pig multi-query approach reduces the
development time.
ï Apache pig has a rich set of datasets for performing
operations like join, filter, sort, load, group, etc.
ï Pig Latin language is very similar to SQL.
Programmers with good SQL knowledge find it easy
to write Pig script.
ï Allows programmers to write fewer lines of codes.
Programmers can write 200 lines of Java code in only
ten lines using the Pig Latin language.
ï Apache Pig handles both structured and unstructured
data analysis.
15. Apache Pig Applications
ï Processes large volume of data
ï Supports quick prototyping and ad-hoc queries
across large datasets
ï Performs data processing in search platforms
ï Processes time-sensitive data loads
ï Used by telecom companies to de-identify the
user call data information.