SlideShare ist ein Scribd-Unternehmen logo
1 von 63
Apache Pig
Prashant Gupta
PIG Latin
• Pig Latin is a data flow language used for exploring large data sets.
• Rapid development
• No Java is required.
• Its is a high-level platform for creating MapReduce programs used
with Hadoop.
• Pig was originally developed at Yahoo Research around 2006 for
researchers to have an ad-hoc way of creating and executing map-
reduce jobs on very large data sets. In 2007,it was moved into the
Apache Software Foundation
• Like actual pigs, who eat almost anything, the Pig programming
language is designed to handle any kind of data—hence the name!
What Pig Does
Pig was designed for performing a long series of data operations,
making it ideal for three categories of Big Data jobs:
• Extract-transform-load (ETL) data pipelines,
• Research on raw data, and
• Iterative data processing.
Features of PIG
• Provides support for data types – long, float, chararray, schemas
and functions
• Is extensible and supports User Defined Functions
• Schema not mandatory, but used when available
• Provides common operations like JOIN, GROUP, FILTER, SORT
When not to use PIG
• Really nasty data formats or complete unstructured data.
– Video Files
– Audio Files
– Image Files
– Raw human readable text
• PIG is slow compared to Map-Reduce
• When you need more power to optimize code.
PIG Use Case
PIG Components
I Install PIG
•To install pig
• untar the .gz file using tar –xvzf pig-0.13.0-bin.tar.gz
•To initialize the environment variables, export the following:
• export PIG_HADOOP_VERSION=20
(Specifies the version of hadoop that is running)
• export HADOOP_HOME=/home/(user-name)/hadoop-0.20.2
(Specifies the installation directory of hadoop to the environment
variable HADOOP_HOME. Typically defined as /home/user-
name/hadoop-version)
• export PIG_CLASSPATH=$HADOOP_HOME/conf
(Specifies the class path for pig)
• export PATH=$PATH:/home/user-name/pig-0.13.1/bin
(for setting the PATH variable)
• export JAVA_HOME=/usr
(Specifies the java home to the environment variable.)
PIG Modes
• Pig in Local mode
– No HDFS is required, All files run on local file system.
– Command: pig –x local
• Pig in MapReduce(hadoop) mode
– To run PIG scripts in MR mode, ensure you have access to
HDFS, By Default, PIG starts in MapReduce Mode.
– Command: pig –x mapreduce or pig
PIG Program Structure
• Grunt Shell or Interactive mode
– Grunt is an interactive shell for running PIG commands.
• PIG Scripts or Batch mode
– PIG can run a script file that contains PIG commands.
– E.g. PIG script.pig
Introducing data types
• Data type is a data storage format that can contain a specific type or
range of values.
– Scalar types
• Sample: int, long, double, chararray, bytearray
– Complex types
• Sample: Atom, Tuple, Bag, Map
• User can declare data type at load time as below.
– A= LOAD ‘test.data’ using PigStorage(',') AS (sno:chararray,
name: chararray, marks:long);
• If data type is not declared but script treats value as a certain type,
Pig will assume it is of that type and cast it.
– A= LOAD ‘test.data’ using PigStorage(',') AS (sno, name,
marks);
– B = FOREACH A GENERATE marks* 100; --marks cast to long
Data types continues…
Relation can be defined as follows:
• A field/Atom is a piece of data.
Ex:12.5 or hello world
• A tuple is an ordered set of fields.
EX: Tuple (12.5,hello world,-2)
It’s most often used as a row in a relation.
It’s represented by fields separated by commas, enclosed by
parentheses.
• A bag is a collection of tuples.
Bag {(12.5,hello world,-2),(2.87,bye world,10)}
A bag is an unordered collection of tuples.
A bag is represented by tuples separated by commas, all
enclosed by curly
• Map [key value]
A map is a set of key/value pairs.
Keys must be unique and be a string (chararray).
The value can be any type.
In sort ..
Relations, Bags, Tuples, Fields
Pig Latin statements work with relations, A relation can be defined as
follows:
• A relation is a bag (more specifically, an outer bag).
• A bag is a collection of tuples.
• A tuple is an ordered set of fields.
• A field is a piece of data.
PIG Latin Statements
• A Pig Latin statement is an operator that takes a relation as input
and produces another relation as output.
• This definition applies to all Pig Latin operators except LOAD and
STORE command which read data from and write data to the file
system.
• In PIG when a data element is null it means its unknown. Data of
any type can be null.
• Pig Latin statements can span multiple lines and must end with a
semi-colon ( ; )
PIG The programming language
• Pig Latin statements are generally organized in the following
manner:
– A LOAD statement reads data from the file system.
– A series of "transformation" statements process the data.
– A STORE statement writes output to the file system;
OR
– A DUMP statement displays output to the screen
MULTIQUERY EXECUTION
•Because DUMP is a diagnostic tool, it will always trigger execution.
However, the STORE command is different.
• In interactive mode, STORE acts like DUMP and will always trigger
execution (this includes the run command), but in batch mode it will not
(this includes the exec command).
•The reason for this is efficiency. In batch mode, Pig will parse the
whole script to see whether there are any optimizations that could be
made to limit the amount of data to be written to or read from disk.
Consider the following simple example:
• A = LOAD 'input/pig/multiquery/A';
• B = FILTER A BY $1 == 'banana';
• C = FILTER A BY $1 != 'banana';
• STORE B INTO 'output/b';
• STORE C INTO 'output/c';
Relations B and C are both derived from A, so to save reading A twice,
Pig can run this script as a single MapReduce job by reading A once
and writing two output files from the job, one for each of B and C. This
feature is called multiquery execution.
Working with Data
File System Commands
Utility Commands
Logical vs. Physical Plan
When the Pig Latin interpreter sees the first line containing the LOAD
statement, it confirms that it is syntactically and semantically correct
and adds it to the logical plan, but it does not load the data from the file
(or even check whether the file exists).
The point is that it makes no sense to start any processing until the
whole flow is defined. Similarly, Pig validates the GROUP and
FOREACH…GENERATE statements, and adds them to the logical
plan without executing them. The trigger for Pig to start execution is the
DUMP statement. At that point, the logical plan is compiled into a
physical plan and executed.
Practice Session
Create a sample file
John,18,4.0
Mary,19,3.8
Bill,20,3.9
Joe,18,3.8
Save it as “student.txt”
Move it to HDFS by using below command.
hadoop fs – put <local path - filename> hdfspath
LOAD/DUMP/STORE
A = load 'student' using PigStorage(‘,’) AS
(name:chararray,age:int,gpa:float);
DESCRIBE A;
A: {name: chararray,age: int,gpa: float}
DUMP A;
(John,18,4.0)
(Mary,19,3.8)
(Bill,20,3.9)
(Joe,18,3.8)
store A into ‘/hdfspath’;
Group
Groups the data in one relations.
B = GROUP A BY age;
DUMP B;
(18,{(John,18,4.0),(Joe,18,3.8)})
(19,{(Mary,19,3.8)})
(20,{(Bill,20,3.9)})
Foreach…Generate
C = FOREACH B GENERATE group, COUNT(A);
DUMP C;
(18,2)
(19,1)
(20,1)
C = FOREACH B GENERATE $0, $1.name;
DUMP C;
(18,{(John),(Joe)})
(19,{(Mary)})
(20,{(Bill)})
Create Sample File
FileA.txt
1 2 3
4 2 1
8 3 4
4 3 3
7 2 5
8 4 3
Move it to HDFS by using below command.
hadoop fs – put <localpath> <hdfspath>
Create another Sample File
FileB.txt
2 4
8 9
1 3
2 7
2 9
4 6
4 9
Move it to HDFS by using below command.
hadoop fs – put localpath hdfspath
Filter
Definition: Selects tuples from a relation based on some condition.
FILTER is commonly used to select the data that you want; or, conversely, to filter out
(remove) the data you don’t want.
Examples
A = LOAD 'data' using PigStorage(‘,’) AS (a1:int,a2:int,a3:int);
DUMP A;
(1,2,3)
(4,2,1)
(8,3,4)
(4,3,3)
(7,2,5)
(8,4,3)
X = FILTER A BY a3 == 3;
DUMP X;
(1,2,3)
(4,3,3)
(8,4,3)
Co-Group
Definition: The GROUP and COGROUP operators are identical. For
readability GROUP is used in statements involving one relation and
COGROUP is used in statements involving two or more relations.
X = COGROUP A BY $0, B BY $0;
(1, {(1, 2, 3)}, {(1, 3)})
(2, {}, {(2, 4), (2, 7), (2, 9)})
(4, {(4, 2, 1), (4, 3, 3)}, {(4, 6),(4, 9)})
(7, {(7, 2, 5)}, {})
(8, {(8, 3, 4), (8, 4, 3)}, {(8, 9)})
•To see groups for which inputs have at least one tuple:
X = COGROUP A BY $0 INNER, B BY $0 INNER;
(1, {(1, 2, 3)}, {(1, 3)})
(4, {(4, 2, 1), (4, 3, 3)}, {(4, 6), (4, 9)})
(8, {(8, 3, 4), (8, 4, 3)}, {(8, 9)})
FileA
1 2 3
4 2 1
8 3 4
4 3 3
7 2 5
8 4 3
FileB.txt
2 4
8 9
1 3
2 7
2 9
4 6
4 9
Flatten Operator
• Flatten un-nests tuples as well as bags.
• For tuples, flatten substitutes the fields of a tuple in place of the tuple.
• For example, consider a relation (a, (b, c)).
• GENERATE $0, flatten($1)
– (a, b, c).
• For bags, flatten substitutes bags with new tuples.
• For Example, consider a bag ({(b,c),(d,e)}).
• GENERATE flatten($0),
– will end up with two tuples (b,c) and (d,e).
• When we remove a level of nesting in a bag, sometimes we cause a cross product to
happen.
• For example, consider a relation (a, {(b,c), (d,e)})
• GENERATE $0, flatten($1),
– it will create new tuples: (a, b, c) and (a, d, e).
JOIN
Definition: Performs join of two or more relations based on common field values
Syntax:
X= JOIN A BY $0, B BY $0;
which is equivalent to:
X = COGROUP A BY $0 INNER, B BY $0 INNER;
Y = FOREACH X GENERATE FLATTEN(A), FLATTEN(B);
The result is:
(1, 2, 3, 1, 3)
(4, 2, 1, 4, 6)
(4, 3, 3, 4, 6)
(4, 2, 1, 4, 9)
(4, 3, 3, 4, 9)
(8, 3, 4, 8, 9)
(8, 4, 3, 8, 9)
(1, {(1, 2, 3)}, {(1, 3)})
(4, {(4, 2, 1), (4, 3, 3)}, {(4, 6), (4, 9)})
(8, {(8, 3, 4), (8, 4, 3)}, {(8, 9)})
Distinct
Removes duplicate tuples in a relation.
X = FOREACH A GENERATE $2;
(3)
(1)
(4)
(3)
(5)
(3)
Y = DISTINCT X;
(1)
(3)
(4)
(5)
CROSS
•Computes the cross product of two or more relations.
Example: X = CROSS A, B;
(1, 2, 3, 2, 4)
(1, 2, 3, 8, 9)
(1, 2, 3, 1, 3)
(1, 2, 3, 2, 7)
(1, 2, 3, 2, 9)
(1, 2, 3, 4, 6)
(1, 2, 3, 4, 9)
(4, 2, 1, 2, 4)
(4, 2, 1, 8, 9)
SPLIT
Partitions a relation into two or more relations.
Example: A = LOAD 'data' AS (f1:int,f2:int,f3:int);
DUMP A; (1,2,3) (4,5,6) (7,8,9)
SPLIT A INTO X IF f1<7, Y IF f2==5, Z IF (f3<6 OR f3>6);
DUMP X;
(1,2,3)
(4,5,6)
DUMP Y;
(4,5,6)
DUMP Z;
(1,2,3) (7,8,9)
Some more commands
• To select few columns from one dataset
– S1 = foreach a generate a1, a1;
• Simple calculation on dataset
– K = foreach A generate $1, $2, $1*$2;
• To display only 100 records
– B = limit a 100;
• To see the structure/Schema
– Describe A;
• To Union two datasets
– C = UNION A,B;
Word Count Program
Create a basic wordsample.txt file and move to
HDFS
x = load '/home/pgupta5/prashant/data.txt';
y = foreach x generate flatten (TOKENIZE ((chararray) $0))
as word;
z = group y by word;
counter = foreach z generate group, COUNT(y);
store counter into ‘/NewPigData/WordCount’;
Another Example
i/p: webcount
en google.com 70 2012
en yahoo.com 60 2013
us google.com 80 2012
en google.com 40 2014
us google.com 80 2012
records = LOAD ‘webcount’ using PigStorage (‘t’) as (country:chararray,
name:chararray, pagecount:int, year:int);
filtered_records = filter records by country == ‘en’;
grouped_records = group filtered_records by name;
results = foreach grouped_records generate group, SUM
(filtered_records.pagecount);
sorted_result = order results by $1 desc;
store sorted_result into ‘/some_external_HDFS_location//data’; -- Hive
external table path
Find Maximum Score
i/p: CricketScore.txt
a = load '/user/cloudera/SampleDataFile/CricketScore.txt'
using PigStorage('t');
b = foreach a generate $0, $1;
c = group b by $0;
d = foreach c generate group, max(b.$1);
dump d;
Sorting Data
Relations are unordered in Pig.
Consider a relation A:
• grunt> DUMP A;
• (2,3)
• (1,2)
• (2,4)
There is no guarantee which order the rows will be processed in. In particular, when
retrieving the contents of A using DUMP or STORE, the rows may be written in any
order. If you want to impose an order on the output, you can use the ORDER
operator to sort a relation by one or more fields.
The following example sorts A by the first field in ascending order and by the
second field in descending order:
grunt> B = ORDER A BY $0, $1 DESC;
grunt> DUMP B;
• (1,2)
• (2,4)
• (2,3)
Any further processing on a sorted relation is not guaranteed to retain its order.
Using Hive tables with HCatalog
• HCatalog (which is a component of Hive) provides
access to Hive’s metastore, so that Pig queries can
reference schemas each time.
• For example, after running through An Example to load
data into a Hive table called records, Pig can access the
table’s schema and data as follows:
• pig -useHCatalog
• grunt> records = LOAD ‘School_db.student_tbl'
USING org.apache.hcatalog.pig.HCatLoader();
• grunt> DESCRIBE records;
• grunt> DUMP records;
PIG UDFs
Pig provides extensive support for user defined functions (UDFs) to
specify custom processing.
REGISTER - Registers the JAR file with PIG runtime.
REGISTER myudfs.jar;
//JAR file should be available in local LINUX.
A = LOAD 'student_data‘ using PigStorage(‘,’) AS (name: chararray,
age: int, gpa: float);
B = FOREACH A GENERATE myudfs.UPPER(name);
DUMP B;
UDF Sample Program
package myudfs;
import java.io.IOException;
import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;
public class UPPER extends EvalFunc<String>
{
public String exec(Tuple input) throws IOException {
if (input == null || input.size() == 0)
return null;
try{
String str = (String)input.get(0);
return str.toUpperCase();
}
catch(Exception e){
throw new IOException("Caught exception processing input row ", e);
}
}
}
• (Pig’s Java UDF extends functionalities of EvalFunc)
Diagnostic operator
• DESCRIBE: Prints a relation’s schema.
• EXPLAIN: Prints the logical and physical plans.
• ILLUSTRATE: Shows a sample execution of the logical
plan, using a generated subset of the input.
Performance Tuning
Pig does not (yet) determine when a field is no longer needed and drop the field from the
row. For example, say you have a query like:
• Project Early and Often
– A = load 'myfile' as (t, u, v);
– B = load 'myotherfile' as (x, y, z);
– C = join A by t, B by x;
– D = group C by u;
– E = foreach D generate group, COUNT($1);
• There is no need for v, y, or z to participate in this query. And there is no need to
carry both t and x past the join, just one will suffice. Changing the query above to the
query below will greatly reduce the amount of data being carried through the map and
reduce phases by pig.
– A = load 'myfile' as (t, u, v);
– A1 = foreach A generate t, u;
– B = load 'myotherfile' as (x, y, z);
– B1 = foreach B generate x;
– C = join A1 by t, B1 by x;
– C1 = foreach C generate t, u;
– D = group C1 by u;
– E = foreach D generate group, COUNT($1);
Performance Tuning
As with early projection, in most cases it is beneficial to apply filters as early as possible
to reduce the amount of data flowing through the pipeline.
• Filter Early and Often
-- Query 1
– A = load 'myfile' as (t, u, v);
– B = load 'myotherfile' as (x, y, z);
– C = filter A by t == 1;
– D = join C by t, B by x;
– E = group D by u;
– F = foreach E generate group, COUNT($1);
-- Query 2
– A = load 'myfile' as (t, u, v);
– B = load 'myotherfile' as (x, y, z);
– C = join A by t, B by x;
– D = group C by u;
– E = foreach D generate group, COUNT($1);
– F = filter E by C.t == 1;
• The first query is clearly more efficient than the second one because
it reduces the amount of data going into the join.
Performance Tuning
Often you are not interested in the entire output but rather a
sample or top results. In such cases, LIMIT can yield a
much better performance as we push the limit as high as
possible to minimize the amount of data travelling through
the pipeline.
• Use the LIMIT Operator
– A = load 'myfile' as (t, u, v);
– B = order A by t;
– C = limit B 500;
Performance Tuning
If types are not specified in the load statement, Pig assumes the
type of double for numeric computations. A lot of the time, your
data would be much smaller, maybe, integer or long. Specifying
the real type will help with speed of arithmetic computation.
• Use Types
– --Query 1
• A = load 'myfile' as (t, u, v);
• B = foreach A generate t + u;
– --Query 2
• A = load 'myfile' as (t: int, u: int, v);
• B = foreach A generate t + u;
• The second query will run more efficiently than the first. In
some of our queries with see 2x speedup.
Performance Tuning
• Use Joins appropriately.
– Understand Skewed Vs. Replicated vs. Merge join.
• Remove null values before join.
– A = load 'myfile' as (t, u, v);
– B = load 'myotherfile' as (x, y, z);
– C = join A by t, B by x;
• is rewritten by Pig to
– A = load 'myfile' as (t, u, v);
– B = load 'myotherfile' as (x, y, z);
– C1 = cogroup A by t INNER, B by x INNER;
– C = foreach C1 generate flatten(A), flatten(B);
Since the nulls from A and B won't be collected together,
when the nulls are flattened we're guaranteed to have an
empty bag, which will result in no output. But they will not
be dropped until the last possible moment.
Performance Tuning
• Hence the previous query should be rewritten as
– A = load 'myfile' as (t, u, v);
– B = load 'myotherfile' as (x, y, z);
– A1 = filter A by t is not null;
– B1 = filter B by x is not null;
– C = join A1 by t, B1 by x;
Now nulls will be dropped before the join. Since all null
keys go to a single reducer, if your key is null even a small
percentage of the time the gain can be significant.
Performance Tuning
• You can set the number of reduce tasks for the
MapReduce jobs generated by Pig using parallel
reducer feature.
– set default parallel command is used at the script
level.
• In this example all the MapReduce jobs gets launched use 20
reducers.
– SET default_parallel 20;
– A = LOAD ‘myfile.txt’ USING PigStorage() AS (t, u, v);
– B = GROUP A BY t;
– C = FOREACH B GENERATE group, COUNT(A.t) as mycount;
– D = ORDER C BY mycount;
– PARALLEL clause can be used with any operator like
group, cogroup, join, order by, distinct that starts
reduce phase.
Replicated Join
• One of the datasets is small enough that it fits in the memory.
• A replicated join copies the small dataset to the distributed cache -
space that is available on every cluster machine - and loads it into
the memory.
• Coz the data is available in the memory(DC), and is processed on
the map side of MapReduce, this operation works much faster than
a default join.
• Limitations
It isn’t clear how small the dataset needs to be for using replicated join.
According to the Pig documentation, a relation of up to 100 MB can
be used when the process has 1 GB of memory. A run-time error will
be generated if not enough memory is available for loading the data.
• transactions = load 'customer_transactions' as ( fname, lname, city,
state, country, amount, tax);
• geography = load 'geo_data' as (state, country, district, manager);
Regular join
• sales = join transactions by (state, country), geography by (state,
country);
• sales = join transactions by (state, country), geography by (state,
country) using 'replicated';
Skewed Join
• One of the keys is much more common than others, and the data for
it is too large to fit in the memory.
• Standard joins run in parallel across different reducers by splitting
key values across processes. If there is a lot of data for a certain
key, the data will not be distributed evenly across the reducers, and
one of them will be ‘stuck’ processing the majority of data.
• Skewed join handles this case. It calculates a histogram to check
which key is the most prevalent and then splits its data across
different reducers for optimal performance.
• transactions = load 'customer_transactions' as ( fname,
lname, city, state, country, amount, tax);
• geography = load 'geo_data' as (state, country, district,
manager);
• sales = join transactions by (state, country), geography
by (state, country) using 'skewed';
Merge Join
• The two datasets are both sorted in ascending order by the join key.
• Datasets may already be sorted by the join key if that’s the order in
which data was entered or they have undergone sorting before the
join operation for other needs.
• When merge join receives the pre-sorted datasets, they are read
and compared on the map side, and as a result they run faster. Both
inner and outer join are available.
•
• transactions = load 'customer_transactions' as (
fname, lname, city, state, country, amount, tax);
• geography = load 'geo_data' as (state, country,
district, manager);
• sales = join transactions by (state, country),
geography by (state, country) using 'merge';
Thank You
• Question?
• Feedback?

Weitere ähnliche Inhalte

Was ist angesagt?

Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File SystemRutvik Bapat
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component rebeccatho
 
Introduction to Map Reduce
Introduction to Map ReduceIntroduction to Map Reduce
Introduction to Map ReduceApache Apex
 
Map reduce in BIG DATA
Map reduce in BIG DATAMap reduce in BIG DATA
Map reduce in BIG DATAGauravBiswas9
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture EMC
 
HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY pptsravya raju
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to HadoopApache Apex
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)Prashant Gupta
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsLynn Langit
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation HadoopVarun Narang
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introductionsudhakara st
 
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...Agile Testing Alliance
 

Was ist angesagt? (20)

Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
 
Introduction to HDFS
Introduction to HDFSIntroduction to HDFS
Introduction to HDFS
 
Python pandas Library
Python pandas LibraryPython pandas Library
Python pandas Library
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
 
Introduction to Map Reduce
Introduction to Map ReduceIntroduction to Map Reduce
Introduction to Map Reduce
 
Map reduce in BIG DATA
Map reduce in BIG DATAMap reduce in BIG DATA
Map reduce in BIG DATA
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
 
PPT on Hadoop
PPT on HadoopPPT on Hadoop
PPT on Hadoop
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY ppt
 
Apache spark
Apache sparkApache spark
Apache spark
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals
 
Hadoop YARN
Hadoop YARNHadoop YARN
Hadoop YARN
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 
Apache Flume
Apache FlumeApache Flume
Apache Flume
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introduction
 
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
 

Ähnlich wie Apache PIG

PigHive presentation and hive impor.pptx
PigHive presentation and hive impor.pptxPigHive presentation and hive impor.pptx
PigHive presentation and hive impor.pptxRahul Borate
 
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labsApache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labsViswanath Gangavaram
 
power point presentation on pig -hadoop framework
power point presentation on pig -hadoop frameworkpower point presentation on pig -hadoop framework
power point presentation on pig -hadoop frameworkbhargavi804095
 
Introduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 
Big Data Hadoop Training
Big Data Hadoop TrainingBig Data Hadoop Training
Big Data Hadoop Trainingstratapps
 
Pig_Presentation
Pig_PresentationPig_Presentation
Pig_PresentationArjun Shah
 
Apache pig presentation_siddharth_mathur
Apache pig presentation_siddharth_mathurApache pig presentation_siddharth_mathur
Apache pig presentation_siddharth_mathurSiddharth Mathur
 
Pig - A Data Flow Language and Execution Environment for Exploring Very Large...
Pig - A Data Flow Language and Execution Environment for Exploring Very Large...Pig - A Data Flow Language and Execution Environment for Exploring Very Large...
Pig - A Data Flow Language and Execution Environment for Exploring Very Large...DrPDShebaKeziaMalarc
 
AWS Hadoop and PIG and overview
AWS Hadoop and PIG and overviewAWS Hadoop and PIG and overview
AWS Hadoop and PIG and overviewDan Morrill
 
Apache pig presentation_siddharth_mathur
Apache pig presentation_siddharth_mathurApache pig presentation_siddharth_mathur
Apache pig presentation_siddharth_mathurSiddharth Mathur
 
Session 04 pig - slides
Session 04   pig - slidesSession 04   pig - slides
Session 04 pig - slidesAnandMHadoop
 

Ähnlich wie Apache PIG (20)

PigHive.pptx
PigHive.pptxPigHive.pptx
PigHive.pptx
 
PigHive presentation and hive impor.pptx
PigHive presentation and hive impor.pptxPigHive presentation and hive impor.pptx
PigHive presentation and hive impor.pptx
 
PigHive.pptx
PigHive.pptxPigHive.pptx
PigHive.pptx
 
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labsApache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
 
Pig workshop
Pig workshopPig workshop
Pig workshop
 
power point presentation on pig -hadoop framework
power point presentation on pig -hadoop frameworkpower point presentation on pig -hadoop framework
power point presentation on pig -hadoop framework
 
Apache pig
Apache pigApache pig
Apache pig
 
Introduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLab
 
Hadoop pig
Hadoop pigHadoop pig
Hadoop pig
 
Big Data Hadoop Training
Big Data Hadoop TrainingBig Data Hadoop Training
Big Data Hadoop Training
 
Pig_Presentation
Pig_PresentationPig_Presentation
Pig_Presentation
 
06 pig-01-intro
06 pig-01-intro06 pig-01-intro
06 pig-01-intro
 
Apache pig presentation_siddharth_mathur
Apache pig presentation_siddharth_mathurApache pig presentation_siddharth_mathur
Apache pig presentation_siddharth_mathur
 
Pig - A Data Flow Language and Execution Environment for Exploring Very Large...
Pig - A Data Flow Language and Execution Environment for Exploring Very Large...Pig - A Data Flow Language and Execution Environment for Exploring Very Large...
Pig - A Data Flow Language and Execution Environment for Exploring Very Large...
 
pig.ppt
pig.pptpig.ppt
pig.ppt
 
04 pig data operations
04 pig data operations04 pig data operations
04 pig data operations
 
AWS Hadoop and PIG and overview
AWS Hadoop and PIG and overviewAWS Hadoop and PIG and overview
AWS Hadoop and PIG and overview
 
Hadoop - Apache Pig
Hadoop - Apache PigHadoop - Apache Pig
Hadoop - Apache Pig
 
Apache pig presentation_siddharth_mathur
Apache pig presentation_siddharth_mathurApache pig presentation_siddharth_mathur
Apache pig presentation_siddharth_mathur
 
Session 04 pig - slides
Session 04   pig - slidesSession 04   pig - slides
Session 04 pig - slides
 

Mehr von Prashant Gupta

Mehr von Prashant Gupta (8)

Spark core
Spark coreSpark core
Spark core
 
Spark Sql and DataFrame
Spark Sql and DataFrameSpark Sql and DataFrame
Spark Sql and DataFrame
 
Map reduce prashant
Map reduce prashantMap reduce prashant
Map reduce prashant
 
Sqoop
SqoopSqoop
Sqoop
 
6.hive
6.hive6.hive
6.hive
 
Apache HBase™
Apache HBase™Apache HBase™
Apache HBase™
 
Mongodb - NoSql Database
Mongodb - NoSql DatabaseMongodb - NoSql Database
Mongodb - NoSql Database
 
Sonar Tool - JAVA code analysis
Sonar Tool - JAVA code analysisSonar Tool - JAVA code analysis
Sonar Tool - JAVA code analysis
 

Kürzlich hochgeladen

Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Delhi Call girls
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramMoniSankarHazra
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girlCall Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girlkumarajju5765
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...shambhavirathore45
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 

Kürzlich hochgeladen (20)

Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girlCall Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 

Apache PIG

  • 2. PIG Latin • Pig Latin is a data flow language used for exploring large data sets. • Rapid development • No Java is required. • Its is a high-level platform for creating MapReduce programs used with Hadoop. • Pig was originally developed at Yahoo Research around 2006 for researchers to have an ad-hoc way of creating and executing map- reduce jobs on very large data sets. In 2007,it was moved into the Apache Software Foundation • Like actual pigs, who eat almost anything, the Pig programming language is designed to handle any kind of data—hence the name!
  • 3. What Pig Does Pig was designed for performing a long series of data operations, making it ideal for three categories of Big Data jobs: • Extract-transform-load (ETL) data pipelines, • Research on raw data, and • Iterative data processing.
  • 4. Features of PIG • Provides support for data types – long, float, chararray, schemas and functions • Is extensible and supports User Defined Functions • Schema not mandatory, but used when available • Provides common operations like JOIN, GROUP, FILTER, SORT
  • 5. When not to use PIG • Really nasty data formats or complete unstructured data. – Video Files – Audio Files – Image Files – Raw human readable text • PIG is slow compared to Map-Reduce • When you need more power to optimize code.
  • 8. I Install PIG •To install pig • untar the .gz file using tar –xvzf pig-0.13.0-bin.tar.gz •To initialize the environment variables, export the following: • export PIG_HADOOP_VERSION=20 (Specifies the version of hadoop that is running) • export HADOOP_HOME=/home/(user-name)/hadoop-0.20.2 (Specifies the installation directory of hadoop to the environment variable HADOOP_HOME. Typically defined as /home/user- name/hadoop-version) • export PIG_CLASSPATH=$HADOOP_HOME/conf (Specifies the class path for pig) • export PATH=$PATH:/home/user-name/pig-0.13.1/bin (for setting the PATH variable) • export JAVA_HOME=/usr (Specifies the java home to the environment variable.)
  • 9. PIG Modes • Pig in Local mode – No HDFS is required, All files run on local file system. – Command: pig –x local • Pig in MapReduce(hadoop) mode – To run PIG scripts in MR mode, ensure you have access to HDFS, By Default, PIG starts in MapReduce Mode. – Command: pig –x mapreduce or pig
  • 10. PIG Program Structure • Grunt Shell or Interactive mode – Grunt is an interactive shell for running PIG commands. • PIG Scripts or Batch mode – PIG can run a script file that contains PIG commands. – E.g. PIG script.pig
  • 11. Introducing data types • Data type is a data storage format that can contain a specific type or range of values. – Scalar types • Sample: int, long, double, chararray, bytearray – Complex types • Sample: Atom, Tuple, Bag, Map
  • 12. • User can declare data type at load time as below. – A= LOAD ‘test.data’ using PigStorage(',') AS (sno:chararray, name: chararray, marks:long); • If data type is not declared but script treats value as a certain type, Pig will assume it is of that type and cast it. – A= LOAD ‘test.data’ using PigStorage(',') AS (sno, name, marks); – B = FOREACH A GENERATE marks* 100; --marks cast to long
  • 13.
  • 14. Data types continues… Relation can be defined as follows: • A field/Atom is a piece of data. Ex:12.5 or hello world • A tuple is an ordered set of fields. EX: Tuple (12.5,hello world,-2) It’s most often used as a row in a relation. It’s represented by fields separated by commas, enclosed by parentheses.
  • 15. • A bag is a collection of tuples. Bag {(12.5,hello world,-2),(2.87,bye world,10)} A bag is an unordered collection of tuples. A bag is represented by tuples separated by commas, all enclosed by curly • Map [key value] A map is a set of key/value pairs. Keys must be unique and be a string (chararray). The value can be any type.
  • 16. In sort .. Relations, Bags, Tuples, Fields Pig Latin statements work with relations, A relation can be defined as follows: • A relation is a bag (more specifically, an outer bag). • A bag is a collection of tuples. • A tuple is an ordered set of fields. • A field is a piece of data.
  • 17. PIG Latin Statements • A Pig Latin statement is an operator that takes a relation as input and produces another relation as output. • This definition applies to all Pig Latin operators except LOAD and STORE command which read data from and write data to the file system. • In PIG when a data element is null it means its unknown. Data of any type can be null. • Pig Latin statements can span multiple lines and must end with a semi-colon ( ; )
  • 18. PIG The programming language • Pig Latin statements are generally organized in the following manner: – A LOAD statement reads data from the file system. – A series of "transformation" statements process the data. – A STORE statement writes output to the file system; OR – A DUMP statement displays output to the screen
  • 19. MULTIQUERY EXECUTION •Because DUMP is a diagnostic tool, it will always trigger execution. However, the STORE command is different. • In interactive mode, STORE acts like DUMP and will always trigger execution (this includes the run command), but in batch mode it will not (this includes the exec command). •The reason for this is efficiency. In batch mode, Pig will parse the whole script to see whether there are any optimizations that could be made to limit the amount of data to be written to or read from disk.
  • 20. Consider the following simple example: • A = LOAD 'input/pig/multiquery/A'; • B = FILTER A BY $1 == 'banana'; • C = FILTER A BY $1 != 'banana'; • STORE B INTO 'output/b'; • STORE C INTO 'output/c'; Relations B and C are both derived from A, so to save reading A twice, Pig can run this script as a single MapReduce job by reading A once and writing two output files from the job, one for each of B and C. This feature is called multiquery execution.
  • 24. Logical vs. Physical Plan When the Pig Latin interpreter sees the first line containing the LOAD statement, it confirms that it is syntactically and semantically correct and adds it to the logical plan, but it does not load the data from the file (or even check whether the file exists). The point is that it makes no sense to start any processing until the whole flow is defined. Similarly, Pig validates the GROUP and FOREACH…GENERATE statements, and adds them to the logical plan without executing them. The trigger for Pig to start execution is the DUMP statement. At that point, the logical plan is compiled into a physical plan and executed.
  • 26. Create a sample file John,18,4.0 Mary,19,3.8 Bill,20,3.9 Joe,18,3.8 Save it as “student.txt” Move it to HDFS by using below command. hadoop fs – put <local path - filename> hdfspath
  • 27. LOAD/DUMP/STORE A = load 'student' using PigStorage(‘,’) AS (name:chararray,age:int,gpa:float); DESCRIBE A; A: {name: chararray,age: int,gpa: float} DUMP A; (John,18,4.0) (Mary,19,3.8) (Bill,20,3.9) (Joe,18,3.8) store A into ‘/hdfspath’;
  • 28. Group Groups the data in one relations. B = GROUP A BY age; DUMP B; (18,{(John,18,4.0),(Joe,18,3.8)}) (19,{(Mary,19,3.8)}) (20,{(Bill,20,3.9)})
  • 29. Foreach…Generate C = FOREACH B GENERATE group, COUNT(A); DUMP C; (18,2) (19,1) (20,1) C = FOREACH B GENERATE $0, $1.name; DUMP C; (18,{(John),(Joe)}) (19,{(Mary)}) (20,{(Bill)})
  • 30. Create Sample File FileA.txt 1 2 3 4 2 1 8 3 4 4 3 3 7 2 5 8 4 3 Move it to HDFS by using below command. hadoop fs – put <localpath> <hdfspath>
  • 31. Create another Sample File FileB.txt 2 4 8 9 1 3 2 7 2 9 4 6 4 9 Move it to HDFS by using below command. hadoop fs – put localpath hdfspath
  • 32. Filter Definition: Selects tuples from a relation based on some condition. FILTER is commonly used to select the data that you want; or, conversely, to filter out (remove) the data you don’t want. Examples A = LOAD 'data' using PigStorage(‘,’) AS (a1:int,a2:int,a3:int); DUMP A; (1,2,3) (4,2,1) (8,3,4) (4,3,3) (7,2,5) (8,4,3) X = FILTER A BY a3 == 3; DUMP X; (1,2,3) (4,3,3) (8,4,3)
  • 33. Co-Group Definition: The GROUP and COGROUP operators are identical. For readability GROUP is used in statements involving one relation and COGROUP is used in statements involving two or more relations. X = COGROUP A BY $0, B BY $0; (1, {(1, 2, 3)}, {(1, 3)}) (2, {}, {(2, 4), (2, 7), (2, 9)}) (4, {(4, 2, 1), (4, 3, 3)}, {(4, 6),(4, 9)}) (7, {(7, 2, 5)}, {}) (8, {(8, 3, 4), (8, 4, 3)}, {(8, 9)}) •To see groups for which inputs have at least one tuple: X = COGROUP A BY $0 INNER, B BY $0 INNER; (1, {(1, 2, 3)}, {(1, 3)}) (4, {(4, 2, 1), (4, 3, 3)}, {(4, 6), (4, 9)}) (8, {(8, 3, 4), (8, 4, 3)}, {(8, 9)}) FileA 1 2 3 4 2 1 8 3 4 4 3 3 7 2 5 8 4 3 FileB.txt 2 4 8 9 1 3 2 7 2 9 4 6 4 9
  • 34. Flatten Operator • Flatten un-nests tuples as well as bags. • For tuples, flatten substitutes the fields of a tuple in place of the tuple. • For example, consider a relation (a, (b, c)). • GENERATE $0, flatten($1) – (a, b, c). • For bags, flatten substitutes bags with new tuples. • For Example, consider a bag ({(b,c),(d,e)}). • GENERATE flatten($0), – will end up with two tuples (b,c) and (d,e). • When we remove a level of nesting in a bag, sometimes we cause a cross product to happen. • For example, consider a relation (a, {(b,c), (d,e)}) • GENERATE $0, flatten($1), – it will create new tuples: (a, b, c) and (a, d, e).
  • 35. JOIN Definition: Performs join of two or more relations based on common field values Syntax: X= JOIN A BY $0, B BY $0; which is equivalent to: X = COGROUP A BY $0 INNER, B BY $0 INNER; Y = FOREACH X GENERATE FLATTEN(A), FLATTEN(B); The result is: (1, 2, 3, 1, 3) (4, 2, 1, 4, 6) (4, 3, 3, 4, 6) (4, 2, 1, 4, 9) (4, 3, 3, 4, 9) (8, 3, 4, 8, 9) (8, 4, 3, 8, 9) (1, {(1, 2, 3)}, {(1, 3)}) (4, {(4, 2, 1), (4, 3, 3)}, {(4, 6), (4, 9)}) (8, {(8, 3, 4), (8, 4, 3)}, {(8, 9)})
  • 36. Distinct Removes duplicate tuples in a relation. X = FOREACH A GENERATE $2; (3) (1) (4) (3) (5) (3) Y = DISTINCT X; (1) (3) (4) (5)
  • 37. CROSS •Computes the cross product of two or more relations. Example: X = CROSS A, B; (1, 2, 3, 2, 4) (1, 2, 3, 8, 9) (1, 2, 3, 1, 3) (1, 2, 3, 2, 7) (1, 2, 3, 2, 9) (1, 2, 3, 4, 6) (1, 2, 3, 4, 9) (4, 2, 1, 2, 4) (4, 2, 1, 8, 9)
  • 38. SPLIT Partitions a relation into two or more relations. Example: A = LOAD 'data' AS (f1:int,f2:int,f3:int); DUMP A; (1,2,3) (4,5,6) (7,8,9) SPLIT A INTO X IF f1<7, Y IF f2==5, Z IF (f3<6 OR f3>6); DUMP X; (1,2,3) (4,5,6) DUMP Y; (4,5,6) DUMP Z; (1,2,3) (7,8,9)
  • 39. Some more commands • To select few columns from one dataset – S1 = foreach a generate a1, a1; • Simple calculation on dataset – K = foreach A generate $1, $2, $1*$2; • To display only 100 records – B = limit a 100; • To see the structure/Schema – Describe A; • To Union two datasets – C = UNION A,B;
  • 40. Word Count Program Create a basic wordsample.txt file and move to HDFS x = load '/home/pgupta5/prashant/data.txt'; y = foreach x generate flatten (TOKENIZE ((chararray) $0)) as word; z = group y by word; counter = foreach z generate group, COUNT(y); store counter into ‘/NewPigData/WordCount’;
  • 41.
  • 42. Another Example i/p: webcount en google.com 70 2012 en yahoo.com 60 2013 us google.com 80 2012 en google.com 40 2014 us google.com 80 2012 records = LOAD ‘webcount’ using PigStorage (‘t’) as (country:chararray, name:chararray, pagecount:int, year:int); filtered_records = filter records by country == ‘en’; grouped_records = group filtered_records by name; results = foreach grouped_records generate group, SUM (filtered_records.pagecount); sorted_result = order results by $1 desc; store sorted_result into ‘/some_external_HDFS_location//data’; -- Hive external table path
  • 43. Find Maximum Score i/p: CricketScore.txt a = load '/user/cloudera/SampleDataFile/CricketScore.txt' using PigStorage('t'); b = foreach a generate $0, $1; c = group b by $0; d = foreach c generate group, max(b.$1); dump d;
  • 44. Sorting Data Relations are unordered in Pig. Consider a relation A: • grunt> DUMP A; • (2,3) • (1,2) • (2,4) There is no guarantee which order the rows will be processed in. In particular, when retrieving the contents of A using DUMP or STORE, the rows may be written in any order. If you want to impose an order on the output, you can use the ORDER operator to sort a relation by one or more fields. The following example sorts A by the first field in ascending order and by the second field in descending order: grunt> B = ORDER A BY $0, $1 DESC; grunt> DUMP B; • (1,2) • (2,4) • (2,3) Any further processing on a sorted relation is not guaranteed to retain its order.
  • 45. Using Hive tables with HCatalog • HCatalog (which is a component of Hive) provides access to Hive’s metastore, so that Pig queries can reference schemas each time. • For example, after running through An Example to load data into a Hive table called records, Pig can access the table’s schema and data as follows: • pig -useHCatalog • grunt> records = LOAD ‘School_db.student_tbl' USING org.apache.hcatalog.pig.HCatLoader(); • grunt> DESCRIBE records; • grunt> DUMP records;
  • 46. PIG UDFs Pig provides extensive support for user defined functions (UDFs) to specify custom processing. REGISTER - Registers the JAR file with PIG runtime. REGISTER myudfs.jar; //JAR file should be available in local LINUX. A = LOAD 'student_data‘ using PigStorage(‘,’) AS (name: chararray, age: int, gpa: float); B = FOREACH A GENERATE myudfs.UPPER(name); DUMP B;
  • 47. UDF Sample Program package myudfs; import java.io.IOException; import org.apache.pig.EvalFunc; import org.apache.pig.data.Tuple; public class UPPER extends EvalFunc<String> { public String exec(Tuple input) throws IOException { if (input == null || input.size() == 0) return null; try{ String str = (String)input.get(0); return str.toUpperCase(); } catch(Exception e){ throw new IOException("Caught exception processing input row ", e); } } } • (Pig’s Java UDF extends functionalities of EvalFunc)
  • 48. Diagnostic operator • DESCRIBE: Prints a relation’s schema. • EXPLAIN: Prints the logical and physical plans. • ILLUSTRATE: Shows a sample execution of the logical plan, using a generated subset of the input.
  • 49. Performance Tuning Pig does not (yet) determine when a field is no longer needed and drop the field from the row. For example, say you have a query like: • Project Early and Often – A = load 'myfile' as (t, u, v); – B = load 'myotherfile' as (x, y, z); – C = join A by t, B by x; – D = group C by u; – E = foreach D generate group, COUNT($1); • There is no need for v, y, or z to participate in this query. And there is no need to carry both t and x past the join, just one will suffice. Changing the query above to the query below will greatly reduce the amount of data being carried through the map and reduce phases by pig. – A = load 'myfile' as (t, u, v); – A1 = foreach A generate t, u; – B = load 'myotherfile' as (x, y, z); – B1 = foreach B generate x; – C = join A1 by t, B1 by x; – C1 = foreach C generate t, u; – D = group C1 by u; – E = foreach D generate group, COUNT($1);
  • 50. Performance Tuning As with early projection, in most cases it is beneficial to apply filters as early as possible to reduce the amount of data flowing through the pipeline. • Filter Early and Often -- Query 1 – A = load 'myfile' as (t, u, v); – B = load 'myotherfile' as (x, y, z); – C = filter A by t == 1; – D = join C by t, B by x; – E = group D by u; – F = foreach E generate group, COUNT($1); -- Query 2 – A = load 'myfile' as (t, u, v); – B = load 'myotherfile' as (x, y, z); – C = join A by t, B by x; – D = group C by u; – E = foreach D generate group, COUNT($1); – F = filter E by C.t == 1; • The first query is clearly more efficient than the second one because it reduces the amount of data going into the join.
  • 51. Performance Tuning Often you are not interested in the entire output but rather a sample or top results. In such cases, LIMIT can yield a much better performance as we push the limit as high as possible to minimize the amount of data travelling through the pipeline. • Use the LIMIT Operator – A = load 'myfile' as (t, u, v); – B = order A by t; – C = limit B 500;
  • 52. Performance Tuning If types are not specified in the load statement, Pig assumes the type of double for numeric computations. A lot of the time, your data would be much smaller, maybe, integer or long. Specifying the real type will help with speed of arithmetic computation. • Use Types – --Query 1 • A = load 'myfile' as (t, u, v); • B = foreach A generate t + u; – --Query 2 • A = load 'myfile' as (t: int, u: int, v); • B = foreach A generate t + u; • The second query will run more efficiently than the first. In some of our queries with see 2x speedup.
  • 53. Performance Tuning • Use Joins appropriately. – Understand Skewed Vs. Replicated vs. Merge join. • Remove null values before join. – A = load 'myfile' as (t, u, v); – B = load 'myotherfile' as (x, y, z); – C = join A by t, B by x; • is rewritten by Pig to – A = load 'myfile' as (t, u, v); – B = load 'myotherfile' as (x, y, z); – C1 = cogroup A by t INNER, B by x INNER; – C = foreach C1 generate flatten(A), flatten(B); Since the nulls from A and B won't be collected together, when the nulls are flattened we're guaranteed to have an empty bag, which will result in no output. But they will not be dropped until the last possible moment.
  • 54. Performance Tuning • Hence the previous query should be rewritten as – A = load 'myfile' as (t, u, v); – B = load 'myotherfile' as (x, y, z); – A1 = filter A by t is not null; – B1 = filter B by x is not null; – C = join A1 by t, B1 by x; Now nulls will be dropped before the join. Since all null keys go to a single reducer, if your key is null even a small percentage of the time the gain can be significant.
  • 55. Performance Tuning • You can set the number of reduce tasks for the MapReduce jobs generated by Pig using parallel reducer feature. – set default parallel command is used at the script level. • In this example all the MapReduce jobs gets launched use 20 reducers. – SET default_parallel 20; – A = LOAD ‘myfile.txt’ USING PigStorage() AS (t, u, v); – B = GROUP A BY t; – C = FOREACH B GENERATE group, COUNT(A.t) as mycount; – D = ORDER C BY mycount; – PARALLEL clause can be used with any operator like group, cogroup, join, order by, distinct that starts reduce phase.
  • 56. Replicated Join • One of the datasets is small enough that it fits in the memory. • A replicated join copies the small dataset to the distributed cache - space that is available on every cluster machine - and loads it into the memory. • Coz the data is available in the memory(DC), and is processed on the map side of MapReduce, this operation works much faster than a default join.
  • 57. • Limitations It isn’t clear how small the dataset needs to be for using replicated join. According to the Pig documentation, a relation of up to 100 MB can be used when the process has 1 GB of memory. A run-time error will be generated if not enough memory is available for loading the data.
  • 58. • transactions = load 'customer_transactions' as ( fname, lname, city, state, country, amount, tax); • geography = load 'geo_data' as (state, country, district, manager); Regular join • sales = join transactions by (state, country), geography by (state, country); • sales = join transactions by (state, country), geography by (state, country) using 'replicated';
  • 59. Skewed Join • One of the keys is much more common than others, and the data for it is too large to fit in the memory. • Standard joins run in parallel across different reducers by splitting key values across processes. If there is a lot of data for a certain key, the data will not be distributed evenly across the reducers, and one of them will be ‘stuck’ processing the majority of data. • Skewed join handles this case. It calculates a histogram to check which key is the most prevalent and then splits its data across different reducers for optimal performance.
  • 60. • transactions = load 'customer_transactions' as ( fname, lname, city, state, country, amount, tax); • geography = load 'geo_data' as (state, country, district, manager); • sales = join transactions by (state, country), geography by (state, country) using 'skewed';
  • 61. Merge Join • The two datasets are both sorted in ascending order by the join key. • Datasets may already be sorted by the join key if that’s the order in which data was entered or they have undergone sorting before the join operation for other needs. • When merge join receives the pre-sorted datasets, they are read and compared on the map side, and as a result they run faster. Both inner and outer join are available. •
  • 62. • transactions = load 'customer_transactions' as ( fname, lname, city, state, country, amount, tax); • geography = load 'geo_data' as (state, country, district, manager); • sales = join transactions by (state, country), geography by (state, country) using 'merge';

Hinweis der Redaktion

  1. Pig is made up of two components: the first is the language itself, which is called PigLatin (yes, people naming various Hadoop projects do tend to have a sense of humor associated with their naming conventions), and the second is a runtime environment where PigLatin programs are executed. Think of the relationship between a Java Virtual Machine (JVM) and a Java application.
  2. As the example is written, this job will requires both a Map & Reduce job to successfully make the join work which leads to larger and larger inefficiency as the customer data set grows in size. This is the exact scenario that is optimized by using a Replicated join. The replicated join, tells Pig to distribute the geography set to each node, where it can be join directly in the Map job and eliminates the need for the Reduce job altogether. 
  3. Skewed join supports both inner and outer join, though only with two inputs - joins between additional tables should be broken up into further joins. Also, there is a pig.skwedjoin.reduce.memusage Java parameter that specifies the heap fraction available to reducers in order to perform this join. Setting a low value means more reducers will be used, yet the cost of copying the data across them will increase. Pig’s developers claim to have good performance when setting it between 0.1-0.4,