Explore well versed other functions, flatten operator and other available options to pass parameters
Let me know if anything is required. Happy to help.
Ping me google #bobrupakroy.
Talk soon!
2. Other Functions
Average:
grunt> grouped = group dataTransaction by CustomerName
grunt> average = FOREACH grouped GENERATE group, AVG
( dataTransaction.TransAmt1);
COUNT: doesn’t count the NULL VALUES
grunt> cnt = foreach grouped generate group,
COUNT(dataTransaction);
grunt> dump cnt;
COUNT_STAR: counts even the NULL VALUES
grunt> cntStar = foreach grouped GENERATE group,
COUNT_STAR($1);
Rupak Roy
3. Concatenate:
grunt> c = foreach concat.csv GENERATE
CONCAT($0,$1);
Multiple concatenate:
grunt> c = foreach concat.csv GENERATE
CONCAT($0,’-’,Transaction_ID);
Is Empty: to check if a bag or map is empty
grunt> F = filter dataTransaction by IsEmpty($1);
Or
grunt> F = filter dataTransaction by Not IsEmpty($1);
Rupak Roy
4. MAX/MIN
grunt> g = group dataTransaction by
CustomerName;
grunt> m= foreach g generate group ,
MIN( dataTransaction.TransAmt1);
or
m = foreach g generate
dataTransaction.CustomerName,
MIN(dataTranscation.TransAmt1);
grunt> m= foreach g generate
dataTransaction.CustomerName,
MAX( dataTransaction.TransAmt1);
Rupak Roy
5. SIZE: is used to calculate the size of the data
according to the Pig data type
grunt> S =foreach dataTransaction generate
SIZE($0), SIZE(CustomerName),SIZE($2);
Rupak Roy
6. SUM
grunt> g = group dataTransaction by
CustomerName;
grunt> s= foreach grouped generate
dataTransaction.CustomerName,
SUM( dataTransation.TransAmt1)
Note: SUM, MAX/MIN, COUNT, COUNT_STAR,AVG
requires GROUP statement before we apply the
functions
Rupak Roy
7. Flatten Operator
It used to change the structure of the tuples and
bags. Flatten un-nest tuples and bags.
For example: consider the tuple has structure like
(a(b,c)). If we add FLATTEN such as GENERATE
flatten($0) it will cause the Tuple to become
(a,b,c)
Again, if we have tuple in the from of
(a,{(b,c,),(d,e)}) which is a group generated by
GROUP OPERATOR and add GENERATE FLATTEN
$0 will give you (a,b,c) and (a,d,e)
Rupak Roy
8. Run Pig Scripts directly from a file
First create a file and save it in a .pig extension.
Type vi output.pig in the terminal
Then write the Pig script
A= LOAD ‘home/hduser/datasets/store.csv’ using
PigStorage(‘,’) as ( )
B= foreach A generate $0,$2;
Now, save the file as output.pig ( or with any .pig extension)
and now execute from any terminal
[bob$localhost~]$ pig –x local /home/hduser/output.pig
Note: if you want to use in HDFS just type only ‘ pig’
And for local mode ‘ pig –x local ‘
Rupak Roy
9. Pig gives you 2 available options to
pass parameters:
1. Using file: -param_file path to the
parameter file.
2. Using command line: -p,-param key value
pair of the form param=val
Rupak Roy
10. Passing Parameters
USING COMMAND LINE:
Create a new file:
vi output1.pig
A= LOAD ‘home/hduser/datasets/store.csv’ using PigStorage(‘,’) as
( )
B = FILTER A by Place ==’$Place’;
DUMP B;
Save the file as output1.pig or with any name and execute the file from
terminal.
[bob$localhost~]$ pig -x local -p Place=‘Alberta’ output1.pig
To pass multiple parameters:
pig -x local -p Place=‘Alberta’ -p Age=‘29’ -p Product=‘electronics’
output1.pig
Rupak Roy
11. USING FILE:
Create a Parameter file using: vi pfile type i to enter
insert mode
Then type CustomerName == ‘Carl Jackson’
threshold = 5
To exit from insert mode Press Esc
then type :wq! To save the contents
Pig –param_file pfile home/hduser/displayoutput.pig
Rupak Roy
12. Next
Flume, a distributed, reliable tool for
collecting large amount of streaming
data.
Rupak Roy