Get to know the implementation of apache Pig relational operators like order, limit, distinct, groupby.
Let me know if anything is required. Happy to help.
Ping me google #bobrupakroy.
Talk soon!
2. Pig Relational Operator: ORDER
ï This helps to sort the data based on Ascending or
Descending manner.
1. Sorting for numerical fields are based on numerically.
2. Sorting for chararray fields are based on lexically.
3. Sorting for bytearray fields are based on lexically.
4. Nulls are considered to be smaller than other values. Therefore
it will always come first or last during ascending or descending
the results.
Letâs perform this with the help of an example;
grunt> dataTransaction = Load â/home/hduser/datasets/store.csvâ
using PigStorage(â,â) AS
(Product_Name:chararray,CustomerName:chararray,Transaction_I
D:byearray,TransAmt1:bytearray,TransAmt2:bytearray,
TransAmt3:bytearray, Place:chararray, Department:chararray);
Rupak Roy
3. Pig Relational Operator: ORDER
grunt> orderbyName = ORDER dataTransaction by
Name;
Example 2:
grunt> orderbyNameNsymbol= order datatransaction
by date, symbol;
Example 3:
grunt> desc= order datatransaction by close desc,
open;
Here close column will have descending order and
since we didnât mentioned any order for open column
it will take ascending order by default.
Rupak Roy
4. Pig Relational Operator: LIMIT
ï Limit simply limits the number of records to display.
ï For example, if we have 1 million rows and columns
and if we dump the results for testing or any purpose it
will take a lot of time to finish displaying the results one
by one which is very time consuming process, so to
make sure our required script is working it is better to
view some results rather than displaying the whole
results.
ï However Pig will still read all the records even we limit
the display of results by (assume) 20 records, but it will
also display 20 different records each time even we
use the same limit query. We can overcome this issue
by using ORDER operator immediately after the limit
operator and will guarantee the same 20 records
each time when we use the same limit query.
Rupak Roy
5. Pig Relational Operator: LIMIT
grunt> Trecords= LIMIT dataTransaction 20;
grunt> dump Trecords;
grunt> dump Trecords; #it will display 20 different records.
#to overcome the limit issue
grunt> order= order dataTransaction by $0;
grunt> Trecords= Limit order 20;
grunt> dump Trecords;
Again we will test for the same results;
grunt> Trecrods = Limit order 20;
Rupak Roy
6. Pig Relational Operator: DISTINCT
ï This operator simply removes the duplicate data.
grunt> RD = DISTINCT dataTransaction;
Note: Distinct operator makes use of a combiner
or we can say semi-reducer between the map
phase and reducer phase to remove the
duplicates.
Rupak Roy
7. Pig Relational Operator: GROUP
ï Group operator is one of the important functions for grouping
the data from a large pool of datasets.
grunt> grouping = GROUP dataTransaction by Place;
grunt> describe grouping;
grunt>cnt = foreach grouping GENERATE group, COUNT( $1);
grunt> dump cnt;
By applying this to our dataset store.csv we can find from how
many places each customer purchase similar products.
Note: to verify this result, load the dataset in excel (since this is a
subset of a dataset and will be suitable to view it in excel). Use the
filter function then select only Christy Britain, we will see she have
purchased the similar products from 6 different places.
Rupak Roy
8. ï Group on multiple keys:
grunt> grouped = GROUP dataTransaction by
(Place, Department);
Pig Relational Operator: GROUP
Rupak Roy
9. Next
ï More into advanced relational operators
like foreach, filter, join and more.
Rupak Roy