SlideShare a Scribd company logo
1 of 83
Download to read offline
Sandeep GiriHadoop
Interact - Ask Questions
Lifetime access of content
Class Recording
Cluster Access
24x7 support
WELCOME - KNOWBIGDATA
Real Life Project
Quizzes & Certification Test
10 x (3hr class)
Socio-ProVisibility
Mock Interviews
Sandeep GiriHadoop
ABOUT ME
2014 KnowBigData Founded
2014
Amazon
Built HighThroughput Systems for Amazon.com site using in-
house NoSql.
2012
2012 InMobi Built Recommender after churning 200TB
2011
tBits Global
Founded tBits Global
Built an enterprise grade Document Management System
2006
D.E.Shaw
Built the big data systems before the
term was coined
2002
2002 IIT Roorkee Finished B.Tech somehow.
Sandeep GiriHadoop
COURSE CONTENTCOURSE CONTENT
I Understanding BigData, Hadoop Architecture
II Environment Overview, MapReduce Basics
III Adv MapReduce & Testing
IV Pig & Pig Latin
V Analytics using Hive
VI NoSQL, HBASE
VII Oozie, Mahout,
VIII Zookeeper, Apache Storm
IX Apache Flume, Apache Spark
X YARN, Big Data Sets & Project Assignment
Sandeep GiriHadoop
Introduction
Wordcount example
Pig Latin vs SQL
Pig Latin vs MapReduce
Use Cases
Philosophy
Installation
Using from Cloud Labs
Local Mode
Command Line Args
TODAY’S CLASS
Prompt - Grunt
Average Val Example
Data Types - Scaler
Data Types - Complex
Schema
Basic Contructions
LOAD
Diagnostic
STORE & DUMP
Foreach
Sandeep GiriHadoop
FILTER
Order by
JOIN
Limit & Sample
Parallel
Flatten
Calling External Functions
Foreach Nested
Fragmented Joins
Skewed Join - Merge
TODAY’S CLASS
Left Outer Join
Right Outer Join
Full Join
CoGroup
Union
Cross
Stream
Params
Splits
Non-Linear Execution
Sandeep GiriHadoop
PIG
An engine for executing data flows in parallel on Hadoop
• Language - Pig Latin
• Provides operators: Join, order by, Filter …
• Runs on Hadoop. Makes use of both
• HDFS
• MapReduce
Sandeep GiriHadoop
PIG - WORD COUNT EXAMPLE
-- Load input from the file named big.txt, and call the single
-- field in the record 'line'.
input = load 'big.txt' as (line);
-- TOKENIZE splits the line into a field for each word.
-- flatten will take the collection of records returned by
-- TOKENIZE and produce a separate record for each one, calling the single
-- field in the record word.
words = foreach input generate flatten(TOKENIZE(line)) as word;
-- Now group them together by each word.
grpd = group words by word;
-- Count them.
cntd = foreach grpd generate group, COUNT(words);
-- Print out the results.
dump cntd;
Sandeep GiriHadoop
PIG LATIN - A DATA FLOW LANGUAGE
Allows describing:
1. How data flows
2. Should be Read
3. Processed
4. Stored to multiple outputs in parallel
1. Complex Workflows
1. Multiple Inputs are joined
2. Data is split into multiple streams
2. No Ifs and fors
Sandeep GiriHadoop
PIG LATIN MAP REDUCE
Provides join, filter, group by, order by,
union
Provides operations such as group by
and order by indirectly. Joins are hard
Provides optimisation such are
rebalancing the reducers
Same optimisations are really hard to
write
Understand your scripts: Error
checking & optimise
You code is opaque to the map reduce
Lower cost to write and maintain than
Java code for MapReduce.
Requires more effort and time
Brief: 9lines 170 lines
Has limitations More Control.
Map Reduce Programming vs Pig Latin
Sandeep GiriHadoop
PIG LATIN SQL
User describes exactly how to process
the input data.
Allows users to describe what
question they want answered
Long series of data operations Around answering one question
Designed for Hadoop Designed for RDBMS
Targets multiple data operations &
parallel processing
For multiple data operations, the query
gets complex
Can be used when schema is unknown,
incomplete, or inconsistent,
requires well formatted
Pig Latin vs SQL
Sandeep GiriHadoop
PIG - USE CASES
1. ETL
2. Research On Data
3. Iterative processing
1. Batch processing
Map Reduce SQLPIG
Sandeep GiriHadoop
PIG PHILOSOPHY
A. Pigs eat anything
• Data: Relational, nested, or unstructured.
B. Pigs live anywhere
• Parallel processing Language. Not only Hadoop
C. Pigs are domestic animals
• Controllable, provides custom functions in java/python
• Custom load and store methods
D. Pigs fly
• Designed for performance
Sandeep GiriHadoop
PIG - INSTALLING
1. Download 0.13.0 from Pig’s release page
2. tar -xvfz pig-0.13.0.tar.gz
3. mv pig-0.13.0 /usr/local/pig
4. export JAVA_HOME=/usr/lib/jvm/java-6-openjdk-amd64
5. Run pig -x local and test if it is working
6. export PIG_CLASSPATH=/etc/hadoop
7. Add pig/bin folder to $PATH
8. Run pig and enjoy the grunts
Sandeep GiriHadoop
ssh student@hadoop1.knowbigdata.com (password is same as
mentioned earlier).You can also try hadoop2, 3, 4
pig
In HDFS, you have the permissions to manipulate /users/student/
The data has been uploaded to /users/student/data in HDFS
PIG - USING FROM CLOUD LABS
Sandeep GiriHadoop
pig -x local
Works without Hadoop installation
Very useful for testing locally
Improves the development pace
PIG - LOCAL MODE
Sandeep GiriHadoop
The first argument can be a pig script file: pig <myscript>
To Execute a single command e.g.:
pig -e ls
-version , for version
-h see more options
PIG - COMMAND LINE ARGUMENTS
Sandeep GiriHadoop
The prompt “grunt>”
This is where all of the commands are typed
You can also control hadoop from here
fs (-ls or -cat or any other hadoop command)
kill job
exec or run pig-script
PIG - GRUNT
Sandeep GiriHadoop
PIG - BASIC CONSTRUCTION
• Each processing step results in new set (or relation)
• Case Insensitive - UDF/Function case sensitive
• Multiline comments: /* */
• Singleline comments: --
Sandeep GiriHadoop
PIG - MY FIRST PIG SCRIPT - AVERAGEVAL
NYSE CPO2009-12-30 0.11
NYSE CPO2009-09-28 0.12
NYSE CPO2009-06-26 0.13
NYSE CCS 2009-10-28 0.41
NYSE CCS 2009-04-29 0.43
NYSE CIF 2009-12-09 .029
NYSE CIF 2009-12-09 .028
Local: data/NYSE_dividends
hdfs: /users/student/data/NYSE_dividends
CPO 0.12
CCS 0.42
CIF .0285
Sandeep GiriHadoop
PIG - MY FIRST PIG SCRIPT - AVERAGEVAL
NYSE CPO2009-12-30 0.11
NYSE CPO2009-09-28 0.12
NYSE CPO2009-06-26 0.13
NYSE CCS 2009-10-28 0.41
NYSE CCS 2009-04-29 0.43
NYSE CIF 2009-12-09 .029
NYSE CIF 2009-12-09 .028
Local: data/NYSE_dividends
hdfs: /user/student/data/NYSE_dividends
CPO 0.12
CCS 0.42
CIF .0285
1. divs = load '/user/student/data/NYSE_dividends' as (name, ticker, date, val);
2. dump divs
3. explain divs
4. grped = group divs by ticker;
5. avged = foreach grped generate group, AVG(divs.val);
6. store avged into '/user/student/data-out/avged';
7. cat /user/student/data-out/avged/part*
Sandeep GiriHadoop
PIG - DATATYPES
Scaler
1. int
2. long
3. float - 4 bytes
4. double - 8 bytes
5. chararray
6. bytearray
Sandeep GiriHadoop
PIG - DATATYPES
Complex
1. Map
1. ['name'#'bob',‘age’#55]
2. chararray => another complex type or scaler
2. Tuple
1. Fixed length, ordered collection
2. Made up of fields and their values.
3. Think of it as one row in RDBMS and fields as columns
4. Can refer the fields by position (0 based)
5. ('bob', 55, 12.3)
3. BAG
1. Unordered collection of tuples
2. {('ram', 55, 12.3), ('sally', 52, 11.2)}
1. Two tuples each with three fields
3. Analogous to SET of rows / dataset
4. Does not need to fit in memory. Spills over to disk
Sandeep GiriHadoop
PIG - DATATYPES
Schemas
1. Not very particular about schema
2. You can either tell upfront, e.g.
1. dividends = load 'NYSE_dividends' as (exchange:chararray,
symbol:chararray, date:chararray, dividend:float);
3. Or it will make best type guesses based on data
4. Schema declaration for complex types:
1. a or a:map[] or a:map[a:tuple(x:int, y:bytearray)] …
2. a:tuple or a:tuple(x, y) or a:tuple(x:int, y:bytearray,…)….
3. a:bag{} or a:bag{t:(x:int, y:int)} or a:bag{t:(x, y)}
5. You can cast the data from one type to another if possible
1. similar to Java by affixing “(type)”
Sandeep GiriHadoop
PIG - LOAD
• divs = load 'NYSE_dividends'
• Loads the values guessing the datatype
• Tab is used as separator
• Absolute or Relative URL would work
• divs = load 'NYSE_dividends' using PigStorage(',');
• To load comma separated values
• divs = load 'NYSE_div*'
• You can specify the wildcard character such as '*'
• divs = load 'NYSE_dividends' using HBaseStorage();
• You can load an HBASE table this way
• divs = load 'NYSE_dividends' as (name:chararray, ticker:chararray,
date, val:float);
• Define the data types upfront
• second_table = LOAD 'second_table.json' USING
JsonLoader('food:chararray, person:chararray, amount:int');
Sandeep GiriHadoop
PIG - DIAGNOSTIC
explain dataset
Shows Detailed analysis of query
describe dataset
Show Schema
Syntax Highlighting Tools
Eclipse http://code.google.com/p/pig-eclipse
Emacs http://github.com/cloudera/piglatin-mode
TextMate http://www.github.com/kevinweil/pig.tmbundle
Vim http://www.vim.org/scripts/script.php?script_id=2186
Sandeep GiriHadoop
PIG - STORE / DUMP
Store
• Stores the data to a file or other storage
• store processed into 'processed' using PigStorage(',');
Dump
• Prints the value on the screen - print()
• Useful for
• debugging or
• using with scripts that redirect output
• Syntax: dump variable; (or relation or field name)
Sandeep GiriHadoop
PIG - RELATIONAL OPERATIONS - FOREACH
1. divs = load '/user/student/data/NYSE_dividends' as (name, ticker, date, val);
2. dump divs;
3. values = foreach divs generate ticker, val
4. dump values
Takes expressions & applies to every record
Sandeep GiriHadoop
PIG - FOREACH
prices = load 'NYSE_daily' as (exchange, symbol, date, open, high, low, close,
volume, adj_close);
Basic Operations such as subtraction/addition
• gain = foreach prices generate close - open;
Using position references
• gain2 = foreach prices generate $6 - $3;
Ranges
• For exchange, symbol, date, open
• beginning = foreach prices generate ..open;
• For open, high, low, close
• middle = foreach prices generate open..close;
• For produces volume, adj_close
• end = foreach prices generate volume..;
Sandeep GiriHadoop
PIG
Will it have any reducer code generated?
gain = foreach prices generate (close - open);
Sandeep GiriHadoop
PIG
Will it have any reducer code generated?
gain = foreach prices generate (close - open);
NO
Sandeep GiriHadoop
PIG - FOREACH
Extract Data from complex types:
bball = load 'baseball' as (name:chararray, team:chararray, traits:bag{}, bat:map[]);
avg = foreach bball generate name, bat#'batting_average';
Tuple Projection
A = load 'input' as (t:tuple(x:int, y:int));
B = foreach A generate t.x, t.$0
Bag Projection
A = load 'input' as (b:bag{t:(x:int, y:int)});
B = foreach A generate b.x;
A = load 'input' as (b:bag{t:(x:int, y:int, z:int)});
B = foreach A generate b.(x, y);
Sandeep GiriHadoop
PIG - FOREACH
Built-In Function’s List
http://pig.apache.org/docs/r0.13.0/func.html
• divs = load 'NYSE_dividends' as
• (exchange:chararray, symbol:chararray, date:chararray, dividends:float);
• in_cents = foreach divs generate
• dividends * 100.0 as dividend, dividends * 100.0;
• describe in_cents;
Sandeep GiriHadoop
PIG - GROUP BY
• B = GROUP A BY f1;
• B = GROUP A BY (f1, f2);
Groups item on key(s)
DUMP A;
(John,18,4.0F)
(Mary,19,3.8F)
(Bill,20,3.9F)
(Joe,18,3.8F)
B = GROUP A BY age;
DUMP B;
(18,{(John,18,4.0F),(Joe,18,3.8F)})
(19,{(Mary,19,3.8F)})
(20,{(Bill,20,3.9F)})
DESCRIBE B;
B: {group: int,A: {name: chararray,age: int,gpa: float}}DESCRIBE A;
A: {name: chararray,age: int,gpa: float}
Sandeep GiriHadoop
PIG - FLATTEN
sandeep {(c),(c++)}
arun {(shell),(c)}
cmaini {(java),(c++),(lisp)}
(sandeep,c)
(sandeep,c++)
(arun,shell)
(arun,c)
(cmaini,java)
(cmaini,c++)
(cmaini,list)
1. x = load 'myfile' as (name, languages:bag{t:(p)});
2. p1 = foreach x generate name, flatten(languages);
3. dump p1;
Sandeep GiriHadoop
PIG - FILTER
-- filter_matches.pig
divs = load 'NYSE_dividends' as
(exchange:chararray, symbol:chararray, date:chararray, dividends:float);
Regular Expression
1. startswithcm = filter divs by symbol matches 'CM.*';
Negation
2. notstartswithcm = filter divs by not symbol matches 'CM.*';
Logical Operators
3. result = filter divs by dividends > 0.10 and dividends < .12;
More Operators are are:
http://pig.apache.org/docs/r0.13.0/basic.html
Sandeep GiriHadoop
PIG
Will there be any reducer generated?
startswithcm = filter divs by symbol matches '.CM.*';
Sandeep GiriHadoop
PIG
No.
Will there be any reducer generated?
startswithcm = filter divs by symbol matches 'CM.*’;
Sandeep GiriHadoop
PIG - ORDER BY / DISTINCT
Order BY
divs = load 'NYSE_dividends' as
(exchange:chararray, symbol:chararray, date:chararray, dividends:float);
ordered = order divs by symbol, dividends desc;
Distinct
divs = load 'NYSE_dividends' as
(exchange:chararray, symbol:chararray, date:chararray, dividends:float);
symbols = foreach divs generate symbol;
uniqsymbols = distinct symbols;
dump uniqsymbols;
Sandeep GiriHadoop
PIG
Will it generate any reducer code?
uniqsymbols = distinct symbols;
Sandeep GiriHadoop
PIG
yes
Will it generate any reducer code?
uniqsymbols = distinct symbols;
Sandeep GiriHadoop
PIG - JOIN
Single Key
daily = load 'NYSE_daily' as (exchange, symbol, date, open, high, low, close,volume,
adj_close);
divs = load 'NYSE_dividends' as (exchange, symbol1, date, dividends);
jnd = join daily by symbol, divs by symbol1;
Composite Key
jnd = join daily by (exchange,symbol), divs by (exchange, symbol);
Self Join - For each stock, find dividends that increased between two dates
divs1 = load 'NYSE_dividends' as (exchange:chararray, symbol:chararray,
date:chararray, dividends);
divs2 = load 'NYSE_dividends' as (exchange:chararray, symbol:chararray,
date:chararray, dividends);
jnd = join divs1 by symbol, divs2 by symbol;
increased = filter jnd by divs1::date < divs2::date and divs1::dividends <
divs2::dividends;
Sandeep GiriHadoop
Join should alway have at least a column
Join can not have conditions
THIS IS WRONG: join A, B on A.value > B.value
JOIN - DIFFERENCE FROM RDBS
Sandeep GiriHadoop
PIG - JOINS
Similar to SQL
A
id name
1 sandeep
2 ryan
B
id name
1 giri
3 saha
(1,sandeep,1,giri)
LO = join A by id, B by id;
dump LO
A = load 'A' as (id, fname); B = load 'B' as (id, lname);
Sandeep GiriHadoop
PIG - OUTER JOINS - LEFT
Similar to SQL
A
id name
1 sandeep
2 ryan
B
id name
1 giri
3 saha
(1,sandeep,1,giri)
(2,ryan,,)
LO = join A by id LEFT OUTER, B by id;
LO = join A by id LEFT, B by id;
dump LO
A = load 'A' as (id, fname); B = load 'B' as (id, lname);
Sandeep GiriHadoop
PIG - OUTER JOINS - RIGHT
Similar to SQL
A
id name
1 sandeep
2 ryan
B
id name
1 giri
3 saha
(1,sandeep,1,giri)
(,,3,saha)
RO = join A by id RIGHT OUTER, B by id;
or
RO = join A by id RIGHT, B by id;
dump RO;
A = load 'A' as (id, fname); B = load 'B' as (id, lname);
Sandeep GiriHadoop
PIG - OUTER JOINS - RIGHT
Similar to SQL
A
id name
1 sandeep
4 ryan
B
id name
4 giri
3 saha
??
RO = join A by id RIGHT OUTER, B by id;
or
RO = join A by id RIGHT, B by id;
dump RO;
A = load 'A' as (id, fname); B = load 'B' as (id, lname);
Sandeep GiriHadoop
PIG - OUTER JOINS - RIGHT
Similar to SQL
A
id name
1 sandeep
4 ryan
B
id name
4 giri
3 saha
(4,ryan,4,giri)
(,,3, saha)
RO = join A by id RIGHT OUTER, B by id;
or
RO = join A by id RIGHT, B by id;
dump RO;
A = load 'A' as (id, fname); B = load 'B' as (id, lname);
Sandeep GiriHadoop
PIG - OUTER JOINS - FULL
Similar to SQL
A
id name
1 sandeep
2 ryan
B
id name
1 giri
3 saha
(1,sandeep,1,giri)
(2,ryan,,)
(,,3,saha)
RO = join A by id FULL, B by id;
dump RO;
A = load 'A' as (id, fname); B = load 'B' as (id, lname);
Sandeep GiriHadoop
PIG
join A by id, B by id = n1 records
join A by id LEFT, B by id = n2 records
join A by id RIGHT, B by id = n3 records
join A by id FULL, B by id = ??? records
Sandeep GiriHadoop
PIG
n2+n3-n1
join A by id, B by id = n1 records
join A by id LEFT, B by id = n2 records
join A by id RIGHT, B by id = n3 records
join A by id FULL, B by id = ??? records
Sandeep GiriHadoop
PIG - OUTER JOINS - QUESTION?
A
id name
1 sandeep
2 ryan
B
id name
1 giri
3 saha
LO = join A by id, B by id LEFT OUTER;
A = load 'A' as (id, fname); B = load 'B' as (id, lname);
???
Sandeep GiriHadoop
PIG - OUTER JOINS - QUESTION?
A
id name
1 sandeep
2 ryan
B
id name
1 giri
3 saha
ERROR
“LEFT/RIGHT” should come before “,B”
LO = join A by id, B by id LEFT OUTER;
A = load 'A' as (id, fname); B = load 'B' as (id, lname);
Sandeep GiriHadoop
PIG - OUTER JOINS - QUESTION?
A
id name
1 sandeep
2 ryan
B
id name
1 giri
3 saha
???
LO = join A by id OUTER, B by id ;
A = load 'A' as (id, fname); B = load 'B' as (id, lname);
Sandeep GiriHadoop
PIG - OUTER JOINS - QUESTION?
A
id name
1 sandeep
2 ryan
B
id name
1 giri
3 saha
ERROR
LO = join A by id OUTER, B by id ;
A = load 'A' as (id, fname); B = load 'B' as (id, lname);
Sandeep GiriHadoop
PIG - LIMIT / SAMPLE
LIMIT
divs = load 'NYSE_dividends';
first10 = limit divs 10;
dump first10;
SAMPLE
divs = load 'NYSE_dividends';
some = sample divs 0.1; --10% records
Sandeep GiriHadoop
PIG
divs = load 'file.tsv'
some = sample divs 0.1;
dump some;
dump some;
If you run “dump some” twice, will it give the same
output?
Sandeep GiriHadoop
PIG
Most Likely No.
Sampling gives randomised output. It is not like limit.
divs = load ‘file.tsv’
some = sample divs 0.1;
dump some;
dump some;
If you run “dump some” twice, will it give the same
output?
Sandeep GiriHadoop
PIG - PARALLEL
1. Parallelism is the basic promise of pig
2. Basically controls the numbers of reducers
3. Does not impact any mappers
1. daily = load 'NYSE_daily' as (exchange, symbol, date, open, high, low, close,
volume, adj_close);
2. bysymbl = group daily by symbol parallel 10;
3. — Will create 10 reducers
1. Needs to be applied per statement
2. Script wide: set default_parallel 10;
3. Without parallel, allocates reducer per 1G of input size unto 999 reducers
Sandeep GiriHadoop
PIG
How many reducers will it produce?
mylist = foreach divs generate symbol, dividends*10
Sandeep GiriHadoop
PIG
How many reducers will it produce?
mylist = foreach divs generate symbol, dividends*10
0.
Sandeep GiriHadoop
PIG
Does parallel clause have any effect in this?
mylist = foreach divs generate symbol parallel 10;
[Yes/No?]
Sandeep GiriHadoop
PIG
Does parallel clause have any effect in this?
mylist = foreach divs generate symbol parallel 10;
[Yes/No?]
N0.
Sandeep GiriHadoop
PIG - CALLING EXTERNAL FUNCTIONS
JAVA Built-IN
define hex InvokeForString('java.lang.Integer.toHexString', 'int');
divs = load 'NYSE_daily' as (exchange, symbol, date, open, high, low,close, volume, adj_close);
inhex = foreach divs generate symbol, hex((int)volume);
Your Own Java Function
register 'acme.jar';
define convert com.acme.financial.CurrencyConverter('dollar', 'euro');
divs = load 'NYSE_dividends' as (exch:chararray, sym:chararray, date:chararray, vals:float);
backwards = foreach divs generate convert(exch);
Calling a Python function
register 'production.py' using jython as bball;
calcprod = foreach nonnull generate name, bball.production(somval);
Sandeep GiriHadoop
Project URL: https://drive.google.com/file/d/0B73r3o5yrImQM3RETFF3VFRYbFU/
view?usp=sharing
grunt>
register pigudf.jar
cd sgiri
A = load 'xyza.txt' as (col);
B = FOREACH A GENERATE myudfs.UPPER(col);
dump B;
PIG USER DEFINED FUNCTIONS (UDF)
Sandeep GiriHadoop
PIG - STREAM - INTEGRATE WITH LEGACY CODE
divs = load 'NYSE_dividends' as 

(exchange, symbol, date, dividends);
highdivs = stream divs through `highdiv.pl` as 

(exchange, symbol, date, dividends);
Say you have a propriety or legacy code that need to applied.
• Invoked once on every map or reduce task
• Not on every record
The program is shipped to every task node:
define hd `highdiv.pl -r xyz` ship('highdiv.pl');
divs = load 'NYSE_dividends' as (exchange, symbol, date, dividends);
highdivs = stream divs through hd as (exchange, symbol, date, dividends);
Sandeep GiriHadoop
PIGGY-BANK
Library of function from other people
• Create a directory for the Pig source code: `mkdir pig`
• cd into that directory: `cd pig`
• Checkout the Pig source code: `svn checkout http://svn.apache.org/repos/asf/pig/trunk/ .`
• Build the project: `ant`
• cd into the piggybank dir: `cd contrib/piggybank/java`
• Build the piggybank: `ant`
• You should now see a piggybank.jar file in that directory.
• REGISTER /public/share/pig/contrib/piggybank/java/piggybank.jar ;
• TweetsInaug = FILTER Tweets BY org.apache.pig.piggybank.evaluation.string.UPPER(text)
MATCHES '.*(INAUG|OBAMA|BIDEN|CHENEY|BUSH).*' ;
• STORE TweetsInaug INTO 'meta/inaug/tweets_inaug' ;
Same as UDF, just that the code is available
https://cwiki.apache.org/confluence/display/PIG/PiggyBank
Sandeep GiriHadoop
PIG - FOREACH NESTED
blore sandeep
blore sandeep
blore vivek
nyc john
nyc john
(blore,2)
(nyc,1)
???
Find count of distinct names in each city
Sandeep GiriHadoop
PIG - FOREACH NESTED
blore sandeep
blore sandeep
blore vivek
nyc john
nyc john
1. people = load 'people' as (city, name);
2. grp = group people by city;
3. counts = foreach grp {
1. dnames = distinct people.name;
2. generate group, COUNT(dnames);
4. }
Find total distinct names in each city
(blore,2)
(nyc,1)
Sandeep GiriHadoop
1. ????
Translate the countries to country code
NAME COUNTRY
sandeep IN
kumar US
gerald CN
…1 billion entries…
NAME CODE
sandeep 91
kumar 1
gerald 86
…1 billion entries…
PIG - FRAGMENTED JOINS
COUNTRY CODE
IN 91
US 1
CN 86
… 100 entries …
Sandeep GiriHadoop
1. persons = load 'persons' as (name, country);
2. cc = load 'cc' as (country, code);
3. joined = join persons by country, cc by
country using ‘replicated';
4. result = foreach joined generate
persons.name, cc.CODE
Translate the countries to country code
NAME COUNTRY
sandeep IN
kumar US
gerald CN
…1 billion entries…
NAME CODE
sandeep 91
kumar 1
gerald 86
…1 billion entries…
PIG - FRAGMENTED JOINS - REPLICATED
COUNTY CODE
IN 91
US 1
CN 86
… 100 entries …
Sandeep GiriHadoop
PIG - JOINING SKEWED DATA
Skewed data:
Too many values for some keys
Too few keys for some
users = load 'users' as (name:chararray, city:chararray);
cinfo = load 'cityinfo' as (city:chararray, population:int);
jnd = join cinfo by city, users by city using 'skewed';
PIG - JOINING SORTED DATA
jnd = join daily by symbol, divs by symbol using 'merge';
Sandeep GiriHadoop
PIG - COGROUP
Group two tables by a column and then join on the
grouped column
owners
person animal
adam cat
adam dog
alex fish
alice cat
steve dog
pets
name animal
nemo fish
fido dog
rex dog
paws cat
wiskers cat
(cat,{(alice,cat),(adam,cat)},{(wiskers,cat),(paws,cat)})
(dog,{(steve,dog),(adam,dog)},{(rex,dog),(fido,dog)})
(fish,{(alex,fish)},{(nemo,fish)})
COGROUP owners BY animal, pets by
animal;
Sandeep GiriHadoop
TO force schema use
A = load 'A' as (x, y);
B = load 'B' as (z,x);
D = union onschema A, B; -- x,y, z as columns
PIG - UNION
A = load '/user/me/data/files/input1';
B = load '/user/someoneelse/info/input2';
C = union A, B;
Concatenating two data sets.
Unix Equivalent: cat file1 file2
SQL: union all
Sandeep GiriHadoop
PIG - UNION
Union OnschemaUnion
Sandeep GiriHadoop
PIG
C = join A, B
What is the result?
Sandeep GiriHadoop
PIG
C = join A, B
What is the result?
ERROR.
Sandeep GiriHadoop
PIG - CROSS
tonsodata = cross daily, divs;
Produce Cross Product
tonsodata = cross daily, divs parallel 10;
Sandeep GiriHadoop
PIG - PARAMETER SUBSTITUTION
Like with any scripting, it is possible to pass an argument
And use it in the script
--daily.pig
...
yesterday = filter daily by date == '$DATE';
pig -p DATE=2009-12-17 daily.pig
#Param file daily.params
YEAR=2009-
MONTH=12-
DAY=17
DATE=$YEAR$MONTH$DAY
pig -param_file daily.params daily.pig
Sandeep GiriHadoop
PIG - SPLITS
wlogs = load 'weblogs' as (pageid, url, timestamp);
split wlogs into
apr03 if timestamp < '20110404',
apr02 if timestamp < '20110403' and timestamp > '20110401',
apr01 if timestamp < '20110402' and timestamp > ‘20110331';
store apr03 into '20110403';
store apr02 into '20110402';
store apr01 into '20110401';
1. A single record can go to multiple legs
2. A record could also be dropped out
Sandeep GiriHadoop
PIG - NON LINEAR EXECUTION
Pig tries to combine multiple groups
and filters into one.
And also creates multiple map jobs
depending on what we are trying to
do.
Sandeep GiriHadoop
PIG - REFERENCES
General Docs:
http://pig.apache.org/docs/r0.13.0/
Creating Custom Load Functions:
http://pig.apache.org/docs/r0.13.0/udf.html
Introduction to pig & pig latin

More Related Content

What's hot

Hadoop hive presentation
Hadoop hive presentationHadoop hive presentation
Hadoop hive presentationArvind Kumar
 
Apache Hive Tutorial
Apache Hive TutorialApache Hive Tutorial
Apache Hive TutorialSandeep Patil
 
Introduction to Hadoop and Cloudera, Louisville BI & Big Data Analytics Meetup
Introduction to Hadoop and Cloudera, Louisville BI & Big Data Analytics MeetupIntroduction to Hadoop and Cloudera, Louisville BI & Big Data Analytics Meetup
Introduction to Hadoop and Cloudera, Louisville BI & Big Data Analytics Meetupiwrigley
 
Your first ClickHouse data warehouse
Your first ClickHouse data warehouseYour first ClickHouse data warehouse
Your first ClickHouse data warehouseAltinity Ltd
 
Cost-based query optimization in Apache Hive
Cost-based query optimization in Apache HiveCost-based query optimization in Apache Hive
Cost-based query optimization in Apache HiveJulian Hyde
 
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...Simplilearn
 
Database Systems
Database SystemsDatabase Systems
Database SystemsUsman Tariq
 
Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!Julian Hyde
 

What's hot (20)

Hive(ppt)
Hive(ppt)Hive(ppt)
Hive(ppt)
 
Hadoop hive presentation
Hadoop hive presentationHadoop hive presentation
Hadoop hive presentation
 
6.hive
6.hive6.hive
6.hive
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Apache Hive Tutorial
Apache Hive TutorialApache Hive Tutorial
Apache Hive Tutorial
 
Apache PIG
Apache PIGApache PIG
Apache PIG
 
Introduction to Hadoop and Cloudera, Louisville BI & Big Data Analytics Meetup
Introduction to Hadoop and Cloudera, Louisville BI & Big Data Analytics MeetupIntroduction to Hadoop and Cloudera, Louisville BI & Big Data Analytics Meetup
Introduction to Hadoop and Cloudera, Louisville BI & Big Data Analytics Meetup
 
Pig latin
Pig latinPig latin
Pig latin
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Apache hive
Apache hiveApache hive
Apache hive
 
Your first ClickHouse data warehouse
Your first ClickHouse data warehouseYour first ClickHouse data warehouse
Your first ClickHouse data warehouse
 
Cost-based query optimization in Apache Hive
Cost-based query optimization in Apache HiveCost-based query optimization in Apache Hive
Cost-based query optimization in Apache Hive
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
3. mining frequent patterns
3. mining frequent patterns3. mining frequent patterns
3. mining frequent patterns
 
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
 
Sqoop
SqoopSqoop
Sqoop
 
Database Systems
Database SystemsDatabase Systems
Database Systems
 
Hadoop Technology
Hadoop TechnologyHadoop Technology
Hadoop Technology
 
Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 

Viewers also liked

Apache spark session
Apache spark sessionApache spark session
Apache spark sessionknowbigdata
 
Introduction of the Design of A High-level Language over MapReduce -- The Pig...
Introduction of the Design of A High-level Language over MapReduce -- The Pig...Introduction of the Design of A High-level Language over MapReduce -- The Pig...
Introduction of the Design of A High-level Language over MapReduce -- The Pig...Yu Liu
 
High-level Programming Languages: Apache Pig and Pig Latin
High-level Programming Languages: Apache Pig and Pig LatinHigh-level Programming Languages: Apache Pig and Pig Latin
High-level Programming Languages: Apache Pig and Pig LatinPietro Michiardi
 
Introduction to HiveQL
Introduction to HiveQLIntroduction to HiveQL
Introduction to HiveQLkristinferrier
 
Apache Spark Streaming - www.know bigdata.com
Apache Spark Streaming - www.know bigdata.comApache Spark Streaming - www.know bigdata.com
Apache Spark Streaming - www.know bigdata.comknowbigdata
 
Hadoop interview questions
Hadoop interview questionsHadoop interview questions
Hadoop interview questionsKalyan Hadoop
 
Pig and Pig Latin - Module 5
Pig and Pig Latin - Module 5Pig and Pig Latin - Module 5
Pig and Pig Latin - Module 5Rohit Agrawal
 
Interview questions on Apache spark [part 2]
Interview questions on Apache spark [part 2]Interview questions on Apache spark [part 2]
Interview questions on Apache spark [part 2]knowbigdata
 
Introduction to Apache ZooKeeper
Introduction to Apache ZooKeeperIntroduction to Apache ZooKeeper
Introduction to Apache ZooKeeperknowbigdata
 
Big data interview questions and answers
Big data interview questions and answersBig data interview questions and answers
Big data interview questions and answersKalyan Hadoop
 
Orienit hadoop practical cluster setup screenshots
Orienit hadoop practical cluster setup screenshotsOrienit hadoop practical cluster setup screenshots
Orienit hadoop practical cluster setup screenshotsKalyan Hadoop
 
Hadoop 31-frequently-asked-interview-questions
Hadoop 31-frequently-asked-interview-questionsHadoop 31-frequently-asked-interview-questions
Hadoop 31-frequently-asked-interview-questionsAsad Masood Qazi
 
Hive User Meeting August 2009 Facebook
Hive User Meeting August 2009 FacebookHive User Meeting August 2009 Facebook
Hive User Meeting August 2009 Facebookragho
 
SQL-on-Hadoop Tutorial
SQL-on-Hadoop TutorialSQL-on-Hadoop Tutorial
SQL-on-Hadoop TutorialDaniel Abadi
 
Differences between OpenStack and AWS
Differences between OpenStack and AWSDifferences between OpenStack and AWS
Differences between OpenStack and AWSEdureka!
 
MapReduce Tutorial | What is MapReduce | Hadoop MapReduce Tutorial | Edureka
MapReduce Tutorial | What is MapReduce | Hadoop MapReduce Tutorial | EdurekaMapReduce Tutorial | What is MapReduce | Hadoop MapReduce Tutorial | Edureka
MapReduce Tutorial | What is MapReduce | Hadoop MapReduce Tutorial | EdurekaEdureka!
 
Introduction to Apache Hive
Introduction to Apache HiveIntroduction to Apache Hive
Introduction to Apache HiveAvkash Chauhan
 
MapReduce Example | MapReduce Programming | Hadoop MapReduce Tutorial | Edureka
MapReduce Example | MapReduce Programming | Hadoop MapReduce Tutorial | Edureka MapReduce Example | MapReduce Programming | Hadoop MapReduce Tutorial | Edureka
MapReduce Example | MapReduce Programming | Hadoop MapReduce Tutorial | Edureka Edureka!
 
Hadoop Tutorial | What is Hadoop | Hadoop Project on Reddit | Edureka
Hadoop Tutorial | What is Hadoop | Hadoop Project on Reddit | EdurekaHadoop Tutorial | What is Hadoop | Hadoop Project on Reddit | Edureka
Hadoop Tutorial | What is Hadoop | Hadoop Project on Reddit | EdurekaEdureka!
 
Control Transactions using PowerCenter
Control Transactions using PowerCenterControl Transactions using PowerCenter
Control Transactions using PowerCenterEdureka!
 

Viewers also liked (20)

Apache spark session
Apache spark sessionApache spark session
Apache spark session
 
Introduction of the Design of A High-level Language over MapReduce -- The Pig...
Introduction of the Design of A High-level Language over MapReduce -- The Pig...Introduction of the Design of A High-level Language over MapReduce -- The Pig...
Introduction of the Design of A High-level Language over MapReduce -- The Pig...
 
High-level Programming Languages: Apache Pig and Pig Latin
High-level Programming Languages: Apache Pig and Pig LatinHigh-level Programming Languages: Apache Pig and Pig Latin
High-level Programming Languages: Apache Pig and Pig Latin
 
Introduction to HiveQL
Introduction to HiveQLIntroduction to HiveQL
Introduction to HiveQL
 
Apache Spark Streaming - www.know bigdata.com
Apache Spark Streaming - www.know bigdata.comApache Spark Streaming - www.know bigdata.com
Apache Spark Streaming - www.know bigdata.com
 
Hadoop interview questions
Hadoop interview questionsHadoop interview questions
Hadoop interview questions
 
Pig and Pig Latin - Module 5
Pig and Pig Latin - Module 5Pig and Pig Latin - Module 5
Pig and Pig Latin - Module 5
 
Interview questions on Apache spark [part 2]
Interview questions on Apache spark [part 2]Interview questions on Apache spark [part 2]
Interview questions on Apache spark [part 2]
 
Introduction to Apache ZooKeeper
Introduction to Apache ZooKeeperIntroduction to Apache ZooKeeper
Introduction to Apache ZooKeeper
 
Big data interview questions and answers
Big data interview questions and answersBig data interview questions and answers
Big data interview questions and answers
 
Orienit hadoop practical cluster setup screenshots
Orienit hadoop practical cluster setup screenshotsOrienit hadoop practical cluster setup screenshots
Orienit hadoop practical cluster setup screenshots
 
Hadoop 31-frequently-asked-interview-questions
Hadoop 31-frequently-asked-interview-questionsHadoop 31-frequently-asked-interview-questions
Hadoop 31-frequently-asked-interview-questions
 
Hive User Meeting August 2009 Facebook
Hive User Meeting August 2009 FacebookHive User Meeting August 2009 Facebook
Hive User Meeting August 2009 Facebook
 
SQL-on-Hadoop Tutorial
SQL-on-Hadoop TutorialSQL-on-Hadoop Tutorial
SQL-on-Hadoop Tutorial
 
Differences between OpenStack and AWS
Differences between OpenStack and AWSDifferences between OpenStack and AWS
Differences between OpenStack and AWS
 
MapReduce Tutorial | What is MapReduce | Hadoop MapReduce Tutorial | Edureka
MapReduce Tutorial | What is MapReduce | Hadoop MapReduce Tutorial | EdurekaMapReduce Tutorial | What is MapReduce | Hadoop MapReduce Tutorial | Edureka
MapReduce Tutorial | What is MapReduce | Hadoop MapReduce Tutorial | Edureka
 
Introduction to Apache Hive
Introduction to Apache HiveIntroduction to Apache Hive
Introduction to Apache Hive
 
MapReduce Example | MapReduce Programming | Hadoop MapReduce Tutorial | Edureka
MapReduce Example | MapReduce Programming | Hadoop MapReduce Tutorial | Edureka MapReduce Example | MapReduce Programming | Hadoop MapReduce Tutorial | Edureka
MapReduce Example | MapReduce Programming | Hadoop MapReduce Tutorial | Edureka
 
Hadoop Tutorial | What is Hadoop | Hadoop Project on Reddit | Edureka
Hadoop Tutorial | What is Hadoop | Hadoop Project on Reddit | EdurekaHadoop Tutorial | What is Hadoop | Hadoop Project on Reddit | Edureka
Hadoop Tutorial | What is Hadoop | Hadoop Project on Reddit | Edureka
 
Control Transactions using PowerCenter
Control Transactions using PowerCenterControl Transactions using PowerCenter
Control Transactions using PowerCenter
 

Similar to Introduction to pig & pig latin

Introduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 
22nd Athens Big Data Meetup - 1st Talk - MLOps Workshop: The Full ML Lifecycl...
22nd Athens Big Data Meetup - 1st Talk - MLOps Workshop: The Full ML Lifecycl...22nd Athens Big Data Meetup - 1st Talk - MLOps Workshop: The Full ML Lifecycl...
22nd Athens Big Data Meetup - 1st Talk - MLOps Workshop: The Full ML Lifecycl...Athens Big Data
 
CloudLand 2023: Rock, Paper, Scissors Cloud Competition - Go vs. Java
CloudLand 2023: Rock, Paper, Scissors Cloud Competition - Go vs. JavaCloudLand 2023: Rock, Paper, Scissors Cloud Competition - Go vs. Java
CloudLand 2023: Rock, Paper, Scissors Cloud Competition - Go vs. JavaJan Stamer
 
Hw09 Sqoop Database Import For Hadoop
Hw09   Sqoop Database Import For HadoopHw09   Sqoop Database Import For Hadoop
Hw09 Sqoop Database Import For HadoopCloudera, Inc.
 
Hadoop and Mapreduce Certification
Hadoop and Mapreduce CertificationHadoop and Mapreduce Certification
Hadoop and Mapreduce CertificationVskills
 
Sql saturday pig session (wes floyd) v2
Sql saturday   pig session (wes floyd) v2Sql saturday   pig session (wes floyd) v2
Sql saturday pig session (wes floyd) v2Wes Floyd
 
Groovy On Trading Desk (2010)
Groovy On Trading Desk (2010)Groovy On Trading Desk (2010)
Groovy On Trading Desk (2010)Jonathan Felch
 
power point presentation on pig -hadoop framework
power point presentation on pig -hadoop frameworkpower point presentation on pig -hadoop framework
power point presentation on pig -hadoop frameworkbhargavi804095
 
Big Data Certification
Big Data CertificationBig Data Certification
Big Data CertificationAdam Doyle
 
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labsApache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labsViswanath Gangavaram
 
Introduction to Apache Pig
Introduction to Apache PigIntroduction to Apache Pig
Introduction to Apache PigJason Shao
 
MongoDB & Hadoop, Sittin' in a Tree
MongoDB & Hadoop, Sittin' in a TreeMongoDB & Hadoop, Sittin' in a Tree
MongoDB & Hadoop, Sittin' in a TreeMongoDB
 

Similar to Introduction to pig & pig latin (20)

Apache Pig
Apache PigApache Pig
Apache Pig
 
Hadoop content
Hadoop contentHadoop content
Hadoop content
 
Introduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLab
 
06 pig-01-intro
06 pig-01-intro06 pig-01-intro
06 pig-01-intro
 
Apache pig
Apache pigApache pig
Apache pig
 
Unit V.pdf
Unit V.pdfUnit V.pdf
Unit V.pdf
 
22nd Athens Big Data Meetup - 1st Talk - MLOps Workshop: The Full ML Lifecycl...
22nd Athens Big Data Meetup - 1st Talk - MLOps Workshop: The Full ML Lifecycl...22nd Athens Big Data Meetup - 1st Talk - MLOps Workshop: The Full ML Lifecycl...
22nd Athens Big Data Meetup - 1st Talk - MLOps Workshop: The Full ML Lifecycl...
 
Big data
Big dataBig data
Big data
 
CloudLand 2023: Rock, Paper, Scissors Cloud Competition - Go vs. Java
CloudLand 2023: Rock, Paper, Scissors Cloud Competition - Go vs. JavaCloudLand 2023: Rock, Paper, Scissors Cloud Competition - Go vs. Java
CloudLand 2023: Rock, Paper, Scissors Cloud Competition - Go vs. Java
 
Hw09 Sqoop Database Import For Hadoop
Hw09   Sqoop Database Import For HadoopHw09   Sqoop Database Import For Hadoop
Hw09 Sqoop Database Import For Hadoop
 
Hadoop and Mapreduce Certification
Hadoop and Mapreduce CertificationHadoop and Mapreduce Certification
Hadoop and Mapreduce Certification
 
Sql saturday pig session (wes floyd) v2
Sql saturday   pig session (wes floyd) v2Sql saturday   pig session (wes floyd) v2
Sql saturday pig session (wes floyd) v2
 
Apache Hive
Apache HiveApache Hive
Apache Hive
 
Groovy On Trading Desk (2010)
Groovy On Trading Desk (2010)Groovy On Trading Desk (2010)
Groovy On Trading Desk (2010)
 
power point presentation on pig -hadoop framework
power point presentation on pig -hadoop frameworkpower point presentation on pig -hadoop framework
power point presentation on pig -hadoop framework
 
Big Data Certification
Big Data CertificationBig Data Certification
Big Data Certification
 
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labsApache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
 
Introduction to Apache Pig
Introduction to Apache PigIntroduction to Apache Pig
Introduction to Apache Pig
 
MongoDB & Hadoop, Sittin' in a Tree
MongoDB & Hadoop, Sittin' in a TreeMongoDB & Hadoop, Sittin' in a Tree
MongoDB & Hadoop, Sittin' in a Tree
 
Html5 Overview
Html5 OverviewHtml5 Overview
Html5 Overview
 

Recently uploaded

INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxHumphrey A Beña
 
Integumentary System SMP B. Pharm Sem I.ppt
Integumentary System SMP B. Pharm Sem I.pptIntegumentary System SMP B. Pharm Sem I.ppt
Integumentary System SMP B. Pharm Sem I.pptshraddhaparab530
 
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Celine George
 
EMBODO Lesson Plan Grade 9 Law of Sines.docx
EMBODO Lesson Plan Grade 9 Law of Sines.docxEMBODO Lesson Plan Grade 9 Law of Sines.docx
EMBODO Lesson Plan Grade 9 Law of Sines.docxElton John Embodo
 
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONTHEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONHumphrey A Beña
 
TEACHER REFLECTION FORM (NEW SET........).docx
TEACHER REFLECTION FORM (NEW SET........).docxTEACHER REFLECTION FORM (NEW SET........).docx
TEACHER REFLECTION FORM (NEW SET........).docxruthvilladarez
 
Millenials and Fillennials (Ethical Challenge and Responses).pptx
Millenials and Fillennials (Ethical Challenge and Responses).pptxMillenials and Fillennials (Ethical Challenge and Responses).pptx
Millenials and Fillennials (Ethical Challenge and Responses).pptxJanEmmanBrigoli
 
4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptxmary850239
 
Choosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for ParentsChoosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for Parentsnavabharathschool99
 
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptxQ4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptxlancelewisportillo
 
Oppenheimer Film Discussion for Philosophy and Film
Oppenheimer Film Discussion for Philosophy and FilmOppenheimer Film Discussion for Philosophy and Film
Oppenheimer Film Discussion for Philosophy and FilmStan Meyer
 
4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptxmary850239
 
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfGrade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfJemuel Francisco
 
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptxAUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptxiammrhaywood
 
ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4MiaBumagat1
 
Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Seán Kennedy
 
Textual Evidence in Reading and Writing of SHS
Textual Evidence in Reading and Writing of SHSTextual Evidence in Reading and Writing of SHS
Textual Evidence in Reading and Writing of SHSMae Pangan
 

Recently uploaded (20)

INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
 
INCLUSIVE EDUCATION PRACTICES FOR TEACHERS AND TRAINERS.pptx
INCLUSIVE EDUCATION PRACTICES FOR TEACHERS AND TRAINERS.pptxINCLUSIVE EDUCATION PRACTICES FOR TEACHERS AND TRAINERS.pptx
INCLUSIVE EDUCATION PRACTICES FOR TEACHERS AND TRAINERS.pptx
 
Integumentary System SMP B. Pharm Sem I.ppt
Integumentary System SMP B. Pharm Sem I.pptIntegumentary System SMP B. Pharm Sem I.ppt
Integumentary System SMP B. Pharm Sem I.ppt
 
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
 
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptxYOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
 
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptxYOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
 
EMBODO Lesson Plan Grade 9 Law of Sines.docx
EMBODO Lesson Plan Grade 9 Law of Sines.docxEMBODO Lesson Plan Grade 9 Law of Sines.docx
EMBODO Lesson Plan Grade 9 Law of Sines.docx
 
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONTHEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
 
TEACHER REFLECTION FORM (NEW SET........).docx
TEACHER REFLECTION FORM (NEW SET........).docxTEACHER REFLECTION FORM (NEW SET........).docx
TEACHER REFLECTION FORM (NEW SET........).docx
 
Millenials and Fillennials (Ethical Challenge and Responses).pptx
Millenials and Fillennials (Ethical Challenge and Responses).pptxMillenials and Fillennials (Ethical Challenge and Responses).pptx
Millenials and Fillennials (Ethical Challenge and Responses).pptx
 
4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx
 
Choosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for ParentsChoosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for Parents
 
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptxQ4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
 
Oppenheimer Film Discussion for Philosophy and Film
Oppenheimer Film Discussion for Philosophy and FilmOppenheimer Film Discussion for Philosophy and Film
Oppenheimer Film Discussion for Philosophy and Film
 
4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx
 
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfGrade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
 
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptxAUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptx
 
ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4
 
Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...
 
Textual Evidence in Reading and Writing of SHS
Textual Evidence in Reading and Writing of SHSTextual Evidence in Reading and Writing of SHS
Textual Evidence in Reading and Writing of SHS
 

Introduction to pig & pig latin

  • 1.
  • 2. Sandeep GiriHadoop Interact - Ask Questions Lifetime access of content Class Recording Cluster Access 24x7 support WELCOME - KNOWBIGDATA Real Life Project Quizzes & Certification Test 10 x (3hr class) Socio-ProVisibility Mock Interviews
  • 3. Sandeep GiriHadoop ABOUT ME 2014 KnowBigData Founded 2014 Amazon Built HighThroughput Systems for Amazon.com site using in- house NoSql. 2012 2012 InMobi Built Recommender after churning 200TB 2011 tBits Global Founded tBits Global Built an enterprise grade Document Management System 2006 D.E.Shaw Built the big data systems before the term was coined 2002 2002 IIT Roorkee Finished B.Tech somehow.
  • 4. Sandeep GiriHadoop COURSE CONTENTCOURSE CONTENT I Understanding BigData, Hadoop Architecture II Environment Overview, MapReduce Basics III Adv MapReduce & Testing IV Pig & Pig Latin V Analytics using Hive VI NoSQL, HBASE VII Oozie, Mahout, VIII Zookeeper, Apache Storm IX Apache Flume, Apache Spark X YARN, Big Data Sets & Project Assignment
  • 5. Sandeep GiriHadoop Introduction Wordcount example Pig Latin vs SQL Pig Latin vs MapReduce Use Cases Philosophy Installation Using from Cloud Labs Local Mode Command Line Args TODAY’S CLASS Prompt - Grunt Average Val Example Data Types - Scaler Data Types - Complex Schema Basic Contructions LOAD Diagnostic STORE & DUMP Foreach
  • 6. Sandeep GiriHadoop FILTER Order by JOIN Limit & Sample Parallel Flatten Calling External Functions Foreach Nested Fragmented Joins Skewed Join - Merge TODAY’S CLASS Left Outer Join Right Outer Join Full Join CoGroup Union Cross Stream Params Splits Non-Linear Execution
  • 7. Sandeep GiriHadoop PIG An engine for executing data flows in parallel on Hadoop • Language - Pig Latin • Provides operators: Join, order by, Filter … • Runs on Hadoop. Makes use of both • HDFS • MapReduce
  • 8. Sandeep GiriHadoop PIG - WORD COUNT EXAMPLE -- Load input from the file named big.txt, and call the single -- field in the record 'line'. input = load 'big.txt' as (line); -- TOKENIZE splits the line into a field for each word. -- flatten will take the collection of records returned by -- TOKENIZE and produce a separate record for each one, calling the single -- field in the record word. words = foreach input generate flatten(TOKENIZE(line)) as word; -- Now group them together by each word. grpd = group words by word; -- Count them. cntd = foreach grpd generate group, COUNT(words); -- Print out the results. dump cntd;
  • 9. Sandeep GiriHadoop PIG LATIN - A DATA FLOW LANGUAGE Allows describing: 1. How data flows 2. Should be Read 3. Processed 4. Stored to multiple outputs in parallel 1. Complex Workflows 1. Multiple Inputs are joined 2. Data is split into multiple streams 2. No Ifs and fors
  • 10. Sandeep GiriHadoop PIG LATIN MAP REDUCE Provides join, filter, group by, order by, union Provides operations such as group by and order by indirectly. Joins are hard Provides optimisation such are rebalancing the reducers Same optimisations are really hard to write Understand your scripts: Error checking & optimise You code is opaque to the map reduce Lower cost to write and maintain than Java code for MapReduce. Requires more effort and time Brief: 9lines 170 lines Has limitations More Control. Map Reduce Programming vs Pig Latin
  • 11. Sandeep GiriHadoop PIG LATIN SQL User describes exactly how to process the input data. Allows users to describe what question they want answered Long series of data operations Around answering one question Designed for Hadoop Designed for RDBMS Targets multiple data operations & parallel processing For multiple data operations, the query gets complex Can be used when schema is unknown, incomplete, or inconsistent, requires well formatted Pig Latin vs SQL
  • 12. Sandeep GiriHadoop PIG - USE CASES 1. ETL 2. Research On Data 3. Iterative processing 1. Batch processing Map Reduce SQLPIG
  • 13. Sandeep GiriHadoop PIG PHILOSOPHY A. Pigs eat anything • Data: Relational, nested, or unstructured. B. Pigs live anywhere • Parallel processing Language. Not only Hadoop C. Pigs are domestic animals • Controllable, provides custom functions in java/python • Custom load and store methods D. Pigs fly • Designed for performance
  • 14. Sandeep GiriHadoop PIG - INSTALLING 1. Download 0.13.0 from Pig’s release page 2. tar -xvfz pig-0.13.0.tar.gz 3. mv pig-0.13.0 /usr/local/pig 4. export JAVA_HOME=/usr/lib/jvm/java-6-openjdk-amd64 5. Run pig -x local and test if it is working 6. export PIG_CLASSPATH=/etc/hadoop 7. Add pig/bin folder to $PATH 8. Run pig and enjoy the grunts
  • 15. Sandeep GiriHadoop ssh student@hadoop1.knowbigdata.com (password is same as mentioned earlier).You can also try hadoop2, 3, 4 pig In HDFS, you have the permissions to manipulate /users/student/ The data has been uploaded to /users/student/data in HDFS PIG - USING FROM CLOUD LABS
  • 16. Sandeep GiriHadoop pig -x local Works without Hadoop installation Very useful for testing locally Improves the development pace PIG - LOCAL MODE
  • 17. Sandeep GiriHadoop The first argument can be a pig script file: pig <myscript> To Execute a single command e.g.: pig -e ls -version , for version -h see more options PIG - COMMAND LINE ARGUMENTS
  • 18. Sandeep GiriHadoop The prompt “grunt>” This is where all of the commands are typed You can also control hadoop from here fs (-ls or -cat or any other hadoop command) kill job exec or run pig-script PIG - GRUNT
  • 19. Sandeep GiriHadoop PIG - BASIC CONSTRUCTION • Each processing step results in new set (or relation) • Case Insensitive - UDF/Function case sensitive • Multiline comments: /* */ • Singleline comments: --
  • 20. Sandeep GiriHadoop PIG - MY FIRST PIG SCRIPT - AVERAGEVAL NYSE CPO2009-12-30 0.11 NYSE CPO2009-09-28 0.12 NYSE CPO2009-06-26 0.13 NYSE CCS 2009-10-28 0.41 NYSE CCS 2009-04-29 0.43 NYSE CIF 2009-12-09 .029 NYSE CIF 2009-12-09 .028 Local: data/NYSE_dividends hdfs: /users/student/data/NYSE_dividends CPO 0.12 CCS 0.42 CIF .0285
  • 21. Sandeep GiriHadoop PIG - MY FIRST PIG SCRIPT - AVERAGEVAL NYSE CPO2009-12-30 0.11 NYSE CPO2009-09-28 0.12 NYSE CPO2009-06-26 0.13 NYSE CCS 2009-10-28 0.41 NYSE CCS 2009-04-29 0.43 NYSE CIF 2009-12-09 .029 NYSE CIF 2009-12-09 .028 Local: data/NYSE_dividends hdfs: /user/student/data/NYSE_dividends CPO 0.12 CCS 0.42 CIF .0285 1. divs = load '/user/student/data/NYSE_dividends' as (name, ticker, date, val); 2. dump divs 3. explain divs 4. grped = group divs by ticker; 5. avged = foreach grped generate group, AVG(divs.val); 6. store avged into '/user/student/data-out/avged'; 7. cat /user/student/data-out/avged/part*
  • 22. Sandeep GiriHadoop PIG - DATATYPES Scaler 1. int 2. long 3. float - 4 bytes 4. double - 8 bytes 5. chararray 6. bytearray
  • 23. Sandeep GiriHadoop PIG - DATATYPES Complex 1. Map 1. ['name'#'bob',‘age’#55] 2. chararray => another complex type or scaler 2. Tuple 1. Fixed length, ordered collection 2. Made up of fields and their values. 3. Think of it as one row in RDBMS and fields as columns 4. Can refer the fields by position (0 based) 5. ('bob', 55, 12.3) 3. BAG 1. Unordered collection of tuples 2. {('ram', 55, 12.3), ('sally', 52, 11.2)} 1. Two tuples each with three fields 3. Analogous to SET of rows / dataset 4. Does not need to fit in memory. Spills over to disk
  • 24. Sandeep GiriHadoop PIG - DATATYPES Schemas 1. Not very particular about schema 2. You can either tell upfront, e.g. 1. dividends = load 'NYSE_dividends' as (exchange:chararray, symbol:chararray, date:chararray, dividend:float); 3. Or it will make best type guesses based on data 4. Schema declaration for complex types: 1. a or a:map[] or a:map[a:tuple(x:int, y:bytearray)] … 2. a:tuple or a:tuple(x, y) or a:tuple(x:int, y:bytearray,…)…. 3. a:bag{} or a:bag{t:(x:int, y:int)} or a:bag{t:(x, y)} 5. You can cast the data from one type to another if possible 1. similar to Java by affixing “(type)”
  • 25. Sandeep GiriHadoop PIG - LOAD • divs = load 'NYSE_dividends' • Loads the values guessing the datatype • Tab is used as separator • Absolute or Relative URL would work • divs = load 'NYSE_dividends' using PigStorage(','); • To load comma separated values • divs = load 'NYSE_div*' • You can specify the wildcard character such as '*' • divs = load 'NYSE_dividends' using HBaseStorage(); • You can load an HBASE table this way • divs = load 'NYSE_dividends' as (name:chararray, ticker:chararray, date, val:float); • Define the data types upfront • second_table = LOAD 'second_table.json' USING JsonLoader('food:chararray, person:chararray, amount:int');
  • 26. Sandeep GiriHadoop PIG - DIAGNOSTIC explain dataset Shows Detailed analysis of query describe dataset Show Schema Syntax Highlighting Tools Eclipse http://code.google.com/p/pig-eclipse Emacs http://github.com/cloudera/piglatin-mode TextMate http://www.github.com/kevinweil/pig.tmbundle Vim http://www.vim.org/scripts/script.php?script_id=2186
  • 27. Sandeep GiriHadoop PIG - STORE / DUMP Store • Stores the data to a file or other storage • store processed into 'processed' using PigStorage(','); Dump • Prints the value on the screen - print() • Useful for • debugging or • using with scripts that redirect output • Syntax: dump variable; (or relation or field name)
  • 28. Sandeep GiriHadoop PIG - RELATIONAL OPERATIONS - FOREACH 1. divs = load '/user/student/data/NYSE_dividends' as (name, ticker, date, val); 2. dump divs; 3. values = foreach divs generate ticker, val 4. dump values Takes expressions & applies to every record
  • 29. Sandeep GiriHadoop PIG - FOREACH prices = load 'NYSE_daily' as (exchange, symbol, date, open, high, low, close, volume, adj_close); Basic Operations such as subtraction/addition • gain = foreach prices generate close - open; Using position references • gain2 = foreach prices generate $6 - $3; Ranges • For exchange, symbol, date, open • beginning = foreach prices generate ..open; • For open, high, low, close • middle = foreach prices generate open..close; • For produces volume, adj_close • end = foreach prices generate volume..;
  • 30. Sandeep GiriHadoop PIG Will it have any reducer code generated? gain = foreach prices generate (close - open);
  • 31. Sandeep GiriHadoop PIG Will it have any reducer code generated? gain = foreach prices generate (close - open); NO
  • 32. Sandeep GiriHadoop PIG - FOREACH Extract Data from complex types: bball = load 'baseball' as (name:chararray, team:chararray, traits:bag{}, bat:map[]); avg = foreach bball generate name, bat#'batting_average'; Tuple Projection A = load 'input' as (t:tuple(x:int, y:int)); B = foreach A generate t.x, t.$0 Bag Projection A = load 'input' as (b:bag{t:(x:int, y:int)}); B = foreach A generate b.x; A = load 'input' as (b:bag{t:(x:int, y:int, z:int)}); B = foreach A generate b.(x, y);
  • 33. Sandeep GiriHadoop PIG - FOREACH Built-In Function’s List http://pig.apache.org/docs/r0.13.0/func.html • divs = load 'NYSE_dividends' as • (exchange:chararray, symbol:chararray, date:chararray, dividends:float); • in_cents = foreach divs generate • dividends * 100.0 as dividend, dividends * 100.0; • describe in_cents;
  • 34. Sandeep GiriHadoop PIG - GROUP BY • B = GROUP A BY f1; • B = GROUP A BY (f1, f2); Groups item on key(s) DUMP A; (John,18,4.0F) (Mary,19,3.8F) (Bill,20,3.9F) (Joe,18,3.8F) B = GROUP A BY age; DUMP B; (18,{(John,18,4.0F),(Joe,18,3.8F)}) (19,{(Mary,19,3.8F)}) (20,{(Bill,20,3.9F)}) DESCRIBE B; B: {group: int,A: {name: chararray,age: int,gpa: float}}DESCRIBE A; A: {name: chararray,age: int,gpa: float}
  • 35. Sandeep GiriHadoop PIG - FLATTEN sandeep {(c),(c++)} arun {(shell),(c)} cmaini {(java),(c++),(lisp)} (sandeep,c) (sandeep,c++) (arun,shell) (arun,c) (cmaini,java) (cmaini,c++) (cmaini,list) 1. x = load 'myfile' as (name, languages:bag{t:(p)}); 2. p1 = foreach x generate name, flatten(languages); 3. dump p1;
  • 36. Sandeep GiriHadoop PIG - FILTER -- filter_matches.pig divs = load 'NYSE_dividends' as (exchange:chararray, symbol:chararray, date:chararray, dividends:float); Regular Expression 1. startswithcm = filter divs by symbol matches 'CM.*'; Negation 2. notstartswithcm = filter divs by not symbol matches 'CM.*'; Logical Operators 3. result = filter divs by dividends > 0.10 and dividends < .12; More Operators are are: http://pig.apache.org/docs/r0.13.0/basic.html
  • 37. Sandeep GiriHadoop PIG Will there be any reducer generated? startswithcm = filter divs by symbol matches '.CM.*';
  • 38. Sandeep GiriHadoop PIG No. Will there be any reducer generated? startswithcm = filter divs by symbol matches 'CM.*’;
  • 39. Sandeep GiriHadoop PIG - ORDER BY / DISTINCT Order BY divs = load 'NYSE_dividends' as (exchange:chararray, symbol:chararray, date:chararray, dividends:float); ordered = order divs by symbol, dividends desc; Distinct divs = load 'NYSE_dividends' as (exchange:chararray, symbol:chararray, date:chararray, dividends:float); symbols = foreach divs generate symbol; uniqsymbols = distinct symbols; dump uniqsymbols;
  • 40. Sandeep GiriHadoop PIG Will it generate any reducer code? uniqsymbols = distinct symbols;
  • 41. Sandeep GiriHadoop PIG yes Will it generate any reducer code? uniqsymbols = distinct symbols;
  • 42. Sandeep GiriHadoop PIG - JOIN Single Key daily = load 'NYSE_daily' as (exchange, symbol, date, open, high, low, close,volume, adj_close); divs = load 'NYSE_dividends' as (exchange, symbol1, date, dividends); jnd = join daily by symbol, divs by symbol1; Composite Key jnd = join daily by (exchange,symbol), divs by (exchange, symbol); Self Join - For each stock, find dividends that increased between two dates divs1 = load 'NYSE_dividends' as (exchange:chararray, symbol:chararray, date:chararray, dividends); divs2 = load 'NYSE_dividends' as (exchange:chararray, symbol:chararray, date:chararray, dividends); jnd = join divs1 by symbol, divs2 by symbol; increased = filter jnd by divs1::date < divs2::date and divs1::dividends < divs2::dividends;
  • 43. Sandeep GiriHadoop Join should alway have at least a column Join can not have conditions THIS IS WRONG: join A, B on A.value > B.value JOIN - DIFFERENCE FROM RDBS
  • 44. Sandeep GiriHadoop PIG - JOINS Similar to SQL A id name 1 sandeep 2 ryan B id name 1 giri 3 saha (1,sandeep,1,giri) LO = join A by id, B by id; dump LO A = load 'A' as (id, fname); B = load 'B' as (id, lname);
  • 45. Sandeep GiriHadoop PIG - OUTER JOINS - LEFT Similar to SQL A id name 1 sandeep 2 ryan B id name 1 giri 3 saha (1,sandeep,1,giri) (2,ryan,,) LO = join A by id LEFT OUTER, B by id; LO = join A by id LEFT, B by id; dump LO A = load 'A' as (id, fname); B = load 'B' as (id, lname);
  • 46. Sandeep GiriHadoop PIG - OUTER JOINS - RIGHT Similar to SQL A id name 1 sandeep 2 ryan B id name 1 giri 3 saha (1,sandeep,1,giri) (,,3,saha) RO = join A by id RIGHT OUTER, B by id; or RO = join A by id RIGHT, B by id; dump RO; A = load 'A' as (id, fname); B = load 'B' as (id, lname);
  • 47. Sandeep GiriHadoop PIG - OUTER JOINS - RIGHT Similar to SQL A id name 1 sandeep 4 ryan B id name 4 giri 3 saha ?? RO = join A by id RIGHT OUTER, B by id; or RO = join A by id RIGHT, B by id; dump RO; A = load 'A' as (id, fname); B = load 'B' as (id, lname);
  • 48. Sandeep GiriHadoop PIG - OUTER JOINS - RIGHT Similar to SQL A id name 1 sandeep 4 ryan B id name 4 giri 3 saha (4,ryan,4,giri) (,,3, saha) RO = join A by id RIGHT OUTER, B by id; or RO = join A by id RIGHT, B by id; dump RO; A = load 'A' as (id, fname); B = load 'B' as (id, lname);
  • 49. Sandeep GiriHadoop PIG - OUTER JOINS - FULL Similar to SQL A id name 1 sandeep 2 ryan B id name 1 giri 3 saha (1,sandeep,1,giri) (2,ryan,,) (,,3,saha) RO = join A by id FULL, B by id; dump RO; A = load 'A' as (id, fname); B = load 'B' as (id, lname);
  • 50. Sandeep GiriHadoop PIG join A by id, B by id = n1 records join A by id LEFT, B by id = n2 records join A by id RIGHT, B by id = n3 records join A by id FULL, B by id = ??? records
  • 51. Sandeep GiriHadoop PIG n2+n3-n1 join A by id, B by id = n1 records join A by id LEFT, B by id = n2 records join A by id RIGHT, B by id = n3 records join A by id FULL, B by id = ??? records
  • 52. Sandeep GiriHadoop PIG - OUTER JOINS - QUESTION? A id name 1 sandeep 2 ryan B id name 1 giri 3 saha LO = join A by id, B by id LEFT OUTER; A = load 'A' as (id, fname); B = load 'B' as (id, lname); ???
  • 53. Sandeep GiriHadoop PIG - OUTER JOINS - QUESTION? A id name 1 sandeep 2 ryan B id name 1 giri 3 saha ERROR “LEFT/RIGHT” should come before “,B” LO = join A by id, B by id LEFT OUTER; A = load 'A' as (id, fname); B = load 'B' as (id, lname);
  • 54. Sandeep GiriHadoop PIG - OUTER JOINS - QUESTION? A id name 1 sandeep 2 ryan B id name 1 giri 3 saha ??? LO = join A by id OUTER, B by id ; A = load 'A' as (id, fname); B = load 'B' as (id, lname);
  • 55. Sandeep GiriHadoop PIG - OUTER JOINS - QUESTION? A id name 1 sandeep 2 ryan B id name 1 giri 3 saha ERROR LO = join A by id OUTER, B by id ; A = load 'A' as (id, fname); B = load 'B' as (id, lname);
  • 56. Sandeep GiriHadoop PIG - LIMIT / SAMPLE LIMIT divs = load 'NYSE_dividends'; first10 = limit divs 10; dump first10; SAMPLE divs = load 'NYSE_dividends'; some = sample divs 0.1; --10% records
  • 57. Sandeep GiriHadoop PIG divs = load 'file.tsv' some = sample divs 0.1; dump some; dump some; If you run “dump some” twice, will it give the same output?
  • 58. Sandeep GiriHadoop PIG Most Likely No. Sampling gives randomised output. It is not like limit. divs = load ‘file.tsv’ some = sample divs 0.1; dump some; dump some; If you run “dump some” twice, will it give the same output?
  • 59. Sandeep GiriHadoop PIG - PARALLEL 1. Parallelism is the basic promise of pig 2. Basically controls the numbers of reducers 3. Does not impact any mappers 1. daily = load 'NYSE_daily' as (exchange, symbol, date, open, high, low, close, volume, adj_close); 2. bysymbl = group daily by symbol parallel 10; 3. — Will create 10 reducers 1. Needs to be applied per statement 2. Script wide: set default_parallel 10; 3. Without parallel, allocates reducer per 1G of input size unto 999 reducers
  • 60. Sandeep GiriHadoop PIG How many reducers will it produce? mylist = foreach divs generate symbol, dividends*10
  • 61. Sandeep GiriHadoop PIG How many reducers will it produce? mylist = foreach divs generate symbol, dividends*10 0.
  • 62. Sandeep GiriHadoop PIG Does parallel clause have any effect in this? mylist = foreach divs generate symbol parallel 10; [Yes/No?]
  • 63. Sandeep GiriHadoop PIG Does parallel clause have any effect in this? mylist = foreach divs generate symbol parallel 10; [Yes/No?] N0.
  • 64. Sandeep GiriHadoop PIG - CALLING EXTERNAL FUNCTIONS JAVA Built-IN define hex InvokeForString('java.lang.Integer.toHexString', 'int'); divs = load 'NYSE_daily' as (exchange, symbol, date, open, high, low,close, volume, adj_close); inhex = foreach divs generate symbol, hex((int)volume); Your Own Java Function register 'acme.jar'; define convert com.acme.financial.CurrencyConverter('dollar', 'euro'); divs = load 'NYSE_dividends' as (exch:chararray, sym:chararray, date:chararray, vals:float); backwards = foreach divs generate convert(exch); Calling a Python function register 'production.py' using jython as bball; calcprod = foreach nonnull generate name, bball.production(somval);
  • 65. Sandeep GiriHadoop Project URL: https://drive.google.com/file/d/0B73r3o5yrImQM3RETFF3VFRYbFU/ view?usp=sharing grunt> register pigudf.jar cd sgiri A = load 'xyza.txt' as (col); B = FOREACH A GENERATE myudfs.UPPER(col); dump B; PIG USER DEFINED FUNCTIONS (UDF)
  • 66. Sandeep GiriHadoop PIG - STREAM - INTEGRATE WITH LEGACY CODE divs = load 'NYSE_dividends' as 
 (exchange, symbol, date, dividends); highdivs = stream divs through `highdiv.pl` as 
 (exchange, symbol, date, dividends); Say you have a propriety or legacy code that need to applied. • Invoked once on every map or reduce task • Not on every record The program is shipped to every task node: define hd `highdiv.pl -r xyz` ship('highdiv.pl'); divs = load 'NYSE_dividends' as (exchange, symbol, date, dividends); highdivs = stream divs through hd as (exchange, symbol, date, dividends);
  • 67. Sandeep GiriHadoop PIGGY-BANK Library of function from other people • Create a directory for the Pig source code: `mkdir pig` • cd into that directory: `cd pig` • Checkout the Pig source code: `svn checkout http://svn.apache.org/repos/asf/pig/trunk/ .` • Build the project: `ant` • cd into the piggybank dir: `cd contrib/piggybank/java` • Build the piggybank: `ant` • You should now see a piggybank.jar file in that directory. • REGISTER /public/share/pig/contrib/piggybank/java/piggybank.jar ; • TweetsInaug = FILTER Tweets BY org.apache.pig.piggybank.evaluation.string.UPPER(text) MATCHES '.*(INAUG|OBAMA|BIDEN|CHENEY|BUSH).*' ; • STORE TweetsInaug INTO 'meta/inaug/tweets_inaug' ; Same as UDF, just that the code is available https://cwiki.apache.org/confluence/display/PIG/PiggyBank
  • 68. Sandeep GiriHadoop PIG - FOREACH NESTED blore sandeep blore sandeep blore vivek nyc john nyc john (blore,2) (nyc,1) ??? Find count of distinct names in each city
  • 69. Sandeep GiriHadoop PIG - FOREACH NESTED blore sandeep blore sandeep blore vivek nyc john nyc john 1. people = load 'people' as (city, name); 2. grp = group people by city; 3. counts = foreach grp { 1. dnames = distinct people.name; 2. generate group, COUNT(dnames); 4. } Find total distinct names in each city (blore,2) (nyc,1)
  • 70. Sandeep GiriHadoop 1. ???? Translate the countries to country code NAME COUNTRY sandeep IN kumar US gerald CN …1 billion entries… NAME CODE sandeep 91 kumar 1 gerald 86 …1 billion entries… PIG - FRAGMENTED JOINS COUNTRY CODE IN 91 US 1 CN 86 … 100 entries …
  • 71. Sandeep GiriHadoop 1. persons = load 'persons' as (name, country); 2. cc = load 'cc' as (country, code); 3. joined = join persons by country, cc by country using ‘replicated'; 4. result = foreach joined generate persons.name, cc.CODE Translate the countries to country code NAME COUNTRY sandeep IN kumar US gerald CN …1 billion entries… NAME CODE sandeep 91 kumar 1 gerald 86 …1 billion entries… PIG - FRAGMENTED JOINS - REPLICATED COUNTY CODE IN 91 US 1 CN 86 … 100 entries …
  • 72. Sandeep GiriHadoop PIG - JOINING SKEWED DATA Skewed data: Too many values for some keys Too few keys for some users = load 'users' as (name:chararray, city:chararray); cinfo = load 'cityinfo' as (city:chararray, population:int); jnd = join cinfo by city, users by city using 'skewed'; PIG - JOINING SORTED DATA jnd = join daily by symbol, divs by symbol using 'merge';
  • 73. Sandeep GiriHadoop PIG - COGROUP Group two tables by a column and then join on the grouped column owners person animal adam cat adam dog alex fish alice cat steve dog pets name animal nemo fish fido dog rex dog paws cat wiskers cat (cat,{(alice,cat),(adam,cat)},{(wiskers,cat),(paws,cat)}) (dog,{(steve,dog),(adam,dog)},{(rex,dog),(fido,dog)}) (fish,{(alex,fish)},{(nemo,fish)}) COGROUP owners BY animal, pets by animal;
  • 74. Sandeep GiriHadoop TO force schema use A = load 'A' as (x, y); B = load 'B' as (z,x); D = union onschema A, B; -- x,y, z as columns PIG - UNION A = load '/user/me/data/files/input1'; B = load '/user/someoneelse/info/input2'; C = union A, B; Concatenating two data sets. Unix Equivalent: cat file1 file2 SQL: union all
  • 75. Sandeep GiriHadoop PIG - UNION Union OnschemaUnion
  • 76. Sandeep GiriHadoop PIG C = join A, B What is the result?
  • 77. Sandeep GiriHadoop PIG C = join A, B What is the result? ERROR.
  • 78. Sandeep GiriHadoop PIG - CROSS tonsodata = cross daily, divs; Produce Cross Product tonsodata = cross daily, divs parallel 10;
  • 79. Sandeep GiriHadoop PIG - PARAMETER SUBSTITUTION Like with any scripting, it is possible to pass an argument And use it in the script --daily.pig ... yesterday = filter daily by date == '$DATE'; pig -p DATE=2009-12-17 daily.pig #Param file daily.params YEAR=2009- MONTH=12- DAY=17 DATE=$YEAR$MONTH$DAY pig -param_file daily.params daily.pig
  • 80. Sandeep GiriHadoop PIG - SPLITS wlogs = load 'weblogs' as (pageid, url, timestamp); split wlogs into apr03 if timestamp < '20110404', apr02 if timestamp < '20110403' and timestamp > '20110401', apr01 if timestamp < '20110402' and timestamp > ‘20110331'; store apr03 into '20110403'; store apr02 into '20110402'; store apr01 into '20110401'; 1. A single record can go to multiple legs 2. A record could also be dropped out
  • 81. Sandeep GiriHadoop PIG - NON LINEAR EXECUTION Pig tries to combine multiple groups and filters into one. And also creates multiple map jobs depending on what we are trying to do.
  • 82. Sandeep GiriHadoop PIG - REFERENCES General Docs: http://pig.apache.org/docs/r0.13.0/ Creating Custom Load Functions: http://pig.apache.org/docs/r0.13.0/udf.html