Scaling API-first – The story of a global engineering organization
SDEC2011 Essentials of Pig
1. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
Copyright for all other & referenced work is retained by their respective owners.
Essentials of Pig
Mastering Hadoop Map-reduce for Data Analysis
Shashank Tiwari
blog: shanky.org | twitter: @tshanky
st@treasuryofideas.com
2. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
Copyright for all other & referenced work is retained by their respective owners.
Session Agenda
• What is Pig and why should you use it?
• Installing & Setting up Pig
• Pig’s Components
• Using Pig with Hadoop MapReduce
• Summary & Conclusion
3. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
Copyright for all other & referenced work is retained by their respective owners.
What is Pig?
• Higher-level abstraction for Hadoop MapReduce
• An infrastructure for data analysis using a scripting language
• named, Pig Latin
4. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
Copyright for all other & referenced work is retained by their respective owners.
Why should you use Pig?
• Hadoop MapReduce:
• Requires you to be a programmer
• Forces you to design all your algorithms in terms of the map and reduce
primitives
5. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
Copyright for all other & referenced work is retained by their respective owners.
Installing & Setting Up Pig -- Required Software
• Required Software:
• Java 1.6.x
• Hadoop 0.20.x
• Ant 1.7+ (for builds)
• JUnit 4.5 (for tests)
• Cygwin (on Windows)
6. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
Copyright for all other & referenced work is retained by their respective owners.
Download
• Source: http://pig.apache.org/
• Version:
• 0.8.1 -- current stable
7. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
Copyright for all other & referenced work is retained by their respective owners.
Install & Configure
• Extract: tar zxvf pig-0.8.1.tar.gz
• Move & Create Symbolic Link:
• ln -s pig-0.8.1 pig
• Edit: bin/pig
• export PIG_CLASSPATH=$HADOOP_HOME/conf
8. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
Copyright for all other & referenced work is retained by their respective owners.
Verify Installation
• Verify:
(remember to start Hadoop first.)
• bin/pig -help (command options)
• bin/pig (run the grunt shell)
9. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
Copyright for all other & referenced work is retained by their respective owners.
Running Pig
• Run Mode
• Local Mode -- single machine
• MapReduce Mode -- needs a Hadoop cluster (with HDFS)
• Run via:
• grunt shell
• pig scripts
• embedded programs
10. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
Copyright for all other & referenced work is retained by their respective owners.
Pig IDE
• PigPen, an eclipse based IDE
• graphical data flow definition
• can show example data flow
11. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
Copyright for all other & referenced work is retained by their respective owners.
Pig Components
• Pig Latin
• Pig Engine
• execution engine on top of Hadoop
• includes default optimal configurations
12. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
Copyright for all other & referenced work is retained by their respective owners.
A client for your cluster
• Pig does not run on a Hadoop cluster
• It connects to one
13. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
Copyright for all other & referenced work is retained by their respective owners.
Pig Latin
• Data flow language (Not declarative like SQL)
• Increases productivity (less lines do more)
• Includes standard operations like join, filter, group, sort
• User code and existing binaries can be included
• Supports nested data types
• Does not require metadata
14. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
Copyright for all other & referenced work is retained by their respective owners.
Pig Latin Example
• Will leverage the tutorial that comes with the distribution
• Check the tutorial folder in the distribution
15. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
Copyright for all other & referenced work is retained by their respective owners.
Start Grunt Shell
• cd $PIG_HOME
• bin/pig -x local
16. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
Copyright for all other & referenced work is retained by their respective owners.
Aggregate Data
• grunt> log = LOAD 'tutorial/data/excite-small.log' AS (user, timestamp,
query);
• alternate delimiters can be used and de-serializers like PigJsonLoader can
be leveraged
• grunt> grouped = GROUP log BY user;
• grunt> counted = FOREACH grouped GENERATE group, COUNT(log);
• grunt> STORE counted INTO 'output';
17. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
Copyright for all other & referenced work is retained by their respective owners.
Group Data
• grunt> grouped = GROUP log BY user;
• In Pig group operation generates (key, collection) pair , where the collection
itself is a collection of tuples.
• The key of the tuples is the same key as that of the (key, collection) pair
18. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
Copyright for all other & referenced work is retained by their respective owners.
Filter Data
• grunt> log= LOAD 'tutorial/data/excite-small.log' AS (user, time, query);
• grunt> grouped = GROUP log BY user;
• grunt> counted = FOREACH grouped GENERATE group, COUNT(log) AS cnt;
• grunt> filtered = FILTER counted BY cnt > 75;
• grunt> STORE filtered INTO 'output1';
19. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
Copyright for all other & referenced work is retained by their respective owners.
Order Data
• grunt> log= LOAD 'tutorial/data/excite-small.log' AS (user, time, query);
• grunt> grouped = GROUP log BY user;
• grunt> counted = FOREACH grouped GENERATE group, COUNT(log) AS cnt;
• grunt> filtered = FILTER counted BY cnt > 50;
• grunt> sorted = ORDER filtered BY cnt;
• grunt> STORE sorted INTO 'output2';
20. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
Copyright for all other & referenced work is retained by their respective owners.
Join Data Example
• Words appearing in Adventures of Huckleberry Finn by Mark Twain
• http://www.gutenberg.org/ebooks/76
• Words appearing in The Adventures of Sherlock Holmes by Sir Arthur Conan
Doyle
• http://www.gutenberg.org/ebooks/1661
21. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
Copyright for all other & referenced work is retained by their respective owners.
Loading & Counting Huckleberry Finn Data
• grunt> A = load 'pg76.txt';
• grunt> B = foreach A generate flatten(TOKENIZE((chararray)$0)) as word;
• grunt> C = filter B by word matches 'w+';
• grunt> D = group C by word;
• grunt> E = foreach D generate COUNT(C), group;
• store E into 'huckleberry_finn_freq';
22. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
Copyright for all other & referenced work is retained by their respective owners.
Loading & Counting Sherlock Holmes Data
• grunt> A = load 'pg1661.txt';
• grunt> B = foreach A generate flatten(TOKENIZE((chararray)$0)) as word;
• grunt> C = filter B by word matches 'w+';
• grunt> D = group C by word;
• grunt> E = foreach D generate COUNT(C), group;
• grunt> store E into 'sherlock_holmes_freq';
23. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
Copyright for all other & referenced work is retained by their respective owners.
Join Data
• grunt> hf= LOAD 'huckleberry_finn_freq' AS (freq, word);
• grunt> sh= LOAD 'sherlock_holmes_freq' AS (freq, word);
• grunt> inboth = JOIN hf BY word, sh BY word;
• grunt> STORE inboth INTO 'output3';
24. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
Copyright for all other & referenced work is retained by their respective owners.
Set Difference (A - B, in A but not in B)
• hf = LOAD 'huckleberry_finn_freq' AS (freq, word);
• sh = LOAD 'sherlock_holmes_freq' AS (freq, word);
• grouped = COGROUP hf BY word, sh BY word;
• not_in_hf = FILTER grouped BY COUNT(hf) == 0;
• out = FOREACH not_in_hf GENERATE FLATTEN(sh);
• STORE out INTO 'output4';
25. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
Copyright for all other & referenced work is retained by their respective owners.
Cogroup Data
• Extends the idea of grouping to multiple collections
• Instead of (key, collection) pair, it now emits a key and a set of tuples from
each of the multiple collections
• With two sources of input it would be (key, collection1, collection2), where
tuples from the first source will be in collection1 and tuples from the
second source will be in collection2.
26. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
Copyright for all other & referenced work is retained by their respective owners.
Data types Supported
• int, long, double, chararray, bytearray
• map, tuple (ordered), bag (unordered)
27. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
Copyright for all other & referenced work is retained by their respective owners.
Data type Declaration
• hf = LOAD 'huckleberry_finn_freq' AS (freq:int, word:chararray);
• explicit data type declaration
• hf = LOAD 'huckleberry_finn_freq' AS (freq:int, word:chararray);
• weighted = FOREACH hf GENERATE freq * 100;
• type inference, freq cast to int
28. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
Copyright for all other & referenced work is retained by their respective owners.
Data type Declaration
• hf = LOAD 'huckleberry_finn_freq' AS (freq:int, word:chararray);
• explicit data type declaration
• hf = LOAD 'huckleberry_finn_freq' AS (freq:int, word:chararray);
• weighted = FOREACH hf GENERATE freq * 100;
• type inference, freq cast to int
29. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
Copyright for all other & referenced work is retained by their respective owners.
Custom Extensions
• User defined functions can be called from Pig scripts
• Nested operations can be carried out
• FOREACH grouped { sorted = ORDER hf BY counted;
• GENERATE group, CustomFunction(sorted); }
• Flow can be split: SPLIT A INTO Negative IF $0 < 0, Positive IF $0 > 0;
30. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
Copyright for all other & referenced work is retained by their respective owners.
Questions?
• blog: shanky.org | twitter: @tshanky
• st@treasuryofideas.com