2. What is PIG? Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs Pig generates and compiles a Map/Reduce program(s) on the fly.
3. Why PIG? Ease of programming - It is trivial to achieve parallel execution of simple, "embarrassingly parallel" data analysis tasks. Complex tasks comprised of multiple interrelated data transformations are explicitly encoded as data flow sequences, making them easy to write, understand, and maintain.
6. Running PIG Grunt Shell: Enter Pig commands manually using Pig’s interactive shell, Grunt. Script File: Place Pig commands in a script file and run the script. Embedded Program: Embed Pig commands in a host language and run the program.
7. Run Modes Local Mode: To run Pig in local mode, you need access to a single machine. Hadoop(mapreduce) Mode: To run Pig in hadoop (mapreduce) mode, you need access to a Hadoop cluster and HDFS installation.
8. Sample PIG script A = load 'passwd' using PigStorage(':'); B = foreach A generate $0 as id; store B into ‘id.out’;
9. Sample Script With Schema A = LOAD 'student_data' AS (name: chararray, age: int, gpa: float); B = FOREACH A GENERATE myudfs.UPPER(name);
10. Eval Functions AVG CONCAT Example COUNT COUNT_STAR DIFF IsEmpty MAX MIN SIZE SUM TOKENIZE
11. Math Functions # Math Functions ABS ACOS ASIN ATAN CBRT CEIL COSH COS EXP FLOOR LOG LOG10 RANDOM ROUND SIN SINH SQRT TAN TANH
13. Sample CW PIG script RawInput = LOAD '$INPUT' USING com.contextweb.pig.CWHeaderLoader('$RESOURCES/schema/wide.xml'); input = foreachRawInput GENERATE ContextCategoryId as Category, TagId, URL, Impressions; GroupedInput = GROUP input BY (Category, TagId, URL); result = FOREACH GroupedInput GENERATE group, SUM(input.Impressions) as Impressions; STORE result INTO '$OUTPUT' USING com.contextweb.pig.CWHeaderStore();
14. Sample PIG script (Filtering) RawInput = LOAD '$INPUT' USING com.contextweb.pig.CWHeaderLoader('$RESOURCES/schema/wide.xml'); input = foreachRawInput GENERATE ContextCategoryId as Category, DefLevelId , TagId, URL,Impressions; defFilter = FILTER input BY (DefLevelId == 8) or (DefLevelId == 12); GroupedInput = GROUP defFilter BY (Category, TagId, URL); result = FOREACH GroupedInput GENERATE group, SUM(input.Impressions) as Impressions; STORE result INTO '$OUTPUT' USING com.contextweb.pig.CWHeaderStore();
15. What is PIG UDF? UDF - User Defined Function Types of UDF’s: Eval Functions (extends EvalFunc<String>) Aggregate Functions (extends EvalFunc<Long> implements Algebraic) Filter Functions (extends FilterFunc) UDFContext Allows UDFs to get access to the JobConfobject Allows UDFs to pass configuration information between instantiations of the UDF on the front and backends.
16. Sample UDF public class TopLevelDomain extends EvalFunc<String> { @Override public String exec(Tupletuple) throws IOException { Object o = tuple.get(0); if (o == null) { return null; } return Validator.getTLD(o.toString()); } }
17. UDF In Action REGISTER '$WORK_DIR/pig-support.jar'; DEFINE getTopLevelDomaincom.contextweb.pig.udf.TopLevelDomain(); AA = foreach input GENERATE TagId, getTopLevelDomain(PublisherDomain) as RootDomain