SlideShare ist ein Scribd-Unternehmen logo
1 von 49
ETL with Apache Pig
By
Arjun Shah
Under the guidance of
Dr Duc Thanh Tran
Agenda
• What is Pig?
• Introduction to Pig Latin
• Installation of Pig
• Getting Started with Pig
• Examples
What is Pig?
• Pig is a dataflow language
• Language is called PigLatin
• Pretty simple syntax
• Under the covers, PigLatin scripts are turned into MapReduce jobs
and executed on the cluster
• Built for Hadoop
• Translates script to MapReduce program under the hood
• Originally developed at Yahoo!
• Huge contributions from Hortonworks, Twitter
What Pig Does
• Pig was designed for performing a long series of
data operations, making it ideal for three
categories of Big Data jobs:
• Extract-transform-load (ETL) data pipelines,
• Research on raw data, and
• Iterative data processing.
Features of Pig
• Joining datasets
• Grouping data
• Referring to elements by position rather than name ($0, $1, etc)
• Loading non-delimited data using a custom SerDe (Writing a custom Reader and Writer)
• Creation of user-defined functions (UDF), written in Java
• And more..
Pig: Install
• There are some prerequisites that one needs to
follow for installing pig. They are:
• JAVA_HOME should be set up
• Hadoop should be installed (Single node
cluster)
• Useful link :
http://codesfusion.blogspot.com/2013/10/setup-
hadoop-2x-220-on-ubuntu.html
Pig: Install(2)
pig.apache.org/docs/r0.12.0/start.html
Pig: Install(3)
Pig: Install(4)
Pig: Install(5)
Move tar file to any location
• $cd /usr/local
• • $cp ~/Download/pig-0.12.0.tar.gz
• • $sudo tar xzf pig-0.12.0.tar.gz
• • $mv pig-0.12.0.tar.gz pig
Change .bashrc
• Edit the .bashrc file:
• $ gedit ~/.bashrc
• Add to .bashrc
• • export PIG_HOME=/usr/local/pig
• • export PATH=$PATH:$PIG_HOME/bin
• Close and then open terminal. Try pig -h
pig -h : Output
Pig: Configure
• The user can run Pig in two modes:
• Local mode (pig -x local) - With access to a single
machine, all files are installed and run using a
local host and file system.
• Hadoop mode - This is the default mode, which
requires access to a Hadoop cluster
• The user can run Pig in either mode using the “pig”
command or the “java” command.
Pig: Run
• Script: Pig can run a script file that contains Pig commands.
• For example,
% pig script.pig
• Runs the commands in the local file ”script.pig”.
• Alternatively, for very short scripts, you can use the -e option to run a script specified as a string on a
command line.
•
• Grunt: Grunt is an interactive shell for running Pig commands.
• Grunt is started when no file is specified for Pig to run, and the -e option is not used.
• Note: It is also possible to run Pig scripts from within Grunt using run and exec.
• Embedded: You can run Pig programs from Java, much like you can use JDBC to run SQL programs
from Java.
• There are more details on the Pig wiki at http://wiki.apache.org/pig/EmbeddedPig
•
Pig Latin: Loading Data
• LOAD
- Reads data from the file system
• Syntax
- LOAD ‘input’ [USING function] [AS schema];
-Eg, A = LOAD ‘input’ USING PigStorage(‘t’) AS
(name:chararray, age:int, gpa:float);
Schema
• Use schemas to assign types to fields
• A = LOAD 'data' AS (name, age, gpa);
-name, age, gpa default to bytearrays
• A = LOAD 'data' AS (name:chararray, age:int,
gpa:float);
-name is now a String (chararray), age is integer
and gpa is float
Describing Schema
• Describe
• Provides the schema of a relation
• Syntax
• DESCRIBE [alias];
• If schema is not provided, describe will say “Schema for alias unknown”
• grunt> A = load 'data' as (a:int, b: long, c: float);
• grunt> describe A;
• A: {a: int, b: long, c: float}
• grunt> B = load 'somemoredata';
• grunt> describe B;
• Schema for B unknown.
Dump and Store
• Dump writes the output to console
• grunt> A = load ‘data’;
• grunt> DUMP A; //This will print contents of A on Console
• Store writes output to a HDFS location
• grunt> A = load ‘data’;
• grunt> STORE A INTO ‘/user/username/output’; //This will
write contents of A to HDFS
• Pig starts a job only when a DUMP or STORE is encountered
Referencing Fields
• Fields are referred to by positional notation OR by name (alias)
• Positional notation is generated by the system
• Starts with $0
• Names are assigned by you using schemas. Eg, A = load
‘data’ as (name:chararray, age:int);
• With positional notation, fields can be accessed as
• A = load ‘data’;
• B = foreach A generate $0, $1; //1st & 2nd column
Limit
• Limits the number of output tuples
• Syntax
• alias = LIMIT alias n;
• grunt> A = load 'data';
• grunt> B = LIMIT A 10;
• grunt> DUMP B; --Prints only 10 rows
Foreach.. Generate
• Used for data transformations and projections
• Syntax
• alias = FOREACH { block | nested_block };
• nested_block usage later in the deck
• grunt> A = load ‘data’ as (a1,a2,a3);
• grunt> B = FOREACH A GENERATE *,
• grunt> DUMP B;
• (1,2,3)
• (4,2,1)
• grunt> C = FOREACH A GENERATE a1, a3;
• grunt> DUMP C;
• (1,3)
• (4,1)
Filter
• Selects tuples from a relation based on some condition
• Syntax
• alias = FILTER alias BY expression;
• Example, to filter for ‘marcbenioff’
• A = LOAD ‘sfdcemployees’ USING PigStorage(‘,’) as
(name:chararray,employeesince:int,age:int);
• B = FILTER A BY name == ‘marcbenioff’;
• You can use boolean operators (AND, OR, NOT)
• B = FILTER A BY (employeesince < 2005) AND (NOT(name ==
‘marcbenioff’));
Group By
• Groups data in one or more relations (similar to SQL GROUP BY)
• Syntax:
• alias = GROUP alias { ALL | BY expression} [, alias ALL | BY expression …] [PARALLEL
n];
• Eg, to group by (employee start year at Salesforce)
• A = LOAD ‘sfdcemployees’ USING PigStorage(‘,’) as (name:chararray,
employeesince:int, age:int);
• B = GROUP A BY (employeesince);
• You can also group by all fields together
• B = GROUP B BY ALL;
• Or Group by multiple fields
• B = GROUP A BY (age, employeesince);
Demo: Sample Data (employee.txt)
• Example contents of ‘employee.txt’ a tab delimited text
• 1 Peter234000000 none
• 2 Peter_01 234000000 none
• 124163 Jacob 10000 cloud
• 124164 Arthur 1000000 setlabs
• 124165 Robert 1000000 setlabs
• 124166 Ram 450000 es
• 124167 Madhusudhan 450000 e&r
• 124168 Alex 6500000 e&r
• 124169 Bob 50000 cloud
Demo: Employees with salary > 1lk
• Loading data from employee.txt into emps bag and with a schema
empls = LOAD ‘employee.txt’ AS (id:int, name:chararray, salary:double,
dept:chararray);
• Filtering the data as required
rich = FILTER empls BY $2 > 100000;
• Sorting
sortd = ORDER rich BY salary DESC;
• Storing the final results
STORE sortd INTO ‘rich_employees.txt’;
• Or alternatively we can dump the record on the screen
DUMP sortd;
------------------------------------------------------------------
• Group by salary
grp = GROUP empls BY salary;
• Get count of employees in each salary group
cnt = FOREACH grp GENERATE group, COUNT(empls.id) as emp_cnt;
Output
More PigLatin (1/2)
• Load using PigStorage
• empls = LOAD ‘employee.txt’ USING
PigStorage('t') AS (id:int, name:chararray,
salary:double, dept:chararray);
• Store using PigStorage
• STORE srtd INTO ‘rich_employees.txt’ USING
PigStorage('t');
More PigLatin (2/2)
• To view the schema of a relation
• DESCRIBE empls;
• To view step-by-step execution of a series of
statements
• ILLUSTRATE empls;
• To view the execution plan of a relation
• EXPLAIN empls;
Exploring Pig with Project
Data Set
Pig: Local Mode using
Project Example
Pig:Hadoop Mode (GUI)
using Project Example
Output
Crimes having category as
VANDALISM
Output
Crimes occurring on
Saturday & Sunday
Output
Grouping crimes by category
Output
PigLatin: UDF
• Pig provides extensive support for user-defined
functions (UDFs) as a way to specify custom
processing. Functions can be a part of almost
every operator in Pig
• All UDF’s are case sensitive
UDF: Types
• Eval Functions (EvalFunc)
• Ex: StringConcat (built-in) : Generates the concatenation of the first two fields
of a tuple.
• Aggregate Functions (EvalFunc & Algebraic)
• Ex: COUNT, AVG ( both built-in)
• Filter Functions (FilterFunc)
• Ex: IsEmpty (built-in)
• Load/Store Functions (LoadFunc/ StoreFunc)
• Ex: PigStorage (built-in)
• Note: URL for built in functions:
http://pig.apache.org/docs/r0.7.0/api/org/apache/pig/builtin/package-
summary.html
Summary
• Pig can be used to run ETL jobs on Hadoop. It
saves you from writing MapReduce code in Java
while its syntax may look familiar to SQL users.
Nonetheless, it is important to take some time to
learn Pig and to understand its advantages and
limitations. Who knows, maybe pigs can fly after
all.

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Benchmarking Solr Performance at Scale
Benchmarking Solr Performance at ScaleBenchmarking Solr Performance at Scale
Benchmarking Solr Performance at Scale
 
20141111 파이썬으로 Hadoop MR프로그래밍
20141111 파이썬으로 Hadoop MR프로그래밍20141111 파이썬으로 Hadoop MR프로그래밍
20141111 파이썬으로 Hadoop MR프로그래밍
 
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
 
Pig latin
Pig latinPig latin
Pig latin
 
Hive Anatomy
Hive AnatomyHive Anatomy
Hive Anatomy
 
Dive into PySpark
Dive into PySparkDive into PySpark
Dive into PySpark
 
Full Text search in Django with Postgres
Full Text search in Django with PostgresFull Text search in Django with Postgres
Full Text search in Django with Postgres
 
Apache Hadoop Shell Rewrite
Apache Hadoop Shell RewriteApache Hadoop Shell Rewrite
Apache Hadoop Shell Rewrite
 
Python高级编程(二)
Python高级编程(二)Python高级编程(二)
Python高级编程(二)
 
Hadoop Streaming: Programming Hadoop without Java
Hadoop Streaming: Programming Hadoop without JavaHadoop Streaming: Programming Hadoop without Java
Hadoop Streaming: Programming Hadoop without Java
 
Migrating to Puppet 4.0
Migrating to Puppet 4.0Migrating to Puppet 4.0
Migrating to Puppet 4.0
 
Solr 4: Run Solr in SolrCloud Mode on your local file system.
Solr 4: Run Solr in SolrCloud Mode on your local file system.Solr 4: Run Solr in SolrCloud Mode on your local file system.
Solr 4: Run Solr in SolrCloud Mode on your local file system.
 
High Performance Solr
High Performance SolrHigh Performance Solr
High Performance Solr
 
Using Morphlines for On-the-Fly ETL
Using Morphlines for On-the-Fly ETLUsing Morphlines for On-the-Fly ETL
Using Morphlines for On-the-Fly ETL
 
Parse, scale to millions
Parse, scale to millionsParse, scale to millions
Parse, scale to millions
 
High Performance Solr and JVM Tuning Strategies used for MapQuest’s Search Ah...
High Performance Solr and JVM Tuning Strategies used for MapQuest’s Search Ah...High Performance Solr and JVM Tuning Strategies used for MapQuest’s Search Ah...
High Performance Solr and JVM Tuning Strategies used for MapQuest’s Search Ah...
 
PuppetConf 2017: What's in a Name? Scaling ENC with DNS- Cameron Nicholson, A...
PuppetConf 2017: What's in a Name? Scaling ENC with DNS- Cameron Nicholson, A...PuppetConf 2017: What's in a Name? Scaling ENC with DNS- Cameron Nicholson, A...
PuppetConf 2017: What's in a Name? Scaling ENC with DNS- Cameron Nicholson, A...
 
Value protocols and codables
Value protocols and codablesValue protocols and codables
Value protocols and codables
 
Hadoop on osx
Hadoop on osxHadoop on osx
Hadoop on osx
 
Writing Hadoop Jobs in Scala using Scalding
Writing Hadoop Jobs in Scala using ScaldingWriting Hadoop Jobs in Scala using Scalding
Writing Hadoop Jobs in Scala using Scalding
 

Ähnlich wie Pig_Presentation

power point presentation on pig -hadoop framework
power point presentation on pig -hadoop frameworkpower point presentation on pig -hadoop framework
power point presentation on pig -hadoop framework
bhargavi804095
 
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labsApache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Viswanath Gangavaram
 
Pig - A Data Flow Language and Execution Environment for Exploring Very Large...
Pig - A Data Flow Language and Execution Environment for Exploring Very Large...Pig - A Data Flow Language and Execution Environment for Exploring Very Large...
Pig - A Data Flow Language and Execution Environment for Exploring Very Large...
DrPDShebaKeziaMalarc
 
Big Data - Lab A1 (SC 11 Tutorial)
Big Data - Lab A1 (SC 11 Tutorial)Big Data - Lab A1 (SC 11 Tutorial)
Big Data - Lab A1 (SC 11 Tutorial)
Robert Grossman
 

Ähnlich wie Pig_Presentation (20)

pig intro.pdf
pig intro.pdfpig intro.pdf
pig intro.pdf
 
06 pig-01-intro
06 pig-01-intro06 pig-01-intro
06 pig-01-intro
 
Apache Pig: A big data processor
Apache Pig: A big data processorApache Pig: A big data processor
Apache Pig: A big data processor
 
power point presentation on pig -hadoop framework
power point presentation on pig -hadoop frameworkpower point presentation on pig -hadoop framework
power point presentation on pig -hadoop framework
 
AWS Hadoop and PIG and overview
AWS Hadoop and PIG and overviewAWS Hadoop and PIG and overview
AWS Hadoop and PIG and overview
 
PigHive presentation and hive impor.pptx
PigHive presentation and hive impor.pptxPigHive presentation and hive impor.pptx
PigHive presentation and hive impor.pptx
 
PigHive.pptx
PigHive.pptxPigHive.pptx
PigHive.pptx
 
Apache PIG
Apache PIGApache PIG
Apache PIG
 
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labsApache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
 
PigHive.pptx
PigHive.pptxPigHive.pptx
PigHive.pptx
 
Hadoop pig
Hadoop pigHadoop pig
Hadoop pig
 
03 pig intro
03 pig intro03 pig intro
03 pig intro
 
Pig - A Data Flow Language and Execution Environment for Exploring Very Large...
Pig - A Data Flow Language and Execution Environment for Exploring Very Large...Pig - A Data Flow Language and Execution Environment for Exploring Very Large...
Pig - A Data Flow Language and Execution Environment for Exploring Very Large...
 
pig.ppt
pig.pptpig.ppt
pig.ppt
 
Logstash
LogstashLogstash
Logstash
 
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
 
PSGI and Plack from first principles
PSGI and Plack from first principlesPSGI and Plack from first principles
PSGI and Plack from first principles
 
Practical pig
Practical pigPractical pig
Practical pig
 
Big Data - Lab A1 (SC 11 Tutorial)
Big Data - Lab A1 (SC 11 Tutorial)Big Data - Lab A1 (SC 11 Tutorial)
Big Data - Lab A1 (SC 11 Tutorial)
 
Data Exploration with Apache Drill: Day 1
Data Exploration with Apache Drill:  Day 1Data Exploration with Apache Drill:  Day 1
Data Exploration with Apache Drill: Day 1
 

Kürzlich hochgeladen

Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
gajnagarg
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
gajnagarg
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
chadhar227
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
wsppdmt
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
nirzagarg
 
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
gajnagarg
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
vexqp
 

Kürzlich hochgeladen (20)

Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubai
 
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
 
Kings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about themKings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about them
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - Almora
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
 
20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdf20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdf
 
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
 
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
 
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for Research
 
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
 
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangePredicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
 
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
 

Pig_Presentation

  • 1. ETL with Apache Pig By Arjun Shah Under the guidance of Dr Duc Thanh Tran
  • 2. Agenda • What is Pig? • Introduction to Pig Latin • Installation of Pig • Getting Started with Pig • Examples
  • 3. What is Pig? • Pig is a dataflow language • Language is called PigLatin • Pretty simple syntax • Under the covers, PigLatin scripts are turned into MapReduce jobs and executed on the cluster • Built for Hadoop • Translates script to MapReduce program under the hood • Originally developed at Yahoo! • Huge contributions from Hortonworks, Twitter
  • 4. What Pig Does • Pig was designed for performing a long series of data operations, making it ideal for three categories of Big Data jobs: • Extract-transform-load (ETL) data pipelines, • Research on raw data, and • Iterative data processing.
  • 5. Features of Pig • Joining datasets • Grouping data • Referring to elements by position rather than name ($0, $1, etc) • Loading non-delimited data using a custom SerDe (Writing a custom Reader and Writer) • Creation of user-defined functions (UDF), written in Java • And more..
  • 6. Pig: Install • There are some prerequisites that one needs to follow for installing pig. They are: • JAVA_HOME should be set up • Hadoop should be installed (Single node cluster) • Useful link : http://codesfusion.blogspot.com/2013/10/setup- hadoop-2x-220-on-ubuntu.html
  • 11. Move tar file to any location • $cd /usr/local • • $cp ~/Download/pig-0.12.0.tar.gz • • $sudo tar xzf pig-0.12.0.tar.gz • • $mv pig-0.12.0.tar.gz pig
  • 12. Change .bashrc • Edit the .bashrc file: • $ gedit ~/.bashrc • Add to .bashrc • • export PIG_HOME=/usr/local/pig • • export PATH=$PATH:$PIG_HOME/bin • Close and then open terminal. Try pig -h
  • 13. pig -h : Output
  • 14. Pig: Configure • The user can run Pig in two modes: • Local mode (pig -x local) - With access to a single machine, all files are installed and run using a local host and file system. • Hadoop mode - This is the default mode, which requires access to a Hadoop cluster • The user can run Pig in either mode using the “pig” command or the “java” command.
  • 15. Pig: Run • Script: Pig can run a script file that contains Pig commands. • For example, % pig script.pig • Runs the commands in the local file ”script.pig”. • Alternatively, for very short scripts, you can use the -e option to run a script specified as a string on a command line. • • Grunt: Grunt is an interactive shell for running Pig commands. • Grunt is started when no file is specified for Pig to run, and the -e option is not used. • Note: It is also possible to run Pig scripts from within Grunt using run and exec. • Embedded: You can run Pig programs from Java, much like you can use JDBC to run SQL programs from Java. • There are more details on the Pig wiki at http://wiki.apache.org/pig/EmbeddedPig •
  • 16. Pig Latin: Loading Data • LOAD - Reads data from the file system • Syntax - LOAD ‘input’ [USING function] [AS schema]; -Eg, A = LOAD ‘input’ USING PigStorage(‘t’) AS (name:chararray, age:int, gpa:float);
  • 17. Schema • Use schemas to assign types to fields • A = LOAD 'data' AS (name, age, gpa); -name, age, gpa default to bytearrays • A = LOAD 'data' AS (name:chararray, age:int, gpa:float); -name is now a String (chararray), age is integer and gpa is float
  • 18. Describing Schema • Describe • Provides the schema of a relation • Syntax • DESCRIBE [alias]; • If schema is not provided, describe will say “Schema for alias unknown” • grunt> A = load 'data' as (a:int, b: long, c: float); • grunt> describe A; • A: {a: int, b: long, c: float} • grunt> B = load 'somemoredata'; • grunt> describe B; • Schema for B unknown.
  • 19. Dump and Store • Dump writes the output to console • grunt> A = load ‘data’; • grunt> DUMP A; //This will print contents of A on Console • Store writes output to a HDFS location • grunt> A = load ‘data’; • grunt> STORE A INTO ‘/user/username/output’; //This will write contents of A to HDFS • Pig starts a job only when a DUMP or STORE is encountered
  • 20. Referencing Fields • Fields are referred to by positional notation OR by name (alias) • Positional notation is generated by the system • Starts with $0 • Names are assigned by you using schemas. Eg, A = load ‘data’ as (name:chararray, age:int); • With positional notation, fields can be accessed as • A = load ‘data’; • B = foreach A generate $0, $1; //1st & 2nd column
  • 21. Limit • Limits the number of output tuples • Syntax • alias = LIMIT alias n; • grunt> A = load 'data'; • grunt> B = LIMIT A 10; • grunt> DUMP B; --Prints only 10 rows
  • 22. Foreach.. Generate • Used for data transformations and projections • Syntax • alias = FOREACH { block | nested_block }; • nested_block usage later in the deck • grunt> A = load ‘data’ as (a1,a2,a3); • grunt> B = FOREACH A GENERATE *, • grunt> DUMP B; • (1,2,3) • (4,2,1) • grunt> C = FOREACH A GENERATE a1, a3; • grunt> DUMP C; • (1,3) • (4,1)
  • 23. Filter • Selects tuples from a relation based on some condition • Syntax • alias = FILTER alias BY expression; • Example, to filter for ‘marcbenioff’ • A = LOAD ‘sfdcemployees’ USING PigStorage(‘,’) as (name:chararray,employeesince:int,age:int); • B = FILTER A BY name == ‘marcbenioff’; • You can use boolean operators (AND, OR, NOT) • B = FILTER A BY (employeesince < 2005) AND (NOT(name == ‘marcbenioff’));
  • 24. Group By • Groups data in one or more relations (similar to SQL GROUP BY) • Syntax: • alias = GROUP alias { ALL | BY expression} [, alias ALL | BY expression …] [PARALLEL n]; • Eg, to group by (employee start year at Salesforce) • A = LOAD ‘sfdcemployees’ USING PigStorage(‘,’) as (name:chararray, employeesince:int, age:int); • B = GROUP A BY (employeesince); • You can also group by all fields together • B = GROUP B BY ALL; • Or Group by multiple fields • B = GROUP A BY (age, employeesince);
  • 25. Demo: Sample Data (employee.txt) • Example contents of ‘employee.txt’ a tab delimited text • 1 Peter234000000 none • 2 Peter_01 234000000 none • 124163 Jacob 10000 cloud • 124164 Arthur 1000000 setlabs • 124165 Robert 1000000 setlabs • 124166 Ram 450000 es • 124167 Madhusudhan 450000 e&r • 124168 Alex 6500000 e&r • 124169 Bob 50000 cloud
  • 26. Demo: Employees with salary > 1lk • Loading data from employee.txt into emps bag and with a schema empls = LOAD ‘employee.txt’ AS (id:int, name:chararray, salary:double, dept:chararray); • Filtering the data as required rich = FILTER empls BY $2 > 100000; • Sorting sortd = ORDER rich BY salary DESC; • Storing the final results STORE sortd INTO ‘rich_employees.txt’; • Or alternatively we can dump the record on the screen DUMP sortd; ------------------------------------------------------------------ • Group by salary grp = GROUP empls BY salary; • Get count of employees in each salary group cnt = FOREACH grp GENERATE group, COUNT(empls.id) as emp_cnt;
  • 27.
  • 29. More PigLatin (1/2) • Load using PigStorage • empls = LOAD ‘employee.txt’ USING PigStorage('t') AS (id:int, name:chararray, salary:double, dept:chararray); • Store using PigStorage • STORE srtd INTO ‘rich_employees.txt’ USING PigStorage('t');
  • 30. More PigLatin (2/2) • To view the schema of a relation • DESCRIBE empls; • To view step-by-step execution of a series of statements • ILLUSTRATE empls; • To view the execution plan of a relation • EXPLAIN empls;
  • 31. Exploring Pig with Project Data Set
  • 32. Pig: Local Mode using Project Example
  • 33.
  • 34.
  • 35.
  • 36. Pig:Hadoop Mode (GUI) using Project Example
  • 37.
  • 38.
  • 40. Crimes having category as VANDALISM
  • 42.
  • 45. Grouping crimes by category
  • 47. PigLatin: UDF • Pig provides extensive support for user-defined functions (UDFs) as a way to specify custom processing. Functions can be a part of almost every operator in Pig • All UDF’s are case sensitive
  • 48. UDF: Types • Eval Functions (EvalFunc) • Ex: StringConcat (built-in) : Generates the concatenation of the first two fields of a tuple. • Aggregate Functions (EvalFunc & Algebraic) • Ex: COUNT, AVG ( both built-in) • Filter Functions (FilterFunc) • Ex: IsEmpty (built-in) • Load/Store Functions (LoadFunc/ StoreFunc) • Ex: PigStorage (built-in) • Note: URL for built in functions: http://pig.apache.org/docs/r0.7.0/api/org/apache/pig/builtin/package- summary.html
  • 49. Summary • Pig can be used to run ETL jobs on Hadoop. It saves you from writing MapReduce code in Java while its syntax may look familiar to SQL users. Nonetheless, it is important to take some time to learn Pig and to understand its advantages and limitations. Who knows, maybe pigs can fly after all.