SlideShare a Scribd company logo
1 of 19
Introduction To PIG The evolution of data processing frameworks
What is PIG? Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs Pig generates and compiles a Map/Reduce program(s) on the fly.
Why PIG? Ease of programming - It is trivial to achieve parallel execution of simple, "embarrassingly parallel" data analysis tasks. Complex tasks comprised of multiple interrelated data transformations are explicitly encoded as data flow sequences, making them easy to write, understand, and maintain.
File Formats PigStorage Custom Load / Store Functions
Installing PIG Download / Unpack tarball (pig.apache.org) Install RPM / DEB package (cloudera.com)
Running PIG Grunt Shell: Enter Pig commands manually using Pig’s interactive shell, Grunt. Script File: Place Pig commands in a script file and run the script. Embedded Program: Embed Pig commands in a host language and run the program.
Run Modes Local Mode: To run Pig in local mode, you need access to a single machine. Hadoop(mapreduce) Mode: To run Pig in hadoop (mapreduce) mode, you need access to a Hadoop cluster and HDFS installation.
Sample PIG script A = load 'passwd' using PigStorage(':');  B = foreach A generate $0 as id; store B into ‘id.out’;
Sample Script With Schema A = LOAD 'student_data' AS (name: chararray, age: int, gpa: float); B = FOREACH A GENERATE myudfs.UPPER(name);
Eval Functions AVG CONCAT Example COUNT COUNT_STAR DIFF IsEmpty MAX MIN SIZE SUM TOKENIZE
Math Functions # Math Functions ABS ACOS ASIN ATAN CBRT CEIL COSH COS EXP FLOOR LOG LOG10 RANDOM ROUND SIN SINH SQRT TAN TANH
Pig Types
Sample CW PIG script RawInput = LOAD '$INPUT' USING com.contextweb.pig.CWHeaderLoader('$RESOURCES/schema/wide.xml'); input = foreachRawInput GENERATE ContextCategoryId as Category, TagId, URL, Impressions; GroupedInput = GROUP input BY (Category, TagId, URL); result = FOREACH GroupedInput GENERATE group, SUM(input.Impressions) as Impressions; STORE result INTO '$OUTPUT' USING com.contextweb.pig.CWHeaderStore();
Sample PIG script (Filtering) RawInput = LOAD '$INPUT' USING com.contextweb.pig.CWHeaderLoader('$RESOURCES/schema/wide.xml'); input = foreachRawInput GENERATE ContextCategoryId as Category, DefLevelId , TagId, URL,Impressions; defFilter = FILTER input BY (DefLevelId == 8) or (DefLevelId == 12); GroupedInput = GROUP defFilter BY (Category, TagId, URL); result = FOREACH GroupedInput GENERATE group, SUM(input.Impressions) as Impressions; STORE result INTO '$OUTPUT' USING com.contextweb.pig.CWHeaderStore();
What is PIG UDF? UDF  - User Defined Function Types of UDF’s: Eval Functions (extends EvalFunc<String>) Aggregate Functions (extends EvalFunc<Long> implements Algebraic) Filter Functions (extends FilterFunc) UDFContext Allows UDFs to get access to the JobConfobject Allows UDFs to pass configuration information between instantiations of the UDF on the front and backends.
Sample UDF public class TopLevelDomain extends EvalFunc<String> { 	@Override 	public String exec(Tupletuple) throws IOException { 		Object o = tuple.get(0); 		if (o == null) { 			return null; 		} 		return Validator.getTLD(o.toString()); 	} }
UDF In Action REGISTER '$WORK_DIR/pig-support.jar'; DEFINE getTopLevelDomaincom.contextweb.pig.udf.TopLevelDomain(); AA = foreach input GENERATE TagId, getTopLevelDomain(PublisherDomain) as RootDomain
Resources Apache PIG http://pig.apache.org/ Apache Hadoophttp://hadoop.apache.org/ Cloudera CDH https://wiki.cloudera.com/display/DOC/CDH3+Installation
PIG DEMO

More Related Content

What's hot

Hadoop And Their Ecosystem ppt
 Hadoop And Their Ecosystem ppt Hadoop And Their Ecosystem ppt
Hadoop And Their Ecosystem pptsunera pathan
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDBRavi Teja
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDBMike Dirolf
 
Introducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceIntroducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceDatabricks
 
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...Edureka!
 
Introduction to Apache Hive
Introduction to Apache HiveIntroduction to Apache Hive
Introduction to Apache HiveAvkash Chauhan
 
Spark overview
Spark overviewSpark overview
Spark overviewLisa Hua
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark Mostafa
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkRahul Jain
 
Beyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFramesBeyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFramesDatabricks
 
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...Simplilearn
 
PySpark Programming | PySpark Concepts with Hands-On | PySpark Training | Edu...
PySpark Programming | PySpark Concepts with Hands-On | PySpark Training | Edu...PySpark Programming | PySpark Concepts with Hands-On | PySpark Training | Edu...
PySpark Programming | PySpark Concepts with Hands-On | PySpark Training | Edu...Edureka!
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsAnton Kirillov
 
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesApache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesWalaa Hamdy Assy
 

What's hot (20)

Hadoop And Their Ecosystem ppt
 Hadoop And Their Ecosystem ppt Hadoop And Their Ecosystem ppt
Hadoop And Their Ecosystem ppt
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDB
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDB
 
Introducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceIntroducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data Science
 
Pig latin
Pig latinPig latin
Pig latin
 
Big data and Hadoop
Big data and HadoopBig data and Hadoop
Big data and Hadoop
 
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
 
Introduction to Apache Hive
Introduction to Apache HiveIntroduction to Apache Hive
Introduction to Apache Hive
 
Spark overview
Spark overviewSpark overview
Spark overview
 
Hadoop Map Reduce
Hadoop Map ReduceHadoop Map Reduce
Hadoop Map Reduce
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Beyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFramesBeyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFrames
 
Hadoop HDFS.ppt
Hadoop HDFS.pptHadoop HDFS.ppt
Hadoop HDFS.ppt
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
 
PySpark Programming | PySpark Concepts with Hands-On | PySpark Training | Edu...
PySpark Programming | PySpark Concepts with Hands-On | PySpark Training | Edu...PySpark Programming | PySpark Concepts with Hands-On | PySpark Training | Edu...
PySpark Programming | PySpark Concepts with Hands-On | PySpark Training | Edu...
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
 
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesApache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & libraries
 

Viewers also liked

Viewers also liked (7)

Hive ppt (1)
Hive ppt (1)Hive ppt (1)
Hive ppt (1)
 
Une introduction à Hive
Une introduction à HiveUne introduction à Hive
Une introduction à Hive
 
Un introduction à Pig
Un introduction à PigUn introduction à Pig
Un introduction à Pig
 
Big Data : concepts, cas d'usage et tendances
Big Data : concepts, cas d'usage et tendancesBig Data : concepts, cas d'usage et tendances
Big Data : concepts, cas d'usage et tendances
 
Big data - Cours d'introduction l Data-business
Big data - Cours d'introduction l Data-businessBig data - Cours d'introduction l Data-business
Big data - Cours d'introduction l Data-business
 
Big data ppt
Big  data pptBig  data ppt
Big data ppt
 
What is Big Data?
What is Big Data?What is Big Data?
What is Big Data?
 

Similar to Introduction to Apache Pig

AWS Hadoop and PIG and overview
AWS Hadoop and PIG and overviewAWS Hadoop and PIG and overview
AWS Hadoop and PIG and overviewDan Morrill
 
Practical pig
Practical pigPractical pig
Practical pigtrihug
 
Apache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan Gate
Apache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan GateApache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan Gate
Apache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan GateYahoo Developer Network
 
Big Data Hadoop Training
Big Data Hadoop TrainingBig Data Hadoop Training
Big Data Hadoop Trainingstratapps
 
Introduction To Groovy 2005
Introduction To Groovy 2005Introduction To Groovy 2005
Introduction To Groovy 2005Tugdual Grall
 
PSGI and Plack from first principles
PSGI and Plack from first principlesPSGI and Plack from first principles
PSGI and Plack from first principlesPerl Careers
 
Groovy Update - JavaPolis 2007
Groovy Update - JavaPolis 2007Groovy Update - JavaPolis 2007
Groovy Update - JavaPolis 2007Guillaume Laforge
 
Scripting GeoServer with GeoScript
Scripting GeoServer with GeoScriptScripting GeoServer with GeoScript
Scripting GeoServer with GeoScriptJustin Deoliveira
 
Pig - A Data Flow Language and Execution Environment for Exploring Very Large...
Pig - A Data Flow Language and Execution Environment for Exploring Very Large...Pig - A Data Flow Language and Execution Environment for Exploring Very Large...
Pig - A Data Flow Language and Execution Environment for Exploring Very Large...DrPDShebaKeziaMalarc
 
Introduction to Pig
Introduction to PigIntroduction to Pig
Introduction to PigMike Unwin
 
Introduction to pig & pig latin
Introduction to pig & pig latinIntroduction to pig & pig latin
Introduction to pig & pig latinknowbigdata
 
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labsApache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labsViswanath Gangavaram
 
Pig_Presentation
Pig_PresentationPig_Presentation
Pig_PresentationArjun Shah
 

Similar to Introduction to Apache Pig (20)

AWS Hadoop and PIG and overview
AWS Hadoop and PIG and overviewAWS Hadoop and PIG and overview
AWS Hadoop and PIG and overview
 
Practical pig
Practical pigPractical pig
Practical pig
 
Apache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan Gate
Apache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan GateApache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan Gate
Apache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan Gate
 
January 2011 HUG: Pig Presentation
January 2011 HUG: Pig PresentationJanuary 2011 HUG: Pig Presentation
January 2011 HUG: Pig Presentation
 
03 pig intro
03 pig intro03 pig intro
03 pig intro
 
Big Data Hadoop Training
Big Data Hadoop TrainingBig Data Hadoop Training
Big Data Hadoop Training
 
Introduction To Groovy 2005
Introduction To Groovy 2005Introduction To Groovy 2005
Introduction To Groovy 2005
 
Pig
PigPig
Pig
 
Apache Pig
Apache PigApache Pig
Apache Pig
 
PSGI and Plack from first principles
PSGI and Plack from first principlesPSGI and Plack from first principles
PSGI and Plack from first principles
 
Groovy Update - JavaPolis 2007
Groovy Update - JavaPolis 2007Groovy Update - JavaPolis 2007
Groovy Update - JavaPolis 2007
 
Scripting GeoServer with GeoScript
Scripting GeoServer with GeoScriptScripting GeoServer with GeoScript
Scripting GeoServer with GeoScript
 
Pig - A Data Flow Language and Execution Environment for Exploring Very Large...
Pig - A Data Flow Language and Execution Environment for Exploring Very Large...Pig - A Data Flow Language and Execution Environment for Exploring Very Large...
Pig - A Data Flow Language and Execution Environment for Exploring Very Large...
 
pig.ppt
pig.pptpig.ppt
pig.ppt
 
06 pig-01-intro
06 pig-01-intro06 pig-01-intro
06 pig-01-intro
 
Introduction to Pig
Introduction to PigIntroduction to Pig
Introduction to Pig
 
What's New in ZF 1.10
What's New in ZF 1.10What's New in ZF 1.10
What's New in ZF 1.10
 
Introduction to pig & pig latin
Introduction to pig & pig latinIntroduction to pig & pig latin
Introduction to pig & pig latin
 
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labsApache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
 
Pig_Presentation
Pig_PresentationPig_Presentation
Pig_Presentation
 

More from Jason Shao

NYC Hadoop Meetup - MapR, Architecture, Philosophy and Applications
NYC Hadoop Meetup - MapR, Architecture, Philosophy and ApplicationsNYC Hadoop Meetup - MapR, Architecture, Philosophy and Applications
NYC Hadoop Meetup - MapR, Architecture, Philosophy and ApplicationsJason Shao
 
Managing Hadoop with Puppet
Managing Hadoop with PuppetManaging Hadoop with Puppet
Managing Hadoop with PuppetJason Shao
 
NYC Java Meetup - Profiling and Performance
NYC Java Meetup - Profiling and PerformanceNYC Java Meetup - Profiling and Performance
NYC Java Meetup - Profiling and PerformanceJason Shao
 
Sakai NYC User Group
Sakai NYC User GroupSakai NYC User Group
Sakai NYC User GroupJason Shao
 

More from Jason Shao (6)

Tune hadoop
Tune hadoopTune hadoop
Tune hadoop
 
Sgi hadoop
Sgi hadoopSgi hadoop
Sgi hadoop
 
NYC Hadoop Meetup - MapR, Architecture, Philosophy and Applications
NYC Hadoop Meetup - MapR, Architecture, Philosophy and ApplicationsNYC Hadoop Meetup - MapR, Architecture, Philosophy and Applications
NYC Hadoop Meetup - MapR, Architecture, Philosophy and Applications
 
Managing Hadoop with Puppet
Managing Hadoop with PuppetManaging Hadoop with Puppet
Managing Hadoop with Puppet
 
NYC Java Meetup - Profiling and Performance
NYC Java Meetup - Profiling and PerformanceNYC Java Meetup - Profiling and Performance
NYC Java Meetup - Profiling and Performance
 
Sakai NYC User Group
Sakai NYC User GroupSakai NYC User Group
Sakai NYC User Group
 

Recently uploaded

Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 

Recently uploaded (20)

Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 

Introduction to Apache Pig

  • 1. Introduction To PIG The evolution of data processing frameworks
  • 2. What is PIG? Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs Pig generates and compiles a Map/Reduce program(s) on the fly.
  • 3. Why PIG? Ease of programming - It is trivial to achieve parallel execution of simple, "embarrassingly parallel" data analysis tasks. Complex tasks comprised of multiple interrelated data transformations are explicitly encoded as data flow sequences, making them easy to write, understand, and maintain.
  • 4. File Formats PigStorage Custom Load / Store Functions
  • 5. Installing PIG Download / Unpack tarball (pig.apache.org) Install RPM / DEB package (cloudera.com)
  • 6. Running PIG Grunt Shell: Enter Pig commands manually using Pig’s interactive shell, Grunt. Script File: Place Pig commands in a script file and run the script. Embedded Program: Embed Pig commands in a host language and run the program.
  • 7. Run Modes Local Mode: To run Pig in local mode, you need access to a single machine. Hadoop(mapreduce) Mode: To run Pig in hadoop (mapreduce) mode, you need access to a Hadoop cluster and HDFS installation.
  • 8. Sample PIG script A = load 'passwd' using PigStorage(':'); B = foreach A generate $0 as id; store B into ‘id.out’;
  • 9. Sample Script With Schema A = LOAD 'student_data' AS (name: chararray, age: int, gpa: float); B = FOREACH A GENERATE myudfs.UPPER(name);
  • 10. Eval Functions AVG CONCAT Example COUNT COUNT_STAR DIFF IsEmpty MAX MIN SIZE SUM TOKENIZE
  • 11. Math Functions # Math Functions ABS ACOS ASIN ATAN CBRT CEIL COSH COS EXP FLOOR LOG LOG10 RANDOM ROUND SIN SINH SQRT TAN TANH
  • 13. Sample CW PIG script RawInput = LOAD '$INPUT' USING com.contextweb.pig.CWHeaderLoader('$RESOURCES/schema/wide.xml'); input = foreachRawInput GENERATE ContextCategoryId as Category, TagId, URL, Impressions; GroupedInput = GROUP input BY (Category, TagId, URL); result = FOREACH GroupedInput GENERATE group, SUM(input.Impressions) as Impressions; STORE result INTO '$OUTPUT' USING com.contextweb.pig.CWHeaderStore();
  • 14. Sample PIG script (Filtering) RawInput = LOAD '$INPUT' USING com.contextweb.pig.CWHeaderLoader('$RESOURCES/schema/wide.xml'); input = foreachRawInput GENERATE ContextCategoryId as Category, DefLevelId , TagId, URL,Impressions; defFilter = FILTER input BY (DefLevelId == 8) or (DefLevelId == 12); GroupedInput = GROUP defFilter BY (Category, TagId, URL); result = FOREACH GroupedInput GENERATE group, SUM(input.Impressions) as Impressions; STORE result INTO '$OUTPUT' USING com.contextweb.pig.CWHeaderStore();
  • 15. What is PIG UDF? UDF - User Defined Function Types of UDF’s: Eval Functions (extends EvalFunc<String>) Aggregate Functions (extends EvalFunc<Long> implements Algebraic) Filter Functions (extends FilterFunc) UDFContext Allows UDFs to get access to the JobConfobject Allows UDFs to pass configuration information between instantiations of the UDF on the front and backends.
  • 16. Sample UDF public class TopLevelDomain extends EvalFunc<String> { @Override public String exec(Tupletuple) throws IOException { Object o = tuple.get(0); if (o == null) { return null; } return Validator.getTLD(o.toString()); } }
  • 17. UDF In Action REGISTER '$WORK_DIR/pig-support.jar'; DEFINE getTopLevelDomaincom.contextweb.pig.udf.TopLevelDomain(); AA = foreach input GENERATE TagId, getTopLevelDomain(PublisherDomain) as RootDomain
  • 18. Resources Apache PIG http://pig.apache.org/ Apache Hadoophttp://hadoop.apache.org/ Cloudera CDH https://wiki.cloudera.com/display/DOC/CDH3+Installation