SlideShare ist ein Scribd-Unternehmen logo
1 von 16
HADOOP SESSION-4



   Introduction to Pig
Session Outline

What is Pig?
Motivation
Background
Components & Architecture
Pig & Map-Reduce
Case Study – Log Analytics
Conclusion

Sunday, April 29, 2012       © Sabre Holdings, 2012   2
What is Pig?

Framework for Analyzing large Data Sets
Sits on top of hadoop




Sunday, April 29, 2012     © Sabre Holdings, 2012   3
Pig has map-reduce powers!




                         +                            =
Sunday, April 29, 2012       © Sabre Holdings, 2012       4
Pig Food?
       Pig has great taste for structured and Unstructured Data.


            CSV’s, TSV’s, Delimited Data
            Any Kind of Logs
            Unstructured Sentences.
            Databases via JDBC Connections




Sunday, April 29, 2012       © Sabre Holdings, 2012                5
Pig Language?

      Pig Understands Pig-Latin (Simple Query Algebra)
      - Data Flow Language
             - Interdependent series of operations
      - Allows ELT’s very effectively
      - Filtering/Aggregations/Applying Functions




Sunday, April 29, 2012          © Sabre Holdings, 2012   6
Pig is not Racist!!

     Pig Streaming
     - Pig Stream allows pig’s food to interact with
     alien scripts/binaries

A= LOAD ‘log.txt’
C= STREAM A THROUGH ‘extractor.pl’



Sunday, April 29, 2012        © Sabre Holdings, 2012   7
Pig vs Traditional Map-Reduce
                              (Challenges/Solutions)

                                            •Problem:

                         Resources           Map-Reduce requires Java Programmer
                                            •Solution:
                                             Users familiar with scripting languages like Python/Perl can easily code.




                                            •Problem:


                         Time                Map-Reduce involves multiple stages to arrive at a solution
                                            • Solution:
                                             100 lines of Java ~ 10 lines of Pig
                                             4 hours of Java Programming ~ 15 minutes of Pig Programming




                                            •Problem:
                                             In Map-Reduce, users have to re-invent common functionalities like


                     Baked                   Join/Cross/Filter
                                            •Solution:
                                             Programmers can leverage inbuilt libraries and functions for Join/Regex Extraction
                                             etc.



Sunday, April 29, 2012               © Sabre Holdings, 2012                                                              8
Appetite!

Pigs can digest huge datasets
  - Batch Log Processing



NOTE:
Do NOT FEED small datasets to pig. It gets angry.



Sunday, April 29, 2012    © Sabre Holdings, 2012   9
Winner in Map-Reduce Race! (1.1x)
     If Pig was first, who was second?



Any Guesses?




Sunday, April 29, 2012   © Sabre Holdings, 2012   10
How to Access Pig?




                                                       Local Mode
              MapReduce Mode
Sunday, April 29, 2012        © Sabre Holdings, 2012                11
Let’s Ride a Pig
•    LOAD
•    GENERATE, FOREACH
•    FILTERS
•    DUMP
•    STORE
•    STREAM
•    REGULAR EXPRESSION EXTRACTION
•    Group, Count, Joins
•    BAGS vs SETS?

Sunday, April 29, 2012       © Sabre Holdings, 2012   12
How can you forget this one?
• Piggy Bank
       – Pig library for already defined functions




Sunday, April 29, 2012     © Sabre Holdings, 2012    13
Theoretical Summarization

• Let us not be afraid of Swine Flu, We can still
  be friends with them.




Sunday, April 29, 2012   © Sabre Holdings, 2012     14
CASE STUDY – LOG Analytics

• Apache Access Logs



                         Let’s work on it!


Sunday, April 29, 2012         © Sabre Holdings, 2012   15
RESOURCES

• Documentation – Apache Wiki (not enough)
• Doubts –> Forums
       – Stack overflow is my favorite
• Overview
       – Cloudera Video Training
• Best Tutorial on internet:
  http://pig.apache.org/docs/r0.7.0/tutorial.ht
  ml
Sunday, April 29, 2012     © Sabre Holdings, 2012   16

Weitere ähnliche Inhalte

Was ist angesagt?

Big Data Training in Ludhiana
Big Data Training in LudhianaBig Data Training in Ludhiana
Big Data Training in LudhianaE2MATRIX
 
Big Data Hadoop Training
Big Data Hadoop TrainingBig Data Hadoop Training
Big Data Hadoop Trainingstratapps
 
Deployment and Management of Hadoop Clusters
Deployment and Management of Hadoop ClustersDeployment and Management of Hadoop Clusters
Deployment and Management of Hadoop ClustersAmal G Jose
 
Pig programming is more fun: New features in Pig
Pig programming is more fun: New features in PigPig programming is more fun: New features in Pig
Pig programming is more fun: New features in Pigdaijy
 
A Day in the Life of a Hadoop Administrator
A Day in the Life of a Hadoop AdministratorA Day in the Life of a Hadoop Administrator
A Day in the Life of a Hadoop AdministratorEdureka!
 
Yahoo! - Arun Murthy - Hadoop World 2010
Yahoo! - Arun Murthy - Hadoop World 2010Yahoo! - Arun Murthy - Hadoop World 2010
Yahoo! - Arun Murthy - Hadoop World 2010Cloudera, Inc.
 
Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...
Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...
Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...Simplilearn
 
Data 360 Conference: Introduction to Big Data, Hadoop and Big Data Analytics
Data 360 Conference: Introduction to Big Data, Hadoop and Big Data AnalyticsData 360 Conference: Introduction to Big Data, Hadoop and Big Data Analytics
Data 360 Conference: Introduction to Big Data, Hadoop and Big Data AnalyticsAvkash Chauhan
 
The Evolution and Future of Hadoop Storage (Hadoop Conference Japan 2016キーノート...
The Evolution and Future of Hadoop Storage (Hadoop Conference Japan 2016キーノート...The Evolution and Future of Hadoop Storage (Hadoop Conference Japan 2016キーノート...
The Evolution and Future of Hadoop Storage (Hadoop Conference Japan 2016キーノート...Hadoop / Spark Conference Japan
 
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...Cloudera, Inc.
 
A day in the life of hadoop administrator!
A day in the life of hadoop administrator!A day in the life of hadoop administrator!
A day in the life of hadoop administrator!Edureka!
 
Hadoop Ecosystem | Big Data Analytics Tools | Hadoop Tutorial | Edureka
Hadoop Ecosystem | Big Data Analytics Tools | Hadoop Tutorial | Edureka Hadoop Ecosystem | Big Data Analytics Tools | Hadoop Tutorial | Edureka
Hadoop Ecosystem | Big Data Analytics Tools | Hadoop Tutorial | Edureka Edureka!
 
Introduction to pig
Introduction to pigIntroduction to pig
Introduction to pigRavi Mutyala
 
Hadoop Tutorial
Hadoop TutorialHadoop Tutorial
Hadoop Tutorialawesomesos
 
Big Data Laboratory
Big Data LaboratoryBig Data Laboratory
Big Data LaboratoryJ Singh
 
Best hadoop-online-training
Best hadoop-online-trainingBest hadoop-online-training
Best hadoop-online-trainingGeohedrick
 
Introduction to Hive for Hadoop
Introduction to Hive for HadoopIntroduction to Hive for Hadoop
Introduction to Hive for Hadoopryanlecompte
 

Was ist angesagt? (20)

Big Data Training in Ludhiana
Big Data Training in LudhianaBig Data Training in Ludhiana
Big Data Training in Ludhiana
 
Hadoop
HadoopHadoop
Hadoop
 
Big Data Hadoop Training
Big Data Hadoop TrainingBig Data Hadoop Training
Big Data Hadoop Training
 
Deployment and Management of Hadoop Clusters
Deployment and Management of Hadoop ClustersDeployment and Management of Hadoop Clusters
Deployment and Management of Hadoop Clusters
 
Pig programming is more fun: New features in Pig
Pig programming is more fun: New features in PigPig programming is more fun: New features in Pig
Pig programming is more fun: New features in Pig
 
A Day in the Life of a Hadoop Administrator
A Day in the Life of a Hadoop AdministratorA Day in the Life of a Hadoop Administrator
A Day in the Life of a Hadoop Administrator
 
Yahoo! - Arun Murthy - Hadoop World 2010
Yahoo! - Arun Murthy - Hadoop World 2010Yahoo! - Arun Murthy - Hadoop World 2010
Yahoo! - Arun Murthy - Hadoop World 2010
 
Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...
Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...
Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...
 
Data 360 Conference: Introduction to Big Data, Hadoop and Big Data Analytics
Data 360 Conference: Introduction to Big Data, Hadoop and Big Data AnalyticsData 360 Conference: Introduction to Big Data, Hadoop and Big Data Analytics
Data 360 Conference: Introduction to Big Data, Hadoop and Big Data Analytics
 
The Evolution and Future of Hadoop Storage (Hadoop Conference Japan 2016キーノート...
The Evolution and Future of Hadoop Storage (Hadoop Conference Japan 2016キーノート...The Evolution and Future of Hadoop Storage (Hadoop Conference Japan 2016キーノート...
The Evolution and Future of Hadoop Storage (Hadoop Conference Japan 2016キーノート...
 
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...
 
Running R on Hadoop - CHUG - 20120815
Running R on Hadoop - CHUG - 20120815Running R on Hadoop - CHUG - 20120815
Running R on Hadoop - CHUG - 20120815
 
A day in the life of hadoop administrator!
A day in the life of hadoop administrator!A day in the life of hadoop administrator!
A day in the life of hadoop administrator!
 
Hadoop Ecosystem | Big Data Analytics Tools | Hadoop Tutorial | Edureka
Hadoop Ecosystem | Big Data Analytics Tools | Hadoop Tutorial | Edureka Hadoop Ecosystem | Big Data Analytics Tools | Hadoop Tutorial | Edureka
Hadoop Ecosystem | Big Data Analytics Tools | Hadoop Tutorial | Edureka
 
Introduction to pig
Introduction to pigIntroduction to pig
Introduction to pig
 
Hadoop Tutorial
Hadoop TutorialHadoop Tutorial
Hadoop Tutorial
 
Big Data Laboratory
Big Data LaboratoryBig Data Laboratory
Big Data Laboratory
 
Best hadoop-online-training
Best hadoop-online-trainingBest hadoop-online-training
Best hadoop-online-training
 
hadoop_module6
hadoop_module6hadoop_module6
hadoop_module6
 
Introduction to Hive for Hadoop
Introduction to Hive for HadoopIntroduction to Hive for Hadoop
Introduction to Hive for Hadoop
 

Andere mochten auch

Introduction to Apache Pig
Introduction to Apache PigIntroduction to Apache Pig
Introduction to Apache PigJason Shao
 
Pig, Making Hadoop Easy
Pig, Making Hadoop EasyPig, Making Hadoop Easy
Pig, Making Hadoop EasyNick Dimiduk
 
Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigMilind Bhandarkar
 
Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)Kevin Weil
 
Introduction to Hadoop and Pig
Introduction to Hadoop and PigIntroduction to Hadoop and Pig
Introduction to Hadoop and Pigprash1784
 
introduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pigintroduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and PigRicardo Varela
 
Hue architecture in the Hadoop ecosystem and SQL Editor
Hue architecture in the Hadoop ecosystem and SQL EditorHue architecture in the Hadoop ecosystem and SQL Editor
Hue architecture in the Hadoop ecosystem and SQL EditorRomain Rigaux
 
apache pig performance optimizations talk at apachecon 2010
apache pig performance optimizations talk at apachecon 2010apache pig performance optimizations talk at apachecon 2010
apache pig performance optimizations talk at apachecon 2010Thejas Nair
 
Pig - Analyzing data sets
Pig - Analyzing data setsPig - Analyzing data sets
Pig - Analyzing data setsCreditas
 
An Introduction to JVM Internals and Garbage Collection in Java
An Introduction to JVM Internals and Garbage Collection in JavaAn Introduction to JVM Internals and Garbage Collection in Java
An Introduction to JVM Internals and Garbage Collection in JavaAbhishek Asthana
 
Understanding Java Garbage Collection
Understanding Java Garbage CollectionUnderstanding Java Garbage Collection
Understanding Java Garbage CollectionAzul Systems Inc.
 
Java Garbage Collection - How it works
Java Garbage Collection - How it worksJava Garbage Collection - How it works
Java Garbage Collection - How it worksMindfire Solutions
 
An introduction to hadoop
An introduction to hadoopAn introduction to hadoop
An introduction to hadoopMinJae Kang
 
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)Sudhir Mallem
 
Big data components - Introduction to Flume, Pig and Sqoop
Big data components - Introduction to Flume, Pig and SqoopBig data components - Introduction to Flume, Pig and Sqoop
Big data components - Introduction to Flume, Pig and SqoopJeyamariappan Guru
 

Andere mochten auch (20)

Introduction to Pig
Introduction to PigIntroduction to Pig
Introduction to Pig
 
Introduction to Apache Pig
Introduction to Apache PigIntroduction to Apache Pig
Introduction to Apache Pig
 
Hadoop pig
Hadoop pigHadoop pig
Hadoop pig
 
Pig, Making Hadoop Easy
Pig, Making Hadoop EasyPig, Making Hadoop Easy
Pig, Making Hadoop Easy
 
Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & Pig
 
Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)
 
Introduction to Hadoop and Pig
Introduction to Hadoop and PigIntroduction to Hadoop and Pig
Introduction to Hadoop and Pig
 
introduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pigintroduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pig
 
Hue architecture in the Hadoop ecosystem and SQL Editor
Hue architecture in the Hadoop ecosystem and SQL EditorHue architecture in the Hadoop ecosystem and SQL Editor
Hue architecture in the Hadoop ecosystem and SQL Editor
 
Hadoop - Apache Pig
Hadoop - Apache PigHadoop - Apache Pig
Hadoop - Apache Pig
 
October 2013 HUG: Oozie 4.x
October 2013 HUG: Oozie 4.xOctober 2013 HUG: Oozie 4.x
October 2013 HUG: Oozie 4.x
 
apache pig performance optimizations talk at apachecon 2010
apache pig performance optimizations talk at apachecon 2010apache pig performance optimizations talk at apachecon 2010
apache pig performance optimizations talk at apachecon 2010
 
Pig - Analyzing data sets
Pig - Analyzing data setsPig - Analyzing data sets
Pig - Analyzing data sets
 
An Introduction to JVM Internals and Garbage Collection in Java
An Introduction to JVM Internals and Garbage Collection in JavaAn Introduction to JVM Internals and Garbage Collection in Java
An Introduction to JVM Internals and Garbage Collection in Java
 
Understanding Java Garbage Collection
Understanding Java Garbage CollectionUnderstanding Java Garbage Collection
Understanding Java Garbage Collection
 
Java Garbage Collection - How it works
Java Garbage Collection - How it worksJava Garbage Collection - How it works
Java Garbage Collection - How it works
 
An introduction to hadoop
An introduction to hadoopAn introduction to hadoop
An introduction to hadoop
 
Apache Pig
Apache PigApache Pig
Apache Pig
 
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
 
Big data components - Introduction to Flume, Pig and Sqoop
Big data components - Introduction to Flume, Pig and SqoopBig data components - Introduction to Flume, Pig and Sqoop
Big data components - Introduction to Flume, Pig and Sqoop
 

Ähnlich wie Introduction to Apache Pig

An Analytics Toolkit Tour
An Analytics Toolkit TourAn Analytics Toolkit Tour
An Analytics Toolkit TourRory Winston
 
Introduction to Hadoop - ACCU2010
Introduction to Hadoop - ACCU2010Introduction to Hadoop - ACCU2010
Introduction to Hadoop - ACCU2010Gavin Heavyside
 
eLearning Suite 6 Workflow
eLearning Suite 6 WorkfloweLearning Suite 6 Workflow
eLearning Suite 6 WorkflowKirsten Rourke
 
Hadoop Operations: Starting Out Small / So Your Cluster Isn't Yahoo-sized (yet)
Hadoop Operations: Starting Out Small / So Your Cluster Isn't Yahoo-sized (yet)Hadoop Operations: Starting Out Small / So Your Cluster Isn't Yahoo-sized (yet)
Hadoop Operations: Starting Out Small / So Your Cluster Isn't Yahoo-sized (yet)Michael Arnold
 
Building infrastructure for Big Data
Building infrastructure for Big DataBuilding infrastructure for Big Data
Building infrastructure for Big DataPromptCloud
 
Data Science Day New York: The Platform for Big Data
Data Science Day New York: The Platform for Big DataData Science Day New York: The Platform for Big Data
Data Science Day New York: The Platform for Big DataCloudera, Inc.
 
Scalable Machine Learning with Hadoop
Scalable Machine Learning with HadoopScalable Machine Learning with Hadoop
Scalable Machine Learning with HadoopGrant Ingersoll
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to HadoopJoey Jablonski
 
Ada 2012
Ada 2012Ada 2012
Ada 2012AdaCore
 
Back-end with SonataAdminBundle (and Symfony2, of course...)
Back-end with SonataAdminBundle (and Symfony2, of course...)Back-end with SonataAdminBundle (and Symfony2, of course...)
Back-end with SonataAdminBundle (and Symfony2, of course...)Andrea Delfino
 
Making Sense of Big data with Hadoop
Making Sense of Big data with HadoopMaking Sense of Big data with Hadoop
Making Sense of Big data with HadoopGwen (Chen) Shapira
 
The state of drupal 8 - Drupalcamp Gent
The state of drupal 8  - Drupalcamp GentThe state of drupal 8  - Drupalcamp Gent
The state of drupal 8 - Drupalcamp Gentswentel
 
Seminar: Embedding Optimization in Applications with MPL OptiMax - April 2012
Seminar: Embedding Optimization in Applications with MPL OptiMax - April 2012Seminar: Embedding Optimization in Applications with MPL OptiMax - April 2012
Seminar: Embedding Optimization in Applications with MPL OptiMax - April 2012Bjarni Kristjánsson
 
Extend starfish to Support the Growing Hadoop Ecosystem
Extend starfish to Support the Growing Hadoop EcosystemExtend starfish to Support the Growing Hadoop Ecosystem
Extend starfish to Support the Growing Hadoop EcosystemFei Dong
 
The Hadoop Ecosystem
The Hadoop EcosystemThe Hadoop Ecosystem
The Hadoop EcosystemJ Singh
 
Hadoop 101
Hadoop 101Hadoop 101
Hadoop 101EMC
 
Introduction to Hadoop - ACCU2010
Introduction to Hadoop - ACCU2010Introduction to Hadoop - ACCU2010
Introduction to Hadoop - ACCU2010Gavin Heavyside
 
DSpace Update from Open Repositories 2014
DSpace Update from Open Repositories 2014DSpace Update from Open Repositories 2014
DSpace Update from Open Repositories 2014Repository Fringe
 

Ähnlich wie Introduction to Apache Pig (20)

An Analytics Toolkit Tour
An Analytics Toolkit TourAn Analytics Toolkit Tour
An Analytics Toolkit Tour
 
Introduction to Hadoop - ACCU2010
Introduction to Hadoop - ACCU2010Introduction to Hadoop - ACCU2010
Introduction to Hadoop - ACCU2010
 
eLearning Suite 6 Workflow
eLearning Suite 6 WorkfloweLearning Suite 6 Workflow
eLearning Suite 6 Workflow
 
Hadoop Operations: Starting Out Small / So Your Cluster Isn't Yahoo-sized (yet)
Hadoop Operations: Starting Out Small / So Your Cluster Isn't Yahoo-sized (yet)Hadoop Operations: Starting Out Small / So Your Cluster Isn't Yahoo-sized (yet)
Hadoop Operations: Starting Out Small / So Your Cluster Isn't Yahoo-sized (yet)
 
Hadoop operations
Hadoop operationsHadoop operations
Hadoop operations
 
Building infrastructure for Big Data
Building infrastructure for Big DataBuilding infrastructure for Big Data
Building infrastructure for Big Data
 
Integrated dwh 3
Integrated dwh 3Integrated dwh 3
Integrated dwh 3
 
Data Science Day New York: The Platform for Big Data
Data Science Day New York: The Platform for Big DataData Science Day New York: The Platform for Big Data
Data Science Day New York: The Platform for Big Data
 
Scalable Machine Learning with Hadoop
Scalable Machine Learning with HadoopScalable Machine Learning with Hadoop
Scalable Machine Learning with Hadoop
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Ada 2012
Ada 2012Ada 2012
Ada 2012
 
Back-end with SonataAdminBundle (and Symfony2, of course...)
Back-end with SonataAdminBundle (and Symfony2, of course...)Back-end with SonataAdminBundle (and Symfony2, of course...)
Back-end with SonataAdminBundle (and Symfony2, of course...)
 
Making Sense of Big data with Hadoop
Making Sense of Big data with HadoopMaking Sense of Big data with Hadoop
Making Sense of Big data with Hadoop
 
The state of drupal 8 - Drupalcamp Gent
The state of drupal 8  - Drupalcamp GentThe state of drupal 8  - Drupalcamp Gent
The state of drupal 8 - Drupalcamp Gent
 
Seminar: Embedding Optimization in Applications with MPL OptiMax - April 2012
Seminar: Embedding Optimization in Applications with MPL OptiMax - April 2012Seminar: Embedding Optimization in Applications with MPL OptiMax - April 2012
Seminar: Embedding Optimization in Applications with MPL OptiMax - April 2012
 
Extend starfish to Support the Growing Hadoop Ecosystem
Extend starfish to Support the Growing Hadoop EcosystemExtend starfish to Support the Growing Hadoop Ecosystem
Extend starfish to Support the Growing Hadoop Ecosystem
 
The Hadoop Ecosystem
The Hadoop EcosystemThe Hadoop Ecosystem
The Hadoop Ecosystem
 
Hadoop 101
Hadoop 101Hadoop 101
Hadoop 101
 
Introduction to Hadoop - ACCU2010
Introduction to Hadoop - ACCU2010Introduction to Hadoop - ACCU2010
Introduction to Hadoop - ACCU2010
 
DSpace Update from Open Repositories 2014
DSpace Update from Open Repositories 2014DSpace Update from Open Repositories 2014
DSpace Update from Open Repositories 2014
 

Kürzlich hochgeladen

Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 

Kürzlich hochgeladen (20)

Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 

Introduction to Apache Pig

  • 1. HADOOP SESSION-4 Introduction to Pig
  • 2. Session Outline What is Pig? Motivation Background Components & Architecture Pig & Map-Reduce Case Study – Log Analytics Conclusion Sunday, April 29, 2012 © Sabre Holdings, 2012 2
  • 3. What is Pig? Framework for Analyzing large Data Sets Sits on top of hadoop Sunday, April 29, 2012 © Sabre Holdings, 2012 3
  • 4. Pig has map-reduce powers! + = Sunday, April 29, 2012 © Sabre Holdings, 2012 4
  • 5. Pig Food? Pig has great taste for structured and Unstructured Data. CSV’s, TSV’s, Delimited Data Any Kind of Logs Unstructured Sentences. Databases via JDBC Connections Sunday, April 29, 2012 © Sabre Holdings, 2012 5
  • 6. Pig Language? Pig Understands Pig-Latin (Simple Query Algebra) - Data Flow Language - Interdependent series of operations - Allows ELT’s very effectively - Filtering/Aggregations/Applying Functions Sunday, April 29, 2012 © Sabre Holdings, 2012 6
  • 7. Pig is not Racist!! Pig Streaming - Pig Stream allows pig’s food to interact with alien scripts/binaries A= LOAD ‘log.txt’ C= STREAM A THROUGH ‘extractor.pl’ Sunday, April 29, 2012 © Sabre Holdings, 2012 7
  • 8. Pig vs Traditional Map-Reduce (Challenges/Solutions) •Problem: Resources Map-Reduce requires Java Programmer •Solution: Users familiar with scripting languages like Python/Perl can easily code. •Problem: Time Map-Reduce involves multiple stages to arrive at a solution • Solution: 100 lines of Java ~ 10 lines of Pig 4 hours of Java Programming ~ 15 minutes of Pig Programming •Problem: In Map-Reduce, users have to re-invent common functionalities like Baked Join/Cross/Filter •Solution: Programmers can leverage inbuilt libraries and functions for Join/Regex Extraction etc. Sunday, April 29, 2012 © Sabre Holdings, 2012 8
  • 9. Appetite! Pigs can digest huge datasets - Batch Log Processing NOTE: Do NOT FEED small datasets to pig. It gets angry. Sunday, April 29, 2012 © Sabre Holdings, 2012 9
  • 10. Winner in Map-Reduce Race! (1.1x) If Pig was first, who was second? Any Guesses? Sunday, April 29, 2012 © Sabre Holdings, 2012 10
  • 11. How to Access Pig? Local Mode MapReduce Mode Sunday, April 29, 2012 © Sabre Holdings, 2012 11
  • 12. Let’s Ride a Pig • LOAD • GENERATE, FOREACH • FILTERS • DUMP • STORE • STREAM • REGULAR EXPRESSION EXTRACTION • Group, Count, Joins • BAGS vs SETS? Sunday, April 29, 2012 © Sabre Holdings, 2012 12
  • 13. How can you forget this one? • Piggy Bank – Pig library for already defined functions Sunday, April 29, 2012 © Sabre Holdings, 2012 13
  • 14. Theoretical Summarization • Let us not be afraid of Swine Flu, We can still be friends with them. Sunday, April 29, 2012 © Sabre Holdings, 2012 14
  • 15. CASE STUDY – LOG Analytics • Apache Access Logs Let’s work on it! Sunday, April 29, 2012 © Sabre Holdings, 2012 15
  • 16. RESOURCES • Documentation – Apache Wiki (not enough) • Doubts –> Forums – Stack overflow is my favorite • Overview – Cloudera Video Training • Best Tutorial on internet: http://pig.apache.org/docs/r0.7.0/tutorial.ht ml Sunday, April 29, 2012 © Sabre Holdings, 2012 16