SlideShare ist ein Scribd-Unternehmen logo
1 von 16
HADOOP SESSION-4



   Introduction to Pig
Session Outline

What is Pig?
Motivation
Background
Components & Architecture
Pig & Map-Reduce
Case Study – Log Analytics
Conclusion

Sunday, April 29, 2012       © Sabre Holdings, 2012   2
What is Pig?

Framework for Analyzing large Data Sets
Sits on top of hadoop




Sunday, April 29, 2012     © Sabre Holdings, 2012   3
Pig has map-reduce powers!




                         +                            =
Sunday, April 29, 2012       © Sabre Holdings, 2012       4
Pig Food?
       Pig has great taste for structured and Unstructured Data.


            CSV’s, TSV’s, Delimited Data
            Any Kind of Logs
            Unstructured Sentences.
            Databases via JDBC Connections




Sunday, April 29, 2012       © Sabre Holdings, 2012                5
Pig Language?

      Pig Understands Pig-Latin (Simple Query Algebra)
      - Data Flow Language
             - Interdependent series of operations
      - Allows ELT’s very effectively
      - Filtering/Aggregations/Applying Functions




Sunday, April 29, 2012          © Sabre Holdings, 2012   6
Pig is not Racist!!

     Pig Streaming
     - Pig Stream allows pig’s food to interact with
     alien scripts/binaries

A= LOAD ‘log.txt’
C= STREAM A THROUGH ‘extractor.pl’



Sunday, April 29, 2012        © Sabre Holdings, 2012   7
Pig vs Traditional Map-Reduce
                              (Challenges/Solutions)

                                            •Problem:

                         Resources           Map-Reduce requires Java Programmer
                                            •Solution:
                                             Users familiar with scripting languages like Python/Perl can easily code.




                                            •Problem:


                         Time                Map-Reduce involves multiple stages to arrive at a solution
                                            • Solution:
                                             100 lines of Java ~ 10 lines of Pig
                                             4 hours of Java Programming ~ 15 minutes of Pig Programming




                                            •Problem:
                                             In Map-Reduce, users have to re-invent common functionalities like


                     Baked                   Join/Cross/Filter
                                            •Solution:
                                             Programmers can leverage inbuilt libraries and functions for Join/Regex Extraction
                                             etc.



Sunday, April 29, 2012               © Sabre Holdings, 2012                                                              8
Appetite!

Pigs can digest huge datasets
  - Batch Log Processing



NOTE:
Do NOT FEED small datasets to pig. It gets angry.



Sunday, April 29, 2012    © Sabre Holdings, 2012   9
Winner in Map-Reduce Race! (1.1x)
     If Pig was first, who was second?



Any Guesses?




Sunday, April 29, 2012   © Sabre Holdings, 2012   10
How to Access Pig?




                                                       Local Mode
              MapReduce Mode
Sunday, April 29, 2012        © Sabre Holdings, 2012                11
Let’s Ride a Pig
•    LOAD
•    GENERATE, FOREACH
•    FILTERS
•    DUMP
•    STORE
•    STREAM
•    REGULAR EXPRESSION EXTRACTION
•    Group, Count, Joins
•    BAGS vs SETS?

Sunday, April 29, 2012       © Sabre Holdings, 2012   12
How can you forget this one?
• Piggy Bank
       – Pig library for already defined functions




Sunday, April 29, 2012     © Sabre Holdings, 2012    13
Theoretical Summarization

• Let us not be afraid of Swine Flu, We can still
  be friends with them.




Sunday, April 29, 2012   © Sabre Holdings, 2012     14
CASE STUDY – LOG Analytics

• Apache Access Logs



                         Let’s work on it!


Sunday, April 29, 2012         © Sabre Holdings, 2012   15
RESOURCES

• Documentation – Apache Wiki (not enough)
• Doubts –> Forums
       – Stack overflow is my favorite
• Overview
       – Cloudera Video Training
• Best Tutorial on internet:
  http://pig.apache.org/docs/r0.7.0/tutorial.ht
  ml
Sunday, April 29, 2012     © Sabre Holdings, 2012   16

Más contenido relacionado

Was ist angesagt?

Big Data Training in Ludhiana
Big Data Training in LudhianaBig Data Training in Ludhiana
Big Data Training in LudhianaE2MATRIX
 
Big Data Hadoop Training
Big Data Hadoop TrainingBig Data Hadoop Training
Big Data Hadoop Trainingstratapps
 
Deployment and Management of Hadoop Clusters
Deployment and Management of Hadoop ClustersDeployment and Management of Hadoop Clusters
Deployment and Management of Hadoop ClustersAmal G Jose
 
Pig programming is more fun: New features in Pig
Pig programming is more fun: New features in PigPig programming is more fun: New features in Pig
Pig programming is more fun: New features in Pigdaijy
 
A Day in the Life of a Hadoop Administrator
A Day in the Life of a Hadoop AdministratorA Day in the Life of a Hadoop Administrator
A Day in the Life of a Hadoop AdministratorEdureka!
 
Yahoo! - Arun Murthy - Hadoop World 2010
Yahoo! - Arun Murthy - Hadoop World 2010Yahoo! - Arun Murthy - Hadoop World 2010
Yahoo! - Arun Murthy - Hadoop World 2010Cloudera, Inc.
 
Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...
Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...
Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...Simplilearn
 
Data 360 Conference: Introduction to Big Data, Hadoop and Big Data Analytics
Data 360 Conference: Introduction to Big Data, Hadoop and Big Data AnalyticsData 360 Conference: Introduction to Big Data, Hadoop and Big Data Analytics
Data 360 Conference: Introduction to Big Data, Hadoop and Big Data AnalyticsAvkash Chauhan
 
The Evolution and Future of Hadoop Storage (Hadoop Conference Japan 2016キーノート...
The Evolution and Future of Hadoop Storage (Hadoop Conference Japan 2016キーノート...The Evolution and Future of Hadoop Storage (Hadoop Conference Japan 2016キーノート...
The Evolution and Future of Hadoop Storage (Hadoop Conference Japan 2016キーノート...Hadoop / Spark Conference Japan
 
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...Cloudera, Inc.
 
A day in the life of hadoop administrator!
A day in the life of hadoop administrator!A day in the life of hadoop administrator!
A day in the life of hadoop administrator!Edureka!
 
Hadoop Ecosystem | Big Data Analytics Tools | Hadoop Tutorial | Edureka
Hadoop Ecosystem | Big Data Analytics Tools | Hadoop Tutorial | Edureka Hadoop Ecosystem | Big Data Analytics Tools | Hadoop Tutorial | Edureka
Hadoop Ecosystem | Big Data Analytics Tools | Hadoop Tutorial | Edureka Edureka!
 
Introduction to pig
Introduction to pigIntroduction to pig
Introduction to pigRavi Mutyala
 
Hadoop Tutorial
Hadoop TutorialHadoop Tutorial
Hadoop Tutorialawesomesos
 
Big Data Laboratory
Big Data LaboratoryBig Data Laboratory
Big Data LaboratoryJ Singh
 
Best hadoop-online-training
Best hadoop-online-trainingBest hadoop-online-training
Best hadoop-online-trainingGeohedrick
 
Introduction to Hive for Hadoop
Introduction to Hive for HadoopIntroduction to Hive for Hadoop
Introduction to Hive for Hadoopryanlecompte
 

Was ist angesagt? (20)

Big Data Training in Ludhiana
Big Data Training in LudhianaBig Data Training in Ludhiana
Big Data Training in Ludhiana
 
Hadoop
HadoopHadoop
Hadoop
 
Big Data Hadoop Training
Big Data Hadoop TrainingBig Data Hadoop Training
Big Data Hadoop Training
 
Deployment and Management of Hadoop Clusters
Deployment and Management of Hadoop ClustersDeployment and Management of Hadoop Clusters
Deployment and Management of Hadoop Clusters
 
Pig programming is more fun: New features in Pig
Pig programming is more fun: New features in PigPig programming is more fun: New features in Pig
Pig programming is more fun: New features in Pig
 
A Day in the Life of a Hadoop Administrator
A Day in the Life of a Hadoop AdministratorA Day in the Life of a Hadoop Administrator
A Day in the Life of a Hadoop Administrator
 
Yahoo! - Arun Murthy - Hadoop World 2010
Yahoo! - Arun Murthy - Hadoop World 2010Yahoo! - Arun Murthy - Hadoop World 2010
Yahoo! - Arun Murthy - Hadoop World 2010
 
Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...
Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...
Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...
 
Data 360 Conference: Introduction to Big Data, Hadoop and Big Data Analytics
Data 360 Conference: Introduction to Big Data, Hadoop and Big Data AnalyticsData 360 Conference: Introduction to Big Data, Hadoop and Big Data Analytics
Data 360 Conference: Introduction to Big Data, Hadoop and Big Data Analytics
 
The Evolution and Future of Hadoop Storage (Hadoop Conference Japan 2016キーノート...
The Evolution and Future of Hadoop Storage (Hadoop Conference Japan 2016キーノート...The Evolution and Future of Hadoop Storage (Hadoop Conference Japan 2016キーノート...
The Evolution and Future of Hadoop Storage (Hadoop Conference Japan 2016キーノート...
 
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...
 
Running R on Hadoop - CHUG - 20120815
Running R on Hadoop - CHUG - 20120815Running R on Hadoop - CHUG - 20120815
Running R on Hadoop - CHUG - 20120815
 
A day in the life of hadoop administrator!
A day in the life of hadoop administrator!A day in the life of hadoop administrator!
A day in the life of hadoop administrator!
 
Hadoop Ecosystem | Big Data Analytics Tools | Hadoop Tutorial | Edureka
Hadoop Ecosystem | Big Data Analytics Tools | Hadoop Tutorial | Edureka Hadoop Ecosystem | Big Data Analytics Tools | Hadoop Tutorial | Edureka
Hadoop Ecosystem | Big Data Analytics Tools | Hadoop Tutorial | Edureka
 
Introduction to pig
Introduction to pigIntroduction to pig
Introduction to pig
 
Hadoop Tutorial
Hadoop TutorialHadoop Tutorial
Hadoop Tutorial
 
Big Data Laboratory
Big Data LaboratoryBig Data Laboratory
Big Data Laboratory
 
Best hadoop-online-training
Best hadoop-online-trainingBest hadoop-online-training
Best hadoop-online-training
 
hadoop_module6
hadoop_module6hadoop_module6
hadoop_module6
 
Introduction to Hive for Hadoop
Introduction to Hive for HadoopIntroduction to Hive for Hadoop
Introduction to Hive for Hadoop
 

Andere mochten auch

Introduction to Apache Pig
Introduction to Apache PigIntroduction to Apache Pig
Introduction to Apache PigJason Shao
 
Pig, Making Hadoop Easy
Pig, Making Hadoop EasyPig, Making Hadoop Easy
Pig, Making Hadoop EasyNick Dimiduk
 
Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigMilind Bhandarkar
 
Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)Kevin Weil
 
Introduction to Hadoop and Pig
Introduction to Hadoop and PigIntroduction to Hadoop and Pig
Introduction to Hadoop and Pigprash1784
 
introduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pigintroduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and PigRicardo Varela
 
Hue architecture in the Hadoop ecosystem and SQL Editor
Hue architecture in the Hadoop ecosystem and SQL EditorHue architecture in the Hadoop ecosystem and SQL Editor
Hue architecture in the Hadoop ecosystem and SQL EditorRomain Rigaux
 
apache pig performance optimizations talk at apachecon 2010
apache pig performance optimizations talk at apachecon 2010apache pig performance optimizations talk at apachecon 2010
apache pig performance optimizations talk at apachecon 2010Thejas Nair
 
Pig - Analyzing data sets
Pig - Analyzing data setsPig - Analyzing data sets
Pig - Analyzing data setsCreditas
 
An Introduction to JVM Internals and Garbage Collection in Java
An Introduction to JVM Internals and Garbage Collection in JavaAn Introduction to JVM Internals and Garbage Collection in Java
An Introduction to JVM Internals and Garbage Collection in JavaAbhishek Asthana
 
Understanding Java Garbage Collection
Understanding Java Garbage CollectionUnderstanding Java Garbage Collection
Understanding Java Garbage CollectionAzul Systems Inc.
 
Java Garbage Collection - How it works
Java Garbage Collection - How it worksJava Garbage Collection - How it works
Java Garbage Collection - How it worksMindfire Solutions
 
An introduction to hadoop
An introduction to hadoopAn introduction to hadoop
An introduction to hadoopMinJae Kang
 
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)Sudhir Mallem
 
Big data components - Introduction to Flume, Pig and Sqoop
Big data components - Introduction to Flume, Pig and SqoopBig data components - Introduction to Flume, Pig and Sqoop
Big data components - Introduction to Flume, Pig and SqoopJeyamariappan Guru
 

Andere mochten auch (20)

Introduction to Pig
Introduction to PigIntroduction to Pig
Introduction to Pig
 
Introduction to Apache Pig
Introduction to Apache PigIntroduction to Apache Pig
Introduction to Apache Pig
 
Hadoop pig
Hadoop pigHadoop pig
Hadoop pig
 
Pig, Making Hadoop Easy
Pig, Making Hadoop EasyPig, Making Hadoop Easy
Pig, Making Hadoop Easy
 
Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & Pig
 
Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)
 
Introduction to Hadoop and Pig
Introduction to Hadoop and PigIntroduction to Hadoop and Pig
Introduction to Hadoop and Pig
 
introduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pigintroduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pig
 
Hue architecture in the Hadoop ecosystem and SQL Editor
Hue architecture in the Hadoop ecosystem and SQL EditorHue architecture in the Hadoop ecosystem and SQL Editor
Hue architecture in the Hadoop ecosystem and SQL Editor
 
Hadoop - Apache Pig
Hadoop - Apache PigHadoop - Apache Pig
Hadoop - Apache Pig
 
October 2013 HUG: Oozie 4.x
October 2013 HUG: Oozie 4.xOctober 2013 HUG: Oozie 4.x
October 2013 HUG: Oozie 4.x
 
apache pig performance optimizations talk at apachecon 2010
apache pig performance optimizations talk at apachecon 2010apache pig performance optimizations talk at apachecon 2010
apache pig performance optimizations talk at apachecon 2010
 
Pig - Analyzing data sets
Pig - Analyzing data setsPig - Analyzing data sets
Pig - Analyzing data sets
 
An Introduction to JVM Internals and Garbage Collection in Java
An Introduction to JVM Internals and Garbage Collection in JavaAn Introduction to JVM Internals and Garbage Collection in Java
An Introduction to JVM Internals and Garbage Collection in Java
 
Understanding Java Garbage Collection
Understanding Java Garbage CollectionUnderstanding Java Garbage Collection
Understanding Java Garbage Collection
 
Java Garbage Collection - How it works
Java Garbage Collection - How it worksJava Garbage Collection - How it works
Java Garbage Collection - How it works
 
An introduction to hadoop
An introduction to hadoopAn introduction to hadoop
An introduction to hadoop
 
Apache Pig
Apache PigApache Pig
Apache Pig
 
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
 
Big data components - Introduction to Flume, Pig and Sqoop
Big data components - Introduction to Flume, Pig and SqoopBig data components - Introduction to Flume, Pig and Sqoop
Big data components - Introduction to Flume, Pig and Sqoop
 

Ähnlich wie Introduction to Apache Pig

An Analytics Toolkit Tour
An Analytics Toolkit TourAn Analytics Toolkit Tour
An Analytics Toolkit TourRory Winston
 
Introduction to Hadoop - ACCU2010
Introduction to Hadoop - ACCU2010Introduction to Hadoop - ACCU2010
Introduction to Hadoop - ACCU2010Gavin Heavyside
 
eLearning Suite 6 Workflow
eLearning Suite 6 WorkfloweLearning Suite 6 Workflow
eLearning Suite 6 WorkflowKirsten Rourke
 
Hadoop Operations: Starting Out Small / So Your Cluster Isn't Yahoo-sized (yet)
Hadoop Operations: Starting Out Small / So Your Cluster Isn't Yahoo-sized (yet)Hadoop Operations: Starting Out Small / So Your Cluster Isn't Yahoo-sized (yet)
Hadoop Operations: Starting Out Small / So Your Cluster Isn't Yahoo-sized (yet)Michael Arnold
 
Building infrastructure for Big Data
Building infrastructure for Big DataBuilding infrastructure for Big Data
Building infrastructure for Big DataPromptCloud
 
Data Science Day New York: The Platform for Big Data
Data Science Day New York: The Platform for Big DataData Science Day New York: The Platform for Big Data
Data Science Day New York: The Platform for Big DataCloudera, Inc.
 
Scalable Machine Learning with Hadoop
Scalable Machine Learning with HadoopScalable Machine Learning with Hadoop
Scalable Machine Learning with HadoopGrant Ingersoll
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to HadoopJoey Jablonski
 
Ada 2012
Ada 2012Ada 2012
Ada 2012AdaCore
 
Back-end with SonataAdminBundle (and Symfony2, of course...)
Back-end with SonataAdminBundle (and Symfony2, of course...)Back-end with SonataAdminBundle (and Symfony2, of course...)
Back-end with SonataAdminBundle (and Symfony2, of course...)Andrea Delfino
 
Making Sense of Big data with Hadoop
Making Sense of Big data with HadoopMaking Sense of Big data with Hadoop
Making Sense of Big data with HadoopGwen (Chen) Shapira
 
The state of drupal 8 - Drupalcamp Gent
The state of drupal 8  - Drupalcamp GentThe state of drupal 8  - Drupalcamp Gent
The state of drupal 8 - Drupalcamp Gentswentel
 
Seminar: Embedding Optimization in Applications with MPL OptiMax - April 2012
Seminar: Embedding Optimization in Applications with MPL OptiMax - April 2012Seminar: Embedding Optimization in Applications with MPL OptiMax - April 2012
Seminar: Embedding Optimization in Applications with MPL OptiMax - April 2012Bjarni Kristjánsson
 
Extend starfish to Support the Growing Hadoop Ecosystem
Extend starfish to Support the Growing Hadoop EcosystemExtend starfish to Support the Growing Hadoop Ecosystem
Extend starfish to Support the Growing Hadoop EcosystemFei Dong
 
The Hadoop Ecosystem
The Hadoop EcosystemThe Hadoop Ecosystem
The Hadoop EcosystemJ Singh
 
Hadoop 101
Hadoop 101Hadoop 101
Hadoop 101EMC
 
Introduction to Hadoop - ACCU2010
Introduction to Hadoop - ACCU2010Introduction to Hadoop - ACCU2010
Introduction to Hadoop - ACCU2010Gavin Heavyside
 
DSpace Update from Open Repositories 2014
DSpace Update from Open Repositories 2014DSpace Update from Open Repositories 2014
DSpace Update from Open Repositories 2014Repository Fringe
 

Ähnlich wie Introduction to Apache Pig (20)

An Analytics Toolkit Tour
An Analytics Toolkit TourAn Analytics Toolkit Tour
An Analytics Toolkit Tour
 
Introduction to Hadoop - ACCU2010
Introduction to Hadoop - ACCU2010Introduction to Hadoop - ACCU2010
Introduction to Hadoop - ACCU2010
 
eLearning Suite 6 Workflow
eLearning Suite 6 WorkfloweLearning Suite 6 Workflow
eLearning Suite 6 Workflow
 
Hadoop operations
Hadoop operationsHadoop operations
Hadoop operations
 
Hadoop Operations: Starting Out Small / So Your Cluster Isn't Yahoo-sized (yet)
Hadoop Operations: Starting Out Small / So Your Cluster Isn't Yahoo-sized (yet)Hadoop Operations: Starting Out Small / So Your Cluster Isn't Yahoo-sized (yet)
Hadoop Operations: Starting Out Small / So Your Cluster Isn't Yahoo-sized (yet)
 
Building infrastructure for Big Data
Building infrastructure for Big DataBuilding infrastructure for Big Data
Building infrastructure for Big Data
 
Integrated dwh 3
Integrated dwh 3Integrated dwh 3
Integrated dwh 3
 
Data Science Day New York: The Platform for Big Data
Data Science Day New York: The Platform for Big DataData Science Day New York: The Platform for Big Data
Data Science Day New York: The Platform for Big Data
 
Scalable Machine Learning with Hadoop
Scalable Machine Learning with HadoopScalable Machine Learning with Hadoop
Scalable Machine Learning with Hadoop
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Ada 2012
Ada 2012Ada 2012
Ada 2012
 
Back-end with SonataAdminBundle (and Symfony2, of course...)
Back-end with SonataAdminBundle (and Symfony2, of course...)Back-end with SonataAdminBundle (and Symfony2, of course...)
Back-end with SonataAdminBundle (and Symfony2, of course...)
 
Making Sense of Big data with Hadoop
Making Sense of Big data with HadoopMaking Sense of Big data with Hadoop
Making Sense of Big data with Hadoop
 
The state of drupal 8 - Drupalcamp Gent
The state of drupal 8  - Drupalcamp GentThe state of drupal 8  - Drupalcamp Gent
The state of drupal 8 - Drupalcamp Gent
 
Seminar: Embedding Optimization in Applications with MPL OptiMax - April 2012
Seminar: Embedding Optimization in Applications with MPL OptiMax - April 2012Seminar: Embedding Optimization in Applications with MPL OptiMax - April 2012
Seminar: Embedding Optimization in Applications with MPL OptiMax - April 2012
 
Extend starfish to Support the Growing Hadoop Ecosystem
Extend starfish to Support the Growing Hadoop EcosystemExtend starfish to Support the Growing Hadoop Ecosystem
Extend starfish to Support the Growing Hadoop Ecosystem
 
The Hadoop Ecosystem
The Hadoop EcosystemThe Hadoop Ecosystem
The Hadoop Ecosystem
 
Hadoop 101
Hadoop 101Hadoop 101
Hadoop 101
 
Introduction to Hadoop - ACCU2010
Introduction to Hadoop - ACCU2010Introduction to Hadoop - ACCU2010
Introduction to Hadoop - ACCU2010
 
DSpace Update from Open Repositories 2014
DSpace Update from Open Repositories 2014DSpace Update from Open Repositories 2014
DSpace Update from Open Repositories 2014
 

Último

CyberSecurity - Computers In Libraries 2024
CyberSecurity - Computers In Libraries 2024CyberSecurity - Computers In Libraries 2024
CyberSecurity - Computers In Libraries 2024Brian Pichman
 
.NET 8 ChatBot with Azure OpenAI Services.pptx
.NET 8 ChatBot with Azure OpenAI Services.pptx.NET 8 ChatBot with Azure OpenAI Services.pptx
.NET 8 ChatBot with Azure OpenAI Services.pptxHansamali Gamage
 
Graphene Quantum Dots-Based Composites for Biomedical Applications
Graphene Quantum Dots-Based Composites for  Biomedical ApplicationsGraphene Quantum Dots-Based Composites for  Biomedical Applications
Graphene Quantum Dots-Based Composites for Biomedical Applicationsnooralam814309
 
Emil Eifrem at GraphSummit Copenhagen 2024 - The Art of the Possible.pptx
Emil Eifrem at GraphSummit Copenhagen 2024 - The Art of the Possible.pptxEmil Eifrem at GraphSummit Copenhagen 2024 - The Art of the Possible.pptx
Emil Eifrem at GraphSummit Copenhagen 2024 - The Art of the Possible.pptxNeo4j
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
 
UiPath Studio Web workshop Series - Day 3
UiPath Studio Web workshop Series - Day 3UiPath Studio Web workshop Series - Day 3
UiPath Studio Web workshop Series - Day 3DianaGray10
 
Outage Analysis: March 5th/6th 2024 Meta, Comcast, and LinkedIn
Outage Analysis: March 5th/6th 2024 Meta, Comcast, and LinkedInOutage Analysis: March 5th/6th 2024 Meta, Comcast, and LinkedIn
Outage Analysis: March 5th/6th 2024 Meta, Comcast, and LinkedInThousandEyes
 
IT Service Management (ITSM) Best Practices for Advanced Computing
IT Service Management (ITSM) Best Practices for Advanced ComputingIT Service Management (ITSM) Best Practices for Advanced Computing
IT Service Management (ITSM) Best Practices for Advanced ComputingMAGNIntelligence
 
Automation Ops Series: Session 2 - Governance for UiPath projects
Automation Ops Series: Session 2 - Governance for UiPath projectsAutomation Ops Series: Session 2 - Governance for UiPath projects
Automation Ops Series: Session 2 - Governance for UiPath projectsDianaGray10
 
How to release an Open Source Dataweave Library
How to release an Open Source Dataweave LibraryHow to release an Open Source Dataweave Library
How to release an Open Source Dataweave Libraryshyamraj55
 
UiPath Studio Web workshop series - Day 1
UiPath Studio Web workshop series  - Day 1UiPath Studio Web workshop series  - Day 1
UiPath Studio Web workshop series - Day 1DianaGray10
 
Introduction - IPLOOK NETWORKS CO., LTD.
Introduction - IPLOOK NETWORKS CO., LTD.Introduction - IPLOOK NETWORKS CO., LTD.
Introduction - IPLOOK NETWORKS CO., LTD.IPLOOK Networks
 
Novo Nordisk's journey in developing an open-source application on Neo4j
Novo Nordisk's journey in developing an open-source application on Neo4jNovo Nordisk's journey in developing an open-source application on Neo4j
Novo Nordisk's journey in developing an open-source application on Neo4jNeo4j
 
TrustArc Webinar - How to Live in a Post Third-Party Cookie World
TrustArc Webinar - How to Live in a Post Third-Party Cookie WorldTrustArc Webinar - How to Live in a Post Third-Party Cookie World
TrustArc Webinar - How to Live in a Post Third-Party Cookie WorldTrustArc
 
20140402 - Smart house demo kit
20140402 - Smart house demo kit20140402 - Smart house demo kit
20140402 - Smart house demo kitJamie (Taka) Wang
 
Explore the UiPath Community and ways you can benefit on your journey to auto...
Explore the UiPath Community and ways you can benefit on your journey to auto...Explore the UiPath Community and ways you can benefit on your journey to auto...
Explore the UiPath Community and ways you can benefit on your journey to auto...DianaGray10
 
How to become a GDSC Lead GDSC MI AOE.pptx
How to become a GDSC Lead GDSC MI AOE.pptxHow to become a GDSC Lead GDSC MI AOE.pptx
How to become a GDSC Lead GDSC MI AOE.pptxKaustubhBhavsar6
 
From the origin to the future of Open Source model and business
From the origin to the future of  Open Source model and businessFrom the origin to the future of  Open Source model and business
From the origin to the future of Open Source model and businessFrancesco Corti
 
Extra-120324-Visite-Entreprise-icare.pdf
Extra-120324-Visite-Entreprise-icare.pdfExtra-120324-Visite-Entreprise-icare.pdf
Extra-120324-Visite-Entreprise-icare.pdfInfopole1
 

Último (20)

CyberSecurity - Computers In Libraries 2024
CyberSecurity - Computers In Libraries 2024CyberSecurity - Computers In Libraries 2024
CyberSecurity - Computers In Libraries 2024
 
.NET 8 ChatBot with Azure OpenAI Services.pptx
.NET 8 ChatBot with Azure OpenAI Services.pptx.NET 8 ChatBot with Azure OpenAI Services.pptx
.NET 8 ChatBot with Azure OpenAI Services.pptx
 
Graphene Quantum Dots-Based Composites for Biomedical Applications
Graphene Quantum Dots-Based Composites for  Biomedical ApplicationsGraphene Quantum Dots-Based Composites for  Biomedical Applications
Graphene Quantum Dots-Based Composites for Biomedical Applications
 
Emil Eifrem at GraphSummit Copenhagen 2024 - The Art of the Possible.pptx
Emil Eifrem at GraphSummit Copenhagen 2024 - The Art of the Possible.pptxEmil Eifrem at GraphSummit Copenhagen 2024 - The Art of the Possible.pptx
Emil Eifrem at GraphSummit Copenhagen 2024 - The Art of the Possible.pptx
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
UiPath Studio Web workshop Series - Day 3
UiPath Studio Web workshop Series - Day 3UiPath Studio Web workshop Series - Day 3
UiPath Studio Web workshop Series - Day 3
 
Outage Analysis: March 5th/6th 2024 Meta, Comcast, and LinkedIn
Outage Analysis: March 5th/6th 2024 Meta, Comcast, and LinkedInOutage Analysis: March 5th/6th 2024 Meta, Comcast, and LinkedIn
Outage Analysis: March 5th/6th 2024 Meta, Comcast, and LinkedIn
 
IT Service Management (ITSM) Best Practices for Advanced Computing
IT Service Management (ITSM) Best Practices for Advanced ComputingIT Service Management (ITSM) Best Practices for Advanced Computing
IT Service Management (ITSM) Best Practices for Advanced Computing
 
Automation Ops Series: Session 2 - Governance for UiPath projects
Automation Ops Series: Session 2 - Governance for UiPath projectsAutomation Ops Series: Session 2 - Governance for UiPath projects
Automation Ops Series: Session 2 - Governance for UiPath projects
 
How to release an Open Source Dataweave Library
How to release an Open Source Dataweave LibraryHow to release an Open Source Dataweave Library
How to release an Open Source Dataweave Library
 
UiPath Studio Web workshop series - Day 1
UiPath Studio Web workshop series  - Day 1UiPath Studio Web workshop series  - Day 1
UiPath Studio Web workshop series - Day 1
 
Introduction - IPLOOK NETWORKS CO., LTD.
Introduction - IPLOOK NETWORKS CO., LTD.Introduction - IPLOOK NETWORKS CO., LTD.
Introduction - IPLOOK NETWORKS CO., LTD.
 
Novo Nordisk's journey in developing an open-source application on Neo4j
Novo Nordisk's journey in developing an open-source application on Neo4jNovo Nordisk's journey in developing an open-source application on Neo4j
Novo Nordisk's journey in developing an open-source application on Neo4j
 
TrustArc Webinar - How to Live in a Post Third-Party Cookie World
TrustArc Webinar - How to Live in a Post Third-Party Cookie WorldTrustArc Webinar - How to Live in a Post Third-Party Cookie World
TrustArc Webinar - How to Live in a Post Third-Party Cookie World
 
20140402 - Smart house demo kit
20140402 - Smart house demo kit20140402 - Smart house demo kit
20140402 - Smart house demo kit
 
Explore the UiPath Community and ways you can benefit on your journey to auto...
Explore the UiPath Community and ways you can benefit on your journey to auto...Explore the UiPath Community and ways you can benefit on your journey to auto...
Explore the UiPath Community and ways you can benefit on your journey to auto...
 
How to become a GDSC Lead GDSC MI AOE.pptx
How to become a GDSC Lead GDSC MI AOE.pptxHow to become a GDSC Lead GDSC MI AOE.pptx
How to become a GDSC Lead GDSC MI AOE.pptx
 
SheDev 2024
SheDev 2024SheDev 2024
SheDev 2024
 
From the origin to the future of Open Source model and business
From the origin to the future of  Open Source model and businessFrom the origin to the future of  Open Source model and business
From the origin to the future of Open Source model and business
 
Extra-120324-Visite-Entreprise-icare.pdf
Extra-120324-Visite-Entreprise-icare.pdfExtra-120324-Visite-Entreprise-icare.pdf
Extra-120324-Visite-Entreprise-icare.pdf
 

Introduction to Apache Pig

  • 1. HADOOP SESSION-4 Introduction to Pig
  • 2. Session Outline What is Pig? Motivation Background Components & Architecture Pig & Map-Reduce Case Study – Log Analytics Conclusion Sunday, April 29, 2012 © Sabre Holdings, 2012 2
  • 3. What is Pig? Framework for Analyzing large Data Sets Sits on top of hadoop Sunday, April 29, 2012 © Sabre Holdings, 2012 3
  • 4. Pig has map-reduce powers! + = Sunday, April 29, 2012 © Sabre Holdings, 2012 4
  • 5. Pig Food? Pig has great taste for structured and Unstructured Data. CSV’s, TSV’s, Delimited Data Any Kind of Logs Unstructured Sentences. Databases via JDBC Connections Sunday, April 29, 2012 © Sabre Holdings, 2012 5
  • 6. Pig Language? Pig Understands Pig-Latin (Simple Query Algebra) - Data Flow Language - Interdependent series of operations - Allows ELT’s very effectively - Filtering/Aggregations/Applying Functions Sunday, April 29, 2012 © Sabre Holdings, 2012 6
  • 7. Pig is not Racist!! Pig Streaming - Pig Stream allows pig’s food to interact with alien scripts/binaries A= LOAD ‘log.txt’ C= STREAM A THROUGH ‘extractor.pl’ Sunday, April 29, 2012 © Sabre Holdings, 2012 7
  • 8. Pig vs Traditional Map-Reduce (Challenges/Solutions) •Problem: Resources Map-Reduce requires Java Programmer •Solution: Users familiar with scripting languages like Python/Perl can easily code. •Problem: Time Map-Reduce involves multiple stages to arrive at a solution • Solution: 100 lines of Java ~ 10 lines of Pig 4 hours of Java Programming ~ 15 minutes of Pig Programming •Problem: In Map-Reduce, users have to re-invent common functionalities like Baked Join/Cross/Filter •Solution: Programmers can leverage inbuilt libraries and functions for Join/Regex Extraction etc. Sunday, April 29, 2012 © Sabre Holdings, 2012 8
  • 9. Appetite! Pigs can digest huge datasets - Batch Log Processing NOTE: Do NOT FEED small datasets to pig. It gets angry. Sunday, April 29, 2012 © Sabre Holdings, 2012 9
  • 10. Winner in Map-Reduce Race! (1.1x) If Pig was first, who was second? Any Guesses? Sunday, April 29, 2012 © Sabre Holdings, 2012 10
  • 11. How to Access Pig? Local Mode MapReduce Mode Sunday, April 29, 2012 © Sabre Holdings, 2012 11
  • 12. Let’s Ride a Pig • LOAD • GENERATE, FOREACH • FILTERS • DUMP • STORE • STREAM • REGULAR EXPRESSION EXTRACTION • Group, Count, Joins • BAGS vs SETS? Sunday, April 29, 2012 © Sabre Holdings, 2012 12
  • 13. How can you forget this one? • Piggy Bank – Pig library for already defined functions Sunday, April 29, 2012 © Sabre Holdings, 2012 13
  • 14. Theoretical Summarization • Let us not be afraid of Swine Flu, We can still be friends with them. Sunday, April 29, 2012 © Sabre Holdings, 2012 14
  • 15. CASE STUDY – LOG Analytics • Apache Access Logs Let’s work on it! Sunday, April 29, 2012 © Sabre Holdings, 2012 15
  • 16. RESOURCES • Documentation – Apache Wiki (not enough) • Doubts –> Forums – Stack overflow is my favorite • Overview – Cloudera Video Training • Best Tutorial on internet: http://pig.apache.org/docs/r0.7.0/tutorial.ht ml Sunday, April 29, 2012 © Sabre Holdings, 2012 16