Introduction to Apache Pig

•

5 gefällt mir•3,506 views

Pig is a framework for analyzing large datasets that sits on top of Hadoop. It allows users to write scripts for processing data in a simple query language called Pig Latin. Pig provides built-in functions and libraries for common tasks like joins, filters, and aggregations. It aims to make analyzing large datasets with MapReduce easier for users than writing Java code. The document then provides an example case study of using Pig to analyze Apache access logs and lists some resources for learning more about Pig.

Technologie Business

Session Outline

What is Pig?
Motivation
Background
Components & Architecture
Pig & Map-Reduce
Case Study – Log Analytics
Conclusion

Sunday, April 29, 2012 © Sabre Holdings, 2012 2

What is Pig?

Framework for Analyzing large Data Sets
Sits on top of hadoop

Sunday, April 29, 2012 © Sabre Holdings, 2012 3

Pig has map-reduce powers!

+ =
Sunday, April 29, 2012 © Sabre Holdings, 2012 4

Pig Food?
Pig has great taste for structured and Unstructured Data.

CSV’s, TSV’s, Delimited Data
Any Kind of Logs
Unstructured Sentences.
Databases via JDBC Connections

Sunday, April 29, 2012 © Sabre Holdings, 2012 5

Pig Language?

Pig Understands Pig-Latin (Simple Query Algebra)
- Data Flow Language
- Interdependent series of operations
- Allows ELT’s very effectively
- Filtering/Aggregations/Applying Functions

Sunday, April 29, 2012 © Sabre Holdings, 2012 6

Pig is not Racist!!

Pig Streaming
- Pig Stream allows pig’s food to interact with
alien scripts/binaries

A= LOAD ‘log.txt’
C= STREAM A THROUGH ‘extractor.pl’

Sunday, April 29, 2012 © Sabre Holdings, 2012 7

Pig vs Traditional Map-Reduce
(Challenges/Solutions)

•Problem:

Resources Map-Reduce requires Java Programmer
•Solution:
Users familiar with scripting languages like Python/Perl can easily code.

•Problem:

Time Map-Reduce involves multiple stages to arrive at a solution
• Solution:
100 lines of Java ~ 10 lines of Pig
4 hours of Java Programming ~ 15 minutes of Pig Programming

•Problem:
In Map-Reduce, users have to re-invent common functionalities like

Baked Join/Cross/Filter
•Solution:
Programmers can leverage inbuilt libraries and functions for Join/Regex Extraction
etc.

Sunday, April 29, 2012 © Sabre Holdings, 2012 8

Appetite!

Pigs can digest huge datasets
- Batch Log Processing

NOTE:
Do NOT FEED small datasets to pig. It gets angry.

Sunday, April 29, 2012 © Sabre Holdings, 2012 9

Winner in Map-Reduce Race! (1.1x)
If Pig was first, who was second?

Any Guesses?

Sunday, April 29, 2012 © Sabre Holdings, 2012 10

How to Access Pig?

Local Mode
MapReduce Mode
Sunday, April 29, 2012 © Sabre Holdings, 2012 11

Let’s Ride a Pig
• LOAD
• GENERATE, FOREACH
• FILTERS
• DUMP
• STORE
• STREAM
• REGULAR EXPRESSION EXTRACTION
• Group, Count, Joins
• BAGS vs SETS?

Sunday, April 29, 2012 © Sabre Holdings, 2012 12

How can you forget this one?
• Piggy Bank
– Pig library for already defined functions

Sunday, April 29, 2012 © Sabre Holdings, 2012 13

Theoretical Summarization

• Let us not be afraid of Swine Flu, We can still
be friends with them.

Sunday, April 29, 2012 © Sabre Holdings, 2012 14

CASE STUDY – LOG Analytics

• Apache Access Logs

Let’s work on it!

Sunday, April 29, 2012 © Sabre Holdings, 2012 15

RESOURCES

• Documentation – Apache Wiki (not enough)
• Doubts –> Forums
– Stack overflow is my favorite
• Overview
– Cloudera Video Training
• Best Tutorial on internet:
http://pig.apache.org/docs/r0.7.0/tutorial.ht
ml
Sunday, April 29, 2012 © Sabre Holdings, 2012 16

Empfohlen

Introduction to Apache HiveTapan Avasthi

Introduction to Pig | Pig Architecture | Pig FundamentalsSkillspeed

Flexible In-Situ Indexing for Hadoop via Elephant TwinDmitriy Ryaboy

Pig Tutorial | Twitter Case Study | Apache Pig Script and Commands | EdurekaEdureka!

Big Data Training in AmritsarE2MATRIX

Hadoop Hive Tutorial | Hive Fundamentals | Hive ArchitectureSkillspeed

Big Data Training in MohaliE2MATRIX

Hadoop admin trainingArun Kumar

Empfohlen

Introduction to Apache HiveTapan Avasthi

Introduction to Pig | Pig Architecture | Pig FundamentalsSkillspeed

Flexible In-Situ Indexing for Hadoop via Elephant TwinDmitriy Ryaboy

Pig Tutorial | Twitter Case Study | Apache Pig Script and Commands | EdurekaEdureka!

Big Data Training in AmritsarE2MATRIX

Hadoop Hive Tutorial | Hive Fundamentals | Hive ArchitectureSkillspeed

Big Data Training in MohaliE2MATRIX

Hadoop admin trainingArun Kumar

Big Data Training in LudhianaE2MATRIX

Hadoopyasser hassen

Big Data Hadoop Trainingstratapps

Deployment and Management of Hadoop ClustersAmal G Jose

Pig programming is more fun: New features in Pigdaijy

A Day in the Life of a Hadoop AdministratorEdureka!

Yahoo! - Arun Murthy - Hadoop World 2010Cloudera, Inc.

Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...Simplilearn

Data 360 Conference: Introduction to Big Data, Hadoop and Big Data AnalyticsAvkash Chauhan

The Evolution and Future of Hadoop Storage （Hadoop Conference Japan 2016キーノート...Hadoop / Spark Conference Japan

Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...Cloudera, Inc.

Running R on Hadoop - CHUG - 20120815Chicago Hadoop Users Group

A day in the life of hadoop administrator!Edureka!

Hadoop Ecosystem | Big Data Analytics Tools | Hadoop Tutorial | Edureka Edureka!

Introduction to pigRavi Mutyala

Hadoop Tutorialawesomesos

Big Data LaboratoryJ Singh

Best hadoop-online-trainingGeohedrick

hadoop_module6Gurmukh Singh

Introduction to Hive for Hadoopryanlecompte

Introduction to PigPrashanth Babu

Introduction to Apache PigJason Shao

Weitere ähnliche Inhalte

Was ist angesagt?

Big Data Training in LudhianaE2MATRIX

Hadoopyasser hassen

Big Data Hadoop Trainingstratapps

Deployment and Management of Hadoop ClustersAmal G Jose

Pig programming is more fun: New features in Pigdaijy

A Day in the Life of a Hadoop AdministratorEdureka!

Yahoo! - Arun Murthy - Hadoop World 2010Cloudera, Inc.

Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...Simplilearn

Data 360 Conference: Introduction to Big Data, Hadoop and Big Data AnalyticsAvkash Chauhan

The Evolution and Future of Hadoop Storage （Hadoop Conference Japan 2016キーノート...Hadoop / Spark Conference Japan

Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...Cloudera, Inc.

Running R on Hadoop - CHUG - 20120815Chicago Hadoop Users Group

A day in the life of hadoop administrator!Edureka!

Hadoop Ecosystem | Big Data Analytics Tools | Hadoop Tutorial | Edureka Edureka!

Introduction to pigRavi Mutyala

Hadoop Tutorialawesomesos

Big Data LaboratoryJ Singh

Best hadoop-online-trainingGeohedrick

hadoop_module6Gurmukh Singh

Introduction to Hive for Hadoopryanlecompte

Was ist angesagt? (20)

Big Data Training in Ludhiana

Hadoop

Big Data Hadoop Training

Deployment and Management of Hadoop Clusters

Pig programming is more fun: New features in Pig

A Day in the Life of a Hadoop Administrator

Yahoo! - Arun Murthy - Hadoop World 2010

Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...

Data 360 Conference: Introduction to Big Data, Hadoop and Big Data Analytics

The Evolution and Future of Hadoop Storage （Hadoop Conference Japan 2016キーノート...

Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...

Running R on Hadoop - CHUG - 20120815

A day in the life of hadoop administrator!

Hadoop Ecosystem | Big Data Analytics Tools | Hadoop Tutorial | Edureka

Introduction to pig

Hadoop Tutorial

Big Data Laboratory

Best hadoop-online-training

hadoop_module6

Introduction to Hive for Hadoop

Andere mochten auch

Introduction to PigPrashanth Babu

Introduction to Apache PigJason Shao

Hadoop pigSean Murphy

Pig, Making Hadoop EasyNick Dimiduk

Practical Problem Solving with Apache Hadoop & PigMilind Bhandarkar

Hadoop, Pig, and Twitter (NoSQL East 2009)Kevin Weil

Introduction to Hadoop and Pigprash1784

introduction to data processing using Hadoop and PigRicardo Varela

Hue architecture in the Hadoop ecosystem and SQL EditorRomain Rigaux

Hadoop - Apache PigVibrant Technologies & Computers

October 2013 HUG: Oozie 4.xYahoo Developer Network

apache pig performance optimizations talk at apachecon 2010Thejas Nair

Pig - Analyzing data setsCreditas

An Introduction to JVM Internals and Garbage Collection in JavaAbhishek Asthana

Understanding Java Garbage CollectionAzul Systems Inc.

Java Garbage Collection - How it worksMindfire Solutions

An introduction to hadoopMinJae Kang

Apache PigShashidhar Basavaraju

Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)Sudhir Mallem

Big data components - Introduction to Flume, Pig and SqoopJeyamariappan Guru

Andere mochten auch (20)

Introduction to Pig

Introduction to Apache Pig

Hadoop pig

Pig, Making Hadoop Easy

Practical Problem Solving with Apache Hadoop & Pig

Hadoop, Pig, and Twitter (NoSQL East 2009)

Introduction to Hadoop and Pig

introduction to data processing using Hadoop and Pig

Hue architecture in the Hadoop ecosystem and SQL Editor

Hadoop - Apache Pig

October 2013 HUG: Oozie 4.x

apache pig performance optimizations talk at apachecon 2010

Pig - Analyzing data sets

An Introduction to JVM Internals and Garbage Collection in Java

Understanding Java Garbage Collection

Java Garbage Collection - How it works

An introduction to hadoop

Apache Pig

Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)

Big data components - Introduction to Flume, Pig and Sqoop

Ähnlich wie Introduction to Apache Pig

An Analytics Toolkit TourRory Winston

Introduction to Hadoop - ACCU2010Gavin Heavyside

eLearning Suite 6 WorkflowKirsten Rourke

Hadoop Operations: Starting Out Small / So Your Cluster Isn't Yahoo-sized (yet)Michael Arnold

Hadoop operationsDataWorks Summit

Building infrastructure for Big DataPromptCloud

Integrated dwh 3Gwen (Chen) Shapira

Data Science Day New York: The Platform for Big DataCloudera, Inc.

Scalable Machine Learning with HadoopGrant Ingersoll

Introduction to HadoopJoey Jablonski

Ada 2012AdaCore

Back-end with SonataAdminBundle (and Symfony2, of course...)Andrea Delfino

Making Sense of Big data with HadoopGwen (Chen) Shapira

The state of drupal 8 - Drupalcamp Gentswentel

Seminar: Embedding Optimization in Applications with MPL OptiMax - April 2012Bjarni Kristjánsson

Extend starfish to Support the Growing Hadoop EcosystemFei Dong

The Hadoop EcosystemJ Singh

Hadoop 101EMC

Introduction to Hadoop - ACCU2010Gavin Heavyside

DSpace Update from Open Repositories 2014Repository Fringe

Ähnlich wie Introduction to Apache Pig (20)

An Analytics Toolkit Tour

Introduction to Hadoop - ACCU2010

eLearning Suite 6 Workflow

Hadoop Operations: Starting Out Small / So Your Cluster Isn't Yahoo-sized (yet)

Hadoop operations

Building infrastructure for Big Data

Integrated dwh 3

Data Science Day New York: The Platform for Big Data

Scalable Machine Learning with Hadoop

Introduction to Hadoop

Ada 2012

Back-end with SonataAdminBundle (and Symfony2, of course...)

Making Sense of Big data with Hadoop

The state of drupal 8 - Drupalcamp Gent

Seminar: Embedding Optimization in Applications with MPL OptiMax - April 2012

Extend starfish to Support the Growing Hadoop Ecosystem

The Hadoop Ecosystem

Hadoop 101

Introduction to Hadoop - ACCU2010

DSpace Update from Open Repositories 2014

Kürzlich hochgeladen

Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal

Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j

A Domino Admins Adventures (Engage 2024)Gabriella Davis

From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software

The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad

Real Time Object Detection Using Open CVKhem

A Year of the Servo Reboot: Where Are We Now?Igalia

CNv6 Instructor Chapter 6 Quality of Servicegiselly40

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science

[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745

🐬 The future of MySQL is Postgres 🐘RTylerCroy

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays

Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1

Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer

Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies

Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC

Finology Group – Insurtech Innovation Award 2024The Digital Insurer

2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong

Kürzlich hochgeladen (20)

Understanding Discord NSFW Servers A Guide for Responsible Users.pdf

Driving Behavioral Change for Information Management through Data-Driven Gree...

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...

A Domino Admins Adventures (Engage 2024)

From Event to Action: Accelerate Your Decision Making with Real-Time Automation

The Codex of Business Writing Software for Real-World Solutions 2.pptx

Real Time Object Detection Using Open CV

A Year of the Servo Reboot: Where Are We Now?

CNv6 Instructor Chapter 6 Quality of Service

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx

[2024]Digital Global Overview Report 2024 Meltwater.pdf

🐬 The future of MySQL is Postgres 🐘

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...

Boost Fertility New Invention Ups Success Rates.pdf

Axa Assurance Maroc - Insurer Innovation Award 2024

Factors to Consider When Choosing Accounts Payable Services Providers.pptx

Breaking the Kubernetes Kill Chain: Host Path Mount

Finology Group – Insurtech Innovation Award 2024

2024: Domino Containers - The Next Step. News from the Domino Container commu...

Introduction to Apache Pig

1. HADOOP SESSION-4 Introduction to Pig

2. Session Outline What is Pig? Motivation Background Components & Architecture Pig & Map-Reduce Case Study – Log Analytics Conclusion Sunday, April 29, 2012 © Sabre Holdings, 2012 2

3. What is Pig? Framework for Analyzing large Data Sets Sits on top of hadoop Sunday, April 29, 2012 © Sabre Holdings, 2012 3

5. Pig Food? Pig has great taste for structured and Unstructured Data. CSV’s, TSV’s, Delimited Data Any Kind of Logs Unstructured Sentences. Databases via JDBC Connections Sunday, April 29, 2012 © Sabre Holdings, 2012 5

6. Pig Language? Pig Understands Pig-Latin (Simple Query Algebra) - Data Flow Language - Interdependent series of operations - Allows ELT’s very effectively - Filtering/Aggregations/Applying Functions Sunday, April 29, 2012 © Sabre Holdings, 2012 6

7. Pig is not Racist!! Pig Streaming - Pig Stream allows pig’s food to interact with alien scripts/binaries A= LOAD ‘log.txt’ C= STREAM A THROUGH ‘extractor.pl’ Sunday, April 29, 2012 © Sabre Holdings, 2012 7

8. Pig vs Traditional Map-Reduce (Challenges/Solutions) •Problem: Resources Map-Reduce requires Java Programmer •Solution: Users familiar with scripting languages like Python/Perl can easily code. •Problem: Time Map-Reduce involves multiple stages to arrive at a solution • Solution: 100 lines of Java ~ 10 lines of Pig 4 hours of Java Programming ~ 15 minutes of Pig Programming •Problem: In Map-Reduce, users have to re-invent common functionalities like Baked Join/Cross/Filter •Solution: Programmers can leverage inbuilt libraries and functions for Join/Regex Extraction etc. Sunday, April 29, 2012 © Sabre Holdings, 2012 8

9. Appetite! Pigs can digest huge datasets - Batch Log Processing NOTE: Do NOT FEED small datasets to pig. It gets angry. Sunday, April 29, 2012 © Sabre Holdings, 2012 9

10. Winner in Map-Reduce Race! (1.1x) If Pig was first, who was second? Any Guesses? Sunday, April 29, 2012 © Sabre Holdings, 2012 10

12. Let’s Ride a Pig • LOAD • GENERATE, FOREACH • FILTERS • DUMP • STORE • STREAM • REGULAR EXPRESSION EXTRACTION • Group, Count, Joins • BAGS vs SETS? Sunday, April 29, 2012 © Sabre Holdings, 2012 12

13. How can you forget this one? • Piggy Bank – Pig library for already defined functions Sunday, April 29, 2012 © Sabre Holdings, 2012 13

14. Theoretical Summarization • Let us not be afraid of Swine Flu, We can still be friends with them. Sunday, April 29, 2012 © Sabre Holdings, 2012 14

16. RESOURCES • Documentation – Apache Wiki (not enough) • Doubts –> Forums – Stack overflow is my favorite • Overview – Cloudera Video Training • Best Tutorial on internet: http://pig.apache.org/docs/r0.7.0/tutorial.ht ml Sunday, April 29, 2012 © Sabre Holdings, 2012 16