SlideShare ist ein Scribd-Unternehmen logo
1 von 25
Pig Power Tools
a quick tour
By
Viswanath Gangavaram
Data Scientist
R&D, DSG, Ilabs, [24] 7 INC
4/27/2014 1
Pig provides a higher level of abstraction for data users, giving them access to the power and
flexibility of Hadoop without requiring them to write extensive data processing applications
in low-level Java Code(MapReduce code). From the preface of “Programming Pig”
What we are going to cover
 A very short introduction to Apache Pig
 Use Grunt shell to work with the Hadoop Distributed File System
 Advanced Pig Operators(relational)
 Pig Macros and Modularity features
 Embed Pig Latin in Python for Iterative Processing and other advanced tasks(SIMOD golden Journeys)
 Json Parsing
 XML Parsing
 UDFs(Jython)
 Pig Streaming
 UDFs Vs. Streaming
 Custom load and store Functions to handle data formats and storage mechanisms
 Single Row Relations
 Python in Pig(Bringing nltk, numpy, scipy, pandas into pig)
 Lipstick
 Hue
 Performance Tips
 External libraries
 Piggybank, DataFu, DataFu Hour Glass, SimpleJson, ElephantBird
Note:- This is general Pig tutorial, will have minimum references to any particular data set
4/27/2014 2
A short introduction to “Apache Pig” in five minutes
4/27/2014 3
• Apache Pig is a high-level platform for executing data flows in parallel on Hadoop. The language for this
platform is called Pig Latin, which includes operators for many of the traditional data operations (join,
sort, filter, etc.), as well as the ability for users to develop their own functions for reading, processing, and
writing data.
– Pigs fly
• Pig processes data quickly. Designers want to consistently improve its performance, and not
implement features in ways that weigh pig down so it can't fly.
• What does it mean to be Pig?
– Pigs Eats Everything
• Pig can operate on data whether it has metadata or not. It can operate on data that is
relational, nested, or unstructured. And it can easily be extended to operate on data beyond
files, including key/value stores, databases, etc.
– Pigs Live Everywhere
• Pig is intended to be a language for parallel data processing. It is not tied to one
particular parallel framework. Check for Pig on Tez
– Pigs Are Domestic Animals
• Pig is designed to be easily controlled and modified by its users.
• Pig allows integration of user code where ever possible, so it currently supports user defined
field transformation functions, user defined aggregates, and user defined conditionals.
• Pig supports user provided load and store functions.
• It supports external executables via its stream command and Map Reduce jars via its
MapReduce command.
• It allows users to provide a custom partitioner for their jobs in some circumstances and to set
the level of reduce parallelism for their jobs.
4/27/2014 4
Apache Pig “Word counting is hello world of MapReduce”
inputFile = LOAD ‘mary’ as ( line );
words = FOREACH inputFile GENERATE FLATTEN( TOKENIZE(line) ) as word;
grpd = GROUP words by word;
cntd = FOREACH grpd GENERATE group, COUNT(words)
DUMP cntd;
Output:-
(This , 2)
(is, 2)
(my, 2 )
(first , 2)
(apache, 2)
(pig,2)
(program, 2)
“mary” file content:-
This is my first apache pig program
This is my first apache pig program
4/27/2014 5
Apache Pig Latin: A data flow language
• Pig Latin is a dataflow language. This means it allows users to describe how data from one or more inputs
should be read, processed, and then stored to one or more outputs in parallel.
• To be mathematically precise, a Pig Latin script describes a directed acyclic graph (DAG), where the edges
are data flows and the nodes are operators that process the data.
Comparing query(HIVE/SQL) and data flow languages(PIG)
• After a cursory look, people often say that Pig Latin is a procedural version of SQL. Although there are
certainly similarities, there are more differences. SQL is a query language. Its focus is to allow users to
form queries. It allows users to describe what question they want answered, but not how they want it
answered. In Pig Latin, on the other hand, the user describes exactly how to process the input data.
• Another major difference is that SQL is oriented around answering one question. When users want to do
several data operations together, they must either write separate queries, storing the intermediate data
into temporary tables, or write it in one query using subqueries inside that query to do the earlier steps of
the processing. However, many SQL users find subqueries confusing and difficult to form properly. Also,
using subqueries creates an inside-out design where the first step in the data pipeline is the innermost
query.
• Pig, however, is designed with a long series of data operations in mind, so there is no need to write the
data pipeline in an inverted set of subqueries or to worry about storing data in temporary tables.
• SQL is the English of data processing. It has the nice feature that everyone and every tool knows it, which
means the barrier to adoption is very low. Our goal is to make Pig Latin the native language of parallel
data-processing systems such as Hadoop. It may take some learning, but it will allow users to utilize the
power of Hadoop much more fully. - Extracted from “Programming Pig”
Pig’s Data types
 Scalar types
• int, long, float, double, chararray, bytearray
 Complex types
• Map
– A map in Pig is a chararray to data element mapping, where that element can be any Pig
type, including a complex type.
– The chararray is called a key and is used as index to find the element, referred to as the
value.
– Map constants are formed using brackets to delimit the map, a hash between keys and
values, and a comma between key-value pairs.
» *‘dept’#’dsg’, ‘team’#’r&d’+
• Tuple
– A tuple is a fixed-length, ordered collection of Pig data elements. Tuples are divided into
fields, with each field containing one data element. These elements can be of any type.
– Tuple constants use parentheses to indicate the tuple and commas to delimit fields in
the tuple.
» (‘boss’, 55)
• Bag
– A bag is an unordered collection of tuples.
– Bag constants are constructed using braces, with the tuples in the bag separated by
commas.
» { (‘a’, 20), (‘b’, 20), (‘c’, 30) }
4/27/2014 6
• Nulls
– Pig includes the concept of a data element being null. Data of any type can be null. A null data
element means the value is unknown. This might be because the data is missing, an error occurred
in processing it, etc.
4/27/2014 7
• Schemas
– Pig has a very lax attitude when it comes to schemas. This is a consequence of Pig’s philosophy of
eating anything
• Casts
Basic operators
1. LOAD
2. STORE
3. LIMIT
4. DEFINE
5. FOREACH
6. FILTER
7. DISTINCT
8. (CO)GROUP
9. JOIN
10. UNION
11. CROSS
12. ORDER BY
4/27/2014 8
Grunt shell
• Grunt is Pig’s interactive shell. It enables users to enter Pig Latin interactively and
provides a shell to interact with HDFS
4/27/2014 9
• Command line history, editing, Tab completion.
• No-pipes, no re-directions and no-background executions
• Grunts shell commands
– Shell for HDFS*
• fs –ls, fs –du, fs –stat, etc…
– Shell for Unix commands (working in the local directory)
• sh ls, sh cat
– exec
– run
– Kill jobid
– Set
– dump
– explain
– describe
• *: fs is default
Advanced operators
1. ASSERT
2. CUBE
3. IMPORT
4. MAPREDUCE
5. ORDER BY
6. RANK
7. SAMPLE
8. SPLIT
9. STREAM
4/27/2014 10
Pig’s Debugging tools
 Use the DUMP operator to display results to your terminal screen.
 Use the DESCRIBE operator to review the schema of a relation.
 Use the EXPLAIN operator to view the logical, physical, or map reduce execution plans to
compute a relation.
 Use the ILLUSTRATE operator to view the step-by-step execution of a series of
statements.
4/27/2014 11
Shortcuts for Debugging operators
 d alias - shortcut for DUMP. If alias is ignored last defined alias will be used.
 de alias - shortcut for DESCRIBE. If alias is ignored last defined alias will be used.
 e alias - shortcut for EXPLAIN. If alias is ignored last defined alias will be used.
 i alias - shortcut for ILLUSTRATE. If alias is ignored last defined alias will be used.
 q - To quit grunt shell
Json Parsing
4/27/2014 12
XML Parsing
4/27/2014 13
User Defined Functions
4/27/2014 14
Pig Streaming
4/27/2014 15
UDFs Vs. Pig streaming
4/27/2014 16
Cython in Pig(Bringing nltk, numpy, scipy, pandas into pig)
4/27/2014 17
Lipstick:- Let’s add some color to Pig
4/27/2014 18
Hue:- Hadoop and its ecosystem in Browser
4/27/2014 19
Piggybank
4/27/2014 20
DataFu
4/27/2014 21
DataFu Hourglass
4/27/2014 22
SimpleJson
4/27/2014 23
Elephant Bird
4/27/2014 24
So what is pig?
4/27/2014 25
Pig is a champion

Weitere ähnliche Inhalte

Was ist angesagt?

MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce Paradigm
Dilip Reddy
 
An introduction to apache drill presentation
An introduction to apache drill presentationAn introduction to apache drill presentation
An introduction to apache drill presentation
MapR Technologies
 
Understanding the Value and Architecture of Apache Drill
Understanding the Value and Architecture of Apache DrillUnderstanding the Value and Architecture of Apache Drill
Understanding the Value and Architecture of Apache Drill
DataWorks Summit
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
Mohamed Ali Mahmoud khouder
 

Was ist angesagt? (19)

Pig Tutorial | Twitter Case Study | Apache Pig Script and Commands | Edureka
Pig Tutorial | Twitter Case Study | Apache Pig Script and Commands | EdurekaPig Tutorial | Twitter Case Study | Apache Pig Script and Commands | Edureka
Pig Tutorial | Twitter Case Study | Apache Pig Script and Commands | Edureka
 
Quadrupling your elephants - RDF and the Hadoop ecosystem
Quadrupling your elephants - RDF and the Hadoop ecosystemQuadrupling your elephants - RDF and the Hadoop ecosystem
Quadrupling your elephants - RDF and the Hadoop ecosystem
 
Sempala - Interactive SPARQL Query Processing on Hadoop
Sempala - Interactive SPARQL Query Processing on HadoopSempala - Interactive SPARQL Query Processing on Hadoop
Sempala - Interactive SPARQL Query Processing on Hadoop
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce Paradigm
 
An introduction to apache drill presentation
An introduction to apache drill presentationAn introduction to apache drill presentation
An introduction to apache drill presentation
 
Using Apache Drill
Using Apache DrillUsing Apache Drill
Using Apache Drill
 
06 pig-01-intro
06 pig-01-intro06 pig-01-intro
06 pig-01-intro
 
Drill into Drill – How Providing Flexibility and Performance is Possible
Drill into Drill – How Providing Flexibility and Performance is PossibleDrill into Drill – How Providing Flexibility and Performance is Possible
Drill into Drill – How Providing Flexibility and Performance is Possible
 
Understanding the Value and Architecture of Apache Drill
Understanding the Value and Architecture of Apache DrillUnderstanding the Value and Architecture of Apache Drill
Understanding the Value and Architecture of Apache Drill
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Running R on Hadoop - CHUG - 20120815
Running R on Hadoop - CHUG - 20120815Running R on Hadoop - CHUG - 20120815
Running R on Hadoop - CHUG - 20120815
 
How LinkedIn Uses Scalding for Data Driven Product Development
How LinkedIn Uses Scalding for Data Driven Product DevelopmentHow LinkedIn Uses Scalding for Data Driven Product Development
How LinkedIn Uses Scalding for Data Driven Product Development
 
Integrating R & Hadoop - Text Mining & Sentiment Analysis
Integrating R & Hadoop - Text Mining & Sentiment AnalysisIntegrating R & Hadoop - Text Mining & Sentiment Analysis
Integrating R & Hadoop - Text Mining & Sentiment Analysis
 
Python in big data world
Python in big data worldPython in big data world
Python in big data world
 
Resource Aware Scheduling for Hadoop [Final Presentation]
Resource Aware Scheduling for Hadoop [Final Presentation]Resource Aware Scheduling for Hadoop [Final Presentation]
Resource Aware Scheduling for Hadoop [Final Presentation]
 
February 2014 HUG : Pig On Tez
February 2014 HUG : Pig On TezFebruary 2014 HUG : Pig On Tez
February 2014 HUG : Pig On Tez
 
Sparkling pandas Letting Pandas Roam - PyData Seattle 2015
Sparkling pandas Letting Pandas Roam - PyData Seattle 2015Sparkling pandas Letting Pandas Roam - PyData Seattle 2015
Sparkling pandas Letting Pandas Roam - PyData Seattle 2015
 
Hadoop scheduler
Hadoop schedulerHadoop scheduler
Hadoop scheduler
 
Why Spark Is the Next Top (Compute) Model
Why Spark Is the Next Top (Compute) ModelWhy Spark Is the Next Top (Compute) Model
Why Spark Is the Next Top (Compute) Model
 

Ähnlich wie Pig power tools_by_viswanath_gangavaram

Ähnlich wie Pig power tools_by_viswanath_gangavaram (20)

Apache PIG
Apache PIGApache PIG
Apache PIG
 
Unit V.pdf
Unit V.pdfUnit V.pdf
Unit V.pdf
 
Introduction to pig.
Introduction to pig.Introduction to pig.
Introduction to pig.
 
A slide share pig in CCS334 for big data analytics
A slide share pig in CCS334 for big data analyticsA slide share pig in CCS334 for big data analytics
A slide share pig in CCS334 for big data analytics
 
PigHive.pptx
PigHive.pptxPigHive.pptx
PigHive.pptx
 
unit-4-apache pig-.pdf
unit-4-apache pig-.pdfunit-4-apache pig-.pdf
unit-4-apache pig-.pdf
 
Unit 4-apache pig
Unit 4-apache pigUnit 4-apache pig
Unit 4-apache pig
 
Apache Pig: A big data processor
Apache Pig: A big data processorApache Pig: A big data processor
Apache Pig: A big data processor
 
PigHive presentation and hive impor.pptx
PigHive presentation and hive impor.pptxPigHive presentation and hive impor.pptx
PigHive presentation and hive impor.pptx
 
PigHive.pptx
PigHive.pptxPigHive.pptx
PigHive.pptx
 
43_Sameer_Kumar_Das2
43_Sameer_Kumar_Das243_Sameer_Kumar_Das2
43_Sameer_Kumar_Das2
 
Hadoop
HadoopHadoop
Hadoop
 
APACHE SPARK.pptx
APACHE SPARK.pptxAPACHE SPARK.pptx
APACHE SPARK.pptx
 
Architecting and productionising data science applications at scale
Architecting and productionising data science applications at scaleArchitecting and productionising data science applications at scale
Architecting and productionising data science applications at scale
 
Hive and Pig for .NET User Group
Hive and Pig for .NET User GroupHive and Pig for .NET User Group
Hive and Pig for .NET User Group
 
Lipstick On Pig
Lipstick On Pig Lipstick On Pig
Lipstick On Pig
 
Netflix - Pig with Lipstick by Jeff Magnusson
Netflix - Pig with Lipstick by Jeff Magnusson Netflix - Pig with Lipstick by Jeff Magnusson
Netflix - Pig with Lipstick by Jeff Magnusson
 
Putting Lipstick on Apache Pig at Netflix
Putting Lipstick on Apache Pig at NetflixPutting Lipstick on Apache Pig at Netflix
Putting Lipstick on Apache Pig at Netflix
 
Map reduce prashant
Map reduce prashantMap reduce prashant
Map reduce prashant
 
Pig latin
Pig latinPig latin
Pig latin
 

Kürzlich hochgeladen

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Kürzlich hochgeladen (20)

Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 

Pig power tools_by_viswanath_gangavaram

  • 1. Pig Power Tools a quick tour By Viswanath Gangavaram Data Scientist R&D, DSG, Ilabs, [24] 7 INC 4/27/2014 1 Pig provides a higher level of abstraction for data users, giving them access to the power and flexibility of Hadoop without requiring them to write extensive data processing applications in low-level Java Code(MapReduce code). From the preface of “Programming Pig”
  • 2. What we are going to cover  A very short introduction to Apache Pig  Use Grunt shell to work with the Hadoop Distributed File System  Advanced Pig Operators(relational)  Pig Macros and Modularity features  Embed Pig Latin in Python for Iterative Processing and other advanced tasks(SIMOD golden Journeys)  Json Parsing  XML Parsing  UDFs(Jython)  Pig Streaming  UDFs Vs. Streaming  Custom load and store Functions to handle data formats and storage mechanisms  Single Row Relations  Python in Pig(Bringing nltk, numpy, scipy, pandas into pig)  Lipstick  Hue  Performance Tips  External libraries  Piggybank, DataFu, DataFu Hour Glass, SimpleJson, ElephantBird Note:- This is general Pig tutorial, will have minimum references to any particular data set 4/27/2014 2
  • 3. A short introduction to “Apache Pig” in five minutes 4/27/2014 3 • Apache Pig is a high-level platform for executing data flows in parallel on Hadoop. The language for this platform is called Pig Latin, which includes operators for many of the traditional data operations (join, sort, filter, etc.), as well as the ability for users to develop their own functions for reading, processing, and writing data. – Pigs fly • Pig processes data quickly. Designers want to consistently improve its performance, and not implement features in ways that weigh pig down so it can't fly. • What does it mean to be Pig? – Pigs Eats Everything • Pig can operate on data whether it has metadata or not. It can operate on data that is relational, nested, or unstructured. And it can easily be extended to operate on data beyond files, including key/value stores, databases, etc. – Pigs Live Everywhere • Pig is intended to be a language for parallel data processing. It is not tied to one particular parallel framework. Check for Pig on Tez – Pigs Are Domestic Animals • Pig is designed to be easily controlled and modified by its users. • Pig allows integration of user code where ever possible, so it currently supports user defined field transformation functions, user defined aggregates, and user defined conditionals. • Pig supports user provided load and store functions. • It supports external executables via its stream command and Map Reduce jars via its MapReduce command. • It allows users to provide a custom partitioner for their jobs in some circumstances and to set the level of reduce parallelism for their jobs.
  • 4. 4/27/2014 4 Apache Pig “Word counting is hello world of MapReduce” inputFile = LOAD ‘mary’ as ( line ); words = FOREACH inputFile GENERATE FLATTEN( TOKENIZE(line) ) as word; grpd = GROUP words by word; cntd = FOREACH grpd GENERATE group, COUNT(words) DUMP cntd; Output:- (This , 2) (is, 2) (my, 2 ) (first , 2) (apache, 2) (pig,2) (program, 2) “mary” file content:- This is my first apache pig program This is my first apache pig program
  • 5. 4/27/2014 5 Apache Pig Latin: A data flow language • Pig Latin is a dataflow language. This means it allows users to describe how data from one or more inputs should be read, processed, and then stored to one or more outputs in parallel. • To be mathematically precise, a Pig Latin script describes a directed acyclic graph (DAG), where the edges are data flows and the nodes are operators that process the data. Comparing query(HIVE/SQL) and data flow languages(PIG) • After a cursory look, people often say that Pig Latin is a procedural version of SQL. Although there are certainly similarities, there are more differences. SQL is a query language. Its focus is to allow users to form queries. It allows users to describe what question they want answered, but not how they want it answered. In Pig Latin, on the other hand, the user describes exactly how to process the input data. • Another major difference is that SQL is oriented around answering one question. When users want to do several data operations together, they must either write separate queries, storing the intermediate data into temporary tables, or write it in one query using subqueries inside that query to do the earlier steps of the processing. However, many SQL users find subqueries confusing and difficult to form properly. Also, using subqueries creates an inside-out design where the first step in the data pipeline is the innermost query. • Pig, however, is designed with a long series of data operations in mind, so there is no need to write the data pipeline in an inverted set of subqueries or to worry about storing data in temporary tables. • SQL is the English of data processing. It has the nice feature that everyone and every tool knows it, which means the barrier to adoption is very low. Our goal is to make Pig Latin the native language of parallel data-processing systems such as Hadoop. It may take some learning, but it will allow users to utilize the power of Hadoop much more fully. - Extracted from “Programming Pig”
  • 6. Pig’s Data types  Scalar types • int, long, float, double, chararray, bytearray  Complex types • Map – A map in Pig is a chararray to data element mapping, where that element can be any Pig type, including a complex type. – The chararray is called a key and is used as index to find the element, referred to as the value. – Map constants are formed using brackets to delimit the map, a hash between keys and values, and a comma between key-value pairs. » *‘dept’#’dsg’, ‘team’#’r&d’+ • Tuple – A tuple is a fixed-length, ordered collection of Pig data elements. Tuples are divided into fields, with each field containing one data element. These elements can be of any type. – Tuple constants use parentheses to indicate the tuple and commas to delimit fields in the tuple. » (‘boss’, 55) • Bag – A bag is an unordered collection of tuples. – Bag constants are constructed using braces, with the tuples in the bag separated by commas. » { (‘a’, 20), (‘b’, 20), (‘c’, 30) } 4/27/2014 6
  • 7. • Nulls – Pig includes the concept of a data element being null. Data of any type can be null. A null data element means the value is unknown. This might be because the data is missing, an error occurred in processing it, etc. 4/27/2014 7 • Schemas – Pig has a very lax attitude when it comes to schemas. This is a consequence of Pig’s philosophy of eating anything • Casts
  • 8. Basic operators 1. LOAD 2. STORE 3. LIMIT 4. DEFINE 5. FOREACH 6. FILTER 7. DISTINCT 8. (CO)GROUP 9. JOIN 10. UNION 11. CROSS 12. ORDER BY 4/27/2014 8
  • 9. Grunt shell • Grunt is Pig’s interactive shell. It enables users to enter Pig Latin interactively and provides a shell to interact with HDFS 4/27/2014 9 • Command line history, editing, Tab completion. • No-pipes, no re-directions and no-background executions • Grunts shell commands – Shell for HDFS* • fs –ls, fs –du, fs –stat, etc… – Shell for Unix commands (working in the local directory) • sh ls, sh cat – exec – run – Kill jobid – Set – dump – explain – describe • *: fs is default
  • 10. Advanced operators 1. ASSERT 2. CUBE 3. IMPORT 4. MAPREDUCE 5. ORDER BY 6. RANK 7. SAMPLE 8. SPLIT 9. STREAM 4/27/2014 10
  • 11. Pig’s Debugging tools  Use the DUMP operator to display results to your terminal screen.  Use the DESCRIBE operator to review the schema of a relation.  Use the EXPLAIN operator to view the logical, physical, or map reduce execution plans to compute a relation.  Use the ILLUSTRATE operator to view the step-by-step execution of a series of statements. 4/27/2014 11 Shortcuts for Debugging operators  d alias - shortcut for DUMP. If alias is ignored last defined alias will be used.  de alias - shortcut for DESCRIBE. If alias is ignored last defined alias will be used.  e alias - shortcut for EXPLAIN. If alias is ignored last defined alias will be used.  i alias - shortcut for ILLUSTRATE. If alias is ignored last defined alias will be used.  q - To quit grunt shell
  • 16. UDFs Vs. Pig streaming 4/27/2014 16
  • 17. Cython in Pig(Bringing nltk, numpy, scipy, pandas into pig) 4/27/2014 17
  • 18. Lipstick:- Let’s add some color to Pig 4/27/2014 18
  • 19. Hue:- Hadoop and its ecosystem in Browser 4/27/2014 19
  • 25. So what is pig? 4/27/2014 25 Pig is a champion