SlideShare a Scribd company logo
1 of 46
CSC 5800:

Pig Latin: A Not-So-Foreign
Language for Data
Intelligent Systems: Processing
Algorithms and Tools

By Siddharth Mathur

1
What we will be covering
 Introduction
 MapReduce Overview
 Pig Overview
 Pig Features
 Pig Latin
 Pig Debugger
 Demo

2
Introduction
 Enormous data

 Innovation critically depends upon analyzing terabytes of
data collected everyday
 SQL can resolve the structure data problems

 Parallel Database processing
– Data is enormous can‟t be analyzed serially.
– Has to be analyzed in parallel.
– Shared nothing clusters are the way to go.

3
Parallel DB Products
 Teradata, Oracle RAC, Netezza
 Expensive at web scale
 Programmers have to write complex SQL queries
because of this declarative programming is not preferred

4
Procedural programming
 Map-Reduce programming model
 It can easily perform a group by aggregation in parallel
over a cluster of machines
 The programmer provides map functions which is used as
a filter or transforming method
 The reduce function performs the aggregation
 Appealing to the programmer because there are only 2
high level declarative functions to enable parallel
processing

5
MapReduce Overview
 Programming Model
– To cater large data analytics
– Works over Hadoop
– Java based
– Splits data into independent chunks and process them
in-parallel
 Program structure

– Mapper
– Reducer
– Driver Program

6
MapReduce Driver Program
 Works as „Main‟ function for MR job
 Takes care of
– Number of arguments
– Input Data Location
– Input Data Types
– Output Data Location
– Output Data types

– Number of Mappers
– Number of Reducers

7
Mapper and Reducer Class
 Mapper Class
– Main task is to perform any function logic
– Computes tasks like:
• Filtering
• Splitting
• Tokenizing
• Transforming

 Reducer Class
– Works as an aggregator
– Aggregates the intermediate results gathered from
Mapper
8
Word Count Execution

Input

the quick
brown fox

Map

Shuffle & Sort

Reduce

the, 1
brown, 1
fox, 1

Output

Reduce

Map

brown, 2
fox, 2
how, 1
now, 1
the, 3

Reduce

ate, 1
cow, 1
mouse, 1
quick, 1

the, 1
fox, 1
the, 1

the fox ate
the mouse

Map
quick, 1
how, 1
now, 1
brown, 1

how now
brown cow

Map

ate, 1
mouse, 1

cow, 1

9
MapReduce Word Count Program
public static class Map extends Mapper<LongWritable, Text, Text, IntWritable>
{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);

while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
context.write(word, one);
}
}
}

public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
context.write(key, new IntWritable(sum));
}

10
Map Reduce Limitations
 1 input – 2 stage data flow is extremely rigid.
– To perform a task like join or sum iteration task,
workaround has to be devised.
– Custom code for common task like filtering or
transforming or projection
– The code is difficult to reuse and maintain
 Moreover, because of its own data types, workflow and
the fact that people have to learn java, makes it‟s a tough
choice to take.

11
Pig
 An Apache open source project.
 Provides an engine for executing data flows in parallel on
Hadoop.
 Includes a language called „Pig Latin‟ for expressing
these data flows.
 High level declarative data workflow language.
 It has best of both worlds:
– High Level declarative querying like SQL
– Low Level procedural like Map Reduce

12
Hadoop Stack

Hive

…
HBase
Data Processing Layer
Pig

Hadoop MR

Hadoop Yarn
Resource Management Layer
HDFS
Storage Layer
13
Why Choose Pig
 Written like SQL, compiled into MapReduce
 Fully nested data model
 Extensive support for UDFs
 Can answer multiple questions in one single workflow.
A = load './input.txt';
B = foreach A generate flatten(TOKENIZE((chararray)$0)) as word;
C = group B by word;
D = foreach C generate COUNT(B), group;
store D into './output';

14
Features and Motivation
 Design goal of pig is to provide programmers with
appealing experience for performing ad-hoc analysis of
extremely large data sets.
– DataFlow Language

– QuickStart and Interoperability
– Nested Data Model
– UDF‟s
– Debugging Environment

15
Data Flow Language
 Each step specifies a single high level data
transformation
 Different from SQL where all these results are a single
output.

 The system has given opportunity to provide optimization
function.
– Example:
A= Load „input.txt‟;

B= Filter A by UDF (Column1);
C= Filter B by Column1 > 0.8;

16
Quick start and Interoperability
 Data Load
– Capability of Ad-Hoc analysis
– Can run queries directly on Data from dump of search
engines
– Just have to provide a function that tells Pig how to
parse the content of file into tuple.
– Similarly for output
• Any output format.
• These function can be reused.
• Used for visualization or dumped to excel directly

17
Pig as part of workflow
 Pig easily becomes a part of workflow eco-system
– Can take most of the input types
– Can output in many of the forms
– Doesn‟t take over the data, i.e., it does not lock the
data that is being processed.
– Read only data analysis

18
Optional data schemas
 Schema can be provided by the user :
– In the beginning
– On the fly

– Example:
• A= LOAD „input.txt‟ as (Column1;Column2);
• B= Filter A by Column1>5;

 If the schema is not provided then the columns can be
referred by „$0‟, „$1‟, „$2‟…. for the 1st, 2nd, 3rd column
etc.
 Example:

 A= LOAD „input.txt‟;
 B= Filter A by $0>5;
19
Nested Data Model
 Suppose, for a document, we want to extract the term and
its position.
 Format of output : Map<document Set<position>>
 SQL data model:
Term

Document ID

Position

Hi

1

2

Hi

1

5

 Or keep in normalized form, i.e.,
– term_info(termid, String)
– position_info(termid, position, document)

20
Problem resolved using Pig
 In pig we have complex data types like map, tuple or bag
to occur as a field of a table itself.
 Example:
Term

Document ID

Position

Hi

1

(2,5,8..)

 This approach is good because its more closer to what a
programmer thinks.
 Data is stored on disk in a nested fashion only
 It gives user an ease in writing UDFs.

21
UDFs
 Significant part of data analysis is custom processing
 For example, user might want to process natural
language stemming
 Or checking if the page is spam or not, or many other
tasks
 To work on this, Pig Latin has extensive support for
UDFs, most of the tasks can be resolved using the UDFs
 It can take non-atomic input and can provide a nonatomic output also
 Currently the UDFs can be written in java or python

22
Debugging Environment
 In any language, getting a data processing program work
correctly usually takes many iterations
 First few iterations mostly produce errors
 With a large scale data this would result in serious time
and resource wastage
 Debuggers can help
 Pig has a novel debugging environment
 Generates concise examples from input data
 Data samples are carefully chosen to resemble real data
as far as possible
 Sample data is carved specially
23
Pig Latin
 Language in which data workflow statements are written
 It runs on the shell called „Grunt‟
 It has a shared repository name Piggybank
 We can create our custom UDFs and add them to
Piggybank

24
Data Model
 Rich, yet simple data models
 Atoms

– Simple atomic values like string or number
 Tuple
– A collection of fields each of which can be of any data
type
– Analogous to rows in SQL
 Bag
– Collection of tuples or both tuples and atoms
– Can also be heterogeneous

25
Data Model (cont.)


Example of a relation
Atom

Tuple

Bag

T= „alice‟, (labours,1), {(„ipod‟, 2),„james‟}


Tuple is represented with round braces



Bag is represented with curly braces

26
Specifying Input Data : LOAD
 Its the first step in Pig Latin program
 Specifying what the input files are
 How are its contents to be deserialized, i.e., converted to
pig data model.

 LOAD command
– Example
queries= LOAD „query_log.csv‟
USING PigStorage(„,‟)
AS (userId,queryString,timestamp);

27
LOAD (cont.)
 Both the „USING‟ clause and the „AS‟ clause are optional
 We can work without them as shown earlier ($0 for first
field)
 Pig Storage is a pre-defined function

 Can use custom function instead of Pig Storage

28
Per Tuple Processing : FOREACH
 Similar to FOR statements
 Its used for applying special processing to each tuple of
the dataset

 Example
– Expanded_query = FOREACH queries GENERATE
UserId, Expand(queryString), timeStamp;

 Its not a FILTERING command
 „Expand‟ can take atomic input and can generate a bag of
outputs

29
Per Tuple Processing : FOREACH(cont.)
 The semantics of FOREACH is such that there is no
dependency between different tuples of input, therefore
permitting efficient parallel implementation

30
Discarding Unwanted Data : FILTER
 Used as a where clause
 Can provide anything in the expression
– Query = FILTER queries By user_id neq „bot‟;

 We can provide a UDF also, like
– Query = FILTER queries by Isbot(user_id);

31
COGROUP
 Similar to Join
 Groups bags of different inputs together
 Ease of use for UDF‟s
– Grouped_data = COGROUP results by querystring, revenue by
querystring;

32
JOIN
 Not all users want to use COGROUP
 Simple equi-join is all that is required
– Example
Join_result = JOIN results by querystring,
revenue by querystring;

 Other types of join are also supported:
– Left outer
– Right outer
– Full outer

33
Other Commands
 Relational Operators
– UNION
– CROSS
– ORDER
– DISTINCT
– LIMIT
 Eval Functions

– Concat
– Count
– Diff
34
PARALLEL clause
 It is used to increase the parallelization of the job
 We can specify the number of reduce tasks of the MR
jobs created by Pig
 It only effects the reduce task

 No control over map
 The system also can figure out number of reducers
 Mostly one reduce task is required

35
PARALLEL clause (cont.)
 Can be applied to only those commands which come
under reduce phase
– COGROUP
– CROSS

– DISTINCT
– GROUP
– JOINS

– ORDER
A = LOAD „ File1‟;
B = LOAD „ File2‟;
C = CROSS A, B PARALLEL 10;
36
Split Clause
 We can split the input record into many by providing
condition
A = LOAD „data‟ AS (F1:int, F2:int, F3:int)

(1,2;3)
(2,3;7)
SPLIT A INTO B IF F1>7, C IF F2==5;

B (1,2,3)

C (2,5,7)

(2,5,7)

 Any expression can be written
 UDFs can be used
 It is not partitioning
37
Output
 There are two ways to display
– STORE
• If you want to store the output in any location
STORE output_1 INTO „hadoopuser/output‟

– DUMP
• Basically used to display the result in the GRUNT
shell itself
• Dumping doesn‟t store the output anywhere
DUMP query_result;

38
Building a Logical Plan
 Pig interpreter first parses all the commands which the
client issues
 Verifies that the input files, bags or columns referred by
the command are valid
 Builds a logical plan for every bag the user defines
 No processing is carried out
 Processing triggers where a user invokes STORE/DUMP
command
 Called as a Lazy execution approach
 Helps in FILTER reordering

39
Debugging Environment
 This is used to avoid running the complete code on the
entire dataset
 User can create a sample data
 Difficult to tailor these datasets and end up in self cooked
data
 Pig Pen is Pig‟s debugging environment
 Creates side dataset automatically, called as sandbox
dataset
 Pig Pen has its own user interface

40
Pig Pen

 Outputs can be easily analyzed
 Errors can be rectified earlier
41
Future Work
 User Interface
– Drag-Drop style would help
– Logical plan diagram create made easy
 UDF support for other languages
 Unified Environment
– Currently, lacks in control structures like loops
– Has to embedded for all iterative tasks

42
Summary
 Not So Foreign Language
 Aims a sweet spot between SQL and MapReduce
 Reusable and easy to use
 Novel Debugging Environment: Pig Pen
 Pig has an active and growing user base in Yahoo!
 Pigs
– Eats anything

– Live anywhere
– Are domestic

43
44
Based on “Pig Latin: A Not-So-Foreign Language for Data
Processing”
SIGMOD‟08,June 9–12, 2008, Vancouver,BC,Canada
Christopher Olston
Yahoo! Research

Benjamin Reed
Yahoo! Research

Utkarsh Srivastava
Yahoo! Research

Ravi Kumar
Yahoo! Research

Andrew Tomkins
Yahoo! Research

45
References
 http://infolab.stanford.edu/~usriv/papers/pig-latin.pdf
 http://pig.apache.org/docs/r0.7.0/piglatin_ref2.html
 Book: Programming pig
 http://www.brentozar.com/archive/2011/11/good-pig/
 http://hortonworks.com/hadoop/pig/

46

More Related Content

What's hot

Unit 3 writable collections
Unit 3 writable collectionsUnit 3 writable collections
Unit 3 writable collectionsvishal choudhary
 
R tutorial for a windows environment
R tutorial for a windows environmentR tutorial for a windows environment
R tutorial for a windows environmentYogendra Chaubey
 
Is there a perfect data-parallel programming language? (Experiments with More...
Is there a perfect data-parallel programming language? (Experiments with More...Is there a perfect data-parallel programming language? (Experiments with More...
Is there a perfect data-parallel programming language? (Experiments with More...Julian Hyde
 
Introduction to Data Mining with R and Data Import/Export in R
Introduction to Data Mining with R and Data Import/Export in RIntroduction to Data Mining with R and Data Import/Export in R
Introduction to Data Mining with R and Data Import/Export in RYanchang Zhao
 
Session 19 - MapReduce
Session 19  - MapReduce Session 19  - MapReduce
Session 19 - MapReduce AnandMHadoop
 
Apache Flink Training: DataStream API Part 2 Advanced
Apache Flink Training: DataStream API Part 2 Advanced Apache Flink Training: DataStream API Part 2 Advanced
Apache Flink Training: DataStream API Part 2 Advanced Flink Forward
 
Introduction to R Programming
Introduction to R ProgrammingIntroduction to R Programming
Introduction to R Programmingizahn
 
Unit 5-hive data types – primitive and complex data
Unit 5-hive data types – primitive and complex dataUnit 5-hive data types – primitive and complex data
Unit 5-hive data types – primitive and complex datavishal choudhary
 
Generics Past, Present and Future (Latest)
Generics Past, Present and Future (Latest)Generics Past, Present and Future (Latest)
Generics Past, Present and Future (Latest)RichardWarburton
 
Hadoop源码分析 mapreduce部分
Hadoop源码分析 mapreduce部分Hadoop源码分析 mapreduce部分
Hadoop源码分析 mapreduce部分sg7879
 
Apache Flink Training: DataSet API Basics
Apache Flink Training: DataSet API BasicsApache Flink Training: DataSet API Basics
Apache Flink Training: DataSet API BasicsFlink Forward
 
Next Generation Programming in R
Next Generation Programming in RNext Generation Programming in R
Next Generation Programming in RFlorian Uhlitz
 

What's hot (20)

Unit 3 writable collections
Unit 3 writable collectionsUnit 3 writable collections
Unit 3 writable collections
 
R tutorial for a windows environment
R tutorial for a windows environmentR tutorial for a windows environment
R tutorial for a windows environment
 
Unit 2 part-2
Unit 2 part-2Unit 2 part-2
Unit 2 part-2
 
Is there a perfect data-parallel programming language? (Experiments with More...
Is there a perfect data-parallel programming language? (Experiments with More...Is there a perfect data-parallel programming language? (Experiments with More...
Is there a perfect data-parallel programming language? (Experiments with More...
 
Pune Clojure Course Outline
Pune Clojure Course OutlinePune Clojure Course Outline
Pune Clojure Course Outline
 
Introduction to Data Mining with R and Data Import/Export in R
Introduction to Data Mining with R and Data Import/Export in RIntroduction to Data Mining with R and Data Import/Export in R
Introduction to Data Mining with R and Data Import/Export in R
 
Session 19 - MapReduce
Session 19  - MapReduce Session 19  - MapReduce
Session 19 - MapReduce
 
Apache Flink Training: DataStream API Part 2 Advanced
Apache Flink Training: DataStream API Part 2 Advanced Apache Flink Training: DataStream API Part 2 Advanced
Apache Flink Training: DataStream API Part 2 Advanced
 
Data Analysis in Python
Data Analysis in PythonData Analysis in Python
Data Analysis in Python
 
Unit 3 lecture-2
Unit 3 lecture-2Unit 3 lecture-2
Unit 3 lecture-2
 
Collections forceawakens
Collections forceawakensCollections forceawakens
Collections forceawakens
 
Introduction to R Programming
Introduction to R ProgrammingIntroduction to R Programming
Introduction to R Programming
 
Unit 5-hive data types – primitive and complex data
Unit 5-hive data types – primitive and complex dataUnit 5-hive data types – primitive and complex data
Unit 5-hive data types – primitive and complex data
 
Gur1009
Gur1009Gur1009
Gur1009
 
Unit 4 lecture-3
Unit 4 lecture-3Unit 4 lecture-3
Unit 4 lecture-3
 
Generics Past, Present and Future (Latest)
Generics Past, Present and Future (Latest)Generics Past, Present and Future (Latest)
Generics Past, Present and Future (Latest)
 
Spark workshop
Spark workshopSpark workshop
Spark workshop
 
Hadoop源码分析 mapreduce部分
Hadoop源码分析 mapreduce部分Hadoop源码分析 mapreduce部分
Hadoop源码分析 mapreduce部分
 
Apache Flink Training: DataSet API Basics
Apache Flink Training: DataSet API BasicsApache Flink Training: DataSet API Basics
Apache Flink Training: DataSet API Basics
 
Next Generation Programming in R
Next Generation Programming in RNext Generation Programming in R
Next Generation Programming in R
 

Similar to Apache pig presentation_siddharth_mathur

Pig - A Data Flow Language and Execution Environment for Exploring Very Large...
Pig - A Data Flow Language and Execution Environment for Exploring Very Large...Pig - A Data Flow Language and Execution Environment for Exploring Very Large...
Pig - A Data Flow Language and Execution Environment for Exploring Very Large...DrPDShebaKeziaMalarc
 
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...IndicThreads
 
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labsApache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labsViswanath Gangavaram
 
Mapreduce Algorithms
Mapreduce AlgorithmsMapreduce Algorithms
Mapreduce AlgorithmsAmund Tveit
 
Meethadoop
MeethadoopMeethadoop
MeethadoopIIIT-H
 
Spark 4th Meetup Londond - Building a Product with Spark
Spark 4th Meetup Londond - Building a Product with SparkSpark 4th Meetup Londond - Building a Product with Spark
Spark 4th Meetup Londond - Building a Product with Sparksamthemonad
 
Big data distributed processing: Spark introduction
Big data distributed processing: Spark introductionBig data distributed processing: Spark introduction
Big data distributed processing: Spark introductionHektor Jacynycz García
 
Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Massimo Schenone
 
Data Analytics and Simulation in Parallel with MATLAB*
Data Analytics and Simulation in Parallel with MATLAB*Data Analytics and Simulation in Parallel with MATLAB*
Data Analytics and Simulation in Parallel with MATLAB*Intel® Software
 
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design PathshalaAdvance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design PathshalaDesing Pathshala
 
unit-4-apache pig-.pdf
unit-4-apache pig-.pdfunit-4-apache pig-.pdf
unit-4-apache pig-.pdfssuser92282c
 
Hadoop interview question
Hadoop interview questionHadoop interview question
Hadoop interview questionpappupassindia
 
Apache Flink: API, runtime, and project roadmap
Apache Flink: API, runtime, and project roadmapApache Flink: API, runtime, and project roadmap
Apache Flink: API, runtime, and project roadmapKostas Tzoumas
 

Similar to Apache pig presentation_siddharth_mathur (20)

Pig - A Data Flow Language and Execution Environment for Exploring Very Large...
Pig - A Data Flow Language and Execution Environment for Exploring Very Large...Pig - A Data Flow Language and Execution Environment for Exploring Very Large...
Pig - A Data Flow Language and Execution Environment for Exploring Very Large...
 
pig.ppt
pig.pptpig.ppt
pig.ppt
 
Big data concepts
Big data conceptsBig data concepts
Big data concepts
 
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
 
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labsApache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
 
Mapreduce Algorithms
Mapreduce AlgorithmsMapreduce Algorithms
Mapreduce Algorithms
 
Meethadoop
MeethadoopMeethadoop
Meethadoop
 
Spark 4th Meetup Londond - Building a Product with Spark
Spark 4th Meetup Londond - Building a Product with SparkSpark 4th Meetup Londond - Building a Product with Spark
Spark 4th Meetup Londond - Building a Product with Spark
 
Lecture 2 part 3
Lecture 2 part 3Lecture 2 part 3
Lecture 2 part 3
 
Big data distributed processing: Spark introduction
Big data distributed processing: Spark introductionBig data distributed processing: Spark introduction
Big data distributed processing: Spark introduction
 
Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?
 
Data Analytics and Simulation in Parallel with MATLAB*
Data Analytics and Simulation in Parallel with MATLAB*Data Analytics and Simulation in Parallel with MATLAB*
Data Analytics and Simulation in Parallel with MATLAB*
 
4.1-Pig.pptx
4.1-Pig.pptx4.1-Pig.pptx
4.1-Pig.pptx
 
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design PathshalaAdvance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
 
04 pig data operations
04 pig data operations04 pig data operations
04 pig data operations
 
Lect1.pptx
Lect1.pptxLect1.pptx
Lect1.pptx
 
unit-4-apache pig-.pdf
unit-4-apache pig-.pdfunit-4-apache pig-.pdf
unit-4-apache pig-.pdf
 
Hadoop interview question
Hadoop interview questionHadoop interview question
Hadoop interview question
 
Apache Flink: API, runtime, and project roadmap
Apache Flink: API, runtime, and project roadmapApache Flink: API, runtime, and project roadmap
Apache Flink: API, runtime, and project roadmap
 
Flink internals web
Flink internals web Flink internals web
Flink internals web
 

Recently uploaded

Barangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptxBarangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptxCarlos105
 
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptxAUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptxiammrhaywood
 
Concurrency Control in Database Management system
Concurrency Control in Database Management systemConcurrency Control in Database Management system
Concurrency Control in Database Management systemChristalin Nelson
 
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdfVirtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdfErwinPantujan2
 
FILIPINO PSYCHology sikolohiyang pilipino
FILIPINO PSYCHology sikolohiyang pilipinoFILIPINO PSYCHology sikolohiyang pilipino
FILIPINO PSYCHology sikolohiyang pilipinojohnmickonozaleda
 
Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Jisc
 
How to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPHow to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPCeline George
 
Culture Uniformity or Diversity IN SOCIOLOGY.pptx
Culture Uniformity or Diversity IN SOCIOLOGY.pptxCulture Uniformity or Diversity IN SOCIOLOGY.pptx
Culture Uniformity or Diversity IN SOCIOLOGY.pptxPoojaSen20
 
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...Postal Advocate Inc.
 
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)lakshayb543
 
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Celine George
 
What is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPWhat is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPCeline George
 
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfGrade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfJemuel Francisco
 
Transaction Management in Database Management System
Transaction Management in Database Management SystemTransaction Management in Database Management System
Transaction Management in Database Management SystemChristalin Nelson
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfInclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfTechSoup
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxthorishapillay1
 
Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Celine George
 
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSGRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSJoshuaGantuangco2
 

Recently uploaded (20)

Barangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptxBarangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptx
 
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptxAUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptx
 
Concurrency Control in Database Management system
Concurrency Control in Database Management systemConcurrency Control in Database Management system
Concurrency Control in Database Management system
 
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdfVirtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
 
FILIPINO PSYCHology sikolohiyang pilipino
FILIPINO PSYCHology sikolohiyang pilipinoFILIPINO PSYCHology sikolohiyang pilipino
FILIPINO PSYCHology sikolohiyang pilipino
 
Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...
 
How to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPHow to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERP
 
Culture Uniformity or Diversity IN SOCIOLOGY.pptx
Culture Uniformity or Diversity IN SOCIOLOGY.pptxCulture Uniformity or Diversity IN SOCIOLOGY.pptx
Culture Uniformity or Diversity IN SOCIOLOGY.pptx
 
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptxLEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
 
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
 
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
 
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
 
What is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPWhat is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERP
 
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfGrade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
 
Transaction Management in Database Management System
Transaction Management in Database Management SystemTransaction Management in Database Management System
Transaction Management in Database Management System
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
 
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfInclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptx
 
Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17
 
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSGRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
 

Apache pig presentation_siddharth_mathur

  • 1. CSC 5800: Pig Latin: A Not-So-Foreign Language for Data Intelligent Systems: Processing Algorithms and Tools By Siddharth Mathur 1
  • 2. What we will be covering  Introduction  MapReduce Overview  Pig Overview  Pig Features  Pig Latin  Pig Debugger  Demo 2
  • 3. Introduction  Enormous data  Innovation critically depends upon analyzing terabytes of data collected everyday  SQL can resolve the structure data problems  Parallel Database processing – Data is enormous can‟t be analyzed serially. – Has to be analyzed in parallel. – Shared nothing clusters are the way to go. 3
  • 4. Parallel DB Products  Teradata, Oracle RAC, Netezza  Expensive at web scale  Programmers have to write complex SQL queries because of this declarative programming is not preferred 4
  • 5. Procedural programming  Map-Reduce programming model  It can easily perform a group by aggregation in parallel over a cluster of machines  The programmer provides map functions which is used as a filter or transforming method  The reduce function performs the aggregation  Appealing to the programmer because there are only 2 high level declarative functions to enable parallel processing 5
  • 6. MapReduce Overview  Programming Model – To cater large data analytics – Works over Hadoop – Java based – Splits data into independent chunks and process them in-parallel  Program structure – Mapper – Reducer – Driver Program 6
  • 7. MapReduce Driver Program  Works as „Main‟ function for MR job  Takes care of – Number of arguments – Input Data Location – Input Data Types – Output Data Location – Output Data types – Number of Mappers – Number of Reducers 7
  • 8. Mapper and Reducer Class  Mapper Class – Main task is to perform any function logic – Computes tasks like: • Filtering • Splitting • Tokenizing • Transforming  Reducer Class – Works as an aggregator – Aggregates the intermediate results gathered from Mapper 8
  • 9. Word Count Execution Input the quick brown fox Map Shuffle & Sort Reduce the, 1 brown, 1 fox, 1 Output Reduce Map brown, 2 fox, 2 how, 1 now, 1 the, 3 Reduce ate, 1 cow, 1 mouse, 1 quick, 1 the, 1 fox, 1 the, 1 the fox ate the mouse Map quick, 1 how, 1 now, 1 brown, 1 how now brown cow Map ate, 1 mouse, 1 cow, 1 9
  • 10. MapReduce Word Count Program public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); } } } public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); } 10
  • 11. Map Reduce Limitations  1 input – 2 stage data flow is extremely rigid. – To perform a task like join or sum iteration task, workaround has to be devised. – Custom code for common task like filtering or transforming or projection – The code is difficult to reuse and maintain  Moreover, because of its own data types, workflow and the fact that people have to learn java, makes it‟s a tough choice to take. 11
  • 12. Pig  An Apache open source project.  Provides an engine for executing data flows in parallel on Hadoop.  Includes a language called „Pig Latin‟ for expressing these data flows.  High level declarative data workflow language.  It has best of both worlds: – High Level declarative querying like SQL – Low Level procedural like Map Reduce 12
  • 13. Hadoop Stack Hive … HBase Data Processing Layer Pig Hadoop MR Hadoop Yarn Resource Management Layer HDFS Storage Layer 13
  • 14. Why Choose Pig  Written like SQL, compiled into MapReduce  Fully nested data model  Extensive support for UDFs  Can answer multiple questions in one single workflow. A = load './input.txt'; B = foreach A generate flatten(TOKENIZE((chararray)$0)) as word; C = group B by word; D = foreach C generate COUNT(B), group; store D into './output'; 14
  • 15. Features and Motivation  Design goal of pig is to provide programmers with appealing experience for performing ad-hoc analysis of extremely large data sets. – DataFlow Language – QuickStart and Interoperability – Nested Data Model – UDF‟s – Debugging Environment 15
  • 16. Data Flow Language  Each step specifies a single high level data transformation  Different from SQL where all these results are a single output.  The system has given opportunity to provide optimization function. – Example: A= Load „input.txt‟; B= Filter A by UDF (Column1); C= Filter B by Column1 > 0.8; 16
  • 17. Quick start and Interoperability  Data Load – Capability of Ad-Hoc analysis – Can run queries directly on Data from dump of search engines – Just have to provide a function that tells Pig how to parse the content of file into tuple. – Similarly for output • Any output format. • These function can be reused. • Used for visualization or dumped to excel directly 17
  • 18. Pig as part of workflow  Pig easily becomes a part of workflow eco-system – Can take most of the input types – Can output in many of the forms – Doesn‟t take over the data, i.e., it does not lock the data that is being processed. – Read only data analysis 18
  • 19. Optional data schemas  Schema can be provided by the user : – In the beginning – On the fly – Example: • A= LOAD „input.txt‟ as (Column1;Column2); • B= Filter A by Column1>5;  If the schema is not provided then the columns can be referred by „$0‟, „$1‟, „$2‟…. for the 1st, 2nd, 3rd column etc.  Example:  A= LOAD „input.txt‟;  B= Filter A by $0>5; 19
  • 20. Nested Data Model  Suppose, for a document, we want to extract the term and its position.  Format of output : Map<document Set<position>>  SQL data model: Term Document ID Position Hi 1 2 Hi 1 5  Or keep in normalized form, i.e., – term_info(termid, String) – position_info(termid, position, document) 20
  • 21. Problem resolved using Pig  In pig we have complex data types like map, tuple or bag to occur as a field of a table itself.  Example: Term Document ID Position Hi 1 (2,5,8..)  This approach is good because its more closer to what a programmer thinks.  Data is stored on disk in a nested fashion only  It gives user an ease in writing UDFs. 21
  • 22. UDFs  Significant part of data analysis is custom processing  For example, user might want to process natural language stemming  Or checking if the page is spam or not, or many other tasks  To work on this, Pig Latin has extensive support for UDFs, most of the tasks can be resolved using the UDFs  It can take non-atomic input and can provide a nonatomic output also  Currently the UDFs can be written in java or python 22
  • 23. Debugging Environment  In any language, getting a data processing program work correctly usually takes many iterations  First few iterations mostly produce errors  With a large scale data this would result in serious time and resource wastage  Debuggers can help  Pig has a novel debugging environment  Generates concise examples from input data  Data samples are carefully chosen to resemble real data as far as possible  Sample data is carved specially 23
  • 24. Pig Latin  Language in which data workflow statements are written  It runs on the shell called „Grunt‟  It has a shared repository name Piggybank  We can create our custom UDFs and add them to Piggybank 24
  • 25. Data Model  Rich, yet simple data models  Atoms – Simple atomic values like string or number  Tuple – A collection of fields each of which can be of any data type – Analogous to rows in SQL  Bag – Collection of tuples or both tuples and atoms – Can also be heterogeneous 25
  • 26. Data Model (cont.)  Example of a relation Atom Tuple Bag T= „alice‟, (labours,1), {(„ipod‟, 2),„james‟}  Tuple is represented with round braces  Bag is represented with curly braces 26
  • 27. Specifying Input Data : LOAD  Its the first step in Pig Latin program  Specifying what the input files are  How are its contents to be deserialized, i.e., converted to pig data model.  LOAD command – Example queries= LOAD „query_log.csv‟ USING PigStorage(„,‟) AS (userId,queryString,timestamp); 27
  • 28. LOAD (cont.)  Both the „USING‟ clause and the „AS‟ clause are optional  We can work without them as shown earlier ($0 for first field)  Pig Storage is a pre-defined function  Can use custom function instead of Pig Storage 28
  • 29. Per Tuple Processing : FOREACH  Similar to FOR statements  Its used for applying special processing to each tuple of the dataset  Example – Expanded_query = FOREACH queries GENERATE UserId, Expand(queryString), timeStamp;  Its not a FILTERING command  „Expand‟ can take atomic input and can generate a bag of outputs 29
  • 30. Per Tuple Processing : FOREACH(cont.)  The semantics of FOREACH is such that there is no dependency between different tuples of input, therefore permitting efficient parallel implementation 30
  • 31. Discarding Unwanted Data : FILTER  Used as a where clause  Can provide anything in the expression – Query = FILTER queries By user_id neq „bot‟;  We can provide a UDF also, like – Query = FILTER queries by Isbot(user_id); 31
  • 32. COGROUP  Similar to Join  Groups bags of different inputs together  Ease of use for UDF‟s – Grouped_data = COGROUP results by querystring, revenue by querystring; 32
  • 33. JOIN  Not all users want to use COGROUP  Simple equi-join is all that is required – Example Join_result = JOIN results by querystring, revenue by querystring;  Other types of join are also supported: – Left outer – Right outer – Full outer 33
  • 34. Other Commands  Relational Operators – UNION – CROSS – ORDER – DISTINCT – LIMIT  Eval Functions – Concat – Count – Diff 34
  • 35. PARALLEL clause  It is used to increase the parallelization of the job  We can specify the number of reduce tasks of the MR jobs created by Pig  It only effects the reduce task  No control over map  The system also can figure out number of reducers  Mostly one reduce task is required 35
  • 36. PARALLEL clause (cont.)  Can be applied to only those commands which come under reduce phase – COGROUP – CROSS – DISTINCT – GROUP – JOINS – ORDER A = LOAD „ File1‟; B = LOAD „ File2‟; C = CROSS A, B PARALLEL 10; 36
  • 37. Split Clause  We can split the input record into many by providing condition A = LOAD „data‟ AS (F1:int, F2:int, F3:int) (1,2;3) (2,3;7) SPLIT A INTO B IF F1>7, C IF F2==5; B (1,2,3) C (2,5,7) (2,5,7)  Any expression can be written  UDFs can be used  It is not partitioning 37
  • 38. Output  There are two ways to display – STORE • If you want to store the output in any location STORE output_1 INTO „hadoopuser/output‟ – DUMP • Basically used to display the result in the GRUNT shell itself • Dumping doesn‟t store the output anywhere DUMP query_result; 38
  • 39. Building a Logical Plan  Pig interpreter first parses all the commands which the client issues  Verifies that the input files, bags or columns referred by the command are valid  Builds a logical plan for every bag the user defines  No processing is carried out  Processing triggers where a user invokes STORE/DUMP command  Called as a Lazy execution approach  Helps in FILTER reordering 39
  • 40. Debugging Environment  This is used to avoid running the complete code on the entire dataset  User can create a sample data  Difficult to tailor these datasets and end up in self cooked data  Pig Pen is Pig‟s debugging environment  Creates side dataset automatically, called as sandbox dataset  Pig Pen has its own user interface 40
  • 41. Pig Pen  Outputs can be easily analyzed  Errors can be rectified earlier 41
  • 42. Future Work  User Interface – Drag-Drop style would help – Logical plan diagram create made easy  UDF support for other languages  Unified Environment – Currently, lacks in control structures like loops – Has to embedded for all iterative tasks 42
  • 43. Summary  Not So Foreign Language  Aims a sweet spot between SQL and MapReduce  Reusable and easy to use  Novel Debugging Environment: Pig Pen  Pig has an active and growing user base in Yahoo!  Pigs – Eats anything – Live anywhere – Are domestic 43
  • 44. 44
  • 45. Based on “Pig Latin: A Not-So-Foreign Language for Data Processing” SIGMOD‟08,June 9–12, 2008, Vancouver,BC,Canada Christopher Olston Yahoo! Research Benjamin Reed Yahoo! Research Utkarsh Srivastava Yahoo! Research Ravi Kumar Yahoo! Research Andrew Tomkins Yahoo! Research 45
  • 46. References  http://infolab.stanford.edu/~usriv/papers/pig-latin.pdf  http://pig.apache.org/docs/r0.7.0/piglatin_ref2.html  Book: Programming pig  http://www.brentozar.com/archive/2011/11/good-pig/  http://hortonworks.com/hadoop/pig/ 46