SlideShare ist ein Scribd-Unternehmen logo
1 von 25
Downloaden Sie, um offline zu lesen
© 2012 coreservlets.com and Dima May
Customized Java EE Training: http://courses.coreservlets.com/
Hadoop, Java, JSF 2, PrimeFaces, Servlets, JSP, Ajax, jQuery, Spring, Hibernate, RESTful Web Services, Android.
Developed and taught by well-known author and developer. At public venues or onsite at your location.
Apache Pig
Originals of slides and source code for examples: http://www.coreservlets.com/hadoop-tutorial/
Also see the customized Hadoop training courses (onsite or at public venues) – http://courses.coreservlets.com/hadoop-training.html
© 2012 coreservlets.com and Dima May
Customized Java EE Training: http://courses.coreservlets.com/
Hadoop, Java, JSF 2, PrimeFaces, Servlets, JSP, Ajax, jQuery, Spring, Hibernate, RESTful Web Services, Android.
Developed and taught by well-known author and developer. At public venues or onsite at your location.
For live customized Hadoop training (including prep
for the Cloudera certification exam), please email
info@coreservlets.com
Taught by recognized Hadoop expert who spoke on Hadoop
several times at JavaOne, and who uses Hadoop daily in
real-world apps. Available at public venues, or customized
versions can be held on-site at your organization.
• Courses developed and taught by Marty Hall
– JSF 2.2, PrimeFaces, servlets/JSP, Ajax, jQuery, Android development, Java 7 or 8 programming, custom mix of topics
– Courses available in any state or country. Maryland/DC area companies can also choose afternoon/evening courses.
• Courses developed and taught by coreservlets.com experts (edited by Marty)
– Spring, Hibernate/JPA, GWT, Hadoop, HTML5, RESTful Web Services
Contact info@coreservlets.com for details
Agenda
• Pig Overview
• Execution Modes
• Installation
• Pig Latin Basics
• Developing Pig Script
– Most Occurred Start Letter
• Resources
4
Pig
5
• Top Level Apache Project
– http://pig.apache.org
• Pig is an abstraction on top of Hadoop
– Provides high level programming language designed for data
processing
– Converted into MapReduce and executed on Hadoop Clusters
• Pig is widely accepted and used
– Yahoo!, Twitter, Netflix, etc...
“is a platform for analyzing large data sets that consists
of a high-level language for expressing data analysis
programs, coupled with infrastructure for evaluating
these programs. “
Pig and MapReduce
• MapReduce requires programmers
– Must think in terms of map and reduce functions
– More than likely will require Java programmers
• Pig provides high-level language that can be
used by
– Analysts
– Data Scientists
– Statisticians
– Etc...
• Originally implemented at Yahoo! to allow
analysts to access data
6
Pig’s Features
• Join Datasets
• Sort Datasets
• Filter
• Data Types
• Group By
• User Defined Functions
• Etc..
7
Pig’s Use Cases
• Extract Transform Load (ETL)
– Ex: Processing large amounts of log data
• clean bad entries, join with other data-sets
• Research of “raw” information
– Ex. User Audit Logs
– Schema maybe unknown or inconsistent
– Data Scientists and Analysts may like Pig’s data
transformation paradigm
8
Pig Components
9
• Pig Latin
– Command based language
– Designed specifically for data transformation and flow
expression
• Execution Environment
– The environment in which Pig Latin commands are executed
– Currently there is support for Local and Hadoop modes
• Pig compiler converts Pig Latin to MapReduce
– Compiler strives to optimize execution
– You automatically get optimization improvements with Pig
updates
Execution Modes
• Local
– Executes in a single JVM
– Works exclusively with local file system
– Great for development, experimentation and prototyping
• Hadoop Mode
– Also known as MapReduce mode
– Pig renders Pig Latin into MapReduce jobs and executes
them on the cluster
– Can execute against semi-distributed or fully-distributed
hadoop installation
• We will run on semi-distributed cluster
10
Hadoop Mode
11
-- 1: Load text into a bag, where a row is a line of
text
lines = LOAD '/training/playArea/hamlet.txt' AS
(line:chararray);
-- 2: Tokenize the provided text
tokens = FOREACH lines GENERATE
flatten(TOKENIZE(line)) AS token:chararray;
PigLatin.pig
...
Pig
Hadoop
Execution
Environment
Hadoop
Cluster
Execute on
Hadoop Cluster
Monitor/Report
Parse Pig script and
compile into a set of
MapReduce jobs
Installation Prerequisites
• Java 6
– With $JAVA_HOME environment variable properly set
• Cygwin on Windows
12
Installation
• Add pig script to path
– export PIG_HOME=$CDH_HOME/pig-0.9.2-cdh4.0.0
– export PATH=$PATH:$PIG_HOME/bin
• $ pig -help
• That’s all we need to run in local mode
– Think of Pig as a ‘Pig Latin’ compiler, development tool
and executor
– Not tightly coupled with Hadoop clusters
13
Pig Installation for Hadoop
Mode
• Make sure Pig compiles with Hadoop
– Not a problem when using a distribution such as Cloudera
Distribution for Hadoop (CDH)
• Pig will utilize $HADOOP_HOME and
$HADOOP_CONF_DIR variables to locate
Hadoop configuration
– We already set these properties during MapReduce
installation
– Pig will use these properties to locate Namenode and
Resource Manager
14
Running Modes
• Can manually override the default mode via
‘-x’ or ‘-exectype’ options
– $pig -x local
– $pig -x mapreduce
15
$ pig
2012-07-14 13:38:58,139 [main] INFO org.apache.pig.Main - Logging error messages
to: /home/hadoop/Training/play_area/pig/pig_1342287538128.log
2012-07-14 13:38:58,458 [main] INFO
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to
hadoop file system at: hdfs://localhost:8020
$ pig -x local
2012-07-14 13:39:31,029 [main] INFO org.apache.pig.Main - Logging error messages
to: /home/hadoop/Training/play_area/pig/pig_1342287571019.log
2012-07-14 13:39:31,232 [main] INFO
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to
hadoop file system at: file:///
Running Pig
• Script
– Execute commands in a file
– $pig scriptFile.pig
• Grunt
– Interactive Shell for executing Pig Commands
– Started when script file is NOT provided
– Can execute scripts from Grunt via run or exec
commands
• Embedded
– Execute Pig commands using PigServer class
• Just like JDBC to execute SQL
– Can have programmatic access to Grunt via PigRunner
class16
Pig Latin Concepts
• Building blocks
– Field – piece of data
– Tuple – ordered set of fields, represented with “(“ and “)”
• (10.4, 5, word, 4, field1)
– Bag – collection of tuples, represented with “{“ and “}”
• { (10.4, 5, word, 4, field1), (this, 1, blah) }
• Similar to Relational Database
– Bag is a table in the database
– Tuple is a row in a table
– Bags do not require that all tuples contain the same
number
• Unlike relational table
17
Simple Pig Latin Example
18
$ pig
grunt> cat /training/playArea/pig/a.txt
a 1
d 4
c 9
k 6
grunt> records = LOAD '/training/playArea/pig/a.txt' as
(letter:chararray, count:int);
grunt> dump records;
...
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer
.MapReduceLauncher - 50% complete
2012-07-14 17:36:22,040 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer
.MapReduceLauncher - 100% complete
...
(a,1)
(d,4)
(c,9)
(k,6)
grunt>
Start Grunt with default
MapReduce mode
Grunt supports file
system commands
Load contents of text files
into a Bag named records
Display records bag to
the screen
Results of the bag named records
are printed to the screen
DUMP and STORE statements
• No action is taken until DUMP or STORE
commands are encountered
– Pig will parse, validate and analyze statements but not
execute them
• DUMP – displays the results to the screen
• STORE – saves results (typically to a file)
19
records = LOAD '/training/playArea/pig/a.txt' as
(letter:chararray, count:int);
...
...
...
...
...
DUMP final_bag;
Nothing is
executed;
Pig will
optimize this
entire
chunk of
script
The fun begins here
Large Data
• Hadoop data is usually quite large and it
doesn’t make sense to print it to the screen
• The common pattern is to persist results to
Hadoop (HDFS, HBase)
– This is done with STORE command
• For information and debugging purposes
you can print a small sub-set to the screen
20
grunt> records = LOAD '/training/playArea/pig/excite-small.log'
AS (userId:chararray, timestamp:long, query:chararray);
grunt> toPrint = LIMIT records 5;
grunt> DUMP toPrint;
Only 5 records will be
displayed
LOAD Command
• data – name of the directory or file
– Must be in single quotes
• USING – specifies the load function to use
– By default uses PigStorage which parses each line into
fields using a delimiter
• Default delimiter is tab (‘t’)
• The delimiter can be customized using regular
expressions
• AS – assign a schema to incoming data
– Assigns names to fields
– Declares types to fields
21
LOAD 'data' [USING function] [AS schema];
LOAD Command Example
22
records =
LOAD '/training/playArea/pig/excite-small.log'
USING PigStorage()
AS (userId:chararray, timestamp:long, query:chararray);
Data
User selected Load Function, there
are a lot of choices or you can
implement your own
Schema
Schema Data Types
23 Source: Apache Pig Documentation 0.9.2; “Pig Latin Basics”. 2012
Type Description Example
Simple
int Signed 32-bit integer 10
long Signed 64-bit integer 10L or 10l
float 32-bit floating point 10.5F or 10.5f
double 64-bit floating point 10.5 or 10.5e2 or 10.5E2
Arrays
chararray Character array (string) in Unicode UTF-8 hello world
bytearray Byte array (blob)
Complex Data Types
tuple An ordered set of fields (19,2)
bag An collection of tuples {(19,2), (18,1)}
map An collection of tuples [open#apache]
Pig Latin – Diagnostic Tools
• Display the structure of the Bag
– grunt> DESCRIBE <bag_name>;
• Display Execution Plan
– Produces Various reports
• Logical Plan
• MapReduce Plan
– grunt> EXPLAIN <bag_name>;
• Illustrate how Pig engine transforms the
data
– grunt> ILLUSTRATE <bag_name>;
24
Pig Latin - Grouping
25
grunt> chars = LOAD '/training/playArea/pig/b.txt' AS
(c:chararray);
grunt> describe chars;
chars: {c: chararray}
grunt> dump chars;
(a)
(k)
...
...
(k)
(c)
(k)
grunt> charGroup = GROUP chars by c;
grunt> describe charGroup;
charGroup: {group: chararray,chars: {(c: chararray)}}
grunt> dump charGroup;
(a,{(a),(a),(a)})
(c,{(c),(c)})
(i,{(i),(i),(i)})
(k,{(k),(k),(k),(k)})
(l,{(l),(l)})
Creates a new bag with element named
group and element named chars
The chars bag is
grouped by “c”;
therefore ‘group’
element will contain
unique values
‘chars’ element is a bag itself and
contains all tuples from ‘chars’
bag that match the value form ‘c’
ILUSTRATE Command
26
grunt> chars = LOAD ‘/training/playArea/pig/b.txt' AS (c:chararray);
grunt> charGroup = GROUP chars by c;
grunt> ILLUSTRATE charGroup;
------------------------------
| chars | c:chararray |
------------------------------
| | c |
| | c |
------------------------------
--------------------------------------------------------------------------------
| charGroup | group:chararray | chars:bag{:tuple(c:chararray)} |
--------------------------------------------------------------------------------
| | c | {(c), (c)} |
--------------------------------------------------------------------------------
Inner vs. Outer Bag
27
grunt> chars = LOAD ‘/training/playArea/pig/b.txt' AS (c:chararray);
grunt> charGroup = GROUP chars by c;
grunt> ILLUSTRATE charGroup;
------------------------------
| chars | c:chararray |
------------------------------
| | c |
| | c |
------------------------------
--------------------------------------------------------------------------------
| charGroup | group:chararray | chars:bag{:tuple(c:chararray)} |
--------------------------------------------------------------------------------
| | c | {(c), (c)} |
--------------------------------------------------------------------------------
Inner Bag
Outer Bag
Inner vs. Outer Bag
28
grunt> chars = LOAD '/training/playArea/pig/b.txt' AS
(c:chararray);
grunt> charGroup = GROUP chars by c;
grunt> dump charGroup;
(a,{(a),(a),(a)})
(c,{(c),(c)})
(i,{(i),(i),(i)})
(k,{(k),(k),(k),(k)})
(l,{(l),(l)})
Inner Bag
Outer Bag
Pig Latin - FOREACH
• FOREACH <bag> GENERATE <data>
– Iterate over each element in the bag and produce a result
– Ex: grunt> result = FOREACH bag GENERATE f1;
29
grunt> records = LOAD 'data/a.txt' AS (c:chararray, i:int);
grunt> dump records;
(a,1)
(d,4)
(c,9)
(k,6)
grunt> counts = foreach records generate i;
grunt> dump counts;
(1)
(4)
(9)
(6)
For each row emit ‘i’ field
FOREACH with Functions
30
FOREACH B GENERATE group, FUNCTION(A);
– Pig comes with many functions including COUNT, FLATTEN, CONCAT,
etc...
– Can implement a custom function
grunt> chars = LOAD 'data/b.txt' AS (c:chararray);
grunt> charGroup = GROUP chars by c;
grunt> dump charGroup;
(a,{(a),(a),(a)})
(c,{(c),(c)})
(i,{(i),(i),(i)})
(k,{(k),(k),(k),(k)})
(l,{(l),(l)})
grunt> describe charGroup;
charGroup: {group: chararray,chars: {(c: chararray)}}
grunt> counts = FOREACH charGroup GENERATE group, COUNT(chars);
grunt> dump counts;
(a,3)
(c,2)
(i,3)
(k,4)
(l,2)
For each row in ‘charGroup’ bag, emit
group field and count the number of
items in ‘chars’ bag
TOKENIZE Function
• Splits a string into tokens and outputs as a
bag of tokens
– Separators are: space, double quote("), coma(,)
parenthesis(()), star(*)
31
grunt> linesOfText = LOAD 'data/c.txt' AS (line:chararray);
grunt> dump linesOfText;
(this is a line of text)
(yet another line of text)
(third line of words)
grunt> tokenBag = FOREACH linesOfText GENERATE TOKENIZE(line);
grunt> dump tokenBag;
({(this),(is),(a),(line),(of),(text)})
({(yet),(another),(line),(of),(text)})
({(third),(line),(of),(words)})
grunt> describe tokenBag;
tokenBag: {bag_of_tokenTuples: {tuple_of_tokens: (token: chararray)}}
Split each row line by space
and return a bag of tokens
Each row is a bag of
words produced by
TOKENIZE function
FLATTEN Operator
• Flattens nested bags and data types
• FLATTEN is not a function, it’s an operator
– Re-arranges output
32
grunt> dump tokenBag;
({(this),(is),(a),(line),(of),(text)})
({(yet),(another),(line),(of),(text)})
({(third),(line),(of),(words)})
grunt> flatBag = FOREACH tokenBag GENERATE flatten($0);
grunt> dump flatBag;
(this)
(is)
(a)
...
...
(text)
(third)
(line)
(of)
(words)
Nested structure: bag of
bags of tuples
Each row is flatten resulting in a
bag of simple tokens
Elements in a bag can
be referenced by index
Conventions and Case
Sensitivity
33
• Case Sensitive
– Alias names
– Pig Latin Functions
• Case Insensitive
– Pig Latin Keywords
• General conventions
– Upper case is a system keyword
– Lowercase is something that you provide
counts = FOREACH charGroup GENERATE group, COUNT(c);
Alias
Case Sensitive
Keywords
Case Insensitive
Alias
Case Sensitive
Function
Case Sensitive
Problem: Locate Most Occurred
Start Letter
34
• Calculate number of
occurrences of each letter in
the provided body of text
• Traverse each letter
comparing occurrence count
• Produce start letter that has
the most occurrences
(For so this side of our known world esteem'd him)
Did slay this Fortinbras; who, by a seal'd compact,
Well ratified by law and heraldry,
Did forfeit, with his life, all those his lands
Which he stood seiz'd of, to the conqueror;
Against the which a moiety competent
Was gaged by our king; which had return'd
To the inheritance of Fortinbras,
A 89530
B 3920
..
..
Z 876
T 495959
‘Most Occurred Start Letter’ Pig
Way
1. Load text into a bag (named ‘lines’)
2. Tokenize the text in the ‘lines’ bag
3. Retain first letter of each token
4. Group by letter
5. Count the number of occurrences in each
group
6. Descending order the group by the count
7. Grab the first element => Most occurring
letter
8. Persist result on a file system
35
1: Load Text Into a Bag
36
grunt> lines = LOAD '/training/data/hamlet.txt'
AS (line:chararray);
Load text file into a bag, stick entire line into
element ‘line’ of type ‘chararray’
grunt> describe lines;
lines: {line: chararray}
grunt> toDisplay = LIMIT lines 5;
grunt> dump toDisplay;
(This Etext file is presented by Project Gutenberg, in)
(This etext is a typo-corrected version of Shakespeare's Hamlet,)
(cooperation with World Library, Inc., from their Library of the)
(*This Etext has certain copyright implications you should read!*
(Future and Shakespeare CDROMS. Project Gutenberg often releases
INSPECT lines bag:
Each row is a line of text
2: Tokenize the Text in the
‘Lines’ Bag
37
grunt> tokens = FOREACH lines GENERATE
flatten(TOKENIZE(line)) AS token:chararray;
For each line of text (1) tokenize that line
(2) flatten the structure to produce 1 word per row
grunt> describe tokens
tokens: {token: chararray}
grunt> toDisplay = LIMIT tokens 5;
grunt> dump toDisplay;
(a)
(is)
(of)
(This)
(etext)
INSPECT tokens bag:
Each row is now a token
3: Retain First Letter of Each
Token
38
grunt> letters = FOREACH tokens GENERATE
SUBSTRING(token,0,1) AS letter:chararray;
For each token grab the first letter; utilize
SUBSTRING function
grunt> describe letters;
letters: {letter: chararray}
grunt> toDisplay = LIMIT letters 5;
grunt> dump toDisplay;
(a)
(i)
(T)
(e)
(t)
INSPECT letters bag:
What we have no is 1
character per row
4: Group by Letter
39
grunt> letterGroup = GROUP letters BY letter;
Create a bag for each unique character; the
“grouped” bag will contain the same character
for each occurrence of that character
grunt> describe letterGroup;
letterGroup: {group: chararray,letters: {(letter: chararray)}}
grunt> toDisplay = LIMIT letterGroup 5;
grunt> dump toDisplay;
(0,{(0),(0),(0)})
(a,{(a),(a))
(2,{(2),(2),(2),(2),(2))
(3,{(3),(3),(3)})
(b,{(b)})
INSPECT letterGroup bag:
Next we’ll need to convert
characters occurrences into
counts; Note this display was
modified as there were too many
characters to fit on the screen
5: Count the Number of
Occurrences in Each Group
40
grunt> countPerLetter = FOREACH letterGroup
GENERATE group, COUNT(letters);
For each row, count occurrence of the letter
grunt> describe countPerLetter;
countPerLetter: {group: chararray,long}
grunt> toDisplay = LIMIT countPerLetter 5;
grunt> dump toDisplay;
(A,728)
(B,325)
(C,291)
(D,194)
(E,264)
INSPECT countPerLetter bag:
Each row now has the
character and the
number of times it was
found to start a word.
All we have to do is find
the maximum
6: Descending Order the Group
by the Count
41
grunt> orderedCountPerLetter = ORDER
countPerLetter BY $1 DESC;
Simply order the bag by the first element, a
number of occurrences for that element
grunt> describe orderedCountPerLetter;
orderedCountPerLetter: {group: chararray,long}
grunt> toDisplay = LIMIT orderedCountPerLetter 5;
grunt> dump toDisplay;
(t,3711)
(a,2379)
(s,1938)
(m,1787)
(h,1725)
INSPECT orderedCountPerLetter bag:
All we have to do now is
just grab the first element
7: Grab the First Element
42
grunt> result = LIMIT orderedCountPerLetter 1;
The rows were already ordered in descending
order, so simply limiting to one element gives
us the result
grunt> describe result;
result: {group: chararray,long}
grunt> dump result;
(t,3711)
INSPECT orderedCountPerLetter bag:
There it is
8: Persist Result on a File
System
43
grunt> STORE result INTO
'/training/playArea/pig/mostSeenLetterOutput';
Result is saved under the provided directory
$ hdfs dfs -cat
/training/playArea/pig/mostSeenLetterOutput/part-r-00000
t 3711
INSPECT result
result Notice that result was stored int part-r-0000, the
regular artifact of a MapReduce reducer; Pig
compiles Pig Latin into MapReduce code and
executes.
MostSeenStartLetter.pig Script
• Execute the script:
– $ pig MostSeenStartLetter.pig44
-- 1: Load text into a bag, where a row is a line of text
lines = LOAD '/training/data/hamlet.txt' AS (line:chararray);
-- 2: Tokenize the provided text
tokens = FOREACH lines GENERATE flatten(TOKENIZE(line)) AS token:chararray;
-- 3: Retain first letter of each token
letters = FOREACH tokens GENERATE SUBSTRING(token,0,1) AS letter:chararray;
-- 4: Group by letter
letterGroup = GROUP letters BY letter;
-- 5: Count the number of occurrences in each group
countPerLetter = FOREACH letterGroup GENERATE group, COUNT(letters);
-- 6: Descending order the group by the count
orderedCountPerLetter = ORDER countPerLetter BY $1 DESC;
-- 7: Grab the first element => Most occurring letter
result = LIMIT orderedCountPerLetter 1;
-- 8: Persist result on a file system
STORE result INTO '/training/playArea/pig/mostSeenLetterOutput';
Pig Tools
• Community has developed several tools to
support Pig
– https://cwiki.apache.org/confluence/display/PIG/PigTools
• We have PigPen Eclipse Plugin installed:
– Download the latest jar release at
https://issues.apache.org/jira/browse/PIG-366
• As of writing org.apache.pig.pigpen_0.7.5.jar
– Place jar in eclupse/plugins/
– Restart eclipse
45
Pig Resources
• Apache Pig Documentation
– http://pig.apache.org
46
Chapter About Pig
Hadoop: The Definitive Guide
Tom White (Author)
O'Reilly Media; 3rd Edition (May6, 2012)
Programming Pig
Alan Gates (Author)
O'Reilly Media; 1st Edition (October, 2011)
Chapter About Pig
Hadoop in Action
Chuck Lam (Author)
Manning Publications; 1st Edition (December, 2010)
Pig Resources
47
Hadoop in Practice
Alex Holmes (Author)
Manning Publications; (October 10, 2012)
© 2012 coreservlets.com and Dima May
Customized Java EE Training: http://courses.coreservlets.com/
Hadoop, Java, JSF 2, PrimeFaces, Servlets, JSP, Ajax, jQuery, Spring, Hibernate, RESTful Web Services, Android.
Developed and taught by well-known author and developer. At public venues or onsite at your location.
Wrap-Up
Summary
• We learned about
– Pig Overview
– Execution Modes
– Installation
– Pig Latin Basics
– Resources
• We developed Pig Script to locate “Most
Occurred Start Letter”
49
© 2012 coreservlets.com and Dima May
Customized Java EE Training: http://courses.coreservlets.com/
Hadoop, Java, JSF 2, PrimeFaces, Servlets, JSP, Ajax, jQuery, Spring, Hibernate, RESTful Web Services, Android.
Developed and taught by well-known author and developer. At public venues or onsite at your location.
Questions?
More info:
http://www.coreservlets.com/hadoop-tutorial/ – Hadoop programming tutorial
http://courses.coreservlets.com/hadoop-training.html – Customized Hadoop training courses, at public venues or onsite at your organization
http://courses.coreservlets.com/Course-Materials/java.html – General Java programming tutorial
http://www.coreservlets.com/java-8-tutorial/ – Java 8 tutorial
http://www.coreservlets.com/JSF-Tutorial/jsf2/ – JSF 2.2 tutorial
http://www.coreservlets.com/JSF-Tutorial/primefaces/ – PrimeFaces tutorial
http://coreservlets.com/ – JSF 2, PrimeFaces, Java 7 or 8, Ajax, jQuery, Hadoop, RESTful Web Services, Android, HTML5, Spring, Hibernate, Servlets, JSP, GWT, and other Java EE training

Weitere ähnliche Inhalte

Was ist angesagt?

Hadoop Performance Optimization at Scale, Lessons Learned at Twitter
Hadoop Performance Optimization at Scale, Lessons Learned at TwitterHadoop Performance Optimization at Scale, Lessons Learned at Twitter
Hadoop Performance Optimization at Scale, Lessons Learned at Twitter
DataWorks Summit
 

Was ist angesagt? (20)

Hadoop pig
Hadoop pigHadoop pig
Hadoop pig
 
An Introduction to Apache Pig
An Introduction to Apache PigAn Introduction to Apache Pig
An Introduction to Apache Pig
 
How to collect Big Data into Hadoop
How to collect Big Data into HadoopHow to collect Big Data into Hadoop
How to collect Big Data into Hadoop
 
Apache Drill with Oracle, Hive and HBase
Apache Drill with Oracle, Hive and HBaseApache Drill with Oracle, Hive and HBase
Apache Drill with Oracle, Hive and HBase
 
Working with Delimited Data in Apache Drill 1.6.0
Working with Delimited Data in Apache Drill 1.6.0Working with Delimited Data in Apache Drill 1.6.0
Working with Delimited Data in Apache Drill 1.6.0
 
Big data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesBig data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting Languages
 
Hadoop & HDFS for Beginners
Hadoop & HDFS for BeginnersHadoop & HDFS for Beginners
Hadoop & HDFS for Beginners
 
Fluentd meetup at Slideshare
Fluentd meetup at SlideshareFluentd meetup at Slideshare
Fluentd meetup at Slideshare
 
Apache PIG
Apache PIGApache PIG
Apache PIG
 
Pig Tutorial | Twitter Case Study | Apache Pig Script and Commands | Edureka
Pig Tutorial | Twitter Case Study | Apache Pig Script and Commands | EdurekaPig Tutorial | Twitter Case Study | Apache Pig Script and Commands | Edureka
Pig Tutorial | Twitter Case Study | Apache Pig Script and Commands | Edureka
 
01 hbase
01 hbase01 hbase
01 hbase
 
Hadoop Performance Optimization at Scale, Lessons Learned at Twitter
Hadoop Performance Optimization at Scale, Lessons Learned at TwitterHadoop Performance Optimization at Scale, Lessons Learned at Twitter
Hadoop Performance Optimization at Scale, Lessons Learned at Twitter
 
Hdfs java api
Hdfs java apiHdfs java api
Hdfs java api
 
Hive Quick Start Tutorial
Hive Quick Start TutorialHive Quick Start Tutorial
Hive Quick Start Tutorial
 
Hadoop MapReduce Streaming and Pipes
Hadoop MapReduce  Streaming and PipesHadoop MapReduce  Streaming and Pipes
Hadoop MapReduce Streaming and Pipes
 
Apache Drill @ PJUG, Jan 15, 2013
Apache Drill @ PJUG, Jan 15, 2013Apache Drill @ PJUG, Jan 15, 2013
Apache Drill @ PJUG, Jan 15, 2013
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Harnessing the power of Nutch with Scala
Harnessing the power of Nutch with ScalaHarnessing the power of Nutch with Scala
Harnessing the power of Nutch with Scala
 
H cat berlinbuzzwords2012
H cat berlinbuzzwords2012H cat berlinbuzzwords2012
H cat berlinbuzzwords2012
 

Andere mochten auch (8)

Vendendo (com) WordPress
Vendendo (com) WordPressVendendo (com) WordPress
Vendendo (com) WordPress
 
Reason 3: Therapy
Reason 3: TherapyReason 3: Therapy
Reason 3: Therapy
 
Quality 2
Quality 2Quality 2
Quality 2
 
4 pencernaan
4 pencernaan4 pencernaan
4 pencernaan
 
Final تعليمات خاصة بمعمل السموم الفطرية copy
Final تعليمات خاصة بمعمل السموم الفطرية   copyFinal تعليمات خاصة بمعمل السموم الفطرية   copy
Final تعليمات خاصة بمعمل السموم الفطرية copy
 
Trillenium
TrilleniumTrillenium
Trillenium
 
Powerpoint miami 2
Powerpoint miami 2Powerpoint miami 2
Powerpoint miami 2
 
Peneranganadjktif2
Peneranganadjktif2Peneranganadjktif2
Peneranganadjktif2
 

Ähnlich wie 06 pig-01-intro

power point presentation on pig -hadoop framework
power point presentation on pig -hadoop frameworkpower point presentation on pig -hadoop framework
power point presentation on pig -hadoop framework
bhargavi804095
 
Pig power tools_by_viswanath_gangavaram
Pig power tools_by_viswanath_gangavaramPig power tools_by_viswanath_gangavaram
Pig power tools_by_viswanath_gangavaram
Viswanath Gangavaram
 
Sql saturday pig session (wes floyd) v2
Sql saturday   pig session (wes floyd) v2Sql saturday   pig session (wes floyd) v2
Sql saturday pig session (wes floyd) v2
Wes Floyd
 
Pig - A Data Flow Language and Execution Environment for Exploring Very Large...
Pig - A Data Flow Language and Execution Environment for Exploring Very Large...Pig - A Data Flow Language and Execution Environment for Exploring Very Large...
Pig - A Data Flow Language and Execution Environment for Exploring Very Large...
DrPDShebaKeziaMalarc
 

Ähnlich wie 06 pig-01-intro (20)

power point presentation on pig -hadoop framework
power point presentation on pig -hadoop frameworkpower point presentation on pig -hadoop framework
power point presentation on pig -hadoop framework
 
Pig power tools_by_viswanath_gangavaram
Pig power tools_by_viswanath_gangavaramPig power tools_by_viswanath_gangavaram
Pig power tools_by_viswanath_gangavaram
 
Sql saturday pig session (wes floyd) v2
Sql saturday   pig session (wes floyd) v2Sql saturday   pig session (wes floyd) v2
Sql saturday pig session (wes floyd) v2
 
AWS Hadoop and PIG and overview
AWS Hadoop and PIG and overviewAWS Hadoop and PIG and overview
AWS Hadoop and PIG and overview
 
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
 
Pig - A Data Flow Language and Execution Environment for Exploring Very Large...
Pig - A Data Flow Language and Execution Environment for Exploring Very Large...Pig - A Data Flow Language and Execution Environment for Exploring Very Large...
Pig - A Data Flow Language and Execution Environment for Exploring Very Large...
 
pig.ppt
pig.pptpig.ppt
pig.ppt
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Unit V.pdf
Unit V.pdfUnit V.pdf
Unit V.pdf
 
How to develop Big Data Pipelines for Hadoop, by Costin Leau
How to develop Big Data Pipelines for Hadoop, by Costin LeauHow to develop Big Data Pipelines for Hadoop, by Costin Leau
How to develop Big Data Pipelines for Hadoop, by Costin Leau
 
PigHive presentation and hive impor.pptx
PigHive presentation and hive impor.pptxPigHive presentation and hive impor.pptx
PigHive presentation and hive impor.pptx
 
Hive and Pig for .NET User Group
Hive and Pig for .NET User GroupHive and Pig for .NET User Group
Hive and Pig for .NET User Group
 
Apache Spark Introduction @ University College London
Apache Spark Introduction @ University College LondonApache Spark Introduction @ University College London
Apache Spark Introduction @ University College London
 
PigHive.pptx
PigHive.pptxPigHive.pptx
PigHive.pptx
 
Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
 
PigHive.pptx
PigHive.pptxPigHive.pptx
PigHive.pptx
 
Apache PIG
Apache PIGApache PIG
Apache PIG
 
unit-4-apache pig-.pdf
unit-4-apache pig-.pdfunit-4-apache pig-.pdf
unit-4-apache pig-.pdf
 
Unit 4-apache pig
Unit 4-apache pigUnit 4-apache pig
Unit 4-apache pig
 

Kürzlich hochgeladen

Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
amitlee9823
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
amitlee9823
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
amitlee9823
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
JoseMangaJr1
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
amitlee9823
 

Kürzlich hochgeladen (20)

Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 

06 pig-01-intro

  • 1. © 2012 coreservlets.com and Dima May Customized Java EE Training: http://courses.coreservlets.com/ Hadoop, Java, JSF 2, PrimeFaces, Servlets, JSP, Ajax, jQuery, Spring, Hibernate, RESTful Web Services, Android. Developed and taught by well-known author and developer. At public venues or onsite at your location. Apache Pig Originals of slides and source code for examples: http://www.coreservlets.com/hadoop-tutorial/ Also see the customized Hadoop training courses (onsite or at public venues) – http://courses.coreservlets.com/hadoop-training.html © 2012 coreservlets.com and Dima May Customized Java EE Training: http://courses.coreservlets.com/ Hadoop, Java, JSF 2, PrimeFaces, Servlets, JSP, Ajax, jQuery, Spring, Hibernate, RESTful Web Services, Android. Developed and taught by well-known author and developer. At public venues or onsite at your location. For live customized Hadoop training (including prep for the Cloudera certification exam), please email info@coreservlets.com Taught by recognized Hadoop expert who spoke on Hadoop several times at JavaOne, and who uses Hadoop daily in real-world apps. Available at public venues, or customized versions can be held on-site at your organization. • Courses developed and taught by Marty Hall – JSF 2.2, PrimeFaces, servlets/JSP, Ajax, jQuery, Android development, Java 7 or 8 programming, custom mix of topics – Courses available in any state or country. Maryland/DC area companies can also choose afternoon/evening courses. • Courses developed and taught by coreservlets.com experts (edited by Marty) – Spring, Hibernate/JPA, GWT, Hadoop, HTML5, RESTful Web Services Contact info@coreservlets.com for details
  • 2. Agenda • Pig Overview • Execution Modes • Installation • Pig Latin Basics • Developing Pig Script – Most Occurred Start Letter • Resources 4 Pig 5 • Top Level Apache Project – http://pig.apache.org • Pig is an abstraction on top of Hadoop – Provides high level programming language designed for data processing – Converted into MapReduce and executed on Hadoop Clusters • Pig is widely accepted and used – Yahoo!, Twitter, Netflix, etc... “is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. “
  • 3. Pig and MapReduce • MapReduce requires programmers – Must think in terms of map and reduce functions – More than likely will require Java programmers • Pig provides high-level language that can be used by – Analysts – Data Scientists – Statisticians – Etc... • Originally implemented at Yahoo! to allow analysts to access data 6 Pig’s Features • Join Datasets • Sort Datasets • Filter • Data Types • Group By • User Defined Functions • Etc.. 7
  • 4. Pig’s Use Cases • Extract Transform Load (ETL) – Ex: Processing large amounts of log data • clean bad entries, join with other data-sets • Research of “raw” information – Ex. User Audit Logs – Schema maybe unknown or inconsistent – Data Scientists and Analysts may like Pig’s data transformation paradigm 8 Pig Components 9 • Pig Latin – Command based language – Designed specifically for data transformation and flow expression • Execution Environment – The environment in which Pig Latin commands are executed – Currently there is support for Local and Hadoop modes • Pig compiler converts Pig Latin to MapReduce – Compiler strives to optimize execution – You automatically get optimization improvements with Pig updates
  • 5. Execution Modes • Local – Executes in a single JVM – Works exclusively with local file system – Great for development, experimentation and prototyping • Hadoop Mode – Also known as MapReduce mode – Pig renders Pig Latin into MapReduce jobs and executes them on the cluster – Can execute against semi-distributed or fully-distributed hadoop installation • We will run on semi-distributed cluster 10 Hadoop Mode 11 -- 1: Load text into a bag, where a row is a line of text lines = LOAD '/training/playArea/hamlet.txt' AS (line:chararray); -- 2: Tokenize the provided text tokens = FOREACH lines GENERATE flatten(TOKENIZE(line)) AS token:chararray; PigLatin.pig ... Pig Hadoop Execution Environment Hadoop Cluster Execute on Hadoop Cluster Monitor/Report Parse Pig script and compile into a set of MapReduce jobs
  • 6. Installation Prerequisites • Java 6 – With $JAVA_HOME environment variable properly set • Cygwin on Windows 12 Installation • Add pig script to path – export PIG_HOME=$CDH_HOME/pig-0.9.2-cdh4.0.0 – export PATH=$PATH:$PIG_HOME/bin • $ pig -help • That’s all we need to run in local mode – Think of Pig as a ‘Pig Latin’ compiler, development tool and executor – Not tightly coupled with Hadoop clusters 13
  • 7. Pig Installation for Hadoop Mode • Make sure Pig compiles with Hadoop – Not a problem when using a distribution such as Cloudera Distribution for Hadoop (CDH) • Pig will utilize $HADOOP_HOME and $HADOOP_CONF_DIR variables to locate Hadoop configuration – We already set these properties during MapReduce installation – Pig will use these properties to locate Namenode and Resource Manager 14 Running Modes • Can manually override the default mode via ‘-x’ or ‘-exectype’ options – $pig -x local – $pig -x mapreduce 15 $ pig 2012-07-14 13:38:58,139 [main] INFO org.apache.pig.Main - Logging error messages to: /home/hadoop/Training/play_area/pig/pig_1342287538128.log 2012-07-14 13:38:58,458 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://localhost:8020 $ pig -x local 2012-07-14 13:39:31,029 [main] INFO org.apache.pig.Main - Logging error messages to: /home/hadoop/Training/play_area/pig/pig_1342287571019.log 2012-07-14 13:39:31,232 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: file:///
  • 8. Running Pig • Script – Execute commands in a file – $pig scriptFile.pig • Grunt – Interactive Shell for executing Pig Commands – Started when script file is NOT provided – Can execute scripts from Grunt via run or exec commands • Embedded – Execute Pig commands using PigServer class • Just like JDBC to execute SQL – Can have programmatic access to Grunt via PigRunner class16 Pig Latin Concepts • Building blocks – Field – piece of data – Tuple – ordered set of fields, represented with “(“ and “)” • (10.4, 5, word, 4, field1) – Bag – collection of tuples, represented with “{“ and “}” • { (10.4, 5, word, 4, field1), (this, 1, blah) } • Similar to Relational Database – Bag is a table in the database – Tuple is a row in a table – Bags do not require that all tuples contain the same number • Unlike relational table 17
  • 9. Simple Pig Latin Example 18 $ pig grunt> cat /training/playArea/pig/a.txt a 1 d 4 c 9 k 6 grunt> records = LOAD '/training/playArea/pig/a.txt' as (letter:chararray, count:int); grunt> dump records; ... org.apache.pig.backend.hadoop.executionengine.mapReduceLayer .MapReduceLauncher - 50% complete 2012-07-14 17:36:22,040 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer .MapReduceLauncher - 100% complete ... (a,1) (d,4) (c,9) (k,6) grunt> Start Grunt with default MapReduce mode Grunt supports file system commands Load contents of text files into a Bag named records Display records bag to the screen Results of the bag named records are printed to the screen DUMP and STORE statements • No action is taken until DUMP or STORE commands are encountered – Pig will parse, validate and analyze statements but not execute them • DUMP – displays the results to the screen • STORE – saves results (typically to a file) 19 records = LOAD '/training/playArea/pig/a.txt' as (letter:chararray, count:int); ... ... ... ... ... DUMP final_bag; Nothing is executed; Pig will optimize this entire chunk of script The fun begins here
  • 10. Large Data • Hadoop data is usually quite large and it doesn’t make sense to print it to the screen • The common pattern is to persist results to Hadoop (HDFS, HBase) – This is done with STORE command • For information and debugging purposes you can print a small sub-set to the screen 20 grunt> records = LOAD '/training/playArea/pig/excite-small.log' AS (userId:chararray, timestamp:long, query:chararray); grunt> toPrint = LIMIT records 5; grunt> DUMP toPrint; Only 5 records will be displayed LOAD Command • data – name of the directory or file – Must be in single quotes • USING – specifies the load function to use – By default uses PigStorage which parses each line into fields using a delimiter • Default delimiter is tab (‘t’) • The delimiter can be customized using regular expressions • AS – assign a schema to incoming data – Assigns names to fields – Declares types to fields 21 LOAD 'data' [USING function] [AS schema];
  • 11. LOAD Command Example 22 records = LOAD '/training/playArea/pig/excite-small.log' USING PigStorage() AS (userId:chararray, timestamp:long, query:chararray); Data User selected Load Function, there are a lot of choices or you can implement your own Schema Schema Data Types 23 Source: Apache Pig Documentation 0.9.2; “Pig Latin Basics”. 2012 Type Description Example Simple int Signed 32-bit integer 10 long Signed 64-bit integer 10L or 10l float 32-bit floating point 10.5F or 10.5f double 64-bit floating point 10.5 or 10.5e2 or 10.5E2 Arrays chararray Character array (string) in Unicode UTF-8 hello world bytearray Byte array (blob) Complex Data Types tuple An ordered set of fields (19,2) bag An collection of tuples {(19,2), (18,1)} map An collection of tuples [open#apache]
  • 12. Pig Latin – Diagnostic Tools • Display the structure of the Bag – grunt> DESCRIBE <bag_name>; • Display Execution Plan – Produces Various reports • Logical Plan • MapReduce Plan – grunt> EXPLAIN <bag_name>; • Illustrate how Pig engine transforms the data – grunt> ILLUSTRATE <bag_name>; 24 Pig Latin - Grouping 25 grunt> chars = LOAD '/training/playArea/pig/b.txt' AS (c:chararray); grunt> describe chars; chars: {c: chararray} grunt> dump chars; (a) (k) ... ... (k) (c) (k) grunt> charGroup = GROUP chars by c; grunt> describe charGroup; charGroup: {group: chararray,chars: {(c: chararray)}} grunt> dump charGroup; (a,{(a),(a),(a)}) (c,{(c),(c)}) (i,{(i),(i),(i)}) (k,{(k),(k),(k),(k)}) (l,{(l),(l)}) Creates a new bag with element named group and element named chars The chars bag is grouped by “c”; therefore ‘group’ element will contain unique values ‘chars’ element is a bag itself and contains all tuples from ‘chars’ bag that match the value form ‘c’
  • 13. ILUSTRATE Command 26 grunt> chars = LOAD ‘/training/playArea/pig/b.txt' AS (c:chararray); grunt> charGroup = GROUP chars by c; grunt> ILLUSTRATE charGroup; ------------------------------ | chars | c:chararray | ------------------------------ | | c | | | c | ------------------------------ -------------------------------------------------------------------------------- | charGroup | group:chararray | chars:bag{:tuple(c:chararray)} | -------------------------------------------------------------------------------- | | c | {(c), (c)} | -------------------------------------------------------------------------------- Inner vs. Outer Bag 27 grunt> chars = LOAD ‘/training/playArea/pig/b.txt' AS (c:chararray); grunt> charGroup = GROUP chars by c; grunt> ILLUSTRATE charGroup; ------------------------------ | chars | c:chararray | ------------------------------ | | c | | | c | ------------------------------ -------------------------------------------------------------------------------- | charGroup | group:chararray | chars:bag{:tuple(c:chararray)} | -------------------------------------------------------------------------------- | | c | {(c), (c)} | -------------------------------------------------------------------------------- Inner Bag Outer Bag
  • 14. Inner vs. Outer Bag 28 grunt> chars = LOAD '/training/playArea/pig/b.txt' AS (c:chararray); grunt> charGroup = GROUP chars by c; grunt> dump charGroup; (a,{(a),(a),(a)}) (c,{(c),(c)}) (i,{(i),(i),(i)}) (k,{(k),(k),(k),(k)}) (l,{(l),(l)}) Inner Bag Outer Bag Pig Latin - FOREACH • FOREACH <bag> GENERATE <data> – Iterate over each element in the bag and produce a result – Ex: grunt> result = FOREACH bag GENERATE f1; 29 grunt> records = LOAD 'data/a.txt' AS (c:chararray, i:int); grunt> dump records; (a,1) (d,4) (c,9) (k,6) grunt> counts = foreach records generate i; grunt> dump counts; (1) (4) (9) (6) For each row emit ‘i’ field
  • 15. FOREACH with Functions 30 FOREACH B GENERATE group, FUNCTION(A); – Pig comes with many functions including COUNT, FLATTEN, CONCAT, etc... – Can implement a custom function grunt> chars = LOAD 'data/b.txt' AS (c:chararray); grunt> charGroup = GROUP chars by c; grunt> dump charGroup; (a,{(a),(a),(a)}) (c,{(c),(c)}) (i,{(i),(i),(i)}) (k,{(k),(k),(k),(k)}) (l,{(l),(l)}) grunt> describe charGroup; charGroup: {group: chararray,chars: {(c: chararray)}} grunt> counts = FOREACH charGroup GENERATE group, COUNT(chars); grunt> dump counts; (a,3) (c,2) (i,3) (k,4) (l,2) For each row in ‘charGroup’ bag, emit group field and count the number of items in ‘chars’ bag TOKENIZE Function • Splits a string into tokens and outputs as a bag of tokens – Separators are: space, double quote("), coma(,) parenthesis(()), star(*) 31 grunt> linesOfText = LOAD 'data/c.txt' AS (line:chararray); grunt> dump linesOfText; (this is a line of text) (yet another line of text) (third line of words) grunt> tokenBag = FOREACH linesOfText GENERATE TOKENIZE(line); grunt> dump tokenBag; ({(this),(is),(a),(line),(of),(text)}) ({(yet),(another),(line),(of),(text)}) ({(third),(line),(of),(words)}) grunt> describe tokenBag; tokenBag: {bag_of_tokenTuples: {tuple_of_tokens: (token: chararray)}} Split each row line by space and return a bag of tokens Each row is a bag of words produced by TOKENIZE function
  • 16. FLATTEN Operator • Flattens nested bags and data types • FLATTEN is not a function, it’s an operator – Re-arranges output 32 grunt> dump tokenBag; ({(this),(is),(a),(line),(of),(text)}) ({(yet),(another),(line),(of),(text)}) ({(third),(line),(of),(words)}) grunt> flatBag = FOREACH tokenBag GENERATE flatten($0); grunt> dump flatBag; (this) (is) (a) ... ... (text) (third) (line) (of) (words) Nested structure: bag of bags of tuples Each row is flatten resulting in a bag of simple tokens Elements in a bag can be referenced by index Conventions and Case Sensitivity 33 • Case Sensitive – Alias names – Pig Latin Functions • Case Insensitive – Pig Latin Keywords • General conventions – Upper case is a system keyword – Lowercase is something that you provide counts = FOREACH charGroup GENERATE group, COUNT(c); Alias Case Sensitive Keywords Case Insensitive Alias Case Sensitive Function Case Sensitive
  • 17. Problem: Locate Most Occurred Start Letter 34 • Calculate number of occurrences of each letter in the provided body of text • Traverse each letter comparing occurrence count • Produce start letter that has the most occurrences (For so this side of our known world esteem'd him) Did slay this Fortinbras; who, by a seal'd compact, Well ratified by law and heraldry, Did forfeit, with his life, all those his lands Which he stood seiz'd of, to the conqueror; Against the which a moiety competent Was gaged by our king; which had return'd To the inheritance of Fortinbras, A 89530 B 3920 .. .. Z 876 T 495959 ‘Most Occurred Start Letter’ Pig Way 1. Load text into a bag (named ‘lines’) 2. Tokenize the text in the ‘lines’ bag 3. Retain first letter of each token 4. Group by letter 5. Count the number of occurrences in each group 6. Descending order the group by the count 7. Grab the first element => Most occurring letter 8. Persist result on a file system 35
  • 18. 1: Load Text Into a Bag 36 grunt> lines = LOAD '/training/data/hamlet.txt' AS (line:chararray); Load text file into a bag, stick entire line into element ‘line’ of type ‘chararray’ grunt> describe lines; lines: {line: chararray} grunt> toDisplay = LIMIT lines 5; grunt> dump toDisplay; (This Etext file is presented by Project Gutenberg, in) (This etext is a typo-corrected version of Shakespeare's Hamlet,) (cooperation with World Library, Inc., from their Library of the) (*This Etext has certain copyright implications you should read!* (Future and Shakespeare CDROMS. Project Gutenberg often releases INSPECT lines bag: Each row is a line of text 2: Tokenize the Text in the ‘Lines’ Bag 37 grunt> tokens = FOREACH lines GENERATE flatten(TOKENIZE(line)) AS token:chararray; For each line of text (1) tokenize that line (2) flatten the structure to produce 1 word per row grunt> describe tokens tokens: {token: chararray} grunt> toDisplay = LIMIT tokens 5; grunt> dump toDisplay; (a) (is) (of) (This) (etext) INSPECT tokens bag: Each row is now a token
  • 19. 3: Retain First Letter of Each Token 38 grunt> letters = FOREACH tokens GENERATE SUBSTRING(token,0,1) AS letter:chararray; For each token grab the first letter; utilize SUBSTRING function grunt> describe letters; letters: {letter: chararray} grunt> toDisplay = LIMIT letters 5; grunt> dump toDisplay; (a) (i) (T) (e) (t) INSPECT letters bag: What we have no is 1 character per row 4: Group by Letter 39 grunt> letterGroup = GROUP letters BY letter; Create a bag for each unique character; the “grouped” bag will contain the same character for each occurrence of that character grunt> describe letterGroup; letterGroup: {group: chararray,letters: {(letter: chararray)}} grunt> toDisplay = LIMIT letterGroup 5; grunt> dump toDisplay; (0,{(0),(0),(0)}) (a,{(a),(a)) (2,{(2),(2),(2),(2),(2)) (3,{(3),(3),(3)}) (b,{(b)}) INSPECT letterGroup bag: Next we’ll need to convert characters occurrences into counts; Note this display was modified as there were too many characters to fit on the screen
  • 20. 5: Count the Number of Occurrences in Each Group 40 grunt> countPerLetter = FOREACH letterGroup GENERATE group, COUNT(letters); For each row, count occurrence of the letter grunt> describe countPerLetter; countPerLetter: {group: chararray,long} grunt> toDisplay = LIMIT countPerLetter 5; grunt> dump toDisplay; (A,728) (B,325) (C,291) (D,194) (E,264) INSPECT countPerLetter bag: Each row now has the character and the number of times it was found to start a word. All we have to do is find the maximum 6: Descending Order the Group by the Count 41 grunt> orderedCountPerLetter = ORDER countPerLetter BY $1 DESC; Simply order the bag by the first element, a number of occurrences for that element grunt> describe orderedCountPerLetter; orderedCountPerLetter: {group: chararray,long} grunt> toDisplay = LIMIT orderedCountPerLetter 5; grunt> dump toDisplay; (t,3711) (a,2379) (s,1938) (m,1787) (h,1725) INSPECT orderedCountPerLetter bag: All we have to do now is just grab the first element
  • 21. 7: Grab the First Element 42 grunt> result = LIMIT orderedCountPerLetter 1; The rows were already ordered in descending order, so simply limiting to one element gives us the result grunt> describe result; result: {group: chararray,long} grunt> dump result; (t,3711) INSPECT orderedCountPerLetter bag: There it is 8: Persist Result on a File System 43 grunt> STORE result INTO '/training/playArea/pig/mostSeenLetterOutput'; Result is saved under the provided directory $ hdfs dfs -cat /training/playArea/pig/mostSeenLetterOutput/part-r-00000 t 3711 INSPECT result result Notice that result was stored int part-r-0000, the regular artifact of a MapReduce reducer; Pig compiles Pig Latin into MapReduce code and executes.
  • 22. MostSeenStartLetter.pig Script • Execute the script: – $ pig MostSeenStartLetter.pig44 -- 1: Load text into a bag, where a row is a line of text lines = LOAD '/training/data/hamlet.txt' AS (line:chararray); -- 2: Tokenize the provided text tokens = FOREACH lines GENERATE flatten(TOKENIZE(line)) AS token:chararray; -- 3: Retain first letter of each token letters = FOREACH tokens GENERATE SUBSTRING(token,0,1) AS letter:chararray; -- 4: Group by letter letterGroup = GROUP letters BY letter; -- 5: Count the number of occurrences in each group countPerLetter = FOREACH letterGroup GENERATE group, COUNT(letters); -- 6: Descending order the group by the count orderedCountPerLetter = ORDER countPerLetter BY $1 DESC; -- 7: Grab the first element => Most occurring letter result = LIMIT orderedCountPerLetter 1; -- 8: Persist result on a file system STORE result INTO '/training/playArea/pig/mostSeenLetterOutput'; Pig Tools • Community has developed several tools to support Pig – https://cwiki.apache.org/confluence/display/PIG/PigTools • We have PigPen Eclipse Plugin installed: – Download the latest jar release at https://issues.apache.org/jira/browse/PIG-366 • As of writing org.apache.pig.pigpen_0.7.5.jar – Place jar in eclupse/plugins/ – Restart eclipse 45
  • 23. Pig Resources • Apache Pig Documentation – http://pig.apache.org 46 Chapter About Pig Hadoop: The Definitive Guide Tom White (Author) O'Reilly Media; 3rd Edition (May6, 2012) Programming Pig Alan Gates (Author) O'Reilly Media; 1st Edition (October, 2011) Chapter About Pig Hadoop in Action Chuck Lam (Author) Manning Publications; 1st Edition (December, 2010) Pig Resources 47 Hadoop in Practice Alex Holmes (Author) Manning Publications; (October 10, 2012)
  • 24. © 2012 coreservlets.com and Dima May Customized Java EE Training: http://courses.coreservlets.com/ Hadoop, Java, JSF 2, PrimeFaces, Servlets, JSP, Ajax, jQuery, Spring, Hibernate, RESTful Web Services, Android. Developed and taught by well-known author and developer. At public venues or onsite at your location. Wrap-Up Summary • We learned about – Pig Overview – Execution Modes – Installation – Pig Latin Basics – Resources • We developed Pig Script to locate “Most Occurred Start Letter” 49
  • 25. © 2012 coreservlets.com and Dima May Customized Java EE Training: http://courses.coreservlets.com/ Hadoop, Java, JSF 2, PrimeFaces, Servlets, JSP, Ajax, jQuery, Spring, Hibernate, RESTful Web Services, Android. Developed and taught by well-known author and developer. At public venues or onsite at your location. Questions? More info: http://www.coreservlets.com/hadoop-tutorial/ – Hadoop programming tutorial http://courses.coreservlets.com/hadoop-training.html – Customized Hadoop training courses, at public venues or onsite at your organization http://courses.coreservlets.com/Course-Materials/java.html – General Java programming tutorial http://www.coreservlets.com/java-8-tutorial/ – Java 8 tutorial http://www.coreservlets.com/JSF-Tutorial/jsf2/ – JSF 2.2 tutorial http://www.coreservlets.com/JSF-Tutorial/primefaces/ – PrimeFaces tutorial http://coreservlets.com/ – JSF 2, PrimeFaces, Java 7 or 8, Ajax, jQuery, Hadoop, RESTful Web Services, Android, HTML5, Spring, Hibernate, Servlets, JSP, GWT, and other Java EE training