SlideShare ist ein Scribd-Unternehmen logo
1 von 30
Downloaden Sie, um offline zu lesen
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                         Copyright for all other & referenced work is retained by their respective owners.




Essentials of Pig
Mastering Hadoop Map-reduce for Data Analysis


Shashank Tiwari
blog: shanky.org | twitter: @tshanky
st@treasuryofideas.com
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                         Copyright for all other & referenced work is retained by their respective owners.




Session Agenda

• What is Pig and why should you use it?


• Installing & Setting up Pig


• Pig’s Components


• Using Pig with Hadoop MapReduce


• Summary & Conclusion
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                         Copyright for all other & referenced work is retained by their respective owners.




What is Pig?

• Higher-level abstraction for Hadoop MapReduce


• An infrastructure for data analysis using a scripting language


   • named, Pig Latin
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                       Copyright for all other & referenced work is retained by their respective owners.




Why should you use Pig?

• Hadoop MapReduce:


  • Requires you to be a programmer


  • Forces you to design all your algorithms in terms of the map and reduce
    primitives
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                         Copyright for all other & referenced work is retained by their respective owners.




Installing & Setting Up Pig -- Required Software

• Required Software:


  • Java 1.6.x


  • Hadoop 0.20.x


  • Ant 1.7+ (for builds)


  • JUnit 4.5 (for tests)


  • Cygwin (on Windows)
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                         Copyright for all other & referenced work is retained by their respective owners.




Download

• Source: http://pig.apache.org/


• Version:


   • 0.8.1 -- current stable
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                          Copyright for all other & referenced work is retained by their respective owners.




Install & Configure

• Extract: tar zxvf pig-0.8.1.tar.gz


• Move & Create Symbolic Link:


   • ln -s pig-0.8.1 pig


• Edit: bin/pig


   • export PIG_CLASSPATH=$HADOOP_HOME/conf
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                          Copyright for all other & referenced work is retained by their respective owners.




Verify Installation

• Verify:


   (remember to start Hadoop first.)


   • bin/pig -help (command options)


   • bin/pig (run the grunt shell)
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                          Copyright for all other & referenced work is retained by their respective owners.




Running Pig

• Run Mode


   • Local Mode -- single machine


   • MapReduce Mode -- needs a Hadoop cluster (with HDFS)


• Run via:


   • grunt shell


   • pig scripts


   • embedded programs
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                        Copyright for all other & referenced work is retained by their respective owners.




Pig IDE

• PigPen, an eclipse based IDE


  • graphical data flow definition


  • can show example data flow
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                        Copyright for all other & referenced work is retained by their respective owners.




Pig Components

• Pig Latin


• Pig Engine


   • execution engine on top of Hadoop


   • includes default optimal configurations
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                         Copyright for all other & referenced work is retained by their respective owners.




A client for your cluster

• Pig does not run on a Hadoop cluster


• It connects to one
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                         Copyright for all other & referenced work is retained by their respective owners.




Pig Latin

• Data flow language (Not declarative like SQL)


• Increases productivity (less lines do more)


• Includes standard operations like join, filter, group, sort


• User code and existing binaries can be included


• Supports nested data types


• Does not require metadata
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                          Copyright for all other & referenced work is retained by their respective owners.




Pig Latin Example

• Will leverage the tutorial that comes with the distribution


• Check the tutorial folder in the distribution
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                            Copyright for all other & referenced work is retained by their respective owners.




Start Grunt Shell

• cd $PIG_HOME


• bin/pig -x local
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                         Copyright for all other & referenced work is retained by their respective owners.




Aggregate Data

• grunt> log = LOAD 'tutorial/data/excite-small.log' AS (user, timestamp,
  query);


   • alternate delimiters can be used and de-serializers like PigJsonLoader can
     be leveraged


• grunt> grouped = GROUP log BY user;


• grunt> counted = FOREACH grouped GENERATE group, COUNT(log);


• grunt> STORE counted INTO 'output';
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                         Copyright for all other & referenced work is retained by their respective owners.




Group Data

• grunt> grouped = GROUP log BY user;


• In Pig group operation generates (key, collection) pair , where the collection
  itself is a collection of tuples.


   • The key of the tuples is the same key as that of the (key, collection) pair
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                         Copyright for all other & referenced work is retained by their respective owners.




Filter Data

• grunt> log= LOAD 'tutorial/data/excite-small.log' AS (user, time, query);


• grunt> grouped = GROUP log BY user;


• grunt> counted = FOREACH grouped GENERATE group, COUNT(log) AS cnt;


• grunt> filtered = FILTER counted BY cnt > 75;


• grunt> STORE filtered INTO 'output1';
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                         Copyright for all other & referenced work is retained by their respective owners.




Order Data

• grunt> log= LOAD 'tutorial/data/excite-small.log' AS (user, time, query);


• grunt> grouped = GROUP log BY user;


• grunt> counted = FOREACH grouped GENERATE group, COUNT(log) AS cnt;


• grunt> filtered = FILTER counted BY cnt > 50;


• grunt> sorted = ORDER filtered BY cnt;


• grunt> STORE sorted INTO 'output2';
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                        Copyright for all other & referenced work is retained by their respective owners.




Join Data Example

• Words appearing in Adventures of Huckleberry Finn by Mark Twain


  • http://www.gutenberg.org/ebooks/76


• Words appearing in The Adventures of Sherlock Holmes by Sir Arthur Conan
  Doyle


  • http://www.gutenberg.org/ebooks/1661
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                         Copyright for all other & referenced work is retained by their respective owners.




Loading & Counting Huckleberry Finn Data

• grunt> A = load 'pg76.txt';


• grunt> B = foreach A generate flatten(TOKENIZE((chararray)$0)) as word;


• grunt> C = filter B by word matches 'w+';


• grunt> D = group C by word;


• grunt> E = foreach D generate COUNT(C), group;


• store E into 'huckleberry_finn_freq';
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                         Copyright for all other & referenced work is retained by their respective owners.




Loading & Counting Sherlock Holmes Data

• grunt> A = load 'pg1661.txt';


• grunt> B = foreach A generate flatten(TOKENIZE((chararray)$0)) as word;


• grunt> C = filter B by word matches 'w+';


• grunt> D = group C by word;


• grunt> E = foreach D generate COUNT(C), group;


• grunt> store E into 'sherlock_holmes_freq';
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                        Copyright for all other & referenced work is retained by their respective owners.




Join Data

• grunt> hf= LOAD 'huckleberry_finn_freq' AS (freq, word);


• grunt> sh= LOAD 'sherlock_holmes_freq' AS (freq, word);


• grunt> inboth = JOIN hf BY word, sh BY word;


• grunt> STORE inboth INTO 'output3';
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                        Copyright for all other & referenced work is retained by their respective owners.




Set Difference (A - B, in A but not in B)

• hf	= LOAD 'huckleberry_finn_freq' AS (freq, word);


• sh = LOAD 'sherlock_holmes_freq' AS (freq, word);


• grouped = COGROUP hf BY word, sh BY word;


• not_in_hf = FILTER grouped BY COUNT(hf) == 0;


• out = FOREACH not_in_hf GENERATE FLATTEN(sh);


• STORE out INTO 'output4';
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                         Copyright for all other & referenced work is retained by their respective owners.




Cogroup Data

• Extends the idea of grouping to multiple collections


• Instead of (key, collection) pair, it now emits a key and a set of tuples from
  each of the multiple collections


   • With two sources of input it would be (key, collection1, collection2), where
     tuples from the first source will be in collection1 and tuples from the
     second source will be in collection2.
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                         Copyright for all other & referenced work is retained by their respective owners.




Data types Supported

• int, long, double, chararray, bytearray


• map, tuple (ordered), bag (unordered)
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                          Copyright for all other & referenced work is retained by their respective owners.




Data type Declaration

• hf	= LOAD 'huckleberry_finn_freq' AS (freq:int, word:chararray);


   • explicit data type declaration


• hf	= LOAD 'huckleberry_finn_freq' AS (freq:int, word:chararray);


• weighted = FOREACH hf GENERATE freq * 100;


   • type inference, freq cast to int
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                          Copyright for all other & referenced work is retained by their respective owners.




Data type Declaration

• hf	= LOAD 'huckleberry_finn_freq' AS (freq:int, word:chararray);


   • explicit data type declaration


• hf	= LOAD 'huckleberry_finn_freq' AS (freq:int, word:chararray);


• weighted = FOREACH hf GENERATE freq * 100;


   • type inference, freq cast to int
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                         Copyright for all other & referenced work is retained by their respective owners.




Custom Extensions

• User defined functions can be called from Pig scripts


• Nested operations can be carried out


   • FOREACH grouped { sorted = ORDER hf BY counted;


   • GENERATE group, CustomFunction(sorted); }


• Flow can be split: SPLIT A INTO Negative IF $0 < 0, Positive IF $0 > 0;
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                         Copyright for all other & referenced work is retained by their respective owners.




Questions?




• blog: shanky.org | twitter: @tshanky


• st@treasuryofideas.com

Weitere ähnliche Inhalte

Was ist angesagt?

Elasticsearch - Devoxx France 2012 - English version
Elasticsearch - Devoxx France 2012 - English versionElasticsearch - Devoxx France 2012 - English version
Elasticsearch - Devoxx France 2012 - English versionDavid Pilato
 
An Introduction to Apache Pig
An Introduction to Apache PigAn Introduction to Apache Pig
An Introduction to Apache PigSachin Vakkund
 
Apache Pig: Making data transformation easy
Apache Pig: Making data transformation easyApache Pig: Making data transformation easy
Apache Pig: Making data transformation easyVictor Sanchez Anguix
 
Introduction to pig & pig latin
Introduction to pig & pig latinIntroduction to pig & pig latin
Introduction to pig & pig latinknowbigdata
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystemtfmailru
 
Asset Pipeline
Asset PipelineAsset Pipeline
Asset PipelineEric Berry
 
Oozie or Easy: Managing Hadoop Workloads the EASY Way
Oozie or Easy: Managing Hadoop Workloads the EASY WayOozie or Easy: Managing Hadoop Workloads the EASY Way
Oozie or Easy: Managing Hadoop Workloads the EASY WayDataWorks Summit
 
Apache Drill @ PJUG, Jan 15, 2013
Apache Drill @ PJUG, Jan 15, 2013Apache Drill @ PJUG, Jan 15, 2013
Apache Drill @ PJUG, Jan 15, 2013Gera Shegalov
 
API Design Antipatterns - APICon SF
API Design Antipatterns - APICon SFAPI Design Antipatterns - APICon SF
API Design Antipatterns - APICon SFManish Pandit
 
Apache Hive micro guide - ConfusedCoders
Apache Hive micro guide - ConfusedCodersApache Hive micro guide - ConfusedCoders
Apache Hive micro guide - ConfusedCodersYash Sharma
 
Workshop: Learning Elasticsearch
Workshop: Learning ElasticsearchWorkshop: Learning Elasticsearch
Workshop: Learning ElasticsearchAnurag Patel
 
Battle of the Giants - Apache Solr vs. Elasticsearch (ApacheCon)
Battle of the Giants - Apache Solr vs. Elasticsearch (ApacheCon)Battle of the Giants - Apache Solr vs. Elasticsearch (ApacheCon)
Battle of the Giants - Apache Solr vs. Elasticsearch (ApacheCon)Sematext Group, Inc.
 
Puppet for Everybody: Federated and Hierarchical Puppet Enterprise
Puppet for Everybody: Federated and Hierarchical Puppet EnterprisePuppet for Everybody: Federated and Hierarchical Puppet Enterprise
Puppet for Everybody: Federated and Hierarchical Puppet EnterprisecbowlesUT
 
Battle of the Giants round 2
Battle of the Giants round 2Battle of the Giants round 2
Battle of the Giants round 2Rafał Kuć
 

Was ist angesagt? (18)

Elasticsearch - Devoxx France 2012 - English version
Elasticsearch - Devoxx France 2012 - English versionElasticsearch - Devoxx France 2012 - English version
Elasticsearch - Devoxx France 2012 - English version
 
An Introduction to Apache Pig
An Introduction to Apache PigAn Introduction to Apache Pig
An Introduction to Apache Pig
 
Apache Pig: Making data transformation easy
Apache Pig: Making data transformation easyApache Pig: Making data transformation easy
Apache Pig: Making data transformation easy
 
Beginning hive and_apache_pig
Beginning hive and_apache_pigBeginning hive and_apache_pig
Beginning hive and_apache_pig
 
Introduction to pig & pig latin
Introduction to pig & pig latinIntroduction to pig & pig latin
Introduction to pig & pig latin
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
06 pig-01-intro
06 pig-01-intro06 pig-01-intro
06 pig-01-intro
 
Asset Pipeline
Asset PipelineAsset Pipeline
Asset Pipeline
 
Oozie or Easy: Managing Hadoop Workloads the EASY Way
Oozie or Easy: Managing Hadoop Workloads the EASY WayOozie or Easy: Managing Hadoop Workloads the EASY Way
Oozie or Easy: Managing Hadoop Workloads the EASY Way
 
Apache Drill @ PJUG, Jan 15, 2013
Apache Drill @ PJUG, Jan 15, 2013Apache Drill @ PJUG, Jan 15, 2013
Apache Drill @ PJUG, Jan 15, 2013
 
API Design Antipatterns - APICon SF
API Design Antipatterns - APICon SFAPI Design Antipatterns - APICon SF
API Design Antipatterns - APICon SF
 
Apache Hive micro guide - ConfusedCoders
Apache Hive micro guide - ConfusedCodersApache Hive micro guide - ConfusedCoders
Apache Hive micro guide - ConfusedCoders
 
Apache Drill
Apache DrillApache Drill
Apache Drill
 
API Design
API DesignAPI Design
API Design
 
Workshop: Learning Elasticsearch
Workshop: Learning ElasticsearchWorkshop: Learning Elasticsearch
Workshop: Learning Elasticsearch
 
Battle of the Giants - Apache Solr vs. Elasticsearch (ApacheCon)
Battle of the Giants - Apache Solr vs. Elasticsearch (ApacheCon)Battle of the Giants - Apache Solr vs. Elasticsearch (ApacheCon)
Battle of the Giants - Apache Solr vs. Elasticsearch (ApacheCon)
 
Puppet for Everybody: Federated and Hierarchical Puppet Enterprise
Puppet for Everybody: Federated and Hierarchical Puppet EnterprisePuppet for Everybody: Federated and Hierarchical Puppet Enterprise
Puppet for Everybody: Federated and Hierarchical Puppet Enterprise
 
Battle of the Giants round 2
Battle of the Giants round 2Battle of the Giants round 2
Battle of the Giants round 2
 

Andere mochten auch

Brief Intro to Apache Spark @ Stanford ICME
Brief Intro to Apache Spark @ Stanford ICMEBrief Intro to Apache Spark @ Stanford ICME
Brief Intro to Apache Spark @ Stanford ICMEPaco Nathan
 
Benchmark Mail Tutorial
Benchmark Mail TutorialBenchmark Mail Tutorial
Benchmark Mail Tutorialchairmanarnold
 
HbaseHivePigbyRohitDubey
HbaseHivePigbyRohitDubeyHbaseHivePigbyRohitDubey
HbaseHivePigbyRohitDubeyRohit Dubey
 
Spark introduction - In Chinese
Spark introduction - In ChineseSpark introduction - In Chinese
Spark introduction - In Chinesecolorant
 
SDEC2011 Big engineer vs small entreprenuer
SDEC2011 Big engineer vs small entreprenuerSDEC2011 Big engineer vs small entreprenuer
SDEC2011 Big engineer vs small entreprenuerKorea Sdec
 
Adobe Spark Step by Step Guide
Adobe Spark Step by Step GuideAdobe Spark Step by Step Guide
Adobe Spark Step by Step Guidechairmanarnold
 
SDEC2011 Replacing legacy Telco DB/DW to Hadoop and Hive
SDEC2011 Replacing legacy Telco DB/DW to Hadoop and HiveSDEC2011 Replacing legacy Telco DB/DW to Hadoop and Hive
SDEC2011 Replacing legacy Telco DB/DW to Hadoop and HiveKorea Sdec
 
Sept 17 2013 - THUG - HBase a Technical Introduction
Sept 17 2013 - THUG - HBase a Technical IntroductionSept 17 2013 - THUG - HBase a Technical Introduction
Sept 17 2013 - THUG - HBase a Technical IntroductionAdam Muise
 
Apache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big DataApache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big DataPaco Nathan
 
#MesosCon 2014: Spark on Mesos
#MesosCon 2014: Spark on Mesos#MesosCon 2014: Spark on Mesos
#MesosCon 2014: Spark on MesosPaco Nathan
 
How Apache Spark fits in the Big Data landscape
How Apache Spark fits in the Big Data landscapeHow Apache Spark fits in the Big Data landscape
How Apache Spark fits in the Big Data landscapePaco Nathan
 
Databricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User GroupDatabricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User GroupPaco Nathan
 
NoSQL HBase schema design and SQL with Apache Drill
NoSQL HBase schema design and SQL with Apache Drill NoSQL HBase schema design and SQL with Apache Drill
NoSQL HBase schema design and SQL with Apache Drill Carol McDonald
 
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MoreStrata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MorePaco Nathan
 
QCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark StreamingQCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark StreamingPaco Nathan
 
Strata EU 2014: Spark Streaming Case Studies
Strata EU 2014: Spark Streaming Case StudiesStrata EU 2014: Spark Streaming Case Studies
Strata EU 2014: Spark Streaming Case StudiesPaco Nathan
 
Learning Apache HIVE - Data Warehouse and Query Language for Hadoop
Learning Apache HIVE - Data Warehouse and Query Language for HadoopLearning Apache HIVE - Data Warehouse and Query Language for Hadoop
Learning Apache HIVE - Data Warehouse and Query Language for HadoopSomeshwar Kale
 
11. From Hadoop to Spark 2/2
11. From Hadoop to Spark 2/211. From Hadoop to Spark 2/2
11. From Hadoop to Spark 2/2Fabio Fumarola
 

Andere mochten auch (20)

Brief Intro to Apache Spark @ Stanford ICME
Brief Intro to Apache Spark @ Stanford ICMEBrief Intro to Apache Spark @ Stanford ICME
Brief Intro to Apache Spark @ Stanford ICME
 
Benchmark Mail Tutorial
Benchmark Mail TutorialBenchmark Mail Tutorial
Benchmark Mail Tutorial
 
HbaseHivePigbyRohitDubey
HbaseHivePigbyRohitDubeyHbaseHivePigbyRohitDubey
HbaseHivePigbyRohitDubey
 
Spark introduction - In Chinese
Spark introduction - In ChineseSpark introduction - In Chinese
Spark introduction - In Chinese
 
Hive
HiveHive
Hive
 
SDEC2011 Big engineer vs small entreprenuer
SDEC2011 Big engineer vs small entreprenuerSDEC2011 Big engineer vs small entreprenuer
SDEC2011 Big engineer vs small entreprenuer
 
Adobe Spark Step by Step Guide
Adobe Spark Step by Step GuideAdobe Spark Step by Step Guide
Adobe Spark Step by Step Guide
 
SDEC2011 Replacing legacy Telco DB/DW to Hadoop and Hive
SDEC2011 Replacing legacy Telco DB/DW to Hadoop and HiveSDEC2011 Replacing legacy Telco DB/DW to Hadoop and Hive
SDEC2011 Replacing legacy Telco DB/DW to Hadoop and Hive
 
Sept 17 2013 - THUG - HBase a Technical Introduction
Sept 17 2013 - THUG - HBase a Technical IntroductionSept 17 2013 - THUG - HBase a Technical Introduction
Sept 17 2013 - THUG - HBase a Technical Introduction
 
Apache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big DataApache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big Data
 
#MesosCon 2014: Spark on Mesos
#MesosCon 2014: Spark on Mesos#MesosCon 2014: Spark on Mesos
#MesosCon 2014: Spark on Mesos
 
How Apache Spark fits in the Big Data landscape
How Apache Spark fits in the Big Data landscapeHow Apache Spark fits in the Big Data landscape
How Apache Spark fits in the Big Data landscape
 
Databricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User GroupDatabricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User Group
 
NoSQL HBase schema design and SQL with Apache Drill
NoSQL HBase schema design and SQL with Apache Drill NoSQL HBase schema design and SQL with Apache Drill
NoSQL HBase schema design and SQL with Apache Drill
 
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MoreStrata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
 
QCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark StreamingQCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark Streaming
 
Strata EU 2014: Spark Streaming Case Studies
Strata EU 2014: Spark Streaming Case StudiesStrata EU 2014: Spark Streaming Case Studies
Strata EU 2014: Spark Streaming Case Studies
 
Learning Apache HIVE - Data Warehouse and Query Language for Hadoop
Learning Apache HIVE - Data Warehouse and Query Language for HadoopLearning Apache HIVE - Data Warehouse and Query Language for Hadoop
Learning Apache HIVE - Data Warehouse and Query Language for Hadoop
 
Internal Hive
Internal HiveInternal Hive
Internal Hive
 
11. From Hadoop to Spark 2/2
11. From Hadoop to Spark 2/211. From Hadoop to Spark 2/2
11. From Hadoop to Spark 2/2
 

Ähnlich wie Master Pig Latin for analyzing big data

Sdec2011 Introducing Hadoop
Sdec2011 Introducing HadoopSdec2011 Introducing Hadoop
Sdec2011 Introducing HadoopKorea Sdec
 
SDEC2011 Introducing Hadoop
SDEC2011 Introducing HadoopSDEC2011 Introducing Hadoop
SDEC2011 Introducing HadoopKorea Sdec
 
SDEC2011 Essentials of Mahout
SDEC2011 Essentials of MahoutSDEC2011 Essentials of Mahout
SDEC2011 Essentials of MahoutKorea Sdec
 
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...Yahoo Developer Network
 
Delegated Configuration with Multiple Hiera Databases - PuppetConf 2014
Delegated Configuration with Multiple Hiera Databases - PuppetConf 2014Delegated Configuration with Multiple Hiera Databases - PuppetConf 2014
Delegated Configuration with Multiple Hiera Databases - PuppetConf 2014Puppet
 
M4,C5 APACHE PIG.pptx
M4,C5 APACHE PIG.pptxM4,C5 APACHE PIG.pptx
M4,C5 APACHE PIG.pptxShrinivasa6
 
Dragon: A Distributed Object Storage at Yahoo! JAPAN (WebDB Forum 2017 / E...
   Dragon: A Distributed Object Storage at Yahoo! JAPAN (WebDB Forum 2017 / E...   Dragon: A Distributed Object Storage at Yahoo! JAPAN (WebDB Forum 2017 / E...
Dragon: A Distributed Object Storage at Yahoo! JAPAN (WebDB Forum 2017 / E...Yahoo!デベロッパーネットワーク
 
IMCSummite 2016 Breakout - Nikita Ivanov - Apache Ignite 2.0 Towards a Conver...
IMCSummite 2016 Breakout - Nikita Ivanov - Apache Ignite 2.0 Towards a Conver...IMCSummite 2016 Breakout - Nikita Ivanov - Apache Ignite 2.0 Towards a Conver...
IMCSummite 2016 Breakout - Nikita Ivanov - Apache Ignite 2.0 Towards a Conver...In-Memory Computing Summit
 
Habitat Workshop at Velocity London 2017
Habitat Workshop at Velocity London 2017Habitat Workshop at Velocity London 2017
Habitat Workshop at Velocity London 2017Mandi Walls
 
Giving back with GitHub - Putting the Open Source back in iOS
Giving back with GitHub - Putting the Open Source back in iOSGiving back with GitHub - Putting the Open Source back in iOS
Giving back with GitHub - Putting the Open Source back in iOSMadhava Jay
 
Habitat Overview
Habitat OverviewHabitat Overview
Habitat OverviewMandi Walls
 
13 practical tips for writing secure golang applications
13 practical tips for writing secure golang applications13 practical tips for writing secure golang applications
13 practical tips for writing secure golang applicationsKarthik Gaekwad
 
Introduction to PHP - SDPHP
Introduction to PHP - SDPHPIntroduction to PHP - SDPHP
Introduction to PHP - SDPHPEric Johnson
 
Introduction to pig
Introduction to pigIntroduction to pig
Introduction to pigRavi Mutyala
 
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...rhatr
 

Ähnlich wie Master Pig Latin for analyzing big data (20)

Sdec2011 Introducing Hadoop
Sdec2011 Introducing HadoopSdec2011 Introducing Hadoop
Sdec2011 Introducing Hadoop
 
SDEC2011 Introducing Hadoop
SDEC2011 Introducing HadoopSDEC2011 Introducing Hadoop
SDEC2011 Introducing Hadoop
 
SDEC2011 Essentials of Mahout
SDEC2011 Essentials of MahoutSDEC2011 Essentials of Mahout
SDEC2011 Essentials of Mahout
 
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...
 
hotdog a TD tool for DD
hotdog a TD tool for DDhotdog a TD tool for DD
hotdog a TD tool for DD
 
Delegated Configuration with Multiple Hiera Databases - PuppetConf 2014
Delegated Configuration with Multiple Hiera Databases - PuppetConf 2014Delegated Configuration with Multiple Hiera Databases - PuppetConf 2014
Delegated Configuration with Multiple Hiera Databases - PuppetConf 2014
 
M4,C5 APACHE PIG.pptx
M4,C5 APACHE PIG.pptxM4,C5 APACHE PIG.pptx
M4,C5 APACHE PIG.pptx
 
Dragon: A Distributed Object Storage at Yahoo! JAPAN (WebDB Forum 2017 / E...
   Dragon: A Distributed Object Storage at Yahoo! JAPAN (WebDB Forum 2017 / E...   Dragon: A Distributed Object Storage at Yahoo! JAPAN (WebDB Forum 2017 / E...
Dragon: A Distributed Object Storage at Yahoo! JAPAN (WebDB Forum 2017 / E...
 
HDP2 and YARN operations point
HDP2 and YARN operations pointHDP2 and YARN operations point
HDP2 and YARN operations point
 
IMCSummite 2016 Breakout - Nikita Ivanov - Apache Ignite 2.0 Towards a Conver...
IMCSummite 2016 Breakout - Nikita Ivanov - Apache Ignite 2.0 Towards a Conver...IMCSummite 2016 Breakout - Nikita Ivanov - Apache Ignite 2.0 Towards a Conver...
IMCSummite 2016 Breakout - Nikita Ivanov - Apache Ignite 2.0 Towards a Conver...
 
Habitat Workshop at Velocity London 2017
Habitat Workshop at Velocity London 2017Habitat Workshop at Velocity London 2017
Habitat Workshop at Velocity London 2017
 
Giving back with GitHub - Putting the Open Source back in iOS
Giving back with GitHub - Putting the Open Source back in iOSGiving back with GitHub - Putting the Open Source back in iOS
Giving back with GitHub - Putting the Open Source back in iOS
 
Habitat Overview
Habitat OverviewHabitat Overview
Habitat Overview
 
Recon like a pro
Recon like a proRecon like a pro
Recon like a pro
 
ki
kiki
ki
 
Google Hacking Basic
Google Hacking BasicGoogle Hacking Basic
Google Hacking Basic
 
13 practical tips for writing secure golang applications
13 practical tips for writing secure golang applications13 practical tips for writing secure golang applications
13 practical tips for writing secure golang applications
 
Introduction to PHP - SDPHP
Introduction to PHP - SDPHPIntroduction to PHP - SDPHP
Introduction to PHP - SDPHP
 
Introduction to pig
Introduction to pigIntroduction to pig
Introduction to pig
 
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...
 

Mehr von Korea Sdec

SDEC2011 Implementing me2day friend suggestion
SDEC2011 Implementing me2day friend suggestionSDEC2011 Implementing me2day friend suggestion
SDEC2011 Implementing me2day friend suggestionKorea Sdec
 
SDEC2011 NoSQL Data modelling
SDEC2011 NoSQL Data modellingSDEC2011 NoSQL Data modelling
SDEC2011 NoSQL Data modellingKorea Sdec
 
SDEC2011 NoSQL concepts and models
SDEC2011 NoSQL concepts and modelsSDEC2011 NoSQL concepts and models
SDEC2011 NoSQL concepts and modelsKorea Sdec
 
SDEC2011 Rapidant
SDEC2011 RapidantSDEC2011 Rapidant
SDEC2011 RapidantKorea Sdec
 
SDEC2011 Mahout - the what, the how and the why
SDEC2011 Mahout - the what, the how and the whySDEC2011 Mahout - the what, the how and the why
SDEC2011 Mahout - the what, the how and the whyKorea Sdec
 
SDEC2011 Going by TACC
SDEC2011 Going by TACCSDEC2011 Going by TACC
SDEC2011 Going by TACCKorea Sdec
 
SDEC2011 Glory-FS development & Experiences
SDEC2011 Glory-FS development & ExperiencesSDEC2011 Glory-FS development & Experiences
SDEC2011 Glory-FS development & ExperiencesKorea Sdec
 
SDEC2011 Using Couchbase for social game scaling and speed
SDEC2011 Using Couchbase for social game scaling and speedSDEC2011 Using Couchbase for social game scaling and speed
SDEC2011 Using Couchbase for social game scaling and speedKorea Sdec
 
SDEC2011 Arcus NHN memcached cloud
SDEC2011 Arcus NHN memcached cloudSDEC2011 Arcus NHN memcached cloud
SDEC2011 Arcus NHN memcached cloudKorea Sdec
 

Mehr von Korea Sdec (9)

SDEC2011 Implementing me2day friend suggestion
SDEC2011 Implementing me2day friend suggestionSDEC2011 Implementing me2day friend suggestion
SDEC2011 Implementing me2day friend suggestion
 
SDEC2011 NoSQL Data modelling
SDEC2011 NoSQL Data modellingSDEC2011 NoSQL Data modelling
SDEC2011 NoSQL Data modelling
 
SDEC2011 NoSQL concepts and models
SDEC2011 NoSQL concepts and modelsSDEC2011 NoSQL concepts and models
SDEC2011 NoSQL concepts and models
 
SDEC2011 Rapidant
SDEC2011 RapidantSDEC2011 Rapidant
SDEC2011 Rapidant
 
SDEC2011 Mahout - the what, the how and the why
SDEC2011 Mahout - the what, the how and the whySDEC2011 Mahout - the what, the how and the why
SDEC2011 Mahout - the what, the how and the why
 
SDEC2011 Going by TACC
SDEC2011 Going by TACCSDEC2011 Going by TACC
SDEC2011 Going by TACC
 
SDEC2011 Glory-FS development & Experiences
SDEC2011 Glory-FS development & ExperiencesSDEC2011 Glory-FS development & Experiences
SDEC2011 Glory-FS development & Experiences
 
SDEC2011 Using Couchbase for social game scaling and speed
SDEC2011 Using Couchbase for social game scaling and speedSDEC2011 Using Couchbase for social game scaling and speed
SDEC2011 Using Couchbase for social game scaling and speed
 
SDEC2011 Arcus NHN memcached cloud
SDEC2011 Arcus NHN memcached cloudSDEC2011 Arcus NHN memcached cloud
SDEC2011 Arcus NHN memcached cloud
 

Kürzlich hochgeladen

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 

Kürzlich hochgeladen (20)

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 

Master Pig Latin for analyzing big data

  • 1. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Essentials of Pig Mastering Hadoop Map-reduce for Data Analysis Shashank Tiwari blog: shanky.org | twitter: @tshanky st@treasuryofideas.com
  • 2. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Session Agenda • What is Pig and why should you use it? • Installing & Setting up Pig • Pig’s Components • Using Pig with Hadoop MapReduce • Summary & Conclusion
  • 3. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. What is Pig? • Higher-level abstraction for Hadoop MapReduce • An infrastructure for data analysis using a scripting language • named, Pig Latin
  • 4. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Why should you use Pig? • Hadoop MapReduce: • Requires you to be a programmer • Forces you to design all your algorithms in terms of the map and reduce primitives
  • 5. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Installing & Setting Up Pig -- Required Software • Required Software: • Java 1.6.x • Hadoop 0.20.x • Ant 1.7+ (for builds) • JUnit 4.5 (for tests) • Cygwin (on Windows)
  • 6. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Download • Source: http://pig.apache.org/ • Version: • 0.8.1 -- current stable
  • 7. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Install & Configure • Extract: tar zxvf pig-0.8.1.tar.gz • Move & Create Symbolic Link: • ln -s pig-0.8.1 pig • Edit: bin/pig • export PIG_CLASSPATH=$HADOOP_HOME/conf
  • 8. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Verify Installation • Verify: (remember to start Hadoop first.) • bin/pig -help (command options) • bin/pig (run the grunt shell)
  • 9. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Running Pig • Run Mode • Local Mode -- single machine • MapReduce Mode -- needs a Hadoop cluster (with HDFS) • Run via: • grunt shell • pig scripts • embedded programs
  • 10. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Pig IDE • PigPen, an eclipse based IDE • graphical data flow definition • can show example data flow
  • 11. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Pig Components • Pig Latin • Pig Engine • execution engine on top of Hadoop • includes default optimal configurations
  • 12. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. A client for your cluster • Pig does not run on a Hadoop cluster • It connects to one
  • 13. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Pig Latin • Data flow language (Not declarative like SQL) • Increases productivity (less lines do more) • Includes standard operations like join, filter, group, sort • User code and existing binaries can be included • Supports nested data types • Does not require metadata
  • 14. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Pig Latin Example • Will leverage the tutorial that comes with the distribution • Check the tutorial folder in the distribution
  • 15. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Start Grunt Shell • cd $PIG_HOME • bin/pig -x local
  • 16. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Aggregate Data • grunt> log = LOAD 'tutorial/data/excite-small.log' AS (user, timestamp, query); • alternate delimiters can be used and de-serializers like PigJsonLoader can be leveraged • grunt> grouped = GROUP log BY user; • grunt> counted = FOREACH grouped GENERATE group, COUNT(log); • grunt> STORE counted INTO 'output';
  • 17. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Group Data • grunt> grouped = GROUP log BY user; • In Pig group operation generates (key, collection) pair , where the collection itself is a collection of tuples. • The key of the tuples is the same key as that of the (key, collection) pair
  • 18. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Filter Data • grunt> log= LOAD 'tutorial/data/excite-small.log' AS (user, time, query); • grunt> grouped = GROUP log BY user; • grunt> counted = FOREACH grouped GENERATE group, COUNT(log) AS cnt; • grunt> filtered = FILTER counted BY cnt > 75; • grunt> STORE filtered INTO 'output1';
  • 19. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Order Data • grunt> log= LOAD 'tutorial/data/excite-small.log' AS (user, time, query); • grunt> grouped = GROUP log BY user; • grunt> counted = FOREACH grouped GENERATE group, COUNT(log) AS cnt; • grunt> filtered = FILTER counted BY cnt > 50; • grunt> sorted = ORDER filtered BY cnt; • grunt> STORE sorted INTO 'output2';
  • 20. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Join Data Example • Words appearing in Adventures of Huckleberry Finn by Mark Twain • http://www.gutenberg.org/ebooks/76 • Words appearing in The Adventures of Sherlock Holmes by Sir Arthur Conan Doyle • http://www.gutenberg.org/ebooks/1661
  • 21. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Loading & Counting Huckleberry Finn Data • grunt> A = load 'pg76.txt'; • grunt> B = foreach A generate flatten(TOKENIZE((chararray)$0)) as word; • grunt> C = filter B by word matches 'w+'; • grunt> D = group C by word; • grunt> E = foreach D generate COUNT(C), group; • store E into 'huckleberry_finn_freq';
  • 22. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Loading & Counting Sherlock Holmes Data • grunt> A = load 'pg1661.txt'; • grunt> B = foreach A generate flatten(TOKENIZE((chararray)$0)) as word; • grunt> C = filter B by word matches 'w+'; • grunt> D = group C by word; • grunt> E = foreach D generate COUNT(C), group; • grunt> store E into 'sherlock_holmes_freq';
  • 23. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Join Data • grunt> hf= LOAD 'huckleberry_finn_freq' AS (freq, word); • grunt> sh= LOAD 'sherlock_holmes_freq' AS (freq, word); • grunt> inboth = JOIN hf BY word, sh BY word; • grunt> STORE inboth INTO 'output3';
  • 24. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Set Difference (A - B, in A but not in B) • hf = LOAD 'huckleberry_finn_freq' AS (freq, word); • sh = LOAD 'sherlock_holmes_freq' AS (freq, word); • grouped = COGROUP hf BY word, sh BY word; • not_in_hf = FILTER grouped BY COUNT(hf) == 0; • out = FOREACH not_in_hf GENERATE FLATTEN(sh); • STORE out INTO 'output4';
  • 25. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Cogroup Data • Extends the idea of grouping to multiple collections • Instead of (key, collection) pair, it now emits a key and a set of tuples from each of the multiple collections • With two sources of input it would be (key, collection1, collection2), where tuples from the first source will be in collection1 and tuples from the second source will be in collection2.
  • 26. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Data types Supported • int, long, double, chararray, bytearray • map, tuple (ordered), bag (unordered)
  • 27. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Data type Declaration • hf = LOAD 'huckleberry_finn_freq' AS (freq:int, word:chararray); • explicit data type declaration • hf = LOAD 'huckleberry_finn_freq' AS (freq:int, word:chararray); • weighted = FOREACH hf GENERATE freq * 100; • type inference, freq cast to int
  • 28. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Data type Declaration • hf = LOAD 'huckleberry_finn_freq' AS (freq:int, word:chararray); • explicit data type declaration • hf = LOAD 'huckleberry_finn_freq' AS (freq:int, word:chararray); • weighted = FOREACH hf GENERATE freq * 100; • type inference, freq cast to int
  • 29. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Custom Extensions • User defined functions can be called from Pig scripts • Nested operations can be carried out • FOREACH grouped { sorted = ORDER hf BY counted; • GENERATE group, CustomFunction(sorted); } • Flow can be split: SPLIT A INTO Negative IF $0 < 0, Positive IF $0 > 0;
  • 30. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Questions? • blog: shanky.org | twitter: @tshanky • st@treasuryofideas.com