SlideShare ist ein Scribd-Unternehmen logo
1 von 61
Introduction to Apache
       Hadoop-Pig
       PrashantKommireddi
Hadoop Infrastructure, Salesforce.com
      pkommireddi@salesforce.com
Agenda
•   Hadoop Overview
•   Hadoop at Salesforce
•   MapReduce and HDFS
•   What is Pig
•   Introduction to Pig Latin
•   Getting Started with Pig
•   Examples
Hadoop Overview
What Can Hadoop Do For You
• Handle large data volume
   – Run queries spanning days/months
   – GB/TB/PBs
• Structured, Semi and
  Unstructured data
• Computationally intensive
   – Deep analytics
   – Machine learning algorithms
What Hadoop Can NOT Do
• Real-time/near-real-time processing
   – Some lag involved
• Hadoop is batch-oriented (full dataset scans)
   – For real-time queries consider Hbase - built on top of HDFS
• Example
   – Give me log lines with url containing “login” in the last 30 secs
     : difficult to achieve with hadoop (MapReduce), not really
     suitable for it
Why Hadoop?
Why Hadoop?
• Data is growing, we need to be able to scale-out
  computation
• Uses cheap(er) hardware to grow horizontally
• Tolerates a few machines going down
   – Happens all the time
• Store all your data from all systems
   – Don‟t throw it away!
Who‟s using it…
Agenda
•   Hadoop Overview
•   Hadoop at Salesforce
•   MapReduce and HDFS
•   What is Pig
•   Introduction to Pig Latin
•   Getting Started with Pig
•   Examples
Hadoop at Salesforce
• Several clusters in production and internal
  environments
• Driving search relevancy and recommendations on
  Salesforce.com/Chatter
• Data ingest from app servers (logs), Oracle and other
  sources
• Several internal use cases – product intelligence,
  security, performance, UX, TechOps….
A few use-cases at Salesforce ….
Product Metrics
Click-through analysis
What is Hadoop?
System for
      Processing
Large (Giga, Tera, Peta)
                 Amounts of
Data
MapReduce
            +
                HDFS
MapReduce   (Computation)
        +
            HDFS (Storage)
What is HDFS?
What is HDFS?
• Hadoop Distributed File System

• Provides common File System functionality such as
  create, delete, write, read, copy, move, list …
pkommireddi@pkommireddi-wsl:$ hadoopfs-ls/user/pkommireddi
Found 2 items
drwxr-xr-x - pkommireddisupergroup      0 2012-03-27 19:02 /user/pkommireddi/dir1
drwxr-xr-x - pkommireddisupergroup      0 2012-03-28 15:37 /user/pkommireddi/dir2

pkommireddi@pkommireddi-wsl:~$ hadoopfs-mkdir/user/pkommireddi/dir3

pkommireddi@pkommireddi-wsl:~$ hadoopfs-ls/user/pkommireddi
Found 3 items
drwxr-xr-x - pkommireddisupergroup     0 2012-03-29 13:33 /user/pkommireddi/dir1
drwxr-xr-x - pkommireddisupergroup     0 2012-03-27 19:02 /user/pkommireddi/dir2
drwxr-xr-x - pkommireddisupergroup     0 2012-03-28 15:37 /user/pkommireddi/dir3

pkommireddi@pkommireddi-wsl:~$ hadoopfs–rmrdir3
Moved to trash: hdfs://gforce1-nn1-1-sfm.ops.sfdc.net:54310/user/pkommireddi/dir3
How does HDFS work?
We’re raising the
                                         question because no
                                         one else wants to,
                                         because no one else
                                         wants to say what
                                         needs to be said.

A file we want to store on HDFS …        And let’s be real, it’s
                                         the two-ton elephant in
                                         the room with nearly
                                         every other star’s name
                                600 MB   on the trade rumor
                                         radar these days.

                                         We’ve read over and
                                         over again about Nash
                                         refusing to ask for a
                                         trade, refusing to play
                                         the game that so many
                                         others have late in their
                                         careers.
We’re raising the
                                          question because no
                                          one else wants to,
                                 256 MB   because no one else
                                          wants to say what
                                          needs to be said.



HDFS Splits file into blocks …            And let’s be real, it’s
                                          the two-ton elephant in
                                          the room with nearly
                                 256 MB   every other star’s name
                                          on the trade rumor
                                          radar these days.




                                          We’ve read over and
                                          over again about Nash
                                          refusing to play the
                            88 MB         game that so many
                                          others have late in their
                                          careers.
We’re raising the
                                            questionraising the
                                              We’re because no
                                              questionraising the
                                                We’re because
                                            one else wants to, no
                                3 copies      one else one else no
                                                question because
                                            because nowants to,
                                              because nowants to,
                                                one else one
                                            wants to say what else
                                              wants be said. else
                                                because no one
                                            needs toto say what
                                              needs toto say what
                                                wants be said.
                                                needs to be said.


HDFS will create 3replicas of each          And let’s be real, it’s
block …                                       And let’s be real, it’s
                                            the two-ton elephant in
                                              the two-ton nearly it’s
                                                And let’s be real,
                                            the room with elephant in
                                 3 copies     the other star’s name in
                                                the two-ton elephant
                                            every room with nearly
                                                the room with nearly
                                            onevery other star’s name
                                               the trade rumor
                                              onevery other star’s name
                                            radarthe trade rumor
                                                  these days.
                                              radarthe trade rumor
                                                on these days.
                                                radar these days.


                                            We’ve read over and
                                              We’ve read over and
                                            over again about Nash
                                              over again about and
                                                We’ve read over
                                            refusing to play the Nash
                                 3 copies     refusing to many Nash
                                                over again about
                                            game that so play the
                                              game that so play the
                                                refusing to
                                            others have latemany
                                                              in their
                                              others have latemany
                                                game that so in their
                                            careers.
                                              careers. have late in their
                                                others
                                                careers.
HDFS distributes these replicas
                across the cluster …
And let’s be real, it’s the                                                         And let’s be real, it’s the
two-ton elephant in the We’re raising the               We’ve read over and over    two-ton elephant in the
room with nearly every question because no one          again about Nash            room with nearly every
other star’s name on theelse wants to, because          refusing to play the game   other star’s name on the
trade rumor radar these no one else wants to say        that so many others have    trade rumor radar these
days.                       what needs to be said.      late in their careers.      days.




                 Node 1                                                Node 2


We’re raising the           We’ve read over and over    And let’s be real, it’s the
question because no one     again about Nash            two-ton elephant in the   We’ve read over and over raising the
                                                                                                         We’re
else wants to, because      refusing to play the game   room with nearly every    again about Nash       question because no one
no one else wants to say    that so many others have    other star’s name on the  refusing to play the game wants to, because no
                                                                                                         else
what needs to be said.      late in their careers.      trade rumor radar these so many others have else wants to say
                                                                                  that                   one
                                                        days.                     late in their careers. what needs to be said.




                 Node 3                                                Node 4
If a node goes down, we have copies
                elsewhere
And let’s be real, it’s the                                                         And let’s be real, it’s the
two-ton elephant in the We’re raising the               We’ve read over and over    two-ton elephant in the
room with nearly every question because no one          again about Nash            room with nearly every
other star’s name on theelse wants to, because          refusing to play the game   other star’s name on the
trade rumor radar these no one else wants to say        that so many others have    trade rumor radar these
days.                       what needs to be said.      late in their careers.      days.




                 Node 1                                                Node 2


We’re raising the           We’ve read over and over    And let’s be real, it’s the
question because no one     again about Nash            two-ton elephant in the   We’ve read over and over raising the
                                                                                                         We’re
else wants to, because      refusing to play the game   room with nearly every    again about Nash       question because no one
no one else wants to say    that so many others have    other star’s name on the  refusing to play the game wants to, because no
                                                                                                         else
what needs to be said.      late in their careers.      trade rumor radar these so many others have else wants to say
                                                                                  that                   one
                                                        days.                     late in their careers. what needs to be said.




                 Node 3                                                Node 4
What is MapReduce?
MapReduce: High-Level Overview
• Consists of two phases: Map and Reduce
   – Between M and R is a stage known as the shuffle and sort!

• Each Map task operates on a certain portion of the
  overall dataset
   – Typically 1 HDFS block of data!
It‟s all Keys & Values
• Map: extract data you care about.
   – map(K,V) -><K`,V`>*
   – Note the original input key (K) and output key from map (K`) could
     be different
• Shuffle: distribute sorted Map output to Reducers
• Reduce: aggregate, summarize, output results
   – reduce(K`,List<V`>) -><K``,V``>*
   – All V` with same K` are reduced together
   – Again, input key (K`) could be different from
     reducer output key (K``)
But, writing MapReduce
 jobs in Java is painful.
    Let’s see why …
Pig Job
• Generate COUNT of „U‟ log events for each (OrgId, UserId)

   A = load ’/app_logs/2012/01/*/' using PigStorage();

   uLogs = FILTER A BY $0 == ’U';

   uLogFields = FOREACH uLogs GENERATE $1 as orgId,
                                       $2 as userId,

   orgUserGroup = GROUP uLogFields BY (orgId, userId);

   uCount = FOREACH orgUserGroup GENERATE group, COUNT(uLogFields);

   STOREuCount INTO ‘output’;
Same job in Java MR ..
And …
Let‟s talk about Pig!
Agenda
•   Hadoop Overview
•   Hadoop at Salesforce
•   MapReduce and HDFS
•   What is Pig
•   Introduction to Pig Latin
•   Getting Started with Pig
•   Examples
What is Pig?
• Sub-project of Apache Hadoop
• Platform for analyzing large data sets
• Includes a data-flow language Pig Latin
• Built for Hadoop
   – Translates script to MapReduce program under the hood

• Originally developed at Yahoo!
   – Huge contributions from Hortonworks, Twitter
Pig Execution Stages
         Client machine                Hadoop Cluster


 Pig       Pig Execution
Script                     MapReduce   Hadoop Job
               Engine
Why Pig?
• Makes writing hadoop jobs a lot simpler
   – 5% of the code, 5% of time

   – You don‟t have to be a programmer to write Pig scripts

• Provides major functionality required for DW and Analytics
   – Load, Filter, Join, Group By, Order, Transform, UDFs, Store

• User can write custom UDFs (User Defined Function)
Hive
•   Hive has the advantage that its syntax is similar to SQL.
•   Requires Schema (some sort of)
     –   Difficult to define schema for semi-structured data, i.e. app logs

•   Writing data-flow queries gets complex
     –   Sub queries

     –   Temporary tables

•   Integration with Spark
•   Integration with Hbase in the works
•   Heavily used at Facebook
•   We at Salesforce adopted Pig more widely
     –   Pig is easier for variable schema
Agenda
•   Hadoop Overview
•   Hadoop at SFDC
•   MapReduce and HDFS
•   What is Pig
•   Introduction to Pig Latin
•   Getting Started with Pig
•   Examples
PigLatin – the dataflow language
•   PigLatin statements work with relations
     – A relation (analogous to database table) is a bag
     – A bag is a collection of tuples
     – A tuple (analogous to database row) is an ordered set of fields
     – A field is a piece of data
•   Example, A = LOAD „input.dat‟;
     – Here „A‟ is a relation
     – All records in „A‟ (from the file „input.dat‟) collectively form a bag
     – Each record in „A‟ is a tuple
     – A field is a single cell in each tuple


To remember : A Pig relation is a bag of
                    tuples
Getting started
•   Download a recent stable release from one of the Apache Download Mirrors
    (see Pig Releases).

•   Unpack the downloaded Pig distribution

•   Add pig-x.y.z/bin to your path.
     –   Use export (bash,sh,ksh) or setenv (tcsh,csh).

     –   For example:
         $ export PATH=/<my-path-to-pig>/pig-x.y.z/bin:$PATH

•   Test the Pig installation with this simple command: $ pig –help
Local mode
•   All files are installed and run using your local host and file system
     –   Does not involve a real hadoop cluster

•   Great for starting off, debugging

•   Specify local mode using the -x flag
     –   $ pig –x local

     –   $ grunt> a = load „foo‟; -- here the file „foo‟ resides on local filesystem
Mapreduce mode
•   Default mode

•   Access to a Hadoop cluster and HDFS installation

•   Point Pig to remote cluster by placing HADOOP_CONF_DIR on
    PIG_CLASSPATH
     –   HADOOP_CONF_DIR is the directory containing your hadoop-site.xml, hdfs-site.xml,
         mapred-site.xml files

     –   Example: $ export PIG_CLASSPATH=<path_to_hadoop_conf_dir>

     –   $ pig

     –   grunt> a = load „foo‟;     -- here „foo‟ refers to a file on HDFS
Data types
• int, long
• float, double
• chararray – Java String
• bytearray
   – default type of all fields if schema not specified
• Complex data types
   – tuple, eg (abc,def)
   – bag, eg {(19,2), (18,1)}
   – map, eg [sfdc#logs]
Loading data
• LOAD
   – Reads data from the file system
• Syntax
   – LOAD „input‟ [USING function] [AS schema];
   – Eg, A = LOAD „input‟ USING PigStorage(„t‟) AS
     (name:chararray, age:int, gpa:float);
Schema
• Use schemas to assign types to fields
• A = LOAD 'data' AS (name, age, gpa);
   – name, age, gpa default to bytearrays

• A = LOAD 'data' AS (name:chararray, age:int, gpa:float);
   – name is now a String (chararray), age is integer and gpa is float
Describing Schema
• Describe
     – Provides the schema of a relation
• Syntax
     – DESCRIBE [alias];
     – If schema is not provided, describe will say “Schema for alias
       unknown”

 grunt> A = load 'data' as (a:int, b: long, c: float);
 grunt> describe A;
 A: {a: int, b: long, c: float}

 grunt> B = load 'somemoredata';
 grunt> describe B;
 Schema for B unknown.
Dump and Store
• Dump writes the output to console
   – grunt> A = load „data‟;
   – grunt> DUMP A; //This will print contents of A on Console
• Store writes output to a HDFS location
   – grunt> A = load „data‟;
   – grunt> STORE A INTO „/user/username/output‟; //This will
     write contents of A to HDFS
• Pig starts a job only when a DUMP or STORE is
  encountered
Referencing Fields
• Fields are referred to by positional notation OR by
  name (alias)
   – Positional notation is generated by the system
   – Starts with $0
   – Names are assigned by you using schemas. Eg, A = load „data‟ as
     (name:chararray, age:int);
• With positional notation, fields can be accessed as
   – A = load „data‟;
   – B = foreach A generate $0, $1; //1st& 2nd column
Limit
• Limits the number of output tuples
• Syntax
   – alias = LIMIT alias n;
   grunt> A = load 'data';

   grunt> B = LIMIT A 10;

   grunt> DUMP B; --Prints only 10 rows
Foreach.. Generate
• Used for data transformations and projections
• Syntax
   – alias = FOREACH { block | nested_block };
   – nested_block usage later in the deck

   grunt>A = load ‘data’ as (a1,a2,a3);

   grunt>B = FOREACH A GENERATE *,

   grunt>DUMP B;
   (1,2,3)
   (4,2,1)

   grunt>C = FOREACH A GENERATE a1, a3;

   grunt> DUMP C;
   (1,3)
   (4,1)
Filter
• Selects tuples from a relation based on some condition
• Syntax
   – alias = FILTER alias BY expression;
   – Example, to filter for „marcbenioff‟
       • A = LOAD „sfdcemployees‟ USING PigStorage(„,‟) as
         (name:chararray,employeesince:int,age:int);
       • B = FILTER A BY name == „marcbenioff‟;
   – You can use boolean operators (AND, OR, NOT)
       • B = FILTER A BY (employeesince< 2005) AND
       (NOT(name == „marcbenioff‟));
Group By
•   Groups data in one or more relations (similar to SQL GROUP BY)
•   Syntax:
     – alias = GROUP alias { ALL | BY expression} [, alias ALL | BY
       expression …] [PARALLEL n];
     – Eg, to group by (employee start year at Salesforce)
         • A = LOAD „sfdcemployees‟ USING PigStorage(„,‟) as (name:chararray,
           employeesince:int, age:int);
         • B = GROUP A BY (employeesince);
     – You can also group by all fields together
         • B = GROUP B BY ALL;
     – Or Group by multiple fields
         • B = GROUP A BY (age, employeesince);
Using Grouped Results
• FOREACH works for grouped data
• Let‟s see an example to count the number of rows
  grouped by employee start year
  grunt> A = load ’data’ as (name, employeesince, age);
  grunt> B = GROUP A by employeesince;
  grunt> C = FOREACH B GENERATE group, COUNT(A);



• „group‟ is an implicit field name given to group key
• Use the alias grouped, within an aggregation function -
  COUNT(A)
Aggregation
• Pig provides a bunch of aggregation functions
   –   AVG
   –   COUNT
   –   COUNT_STAR
   –   SUM
   –   MAX
   –   MIN
Define
• Assigns an alias to a UDF
• Syntax
   – DEFINE alias {function}
• Use DEFINE to specify a UDF function when:
   – UDF has a long package name
   – UDF constructor takes string parameters.
  grunt> DEFINE LEN org.apache.pig.piggybank.evaluation.string.LENGTH();
  grunt> A = load ‘data’ as (name:string, age:int);
  grunt> B = Foreach A GenerateLEN(name) as namelength;
Case Sensitivity
• names (aliases) of relations and fields are case
  sensitive
   – A = load „input‟; B = foreacha generate $0; --Won’t work
• UDF names are case sensitive
   – „LENGTH‟ is not the same as „length‟
• PigLatin keywords are case insensitive
   – Load, dump, Group by, foreach..generate, join
And we‟re done
• Goal of this presentation was to only get you started
   – There‟s a lot more to Hadoop and Pig, and this only serves as a starting ground
     
Good Stuff
• Pig Latin basics - http://pig.apache.org/docs/r0.10.0/basic.html
• Programming Pig - http://ofps.oreilly.com/titles/9781449302641/
• Pig Mailing List - http://pig.apache.org/mailing_lists.html#Users
• How Salesforce.com uses Hadoop -
  http://www.youtube.com/watch?v=BT8WvQMMaV0
• New features in Pig 0.11 -
  http://www.slideshare.net/hortonworks/new-features-in-pig-011
We are hiring 
http://www.salesforce.com/careers/tech/

Weitere ähnliche Inhalte

Andere mochten auch

Apache Hadoop 0.23
Apache Hadoop 0.23Apache Hadoop 0.23
Apache Hadoop 0.23Hortonworks
 
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)Sudhir Mallem
 
Big Data Warehousing: Pig vs. Hive Comparison
Big Data Warehousing: Pig vs. Hive ComparisonBig Data Warehousing: Pig vs. Hive Comparison
Big Data Warehousing: Pig vs. Hive ComparisonCaserta
 
Presto at Hadoop Summit 2016
Presto at Hadoop Summit 2016Presto at Hadoop Summit 2016
Presto at Hadoop Summit 2016kbajda
 
Introduction to Pig | Pig Architecture | Pig Fundamentals
Introduction to Pig | Pig Architecture | Pig FundamentalsIntroduction to Pig | Pig Architecture | Pig Fundamentals
Introduction to Pig | Pig Architecture | Pig FundamentalsSkillspeed
 
Intro to Pig UDF
Intro to Pig UDFIntro to Pig UDF
Intro to Pig UDFChris Wilkes
 
Learning Apache HIVE - Data Warehouse and Query Language for Hadoop
Learning Apache HIVE - Data Warehouse and Query Language for HadoopLearning Apache HIVE - Data Warehouse and Query Language for Hadoop
Learning Apache HIVE - Data Warehouse and Query Language for HadoopSomeshwar Kale
 
Hive Functions Cheat Sheet
Hive Functions Cheat SheetHive Functions Cheat Sheet
Hive Functions Cheat SheetHortonworks
 
New Data Transfer Tools for Hadoop: Sqoop 2
New Data Transfer Tools for Hadoop: Sqoop 2New Data Transfer Tools for Hadoop: Sqoop 2
New Data Transfer Tools for Hadoop: Sqoop 2DataWorks Summit
 
Intro to HDFS and MapReduce
Intro to HDFS and MapReduceIntro to HDFS and MapReduce
Intro to HDFS and MapReduceRyan Tabora
 
Analytical Queries with Hive: SQL Windowing and Table Functions
Analytical Queries with Hive: SQL Windowing and Table FunctionsAnalytical Queries with Hive: SQL Windowing and Table Functions
Analytical Queries with Hive: SQL Windowing and Table FunctionsDataWorks Summit
 

Andere mochten auch (11)

Apache Hadoop 0.23
Apache Hadoop 0.23Apache Hadoop 0.23
Apache Hadoop 0.23
 
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
 
Big Data Warehousing: Pig vs. Hive Comparison
Big Data Warehousing: Pig vs. Hive ComparisonBig Data Warehousing: Pig vs. Hive Comparison
Big Data Warehousing: Pig vs. Hive Comparison
 
Presto at Hadoop Summit 2016
Presto at Hadoop Summit 2016Presto at Hadoop Summit 2016
Presto at Hadoop Summit 2016
 
Introduction to Pig | Pig Architecture | Pig Fundamentals
Introduction to Pig | Pig Architecture | Pig FundamentalsIntroduction to Pig | Pig Architecture | Pig Fundamentals
Introduction to Pig | Pig Architecture | Pig Fundamentals
 
Intro to Pig UDF
Intro to Pig UDFIntro to Pig UDF
Intro to Pig UDF
 
Learning Apache HIVE - Data Warehouse and Query Language for Hadoop
Learning Apache HIVE - Data Warehouse and Query Language for HadoopLearning Apache HIVE - Data Warehouse and Query Language for Hadoop
Learning Apache HIVE - Data Warehouse and Query Language for Hadoop
 
Hive Functions Cheat Sheet
Hive Functions Cheat SheetHive Functions Cheat Sheet
Hive Functions Cheat Sheet
 
New Data Transfer Tools for Hadoop: Sqoop 2
New Data Transfer Tools for Hadoop: Sqoop 2New Data Transfer Tools for Hadoop: Sqoop 2
New Data Transfer Tools for Hadoop: Sqoop 2
 
Intro to HDFS and MapReduce
Intro to HDFS and MapReduceIntro to HDFS and MapReduce
Intro to HDFS and MapReduce
 
Analytical Queries with Hive: SQL Windowing and Table Functions
Analytical Queries with Hive: SQL Windowing and Table FunctionsAnalytical Queries with Hive: SQL Windowing and Table Functions
Analytical Queries with Hive: SQL Windowing and Table Functions
 

KĂźrzlich hochgeladen

Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 

KĂźrzlich hochgeladen (20)

Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 

Introduction to Hadoop and Pig

  • 1. Introduction to Apache Hadoop-Pig PrashantKommireddi Hadoop Infrastructure, Salesforce.com pkommireddi@salesforce.com
  • 2. Agenda • Hadoop Overview • Hadoop at Salesforce • MapReduce and HDFS • What is Pig • Introduction to Pig Latin • Getting Started with Pig • Examples
  • 4. What Can Hadoop Do For You • Handle large data volume – Run queries spanning days/months – GB/TB/PBs • Structured, Semi and Unstructured data • Computationally intensive – Deep analytics – Machine learning algorithms
  • 5. What Hadoop Can NOT Do • Real-time/near-real-time processing – Some lag involved • Hadoop is batch-oriented (full dataset scans) – For real-time queries consider Hbase - built on top of HDFS • Example – Give me log lines with url containing “login” in the last 30 secs : difficult to achieve with hadoop (MapReduce), not really suitable for it
  • 7. Why Hadoop? • Data is growing, we need to be able to scale-out computation • Uses cheap(er) hardware to grow horizontally • Tolerates a few machines going down – Happens all the time • Store all your data from all systems – Don‟t throw it away!
  • 9.
  • 10. Agenda • Hadoop Overview • Hadoop at Salesforce • MapReduce and HDFS • What is Pig • Introduction to Pig Latin • Getting Started with Pig • Examples
  • 11. Hadoop at Salesforce • Several clusters in production and internal environments • Driving search relevancy and recommendations on Salesforce.com/Chatter • Data ingest from app servers (logs), Oracle and other sources • Several internal use cases – product intelligence, security, performance, UX, TechOps….
  • 12. A few use-cases at Salesforce ….
  • 16. System for Processing Large (Giga, Tera, Peta) Amounts of Data
  • 17. MapReduce + HDFS
  • 18. MapReduce (Computation) + HDFS (Storage)
  • 20. What is HDFS? • Hadoop Distributed File System • Provides common File System functionality such as create, delete, write, read, copy, move, list … pkommireddi@pkommireddi-wsl:$ hadoopfs-ls/user/pkommireddi Found 2 items drwxr-xr-x - pkommireddisupergroup 0 2012-03-27 19:02 /user/pkommireddi/dir1 drwxr-xr-x - pkommireddisupergroup 0 2012-03-28 15:37 /user/pkommireddi/dir2 pkommireddi@pkommireddi-wsl:~$ hadoopfs-mkdir/user/pkommireddi/dir3 pkommireddi@pkommireddi-wsl:~$ hadoopfs-ls/user/pkommireddi Found 3 items drwxr-xr-x - pkommireddisupergroup 0 2012-03-29 13:33 /user/pkommireddi/dir1 drwxr-xr-x - pkommireddisupergroup 0 2012-03-27 19:02 /user/pkommireddi/dir2 drwxr-xr-x - pkommireddisupergroup 0 2012-03-28 15:37 /user/pkommireddi/dir3 pkommireddi@pkommireddi-wsl:~$ hadoopfs–rmrdir3 Moved to trash: hdfs://gforce1-nn1-1-sfm.ops.sfdc.net:54310/user/pkommireddi/dir3
  • 21. How does HDFS work?
  • 22. We’re raising the question because no one else wants to, because no one else wants to say what needs to be said. A file we want to store on HDFS … And let’s be real, it’s the two-ton elephant in the room with nearly every other star’s name 600 MB on the trade rumor radar these days. We’ve read over and over again about Nash refusing to ask for a trade, refusing to play the game that so many others have late in their careers.
  • 23. We’re raising the question because no one else wants to, 256 MB because no one else wants to say what needs to be said. HDFS Splits file into blocks … And let’s be real, it’s the two-ton elephant in the room with nearly 256 MB every other star’s name on the trade rumor radar these days. We’ve read over and over again about Nash refusing to play the 88 MB game that so many others have late in their careers.
  • 24. We’re raising the questionraising the We’re because no questionraising the We’re because one else wants to, no 3 copies one else one else no question because because nowants to, because nowants to, one else one wants to say what else wants be said. else because no one needs toto say what needs toto say what wants be said. needs to be said. HDFS will create 3replicas of each And let’s be real, it’s block … And let’s be real, it’s the two-ton elephant in the two-ton nearly it’s And let’s be real, the room with elephant in 3 copies the other star’s name in the two-ton elephant every room with nearly the room with nearly onevery other star’s name the trade rumor onevery other star’s name radarthe trade rumor these days. radarthe trade rumor on these days. radar these days. We’ve read over and We’ve read over and over again about Nash over again about and We’ve read over refusing to play the Nash 3 copies refusing to many Nash over again about game that so play the game that so play the refusing to others have latemany in their others have latemany game that so in their careers. careers. have late in their others careers.
  • 25. HDFS distributes these replicas across the cluster … And let’s be real, it’s the And let’s be real, it’s the two-ton elephant in the We’re raising the We’ve read over and over two-ton elephant in the room with nearly every question because no one again about Nash room with nearly every other star’s name on theelse wants to, because refusing to play the game other star’s name on the trade rumor radar these no one else wants to say that so many others have trade rumor radar these days. what needs to be said. late in their careers. days. Node 1 Node 2 We’re raising the We’ve read over and over And let’s be real, it’s the question because no one again about Nash two-ton elephant in the We’ve read over and over raising the We’re else wants to, because refusing to play the game room with nearly every again about Nash question because no one no one else wants to say that so many others have other star’s name on the refusing to play the game wants to, because no else what needs to be said. late in their careers. trade rumor radar these so many others have else wants to say that one days. late in their careers. what needs to be said. Node 3 Node 4
  • 26. If a node goes down, we have copies elsewhere And let’s be real, it’s the And let’s be real, it’s the two-ton elephant in the We’re raising the We’ve read over and over two-ton elephant in the room with nearly every question because no one again about Nash room with nearly every other star’s name on theelse wants to, because refusing to play the game other star’s name on the trade rumor radar these no one else wants to say that so many others have trade rumor radar these days. what needs to be said. late in their careers. days. Node 1 Node 2 We’re raising the We’ve read over and over And let’s be real, it’s the question because no one again about Nash two-ton elephant in the We’ve read over and over raising the We’re else wants to, because refusing to play the game room with nearly every again about Nash question because no one no one else wants to say that so many others have other star’s name on the refusing to play the game wants to, because no else what needs to be said. late in their careers. trade rumor radar these so many others have else wants to say that one days. late in their careers. what needs to be said. Node 3 Node 4
  • 28. MapReduce: High-Level Overview • Consists of two phases: Map and Reduce – Between M and R is a stage known as the shuffle and sort! • Each Map task operates on a certain portion of the overall dataset – Typically 1 HDFS block of data!
  • 29. It‟s all Keys & Values • Map: extract data you care about. – map(K,V) -><K`,V`>* – Note the original input key (K) and output key from map (K`) could be different • Shuffle: distribute sorted Map output to Reducers • Reduce: aggregate, summarize, output results – reduce(K`,List<V`>) -><K``,V``>* – All V` with same K` are reduced together – Again, input key (K`) could be different from reducer output key (K``)
  • 30. But, writing MapReduce jobs in Java is painful. Let’s see why …
  • 31. Pig Job • Generate COUNT of „U‟ log events for each (OrgId, UserId) A = load ’/app_logs/2012/01/*/' using PigStorage(); uLogs = FILTER A BY $0 == ’U'; uLogFields = FOREACH uLogs GENERATE $1 as orgId, $2 as userId, orgUserGroup = GROUP uLogFields BY (orgId, userId); uCount = FOREACH orgUserGroup GENERATE group, COUNT(uLogFields); STOREuCount INTO ‘output’;
  • 32. Same job in Java MR ..
  • 35. Agenda • Hadoop Overview • Hadoop at Salesforce • MapReduce and HDFS • What is Pig • Introduction to Pig Latin • Getting Started with Pig • Examples
  • 36. What is Pig? • Sub-project of Apache Hadoop • Platform for analyzing large data sets • Includes a data-flow language Pig Latin • Built for Hadoop – Translates script to MapReduce program under the hood • Originally developed at Yahoo! – Huge contributions from Hortonworks, Twitter
  • 37. Pig Execution Stages Client machine Hadoop Cluster Pig Pig Execution Script MapReduce Hadoop Job Engine
  • 38. Why Pig? • Makes writing hadoop jobs a lot simpler – 5% of the code, 5% of time – You don‟t have to be a programmer to write Pig scripts • Provides major functionality required for DW and Analytics – Load, Filter, Join, Group By, Order, Transform, UDFs, Store • User can write custom UDFs (User Defined Function)
  • 39. Hive • Hive has the advantage that its syntax is similar to SQL. • Requires Schema (some sort of) – Difficult to define schema for semi-structured data, i.e. app logs • Writing data-flow queries gets complex – Sub queries – Temporary tables • Integration with Spark • Integration with Hbase in the works • Heavily used at Facebook • We at Salesforce adopted Pig more widely – Pig is easier for variable schema
  • 40. Agenda • Hadoop Overview • Hadoop at SFDC • MapReduce and HDFS • What is Pig • Introduction to Pig Latin • Getting Started with Pig • Examples
  • 41. PigLatin – the dataflow language • PigLatin statements work with relations – A relation (analogous to database table) is a bag – A bag is a collection of tuples – A tuple (analogous to database row) is an ordered set of fields – A field is a piece of data • Example, A = LOAD „input.dat‟; – Here „A‟ is a relation – All records in „A‟ (from the file „input.dat‟) collectively form a bag – Each record in „A‟ is a tuple – A field is a single cell in each tuple To remember : A Pig relation is a bag of tuples
  • 42. Getting started • Download a recent stable release from one of the Apache Download Mirrors (see Pig Releases). • Unpack the downloaded Pig distribution • Add pig-x.y.z/bin to your path. – Use export (bash,sh,ksh) or setenv (tcsh,csh). – For example: $ export PATH=/<my-path-to-pig>/pig-x.y.z/bin:$PATH • Test the Pig installation with this simple command: $ pig –help
  • 43. Local mode • All files are installed and run using your local host and file system – Does not involve a real hadoop cluster • Great for starting off, debugging • Specify local mode using the -x flag – $ pig –x local – $ grunt> a = load „foo‟; -- here the file „foo‟ resides on local filesystem
  • 44. Mapreduce mode • Default mode • Access to a Hadoop cluster and HDFS installation • Point Pig to remote cluster by placing HADOOP_CONF_DIR on PIG_CLASSPATH – HADOOP_CONF_DIR is the directory containing your hadoop-site.xml, hdfs-site.xml, mapred-site.xml files – Example: $ export PIG_CLASSPATH=<path_to_hadoop_conf_dir> – $ pig – grunt> a = load „foo‟; -- here „foo‟ refers to a file on HDFS
  • 45. Data types • int, long • float, double • chararray – Java String • bytearray – default type of all fields if schema not specified • Complex data types – tuple, eg (abc,def) – bag, eg {(19,2), (18,1)} – map, eg [sfdc#logs]
  • 46. Loading data • LOAD – Reads data from the file system • Syntax – LOAD „input‟ [USING function] [AS schema]; – Eg, A = LOAD „input‟ USING PigStorage(„t‟) AS (name:chararray, age:int, gpa:float);
  • 47. Schema • Use schemas to assign types to fields • A = LOAD 'data' AS (name, age, gpa); – name, age, gpa default to bytearrays • A = LOAD 'data' AS (name:chararray, age:int, gpa:float); – name is now a String (chararray), age is integer and gpa is float
  • 48. Describing Schema • Describe – Provides the schema of a relation • Syntax – DESCRIBE [alias]; – If schema is not provided, describe will say “Schema for alias unknown” grunt> A = load 'data' as (a:int, b: long, c: float); grunt> describe A; A: {a: int, b: long, c: float} grunt> B = load 'somemoredata'; grunt> describe B; Schema for B unknown.
  • 49. Dump and Store • Dump writes the output to console – grunt> A = load „data‟; – grunt> DUMP A; //This will print contents of A on Console • Store writes output to a HDFS location – grunt> A = load „data‟; – grunt> STORE A INTO „/user/username/output‟; //This will write contents of A to HDFS • Pig starts a job only when a DUMP or STORE is encountered
  • 50. Referencing Fields • Fields are referred to by positional notation OR by name (alias) – Positional notation is generated by the system – Starts with $0 – Names are assigned by you using schemas. Eg, A = load „data‟ as (name:chararray, age:int); • With positional notation, fields can be accessed as – A = load „data‟; – B = foreach A generate $0, $1; //1st& 2nd column
  • 51. Limit • Limits the number of output tuples • Syntax – alias = LIMIT alias n; grunt> A = load 'data'; grunt> B = LIMIT A 10; grunt> DUMP B; --Prints only 10 rows
  • 52. Foreach.. Generate • Used for data transformations and projections • Syntax – alias = FOREACH { block | nested_block }; – nested_block usage later in the deck grunt>A = load ‘data’ as (a1,a2,a3); grunt>B = FOREACH A GENERATE *, grunt>DUMP B; (1,2,3) (4,2,1) grunt>C = FOREACH A GENERATE a1, a3; grunt> DUMP C; (1,3) (4,1)
  • 53. Filter • Selects tuples from a relation based on some condition • Syntax – alias = FILTER alias BY expression; – Example, to filter for „marcbenioff‟ • A = LOAD „sfdcemployees‟ USING PigStorage(„,‟) as (name:chararray,employeesince:int,age:int); • B = FILTER A BY name == „marcbenioff‟; – You can use boolean operators (AND, OR, NOT) • B = FILTER A BY (employeesince< 2005) AND (NOT(name == „marcbenioff‟));
  • 54. Group By • Groups data in one or more relations (similar to SQL GROUP BY) • Syntax: – alias = GROUP alias { ALL | BY expression} [, alias ALL | BY expression …] [PARALLEL n]; – Eg, to group by (employee start year at Salesforce) • A = LOAD „sfdcemployees‟ USING PigStorage(„,‟) as (name:chararray, employeesince:int, age:int); • B = GROUP A BY (employeesince); – You can also group by all fields together • B = GROUP B BY ALL; – Or Group by multiple fields • B = GROUP A BY (age, employeesince);
  • 55. Using Grouped Results • FOREACH works for grouped data • Let‟s see an example to count the number of rows grouped by employee start year grunt> A = load ’data’ as (name, employeesince, age); grunt> B = GROUP A by employeesince; grunt> C = FOREACH B GENERATE group, COUNT(A); • „group‟ is an implicit field name given to group key • Use the alias grouped, within an aggregation function - COUNT(A)
  • 56. Aggregation • Pig provides a bunch of aggregation functions – AVG – COUNT – COUNT_STAR – SUM – MAX – MIN
  • 57. Define • Assigns an alias to a UDF • Syntax – DEFINE alias {function} • Use DEFINE to specify a UDF function when: – UDF has a long package name – UDF constructor takes string parameters. grunt> DEFINE LEN org.apache.pig.piggybank.evaluation.string.LENGTH(); grunt> A = load ‘data’ as (name:string, age:int); grunt> B = Foreach A GenerateLEN(name) as namelength;
  • 58. Case Sensitivity • names (aliases) of relations and fields are case sensitive – A = load „input‟; B = foreacha generate $0; --Won’t work • UDF names are case sensitive – „LENGTH‟ is not the same as „length‟ • PigLatin keywords are case insensitive – Load, dump, Group by, foreach..generate, join
  • 59. And we‟re done • Goal of this presentation was to only get you started – There‟s a lot more to Hadoop and Pig, and this only serves as a starting ground 
  • 60. Good Stuff • Pig Latin basics - http://pig.apache.org/docs/r0.10.0/basic.html • Programming Pig - http://ofps.oreilly.com/titles/9781449302641/ • Pig Mailing List - http://pig.apache.org/mailing_lists.html#Users • How Salesforce.com uses Hadoop - http://www.youtube.com/watch?v=BT8WvQMMaV0 • New features in Pig 0.11 - http://www.slideshare.net/hortonworks/new-features-in-pig-011
  • 61. We are hiring  http://www.salesforce.com/careers/tech/