SlideShare ist ein Scribd-Unternehmen logo
1 von 33
Pig: Recent Work and Next
Steps
Alan	
  F.	
  Gates	
  
@alanfgates	
  




                            Page	
  1	
  
Who Am I?
                        •  Pig committer and PMC Member
                        •  Original member of the engineering team in Yahoo
                           that took Pig from research to production
                        •  Author of Programming Pig from O’Reilly
                        •  HCatalog committer and mentor
                        •  Co-founder of Hortonworks
                        •  Tech lead of the team at Hortonworks that does Pig,
                           Hive, and HCatalog
                        •  Member of Apache Software Foundation and
                           Incubator PMC




   © 2012 Hortonworks
                                                                         Page 2
What is Pig?




   © 2012 Hortonworks
                        Page 3
What is Pig?
•  A data flow language
   users      = load 'users';
   grouped = group users by zipcode;
   byzip      = foreach grouped generate zipcode, COUNT(users);
   store byzip into 'count_by_zip';




      © 2012 Hortonworks
                                                            Page 4
What is Pig?
•  A data flow language
   users      = load 'users';
   grouped = group users by zipcode;
   byzip      = foreach grouped generate zipcode, COUNT(users);
   store byzip into 'count_by_zip';

•  that translates a script into a series of MapReduce jobs and then executes
   those jobs




        © 2012 Hortonworks
                                                                                Page 5
What is Pig?
•  A data flow language
   users      = load 'users';
   grouped = group users by zipcode;
   byzip      = foreach grouped generate zipcode, COUNT(users);
   store byzip into 'count_by_zip';

•  That translates a script into a series of MapReduce jobs and then executes
   those jobs

users   = load 'users';
grouped = group users by zipcode;
byzip   = foreach grouped
          generate zipcode, COUNT(users);
store byzip into 'count_by_zip';




          © 2012 Hortonworks
                                                                            Page 6
What is Pig?
•  A data flow language
   users      = load 'users';
   grouped = group users by zipcode;
   byzip      = foreach grouped generate zipcode, COUNT(users);
   store byzip into 'count_by_zip';

•  That translates a script into a series of MapReduce jobs and then executes
   those jobs
                                            Map Reduce Job:
users   = load 'users';                     Input: ./users
grouped = group users by zipcode;
byzip   = foreach grouped                   Map: project(zipcode, userid)
          generate zipcode, COUNT(users);   Shuffle key: userid
store byzip into 'count_by_zip';
                                            Reduce: count
                                            Output: ./count_by_zip




          © 2012 Hortonworks
                                                                            Page 7
What is Pig?
  •  A data flow language
     users      = load 'users';
     grouped = group users by zipcode;
     byzip      = foreach grouped generate zipcode, COUNT(users);
     store byzip into 'count_by_zip';

  •  That translates a script into a series of MapReduce jobs and then executes
     those jobs
                                             Map Reduce Job:
 users   = load 'users';                     Input: ./users
 grouped = group users by zipcode;
 byzip   = foreach grouped                   Map: project(zipcode, userid)
           generate zipcode, COUNT(users);   Shuffle key: userid
 store byzip into 'count_by_zip';
                                             Reduce: count
                                             Output: ./count_by_zip
Lives on client
machine,
nothing to
install on
cluster
           © 2012 Hortonworks
                                                                              Page 8
Recent Work




© 2012 Hortonworks
                                   Page 9
New Features in Pig 0.10
•  Released April, 2012
•  This release was a collaborative effort, with major features added by
   Twitter, Yahoo, Hortonworks, and Google Summer of Code students
•  Not all the new features are covered here, see
   http://hortonworks.com/blog/new-features-in-apache-pig-0-10/ for a
   complete list.




       © 2012 Hortonworks
                                                                      Page 10
Ruby UDFs
•  Pig 0.8, 0.9 UDFs could be done in Python and Java. Now Ruby also supported
•  Evaluated via JRuby

power.pig:
register 'power.rb' using jruby as rf;
data       = load ‘input’ as (a:int, b:int);
powered = foreach data generate rf.power(a, b);

power.rb:
require 'pigudf'
class Power < PigUdf
  outputSchema "a:int"
  def power(mantissa, exponent)
    return nil if mantissa.nil? or exponent.nil?
    mantissa**exponent
  end
end

•  Can also do Algebraic and Accumulator UDFs in Ruby (like in Java, but unlike in
   Python)
        © 2012 Hortonworks
                                                                            Page 11
PigStorage With Schemas
•  By default, PigStorage (the default load/store function) does not use a
   schema
•  In 0.10, it can store a schema if instructed to
•  Schema stored in side file .pig_schema
•  If schema is available it will automatically be used

 A = load 'studenttab10k' as
      (name:chararray, age:int, gpa:double);
  store A into 'foo' using PigStorage('t', '-schema');

  A = load 'foo';
  B = foreach A generate name, age;




         © 2012 Hortonworks
                                                                             Page 12
Additional UDF Improvements
•  Automatic generation of simpler UDFs
    –  If you implement an Algebraic UDF, Pig can generate Accumulator & basic UDFs
    –  If you implement an Accumulator UDF, Pig can generate a basic UDF
•  JSON load and store functions
    –  Requires schema that describes JSON, does not intuit schema from data
    –  Schema stored in side file, no need to declare in script
•  Built in UDFs for Bloom filters
    –  BuildBloom builds a bloom filter for one or more columns for a given input
          –  Can be constructed to be a certain size (# of hash functions and # of bits) or based on the desired false
             positive rate
    –  Bloom takes the file generated by BuildBloom and applies it to an input
  define bb BuildBloom('Hash.JENKINS_HASH', '1000', '0.01');
  A = load 'users';
  B = group A all;
  C = foreach B generate bb(A.name);
  store C into 'mybloom';

  define bloom Bloom('mybloom');
  A = load 'transactions';
  B = filter A by bloom(name);

             © 2012 Hortonworks
                                                                                                                     Page 13
Language Improvements
•  Boolean now supported as a first class data type
  a = load 'foo' as (n:chararray, a:int, g:double, b:boolean);
•  Default split destination - otherwise
   –  records which do not match any of the ifs will go to this destination
   –  records can still go to multiple ifs
    split a into b if id < 3, c if id > 5, d otherwise;
•  Maps, tuples, and bags can now be generated without UDFs:
    B = foreach A generate [key, value], (col1, col2),
            {col1, col2};
•  Register a collection of jars at once with globs:
   –  Uses HDFS globbing syntax
  register '/home/me/jars/*.jar';




        © 2012 Hortonworks
                                                                              Page 14
Performance Improvements
•  Hash based aggregation
    –  Up to 50% faster aggregation for sets with small number of distinct keys
    –  Pig runtime automatically selects aggregation implementation
•  Push limit to loader
    –  Now when you have a limit that can be applied to the load, Pig will stop reading
       records after reaching the limit
    –  Does not work after group, join, distinct, or order by




         © 2012 Hortonworks
                                                                                      Page 15
Current Work in Pig – Not Yet Released
•  Work done on internal data representation and map è reduce transfer to
   lower memory footprint and enhance performance
•  Datetime type has been added
•  Development of CUBE, ROLLUP, and RANK operators – patches posted and
   being reviewed
•  Pig running natively on Windows – in the process of posting patches




        © 2012 Hortonworks
                                                                             Page 16
Pig with Hadoop 2.0
•  Pig 0.10 is the first release of Pig that works with Hadoop 2.0 (fka Hadoop 0.23)
•  By default Pig 0.10 works with Hadoop 1.0
•  Must be recompiled to work with Hadoop 2.0
    –  All the pieces included with released code, just need to run ant with the right flags set
•  Does not yet take advantage of new features in Hadoop 2.0




             © 2012 Hortonworks
                                                                                                   Page 17
Next Steps




© 2012 Hortonworks
                                  Page 18
Pig Execution Today
                                            Map Reduce Job:
users   = load 'users';                     Input: ./users
grouped = group users by zipcode;
byzip   = foreach grouped                   Map: project(zipcode, userid)
          generate zipcode, COUNT(users);   Shuffle key: userid
store byzip into 'count_by_zip';
                                            Reduce: count
                                            Output: ./count_by_zip




    •  All planning done up front
    •  No use made of any statistics or information that we have
    •  Pig (mostly) uses vanilla MapReduce


         © 2012 Hortonworks                                                 Page 19
Re-optimize on the Fly


                                    MR Job            MR Job




MR Job   = planned

                                             MR Job
MR Job   = executed




                                             MR Job


               © 2012 Hortonworks
                                                               Page 20
Re-optimize on the Fly


                                    MR Job            MR Job




MR Job   = planned

                                             MR Job
MR Job   = executed




                                             MR Job


               © 2012 Hortonworks
                                                               Page 21
Re-optimize on the Fly


                                    MR Job            MR Job




MR Job   = planned

                                             MR Job
MR Job   = executed




                                             MR Job


               © 2012 Hortonworks
                                                               Page 22
Re-optimize on the Fly


                                    MR Job            MR Job




MR Job   = planned

                                             MR Job
MR Job   = executed




                                             MR Job


               © 2012 Hortonworks
                                                               Page 23
Re-optimize on the Fly


                                    MR Job                   MR Job
                                      output: 50G              output: 1G



                                                                 Observe output size
MR Job   = planned                                               from both jobs, notice
                                                    MR Job       that one of them is
MR Job   = executed                                              small enough to fit in
                                                                 memory




                                                    MR Job


               © 2012 Hortonworks
                                                                                      Page 24
Re-optimize on the Fly


                                    MR Job                   MR Job
                                      output: 50G              output: 1G



                                                                 Observe output size
MR Job   = planned                                               from both jobs, notice
                                                    MR Job       that one of them is
MR Job   = executed                                              small enough to fit in
                                                                 memory

                                                                 Can change join to FR
                                                                 join, thus map only,
                                                                 and combine with last
                                                    MR Job       MR job


               © 2012 Hortonworks
                                                                                      Page 25
Re-optimize on the Fly


                                    MR Job                   MR Job
                                      output: 50G              output: 1G



                                                                 Observe output size
MR Job   = planned                                               from both jobs, notice
                                                    MR Job       that one of them is
MR Job   = executed                                              small enough to fit in
                                                                 memory

                                                                 Can change join to FR
                                                                 join, thus map only,
                                                                 and combine with last
                                                                 MR job


               © 2012 Hortonworks
                                                                                      Page 26
Re-optimize on the Fly


                                    MR Job                   MR Job
                                      output: 50G              output: 1G



                                                                 Observe output size
MR Job   = planned                                               from both jobs, notice
                                                    MR Job       that one of them is
MR Job   = executed                                              small enough to fit in
                                                                 memory

                                                                 Can change join to FR
                                                                 join, thus map only,
                                                                 and combine with last
                                                                 MR job


               © 2012 Hortonworks
                                                                                      Page 27
Modify MapReduce

users   = load 'users';
grouped = group users by zipcode;    Map
byzip   = foreach grouped
          generate zipcode,
          COUNT(users) as cnt;
sorted = order byzip by cnt         Reduce
store sorted into 'count_by_zip';



                                     Map




                                    Reduce



       © 2012 Hortonworks
                                             Page 28
Modify MapReduce

users   = load 'users';
grouped = group users by zipcode;                   Map
byzip   = foreach grouped
          generate zipcode,
          COUNT(users) as cnt;
sorted = order byzip by cnt                        Reduce
store sorted into 'count_by_zip';

                            This map is
                            useless. Whatever       Map
                            can be done in it
                            can always be done
                            in the preceding
                            reduce. Having it
                            costs an extra write   Reduce
                            to and read from
                            HDFS.

       © 2012 Hortonworks
                                                            Page 29
Modify MapReduce

users   = load 'users';
grouped = group users by zipcode;    Map
byzip   = foreach grouped
          generate zipcode,
          COUNT(users) as cnt;
sorted = order byzip by cnt         Reduce
store sorted into 'count_by_zip';



                                    Reduce




       © 2012 Hortonworks
                                             Page 30
Today
                                   Hive
        Pig                               Plan             Others
                                          Optimize
      Plan                                Execute
                                                           Plan
      Optimize
                                                           Optimize
      Execute
                                                           Execute




•  Different in the front end; very similar in the backend
•  With HCatalog different apps can share metadata
•  No ability to share UDFs, operators, or innovations between projects
        © 2012 Hortonworks
                                                                          Page 31
Data Virtual Machine

        Pig                                                Others
                                Hive

   Plan                                Plan                   Plan




                                                Optimize
                         Data Virtual Machine
                                                Execute




    © 2012 Hortonworks                                               Page 32
Questions & Answers

                             TRY
                             download at hortonworks.com

                             LEARN
                             Hortonworks University

                             FOLLOW
                             twitter: @hortonworks
                             Facebook: facebook.com/hortonworks

                             MORE EVENTS
                             hortonworks.com/events




                                                            Page 33
   © Hortonworks Inc. 2012

Weitere ähnliche Inhalte

Was ist angesagt?

Discover Enterprise Security Features in Hortonworks Data Platform 2.1: Apach...
Discover Enterprise Security Features in Hortonworks Data Platform 2.1: Apach...Discover Enterprise Security Features in Hortonworks Data Platform 2.1: Apach...
Discover Enterprise Security Features in Hortonworks Data Platform 2.1: Apach...Hortonworks
 
State of the Union with Shaun Connolly
State of the Union with Shaun ConnollyState of the Union with Shaun Connolly
State of the Union with Shaun ConnollyHortonworks
 
Discover HDP 2.1: Interactive SQL Query in Hadoop with Apache Hive
Discover HDP 2.1: Interactive SQL Query in Hadoop with Apache HiveDiscover HDP 2.1: Interactive SQL Query in Hadoop with Apache Hive
Discover HDP 2.1: Interactive SQL Query in Hadoop with Apache HiveHortonworks
 
YARN Ready - Integrating to YARN using Slider Webinar
YARN Ready - Integrating to YARN using Slider WebinarYARN Ready - Integrating to YARN using Slider Webinar
YARN Ready - Integrating to YARN using Slider WebinarHortonworks
 
Hortonworks Technical Workshop - Operational Best Practices Workshop
Hortonworks Technical Workshop - Operational Best Practices WorkshopHortonworks Technical Workshop - Operational Best Practices Workshop
Hortonworks Technical Workshop - Operational Best Practices WorkshopHortonworks
 
Splunk-hortonworks-risk-management-oct-2014
Splunk-hortonworks-risk-management-oct-2014Splunk-hortonworks-risk-management-oct-2014
Splunk-hortonworks-risk-management-oct-2014Hortonworks
 
Powering Fast Data and the Hadoop Ecosystem with VoltDB and Hortonworks
Powering Fast Data and the Hadoop Ecosystem with VoltDB and HortonworksPowering Fast Data and the Hadoop Ecosystem with VoltDB and Hortonworks
Powering Fast Data and the Hadoop Ecosystem with VoltDB and HortonworksHortonworks
 
Stinger.Next by Alan Gates of Hortonworks
Stinger.Next by Alan Gates of HortonworksStinger.Next by Alan Gates of Hortonworks
Stinger.Next by Alan Gates of HortonworksData Con LA
 
Discover HDP2.1: Apache Storm for Stream Data Processing in Hadoop
Discover HDP2.1: Apache Storm for Stream Data Processing in HadoopDiscover HDP2.1: Apache Storm for Stream Data Processing in Hadoop
Discover HDP2.1: Apache Storm for Stream Data Processing in HadoopHortonworks
 
Discover hdp 2.2: Data storage innovations in Hadoop Distributed Filesystem (...
Discover hdp 2.2: Data storage innovations in Hadoop Distributed Filesystem (...Discover hdp 2.2: Data storage innovations in Hadoop Distributed Filesystem (...
Discover hdp 2.2: Data storage innovations in Hadoop Distributed Filesystem (...Hortonworks
 
Deploying Docker applications on YARN via Slider
Deploying Docker applications on YARN via SliderDeploying Docker applications on YARN via Slider
Deploying Docker applications on YARN via SliderHortonworks
 
Hortonworks tech workshop in-memory processing with spark
Hortonworks tech workshop   in-memory processing with sparkHortonworks tech workshop   in-memory processing with spark
Hortonworks tech workshop in-memory processing with sparkHortonworks
 
Discover HDP 2.1: Apache Solr for Hadoop Search
Discover HDP 2.1: Apache Solr for Hadoop SearchDiscover HDP 2.1: Apache Solr for Hadoop Search
Discover HDP 2.1: Apache Solr for Hadoop SearchHortonworks
 
Discover HDP 2.2: Even Faster SQL Queries with Apache Hive and Stinger.next
Discover HDP 2.2: Even Faster SQL Queries with Apache Hive and Stinger.nextDiscover HDP 2.2: Even Faster SQL Queries with Apache Hive and Stinger.next
Discover HDP 2.2: Even Faster SQL Queries with Apache Hive and Stinger.nextHortonworks
 
Hadoop Operations, Innovations and Enterprise Readiness with Hortonworks Data...
Hadoop Operations, Innovations and Enterprise Readiness with Hortonworks Data...Hadoop Operations, Innovations and Enterprise Readiness with Hortonworks Data...
Hadoop Operations, Innovations and Enterprise Readiness with Hortonworks Data...Hortonworks
 
YARN Ready: Integrating to YARN with Tez
YARN Ready: Integrating to YARN with Tez YARN Ready: Integrating to YARN with Tez
YARN Ready: Integrating to YARN with Tez Hortonworks
 
Discover HDP 2.1: Apache Falcon for Data Governance in Hadoop
Discover HDP 2.1: Apache Falcon for Data Governance in HadoopDiscover HDP 2.1: Apache Falcon for Data Governance in Hadoop
Discover HDP 2.1: Apache Falcon for Data Governance in HadoopHortonworks
 

Was ist angesagt? (20)

Discover Enterprise Security Features in Hortonworks Data Platform 2.1: Apach...
Discover Enterprise Security Features in Hortonworks Data Platform 2.1: Apach...Discover Enterprise Security Features in Hortonworks Data Platform 2.1: Apach...
Discover Enterprise Security Features in Hortonworks Data Platform 2.1: Apach...
 
State of the Union with Shaun Connolly
State of the Union with Shaun ConnollyState of the Union with Shaun Connolly
State of the Union with Shaun Connolly
 
Discover HDP 2.1: Interactive SQL Query in Hadoop with Apache Hive
Discover HDP 2.1: Interactive SQL Query in Hadoop with Apache HiveDiscover HDP 2.1: Interactive SQL Query in Hadoop with Apache Hive
Discover HDP 2.1: Interactive SQL Query in Hadoop with Apache Hive
 
YARN Ready - Integrating to YARN using Slider Webinar
YARN Ready - Integrating to YARN using Slider WebinarYARN Ready - Integrating to YARN using Slider Webinar
YARN Ready - Integrating to YARN using Slider Webinar
 
Hortonworks Technical Workshop - Operational Best Practices Workshop
Hortonworks Technical Workshop - Operational Best Practices WorkshopHortonworks Technical Workshop - Operational Best Practices Workshop
Hortonworks Technical Workshop - Operational Best Practices Workshop
 
Splunk-hortonworks-risk-management-oct-2014
Splunk-hortonworks-risk-management-oct-2014Splunk-hortonworks-risk-management-oct-2014
Splunk-hortonworks-risk-management-oct-2014
 
Powering Fast Data and the Hadoop Ecosystem with VoltDB and Hortonworks
Powering Fast Data and the Hadoop Ecosystem with VoltDB and HortonworksPowering Fast Data and the Hadoop Ecosystem with VoltDB and Hortonworks
Powering Fast Data and the Hadoop Ecosystem with VoltDB and Hortonworks
 
Stinger.Next by Alan Gates of Hortonworks
Stinger.Next by Alan Gates of HortonworksStinger.Next by Alan Gates of Hortonworks
Stinger.Next by Alan Gates of Hortonworks
 
Discover HDP2.1: Apache Storm for Stream Data Processing in Hadoop
Discover HDP2.1: Apache Storm for Stream Data Processing in HadoopDiscover HDP2.1: Apache Storm for Stream Data Processing in Hadoop
Discover HDP2.1: Apache Storm for Stream Data Processing in Hadoop
 
Discover hdp 2.2: Data storage innovations in Hadoop Distributed Filesystem (...
Discover hdp 2.2: Data storage innovations in Hadoop Distributed Filesystem (...Discover hdp 2.2: Data storage innovations in Hadoop Distributed Filesystem (...
Discover hdp 2.2: Data storage innovations in Hadoop Distributed Filesystem (...
 
Deploying Docker applications on YARN via Slider
Deploying Docker applications on YARN via SliderDeploying Docker applications on YARN via Slider
Deploying Docker applications on YARN via Slider
 
Hortonworks tech workshop in-memory processing with spark
Hortonworks tech workshop   in-memory processing with sparkHortonworks tech workshop   in-memory processing with spark
Hortonworks tech workshop in-memory processing with spark
 
Discover HDP 2.1: Apache Solr for Hadoop Search
Discover HDP 2.1: Apache Solr for Hadoop SearchDiscover HDP 2.1: Apache Solr for Hadoop Search
Discover HDP 2.1: Apache Solr for Hadoop Search
 
OOP 2014
OOP 2014OOP 2014
OOP 2014
 
Hortonworks.bdb
Hortonworks.bdbHortonworks.bdb
Hortonworks.bdb
 
Hackathon bonn
Hackathon bonnHackathon bonn
Hackathon bonn
 
Discover HDP 2.2: Even Faster SQL Queries with Apache Hive and Stinger.next
Discover HDP 2.2: Even Faster SQL Queries with Apache Hive and Stinger.nextDiscover HDP 2.2: Even Faster SQL Queries with Apache Hive and Stinger.next
Discover HDP 2.2: Even Faster SQL Queries with Apache Hive and Stinger.next
 
Hadoop Operations, Innovations and Enterprise Readiness with Hortonworks Data...
Hadoop Operations, Innovations and Enterprise Readiness with Hortonworks Data...Hadoop Operations, Innovations and Enterprise Readiness with Hortonworks Data...
Hadoop Operations, Innovations and Enterprise Readiness with Hortonworks Data...
 
YARN Ready: Integrating to YARN with Tez
YARN Ready: Integrating to YARN with Tez YARN Ready: Integrating to YARN with Tez
YARN Ready: Integrating to YARN with Tez
 
Discover HDP 2.1: Apache Falcon for Data Governance in Hadoop
Discover HDP 2.1: Apache Falcon for Data Governance in HadoopDiscover HDP 2.1: Apache Falcon for Data Governance in Hadoop
Discover HDP 2.1: Apache Falcon for Data Governance in Hadoop
 

Ähnlich wie Pig Out to Hadoop

Introduction to pig
Introduction to pigIntroduction to pig
Introduction to pigRavi Mutyala
 
Sql saturday pig session (wes floyd) v2
Sql saturday   pig session (wes floyd) v2Sql saturday   pig session (wes floyd) v2
Sql saturday pig session (wes floyd) v2Wes Floyd
 
Habitat Workshop at Velocity London 2017
Habitat Workshop at Velocity London 2017Habitat Workshop at Velocity London 2017
Habitat Workshop at Velocity London 2017Mandi Walls
 
An Introduction to Apache Pig
An Introduction to Apache PigAn Introduction to Apache Pig
An Introduction to Apache PigSachin Vakkund
 
Scalable Hadoop with succinct Python: the best of both worlds
Scalable Hadoop with succinct Python: the best of both worldsScalable Hadoop with succinct Python: the best of both worlds
Scalable Hadoop with succinct Python: the best of both worldsDataWorks Summit
 
MongoDB & Hadoop: Flexible Hourly Batch Processing Model
MongoDB & Hadoop: Flexible Hourly Batch Processing ModelMongoDB & Hadoop: Flexible Hourly Batch Processing Model
MongoDB & Hadoop: Flexible Hourly Batch Processing ModelTakahiro Inoue
 
Debugging Hive with Hadoop-in-the-Cloud
Debugging Hive with Hadoop-in-the-CloudDebugging Hive with Hadoop-in-the-Cloud
Debugging Hive with Hadoop-in-the-CloudSoam Acharya
 
De-Bugging Hive with Hadoop-in-the-Cloud
De-Bugging Hive with Hadoop-in-the-CloudDe-Bugging Hive with Hadoop-in-the-Cloud
De-Bugging Hive with Hadoop-in-the-CloudDataWorks Summit
 
Containerdays Intro to Habitat
Containerdays Intro to HabitatContainerdays Intro to Habitat
Containerdays Intro to HabitatMandi Walls
 
Apache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan Gate
Apache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan GateApache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan Gate
Apache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan GateYahoo Developer Network
 
Rapid Application Development on Google App Engine for Java
Rapid Application Development on Google App Engine for JavaRapid Application Development on Google App Engine for Java
Rapid Application Development on Google App Engine for JavaKunal Dabir
 
Habitat at LinuxLab IT
Habitat at LinuxLab ITHabitat at LinuxLab IT
Habitat at LinuxLab ITMandi Walls
 
Building production-quality apps with Node.js
Building production-quality apps with Node.jsBuilding production-quality apps with Node.js
Building production-quality apps with Node.jsmattpardee
 
Introduction to pig & pig latin
Introduction to pig & pig latinIntroduction to pig & pig latin
Introduction to pig & pig latinknowbigdata
 
August 2013 HUG: Compression Options in Hadoop - A Tale of Tradeoffs
August 2013 HUG: Compression Options in Hadoop - A Tale of TradeoffsAugust 2013 HUG: Compression Options in Hadoop - A Tale of Tradeoffs
August 2013 HUG: Compression Options in Hadoop - A Tale of TradeoffsYahoo Developer Network
 
Introduction To Apache Pig at WHUG
Introduction To Apache Pig at WHUGIntroduction To Apache Pig at WHUG
Introduction To Apache Pig at WHUGAdam Kawa
 
DC HUG Hadoop for Windows
DC HUG Hadoop for WindowsDC HUG Hadoop for Windows
DC HUG Hadoop for WindowsTerry Padgett
 
Hadoop Summit San Jose 2013: Compression Options in Hadoop - A Tale of Tradeo...
Hadoop Summit San Jose 2013: Compression Options in Hadoop - A Tale of Tradeo...Hadoop Summit San Jose 2013: Compression Options in Hadoop - A Tale of Tradeo...
Hadoop Summit San Jose 2013: Compression Options in Hadoop - A Tale of Tradeo...Sumeet Singh
 
JDD2014: Docker.io - versioned linux containers for JVM devops - Dominik Dorn
JDD2014: Docker.io - versioned linux containers for JVM devops - Dominik DornJDD2014: Docker.io - versioned linux containers for JVM devops - Dominik Dorn
JDD2014: Docker.io - versioned linux containers for JVM devops - Dominik DornPROIDEA
 
How bigtop leveraged docker for build automation and one click hadoop provis...
How bigtop leveraged docker for build automation and  one click hadoop provis...How bigtop leveraged docker for build automation and  one click hadoop provis...
How bigtop leveraged docker for build automation and one click hadoop provis...Evans Ye
 

Ähnlich wie Pig Out to Hadoop (20)

Introduction to pig
Introduction to pigIntroduction to pig
Introduction to pig
 
Sql saturday pig session (wes floyd) v2
Sql saturday   pig session (wes floyd) v2Sql saturday   pig session (wes floyd) v2
Sql saturday pig session (wes floyd) v2
 
Habitat Workshop at Velocity London 2017
Habitat Workshop at Velocity London 2017Habitat Workshop at Velocity London 2017
Habitat Workshop at Velocity London 2017
 
An Introduction to Apache Pig
An Introduction to Apache PigAn Introduction to Apache Pig
An Introduction to Apache Pig
 
Scalable Hadoop with succinct Python: the best of both worlds
Scalable Hadoop with succinct Python: the best of both worldsScalable Hadoop with succinct Python: the best of both worlds
Scalable Hadoop with succinct Python: the best of both worlds
 
MongoDB & Hadoop: Flexible Hourly Batch Processing Model
MongoDB & Hadoop: Flexible Hourly Batch Processing ModelMongoDB & Hadoop: Flexible Hourly Batch Processing Model
MongoDB & Hadoop: Flexible Hourly Batch Processing Model
 
Debugging Hive with Hadoop-in-the-Cloud
Debugging Hive with Hadoop-in-the-CloudDebugging Hive with Hadoop-in-the-Cloud
Debugging Hive with Hadoop-in-the-Cloud
 
De-Bugging Hive with Hadoop-in-the-Cloud
De-Bugging Hive with Hadoop-in-the-CloudDe-Bugging Hive with Hadoop-in-the-Cloud
De-Bugging Hive with Hadoop-in-the-Cloud
 
Containerdays Intro to Habitat
Containerdays Intro to HabitatContainerdays Intro to Habitat
Containerdays Intro to Habitat
 
Apache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan Gate
Apache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan GateApache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan Gate
Apache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan Gate
 
Rapid Application Development on Google App Engine for Java
Rapid Application Development on Google App Engine for JavaRapid Application Development on Google App Engine for Java
Rapid Application Development on Google App Engine for Java
 
Habitat at LinuxLab IT
Habitat at LinuxLab ITHabitat at LinuxLab IT
Habitat at LinuxLab IT
 
Building production-quality apps with Node.js
Building production-quality apps with Node.jsBuilding production-quality apps with Node.js
Building production-quality apps with Node.js
 
Introduction to pig & pig latin
Introduction to pig & pig latinIntroduction to pig & pig latin
Introduction to pig & pig latin
 
August 2013 HUG: Compression Options in Hadoop - A Tale of Tradeoffs
August 2013 HUG: Compression Options in Hadoop - A Tale of TradeoffsAugust 2013 HUG: Compression Options in Hadoop - A Tale of Tradeoffs
August 2013 HUG: Compression Options in Hadoop - A Tale of Tradeoffs
 
Introduction To Apache Pig at WHUG
Introduction To Apache Pig at WHUGIntroduction To Apache Pig at WHUG
Introduction To Apache Pig at WHUG
 
DC HUG Hadoop for Windows
DC HUG Hadoop for WindowsDC HUG Hadoop for Windows
DC HUG Hadoop for Windows
 
Hadoop Summit San Jose 2013: Compression Options in Hadoop - A Tale of Tradeo...
Hadoop Summit San Jose 2013: Compression Options in Hadoop - A Tale of Tradeo...Hadoop Summit San Jose 2013: Compression Options in Hadoop - A Tale of Tradeo...
Hadoop Summit San Jose 2013: Compression Options in Hadoop - A Tale of Tradeo...
 
JDD2014: Docker.io - versioned linux containers for JVM devops - Dominik Dorn
JDD2014: Docker.io - versioned linux containers for JVM devops - Dominik DornJDD2014: Docker.io - versioned linux containers for JVM devops - Dominik Dorn
JDD2014: Docker.io - versioned linux containers for JVM devops - Dominik Dorn
 
How bigtop leveraged docker for build automation and one click hadoop provis...
How bigtop leveraged docker for build automation and  one click hadoop provis...How bigtop leveraged docker for build automation and  one click hadoop provis...
How bigtop leveraged docker for build automation and one click hadoop provis...
 

Mehr von Hortonworks

Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next LevelHortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next LevelHortonworks
 
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT StrategyIoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT StrategyHortonworks
 
Getting the Most Out of Your Data in the Cloud with Cloudbreak
Getting the Most Out of Your Data in the Cloud with CloudbreakGetting the Most Out of Your Data in the Cloud with Cloudbreak
Getting the Most Out of Your Data in the Cloud with CloudbreakHortonworks
 
Johns Hopkins - Using Hadoop to Secure Access Log Events
Johns Hopkins - Using Hadoop to Secure Access Log EventsJohns Hopkins - Using Hadoop to Secure Access Log Events
Johns Hopkins - Using Hadoop to Secure Access Log EventsHortonworks
 
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad GuysCatch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad GuysHortonworks
 
HDF 3.2 - What's New
HDF 3.2 - What's NewHDF 3.2 - What's New
HDF 3.2 - What's NewHortonworks
 
Curing Kafka Blindness with Hortonworks Streams Messaging Manager
Curing Kafka Blindness with Hortonworks Streams Messaging ManagerCuring Kafka Blindness with Hortonworks Streams Messaging Manager
Curing Kafka Blindness with Hortonworks Streams Messaging ManagerHortonworks
 
Interpretation Tool for Genomic Sequencing Data in Clinical Environments
Interpretation Tool for Genomic Sequencing Data in Clinical EnvironmentsInterpretation Tool for Genomic Sequencing Data in Clinical Environments
Interpretation Tool for Genomic Sequencing Data in Clinical EnvironmentsHortonworks
 
IBM+Hortonworks = Transformation of the Big Data Landscape
IBM+Hortonworks = Transformation of the Big Data LandscapeIBM+Hortonworks = Transformation of the Big Data Landscape
IBM+Hortonworks = Transformation of the Big Data LandscapeHortonworks
 
Premier Inside-Out: Apache Druid
Premier Inside-Out: Apache DruidPremier Inside-Out: Apache Druid
Premier Inside-Out: Apache DruidHortonworks
 
Accelerating Data Science and Real Time Analytics at Scale
Accelerating Data Science and Real Time Analytics at ScaleAccelerating Data Science and Real Time Analytics at Scale
Accelerating Data Science and Real Time Analytics at ScaleHortonworks
 
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATATIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATAHortonworks
 
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...Hortonworks
 
Delivering Real-Time Streaming Data for Healthcare Customers: Clearsense
Delivering Real-Time Streaming Data for Healthcare Customers: ClearsenseDelivering Real-Time Streaming Data for Healthcare Customers: Clearsense
Delivering Real-Time Streaming Data for Healthcare Customers: ClearsenseHortonworks
 
Making Enterprise Big Data Small with Ease
Making Enterprise Big Data Small with EaseMaking Enterprise Big Data Small with Ease
Making Enterprise Big Data Small with EaseHortonworks
 
Webinewbie to Webinerd in 30 Days - Webinar World Presentation
Webinewbie to Webinerd in 30 Days - Webinar World PresentationWebinewbie to Webinerd in 30 Days - Webinar World Presentation
Webinewbie to Webinerd in 30 Days - Webinar World PresentationHortonworks
 
Driving Digital Transformation Through Global Data Management
Driving Digital Transformation Through Global Data ManagementDriving Digital Transformation Through Global Data Management
Driving Digital Transformation Through Global Data ManagementHortonworks
 
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming FeaturesHDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming FeaturesHortonworks
 
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...Hortonworks
 
Unlock Value from Big Data with Apache NiFi and Streaming CDC
Unlock Value from Big Data with Apache NiFi and Streaming CDCUnlock Value from Big Data with Apache NiFi and Streaming CDC
Unlock Value from Big Data with Apache NiFi and Streaming CDCHortonworks
 

Mehr von Hortonworks (20)

Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next LevelHortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
 
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT StrategyIoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
 
Getting the Most Out of Your Data in the Cloud with Cloudbreak
Getting the Most Out of Your Data in the Cloud with CloudbreakGetting the Most Out of Your Data in the Cloud with Cloudbreak
Getting the Most Out of Your Data in the Cloud with Cloudbreak
 
Johns Hopkins - Using Hadoop to Secure Access Log Events
Johns Hopkins - Using Hadoop to Secure Access Log EventsJohns Hopkins - Using Hadoop to Secure Access Log Events
Johns Hopkins - Using Hadoop to Secure Access Log Events
 
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad GuysCatch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
 
HDF 3.2 - What's New
HDF 3.2 - What's NewHDF 3.2 - What's New
HDF 3.2 - What's New
 
Curing Kafka Blindness with Hortonworks Streams Messaging Manager
Curing Kafka Blindness with Hortonworks Streams Messaging ManagerCuring Kafka Blindness with Hortonworks Streams Messaging Manager
Curing Kafka Blindness with Hortonworks Streams Messaging Manager
 
Interpretation Tool for Genomic Sequencing Data in Clinical Environments
Interpretation Tool for Genomic Sequencing Data in Clinical EnvironmentsInterpretation Tool for Genomic Sequencing Data in Clinical Environments
Interpretation Tool for Genomic Sequencing Data in Clinical Environments
 
IBM+Hortonworks = Transformation of the Big Data Landscape
IBM+Hortonworks = Transformation of the Big Data LandscapeIBM+Hortonworks = Transformation of the Big Data Landscape
IBM+Hortonworks = Transformation of the Big Data Landscape
 
Premier Inside-Out: Apache Druid
Premier Inside-Out: Apache DruidPremier Inside-Out: Apache Druid
Premier Inside-Out: Apache Druid
 
Accelerating Data Science and Real Time Analytics at Scale
Accelerating Data Science and Real Time Analytics at ScaleAccelerating Data Science and Real Time Analytics at Scale
Accelerating Data Science and Real Time Analytics at Scale
 
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATATIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
 
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
 
Delivering Real-Time Streaming Data for Healthcare Customers: Clearsense
Delivering Real-Time Streaming Data for Healthcare Customers: ClearsenseDelivering Real-Time Streaming Data for Healthcare Customers: Clearsense
Delivering Real-Time Streaming Data for Healthcare Customers: Clearsense
 
Making Enterprise Big Data Small with Ease
Making Enterprise Big Data Small with EaseMaking Enterprise Big Data Small with Ease
Making Enterprise Big Data Small with Ease
 
Webinewbie to Webinerd in 30 Days - Webinar World Presentation
Webinewbie to Webinerd in 30 Days - Webinar World PresentationWebinewbie to Webinerd in 30 Days - Webinar World Presentation
Webinewbie to Webinerd in 30 Days - Webinar World Presentation
 
Driving Digital Transformation Through Global Data Management
Driving Digital Transformation Through Global Data ManagementDriving Digital Transformation Through Global Data Management
Driving Digital Transformation Through Global Data Management
 
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming FeaturesHDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
 
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
 
Unlock Value from Big Data with Apache NiFi and Streaming CDC
Unlock Value from Big Data with Apache NiFi and Streaming CDCUnlock Value from Big Data with Apache NiFi and Streaming CDC
Unlock Value from Big Data with Apache NiFi and Streaming CDC
 

Kürzlich hochgeladen

A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformChameera Dedduwage
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingTechSoup
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAssociation for Project Management
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactdawncurless
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpinRaunakKeshri1
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...Sapna Thakur
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Sapana Sha
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityGeoBlogs
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfciinovamais
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdfSoniaTolstoy
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104misteraugie
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Krashi Coaching
 
9548086042 for call girls in Indira Nagar with room service
9548086042  for call girls in Indira Nagar  with room service9548086042  for call girls in Indira Nagar  with room service
9548086042 for call girls in Indira Nagar with room servicediscovermytutordmt
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfagholdier
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfAdmir Softic
 
Disha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfDisha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfchloefrazer622
 
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...PsychoTech Services
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...christianmathematics
 

Kürzlich hochgeladen (20)

A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy Reform
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across Sectors
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpin
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
 
9548086042 for call girls in Indira Nagar with room service
9548086042  for call girls in Indira Nagar  with room service9548086042  for call girls in Indira Nagar  with room service
9548086042 for call girls in Indira Nagar with room service
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
Disha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfDisha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdf
 
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 

Pig Out to Hadoop

  • 1. Pig: Recent Work and Next Steps Alan  F.  Gates   @alanfgates   Page  1  
  • 2. Who Am I? •  Pig committer and PMC Member •  Original member of the engineering team in Yahoo that took Pig from research to production •  Author of Programming Pig from O’Reilly •  HCatalog committer and mentor •  Co-founder of Hortonworks •  Tech lead of the team at Hortonworks that does Pig, Hive, and HCatalog •  Member of Apache Software Foundation and Incubator PMC © 2012 Hortonworks Page 2
  • 3. What is Pig? © 2012 Hortonworks Page 3
  • 4. What is Pig? •  A data flow language users = load 'users'; grouped = group users by zipcode; byzip = foreach grouped generate zipcode, COUNT(users); store byzip into 'count_by_zip'; © 2012 Hortonworks Page 4
  • 5. What is Pig? •  A data flow language users = load 'users'; grouped = group users by zipcode; byzip = foreach grouped generate zipcode, COUNT(users); store byzip into 'count_by_zip'; •  that translates a script into a series of MapReduce jobs and then executes those jobs © 2012 Hortonworks Page 5
  • 6. What is Pig? •  A data flow language users = load 'users'; grouped = group users by zipcode; byzip = foreach grouped generate zipcode, COUNT(users); store byzip into 'count_by_zip'; •  That translates a script into a series of MapReduce jobs and then executes those jobs users = load 'users'; grouped = group users by zipcode; byzip = foreach grouped generate zipcode, COUNT(users); store byzip into 'count_by_zip'; © 2012 Hortonworks Page 6
  • 7. What is Pig? •  A data flow language users = load 'users'; grouped = group users by zipcode; byzip = foreach grouped generate zipcode, COUNT(users); store byzip into 'count_by_zip'; •  That translates a script into a series of MapReduce jobs and then executes those jobs Map Reduce Job: users = load 'users'; Input: ./users grouped = group users by zipcode; byzip = foreach grouped Map: project(zipcode, userid) generate zipcode, COUNT(users); Shuffle key: userid store byzip into 'count_by_zip'; Reduce: count Output: ./count_by_zip © 2012 Hortonworks Page 7
  • 8. What is Pig? •  A data flow language users = load 'users'; grouped = group users by zipcode; byzip = foreach grouped generate zipcode, COUNT(users); store byzip into 'count_by_zip'; •  That translates a script into a series of MapReduce jobs and then executes those jobs Map Reduce Job: users = load 'users'; Input: ./users grouped = group users by zipcode; byzip = foreach grouped Map: project(zipcode, userid) generate zipcode, COUNT(users); Shuffle key: userid store byzip into 'count_by_zip'; Reduce: count Output: ./count_by_zip Lives on client machine, nothing to install on cluster © 2012 Hortonworks Page 8
  • 9. Recent Work © 2012 Hortonworks Page 9
  • 10. New Features in Pig 0.10 •  Released April, 2012 •  This release was a collaborative effort, with major features added by Twitter, Yahoo, Hortonworks, and Google Summer of Code students •  Not all the new features are covered here, see http://hortonworks.com/blog/new-features-in-apache-pig-0-10/ for a complete list. © 2012 Hortonworks Page 10
  • 11. Ruby UDFs •  Pig 0.8, 0.9 UDFs could be done in Python and Java. Now Ruby also supported •  Evaluated via JRuby power.pig: register 'power.rb' using jruby as rf; data = load ‘input’ as (a:int, b:int); powered = foreach data generate rf.power(a, b); power.rb: require 'pigudf' class Power < PigUdf outputSchema "a:int" def power(mantissa, exponent) return nil if mantissa.nil? or exponent.nil? mantissa**exponent end end •  Can also do Algebraic and Accumulator UDFs in Ruby (like in Java, but unlike in Python) © 2012 Hortonworks Page 11
  • 12. PigStorage With Schemas •  By default, PigStorage (the default load/store function) does not use a schema •  In 0.10, it can store a schema if instructed to •  Schema stored in side file .pig_schema •  If schema is available it will automatically be used A = load 'studenttab10k' as (name:chararray, age:int, gpa:double); store A into 'foo' using PigStorage('t', '-schema'); A = load 'foo'; B = foreach A generate name, age; © 2012 Hortonworks Page 12
  • 13. Additional UDF Improvements •  Automatic generation of simpler UDFs –  If you implement an Algebraic UDF, Pig can generate Accumulator & basic UDFs –  If you implement an Accumulator UDF, Pig can generate a basic UDF •  JSON load and store functions –  Requires schema that describes JSON, does not intuit schema from data –  Schema stored in side file, no need to declare in script •  Built in UDFs for Bloom filters –  BuildBloom builds a bloom filter for one or more columns for a given input –  Can be constructed to be a certain size (# of hash functions and # of bits) or based on the desired false positive rate –  Bloom takes the file generated by BuildBloom and applies it to an input define bb BuildBloom('Hash.JENKINS_HASH', '1000', '0.01'); A = load 'users'; B = group A all; C = foreach B generate bb(A.name); store C into 'mybloom'; define bloom Bloom('mybloom'); A = load 'transactions'; B = filter A by bloom(name); © 2012 Hortonworks Page 13
  • 14. Language Improvements •  Boolean now supported as a first class data type a = load 'foo' as (n:chararray, a:int, g:double, b:boolean); •  Default split destination - otherwise –  records which do not match any of the ifs will go to this destination –  records can still go to multiple ifs split a into b if id < 3, c if id > 5, d otherwise; •  Maps, tuples, and bags can now be generated without UDFs: B = foreach A generate [key, value], (col1, col2), {col1, col2}; •  Register a collection of jars at once with globs: –  Uses HDFS globbing syntax register '/home/me/jars/*.jar'; © 2012 Hortonworks Page 14
  • 15. Performance Improvements •  Hash based aggregation –  Up to 50% faster aggregation for sets with small number of distinct keys –  Pig runtime automatically selects aggregation implementation •  Push limit to loader –  Now when you have a limit that can be applied to the load, Pig will stop reading records after reaching the limit –  Does not work after group, join, distinct, or order by © 2012 Hortonworks Page 15
  • 16. Current Work in Pig – Not Yet Released •  Work done on internal data representation and map è reduce transfer to lower memory footprint and enhance performance •  Datetime type has been added •  Development of CUBE, ROLLUP, and RANK operators – patches posted and being reviewed •  Pig running natively on Windows – in the process of posting patches © 2012 Hortonworks Page 16
  • 17. Pig with Hadoop 2.0 •  Pig 0.10 is the first release of Pig that works with Hadoop 2.0 (fka Hadoop 0.23) •  By default Pig 0.10 works with Hadoop 1.0 •  Must be recompiled to work with Hadoop 2.0 –  All the pieces included with released code, just need to run ant with the right flags set •  Does not yet take advantage of new features in Hadoop 2.0 © 2012 Hortonworks Page 17
  • 18. Next Steps © 2012 Hortonworks Page 18
  • 19. Pig Execution Today Map Reduce Job: users = load 'users'; Input: ./users grouped = group users by zipcode; byzip = foreach grouped Map: project(zipcode, userid) generate zipcode, COUNT(users); Shuffle key: userid store byzip into 'count_by_zip'; Reduce: count Output: ./count_by_zip •  All planning done up front •  No use made of any statistics or information that we have •  Pig (mostly) uses vanilla MapReduce © 2012 Hortonworks Page 19
  • 20. Re-optimize on the Fly MR Job MR Job MR Job = planned MR Job MR Job = executed MR Job © 2012 Hortonworks Page 20
  • 21. Re-optimize on the Fly MR Job MR Job MR Job = planned MR Job MR Job = executed MR Job © 2012 Hortonworks Page 21
  • 22. Re-optimize on the Fly MR Job MR Job MR Job = planned MR Job MR Job = executed MR Job © 2012 Hortonworks Page 22
  • 23. Re-optimize on the Fly MR Job MR Job MR Job = planned MR Job MR Job = executed MR Job © 2012 Hortonworks Page 23
  • 24. Re-optimize on the Fly MR Job MR Job output: 50G output: 1G Observe output size MR Job = planned from both jobs, notice MR Job that one of them is MR Job = executed small enough to fit in memory MR Job © 2012 Hortonworks Page 24
  • 25. Re-optimize on the Fly MR Job MR Job output: 50G output: 1G Observe output size MR Job = planned from both jobs, notice MR Job that one of them is MR Job = executed small enough to fit in memory Can change join to FR join, thus map only, and combine with last MR Job MR job © 2012 Hortonworks Page 25
  • 26. Re-optimize on the Fly MR Job MR Job output: 50G output: 1G Observe output size MR Job = planned from both jobs, notice MR Job that one of them is MR Job = executed small enough to fit in memory Can change join to FR join, thus map only, and combine with last MR job © 2012 Hortonworks Page 26
  • 27. Re-optimize on the Fly MR Job MR Job output: 50G output: 1G Observe output size MR Job = planned from both jobs, notice MR Job that one of them is MR Job = executed small enough to fit in memory Can change join to FR join, thus map only, and combine with last MR job © 2012 Hortonworks Page 27
  • 28. Modify MapReduce users = load 'users'; grouped = group users by zipcode; Map byzip = foreach grouped generate zipcode, COUNT(users) as cnt; sorted = order byzip by cnt Reduce store sorted into 'count_by_zip'; Map Reduce © 2012 Hortonworks Page 28
  • 29. Modify MapReduce users = load 'users'; grouped = group users by zipcode; Map byzip = foreach grouped generate zipcode, COUNT(users) as cnt; sorted = order byzip by cnt Reduce store sorted into 'count_by_zip'; This map is useless. Whatever Map can be done in it can always be done in the preceding reduce. Having it costs an extra write Reduce to and read from HDFS. © 2012 Hortonworks Page 29
  • 30. Modify MapReduce users = load 'users'; grouped = group users by zipcode; Map byzip = foreach grouped generate zipcode, COUNT(users) as cnt; sorted = order byzip by cnt Reduce store sorted into 'count_by_zip'; Reduce © 2012 Hortonworks Page 30
  • 31. Today Hive Pig Plan Others Optimize Plan Execute Plan Optimize Optimize Execute Execute •  Different in the front end; very similar in the backend •  With HCatalog different apps can share metadata •  No ability to share UDFs, operators, or innovations between projects © 2012 Hortonworks Page 31
  • 32. Data Virtual Machine Pig Others Hive Plan Plan Plan Optimize Data Virtual Machine Execute © 2012 Hortonworks Page 32
  • 33. Questions & Answers TRY download at hortonworks.com LEARN Hortonworks University FOLLOW twitter: @hortonworks Facebook: facebook.com/hortonworks MORE EVENTS hortonworks.com/events Page 33 © Hortonworks Inc. 2012