SlideShare ist ein Scribd-Unternehmen logo
1 von 36
Downloaden Sie, um offline zu lesen
Pig programming is more fun: New features in Pig



Daniel Dai (@daijy)
Thejas Nair (@thejasn)




© Hortonworks Inc. 2011                        Page 1
What is Apache Pig?
  Pig Latin, a high level                                                An engine that
  data processing                                                        executes Pig
  language.                                                              Latin locally or on
                                                                         a Hadoop cluster.




Pig-latin-cup pic from http://www.flickr.com/photos/frippy/2507970530/

                  Architecting the Future of Big Data
                                                                                               Page 2
                  © Hortonworks Inc. 2011
Pig-latin example
• Query : Get the list of web pages visited by users whose
  age is between 20 and 29 years.

USERS = load „users‟ as (uid, age);

USERS_20s = filter USERS by age >= 20 and age <= 29;

PVs = load „pages‟ as (url, uid, timestamp);

PVs_u20s = join USERS_20s by uid, PVs by uid;



      Architecting the Future of Big Data
                                                         Page 3
      © Hortonworks Inc. 2011
Why pig ?
• Faster development
  – Fewer lines of code
  – Don‟t re-invent the wheel

• Flexible
  – Metadata is optional
  – Extensible
  – Procedural programming



         Pic courtesy http://www.flickr.com/photos/shutterbc/471935204/

     Architecting the Future of Big Data
                                                                          Page 4
     © Hortonworks Inc. 2011
Before pig 0.9
   p1.pig                           p2.pig   p3.pig




     Architecting the Future of Big Data
                                                      Page 5
     © Hortonworks Inc. 2011
With pig macros
                                  p1.pig           p2.pig   p3.pig

macro1.pig                                                           macro2.pig




             Architecting the Future of Big Data
                                                                           Page 6
             © Hortonworks Inc. 2011
With pig macros
  p1.pig                                   p1.pig   rm_bots.pig




                                                    get_top.pig




     Architecting the Future of Big Data
                                                           Page 7
     © Hortonworks Inc. 2011
Pig macro example
• Page_views data : (url, timestamp, uname, …)
• Find
  1. top 5 users (uname) by page views
  2. top 10 most visited urls




      Architecting the Future of Big Data
                                                 Page 8
      © Hortonworks Inc. 2011
Pig Macro example
page_views = LOAD ..                           /* top x macro */
/* get top 5 users by page view */             DEFINE topCount (rel, col, topNum)
u_grp = GROUP .. by uname;                     RETURNS top_num_recs {
u_count = FOREACH .. COUNT ..                   grped = GROUP $rel by $col;
ord_u_count = ORDER u_count ..                  cnt_grp = FOREACH ..COUNT($rel)..
top_5_users = LIMIT ordered.. 5;                ord_cnt = ORDER .. by cnt;
DUMP top_5_users;                               $top_num_recs = LIMIT.. $topNum;
                                               }
/* get top 10 urls by page view */             -----------------------------------------
url_grp = GROUP .. by url;                     page_views = LOAD ..
url_count = FOREACH .. COUNT .                 /* get top 5 users by page view */
ord_url_count = ORDER url_count..              top_5_users = topCount(page_views,
top_10_urls = LIMIT ord_url.. 10;              uname, 5);
DUMP top_10_urls;                              …



         Architecting the Future of Big Data
                                                                                  Page 9
         © Hortonworks Inc. 2011
Pig macro
• Coming soon – piggybank with pig macros




     Architecting the Future of Big Data
                                            Page 10
     © Hortonworks Inc. 2011
Writing data flow program
• Writing a complex data pipeline is an iterative process

     Load                                   Load



   Transform                                Join



                                            Group   Transform   Filter




      Architecting the Future of Big Data
                                                                         Page 11
      © Hortonworks Inc. 2011
Writing data flow program


    Load                                   Load



  Transform                                Join



                                           Group   Transform        Filter


                                                               No output! 




     Architecting the Future of Big Data
                                                                              Page 12
     © Hortonworks Inc. 2011
Writing data flow program
• Debug!

      Load                                   Load


                                                     Was join on
   Transform                                 Join      wrong
                                                       attributes?


Bug in                                       Group       Transform           Filter
   transform?

                                                                     Did filter drop
                                                                         everything?



       Architecting the Future of Big Data
                                                                                       Page 13
       © Hortonworks Inc. 2011
Common approaches to debug
• Running on real (large) data
   –Inefficient, takes longer
• Running on (small) samples
   –Empty results on join, selective filters




      Architecting the Future of Big Data
                                               Page 14
      © Hortonworks Inc. 2011
Pig illustrate command
• Objective- Show examples for i/o of each statement that
  are
  –Realistic
  –Complete
  –Concise
  –Generated fast
• Steps
  –Downstream – sample and process
  –Prune
  –Upstream – generate realistic missing classes of examples
  –Prune


      Architecting the Future of Big Data
                                                          Page 15
      © Hortonworks Inc. 2011
Illustrate command demo




   Architecting the Future of Big Data
                                         Page 16
   © Hortonworks Inc. 2011
Pig relation-as-scalar
• In pig each statement alias is a relation
   –Relation is a set of records
• Task: Get list of pages whose load time was more
  than average.
• Steps
   1. Compute average load time
   2. Get list of pages whose load time is > average




      Architecting the Future of Big Data
                                                       Page 17
      © Hortonworks Inc. 2011
Pig relation-as-scalar
• Step 1 is like
 .. = load ..
 ..= group ..
 al_rel = foreach .. AVG(ltime) as avg_ltime;


• Step 2 looks like
   page_views = load „pviews.txt‟ as
                   (url, ltime, ..);

   slow_views = filter page_views by
               ltime > avg_ltime




       Architecting the Future of Big Data
                                                Page 18
       © Hortonworks Inc. 2011
Pig relation-as-scalar
• Getting results of step 1 (average_gpa)
   –Join result of step 1 with students relation, or
   –Write result into file, then use udf to read from file
• Pig scalar feature now simplifies this-
   slow_views = filter page_views by
               ltime > al_rel.avg_ltime


   –Runtime exception if al_rel has more than one record.




      Architecting the Future of Big Data
                                                             Page 19
      © Hortonworks Inc. 2011
UDF in Scripting Language
• Benefit
   –Use legacy code
   –Use library in scripting language
   –Leverage Hadoop for non-Java programmer
• Currently supported language
   –Python (0.8)
   –JavaScript (0.8)
   –Ruby (0.10)
• Extensible Interface
   –Minimum effort to support another language



      Architecting the Future of Big Data
                                                 Page 20
      © Hortonworks Inc. 2011
Writing a Python UDF
Write a Python UDF                              register 'util.py' using jython as util;

@outputSchema("word:chararray")                 B = foreach A generate util.square(i);
def concat(word):
  return word + word
                                                 • Invoke Python functions when
                                                   needed
@outputSchemaFunction("squareSchema")            • Type conversion
def square(num):                                     – Python simple type <-> Pig
                                                       simple type
  if num == None:
                                                     – Python Array <-> Pig Bag
      return None                                    – Python Dict <-> Pig Map
  return ((num)*(num))                               – Pyton Tuple <-> Pig Tuple

def squareSchema(input):
  return input

          Architecting the Future of Big Data
                                                                                    Page 21
          © Hortonworks Inc. 2011
Use NLTK in Pig
• Example
register ‟nltk_util.py' using jython as nltk;    Pig eats everything
……
B = foreach A generate nltk.tokenize(sentence)

                                                           Tokenize
  nltk_util.py
                                                           Stemming
import nltk
porter = nltk.PorterStemmer()                          (Pig)
@outputSchema("words:{(word:chararray)}")              (eat)
def tokenize(sentence):                             (everything)
  tokens = nltk.word_tokenize(sentence)
  words = [porter.stem(t) for t in tokens]
  return words



       Architecting the Future of Big Data
                                                                   Page 22
       © Hortonworks Inc. 2011
Comparison with Pig Streaming

                                            Pig Streaming             Scripting UDF

                                   B = stream A through `perl    B = foreach A generate
    Syntax
                                           sample.pl`;          myfunc.concat(a0, a1), a2;
                                                                function parameter/return
                                              stdin/tout
 Input/Output                                                             value
                                            entire relation
                                                                     particular fields

                                  Need to parse input/convert       Type conversion is
Type Conversion
                                             type                       automatic

                                     Every streaming operator   Organize the functions into
  Modularize
                                      need a separate script             module




      Architecting the Future of Big Data
                                                                                         Page 23
      © Hortonworks Inc. 2011
Writing a Script Engine
Writing a bridge UDF
class JythonFunction extends EvalFunc<Object> {               Convert Pig input
                                                                   into Python
   public Object exec(Tuple tuple) {
     PyObject[] params = JythonUtils.pigTupleToPyTuple(tuple).getArray();
     PyObject result = f.__call__(params);      Invoke Python UDF
     return JythonUtils.pythonToPig(result);
   }                                         Convert result to Pig
   public Schema outputSchema(Schema input) {
     PyObject outputSchemaDef = f.__findattr__("outputSchema".intern());
     return Utils.getSchemaFromString(outputSchemaDef.toString());
   }
}




         Architecting the Future of Big Data
                                                                             Page 24
         © Hortonworks Inc. 2011
Writing a Script Engine
Register scripting UDF

register 'util.py' using jython as util;

What happens in Pig
class JythonScriptEngine extends ScriptEngine {
   public void registerFunctions(String path, String namespace, PigContext
pigContext) {
        myudf.py
        def square(num):
          ……                                     square   JythonFunction(“square”)
        def concat(word):                        concat   JythonFunction(“concat”)
          ……
        def count(bag):                          count    JythonFunction(“count”)
          ……
    }
}



           Architecting the Future of Big Data
                                                                                     Page 25
           © Hortonworks Inc. 2011
Algebraic UDF in JRuby
class SUM < AlgebraicPigUdf
   output_schema Schema.long

  def initial num
    num                                          Initial Function
  end

  def intermed num
    num.flatten.inject(:+)                    Intermediate Function
  end

  def final num
    intermed(num)                                Final Function
  end

end


        Architecting the Future of Big Data
                                                                      Page 26
        © Hortonworks Inc. 2011
Pig Embedding
• Embed Pig inside scripting language
  –Python
  –JavaScript
• Algorithms which cannot complete using one Pig script
  –Iterative algorithm
       – PageRank, Kmeans, Neural Network, Apriori, etc

  – Parallel Independent execution
       – Ensemble

  – Divide and Conquer
  – Branching




      Architecting the Future of Big Data
                                                          Page 27
      © Hortonworks Inc. 2011
Pig Embedding
from org.apache.pig.scripting import Pig
                                                                   Compile Pig
input= ":INPATH:/singlefile/studenttab10k”                            Script


P = Pig.compile("""A = load '$in' as (name, age, gpa);
                   store A into ’output';""")

Q = P.bind({'in':input})                        Bind Variables


result = Q.runSingle()                         Launch Pig Script

result = stats.result('A')

for t in result.iterator():                     Iterate result
   print t


         Architecting the Future of Big Data
                                                                                 Page 28
         © Hortonworks Inc. 2011
Convergence Example
P = Pig.compile(“““DEFINE myudf MyUDF('$param');
                   A = load ‟input‟;
                   B = foreach A generate MyUDF(*);
                   store B into „output‟;””” )

while True:
  Q = P.bind({‟ param':new_parameter})              Bind to new parameter
  results = Q.runSingle()
  iter = results.result("result").iterator()
  if converged:                      Convergence check
      break

  new_parameter = xxxxxx                      Change parameter




        Architecting the Future of Big Data
                                                                            Page 29
        © Hortonworks Inc. 2011
Pig Embedding
 • Running embeded Pig script
    pig sample.py                                                   while True:
 • What happen within Pig?                                            Q = P.bind()
                                                                      results = Q.runSingle()
                                                       While Loop     converge?

                                                                     Pig
                                                                     Script

             Pytho                            Pytho
             n                                n
sample.py    Script                  Pig      Script
                                                         Jython                      Pig




                                                          End


        Architecting the Future of Big Data
                                                                                                Page 30
        © Hortonworks Inc. 2011
Nested Operator
• Nested Operator: Operator inside foreach
  B = group A by name;
  C = foreach B {
    C0 = limit A 10;
    generate flatten(C0);
  }


• Prior Pig 0.10, supported nested operator
  –DISTINCT, FILTER, LIMIT, and ORDER BY
• New operators added in 0.10
  –CROSS, FOREACH



       Architecting the Future of Big Data
                                              Page 31
       © Hortonworks Inc. 2011
Nested Cross/ForEach
           ì(i0, a)ü                                              ì(i0, 0)ü
    A=     í       ý                                         B=   í       ý
           î(i0, b)þ                                              î(i0,1) þ

                                        ì ì aü ì 0 ü ü
                                        ï ï            ï
CoGroup A, B                 C=         í(i0, í ý, í ý)ý
                                        ï ïbþ î1 þ ï
                                        î î            þ
                                           ì     ì(a, 0)üü          C = CoGroup A, B;
                                           ï     ï      ïï
Cross A, B                                 ï     ï(a,1) ïï          D = ForEach C {
                                           í(i0, í      ýý
                                           ï     ï(b, 0)ïï            X = Cross A, B;
                                           ï
                                           î     ï(b,1) ïï
                                                 î      þþ            Y = ForEach X generate
                                                                            CONCAT(f1, f2);
                 ì     ì(a0)üü
                 ï     ï     ïï                                       Generate Y;
ForEach … CONCAT ï     ï(a1) ïï
                 í(i0, í     ýý                                     }
                 ï     ï(b0)ïï
                 ï
                 î     ï(b1) ïï
                       î     þþ
         Architecting the Future of Big Data
                                                                                               Page 32
         © Hortonworks Inc. 2011
HCatalog Integration
• Hcatalog

             Pig                            Map Reduce   Hive




                                             HCatalog



• HCatLoader/HCatStorage
  –Load/Store from HCatalog from Pig
• HCatalog DDL Integration (Pig 0.11)
  –sql “create table student(name string, age int, gpa double);”

      Architecting the Future of Big Data
                                                                Page 33
      © Hortonworks Inc. 2011
Misc Loaders
• HBaseStorage
  –Pig builtin
• AvroStorage
  –Piggybank
• CassandraStorage
  –In Cassandra code base
• MongoStorage
  –In Mongo DB code base
• JsonLoader/JsonStorage
  –Pig builtin



     Architecting the Future of Big Data
                                           Page 34
     © Hortonworks Inc. 2011
Talend
Enterprise Data Integration
• Talend Open Studio for Big Data
   – Feature-rich Job Designer
   – Rich palette of pre-built templates
   – Supports HDFS, Pig, Hive, HBase, HCatalog
   – Apache-licensed, bundled with HDP


• Key benefits
   – Graphical development
   – Robust and scalable execution
   – Broadest connectivity to support
     all systems:
     450+ components
   – Real-time debugging




       © Hortonworks Inc. 2011                   Page 35
Questions




   Architecting the Future of Big Data
                                         Page 36
   © Hortonworks Inc. 2011

Más contenido relacionado

Was ist angesagt?

Getting Started on Hadoop
Getting Started on HadoopGetting Started on Hadoop
Getting Started on HadoopPaco Nathan
 
Python in big data world
Python in big data worldPython in big data world
Python in big data worldRohit
 
Hive vs Pig for HadoopSourceCodeReading
Hive vs Pig for HadoopSourceCodeReadingHive vs Pig for HadoopSourceCodeReading
Hive vs Pig for HadoopSourceCodeReadingMitsuharu Hamba
 
HIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopHIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopZheng Shao
 
Pig Tutorial | Twitter Case Study | Apache Pig Script and Commands | Edureka
Pig Tutorial | Twitter Case Study | Apache Pig Script and Commands | EdurekaPig Tutorial | Twitter Case Study | Apache Pig Script and Commands | Edureka
Pig Tutorial | Twitter Case Study | Apache Pig Script and Commands | EdurekaEdureka!
 
IPython Notebook as a Unified Data Science Interface for Hadoop
IPython Notebook as a Unified Data Science Interface for HadoopIPython Notebook as a Unified Data Science Interface for Hadoop
IPython Notebook as a Unified Data Science Interface for HadoopDataWorks Summit
 
Apache Pig for Data Scientists
Apache Pig for Data ScientistsApache Pig for Data Scientists
Apache Pig for Data ScientistsDataWorks Summit
 
Faster Faster Faster! Datamarts with Hive at Yahoo
Faster Faster Faster! Datamarts with Hive at YahooFaster Faster Faster! Datamarts with Hive at Yahoo
Faster Faster Faster! Datamarts with Hive at YahooMithun Radhakrishnan
 
Interview questions on Apache spark [part 2]
Interview questions on Apache spark [part 2]Interview questions on Apache spark [part 2]
Interview questions on Apache spark [part 2]knowbigdata
 
Hadoop Administration pdf
Hadoop Administration pdfHadoop Administration pdf
Hadoop Administration pdfEdureka!
 
Hadoop with Python
Hadoop with PythonHadoop with Python
Hadoop with PythonDonald Miner
 
Introduction to Pig | Pig Architecture | Pig Fundamentals
Introduction to Pig | Pig Architecture | Pig FundamentalsIntroduction to Pig | Pig Architecture | Pig Fundamentals
Introduction to Pig | Pig Architecture | Pig FundamentalsSkillspeed
 
Hadoop hdfs interview questions
Hadoop hdfs interview questionsHadoop hdfs interview questions
Hadoop hdfs interview questionsKalyan Hadoop
 
Hadoop Seminar Report
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar ReportAtul Kushwaha
 
Big Data Step-by-Step: Using R & Hadoop (with RHadoop's rmr package)
Big Data Step-by-Step: Using R & Hadoop (with RHadoop's rmr package)Big Data Step-by-Step: Using R & Hadoop (with RHadoop's rmr package)
Big Data Step-by-Step: Using R & Hadoop (with RHadoop's rmr package)Jeffrey Breen
 
Intro to Hadoop
Intro to HadoopIntro to Hadoop
Intro to Hadoopjeffturner
 
Large Scale Data With Hadoop
Large Scale Data With HadoopLarge Scale Data With Hadoop
Large Scale Data With Hadoopguest27e6764
 

Was ist angesagt? (20)

Apache Pig
Apache PigApache Pig
Apache Pig
 
Getting Started on Hadoop
Getting Started on HadoopGetting Started on Hadoop
Getting Started on Hadoop
 
Hadoop pig
Hadoop pigHadoop pig
Hadoop pig
 
Python in big data world
Python in big data worldPython in big data world
Python in big data world
 
Hive vs Pig for HadoopSourceCodeReading
Hive vs Pig for HadoopSourceCodeReadingHive vs Pig for HadoopSourceCodeReading
Hive vs Pig for HadoopSourceCodeReading
 
HIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopHIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on Hadoop
 
Running R on Hadoop - CHUG - 20120815
Running R on Hadoop - CHUG - 20120815Running R on Hadoop - CHUG - 20120815
Running R on Hadoop - CHUG - 20120815
 
Pig Tutorial | Twitter Case Study | Apache Pig Script and Commands | Edureka
Pig Tutorial | Twitter Case Study | Apache Pig Script and Commands | EdurekaPig Tutorial | Twitter Case Study | Apache Pig Script and Commands | Edureka
Pig Tutorial | Twitter Case Study | Apache Pig Script and Commands | Edureka
 
IPython Notebook as a Unified Data Science Interface for Hadoop
IPython Notebook as a Unified Data Science Interface for HadoopIPython Notebook as a Unified Data Science Interface for Hadoop
IPython Notebook as a Unified Data Science Interface for Hadoop
 
Apache Pig for Data Scientists
Apache Pig for Data ScientistsApache Pig for Data Scientists
Apache Pig for Data Scientists
 
Faster Faster Faster! Datamarts with Hive at Yahoo
Faster Faster Faster! Datamarts with Hive at YahooFaster Faster Faster! Datamarts with Hive at Yahoo
Faster Faster Faster! Datamarts with Hive at Yahoo
 
Interview questions on Apache spark [part 2]
Interview questions on Apache spark [part 2]Interview questions on Apache spark [part 2]
Interview questions on Apache spark [part 2]
 
Hadoop Administration pdf
Hadoop Administration pdfHadoop Administration pdf
Hadoop Administration pdf
 
Hadoop with Python
Hadoop with PythonHadoop with Python
Hadoop with Python
 
Introduction to Pig | Pig Architecture | Pig Fundamentals
Introduction to Pig | Pig Architecture | Pig FundamentalsIntroduction to Pig | Pig Architecture | Pig Fundamentals
Introduction to Pig | Pig Architecture | Pig Fundamentals
 
Hadoop hdfs interview questions
Hadoop hdfs interview questionsHadoop hdfs interview questions
Hadoop hdfs interview questions
 
Hadoop Seminar Report
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar Report
 
Big Data Step-by-Step: Using R & Hadoop (with RHadoop's rmr package)
Big Data Step-by-Step: Using R & Hadoop (with RHadoop's rmr package)Big Data Step-by-Step: Using R & Hadoop (with RHadoop's rmr package)
Big Data Step-by-Step: Using R & Hadoop (with RHadoop's rmr package)
 
Intro to Hadoop
Intro to HadoopIntro to Hadoop
Intro to Hadoop
 
Large Scale Data With Hadoop
Large Scale Data With HadoopLarge Scale Data With Hadoop
Large Scale Data With Hadoop
 

Andere mochten auch

Cloudera amazon-ec2
Cloudera amazon-ec2Cloudera amazon-ec2
Cloudera amazon-ec2Randy Zwitch
 
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labsApache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labsViswanath Gangavaram
 
Hortonworks Technical Workshop: Interactive Query with Apache Hive
Hortonworks Technical Workshop: Interactive Query with Apache Hive Hortonworks Technical Workshop: Interactive Query with Apache Hive
Hortonworks Technical Workshop: Interactive Query with Apache Hive Hortonworks
 
Hadoop basic commands
Hadoop basic commandsHadoop basic commands
Hadoop basic commandsbispsolutions
 

Andere mochten auch (8)

Hadoop Pig Syntax Card
Hadoop Pig Syntax CardHadoop Pig Syntax Card
Hadoop Pig Syntax Card
 
Hadoop Pig
Hadoop PigHadoop Pig
Hadoop Pig
 
Cloudera amazon-ec2
Cloudera amazon-ec2Cloudera amazon-ec2
Cloudera amazon-ec2
 
Hive commands
Hive commandsHive commands
Hive commands
 
What's new in Apache Hive
What's new in Apache HiveWhat's new in Apache Hive
What's new in Apache Hive
 
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labsApache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
 
Hortonworks Technical Workshop: Interactive Query with Apache Hive
Hortonworks Technical Workshop: Interactive Query with Apache Hive Hortonworks Technical Workshop: Interactive Query with Apache Hive
Hortonworks Technical Workshop: Interactive Query with Apache Hive
 
Hadoop basic commands
Hadoop basic commandsHadoop basic commands
Hadoop basic commands
 

Ähnlich wie Pig programming is more fun: New features in Pig

Introduction to pig
Introduction to pigIntroduction to pig
Introduction to pigRavi Mutyala
 
Storm Demo Talk - Colorado Springs May 2015
Storm Demo Talk - Colorado Springs May 2015Storm Demo Talk - Colorado Springs May 2015
Storm Demo Talk - Colorado Springs May 2015Mac Moore
 
Real-Time Processing in Hadoop for IoT Use Cases - Phoenix HUG
Real-Time Processing in Hadoop for IoT Use Cases - Phoenix HUGReal-Time Processing in Hadoop for IoT Use Cases - Phoenix HUG
Real-Time Processing in Hadoop for IoT Use Cases - Phoenix HUGskumpf
 
Storm Demo Talk - Denver Apr 2015
Storm Demo Talk - Denver Apr 2015Storm Demo Talk - Denver Apr 2015
Storm Demo Talk - Denver Apr 2015Mac Moore
 
Agile analytics applications on hadoop
Agile analytics applications on hadoopAgile analytics applications on hadoop
Agile analytics applications on hadoopHortonworks
 
Hortonworks: Agile Analytics Applications
Hortonworks: Agile Analytics ApplicationsHortonworks: Agile Analytics Applications
Hortonworks: Agile Analytics Applicationsrussell_jurney
 
Apache Hadoop on the Open Cloud
Apache Hadoop on the Open CloudApache Hadoop on the Open Cloud
Apache Hadoop on the Open CloudHortonworks
 
Internet of things Crash Course Workshop
Internet of things Crash Course WorkshopInternet of things Crash Course Workshop
Internet of things Crash Course WorkshopDataWorks Summit
 
Internet of Things Crash Course Workshop at Hadoop Summit
Internet of Things Crash Course Workshop at Hadoop SummitInternet of Things Crash Course Workshop at Hadoop Summit
Internet of Things Crash Course Workshop at Hadoop SummitDataWorks Summit
 
2013 march 26_thug_etl_cdc_talking_points
2013 march 26_thug_etl_cdc_talking_points2013 march 26_thug_etl_cdc_talking_points
2013 march 26_thug_etl_cdc_talking_pointsAdam Muise
 
Cloud Foundry Bootcamp
Cloud Foundry BootcampCloud Foundry Bootcamp
Cloud Foundry BootcampJoshua Long
 
Process and Visualize Your Data with Revolution R, Hadoop and GoogleVis
Process and Visualize Your Data with Revolution R, Hadoop and GoogleVisProcess and Visualize Your Data with Revolution R, Hadoop and GoogleVis
Process and Visualize Your Data with Revolution R, Hadoop and GoogleVisHortonworks
 
Hdp r-google charttools-webinar-3-5-2013 (2)
Hdp r-google charttools-webinar-3-5-2013 (2)Hdp r-google charttools-webinar-3-5-2013 (2)
Hdp r-google charttools-webinar-3-5-2013 (2)Hortonworks
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to HadoopPOSSCON
 
Run Your First Hadoop 2.x Program
Run Your First Hadoop 2.x ProgramRun Your First Hadoop 2.x Program
Run Your First Hadoop 2.x ProgramSkillspeed
 
The Three Stages of Cloud Adoption - RightScale Compute 2013
The Three Stages of Cloud Adoption - RightScale Compute 2013The Three Stages of Cloud Adoption - RightScale Compute 2013
The Three Stages of Cloud Adoption - RightScale Compute 2013RightScale
 
Enabling Python to be a Better Big Data Citizen
Enabling Python to be a Better Big Data CitizenEnabling Python to be a Better Big Data Citizen
Enabling Python to be a Better Big Data CitizenWes McKinney
 
Enterprise IIoT Edge Processing with Apache NiFi
Enterprise IIoT Edge Processing with Apache NiFiEnterprise IIoT Edge Processing with Apache NiFi
Enterprise IIoT Edge Processing with Apache NiFiTimothy Spann
 

Ähnlich wie Pig programming is more fun: New features in Pig (20)

Introduction to pig
Introduction to pigIntroduction to pig
Introduction to pig
 
Storm Demo Talk - Colorado Springs May 2015
Storm Demo Talk - Colorado Springs May 2015Storm Demo Talk - Colorado Springs May 2015
Storm Demo Talk - Colorado Springs May 2015
 
Real-Time Processing in Hadoop for IoT Use Cases - Phoenix HUG
Real-Time Processing in Hadoop for IoT Use Cases - Phoenix HUGReal-Time Processing in Hadoop for IoT Use Cases - Phoenix HUG
Real-Time Processing in Hadoop for IoT Use Cases - Phoenix HUG
 
Storm Demo Talk - Denver Apr 2015
Storm Demo Talk - Denver Apr 2015Storm Demo Talk - Denver Apr 2015
Storm Demo Talk - Denver Apr 2015
 
OOP 2014
OOP 2014OOP 2014
OOP 2014
 
Agile analytics applications on hadoop
Agile analytics applications on hadoopAgile analytics applications on hadoop
Agile analytics applications on hadoop
 
Hortonworks: Agile Analytics Applications
Hortonworks: Agile Analytics ApplicationsHortonworks: Agile Analytics Applications
Hortonworks: Agile Analytics Applications
 
Apache Hadoop on the Open Cloud
Apache Hadoop on the Open CloudApache Hadoop on the Open Cloud
Apache Hadoop on the Open Cloud
 
Internet of things Crash Course Workshop
Internet of things Crash Course WorkshopInternet of things Crash Course Workshop
Internet of things Crash Course Workshop
 
Internet of Things Crash Course Workshop at Hadoop Summit
Internet of Things Crash Course Workshop at Hadoop SummitInternet of Things Crash Course Workshop at Hadoop Summit
Internet of Things Crash Course Workshop at Hadoop Summit
 
2013 march 26_thug_etl_cdc_talking_points
2013 march 26_thug_etl_cdc_talking_points2013 march 26_thug_etl_cdc_talking_points
2013 march 26_thug_etl_cdc_talking_points
 
Cloud Foundry Bootcamp
Cloud Foundry BootcampCloud Foundry Bootcamp
Cloud Foundry Bootcamp
 
Process and Visualize Your Data with Revolution R, Hadoop and GoogleVis
Process and Visualize Your Data with Revolution R, Hadoop and GoogleVisProcess and Visualize Your Data with Revolution R, Hadoop and GoogleVis
Process and Visualize Your Data with Revolution R, Hadoop and GoogleVis
 
Hdp r-google charttools-webinar-3-5-2013 (2)
Hdp r-google charttools-webinar-3-5-2013 (2)Hdp r-google charttools-webinar-3-5-2013 (2)
Hdp r-google charttools-webinar-3-5-2013 (2)
 
Ruby and R
Ruby and RRuby and R
Ruby and R
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Run Your First Hadoop 2.x Program
Run Your First Hadoop 2.x ProgramRun Your First Hadoop 2.x Program
Run Your First Hadoop 2.x Program
 
The Three Stages of Cloud Adoption - RightScale Compute 2013
The Three Stages of Cloud Adoption - RightScale Compute 2013The Three Stages of Cloud Adoption - RightScale Compute 2013
The Three Stages of Cloud Adoption - RightScale Compute 2013
 
Enabling Python to be a Better Big Data Citizen
Enabling Python to be a Better Big Data CitizenEnabling Python to be a Better Big Data Citizen
Enabling Python to be a Better Big Data Citizen
 
Enterprise IIoT Edge Processing with Apache NiFi
Enterprise IIoT Edge Processing with Apache NiFiEnterprise IIoT Edge Processing with Apache NiFi
Enterprise IIoT Edge Processing with Apache NiFi
 

Último

TrustArc Webinar - How to Live in a Post Third-Party Cookie World
TrustArc Webinar - How to Live in a Post Third-Party Cookie WorldTrustArc Webinar - How to Live in a Post Third-Party Cookie World
TrustArc Webinar - How to Live in a Post Third-Party Cookie WorldTrustArc
 
The New Cloud World Order Is FinOps (Slideshow)
The New Cloud World Order Is FinOps (Slideshow)The New Cloud World Order Is FinOps (Slideshow)
The New Cloud World Order Is FinOps (Slideshow)codyslingerland1
 
March Patch Tuesday
March Patch TuesdayMarch Patch Tuesday
March Patch TuesdayIvanti
 
Trailblazer Community - Flows Workshop (Session 2)
Trailblazer Community - Flows Workshop (Session 2)Trailblazer Community - Flows Workshop (Session 2)
Trailblazer Community - Flows Workshop (Session 2)Muhammad Tiham Siddiqui
 
Explore the UiPath Community and ways you can benefit on your journey to auto...
Explore the UiPath Community and ways you can benefit on your journey to auto...Explore the UiPath Community and ways you can benefit on your journey to auto...
Explore the UiPath Community and ways you can benefit on your journey to auto...DianaGray10
 
Emil Eifrem at GraphSummit Copenhagen 2024 - The Art of the Possible.pptx
Emil Eifrem at GraphSummit Copenhagen 2024 - The Art of the Possible.pptxEmil Eifrem at GraphSummit Copenhagen 2024 - The Art of the Possible.pptx
Emil Eifrem at GraphSummit Copenhagen 2024 - The Art of the Possible.pptxNeo4j
 
UiPath Studio Web workshop series - Day 4
UiPath Studio Web workshop series - Day 4UiPath Studio Web workshop series - Day 4
UiPath Studio Web workshop series - Day 4DianaGray10
 
Stobox 4: Revolutionizing Investment in Real-World Assets Through Tokenization
Stobox 4: Revolutionizing Investment in Real-World Assets Through TokenizationStobox 4: Revolutionizing Investment in Real-World Assets Through Tokenization
Stobox 4: Revolutionizing Investment in Real-World Assets Through TokenizationStobox
 
Technical SEO for Improved Accessibility WTS FEST
Technical SEO for Improved Accessibility  WTS FESTTechnical SEO for Improved Accessibility  WTS FEST
Technical SEO for Improved Accessibility WTS FESTBillieHyde
 
AI Workshops at Computers In Libraries 2024
AI Workshops at Computers In Libraries 2024AI Workshops at Computers In Libraries 2024
AI Workshops at Computers In Libraries 2024Brian Pichman
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
 
SIM INFORMATION SYSTEM: REVOLUTIONIZING DATA MANAGEMENT
SIM INFORMATION SYSTEM: REVOLUTIONIZING DATA MANAGEMENTSIM INFORMATION SYSTEM: REVOLUTIONIZING DATA MANAGEMENT
SIM INFORMATION SYSTEM: REVOLUTIONIZING DATA MANAGEMENTxtailishbaloch
 
UiPath Studio Web workshop Series - Day 3
UiPath Studio Web workshop Series - Day 3UiPath Studio Web workshop Series - Day 3
UiPath Studio Web workshop Series - Day 3DianaGray10
 
The Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and InsightThe Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and InsightSafe Software
 
2024.03.12 Cost drivers of cultivated meat production.pdf
2024.03.12 Cost drivers of cultivated meat production.pdf2024.03.12 Cost drivers of cultivated meat production.pdf
2024.03.12 Cost drivers of cultivated meat production.pdfThe Good Food Institute
 
Novo Nordisk's journey in developing an open-source application on Neo4j
Novo Nordisk's journey in developing an open-source application on Neo4jNovo Nordisk's journey in developing an open-source application on Neo4j
Novo Nordisk's journey in developing an open-source application on Neo4jNeo4j
 
Design and Modeling for MySQL SCALE 21X Pasadena, CA Mar 2024
Design and Modeling for MySQL SCALE 21X Pasadena, CA Mar 2024Design and Modeling for MySQL SCALE 21X Pasadena, CA Mar 2024
Design and Modeling for MySQL SCALE 21X Pasadena, CA Mar 2024Alkin Tezuysal
 
UiPath Studio Web workshop series - Day 1
UiPath Studio Web workshop series  - Day 1UiPath Studio Web workshop series  - Day 1
UiPath Studio Web workshop series - Day 1DianaGray10
 
Introduction to RAG (Retrieval Augmented Generation) and its application
Introduction to RAG (Retrieval Augmented Generation) and its applicationIntroduction to RAG (Retrieval Augmented Generation) and its application
Introduction to RAG (Retrieval Augmented Generation) and its applicationKnoldus Inc.
 

Último (20)

TrustArc Webinar - How to Live in a Post Third-Party Cookie World
TrustArc Webinar - How to Live in a Post Third-Party Cookie WorldTrustArc Webinar - How to Live in a Post Third-Party Cookie World
TrustArc Webinar - How to Live in a Post Third-Party Cookie World
 
The New Cloud World Order Is FinOps (Slideshow)
The New Cloud World Order Is FinOps (Slideshow)The New Cloud World Order Is FinOps (Slideshow)
The New Cloud World Order Is FinOps (Slideshow)
 
March Patch Tuesday
March Patch TuesdayMarch Patch Tuesday
March Patch Tuesday
 
Trailblazer Community - Flows Workshop (Session 2)
Trailblazer Community - Flows Workshop (Session 2)Trailblazer Community - Flows Workshop (Session 2)
Trailblazer Community - Flows Workshop (Session 2)
 
Explore the UiPath Community and ways you can benefit on your journey to auto...
Explore the UiPath Community and ways you can benefit on your journey to auto...Explore the UiPath Community and ways you can benefit on your journey to auto...
Explore the UiPath Community and ways you can benefit on your journey to auto...
 
Emil Eifrem at GraphSummit Copenhagen 2024 - The Art of the Possible.pptx
Emil Eifrem at GraphSummit Copenhagen 2024 - The Art of the Possible.pptxEmil Eifrem at GraphSummit Copenhagen 2024 - The Art of the Possible.pptx
Emil Eifrem at GraphSummit Copenhagen 2024 - The Art of the Possible.pptx
 
SheDev 2024
SheDev 2024SheDev 2024
SheDev 2024
 
UiPath Studio Web workshop series - Day 4
UiPath Studio Web workshop series - Day 4UiPath Studio Web workshop series - Day 4
UiPath Studio Web workshop series - Day 4
 
Stobox 4: Revolutionizing Investment in Real-World Assets Through Tokenization
Stobox 4: Revolutionizing Investment in Real-World Assets Through TokenizationStobox 4: Revolutionizing Investment in Real-World Assets Through Tokenization
Stobox 4: Revolutionizing Investment in Real-World Assets Through Tokenization
 
Technical SEO for Improved Accessibility WTS FEST
Technical SEO for Improved Accessibility  WTS FESTTechnical SEO for Improved Accessibility  WTS FEST
Technical SEO for Improved Accessibility WTS FEST
 
AI Workshops at Computers In Libraries 2024
AI Workshops at Computers In Libraries 2024AI Workshops at Computers In Libraries 2024
AI Workshops at Computers In Libraries 2024
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
SIM INFORMATION SYSTEM: REVOLUTIONIZING DATA MANAGEMENT
SIM INFORMATION SYSTEM: REVOLUTIONIZING DATA MANAGEMENTSIM INFORMATION SYSTEM: REVOLUTIONIZING DATA MANAGEMENT
SIM INFORMATION SYSTEM: REVOLUTIONIZING DATA MANAGEMENT
 
UiPath Studio Web workshop Series - Day 3
UiPath Studio Web workshop Series - Day 3UiPath Studio Web workshop Series - Day 3
UiPath Studio Web workshop Series - Day 3
 
The Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and InsightThe Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and Insight
 
2024.03.12 Cost drivers of cultivated meat production.pdf
2024.03.12 Cost drivers of cultivated meat production.pdf2024.03.12 Cost drivers of cultivated meat production.pdf
2024.03.12 Cost drivers of cultivated meat production.pdf
 
Novo Nordisk's journey in developing an open-source application on Neo4j
Novo Nordisk's journey in developing an open-source application on Neo4jNovo Nordisk's journey in developing an open-source application on Neo4j
Novo Nordisk's journey in developing an open-source application on Neo4j
 
Design and Modeling for MySQL SCALE 21X Pasadena, CA Mar 2024
Design and Modeling for MySQL SCALE 21X Pasadena, CA Mar 2024Design and Modeling for MySQL SCALE 21X Pasadena, CA Mar 2024
Design and Modeling for MySQL SCALE 21X Pasadena, CA Mar 2024
 
UiPath Studio Web workshop series - Day 1
UiPath Studio Web workshop series  - Day 1UiPath Studio Web workshop series  - Day 1
UiPath Studio Web workshop series - Day 1
 
Introduction to RAG (Retrieval Augmented Generation) and its application
Introduction to RAG (Retrieval Augmented Generation) and its applicationIntroduction to RAG (Retrieval Augmented Generation) and its application
Introduction to RAG (Retrieval Augmented Generation) and its application
 

Pig programming is more fun: New features in Pig

  • 1. Pig programming is more fun: New features in Pig Daniel Dai (@daijy) Thejas Nair (@thejasn) © Hortonworks Inc. 2011 Page 1
  • 2. What is Apache Pig? Pig Latin, a high level An engine that data processing executes Pig language. Latin locally or on a Hadoop cluster. Pig-latin-cup pic from http://www.flickr.com/photos/frippy/2507970530/ Architecting the Future of Big Data Page 2 © Hortonworks Inc. 2011
  • 3. Pig-latin example • Query : Get the list of web pages visited by users whose age is between 20 and 29 years. USERS = load „users‟ as (uid, age); USERS_20s = filter USERS by age >= 20 and age <= 29; PVs = load „pages‟ as (url, uid, timestamp); PVs_u20s = join USERS_20s by uid, PVs by uid; Architecting the Future of Big Data Page 3 © Hortonworks Inc. 2011
  • 4. Why pig ? • Faster development – Fewer lines of code – Don‟t re-invent the wheel • Flexible – Metadata is optional – Extensible – Procedural programming Pic courtesy http://www.flickr.com/photos/shutterbc/471935204/ Architecting the Future of Big Data Page 4 © Hortonworks Inc. 2011
  • 5. Before pig 0.9 p1.pig p2.pig p3.pig Architecting the Future of Big Data Page 5 © Hortonworks Inc. 2011
  • 6. With pig macros p1.pig p2.pig p3.pig macro1.pig macro2.pig Architecting the Future of Big Data Page 6 © Hortonworks Inc. 2011
  • 7. With pig macros p1.pig p1.pig rm_bots.pig get_top.pig Architecting the Future of Big Data Page 7 © Hortonworks Inc. 2011
  • 8. Pig macro example • Page_views data : (url, timestamp, uname, …) • Find 1. top 5 users (uname) by page views 2. top 10 most visited urls Architecting the Future of Big Data Page 8 © Hortonworks Inc. 2011
  • 9. Pig Macro example page_views = LOAD .. /* top x macro */ /* get top 5 users by page view */ DEFINE topCount (rel, col, topNum) u_grp = GROUP .. by uname; RETURNS top_num_recs { u_count = FOREACH .. COUNT .. grped = GROUP $rel by $col; ord_u_count = ORDER u_count .. cnt_grp = FOREACH ..COUNT($rel).. top_5_users = LIMIT ordered.. 5; ord_cnt = ORDER .. by cnt; DUMP top_5_users; $top_num_recs = LIMIT.. $topNum; } /* get top 10 urls by page view */ ----------------------------------------- url_grp = GROUP .. by url; page_views = LOAD .. url_count = FOREACH .. COUNT . /* get top 5 users by page view */ ord_url_count = ORDER url_count.. top_5_users = topCount(page_views, top_10_urls = LIMIT ord_url.. 10; uname, 5); DUMP top_10_urls; … Architecting the Future of Big Data Page 9 © Hortonworks Inc. 2011
  • 10. Pig macro • Coming soon – piggybank with pig macros Architecting the Future of Big Data Page 10 © Hortonworks Inc. 2011
  • 11. Writing data flow program • Writing a complex data pipeline is an iterative process Load Load Transform Join Group Transform Filter Architecting the Future of Big Data Page 11 © Hortonworks Inc. 2011
  • 12. Writing data flow program Load Load Transform Join Group Transform Filter No output!  Architecting the Future of Big Data Page 12 © Hortonworks Inc. 2011
  • 13. Writing data flow program • Debug! Load Load Was join on Transform Join wrong attributes? Bug in Group Transform Filter transform? Did filter drop everything? Architecting the Future of Big Data Page 13 © Hortonworks Inc. 2011
  • 14. Common approaches to debug • Running on real (large) data –Inefficient, takes longer • Running on (small) samples –Empty results on join, selective filters Architecting the Future of Big Data Page 14 © Hortonworks Inc. 2011
  • 15. Pig illustrate command • Objective- Show examples for i/o of each statement that are –Realistic –Complete –Concise –Generated fast • Steps –Downstream – sample and process –Prune –Upstream – generate realistic missing classes of examples –Prune Architecting the Future of Big Data Page 15 © Hortonworks Inc. 2011
  • 16. Illustrate command demo Architecting the Future of Big Data Page 16 © Hortonworks Inc. 2011
  • 17. Pig relation-as-scalar • In pig each statement alias is a relation –Relation is a set of records • Task: Get list of pages whose load time was more than average. • Steps 1. Compute average load time 2. Get list of pages whose load time is > average Architecting the Future of Big Data Page 17 © Hortonworks Inc. 2011
  • 18. Pig relation-as-scalar • Step 1 is like .. = load .. ..= group .. al_rel = foreach .. AVG(ltime) as avg_ltime; • Step 2 looks like page_views = load „pviews.txt‟ as (url, ltime, ..); slow_views = filter page_views by ltime > avg_ltime Architecting the Future of Big Data Page 18 © Hortonworks Inc. 2011
  • 19. Pig relation-as-scalar • Getting results of step 1 (average_gpa) –Join result of step 1 with students relation, or –Write result into file, then use udf to read from file • Pig scalar feature now simplifies this- slow_views = filter page_views by ltime > al_rel.avg_ltime –Runtime exception if al_rel has more than one record. Architecting the Future of Big Data Page 19 © Hortonworks Inc. 2011
  • 20. UDF in Scripting Language • Benefit –Use legacy code –Use library in scripting language –Leverage Hadoop for non-Java programmer • Currently supported language –Python (0.8) –JavaScript (0.8) –Ruby (0.10) • Extensible Interface –Minimum effort to support another language Architecting the Future of Big Data Page 20 © Hortonworks Inc. 2011
  • 21. Writing a Python UDF Write a Python UDF register 'util.py' using jython as util; @outputSchema("word:chararray") B = foreach A generate util.square(i); def concat(word): return word + word • Invoke Python functions when needed @outputSchemaFunction("squareSchema") • Type conversion def square(num): – Python simple type <-> Pig simple type if num == None: – Python Array <-> Pig Bag return None – Python Dict <-> Pig Map return ((num)*(num)) – Pyton Tuple <-> Pig Tuple def squareSchema(input): return input Architecting the Future of Big Data Page 21 © Hortonworks Inc. 2011
  • 22. Use NLTK in Pig • Example register ‟nltk_util.py' using jython as nltk; Pig eats everything …… B = foreach A generate nltk.tokenize(sentence) Tokenize nltk_util.py Stemming import nltk porter = nltk.PorterStemmer() (Pig) @outputSchema("words:{(word:chararray)}") (eat) def tokenize(sentence): (everything) tokens = nltk.word_tokenize(sentence) words = [porter.stem(t) for t in tokens] return words Architecting the Future of Big Data Page 22 © Hortonworks Inc. 2011
  • 23. Comparison with Pig Streaming Pig Streaming Scripting UDF B = stream A through `perl B = foreach A generate Syntax sample.pl`; myfunc.concat(a0, a1), a2; function parameter/return stdin/tout Input/Output value entire relation particular fields Need to parse input/convert Type conversion is Type Conversion type automatic Every streaming operator Organize the functions into Modularize need a separate script module Architecting the Future of Big Data Page 23 © Hortonworks Inc. 2011
  • 24. Writing a Script Engine Writing a bridge UDF class JythonFunction extends EvalFunc<Object> { Convert Pig input into Python public Object exec(Tuple tuple) { PyObject[] params = JythonUtils.pigTupleToPyTuple(tuple).getArray(); PyObject result = f.__call__(params); Invoke Python UDF return JythonUtils.pythonToPig(result); } Convert result to Pig public Schema outputSchema(Schema input) { PyObject outputSchemaDef = f.__findattr__("outputSchema".intern()); return Utils.getSchemaFromString(outputSchemaDef.toString()); } } Architecting the Future of Big Data Page 24 © Hortonworks Inc. 2011
  • 25. Writing a Script Engine Register scripting UDF register 'util.py' using jython as util; What happens in Pig class JythonScriptEngine extends ScriptEngine { public void registerFunctions(String path, String namespace, PigContext pigContext) { myudf.py def square(num): …… square JythonFunction(“square”) def concat(word): concat JythonFunction(“concat”) …… def count(bag): count JythonFunction(“count”) …… } } Architecting the Future of Big Data Page 25 © Hortonworks Inc. 2011
  • 26. Algebraic UDF in JRuby class SUM < AlgebraicPigUdf output_schema Schema.long def initial num num Initial Function end def intermed num num.flatten.inject(:+) Intermediate Function end def final num intermed(num) Final Function end end Architecting the Future of Big Data Page 26 © Hortonworks Inc. 2011
  • 27. Pig Embedding • Embed Pig inside scripting language –Python –JavaScript • Algorithms which cannot complete using one Pig script –Iterative algorithm – PageRank, Kmeans, Neural Network, Apriori, etc – Parallel Independent execution – Ensemble – Divide and Conquer – Branching Architecting the Future of Big Data Page 27 © Hortonworks Inc. 2011
  • 28. Pig Embedding from org.apache.pig.scripting import Pig Compile Pig input= ":INPATH:/singlefile/studenttab10k” Script P = Pig.compile("""A = load '$in' as (name, age, gpa); store A into ’output';""") Q = P.bind({'in':input}) Bind Variables result = Q.runSingle() Launch Pig Script result = stats.result('A') for t in result.iterator(): Iterate result print t Architecting the Future of Big Data Page 28 © Hortonworks Inc. 2011
  • 29. Convergence Example P = Pig.compile(“““DEFINE myudf MyUDF('$param'); A = load ‟input‟; B = foreach A generate MyUDF(*); store B into „output‟;””” ) while True: Q = P.bind({‟ param':new_parameter}) Bind to new parameter results = Q.runSingle() iter = results.result("result").iterator() if converged: Convergence check break new_parameter = xxxxxx Change parameter Architecting the Future of Big Data Page 29 © Hortonworks Inc. 2011
  • 30. Pig Embedding • Running embeded Pig script pig sample.py while True: • What happen within Pig? Q = P.bind() results = Q.runSingle() While Loop converge? Pig Script Pytho Pytho n n sample.py Script Pig Script Jython Pig End Architecting the Future of Big Data Page 30 © Hortonworks Inc. 2011
  • 31. Nested Operator • Nested Operator: Operator inside foreach B = group A by name; C = foreach B { C0 = limit A 10; generate flatten(C0); } • Prior Pig 0.10, supported nested operator –DISTINCT, FILTER, LIMIT, and ORDER BY • New operators added in 0.10 –CROSS, FOREACH Architecting the Future of Big Data Page 31 © Hortonworks Inc. 2011
  • 32. Nested Cross/ForEach ì(i0, a)ü ì(i0, 0)ü A= í ý B= í ý î(i0, b)þ î(i0,1) þ ì ì aü ì 0 ü ü ï ï ï CoGroup A, B C= í(i0, í ý, í ý)ý ï ïbþ î1 þ ï î î þ ì ì(a, 0)üü C = CoGroup A, B; ï ï ïï Cross A, B ï ï(a,1) ïï D = ForEach C { í(i0, í ýý ï ï(b, 0)ïï X = Cross A, B; ï î ï(b,1) ïï î þþ Y = ForEach X generate CONCAT(f1, f2); ì ì(a0)üü ï ï ïï Generate Y; ForEach … CONCAT ï ï(a1) ïï í(i0, í ýý } ï ï(b0)ïï ï î ï(b1) ïï î þþ Architecting the Future of Big Data Page 32 © Hortonworks Inc. 2011
  • 33. HCatalog Integration • Hcatalog Pig Map Reduce Hive HCatalog • HCatLoader/HCatStorage –Load/Store from HCatalog from Pig • HCatalog DDL Integration (Pig 0.11) –sql “create table student(name string, age int, gpa double);” Architecting the Future of Big Data Page 33 © Hortonworks Inc. 2011
  • 34. Misc Loaders • HBaseStorage –Pig builtin • AvroStorage –Piggybank • CassandraStorage –In Cassandra code base • MongoStorage –In Mongo DB code base • JsonLoader/JsonStorage –Pig builtin Architecting the Future of Big Data Page 34 © Hortonworks Inc. 2011
  • 35. Talend Enterprise Data Integration • Talend Open Studio for Big Data – Feature-rich Job Designer – Rich palette of pre-built templates – Supports HDFS, Pig, Hive, HBase, HCatalog – Apache-licensed, bundled with HDP • Key benefits – Graphical development – Robust and scalable execution – Broadest connectivity to support all systems: 450+ components – Real-time debugging © Hortonworks Inc. 2011 Page 35
  • 36. Questions Architecting the Future of Big Data Page 36 © Hortonworks Inc. 2011