SlideShare a Scribd company logo
1 of 53
Introduction to
   Pig UDFs

       Chris Wilkes
cwilkes@seattlehadoop.org
Agenda


  1 What, Why, How

  2 EvalFunc basics

  3 More EvalFunc
  4 LoadFunc

  5 Piggybank
Agenda Point 1


  1 What, Why, How

  2 EvalFunc basics

  3 More EvalFunc
  4 LoadFunc

  5 Piggybank
What is a UDF?



  User Defined Function

  • Way to do an operation on a field or fields
  • Note: not on the group
  • Called from within a pig script
  • b = FOREACH a GENERATE foo(color)
  • Currently all done in java
Why use a UDF?



  • You need to do more than grouping or
   filtering
  • Actually filtering is a UDF
  • Probably using them already
  • Maybe more comfortable in java land
   than in SQL / Pig Latin
How to write an use?




 • Just extend / implement an
   interface
 • No need for administrator
   rights, just call your script
 • Very simple java, just think
   about your small problem

                                   Magical Powers not required
Moving right along




      Now to the informative part of the talk
Agenda


  1 What, Why, How

  2 EvalFunc basics

  3 More EvalFunc
  4 LoadFunc

  5 Piggybank
EvalFunc : probably what you need to do



  •Easiest to understand: takes one or more
   fields and spits back a generic object
  •Extend the EvalFunc interface and it
   practically writes itself
  •Let’s look at the UPPER example from the
   piggybank
The UPPER EvalFunc

 public class UPPER extends EvalFunc<String> {

     @Override
     public String exec(Tuple input) throws IOException {
        if (input == null||input.size() == 0||input.get(0) == null)
           return null;
        try {
           return ((String)input.get(0)).toUpperCase();
        } catch (ClassCastException e) {
           warn(“error msg”, PigWarning.UDF_WARNING_1);
        } catch(Exception e){
           warn("Error”, PigWarning.UDF_WARNING_1);
        }
        return null;
     }

 }



         modified version from the piggybank SVN
The UPPER EvalFunc

 public class UPPER   extends EvalFunc<String> {

     @Override
      public String exec(Tuple input) throws IOException {
         if (input == null||input.size() == 0||input.get(0) == null)
            return null;
         try {
            return ((String)input.get(0)).toUpperCase();
         } catch (ClassCastException e) {
            warn(“error msg”, PigWarning.UDF_WARNING_1);
         } catch(Exception e){
            warn("Error”, PigWarning.UDF_WARNING_1);
         }
         return null;
      }

 }

     The generic <String> tells Pig what class will be
              returned from this method
The UPPER EvalFunc

 public class UPPER extends EvalFunc<String> {

     @Override
      public String exec(Tuple input) throws IOException {
         if (input == null||input.size() == 0||input.get(0) == null)
            return null;
         try {
            return ((String)input.get(0)).toUpperCase();
         } catch (ClassCastException e) {
            warn(“error msg”, PigWarning.UDF_WARNING_1);
         } catch(Exception e){
            warn("Error”, PigWarning.UDF_WARNING_1);
         }
         return null;
      }

 }


 The Tuple input contains the fields within the script ()
The UPPER EvalFunc


 public class UPPER extends EvalFunc<String> {

     public String exec(Tuple input) throws IOException {
        if (input == null||input.size() == 0||input.get(0) == null)
           return null;
        try {
           return ((String)input.get(0)).toUpperCase();
        } catch (ClassCastException e) {
           warn(“error msg”, PigWarning.UDF_WARNING_1);
        } catch(Exception e){
           warn("Error”, PigWarning.UDF_WARNING_1);
        }
        return null;
     }

 }



          Check your inputs for empties or nulls
The UPPER EvalFunc


 public class UPPER extends EvalFunc<String> {

     public String exec(Tuple input) throws IOException {
        if (input == null||input.size() == 0||input.get(0) == null)
           return null;
        try {
           return ((String)input.get(0)).toUpperCase();
        } catch (ClassCastException e) {
           warn(“error msg”, PigWarning.UDF_WARNING_1);
        } catch(Exception e){
           warn("Error”, PigWarning.UDF_WARNING_1);
        }
        return null;
     }

 }


 You have to know that the 1st parameter inside the
                 tuple is a String
The UPPER EvalFunc


 public class UPPER extends EvalFunc<String> {

     public String exec(Tuple input) throws IOException {
        if (input == null||input.size() == 0||input.get(0) == null)
           return null;
        try {
           return ((String)input.get(0)).toUpperCase();
        } catch (ClassCastException e) {
           warn(“error msg”, PigWarning.UDF_WARNING_1);
        } catch(Exception e){
           warn("Error”, PigWarning.UDF_WARNING_1);
        }
        return null;
     }

 }

  Catch errors that are acceptable and return null so
                 can be skipped over
The UPPER EvalFunc




 public class UPPER extends EvalFunc<String> {
    public List<FuncSpec> getArgToFuncMapping() {
       List<FuncSpec> funcList = new ArrayList<FuncSpec>();
       funcList.add(new FuncSpec(this.getClass().getName(),
          new Schema(new Schema.FieldSchema(null,
             DataType.CHARARRAY))));
       return funcList;
    }
 }




   Tells Pig what parameters this function takes
Recap of UPPER


 • Generics outlines contract for return type
 • Schemas are preserved (chararray / String)
 • Check inputs for empty or null
 • Return null if item should be skipped
     • Throw an exception if deadly
 • Name “UPPER” can be used if known to
   PigContext’s packageImportList, otherwise need
   full classname
 • Cast items inside of the Tuple parameter
Another simple EvalFunc: AstroDist




 • Two input files: planet names with coordinates
   and pairs of planets
 • Goal: find the distance between the pairs
 • Loading is slightly different: coords in a tuple
 • Input to EvalFunc is a Tuple that contains a Tuple
AstroDist input files

$ cat src/test/resources/cosmo
aaa bbb
aaa ccc
ddd aaa

$ cat src/test/resources/planets
aaa (1,0,10)
bbb (2,-5,15)
ccc (-7,12,48)                     image from xkcd.com
ddd (3,3,8)
AstroDist pig script

 REGISTER target/pig-demo-1.0-SNAPSHOT.jar;

 planets = load '$dir/planets' as (name : chararray,
               l:tuple(x : int, y : int, z : int));
 cosmo = load '$dir/cosmo' as (planet1 : chararray, planet2 : chararray);

 A = JOIN cosmo BY planet1, planets BY name;
 B = JOIN A by planet2, planets BY name;

 locations = FOREACH B GENERATE
   $1 AS p1name:chararray,
   $2 AS p2name : chararray,
   AstroDist($3,$5) as distance;

 dump locations;
AstroDist output


   $ pig -x local -f src/test/resources/distances.pig
    -param dir=src/test/resources/

   What B looks like:
   (ddd,aaa,ddd,(3,3,8),aaa,(1,0,10))
   (aaa,bbb,aaa,(1,0,10),bbb,(2,-5,15))
   (aaa,ccc,aaa,(1,0,10),ccc,(-7,12,48))

   Output:
   (aaa,ddd,4.123105625617661)
   (bbb,aaa,7.14142842854285)
   (ccc,aaa,40.64480286580315)
AstroDist program


 public class AstroDist extends EvalFunc<Double> {
   @Override
    public Double exec(Tuple input) throws IOException {
      Point3D astroPos1 = new Point3D((Tuple) input.get(0));
      Point3D astroPos2 = new Point3D((Tuple) input.get(1));
      return astroPos1.distance(astroPos2);
    }
    @Override
    public List<FuncSpec> getArgToFuncMapping() {
      Schema s = new Schema();
      s.add(new Schema.FieldSchema(null, DataType.TUPLE));
      s.add(new Schema.FieldSchema(null, DataType.TUPLE));
      return Arrays.asList(
        new FuncSpec(this.getClass().getName(), s));
    }
 }
AstroDist program (cont)


 private static class Point3D {
   private final int x, y, z;
   private Point3D(Tuple tuple) throws ExecException {
 
 
 if (tuple.size() != 3) {
 
 
 
 throw new ExecException("Received " + tuple.size() +
   " points in 3D tuple", ERROR_CODE_BAD_TUPLE, PigException.BUG);
 
 
 }
 
 
 x = (Integer) tuple.get(0);
 
 
 y = (Integer) tuple.get(1);
 
 
 z = (Integer) tuple.get(2);
 
 }
 
 private double distance(Point3D other) {
 
 
 return Math.sqrt(Math.pow(x - other.x, 2) +
        Math.pow(y - other.y, 2) + Math.pow(z - other.z, 2));
 
 }
 }
Fun times when running this script

  • Looking through PigContext and Main found
    that /pig.properties in the classpath is parsed for
    the key/value “udf.import.list”
  • Put this into my jar (src/main/resources with
    maven) but it didn’t appear to load
  • Debug log should show what’s going on, except
    debug isn’t turned on till after this load
  • Ended up putting into ~/.pigrc but Pig warns that
    it should go into conf/pig.properties, a file that
    isn’t read
  • Schemas and UDFs are picky, use trial and error
Agenda Point 3


  1 What, Why, How

  2 EvalFunc basics

  3 More EvalFunc
  4 LoadFunc

  5 Piggybank
Returning a Tuple from a UDF


 • Sometimes you want to return more than one
   thing from a function
 • For example an expensive calculation was done
   and its results can be reused
 • But what should be returned?
   • Of course a Tuple
   • “tuple” is the answer 92% of the time


                                                   http://tuplemusic.org/
                                     Tuple is dedicated to exploring and expanding
                                     the contemporary repertoire for two bassoons
BestBook: returns the highest scored book

   $ cat src/test/resources/bookscores
   book1 aaa 1
   book1 bbb 3
                               Want output of that for
   book1 ccc 12
                              book3 reviewer bbb was
   book2 aaa 4
                                      the highest at 5
   book2 bbb 1
   book3 ccc 1
   book3 bbb 5
BestBook EvalFunc

 public class BestBook extends EvalFunc<Tuple> {

 
   @Override
 
   public Tuple exec(Tuple p_input) throws IOException {
 
   
 Iterator<Tuple> bagReviewers =
          ((DataBag) p_input.get(0)).iterator();
 
   
 Iterator<Tuple> bagScores =
          ((DataBag) p_input.get(1)).iterator();
 
   
 int bestScore = -1;
 
   
 String bestReviewer = null;
 
   
 while (bagReviewers.hasNext() && bagScores.hasNext()) {
 
   
 
 String reviewerName = (String) bagReviewers.next().get(0);
 
   
 
 Integer score = (Integer) bagScores.next().get(0);
 
   
 
 if (score.intValue() > bestScore) {
 
   
 
 
 bestScore = score;
 
   
 
 
 bestReviewer = reviewerName;
 
   
 
 }
 
   
 }
 
   
 return TupleFactory.getInstance().newTuple(
          Arrays.asList(bestReviewer, (Integer) bestScore));
 
   }
BestBook EvalFunc

 public class BestBook extends EvalFunc<Tuple> {

 
   @Override
 
   public Tuple exec(Tuple p_input) throws IOException {
 
   
 Iterator<Tuple> bagReviewers =
          ((DataBag) p_input.get(0)).iterator();
 
   
 Iterator<Tuple> bagScores =
          ((DataBag) p_input.get(1)).iterator();
 
   
 int bestScore = -1;
 
   
 String bestReviewer = null;
 
   
 while (bagReviewers.hasNext() && bagScores.hasNext()) {
 
   
 
 String reviewerName = (String) bagReviewers.next().get(0);
 
   
 
 Integer score = (Integer) bagScores.next().get(0);
 
   
 
 if (score.intValue() > bestScore) {
 
   
 
 
 bestScore = score;
 
   
 
 
 bestReviewer = reviewerName;
 
   
 
 }
 
   
 }
 
   
 return TupleFactory.getInstance().newTuple(
          Arrays.asList(bestReviewer, (Integer) bestScore));
 
   }
                      The inputs are bag “columns”
BestBook EvalFunc

 public class BestBook extends EvalFunc<Tuple> {

 
   @Override
 
   public Tuple exec(Tuple p_input) throws IOException {
 
   
 Iterator<Tuple> bagReviewers =
          ((DataBag) p_input.get(0)).iterator();
 
   
 Iterator<Tuple> bagScores =
          ((DataBag) p_input.get(1)).iterator();
 
   
 int bestScore = -1;
 
   
 String bestReviewer = null;
 
   
 while (bagReviewers.hasNext() && bagScores.hasNext()) {
 
   
 
 String reviewerName = (String) bagReviewers.next().get(0);
 
   
 
 Integer score = (Integer) bagScores.next().get(0);
 
   
 
 if (score.intValue() > bestScore) {
 
   
 
 
 bestScore = score;
 
   
 
 
 bestReviewer = reviewerName;
 
   
 
 }
 
   
 }
 
   
 return TupleFactory.getInstance().newTuple(
          Arrays.asList(bestReviewer, (Integer) bestScore));
 
   }
                 return a Tuple that’s just like the inputs
BestBook EvalFunc




 public class BestBook extends EvalFunc<Tuple> {

 
 @Override
 
 public Schema outputSchema(Schema p_input) {
 
 
 try {
 
 
 
 return Schema.generateNestedSchema(DataType.TUPLE,
 DataType.CHARARRAY, DataType.INTEGER);
 
 
 } catch (FrontendException e) {
 
 
 
 throw new IllegalStateException(e);
 
 
 }
 
 }
 }



                     How to define the outbound
                       schema inside the Tuple
BestBook: returns the highest scored book


   REGISTER target/demo-pig-udf-1.0-SNAPSHOT.jar;

   A = LOAD '$dir/bookscores' as (name : chararray,
     reviewer : chararray, score : int);

   B = group A by name;
   describe B;
   dump B;

   C = FOREACH B GENERATE group,
       BestBook(A.reviewer, A.score) as reviewandscore;

   describe C;
   dump C;
BestBook: returns the highest scored book



B: {group: chararray,A: {name: chararray,reviewer: chararray,score: int}}
(book1,{(book1,aaa,1),(book1,bbb,3),(book1,ccc,12)})
(book2,{(book2,aaa,4),(book2,bbb,1)})
(book3,{(book3,ccc,1),(book3,bbb,5)})

C: {group: chararray,reviewandscore: (chararray,int)}
(book1,(ccc,12))
(book2,(aaa,4))
(book3,(bbb,5))
BestBook: improve by implementing Algebraic



 •If EvalFunc can be run in stages and summed up consider
 implementing Algebraic
 •Three methods to override:
   •String getInitial();
   •String getIntermed();
   •String getFinal()
 •See COUNT and DoubleAvg
FilterFunc: a filter that’s an EvalFunc




  • For keeping and disgarding entries write a filter
  • FilterFunc extends EvalFunc<Boolean>
    • Adds a method “void finish()” for cleanup
  • Example: only wants dates that are within 10
   minutes of one another
FilterFunc: DateWithinFilter


 public class DateWithinFilter extends FilterFunc {

 @Override
 
 public Boolean exec(Tuple input) throws IOException {
 
 
 if (input.size() != 3) {
 
 
 
 throw new IOException(“error msg”);
 
 
 }
 
 
 Date[] startAndTryDates = getColumnDates(input);
 
 
 if (startAndTryDates == null)
 
 
 
 return false;
 
 
 long dateDiff = startAndTryDates[1].getTime() -
         startAndTryDates[0].getTime();
 
 
 if (dateDiff < 0) {
 
 
 
 return false; // maybe make optional
 
 
 }
 
 
 int maxDateDiff = (Integer) input.get(2);
 
 
 return dateDiff <= maxDateDiff;
 
 }
FilterFunc: DateWithinFilter
 private Date[] getColumnDates(Tuple input) throws ExecException {
 
 
 String strDate1 = (String) input.get(0);
 
 
 String strDate2 = (String) input.get(1);
 
 
 if (strDate1 == null || strDate2 == null) {
 
 
 
 return null;
 
 
 }
 
 
 Date date1 = null;
 
 
 try {
 
 
 
 date1 = df.parse(strDate1);
 
 
 } catch (ParseException e) {
 
 
 
 warn(“date format err”, PigWarning.UDF_WARNING_1);
 
 
 
 return null;
 
 
 }
 
 
 Date date2 = null;
 
 
 try {
 
 
 
 date2 = df.parse(strDate2);
 
 
 } catch (ParseException e) {
 
 
 
 warn(“date format err”, PigWarning.UDF_WARNING_1);
 
 
 
 return null;
 
 
 }
 
 
 return new Date[] { date1, date2 };
 
 }
FilterFunc: DateWithinFilter

 @Override
 
 public List<FuncSpec> getArgToFuncMapping() throws
 FrontendException {
 
 
 List<FuncSpec> funcList = new ArrayList<FuncSpec>();
 
 
 Schema s = new Schema();
 
 
 s.add(new Schema.FieldSchema(null, DataType.CHARARRAY));
 
 
 s.add(new Schema.FieldSchema(null, DataType.CHARARRAY));
 
 
 s.add(new Schema.FieldSchema(null, DataType.INTEGER));
 
 
 funcList.add(new FuncSpec(this.getClass().getName(), s));
 
 
 return funcList;
 
 }




            Defining what inputs we accept
        stay tuned for what happens when violated
FilterFunc: DateWithinFilter
$ cat src/test/resources/purchasetimes
1234 2010-06-01 10:31:22 2010-06-01 10:32:22
7121 2010-06-01 10:30:18 2010-06-01 11:02:59
1234 2010-06-01 10:40:18 2010-06-01 10:45:32
7681 lol wut
4532 2010-06-01 11:37:18 2010-06-01 11:42:59

$ cat src/test/resources/purchasetimes.pig
REGISTER target/demo-pig-udf-1.0-SNAPSHOT.jar;
purchasetimes = LOAD '$dir/purchasetimes' AS
  (userid: int, datein: chararray, dateout: chararray);
quickybuyers = FILTER purchasetimes BY
  DateWithinFilter(datein, dateout, 600000);
DUMP quickybuyers;                 $ pig -x local -f src/test/resources/purchasetimes.pig
                                       -param dir=src/test/resources/
                                    (1234,2010-06-01 10:31:22,2010-06-01 10:32:22)
                                    (1234,2010-06-01 10:40:18,2010-06-01 10:45:32)
                                    (4532,2010-06-01 11:37:18,2010-06-01 11:42:59)
EvalFunc: not passing in correct number args
$ cat src/test/resources/purchasetimes.pig
quickybuyers = FILTER purchasetimes BY
  DateWithinFilter(datein, dateout);

$ pig -x local -f src/test/resources/purchasetimes.pig -param dir=src/test/resources/

2010-06-17 17:25:43,440 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR
1045: Could not infer the matching function for
org.seattlehadoop.demo.pig.udf.DateWithinFilter as multiple or none of them fit.
Please use an explicit cast.
Details at logfile: /Users/cwilkes/Documents/workspace5/SeattleHadoop-
demo-code/pig_1276820742917.log

log file has:
       at
org.apache.pig.impl.logicalLayer.validators.TypeCheckingVisitor.visit(TypeCheckingVi
sitor.java:1197)
so error caught before loading data
Agenda


  1 What, Why, How

  2 EvalFunc basics

  3 More EvalFunc
  4 LoadFunc

  5 Piggybank
LoadFunc: definition




  • How does something get loaded into Pig?
  • A = load ‘B’;
  • But what is actually going on?
  • A = load ‘B’ using PigStorage();
  • PigStorage is a LoadFunc that reads off of disk
   and splits on tab to create a Tuple
LoadFunc: definition


  •  LoadFunc is an interface with a number of
    methods, the most interesting being
    • bindTo(fileName,inputStream,offset,end)
    • Tuple getNext()
  • Extend from UTF8StorageConverter like
    PigStorage to get defaults
  • Overview: PigStorage’s getNext() creates an array
    of objects after splitting on a tab and puts those
    into a Tuple
LoadFunc: make our own



 • Have a lot of log files, some just contain a URL
   • http://example.com?use=mind+bullets&target=yak
 • Want to load URLs and do analysis
 • Write your own LoadFunc to do this that takes a
   URL and returns a Map of the query parameters
 • Know what parameters you care about, only look
   for those
LoadFunc: make our own

 • Have a lot of log files, some just contain a URL
   • http://example.com?use=mind+bullets&target=yak
 • Want to load URLs and do analysis
 • Write your own LoadFunc to do this that takes a
   URL and returns a Map of the query parameters
 • Know what parameters you care about, only look
   for those
 • Goal:
 • A = LOAD 'urls' USING
   QuerystringLoader('query', 'userid') AS (query:
   chararray, userid : int);
LoadFunc: QuerystringLoader

 • Passing in constructor arguments from the pig
   script is easy:
   • public QuerystringLoader(String... fieldNames)
 • bindTo is almost exactly the same as the
   PigStorage one, using the PigLineRecordReader
   to parse the InputStream
 • Tuple getTuple() is where the action happens
   • parse the querystring into a Map
   • loop through the fields given in the constructor
   • return a Tuple of a list of those objects
LoadFunc: QuerystringLoader getTuple()

 
   @Override
 
   public Tuple getNext() throws IOException {
 
   
   if (in == null || in.getPosition() > end) {
 
   
   
   return null;
 
   
   }
 
   
   Text value = new Text();
 
   
   boolean notDone = in.next(value);
 
   
   if (!notDone) {
 
   
   
   return null;
 
   
   }
 
   
   Map<String, Object> parameters = getParameterMap(value.toString());
 
   
   List<String> output = new ArrayList<String>();
 
   
   for (String fieldName : m_fieldsInOrder) {
 
   
   
   Object object = parameters.get(fieldName);
 
   
   
   if (object == null) {
 
   
   
   
   output.add(null);
 
   
   
   
   continue;
 
   
   
   }
 
   
   
   if (object instanceof String) {
 
   
   
   
   output.add((String) object);
 
   
   
   } else {
 
   
   
   
   List<String> objectVal = (List<String>) object;
 
   
   
   
   output.add(objectVal.get(0));
 
   
   
   }
 
   
   }
 
   
   return mTupleFactory.newTupleNoCopy(output);
 
   }
LoadFunc: notes



 • boolean okay=in.next(tuple) is how to get the next
   parsed line
 • getParameterMap(url) splits querystring into a
   Map<String,Object>
 • Pig handles type conversion for you, just hand back
   a Tuple.
 • In this case the Tuple can be made up of anything so
   user specifies the schema in the script
 • AS (query:chararray, userid:int)
RegexLoader


 Same concept, pass in a Pattern for the constructor
 and have getTuple() return only the matched parts
    @Override
    	 public Tuple getNext() throws IOException {
    	 	
    	   	   Matcher m = m_linePattern.matcher(value.toString());
    	   	   if (!m.matches()) {
    	   	   	 return EmptyTuple.getInstance();
    	   	   }
    	   	   List<String> regexMatches = new ArrayList<String>();
    	   	   for (int i = 1; i <= m.groupCount(); i++) {
    	   	   	 regexMatches.add(m.group(i));
    	   	   }
    	   	   return mTupleFactory.newTupleNoCopy(regexMatches);
    	   }
Agenda


  1 What, Why, How

  2 EvalFunc basics

  3 More EvalFunc
  4 LoadFunc

  5 Piggybank
Piggybank

  • CVS repository of common UDFs
  • Excited about it at first, doesn’t appear to be
    used that much
  • Needs to be an easier way of doing this
  • CPAN (Perl) for Pig would be great
      • register pigpan://Math::FFT
  • brings down the jars from a maven-like
    repository and tells pig where to load from
  • any takers? Looking into it
Bonus section: unit testing
 @Test
 
 public void testRepeatQueryParams() throws IOException {
 
 
 String url = "http://localhost/foo?a=123&a=456nx=y
 nhttp://localhost/bar?a=761&b=hi";
 
 
 QuerystringLoader loader = new QuerystringLoader("a", "b");
 
 
 InputStream in = new ByteArrayInputStream(url.getBytes());
 
 
 loader.bindTo(null,
         new BufferedPositionedInputStream(in), 0, url.length());
 
 
 Tuple tuple = loader.getNext();
 
 
 assertEquals("123", (String) tuple.get(0));
 
 
 assertNull(tuple.get(1));
 
 
 tuple = loader.getNext();
 
 
 assertEquals(2, tuple.size());
 
 
 assertNull(tuple.get(0));
 
 
 assertNull(tuple.get(1));
 
 
 tuple = loader.getNext();
 
 
 assertEquals("761", (String) tuple.get(0));
 
 
 assertEquals("hi", (String) tuple.get(1));
 
 }
Resources


 UDF reference:
 http://hadoop.apache.org/pig/docs/r0.5.0/
 piglatin_reference.html

 Code samples:
 http://github.com/seattlehadoop

 Presentation:
 http://www.slideshare.net/seattlehadoop

More Related Content

What's hot

Native interfaces for R
Native interfaces for RNative interfaces for R
Native interfaces for RSeth Falcon
 
If You Think You Can Stay Away from Functional Programming, You Are Wrong
If You Think You Can Stay Away from Functional Programming, You Are WrongIf You Think You Can Stay Away from Functional Programming, You Are Wrong
If You Think You Can Stay Away from Functional Programming, You Are WrongMario Fusco
 
响应式编程及框架
响应式编程及框架响应式编程及框架
响应式编程及框架jeffz
 
The Evolution of Async-Programming on .NET Platform (TUP, Full)
The Evolution of Async-Programming on .NET Platform (TUP, Full)The Evolution of Async-Programming on .NET Platform (TUP, Full)
The Evolution of Async-Programming on .NET Platform (TUP, Full)jeffz
 
Java8 stream
Java8 streamJava8 stream
Java8 streamkoji lin
 
Jscex: Write Sexy JavaScript
Jscex: Write Sexy JavaScriptJscex: Write Sexy JavaScript
Jscex: Write Sexy JavaScriptjeffz
 
Phil Bartie QGIS PLPython
Phil Bartie QGIS PLPythonPhil Bartie QGIS PLPython
Phil Bartie QGIS PLPythonRoss McDonald
 
자바 8 스트림 API
자바 8 스트림 API자바 8 스트림 API
자바 8 스트림 APINAVER Corp
 
The Evolution of Async-Programming (SD 2.0, JavaScript)
The Evolution of Async-Programming (SD 2.0, JavaScript)The Evolution of Async-Programming (SD 2.0, JavaScript)
The Evolution of Async-Programming (SD 2.0, JavaScript)jeffz
 
OOP and FP - Become a Better Programmer
OOP and FP - Become a Better ProgrammerOOP and FP - Become a Better Programmer
OOP and FP - Become a Better ProgrammerMario Fusco
 
Laziness, trampolines, monoids and other functional amenities: this is not yo...
Laziness, trampolines, monoids and other functional amenities: this is not yo...Laziness, trampolines, monoids and other functional amenities: this is not yo...
Laziness, trampolines, monoids and other functional amenities: this is not yo...Mario Fusco
 
Introduction to functional programming using Ocaml
Introduction to functional programming using OcamlIntroduction to functional programming using Ocaml
Introduction to functional programming using Ocamlpramode_ce
 
«iPython & Jupyter: 4 fun & profit», Лев Тонких, Rambler&Co
«iPython & Jupyter: 4 fun & profit», Лев Тонких, Rambler&Co«iPython & Jupyter: 4 fun & profit», Лев Тонких, Rambler&Co
«iPython & Jupyter: 4 fun & profit», Лев Тонких, Rambler&CoMail.ru Group
 
From object oriented to functional domain modeling
From object oriented to functional domain modelingFrom object oriented to functional domain modeling
From object oriented to functional domain modelingMario Fusco
 
Euro python2011 High Performance Python
Euro python2011 High Performance PythonEuro python2011 High Performance Python
Euro python2011 High Performance PythonIan Ozsvald
 
Scala introduction
Scala introductionScala introduction
Scala introductionvito jeng
 
Orthogonal Functional Architecture
Orthogonal Functional ArchitectureOrthogonal Functional Architecture
Orthogonal Functional ArchitectureJohn De Goes
 

What's hot (20)

Java 8 Workshop
Java 8 WorkshopJava 8 Workshop
Java 8 Workshop
 
Native interfaces for R
Native interfaces for RNative interfaces for R
Native interfaces for R
 
If You Think You Can Stay Away from Functional Programming, You Are Wrong
If You Think You Can Stay Away from Functional Programming, You Are WrongIf You Think You Can Stay Away from Functional Programming, You Are Wrong
If You Think You Can Stay Away from Functional Programming, You Are Wrong
 
MTL Versus Free
MTL Versus FreeMTL Versus Free
MTL Versus Free
 
响应式编程及框架
响应式编程及框架响应式编程及框架
响应式编程及框架
 
The Evolution of Async-Programming on .NET Platform (TUP, Full)
The Evolution of Async-Programming on .NET Platform (TUP, Full)The Evolution of Async-Programming on .NET Platform (TUP, Full)
The Evolution of Async-Programming on .NET Platform (TUP, Full)
 
Fun with Kotlin
Fun with KotlinFun with Kotlin
Fun with Kotlin
 
Java8 stream
Java8 streamJava8 stream
Java8 stream
 
Jscex: Write Sexy JavaScript
Jscex: Write Sexy JavaScriptJscex: Write Sexy JavaScript
Jscex: Write Sexy JavaScript
 
Phil Bartie QGIS PLPython
Phil Bartie QGIS PLPythonPhil Bartie QGIS PLPython
Phil Bartie QGIS PLPython
 
자바 8 스트림 API
자바 8 스트림 API자바 8 스트림 API
자바 8 스트림 API
 
The Evolution of Async-Programming (SD 2.0, JavaScript)
The Evolution of Async-Programming (SD 2.0, JavaScript)The Evolution of Async-Programming (SD 2.0, JavaScript)
The Evolution of Async-Programming (SD 2.0, JavaScript)
 
OOP and FP - Become a Better Programmer
OOP and FP - Become a Better ProgrammerOOP and FP - Become a Better Programmer
OOP and FP - Become a Better Programmer
 
Laziness, trampolines, monoids and other functional amenities: this is not yo...
Laziness, trampolines, monoids and other functional amenities: this is not yo...Laziness, trampolines, monoids and other functional amenities: this is not yo...
Laziness, trampolines, monoids and other functional amenities: this is not yo...
 
Introduction to functional programming using Ocaml
Introduction to functional programming using OcamlIntroduction to functional programming using Ocaml
Introduction to functional programming using Ocaml
 
«iPython & Jupyter: 4 fun & profit», Лев Тонких, Rambler&Co
«iPython & Jupyter: 4 fun & profit», Лев Тонких, Rambler&Co«iPython & Jupyter: 4 fun & profit», Лев Тонких, Rambler&Co
«iPython & Jupyter: 4 fun & profit», Лев Тонких, Rambler&Co
 
From object oriented to functional domain modeling
From object oriented to functional domain modelingFrom object oriented to functional domain modeling
From object oriented to functional domain modeling
 
Euro python2011 High Performance Python
Euro python2011 High Performance PythonEuro python2011 High Performance Python
Euro python2011 High Performance Python
 
Scala introduction
Scala introductionScala introduction
Scala introduction
 
Orthogonal Functional Architecture
Orthogonal Functional ArchitectureOrthogonal Functional Architecture
Orthogonal Functional Architecture
 

Viewers also liked

Integration of Hive and HBase
Integration of Hive and HBaseIntegration of Hive and HBase
Integration of Hive and HBaseHortonworks
 
Making Big Data Analytics Interactive and Real-­Time
 Making Big Data Analytics Interactive and Real-­Time Making Big Data Analytics Interactive and Real-­Time
Making Big Data Analytics Interactive and Real-­TimeSeven Nguyen
 
Twitter Storm: Ereignisverarbeitung in Echtzeit
Twitter Storm: Ereignisverarbeitung in EchtzeitTwitter Storm: Ereignisverarbeitung in Echtzeit
Twitter Storm: Ereignisverarbeitung in EchtzeitGuido Schmutz
 
HCatalog: Table Management for Hadoop - CHUG - 20120917
HCatalog: Table Management for Hadoop - CHUG - 20120917HCatalog: Table Management for Hadoop - CHUG - 20120917
HCatalog: Table Management for Hadoop - CHUG - 20120917Chicago Hadoop Users Group
 
Monitoring MySQL with OpenTSDB
Monitoring MySQL with OpenTSDBMonitoring MySQL with OpenTSDB
Monitoring MySQL with OpenTSDBGeoffrey Anderson
 
Using Morphlines for On-the-Fly ETL
Using Morphlines for On-the-Fly ETLUsing Morphlines for On-the-Fly ETL
Using Morphlines for On-the-Fly ETLCloudera, Inc.
 
How we solved Real-time User Segmentation using HBase
How we solved Real-time User Segmentation using HBaseHow we solved Real-time User Segmentation using HBase
How we solved Real-time User Segmentation using HBaseDataWorks Summit
 
Deploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analyticsDeploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analyticsDataWorks Summit
 
A real time architecture using Hadoop and Storm @ FOSDEM 2013
A real time architecture using Hadoop and Storm @ FOSDEM 2013A real time architecture using Hadoop and Storm @ FOSDEM 2013
A real time architecture using Hadoop and Storm @ FOSDEM 2013Nathan Bijnens
 
HBaseCon 2012 | Lessons learned from OpenTSDB - Benoit Sigoure, StumbleUpon
HBaseCon 2012 | Lessons learned from OpenTSDB - Benoit Sigoure, StumbleUponHBaseCon 2012 | Lessons learned from OpenTSDB - Benoit Sigoure, StumbleUpon
HBaseCon 2012 | Lessons learned from OpenTSDB - Benoit Sigoure, StumbleUponCloudera, Inc.
 
Apache Hadoop YARN - Enabling Next Generation Data Applications
Apache Hadoop YARN - Enabling Next Generation Data ApplicationsApache Hadoop YARN - Enabling Next Generation Data Applications
Apache Hadoop YARN - Enabling Next Generation Data ApplicationsHortonworks
 
Realtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and HadoopRealtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and HadoopDataWorks Summit
 
ORC File & Vectorization - Improving Hive Data Storage and Query Performance
ORC File & Vectorization - Improving Hive Data Storage and Query PerformanceORC File & Vectorization - Improving Hive Data Storage and Query Performance
ORC File & Vectorization - Improving Hive Data Storage and Query PerformanceDataWorks Summit
 
Big data: Loading your data with flume and sqoop
Big data:  Loading your data with flume and sqoopBig data:  Loading your data with flume and sqoop
Big data: Loading your data with flume and sqoopChristophe Marchal
 
How to Test Big Data Systems | QualiTest Group
How to Test Big Data Systems | QualiTest GroupHow to Test Big Data Systems | QualiTest Group
How to Test Big Data Systems | QualiTest GroupQualitest
 
Introduction to Hadoop and Pig
Introduction to Hadoop and PigIntroduction to Hadoop and Pig
Introduction to Hadoop and Pigprash1784
 
Learning Apache HIVE - Data Warehouse and Query Language for Hadoop
Learning Apache HIVE - Data Warehouse and Query Language for HadoopLearning Apache HIVE - Data Warehouse and Query Language for Hadoop
Learning Apache HIVE - Data Warehouse and Query Language for HadoopSomeshwar Kale
 
Hive Functions Cheat Sheet
Hive Functions Cheat SheetHive Functions Cheat Sheet
Hive Functions Cheat SheetHortonworks
 

Viewers also liked (20)

Integration of Hive and HBase
Integration of Hive and HBaseIntegration of Hive and HBase
Integration of Hive and HBase
 
Making Big Data Analytics Interactive and Real-­Time
 Making Big Data Analytics Interactive and Real-­Time Making Big Data Analytics Interactive and Real-­Time
Making Big Data Analytics Interactive and Real-­Time
 
Spark and shark
Spark and sharkSpark and shark
Spark and shark
 
Twitter Storm: Ereignisverarbeitung in Echtzeit
Twitter Storm: Ereignisverarbeitung in EchtzeitTwitter Storm: Ereignisverarbeitung in Echtzeit
Twitter Storm: Ereignisverarbeitung in Echtzeit
 
HCatalog: Table Management for Hadoop - CHUG - 20120917
HCatalog: Table Management for Hadoop - CHUG - 20120917HCatalog: Table Management for Hadoop - CHUG - 20120917
HCatalog: Table Management for Hadoop - CHUG - 20120917
 
Hadoop basics
Hadoop basicsHadoop basics
Hadoop basics
 
Monitoring MySQL with OpenTSDB
Monitoring MySQL with OpenTSDBMonitoring MySQL with OpenTSDB
Monitoring MySQL with OpenTSDB
 
Using Morphlines for On-the-Fly ETL
Using Morphlines for On-the-Fly ETLUsing Morphlines for On-the-Fly ETL
Using Morphlines for On-the-Fly ETL
 
How we solved Real-time User Segmentation using HBase
How we solved Real-time User Segmentation using HBaseHow we solved Real-time User Segmentation using HBase
How we solved Real-time User Segmentation using HBase
 
Deploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analyticsDeploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analytics
 
A real time architecture using Hadoop and Storm @ FOSDEM 2013
A real time architecture using Hadoop and Storm @ FOSDEM 2013A real time architecture using Hadoop and Storm @ FOSDEM 2013
A real time architecture using Hadoop and Storm @ FOSDEM 2013
 
HBaseCon 2012 | Lessons learned from OpenTSDB - Benoit Sigoure, StumbleUpon
HBaseCon 2012 | Lessons learned from OpenTSDB - Benoit Sigoure, StumbleUponHBaseCon 2012 | Lessons learned from OpenTSDB - Benoit Sigoure, StumbleUpon
HBaseCon 2012 | Lessons learned from OpenTSDB - Benoit Sigoure, StumbleUpon
 
Apache Hadoop YARN - Enabling Next Generation Data Applications
Apache Hadoop YARN - Enabling Next Generation Data ApplicationsApache Hadoop YARN - Enabling Next Generation Data Applications
Apache Hadoop YARN - Enabling Next Generation Data Applications
 
Realtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and HadoopRealtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and Hadoop
 
ORC File & Vectorization - Improving Hive Data Storage and Query Performance
ORC File & Vectorization - Improving Hive Data Storage and Query PerformanceORC File & Vectorization - Improving Hive Data Storage and Query Performance
ORC File & Vectorization - Improving Hive Data Storage and Query Performance
 
Big data: Loading your data with flume and sqoop
Big data:  Loading your data with flume and sqoopBig data:  Loading your data with flume and sqoop
Big data: Loading your data with flume and sqoop
 
How to Test Big Data Systems | QualiTest Group
How to Test Big Data Systems | QualiTest GroupHow to Test Big Data Systems | QualiTest Group
How to Test Big Data Systems | QualiTest Group
 
Introduction to Hadoop and Pig
Introduction to Hadoop and PigIntroduction to Hadoop and Pig
Introduction to Hadoop and Pig
 
Learning Apache HIVE - Data Warehouse and Query Language for Hadoop
Learning Apache HIVE - Data Warehouse and Query Language for HadoopLearning Apache HIVE - Data Warehouse and Query Language for Hadoop
Learning Apache HIVE - Data Warehouse and Query Language for Hadoop
 
Hive Functions Cheat Sheet
Hive Functions Cheat SheetHive Functions Cheat Sheet
Hive Functions Cheat Sheet
 

Similar to Intro to Pig UDF

Please review my code (java)Someone helped me with it but i cannot.pdf
Please review my code (java)Someone helped me with it but i cannot.pdfPlease review my code (java)Someone helped me with it but i cannot.pdf
Please review my code (java)Someone helped me with it but i cannot.pdffathimafancyjeweller
 
Operator Overloading In Scala
Operator Overloading In ScalaOperator Overloading In Scala
Operator Overloading In ScalaJoey Gibson
 
The Future of Futures - A Talk About Java 8 CompletableFutures
The Future of Futures - A Talk About Java 8 CompletableFuturesThe Future of Futures - A Talk About Java 8 CompletableFutures
The Future of Futures - A Talk About Java 8 CompletableFuturesHaim Yadid
 
What can be done with Java, but should better be done with Erlang (@pavlobaron)
What can be done with Java, but should better be done with Erlang (@pavlobaron)What can be done with Java, but should better be done with Erlang (@pavlobaron)
What can be done with Java, but should better be done with Erlang (@pavlobaron)Pavlo Baron
 
XpUg Coding Dojo: KataYahtzee in Ocp way
XpUg Coding Dojo: KataYahtzee in Ocp wayXpUg Coding Dojo: KataYahtzee in Ocp way
XpUg Coding Dojo: KataYahtzee in Ocp wayGiordano Scalzo
 
Unit testing with PHPUnit
Unit testing with PHPUnitUnit testing with PHPUnit
Unit testing with PHPUnitferca_sl
 
java write a program to evaluate the postfix expressionthe program.pdf
java write a program to evaluate the postfix expressionthe program.pdfjava write a program to evaluate the postfix expressionthe program.pdf
java write a program to evaluate the postfix expressionthe program.pdfarjuntelecom26
 
エンタープライズ・クラウドと 並列・分散・非同期処理
エンタープライズ・クラウドと 並列・分散・非同期処理エンタープライズ・クラウドと 並列・分散・非同期処理
エンタープライズ・クラウドと 並列・分散・非同期処理maruyama097
 
Java Generics
Java GenericsJava Generics
Java Genericsjeslie
 
Navigating the xDD Alphabet Soup
Navigating the xDD Alphabet SoupNavigating the xDD Alphabet Soup
Navigating the xDD Alphabet SoupDror Helper
 
Repetition Structure
Repetition StructureRepetition Structure
Repetition StructurePRN USM
 
Binary patching for fun and profit @ JUG.ru, 25.02.2012
Binary patching for fun and profit @ JUG.ru, 25.02.2012Binary patching for fun and profit @ JUG.ru, 25.02.2012
Binary patching for fun and profit @ JUG.ru, 25.02.2012Anton Arhipov
 
2012 JDays Bad Tests Good Tests
2012 JDays Bad Tests Good Tests2012 JDays Bad Tests Good Tests
2012 JDays Bad Tests Good TestsTomek Kaczanowski
 
4. Обработка ошибок, исключения, отладка
4. Обработка ошибок, исключения, отладка4. Обработка ошибок, исключения, отладка
4. Обработка ошибок, исключения, отладкаDEVTYPE
 

Similar to Intro to Pig UDF (20)

Please review my code (java)Someone helped me with it but i cannot.pdf
Please review my code (java)Someone helped me with it but i cannot.pdfPlease review my code (java)Someone helped me with it but i cannot.pdf
Please review my code (java)Someone helped me with it but i cannot.pdf
 
Java.lang.object
Java.lang.objectJava.lang.object
Java.lang.object
 
Operator Overloading In Scala
Operator Overloading In ScalaOperator Overloading In Scala
Operator Overloading In Scala
 
Java Class Design
Java Class DesignJava Class Design
Java Class Design
 
The Future of Futures - A Talk About Java 8 CompletableFutures
The Future of Futures - A Talk About Java 8 CompletableFuturesThe Future of Futures - A Talk About Java 8 CompletableFutures
The Future of Futures - A Talk About Java 8 CompletableFutures
 
What can be done with Java, but should better be done with Erlang (@pavlobaron)
What can be done with Java, but should better be done with Erlang (@pavlobaron)What can be done with Java, but should better be done with Erlang (@pavlobaron)
What can be done with Java, but should better be done with Erlang (@pavlobaron)
 
Kotlin decompiled
Kotlin decompiledKotlin decompiled
Kotlin decompiled
 
XpUg Coding Dojo: KataYahtzee in Ocp way
XpUg Coding Dojo: KataYahtzee in Ocp wayXpUg Coding Dojo: KataYahtzee in Ocp way
XpUg Coding Dojo: KataYahtzee in Ocp way
 
Unit testing with PHPUnit
Unit testing with PHPUnitUnit testing with PHPUnit
Unit testing with PHPUnit
 
java write a program to evaluate the postfix expressionthe program.pdf
java write a program to evaluate the postfix expressionthe program.pdfjava write a program to evaluate the postfix expressionthe program.pdf
java write a program to evaluate the postfix expressionthe program.pdf
 
エンタープライズ・クラウドと 並列・分散・非同期処理
エンタープライズ・クラウドと 並列・分散・非同期処理エンタープライズ・クラウドと 並列・分散・非同期処理
エンタープライズ・クラウドと 並列・分散・非同期処理
 
Java Generics
Java GenericsJava Generics
Java Generics
 
Navigating the xDD Alphabet Soup
Navigating the xDD Alphabet SoupNavigating the xDD Alphabet Soup
Navigating the xDD Alphabet Soup
 
C++ aptitude
C++ aptitudeC++ aptitude
C++ aptitude
 
Repetition Structure
Repetition StructureRepetition Structure
Repetition Structure
 
Binary patching for fun and profit @ JUG.ru, 25.02.2012
Binary patching for fun and profit @ JUG.ru, 25.02.2012Binary patching for fun and profit @ JUG.ru, 25.02.2012
Binary patching for fun and profit @ JUG.ru, 25.02.2012
 
2012 JDays Bad Tests Good Tests
2012 JDays Bad Tests Good Tests2012 JDays Bad Tests Good Tests
2012 JDays Bad Tests Good Tests
 
4. Обработка ошибок, исключения, отладка
4. Обработка ошибок, исключения, отладка4. Обработка ошибок, исключения, отладка
4. Обработка ошибок, исключения, отладка
 
3 j unit
3 j unit3 j unit
3 j unit
 
Iteration
IterationIteration
Iteration
 

Recently uploaded

Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 

Recently uploaded (20)

Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 

Intro to Pig UDF

  • 1. Introduction to Pig UDFs Chris Wilkes cwilkes@seattlehadoop.org
  • 2. Agenda 1 What, Why, How 2 EvalFunc basics 3 More EvalFunc 4 LoadFunc 5 Piggybank
  • 3. Agenda Point 1 1 What, Why, How 2 EvalFunc basics 3 More EvalFunc 4 LoadFunc 5 Piggybank
  • 4. What is a UDF? User Defined Function • Way to do an operation on a field or fields • Note: not on the group • Called from within a pig script • b = FOREACH a GENERATE foo(color) • Currently all done in java
  • 5. Why use a UDF? • You need to do more than grouping or filtering • Actually filtering is a UDF • Probably using them already • Maybe more comfortable in java land than in SQL / Pig Latin
  • 6. How to write an use? • Just extend / implement an interface • No need for administrator rights, just call your script • Very simple java, just think about your small problem Magical Powers not required
  • 7. Moving right along Now to the informative part of the talk
  • 8. Agenda 1 What, Why, How 2 EvalFunc basics 3 More EvalFunc 4 LoadFunc 5 Piggybank
  • 9. EvalFunc : probably what you need to do •Easiest to understand: takes one or more fields and spits back a generic object •Extend the EvalFunc interface and it practically writes itself •Let’s look at the UPPER example from the piggybank
  • 10. The UPPER EvalFunc public class UPPER extends EvalFunc<String> { @Override public String exec(Tuple input) throws IOException { if (input == null||input.size() == 0||input.get(0) == null) return null; try { return ((String)input.get(0)).toUpperCase(); } catch (ClassCastException e) { warn(“error msg”, PigWarning.UDF_WARNING_1); } catch(Exception e){ warn("Error”, PigWarning.UDF_WARNING_1); } return null; } } modified version from the piggybank SVN
  • 11. The UPPER EvalFunc public class UPPER extends EvalFunc<String> { @Override public String exec(Tuple input) throws IOException { if (input == null||input.size() == 0||input.get(0) == null) return null; try { return ((String)input.get(0)).toUpperCase(); } catch (ClassCastException e) { warn(“error msg”, PigWarning.UDF_WARNING_1); } catch(Exception e){ warn("Error”, PigWarning.UDF_WARNING_1); } return null; } } The generic <String> tells Pig what class will be returned from this method
  • 12. The UPPER EvalFunc public class UPPER extends EvalFunc<String> { @Override public String exec(Tuple input) throws IOException { if (input == null||input.size() == 0||input.get(0) == null) return null; try { return ((String)input.get(0)).toUpperCase(); } catch (ClassCastException e) { warn(“error msg”, PigWarning.UDF_WARNING_1); } catch(Exception e){ warn("Error”, PigWarning.UDF_WARNING_1); } return null; } } The Tuple input contains the fields within the script ()
  • 13. The UPPER EvalFunc public class UPPER extends EvalFunc<String> { public String exec(Tuple input) throws IOException { if (input == null||input.size() == 0||input.get(0) == null) return null; try { return ((String)input.get(0)).toUpperCase(); } catch (ClassCastException e) { warn(“error msg”, PigWarning.UDF_WARNING_1); } catch(Exception e){ warn("Error”, PigWarning.UDF_WARNING_1); } return null; } } Check your inputs for empties or nulls
  • 14. The UPPER EvalFunc public class UPPER extends EvalFunc<String> { public String exec(Tuple input) throws IOException { if (input == null||input.size() == 0||input.get(0) == null) return null; try { return ((String)input.get(0)).toUpperCase(); } catch (ClassCastException e) { warn(“error msg”, PigWarning.UDF_WARNING_1); } catch(Exception e){ warn("Error”, PigWarning.UDF_WARNING_1); } return null; } } You have to know that the 1st parameter inside the tuple is a String
  • 15. The UPPER EvalFunc public class UPPER extends EvalFunc<String> { public String exec(Tuple input) throws IOException { if (input == null||input.size() == 0||input.get(0) == null) return null; try { return ((String)input.get(0)).toUpperCase(); } catch (ClassCastException e) { warn(“error msg”, PigWarning.UDF_WARNING_1); } catch(Exception e){ warn("Error”, PigWarning.UDF_WARNING_1); } return null; } } Catch errors that are acceptable and return null so can be skipped over
  • 16. The UPPER EvalFunc public class UPPER extends EvalFunc<String> { public List<FuncSpec> getArgToFuncMapping() { List<FuncSpec> funcList = new ArrayList<FuncSpec>(); funcList.add(new FuncSpec(this.getClass().getName(), new Schema(new Schema.FieldSchema(null, DataType.CHARARRAY)))); return funcList; } } Tells Pig what parameters this function takes
  • 17. Recap of UPPER • Generics outlines contract for return type • Schemas are preserved (chararray / String) • Check inputs for empty or null • Return null if item should be skipped • Throw an exception if deadly • Name “UPPER” can be used if known to PigContext’s packageImportList, otherwise need full classname • Cast items inside of the Tuple parameter
  • 18. Another simple EvalFunc: AstroDist • Two input files: planet names with coordinates and pairs of planets • Goal: find the distance between the pairs • Loading is slightly different: coords in a tuple • Input to EvalFunc is a Tuple that contains a Tuple
  • 19. AstroDist input files $ cat src/test/resources/cosmo aaa bbb aaa ccc ddd aaa $ cat src/test/resources/planets aaa (1,0,10) bbb (2,-5,15) ccc (-7,12,48) image from xkcd.com ddd (3,3,8)
  • 20. AstroDist pig script REGISTER target/pig-demo-1.0-SNAPSHOT.jar; planets = load '$dir/planets' as (name : chararray, l:tuple(x : int, y : int, z : int)); cosmo = load '$dir/cosmo' as (planet1 : chararray, planet2 : chararray); A = JOIN cosmo BY planet1, planets BY name; B = JOIN A by planet2, planets BY name; locations = FOREACH B GENERATE $1 AS p1name:chararray, $2 AS p2name : chararray, AstroDist($3,$5) as distance; dump locations;
  • 21. AstroDist output $ pig -x local -f src/test/resources/distances.pig -param dir=src/test/resources/ What B looks like: (ddd,aaa,ddd,(3,3,8),aaa,(1,0,10)) (aaa,bbb,aaa,(1,0,10),bbb,(2,-5,15)) (aaa,ccc,aaa,(1,0,10),ccc,(-7,12,48)) Output: (aaa,ddd,4.123105625617661) (bbb,aaa,7.14142842854285) (ccc,aaa,40.64480286580315)
  • 22. AstroDist program public class AstroDist extends EvalFunc<Double> { @Override public Double exec(Tuple input) throws IOException { Point3D astroPos1 = new Point3D((Tuple) input.get(0)); Point3D astroPos2 = new Point3D((Tuple) input.get(1)); return astroPos1.distance(astroPos2); } @Override public List<FuncSpec> getArgToFuncMapping() { Schema s = new Schema(); s.add(new Schema.FieldSchema(null, DataType.TUPLE)); s.add(new Schema.FieldSchema(null, DataType.TUPLE)); return Arrays.asList( new FuncSpec(this.getClass().getName(), s)); } }
  • 23. AstroDist program (cont) private static class Point3D { private final int x, y, z; private Point3D(Tuple tuple) throws ExecException { if (tuple.size() != 3) { throw new ExecException("Received " + tuple.size() + " points in 3D tuple", ERROR_CODE_BAD_TUPLE, PigException.BUG); } x = (Integer) tuple.get(0); y = (Integer) tuple.get(1); z = (Integer) tuple.get(2); } private double distance(Point3D other) { return Math.sqrt(Math.pow(x - other.x, 2) + Math.pow(y - other.y, 2) + Math.pow(z - other.z, 2)); } }
  • 24. Fun times when running this script • Looking through PigContext and Main found that /pig.properties in the classpath is parsed for the key/value “udf.import.list” • Put this into my jar (src/main/resources with maven) but it didn’t appear to load • Debug log should show what’s going on, except debug isn’t turned on till after this load • Ended up putting into ~/.pigrc but Pig warns that it should go into conf/pig.properties, a file that isn’t read • Schemas and UDFs are picky, use trial and error
  • 25. Agenda Point 3 1 What, Why, How 2 EvalFunc basics 3 More EvalFunc 4 LoadFunc 5 Piggybank
  • 26. Returning a Tuple from a UDF • Sometimes you want to return more than one thing from a function • For example an expensive calculation was done and its results can be reused • But what should be returned? • Of course a Tuple • “tuple” is the answer 92% of the time http://tuplemusic.org/ Tuple is dedicated to exploring and expanding the contemporary repertoire for two bassoons
  • 27. BestBook: returns the highest scored book $ cat src/test/resources/bookscores book1 aaa 1 book1 bbb 3 Want output of that for book1 ccc 12 book3 reviewer bbb was book2 aaa 4 the highest at 5 book2 bbb 1 book3 ccc 1 book3 bbb 5
  • 28. BestBook EvalFunc public class BestBook extends EvalFunc<Tuple> { @Override public Tuple exec(Tuple p_input) throws IOException { Iterator<Tuple> bagReviewers = ((DataBag) p_input.get(0)).iterator(); Iterator<Tuple> bagScores = ((DataBag) p_input.get(1)).iterator(); int bestScore = -1; String bestReviewer = null; while (bagReviewers.hasNext() && bagScores.hasNext()) { String reviewerName = (String) bagReviewers.next().get(0); Integer score = (Integer) bagScores.next().get(0); if (score.intValue() > bestScore) { bestScore = score; bestReviewer = reviewerName; } } return TupleFactory.getInstance().newTuple( Arrays.asList(bestReviewer, (Integer) bestScore)); }
  • 29. BestBook EvalFunc public class BestBook extends EvalFunc<Tuple> { @Override public Tuple exec(Tuple p_input) throws IOException { Iterator<Tuple> bagReviewers = ((DataBag) p_input.get(0)).iterator(); Iterator<Tuple> bagScores = ((DataBag) p_input.get(1)).iterator(); int bestScore = -1; String bestReviewer = null; while (bagReviewers.hasNext() && bagScores.hasNext()) { String reviewerName = (String) bagReviewers.next().get(0); Integer score = (Integer) bagScores.next().get(0); if (score.intValue() > bestScore) { bestScore = score; bestReviewer = reviewerName; } } return TupleFactory.getInstance().newTuple( Arrays.asList(bestReviewer, (Integer) bestScore)); } The inputs are bag “columns”
  • 30. BestBook EvalFunc public class BestBook extends EvalFunc<Tuple> { @Override public Tuple exec(Tuple p_input) throws IOException { Iterator<Tuple> bagReviewers = ((DataBag) p_input.get(0)).iterator(); Iterator<Tuple> bagScores = ((DataBag) p_input.get(1)).iterator(); int bestScore = -1; String bestReviewer = null; while (bagReviewers.hasNext() && bagScores.hasNext()) { String reviewerName = (String) bagReviewers.next().get(0); Integer score = (Integer) bagScores.next().get(0); if (score.intValue() > bestScore) { bestScore = score; bestReviewer = reviewerName; } } return TupleFactory.getInstance().newTuple( Arrays.asList(bestReviewer, (Integer) bestScore)); } return a Tuple that’s just like the inputs
  • 31. BestBook EvalFunc public class BestBook extends EvalFunc<Tuple> { @Override public Schema outputSchema(Schema p_input) { try { return Schema.generateNestedSchema(DataType.TUPLE, DataType.CHARARRAY, DataType.INTEGER); } catch (FrontendException e) { throw new IllegalStateException(e); } } } How to define the outbound schema inside the Tuple
  • 32. BestBook: returns the highest scored book REGISTER target/demo-pig-udf-1.0-SNAPSHOT.jar; A = LOAD '$dir/bookscores' as (name : chararray, reviewer : chararray, score : int); B = group A by name; describe B; dump B; C = FOREACH B GENERATE group, BestBook(A.reviewer, A.score) as reviewandscore; describe C; dump C;
  • 33. BestBook: returns the highest scored book B: {group: chararray,A: {name: chararray,reviewer: chararray,score: int}} (book1,{(book1,aaa,1),(book1,bbb,3),(book1,ccc,12)}) (book2,{(book2,aaa,4),(book2,bbb,1)}) (book3,{(book3,ccc,1),(book3,bbb,5)}) C: {group: chararray,reviewandscore: (chararray,int)} (book1,(ccc,12)) (book2,(aaa,4)) (book3,(bbb,5))
  • 34. BestBook: improve by implementing Algebraic •If EvalFunc can be run in stages and summed up consider implementing Algebraic •Three methods to override: •String getInitial(); •String getIntermed(); •String getFinal() •See COUNT and DoubleAvg
  • 35. FilterFunc: a filter that’s an EvalFunc • For keeping and disgarding entries write a filter • FilterFunc extends EvalFunc<Boolean> • Adds a method “void finish()” for cleanup • Example: only wants dates that are within 10 minutes of one another
  • 36. FilterFunc: DateWithinFilter public class DateWithinFilter extends FilterFunc { @Override public Boolean exec(Tuple input) throws IOException { if (input.size() != 3) { throw new IOException(“error msg”); } Date[] startAndTryDates = getColumnDates(input); if (startAndTryDates == null) return false; long dateDiff = startAndTryDates[1].getTime() - startAndTryDates[0].getTime(); if (dateDiff < 0) { return false; // maybe make optional } int maxDateDiff = (Integer) input.get(2); return dateDiff <= maxDateDiff; }
  • 37. FilterFunc: DateWithinFilter private Date[] getColumnDates(Tuple input) throws ExecException { String strDate1 = (String) input.get(0); String strDate2 = (String) input.get(1); if (strDate1 == null || strDate2 == null) { return null; } Date date1 = null; try { date1 = df.parse(strDate1); } catch (ParseException e) { warn(“date format err”, PigWarning.UDF_WARNING_1); return null; } Date date2 = null; try { date2 = df.parse(strDate2); } catch (ParseException e) { warn(“date format err”, PigWarning.UDF_WARNING_1); return null; } return new Date[] { date1, date2 }; }
  • 38. FilterFunc: DateWithinFilter @Override public List<FuncSpec> getArgToFuncMapping() throws FrontendException { List<FuncSpec> funcList = new ArrayList<FuncSpec>(); Schema s = new Schema(); s.add(new Schema.FieldSchema(null, DataType.CHARARRAY)); s.add(new Schema.FieldSchema(null, DataType.CHARARRAY)); s.add(new Schema.FieldSchema(null, DataType.INTEGER)); funcList.add(new FuncSpec(this.getClass().getName(), s)); return funcList; } Defining what inputs we accept stay tuned for what happens when violated
  • 39. FilterFunc: DateWithinFilter $ cat src/test/resources/purchasetimes 1234 2010-06-01 10:31:22 2010-06-01 10:32:22 7121 2010-06-01 10:30:18 2010-06-01 11:02:59 1234 2010-06-01 10:40:18 2010-06-01 10:45:32 7681 lol wut 4532 2010-06-01 11:37:18 2010-06-01 11:42:59 $ cat src/test/resources/purchasetimes.pig REGISTER target/demo-pig-udf-1.0-SNAPSHOT.jar; purchasetimes = LOAD '$dir/purchasetimes' AS (userid: int, datein: chararray, dateout: chararray); quickybuyers = FILTER purchasetimes BY DateWithinFilter(datein, dateout, 600000); DUMP quickybuyers; $ pig -x local -f src/test/resources/purchasetimes.pig -param dir=src/test/resources/ (1234,2010-06-01 10:31:22,2010-06-01 10:32:22) (1234,2010-06-01 10:40:18,2010-06-01 10:45:32) (4532,2010-06-01 11:37:18,2010-06-01 11:42:59)
  • 40. EvalFunc: not passing in correct number args $ cat src/test/resources/purchasetimes.pig quickybuyers = FILTER purchasetimes BY DateWithinFilter(datein, dateout); $ pig -x local -f src/test/resources/purchasetimes.pig -param dir=src/test/resources/ 2010-06-17 17:25:43,440 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1045: Could not infer the matching function for org.seattlehadoop.demo.pig.udf.DateWithinFilter as multiple or none of them fit. Please use an explicit cast. Details at logfile: /Users/cwilkes/Documents/workspace5/SeattleHadoop- demo-code/pig_1276820742917.log log file has: at org.apache.pig.impl.logicalLayer.validators.TypeCheckingVisitor.visit(TypeCheckingVi sitor.java:1197) so error caught before loading data
  • 41. Agenda 1 What, Why, How 2 EvalFunc basics 3 More EvalFunc 4 LoadFunc 5 Piggybank
  • 42. LoadFunc: definition • How does something get loaded into Pig? • A = load ‘B’; • But what is actually going on? • A = load ‘B’ using PigStorage(); • PigStorage is a LoadFunc that reads off of disk and splits on tab to create a Tuple
  • 43. LoadFunc: definition • LoadFunc is an interface with a number of methods, the most interesting being • bindTo(fileName,inputStream,offset,end) • Tuple getNext() • Extend from UTF8StorageConverter like PigStorage to get defaults • Overview: PigStorage’s getNext() creates an array of objects after splitting on a tab and puts those into a Tuple
  • 44. LoadFunc: make our own • Have a lot of log files, some just contain a URL • http://example.com?use=mind+bullets&target=yak • Want to load URLs and do analysis • Write your own LoadFunc to do this that takes a URL and returns a Map of the query parameters • Know what parameters you care about, only look for those
  • 45. LoadFunc: make our own • Have a lot of log files, some just contain a URL • http://example.com?use=mind+bullets&target=yak • Want to load URLs and do analysis • Write your own LoadFunc to do this that takes a URL and returns a Map of the query parameters • Know what parameters you care about, only look for those • Goal: • A = LOAD 'urls' USING QuerystringLoader('query', 'userid') AS (query: chararray, userid : int);
  • 46. LoadFunc: QuerystringLoader • Passing in constructor arguments from the pig script is easy: • public QuerystringLoader(String... fieldNames) • bindTo is almost exactly the same as the PigStorage one, using the PigLineRecordReader to parse the InputStream • Tuple getTuple() is where the action happens • parse the querystring into a Map • loop through the fields given in the constructor • return a Tuple of a list of those objects
  • 47. LoadFunc: QuerystringLoader getTuple() @Override public Tuple getNext() throws IOException { if (in == null || in.getPosition() > end) { return null; } Text value = new Text(); boolean notDone = in.next(value); if (!notDone) { return null; } Map<String, Object> parameters = getParameterMap(value.toString()); List<String> output = new ArrayList<String>(); for (String fieldName : m_fieldsInOrder) { Object object = parameters.get(fieldName); if (object == null) { output.add(null); continue; } if (object instanceof String) { output.add((String) object); } else { List<String> objectVal = (List<String>) object; output.add(objectVal.get(0)); } } return mTupleFactory.newTupleNoCopy(output); }
  • 48. LoadFunc: notes • boolean okay=in.next(tuple) is how to get the next parsed line • getParameterMap(url) splits querystring into a Map<String,Object> • Pig handles type conversion for you, just hand back a Tuple. • In this case the Tuple can be made up of anything so user specifies the schema in the script • AS (query:chararray, userid:int)
  • 49. RegexLoader Same concept, pass in a Pattern for the constructor and have getTuple() return only the matched parts @Override public Tuple getNext() throws IOException { Matcher m = m_linePattern.matcher(value.toString()); if (!m.matches()) { return EmptyTuple.getInstance(); } List<String> regexMatches = new ArrayList<String>(); for (int i = 1; i <= m.groupCount(); i++) { regexMatches.add(m.group(i)); } return mTupleFactory.newTupleNoCopy(regexMatches); }
  • 50. Agenda 1 What, Why, How 2 EvalFunc basics 3 More EvalFunc 4 LoadFunc 5 Piggybank
  • 51. Piggybank • CVS repository of common UDFs • Excited about it at first, doesn’t appear to be used that much • Needs to be an easier way of doing this • CPAN (Perl) for Pig would be great • register pigpan://Math::FFT • brings down the jars from a maven-like repository and tells pig where to load from • any takers? Looking into it
  • 52. Bonus section: unit testing @Test public void testRepeatQueryParams() throws IOException { String url = "http://localhost/foo?a=123&a=456nx=y nhttp://localhost/bar?a=761&b=hi"; QuerystringLoader loader = new QuerystringLoader("a", "b"); InputStream in = new ByteArrayInputStream(url.getBytes()); loader.bindTo(null, new BufferedPositionedInputStream(in), 0, url.length()); Tuple tuple = loader.getNext(); assertEquals("123", (String) tuple.get(0)); assertNull(tuple.get(1)); tuple = loader.getNext(); assertEquals(2, tuple.size()); assertNull(tuple.get(0)); assertNull(tuple.get(1)); tuple = loader.getNext(); assertEquals("761", (String) tuple.get(0)); assertEquals("hi", (String) tuple.get(1)); }
  • 53. Resources UDF reference: http://hadoop.apache.org/pig/docs/r0.5.0/ piglatin_reference.html Code samples: http://github.com/seattlehadoop Presentation: http://www.slideshare.net/seattlehadoop

Editor's Notes