2. Accumulator A = load ‘clicks’; B = group A by user; C = foreach B { C1 = order A by timestamp; generate user, sessionize(C1); } … Many aggregate operations cannot use combiner but do not need all records for a single key together New in 0.6, Accumulator interface which can be implemented by UDFs Pig calls accumulate multiple times with partial list of tuples, then when the key changes calls getValue
3. Also in 0.6 UDFContext, allows UDFs to pass info from frontend to backend and to access JobConf A lot of work with memory manager to reduce the number of GCOverhead and out of heap errors
4. New Load and Store Interfaces 0.6 and before Want to write a LoadFunc that works on files and uses standard splits? Easy Want to write a LoadFunc that works on something other than files or uses non-standard splits? Hard; have to write a Slicer (which mostly duplicates Hadoop’sInputFormat) Want to write a StoreFunc that works on something other than files? Sorry 0.7 LoadFunc now sits atop InputFormat, so if you have an InputFormat for your data, writing a LoadFunc is easy StoreFunc now sits atop OutputFormat, … Not backward compatible, will require rewrite of custom Load and StoreFuncs
5. Also in 0.7 Moved local mode to Hadoop’sLocalJobRunner; means debugging environment much closer to runtime environment More aggressive use of Hadoop distributed cache for features such as replicated join and order by
6. What We Are Working On Now Runtime statistics – track what features your script used, how many records it processed, etc. Results stored in Pig logs and job history files Adding UDFs in scripting languages (python initially) - PIG-928 Allow users to set a custom partitioner in some cases - PIG-282 Make Pig available in Maven repositories - PIG-1334 Label Interfaces for audience and stability - PIG-1311 Part of Hadoop’s compatibility plan, see the following blog posthttp://bit.ly/9yRDlH