New features in Pig 0.11
Upcoming SlideShare
Loading in...5
×
 

New features in Pig 0.11

on

  • 8,222 Views

 

Statistiken

Views

Gesamtviews
8,222
Views auf SlideShare
7,393
Views einbetten
829

Actions

Gefällt mir
8
Downloads
86
Kommentare
0

18 Einbettungen 829

http://datasyndrome.com 388
http://hortonworks.com 242
http://www.linkedin.com 91
https://twitter.com 57
http://feeds.feedburner.com 18
http://development.hortonworks.com 12
http://tweetedtimes.com 4
http://hortonworks.dev 4
http://pinterest.com 3
http://kred.com 2
http://staging.hortonworks.com 1
http://wp.hortonworks.dev 1
http://stress-management-products.com 1
http://www.acushare.com 1
http://twitter.com 1
https://si0.twimg.com 1
https://twimg0-a.akamaihd.net 1
http://www.newsblur.com 1
Mehr ...

Zugänglichkeit

Kategorien

Details hochladen

Uploaded via as Microsoft PowerPoint

Benutzerrechte

© Alle Rechte vorbehalten

Report content

Als unangemessen gemeldet Als unangemessen melden
Als unangemessen melden

Wählen Sie Ihren Grund, warum Sie diese Präsentation als unangemessen melden.

Löschen
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Ihre Nachricht erscheint hier
    Processing...
Kommentar posten
Kommentar bearbeiten
  • Hortonworks Data Platform (HDP) is the only 100% open source Apache Hadoop distribution that provides a complete and reliable foundation for enterprises that want to build, deploy and manage big data solutions. It allows you to confidently capture, process and share data in any format, at scale on commodity hardware and/or in a cloud environment. As the foundation for the next generation enterprise data architecture, HDP delivers all of the necessary components to uncover business insights from the growing streams of data flowing into and throughout your business. HDP is a fully integrated data platform that includes the stable core functions of Apache Hadoop (HDFS and MapReduce), the baseline tools to process big data (Apache Hive, Apache HBase, Apache Pig) as well as a set of advanced capabilities (Apache Ambari, Apache HCatalog and High Availability) that make big data operational and ready for the enterprise.  Run through the points on left…

New features in Pig 0.11 New features in Pig 0.11 Presentation Transcript

  • Pig 0.11 - New FeaturesDaniel DaiMember of Technical StaffCommitter, VP of Apache Pig© Hortonworks Inc. 2011 Page 1
  • Pig 0.11 release plan• Branched on Oct 12, 2012• Will come in weeks – Fix tests: PIG-2972 – Documentation: PIG-2756 – Several last minute fixes Architecting the Future of Big Data Page 2 © Hortonworks Inc. 2011
  • New features• CUBE operator• Rank operator• Groovy UDFs• New data type: DateTime• SchemaTuple optimization• Works with JDK 7• Works with Windows ? Architecting the Future of Big Data Page 3 © Hortonworks Inc. 2011
  • New features• Faster local mode• Better stats/notification – Ambros• Default scripts: pigrc• Integrate HCat DDL• Grunt enhancement: history/clear• UDF enhancement – New/enhanced UDFs – AvroStorage enhancement Architecting the Future of Big Data Page 4 © Hortonworks Inc. 2011
  • CUBE operator rawdata = load ‟input as (ptype, pstore, number); cubed = cube rawdata by rollup(ptype, pstore); result = foreach cubed generate flatten(group), SUM(cube.number); dump result; Ptype Pstore SumPtype Pstore number Cat Miami 18Dog Miami 12 Cat Naples 9Cat Miami 18 Cat 27 Dog Miami 12Turtle Tampa 4 Dog Tampa 14Dog Tampa 14 Dog Naples 5Cat Naples 9 Dog 31Dog Naples 5 Turtle Tampa 4Turtle Naples 1 Turtle Naples 1 Turtle 5 63 Architecting the Future of Big Data Page 5 © Hortonworks Inc. 2011
  • CUBE operator• Syntax outalias = CUBE inalias BY { CUBE expression | ROLLUP expression }, [ CUBE expression | ROLLUP expression ] [PARALLEL n];• Umbrella Jira PIG-2167• Non-distributed version will be in 0.11 (PIG- 2765)• Distributed version still in progress (PIG-2831) – Push algebraic computation to map/combiner – Reference: “Distributed cube materialization on holistic measures”, Arnab Nandi et al, ICDE 2011 Architecting the Future of Big Data Page 6 © Hortonworks Inc. 2011
  • Rank operator rawdata = load ‟input as (name, gpa:double); ranked = rank rawdata by gpa; dump ranked;Name Gpa Rank Name GpaKatie 3.5 1 Katie 3.5Fred 4.0 5 Fred 4.0Holly 3.7 2 Holly 3.7Luke 3.5 1 Luke 3.5Nick 3.7 2 Nick 3.7 Architecting the Future of Big Data Page 7 © Hortonworks Inc. 2011
  • Rank operator rawdata = load ‟input as (name, gpa:double); ranked = rank rawdata by gpa desc dense; dump ranked;Name Gpa Rank Name GpaKatie 3.5 3 Katie 3.5Fred 4.0 1 Fred 4.0Holly 3.7 2 Holly 3.7Luke 3.5 3 Luke 3.5Nick 3.7 2 Nick 3.7 Architecting the Future of Big Data Page 8 © Hortonworks Inc. 2011
  • Rank operator• Limitation – Only 1 reducer• Possible improvements – Provide a distributed implementation• PIG-2353 Architecting the Future of Big Data Page 9 © Hortonworks Inc. 2011
  • Groovy UDFs register test.groovy using groovy as myfuncs; a = load 1.txt as (a0, a1:long); b = foreach a generate myfuncs.square(a1); dump b; test.groovy: import org.apache.pig.builtin.OutputSchema; class GroovyUDFs { @OutputSchema(x:long) long square(long x) { return x*x; } } Architecting the Future of Big Data Page 10 © Hortonworks Inc. 2011
  • Embed Pig into Groovy import org.apache.pig.scripting.Pig; public static void main(String[] args) { String input = ”input" String output = "output" Pig P = Pig.compile("A = load $in; store A into $out;") result = P.bind([in:input, out:output]).runSingle() if (result.isSuccessful()) { print("Pig job succeeded") } else { print("Pig job failed") } }Command line: bin/pig -x local demo.groovy Architecting the Future of Big Data Page 11 © Hortonworks Inc. 2011
  • New data type: DateTime a = load ‟input as (a0: datetime, a1:chararray, a2:long); b = foreach a generate a0, ToDate(a1, „yyyyMMdd HH:mm:ss‟), ToDate(a2), CurrentTime();• Support timezone• Millisecond precision Architecting the Future of Big Data Page 12 © Hortonworks Inc. 2011
  • New data type: DateTime• DateTime UDFs GetYear YearsBetween SubtractDuration GetMonth MonthsBetween ToDate GetDay WeeksBetween ToDateISO GetWeekYear DaysBetween ToMilliSeconds GetWeek HoursBetween ToString GetHour MinutesBetween ToUnixTime GetMinute SecondsBetween CurrentTime GetSecond MilliSecondsBetween GetMilliSecond AddDuration Architecting the Future of Big Data Page 13 © Hortonworks Inc. 2011
  • SchemaTuple optimization• Idea – Generate schema specific tuple code when schema is known• Benefit – Decrease memory footprint – Better performance Architecting the Future of Big Data Page 14 © Hortonworks Inc. 2011
  • SchemaTuple optimization• When tuple schema is known (a0: int, a1: chararray, a2: double)Original Tuple: Schema Tuple:Tuple { SchemaTuple { int f0; List<Object> mFields; String f1; Object get(int fieldNum) { double f2; return mFields.get(fieldNum); Object get(int fieldNum) { } switch (fieldNum) { case 0: return f0; void set(int fieldNum, Object val) case 1: return f1; mFields.set(fieldNum, val); case 2: return f2; } } void set(int fieldNum, Object val)} …… } } Architecting the Future of Big Data Page 15 © Hortonworks Inc. 2011
  • Pig on new environment• JDK 7 – All unit tests pass – Jira: PIG-2908• Hadoop 2.0.0 – Jira: PIG-2791• Windows – No need for cygwin – Jira: PIG-2793 – Try to make it to 0.11 Architecting the Future of Big Data Page 16 © Hortonworks Inc. 2011
  • Faster local mode• Skip generating job.jar – PIG-2128 – In 0.9, 0.10 as well, unadvertised• Remove 5 seconds hardcoded wait time for JobControl – PIG-2702 Architecting the Future of Big Data Page 17 © Hortonworks Inc. 2011
  • Better stats• Information on alias/lines of a map/reduce job – An information line for every map/reduce job detailed locations: M: A[1,4],A[3,4],B[2,4] C: A[3,4],B[2,4] R: A[3,4], C[5, 4] Explanation: Map contains: alias A: line 1 column 4 alias A: line 3 column 4 alias B: line 2 column 4 Combiner contains: alias A: line 3 column 4 alias B: line 2 column 4 Reduce contains: alias A: line 3 column 4 alias C: line 5 column 4 Architecting the Future of Big Data Page 18 © Hortonworks Inc. 2011
  • Better notification• For the support of Ambrose – Check out “Twitter Ambrose” – github open source – Monitor Pig job progress in a UI Architecting the Future of Big Data Page 19 © Hortonworks Inc. 2011
  • Integrate HCat DDL• Embed HCat DDL command in Pig script• Run HCat DDL command in Grunt grunt> sql create table pig_test(name string, age int, gpa double) stored as textfile; grunt>• Embed HCat DDL in scripting language from org.apache.pig.scripting import Pig ret = Pig.sql("""drop table if exists table_1;""") if ret==0: #success Architecting the Future of Big Data Page 20 © Hortonworks Inc. 2011
  • Grunt enhancement• History grunt> a = load 1.txt; grunt> b = foreach a generate $0, $1; grunt> history 1 a = load 1.txt; 2 b = foreach a generate $0, $1; grunt>• Clear – Clear screen Architecting the Future of Big Data Page 21 © Hortonworks Inc. 2011
  • New/enhanced UDFs• New UDFs STARTSWITH INVERSEMAP VALUESET BagToString KEYSET BagToTuple VALUELIST• Enhanced UDFs RANDOM Take a seed AvroStorage Support recursive record Support globs and commas Upgrade to Avro 1.7.1• EvalFunc enhancement – getInputSchema(): Get input schema for UDF Architecting the Future of Big Data Page 22 © Hortonworks Inc. 2011
  • Hortonworks Data Platform • Simplify deployment to get started quickly and easily • Monitor, manage any size cluster with familiar console and tools 1 • Only platform to include data integration services to interact with any data • Metadata services opens the platform for integration with existing applications • Dependable high availability architecture Reduce risks and cost of adoption Lower the total cost to administer and provision • Tested at scale to future proof your cluster growth Integrate with your existing ecosystem Page 23 © Hortonworks Inc. 2011
  • Hortonworks Training The expert source for Apache Hadoop training & certificationRole-based Developer and Administration training – Coursework built and maintained by the core Apache Hadoop development team. – The “right” course, with the most extensive and realistic hands-on materials – Provide an immersive experience into real-world Hadoop scenarios – Public and Private courses availableComprehensive Apache Hadoop Certification – Become a trusted and valuable Apache Hadoop expert Page 24 © Hortonworks Inc. 2011
  • Next Steps?1 Download Hortonworks Data Platform hortonworks.com/download2 Use the getting started guide hortonworks.com/get-started3 Learn more… get support Hortonworks Support • Expert role based training • Full lifecycle technical support • Course for admins, developers across four service levels and operators • Delivered by Apache Hadoop • Certification program Experts/Committers • Custom onsite options • Forward-compatible hortonworks.com/training hortonworks.com/support Page 25 © Hortonworks Inc. 2011
  • Thank You!Questions & AnswersFollow: @hortonworksRead: hortonworks.com/blog Page 26 © Hortonworks Inc. 2011