SlideShare ist ein Scribd-Unternehmen logo
1 von 47
Downloaden Sie, um offline zu lesen
Big Data with Pig
                          and Python
                                Shawn Hermans
                       Omaha Dynamic Languages User Group
                                 April 8th, 2013




Tuesday, April 9, 13
About Me

                       • Mathematician/Physicist turned Consultant
                       • Graduate Student in CS at UNO
                       • Current Software Engineer at Sojern


Tuesday, April 9, 13
Working with Big Data



Tuesday, April 9, 13
What is Big Data?
                       Data Source       Size
                                                        Gigabytes -
   Wikipedia Database Dump               9GB         Normal size for relational
                                                           databases
              Open Street Map           19GB
                                                         Terabytes -
                                                      Relational databases may
               Common Crawl             81TB         start to experience scaling
                                                                issues
                1000 Genomes            200TB
                                                         Petabytes -
                                                        Relational databases
        Large Hadron Collider        15PB annually   struggle to scale without a
                                                          lot of fine tuning




Tuesday, April 9, 13
Working With Data
      Expectation                                        Reality
                            •   Different File Formats

                            •   Missing Values

                            •   Inconsistent Schema

                            •   Loosely Structured

                            •   Lots of it




Tuesday, April 9, 13
MapReduce
                                                                                         •        Map - Emit key/
                                                                                                  value pairs from
                                                                                                  data

                                                                                         •        Reduce - Collect
                                                                                                  data with common
                                                                                                  keys

                                                                                         •        Tries to minimize
                                                                                                  moving data
                                                                                                  between nodes

                       Image taken from: https://developers.google.com/appengine/docs/python/dataprocessing/overview




Tuesday, April 9, 13
MapReduce Issues

                       • Very low-level abstraction
                       • Cumbersome Java API
                       • Unfamiliar to data analysts
                       • Rudimentary support for data pipelines

Tuesday, April 9, 13
Pig
                       • Eats anything
                       • SQL-like, procedural data flow language
                       • Extensible with Java, Jython, Groovy, Ruby
                         or JavaScript
                       • Provides opportunities to optimize
                         workflows



Tuesday, April 9, 13
Alternatives
                       • Java MapReduce API
                       • Hadoop Streaming
                       • Hive
                       • Spark
                       • Cascading
                       • Cascalog
Tuesday, April 9, 13
Python

                       • Data analysis - pandas, numpy, networkx
                       • Machine learning - scikits.learn, milk
                       • Scientific - scipy, pyephem, astropysics
                       • Visualization - matplotlib, d3py, ggplot

Tuesday, April 9, 13
Pig Features



Tuesday, April 9, 13
Input/Output
                       • HBase           • Sequence File
                       • JDBC Database   • Hive Columnar
                       • JSON            • XML
                       • CSV/TSV         • Apache Log
                       • Avro            • Thrift
                       • ProtoBuff       • Regex
Tuesday, April 9, 13
Relational Operators
                       LIMIT   GROUP   FILTER    CROSS


              COGROUP          JOIN    STORE    DISTINCT


               FOREACH         LOAD    ORDER    UNION




Tuesday, April 9, 13
Built In Functions
                       COS       SIN      AVG      SUM


                  COUNT         RANDOM   LOWER    UPPER


                CONCAT           MAX      MIN    TOKENIZE




Tuesday, April 9, 13
User Defined Functions
                       • Easy way to add arbitrary code to Pig
                        • Eval - Filter, aggregate, or evaluate
                        • Storage - Load/Store data
                       • Full support for Java and Jython
                       • Experimental support for Groovy, Ruby and
                         JavaScript


Tuesday, April 9, 13
Census Example


Tuesday, April 9, 13
Getting Data




Tuesday, April 9, 13
Convert to TSV
          ogr2ogr -f "CSV" CSA_2010Census_DP1.csv CSA_2010Census_DP1.shp -lco "GEOMETRY=AS_WKT" -lco "SEPARATOR=TAB"




                       • Uses Geospatial Data Abstraction Library
                         (GDAL) to convert to TSV
                       • TSV > CSV


Tuesday, April 9, 13
Inspect Headers
                       f = open('CSA_2010Census_DP1.tsv')
                       header = f.readline()
                       headers = header.strip('n').split('t')
                       list(enumerate(headers))

                       [(0,   'WKT'),
                        (1,   'GEOID10'),
                        (2,   'NAMELSAD10'),
                        (3,   'ALAND10'),
                        (4,   'AWATER10'),
                        (5,   'INTPTLAT10'),
                        (6,   'INTPTLON10'),
                        (7,   'DP0010001'),
                         .
                         .
                         .




Tuesday, April 9, 13
Pig Quick Start
      •       Download Pig Distribution

      •       Untar package

      •       Start Pig in local mode
                       pig -x local
                       grunt> ls
                       file:/data/CSA_2010Census_DP1 1.dbf<r 1>                                  841818
                       file:/data/CSA_2010Census_DP1.prj<r 1>                                  167
                       file:/data/CSA_2010Census_DP1.shp<r 1>                                  76180308
                       file:/data/CSA_2010Census_DP1.shx<r 1>                                  3596
                       file:/data/CSA_2010Census_DP1.tsv<r 1>                                  111224058


                                                  http://pig.apache.org/releases.html

                                      https://ccp.cloudera.com/display/SUPPORT/CDH+Downloads

Tuesday, April 9, 13
Loading Data

     grunt> csas = LOAD 'CSA_2010Census_DP1.tsv' USING PigStorage();




Tuesday, April 9, 13
Extracting Data
       grunt> csas = LOAD 'CSA_2010Census_DP1.tsv' USING PigStorage();
       grunt> extracted_no_types = FOREACH csas GENERATE $2
          AS name, $7 as population;
       grunt> describe extracted_no_types
       extracted_no_types: {name: bytearray,population: bytearray};




Tuesday, April 9, 13
Adding Schema
       grunt> csas = LOAD 'CSA_2010Census_DP1.tsv' USING PigStorage();
       grunt> extracted = FOREACH csas GENERATE $2
          AS name:chararray, $7 as population:int;
       grunt> describe extracted;
       extracted: {name: chararray,population: int}




Tuesday, April 9, 13
Ordering
        grunt> ordered = ORDER extracted by population DESC;
        grunt> dump ordered;

        ("New York-Newark-Bridgeport, NY-NJ-CT-PA CSA",22085649)
        ("Los Angeles-Long Beach-Riverside, CA CSA",17877006)
        ("Chicago-Naperville-Michigan City, IL-IN-WI CSA",9686021)
        ("Washington-Baltimore-Northern Virginia, DC-MD-VA-WV CSA",
        8572971)
        ("Boston-Worcester-Manchester, MA-RI-NH CSA",7559060)
        ("San Jose-San Francisco-Oakland, CA CSA",7468390)
        ("Dallas-Fort Worth, TX CSA",6731317)
        ("Philadelphia-Camden-Vineland, PA-NJ-DE-MD CSA",6533683)




Tuesday, April 9, 13
Storing Data
  grunt> STORE extracted INTO 'extracted_data' USING PigStorage('t', '-schema');




        ls -a
        .part-m-00035.crc   .part-m-00115.crc   .pig_header    part-m-00077   part-m-00157
        .part-m-00036.crc   .part-m-00116.crc   .pig_schema    part-m-00078   part-m-00158
        .part-m-00037.crc   .part-m-00117.crc   _SUCCESS       part-m-00079   part-m-00159
        .part-m-00038.crc   .part-m-00118.crc   part-m-00000   part-m-00080   part-m-00160




Tuesday, April 9, 13
Space Catalog Example



Tuesday, April 9, 13
Space Catalog
                       • 14,000+ objects in public catalog
                       • Use Two Line Element sets to propagate
                         out positions and velocities
                       • Can generate over 100 million positions &
                         velocities per day




Tuesday, April 9, 13
Two Line Elements
         ISS (ZARYA)
         1 25544U 98067A  08264.51782528 −.00002182 00000-0 -11606-4 0 2927
         2 25544 51.6416 247.4627 0006703 130.5360 325.0288 15.72125391563537



        • Use Python script to convert to Pig friendly TSV
        • Create Python UDF to parse TLE into parameters
        • Use Python UDF with Java libraries to propagate out
                positions



Tuesday, April 9, 13
Python UDFs
                       • Easy way to extend Pig with new functions
                       • Uses Jython which is at Python 2.5
                       • Cannot take advantage of libraries with C
                         dependencies (e.g. numpy, scikits, etc...)
                       • Can use Java classes

Tuesday, April 9, 13
TLE parsing
                                                             BSTAR Drag
                                  54-61                                                                -11606-4
                                                          (Decimal Assumed)

        def	
  parse_tle_number(tle_number_string):
        	
  	
  	
  	
  split_string	
  =	
  tle_number_string.split('-­‐')
        	
  	
  	
  	
  if	
  len(split_string)	
  ==	
  3:
        	
  	
  	
  	
  	
  	
  	
  	
  new_number	
  =	
  '-­‐'	
  +	
  str(split_string[1])	
  +	
  'e-­‐'	
  +	
  str(int(split_string[2])+1)
        	
  	
  	
  	
  elif	
  len(split_string)	
  ==	
  2:
        	
  	
  	
  	
  	
  	
  	
  	
  new_number	
  =	
  str(split_string[0])	
  +	
  'e-­‐'	
  +	
  str(int(split_string[1])+1)
        	
  	
  	
  	
  elif	
  len(split_string)	
  ==	
  1:
        	
  	
  	
  	
  	
  	
  	
  	
  new_number	
  =	
  '0.'	
  +	
  str(split_string[0])
        	
  	
  	
  	
  else:
        	
  	
  	
  	
  	
  	
  	
  	
  raise	
  TypeError('Input	
  is	
  not	
  in	
  the	
  TLE	
  float	
  format')
        	
  
        	
  	
  	
  	
  return	
  float(new_number)




                                           Full parser at https://gist.github.com/shawnhermans/4569360

Tuesday, April 9, 13
Simple UDF
        import tleparser

        @outputSchema("params:map[]")
        def parseTle(name, line1, line2):
            params = tleparser.parse_tle(name, line1, line2)
            return params




Tuesday, April 9, 13
Extract Parameters
       grunt> gps = LOAD 'gps-ops.tsv' USING PigStorage()
          AS (name:chararray, line1:chararray, line2:chararray);
       grunt> REGISTER 'tleUDFs.py' USING jython AS myfuncs;
       grunt> parsed = FOREACH gps GENERATE myfuncs.parseTle(*);

       ([bstar#,arg_of_perigee#333.0924,mean_motion#2.00559335,element_number#72,epoch_year#
       2013,inclination#54.9673,mean_anomaly#26.8787,rev_at_epoch#210,mean_motion_ddot#0.0,e
       ccentricity#5.354E-4,two_digit_year#13,international_designator#12053A,classification
       #U,epoch_day#17.78040066,satellite_number#38833,name#GPS BIIF-3 (PRN
       24),mean_motion_dot#-1.8E-6,ra_of_asc_node#344.5315])




Tuesday, April 9, 13
Storing Results

       grunt> parsed = FOREACH gps GENERATE myfuncs.parseTle(*);
       grunt> STORE parsed INTO 'propagated-csv' using PigStorage(',','-schema');




Tuesday, April 9, 13
UDF with Java Import
  from jsattrak.objects import SatelliteTleSGP4

  @outputSchema("propagated:bag{positions:tuple(time:double, x:double, y:double, z:double)}")
  def propagateTleECEF(name,line1,line2,start_time,end_time,number_of_points):
      satellite = SatelliteTleSGP4(name, line1, line2)
      ecef_positions = []
      increment = (float(end_time)-float(start_time))/float(number_of_points)
      current_time = start_time

          while current_time <= end_time:
              positions = [current_time]
              positions.extend(list(satellite.calculateJ2KPositionFromUT(current_time)))
              ecef_positions.append(tuple(positions))

                  current_time += increment

          return ecef_positions




Tuesday, April 9, 13
Propagate Positions
    grunt > REGISTER 'tleUDFs.py' USING jython AS myfuncs;
    grunt> gps = LOAD 'gps-ops.tsv' USING PigStorage()
      AS (name:chararray, line1:chararray, line2:chararray);
    grunt> propagated = FOREACH gps GENERATE myfuncs.parseTle(name, line1, line2),
      myfuncs.propagateTleECEF(name, line1, line2, 2454992.0, 2454993.0, 100);
    grunt> flattened = FOREACH propagated
      GENERATE params#'satellite_number', FLATTEN(propagated);
    propagated: {params: map[],propagated: {positions:
      (time: double,x: double,y: double,z: double)}}
    grunt> DESCRIBE flattened;
    flattened: {bytearray,propagated::time: double,propagated::x: double,
      propagated::y: double,propagated::z: double}




Tuesday, April 9, 13
Result

  (38833,2454992.9599999785,2.278136816721697E7,7970303.195970464,-1.1066153998664627E7)
  (38833,2454992.9699999783,2.2929498370345607E7,1.0245812732430315E7,-8617450.742994161)
  (38833,2454992.979999978,2.2713614118860725E7,1.2358665040019082E7,-6031915.392826946)
  (38833,2454992.989999978,2.213715624812226E7,1.4275325605036272E7,-3350605.7983842064)
  (38833,2454992.9999999776,2.1209296863515433E7,1.5965381866069315E7,-616098.4598421039)




Tuesday, April 9, 13
Pig on Amazon EMR



Tuesday, April 9, 13
Tuesday, April 9, 13
Tuesday, April 9, 13
Tuesday, April 9, 13
Tuesday, April 9, 13
Tuesday, April 9, 13
Pig with EMR




Tuesday, April 9, 13
Pig with EMR

                       • SSH in to box to run interactive Pig session
                       • Load data to/from S3
                       • Run standalone Pig scripts on demand


Tuesday, April 9, 13
Conclusion



Tuesday, April 9, 13
Other Useful Tools
                       • Python-dateutil : Super-duper date parser
                       • Oozie : Hadoop workflow engine
                       • Piggybank and Elephant Bird : 3rd party pig
                         libraries
                       • Chardet: Character detection library for
                         Python



Tuesday, April 9, 13
Parting Thoughts
                       •   Great ETL tool/language

                       •   Flexible enough to write general purpose
                           MapReduce jobs

                       •   Limited, but emerging 3rd party libraries

                       •   Jython for UDFs is extremely limiting (Spark?)

       Twitter: @shawnhermans
       Email: shawnhermans@gmail.com


Tuesday, April 9, 13

Weitere ähnliche Inhalte

Was ist angesagt?

Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo pptPhil Young
 
Data Analysis and Statistics in Python using pandas and statsmodels
Data Analysis and Statistics in Python using pandas and statsmodelsData Analysis and Statistics in Python using pandas and statsmodels
Data Analysis and Statistics in Python using pandas and statsmodelsWes McKinney
 
Extending Hadoop for Fun & Profit
Extending Hadoop for Fun & ProfitExtending Hadoop for Fun & Profit
Extending Hadoop for Fun & ProfitMilind Bhandarkar
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoopjoelcrabb
 
Introduction To Elastic MapReduce at WHUG
Introduction To Elastic MapReduce at WHUGIntroduction To Elastic MapReduce at WHUG
Introduction To Elastic MapReduce at WHUGAdam Kawa
 
Introduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop EcosystemIntroduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop EcosystemMahabubur Rahaman
 
Machine learning in python course contents
Machine learning in python course contentsMachine learning in python course contents
Machine learning in python course contentsMRUNALINI
 
MapReduce and Its Discontents
MapReduce and Its DiscontentsMapReduce and Its Discontents
MapReduce and Its DiscontentsDean Wampler
 
The Zoo Expands: Labrador *Loves* Elephant, Thanks to Hamster
The Zoo Expands: Labrador *Loves* Elephant, Thanks to HamsterThe Zoo Expands: Labrador *Loves* Elephant, Thanks to Hamster
The Zoo Expands: Labrador *Loves* Elephant, Thanks to HamsterMilind Bhandarkar
 
RIPEstat Public demo 16 April 2012
RIPEstat Public demo 16 April 2012RIPEstat Public demo 16 April 2012
RIPEstat Public demo 16 April 2012RIPE NCC
 
Hadoop Ecosystem Architecture Overview
Hadoop Ecosystem Architecture Overview Hadoop Ecosystem Architecture Overview
Hadoop Ecosystem Architecture Overview Senthil Kumar
 
Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011Milind Bhandarkar
 
First NL-HUG: Large-scale data processing at SARA with Apache Hadoop
First NL-HUG: Large-scale data processing at SARA with Apache HadoopFirst NL-HUG: Large-scale data processing at SARA with Apache Hadoop
First NL-HUG: Large-scale data processing at SARA with Apache HadoopEvert Lammerts
 
Future of Data Intensive Applicaitons
Future of Data Intensive ApplicaitonsFuture of Data Intensive Applicaitons
Future of Data Intensive ApplicaitonsMilind Bhandarkar
 
NYC-Meetup- Introduction to Hadoop Echosystem
NYC-Meetup- Introduction to Hadoop EchosystemNYC-Meetup- Introduction to Hadoop Echosystem
NYC-Meetup- Introduction to Hadoop EchosystemAL500745425
 
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and HadoopFacebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and Hadooproyans
 

Was ist angesagt? (20)

Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo ppt
 
Data Analysis and Statistics in Python using pandas and statsmodels
Data Analysis and Statistics in Python using pandas and statsmodelsData Analysis and Statistics in Python using pandas and statsmodels
Data Analysis and Statistics in Python using pandas and statsmodels
 
Extending Hadoop for Fun & Profit
Extending Hadoop for Fun & ProfitExtending Hadoop for Fun & Profit
Extending Hadoop for Fun & Profit
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Hadoop basics
Hadoop basicsHadoop basics
Hadoop basics
 
Introduction To Elastic MapReduce at WHUG
Introduction To Elastic MapReduce at WHUGIntroduction To Elastic MapReduce at WHUG
Introduction To Elastic MapReduce at WHUG
 
Hadoop Family and Ecosystem
Hadoop Family and EcosystemHadoop Family and Ecosystem
Hadoop Family and Ecosystem
 
Introduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop EcosystemIntroduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop Ecosystem
 
Machine learning in python course contents
Machine learning in python course contentsMachine learning in python course contents
Machine learning in python course contents
 
MapReduce and Its Discontents
MapReduce and Its DiscontentsMapReduce and Its Discontents
MapReduce and Its Discontents
 
The Zoo Expands: Labrador *Loves* Elephant, Thanks to Hamster
The Zoo Expands: Labrador *Loves* Elephant, Thanks to HamsterThe Zoo Expands: Labrador *Loves* Elephant, Thanks to Hamster
The Zoo Expands: Labrador *Loves* Elephant, Thanks to Hamster
 
RIPEstat Public demo 16 April 2012
RIPEstat Public demo 16 April 2012RIPEstat Public demo 16 April 2012
RIPEstat Public demo 16 April 2012
 
Hadoop Ecosystem Architecture Overview
Hadoop Ecosystem Architecture Overview Hadoop Ecosystem Architecture Overview
Hadoop Ecosystem Architecture Overview
 
Hadoop Presentation
Hadoop PresentationHadoop Presentation
Hadoop Presentation
 
Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011
 
First NL-HUG: Large-scale data processing at SARA with Apache Hadoop
First NL-HUG: Large-scale data processing at SARA with Apache HadoopFirst NL-HUG: Large-scale data processing at SARA with Apache Hadoop
First NL-HUG: Large-scale data processing at SARA with Apache Hadoop
 
Future of Data Intensive Applicaitons
Future of Data Intensive ApplicaitonsFuture of Data Intensive Applicaitons
Future of Data Intensive Applicaitons
 
NYC-Meetup- Introduction to Hadoop Echosystem
NYC-Meetup- Introduction to Hadoop EchosystemNYC-Meetup- Introduction to Hadoop Echosystem
NYC-Meetup- Introduction to Hadoop Echosystem
 
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and HadoopFacebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
 
R, Hadoop and Amazon Web Services
R, Hadoop and Amazon Web ServicesR, Hadoop and Amazon Web Services
R, Hadoop and Amazon Web Services
 

Andere mochten auch

Python in the Hadoop Ecosystem (Rock Health presentation)
Python in the Hadoop Ecosystem (Rock Health presentation)Python in the Hadoop Ecosystem (Rock Health presentation)
Python in the Hadoop Ecosystem (Rock Health presentation)Uri Laserson
 
Python for Big Data Analytics
Python for Big Data AnalyticsPython for Big Data Analytics
Python for Big Data AnalyticsEdureka!
 
Hadoop with Python
Hadoop with PythonHadoop with Python
Hadoop with PythonDonald Miner
 
Python for Big Data Analytics
Python for Big Data AnalyticsPython for Big Data Analytics
Python for Big Data AnalyticsEdureka!
 
Hadoop, Pig, and Python (PyData NYC 2012)
Hadoop, Pig, and Python (PyData NYC 2012)Hadoop, Pig, and Python (PyData NYC 2012)
Hadoop, Pig, and Python (PyData NYC 2012)mortardata
 
20141111 파이썬으로 Hadoop MR프로그래밍
20141111 파이썬으로 Hadoop MR프로그래밍20141111 파이썬으로 Hadoop MR프로그래밍
20141111 파이썬으로 Hadoop MR프로그래밍Tae Young Lee
 
Luigi presentation NYC Data Science
Luigi presentation NYC Data ScienceLuigi presentation NYC Data Science
Luigi presentation NYC Data ScienceErik Bernhardsson
 
1-3 Scientific Notation
1-3 Scientific Notation1-3 Scientific Notation
1-3 Scientific Notationrkelch
 
New English translation of Book 2 of the Analects of Confucius
New English translation of Book 2 of the Analects of ConfuciusNew English translation of Book 2 of the Analects of Confucius
New English translation of Book 2 of the Analects of ConfuciusRichard Brown
 
Wired Transmission Media
Wired Transmission MediaWired Transmission Media
Wired Transmission Medianimx106
 
Measurement of ENZYME ACTIVITY
Measurement of ENZYME ACTIVITYMeasurement of ENZYME ACTIVITY
Measurement of ENZYME ACTIVITYIIM Ahmedabad
 
nature VS nurture
nature VS nurturenature VS nurture
nature VS nurtureOscar Ririn
 
Be Distinctive, Not Different
Be Distinctive, Not DifferentBe Distinctive, Not Different
Be Distinctive, Not DifferentPercolate
 
DepEd k12 English 7 fourth quarter module 4
DepEd k12 English 7 fourth quarter module 4DepEd k12 English 7 fourth quarter module 4
DepEd k12 English 7 fourth quarter module 4Rachel Iglesia
 

Andere mochten auch (20)

Python in the Hadoop Ecosystem (Rock Health presentation)
Python in the Hadoop Ecosystem (Rock Health presentation)Python in the Hadoop Ecosystem (Rock Health presentation)
Python in the Hadoop Ecosystem (Rock Health presentation)
 
Python for Big Data Analytics
Python for Big Data AnalyticsPython for Big Data Analytics
Python for Big Data Analytics
 
Hadoop with Python
Hadoop with PythonHadoop with Python
Hadoop with Python
 
Python for Big Data Analytics
Python for Big Data AnalyticsPython for Big Data Analytics
Python for Big Data Analytics
 
Hadoop, Pig, and Python (PyData NYC 2012)
Hadoop, Pig, and Python (PyData NYC 2012)Hadoop, Pig, and Python (PyData NYC 2012)
Hadoop, Pig, and Python (PyData NYC 2012)
 
20141111 파이썬으로 Hadoop MR프로그래밍
20141111 파이썬으로 Hadoop MR프로그래밍20141111 파이썬으로 Hadoop MR프로그래밍
20141111 파이썬으로 Hadoop MR프로그래밍
 
Luigi presentation NYC Data Science
Luigi presentation NYC Data ScienceLuigi presentation NYC Data Science
Luigi presentation NYC Data Science
 
Hfhfhfh
HfhfhfhHfhfhfh
Hfhfhfh
 
1-3 Scientific Notation
1-3 Scientific Notation1-3 Scientific Notation
1-3 Scientific Notation
 
Poor Of New York, hi-light
Poor Of New York, hi-lightPoor Of New York, hi-light
Poor Of New York, hi-light
 
New English translation of Book 2 of the Analects of Confucius
New English translation of Book 2 of the Analects of ConfuciusNew English translation of Book 2 of the Analects of Confucius
New English translation of Book 2 of the Analects of Confucius
 
Communicative reading
Communicative readingCommunicative reading
Communicative reading
 
ABB Price List 2016
ABB Price List 2016ABB Price List 2016
ABB Price List 2016
 
Wired Transmission Media
Wired Transmission MediaWired Transmission Media
Wired Transmission Media
 
arts grade 10 quarter 2
 arts grade 10 quarter 2 arts grade 10 quarter 2
arts grade 10 quarter 2
 
Unit 1 hello everybody
Unit 1   hello everybodyUnit 1   hello everybody
Unit 1 hello everybody
 
Measurement of ENZYME ACTIVITY
Measurement of ENZYME ACTIVITYMeasurement of ENZYME ACTIVITY
Measurement of ENZYME ACTIVITY
 
nature VS nurture
nature VS nurturenature VS nurture
nature VS nurture
 
Be Distinctive, Not Different
Be Distinctive, Not DifferentBe Distinctive, Not Different
Be Distinctive, Not Different
 
DepEd k12 English 7 fourth quarter module 4
DepEd k12 English 7 fourth quarter module 4DepEd k12 English 7 fourth quarter module 4
DepEd k12 English 7 fourth quarter module 4
 

Ähnlich wie Pig and Python to Process Big Data

Omaha Java Users Group - Introduction to HBase and Hadoop
Omaha Java Users Group - Introduction to HBase and HadoopOmaha Java Users Group - Introduction to HBase and Hadoop
Omaha Java Users Group - Introduction to HBase and HadoopShawn Hermans
 
Spring Data NHJUG April 2012
Spring Data NHJUG April 2012Spring Data NHJUG April 2012
Spring Data NHJUG April 2012trisberg
 
Hadoop: A Hands-on Introduction
Hadoop: A Hands-on IntroductionHadoop: A Hands-on Introduction
Hadoop: A Hands-on IntroductionClaudio Martella
 
Tools & Measurements
Tools & MeasurementsTools & Measurements
Tools & MeasurementsRIPE NCC
 
Non Relational Databases And World Domination
Non Relational Databases And World DominationNon Relational Databases And World Domination
Non Relational Databases And World DominationJason Davies
 
Proud to be polyglot!
Proud to be polyglot!Proud to be polyglot!
Proud to be polyglot!NLJUG
 
The architecture of data analytics PaaS on AWS
The architecture of data analytics PaaS on AWSThe architecture of data analytics PaaS on AWS
The architecture of data analytics PaaS on AWSTreasure Data, Inc.
 
Emergent Distributed Data Storage
Emergent Distributed Data StorageEmergent Distributed Data Storage
Emergent Distributed Data Storagehybrid cloud
 
Location Location Location
Location Location LocationLocation Location Location
Location Location LocationGavin Heavyside
 
MAD Skills: New Analysis Practices for Big Data
MAD Skills: New Analysis Practices for Big DataMAD Skills: New Analysis Practices for Big Data
MAD Skills: New Analysis Practices for Big DataChristan Grant
 
"R, Hadoop, and Amazon Web Services (20 December 2011)"
"R, Hadoop, and Amazon Web Services (20 December 2011)""R, Hadoop, and Amazon Web Services (20 December 2011)"
"R, Hadoop, and Amazon Web Services (20 December 2011)"Portland R User Group
 
Searching conversations with hadoop
Searching conversations with hadoopSearching conversations with hadoop
Searching conversations with hadoopDataWorks Summit
 
COMPLEMENTING HADOOP WITH REAL-TIME DATA ANALYSIS from Structure:Data 2013
COMPLEMENTING HADOOP WITH REAL-TIME DATA ANALYSIS from Structure:Data 2013COMPLEMENTING HADOOP WITH REAL-TIME DATA ANALYSIS from Structure:Data 2013
COMPLEMENTING HADOOP WITH REAL-TIME DATA ANALYSIS from Structure:Data 2013Gigaom
 
Open Metrics for Open Repositories at OR2012
Open Metrics for Open Repositories at OR2012Open Metrics for Open Repositories at OR2012
Open Metrics for Open Repositories at OR2012Nick Sheppard
 
dataviz on d3.js + elasticsearch
dataviz on d3.js + elasticsearchdataviz on d3.js + elasticsearch
dataviz on d3.js + elasticsearchMathieu Elie
 
Linking data without common identifiers
Linking data without common identifiersLinking data without common identifiers
Linking data without common identifiersLars Marius Garshol
 
CloudFoundry and MongoDb, a marriage made in heaven
CloudFoundry and MongoDb, a marriage made in heavenCloudFoundry and MongoDb, a marriage made in heaven
CloudFoundry and MongoDb, a marriage made in heavenPatrick Chanezon
 

Ähnlich wie Pig and Python to Process Big Data (20)

Omaha Java Users Group - Introduction to HBase and Hadoop
Omaha Java Users Group - Introduction to HBase and HadoopOmaha Java Users Group - Introduction to HBase and Hadoop
Omaha Java Users Group - Introduction to HBase and Hadoop
 
Spring Data NHJUG April 2012
Spring Data NHJUG April 2012Spring Data NHJUG April 2012
Spring Data NHJUG April 2012
 
Hadoop: A Hands-on Introduction
Hadoop: A Hands-on IntroductionHadoop: A Hands-on Introduction
Hadoop: A Hands-on Introduction
 
Tools & Measurements
Tools & MeasurementsTools & Measurements
Tools & Measurements
 
Non Relational Databases And World Domination
Non Relational Databases And World DominationNon Relational Databases And World Domination
Non Relational Databases And World Domination
 
Proud to be polyglot!
Proud to be polyglot!Proud to be polyglot!
Proud to be polyglot!
 
Hadoop & Hep
Hadoop & HepHadoop & Hep
Hadoop & Hep
 
Drill njhug -19 feb2013
Drill njhug -19 feb2013Drill njhug -19 feb2013
Drill njhug -19 feb2013
 
The architecture of data analytics PaaS on AWS
The architecture of data analytics PaaS on AWSThe architecture of data analytics PaaS on AWS
The architecture of data analytics PaaS on AWS
 
Emergent Distributed Data Storage
Emergent Distributed Data StorageEmergent Distributed Data Storage
Emergent Distributed Data Storage
 
Location Location Location
Location Location LocationLocation Location Location
Location Location Location
 
MAD Skills: New Analysis Practices for Big Data
MAD Skills: New Analysis Practices for Big DataMAD Skills: New Analysis Practices for Big Data
MAD Skills: New Analysis Practices for Big Data
 
"R, Hadoop, and Amazon Web Services (20 December 2011)"
"R, Hadoop, and Amazon Web Services (20 December 2011)""R, Hadoop, and Amazon Web Services (20 December 2011)"
"R, Hadoop, and Amazon Web Services (20 December 2011)"
 
Stardog talk-dc-march-17
Stardog talk-dc-march-17Stardog talk-dc-march-17
Stardog talk-dc-march-17
 
Searching conversations with hadoop
Searching conversations with hadoopSearching conversations with hadoop
Searching conversations with hadoop
 
COMPLEMENTING HADOOP WITH REAL-TIME DATA ANALYSIS from Structure:Data 2013
COMPLEMENTING HADOOP WITH REAL-TIME DATA ANALYSIS from Structure:Data 2013COMPLEMENTING HADOOP WITH REAL-TIME DATA ANALYSIS from Structure:Data 2013
COMPLEMENTING HADOOP WITH REAL-TIME DATA ANALYSIS from Structure:Data 2013
 
Open Metrics for Open Repositories at OR2012
Open Metrics for Open Repositories at OR2012Open Metrics for Open Repositories at OR2012
Open Metrics for Open Repositories at OR2012
 
dataviz on d3.js + elasticsearch
dataviz on d3.js + elasticsearchdataviz on d3.js + elasticsearch
dataviz on d3.js + elasticsearch
 
Linking data without common identifiers
Linking data without common identifiersLinking data without common identifiers
Linking data without common identifiers
 
CloudFoundry and MongoDb, a marriage made in heaven
CloudFoundry and MongoDb, a marriage made in heavenCloudFoundry and MongoDb, a marriage made in heaven
CloudFoundry and MongoDb, a marriage made in heaven
 

Kürzlich hochgeladen

How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 

Kürzlich hochgeladen (20)

How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 

Pig and Python to Process Big Data

  • 1. Big Data with Pig and Python Shawn Hermans Omaha Dynamic Languages User Group April 8th, 2013 Tuesday, April 9, 13
  • 2. About Me • Mathematician/Physicist turned Consultant • Graduate Student in CS at UNO • Current Software Engineer at Sojern Tuesday, April 9, 13
  • 3. Working with Big Data Tuesday, April 9, 13
  • 4. What is Big Data? Data Source Size Gigabytes - Wikipedia Database Dump 9GB Normal size for relational databases Open Street Map 19GB Terabytes - Relational databases may Common Crawl 81TB start to experience scaling issues 1000 Genomes 200TB Petabytes - Relational databases Large Hadron Collider 15PB annually struggle to scale without a lot of fine tuning Tuesday, April 9, 13
  • 5. Working With Data Expectation Reality • Different File Formats • Missing Values • Inconsistent Schema • Loosely Structured • Lots of it Tuesday, April 9, 13
  • 6. MapReduce • Map - Emit key/ value pairs from data • Reduce - Collect data with common keys • Tries to minimize moving data between nodes Image taken from: https://developers.google.com/appengine/docs/python/dataprocessing/overview Tuesday, April 9, 13
  • 7. MapReduce Issues • Very low-level abstraction • Cumbersome Java API • Unfamiliar to data analysts • Rudimentary support for data pipelines Tuesday, April 9, 13
  • 8. Pig • Eats anything • SQL-like, procedural data flow language • Extensible with Java, Jython, Groovy, Ruby or JavaScript • Provides opportunities to optimize workflows Tuesday, April 9, 13
  • 9. Alternatives • Java MapReduce API • Hadoop Streaming • Hive • Spark • Cascading • Cascalog Tuesday, April 9, 13
  • 10. Python • Data analysis - pandas, numpy, networkx • Machine learning - scikits.learn, milk • Scientific - scipy, pyephem, astropysics • Visualization - matplotlib, d3py, ggplot Tuesday, April 9, 13
  • 12. Input/Output • HBase • Sequence File • JDBC Database • Hive Columnar • JSON • XML • CSV/TSV • Apache Log • Avro • Thrift • ProtoBuff • Regex Tuesday, April 9, 13
  • 13. Relational Operators LIMIT GROUP FILTER CROSS COGROUP JOIN STORE DISTINCT FOREACH LOAD ORDER UNION Tuesday, April 9, 13
  • 14. Built In Functions COS SIN AVG SUM COUNT RANDOM LOWER UPPER CONCAT MAX MIN TOKENIZE Tuesday, April 9, 13
  • 15. User Defined Functions • Easy way to add arbitrary code to Pig • Eval - Filter, aggregate, or evaluate • Storage - Load/Store data • Full support for Java and Jython • Experimental support for Groovy, Ruby and JavaScript Tuesday, April 9, 13
  • 18. Convert to TSV ogr2ogr -f "CSV" CSA_2010Census_DP1.csv CSA_2010Census_DP1.shp -lco "GEOMETRY=AS_WKT" -lco "SEPARATOR=TAB" • Uses Geospatial Data Abstraction Library (GDAL) to convert to TSV • TSV > CSV Tuesday, April 9, 13
  • 19. Inspect Headers f = open('CSA_2010Census_DP1.tsv') header = f.readline() headers = header.strip('n').split('t') list(enumerate(headers)) [(0, 'WKT'), (1, 'GEOID10'), (2, 'NAMELSAD10'), (3, 'ALAND10'), (4, 'AWATER10'), (5, 'INTPTLAT10'), (6, 'INTPTLON10'), (7, 'DP0010001'), . . . Tuesday, April 9, 13
  • 20. Pig Quick Start • Download Pig Distribution • Untar package • Start Pig in local mode pig -x local grunt> ls file:/data/CSA_2010Census_DP1 1.dbf<r 1> 841818 file:/data/CSA_2010Census_DP1.prj<r 1> 167 file:/data/CSA_2010Census_DP1.shp<r 1> 76180308 file:/data/CSA_2010Census_DP1.shx<r 1> 3596 file:/data/CSA_2010Census_DP1.tsv<r 1> 111224058 http://pig.apache.org/releases.html https://ccp.cloudera.com/display/SUPPORT/CDH+Downloads Tuesday, April 9, 13
  • 21. Loading Data grunt> csas = LOAD 'CSA_2010Census_DP1.tsv' USING PigStorage(); Tuesday, April 9, 13
  • 22. Extracting Data grunt> csas = LOAD 'CSA_2010Census_DP1.tsv' USING PigStorage(); grunt> extracted_no_types = FOREACH csas GENERATE $2 AS name, $7 as population; grunt> describe extracted_no_types extracted_no_types: {name: bytearray,population: bytearray}; Tuesday, April 9, 13
  • 23. Adding Schema grunt> csas = LOAD 'CSA_2010Census_DP1.tsv' USING PigStorage(); grunt> extracted = FOREACH csas GENERATE $2 AS name:chararray, $7 as population:int; grunt> describe extracted; extracted: {name: chararray,population: int} Tuesday, April 9, 13
  • 24. Ordering grunt> ordered = ORDER extracted by population DESC; grunt> dump ordered; ("New York-Newark-Bridgeport, NY-NJ-CT-PA CSA",22085649) ("Los Angeles-Long Beach-Riverside, CA CSA",17877006) ("Chicago-Naperville-Michigan City, IL-IN-WI CSA",9686021) ("Washington-Baltimore-Northern Virginia, DC-MD-VA-WV CSA", 8572971) ("Boston-Worcester-Manchester, MA-RI-NH CSA",7559060) ("San Jose-San Francisco-Oakland, CA CSA",7468390) ("Dallas-Fort Worth, TX CSA",6731317) ("Philadelphia-Camden-Vineland, PA-NJ-DE-MD CSA",6533683) Tuesday, April 9, 13
  • 25. Storing Data grunt> STORE extracted INTO 'extracted_data' USING PigStorage('t', '-schema'); ls -a .part-m-00035.crc .part-m-00115.crc .pig_header part-m-00077 part-m-00157 .part-m-00036.crc .part-m-00116.crc .pig_schema part-m-00078 part-m-00158 .part-m-00037.crc .part-m-00117.crc _SUCCESS part-m-00079 part-m-00159 .part-m-00038.crc .part-m-00118.crc part-m-00000 part-m-00080 part-m-00160 Tuesday, April 9, 13
  • 27. Space Catalog • 14,000+ objects in public catalog • Use Two Line Element sets to propagate out positions and velocities • Can generate over 100 million positions & velocities per day Tuesday, April 9, 13
  • 28. Two Line Elements ISS (ZARYA) 1 25544U 98067A 08264.51782528 −.00002182 00000-0 -11606-4 0 2927 2 25544 51.6416 247.4627 0006703 130.5360 325.0288 15.72125391563537 • Use Python script to convert to Pig friendly TSV • Create Python UDF to parse TLE into parameters • Use Python UDF with Java libraries to propagate out positions Tuesday, April 9, 13
  • 29. Python UDFs • Easy way to extend Pig with new functions • Uses Jython which is at Python 2.5 • Cannot take advantage of libraries with C dependencies (e.g. numpy, scikits, etc...) • Can use Java classes Tuesday, April 9, 13
  • 30. TLE parsing BSTAR Drag 54-61 -11606-4 (Decimal Assumed) def  parse_tle_number(tle_number_string):        split_string  =  tle_number_string.split('-­‐')        if  len(split_string)  ==  3:                new_number  =  '-­‐'  +  str(split_string[1])  +  'e-­‐'  +  str(int(split_string[2])+1)        elif  len(split_string)  ==  2:                new_number  =  str(split_string[0])  +  'e-­‐'  +  str(int(split_string[1])+1)        elif  len(split_string)  ==  1:                new_number  =  '0.'  +  str(split_string[0])        else:                raise  TypeError('Input  is  not  in  the  TLE  float  format')          return  float(new_number) Full parser at https://gist.github.com/shawnhermans/4569360 Tuesday, April 9, 13
  • 31. Simple UDF import tleparser @outputSchema("params:map[]") def parseTle(name, line1, line2): params = tleparser.parse_tle(name, line1, line2) return params Tuesday, April 9, 13
  • 32. Extract Parameters grunt> gps = LOAD 'gps-ops.tsv' USING PigStorage() AS (name:chararray, line1:chararray, line2:chararray); grunt> REGISTER 'tleUDFs.py' USING jython AS myfuncs; grunt> parsed = FOREACH gps GENERATE myfuncs.parseTle(*); ([bstar#,arg_of_perigee#333.0924,mean_motion#2.00559335,element_number#72,epoch_year# 2013,inclination#54.9673,mean_anomaly#26.8787,rev_at_epoch#210,mean_motion_ddot#0.0,e ccentricity#5.354E-4,two_digit_year#13,international_designator#12053A,classification #U,epoch_day#17.78040066,satellite_number#38833,name#GPS BIIF-3 (PRN 24),mean_motion_dot#-1.8E-6,ra_of_asc_node#344.5315]) Tuesday, April 9, 13
  • 33. Storing Results grunt> parsed = FOREACH gps GENERATE myfuncs.parseTle(*); grunt> STORE parsed INTO 'propagated-csv' using PigStorage(',','-schema'); Tuesday, April 9, 13
  • 34. UDF with Java Import from jsattrak.objects import SatelliteTleSGP4 @outputSchema("propagated:bag{positions:tuple(time:double, x:double, y:double, z:double)}") def propagateTleECEF(name,line1,line2,start_time,end_time,number_of_points): satellite = SatelliteTleSGP4(name, line1, line2) ecef_positions = [] increment = (float(end_time)-float(start_time))/float(number_of_points) current_time = start_time while current_time <= end_time: positions = [current_time] positions.extend(list(satellite.calculateJ2KPositionFromUT(current_time))) ecef_positions.append(tuple(positions)) current_time += increment return ecef_positions Tuesday, April 9, 13
  • 35. Propagate Positions grunt > REGISTER 'tleUDFs.py' USING jython AS myfuncs; grunt> gps = LOAD 'gps-ops.tsv' USING PigStorage() AS (name:chararray, line1:chararray, line2:chararray); grunt> propagated = FOREACH gps GENERATE myfuncs.parseTle(name, line1, line2), myfuncs.propagateTleECEF(name, line1, line2, 2454992.0, 2454993.0, 100); grunt> flattened = FOREACH propagated GENERATE params#'satellite_number', FLATTEN(propagated); propagated: {params: map[],propagated: {positions: (time: double,x: double,y: double,z: double)}} grunt> DESCRIBE flattened; flattened: {bytearray,propagated::time: double,propagated::x: double, propagated::y: double,propagated::z: double} Tuesday, April 9, 13
  • 36. Result (38833,2454992.9599999785,2.278136816721697E7,7970303.195970464,-1.1066153998664627E7) (38833,2454992.9699999783,2.2929498370345607E7,1.0245812732430315E7,-8617450.742994161) (38833,2454992.979999978,2.2713614118860725E7,1.2358665040019082E7,-6031915.392826946) (38833,2454992.989999978,2.213715624812226E7,1.4275325605036272E7,-3350605.7983842064) (38833,2454992.9999999776,2.1209296863515433E7,1.5965381866069315E7,-616098.4598421039) Tuesday, April 9, 13
  • 37. Pig on Amazon EMR Tuesday, April 9, 13
  • 43. Pig with EMR Tuesday, April 9, 13
  • 44. Pig with EMR • SSH in to box to run interactive Pig session • Load data to/from S3 • Run standalone Pig scripts on demand Tuesday, April 9, 13
  • 46. Other Useful Tools • Python-dateutil : Super-duper date parser • Oozie : Hadoop workflow engine • Piggybank and Elephant Bird : 3rd party pig libraries • Chardet: Character detection library for Python Tuesday, April 9, 13
  • 47. Parting Thoughts • Great ETL tool/language • Flexible enough to write general purpose MapReduce jobs • Limited, but emerging 3rd party libraries • Jython for UDFs is extremely limiting (Spark?) Twitter: @shawnhermans Email: shawnhermans@gmail.com Tuesday, April 9, 13