SlideShare a Scribd company logo
1 of 53
Perl on Amazon Elastic
      MapReduce
A Gentle Introduction
   to MapReduce
• Distributed computing model
• Mappers process the input and forward
  intermediate results to reducers.
• Reducers aggregate these intermediate
  results, and emit the final results.
$ map | sort | reduce
MapReduce
• Input data sent to mappers as (k, v) pairs.
• After processing, mappers emit (k v ).
                                     out,   out

• These pairs are sorted and sent to
  reducers.
• All (k   out, vout)
                pairs for a given kout are sent
  to a single reducer.
MapReduce


• Reducers get (k, [v , v , …, v ]).
                      1   2    n

• After processing, the reducer emits a (k , v )
                                            f   f
  per result.
MapReduce


 We wanted to have a world map showing
where people were starting our games (like
             Mozilla Glow)
Glowfish
MapReduce
• Input: ( epoch, IP address )
• Mappers group these into 5-minute blocks,
  and emit ( block Id, IP address )
• Reducers get ( blockId, [ip , ip , …, ip ] )
                                 1   2        n

• Do a geo lookup and emit
  ( epoch, [ ( lat1, lon1 ), ( lat2, lon2), … ] )
$ map | sort | reduce
Apache Hadoop
• Distributed programming framework
• Implements MapReduce
• Does all the usual distributed programming
  heavy-lifting for you
• Highly-fault tolerant, automatic task re-
  assignment in case of failure
• You focus on mappers and reducers
Apache Hadoop
• Native Java API
• Streaming API which can use mappers and
  reducers written in any programming
  language.
• Distributed file system (HDFS)
• Distributed Cache
Amazon Elastic
        MapReduce
• On-demand Hadoop clusters running on
  EC2 instances.
• Improved S3 support for storage of input
  and output data.
• Build workflows by sending jobs to a
  cluster.
EMR Downsides
• No control over the machine images.
• Perl 5.8.8
• Ephemeral, when your cluster is shut down
  (or dies), HDFS is gone.
• HDFS not available at cluster-creation time.
• Debian
Streaming vs. Native


$ cat | map | sort | reduce
Streaming vs. Native

Instead of
               ( k, [ v1, v2, …, vn ] )
reducers get
 (( k1, v1 ), …, ( k1, vn ), ( k2, v1 ), …, ( k2, v2 ))
Composite Keys
• Reducers receive both keys and values
  sorted
• Merge 3 tables:
  userid, 0, … # customer info

  userid, 1, … # payments history

  userid, recordid1, … # clickstream

  userid, recordid2, … # clickstream
Streaming vs. Native

• Limited API
• About a 7-10% increase in run time
• About a 1000% decrease in development
  time (as reported by a non-representative
  sample of developers)
Where’s My Towel?
• Tasks run chrooted in a non-deterministic
  location.
• It’s easy to store files in HDFS when
  submitting a job, impossible to store
  directory trees.
• For native Java jobs, your dependencies get
  packaged in the JAR alongside your code.
Streaming’s Little
         Helpers

Define your inputs and outputs:
--input s3://events/2011-30-10

--output s3://glowfish/output/2011-30-10
Streaming’s Little
         Helpers
You can use any class in Hadoop’s classpath
as a codec, several come bundled:
-D mapred.output.key.comparator.class =
org.apache.hadoop.mapred.lib.KeyFieldBasedComparator

-partitioner
org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner
Streaming’s Little
         Helpers
• Use S3 to store…
 • input data
 • output data
 • supporting data (e.g., Geo-IP)
 • your code
Mapper and Reducer

To specify the mapper and reducer to be
used in your streaming job, you can point
Hadoop to S3:
--mapper s3://glowfish/bin/mapper.pl

--reducer s3://glowfish/bin/reducer.pl
Support Files

When specifying a file to store in the DC, a
URI fragment will be used as a symlink in the
local filesystem:
 -cacheFile s3://glowfish/data/GeoLiteCity.dat#GeoLiteCity.dat
Support Files

When specifying a file to store in the DC, a
URI fragment will be used as a symlink in the
local filesystem:
 -cacheFile s3://glowfish/data/GeoLiteCity.dat#GeoLiteCity.dat
Dependencies


But if you store an archive (Zip, TGZ, or JAR)
in the Distributed Cache, …
   -cacheArchive s3://glowfish/lib/perllib.tgz
Dependencies


But if you store an archive (Zip, TGZ, or JAR)
in the Distributed Cache, …
   -cacheArchive s3://glowfish/lib/perllib.tgz
Dependencies


But if you store an archive (Zip, TGZ, or JAR)
in the Distributed Cache, …
-cacheArchive s3://glowfish/lib/perllib.tgz#locallib
Dependencies


 Hadoop will uncompress it and create a link
to whatever directory it created, in the task’s
             working directory.
Dependencies


Which is where it stores your mapper and
                reducer.
Dependencies


use lib qw/ locallib /;
Mapper
#!/usr/bin/env perl

use strict;
use warnings;

use lib qw/ locallib /;

use JSON::PP;

my $decoder = JSON::PP->new->utf8;
my $missing_ip = 0;

while ( <> ) {
  chomp;
  next unless /load_complete/;
  my @line = split /t/;
  my ( $epoch, $payload ) = ( int( $line[1] / 1000 / 300 ), $line[5] );
  my $json = $decoder->decode( $payload );
  if ( ! exists $json->{'ip'} ) {
    $missing_ip++;
    next;
  }
  print "$epocht$json->{'ip'}n";
}

print STDERR "reporter:counter:Job Counters,MISSING_IP,$missing_ipn";
Mapper
#!/usr/bin/env perl

use strict;
use warnings;

use lib qw/ locallib /;

use JSON::PP;

my $decoder = JSON::PP->new->utf8;
my $missing_ip = 0;

while ( <> ) {
  chomp;
  next unless /load_complete/;
  my @line = split /t/;
  my ( $epoch, $payload ) = ( int( $line[1] / 1000 / 300 ), $line[5] );
  my $json = $decoder->decode( $payload );
  if ( ! exists $json->{'ip'} ) {
    $missing_ip++;
    next;
  }
  print "$epocht$json->{'ip'}n";
}

print STDERR "reporter:counter:Job Counters,MISSING_IP,$missing_ipn";
Reducer
#!/usr/bin/env perl

use strict;
use warnings;
use lib qw/ locallib /;

use Geo::IP;
use Regexp::Common qw/ net /;
use Readonly;

Readonly::Scalar my $TAB => "t";
my $geo = Geo::IP->open( 'GeoLiteCity.dat', GEOIP_MEMORY_CACHE )
  or die "Could not open GeoIP database: $!n";

my $format_errors = 0;
my $invalid_ip_address = 0;
my $geo_lookup_errors = 0;

my $time_slot;
my $previous_time_slot = -1;
Reducer
#!/usr/bin/env perl

use strict;
use warnings;
use lib qw/ locallib /;

use Geo::IP;
use Regexp::Common qw/ net /;
use Readonly;

Readonly::Scalar my $TAB => "t";
my $geo = Geo::IP->open( 'GeoLiteCity.dat', GEOIP_MEMORY_CACHE )
  or die "Could not open GeoIP database: $!n";

my $format_errors = 0;
my $invalid_ip_address = 0;
my $geo_lookup_errors = 0;

my $time_slot;
my $previous_time_slot = -1;
Reducer
#!/usr/bin/env perl

use strict;
use warnings;
use lib qw/ locallib /;

use Geo::IP;
use Regexp::Common qw/ net /;
use Readonly;

Readonly::Scalar my $TAB => "t";
my $geo = Geo::IP->open( 'GeoLiteCity.dat', GEOIP_MEMORY_CACHE )
  or die "Could not open GeoIP database: $!n";

my $format_errors = 0;
my $invalid_ip_address = 0;
my $geo_lookup_errors = 0;

my $time_slot;
my $previous_time_slot = -1;
Reducer
while ( <> ) {
  chomp;

  my @cols = split( TAB );
  if ( scalar @cols != 2 ) {
    $format_errors++;
    next;
  }
  my ( $time_slot, $ip_addr ) = @cols;
  if ( $previous_time_slot != -1 &&
       $time_slot != $previous_time_slot ) {
    # we've entered a new time slot, write the previous one out
    emit( $time_slot, $previous_time_slot );
  }

  if ( $ip_addr !~ /$RE{net}{IPv4}/ ) {
    $invalid_ip_address++;
    $previous_time_slot = $time_slot;
    next;
  }
Reducer
while ( <> ) {
  chomp;

  my @cols = split( TAB );
  if ( scalar @cols != 2 ) {
    $format_errors++;
    next;
  }
  my ( $time_slot, $ip_addr ) = @cols;
  if ( $previous_time_slot != -1 &&
       $time_slot != $previous_time_slot ) {
    # we've entered a new time slot, write the previous one out
    emit( $time_slot, $previous_time_slot );
  }

  if ( $ip_addr !~ /$RE{net}{IPv4}/ ) {
    $invalid_ip_address++;
    $previous_time_slot = $time_slot;
    next;
  }
Reducer
while ( <> ) {
  chomp;

  my @cols = split( TAB );
  if ( scalar @cols != 2 ) {
    $format_errors++;
    next;
  }
  my ( $time_slot, $ip_addr ) = @cols;
  if ( $previous_time_slot != -1 &&
       $time_slot != $previous_time_slot ) {
    # we've entered a new time slot, write the previous one out
    emit( $time_slot, $previous_time_slot );
  }

  if ( $ip_addr !~ /$RE{net}{IPv4}/ ) {
    $invalid_ip_address++;
    $previous_time_slot = $time_slot;
    next;
  }
Reducer
  my $geo_record = $geo->record_by_addr( $ip_addr );
  if ( ! defined $geo_record ) {
    $geo_lookup_errors++;
    $previous_time_slot = $time_slot;
    next;
  }

  # update entry for time slot with lat and lon

  $previous_time_slot = $time_slot;
} # while ( <> )

emit( $time_slot + 1, $time_slot );

print STDERR "reporter:counter:Job Counters,FORMAT_ERRORS,
$format_errorsn";
print STDERR "reporter:counter:Job Counters,INVALID_IPS,
$invalid_ip_addressn";
print STDERR "reporter:counter:Job Counters,GEO_LOOKUP_ERRORS,
$geo_lookup_errorsn";
Reducer
  my $geo_record = $geo->record_by_addr( $ip_addr );
  if ( ! defined $geo_record ) {
    $geo_lookup_errors++;
    $previous_time_slot = $time_slot;
    next;
  }

  # update entry for time slot with lat and lon

  $previous_time_slot = $time_slot;
} # while ( <> )

emit( $time_slot + 1, $time_slot );

print STDERR "reporter:counter:Job Counters,FORMAT_ERRORS,
$format_errorsn";
print STDERR "reporter:counter:Job Counters,INVALID_IPS,
$invalid_ip_addressn";
print STDERR "reporter:counter:Job Counters,GEO_LOOKUP_ERRORS,
$geo_lookup_errorsn";
Reducer
  my $geo_record = $geo->record_by_addr( $ip_addr );
  if ( ! defined $geo_record ) {
    $geo_lookup_errors++;
    $previous_time_slot = $time_slot;
    next;
  }

  # update entry for time slot with lat and lon

  $previous_time_slot = $time_slot;
} # while ( <> )

emit( $time_slot + 1, $time_slot );

print STDERR "reporter:counter:Job Counters,FORMAT_ERRORS,
$format_errorsn";
print STDERR "reporter:counter:Job Counters,INVALID_IPS,
$invalid_ip_addressn";
print STDERR "reporter:counter:Job Counters,GEO_LOOKUP_ERRORS,
$geo_lookup_errorsn";
Reducer
  my $geo_record = $geo->record_by_addr( $ip_addr );
  if ( ! defined $geo_record ) {
    $geo_lookup_errors++;
    $previous_time_slot = $time_slot;
    next;
  }

  # update entry for time slot with lat and lon

  $previous_time_slot = $time_slot;
} # while ( <> )

emit( $time_slot + 1, $time_slot );

print STDERR "reporter:counter:Job Counters,FORMAT_ERRORS,
$format_errorsn";
print STDERR "reporter:counter:Job Counters,INVALID_IPS,
$invalid_ip_addressn";
print STDERR "reporter:counter:Job Counters,GEO_LOOKUP_ERRORS,
$geo_lookup_errorsn";
Recap


• EMR clusters are volatile!
Recap


• EMR clusters are volatile.
• Values for a given key will all go to a single
  reducer, sorted.
Recap

• EMR clusters are volatile.
• Values for a given key will all go to a single
  reducer, sorted.
• Use S3 for everything, and plan your
  dataflow ahead.
( On data )
• Store it wisely, e.g., using a directory
  structure looking like the following to get
  free partitioning in Hive/others:
       s3://bucket/path/data/run_date=2011-11-12


• Don’t worry about getting the data out of
  S3, you can always write a simple job that
  does that and run it at the end of your
  workflow.
Recap
• EMR clusters are volatile.
• Values for a given key will all go to a single
  reducer, sorted. Watch for the key
  changing.
• Use S3 for everything, and plan your
  dataflow ahead.
• Make carton a part of your life, and
  especially of your build tool’s.
( carton )
• Shipwright for humans
• Reads dependencies from Makefile.PL
• Installs them locally to your app
• Deploy your stuff, including carton.lock
• Run carton install --deployment
• Tar result and upload to S3
URLs

• The MapReduce Paper
  http://labs.google.com/papers/mapreduce.html


• Apache Hadoop
  http://hadoop.apache.org/


• Amazon Elastic MapReduce
  http://aws.amazon.com/elasticmapreduce/
URLs

• Hadoop Streaming Tutorial (Apache)
  http://hadoop.apache.org/common/docs/r0.20.2/streaming.html



• Hadoop Streaming How-To (Amazon)
  http://docs.amazonwebservices.com/ElasticMapReduce/latest/
  GettingStartedGuide/CreateJobFlowStreaming.html
URLs

• Amazon EMR Perl Client Library
  http://aws.amazon.com/code/Elastic-MapReduce/2309


• Amazon EMR Command-Line Tool
  http://aws.amazon.com/code/Elastic-MapReduce/2264
That’s All, Folks!

                Slides available at
http://slideshare.net/pfig/perl-on-amazon-elastic-mapreduce




   me@pedrofigueiredo.org

More Related Content

What's hot

Upgrade time ! Java to Kotlin
Upgrade time ! Java to KotlinUpgrade time ! Java to Kotlin
Upgrade time ! Java to KotlinPaulien van Alst
 
GOTO 2011 preso: 3x Hadoop
GOTO 2011 preso: 3x HadoopGOTO 2011 preso: 3x Hadoop
GOTO 2011 preso: 3x Hadoopfvanvollenhoven
 
2017 02-07 - elastic & spark. building a search geo locator
2017 02-07 - elastic & spark. building a search geo locator2017 02-07 - elastic & spark. building a search geo locator
2017 02-07 - elastic & spark. building a search geo locatorAlberto Paro
 
Data Types and Processing in ES6
Data Types and Processing in ES6Data Types and Processing in ES6
Data Types and Processing in ES6m0bz
 
HBase + Hue - LA HBase User Group
HBase + Hue - LA HBase User GroupHBase + Hue - LA HBase User Group
HBase + Hue - LA HBase User Groupgethue
 
Practical pig
Practical pigPractical pig
Practical pigtrihug
 
PuppetDB, Puppet Explorer and puppetdbquery
PuppetDB, Puppet Explorer and puppetdbqueryPuppetDB, Puppet Explorer and puppetdbquery
PuppetDB, Puppet Explorer and puppetdbqueryPuppet
 
AWS Hadoop and PIG and overview
AWS Hadoop and PIG and overviewAWS Hadoop and PIG and overview
AWS Hadoop and PIG and overviewDan Morrill
 
05 pig user defined functions (udfs)
05 pig user defined functions (udfs)05 pig user defined functions (udfs)
05 pig user defined functions (udfs)Subhas Kumar Ghosh
 
Apache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan Gate
Apache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan GateApache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan Gate
Apache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan GateYahoo Developer Network
 
puppet @techlifecookpad
puppet @techlifecookpadpuppet @techlifecookpad
puppet @techlifecookpadNaoya Nakazawa
 
Working with databases in Perl
Working with databases in PerlWorking with databases in Perl
Working with databases in PerlLaurent Dami
 
Apache beam — promyk nadziei data engineera na Toruń JUG 28.03.2018
Apache beam — promyk nadziei data engineera na Toruń JUG 28.03.2018Apache beam — promyk nadziei data engineera na Toruń JUG 28.03.2018
Apache beam — promyk nadziei data engineera na Toruń JUG 28.03.2018Piotr Wikiel
 
Docker tips & tricks
Docker  tips & tricksDocker  tips & tricks
Docker tips & tricksDharmit Shah
 
Integrate Hue with your Hadoop cluster - Yahoo! Hadoop Meetup
Integrate Hue with your Hadoop cluster - Yahoo! Hadoop MeetupIntegrate Hue with your Hadoop cluster - Yahoo! Hadoop Meetup
Integrate Hue with your Hadoop cluster - Yahoo! Hadoop Meetupgethue
 
サンプルから見るMap reduceコード
サンプルから見るMap reduceコードサンプルから見るMap reduceコード
サンプルから見るMap reduceコードShinpei Ohtani
 
COSCUP2012: How to write a bash script like the python?
COSCUP2012: How to write a bash script like the python?COSCUP2012: How to write a bash script like the python?
COSCUP2012: How to write a bash script like the python?Lloyd Huang
 
Writing Clean Code in Swift
Writing Clean Code in SwiftWriting Clean Code in Swift
Writing Clean Code in SwiftDerek Lee Boire
 

What's hot (20)

Upgrade time ! Java to Kotlin
Upgrade time ! Java to KotlinUpgrade time ! Java to Kotlin
Upgrade time ! Java to Kotlin
 
GOTO 2011 preso: 3x Hadoop
GOTO 2011 preso: 3x HadoopGOTO 2011 preso: 3x Hadoop
GOTO 2011 preso: 3x Hadoop
 
2017 02-07 - elastic & spark. building a search geo locator
2017 02-07 - elastic & spark. building a search geo locator2017 02-07 - elastic & spark. building a search geo locator
2017 02-07 - elastic & spark. building a search geo locator
 
Data Types and Processing in ES6
Data Types and Processing in ES6Data Types and Processing in ES6
Data Types and Processing in ES6
 
HBase + Hue - LA HBase User Group
HBase + Hue - LA HBase User GroupHBase + Hue - LA HBase User Group
HBase + Hue - LA HBase User Group
 
Practical pig
Practical pigPractical pig
Practical pig
 
PuppetDB, Puppet Explorer and puppetdbquery
PuppetDB, Puppet Explorer and puppetdbqueryPuppetDB, Puppet Explorer and puppetdbquery
PuppetDB, Puppet Explorer and puppetdbquery
 
AWS Hadoop and PIG and overview
AWS Hadoop and PIG and overviewAWS Hadoop and PIG and overview
AWS Hadoop and PIG and overview
 
05 pig user defined functions (udfs)
05 pig user defined functions (udfs)05 pig user defined functions (udfs)
05 pig user defined functions (udfs)
 
Apache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan Gate
Apache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan GateApache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan Gate
Apache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan Gate
 
puppet @techlifecookpad
puppet @techlifecookpadpuppet @techlifecookpad
puppet @techlifecookpad
 
Ordered Record Collection
Ordered Record CollectionOrdered Record Collection
Ordered Record Collection
 
2015 555 kharchenko_ppt
2015 555 kharchenko_ppt2015 555 kharchenko_ppt
2015 555 kharchenko_ppt
 
Working with databases in Perl
Working with databases in PerlWorking with databases in Perl
Working with databases in Perl
 
Apache beam — promyk nadziei data engineera na Toruń JUG 28.03.2018
Apache beam — promyk nadziei data engineera na Toruń JUG 28.03.2018Apache beam — promyk nadziei data engineera na Toruń JUG 28.03.2018
Apache beam — promyk nadziei data engineera na Toruń JUG 28.03.2018
 
Docker tips & tricks
Docker  tips & tricksDocker  tips & tricks
Docker tips & tricks
 
Integrate Hue with your Hadoop cluster - Yahoo! Hadoop Meetup
Integrate Hue with your Hadoop cluster - Yahoo! Hadoop MeetupIntegrate Hue with your Hadoop cluster - Yahoo! Hadoop Meetup
Integrate Hue with your Hadoop cluster - Yahoo! Hadoop Meetup
 
サンプルから見るMap reduceコード
サンプルから見るMap reduceコードサンプルから見るMap reduceコード
サンプルから見るMap reduceコード
 
COSCUP2012: How to write a bash script like the python?
COSCUP2012: How to write a bash script like the python?COSCUP2012: How to write a bash script like the python?
COSCUP2012: How to write a bash script like the python?
 
Writing Clean Code in Swift
Writing Clean Code in SwiftWriting Clean Code in Swift
Writing Clean Code in Swift
 

Viewers also liked

30 Minutes To CPAN
30 Minutes To CPAN30 Minutes To CPAN
30 Minutes To CPANdaoswald
 
PERL Unit 6 regular expression
PERL Unit 6 regular expressionPERL Unit 6 regular expression
PERL Unit 6 regular expressionBinsent Ribera
 
Logic Progamming in Perl
Logic Progamming in PerlLogic Progamming in Perl
Logic Progamming in PerlCurtis Poe
 
626 pages
626 pages626 pages
626 pagesJFG407
 

Viewers also liked (7)

Perl in Teh Cloud
Perl in Teh CloudPerl in Teh Cloud
Perl in Teh Cloud
 
The problem with Perl
The problem with PerlThe problem with Perl
The problem with Perl
 
CPAN Training
CPAN TrainingCPAN Training
CPAN Training
 
30 Minutes To CPAN
30 Minutes To CPAN30 Minutes To CPAN
30 Minutes To CPAN
 
PERL Unit 6 regular expression
PERL Unit 6 regular expressionPERL Unit 6 regular expression
PERL Unit 6 regular expression
 
Logic Progamming in Perl
Logic Progamming in PerlLogic Progamming in Perl
Logic Progamming in Perl
 
626 pages
626 pages626 pages
626 pages
 

Similar to Perl on Amazon Elastic MapReduce

Similar to Perl on Amazon Elastic MapReduce (20)

mapreduce ppt.ppt
mapreduce ppt.pptmapreduce ppt.ppt
mapreduce ppt.ppt
 
L3.fa14.ppt
L3.fa14.pptL3.fa14.ppt
L3.fa14.ppt
 
サンプルから見るMapReduceコード
サンプルから見るMapReduceコードサンプルから見るMapReduceコード
サンプルから見るMapReduceコード
 
MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptx
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Osd ctw spark
Osd ctw sparkOsd ctw spark
Osd ctw spark
 
Big Data for Mobile
Big Data for MobileBig Data for Mobile
Big Data for Mobile
 
Big data shim
Big data shimBig data shim
Big data shim
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Barcelona MUG MongoDB + Hadoop Presentation
Barcelona MUG MongoDB + Hadoop PresentationBarcelona MUG MongoDB + Hadoop Presentation
Barcelona MUG MongoDB + Hadoop Presentation
 
MapReduce
MapReduceMapReduce
MapReduce
 
pig intro.pdf
pig intro.pdfpig intro.pdf
pig intro.pdf
 
Going crazy with Node.JS and CakePHP
Going crazy with Node.JS and CakePHPGoing crazy with Node.JS and CakePHP
Going crazy with Node.JS and CakePHP
 
Lecture 2 part 3
Lecture 2 part 3Lecture 2 part 3
Lecture 2 part 3
 
The Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsightThe Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsight
 
Good Evils In Perl (Yapc Asia)
Good Evils In Perl (Yapc Asia)Good Evils In Perl (Yapc Asia)
Good Evils In Perl (Yapc Asia)
 
Remedie: Building a desktop app with HTTP::Engine, SQLite and jQuery
Remedie: Building a desktop app with HTTP::Engine, SQLite and jQueryRemedie: Building a desktop app with HTTP::Engine, SQLite and jQuery
Remedie: Building a desktop app with HTTP::Engine, SQLite and jQuery
 
Hadoop interview question
Hadoop interview questionHadoop interview question
Hadoop interview question
 
Internationalizing CakePHP Applications
Internationalizing CakePHP ApplicationsInternationalizing CakePHP Applications
Internationalizing CakePHP Applications
 

Recently uploaded

SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 

Recently uploaded (20)

SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 

Perl on Amazon Elastic MapReduce

  • 1. Perl on Amazon Elastic MapReduce
  • 2. A Gentle Introduction to MapReduce • Distributed computing model • Mappers process the input and forward intermediate results to reducers. • Reducers aggregate these intermediate results, and emit the final results.
  • 3. $ map | sort | reduce
  • 4. MapReduce • Input data sent to mappers as (k, v) pairs. • After processing, mappers emit (k v ). out, out • These pairs are sorted and sent to reducers. • All (k out, vout) pairs for a given kout are sent to a single reducer.
  • 5. MapReduce • Reducers get (k, [v , v , …, v ]). 1 2 n • After processing, the reducer emits a (k , v ) f f per result.
  • 6. MapReduce We wanted to have a world map showing where people were starting our games (like Mozilla Glow)
  • 8. MapReduce • Input: ( epoch, IP address ) • Mappers group these into 5-minute blocks, and emit ( block Id, IP address ) • Reducers get ( blockId, [ip , ip , …, ip ] ) 1 2 n • Do a geo lookup and emit ( epoch, [ ( lat1, lon1 ), ( lat2, lon2), … ] )
  • 9. $ map | sort | reduce
  • 10.
  • 11. Apache Hadoop • Distributed programming framework • Implements MapReduce • Does all the usual distributed programming heavy-lifting for you • Highly-fault tolerant, automatic task re- assignment in case of failure • You focus on mappers and reducers
  • 12. Apache Hadoop • Native Java API • Streaming API which can use mappers and reducers written in any programming language. • Distributed file system (HDFS) • Distributed Cache
  • 13. Amazon Elastic MapReduce • On-demand Hadoop clusters running on EC2 instances. • Improved S3 support for storage of input and output data. • Build workflows by sending jobs to a cluster.
  • 14. EMR Downsides • No control over the machine images. • Perl 5.8.8 • Ephemeral, when your cluster is shut down (or dies), HDFS is gone. • HDFS not available at cluster-creation time. • Debian
  • 15. Streaming vs. Native $ cat | map | sort | reduce
  • 16. Streaming vs. Native Instead of ( k, [ v1, v2, …, vn ] ) reducers get (( k1, v1 ), …, ( k1, vn ), ( k2, v1 ), …, ( k2, v2 ))
  • 17. Composite Keys • Reducers receive both keys and values sorted • Merge 3 tables: userid, 0, … # customer info userid, 1, … # payments history userid, recordid1, … # clickstream userid, recordid2, … # clickstream
  • 18. Streaming vs. Native • Limited API • About a 7-10% increase in run time • About a 1000% decrease in development time (as reported by a non-representative sample of developers)
  • 19. Where’s My Towel? • Tasks run chrooted in a non-deterministic location. • It’s easy to store files in HDFS when submitting a job, impossible to store directory trees. • For native Java jobs, your dependencies get packaged in the JAR alongside your code.
  • 20. Streaming’s Little Helpers Define your inputs and outputs: --input s3://events/2011-30-10 --output s3://glowfish/output/2011-30-10
  • 21. Streaming’s Little Helpers You can use any class in Hadoop’s classpath as a codec, several come bundled: -D mapred.output.key.comparator.class = org.apache.hadoop.mapred.lib.KeyFieldBasedComparator -partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner
  • 22. Streaming’s Little Helpers • Use S3 to store… • input data • output data • supporting data (e.g., Geo-IP) • your code
  • 23. Mapper and Reducer To specify the mapper and reducer to be used in your streaming job, you can point Hadoop to S3: --mapper s3://glowfish/bin/mapper.pl --reducer s3://glowfish/bin/reducer.pl
  • 24. Support Files When specifying a file to store in the DC, a URI fragment will be used as a symlink in the local filesystem: -cacheFile s3://glowfish/data/GeoLiteCity.dat#GeoLiteCity.dat
  • 25. Support Files When specifying a file to store in the DC, a URI fragment will be used as a symlink in the local filesystem: -cacheFile s3://glowfish/data/GeoLiteCity.dat#GeoLiteCity.dat
  • 26. Dependencies But if you store an archive (Zip, TGZ, or JAR) in the Distributed Cache, … -cacheArchive s3://glowfish/lib/perllib.tgz
  • 27. Dependencies But if you store an archive (Zip, TGZ, or JAR) in the Distributed Cache, … -cacheArchive s3://glowfish/lib/perllib.tgz
  • 28. Dependencies But if you store an archive (Zip, TGZ, or JAR) in the Distributed Cache, … -cacheArchive s3://glowfish/lib/perllib.tgz#locallib
  • 29. Dependencies Hadoop will uncompress it and create a link to whatever directory it created, in the task’s working directory.
  • 30. Dependencies Which is where it stores your mapper and reducer.
  • 32. Mapper #!/usr/bin/env perl use strict; use warnings; use lib qw/ locallib /; use JSON::PP; my $decoder = JSON::PP->new->utf8; my $missing_ip = 0; while ( <> ) { chomp; next unless /load_complete/; my @line = split /t/; my ( $epoch, $payload ) = ( int( $line[1] / 1000 / 300 ), $line[5] ); my $json = $decoder->decode( $payload ); if ( ! exists $json->{'ip'} ) { $missing_ip++; next; } print "$epocht$json->{'ip'}n"; } print STDERR "reporter:counter:Job Counters,MISSING_IP,$missing_ipn";
  • 33. Mapper #!/usr/bin/env perl use strict; use warnings; use lib qw/ locallib /; use JSON::PP; my $decoder = JSON::PP->new->utf8; my $missing_ip = 0; while ( <> ) { chomp; next unless /load_complete/; my @line = split /t/; my ( $epoch, $payload ) = ( int( $line[1] / 1000 / 300 ), $line[5] ); my $json = $decoder->decode( $payload ); if ( ! exists $json->{'ip'} ) { $missing_ip++; next; } print "$epocht$json->{'ip'}n"; } print STDERR "reporter:counter:Job Counters,MISSING_IP,$missing_ipn";
  • 34. Reducer #!/usr/bin/env perl use strict; use warnings; use lib qw/ locallib /; use Geo::IP; use Regexp::Common qw/ net /; use Readonly; Readonly::Scalar my $TAB => "t"; my $geo = Geo::IP->open( 'GeoLiteCity.dat', GEOIP_MEMORY_CACHE ) or die "Could not open GeoIP database: $!n"; my $format_errors = 0; my $invalid_ip_address = 0; my $geo_lookup_errors = 0; my $time_slot; my $previous_time_slot = -1;
  • 35. Reducer #!/usr/bin/env perl use strict; use warnings; use lib qw/ locallib /; use Geo::IP; use Regexp::Common qw/ net /; use Readonly; Readonly::Scalar my $TAB => "t"; my $geo = Geo::IP->open( 'GeoLiteCity.dat', GEOIP_MEMORY_CACHE ) or die "Could not open GeoIP database: $!n"; my $format_errors = 0; my $invalid_ip_address = 0; my $geo_lookup_errors = 0; my $time_slot; my $previous_time_slot = -1;
  • 36. Reducer #!/usr/bin/env perl use strict; use warnings; use lib qw/ locallib /; use Geo::IP; use Regexp::Common qw/ net /; use Readonly; Readonly::Scalar my $TAB => "t"; my $geo = Geo::IP->open( 'GeoLiteCity.dat', GEOIP_MEMORY_CACHE ) or die "Could not open GeoIP database: $!n"; my $format_errors = 0; my $invalid_ip_address = 0; my $geo_lookup_errors = 0; my $time_slot; my $previous_time_slot = -1;
  • 37. Reducer while ( <> ) { chomp; my @cols = split( TAB ); if ( scalar @cols != 2 ) { $format_errors++; next; } my ( $time_slot, $ip_addr ) = @cols; if ( $previous_time_slot != -1 && $time_slot != $previous_time_slot ) { # we've entered a new time slot, write the previous one out emit( $time_slot, $previous_time_slot ); } if ( $ip_addr !~ /$RE{net}{IPv4}/ ) { $invalid_ip_address++; $previous_time_slot = $time_slot; next; }
  • 38. Reducer while ( <> ) { chomp; my @cols = split( TAB ); if ( scalar @cols != 2 ) { $format_errors++; next; } my ( $time_slot, $ip_addr ) = @cols; if ( $previous_time_slot != -1 && $time_slot != $previous_time_slot ) { # we've entered a new time slot, write the previous one out emit( $time_slot, $previous_time_slot ); } if ( $ip_addr !~ /$RE{net}{IPv4}/ ) { $invalid_ip_address++; $previous_time_slot = $time_slot; next; }
  • 39. Reducer while ( <> ) { chomp; my @cols = split( TAB ); if ( scalar @cols != 2 ) { $format_errors++; next; } my ( $time_slot, $ip_addr ) = @cols; if ( $previous_time_slot != -1 && $time_slot != $previous_time_slot ) { # we've entered a new time slot, write the previous one out emit( $time_slot, $previous_time_slot ); } if ( $ip_addr !~ /$RE{net}{IPv4}/ ) { $invalid_ip_address++; $previous_time_slot = $time_slot; next; }
  • 40. Reducer my $geo_record = $geo->record_by_addr( $ip_addr ); if ( ! defined $geo_record ) { $geo_lookup_errors++; $previous_time_slot = $time_slot; next; } # update entry for time slot with lat and lon $previous_time_slot = $time_slot; } # while ( <> ) emit( $time_slot + 1, $time_slot ); print STDERR "reporter:counter:Job Counters,FORMAT_ERRORS, $format_errorsn"; print STDERR "reporter:counter:Job Counters,INVALID_IPS, $invalid_ip_addressn"; print STDERR "reporter:counter:Job Counters,GEO_LOOKUP_ERRORS, $geo_lookup_errorsn";
  • 41. Reducer my $geo_record = $geo->record_by_addr( $ip_addr ); if ( ! defined $geo_record ) { $geo_lookup_errors++; $previous_time_slot = $time_slot; next; } # update entry for time slot with lat and lon $previous_time_slot = $time_slot; } # while ( <> ) emit( $time_slot + 1, $time_slot ); print STDERR "reporter:counter:Job Counters,FORMAT_ERRORS, $format_errorsn"; print STDERR "reporter:counter:Job Counters,INVALID_IPS, $invalid_ip_addressn"; print STDERR "reporter:counter:Job Counters,GEO_LOOKUP_ERRORS, $geo_lookup_errorsn";
  • 42. Reducer my $geo_record = $geo->record_by_addr( $ip_addr ); if ( ! defined $geo_record ) { $geo_lookup_errors++; $previous_time_slot = $time_slot; next; } # update entry for time slot with lat and lon $previous_time_slot = $time_slot; } # while ( <> ) emit( $time_slot + 1, $time_slot ); print STDERR "reporter:counter:Job Counters,FORMAT_ERRORS, $format_errorsn"; print STDERR "reporter:counter:Job Counters,INVALID_IPS, $invalid_ip_addressn"; print STDERR "reporter:counter:Job Counters,GEO_LOOKUP_ERRORS, $geo_lookup_errorsn";
  • 43. Reducer my $geo_record = $geo->record_by_addr( $ip_addr ); if ( ! defined $geo_record ) { $geo_lookup_errors++; $previous_time_slot = $time_slot; next; } # update entry for time slot with lat and lon $previous_time_slot = $time_slot; } # while ( <> ) emit( $time_slot + 1, $time_slot ); print STDERR "reporter:counter:Job Counters,FORMAT_ERRORS, $format_errorsn"; print STDERR "reporter:counter:Job Counters,INVALID_IPS, $invalid_ip_addressn"; print STDERR "reporter:counter:Job Counters,GEO_LOOKUP_ERRORS, $geo_lookup_errorsn";
  • 44. Recap • EMR clusters are volatile!
  • 45. Recap • EMR clusters are volatile. • Values for a given key will all go to a single reducer, sorted.
  • 46. Recap • EMR clusters are volatile. • Values for a given key will all go to a single reducer, sorted. • Use S3 for everything, and plan your dataflow ahead.
  • 47. ( On data ) • Store it wisely, e.g., using a directory structure looking like the following to get free partitioning in Hive/others: s3://bucket/path/data/run_date=2011-11-12 • Don’t worry about getting the data out of S3, you can always write a simple job that does that and run it at the end of your workflow.
  • 48. Recap • EMR clusters are volatile. • Values for a given key will all go to a single reducer, sorted. Watch for the key changing. • Use S3 for everything, and plan your dataflow ahead. • Make carton a part of your life, and especially of your build tool’s.
  • 49. ( carton ) • Shipwright for humans • Reads dependencies from Makefile.PL • Installs them locally to your app • Deploy your stuff, including carton.lock • Run carton install --deployment • Tar result and upload to S3
  • 50. URLs • The MapReduce Paper http://labs.google.com/papers/mapreduce.html • Apache Hadoop http://hadoop.apache.org/ • Amazon Elastic MapReduce http://aws.amazon.com/elasticmapreduce/
  • 51. URLs • Hadoop Streaming Tutorial (Apache) http://hadoop.apache.org/common/docs/r0.20.2/streaming.html • Hadoop Streaming How-To (Amazon) http://docs.amazonwebservices.com/ElasticMapReduce/latest/ GettingStartedGuide/CreateJobFlowStreaming.html
  • 52. URLs • Amazon EMR Perl Client Library http://aws.amazon.com/code/Elastic-MapReduce/2309 • Amazon EMR Command-Line Tool http://aws.amazon.com/code/Elastic-MapReduce/2264
  • 53. That’s All, Folks! Slides available at http://slideshare.net/pfig/perl-on-amazon-elastic-mapreduce me@pedrofigueiredo.org

Editor's Notes

  1. \n
  2. Sort/shuffle between the two steps, guaranteeing that all mapper results for a single key go to the same reducer, and that workload is distributed evenly.\n
  3. \n
  4. The sorting guarantees that all values for a given key are sent to a single reducer.\n
  5. \n
  6. Mozilla Glow tracked Firefox 4 downloads on a world map, in near real-time.\n
  7. \n
  8. On a 50-node cluster, processing ~3BN events takes 11 minutes, including data transfers.\n2 hours worth take 3 minutes, so we can easily have data from 5 minutes ago\n1 day to modify the Glow protocol, 1 day to build\nEverything stored on S3\n
  9. \n
  10. \n
  11. Serialisation, heartbeat, node management, directory, etc.\nSpeculative task execution, first one to finish wins\nPotentially very simple and contained code\n
  12. You supply the mapper, reducer, and driver code\n
  13. S3 gives you virtually unlimited storage with very high redundancy\nS3 performance: ~750MB of uncompressed data (110-byte rows -&gt; ~7M rows/sec)\nAll this is controlled using a REST API\nJobs are called &amp;#x2018;steps&amp;#x2019; in EMR lingo\n
  14. No way to customise the image and, e.g., install your own Perl\nSo it&amp;#x2019;s a good idea to store the final results of a workflow in S3\nNo way to store dependencies in HDFS when cluster is created\n
  15. \n
  16. \n
  17. If you set a value to 0, you&amp;#x2019;ll know that it&amp;#x2019;s going to be the first (k,v) the reducer will see, 1 will be the second, etc.\nwhen the userid changes, it&amp;#x2019;s a new user.\n
  18. E.g., no control over output file names, many of the API settings can&amp;#x2019;t be configured programmatically (cmd-line switches), no separate mappers per input, etc.\nBecause reducer input is also sorted on keys, when the key changes you know you won&amp;#x2019;t be seeing any more of those. Might need to keep track of the current key, to use as the previous.\n
  19. So how do you get all the CPAN goodness you know and love in there?\nHDFS operations are limited to copy, move, delete, and the host OS doesn&amp;#x2019;t see it - no untar&amp;#x2019;ing!\n
  20. Can have multiple inputs\n
  21. That -D is a Hadoop define, not a JVM system property definition\n
  22. On a streaming job you specify the programs to use as mapper and reducer\n
  23. \n
  24. \n
  25. In the unknown directory where the task is running, making it accessible to it\n
  26. \n
  27. \n
  28. \n
  29. \n
  30. \n
  31. \n
  32. \n
  33. At the end of the job, Hadoop aggregates counters from all tasks.\n
  34. \n
  35. \n
  36. \n
  37. \n
  38. \n
  39. \n
  40. \n
  41. \n
  42. \n
  43. \n
  44. \n
  45. \n
  46. \n
  47. Hive partitioning\n
  48. \n
  49. \n
  50. \n
  51. \n
  52. \n
  53. \n