SlideShare ist ein Scribd-Unternehmen logo
1 von 43
Downloaden Sie, um offline zu lesen
Getting Started with Hadoop
                      with Amazon’s Elastic MapReduce

                           Scott Hendrickson
                           scott@drskippy.net
          http://drskippy.net/projects/EMR-HadoopMeetup.pdf

                                    Boulder/Denver Hadoop Meetup


                                           8 July 2010




Scott Hendrickson (Hadoop Meetup)            EMR-Hadoop            8 July 2010   1 / 43
Agenda


1   Amazon Web Services

2   Interlude: Solving problems with Map and Reduce

3   Running MapReduce on Amazon Elastic MapReduce
      Example 1: Streaming Work Flow with AWS Management Console
      Example 2 - Word count (Slightly more useful)
      Example 3 - elastic-mapreduce command line tool

4   References and Notes



Scott Hendrickson (Hadoop Meetup)   EMR-Hadoop         8 July 2010   2 / 43
Amazon Web Services


What is Amazon Web Services?


For first Hadoop project on AWS, use these services:
       Elastic Compute Cloud (EC2)
       Amazon Simple Storage Service (S3)
       Elastic MapReduce (EMR)
For future projects, AWS is much more:
       SimpleDB, Relational Database Services
       Simple Queue Service (SQS), Simple Notification Service (SNS)
       Alexa
       Mechanical Turk
       ...



Scott Hendrickson (Hadoop Meetup)                 EMR-Hadoop   8 July 2010   3 / 43
Amazon Web Services


Signing up for AWS



   1   Create an AWS account - http://aws.amazon.com/
   2   Sign up for EC2 cloud compute services -
       http://aws.amazon.com/ec2/
   3   Set up Security Credentials (under menu Account|Security
       Credentials) - 3 kinds of credentials, you need to create an “Access
       Key”; use it to access S3 storage
   4   Sign up for S3 storage services - http://aws.amazon.com/s3/
   5   Sign up for EMR - http://aws.amazon.com/elasticmapreduce/




Scott Hendrickson (Hadoop Meetup)                 EMR-Hadoop    8 July 2010   4 / 43
Amazon Web Services


What are S3 buckets?


Streaming EMR projects use Simple Storage Service (S3) Buckets for
data, code, logging and output.
            Bucket “A bucket is a container for objects stored in Amazon S3.
                   Every object is contained in a bucket.” Bucket names
                   must be globally unique.
            Object “Entities stored in Amazon S3. Objects consist of object
                   data and metadata.” Metadata consists of key-value pairs.
                   Object data is opaque.
   Objects Keys “An object is uniquely identified within a bucket by a key
                (name) and a version ID.”




Scott Hendrickson (Hadoop Meetup)                 EMR-Hadoop    8 July 2010   5 / 43
Amazon Web Services


Accessing objects in S3 buckets

Want to:
   1   Move data into and out of S3 buckets
   2   Set access privileges
Tools:
       S3 console in your AWS control panel is adequate for managing S3
       buckets and objects one at a time
       Other browser options: good for multiple file upload/download -
       Firefox S3
       https://addons.mozilla.org/en-US/firefox/addon/3247/ ; or
       minimal - S3 plug-in for Chrome https://chrome.google.com/
       extensions/detail/appeggcmoaojledegaonmdaakfhjhchf
       Programmatic options: Web Services (both SOAP-y and REST-ful):
       wget, curl, Python, Ruby, Java . . .

Scott Hendrickson (Hadoop Meetup)                 EMR-Hadoop   8 July 2010   6 / 43
Amazon Web Services


S3 Bucket Example 1 - RESTful GET

Example - Image object
Bucket: bsi-test
Key: image.jpg
Object: JPEG structured data data from image.jpg
RESTful GET access, use URL:
http://s3.amazonaws.com/bsi-test/image.jpg

Example - Text file object
Bucket: bsi-test
Key: foobar
Object: text
RESTful GET access, use URL:
http://s3.amazonaws.com/bsi-test/foobar


Scott Hendrickson (Hadoop Meetup)                 EMR-Hadoop   8 July 2010   7 / 43
Amazon Web Services


S3 Bucket Example 2

Example - Python, Boto, Metadata
from boto.s3.connection import S3Connection
conn = S3Connection(’key-id’, ’secret-key’)
bucket = conn.get_bucket(’bsi-test’)

k = bucket.get_key(’image.jpg’)
print "Value for key ’x-amz-meta-s3fox-modifiedtime’ is:"
print k.get_metadata(’s3fox-modifiedtime’)
k.get_contents_to_filename(’deleteme.jpg’)

k = bucket.get_key(’foobar’)
print "Object value for key ’foobar’ is:"
print k.get_contents_as_string()
print "Value for key ’x-amz-meta-example-key’ is:"
print k.get_metadata(’example-key’)
Scott Hendrickson (Hadoop Meetup)                 EMR-Hadoop   8 July 2010   8 / 43
Amazon Web Services


S3 Bucket Example 2



Example - Python, Boto, Metadata - Output

scott@mowgli-ubuntu:~/Dropbox/hadoop$ ./botoExample.py
Value for key ’x-amz-meta-s3fox-modifiedtime’ is:
1273869756000
Object value for key ’foobar’ is:
This is a test of S3
Value for key ’x-amz-meta-example-key’ is:
This is an example value.




Scott Hendrickson (Hadoop Meetup)                 EMR-Hadoop   8 July 2010   9 / 43
Amazon Web Services


What is Elastic Map Reduce?




                  Hadoop Hosted Hadoop framework running on EC2 and S3.
                Job Flow Processing steps EMR “runs on a specified dataset
                         using a set of Amazon EC2 instances.”
          S3 Bucket(s) Input data, output, scripts, jars, logs.




Scott Hendrickson (Hadoop Meetup)                 EMR-Hadoop      8 July 2010   10 / 43
Amazon Web Services


Controlling Job Flows

Want to:
   1   Configure jobs
   2   Start jobs
   3   Check status or stop jobs
Tools:
       AWS Management Console
       https://console.aws.amazon.com/elasticmapreduce/home
       Command Line Tools
       (requires Ruby [sudo apt-get install ruby libopenssl-ruby])
       http://developer.amazonwebservices.com/connect/entry.
       jspa?externalID=2264&categoryID=262
       API calls defined by the service (REST-ful and SOAP-y)


Scott Hendrickson (Hadoop Meetup)                 EMR-Hadoop   8 July 2010   11 / 43
Amazon Web Services


EMR Example 1 - Running a simple Work Flow from the
AWS Management Console




EMR Example 1
                                        Hold up a minute. . . !

                                    What problem are we solving?




Scott Hendrickson (Hadoop Meetup)                 EMR-Hadoop       8 July 2010   12 / 43
Interlude: Solving problems with Map and Reduce


Agenda


1   Amazon Web Services

2   Interlude: Solving problems with Map and Reduce

3   Running MapReduce on Amazon Elastic MapReduce
      Example 1: Streaming Work Flow with AWS Management Console
      Example 2 - Word count (Slightly more useful)
      Example 3 - elastic-mapreduce command line tool

4   References and Notes



Scott Hendrickson (Hadoop Meetup)                   EMR-Hadoop   8 July 2010   13 / 43
Interlude: Solving problems with Map and Reduce


Central MapReduce Ideas

       Operate on key-value pairs
       Data scientist provides map and reduce
                           (input)
                                                       map
                           < k1, v 1 >                −→
                                                       −         < k2, v 2 >
                                                  combine,sort
                           < k2, v 2 > − − − −
                                        − − − → < k2, v 2 >
                                                      reduce
                           < k2, v 2 >               −−→
                                                     −−          < k3, v 3 >
                                                                   (output)

       (Optional: Combine provided in map, may significantly reduce
       bandwidth between workers)
       Efficient Sort provide by MapReduce library. Implies efficient
       compare(k2a , k2b )
       “Implicit” parallelization - splitting and distributing data, starting
       maps, reduces, collecting output
Scott Hendrickson (Hadoop Meetup)                   EMR-Hadoop                 8 July 2010   14 / 43
Interlude: Solving problems with Map and Reduce


Key components of MapReduce framework


(wikipedia http://en.wikipedia.org/wiki/MapReduce)
The frozen part of the MapReduce framework is a large distributed sort.
The hot spots, which the application defines, are:
   1   input reader
   2   Map function
   3   partition function
   4   compare function
   5   Reduce function
   6   output writer




Scott Hendrickson (Hadoop Meetup)                   EMR-Hadoop   8 July 2010   15 / 43
Interlude: Solving problems with Map and Reduce


Google Tutorial View
   1   MapReduce library shards the input files and starts up many copies on
       a cluster.
   2   Master assigns work to workers. There are map and reduce tasks.
   3   Workers assigned map tasks reads the contents input shard, parse
       key-value pairs and pass pairs to map function. Intermediate
       key-value pairs produced by the map function are buffered in memory.
   4   Periodically, buffered pairs are written to disk, partitioned into regions.
       Locations of buffered pairs on the local disk are passed to the master.
   5   When a reduce worker has read all intermediate data, it sorts by the
       intermediate keys. All occurrences a key are grouped together.
   6   Reduce workers pass a key and the corresponding set of intermediate
       values to the reduce function.
   7   Output of the reduce function is appended to a final output file for
       each reduce partition.
Scott Hendrickson (Hadoop Meetup)                   EMR-Hadoop     8 July 2010   16 / 43
Interlude: Solving problems with Map and Reduce


MapReduce Example 1 - Word Count - Data




(from Apache Hadoop tutorial)
Example: Word Count
file1:
Hello World Bye World
file2:
Hello Hadoop Goodbye Hadoop




Scott Hendrickson (Hadoop Meetup)                   EMR-Hadoop   8 July 2010   17 / 43
Interlude: Solving problems with Map and Reduce


MapReduce Example 1 - Word Count - Map


Example: Word Count
The first map emits:
< Hello, 1>
< World, 1>
< Bye, 1>
< World, 1>

The second map emits:
< Hello, 1>
< Hadoop, 1>
< Goodbye, 1>
< Hadoop, 1>



Scott Hendrickson (Hadoop Meetup)                   EMR-Hadoop   8 July 2010   18 / 43
Interlude: Solving problems with Map and Reduce


MapReduce Example 1 - Word Count - Sort and Combine


Example: Word Count
The output of the first map:
< Bye, 1>
< Hello, 1>
< World, 2>

The output of the second map:
< Goodbye, 1>
< Hadoop, 2>
< Hello, 1>




Scott Hendrickson (Hadoop Meetup)                   EMR-Hadoop   8 July 2010   19 / 43
Interlude: Solving problems with Map and Reduce


MapReduce Example 1 - Word Count - Sort and Reduce



Example: Word Count
The Reducer method sums up the values for each key.

The output of the job is:
< Bye, 1>
< Goodbye, 1>
< Hadoop, 2>
< Hello, 2>
< World, 2>




Scott Hendrickson (Hadoop Meetup)                   EMR-Hadoop   8 July 2010   20 / 43
Interlude: Solving problems with Map and Reduce


What problems is MapReduce good at solving?


Themes:
       Identify, transform, aggregate, filter, count, sort. . .
       Requirement of global knowledge of data is (a) “occasional” (vs. cost
       of map) (b) confined to ordinality
       Discovery tasks (vs. high repetition of similar transactional tasks,
       many reads)
       Unstructured data (vs. tabular, indexes!)
       Continuously updated data (indexing cost)
       Many, many, many machines (fault tolerance)




Scott Hendrickson (Hadoop Meetup)                   EMR-Hadoop    8 July 2010   21 / 43
Interlude: Solving problems with Map and Reduce


What problems is MapReduce good at solving?

Memes:
       MapReduce ⇔ SQL (read the comments too)
       http://www.data-miners.com/blog/2008/01/
       mapreduce-and-sql-aggregations.html
       MapReduce vs. Message Passing Interface (MPI) “MPI is good for
       task parallelism and Hadoop is good for Data Parallelism.” finite
       differences, finite elements, particle-in-cell. . .
       MapReduce vs. column-oriented DBs tabular data, indexes
       (cantankerous old farts!) http://databasecolumn.vertica.com/
       database-innovation/mapreduce-a-major-step-backwards/
       and http://databasecolumn.vertica.com/
       database-innovation/mapreduce-ii/
       MapReduce vs. relational DBs http://scienceblogs.com/
       goodmath/2008/01/databases_are_hammers_mapreduc.php

Scott Hendrickson (Hadoop Meetup)                   EMR-Hadoop   8 July 2010   22 / 43
Running MapReduce on Amazon Elastic MapReduce   Example 1: Streaming Work Flow with AWS Management Console


Agenda


1   Amazon Web Services

2   Interlude: Solving problems with Map and Reduce

3   Running MapReduce on Amazon Elastic MapReduce
      Example 1: Streaming Work Flow with AWS Management Console
      Example 2 - Word count (Slightly more useful)
      Example 3 - elastic-mapreduce command line tool

4   References and Notes



Scott Hendrickson (Hadoop Meetup)               EMR-Hadoop                                8 July 2010    23 / 43
Running MapReduce on Amazon Elastic MapReduce   Example 1: Streaming Work Flow with AWS Management Console


Example 1 - Add up integers

Data
3
4
-1
4
-3
1
1
...

Map
import sys
for line in sys.stdin:
    print ’%s%s%d’ % ("sum", ’t’, int(line))

Scott Hendrickson (Hadoop Meetup)               EMR-Hadoop                                8 July 2010    24 / 43
Running MapReduce on Amazon Elastic MapReduce   Example 1: Streaming Work Flow with AWS Management Console


Example 1 - Add up integers

Reduce
import sys
sum_of_ints = 0
for line in sys.stdin:
    key, value = line.split(’t’) # key is always the same
    try:
        sum_of_ints += int(value)
    except ValueError:
        pass
try:
    print "%s%s%d" % (key, ’t’, sum_of_ints)
except NameError: # No items processed
    pass


Scott Hendrickson (Hadoop Meetup)               EMR-Hadoop                                8 July 2010    25 / 43
Running MapReduce on Amazon Elastic MapReduce   Example 1: Streaming Work Flow with AWS Management Console


Example 1 - Add up integers



Shell test
cat ./input/ints.txt | ./mapper.py > ./inter
cat ./input/ints1.txt | ./mapper.py >> ./inter
cat ./input/ints2.txt | ./mapper.py >> ./inter
cat ./input/ints3.txt | ./mapper.py >> ./inter
echo "Intermediate output:"
cat ./inter
cat ./inter | sort | 
           ./reducer.py > ./output/cmdLineOutput.txt




Scott Hendrickson (Hadoop Meetup)               EMR-Hadoop                                8 July 2010    26 / 43
Running MapReduce on Amazon Elastic MapReduce   Example 1: Streaming Work Flow with AWS Management Console


Example 1 - Add up integers



What was that comment earlier about an optional combiner?
Combiner in map
import sys
sum_of_ints = 0
for line in sys.stdin:
    sum_of_ints += int(line)
print ’%s%s%d’ % ("sum", ’t’, sum_of_ints)




Scott Hendrickson (Hadoop Meetup)               EMR-Hadoop                                8 July 2010    27 / 43
Running MapReduce on Amazon Elastic MapReduce   Example 1: Streaming Work Flow with AWS Management Console


Example 1 - Add up integers



Combiner shell test
cat ./input/ints.txt | ./mapper_combine.py > ./inter
cat ./input/ints1.txt | ./mapper_combine.py >> ./inter
cat ./input/ints2.txt | ./mapper_combine.py >> ./inter
cat ./input/ints3.txt | ./mapper_combine.py >> ./inter
echo "Intermediate output:"
cat ./inter
cat ./inter | sort | 
          ./reducer.py > ./output/cmdLineCombOutput.txt




Scott Hendrickson (Hadoop Meetup)               EMR-Hadoop                                8 July 2010    28 / 43
Running MapReduce on Amazon Elastic MapReduce   Example 1: Streaming Work Flow with AWS Management Console


Example 1 - Add up integers - AWS Console


   1   Upload oneCount directory with FFS3
   2   Create a New Job Flow
       Name: ”oneCount”
       Job Flow: Run own app
       Job Type: Streaming
   3   Input: bsi-test/oneCount/input
       Output: bsi-test/oneCount/outputConsole (must not exist)
       Mapper: bsi-test/oneCount/mapper.py
       Reducer: bsi-test/oneCount/reducer.py
       Extra Args: none




Scott Hendrickson (Hadoop Meetup)               EMR-Hadoop                                8 July 2010    29 / 43
Running MapReduce on Amazon Elastic MapReduce   Example 1: Streaming Work Flow with AWS Management Console


Example 1 - Add up integers - AWS Console



   4   Instances: 4
       Type: small
       Keypair: No (Yes allows ssh to Hadoop master)
       Log: yes
       Log Location: bsi-test/oneCount/log
       Hadoop Debug: no
   5   No bootstrap actions
   6   Start it, and wait. . .




Scott Hendrickson (Hadoop Meetup)               EMR-Hadoop                                8 July 2010    30 / 43
Running MapReduce on Amazon Elastic MapReduce   Example 2 - Word count (Slightly more useful)


Agenda


1   Amazon Web Services

2   Interlude: Solving problems with Map and Reduce

3   Running MapReduce on Amazon Elastic MapReduce
      Example 1: Streaming Work Flow with AWS Management Console
      Example 2 - Word count (Slightly more useful)
      Example 3 - elastic-mapreduce command line tool

4   References and Notes



Scott Hendrickson (Hadoop Meetup)               EMR-Hadoop                                   8 July 2010   31 / 43
Running MapReduce on Amazon Elastic MapReduce   Example 2 - Word count (Slightly more useful)


Example 2 - Word count


Map
def read_input(file):
    for line in file:
        yield line.split()

def main(separator=’t’):
    data = read_input(sys.stdin)
    for words in data:
        for word in words:
            lword = word.lower().strip(string.puctuation)
            print ’%s%s%d’ % (lword, separator, 1)



Scott Hendrickson (Hadoop Meetup)               EMR-Hadoop                                   8 July 2010   32 / 43
Running MapReduce on Amazon Elastic MapReduce   Example 2 - Word count (Slightly more useful)


Example 2 - Word count
Reduce
def read_mapper_output(file, separator=’t’):
    for line in file:
        yield line.rstrip().split(separator, 1)

def main(separator=’t’):
    data = read_mapper_output(sys.stdin,
                              separator=separator)
    for current_word,group in groupby(data,itemgetter(0)):
        try:
            total_count = sum(int(count)
                          for current_word, count in group)
            print "%s%s%d" % (current_word,
                              separator, total_count)
        except ValueError:
            pass
Scott Hendrickson (Hadoop Meetup)               EMR-Hadoop                                   8 July 2010   33 / 43
Running MapReduce on Amazon Elastic MapReduce   Example 2 - Word count (Slightly more useful)


Example 2 - Word count




Shell test
echo "foo foo quux labs foo bar quux" | ./mapper.py
echo "foo foo quux labs foo bar quux" | ./mapper.py 
           | sort | ./reducer.py
cat ./input/alice.txt | ./mapper.py 
           | sort | ./reducer.py > ./output/cmdLineOutput.txt




Scott Hendrickson (Hadoop Meetup)               EMR-Hadoop                                   8 July 2010   34 / 43
Running MapReduce on Amazon Elastic MapReduce   Example 2 - Word count (Slightly more useful)


Example 2 - Word count - AWS Console


   1   Upload myWordCount directory with FFS3
   2   Create a New Job Flow
       Name: ”myWordCount”
       Job Flow: Run own app
       Job Type: Streaming
   3   Input: bsi-test/myWordCount/input
       Output: bsi-test/myWordCount/outputConsole (must not exist)
       Mapper: bsi-test/myWordCount/mapper.py
       Reducer: bsi-test/myWordCount/reducer.py
       Extra Args: none




Scott Hendrickson (Hadoop Meetup)               EMR-Hadoop                                   8 July 2010   35 / 43
Running MapReduce on Amazon Elastic MapReduce   Example 2 - Word count (Slightly more useful)


Example 2 - Word count - AWS Console



   4   Instances: 4
       Type: small
       Keypair: No (Yes allows ssh to Hadoop master)
       Log: yes
       Log Location: bsi-test/myWordCount/log
       Hadoop Debug: no
   5   No bootstrap actions
   6   Start it, and wait. . .




Scott Hendrickson (Hadoop Meetup)               EMR-Hadoop                                   8 July 2010   36 / 43
Running MapReduce on Amazon Elastic MapReduce   Example 3 - elastic-mapreduce command line tool


Agenda


1   Amazon Web Services

2   Interlude: Solving problems with Map and Reduce

3   Running MapReduce on Amazon Elastic MapReduce
      Example 1: Streaming Work Flow with AWS Management Console
      Example 2 - Word count (Slightly more useful)
      Example 3 - elastic-mapreduce command line tool

4   References and Notes



Scott Hendrickson (Hadoop Meetup)               EMR-Hadoop                                   8 July 2010   37 / 43
Running MapReduce on Amazon Elastic MapReduce   Example 3 - elastic-mapreduce command line tool


Example 3 - elastic-mapreduce command line tool


Word count (again, only better)
/usr/local/emr-ruby/elastic-mapreduce --create 
      --stream 
      --num-instances 2 
      --name from-elastic-mapreduce 
      --input s3n://bsi-test/myWordCount/input 
      --output s3n://bsi-test/myWordCount/outputRubyTool 
      --mapper s3n://bsi-test/myWordCount/mapper.py 
      --reducer s3n://bsi-test/myWordCount/reducer.py 
      --log-uri s3n://bsi-test/myWordCount/log

/usr/local/emr-ruby/elastic-mapreduce --list


Scott Hendrickson (Hadoop Meetup)               EMR-Hadoop                                   8 July 2010   38 / 43
References and Notes


Agenda


1   Amazon Web Services

2   Interlude: Solving problems with Map and Reduce

3   Running MapReduce on Amazon Elastic MapReduce
      Example 1: Streaming Work Flow with AWS Management Console
      Example 2 - Word count (Slightly more useful)
      Example 3 - elastic-mapreduce command line tool

4   References and Notes



Scott Hendrickson (Hadoop Meetup)                  EMR-Hadoop   8 July 2010   39 / 43
References and Notes


MapReduce Concepts Links



       Google MapReduce Tutorial: http:
       //code.google.com/edu/parallel/mapreduce-tutorial.html
       Apache Hadoop tutorial: http://hadoop.apache.org/common/
       docs/current/mapred_tutorial.html
       Google Code University presentation on MapReduce: http://code.
       google.com/edu/submissions/mapreduce/listing.html
       MapReduce framework paper:
       http://labs.google.com/papers/mapreduce-osdi04.pdf




Scott Hendrickson (Hadoop Meetup)                  EMR-Hadoop   8 July 2010   40 / 43
References and Notes


Amazon Web Services Links



       EMR Getting Started documentation:
       http://aws.amazon.com/documentation/elasticmapreduce/
       Getting started with Amazon S3: http:
       //docs.amazonwebservices.com/AmazonS3/2006-03-01/gsg/
       PIG on EMR: http:
       //s3.amazonaws.com/awsVideos/AmazonElasticMapReduce/
       ElasticMapReduce-PigTutorial.html
       Boto Python library (multiple Amazon Services):
       http://code.google.com/p/boto/




Scott Hendrickson (Hadoop Meetup)                  EMR-Hadoop   8 July 2010   41 / 43
References and Notes


Machine Learning




       Linear speedup (with processor number) for “locally weighted linear
       regression (LWLR), k-means, logistic regression (LR), naive Bayes
       (NB), SVM, ICA, PCA, gaussian discriminant analysis (GDA), EM,
       and backpropagation (NN)”: http://www.cs.stanford.edu/
       people/ang/papers/nips06-mapreducemulticore.pdf
       Mahout framework: http://mahout.apache.org/




Scott Hendrickson (Hadoop Meetup)                  EMR-Hadoop   8 July 2010   42 / 43
References and Notes


Examples Links



       Wordcount example/tutorial: http://www.michael-noll.com/
       wiki/Writing_An_Hadoop_MapReduce_Program_In_Python
       CouchDB and MapReduce (interesting examples of MR
       implementations for common problems)
       http://wiki.apache.org/couchdb/View_Snippets
       This presentation:
       http://drskippy.net/projects/EMR-HadoopMeetup.pdf or
       presentation source, example files etc.:
       http://drskippy.net/projects/EMR-HadoopMeetup.zip




Scott Hendrickson (Hadoop Meetup)                  EMR-Hadoop   8 July 2010   43 / 43

Weitere ähnliche Inhalte

Kürzlich hochgeladen

Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
ciinovamais
 
Spellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please PractiseSpellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please Practise
AnaAcapella
 
Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptx
negromaestrong
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
kauryashika82
 

Kürzlich hochgeladen (20)

ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdf
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptx
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
Spellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please PractiseSpellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please Practise
 
Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptx
 
SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentation
 
Third Battle of Panipat detailed notes.pptx
Third Battle of Panipat detailed notes.pptxThird Battle of Panipat detailed notes.pptx
Third Battle of Panipat detailed notes.pptx
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Magic bus Group work1and 2 (Team 3).pptx
Magic bus Group work1and 2 (Team 3).pptxMagic bus Group work1and 2 (Team 3).pptx
Magic bus Group work1and 2 (Team 3).pptx
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...
 
Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdf
 
Dyslexia AI Workshop for Slideshare.pptx
Dyslexia AI Workshop for Slideshare.pptxDyslexia AI Workshop for Slideshare.pptx
Dyslexia AI Workshop for Slideshare.pptx
 
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfUGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
 

Empfohlen

How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
ThinkNow
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
Kurio // The Social Media Age(ncy)
 

Empfohlen (20)

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPT
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage Engineerings
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 

Amazon Elastic MapReduce -- Getting started with Hadoop

  • 1. Getting Started with Hadoop with Amazon’s Elastic MapReduce Scott Hendrickson scott@drskippy.net http://drskippy.net/projects/EMR-HadoopMeetup.pdf Boulder/Denver Hadoop Meetup 8 July 2010 Scott Hendrickson (Hadoop Meetup) EMR-Hadoop 8 July 2010 1 / 43
  • 2. Agenda 1 Amazon Web Services 2 Interlude: Solving problems with Map and Reduce 3 Running MapReduce on Amazon Elastic MapReduce Example 1: Streaming Work Flow with AWS Management Console Example 2 - Word count (Slightly more useful) Example 3 - elastic-mapreduce command line tool 4 References and Notes Scott Hendrickson (Hadoop Meetup) EMR-Hadoop 8 July 2010 2 / 43
  • 3. Amazon Web Services What is Amazon Web Services? For first Hadoop project on AWS, use these services: Elastic Compute Cloud (EC2) Amazon Simple Storage Service (S3) Elastic MapReduce (EMR) For future projects, AWS is much more: SimpleDB, Relational Database Services Simple Queue Service (SQS), Simple Notification Service (SNS) Alexa Mechanical Turk ... Scott Hendrickson (Hadoop Meetup) EMR-Hadoop 8 July 2010 3 / 43
  • 4. Amazon Web Services Signing up for AWS 1 Create an AWS account - http://aws.amazon.com/ 2 Sign up for EC2 cloud compute services - http://aws.amazon.com/ec2/ 3 Set up Security Credentials (under menu Account|Security Credentials) - 3 kinds of credentials, you need to create an “Access Key”; use it to access S3 storage 4 Sign up for S3 storage services - http://aws.amazon.com/s3/ 5 Sign up for EMR - http://aws.amazon.com/elasticmapreduce/ Scott Hendrickson (Hadoop Meetup) EMR-Hadoop 8 July 2010 4 / 43
  • 5. Amazon Web Services What are S3 buckets? Streaming EMR projects use Simple Storage Service (S3) Buckets for data, code, logging and output. Bucket “A bucket is a container for objects stored in Amazon S3. Every object is contained in a bucket.” Bucket names must be globally unique. Object “Entities stored in Amazon S3. Objects consist of object data and metadata.” Metadata consists of key-value pairs. Object data is opaque. Objects Keys “An object is uniquely identified within a bucket by a key (name) and a version ID.” Scott Hendrickson (Hadoop Meetup) EMR-Hadoop 8 July 2010 5 / 43
  • 6. Amazon Web Services Accessing objects in S3 buckets Want to: 1 Move data into and out of S3 buckets 2 Set access privileges Tools: S3 console in your AWS control panel is adequate for managing S3 buckets and objects one at a time Other browser options: good for multiple file upload/download - Firefox S3 https://addons.mozilla.org/en-US/firefox/addon/3247/ ; or minimal - S3 plug-in for Chrome https://chrome.google.com/ extensions/detail/appeggcmoaojledegaonmdaakfhjhchf Programmatic options: Web Services (both SOAP-y and REST-ful): wget, curl, Python, Ruby, Java . . . Scott Hendrickson (Hadoop Meetup) EMR-Hadoop 8 July 2010 6 / 43
  • 7. Amazon Web Services S3 Bucket Example 1 - RESTful GET Example - Image object Bucket: bsi-test Key: image.jpg Object: JPEG structured data data from image.jpg RESTful GET access, use URL: http://s3.amazonaws.com/bsi-test/image.jpg Example - Text file object Bucket: bsi-test Key: foobar Object: text RESTful GET access, use URL: http://s3.amazonaws.com/bsi-test/foobar Scott Hendrickson (Hadoop Meetup) EMR-Hadoop 8 July 2010 7 / 43
  • 8. Amazon Web Services S3 Bucket Example 2 Example - Python, Boto, Metadata from boto.s3.connection import S3Connection conn = S3Connection(’key-id’, ’secret-key’) bucket = conn.get_bucket(’bsi-test’) k = bucket.get_key(’image.jpg’) print "Value for key ’x-amz-meta-s3fox-modifiedtime’ is:" print k.get_metadata(’s3fox-modifiedtime’) k.get_contents_to_filename(’deleteme.jpg’) k = bucket.get_key(’foobar’) print "Object value for key ’foobar’ is:" print k.get_contents_as_string() print "Value for key ’x-amz-meta-example-key’ is:" print k.get_metadata(’example-key’) Scott Hendrickson (Hadoop Meetup) EMR-Hadoop 8 July 2010 8 / 43
  • 9. Amazon Web Services S3 Bucket Example 2 Example - Python, Boto, Metadata - Output scott@mowgli-ubuntu:~/Dropbox/hadoop$ ./botoExample.py Value for key ’x-amz-meta-s3fox-modifiedtime’ is: 1273869756000 Object value for key ’foobar’ is: This is a test of S3 Value for key ’x-amz-meta-example-key’ is: This is an example value. Scott Hendrickson (Hadoop Meetup) EMR-Hadoop 8 July 2010 9 / 43
  • 10. Amazon Web Services What is Elastic Map Reduce? Hadoop Hosted Hadoop framework running on EC2 and S3. Job Flow Processing steps EMR “runs on a specified dataset using a set of Amazon EC2 instances.” S3 Bucket(s) Input data, output, scripts, jars, logs. Scott Hendrickson (Hadoop Meetup) EMR-Hadoop 8 July 2010 10 / 43
  • 11. Amazon Web Services Controlling Job Flows Want to: 1 Configure jobs 2 Start jobs 3 Check status or stop jobs Tools: AWS Management Console https://console.aws.amazon.com/elasticmapreduce/home Command Line Tools (requires Ruby [sudo apt-get install ruby libopenssl-ruby]) http://developer.amazonwebservices.com/connect/entry. jspa?externalID=2264&categoryID=262 API calls defined by the service (REST-ful and SOAP-y) Scott Hendrickson (Hadoop Meetup) EMR-Hadoop 8 July 2010 11 / 43
  • 12. Amazon Web Services EMR Example 1 - Running a simple Work Flow from the AWS Management Console EMR Example 1 Hold up a minute. . . ! What problem are we solving? Scott Hendrickson (Hadoop Meetup) EMR-Hadoop 8 July 2010 12 / 43
  • 13. Interlude: Solving problems with Map and Reduce Agenda 1 Amazon Web Services 2 Interlude: Solving problems with Map and Reduce 3 Running MapReduce on Amazon Elastic MapReduce Example 1: Streaming Work Flow with AWS Management Console Example 2 - Word count (Slightly more useful) Example 3 - elastic-mapreduce command line tool 4 References and Notes Scott Hendrickson (Hadoop Meetup) EMR-Hadoop 8 July 2010 13 / 43
  • 14. Interlude: Solving problems with Map and Reduce Central MapReduce Ideas Operate on key-value pairs Data scientist provides map and reduce (input) map < k1, v 1 > −→ − < k2, v 2 > combine,sort < k2, v 2 > − − − − − − − → < k2, v 2 > reduce < k2, v 2 > −−→ −− < k3, v 3 > (output) (Optional: Combine provided in map, may significantly reduce bandwidth between workers) Efficient Sort provide by MapReduce library. Implies efficient compare(k2a , k2b ) “Implicit” parallelization - splitting and distributing data, starting maps, reduces, collecting output Scott Hendrickson (Hadoop Meetup) EMR-Hadoop 8 July 2010 14 / 43
  • 15. Interlude: Solving problems with Map and Reduce Key components of MapReduce framework (wikipedia http://en.wikipedia.org/wiki/MapReduce) The frozen part of the MapReduce framework is a large distributed sort. The hot spots, which the application defines, are: 1 input reader 2 Map function 3 partition function 4 compare function 5 Reduce function 6 output writer Scott Hendrickson (Hadoop Meetup) EMR-Hadoop 8 July 2010 15 / 43
  • 16. Interlude: Solving problems with Map and Reduce Google Tutorial View 1 MapReduce library shards the input files and starts up many copies on a cluster. 2 Master assigns work to workers. There are map and reduce tasks. 3 Workers assigned map tasks reads the contents input shard, parse key-value pairs and pass pairs to map function. Intermediate key-value pairs produced by the map function are buffered in memory. 4 Periodically, buffered pairs are written to disk, partitioned into regions. Locations of buffered pairs on the local disk are passed to the master. 5 When a reduce worker has read all intermediate data, it sorts by the intermediate keys. All occurrences a key are grouped together. 6 Reduce workers pass a key and the corresponding set of intermediate values to the reduce function. 7 Output of the reduce function is appended to a final output file for each reduce partition. Scott Hendrickson (Hadoop Meetup) EMR-Hadoop 8 July 2010 16 / 43
  • 17. Interlude: Solving problems with Map and Reduce MapReduce Example 1 - Word Count - Data (from Apache Hadoop tutorial) Example: Word Count file1: Hello World Bye World file2: Hello Hadoop Goodbye Hadoop Scott Hendrickson (Hadoop Meetup) EMR-Hadoop 8 July 2010 17 / 43
  • 18. Interlude: Solving problems with Map and Reduce MapReduce Example 1 - Word Count - Map Example: Word Count The first map emits: < Hello, 1> < World, 1> < Bye, 1> < World, 1> The second map emits: < Hello, 1> < Hadoop, 1> < Goodbye, 1> < Hadoop, 1> Scott Hendrickson (Hadoop Meetup) EMR-Hadoop 8 July 2010 18 / 43
  • 19. Interlude: Solving problems with Map and Reduce MapReduce Example 1 - Word Count - Sort and Combine Example: Word Count The output of the first map: < Bye, 1> < Hello, 1> < World, 2> The output of the second map: < Goodbye, 1> < Hadoop, 2> < Hello, 1> Scott Hendrickson (Hadoop Meetup) EMR-Hadoop 8 July 2010 19 / 43
  • 20. Interlude: Solving problems with Map and Reduce MapReduce Example 1 - Word Count - Sort and Reduce Example: Word Count The Reducer method sums up the values for each key. The output of the job is: < Bye, 1> < Goodbye, 1> < Hadoop, 2> < Hello, 2> < World, 2> Scott Hendrickson (Hadoop Meetup) EMR-Hadoop 8 July 2010 20 / 43
  • 21. Interlude: Solving problems with Map and Reduce What problems is MapReduce good at solving? Themes: Identify, transform, aggregate, filter, count, sort. . . Requirement of global knowledge of data is (a) “occasional” (vs. cost of map) (b) confined to ordinality Discovery tasks (vs. high repetition of similar transactional tasks, many reads) Unstructured data (vs. tabular, indexes!) Continuously updated data (indexing cost) Many, many, many machines (fault tolerance) Scott Hendrickson (Hadoop Meetup) EMR-Hadoop 8 July 2010 21 / 43
  • 22. Interlude: Solving problems with Map and Reduce What problems is MapReduce good at solving? Memes: MapReduce ⇔ SQL (read the comments too) http://www.data-miners.com/blog/2008/01/ mapreduce-and-sql-aggregations.html MapReduce vs. Message Passing Interface (MPI) “MPI is good for task parallelism and Hadoop is good for Data Parallelism.” finite differences, finite elements, particle-in-cell. . . MapReduce vs. column-oriented DBs tabular data, indexes (cantankerous old farts!) http://databasecolumn.vertica.com/ database-innovation/mapreduce-a-major-step-backwards/ and http://databasecolumn.vertica.com/ database-innovation/mapreduce-ii/ MapReduce vs. relational DBs http://scienceblogs.com/ goodmath/2008/01/databases_are_hammers_mapreduc.php Scott Hendrickson (Hadoop Meetup) EMR-Hadoop 8 July 2010 22 / 43
  • 23. Running MapReduce on Amazon Elastic MapReduce Example 1: Streaming Work Flow with AWS Management Console Agenda 1 Amazon Web Services 2 Interlude: Solving problems with Map and Reduce 3 Running MapReduce on Amazon Elastic MapReduce Example 1: Streaming Work Flow with AWS Management Console Example 2 - Word count (Slightly more useful) Example 3 - elastic-mapreduce command line tool 4 References and Notes Scott Hendrickson (Hadoop Meetup) EMR-Hadoop 8 July 2010 23 / 43
  • 24. Running MapReduce on Amazon Elastic MapReduce Example 1: Streaming Work Flow with AWS Management Console Example 1 - Add up integers Data 3 4 -1 4 -3 1 1 ... Map import sys for line in sys.stdin: print ’%s%s%d’ % ("sum", ’t’, int(line)) Scott Hendrickson (Hadoop Meetup) EMR-Hadoop 8 July 2010 24 / 43
  • 25. Running MapReduce on Amazon Elastic MapReduce Example 1: Streaming Work Flow with AWS Management Console Example 1 - Add up integers Reduce import sys sum_of_ints = 0 for line in sys.stdin: key, value = line.split(’t’) # key is always the same try: sum_of_ints += int(value) except ValueError: pass try: print "%s%s%d" % (key, ’t’, sum_of_ints) except NameError: # No items processed pass Scott Hendrickson (Hadoop Meetup) EMR-Hadoop 8 July 2010 25 / 43
  • 26. Running MapReduce on Amazon Elastic MapReduce Example 1: Streaming Work Flow with AWS Management Console Example 1 - Add up integers Shell test cat ./input/ints.txt | ./mapper.py > ./inter cat ./input/ints1.txt | ./mapper.py >> ./inter cat ./input/ints2.txt | ./mapper.py >> ./inter cat ./input/ints3.txt | ./mapper.py >> ./inter echo "Intermediate output:" cat ./inter cat ./inter | sort | ./reducer.py > ./output/cmdLineOutput.txt Scott Hendrickson (Hadoop Meetup) EMR-Hadoop 8 July 2010 26 / 43
  • 27. Running MapReduce on Amazon Elastic MapReduce Example 1: Streaming Work Flow with AWS Management Console Example 1 - Add up integers What was that comment earlier about an optional combiner? Combiner in map import sys sum_of_ints = 0 for line in sys.stdin: sum_of_ints += int(line) print ’%s%s%d’ % ("sum", ’t’, sum_of_ints) Scott Hendrickson (Hadoop Meetup) EMR-Hadoop 8 July 2010 27 / 43
  • 28. Running MapReduce on Amazon Elastic MapReduce Example 1: Streaming Work Flow with AWS Management Console Example 1 - Add up integers Combiner shell test cat ./input/ints.txt | ./mapper_combine.py > ./inter cat ./input/ints1.txt | ./mapper_combine.py >> ./inter cat ./input/ints2.txt | ./mapper_combine.py >> ./inter cat ./input/ints3.txt | ./mapper_combine.py >> ./inter echo "Intermediate output:" cat ./inter cat ./inter | sort | ./reducer.py > ./output/cmdLineCombOutput.txt Scott Hendrickson (Hadoop Meetup) EMR-Hadoop 8 July 2010 28 / 43
  • 29. Running MapReduce on Amazon Elastic MapReduce Example 1: Streaming Work Flow with AWS Management Console Example 1 - Add up integers - AWS Console 1 Upload oneCount directory with FFS3 2 Create a New Job Flow Name: ”oneCount” Job Flow: Run own app Job Type: Streaming 3 Input: bsi-test/oneCount/input Output: bsi-test/oneCount/outputConsole (must not exist) Mapper: bsi-test/oneCount/mapper.py Reducer: bsi-test/oneCount/reducer.py Extra Args: none Scott Hendrickson (Hadoop Meetup) EMR-Hadoop 8 July 2010 29 / 43
  • 30. Running MapReduce on Amazon Elastic MapReduce Example 1: Streaming Work Flow with AWS Management Console Example 1 - Add up integers - AWS Console 4 Instances: 4 Type: small Keypair: No (Yes allows ssh to Hadoop master) Log: yes Log Location: bsi-test/oneCount/log Hadoop Debug: no 5 No bootstrap actions 6 Start it, and wait. . . Scott Hendrickson (Hadoop Meetup) EMR-Hadoop 8 July 2010 30 / 43
  • 31. Running MapReduce on Amazon Elastic MapReduce Example 2 - Word count (Slightly more useful) Agenda 1 Amazon Web Services 2 Interlude: Solving problems with Map and Reduce 3 Running MapReduce on Amazon Elastic MapReduce Example 1: Streaming Work Flow with AWS Management Console Example 2 - Word count (Slightly more useful) Example 3 - elastic-mapreduce command line tool 4 References and Notes Scott Hendrickson (Hadoop Meetup) EMR-Hadoop 8 July 2010 31 / 43
  • 32. Running MapReduce on Amazon Elastic MapReduce Example 2 - Word count (Slightly more useful) Example 2 - Word count Map def read_input(file): for line in file: yield line.split() def main(separator=’t’): data = read_input(sys.stdin) for words in data: for word in words: lword = word.lower().strip(string.puctuation) print ’%s%s%d’ % (lword, separator, 1) Scott Hendrickson (Hadoop Meetup) EMR-Hadoop 8 July 2010 32 / 43
  • 33. Running MapReduce on Amazon Elastic MapReduce Example 2 - Word count (Slightly more useful) Example 2 - Word count Reduce def read_mapper_output(file, separator=’t’): for line in file: yield line.rstrip().split(separator, 1) def main(separator=’t’): data = read_mapper_output(sys.stdin, separator=separator) for current_word,group in groupby(data,itemgetter(0)): try: total_count = sum(int(count) for current_word, count in group) print "%s%s%d" % (current_word, separator, total_count) except ValueError: pass Scott Hendrickson (Hadoop Meetup) EMR-Hadoop 8 July 2010 33 / 43
  • 34. Running MapReduce on Amazon Elastic MapReduce Example 2 - Word count (Slightly more useful) Example 2 - Word count Shell test echo "foo foo quux labs foo bar quux" | ./mapper.py echo "foo foo quux labs foo bar quux" | ./mapper.py | sort | ./reducer.py cat ./input/alice.txt | ./mapper.py | sort | ./reducer.py > ./output/cmdLineOutput.txt Scott Hendrickson (Hadoop Meetup) EMR-Hadoop 8 July 2010 34 / 43
  • 35. Running MapReduce on Amazon Elastic MapReduce Example 2 - Word count (Slightly more useful) Example 2 - Word count - AWS Console 1 Upload myWordCount directory with FFS3 2 Create a New Job Flow Name: ”myWordCount” Job Flow: Run own app Job Type: Streaming 3 Input: bsi-test/myWordCount/input Output: bsi-test/myWordCount/outputConsole (must not exist) Mapper: bsi-test/myWordCount/mapper.py Reducer: bsi-test/myWordCount/reducer.py Extra Args: none Scott Hendrickson (Hadoop Meetup) EMR-Hadoop 8 July 2010 35 / 43
  • 36. Running MapReduce on Amazon Elastic MapReduce Example 2 - Word count (Slightly more useful) Example 2 - Word count - AWS Console 4 Instances: 4 Type: small Keypair: No (Yes allows ssh to Hadoop master) Log: yes Log Location: bsi-test/myWordCount/log Hadoop Debug: no 5 No bootstrap actions 6 Start it, and wait. . . Scott Hendrickson (Hadoop Meetup) EMR-Hadoop 8 July 2010 36 / 43
  • 37. Running MapReduce on Amazon Elastic MapReduce Example 3 - elastic-mapreduce command line tool Agenda 1 Amazon Web Services 2 Interlude: Solving problems with Map and Reduce 3 Running MapReduce on Amazon Elastic MapReduce Example 1: Streaming Work Flow with AWS Management Console Example 2 - Word count (Slightly more useful) Example 3 - elastic-mapreduce command line tool 4 References and Notes Scott Hendrickson (Hadoop Meetup) EMR-Hadoop 8 July 2010 37 / 43
  • 38. Running MapReduce on Amazon Elastic MapReduce Example 3 - elastic-mapreduce command line tool Example 3 - elastic-mapreduce command line tool Word count (again, only better) /usr/local/emr-ruby/elastic-mapreduce --create --stream --num-instances 2 --name from-elastic-mapreduce --input s3n://bsi-test/myWordCount/input --output s3n://bsi-test/myWordCount/outputRubyTool --mapper s3n://bsi-test/myWordCount/mapper.py --reducer s3n://bsi-test/myWordCount/reducer.py --log-uri s3n://bsi-test/myWordCount/log /usr/local/emr-ruby/elastic-mapreduce --list Scott Hendrickson (Hadoop Meetup) EMR-Hadoop 8 July 2010 38 / 43
  • 39. References and Notes Agenda 1 Amazon Web Services 2 Interlude: Solving problems with Map and Reduce 3 Running MapReduce on Amazon Elastic MapReduce Example 1: Streaming Work Flow with AWS Management Console Example 2 - Word count (Slightly more useful) Example 3 - elastic-mapreduce command line tool 4 References and Notes Scott Hendrickson (Hadoop Meetup) EMR-Hadoop 8 July 2010 39 / 43
  • 40. References and Notes MapReduce Concepts Links Google MapReduce Tutorial: http: //code.google.com/edu/parallel/mapreduce-tutorial.html Apache Hadoop tutorial: http://hadoop.apache.org/common/ docs/current/mapred_tutorial.html Google Code University presentation on MapReduce: http://code. google.com/edu/submissions/mapreduce/listing.html MapReduce framework paper: http://labs.google.com/papers/mapreduce-osdi04.pdf Scott Hendrickson (Hadoop Meetup) EMR-Hadoop 8 July 2010 40 / 43
  • 41. References and Notes Amazon Web Services Links EMR Getting Started documentation: http://aws.amazon.com/documentation/elasticmapreduce/ Getting started with Amazon S3: http: //docs.amazonwebservices.com/AmazonS3/2006-03-01/gsg/ PIG on EMR: http: //s3.amazonaws.com/awsVideos/AmazonElasticMapReduce/ ElasticMapReduce-PigTutorial.html Boto Python library (multiple Amazon Services): http://code.google.com/p/boto/ Scott Hendrickson (Hadoop Meetup) EMR-Hadoop 8 July 2010 41 / 43
  • 42. References and Notes Machine Learning Linear speedup (with processor number) for “locally weighted linear regression (LWLR), k-means, logistic regression (LR), naive Bayes (NB), SVM, ICA, PCA, gaussian discriminant analysis (GDA), EM, and backpropagation (NN)”: http://www.cs.stanford.edu/ people/ang/papers/nips06-mapreducemulticore.pdf Mahout framework: http://mahout.apache.org/ Scott Hendrickson (Hadoop Meetup) EMR-Hadoop 8 July 2010 42 / 43
  • 43. References and Notes Examples Links Wordcount example/tutorial: http://www.michael-noll.com/ wiki/Writing_An_Hadoop_MapReduce_Program_In_Python CouchDB and MapReduce (interesting examples of MR implementations for common problems) http://wiki.apache.org/couchdb/View_Snippets This presentation: http://drskippy.net/projects/EMR-HadoopMeetup.pdf or presentation source, example files etc.: http://drskippy.net/projects/EMR-HadoopMeetup.zip Scott Hendrickson (Hadoop Meetup) EMR-Hadoop 8 July 2010 43 / 43