Amazon EMR Streaming Guide

An
Introduc+on
to

Data
Intensive
Compu+ng

Appendix
A:
Amazon’s
Elas+c
MapReduce

Robert
Grossman

University
of
Chicago

Open
Data
Group

Collin
BenneF

Open
Data
Group

November
14,
2011
1

Sec+on
A1

Hadoop
Streaming

See

hFp://hadoop.apache.org/common/docs/
r0.15.2/streaming.html

Basic
Idea

•  With
Hadoop
streams
you
can
run
any

program
as
the
Mapper
and
the
Reducer.

•  For
example,
you
can
run
Python
and
Perl

code.

•  You
can
also
run
standard
Unix
u+li+es.

•  With
streams,
Mappers
and
Reducers
use

standard
input
and
standard
output.

Mappers
for
Streams

•  As
the
mapper
task
runs,
it
converts
its
inputs

into
lines
and
feed
the
lines
to
the
stdin
of
the

process.

•  The
mapper
collects
the
line
oriented
outputs

from
the
stdout
of
the
process
and
converts
each

line
into
a
key/value
pair,
which
is
collected
as

the
output
of
the
mapper.

•  By
default,
the
preﬁx
of
a
line
up
to
the
ﬁrst
tab

character
is
the
key
and
the
the
rest
of
the
line

(excluding
the
tab)
is
the
value.

•  This
default
can
be
changed.

Reducers
for
Streams

•  As
the
reducer
task
runs,
it
converts
its
input

key/values
pairs
into
lines
and
feeds
the
lines

to
the
stdin
of
the
process.

•  The
reducer
collects
the
line
oriented
outputs

from
the
stdout
of
the
process,
converts
each

line
into
a
key/value
pair,
which
is
collected
as

the
output
of
the
reducer.

•  By
default,
the
preﬁx
of
a
line
up
to
the
ﬁrst

tab
character
is
the
key
and
the
the
rest
of
the

line
(excluding
the
tab
character)
is
the
value.

Example

$HADOOP_HOME/bin/hadoop

jar

$HADOOP_HOME/hadoop-‐streaming.jar

-‐input
myInputDirs

-‐output
myOutputDir

-‐mapper
/bin/cat

-‐reducer
/bin/wc

•  Here
the
Unix
u+li+es
cat
and
wc
are
the

Mapper
and
Reducer.

Sec+on
A2

S3
Buckets

S3
Buckets

•  S3
bucket
names
must
be
unique
across
AWS

•  A
good
prac+ce
is
to
use
a
paFern
like

tutorial.osdc.org/dataset1.txt

for
a
domain
you
own.

•  The
ﬁle
is
then
referenced
as

tutorial.osdc.org.s3.
amazonaws.com/
dataset1.txt

•  If
you
own
osdc.org
you
can
create
a
DNS

CNAME
entry
to
access
the
ﬁle
as

tutorial.osdc.org/dataset1.txt

S3
Security

•  AWS
access
key
(user
name)

•  This
func+on
is
your
S3
username
.
It
is
an

alphanumeric
text
string
that
uniquely

iden+ﬁes
users.

•  AWS
Secret
key
(func+ons
as
password)

AWS
Account
Informa+on

Access
Keys

User
Name
Password

Sec+on
A3

Using
AWS
Elas+c
MapReduce

Overview

1.  Upload
input
data
to
S3

2.  Create
job
ﬂow
by
deﬁning
Map
and
Reduce

3.  Download
output
data
from
S3

Create
New
Elas+c
MR
Job
Flow

Custom
Jobs

•  Amazon
Elas+c
MR
Custom
jobs
can
be

wriFen
as
a:

– Custom
Jar
File

– Streaming
File

– Pig
Program

– Hive
Program

Step
1.
Load
Your
Data
Into
an

S3
Bucket

•  Amazon’s
Elas+c
MapReduce
reads
data
from

S3
and
write
data
to
S3

Step
1a.
Create
&
Name
the
S3
Bucket

Step
1b.
Upload
Data
Into
the
S3
Bucket

•  This
can
be
done
from
the
AWS
Console.

•  This
can
also
be
done
using
command
line

tools.

Step
2a.

Write
a
Mapper

#!/usr/bin/python

import
sys

import
re

def
main(argv):

line
=
sys.stdin.readline()

paFern
=
re.compile("[a-‐zA-‐Z][a-‐zA-‐Z0-‐9]*")

try:

while
line:

for
word
in

paFern.ﬁndall(line):

print

"LongValueSum:"
+
word.lower()
+
"t"
+
"1"

line
=

sys.stdin.readline()

except
"end
of
ﬁle":

return
None

if
__name__
==
"__main__":

main(sys.argv)

Step
2b.

Upload
the
Mapper
to
S3

•  This
Mapper
is
already
in
S3
in
this
loca+on:

s3://elas+cmapreduce/samples/wordcount/
wordSpliFer.py

So
we
don’t
need
to
upload.

Step
3a.

Write
a
Reducer

def
main(argv):

line
=
sys.stdin.readline();

try:

while
line:

line
=
line[:-‐1];

fields
=
line.split("t");

print
generateLongCountToken(fields[0]);

line
=

except
"end
of
file":

return
None

Step
3a.

Write
a
Reducer
(cont’d)

#!/usr/bin/python

import
sys;

def
generateLongCountToken(id):

return
"LongValueSum:"
+
id
+
"t"
+
"1"

def
main(argv):

line
=

try:

while
line:

line
=
line[:-‐1];

fields
=
line.split("t");

print
generateLongCountToken(fields[0]);

line
=

except
"end
of
file":

return
None

if
__name__
==
"__main__":

main(sys.argv)

Step
3b.
Upload
Reducer
to
S3

myAggregatorForKeyCount.py

•  This
is
a
standard
Reducer
and
part
of
a

standard
Hadoop
library
called
Aggregate
so

we
don’t
need
to
upload
it,
just
invoke
it.

Hadoop
Library
Aggregate

To
use
Aggregate,
simply
specify
"-‐reducer
aggregate":

$HADOOP_HOME/bin/hadoop

jar
$HADOOP_HOME/
hadoop-‐streaming.jar

-‐input
myInputDirs

-‐output
myOutputDir

-‐mapper

-‐reducer
aggregate

-‐ﬁle

-‐jobconf
mapred.reduce.tasks=12

Step
4.

Deﬁne
the
Job
Flow

Step
4a.
Specify
Parameters

Step
4b.
Conﬁgure
EC2
Parameters

•  Default
parameters
work
for
this
example

Step
4c.
Conﬁgure
Bootstrap
Ac+ons

•  These
include
parameters
for
Hadoop,
etc.

•  Here
are
the
choices:

Step
4d.
Review
Conﬁgura+on

Step
5.

Launch
Job
Flow
&
Wait

…
and
wait
…

Wait
for
Job

•  This
job
took
3
minutes.

Step
6.
The
Output
Data
is
in
S3

•  The
output
is
in
ﬁles
labeled
part-‐00000,

part-‐00001,
etc.

•  Recall
we
speciﬁed
the
bucket
plus
folders:

tutorial.osdc.org/wordcount/output/2011-‐06-‐26

Step
6.
Download
the
Data
From
S3

•  You
can
leave
the
data
in
S3
and
work
with
it.

•  You
can
download
it
with
command
line

tools:

aws
get
tutorial.osdc.org/wordcount/output/
2011-‐06-‐26/part-‐00000
part00000

•  You
can
download
it
with
the
S3
AWS

Console.

Step
7.

Remove
Any
Unnecessary
Files

•  You
will
be
charged
for
all
ﬁles
that
remain
in

S3,
so
remove
any
unnecessary
ones.

Ques+ons?

For
the
most
current
version
of
these
notes,
see

rgrossman.com

Amazon EMR Streaming Guide

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (19)

Similar to Amazon EMR Streaming Guide

Similar to Amazon EMR Streaming Guide (20)

More from Robert Grossman

More from Robert Grossman (11)

Recently uploaded

Recently uploaded (20)

Amazon EMR Streaming Guide