This document provides an introduction and overview of using Amazon's Elastic MapReduce (EMR) service for data intensive computing. It discusses uploading data to S3 storage, writing mappers and reducers in various languages like Python and streaming utilities, and executing a MapReduce job on EMR to process the data in parallel across a cluster of Amazon EC2 instances. The key steps involve loading input data to S3, defining the mapper and reducer processing logic, and downloading outputs from S3 upon job completion.
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
Amazon EMR Streaming Guide
1. An
Introduc+on
to
Data
Intensive
Compu+ng
Appendix
A:
Amazon’s
Elas+c
MapReduce
Robert
Grossman
University
of
Chicago
Open
Data
Group
Collin
BenneF
Open
Data
Group
November
14,
2011
1
2. Sec+on
A1
Hadoop
Streaming
See
hFp://hadoop.apache.org/common/docs/
r0.15.2/streaming.html
3. Basic
Idea
• With
Hadoop
streams
you
can
run
any
program
as
the
Mapper
and
the
Reducer.
• For
example,
you
can
run
Python
and
Perl
code.
• You
can
also
run
standard
Unix
u+li+es.
• With
streams,
Mappers
and
Reducers
use
standard
input
and
standard
output.
4. Mappers
for
Streams
• As
the
mapper
task
runs,
it
converts
its
inputs
into
lines
and
feed
the
lines
to
the
stdin
of
the
process.
• The
mapper
collects
the
line
oriented
outputs
from
the
stdout
of
the
process
and
converts
each
line
into
a
key/value
pair,
which
is
collected
as
the
output
of
the
mapper.
• By
default,
the
prefix
of
a
line
up
to
the
first
tab
character
is
the
key
and
the
the
rest
of
the
line
(excluding
the
tab)
is
the
value.
• This
default
can
be
changed.
5. Reducers
for
Streams
• As
the
reducer
task
runs,
it
converts
its
input
key/values
pairs
into
lines
and
feeds
the
lines
to
the
stdin
of
the
process.
• The
reducer
collects
the
line
oriented
outputs
from
the
stdout
of
the
process,
converts
each
line
into
a
key/value
pair,
which
is
collected
as
the
output
of
the
reducer.
• By
default,
the
prefix
of
a
line
up
to
the
first
tab
character
is
the
key
and
the
the
rest
of
the
line
(excluding
the
tab
character)
is
the
value.
6. Example
$HADOOP_HOME/bin/hadoop
jar
$HADOOP_HOME/hadoop-‐streaming.jar
-‐input
myInputDirs
-‐output
myOutputDir
-‐mapper
/bin/cat
-‐reducer
/bin/wc
• Here
the
Unix
u+li+es
cat
and
wc
are
the
Mapper
and
Reducer.
8. S3
Buckets
• S3
bucket
names
must
be
unique
across
AWS
• A
good
prac+ce
is
to
use
a
paFern
like
tutorial.osdc.org/dataset1.txt
for
a
domain
you
own.
• The
file
is
then
referenced
as
tutorial.osdc.org.s3.
amazonaws.com/
dataset1.txt
• If
you
own
osdc.org
you
can
create
a
DNS
CNAME
entry
to
access
the
file
as
tutorial.osdc.org/dataset1.txt
9. S3
Security
• AWS
access
key
(user
name)
• This
func+on
is
your
S3
username
.
It
is
an
alphanumeric
text
string
that
uniquely
iden+fies
users.
• AWS
Secret
key
(func+ons
as
password)
18. Step
1b.
Upload
Data
Into
the
S3
Bucket
• This
can
be
done
from
the
AWS
Console.
• This
can
also
be
done
using
command
line
tools.
19. Step
2a.
Write
a
Mapper
#!/usr/bin/python
import
sys
import
re
def
main(argv):
line
=
sys.stdin.readline()
paFern
=
re.compile("[a-‐zA-‐Z][a-‐zA-‐Z0-‐9]*")
try:
while
line:
for
word
in
paFern.findall(line):
print
"LongValueSum:"
+
word.lower()
+
"t"
+
"1"
line
=
sys.stdin.readline()
except
"end
of
file":
return
None
if
__name__
==
"__main__":
main(sys.argv)
20. Step
2b.
Upload
the
Mapper
to
S3
• This
Mapper
is
already
in
S3
in
this
loca+on:
s3://elas+cmapreduce/samples/wordcount/
wordSpliFer.py
So
we
don’t
need
to
upload.
21. Step
3a.
Write
a
Reducer
def
main(argv):
line
=
sys.stdin.readline();
try:
while
line:
line
=
line[:-‐1];
fields
=
line.split("t");
print
generateLongCountToken(fields[0]);
line
=
sys.stdin.readline();
except
"end
of
file":
return
None
22. Step
3a.
Write
a
Reducer
(cont’d)
#!/usr/bin/python
import
sys;
def
generateLongCountToken(id):
return
"LongValueSum:"
+
id
+
"t"
+
"1"
def
main(argv):
line
=
sys.stdin.readline();
try:
while
line:
line
=
line[:-‐1];
fields
=
line.split("t");
print
generateLongCountToken(fields[0]);
line
=
sys.stdin.readline();
except
"end
of
file":
return
None
if
__name__
==
"__main__":
main(sys.argv)
23. Step
3b.
Upload
Reducer
to
S3
myAggregatorForKeyCount.py
• This
is
a
standard
Reducer
and
part
of
a
standard
Hadoop
library
called
Aggregate
so
we
don’t
need
to
upload
it,
just
invoke
it.
24. Hadoop
Library
Aggregate
To
use
Aggregate,
simply
specify
"-‐reducer
aggregate":
$HADOOP_HOME/bin/hadoop
jar
$HADOOP_HOME/
hadoop-‐streaming.jar
-‐input
myInputDirs
-‐output
myOutputDir
-‐mapper
myAggregatorForKeyCount.py
-‐reducer
aggregate
-‐file
myAggregatorForKeyCount.py
-‐jobconf
mapred.reduce.tasks=12
32. Step
6.
The
Output
Data
is
in
S3
• The
output
is
in
files
labeled
part-‐00000,
part-‐00001,
etc.
• Recall
we
specified
the
bucket
plus
folders:
tutorial.osdc.org/wordcount/output/2011-‐06-‐26
33. Step
6.
Download
the
Data
From
S3
• You
can
leave
the
data
in
S3
and
work
with
it.
• You
can
download
it
with
command
line
tools:
aws
get
tutorial.osdc.org/wordcount/output/
2011-‐06-‐26/part-‐00000
part00000
• You
can
download
it
with
the
S3
AWS
Console.
34. Step
7.
Remove
Any
Unnecessary
Files
• You
will
be
charged
for
all
files
that
remain
in
S3,
so
remove
any
unnecessary
ones.
35. Ques+ons?
For
the
most
current
version
of
these
notes,
see
rgrossman.com