The document discusses lessons learned from scaling Hadoop and big data processing on Amazon EMR. It describes how EMR provides a scalable and cost-effective way to run Hadoop jobs in the cloud without having to manage infrastructure. While EMR enables bootstrapping large clusters easily, performance can vary due to network issues and disk I/O constraints of different instance types. The document outlines best practices for optimizing Hadoop jobs and tuning configurations on EMR.
6. High Level Architecture
Processing
Multiple Data
Ingest (Hadoop Jobs)
Sources Data Storage
Output
Location Datasets
API
Intelligence
7. Our Challenges
• Mul$ple
data
sources
–
social,
retail,
events,
news,
census,
loca$on
etc
• Spa$al
data
analysis
and
querying
–
loca$on
overlay
on
data
• Temporal
nature
of
the
input
datasets
• Large
input
data
sets
and
hundreds
of
GB
compressed
inputs
for
jobs
• Complex
processing
and
business
logic
based
on
use
cases
• Custom
output
data
formats
–
JSON,
XML,
XLS,
Flat
files
etc
8. Why Amazon EMR?
I am interested in using Hadoop to solve business problems
and not in building and managing Hadoop infrastructure!
Scalable Storage - S3
Flexible Computing – EC2
No Hadoop Management – EMR
10. How to move existing data to Cloud?
10’s
of
GB
100’s
of
GB
>
Terabyte
Direct
Upload
Any
S3
tools
AWS
Import/
Export
Any
S3
tools
Tsunami
Aspera,
Tsunami
11. Solution Architecture
EC2 S3 EMR
Processing
Multiple Data
Ingest (Hadoop Jobs)
Sources Data Storage
Output
Location Datasets
API
Intelligence
S3
EC2 EC2
13. EMR Map Reduce Jobs
Amazon
EMR
supports
–
streaming,
custom
jar,
cascading,
pig
and
hive.
Streaming
–
Write
Map
Reduce
jobs
in
any
scrip$ng
language.
Custom
Jar
–
Write
using
Java
and
good
for
speed/control.
Cascading,
Hive
and
Pig
–
Higher
level
of
abstrac$on.
AWS
EMR
forums
if
you
need
help.
15. EMR – Good, Bad and Ugly
Great
for
bootstrapping
large
clusters
and
very
cost-‐effec$ve
for
transient
clusters.
Most
patches
are
applied
and
Amazon
creates
new
AMI’s
with
improvements
–
but
not
for
everything.
Intermiient
network
issues
–
Some$mes
could
cause
serious
degrada$on
of
performance.
Network/Disk
IO
is
variable
based
on
instance
types
and
streaming
jobs
will
be
much
sluggish
on
EMR
compared
to
dedicated
setup.
Be
ready
to
face
variable
performance
in
Cloud.
16. Hadoop and EMR – Jobs
Use
local
Hadoop
setup
for
debugging
your
jobs
–
there
is
no
easy
way
on
EMR.
Capture
EMR
cluster
metrics
-‐
always
bootstrap
with
Ganglia.
High
JVM
memory
alloca$on
lead
to
long
GC
pauses.
Don’t
trust
EMR
tuned
sekngs
of
Hadoop
configura$ons.
Benchmark
on
small
cluster
for
data
points.
17. Hadoop and EMR – Jobs performance
GC
Overhead
-‐
increase
memory
and
reduce
the
jvm
reuse
tasks.
Avoid
read
conten$on
at
S3
–
Have
equal
or
more
files
in
S3
compared
to
available
mappers.
Use
mapred
output
compression
to
save
storage,
processing
$me
and
bandwidth
costs.
Set
mapred
task
$meout
to
0
if
you
have
long
running
jobs
(>
10
mins)
and
can
disable
specula$ve
execu$on
$me.
Always
benchmark
third
party
libraries
used
in
your
jobs
code
before
pukng
them
in
produc$on
–
too
much
sluggish
stuff
out
there.
18. Hadoop – High Level Tuning
Small
files
problem
–
avoid
too
Tune
your
sekngs
–
JVM
Reuse,
many
small
files
in
S3
Sort
Buffer,
Sort
Factor,
Map/
Reduce
Tasks,
Parallel
Copies,
MapRed
Output
Compression
etc
Good
thing
is
that
you
can
use
Know
what
is
limi$ng
you
at
a
small
cluster
and
sample
input
node
level
–
CPU,
Memory,
DISK
size
for
tuning
IO
or
Network
IN/OUT
19. Performance Tuning Golden Rules
When you are operating at very large scale
even a 10 ms makes a big difference!
Example
:
Moving
away
from
Simple-‐Json
to
Jackson
JSON
Parsing
–
600
ms
Op$mized
Parsing
–
500
ms
Number
of
input
JSON
records
–
3
million
Time
saved
by
simple
op$miza$on
–
84
hrs
of
savings
20. We have seen improvements from 10x to
100x in our production clusters –
significant money savings.
21. Lesson Learned – Saving Time
Hadoop
Job
with
complex
business
logic
opera$ng
on
350
MB
input
size
Job Language
Cluster Size
Input Files
Processing Time
Ruby
6 m1.xlarge
1000
184 mins
Java
6 m1.xlarge
1000
69 mins
Java
6 m1.xlarge
100 39 mins
(1000 files combined)
Java 6 m1.xlarge
100 25 mins
(EMR tuned)
(1000 files combined)
Java 6 m1.xlarge
100 13 mins
(EMR and Code tuned)
(1000 files combined)
22. Lesson Learned – Saving Cost
A
data
mining
job
in
produc$on
with
50
GB
compressed
input
data
Job Cluster Size
Processing Each Job 100
Jobs
Cost
Per
Language
Time
Cost
Month
Ruby
50 m2.2xlarge
240 mins
$242
$24200
Java
20 m1.xlarge
200 mins
$68
$6800
Java 20 m1.xlarge
$50
$5000
(EMR tuned)
165 mins
Java 20 m1.xlarge
50 mins $17
$1700
(EMR and Code
tuned)
23. EMR Cost Optimization
Use
a
small
dedicated/transient
cluster
Leverage
spot
instance
for
Task
Nodes
Op$mize,
profile
and
tune
your
code
always
–
code
first
and
config
next
Tune
EMR
configura$on
based
on
historical
jobs
data
Always
benchmark
third
party
libraries