Lessons learned scaling big data in cloud

Lessons Learned : Scaling Hadoop and Big Data in Cloud

Vijay Rayapati
@amnigos

High Level Architecture

Processing
Multiple Data
Ingest (Hadoop Jobs)
Sources Data Storage

Output
Location Datasets
API
Intelligence

Our Challenges
•  Mul$ple
data
sources
–
social,
retail,
events,
news,
census,
loca$on
etc

•  Spa$al
data
analysis
and
querying
–
loca$on
overlay
on
data

•  Temporal
nature
of
the
input
datasets

•  Large
input
data
sets
and
hundreds
of
GB
compressed
inputs
for
jobs

•  Complex
processing
and
business
logic
based
on
use
cases

•  Custom
output
data
formats
–
JSON,
XML,
XLS,
Flat
ﬁles
etc

Why Amazon EMR?

I am interested in using Hadoop to solve business problems
and not in building and managing Hadoop infrastructure!

Scalable Storage - S3

Flexible Computing – EC2

No Hadoop Management – EMR

Amazon EMR - Service Architecture

How to move existing data to Cloud?

10’s
of
GB
100’s
of
GB

>
Terabyte

Direct
Upload
Any
S3
tools
AWS
Import/
Export

Any
S3
tools
Tsunami
Aspera,
Tsunami

Solution Architecture
EC2 S3 EMR

Processing
Multiple Data
Ingest (Hadoop Jobs)
Sources Data Storage

Output
Location Datasets
API
Intelligence

S3
EC2 EC2

Amazon EMR – Setup
Launching
a
500
node
and
fully
configured
cluster
is
as
simple
as
execu$ng

one
command.

>
elas$c-‐mapreduce

-‐-‐create
-‐-‐alive
-‐-‐plain-‐output
-‐-‐master-‐instance-‐type

m1.xlarge
-‐-‐slave-‐instance-‐type
m2.2xlarge

-‐-‐num-‐instances
500

-‐-‐name
"Site

Analy$cs
Cluster"
-‐-‐bootstrap-‐ac$on
s3://com.bcb11.emr/scripts/bootstrap-‐
custom.sh
s3://elas$cmapreduce/bootstrap-‐ac$ons/install-‐
ganglia
s3://elas$cmapreduce/bootstrap-‐ac$ons/configure-‐
hadoop
-‐-‐args
"-‐-‐mapred-‐config-‐file,
s3://com.bcb11.emr/conf/custom-‐mapred-‐
site.xml"

>
elas$c-‐mapreduce
-‐j
${jobflow}
-‐-‐stream
-‐-‐step-‐name
“Profile
Analyzer"
-‐-‐
jobconf
mapred.task.$meout=0
-‐-‐mapper
s3://com.bcb11.emr/code/mapper.rb

-‐-‐reducer
s3://com.bcb11.emr/bin/reducer.rb
-‐-‐cache
s3://com.bcb11.emr/
cache/customdata.dat#data.txt
-‐-‐input
s3://com.bcb11.emr/input/
-‐-‐output
s3://
com.bcb11.emr/output

EMR Map Reduce Jobs

Amazon
EMR
supports
–
streaming,
custom
jar,
cascading,
pig
and
hive.

Streaming
–
Write
Map
Reduce
jobs
in
any
scrip$ng
language.

Custom
Jar
–
Write
using
Java
and
good
for
speed/control.

Cascading,
Hive
and
Pig
–
Higher
level
of
abstrac$on.

AWS
EMR
forums
if
you
need
help.

Hadoop and EMR – Lesson Learned

EMR – Good, Bad and Ugly
Great
for
bootstrapping
large
clusters
and
very
cost-‐eﬀec$ve
for
transient

clusters.

Most
patches
are
applied
and
Amazon
creates
new
AMI’s
with

improvements
–
but
not
for
everything.

Intermiient
network
issues
–
Some$mes
could
cause
serious
degrada$on

of
performance.

Network/Disk
IO
is
variable
based
on
instance
types
and
streaming
jobs

will
be
much
sluggish
on
EMR
compared
to
dedicated
setup.

Be
ready
to
face
variable
performance
in
Cloud.

Hadoop and EMR – Jobs
Use
local
Hadoop
setup
for
debugging
your
jobs
–
there
is
no
easy
way
on

EMR.

Capture
EMR
cluster
metrics
-‐
always
bootstrap
with
Ganglia.

High
JVM
memory
alloca$on
lead
to
long
GC
pauses.

Don’t
trust
EMR
tuned
sekngs
of
Hadoop
conﬁgura$ons.

Benchmark
on
small
cluster
for
data
points.

Hadoop and EMR – Jobs performance

GC
Overhead
-‐

increase
memory
and
reduce
the
jvm
reuse
tasks.

Avoid
read
conten$on
at
S3
–
Have
equal
or
more
ﬁles
in
S3
compared
to

available
mappers.

Use
mapred
output
compression
to
save
storage,
processing
$me
and

bandwidth
costs.

Set
mapred
task
$meout
to
0
if
you
have
long
running
jobs
(>
10
mins)

and
can
disable
specula$ve
execu$on
$me.

Always
benchmark
third
party
libraries
used
in
your
jobs
code
before

pukng
them
in
produc$on
–
too
much
sluggish
stuﬀ
out
there.

Hadoop – High Level Tuning

Small
files
problem
–
avoid
too
Tune
your
sekngs
–
JVM
Reuse,

many
small
files
in
S3
Sort
Buffer,
Sort
Factor,
Map/
Reduce
Tasks,
Parallel
Copies,

MapRed
Output
Compression
etc

Good
thing
is
that
you
can
use

Know
what
is
limi$ng
you
at
a

small
cluster
and
sample
input

node
level
–
CPU,
Memory,
DISK

size
for
tuning

IO
or
Network
IN/OUT

Performance Tuning Golden Rules

When you are operating at very large scale
even a 10 ms makes a big difference!

Example
:
Moving
away
from
Simple-‐Json
to
Jackson

JSON
Parsing
–
600
ms

Op$mized
Parsing
–
500
ms

Number
of
input
JSON
records
–
3
million

Time
saved
by
simple
op$miza$on
–
84
hrs
of
savings

We have seen improvements from 10x to
100x in our production clusters –
significant money savings.

Lesson Learned – Saving Time
Hadoop
Job
with
complex
business
logic
opera$ng
on
350
MB
input
size

Job Language
Cluster Size
Input Files
Processing Time

Ruby
6 m1.xlarge
1000
184 mins

Java
6 m1.xlarge
1000
69 mins

Java
6 m1.xlarge
100 39 mins

(1000 files combined)

Java 6 m1.xlarge
100 25 mins

(EMR tuned)

Java 6 m1.xlarge
100 13 mins

(EMR and Code tuned)

Lesson Learned – Saving Cost
A
data
mining

job
in
produc$on
with
50
GB
compressed
input
data

Job Cluster Size
Processing Each Job 100
Jobs
Cost
Per

Language
Time
Cost
Month

Ruby
50 m2.2xlarge
240 mins
$242
$24200

Java
20 m1.xlarge
200 mins
$68
$6800

Java 20 m1.xlarge
$50
$5000

(EMR tuned)
165 mins

Java 20 m1.xlarge
50 mins $17
$1700

(EMR and Code

tuned)

EMR Cost Optimization

Use
a
small
dedicated/transient
cluster

Leverage
spot
instance
for
Task
Nodes

Op$mize,
profile
and
tune
your
code
always
–
code
first
and
config
next

Tune
EMR
configura$on
based
on
historical
jobs
data

Always
benchmark
third
party
libraries

Like what we do? – connect with me
Kuliza.com | vijay.rayapati@kuliza.com | @kuliza

vijay.rayapati@kuliza.com
@amnigos

Lessons learned scaling big data in cloud

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (16)

Ähnlich wie Lessons learned scaling big data in cloud

Ähnlich wie Lessons learned scaling big data in cloud (20)

Mehr von Vijay Rayapati

Mehr von Vijay Rayapati (13)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Lessons learned scaling big data in cloud