Self-serve Hadoop Performance Tuning with Dr. Elephant

  Mark Wagner
  Engineer, Hadoop Infrastructure
  LinkedIn
Dr. Elephant:
Self-serve performance tuning for
Hadoop

3
Hadoop @ LinkedIn
•  Thousands of users of Hadoop infrastructure
•  Tens of thousands of jobs a day
•  Thousands of registered projects
•  Multiple analytics, experimentation, and metrics platforms built on top
•  Diverse backgrounds and levels of experience with Hadoop

4
Hadoop team @ LinkedIn
•  Roll our own distribution
•  Build next generation systems
•  Optimize our investment in hardware
•  Enable our users to be productive

5
Optimizing people
Better compatibility: ByteRay
•  We have 1000s of developer-
months in existing codebases
•  Hadoop 2 has incompatible APIs

6
Optimizing people
Workflow tooling: Gradle DSL for Hadoop
•  Nobody writes one Hadoop job
•  How do you structure Hadoop
codebases?

hadoop
{

buildPath
'conf/jobs';

propertyFile('common'){

set
properties:
[

'user.to.proxy'
:
'mwagner'

]

}

workflow('my-‐first-‐workflow'){

commandJob('start-‐job'){

uses
'echo
"Hello,
World!"'

}

pigLiJob('vowels'){

uses
'src/main/pig/vowels.pig'

depends
'start-‐job'

}

targets
'vowels'

}

}

Easier tuning?
7
Optimizing people
•  Large investment in hardware
•  Cost(People) >> Cost(Machines)
•  Can’t throw machines at the problem forever
•  Some tuning needed to get things running
•  Minimum effort gives the worst of both worlds

8
Barriers to tuning
 Problems are not obvious
•  What’s wrong with this job?
Anything?
...

2015-‐06-‐09
05:57:56,281
Stage-‐1
map
=
95%,

reduce
=
0%,
Cumulative
CPU
12602.08
sec

2015-‐06-‐09
05:58:17,821
Stage-‐1
map
=
96%,

reduce
=
0%,
Cumulative
CPU
12688.5
sec

2015-‐06-‐09
05:58:23,952
Stage-‐1
map
=
97%,

reduce
=
0%,
Cumulative
CPU
12705.91
sec

2015-‐06-‐09
05:58:24,976
Stage-‐1
map
=
99%,

reduce
=
0%,
Cumulative
CPU
12710.31
sec

2015-‐06-‐09
05:58:26,000
Stage-‐1
map
=
100%,

reduce
=
0%,
Cumulative
CPU
12712.08
sec

2015-‐06-‐09
05:58:40,317
Stage-‐1
map
=
100%,

reduce
=
100%,
Cumulative
CPU
12714.17

sec

MapReduce
Total
cumulative
CPU
time:
0
days
3
hours
31
minutes
54
seconds
170
msec

Ended
Job
=
job_1433389922983_133809

MapReduce
Jobs
Launched:

Job
0:
Map:
35

Reduce:
1

Cumulative
CPU:
12714.17
sec

HDFS
Read:
23223452
HDFS

Write:
18
SUCCESS

Total
MapReduce
CPU
Time
Spent:
0
days
3
hours
31
minutes
54
seconds
170
msec

OK

1234567

Time
taken:
564.189
seconds,
Fetched:
1
row(s)

hive
(default)>

Critical information is scattered
9
Barriers to tuning

Inter-related settings
10
Barriers to tuning
What interface
are you using?
Did you set max
split size?
Did you set min
split size?
Did you have
split combination
enabled?
How large are
your files?
Extend
CombineFileInputFormat?
CombineHiveInputFormat?
What’s your
maxCombinedSplitSize?
What’s your
block size?

Large Parameter Space
11
Barriers to tuning
Mapreduce.task.io.sort.mb
Mapreduce.job.min.split.size
Pig.maxcombinedsplitsize
Hive.autoconvert.join
Mapreduce.task.io.sort.factor
Hive.exec.reducers.bytes.per.reducer
Pig.exec.reducer.max
Pig.exec.reducers.bytes.per.reducer
Hive.map.aggr
Hive.groupby.skewindata
Hive.multigroupby.singlemr
Mapreduce.map.memory.mb
Pig.cachedbag.memusage
Hive.optimize.correlation
Hive.exec.orc.dictionary.key.size.threshold
Pig.exec.mapPartAgg
Pig.exec.mapPartAgg.minReduction
Pig.skewedjoin.reduce.memusage
Mapreduce.map.sort.spill.percent
Mapreduce.job.max.split.locations
Mapreduce.reduce.shuffle.parallelcopies
Mapreduce.reduce.shuffle.merge.percent
Mapreduce.map.speculative
Mapreduce.reduce.speculative
Mapreduce.map.output.compress
Mapreduce.job.ubertask.maxmaps
Mapreduce.ifile.readahead.bytes
Hive.exec.compress.intermediate
Hive.merge.mapfiles
200+ configuration settings in MapReduce
300+ more in Hive

Not this
12
Tuning Hadoop
Photo credit: __ Night Flier __

This
13
Tuning Hadoop
Photo credit: Ben Cooper

Expert intervention
14
Things that don’t work
•  Not enough support resources available
•  Poor coverage
•  Difficult to prioritize efforts
•  Delays user development

Extensive training
15
Things that don’t work
•  Too many users
•  Diverse backgrounds
•  Scope is large and evolving
•  Other responsibilities are more important

Goals
16
Dr. Elephant
•  Help every user to get the best performance of their jobs
•  Impose minimal burden on the user
•  Development burden
•  Intellectual burden
•  Provide a platform for other performance related tools

Internals
21
Dr. Elephant
•  All completed jobs are monitored
•  Diagnostic information collected automatically
•  REST API for everything

22
Dr. Elephant
 Monitoring scheduled workflows
•  Performance Characteristics
change
•  Data growth
•  Data distribution change
•  Hardware change
•  Incremental software change
•  Monitor performance on each
execution
•  Compare behavior across revisions
======TOP
20
BAD
JOBS
YESTERDAY======

JobId

Score

job_1431576474881_181412

36035

job_1431576474881_185548

27710

.

.

.

======TOP
20
BAD
FLOWS
YESTERDAY======

FlowUrl

Score

https://prod-‐azkaban/...

45379

.

.

.

======TOP
10
FLOWS
WITH
SIGNIFICANT
PERFORMANCE
CHANGE======

Project
Flow

ChangeScore
User

myProject

score-‐daily
48755

mwagner

.

.

.

Automated audits
23
Dr. Elephant
•  Separate cluster for critical workloads
•  Audit before deployment
•  Improved accuracy
•  Faster turnaround
•  Higher throughput

24
Dr. Elephant
 As an operator utility
•  Global view of performance issues
•  Search and identify jobs for extra
attention
•  Dr. Elephant sign-off as a
requirement for capacity requests

•  Dr. Elephant can grade itself
•  Social pressures encourage good
behavior
•  Tuning degrades over time
25
Results and experiences
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Fraction
Fraction of healthy jobs

26
Dr. Elephant for all
•  Plugins for other execution engines
•  Tez, Spark on the way
•  Allow the user community to build a knowledge-base

27
Dr. Elephant today
•  Evaluating 60000+ jobs a day across multiple clusters
•  Open source release coming soon

Self-serve Hadoop Performance Tuning with Dr. Elephant

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Self-serve Hadoop Performance Tuning with Dr. Elephant

Ähnlich wie Self-serve Hadoop Performance Tuning with Dr. Elephant (20)

Mehr von DataWorks Summit

Mehr von DataWorks Summit (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Self-serve Hadoop Performance Tuning with Dr. Elephant