PStorM

PStorM:
Profile Storage
and Matching for
Feedback-Based Tuning
of MapReduce Jobs
MMath Thesis Presentation
by
Mostafa Ead
Supervised by
Prof. Ashraf Aboulnaga

Outline
● Hadoop MapReduce
● Tuning Hadoop Configuration Parameters
○ Rule-Based Approach
○ Feedback-Based Approach
● PStorM System Overview
● The Profile Matcher
○ Feature Selection
○ Similarity Measures
○ Matching Algorithm
● Evaluation
Dec 5, 2012 MMath Thesis Presentation 2

The MapReduce
Programming Model

P11
Input Split-1 Map-1
P12 P11
Output
P21 Red-1
Split-1
P21 P31
Input Split-2 Map-2
P22 P12
Output
P22 Red-2
Split-2
P31 P32
Input Split-3 Map-3
P32

<K1, V1> <K2, V2> <K2, list(V2)> <K3, V3>

Hadoop MapReduce
● Hadoop is a Java open-source
implementation of the MapReduce model
● Hadoop configuration parameters
○ io.sort.mb = 100
○ mapred.compress.map.output = false
○ mapred.reduce.tasks = 1
● These parameters have significant effect on
the performance of the MR job


Hadoop Configuration Parameters
P11
Input Split-1 Map-1
P12

Serialize
Map-1 Memory Buffer
Partition

Sort,
[Combine],
P11
Input [Compress]
P12
Split-1

Read off Merge
HDFS Map Collect Spill Spills


P11
Input Split-1 Map-1 io.sort.mb
P12

Serialize
Map-1 Memory Buffer
Partition

Sort,
[Combine],
P11
Input [Compress]
P12
Split-1

Read off Merge


P11
Input Split-1 Map-1 io.sort.mb
P12

Serialize
Map-1 Memory Buffer
Partition
mapred.compress.
map.output Sort,
[Combine],
P11
Input [Compress]
P12
Split-1

Read off Merge


● Good setting of these parameters relies on:
○ Behaviour of the map and reduce functions
○ Cluster resources

● Cross-interaction between the configuration
parameters:
○ io.sort.record.percent and io.sort.mb
Meta-Data Serialized Intermediate Records


Rule-Based Optimizer
● Initial attempt is to capture the hadoop
admin. expertise into a set of <rule, action>
pairs
○ Intermediate > input data size => enable
compression
○ Reduce function is associative-commutative
=> enable the combiner
● This attempt achieved good runtime
speedups, but not for all MR jobs


Rule-Based Optimizer (RBO)


Feedback-Based Tuning Approach
● Another attempt is to capture the effect of
the program complexity and the cluster
resources on the performance of the job into
an execution profile
● The profile is feedback to an optimizer to
provide cost-based recommendations
● This attempt achieved better runtime
speedups


Feedback-Based Tuning Approach


Starfish
● Starfish is an automatic feedback-based
tuning system

First Submission
Subsequent
Submissions


Starfish
● Starfish execution profile:
○ General: IO, CPU, Memory
○ Domain specific: runtimes of every phase in the
map/reduce tasks
● Tuning workflow:
○ Apply dynamic instrumentation code to the job
○ Run the instrumented job with the given parameter
settings and collect the execution profile
○ For the next submission of the same job, make the
tuning decisions based on its execution profile
○ Run the job with the tuned parameter settings

Starfish
○ Domain Specific: runtimes of every phase in the
map/reduce tasks
Profile Collection Overhead
○ Run the instrumented job with the default parameter
37% for the WCoP

Starfish
○ Domain Specific: runtimes of every phase in the
map/reduce tasks
Profile Collection Overhead
○ Run the instrumented job with the default parameter
37% for the WCoP
No Profile Reuse

Profile Reuse
● MR jobs have a high likelihood to be similar:
○ MR jobs are generated from a high level
language e.g. PigLatin and HiveQL
○ Code reuse and refactoring
● Execution profile composition for new jobs:
J1: map-profile reduce-profile


J3: Map function similar to J1, and
reduce function similar to J2

Profile Reuse
● MR jobs have a high likelihood to be similar:
○ MR jobs are generated from a high level
query language e.g. PigLatin and HiveQL
○ Code reuse and refactoring
● Execution profile composition for new jobs:




Profile Reuse Example
● Bigram Relative Frequency MR job:
○ Counts the frequency of a pair of subsequent words
relative to the frequency of the first word in that pair
● Word Co-occurrence MR job:
○ Counts the co-occurrences of every pair of words in a
sliding window of length n
● At n=2:
○ Similar behaviour
○ Similar execution profiles


Profile Reuse Example


Challenge

Given a repository of execution profiles of
previously executed MR jobs, how to
automatically compose an execution profile that
can be useful for tuning the configuration
parameters of a newly submitted job ?


Outline
● Evaluation

PStorM: Profile Store and Matcher
● PStorM goals:
○ Extensible profile store
○ Accurate profile matcher that reuses the stored
execution profiles to compose a matching profile for
the submitted job, even for unseen jobs
○ The performance gains achieved by the feedback-
based tuning system given the complete profile of
the job should be equal to the gains achieved given
the profile returned by PStorM


System Overview


Profile Matcher
● Profile matching is a domain-specific pattern
recognition problem:
a. Feature selection
b. Similarity measures
c. Matching algorithm


Profile Matcher


Sample Profile
● Dataflow fields (D):
○ Number of input records to the map/reduce tasks
● Cost fields (C):
○ Map/reduce phase times in the map/reduce tasks
● Dataflow statistics (DS):
○ Selectivity of the map/reduce functions in terms of
size and number of records
● Cost statistics (CS):
○ CPU cost to process one input/intermediate record
in the map/reduce tasks

Feature Selection
Job D C DS CS

● Q: Given a MapReduce job and its sample
profile, what are the features that can
distinguish the candidate matching profile
among other profiles stored in the Profile
Store ?

● Analytical models of the What-If engine

Feature Selection

First Submission
Subsequent
Submissions


Feature Selection
Job D C DS CS

● Inputs to the analytical models:
○ Dataflow statistics
○ Cost statistics
○ Configuration parameter settings
■ Enumerated by the cost-based optimizer
● No need to find a matching profile whose D and
C fields are similar to the complete profile of the
submitted job

Feature Selection
Job DS CS

● The DS and CS features are obtained from the
sample profile
● The selected features should be expected to
have the same values among different samples
of the same job, and different values among the
profiles of other jobs


Feature Selection
Job DS CS

● Dataflow statistics are expected to have this
characteristic
● Map selectivity of the number of records:
○ Sort: = 1
○ Word Count: > 1
○ Word Co-occurrence Pairs: >>1


Feature Selection
Job DS CS

● CS features can vary between different
samples of the same job
● Map CPU cost can differ for the same job
between the executions of the sample on
over-utilized and under-utilized nodes


Feature Selection
Job DS CS

● What are the features that can be extracted
from the bytecode of the submitted job, and
can be useful for the matcher ?


Feature Selection
Job DS CS

● Differences between MR jobs are
Input Formatter
Intermediate Key Type
Input Key Type Mapper
Intermediate Value Type
Input Value Type

Output Formatter
Intermediate Key Type
Reducer Output Key Type
Intermediate Value Type
Output Value Type


Feature Selection
Job DS CS

● We will refer to these features as the static
features

● Different input formatter results in different
IO cost to read the input records


Feature Selection
Job DS CS

● So far, the map/reduce functions are
analyzed as black-boxes
● Static analysis of the bytecode of the
map/reduce functions:
○ Control Flow Graphs (CFG)
○ Different map/reduce CFG results in different
map/reduce CPU costs


CFG Example
Word Co-occurrence Pairs Word Count


CFG Example


CFG Example

Different map CFGs => different map-phase times

Outline
● Evaluation

Similarity Measures
Static CFG DS CS

● Matching the static features:
○ Feature values are all strings (categorical data)
○ Jaccard Similarity index

○ Score range: [0, 1]


Similarity Measures
Static CFG DS CS

● Matching CFGs:
○ Synchronized breadth-first search
■ Both normal statements
■ Both branch statements
● Condition of a loop
○ Score range: {0, 1}
■ Conservative matching score


Similarity Measures
Static CFG DS CS

● Matching DS and CS features:
○ Numerical features
○ Data normalization to bring all features to the same
scale
○ Euclidean distance
○ Score range: [0, ]


Matching Algorithm
● Feature vector is composed of features of
mixed data types (categorical and numerical)
● Two possible matching algorithms:
○ Multi-stage matching
○ Machine learning approach


Multi-Stage Matching


Multi-Stage Matching
● The job profile is composed of independent
map profile and reduce profile

● Multi-stage matcher will be applied twice

● The matching map profile and reduce profile
will compose the final matching job profile


Machine Learning Approach
● Generalized distance function
○ Weighted sums of the distances/similarities
calculated separately for each set of features of the
same type

○ Weights should be learned


● Training data set generation:
○ For every job, Ji, in the profile store, pick its profile, Pi
○ Choose a random profile, Pj, from the profile store
○ Calculate the distances and similarities between Pi and Pj
○ Calculate T1: predicted runtime of the job Ji given the
profile Pi
○ Calculate T2: predicted runtime of the job Ji given the
profile Pj
○ D = |T1 - T2|


● Machine learning algorithm:
○ Gradient Boosted Regression Tree (GBRT)
○ Profile matching implementation in R
● Profile matching using the learned model:
○ Extract the profile, Ps, for the submitted MR job
○ Calculate the similarities/distances between Ps and
the profiles in PStorM, and the corresponding value
of D
○ Select the PStorM profile whose D is the minimum

● PStorM uses multi-stage matching algorithm

Outline
● Evaluation

Infrastructure
● 16 x Amazon EC2 c1.medium nodes:
○ 2 x Virtual cores
○ 1.7 GB of RAM
○ 350 GB of instance storage
● Hadoop cluster:
○ 1 master + 15 workers
○ Each worker can run at most 2 map and 2 reduce
tasks concurrently
● PStorM profile store:
○ HBase instance running on the master node


Benchmark


Evaluation
● Objectives:
a. Profile matcher accuracy
b. Profile matcher efficiency
■ The profile returned from PStorM should result in
comparable speedups to that achieved given the
complete profile of the submitted job


Profile Matcher Accuracy
● Two content states of the profile store

● Same Data (SD) content state:
○ PStorM contains the profile collected during the
execution on the same submitted data set

● Different Data (DD) content state:
○ PStorM contains the profile collected during the
execution on a different data set


● Evaluation metric is the number of correct matches as a
fraction of the number of job submissions
● At the SD content state:
○ A correct match is the profile of the submitted job
collected during the execution on the same data set
● At the DD content state:
○ A correct match is the profile of the submitted job
collected during the execution on another data set
● Number of correct matches is calculated for the map
and reduce profiles, separately

● The accuracy of PStorM will be compared to
the accuracy of the alternative solutions
● PStorM contributions at the matching level:
○ Feature selection:
■ New set of features: static and CFG
■ Feature selection based on our domain
knowledge
○ Multi-stage matching algorithm


Profile Matcher Accuracy:
Feature Selection
● Alternative feature selection approaches:
○ P-features:
■ Given the sample profile of the submitted job
○ SP-features:
■ Given the static features we proposed and the
sample profile of the submitted job
● For both approaches:
○ Rank the features according to their information
gains
○ Select the highest F features, such that F = number
of features used by PStorM

Feature Selection


Matching Algorithm
● PStorM uses the multi-stage matching
algorithm
● The alternative one is the machine learning
approach:
○ GBRT has multiple configuration parameters
○ Four trials of different parameter settings until we
found the one that resulted in the highest matching
accuracy for GBRT


Matching Algorithm


Profile Matcher Efficiency
● Runtime speedups is the main factor that
matters
● A third content state, NJ:
○ The submitted job has not been executed before on
the cluster
○ Highlights the benefits of profile reuse


Profile Matcher Efficiency

Default 12 824 100 302

Conclusion
● Hadoop configuration parameters and their
effect on the performance of MR jobs
● Robustness and efficiency of the feedback-
based tuning approach
● Drawbacks: overhead and no profile reuse
● PStorM: profile storage and matcher that
leverages the idea of profile reuse
● PStorM resulted in significant speedups
even for new jobs

PStorM

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Andere mochten auch

Andere mochten auch (20)

Ähnlich wie PStorM

Ähnlich wie PStorM (18)

PStorM