1. PStorM:
Profile Storage
and Matching for
Feedback-Based Tuning
of MapReduce Jobs
MMath Thesis Presentation
by
Mostafa Ead
Supervised by
Prof. Ashraf Aboulnaga
4. Hadoop MapReduce
● Hadoop is a Java open-source
implementation of the MapReduce model
● Hadoop configuration parameters
○ io.sort.mb = 100
○ mapred.compress.map.output = false
○ mapred.reduce.tasks = 1
● These parameters have significant effect on
the performance of the MR job
Dec 5, 2012 MMath Thesis Presentation 4
8. Hadoop Configuration Parameters
● Good setting of these parameters relies on:
○ Behaviour of the map and reduce functions
○ Cluster resources
● Cross-interaction between the configuration
parameters:
○ io.sort.record.percent and io.sort.mb
Meta-Data Serialized Intermediate Records
Dec 5, 2012 MMath Thesis Presentation 6
9. Rule-Based Optimizer
● Initial attempt is to capture the hadoop
admin. expertise into a set of <rule, action>
pairs
○ Intermediate > input data size => enable
compression
○ Reduce function is associative-commutative
=> enable the combiner
● This attempt achieved good runtime
speedups, but not for all MR jobs
Dec 5, 2012 MMath Thesis Presentation 7
11. Feedback-Based Tuning Approach
● Another attempt is to capture the effect of
the program complexity and the cluster
resources on the performance of the job into
an execution profile
● The profile is feedback to an optimizer to
provide cost-based recommendations
● This attempt achieved better runtime
speedups
Dec 5, 2012 MMath Thesis Presentation 9
13. Starfish
● Starfish is an automatic feedback-based
tuning system
First Submission
Subsequent
Submissions
Dec 5, 2012 MMath Thesis Presentation 11
14. Starfish
● Starfish execution profile:
○ General: IO, CPU, Memory
○ Domain specific: runtimes of every phase in the
map/reduce tasks
● Tuning workflow:
○ Apply dynamic instrumentation code to the job
○ Run the instrumented job with the given parameter
settings and collect the execution profile
○ For the next submission of the same job, make the
tuning decisions based on its execution profile
○ Run the job with the tuned parameter settings
Dec 5, 2012 MMath Thesis Presentation 12
15. Starfish
● Starfish execution profile:
○ General: IO, CPU, Memory
○ Domain Specific: runtimes of every phase in the
map/reduce tasks
● Tuning workflow:
Profile Collection Overhead
○ Apply dynamic instrumentation code to the job
○ Run the instrumented job with the default parameter
37% for the WCoP
settings and collect the execution profile
○ For the next submission of the same job, make the
tuning decisions based on its execution profile
○ Run the job with the tuned parameter settings
Dec 5, 2012 MMath Thesis Presentation 12
16. Starfish
● Starfish execution profile:
○ General: IO, CPU, Memory
○ Domain Specific: runtimes of every phase in the
map/reduce tasks
● Tuning workflow:
Profile Collection Overhead
○ Apply dynamic instrumentation code to the job
○ Run the instrumented job with the default parameter
37% for the WCoP
settings and collect the execution profile
○ For the next submission of the same job, make the
No Profile Reuse
tuning decisions based on its execution profile
○ Run the job with the tuned parameter settings
Dec 5, 2012 MMath Thesis Presentation 12
17. Profile Reuse
● MR jobs have a high likelihood to be similar:
○ MR jobs are generated from a high level
language e.g. PigLatin and HiveQL
○ Code reuse and refactoring
● Execution profile composition for new jobs:
J1: map-profile reduce-profile
J2: map-profile reduce-profile
J3: Map function similar to J1, and
reduce function similar to J2
Dec 5, 2012 MMath Thesis Presentation 13
18. Profile Reuse
● MR jobs have a high likelihood to be similar:
○ MR jobs are generated from a high level
query language e.g. PigLatin and HiveQL
○ Code reuse and refactoring
● Execution profile composition for new jobs:
J1: map-profile reduce-profile
J2: map-profile reduce-profile
J3: map-profile reduce-profile
Dec 5, 2012 MMath Thesis Presentation 13
19. Profile Reuse Example
● Bigram Relative Frequency MR job:
○ Counts the frequency of a pair of subsequent words
relative to the frequency of the first word in that pair
● Word Co-occurrence MR job:
○ Counts the co-occurrences of every pair of words in a
sliding window of length n
● At n=2:
○ Similar behaviour
○ Similar execution profiles
Dec 5, 2012 MMath Thesis Presentation 14
21. Challenge
Given a repository of execution profiles of
previously executed MR jobs, how to
automatically compose an execution profile that
can be useful for tuning the configuration
parameters of a newly submitted job ?
Dec 5, 2012 MMath Thesis Presentation 16
23. PStorM: Profile Store and Matcher
● PStorM goals:
○ Extensible profile store
○ Accurate profile matcher that reuses the stored
execution profiles to compose a matching profile for
the submitted job, even for unseen jobs
○ The performance gains achieved by the feedback-
based tuning system given the complete profile of
the job should be equal to the gains achieved given
the profile returned by PStorM
Dec 5, 2012 MMath Thesis Presentation 18
25. Profile Matcher
● Profile matching is a domain-specific pattern
recognition problem:
a. Feature selection
b. Similarity measures
c. Matching algorithm
Dec 5, 2012 MMath Thesis Presentation 20
27. Sample Profile
● Dataflow fields (D):
○ Number of input records to the map/reduce tasks
● Cost fields (C):
○ Map/reduce phase times in the map/reduce tasks
● Dataflow statistics (DS):
○ Selectivity of the map/reduce functions in terms of
size and number of records
● Cost statistics (CS):
○ CPU cost to process one input/intermediate record
in the map/reduce tasks
Dec 5, 2012 MMath Thesis Presentation 22
28. Feature Selection
Job D C DS CS
● Q: Given a MapReduce job and its sample
profile, what are the features that can
distinguish the candidate matching profile
among other profiles stored in the Profile
Store ?
● Analytical models of the What-If engine
Dec 5, 2012 MMath Thesis Presentation 23
29. Feature Selection
First Submission
Subsequent
Submissions
Dec 5, 2012 MMath Thesis Presentation 24
30. Feature Selection
Job D C DS CS
● Inputs to the analytical models:
○ Dataflow statistics
○ Cost statistics
○ Configuration parameter settings
■ Enumerated by the cost-based optimizer
● No need to find a matching profile whose D and
C fields are similar to the complete profile of the
submitted job
Dec 5, 2012 MMath Thesis Presentation 25
31. Feature Selection
Job DS CS
● The DS and CS features are obtained from the
sample profile
● The selected features should be expected to
have the same values among different samples
of the same job, and different values among the
profiles of other jobs
Dec 5, 2012 MMath Thesis Presentation 26
32. Feature Selection
Job DS CS
● Dataflow statistics are expected to have this
characteristic
● Map selectivity of the number of records:
○ Sort: = 1
○ Word Count: > 1
○ Word Co-occurrence Pairs: >>1
Dec 5, 2012 MMath Thesis Presentation 27
33. Feature Selection
Job DS CS
● CS features can vary between different
samples of the same job
● Map CPU cost can differ for the same job
between the executions of the sample on
over-utilized and under-utilized nodes
Dec 5, 2012 MMath Thesis Presentation 28
34. Feature Selection
Job DS CS
● What are the features that can be extracted
from the bytecode of the submitted job, and
can be useful for the matcher ?
Dec 5, 2012 MMath Thesis Presentation 29
35. Feature Selection
Job DS CS
● Differences between MR jobs are
Input Formatter
Intermediate Key Type
Input Key Type Mapper
Intermediate Value Type
Input Value Type
Output Formatter
Intermediate Key Type
Reducer Output Key Type
Intermediate Value Type
Output Value Type
Dec 5, 2012 MMath Thesis Presentation 30
36. Feature Selection
Job DS CS
● We will refer to these features as the static
features
● Different input formatter results in different
IO cost to read the input records
Dec 5, 2012 MMath Thesis Presentation 31
37. Feature Selection
Job DS CS
● So far, the map/reduce functions are
analyzed as black-boxes
● Static analysis of the bytecode of the
map/reduce functions:
○ Control Flow Graphs (CFG)
○ Different map/reduce CFG results in different
map/reduce CPU costs
Dec 5, 2012 MMath Thesis Presentation 32
38. CFG Example
Word Co-occurrence Pairs Word Count
Dec 5, 2012 MMath Thesis Presentation 33
39. CFG Example
Word Co-occurrence Pairs Word Count
Dec 5, 2012 MMath Thesis Presentation 34
40. CFG Example
Word Co-occurrence Pairs Word Count
Different map CFGs => different map-phase times
Dec 5, 2012 MMath Thesis Presentation 35
42. Similarity Measures
Static CFG DS CS
● Matching the static features:
○ Feature values are all strings (categorical data)
○ Jaccard Similarity index
○ Score range: [0, 1]
Dec 5, 2012 MMath Thesis Presentation 37
43. Similarity Measures
Static CFG DS CS
● Matching CFGs:
○ Synchronized breadth-first search
■ Both normal statements
■ Both branch statements
● Condition of a loop
○ Score range: {0, 1}
■ Conservative matching score
Dec 5, 2012 MMath Thesis Presentation 38
44. Similarity Measures
Static CFG DS CS
● Matching DS and CS features:
○ Numerical features
○ Data normalization to bring all features to the same
scale
○ Euclidean distance
○ Score range: [0, ]
Dec 5, 2012 MMath Thesis Presentation 39
45. Matching Algorithm
● Feature vector is composed of features of
mixed data types (categorical and numerical)
● Two possible matching algorithms:
○ Multi-stage matching
○ Machine learning approach
Dec 5, 2012 MMath Thesis Presentation 40
48. Multi-Stage Matching
● The job profile is composed of independent
map profile and reduce profile
● Multi-stage matcher will be applied twice
● The matching map profile and reduce profile
will compose the final matching job profile
Dec 5, 2012 MMath Thesis Presentation 42
49. Machine Learning Approach
● Generalized distance function
○ Weighted sums of the distances/similarities
calculated separately for each set of features of the
same type
○ Weights should be learned
Dec 5, 2012 MMath Thesis Presentation 43
50. Machine Learning Approach
● Training data set generation:
○ For every job, Ji, in the profile store, pick its profile, Pi
○ Choose a random profile, Pj, from the profile store
○ Calculate the distances and similarities between Pi and Pj
○ Calculate T1: predicted runtime of the job Ji given the
profile Pi
○ Calculate T2: predicted runtime of the job Ji given the
profile Pj
○ D = |T1 - T2|
Dec 5, 2012 MMath Thesis Presentation 44
51. Machine Learning Approach
● Machine learning algorithm:
○ Gradient Boosted Regression Tree (GBRT)
○ Profile matching implementation in R
● Profile matching using the learned model:
○ Extract the profile, Ps, for the submitted MR job
○ Calculate the similarities/distances between Ps and
the profiles in PStorM, and the corresponding value
of D
○ Select the PStorM profile whose D is the minimum
● PStorM uses multi-stage matching algorithm
Dec 5, 2012 MMath Thesis Presentation 45
55. Evaluation
● Objectives:
a. Profile matcher accuracy
b. Profile matcher efficiency
■ The profile returned from PStorM should result in
comparable speedups to that achieved given the
complete profile of the submitted job
Dec 5, 2012 MMath Thesis Presentation 49
56. Profile Matcher Accuracy
● Two content states of the profile store
● Same Data (SD) content state:
○ PStorM contains the profile collected during the
execution on the same submitted data set
● Different Data (DD) content state:
○ PStorM contains the profile collected during the
execution on a different data set
Dec 5, 2012 MMath Thesis Presentation 50
57. Profile Matcher Accuracy
● Evaluation metric is the number of correct matches as a
fraction of the number of job submissions
● At the SD content state:
○ A correct match is the profile of the submitted job
collected during the execution on the same data set
● At the DD content state:
○ A correct match is the profile of the submitted job
collected during the execution on another data set
● Number of correct matches is calculated for the map
and reduce profiles, separately
Dec 5, 2012 MMath Thesis Presentation 51
58. Profile Matcher Accuracy
● The accuracy of PStorM will be compared to
the accuracy of the alternative solutions
● PStorM contributions at the matching level:
○ Feature selection:
■ New set of features: static and CFG
■ Feature selection based on our domain
knowledge
○ Multi-stage matching algorithm
Dec 5, 2012 MMath Thesis Presentation 52
59. Profile Matcher Accuracy:
Feature Selection
● Alternative feature selection approaches:
○ P-features:
■ Given the sample profile of the submitted job
○ SP-features:
■ Given the static features we proposed and the
sample profile of the submitted job
● For both approaches:
○ Rank the features according to their information
gains
○ Select the highest F features, such that F = number
of features used by PStorM
Dec 5, 2012 MMath Thesis Presentation 53
61. Profile Matcher Accuracy:
Matching Algorithm
● PStorM uses the multi-stage matching
algorithm
● The alternative one is the machine learning
approach:
○ GBRT has multiple configuration parameters
○ Four trials of different parameter settings until we
found the one that resulted in the highest matching
accuracy for GBRT
Dec 5, 2012 MMath Thesis Presentation 55
63. Profile Matcher Efficiency
● Runtime speedups is the main factor that
matters
● A third content state, NJ:
○ The submitted job has not been executed before on
the cluster
○ Highlights the benefits of profile reuse
Dec 5, 2012 MMath Thesis Presentation 57
65. Conclusion
● Hadoop configuration parameters and their
effect on the performance of MR jobs
● Robustness and efficiency of the feedback-
based tuning approach
● Drawbacks: overhead and no profile reuse
● PStorM: profile storage and matcher that
leverages the idea of profile reuse
● PStorM resulted in significant speedups
even for new jobs
Dec 5, 2012 MMath Thesis Presentation 59