SlideShare ist ein Scribd-Unternehmen logo
1 von 65
PStorM:
     Profile Storage
   and Matching for
Feedback-Based Tuning
  of MapReduce Jobs
    MMath Thesis Presentation
               by
           Mostafa Ead
          Supervised by
     Prof. Ashraf Aboulnaga
Outline
   ● Hadoop MapReduce
   ● Tuning Hadoop Configuration Parameters
         ○ Rule-Based Approach
         ○ Feedback-Based Approach
   ● PStorM System Overview
   ● The Profile Matcher
         ○ Feature Selection
         ○ Similarity Measures
         ○ Matching Algorithm
   ● Evaluation
Dec 5, 2012              MMath Thesis Presentation   2
The MapReduce
   Programming Model

                               P11
 Input Split-1   Map-1
                               P12           P11
                                                                Output
                                             P21       Red-1
                                                                Split-1
                               P21           P31
 Input Split-2   Map-2
                               P22           P12
                                                                Output
                                             P22       Red-2
                                                                Split-2
                               P31           P32
 Input Split-3   Map-3
                               P32


<K1, V1>                 <K2, V2>          <K2, list(V2)>      <K3, V3>
Dec 5, 2012                MMath Thesis Presentation                      3
Hadoop MapReduce
   ● Hadoop is a Java open-source
     implementation of the MapReduce model
   ● Hadoop configuration parameters
         ○ io.sort.mb = 100
         ○ mapred.compress.map.output = false
         ○ mapred.reduce.tasks = 1
   ● These parameters have significant effect on
     the performance of the MR job


Dec 5, 2012           MMath Thesis Presentation    4
Hadoop Configuration Parameters
                            P11
 Input Split-1    Map-1
                            P12

                                  Serialize
                          Map-1                   Memory Buffer
                                  Partition

                                          Sort,
                                     [Combine],
                                                                       P11
                 Input              [Compress]
                                                                       P12
                 Split-1


                 Read off                                     Merge
                  HDFS Map         Collect         Spill      Spills

Dec 5, 2012                       MMath Thesis Presentation                  5
Hadoop Configuration Parameters
                            P11
 Input Split-1    Map-1                                       io.sort.mb
                            P12

                                  Serialize
                          Map-1                   Memory Buffer
                                  Partition

                                          Sort,
                                     [Combine],
                                                                          P11
                 Input              [Compress]
                                                                          P12
                 Split-1


                 Read off                                        Merge
                  HDFS Map         Collect         Spill         Spills

Dec 5, 2012                       MMath Thesis Presentation                     5
Hadoop Configuration Parameters
                            P11
 Input Split-1    Map-1                                       io.sort.mb
                            P12

                                  Serialize
                          Map-1                   Memory Buffer
                                  Partition
 mapred.compress.
    map.output                            Sort,
                                     [Combine],
                                                                          P11
                 Input              [Compress]
                                                                          P12
                 Split-1


                 Read off                                        Merge
                  HDFS Map         Collect         Spill         Spills

Dec 5, 2012                       MMath Thesis Presentation                     5
Hadoop Configuration Parameters
   ● Good setting of these parameters relies on:
         ○ Behaviour of the map and reduce functions
         ○ Cluster resources


   ● Cross-interaction between the configuration
     parameters:
         ○ io.sort.record.percent and io.sort.mb
              Meta-Data     Serialized Intermediate Records



Dec 5, 2012               MMath Thesis Presentation           6
Rule-Based Optimizer
   ● Initial attempt is to capture the hadoop
     admin. expertise into a set of <rule, action>
     pairs
         ○ Intermediate > input data size => enable
           compression
         ○ Reduce function is associative-commutative
           => enable the combiner
   ● This attempt achieved good runtime
     speedups, but not for all MR jobs

Dec 5, 2012              MMath Thesis Presentation      7
Rule-Based Optimizer (RBO)




Dec 5, 2012    MMath Thesis Presentation   8
Feedback-Based Tuning Approach
   ● Another attempt is to capture the effect of
     the program complexity and the cluster
     resources on the performance of the job into
     an execution profile
   ● The profile is feedback to an optimizer to
     provide cost-based recommendations
   ● This attempt achieved better runtime
     speedups

Dec 5, 2012         MMath Thesis Presentation       9
Feedback-Based Tuning Approach




Dec 5, 2012   MMath Thesis Presentation   10
Starfish
   ● Starfish is an automatic feedback-based
     tuning system

    First Submission
    Subsequent
    Submissions




Dec 5, 2012            MMath Thesis Presentation   11
Starfish
   ● Starfish execution profile:
         ○ General: IO, CPU, Memory
         ○ Domain specific: runtimes of every phase in the
           map/reduce tasks
   ● Tuning workflow:
         ○ Apply dynamic instrumentation code to the job
         ○ Run the instrumented job with the given parameter
           settings and collect the execution profile
         ○ For the next submission of the same job, make the
           tuning decisions based on its execution profile
         ○ Run the job with the tuned parameter settings
Dec 5, 2012               MMath Thesis Presentation            12
Starfish
   ● Starfish execution profile:
         ○ General: IO, CPU, Memory
         ○ Domain Specific: runtimes of every phase in the
           map/reduce tasks
   ● Tuning workflow:
          Profile Collection Overhead
         ○ Apply dynamic instrumentation code to the job
         ○ Run the instrumented job with the default parameter
               37% for the WCoP
           settings and collect the execution profile
         ○ For the next submission of the same job, make the
           tuning decisions based on its execution profile
         ○ Run the job with the tuned parameter settings
Dec 5, 2012               MMath Thesis Presentation              12
Starfish
   ● Starfish execution profile:
         ○ General: IO, CPU, Memory
         ○ Domain Specific: runtimes of every phase in the
           map/reduce tasks
   ● Tuning workflow:
          Profile Collection Overhead
         ○ Apply dynamic instrumentation code to the job
         ○ Run the instrumented job with the default parameter
               37% for the WCoP
           settings and collect the execution profile
         ○ For the next submission of the same job, make the
                No Profile Reuse
           tuning decisions based on its execution profile
         ○ Run the job with the tuned parameter settings
Dec 5, 2012               MMath Thesis Presentation              12
Profile Reuse
   ● MR jobs have a high likelihood to be similar:
         ○ MR jobs are generated from a high level
       language e.g. PigLatin and HiveQL
     ○ Code reuse and refactoring
   ● Execution profile composition for new jobs:
              J1:     map-profile       reduce-profile

              J2:     map-profile       reduce-profile

              J3: Map function similar to J1, and
                  reduce function similar to J2
Dec 5, 2012                   MMath Thesis Presentation   13
Profile Reuse
   ● MR jobs have a high likelihood to be similar:
         ○ MR jobs are generated from a high level
       query language e.g. PigLatin and HiveQL
     ○ Code reuse and refactoring
   ● Execution profile composition for new jobs:
              J1:   map-profile                 reduce-profile

              J2:   map-profile                 reduce-profile

              J3:   map-profile                 reduce-profile



Dec 5, 2012                 MMath Thesis Presentation            13
Profile Reuse Example
   ● Bigram Relative Frequency MR job:
         ○ Counts the frequency of a pair of subsequent words
              relative to the frequency of the first word in that pair
   ● Word Co-occurrence MR job:
         ○ Counts the co-occurrences of every pair of words in a
              sliding window of length n
   ● At n=2:
         ○ Similar behaviour
         ○ Similar execution profiles


Dec 5, 2012                    MMath Thesis Presentation                 14
Profile Reuse Example




Dec 5, 2012    MMath Thesis Presentation   15
Challenge



      Given a repository of execution profiles of
        previously executed MR jobs, how to
   automatically compose an execution profile that
      can be useful for tuning the configuration
       parameters of a newly submitted job ?



Dec 5, 2012         MMath Thesis Presentation        16
Outline
   ● Hadoop MapReduce
   ● Tuning Hadoop Configuration Parameters
         ○ Rule-Based Approach
         ○ Feedback-Based Approach
   ● PStorM System Overview
   ● The Profile Matcher
         ○ Feature Selection
         ○ Similarity Measures
         ○ Matching Algorithm
   ● Evaluation
Dec 5, 2012              MMath Thesis Presentation   17
PStorM: Profile Store and Matcher
   ● PStorM goals:
     ○ Extensible profile store
     ○ Accurate profile matcher that reuses the stored
              execution profiles to compose a matching profile for
              the submitted job, even for unseen jobs
        ○ The performance gains achieved by the feedback-
              based tuning system given the complete profile of
              the job should be equal to the gains achieved given
              the profile returned by PStorM


Dec 5, 2012                  MMath Thesis Presentation               18
System Overview




Dec 5, 2012   MMath Thesis Presentation   19
Profile Matcher
   ● Profile matching is a domain-specific pattern
     recognition problem:
        a. Feature selection
        b. Similarity measures
        c. Matching algorithm




Dec 5, 2012              MMath Thesis Presentation   20
Profile Matcher




Dec 5, 2012     MMath Thesis Presentation   21
Sample Profile
   ● Dataflow fields (D):
         ○ Number of input records to the map/reduce tasks
   ● Cost fields (C):
         ○ Map/reduce phase times in the map/reduce tasks
   ● Dataflow statistics (DS):
         ○ Selectivity of the map/reduce functions in terms of
           size and number of records
   ● Cost statistics (CS):
         ○ CPU cost to process one input/intermediate record
           in the map/reduce tasks
Dec 5, 2012                MMath Thesis Presentation             22
Feature Selection
          Job   D              C                DS   CS

   ● Q: Given a MapReduce job and its sample
     profile, what are the features that can
     distinguish the candidate matching profile
     among other profiles stored in the Profile
     Store ?

   ● Analytical models of the What-If engine
Dec 5, 2012         MMath Thesis Presentation             23
Feature Selection


    First Submission
    Subsequent
    Submissions




Dec 5, 2012            MMath Thesis Presentation   24
Feature Selection
          Job         D              C                DS   CS

   ● Inputs to the analytical models:
         ○ Dataflow statistics
         ○ Cost statistics
         ○ Configuration parameter settings
              ■ Enumerated by the cost-based optimizer
   ● No need to find a matching profile whose D and
        C fields are similar to the complete profile of the
        submitted job
Dec 5, 2012               MMath Thesis Presentation             25
Feature Selection
                Job             DS                CS

   ● The DS and CS features are obtained from the
     sample profile
   ● The selected features should be expected to
     have the same values among different samples
     of the same job, and different values among the
     profiles of other jobs



Dec 5, 2012           MMath Thesis Presentation        26
Feature Selection
                    Job             DS                CS

   ● Dataflow statistics are expected to have this
     characteristic
   ● Map selectivity of the number of records:
         ○ Sort: = 1
         ○ Word Count: > 1
         ○ Word Co-occurrence Pairs: >>1




Dec 5, 2012               MMath Thesis Presentation        27
Feature Selection
               Job             DS                CS

   ● CS features can vary between different
     samples of the same job
   ● Map CPU cost can differ for the same job
     between the executions of the sample on
     over-utilized and under-utilized nodes



Dec 5, 2012          MMath Thesis Presentation        28
Feature Selection
               Job             DS                CS

   ● What are the features that can be extracted
     from the bytecode of the submitted job, and
     can be useful for the matcher ?




Dec 5, 2012          MMath Thesis Presentation        29
Feature Selection
                     Job                DS                CS

   ● Differences between MR jobs are
            Input Formatter
                                                      Intermediate Key Type
            Input Key Type          Mapper
                                                      Intermediate Value Type
          Input Value Type

                                                      Output Formatter
  Intermediate Key Type
                                   Reducer            Output Key Type
Intermediate Value Type
                                                      Output Value Type


Dec 5, 2012                   MMath Thesis Presentation                   30
Feature Selection
                Job             DS                CS

   ● We will refer to these features as the static
     features

   ● Different input formatter results in different
     IO cost to read the input records



Dec 5, 2012           MMath Thesis Presentation        31
Feature Selection
                     Job             DS                CS

   ● So far, the map/reduce functions are
     analyzed as black-boxes
   ● Static analysis of the bytecode of the
     map/reduce functions:
         ○ Control Flow Graphs (CFG)
         ○ Different map/reduce CFG results in different
           map/reduce CPU costs


Dec 5, 2012                MMath Thesis Presentation        32
CFG Example
       Word Co-occurrence Pairs                      Word Count




Dec 5, 2012              MMath Thesis Presentation                33
CFG Example
       Word Co-occurrence Pairs                      Word Count




Dec 5, 2012              MMath Thesis Presentation                34
CFG Example
       Word Co-occurrence Pairs                      Word Count




          Different map CFGs => different map-phase times
Dec 5, 2012              MMath Thesis Presentation                35
Outline
   ● Hadoop MapReduce
   ● Tuning Hadoop Configuration Parameters
         ○ Rule-Based Approach
         ○ Feedback-Based Approach
   ● PStorM System Overview
   ● The Profile Matcher
         ○ Feature Selection
         ○ Similarity Measures
         ○ Matching Algorithm
   ● Evaluation
Dec 5, 2012              MMath Thesis Presentation   36
Similarity Measures
              Static   CFG           DS                CS

   ● Matching the static features:
         ○ Feature values are all strings (categorical data)
         ○ Jaccard Similarity index




         ○ Score range: [0, 1]



Dec 5, 2012                MMath Thesis Presentation           37
Similarity Measures
              Static    CFG            DS                CS

   ● Matching CFGs:
         ○ Synchronized breadth-first search
           ■ Both normal statements
           ■ Both branch statements
                ●   Condition of a loop
         ○ Score range: {0, 1}
           ■ Conservative matching score



Dec 5, 2012                  MMath Thesis Presentation        38
Similarity Measures
              Static   CFG           DS                CS

   ● Matching DS and CS features:
         ○ Numerical features
         ○ Data normalization to bring all features to the same
           scale
         ○ Euclidean distance
         ○ Score range: [0,                          ]




Dec 5, 2012                MMath Thesis Presentation              39
Matching Algorithm
   ● Feature vector is composed of features of
     mixed data types (categorical and numerical)
   ● Two possible matching algorithms:
         ○ Multi-stage matching
         ○ Machine learning approach




Dec 5, 2012            MMath Thesis Presentation    40
Multi-Stage Matching




Dec 5, 2012   MMath Thesis Presentation   41
Multi-Stage Matching




Dec 5, 2012   MMath Thesis Presentation   41
Multi-Stage Matching
   ● The job profile is composed of independent
     map profile and reduce profile

   ● Multi-stage matcher will be applied twice

   ● The matching map profile and reduce profile
     will compose the final matching job profile



Dec 5, 2012         MMath Thesis Presentation      42
Machine Learning Approach
   ● Generalized distance function
         ○ Weighted sums of the distances/similarities
           calculated separately for each set of features of the
           same type




         ○ Weights should be learned

Dec 5, 2012                MMath Thesis Presentation               43
Machine Learning Approach
   ● Training data set generation:
         ○ For every job, Ji, in the profile store, pick its profile, Pi
         ○ Choose a random profile, Pj, from the profile store
         ○ Calculate the distances and similarities between Pi and Pj
         ○ Calculate T1: predicted runtime of the job Ji given the
              profile Pi
         ○ Calculate T2: predicted runtime of the job Ji given the
              profile Pj
         ○ D = |T1 - T2|


Dec 5, 2012                   MMath Thesis Presentation                    44
Machine Learning Approach
   ● Machine learning algorithm:
         ○ Gradient Boosted Regression Tree (GBRT)
         ○ Profile matching implementation in R
   ● Profile matching using the learned model:
         ○ Extract the profile, Ps, for the submitted MR job
         ○ Calculate the similarities/distances between Ps and
           the profiles in PStorM, and the corresponding value
           of D
         ○ Select the PStorM profile whose D is the minimum

   ● PStorM uses multi-stage matching algorithm
Dec 5, 2012               MMath Thesis Presentation              45
Outline
   ● Hadoop MapReduce
   ● Tuning Hadoop Configuration Parameters
         ○ Rule-Based Approach
         ○ Feedback-Based Approach
   ● PStorM System Overview
   ● The Profile Matcher
         ○ Feature Selection
         ○ Similarity Measures
         ○ Matching Algorithm
   ● Evaluation
Dec 5, 2012              MMath Thesis Presentation   46
Infrastructure
   ● 16 x Amazon EC2 c1.medium nodes:
         ○ 2 x Virtual cores
         ○ 1.7 GB of RAM
         ○ 350 GB of instance storage
   ● Hadoop cluster:
         ○ 1 master + 15 workers
         ○ Each worker can run at most 2 map and 2 reduce
           tasks concurrently
   ● PStorM profile store:
         ○ HBase instance running on the master node

Dec 5, 2012              MMath Thesis Presentation          47
Benchmark




Dec 5, 2012    MMath Thesis Presentation   48
Evaluation
   ● Objectives:
        a. Profile matcher accuracy
        b. Profile matcher efficiency
           ■ The profile returned from PStorM should result in
              comparable speedups to that achieved given the
              complete profile of the submitted job




Dec 5, 2012               MMath Thesis Presentation              49
Profile Matcher Accuracy
   ● Two content states of the profile store

   ● Same Data (SD) content state:
         ○ PStorM contains the profile collected during the
           execution on the same submitted data set


   ● Different Data (DD) content state:
         ○ PStorM contains the profile collected during the
           execution on a different data set

Dec 5, 2012               MMath Thesis Presentation           50
Profile Matcher Accuracy
   ● Evaluation metric is the number of correct matches as a
        fraction of the number of job submissions
   ● At the SD content state:
         ○ A correct match is the profile of the submitted job
           collected during the execution on the same data set
   ● At the DD content state:
         ○ A correct match is the profile of the submitted job
           collected during the execution on another data set
   ● Number of correct matches is calculated for the map
        and reduce profiles, separately
Dec 5, 2012                MMath Thesis Presentation             51
Profile Matcher Accuracy
   ● The accuracy of PStorM will be compared to
     the accuracy of the alternative solutions
   ● PStorM contributions at the matching level:
         ○ Feature selection:
           ■ New set of features: static and CFG
           ■ Feature selection based on our domain
              knowledge
         ○ Multi-stage matching algorithm



Dec 5, 2012              MMath Thesis Presentation   52
Profile Matcher Accuracy:
   Feature Selection
   ● Alternative feature selection approaches:
         ○ P-features:
           ■ Given the sample profile of the submitted job
         ○ SP-features:
           ■ Given the static features we proposed and the
              sample profile of the submitted job
   ● For both approaches:
         ○ Rank the features according to their information
           gains
         ○ Select the highest F features, such that F = number
           of features used by PStorM
Dec 5, 2012               MMath Thesis Presentation              53
Profile Matcher Accuracy:
   Feature Selection




Dec 5, 2012     MMath Thesis Presentation   54
Profile Matcher Accuracy:
   Matching Algorithm
   ● PStorM uses the multi-stage matching
     algorithm
   ● The alternative one is the machine learning
     approach:
         ○ GBRT has multiple configuration parameters
         ○ Four trials of different parameter settings until we
           found the one that resulted in the highest matching
           accuracy for GBRT



Dec 5, 2012                MMath Thesis Presentation              55
Profile Matcher Accuracy:
   Matching Algorithm




Dec 5, 2012     MMath Thesis Presentation   56
Profile Matcher Efficiency
   ● Runtime speedups is the main factor that
     matters
   ● A third content state, NJ:
         ○ The submitted job has not been executed before on
           the cluster
         ○ Highlights the benefits of profile reuse




Dec 5, 2012              MMath Thesis Presentation             57
Profile Matcher Efficiency




   Default    12      824                 100   302
Dec 5, 2012        MMath Thesis Presentation          58
Conclusion
   ● Hadoop configuration parameters and their
     effect on the performance of MR jobs
   ● Robustness and efficiency of the feedback-
     based tuning approach
   ● Drawbacks: overhead and no profile reuse
   ● PStorM: profile storage and matcher that
     leverages the idea of profile reuse
   ● PStorM resulted in significant speedups
     even for new jobs
Dec 5, 2012        MMath Thesis Presentation      59

Weitere ähnliche Inhalte

Andere mochten auch

Daniel Sikar Amazon Ec2 S3
Daniel Sikar Amazon Ec2 S3Daniel Sikar Amazon Ec2 S3
Daniel Sikar Amazon Ec2 S3Skills Matter
 
Dryad Paper Review and System Analysis
Dryad Paper Review and System AnalysisDryad Paper Review and System Analysis
Dryad Paper Review and System AnalysisJinGui LI
 
MapReduce : Simplified Data Processing on Large Clusters
MapReduce : Simplified Data Processing on Large ClustersMapReduce : Simplified Data Processing on Large Clusters
MapReduce : Simplified Data Processing on Large ClustersAbolfazl Asudeh
 
Daniel Sikar: Hadoop MapReduce - 06/09/2010
Daniel Sikar: Hadoop MapReduce - 06/09/2010 Daniel Sikar: Hadoop MapReduce - 06/09/2010
Daniel Sikar: Hadoop MapReduce - 06/09/2010 Skills Matter
 
ブラウザでMap Reduce風味の並列分散処理
ブラウザでMap Reduce風味の並列分散処理ブラウザでMap Reduce風味の並列分散処理
ブラウザでMap Reduce風味の並列分散処理Shinya Miyazaki
 
Architecting for the cloud map reduce creating
Architecting for the cloud   map reduce creatingArchitecting for the cloud   map reduce creating
Architecting for the cloud map reduce creatingLen Bass
 
Project management in easy steps
Project management in easy stepsProject management in easy steps
Project management in easy stepsSanjay Bhatnagar
 
Challenging assumptions 2
Challenging assumptions 2Challenging assumptions 2
Challenging assumptions 2Angelesmf
 
Public Speaking WK 2
Public Speaking WK 2Public Speaking WK 2
Public Speaking WK 2AGraham09
 
Framing and reframing
Framing and reframingFraming and reframing
Framing and reframingAngelesmf
 
Are you paying attention nicole
Are you paying attention nicoleAre you paying attention nicole
Are you paying attention nicoleNicole Ryujin
 
Disney channel movies
Disney channel moviesDisney channel movies
Disney channel moviesa1925
 
Public Speaking WK 2
Public Speaking WK 2Public Speaking WK 2
Public Speaking WK 2AGraham09
 
Webquest
WebquestWebquest
Webquestkouro84
 
Ted talk slideshow
Ted talk slideshowTed talk slideshow
Ted talk slideshowghost222
 
Challenge assumptions 3
Challenge assumptions 3Challenge assumptions 3
Challenge assumptions 3Angelesmf
 
Connect & combine
Connect & combineConnect & combine
Connect & combineAngelesmf
 
Brake system assingment
Brake system assingmentBrake system assingment
Brake system assingmentsiryemzy
 

Andere mochten auch (20)

Daniel Sikar Amazon Ec2 S3
Daniel Sikar Amazon Ec2 S3Daniel Sikar Amazon Ec2 S3
Daniel Sikar Amazon Ec2 S3
 
Dryad Paper Review and System Analysis
Dryad Paper Review and System AnalysisDryad Paper Review and System Analysis
Dryad Paper Review and System Analysis
 
MapReduce : Simplified Data Processing on Large Clusters
MapReduce : Simplified Data Processing on Large ClustersMapReduce : Simplified Data Processing on Large Clusters
MapReduce : Simplified Data Processing on Large Clusters
 
Daniel Sikar: Hadoop MapReduce - 06/09/2010
Daniel Sikar: Hadoop MapReduce - 06/09/2010 Daniel Sikar: Hadoop MapReduce - 06/09/2010
Daniel Sikar: Hadoop MapReduce - 06/09/2010
 
ブラウザでMap Reduce風味の並列分散処理
ブラウザでMap Reduce風味の並列分散処理ブラウザでMap Reduce風味の並列分散処理
ブラウザでMap Reduce風味の並列分散処理
 
Architecting for the cloud map reduce creating
Architecting for the cloud   map reduce creatingArchitecting for the cloud   map reduce creating
Architecting for the cloud map reduce creating
 
MapReduce
MapReduceMapReduce
MapReduce
 
Project management in easy steps
Project management in easy stepsProject management in easy steps
Project management in easy steps
 
Challenging assumptions 2
Challenging assumptions 2Challenging assumptions 2
Challenging assumptions 2
 
Public Speaking WK 2
Public Speaking WK 2Public Speaking WK 2
Public Speaking WK 2
 
Framing and reframing
Framing and reframingFraming and reframing
Framing and reframing
 
Are you paying attention nicole
Are you paying attention nicoleAre you paying attention nicole
Are you paying attention nicole
 
Disney channel movies
Disney channel moviesDisney channel movies
Disney channel movies
 
Cc p6
Cc p6Cc p6
Cc p6
 
Public Speaking WK 2
Public Speaking WK 2Public Speaking WK 2
Public Speaking WK 2
 
Webquest
WebquestWebquest
Webquest
 
Ted talk slideshow
Ted talk slideshowTed talk slideshow
Ted talk slideshow
 
Challenge assumptions 3
Challenge assumptions 3Challenge assumptions 3
Challenge assumptions 3
 
Connect & combine
Connect & combineConnect & combine
Connect & combine
 
Brake system assingment
Brake system assingmentBrake system assingment
Brake system assingment
 

Ähnlich wie PStorM

MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce ParadigmDilip Reddy
 
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraBrief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraSomnath Mazumdar
 
Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your Application
Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your ApplicationHadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your Application
Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your ApplicationYahoo Developer Network
 
Oracle GoldenGate DB2 to Oracle11gR2 Configuration
Oracle GoldenGate DB2 to Oracle11gR2 ConfigurationOracle GoldenGate DB2 to Oracle11gR2 Configuration
Oracle GoldenGate DB2 to Oracle11gR2 Configurationgrigorianvlad
 
Hands-on MapReduce Programming
Hands-on MapReduce ProgrammingHands-on MapReduce Programming
Hands-on MapReduce ProgrammingOrzota
 
Hadoop hbase mapreduce
Hadoop hbase mapreduceHadoop hbase mapreduce
Hadoop hbase mapreduceFARUK BERKSÖZ
 
Introduction to MapReduce using Disco
Introduction to MapReduce using DiscoIntroduction to MapReduce using Disco
Introduction to MapReduce using DiscoJim Roepcke
 
Hadoop Summit 2012 | Optimizing MapReduce Job Performance
Hadoop Summit 2012 | Optimizing MapReduce Job PerformanceHadoop Summit 2012 | Optimizing MapReduce Job Performance
Hadoop Summit 2012 | Optimizing MapReduce Job PerformanceCloudera, Inc.
 
20141111 파이썬으로 Hadoop MR프로그래밍
20141111 파이썬으로 Hadoop MR프로그래밍20141111 파이썬으로 Hadoop MR프로그래밍
20141111 파이썬으로 Hadoop MR프로그래밍Tae Young Lee
 
Introduction to map reduce
Introduction to map reduceIntroduction to map reduce
Introduction to map reduceBhupesh Chawda
 
Introduction to map reduce
Introduction to map reduceIntroduction to map reduce
Introduction to map reduceM Baddar
 
Pig on Tez - Low Latency ETL with Big Data
Pig on Tez - Low Latency ETL with Big DataPig on Tez - Low Latency ETL with Big Data
Pig on Tez - Low Latency ETL with Big DataDataWorks Summit
 
[05][cuda 및 fermi 최적화 기술] hryu optimization
[05][cuda 및 fermi 최적화 기술] hryu optimization[05][cuda 및 fermi 최적화 기술] hryu optimization
[05][cuda 및 fermi 최적화 기술] hryu optimizationlaparuma
 

Ähnlich wie PStorM (18)

MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce Paradigm
 
02 Map Reduce
02 Map Reduce02 Map Reduce
02 Map Reduce
 
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraBrief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
 
Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your Application
Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your ApplicationHadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your Application
Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your Application
 
Oracle GoldenGate DB2 to Oracle11gR2 Configuration
Oracle GoldenGate DB2 to Oracle11gR2 ConfigurationOracle GoldenGate DB2 to Oracle11gR2 Configuration
Oracle GoldenGate DB2 to Oracle11gR2 Configuration
 
MapReduce
MapReduceMapReduce
MapReduce
 
Hands-on MapReduce Programming
Hands-on MapReduce ProgrammingHands-on MapReduce Programming
Hands-on MapReduce Programming
 
Hadoop hbase mapreduce
Hadoop hbase mapreduceHadoop hbase mapreduce
Hadoop hbase mapreduce
 
48a tuning
48a tuning48a tuning
48a tuning
 
Introduction to MapReduce using Disco
Introduction to MapReduce using DiscoIntroduction to MapReduce using Disco
Introduction to MapReduce using Disco
 
Hadoop Summit 2012 | Optimizing MapReduce Job Performance
Hadoop Summit 2012 | Optimizing MapReduce Job PerformanceHadoop Summit 2012 | Optimizing MapReduce Job Performance
Hadoop Summit 2012 | Optimizing MapReduce Job Performance
 
20141111 파이썬으로 Hadoop MR프로그래밍
20141111 파이썬으로 Hadoop MR프로그래밍20141111 파이썬으로 Hadoop MR프로그래밍
20141111 파이썬으로 Hadoop MR프로그래밍
 
Introduction to map reduce
Introduction to map reduceIntroduction to map reduce
Introduction to map reduce
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Introduction to map reduce
Introduction to map reduceIntroduction to map reduce
Introduction to map reduce
 
Pig on Tez - Low Latency ETL with Big Data
Pig on Tez - Low Latency ETL with Big DataPig on Tez - Low Latency ETL with Big Data
Pig on Tez - Low Latency ETL with Big Data
 
[05][cuda 및 fermi 최적화 기술] hryu optimization
[05][cuda 및 fermi 최적화 기술] hryu optimization[05][cuda 및 fermi 최적화 기술] hryu optimization
[05][cuda 및 fermi 최적화 기술] hryu optimization
 
Overview of Spark for HPC
Overview of Spark for HPCOverview of Spark for HPC
Overview of Spark for HPC
 

PStorM

  • 1. PStorM: Profile Storage and Matching for Feedback-Based Tuning of MapReduce Jobs MMath Thesis Presentation by Mostafa Ead Supervised by Prof. Ashraf Aboulnaga
  • 2. Outline ● Hadoop MapReduce ● Tuning Hadoop Configuration Parameters ○ Rule-Based Approach ○ Feedback-Based Approach ● PStorM System Overview ● The Profile Matcher ○ Feature Selection ○ Similarity Measures ○ Matching Algorithm ● Evaluation Dec 5, 2012 MMath Thesis Presentation 2
  • 3. The MapReduce Programming Model P11 Input Split-1 Map-1 P12 P11 Output P21 Red-1 Split-1 P21 P31 Input Split-2 Map-2 P22 P12 Output P22 Red-2 Split-2 P31 P32 Input Split-3 Map-3 P32 <K1, V1> <K2, V2> <K2, list(V2)> <K3, V3> Dec 5, 2012 MMath Thesis Presentation 3
  • 4. Hadoop MapReduce ● Hadoop is a Java open-source implementation of the MapReduce model ● Hadoop configuration parameters ○ io.sort.mb = 100 ○ mapred.compress.map.output = false ○ mapred.reduce.tasks = 1 ● These parameters have significant effect on the performance of the MR job Dec 5, 2012 MMath Thesis Presentation 4
  • 5. Hadoop Configuration Parameters P11 Input Split-1 Map-1 P12 Serialize Map-1 Memory Buffer Partition Sort, [Combine], P11 Input [Compress] P12 Split-1 Read off Merge HDFS Map Collect Spill Spills Dec 5, 2012 MMath Thesis Presentation 5
  • 6. Hadoop Configuration Parameters P11 Input Split-1 Map-1 io.sort.mb P12 Serialize Map-1 Memory Buffer Partition Sort, [Combine], P11 Input [Compress] P12 Split-1 Read off Merge HDFS Map Collect Spill Spills Dec 5, 2012 MMath Thesis Presentation 5
  • 7. Hadoop Configuration Parameters P11 Input Split-1 Map-1 io.sort.mb P12 Serialize Map-1 Memory Buffer Partition mapred.compress. map.output Sort, [Combine], P11 Input [Compress] P12 Split-1 Read off Merge HDFS Map Collect Spill Spills Dec 5, 2012 MMath Thesis Presentation 5
  • 8. Hadoop Configuration Parameters ● Good setting of these parameters relies on: ○ Behaviour of the map and reduce functions ○ Cluster resources ● Cross-interaction between the configuration parameters: ○ io.sort.record.percent and io.sort.mb Meta-Data Serialized Intermediate Records Dec 5, 2012 MMath Thesis Presentation 6
  • 9. Rule-Based Optimizer ● Initial attempt is to capture the hadoop admin. expertise into a set of <rule, action> pairs ○ Intermediate > input data size => enable compression ○ Reduce function is associative-commutative => enable the combiner ● This attempt achieved good runtime speedups, but not for all MR jobs Dec 5, 2012 MMath Thesis Presentation 7
  • 10. Rule-Based Optimizer (RBO) Dec 5, 2012 MMath Thesis Presentation 8
  • 11. Feedback-Based Tuning Approach ● Another attempt is to capture the effect of the program complexity and the cluster resources on the performance of the job into an execution profile ● The profile is feedback to an optimizer to provide cost-based recommendations ● This attempt achieved better runtime speedups Dec 5, 2012 MMath Thesis Presentation 9
  • 12. Feedback-Based Tuning Approach Dec 5, 2012 MMath Thesis Presentation 10
  • 13. Starfish ● Starfish is an automatic feedback-based tuning system First Submission Subsequent Submissions Dec 5, 2012 MMath Thesis Presentation 11
  • 14. Starfish ● Starfish execution profile: ○ General: IO, CPU, Memory ○ Domain specific: runtimes of every phase in the map/reduce tasks ● Tuning workflow: ○ Apply dynamic instrumentation code to the job ○ Run the instrumented job with the given parameter settings and collect the execution profile ○ For the next submission of the same job, make the tuning decisions based on its execution profile ○ Run the job with the tuned parameter settings Dec 5, 2012 MMath Thesis Presentation 12
  • 15. Starfish ● Starfish execution profile: ○ General: IO, CPU, Memory ○ Domain Specific: runtimes of every phase in the map/reduce tasks ● Tuning workflow: Profile Collection Overhead ○ Apply dynamic instrumentation code to the job ○ Run the instrumented job with the default parameter 37% for the WCoP settings and collect the execution profile ○ For the next submission of the same job, make the tuning decisions based on its execution profile ○ Run the job with the tuned parameter settings Dec 5, 2012 MMath Thesis Presentation 12
  • 16. Starfish ● Starfish execution profile: ○ General: IO, CPU, Memory ○ Domain Specific: runtimes of every phase in the map/reduce tasks ● Tuning workflow: Profile Collection Overhead ○ Apply dynamic instrumentation code to the job ○ Run the instrumented job with the default parameter 37% for the WCoP settings and collect the execution profile ○ For the next submission of the same job, make the No Profile Reuse tuning decisions based on its execution profile ○ Run the job with the tuned parameter settings Dec 5, 2012 MMath Thesis Presentation 12
  • 17. Profile Reuse ● MR jobs have a high likelihood to be similar: ○ MR jobs are generated from a high level language e.g. PigLatin and HiveQL ○ Code reuse and refactoring ● Execution profile composition for new jobs: J1: map-profile reduce-profile J2: map-profile reduce-profile J3: Map function similar to J1, and reduce function similar to J2 Dec 5, 2012 MMath Thesis Presentation 13
  • 18. Profile Reuse ● MR jobs have a high likelihood to be similar: ○ MR jobs are generated from a high level query language e.g. PigLatin and HiveQL ○ Code reuse and refactoring ● Execution profile composition for new jobs: J1: map-profile reduce-profile J2: map-profile reduce-profile J3: map-profile reduce-profile Dec 5, 2012 MMath Thesis Presentation 13
  • 19. Profile Reuse Example ● Bigram Relative Frequency MR job: ○ Counts the frequency of a pair of subsequent words relative to the frequency of the first word in that pair ● Word Co-occurrence MR job: ○ Counts the co-occurrences of every pair of words in a sliding window of length n ● At n=2: ○ Similar behaviour ○ Similar execution profiles Dec 5, 2012 MMath Thesis Presentation 14
  • 20. Profile Reuse Example Dec 5, 2012 MMath Thesis Presentation 15
  • 21. Challenge Given a repository of execution profiles of previously executed MR jobs, how to automatically compose an execution profile that can be useful for tuning the configuration parameters of a newly submitted job ? Dec 5, 2012 MMath Thesis Presentation 16
  • 22. Outline ● Hadoop MapReduce ● Tuning Hadoop Configuration Parameters ○ Rule-Based Approach ○ Feedback-Based Approach ● PStorM System Overview ● The Profile Matcher ○ Feature Selection ○ Similarity Measures ○ Matching Algorithm ● Evaluation Dec 5, 2012 MMath Thesis Presentation 17
  • 23. PStorM: Profile Store and Matcher ● PStorM goals: ○ Extensible profile store ○ Accurate profile matcher that reuses the stored execution profiles to compose a matching profile for the submitted job, even for unseen jobs ○ The performance gains achieved by the feedback- based tuning system given the complete profile of the job should be equal to the gains achieved given the profile returned by PStorM Dec 5, 2012 MMath Thesis Presentation 18
  • 24. System Overview Dec 5, 2012 MMath Thesis Presentation 19
  • 25. Profile Matcher ● Profile matching is a domain-specific pattern recognition problem: a. Feature selection b. Similarity measures c. Matching algorithm Dec 5, 2012 MMath Thesis Presentation 20
  • 26. Profile Matcher Dec 5, 2012 MMath Thesis Presentation 21
  • 27. Sample Profile ● Dataflow fields (D): ○ Number of input records to the map/reduce tasks ● Cost fields (C): ○ Map/reduce phase times in the map/reduce tasks ● Dataflow statistics (DS): ○ Selectivity of the map/reduce functions in terms of size and number of records ● Cost statistics (CS): ○ CPU cost to process one input/intermediate record in the map/reduce tasks Dec 5, 2012 MMath Thesis Presentation 22
  • 28. Feature Selection Job D C DS CS ● Q: Given a MapReduce job and its sample profile, what are the features that can distinguish the candidate matching profile among other profiles stored in the Profile Store ? ● Analytical models of the What-If engine Dec 5, 2012 MMath Thesis Presentation 23
  • 29. Feature Selection First Submission Subsequent Submissions Dec 5, 2012 MMath Thesis Presentation 24
  • 30. Feature Selection Job D C DS CS ● Inputs to the analytical models: ○ Dataflow statistics ○ Cost statistics ○ Configuration parameter settings ■ Enumerated by the cost-based optimizer ● No need to find a matching profile whose D and C fields are similar to the complete profile of the submitted job Dec 5, 2012 MMath Thesis Presentation 25
  • 31. Feature Selection Job DS CS ● The DS and CS features are obtained from the sample profile ● The selected features should be expected to have the same values among different samples of the same job, and different values among the profiles of other jobs Dec 5, 2012 MMath Thesis Presentation 26
  • 32. Feature Selection Job DS CS ● Dataflow statistics are expected to have this characteristic ● Map selectivity of the number of records: ○ Sort: = 1 ○ Word Count: > 1 ○ Word Co-occurrence Pairs: >>1 Dec 5, 2012 MMath Thesis Presentation 27
  • 33. Feature Selection Job DS CS ● CS features can vary between different samples of the same job ● Map CPU cost can differ for the same job between the executions of the sample on over-utilized and under-utilized nodes Dec 5, 2012 MMath Thesis Presentation 28
  • 34. Feature Selection Job DS CS ● What are the features that can be extracted from the bytecode of the submitted job, and can be useful for the matcher ? Dec 5, 2012 MMath Thesis Presentation 29
  • 35. Feature Selection Job DS CS ● Differences between MR jobs are Input Formatter Intermediate Key Type Input Key Type Mapper Intermediate Value Type Input Value Type Output Formatter Intermediate Key Type Reducer Output Key Type Intermediate Value Type Output Value Type Dec 5, 2012 MMath Thesis Presentation 30
  • 36. Feature Selection Job DS CS ● We will refer to these features as the static features ● Different input formatter results in different IO cost to read the input records Dec 5, 2012 MMath Thesis Presentation 31
  • 37. Feature Selection Job DS CS ● So far, the map/reduce functions are analyzed as black-boxes ● Static analysis of the bytecode of the map/reduce functions: ○ Control Flow Graphs (CFG) ○ Different map/reduce CFG results in different map/reduce CPU costs Dec 5, 2012 MMath Thesis Presentation 32
  • 38. CFG Example Word Co-occurrence Pairs Word Count Dec 5, 2012 MMath Thesis Presentation 33
  • 39. CFG Example Word Co-occurrence Pairs Word Count Dec 5, 2012 MMath Thesis Presentation 34
  • 40. CFG Example Word Co-occurrence Pairs Word Count Different map CFGs => different map-phase times Dec 5, 2012 MMath Thesis Presentation 35
  • 41. Outline ● Hadoop MapReduce ● Tuning Hadoop Configuration Parameters ○ Rule-Based Approach ○ Feedback-Based Approach ● PStorM System Overview ● The Profile Matcher ○ Feature Selection ○ Similarity Measures ○ Matching Algorithm ● Evaluation Dec 5, 2012 MMath Thesis Presentation 36
  • 42. Similarity Measures Static CFG DS CS ● Matching the static features: ○ Feature values are all strings (categorical data) ○ Jaccard Similarity index ○ Score range: [0, 1] Dec 5, 2012 MMath Thesis Presentation 37
  • 43. Similarity Measures Static CFG DS CS ● Matching CFGs: ○ Synchronized breadth-first search ■ Both normal statements ■ Both branch statements ● Condition of a loop ○ Score range: {0, 1} ■ Conservative matching score Dec 5, 2012 MMath Thesis Presentation 38
  • 44. Similarity Measures Static CFG DS CS ● Matching DS and CS features: ○ Numerical features ○ Data normalization to bring all features to the same scale ○ Euclidean distance ○ Score range: [0, ] Dec 5, 2012 MMath Thesis Presentation 39
  • 45. Matching Algorithm ● Feature vector is composed of features of mixed data types (categorical and numerical) ● Two possible matching algorithms: ○ Multi-stage matching ○ Machine learning approach Dec 5, 2012 MMath Thesis Presentation 40
  • 46. Multi-Stage Matching Dec 5, 2012 MMath Thesis Presentation 41
  • 47. Multi-Stage Matching Dec 5, 2012 MMath Thesis Presentation 41
  • 48. Multi-Stage Matching ● The job profile is composed of independent map profile and reduce profile ● Multi-stage matcher will be applied twice ● The matching map profile and reduce profile will compose the final matching job profile Dec 5, 2012 MMath Thesis Presentation 42
  • 49. Machine Learning Approach ● Generalized distance function ○ Weighted sums of the distances/similarities calculated separately for each set of features of the same type ○ Weights should be learned Dec 5, 2012 MMath Thesis Presentation 43
  • 50. Machine Learning Approach ● Training data set generation: ○ For every job, Ji, in the profile store, pick its profile, Pi ○ Choose a random profile, Pj, from the profile store ○ Calculate the distances and similarities between Pi and Pj ○ Calculate T1: predicted runtime of the job Ji given the profile Pi ○ Calculate T2: predicted runtime of the job Ji given the profile Pj ○ D = |T1 - T2| Dec 5, 2012 MMath Thesis Presentation 44
  • 51. Machine Learning Approach ● Machine learning algorithm: ○ Gradient Boosted Regression Tree (GBRT) ○ Profile matching implementation in R ● Profile matching using the learned model: ○ Extract the profile, Ps, for the submitted MR job ○ Calculate the similarities/distances between Ps and the profiles in PStorM, and the corresponding value of D ○ Select the PStorM profile whose D is the minimum ● PStorM uses multi-stage matching algorithm Dec 5, 2012 MMath Thesis Presentation 45
  • 52. Outline ● Hadoop MapReduce ● Tuning Hadoop Configuration Parameters ○ Rule-Based Approach ○ Feedback-Based Approach ● PStorM System Overview ● The Profile Matcher ○ Feature Selection ○ Similarity Measures ○ Matching Algorithm ● Evaluation Dec 5, 2012 MMath Thesis Presentation 46
  • 53. Infrastructure ● 16 x Amazon EC2 c1.medium nodes: ○ 2 x Virtual cores ○ 1.7 GB of RAM ○ 350 GB of instance storage ● Hadoop cluster: ○ 1 master + 15 workers ○ Each worker can run at most 2 map and 2 reduce tasks concurrently ● PStorM profile store: ○ HBase instance running on the master node Dec 5, 2012 MMath Thesis Presentation 47
  • 54. Benchmark Dec 5, 2012 MMath Thesis Presentation 48
  • 55. Evaluation ● Objectives: a. Profile matcher accuracy b. Profile matcher efficiency ■ The profile returned from PStorM should result in comparable speedups to that achieved given the complete profile of the submitted job Dec 5, 2012 MMath Thesis Presentation 49
  • 56. Profile Matcher Accuracy ● Two content states of the profile store ● Same Data (SD) content state: ○ PStorM contains the profile collected during the execution on the same submitted data set ● Different Data (DD) content state: ○ PStorM contains the profile collected during the execution on a different data set Dec 5, 2012 MMath Thesis Presentation 50
  • 57. Profile Matcher Accuracy ● Evaluation metric is the number of correct matches as a fraction of the number of job submissions ● At the SD content state: ○ A correct match is the profile of the submitted job collected during the execution on the same data set ● At the DD content state: ○ A correct match is the profile of the submitted job collected during the execution on another data set ● Number of correct matches is calculated for the map and reduce profiles, separately Dec 5, 2012 MMath Thesis Presentation 51
  • 58. Profile Matcher Accuracy ● The accuracy of PStorM will be compared to the accuracy of the alternative solutions ● PStorM contributions at the matching level: ○ Feature selection: ■ New set of features: static and CFG ■ Feature selection based on our domain knowledge ○ Multi-stage matching algorithm Dec 5, 2012 MMath Thesis Presentation 52
  • 59. Profile Matcher Accuracy: Feature Selection ● Alternative feature selection approaches: ○ P-features: ■ Given the sample profile of the submitted job ○ SP-features: ■ Given the static features we proposed and the sample profile of the submitted job ● For both approaches: ○ Rank the features according to their information gains ○ Select the highest F features, such that F = number of features used by PStorM Dec 5, 2012 MMath Thesis Presentation 53
  • 60. Profile Matcher Accuracy: Feature Selection Dec 5, 2012 MMath Thesis Presentation 54
  • 61. Profile Matcher Accuracy: Matching Algorithm ● PStorM uses the multi-stage matching algorithm ● The alternative one is the machine learning approach: ○ GBRT has multiple configuration parameters ○ Four trials of different parameter settings until we found the one that resulted in the highest matching accuracy for GBRT Dec 5, 2012 MMath Thesis Presentation 55
  • 62. Profile Matcher Accuracy: Matching Algorithm Dec 5, 2012 MMath Thesis Presentation 56
  • 63. Profile Matcher Efficiency ● Runtime speedups is the main factor that matters ● A third content state, NJ: ○ The submitted job has not been executed before on the cluster ○ Highlights the benefits of profile reuse Dec 5, 2012 MMath Thesis Presentation 57
  • 64. Profile Matcher Efficiency Default 12 824 100 302 Dec 5, 2012 MMath Thesis Presentation 58
  • 65. Conclusion ● Hadoop configuration parameters and their effect on the performance of MR jobs ● Robustness and efficiency of the feedback- based tuning approach ● Drawbacks: overhead and no profile reuse ● PStorM: profile storage and matcher that leverages the idea of profile reuse ● PStorM resulted in significant speedups even for new jobs Dec 5, 2012 MMath Thesis Presentation 59