SlideShare ist ein Scribd-Unternehmen logo
1 von 43
Downloaden Sie, um offline zu lesen
Introduction
Naive Bayes and EM for software effort prediction
               Missing data handling strategies
                                   Experiments
                                        Threats.
                    Conclusion and future work




   Handling missing data in software effort
prediction with naive Bayes and EM algorithm

                   Wen Zhang                 Ye Yang         Qing Wang

                     Laboratory for Internet Software Technologies
                 Institute of Software, Chinese Academy of Sciences
                               Beijing 100190, P.R.China
                         {zhangwen,ye,wq}@itechs.iscas.ac.cn


    7th International Conference on Predictive Models in
           Software Engineering (PROMISE), 2011

               Wen Zhang, Ye Yang, Qing Wang        Software effort prediction with naive Bayes and EM algorithm
Introduction
      Naive Bayes and EM for software effort prediction
                     Missing data handling strategies
                                         Experiments
                                              Threats.
                          Conclusion and future work


Outline
  1   Introduction
  2   Naive Bayes and EM for software effort prediction
  3   Missing data handling strategies
        Missing data toleration strategy.
        Missing data imputation strategy
  4   Experiments
        The datasets
        Experiment setup
        Experimental results
  5   Threats.
  6   Conclusion and future work

                     Wen Zhang, Ye Yang, Qing Wang        Software effort prediction with naive Bayes and EM algorithm
Introduction
    Naive Bayes and EM for software effort prediction
                   Missing data handling strategies
                                       Experiments
                                            Threats.
                        Conclusion and future work


Effort prediction with missing data.



       The knowledge on software project effort stored in the
       historical datasets can be used to develop predictive
       models, by either statistical methods such as linear
       regression and correlation analysis to predict the effort of
       new incoming projects.
       Usually, most historical effort datasets contain large
       amount of missing data.




                   Wen Zhang, Ye Yang, Qing Wang        Software effort prediction with naive Bayes and EM algorithm
Introduction
    Naive Bayes and EM for software effort prediction
                   Missing data handling strategies
                                       Experiments
                                            Threats.
                        Conclusion and future work


Effort prediction with missing data.



       Due to the small sizes of most historical databases, the
       common practice of ignoring projects with missing data will
       lead to biased and inaccurate prediction model.
       For these reasons, how to handle missing data in software
       effort datasets is becoming an important problem.




                   Wen Zhang, Ye Yang, Qing Wang        Software effort prediction with naive Bayes and EM algorithm
Introduction
    Naive Bayes and EM for software effort prediction
                   Missing data handling strategies
                                       Experiments
                                            Threats.
                        Conclusion and future work


Sample data
       The historical effort data of projects were organized as
       shown in the following Table.

             Table: The sample data in historical project dataset.
                    D      X1 ... Xj ... Xn            H
                    D1 x11 ... x1j ... x1n h1
                    ...    ... ... ... ... ...         ...
                    Di     xi1 ... xij ... xin         hi
                    ...    ... ... ... ... ...         ...
                   Dm xm1 ... xmj ... xmn hm
       Xj (1 ≤ j ≤ n) denotes an attribute of project Di
       (1 ≤ i ≤ m). hi is the effort class label of Di and it is
       derived from the real effort of project Di .
                   Wen Zhang, Ye Yang, Qing Wang        Software effort prediction with naive Bayes and EM algorithm
Introduction
    Naive Bayes and EM for software effort prediction
                   Missing data handling strategies
                                       Experiments
                                            Threats.
                        Conclusion and future work


Sample data.




       There are l effort classes for all the projects in a dataset,
       that is, hi is equal to one of the elements in {c1 , ..., cl }.
       Xj is independent of each other and has Boolean values
       without missing data, i.e. xij ∈ {0, 1}.




                   Wen Zhang, Ye Yang, Qing Wang        Software effort prediction with naive Bayes and EM algorithm
Introduction
    Naive Bayes and EM for software effort prediction
                   Missing data handling strategies
                                       Experiments
                                            Threats.
                        Conclusion and future work


Formulation of the problem.


       An effort dataset Ycom containing m historical projects as
       Ycom = (D1 , ..., Di , ..., Dm )T , where Di (1 ≤ i ≤ m) is a
       historical project and Di = (xi1 , ..., xij , ..., xin )T is
       represented by n attributes Xj (1 ≤ j ≤ n).
       hi denotes the effort class label of project Di . For each xij ,
       which is the value of attribute Xj ) (1 ≤ j ≤ n)on Di , it would
       be observed or missing.
       Cross validation on effort prediction is used to to evaluate
       the performances of missing data handling techniques.



                   Wen Zhang, Ye Yang, Qing Wang        Software effort prediction with naive Bayes and EM algorithm
Introduction
    Naive Bayes and EM for software effort prediction
                   Missing data handling strategies
                                       Experiments
                                            Threats.
                        Conclusion and future work


Motivation.


       EM (Expectation Maximization) algorithm is a method for
       finding maximum likelihood or maximum a posteriori
       estimates of parameters in statistical models.
       The motivation of applying EM(Expectation Maximization)
       to na¨ Bayes is to augment the unlabeled projects with
             ive
       their estimated effort class labels into the labeled data sets.
       Thus, the performance of classification would be improved
       by using more data to train the prediction model.



                   Wen Zhang, Ye Yang, Qing Wang        Software effort prediction with naive Bayes and EM algorithm
Introduction
    Naive Bayes and EM for software effort prediction
                   Missing data handling strategies
                                       Experiments
                                            Threats.
                        Conclusion and future work


Labeled projects and unlabeled projects.


       For a labeled project DiL , its effort class
       P(hi = ct ∣DiL ) ∈ {0, 1} is determinate.
       For an unlabeled project DiU , its label P(hi = ct ∣DiU ) is
       unknown.
       However, if we can assign predicted effort class to DiU ,
       then DiU could also be used to update the estimates
       P{Xj = 0∣ct }, P{Xj = 1∣ct } and P(ct ), and further to refine
       the effort prediction model P(ct ∣Di ). This process is
       described in Equations 1, 2, 3 and 4.



                   Wen Zhang, Ye Yang, Qing Wang        Software effort prediction with naive Bayes and EM algorithm
Introduction
    Naive Bayes and EM for software effort prediction
                   Missing data handling strategies
                                       Experiments
                                            Threats.
                        Conclusion and future work


Estimating P (            +1)
                                 (Xj = 1∣ct ).


       The likelihood of occurrence of Xj with respect to ct at
        + 1 iteration, is updated by Equation 1 using the
       estimates at iteration.

                                               1 + m xij P ( ) (hi = ct ∣Di )
         P(   +1)
                    (Xj = 1∣ct ) =                    i=1
                                                                                    . (1)
                                             n+ n j=1
                                                          m
                                                          i=1 xij P
                                                                    ( ) (h = c ∣D )
                                                                          i   t  i


       In practice, we explain P ( +1) (Xj = 1∣ct ) as probability of
       attribute Xj appearing in a project whose effort class is ct .



                   Wen Zhang, Ye Yang, Qing Wang        Software effort prediction with naive Bayes and EM algorithm
Introduction
    Naive Bayes and EM for software effort prediction
                   Missing data handling strategies
                                       Experiments
                                            Threats.
                        Conclusion and future work


Estimating P (            +1)
                                 (Xj = 0∣ct ).



       Accordingly, the likelihood of non-occurrence of Xj with
       respect to ct at + 1 iteration, P ( +1) (Xj = 0∣ct ) is
       estimated by Equation 2.

                      P(    +1)
                                  (Xj = 0∣ct ) = 1 − P (            +1)
                                                                          (Xj = 1∣ct ).                 (2)




                   Wen Zhang, Ye Yang, Qing Wang        Software effort prediction with naive Bayes and EM algorithm
Introduction
     Naive Bayes and EM for software effort prediction
                    Missing data handling strategies
                                        Experiments
                                             Threats.
                         Conclusion and future work


Estimating P (             +1)
                                  (ct ).


  Second, the effort class prior probability, P ( +1) (ct ), is updated
  in the same manner by Equation 3 using estimates at the
  iteration. In practice, we may regard P ( +1) (ct ) as the prior
  probability of class label ct appearing in all the software
  projects.
                                                         m     ( ) (h
                                             1+          i=1 P        i   = ct ∣Di )
                     P(    +1)
                                 (ct ) =                                                 .                 (3)
                                                             l +m




                    Wen Zhang, Ye Yang, Qing Wang          Software effort prediction with naive Bayes and EM algorithm
Introduction
     Naive Bayes and EM for software effort prediction
                    Missing data handling strategies
                                        Experiments
                                             Threats.
                         Conclusion and future work


Estimating P (             +1)
                                  (hi ′ = ct ∣Di ′ ).

  Third, the posterior probability of an unlabeled project Di ′
  belonging to an effort class ct at the + 1 iteration,
  P ( +1) (hi ′ = ct ∣Di ′ ), is updated using Equation 4.

                                                            P ( ) (ct )P ( ) (Di ′ ∣ct )
                    P(    +1)
                                (hi ′ = ct ∣Di ′ ) =
                                                                   P ( ) (Di ′ )
                                                                   n
                                                    P ( ) (ct )         P ( ) (xi ′ j ∣ct )                 (4)
                                                                  j=1
                                          =                                                      .
                                                l                      n
                                                      P ( ) (ct )          P ( ) (xi ′ j ∣ct )
                                              t=1                   j=1



                    Wen Zhang, Ye Yang, Qing Wang           Software effort prediction with naive Bayes and EM algorithm
Introduction
    Naive Bayes and EM for software effort prediction
                   Missing data handling strategies
                                       Experiments
                                            Threats.
                        Conclusion and future work


Estimating P (            +1)
                                 (hi ′ = ct ∣Di ′ ).

       Hereafter,
               for labeled projects, if xij = 1, then
               P ( ) (xij ∣ct ) = P ( ) (Xj = 1∣ct ). Otherwise xij = 0, then
               P ( ) (xij ∣ct ) = P ( ) (Xj = 0∣ct ).
               for unlabeled projects, if xi ′ j = 1, then
               P ( ) (xi ′ j ∣ct ) = P ( ) (Xj = 1∣ct ). Otherwise xi ′ j = 0, then
               P ( ) (xi ′ j ∣ct ) = P ( ) (Xj = 0∣ct ).
       Here, P (0) (Xj = 1∣ct ) and P (0) (ct ) are initially estimated by
       merely the labeled projects at the first step of iteration, and
       the unlabeled project cases are appended into the learning
       process after they were predicted probabilistic effort class
       by P (1) (hi ′ = ct ∣Di ′ ).

                   Wen Zhang, Ye Yang, Qing Wang        Software effort prediction with naive Bayes and EM algorithm
Introduction
    Naive Bayes and EM for software effort prediction
                   Missing data handling strategies
                                       Experiments
                                            Threats.
                        Conclusion and future work


Predicting the effort class of unlabeled projects.



       We loop the Equations 1, 2, 3 and 4 until their estimates
       converge to stable values.
       Then, P (        +1) (h
                                 i′   = ct ∣Di ′ ) is used to predict effort class of
       Di ′ .
       The ct ∈ {c1 , ..cl } that maximizes P (                       +1) (h
                                                                               i′   = ct ∣Di ′ ) is
       regarded as the effort class of Di ′ .




                   Wen Zhang, Ye Yang, Qing Wang        Software effort prediction with naive Bayes and EM algorithm
Introduction
      Naive Bayes and EM for software effort prediction
                     Missing data handling strategies     Missing data toleration strategy.
                                         Experiments      Missing data imputation strategy
                                              Threats.
                          Conclusion and future work


Outline
  1   Introduction
  2   Naive Bayes and EM for software effort prediction
  3   Missing data handling strategies
        Missing data toleration strategy.
        Missing data imputation strategy
  4   Experiments
        The datasets
        Experiment setup
        Experimental results
  5   Threats.
  6   Conclusion and future work

                     Wen Zhang, Ye Yang, Qing Wang        Software effort prediction with naive Bayes and EM algorithm
Introduction
     Naive Bayes and EM for software effort prediction
                    Missing data handling strategies     Missing data toleration strategy.
                                        Experiments      Missing data imputation strategy
                                             Threats.
                         Conclusion and future work


Initial setting.
        When we use Equation 1 to estimate the likelihood of Xj
        with respect to ct , P(Xj = 1∣ct ) or P(Xj = 0∣ct ), we do not
        consider missing values involved in xij (1 ≤ i ≤ m).
        For each Xj , we can divide the whole historical dataset D
        into two subsets, i.e. D = {Dobs,j ∣Dmis,j } where Dobs,j is the
        set of projects whose values on attribute Xj are observed
        and Dmis,j is the set of projects whose values on attribute
        are unobserved.
        We may also divide the attributes in a project Di into two
        subsets, i.e. Di = {Xobs,i ∣Xmis,i } where Xobs,i is the set of
        attributes whose values are observed in project Di and
        Xmis,i denotes the set of attributes whose values are
        unobserved in project Di .
                    Wen Zhang, Ye Yang, Qing Wang        Software effort prediction with naive Bayes and EM algorithm
Introduction
    Naive Bayes and EM for software effort prediction
                   Missing data handling strategies          Missing data toleration strategy.
                                       Experiments           Missing data imputation strategy
                                            Threats.
                        Conclusion and future work


Missing data toleration strategy.

       This strategy is very similar with the method adopted by
       C4.5 to handle missing data. That is, we ignore missing
       values in training prediction model.
       To estimate P ( +1) (Xj = 1∣ct ) under this strategy, we
       rewrite Equation 1 into Equation 5.
                                                            ∣Dobs,j ∣
                                                  1+                    xij P ( ) (hi = ct ∣Di )
                                                              i=1
         P(    +1)
                     (Xj = 1∣ct ) =                     n
                                                                                                          . (5)
                                                                ∣Dobs,j ∣
                                             n+                 i=1       xij P ( ) (hi     = ct ∣Di )
                                                    j=1



                   Wen Zhang, Ye Yang, Qing Wang             Software effort prediction with naive Bayes and EM algorithm
Introduction
    Naive Bayes and EM for software effort prediction
                   Missing data handling strategies     Missing data toleration strategy.
                                       Experiments      Missing data imputation strategy
                                            Threats.
                        Conclusion and future work


Missing data toleration strategy.



       The difference between Equations 1 and 5 lies in that only
       observed projects on attribute Xj , i.e., Dobs,j are used to
       estimate P ( +1) (Xj = 1∣ct ).
       Equation 2 can also be used here to estimate
       P ( +1) (Xj = 0∣ct ). To estimate P ( +1) (ct ), Equation 3 can
       also be used here.




                   Wen Zhang, Ye Yang, Qing Wang        Software effort prediction with naive Bayes and EM algorithm
Introduction
    Naive Bayes and EM for software effort prediction
                   Missing data handling strategies       Missing data toleration strategy.
                                       Experiments        Missing data imputation strategy
                                            Threats.
                        Conclusion and future work


Missing data toleration strategy.

       Accordingly, the prediction model should be adapted from
       Equation 4 to Equation 6.

                                                          P ( ) (ct )P ( ) (Di ′ ∣ct )
                   P(    +1)
                               (hi ′ = ct ∣Di ′ ) =
                                                                 P ( ) (Di ′ )
                                                          ∣Xobs,i ∣
                                            P ( ) (ct )               P ( ) (xi ′ j ∣ct )
                                                            j=1
                                     =                                                        .           (6)
                                          ∣Xobs,i ∣ l
                                                        P ( ) (ct )P ( ) (xi ′ j ∣ct )
                                            j=1 t=1




                   Wen Zhang, Ye Yang, Qing Wang          Software effort prediction with naive Bayes and EM algorithm
Introduction
      Naive Bayes and EM for software effort prediction
                     Missing data handling strategies     Missing data toleration strategy.
                                         Experiments      Missing data imputation strategy
                                              Threats.
                          Conclusion and future work


Outline
  1   Introduction
  2   Naive Bayes and EM for software effort prediction
  3   Missing data handling strategies
        Missing data toleration strategy.
        Missing data imputation strategy
  4   Experiments
        The datasets
        Experiment setup
        Experimental results
  5   Threats.
  6   Conclusion and future work

                     Wen Zhang, Ye Yang, Qing Wang        Software effort prediction with naive Bayes and EM algorithm
Introduction
    Naive Bayes and EM for software effort prediction
                   Missing data handling strategies     Missing data toleration strategy.
                                       Experiments      Missing data imputation strategy
                                            Threats.
                        Conclusion and future work


Missing data imputation strategy.




       The basic idea of this strategy is that unobserved values of
       attributes can be imputed using the observed values.
       Then, both observed values and imputed values are used
       to construct the prediction model.




                   Wen Zhang, Ye Yang, Qing Wang        Software effort prediction with naive Bayes and EM algorithm
Introduction
    Naive Bayes and EM for software effort prediction
                   Missing data handling strategies     Missing data toleration strategy.
                                       Experiments      Missing data imputation strategy
                                            Threats.
                        Conclusion and future work


Missing data imputation strategy.

       This strategy is an embedded processing in na¨ Bayes
                                                    ive
       and EM and we may rewrite Equation 1 to Equation 7 to
       estimate P ( +1) (Xj = 1∣ct ).


                                        P(   +1)
                                                   (Xj = 1∣ct ) =
                     ∣Dobs,j ∣                                 ∣Dmis,j ∣
            1+                   xij P ( ) (hi = ct ∣Di ) +                x˜ P ( ) (hi = ct ∣Ds )
                                                                            sj
                       i=1                                       s=1
                                                                                                              .
                 n     ∣Dobs,j ∣                                  ∣Dmis,j ∣
        n+           {             xij P ( ) (hi = ct ∣Di ) +                 x˜ P ( ) (hi = ct ∣Ds )}
                                                                               sj
               j=1       i=1                                       s=1
                                                                                                        (7)

                     Wen Zhang, Ye Yang, Qing Wang      Software effort prediction with naive Bayes and EM algorithm
Introduction
    Naive Bayes and EM for software effort prediction
                   Missing data handling strategies         Missing data toleration strategy.
                                       Experiments          Missing data imputation strategy
                                            Threats.
                        Conclusion and future work


Missing data imputation strategy.
       The missing value xsj , which is the value of attribute Xj on
       the project Ds , is imputed using x˜ with Equation 8
                                          sj

                                           ∣Dobs,j ∣
                                                        xij P ( ) (hi = ct ∣Di )
                                             i=1
                                 x˜ =
                                  sj                                                   .                    (8)
                                             ∣Dobs,j ∣
                                                         P ( ) (hi = ct ∣Di )
                                               i=1

       x˜ is a constant independent of Ds given ct .
        sj
       We regulate that x˜ is approximated to 1 if x˜ ≥ 0.5.
                          sj                         sj
       Otherwise, x˜ is approximated to 0.
                    sj
       Here, we also use Equation 3 to estimate P ( +1) (ct ) .
                   Wen Zhang, Ye Yang, Qing Wang            Software effort prediction with naive Bayes and EM algorithm
Introduction
    Naive Bayes and EM for software effort prediction
                   Missing data handling strategies      Missing data toleration strategy.
                                       Experiments       Missing data imputation strategy
                                            Threats.
                        Conclusion and future work


Missing data imputation strategy.
       As for the prediction model, P ( +1) (ct ∣Di ), can be
       constructed in Equation 9 with considering the missing
       values.
                                                 P ( ) (ct )P ( ) (Di ′ ∣ct )
         P(    +1)
                     (hi ′ = ct ∣Di ′ ) =
                                                        P ( ) (Di ′ )
                                                                         n
                                                         P ( ) (ct )         P ( ) (xi ′ j ∣ct )
                                                                       j=1
                                                   =                                               .     (9)
                                                         n    l
                                                                  P ( ) (ct )P ( ) (xi ′ j ∣ct )
                                                        j=1 t=1

       Note that if xi ′ j is unobserved, it value will be substituted
       with x˜′ j given by Equation 8.
             i

                   Wen Zhang, Ye Yang, Qing Wang         Software effort prediction with naive Bayes and EM algorithm
Introduction
      Naive Bayes and EM for software effort prediction
                                                          The datasets
                     Missing data handling strategies
                                                          Experiment setup
                                         Experiments
                                                          Experimental results
                                              Threats.
                          Conclusion and future work


Outline
  1   Introduction
  2   Naive Bayes and EM for software effort prediction
  3   Missing data handling strategies
        Missing data toleration strategy.
        Missing data imputation strategy
  4   Experiments
        The datasets
        Experiment setup
        Experimental results
  5   Threats.
  6   Conclusion and future work

                     Wen Zhang, Ye Yang, Qing Wang        Software effort prediction with naive Bayes and EM algorithm
Introduction
    Naive Bayes and EM for software effort prediction
                                                        The datasets
                   Missing data handling strategies
                                                        Experiment setup
                                       Experiments
                                                        Experimental results
                                            Threats.
                        Conclusion and future work


The ISBSG dataset.


       The ISBSG data set (http://www.isbsg.org) has 70
       attributes and many attributes have no values in the
       corresponding places.
       We extract 188 projects with 16 attributes with the criterion
       that each project has at least 2/3 attributes whose values
       are observed and, for an attribute, its values should be
       observed at least in 2/3 of total projects.
       13 attributes are nominal attributes and 3 attributes are
       continuous attributes.



                   Wen Zhang, Ye Yang, Qing Wang        Software effort prediction with naive Bayes and EM algorithm
Introduction
    Naive Bayes and EM for software effort prediction
                                                        The datasets
                   Missing data handling strategies
                                                        Experiment setup
                                       Experiments
                                                        Experimental results
                                            Threats.
                        Conclusion and future work


The ISBSG dataset.

       We use Equation 10 to normalize the efforts of projects
       into l(= 3) classes.

                                        l × (effortDi − effortmin )
                             ct = ⌊                                 ⌋+1                               (10)
                                          effortmax − effortmin


                   Table: The effort classes in ISBSG data set.
                        Class No.             # of projects           Label
                            1                      85                  Low
                            2                      76                Medium
                            3                      27                 High

                   Wen Zhang, Ye Yang, Qing Wang        Software effort prediction with naive Bayes and EM algorithm
Introduction
   Naive Bayes and EM for software effort prediction
                                                       The datasets
                  Missing data handling strategies
                                                       Experiment setup
                                      Experiments
                                                       Experimental results
                                           Threats.
                       Conclusion and future work


The CSBSG dataset.
      CSBSG dataset contains 1103 projects collected from 140
      organizations and 15 regions across China by Chinese
      association of software industry.
      We extract 94 projects and 21 attributes (15 nominal
      attributes and 6 continuous attributes) with same selection
      criterion of ISBSG data set. We use Equation 10 to
      normalize the efforts of projects into l(= 3) classes.

                  Table: The effort classes in CSBSG data set.
                           Class No.             # of projects           Label
                               1                      27                  Low
                               2                      31                Medium
                               3                      36                 High
                  Wen Zhang, Ye Yang, Qing Wang        Software effort prediction with naive Bayes and EM algorithm
Introduction
      Naive Bayes and EM for software effort prediction
                                                          The datasets
                     Missing data handling strategies
                                                          Experiment setup
                                         Experiments
                                                          Experimental results
                                              Threats.
                          Conclusion and future work


Outline
  1   Introduction
  2   Naive Bayes and EM for software effort prediction
  3   Missing data handling strategies
        Missing data toleration strategy.
        Missing data imputation strategy
  4   Experiments
        The datasets
        Experiment setup
        Experimental results
  5   Threats.
  6   Conclusion and future work

                     Wen Zhang, Ye Yang, Qing Wang        Software effort prediction with naive Bayes and EM algorithm
Introduction
    Naive Bayes and EM for software effort prediction
                                                        The datasets
                   Missing data handling strategies
                                                        Experiment setup
                                       Experiments
                                                        Experimental results
                                            Threats.
                        Conclusion and future work


Experiment setup.


       To evaluate the proposed method comparatively, we adopt
       MI and MINI to impute the missing values of the assigned
       ISBSG and CSBSG dataset.
       BPNN is used to classify the projects in the data sets after
       imputation.
       Our experiments are conducted with 10-flod
       cross-validation technique.




                   Wen Zhang, Ye Yang, Qing Wang        Software effort prediction with naive Bayes and EM algorithm
Introduction
      Naive Bayes and EM for software effort prediction
                                                          The datasets
                     Missing data handling strategies
                                                          Experiment setup
                                         Experiments
                                                          Experimental results
                                              Threats.
                          Conclusion and future work


Outline
  1   Introduction
  2   Naive Bayes and EM for software effort prediction
  3   Missing data handling strategies
        Missing data toleration strategy.
        Missing data imputation strategy
  4   Experiments
        The datasets
        Experiment setup
        Experimental results
  5   Threats.
  6   Conclusion and future work

                     Wen Zhang, Ye Yang, Qing Wang        Software effort prediction with naive Bayes and EM algorithm
Introduction
    Naive Bayes and EM for software effort prediction
                                                        The datasets
                   Missing data handling strategies
                                                        Experiment setup
                                       Experiments
                                                        Experimental results
                                            Threats.
                        Conclusion and future work


EM-T and EM-I on ISBSG dataset.



       The following figure illustrates the performances, of the
       missing data toleration strategy (hereafter called EM-T)
       and missing data imputation strategy (hereafter called
       EM-I) in handling the missing date for effort prediction on
       ISBSG data set.




                   Wen Zhang, Ye Yang, Qing Wang        Software effort prediction with naive Bayes and EM algorithm
Introduction
    Naive Bayes and EM for software effort prediction
                                                                      The datasets
                   Missing data handling strategies
                                                                      Experiment setup
                                       Experiments
                                                                      Experimental results
                                            Threats.
                        Conclusion and future work


EM-T and EM-I on ISBSG dataset.

                                                                                      EM−I
                                                                                      EM−T
                                                                                      BPNN+MI
                                                                                      BPNN+MINI

                                         0.8




                                        0.75
                             Accuracy




                                         0.7




                                        0.65




                                         0.6
                                               0   4    8                  12    16           20
                                                       # of unlabeled projects




  Figure: Performances of naive Bayes with EM-I and EM-T in
  comparison with BPNN on effort prediction using ISBSG data set.


                   Wen Zhang, Ye Yang, Qing Wang                      Software effort prediction with naive Bayes and EM algorithm
Introduction
    Naive Bayes and EM for software effort prediction
                                                        The datasets
                   Missing data handling strategies
                                                        Experiment setup
                                       Experiments
                                                        Experimental results
                                            Threats.
                        Conclusion and future work


EM-T and EM-I on ISBSG dataset.


  What we can see from the figure.
       Both EM-I and EM-T have better performances than BPNN
       with either MI or MINI on classifying the projects in ISBSG
       data set.
       The performance of naive Bayes and EM is augmented
       when unlabeled projects are appended. This outcome
       illustrates that semi-supervised learning can improve the
       prediction of software effort.




                   Wen Zhang, Ye Yang, Qing Wang        Software effort prediction with naive Bayes and EM algorithm
Introduction
    Naive Bayes and EM for software effort prediction
                                                        The datasets
                   Missing data handling strategies
                                                        Experiment setup
                                       Experiments
                                                        Experimental results
                                            Threats.
                        Conclusion and future work


EM-T and EM-I on ISBSG dataset.


  What we can see from figure.
       If supervised learning was used for software effort
       prediction, MINI method is favorable to impute the missing
       values but missing toleration strategy may not be desirable
       to handle missing values.
       Imputing strategy for missing data is more effective than
       tolerating strategy when naive Bayes and EM is used for
       predicting ISBSG software efforts.




                   Wen Zhang, Ye Yang, Qing Wang        Software effort prediction with naive Bayes and EM algorithm
Introduction
    Naive Bayes and EM for software effort prediction
                                                                  The datasets
                   Missing data handling strategies
                                                                  Experiment setup
                                       Experiments
                                                                  Experimental results
                                            Threats.
                        Conclusion and future work


EM-T and EM-I on CSBSG dataset.
       EM-T and EM-I in handling the missing date for effort
       prediction on CSBSG dataset.
                                             0.8
                                                                                         EM−I
                                                                                         EM−T
                                                                                         BPNN+MI
                                                                                         BPNN+MINI
                                            0.75




                                             0.7
                                 Accuracy




                                            0.65




                                             0.6




                                            0.55




                                             0.5
                                                   0   2              4              6               8
                                                           # of unlabeled projects




       Figure:     Performances of EM-I and EM-T in comparison with BPNN on predicting effort with different
       number of unlabeled projects using CSBSG dataset.


                   Wen Zhang, Ye Yang, Qing Wang                  Software effort prediction with naive Bayes and EM algorithm
Introduction
    Naive Bayes and EM for software effort prediction
                                                        The datasets
                   Missing data handling strategies
                                                        Experiment setup
                                       Experiments
                                                        Experimental results
                                            Threats.
                        Conclusion and future work


EM-T and EM-I on CSBSG dataset.
  What we can see from the above figure.
       The better performance of EM-I than EM-T is also
       observed using CSBSG data set, which is the same as
       using ISBSG dataset. This further validate our conjecture
       that EM-I outperforms EM-T in software effort prediction.
       EM-T has better performance than EM-I on condition that
       the number of unlabeled projects is larger than that of
       "maxima", that is different from that of ISBSG dataset. We
       explain this result may be brought out by the relative small
       size of CSBSG dataset where imputation strategy will be
       more prone to bring bias into predictive than toleration
       strategy.

                   Wen Zhang, Ye Yang, Qing Wang        Software effort prediction with naive Bayes and EM algorithm
Introduction
    Naive Bayes and EM for software effort prediction
                                                        The datasets
                   Missing data handling strategies
                                                        Experiment setup
                                       Experiments
                                                        Experimental results
                                            Threats.
                        Conclusion and future work


More experiments and hypotheses testing.




  More experimental results with explanations are detailed in the
  paper. Also, we conduct hypotheses testing to examine the
  significance of the conclusions draw from our experiments. One
  of interest may refer to the paper.




                   Wen Zhang, Ye Yang, Qing Wang        Software effort prediction with naive Bayes and EM algorithm
Introduction
Naive Bayes and EM for software effort prediction
               Missing data handling strategies
                                   Experiments
                                        Threats.
                    Conclusion and future work




   The threat to external validity primarily is the degree to
   which the attributes we used to describe the projects and
   the representative capacity of ISBSG and CSBSG sample
   datasets.
   The threat to internal validity are measurement and data
   effects that can bias our results caused by performance
   measure as accuracy.
   The threat to construct validity is that our experiments
   make use of clipping attributes and clipping project data
   from both ISBSG and CSBSG datasets



               Wen Zhang, Ye Yang, Qing Wang        Software effort prediction with naive Bayes and EM algorithm
Introduction
    Naive Bayes and EM for software effort prediction
                   Missing data handling strategies
                                       Experiments
                                            Threats.
                        Conclusion and future work


Conclusion




       Semi-supervised learning as naive Bayes and EM is
       employed to predict software effort.
       We propose two embedded strategies in naive Bayes and
       EM to handle the missing data.




                   Wen Zhang, Ye Yang, Qing Wang        Software effort prediction with naive Bayes and EM algorithm
Introduction
    Naive Bayes and EM for software effort prediction
                   Missing data handling strategies
                                       Experiments
                                            Threats.
                        Conclusion and future work


Future work

       We plan to compare the proposed techniques with other
       missing data imputation techniques, such as FIML and
       MSWR.
       We will develop more missing data techniques embedded
       with naive Bayes and EM for software effort prediction.
       We have already investigated the underlying mechanism of
       missingness (structural missing or unstructured missing) of
       software effort data. With this progress, we will improve the
       missing data handling strategies oriented to the underlying
       missing mechanism of software effort data.


                   Wen Zhang, Ye Yang, Qing Wang        Software effort prediction with naive Bayes and EM algorithm
Introduction
    Naive Bayes and EM for software effort prediction
                   Missing data handling strategies
                                       Experiments
                                            Threats.
                        Conclusion and future work


Thanks




  Any further questions about the content of the slides and the
  paper can be sent to Mr. Wen Zhang.
  Email: zhangwen@itechs.iscas.ac.cn




                   Wen Zhang, Ye Yang, Qing Wang        Software effort prediction with naive Bayes and EM algorithm

Weitere ähnliche Inhalte

Was ist angesagt?

IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
IJERD Editor
 
Continual Learning with Deep Architectures - Tutorial ICML 2021
Continual Learning with Deep Architectures - Tutorial ICML 2021Continual Learning with Deep Architectures - Tutorial ICML 2021
Continual Learning with Deep Architectures - Tutorial ICML 2021
Vincenzo Lomonaco
 
notes as .ppt
notes as .pptnotes as .ppt
notes as .ppt
butest
 
Presentation on Machine Learning and Data Mining
Presentation on Machine Learning and Data MiningPresentation on Machine Learning and Data Mining
Presentation on Machine Learning and Data Mining
butest
 

Was ist angesagt? (20)

IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
 
Artificial Neural Network
Artificial Neural NetworkArtificial Neural Network
Artificial Neural Network
 
[PR12] understanding deep learning requires rethinking generalization
[PR12] understanding deep learning requires rethinking generalization[PR12] understanding deep learning requires rethinking generalization
[PR12] understanding deep learning requires rethinking generalization
 
Aditya Bhattacharya Chest XRay Image Analysis Using Deep Learning
Aditya Bhattacharya Chest XRay Image Analysis Using Deep LearningAditya Bhattacharya Chest XRay Image Analysis Using Deep Learning
Aditya Bhattacharya Chest XRay Image Analysis Using Deep Learning
 
Deep Neural Networks for Multimodal Learning
Deep Neural Networks for Multimodal LearningDeep Neural Networks for Multimodal Learning
Deep Neural Networks for Multimodal Learning
 
Continual Learning with Deep Architectures - Tutorial ICML 2021
Continual Learning with Deep Architectures - Tutorial ICML 2021Continual Learning with Deep Architectures - Tutorial ICML 2021
Continual Learning with Deep Architectures - Tutorial ICML 2021
 
notes as .ppt
notes as .pptnotes as .ppt
notes as .ppt
 
Presentation on Machine Learning and Data Mining
Presentation on Machine Learning and Data MiningPresentation on Machine Learning and Data Mining
Presentation on Machine Learning and Data Mining
 
A critical review on Adversarial Attacks on Intrusion Detection Systems: Must...
A critical review on Adversarial Attacks on Intrusion Detection Systems: Must...A critical review on Adversarial Attacks on Intrusion Detection Systems: Must...
A critical review on Adversarial Attacks on Intrusion Detection Systems: Must...
 
Model-Based Reinforcement Learning @NIPS2017
Model-Based Reinforcement Learning @NIPS2017Model-Based Reinforcement Learning @NIPS2017
Model-Based Reinforcement Learning @NIPS2017
 
Integrate fault tree analysis and fuzzy sets in quantitative risk assessment
Integrate fault tree analysis and fuzzy sets in quantitative risk assessmentIntegrate fault tree analysis and fuzzy sets in quantitative risk assessment
Integrate fault tree analysis and fuzzy sets in quantitative risk assessment
 
Neural networks, naïve bayes and decision tree machine learning
Neural networks, naïve bayes and decision tree machine learningNeural networks, naïve bayes and decision tree machine learning
Neural networks, naïve bayes and decision tree machine learning
 
Model Evaluation in the land of Deep Learning
Model Evaluation in the land of Deep LearningModel Evaluation in the land of Deep Learning
Model Evaluation in the land of Deep Learning
 
Cs231n 2017 lecture13 Generative Model
Cs231n 2017 lecture13 Generative ModelCs231n 2017 lecture13 Generative Model
Cs231n 2017 lecture13 Generative Model
 
David Barber - Deep Nets, Bayes and the story of AI
David Barber - Deep Nets, Bayes and the story of AIDavid Barber - Deep Nets, Bayes and the story of AI
David Barber - Deep Nets, Bayes and the story of AI
 
Functional-connectome biomarkers to meet clinical needs?
Functional-connectome biomarkers to meet clinical needs?Functional-connectome biomarkers to meet clinical needs?
Functional-connectome biomarkers to meet clinical needs?
 
Professor Steve Roberts; The Bayesian Crowd: scalable information combinati...
Professor Steve Roberts; The Bayesian Crowd: scalable information combinati...Professor Steve Roberts; The Bayesian Crowd: scalable information combinati...
Professor Steve Roberts; The Bayesian Crowd: scalable information combinati...
 
EXTENDING OUTPUT ATTENTIONS IN RECURRENT NEURAL NETWORKS FOR DIALOG GENERATION
EXTENDING OUTPUT ATTENTIONS IN RECURRENT NEURAL NETWORKS FOR DIALOG GENERATIONEXTENDING OUTPUT ATTENTIONS IN RECURRENT NEURAL NETWORKS FOR DIALOG GENERATION
EXTENDING OUTPUT ATTENTIONS IN RECURRENT NEURAL NETWORKS FOR DIALOG GENERATION
 
Recommendation system using collaborative deep learning
Recommendation system using collaborative deep learningRecommendation system using collaborative deep learning
Recommendation system using collaborative deep learning
 
Cs231n 2017 lecture12 Visualizing and Understanding
Cs231n 2017 lecture12 Visualizing and UnderstandingCs231n 2017 lecture12 Visualizing and Understanding
Cs231n 2017 lecture12 Visualizing and Understanding
 

Andere mochten auch

Weather report project
Weather report projectWeather report project
Weather report project
alzambra
 

Andere mochten auch (20)

Naive Bayes Presentation
Naive Bayes PresentationNaive Bayes Presentation
Naive Bayes Presentation
 
Classification with Naive Bayes
Classification with Naive BayesClassification with Naive Bayes
Classification with Naive Bayes
 
Naive bayes
Naive bayesNaive bayes
Naive bayes
 
STATA 13
STATA 13STATA 13
STATA 13
 
Statistical Approaches to Missing Data
Statistical Approaches to Missing DataStatistical Approaches to Missing Data
Statistical Approaches to Missing Data
 
Imputation techniques for missing data in clinical trials
Imputation techniques for missing data in clinical trialsImputation techniques for missing data in clinical trials
Imputation techniques for missing data in clinical trials
 
Missing Data and Causes
Missing Data and CausesMissing Data and Causes
Missing Data and Causes
 
Pattern recognition binoy 05-naive bayes classifier
Pattern recognition binoy 05-naive bayes classifierPattern recognition binoy 05-naive bayes classifier
Pattern recognition binoy 05-naive bayes classifier
 
Bayes 6
Bayes 6Bayes 6
Bayes 6
 
Data mining-2
Data mining-2Data mining-2
Data mining-2
 
Stata tutorial
Stata tutorialStata tutorial
Stata tutorial
 
Analysis of crop yield prediction using data mining techniques
Analysis of crop yield prediction using data mining techniquesAnalysis of crop yield prediction using data mining techniques
Analysis of crop yield prediction using data mining techniques
 
A FUZZY LOGIC BASED SCHEME FOR THE PARAMETERIZATION OF THE INTER-TROPICAL DIS...
A FUZZY LOGIC BASED SCHEME FOR THE PARAMETERIZATION OF THE INTER-TROPICAL DIS...A FUZZY LOGIC BASED SCHEME FOR THE PARAMETERIZATION OF THE INTER-TROPICAL DIS...
A FUZZY LOGIC BASED SCHEME FOR THE PARAMETERIZATION OF THE INTER-TROPICAL DIS...
 
Naive Bayes | Statistics
Naive Bayes | StatisticsNaive Bayes | Statistics
Naive Bayes | Statistics
 
Weather report project
Weather report projectWeather report project
Weather report project
 
Introduction to text classification using naive bayes
Introduction to text classification using naive bayesIntroduction to text classification using naive bayes
Introduction to text classification using naive bayes
 
Sentiment analysis using naive bayes classifier
Sentiment analysis using naive bayes classifier Sentiment analysis using naive bayes classifier
Sentiment analysis using naive bayes classifier
 
Software Project Management for 'Weather Forecasting using Data mining'
Software Project Management for 'Weather Forecasting using Data mining'Software Project Management for 'Weather Forecasting using Data mining'
Software Project Management for 'Weather Forecasting using Data mining'
 
2.3 bayesian classification
2.3 bayesian classification2.3 bayesian classification
2.3 bayesian classification
 
Bayesian classification
Bayesian classificationBayesian classification
Bayesian classification
 

Ähnlich wie PROMISE 2011: "Handling missing data in software effort prediction with naive Bayes and EM"

Predicting Fault-Prone Files using Machine Learning
Predicting Fault-Prone Files using Machine LearningPredicting Fault-Prone Files using Machine Learning
Predicting Fault-Prone Files using Machine Learning
Guido A. Ciollaro
 
Bayesian network based software reliability prediction
Bayesian network based software reliability predictionBayesian network based software reliability prediction
Bayesian network based software reliability prediction
JULIO GONZALEZ SANZ
 
Manifold learning for credit risk assessment
Manifold learning for credit risk assessment Manifold learning for credit risk assessment
Manifold learning for credit risk assessment
Armando Vieira
 
Don't Treat the Symptom, Find the Cause!.pptx
Don't Treat the Symptom, Find the Cause!.pptxDon't Treat the Symptom, Find the Cause!.pptx
Don't Treat the Symptom, Find the Cause!.pptx
Förderverein Technische Fakultät
 
Intro to Model Selection
Intro to Model SelectionIntro to Model Selection
Intro to Model Selection
chenhm
 
Bert pre_training_of_deep_bidirectional_transformers_for_language_understanding
Bert  pre_training_of_deep_bidirectional_transformers_for_language_understandingBert  pre_training_of_deep_bidirectional_transformers_for_language_understanding
Bert pre_training_of_deep_bidirectional_transformers_for_language_understanding
ThyrixYang1
 
Adversarial Variational Autoencoders to extend and improve generative model
Adversarial Variational Autoencoders to extend and improve generative modelAdversarial Variational Autoencoders to extend and improve generative model
Adversarial Variational Autoencoders to extend and improve generative model
Loc Nguyen
 

Ähnlich wie PROMISE 2011: "Handling missing data in software effort prediction with naive Bayes and EM" (20)

Software Defect Prediction on Unlabeled Datasets
Software Defect Prediction on Unlabeled DatasetsSoftware Defect Prediction on Unlabeled Datasets
Software Defect Prediction on Unlabeled Datasets
 
Transferable GAN-generated Images Detection Framework.
Transferable GAN-generated Images  Detection Framework.Transferable GAN-generated Images  Detection Framework.
Transferable GAN-generated Images Detection Framework.
 
Predicting Fault-Prone Files using Machine Learning
Predicting Fault-Prone Files using Machine LearningPredicting Fault-Prone Files using Machine Learning
Predicting Fault-Prone Files using Machine Learning
 
Bayesian network based software reliability prediction
Bayesian network based software reliability predictionBayesian network based software reliability prediction
Bayesian network based software reliability prediction
 
Naive.pdf
Naive.pdfNaive.pdf
Naive.pdf
 
Manifold learning for credit risk assessment
Manifold learning for credit risk assessment Manifold learning for credit risk assessment
Manifold learning for credit risk assessment
 
Estimating project development effort using clustered regression approach
Estimating project development effort using clustered regression approachEstimating project development effort using clustered regression approach
Estimating project development effort using clustered regression approach
 
ESTIMATING PROJECT DEVELOPMENT EFFORT USING CLUSTERED REGRESSION APPROACH
ESTIMATING PROJECT DEVELOPMENT EFFORT USING CLUSTERED REGRESSION APPROACHESTIMATING PROJECT DEVELOPMENT EFFORT USING CLUSTERED REGRESSION APPROACH
ESTIMATING PROJECT DEVELOPMENT EFFORT USING CLUSTERED REGRESSION APPROACH
 
Don't Treat the Symptom, Find the Cause!.pptx
Don't Treat the Symptom, Find the Cause!.pptxDon't Treat the Symptom, Find the Cause!.pptx
Don't Treat the Symptom, Find the Cause!.pptx
 
Software Reliability Growth Model with Logistic- Exponential Testing-Effort F...
Software Reliability Growth Model with Logistic- Exponential Testing-Effort F...Software Reliability Growth Model with Logistic- Exponential Testing-Effort F...
Software Reliability Growth Model with Logistic- Exponential Testing-Effort F...
 
Intro to Model Selection
Intro to Model SelectionIntro to Model Selection
Intro to Model Selection
 
Artificial Neural Networks-Supervised Learning Models
Artificial Neural Networks-Supervised Learning ModelsArtificial Neural Networks-Supervised Learning Models
Artificial Neural Networks-Supervised Learning Models
 
Artificial Neural Networks-Supervised Learning Models
Artificial Neural Networks-Supervised Learning ModelsArtificial Neural Networks-Supervised Learning Models
Artificial Neural Networks-Supervised Learning Models
 
Artificial Neural Networks-Supervised Learning Models
Artificial Neural Networks-Supervised Learning ModelsArtificial Neural Networks-Supervised Learning Models
Artificial Neural Networks-Supervised Learning Models
 
Bert pre_training_of_deep_bidirectional_transformers_for_language_understanding
Bert  pre_training_of_deep_bidirectional_transformers_for_language_understandingBert  pre_training_of_deep_bidirectional_transformers_for_language_understanding
Bert pre_training_of_deep_bidirectional_transformers_for_language_understanding
 
Image De-Noising Using Deep Neural Network
Image De-Noising Using Deep Neural NetworkImage De-Noising Using Deep Neural Network
Image De-Noising Using Deep Neural Network
 
Adversarial Variational Autoencoders to extend and improve generative model
Adversarial Variational Autoencoders to extend and improve generative modelAdversarial Variational Autoencoders to extend and improve generative model
Adversarial Variational Autoencoders to extend and improve generative model
 
final_ICSE '22 Presentaion_Sherry.pdf
final_ICSE '22 Presentaion_Sherry.pdffinal_ICSE '22 Presentaion_Sherry.pdf
final_ICSE '22 Presentaion_Sherry.pdf
 
Empirical Study on Collaborative Software in the field of Machine learning.pptx
Empirical Study on Collaborative Software in the field of Machine learning.pptxEmpirical Study on Collaborative Software in the field of Machine learning.pptx
Empirical Study on Collaborative Software in the field of Machine learning.pptx
 
CLIM Program: Remote Sensing Workshop, A Notional Framework for a Theory of D...
CLIM Program: Remote Sensing Workshop, A Notional Framework for a Theory of D...CLIM Program: Remote Sensing Workshop, A Notional Framework for a Theory of D...
CLIM Program: Remote Sensing Workshop, A Notional Framework for a Theory of D...
 

Mehr von CS, NcState

Lexisnexis june9
Lexisnexis june9Lexisnexis june9
Lexisnexis june9
CS, NcState
 
Ai4se lab template
Ai4se lab templateAi4se lab template
Ai4se lab template
CS, NcState
 
Automated Software Enging, Fall 2015, NCSU
Automated Software Enging, Fall 2015, NCSUAutomated Software Enging, Fall 2015, NCSU
Automated Software Enging, Fall 2015, NCSU
CS, NcState
 
Dagstuhl14 intro-v1
Dagstuhl14 intro-v1Dagstuhl14 intro-v1
Dagstuhl14 intro-v1
CS, NcState
 

Mehr von CS, NcState (20)

Talks2015 novdec
Talks2015 novdecTalks2015 novdec
Talks2015 novdec
 
Future se oct15
Future se oct15Future se oct15
Future se oct15
 
GALE: Geometric active learning for Search-Based Software Engineering
GALE: Geometric active learning for Search-Based Software EngineeringGALE: Geometric active learning for Search-Based Software Engineering
GALE: Geometric active learning for Search-Based Software Engineering
 
Big Data: the weakest link
Big Data: the weakest linkBig Data: the weakest link
Big Data: the weakest link
 
Three Laws of Trusted Data Sharing: (Building a Better Business Case for Dat...
Three Laws of Trusted Data Sharing:(Building a Better Business Case for Dat...Three Laws of Trusted Data Sharing:(Building a Better Business Case for Dat...
Three Laws of Trusted Data Sharing: (Building a Better Business Case for Dat...
 
Lexisnexis june9
Lexisnexis june9Lexisnexis june9
Lexisnexis june9
 
Welcome to ICSE NIER’15 (new ideas and emerging results).
Welcome to ICSE NIER’15 (new ideas and emerging results).Welcome to ICSE NIER’15 (new ideas and emerging results).
Welcome to ICSE NIER’15 (new ideas and emerging results).
 
Icse15 Tech-briefing Data Science
Icse15 Tech-briefing Data ScienceIcse15 Tech-briefing Data Science
Icse15 Tech-briefing Data Science
 
Kits to Find the Bits that Fits
Kits to Find  the Bits that Fits Kits to Find  the Bits that Fits
Kits to Find the Bits that Fits
 
Ai4se lab template
Ai4se lab templateAi4se lab template
Ai4se lab template
 
Automated Software Enging, Fall 2015, NCSU
Automated Software Enging, Fall 2015, NCSUAutomated Software Enging, Fall 2015, NCSU
Automated Software Enging, Fall 2015, NCSU
 
Requirements Engineering
Requirements EngineeringRequirements Engineering
Requirements Engineering
 
172529main ken and_tim_software_assurance_research_at_west_virginia
172529main ken and_tim_software_assurance_research_at_west_virginia172529main ken and_tim_software_assurance_research_at_west_virginia
172529main ken and_tim_software_assurance_research_at_west_virginia
 
Automated Software Engineering
Automated Software EngineeringAutomated Software Engineering
Automated Software Engineering
 
Next Generation “Treatment Learning” (finding the diamonds in the dust)
Next Generation “Treatment Learning” (finding the diamonds in the dust)Next Generation “Treatment Learning” (finding the diamonds in the dust)
Next Generation “Treatment Learning” (finding the diamonds in the dust)
 
Tim Menzies, directions in Data Science
Tim Menzies, directions in Data ScienceTim Menzies, directions in Data Science
Tim Menzies, directions in Data Science
 
Goldrush
GoldrushGoldrush
Goldrush
 
Dagstuhl14 intro-v1
Dagstuhl14 intro-v1Dagstuhl14 intro-v1
Dagstuhl14 intro-v1
 
Know thy tools
Know thy toolsKnow thy tools
Know thy tools
 
The Art and Science of Analyzing Software Data
The Art and Science of Analyzing Software DataThe Art and Science of Analyzing Software Data
The Art and Science of Analyzing Software Data
 

Kürzlich hochgeladen

Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 

Kürzlich hochgeladen (20)

Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 

PROMISE 2011: "Handling missing data in software effort prediction with naive Bayes and EM"

  • 1. Introduction Naive Bayes and EM for software effort prediction Missing data handling strategies Experiments Threats. Conclusion and future work Handling missing data in software effort prediction with naive Bayes and EM algorithm Wen Zhang Ye Yang Qing Wang Laboratory for Internet Software Technologies Institute of Software, Chinese Academy of Sciences Beijing 100190, P.R.China {zhangwen,ye,wq}@itechs.iscas.ac.cn 7th International Conference on Predictive Models in Software Engineering (PROMISE), 2011 Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm
  • 2. Introduction Naive Bayes and EM for software effort prediction Missing data handling strategies Experiments Threats. Conclusion and future work Outline 1 Introduction 2 Naive Bayes and EM for software effort prediction 3 Missing data handling strategies Missing data toleration strategy. Missing data imputation strategy 4 Experiments The datasets Experiment setup Experimental results 5 Threats. 6 Conclusion and future work Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm
  • 3. Introduction Naive Bayes and EM for software effort prediction Missing data handling strategies Experiments Threats. Conclusion and future work Effort prediction with missing data. The knowledge on software project effort stored in the historical datasets can be used to develop predictive models, by either statistical methods such as linear regression and correlation analysis to predict the effort of new incoming projects. Usually, most historical effort datasets contain large amount of missing data. Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm
  • 4. Introduction Naive Bayes and EM for software effort prediction Missing data handling strategies Experiments Threats. Conclusion and future work Effort prediction with missing data. Due to the small sizes of most historical databases, the common practice of ignoring projects with missing data will lead to biased and inaccurate prediction model. For these reasons, how to handle missing data in software effort datasets is becoming an important problem. Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm
  • 5. Introduction Naive Bayes and EM for software effort prediction Missing data handling strategies Experiments Threats. Conclusion and future work Sample data The historical effort data of projects were organized as shown in the following Table. Table: The sample data in historical project dataset. D X1 ... Xj ... Xn H D1 x11 ... x1j ... x1n h1 ... ... ... ... ... ... ... Di xi1 ... xij ... xin hi ... ... ... ... ... ... ... Dm xm1 ... xmj ... xmn hm Xj (1 ≤ j ≤ n) denotes an attribute of project Di (1 ≤ i ≤ m). hi is the effort class label of Di and it is derived from the real effort of project Di . Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm
  • 6. Introduction Naive Bayes and EM for software effort prediction Missing data handling strategies Experiments Threats. Conclusion and future work Sample data. There are l effort classes for all the projects in a dataset, that is, hi is equal to one of the elements in {c1 , ..., cl }. Xj is independent of each other and has Boolean values without missing data, i.e. xij ∈ {0, 1}. Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm
  • 7. Introduction Naive Bayes and EM for software effort prediction Missing data handling strategies Experiments Threats. Conclusion and future work Formulation of the problem. An effort dataset Ycom containing m historical projects as Ycom = (D1 , ..., Di , ..., Dm )T , where Di (1 ≤ i ≤ m) is a historical project and Di = (xi1 , ..., xij , ..., xin )T is represented by n attributes Xj (1 ≤ j ≤ n). hi denotes the effort class label of project Di . For each xij , which is the value of attribute Xj ) (1 ≤ j ≤ n)on Di , it would be observed or missing. Cross validation on effort prediction is used to to evaluate the performances of missing data handling techniques. Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm
  • 8. Introduction Naive Bayes and EM for software effort prediction Missing data handling strategies Experiments Threats. Conclusion and future work Motivation. EM (Expectation Maximization) algorithm is a method for finding maximum likelihood or maximum a posteriori estimates of parameters in statistical models. The motivation of applying EM(Expectation Maximization) to na¨ Bayes is to augment the unlabeled projects with ive their estimated effort class labels into the labeled data sets. Thus, the performance of classification would be improved by using more data to train the prediction model. Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm
  • 9. Introduction Naive Bayes and EM for software effort prediction Missing data handling strategies Experiments Threats. Conclusion and future work Labeled projects and unlabeled projects. For a labeled project DiL , its effort class P(hi = ct ∣DiL ) ∈ {0, 1} is determinate. For an unlabeled project DiU , its label P(hi = ct ∣DiU ) is unknown. However, if we can assign predicted effort class to DiU , then DiU could also be used to update the estimates P{Xj = 0∣ct }, P{Xj = 1∣ct } and P(ct ), and further to refine the effort prediction model P(ct ∣Di ). This process is described in Equations 1, 2, 3 and 4. Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm
  • 10. Introduction Naive Bayes and EM for software effort prediction Missing data handling strategies Experiments Threats. Conclusion and future work Estimating P ( +1) (Xj = 1∣ct ). The likelihood of occurrence of Xj with respect to ct at + 1 iteration, is updated by Equation 1 using the estimates at iteration. 1 + m xij P ( ) (hi = ct ∣Di ) P( +1) (Xj = 1∣ct ) = i=1 . (1) n+ n j=1 m i=1 xij P ( ) (h = c ∣D ) i t i In practice, we explain P ( +1) (Xj = 1∣ct ) as probability of attribute Xj appearing in a project whose effort class is ct . Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm
  • 11. Introduction Naive Bayes and EM for software effort prediction Missing data handling strategies Experiments Threats. Conclusion and future work Estimating P ( +1) (Xj = 0∣ct ). Accordingly, the likelihood of non-occurrence of Xj with respect to ct at + 1 iteration, P ( +1) (Xj = 0∣ct ) is estimated by Equation 2. P( +1) (Xj = 0∣ct ) = 1 − P ( +1) (Xj = 1∣ct ). (2) Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm
  • 12. Introduction Naive Bayes and EM for software effort prediction Missing data handling strategies Experiments Threats. Conclusion and future work Estimating P ( +1) (ct ). Second, the effort class prior probability, P ( +1) (ct ), is updated in the same manner by Equation 3 using estimates at the iteration. In practice, we may regard P ( +1) (ct ) as the prior probability of class label ct appearing in all the software projects. m ( ) (h 1+ i=1 P i = ct ∣Di ) P( +1) (ct ) = . (3) l +m Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm
  • 13. Introduction Naive Bayes and EM for software effort prediction Missing data handling strategies Experiments Threats. Conclusion and future work Estimating P ( +1) (hi ′ = ct ∣Di ′ ). Third, the posterior probability of an unlabeled project Di ′ belonging to an effort class ct at the + 1 iteration, P ( +1) (hi ′ = ct ∣Di ′ ), is updated using Equation 4. P ( ) (ct )P ( ) (Di ′ ∣ct ) P( +1) (hi ′ = ct ∣Di ′ ) = P ( ) (Di ′ ) n P ( ) (ct ) P ( ) (xi ′ j ∣ct ) (4) j=1 = . l n P ( ) (ct ) P ( ) (xi ′ j ∣ct ) t=1 j=1 Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm
  • 14. Introduction Naive Bayes and EM for software effort prediction Missing data handling strategies Experiments Threats. Conclusion and future work Estimating P ( +1) (hi ′ = ct ∣Di ′ ). Hereafter, for labeled projects, if xij = 1, then P ( ) (xij ∣ct ) = P ( ) (Xj = 1∣ct ). Otherwise xij = 0, then P ( ) (xij ∣ct ) = P ( ) (Xj = 0∣ct ). for unlabeled projects, if xi ′ j = 1, then P ( ) (xi ′ j ∣ct ) = P ( ) (Xj = 1∣ct ). Otherwise xi ′ j = 0, then P ( ) (xi ′ j ∣ct ) = P ( ) (Xj = 0∣ct ). Here, P (0) (Xj = 1∣ct ) and P (0) (ct ) are initially estimated by merely the labeled projects at the first step of iteration, and the unlabeled project cases are appended into the learning process after they were predicted probabilistic effort class by P (1) (hi ′ = ct ∣Di ′ ). Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm
  • 15. Introduction Naive Bayes and EM for software effort prediction Missing data handling strategies Experiments Threats. Conclusion and future work Predicting the effort class of unlabeled projects. We loop the Equations 1, 2, 3 and 4 until their estimates converge to stable values. Then, P ( +1) (h i′ = ct ∣Di ′ ) is used to predict effort class of Di ′ . The ct ∈ {c1 , ..cl } that maximizes P ( +1) (h i′ = ct ∣Di ′ ) is regarded as the effort class of Di ′ . Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm
  • 16. Introduction Naive Bayes and EM for software effort prediction Missing data handling strategies Missing data toleration strategy. Experiments Missing data imputation strategy Threats. Conclusion and future work Outline 1 Introduction 2 Naive Bayes and EM for software effort prediction 3 Missing data handling strategies Missing data toleration strategy. Missing data imputation strategy 4 Experiments The datasets Experiment setup Experimental results 5 Threats. 6 Conclusion and future work Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm
  • 17. Introduction Naive Bayes and EM for software effort prediction Missing data handling strategies Missing data toleration strategy. Experiments Missing data imputation strategy Threats. Conclusion and future work Initial setting. When we use Equation 1 to estimate the likelihood of Xj with respect to ct , P(Xj = 1∣ct ) or P(Xj = 0∣ct ), we do not consider missing values involved in xij (1 ≤ i ≤ m). For each Xj , we can divide the whole historical dataset D into two subsets, i.e. D = {Dobs,j ∣Dmis,j } where Dobs,j is the set of projects whose values on attribute Xj are observed and Dmis,j is the set of projects whose values on attribute are unobserved. We may also divide the attributes in a project Di into two subsets, i.e. Di = {Xobs,i ∣Xmis,i } where Xobs,i is the set of attributes whose values are observed in project Di and Xmis,i denotes the set of attributes whose values are unobserved in project Di . Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm
  • 18. Introduction Naive Bayes and EM for software effort prediction Missing data handling strategies Missing data toleration strategy. Experiments Missing data imputation strategy Threats. Conclusion and future work Missing data toleration strategy. This strategy is very similar with the method adopted by C4.5 to handle missing data. That is, we ignore missing values in training prediction model. To estimate P ( +1) (Xj = 1∣ct ) under this strategy, we rewrite Equation 1 into Equation 5. ∣Dobs,j ∣ 1+ xij P ( ) (hi = ct ∣Di ) i=1 P( +1) (Xj = 1∣ct ) = n . (5) ∣Dobs,j ∣ n+ i=1 xij P ( ) (hi = ct ∣Di ) j=1 Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm
  • 19. Introduction Naive Bayes and EM for software effort prediction Missing data handling strategies Missing data toleration strategy. Experiments Missing data imputation strategy Threats. Conclusion and future work Missing data toleration strategy. The difference between Equations 1 and 5 lies in that only observed projects on attribute Xj , i.e., Dobs,j are used to estimate P ( +1) (Xj = 1∣ct ). Equation 2 can also be used here to estimate P ( +1) (Xj = 0∣ct ). To estimate P ( +1) (ct ), Equation 3 can also be used here. Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm
  • 20. Introduction Naive Bayes and EM for software effort prediction Missing data handling strategies Missing data toleration strategy. Experiments Missing data imputation strategy Threats. Conclusion and future work Missing data toleration strategy. Accordingly, the prediction model should be adapted from Equation 4 to Equation 6. P ( ) (ct )P ( ) (Di ′ ∣ct ) P( +1) (hi ′ = ct ∣Di ′ ) = P ( ) (Di ′ ) ∣Xobs,i ∣ P ( ) (ct ) P ( ) (xi ′ j ∣ct ) j=1 = . (6) ∣Xobs,i ∣ l P ( ) (ct )P ( ) (xi ′ j ∣ct ) j=1 t=1 Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm
  • 21. Introduction Naive Bayes and EM for software effort prediction Missing data handling strategies Missing data toleration strategy. Experiments Missing data imputation strategy Threats. Conclusion and future work Outline 1 Introduction 2 Naive Bayes and EM for software effort prediction 3 Missing data handling strategies Missing data toleration strategy. Missing data imputation strategy 4 Experiments The datasets Experiment setup Experimental results 5 Threats. 6 Conclusion and future work Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm
  • 22. Introduction Naive Bayes and EM for software effort prediction Missing data handling strategies Missing data toleration strategy. Experiments Missing data imputation strategy Threats. Conclusion and future work Missing data imputation strategy. The basic idea of this strategy is that unobserved values of attributes can be imputed using the observed values. Then, both observed values and imputed values are used to construct the prediction model. Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm
  • 23. Introduction Naive Bayes and EM for software effort prediction Missing data handling strategies Missing data toleration strategy. Experiments Missing data imputation strategy Threats. Conclusion and future work Missing data imputation strategy. This strategy is an embedded processing in na¨ Bayes ive and EM and we may rewrite Equation 1 to Equation 7 to estimate P ( +1) (Xj = 1∣ct ). P( +1) (Xj = 1∣ct ) = ∣Dobs,j ∣ ∣Dmis,j ∣ 1+ xij P ( ) (hi = ct ∣Di ) + x˜ P ( ) (hi = ct ∣Ds ) sj i=1 s=1 . n ∣Dobs,j ∣ ∣Dmis,j ∣ n+ { xij P ( ) (hi = ct ∣Di ) + x˜ P ( ) (hi = ct ∣Ds )} sj j=1 i=1 s=1 (7) Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm
  • 24. Introduction Naive Bayes and EM for software effort prediction Missing data handling strategies Missing data toleration strategy. Experiments Missing data imputation strategy Threats. Conclusion and future work Missing data imputation strategy. The missing value xsj , which is the value of attribute Xj on the project Ds , is imputed using x˜ with Equation 8 sj ∣Dobs,j ∣ xij P ( ) (hi = ct ∣Di ) i=1 x˜ = sj . (8) ∣Dobs,j ∣ P ( ) (hi = ct ∣Di ) i=1 x˜ is a constant independent of Ds given ct . sj We regulate that x˜ is approximated to 1 if x˜ ≥ 0.5. sj sj Otherwise, x˜ is approximated to 0. sj Here, we also use Equation 3 to estimate P ( +1) (ct ) . Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm
  • 25. Introduction Naive Bayes and EM for software effort prediction Missing data handling strategies Missing data toleration strategy. Experiments Missing data imputation strategy Threats. Conclusion and future work Missing data imputation strategy. As for the prediction model, P ( +1) (ct ∣Di ), can be constructed in Equation 9 with considering the missing values. P ( ) (ct )P ( ) (Di ′ ∣ct ) P( +1) (hi ′ = ct ∣Di ′ ) = P ( ) (Di ′ ) n P ( ) (ct ) P ( ) (xi ′ j ∣ct ) j=1 = . (9) n l P ( ) (ct )P ( ) (xi ′ j ∣ct ) j=1 t=1 Note that if xi ′ j is unobserved, it value will be substituted with x˜′ j given by Equation 8. i Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm
  • 26. Introduction Naive Bayes and EM for software effort prediction The datasets Missing data handling strategies Experiment setup Experiments Experimental results Threats. Conclusion and future work Outline 1 Introduction 2 Naive Bayes and EM for software effort prediction 3 Missing data handling strategies Missing data toleration strategy. Missing data imputation strategy 4 Experiments The datasets Experiment setup Experimental results 5 Threats. 6 Conclusion and future work Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm
  • 27. Introduction Naive Bayes and EM for software effort prediction The datasets Missing data handling strategies Experiment setup Experiments Experimental results Threats. Conclusion and future work The ISBSG dataset. The ISBSG data set (http://www.isbsg.org) has 70 attributes and many attributes have no values in the corresponding places. We extract 188 projects with 16 attributes with the criterion that each project has at least 2/3 attributes whose values are observed and, for an attribute, its values should be observed at least in 2/3 of total projects. 13 attributes are nominal attributes and 3 attributes are continuous attributes. Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm
  • 28. Introduction Naive Bayes and EM for software effort prediction The datasets Missing data handling strategies Experiment setup Experiments Experimental results Threats. Conclusion and future work The ISBSG dataset. We use Equation 10 to normalize the efforts of projects into l(= 3) classes. l × (effortDi − effortmin ) ct = ⌊ ⌋+1 (10) effortmax − effortmin Table: The effort classes in ISBSG data set. Class No. # of projects Label 1 85 Low 2 76 Medium 3 27 High Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm
  • 29. Introduction Naive Bayes and EM for software effort prediction The datasets Missing data handling strategies Experiment setup Experiments Experimental results Threats. Conclusion and future work The CSBSG dataset. CSBSG dataset contains 1103 projects collected from 140 organizations and 15 regions across China by Chinese association of software industry. We extract 94 projects and 21 attributes (15 nominal attributes and 6 continuous attributes) with same selection criterion of ISBSG data set. We use Equation 10 to normalize the efforts of projects into l(= 3) classes. Table: The effort classes in CSBSG data set. Class No. # of projects Label 1 27 Low 2 31 Medium 3 36 High Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm
  • 30. Introduction Naive Bayes and EM for software effort prediction The datasets Missing data handling strategies Experiment setup Experiments Experimental results Threats. Conclusion and future work Outline 1 Introduction 2 Naive Bayes and EM for software effort prediction 3 Missing data handling strategies Missing data toleration strategy. Missing data imputation strategy 4 Experiments The datasets Experiment setup Experimental results 5 Threats. 6 Conclusion and future work Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm
  • 31. Introduction Naive Bayes and EM for software effort prediction The datasets Missing data handling strategies Experiment setup Experiments Experimental results Threats. Conclusion and future work Experiment setup. To evaluate the proposed method comparatively, we adopt MI and MINI to impute the missing values of the assigned ISBSG and CSBSG dataset. BPNN is used to classify the projects in the data sets after imputation. Our experiments are conducted with 10-flod cross-validation technique. Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm
  • 32. Introduction Naive Bayes and EM for software effort prediction The datasets Missing data handling strategies Experiment setup Experiments Experimental results Threats. Conclusion and future work Outline 1 Introduction 2 Naive Bayes and EM for software effort prediction 3 Missing data handling strategies Missing data toleration strategy. Missing data imputation strategy 4 Experiments The datasets Experiment setup Experimental results 5 Threats. 6 Conclusion and future work Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm
  • 33. Introduction Naive Bayes and EM for software effort prediction The datasets Missing data handling strategies Experiment setup Experiments Experimental results Threats. Conclusion and future work EM-T and EM-I on ISBSG dataset. The following figure illustrates the performances, of the missing data toleration strategy (hereafter called EM-T) and missing data imputation strategy (hereafter called EM-I) in handling the missing date for effort prediction on ISBSG data set. Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm
  • 34. Introduction Naive Bayes and EM for software effort prediction The datasets Missing data handling strategies Experiment setup Experiments Experimental results Threats. Conclusion and future work EM-T and EM-I on ISBSG dataset. EM−I EM−T BPNN+MI BPNN+MINI 0.8 0.75 Accuracy 0.7 0.65 0.6 0 4 8 12 16 20 # of unlabeled projects Figure: Performances of naive Bayes with EM-I and EM-T in comparison with BPNN on effort prediction using ISBSG data set. Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm
  • 35. Introduction Naive Bayes and EM for software effort prediction The datasets Missing data handling strategies Experiment setup Experiments Experimental results Threats. Conclusion and future work EM-T and EM-I on ISBSG dataset. What we can see from the figure. Both EM-I and EM-T have better performances than BPNN with either MI or MINI on classifying the projects in ISBSG data set. The performance of naive Bayes and EM is augmented when unlabeled projects are appended. This outcome illustrates that semi-supervised learning can improve the prediction of software effort. Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm
  • 36. Introduction Naive Bayes and EM for software effort prediction The datasets Missing data handling strategies Experiment setup Experiments Experimental results Threats. Conclusion and future work EM-T and EM-I on ISBSG dataset. What we can see from figure. If supervised learning was used for software effort prediction, MINI method is favorable to impute the missing values but missing toleration strategy may not be desirable to handle missing values. Imputing strategy for missing data is more effective than tolerating strategy when naive Bayes and EM is used for predicting ISBSG software efforts. Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm
  • 37. Introduction Naive Bayes and EM for software effort prediction The datasets Missing data handling strategies Experiment setup Experiments Experimental results Threats. Conclusion and future work EM-T and EM-I on CSBSG dataset. EM-T and EM-I in handling the missing date for effort prediction on CSBSG dataset. 0.8 EM−I EM−T BPNN+MI BPNN+MINI 0.75 0.7 Accuracy 0.65 0.6 0.55 0.5 0 2 4 6 8 # of unlabeled projects Figure: Performances of EM-I and EM-T in comparison with BPNN on predicting effort with different number of unlabeled projects using CSBSG dataset. Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm
  • 38. Introduction Naive Bayes and EM for software effort prediction The datasets Missing data handling strategies Experiment setup Experiments Experimental results Threats. Conclusion and future work EM-T and EM-I on CSBSG dataset. What we can see from the above figure. The better performance of EM-I than EM-T is also observed using CSBSG data set, which is the same as using ISBSG dataset. This further validate our conjecture that EM-I outperforms EM-T in software effort prediction. EM-T has better performance than EM-I on condition that the number of unlabeled projects is larger than that of "maxima", that is different from that of ISBSG dataset. We explain this result may be brought out by the relative small size of CSBSG dataset where imputation strategy will be more prone to bring bias into predictive than toleration strategy. Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm
  • 39. Introduction Naive Bayes and EM for software effort prediction The datasets Missing data handling strategies Experiment setup Experiments Experimental results Threats. Conclusion and future work More experiments and hypotheses testing. More experimental results with explanations are detailed in the paper. Also, we conduct hypotheses testing to examine the significance of the conclusions draw from our experiments. One of interest may refer to the paper. Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm
  • 40. Introduction Naive Bayes and EM for software effort prediction Missing data handling strategies Experiments Threats. Conclusion and future work The threat to external validity primarily is the degree to which the attributes we used to describe the projects and the representative capacity of ISBSG and CSBSG sample datasets. The threat to internal validity are measurement and data effects that can bias our results caused by performance measure as accuracy. The threat to construct validity is that our experiments make use of clipping attributes and clipping project data from both ISBSG and CSBSG datasets Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm
  • 41. Introduction Naive Bayes and EM for software effort prediction Missing data handling strategies Experiments Threats. Conclusion and future work Conclusion Semi-supervised learning as naive Bayes and EM is employed to predict software effort. We propose two embedded strategies in naive Bayes and EM to handle the missing data. Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm
  • 42. Introduction Naive Bayes and EM for software effort prediction Missing data handling strategies Experiments Threats. Conclusion and future work Future work We plan to compare the proposed techniques with other missing data imputation techniques, such as FIML and MSWR. We will develop more missing data techniques embedded with naive Bayes and EM for software effort prediction. We have already investigated the underlying mechanism of missingness (structural missing or unstructured missing) of software effort data. With this progress, we will improve the missing data handling strategies oriented to the underlying missing mechanism of software effort data. Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm
  • 43. Introduction Naive Bayes and EM for software effort prediction Missing data handling strategies Experiments Threats. Conclusion and future work Thanks Any further questions about the content of the slides and the paper can be sent to Mr. Wen Zhang. Email: zhangwen@itechs.iscas.ac.cn Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm