Elevate Developer Efficiency & build GenAI Application with Amazon Q
PROMISE 2011: "Handling missing data in software effort prediction with naive Bayes and EM"
1. Introduction
Naive Bayes and EM for software effort prediction
Missing data handling strategies
Experiments
Threats.
Conclusion and future work
Handling missing data in software effort
prediction with naive Bayes and EM algorithm
Wen Zhang Ye Yang Qing Wang
Laboratory for Internet Software Technologies
Institute of Software, Chinese Academy of Sciences
Beijing 100190, P.R.China
{zhangwen,ye,wq}@itechs.iscas.ac.cn
7th International Conference on Predictive Models in
Software Engineering (PROMISE), 2011
Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm
2. Introduction
Naive Bayes and EM for software effort prediction
Missing data handling strategies
Experiments
Threats.
Conclusion and future work
Outline
1 Introduction
2 Naive Bayes and EM for software effort prediction
3 Missing data handling strategies
Missing data toleration strategy.
Missing data imputation strategy
4 Experiments
The datasets
Experiment setup
Experimental results
5 Threats.
6 Conclusion and future work
Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm
3. Introduction
Naive Bayes and EM for software effort prediction
Missing data handling strategies
Experiments
Threats.
Conclusion and future work
Effort prediction with missing data.
The knowledge on software project effort stored in the
historical datasets can be used to develop predictive
models, by either statistical methods such as linear
regression and correlation analysis to predict the effort of
new incoming projects.
Usually, most historical effort datasets contain large
amount of missing data.
Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm
4. Introduction
Naive Bayes and EM for software effort prediction
Missing data handling strategies
Experiments
Threats.
Conclusion and future work
Effort prediction with missing data.
Due to the small sizes of most historical databases, the
common practice of ignoring projects with missing data will
lead to biased and inaccurate prediction model.
For these reasons, how to handle missing data in software
effort datasets is becoming an important problem.
Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm
5. Introduction
Naive Bayes and EM for software effort prediction
Missing data handling strategies
Experiments
Threats.
Conclusion and future work
Sample data
The historical effort data of projects were organized as
shown in the following Table.
Table: The sample data in historical project dataset.
D X1 ... Xj ... Xn H
D1 x11 ... x1j ... x1n h1
... ... ... ... ... ... ...
Di xi1 ... xij ... xin hi
... ... ... ... ... ... ...
Dm xm1 ... xmj ... xmn hm
Xj (1 ≤ j ≤ n) denotes an attribute of project Di
(1 ≤ i ≤ m). hi is the effort class label of Di and it is
derived from the real effort of project Di .
Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm
6. Introduction
Naive Bayes and EM for software effort prediction
Missing data handling strategies
Experiments
Threats.
Conclusion and future work
Sample data.
There are l effort classes for all the projects in a dataset,
that is, hi is equal to one of the elements in {c1 , ..., cl }.
Xj is independent of each other and has Boolean values
without missing data, i.e. xij ∈ {0, 1}.
Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm
7. Introduction
Naive Bayes and EM for software effort prediction
Missing data handling strategies
Experiments
Threats.
Conclusion and future work
Formulation of the problem.
An effort dataset Ycom containing m historical projects as
Ycom = (D1 , ..., Di , ..., Dm )T , where Di (1 ≤ i ≤ m) is a
historical project and Di = (xi1 , ..., xij , ..., xin )T is
represented by n attributes Xj (1 ≤ j ≤ n).
hi denotes the effort class label of project Di . For each xij ,
which is the value of attribute Xj ) (1 ≤ j ≤ n)on Di , it would
be observed or missing.
Cross validation on effort prediction is used to to evaluate
the performances of missing data handling techniques.
Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm
8. Introduction
Naive Bayes and EM for software effort prediction
Missing data handling strategies
Experiments
Threats.
Conclusion and future work
Motivation.
EM (Expectation Maximization) algorithm is a method for
finding maximum likelihood or maximum a posteriori
estimates of parameters in statistical models.
The motivation of applying EM(Expectation Maximization)
to na¨ Bayes is to augment the unlabeled projects with
ive
their estimated effort class labels into the labeled data sets.
Thus, the performance of classification would be improved
by using more data to train the prediction model.
Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm
9. Introduction
Naive Bayes and EM for software effort prediction
Missing data handling strategies
Experiments
Threats.
Conclusion and future work
Labeled projects and unlabeled projects.
For a labeled project DiL , its effort class
P(hi = ct ∣DiL ) ∈ {0, 1} is determinate.
For an unlabeled project DiU , its label P(hi = ct ∣DiU ) is
unknown.
However, if we can assign predicted effort class to DiU ,
then DiU could also be used to update the estimates
P{Xj = 0∣ct }, P{Xj = 1∣ct } and P(ct ), and further to refine
the effort prediction model P(ct ∣Di ). This process is
described in Equations 1, 2, 3 and 4.
Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm
10. Introduction
Naive Bayes and EM for software effort prediction
Missing data handling strategies
Experiments
Threats.
Conclusion and future work
Estimating P ( +1)
(Xj = 1∣ct ).
The likelihood of occurrence of Xj with respect to ct at
+ 1 iteration, is updated by Equation 1 using the
estimates at iteration.
1 + m xij P ( ) (hi = ct ∣Di )
P( +1)
(Xj = 1∣ct ) = i=1
. (1)
n+ n j=1
m
i=1 xij P
( ) (h = c ∣D )
i t i
In practice, we explain P ( +1) (Xj = 1∣ct ) as probability of
attribute Xj appearing in a project whose effort class is ct .
Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm
11. Introduction
Naive Bayes and EM for software effort prediction
Missing data handling strategies
Experiments
Threats.
Conclusion and future work
Estimating P ( +1)
(Xj = 0∣ct ).
Accordingly, the likelihood of non-occurrence of Xj with
respect to ct at + 1 iteration, P ( +1) (Xj = 0∣ct ) is
estimated by Equation 2.
P( +1)
(Xj = 0∣ct ) = 1 − P ( +1)
(Xj = 1∣ct ). (2)
Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm
12. Introduction
Naive Bayes and EM for software effort prediction
Missing data handling strategies
Experiments
Threats.
Conclusion and future work
Estimating P ( +1)
(ct ).
Second, the effort class prior probability, P ( +1) (ct ), is updated
in the same manner by Equation 3 using estimates at the
iteration. In practice, we may regard P ( +1) (ct ) as the prior
probability of class label ct appearing in all the software
projects.
m ( ) (h
1+ i=1 P i = ct ∣Di )
P( +1)
(ct ) = . (3)
l +m
Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm
13. Introduction
Naive Bayes and EM for software effort prediction
Missing data handling strategies
Experiments
Threats.
Conclusion and future work
Estimating P ( +1)
(hi ′ = ct ∣Di ′ ).
Third, the posterior probability of an unlabeled project Di ′
belonging to an effort class ct at the + 1 iteration,
P ( +1) (hi ′ = ct ∣Di ′ ), is updated using Equation 4.
P ( ) (ct )P ( ) (Di ′ ∣ct )
P( +1)
(hi ′ = ct ∣Di ′ ) =
P ( ) (Di ′ )
n
P ( ) (ct ) P ( ) (xi ′ j ∣ct ) (4)
j=1
= .
l n
P ( ) (ct ) P ( ) (xi ′ j ∣ct )
t=1 j=1
Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm
14. Introduction
Naive Bayes and EM for software effort prediction
Missing data handling strategies
Experiments
Threats.
Conclusion and future work
Estimating P ( +1)
(hi ′ = ct ∣Di ′ ).
Hereafter,
for labeled projects, if xij = 1, then
P ( ) (xij ∣ct ) = P ( ) (Xj = 1∣ct ). Otherwise xij = 0, then
P ( ) (xij ∣ct ) = P ( ) (Xj = 0∣ct ).
for unlabeled projects, if xi ′ j = 1, then
P ( ) (xi ′ j ∣ct ) = P ( ) (Xj = 1∣ct ). Otherwise xi ′ j = 0, then
P ( ) (xi ′ j ∣ct ) = P ( ) (Xj = 0∣ct ).
Here, P (0) (Xj = 1∣ct ) and P (0) (ct ) are initially estimated by
merely the labeled projects at the first step of iteration, and
the unlabeled project cases are appended into the learning
process after they were predicted probabilistic effort class
by P (1) (hi ′ = ct ∣Di ′ ).
Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm
15. Introduction
Naive Bayes and EM for software effort prediction
Missing data handling strategies
Experiments
Threats.
Conclusion and future work
Predicting the effort class of unlabeled projects.
We loop the Equations 1, 2, 3 and 4 until their estimates
converge to stable values.
Then, P ( +1) (h
i′ = ct ∣Di ′ ) is used to predict effort class of
Di ′ .
The ct ∈ {c1 , ..cl } that maximizes P ( +1) (h
i′ = ct ∣Di ′ ) is
regarded as the effort class of Di ′ .
Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm
16. Introduction
Naive Bayes and EM for software effort prediction
Missing data handling strategies Missing data toleration strategy.
Experiments Missing data imputation strategy
Threats.
Conclusion and future work
Outline
1 Introduction
2 Naive Bayes and EM for software effort prediction
3 Missing data handling strategies
Missing data toleration strategy.
Missing data imputation strategy
4 Experiments
The datasets
Experiment setup
Experimental results
5 Threats.
6 Conclusion and future work
Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm
17. Introduction
Naive Bayes and EM for software effort prediction
Missing data handling strategies Missing data toleration strategy.
Experiments Missing data imputation strategy
Threats.
Conclusion and future work
Initial setting.
When we use Equation 1 to estimate the likelihood of Xj
with respect to ct , P(Xj = 1∣ct ) or P(Xj = 0∣ct ), we do not
consider missing values involved in xij (1 ≤ i ≤ m).
For each Xj , we can divide the whole historical dataset D
into two subsets, i.e. D = {Dobs,j ∣Dmis,j } where Dobs,j is the
set of projects whose values on attribute Xj are observed
and Dmis,j is the set of projects whose values on attribute
are unobserved.
We may also divide the attributes in a project Di into two
subsets, i.e. Di = {Xobs,i ∣Xmis,i } where Xobs,i is the set of
attributes whose values are observed in project Di and
Xmis,i denotes the set of attributes whose values are
unobserved in project Di .
Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm
18. Introduction
Naive Bayes and EM for software effort prediction
Missing data handling strategies Missing data toleration strategy.
Experiments Missing data imputation strategy
Threats.
Conclusion and future work
Missing data toleration strategy.
This strategy is very similar with the method adopted by
C4.5 to handle missing data. That is, we ignore missing
values in training prediction model.
To estimate P ( +1) (Xj = 1∣ct ) under this strategy, we
rewrite Equation 1 into Equation 5.
∣Dobs,j ∣
1+ xij P ( ) (hi = ct ∣Di )
i=1
P( +1)
(Xj = 1∣ct ) = n
. (5)
∣Dobs,j ∣
n+ i=1 xij P ( ) (hi = ct ∣Di )
j=1
Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm
19. Introduction
Naive Bayes and EM for software effort prediction
Missing data handling strategies Missing data toleration strategy.
Experiments Missing data imputation strategy
Threats.
Conclusion and future work
Missing data toleration strategy.
The difference between Equations 1 and 5 lies in that only
observed projects on attribute Xj , i.e., Dobs,j are used to
estimate P ( +1) (Xj = 1∣ct ).
Equation 2 can also be used here to estimate
P ( +1) (Xj = 0∣ct ). To estimate P ( +1) (ct ), Equation 3 can
also be used here.
Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm
20. Introduction
Naive Bayes and EM for software effort prediction
Missing data handling strategies Missing data toleration strategy.
Experiments Missing data imputation strategy
Threats.
Conclusion and future work
Missing data toleration strategy.
Accordingly, the prediction model should be adapted from
Equation 4 to Equation 6.
P ( ) (ct )P ( ) (Di ′ ∣ct )
P( +1)
(hi ′ = ct ∣Di ′ ) =
P ( ) (Di ′ )
∣Xobs,i ∣
P ( ) (ct ) P ( ) (xi ′ j ∣ct )
j=1
= . (6)
∣Xobs,i ∣ l
P ( ) (ct )P ( ) (xi ′ j ∣ct )
j=1 t=1
Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm
21. Introduction
Naive Bayes and EM for software effort prediction
Missing data handling strategies Missing data toleration strategy.
Experiments Missing data imputation strategy
Threats.
Conclusion and future work
Outline
1 Introduction
2 Naive Bayes and EM for software effort prediction
3 Missing data handling strategies
Missing data toleration strategy.
Missing data imputation strategy
4 Experiments
The datasets
Experiment setup
Experimental results
5 Threats.
6 Conclusion and future work
Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm
22. Introduction
Naive Bayes and EM for software effort prediction
Missing data handling strategies Missing data toleration strategy.
Experiments Missing data imputation strategy
Threats.
Conclusion and future work
Missing data imputation strategy.
The basic idea of this strategy is that unobserved values of
attributes can be imputed using the observed values.
Then, both observed values and imputed values are used
to construct the prediction model.
Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm
23. Introduction
Naive Bayes and EM for software effort prediction
Missing data handling strategies Missing data toleration strategy.
Experiments Missing data imputation strategy
Threats.
Conclusion and future work
Missing data imputation strategy.
This strategy is an embedded processing in na¨ Bayes
ive
and EM and we may rewrite Equation 1 to Equation 7 to
estimate P ( +1) (Xj = 1∣ct ).
P( +1)
(Xj = 1∣ct ) =
∣Dobs,j ∣ ∣Dmis,j ∣
1+ xij P ( ) (hi = ct ∣Di ) + x˜ P ( ) (hi = ct ∣Ds )
sj
i=1 s=1
.
n ∣Dobs,j ∣ ∣Dmis,j ∣
n+ { xij P ( ) (hi = ct ∣Di ) + x˜ P ( ) (hi = ct ∣Ds )}
sj
j=1 i=1 s=1
(7)
Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm
24. Introduction
Naive Bayes and EM for software effort prediction
Missing data handling strategies Missing data toleration strategy.
Experiments Missing data imputation strategy
Threats.
Conclusion and future work
Missing data imputation strategy.
The missing value xsj , which is the value of attribute Xj on
the project Ds , is imputed using x˜ with Equation 8
sj
∣Dobs,j ∣
xij P ( ) (hi = ct ∣Di )
i=1
x˜ =
sj . (8)
∣Dobs,j ∣
P ( ) (hi = ct ∣Di )
i=1
x˜ is a constant independent of Ds given ct .
sj
We regulate that x˜ is approximated to 1 if x˜ ≥ 0.5.
sj sj
Otherwise, x˜ is approximated to 0.
sj
Here, we also use Equation 3 to estimate P ( +1) (ct ) .
Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm
25. Introduction
Naive Bayes and EM for software effort prediction
Missing data handling strategies Missing data toleration strategy.
Experiments Missing data imputation strategy
Threats.
Conclusion and future work
Missing data imputation strategy.
As for the prediction model, P ( +1) (ct ∣Di ), can be
constructed in Equation 9 with considering the missing
values.
P ( ) (ct )P ( ) (Di ′ ∣ct )
P( +1)
(hi ′ = ct ∣Di ′ ) =
P ( ) (Di ′ )
n
P ( ) (ct ) P ( ) (xi ′ j ∣ct )
j=1
= . (9)
n l
P ( ) (ct )P ( ) (xi ′ j ∣ct )
j=1 t=1
Note that if xi ′ j is unobserved, it value will be substituted
with x˜′ j given by Equation 8.
i
Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm
26. Introduction
Naive Bayes and EM for software effort prediction
The datasets
Missing data handling strategies
Experiment setup
Experiments
Experimental results
Threats.
Conclusion and future work
Outline
1 Introduction
2 Naive Bayes and EM for software effort prediction
3 Missing data handling strategies
Missing data toleration strategy.
Missing data imputation strategy
4 Experiments
The datasets
Experiment setup
Experimental results
5 Threats.
6 Conclusion and future work
Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm
27. Introduction
Naive Bayes and EM for software effort prediction
The datasets
Missing data handling strategies
Experiment setup
Experiments
Experimental results
Threats.
Conclusion and future work
The ISBSG dataset.
The ISBSG data set (http://www.isbsg.org) has 70
attributes and many attributes have no values in the
corresponding places.
We extract 188 projects with 16 attributes with the criterion
that each project has at least 2/3 attributes whose values
are observed and, for an attribute, its values should be
observed at least in 2/3 of total projects.
13 attributes are nominal attributes and 3 attributes are
continuous attributes.
Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm
28. Introduction
Naive Bayes and EM for software effort prediction
The datasets
Missing data handling strategies
Experiment setup
Experiments
Experimental results
Threats.
Conclusion and future work
The ISBSG dataset.
We use Equation 10 to normalize the efforts of projects
into l(= 3) classes.
l × (effortDi − effortmin )
ct = ⌊ ⌋+1 (10)
effortmax − effortmin
Table: The effort classes in ISBSG data set.
Class No. # of projects Label
1 85 Low
2 76 Medium
3 27 High
Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm
29. Introduction
Naive Bayes and EM for software effort prediction
The datasets
Missing data handling strategies
Experiment setup
Experiments
Experimental results
Threats.
Conclusion and future work
The CSBSG dataset.
CSBSG dataset contains 1103 projects collected from 140
organizations and 15 regions across China by Chinese
association of software industry.
We extract 94 projects and 21 attributes (15 nominal
attributes and 6 continuous attributes) with same selection
criterion of ISBSG data set. We use Equation 10 to
normalize the efforts of projects into l(= 3) classes.
Table: The effort classes in CSBSG data set.
Class No. # of projects Label
1 27 Low
2 31 Medium
3 36 High
Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm
30. Introduction
Naive Bayes and EM for software effort prediction
The datasets
Missing data handling strategies
Experiment setup
Experiments
Experimental results
Threats.
Conclusion and future work
Outline
1 Introduction
2 Naive Bayes and EM for software effort prediction
3 Missing data handling strategies
Missing data toleration strategy.
Missing data imputation strategy
4 Experiments
The datasets
Experiment setup
Experimental results
5 Threats.
6 Conclusion and future work
Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm
31. Introduction
Naive Bayes and EM for software effort prediction
The datasets
Missing data handling strategies
Experiment setup
Experiments
Experimental results
Threats.
Conclusion and future work
Experiment setup.
To evaluate the proposed method comparatively, we adopt
MI and MINI to impute the missing values of the assigned
ISBSG and CSBSG dataset.
BPNN is used to classify the projects in the data sets after
imputation.
Our experiments are conducted with 10-flod
cross-validation technique.
Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm
32. Introduction
Naive Bayes and EM for software effort prediction
The datasets
Missing data handling strategies
Experiment setup
Experiments
Experimental results
Threats.
Conclusion and future work
Outline
1 Introduction
2 Naive Bayes and EM for software effort prediction
3 Missing data handling strategies
Missing data toleration strategy.
Missing data imputation strategy
4 Experiments
The datasets
Experiment setup
Experimental results
5 Threats.
6 Conclusion and future work
Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm
33. Introduction
Naive Bayes and EM for software effort prediction
The datasets
Missing data handling strategies
Experiment setup
Experiments
Experimental results
Threats.
Conclusion and future work
EM-T and EM-I on ISBSG dataset.
The following figure illustrates the performances, of the
missing data toleration strategy (hereafter called EM-T)
and missing data imputation strategy (hereafter called
EM-I) in handling the missing date for effort prediction on
ISBSG data set.
Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm
34. Introduction
Naive Bayes and EM for software effort prediction
The datasets
Missing data handling strategies
Experiment setup
Experiments
Experimental results
Threats.
Conclusion and future work
EM-T and EM-I on ISBSG dataset.
EM−I
EM−T
BPNN+MI
BPNN+MINI
0.8
0.75
Accuracy
0.7
0.65
0.6
0 4 8 12 16 20
# of unlabeled projects
Figure: Performances of naive Bayes with EM-I and EM-T in
comparison with BPNN on effort prediction using ISBSG data set.
Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm
35. Introduction
Naive Bayes and EM for software effort prediction
The datasets
Missing data handling strategies
Experiment setup
Experiments
Experimental results
Threats.
Conclusion and future work
EM-T and EM-I on ISBSG dataset.
What we can see from the figure.
Both EM-I and EM-T have better performances than BPNN
with either MI or MINI on classifying the projects in ISBSG
data set.
The performance of naive Bayes and EM is augmented
when unlabeled projects are appended. This outcome
illustrates that semi-supervised learning can improve the
prediction of software effort.
Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm
36. Introduction
Naive Bayes and EM for software effort prediction
The datasets
Missing data handling strategies
Experiment setup
Experiments
Experimental results
Threats.
Conclusion and future work
EM-T and EM-I on ISBSG dataset.
What we can see from figure.
If supervised learning was used for software effort
prediction, MINI method is favorable to impute the missing
values but missing toleration strategy may not be desirable
to handle missing values.
Imputing strategy for missing data is more effective than
tolerating strategy when naive Bayes and EM is used for
predicting ISBSG software efforts.
Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm
37. Introduction
Naive Bayes and EM for software effort prediction
The datasets
Missing data handling strategies
Experiment setup
Experiments
Experimental results
Threats.
Conclusion and future work
EM-T and EM-I on CSBSG dataset.
EM-T and EM-I in handling the missing date for effort
prediction on CSBSG dataset.
0.8
EM−I
EM−T
BPNN+MI
BPNN+MINI
0.75
0.7
Accuracy
0.65
0.6
0.55
0.5
0 2 4 6 8
# of unlabeled projects
Figure: Performances of EM-I and EM-T in comparison with BPNN on predicting effort with different
number of unlabeled projects using CSBSG dataset.
Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm
38. Introduction
Naive Bayes and EM for software effort prediction
The datasets
Missing data handling strategies
Experiment setup
Experiments
Experimental results
Threats.
Conclusion and future work
EM-T and EM-I on CSBSG dataset.
What we can see from the above figure.
The better performance of EM-I than EM-T is also
observed using CSBSG data set, which is the same as
using ISBSG dataset. This further validate our conjecture
that EM-I outperforms EM-T in software effort prediction.
EM-T has better performance than EM-I on condition that
the number of unlabeled projects is larger than that of
"maxima", that is different from that of ISBSG dataset. We
explain this result may be brought out by the relative small
size of CSBSG dataset where imputation strategy will be
more prone to bring bias into predictive than toleration
strategy.
Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm
39. Introduction
Naive Bayes and EM for software effort prediction
The datasets
Missing data handling strategies
Experiment setup
Experiments
Experimental results
Threats.
Conclusion and future work
More experiments and hypotheses testing.
More experimental results with explanations are detailed in the
paper. Also, we conduct hypotheses testing to examine the
significance of the conclusions draw from our experiments. One
of interest may refer to the paper.
Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm
40. Introduction
Naive Bayes and EM for software effort prediction
Missing data handling strategies
Experiments
Threats.
Conclusion and future work
The threat to external validity primarily is the degree to
which the attributes we used to describe the projects and
the representative capacity of ISBSG and CSBSG sample
datasets.
The threat to internal validity are measurement and data
effects that can bias our results caused by performance
measure as accuracy.
The threat to construct validity is that our experiments
make use of clipping attributes and clipping project data
from both ISBSG and CSBSG datasets
Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm
41. Introduction
Naive Bayes and EM for software effort prediction
Missing data handling strategies
Experiments
Threats.
Conclusion and future work
Conclusion
Semi-supervised learning as naive Bayes and EM is
employed to predict software effort.
We propose two embedded strategies in naive Bayes and
EM to handle the missing data.
Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm
42. Introduction
Naive Bayes and EM for software effort prediction
Missing data handling strategies
Experiments
Threats.
Conclusion and future work
Future work
We plan to compare the proposed techniques with other
missing data imputation techniques, such as FIML and
MSWR.
We will develop more missing data techniques embedded
with naive Bayes and EM for software effort prediction.
We have already investigated the underlying mechanism of
missingness (structural missing or unstructured missing) of
software effort data. With this progress, we will improve the
missing data handling strategies oriented to the underlying
missing mechanism of software effort data.
Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm
43. Introduction
Naive Bayes and EM for software effort prediction
Missing data handling strategies
Experiments
Threats.
Conclusion and future work
Thanks
Any further questions about the content of the slides and the
paper can be sent to Mr. Wen Zhang.
Email: zhangwen@itechs.iscas.ac.cn
Wen Zhang, Ye Yang, Qing Wang Software effort prediction with naive Bayes and EM algorithm