Cross-project defect prediction is very appealing because (i) it allows predicting defects in projects for which the availability of data is limited, and (ii) it allows producing generalizable prediction models. However, existing research suggests that cross-project prediction is particularly challenging and, due to heterogeneity of projects, prediction accuracy is not always very good. This paper proposes a novel, multi-objective approach for cross-project defect prediction, based on a multi-objective logistic regression model built using a genetic algorithm. Instead of providing the software engineer with a single predictive model, the multi-objective approach allows software engineers to choose predictors achieving a compromise between number of likely defect-prone artifacts (effectiveness) and LOC to be analyzed/tested (which can be considered as a proxy of the cost of code inspection). Results of an empirical evaluation on 10 datasets from the Promise repository indicate the superiority and the usefulness of the multi-objective approach with respect to single-objective predictors. Also, the proposed approach outperforms an alternative approach for cross-project prediction, based on local prediction upon clusters of similar classes.
6. Indicators of defects
Cached history
information
Kim
at
al.
ICSE
2007
Change Metrics
Moset
at
al.
ICSE
2008.
A metrics suite for
object oriented
design Chidamber
at
al.
TSE
1994
9. Defect Prediction Methodology
Predic<ng
Model
Project
Test
Set
Training
Set
Defect
Prone
Class1 YES
Class2 YES
Class3 NO
… YES
ClassN …
Within Project
Issue: Size of the
Training Set
10. Defect Prediction Methodology
Predic<ng
Model
Project
Test
Set
Training
Set
Defect
Prone
Class1 YES
Class2 YES
Class3 NO
… YES
ClassN …
Predic<ng
Model
Test
Set
Training
Set
Defect
Prone
Class1 YES
Class2 YES
Class3 NO
… YES
ClassN …
Within Project
Issue: Size of the
Training Set
Past
Projects
New
Project
11. Project
B
Project
A
Defect Prediction Methodology
Predic<ng
Model
Project
Test
Set
Training
Set
Defect
Prone
Class1 YES
Class2 YES
Class3 NO
… YES
ClassN …
Predic<ng
Model
Test
Set
Training
Set
Defect
Prone
Class1 YES
Class2 YES
Class3 NO
… YES
ClassN …
Within Project
Cross-Project
Issue: Size of the
Training Set
12. Project
B
Project
A
Defect Prediction Methodology
Predic<ng
Model
Project
Test
Set
Training
Set
Defect
Prone
Class1 YES
Class2 YES
Class3 NO
… YES
ClassN …
Predic<ng
Model
Test
Set
Training
Set
Defect
Prone
Class1 YES
Class2 YES
Class3 NO
… YES
ClassN …
Within Project
Cross-Project
Issue: Size of the
Training Set
Issue: The predicting
accuracy can be lower
13. Cost Effectiveness
1) Cross-project does not
necessarily works worse than
within-project
2) Better precision (accuracy)
does not mirror less
inspection cost
3) Traditional predicting model:
logistic regression
Recaling the “imprecision” of Cross-
project Defect Prediction, Rahman
at
al.
FSE
2012
15. Cost Effectiveness: example
Predicting model 1
Class
A
Class
B
Class
A
Class
C
Class
D
100
LOC
10,000
LOC
100
LOC
100
LOC
100
LOC
Predicting model 2
Class
A
Class
B
Class
C
Class
D
16. Cost Effectiveness: example
Predicting model 1
Class
A
Class
B
Class
A
Class
C
Class
D
BUG
BUG
100
LOC
10,000
LOC
100
LOC
100
LOC
100
LOC
Predicting model 2
Class
A
Class
B
Class
C
Class
D
17. Cost Effectiveness: example
Predicting model 1
Class
A
Class
B
Class
A
Class
C
Class
D
BUG
BUG
100
LOC
10,000
LOC
100
LOC
100
LOC
100
LOC
Precision
=
50
%
Cost
=10,100
LOC
Predicting model 2
Class
A
Class
B
Class
C
Class
D
18. Cost Effectiveness: an example
Predicting model 1
Class
A
Class
B
Class
A
Class
C
Class
D
BUG
BUG
100
LOC
10,000
LOC
100
LOC
100
LOC
100
LOC
Precision
=
50
%
Cost
=10,100
LOC
Predicting model 2
Precision
=
33
%
Cost
=
300
LOC
Class
A
Class
B
Class
C
Class
D
19. Class
A
Class
B
Class
C
Class
D
Cost Effectiveness: an example
Predicting model 1
Class
A
Class
B
Class
A
Class
C
Class
D
BUG
BUG
100
LOC
10,000
LOC
100
LOC
100
LOC
100
LOC
Predicting model 2
Precision does not mirrorthe inspection cost
All the existing predicting models work
on precision and not on cost
We need COST oriented models
30. a + b mi1 + c mi2 + …
Multi-objective Genetic Algorithm
⎪
⎩
⎪
⎨
⎧
⋅=
⋅=
∑
∑
i
ii
i
i
i
ActualedessEffectiven
CostPredCostIspection
Pr
min
max
.
1 e
e
Pred
+
=
a + b mi1 + c mi2 + …
Chromosome
(a, b,c , …)
Fitness Function
Multiple objectives are
optimized using Pareto
efficient approaches
31. Multi-objective Genetic Algorithm
Pareto Optimality: all solutionsthat are not dominated by anyother solutions form the Paretooptimal set.
Multiple otpimal solutions (models)
can be found
Cost
Effectiveness
The frontier allows to make a
well-informed decision that
balances the trade-offs
between the two objectives
33. Research Questions
RQ1: How does the multi-objective (MO)prediction perform,
compared to single-objective (SO) prediction
34. Research Questions
RQ1: How does the multi-objective (MO)prediction perform,
compared to single-objective (SO) prediction
Cross-project MO vs. cross-project SO
vs. within project SO
35. Research Questions
RQ2: How does the proposed approach perform, comparedto the local prediction approach by Menzie et al. ?
RQ1: How does the multi-objective (MO)prediction perform,
compared to single-objective (SO) prediction
Cross-project MO vs. cross-project SO
vs. within project SO
36. Research Questions
RQ2: How does the proposed approach perform, comparedto the local prediction approach by Menzie et al. ?
RQ1: How does the multi-objective (MO)prediction perform,
compared to single-objective (SO) prediction
Cross-project MO vs. cross-project SO
vs. within project SO
Cross-project MO vs. Local Prediction
38. • 10 java projects from PROMISE datasetü
different
sizes
ü
different
context
applica<on
Experiment outline
• Cross-projects defect prediction:
ü Training
model
on
nine
projects
and
test
on
the
remaining
one
(10
<mes)
RQ1
39. • 10 java projects from PROMISE datasetü
different
sizes
ü
different
context
applica<on
Experiment outline
• Cross-projects defect prediction:
ü Training
model
on
nine
projects
and
test
on
the
remaining
one
(10
<mes)
• Within project defect prediction:
ü
10
cross-‐folder
valida<on
RQ1
RQ1
40. • 10 java projects from PROMISE datasetü
different
sizes
ü
different
context
applica<on
Experiment outline
• Cross-projects defect prediction:
ü Training
model
on
nine
projects
and
test
on
the
remaining
one
(10
<mes)
• Within project defect prediction:
ü
10
cross-‐folder
valida<on
• Local prediction:
ü
K-‐means
clustering
algorithm
ü
Silhoue]e
Coefficient
RQ1
RQ1
RQ2
43. Cross-project MO vs. Cross-project SO
0
50
100
150
200
250
300
KLOC
Cross-‐project
SO
Cross
project
MO
44. Cross-project MO vs. Cross-project SO
0
50
100
150
200
250
300
KLOC
Cross-‐project
SO
Cross
project
MO
The proposed multi-objective model
Outperform the single-objective one
45. Cross-project MO vs. Within-project SO
0
50
100
150
200
250
300
350
KLOC
Within
project
SO
Cross
project
MO
46. Cross-project MO vs. Within-project SO
0
10
20
30
40
50
60
70
80
90
100
Precision
Within
project
SO
Cross
project
MO
47. Cross-project MO vs. Within-project SO
0
10
20
30
40
50
60
70
80
90
100
Precision
Within
project
SO
Cross
project
MO
Cross-project prediction is worse than within-project
prediction in terms of PRECISION
48. Cross-project MO vs. Within-project SO
0
10
20
30
40
50
60
70
80
90
100
Precision
Within
project
SO
Cross
project
MO
Cross-project prediction is worse than within-project
prediction in terms of PRECISION
But it is better than within-project predictors in term
of COST-EFFECTIVENESS
49. 0
50
100
150
200
250
300
KLOC
Local
Predic<on
Cross
project
MO
Cross-project MO vs. Local Prediction
50. 0
50
100
150
200
250
300
KLOC
Local
Predic<on
Cross
project
MO
Cross-project MO vs. Local Prediction
The multi-objective predictor outperforms the local
predictor.