2. 2
Predict
Training
?
?
Model
Project A
: Metric value
: Buggy-labeled instance
: Clean-labeled instance
?: Unlabeled instance
Software Defect Prediction
Related Work
Munson@TSE`92, Basili@TSE`95, Menzies@TSE`07,
Hassan@ICSE`09, Bird@FSE`11,D’ambros@EMSE112
Lee@FSE`11,...
3. What if labeled instances do not
exist?
3
?
?
?
?
?
Project X
Unlabeled
Dataset
?: Unlabeled instance
: Metric value
Model
New projects
Projects lacking in
historical data
7. Key Idea
• Consistent defect-proneness tendency of
metrics
– Defect prediction metrics measure complexity of
software and its development process.
• e.g.
– The number of developers touching a source code file
(Bird@FSE`11)
– The number of methods in a class (D’Ambroas@ESEJ`12)
– The number of operands (Menzies@TSE`08)
More complexity implies more defect-proneness
(Rahman@ICSE`13)
• Distributions between source and target should
be the same to build a strong prediction model.
7
Match source and target metrics that
have similar distribution
9. Metric Selection
• Why? (Guyon@JMLR`03)
– Select informative metrics
• Remove redundant and irrelevant metrics
– Decrease complexity of metric matching combination
• Feature Selection Approaches (Gao@SPE`11,Shivaji@TSE`13)
– Gain Ratio
– Chi-square
– Relief-F
– Significance attribute evaluation
9
10. Metric Matching
10
Source Metrics Target Metrics
X1
X2
Y1
Y2
0.8
0.5
* We can apply different cutoff values of matching score
* It can be possible that there is no matching at all.
11. Compute Matching Score
KSAnalyzer
• Use p-value of Kolmogorov-Smirnov Test
(Massey@JASA`51)
11
Matching Score M of i-th source and j-th target metrics:
Mij = pij
14. Baselines
• WPDP
• CPDP-CM (Turhan@EMSE`09,Ma@IST`12,He@IST`14)
– Cross-project defect prediction using only
common metrics between source and target
datasets
• CPDP-IFS (He@CoRR`14)
– Cross-project defect prediction on
Imbalanced Feature Set (i.e. heterogeneous
metric set)
– 16 distributional characteristics of values of
an instance as features (e.g., mean, std,
maximum,...)
14
15. Research Questions (RQs)
• RQ1
– Is heterogeneous defect prediction comparable
to WPDP?
• RQ2
– Is heterogeneous defect prediction comparable
to CPDP-CM?
• RQ3
– Is Heterogeneous defect prediction comparable
to CPDP-IFS?
15
16. Benchmark Datasets
Group Dataset
# of instances # of
metrics
Granularity
All Buggy (%)
AEEEM
EQ 325 129 (39.7%)
61 Class
JDT 997 206 (20.7%)
LC 399 64 (9.36%)
ML 1862 245 (13.2%)
PDE 1492 209 (14.0%)
MORP
H
ant-1.3 125 20 (16.0%)
20 Class
arc 234 27 (11.5%)
camel-1.0 339 13 (3.8%)
poi-1.5 237 141 (75.0%)
redaktor 176 27 (15.3%)
skarbonka 45 9 (20.0%)
tomcat 858 77 (9.0%)
velocity-1.4 196 147 (75.0%)
xalan-2.4 723 110 (15.2%)
xerces-1.2 440 71 (16.1%)
16
Group Dataset
# of instances # of
metrics
Granularity
All Buggy (%)
ReLink
Apache 194 98 (50.5%)
26 FileSafe 56 22 (39.3%)
ZXing 399
118
(29.6%)
NASA
cm1 327 42 (12.8%)
37 Function
mw1 253 27 (10.7%)
pc1 705 61 (8.7%)
pc3 1077
134
(12.4%)
pc4 1458
178
(12.2%)
SOFTLA
B
ar1 121 9 (7.4%)
29 Function
ar3 63 8 (12.7%)
ar4 107 20 (18.7%)
ar5 36 8 (22.2%)
ar6 101 15 (14.9%)
600 prediction combinations in total!
17. Experimental Settings
• Logistic Regression
• HDP vs. WPDP, CPDP-CM, and CPDP-IFS
17
Test set
(50%)
Training set
(50%)
Project
1
Project
2
Project
n
...
...
X 1000
Project
1
Project
2
Project
n
...
...
CPDP-CM
CPDP-IFS
HDP
WPDP
Project A
25. Different Feature Selections
(median AUCs, Win/Tie/Loss)
25
Approach
Against
WPDP
Against
CPDP-CM
Against
CPDP-IFS
HDP
AUC Win% AUC Win% AUC Win% AUC
Gain Ratio 0.657 63.7% 0.645 63.2% 0.536 80.2% 0.720
Chi-Square 0.657 64.7% 0.651 66.4% 0.556 82.3% 0.727
Significanc
e
0.657 66.2% 0.636 66.2% 0.553 82.0% 0.724
Relief-F 0.670 57.0% 0.657 63.1% 0.543 80.5% 0.709
None 0.657 47.3% 0.624 50.3% 0.536 66.3% 0.663
26. Results in Different Cutoffs
26
Cutoff
Against
WPDP
Against
CPDP-CM
Against
CPDP-IFS
HDP Target
Coverage
AUC Win% AUC Win% AUC Win% AUC
0.05 0.657 66.2% 0.636 66.2% 0.553 82.4% 0.724* 100%
0.90 0.657 100% 0.761 71.4% 0.624 100% 0.852* 21%
27. Conclusion
• HDP
– Potential for CPDP across datasets with
different metric sets.
• Future work
– Filtering out noisy metric matching
– Determine the best probability threshold
27
Here is Project A and some software entities. Let say these entities are source code files.
I want to predict whether these files are buggy or clean.
To do this, we need a prediction model.
Since defect prediction models are trained by machine learning algorithms, we need labeled instances collected from previous releases.
This is an labeled instance. An instance consists of features and labels.
Various software metrics such as LoC, # of functions in a file, and # of authors touching a source file, are used as features for machine learning.
Software metrics measure complexity of software and its development process
Each instance can be labeled by past bug information.
Software metrics and past bug information can be collected from software archives such as version control systems and bug report systems.
With these labeled instances, we can build a prediction model and predict the unlabeled instances.
This prediction is conducted within the same project. So, we call this Within-project defect prediction (WPDP).
There are many studies about WPDP and showed good prediction performance. ( like prediction accuracy is 0.7.)
What if there are no labeled instances. This can happen in new projects and projects lacking in historical data.
New projects do not have past defect information to label instances.
Some projects also does not have defect information because of lacking in historical data from software archives.
When I participated in an industrial project for Samsung electronics, it was really difficult to generate labeled instances because their software archives are not well managed by developers.
So, in some real industrial projects, we may not generate labeled instances to build a prediction model.
Without labeled instances, we can not build a prediction model.
After experiencing this limitation form the industry, I decided to address this problem.
There are existing solutions to build a prediction model for unlabeled datasets.
The first solution is cross-project defect prediction. We can reuse labeled instances from other projects.
Various feature selection approaches can be applied
By doing that, we can investigate how higher matching scores can impact defect prediction performance.
16 distribution characteristics: mode, median, mean, harmonic mean, minimum, maximum, range, variation ratio, first quartile, third quartile, interquartile range, vari- ance, standard deviation, coefficient of variance, skewness, and kurtosis
AEEEM: object- oriented (OO) metrics, previous-defect metrics, entropy met- rics of change and code, and churn-of-source-code metrics [4].
MORPH: McCabe’s cyclomatic metrics, CK metrics, and other OO metrics [36].
ReLink: code complexity metrics
NASA: Halstead metrics and McCabe’s cyclomatic metrics, additional complexity metrics such as parameter count and percentage of comments
SOFTLAB: Halstead metrics and McCabe’s cyclomatic metrics
all 222 prediction combinations among 600 predictions