Leveraging HPC Resources to Improve the Experimental Design of Software Analytics

1
Leveraging HPC Resources to Improve the
Experimental Design of Software Analytics
Chakkrit (Kla) Tantithamthavorn
http://chakkrit.com
kla@chakkrit.com
@klainfo

2
My academic career
Master 2014 
PhD 2016

2
My academic career
Master 2014 
PhD 2016
PostDoc  
2017

2
My academic career
Master 2014 
PhD 2016
PostDoc  
2017
Faculty  
Member
Dec 2017

Software defects are prevalent in today
software development life cycle
3
Deliver software  
product  
to customers
Software Team

4
product  
to customers
Customers
Software Team

5
Software Team
product  
to customers
Customers

5
Software Team
product  
to customers
Customers
Crash
Error Stop working
Freeze

6
Questions arise during a team meeting
Tester
Where are the top-
ten risky software
modules?

6
Tester
Where are the top-
ten risky software
modules?
Developer
Why defects
happen?

6
Tester
Where are the top-
ten risky software
modules?
Developer
Why defects
happen?
Manager
When we are
ready to ship
the next
release?

7
Software analytics enables software practitioners to
make informed and empirically-supported decisions
Goals of software analytics:
• Improve software quality
• Better developer productivity
• Improve user experience

8
Two major innovations of software analytics
for software quality purpose

8
Predict 
what are risky modules

8
Predict 
what are risky modules
Understand  
what makes software fail
Quality  
improvement plan

Pre-release period
Release
Software  
quality 
analytics
Software analytics is used to predict software
modules that are likely to be defective in the future
Post-release period
9

Pre-release period
Release
Software  
quality 
analytics
Module A
Module C
Module B
Module D
Post-release period
Clean
Defect-prone
Clean
Defect-prone
Module A
Module C
Module B
Module D
9

Pre-release period
Release
Software  
quality 
analytics
Module A
Module C
Module B
Module D
Post-release period
Clean
Defect-prone
Clean
Defect-prone
Module A
Module C
Module B
Module D
Lewis et al., ICSE’13
Mockus et al., BLTJ’00 Ostrand et al., TSE’05 Kim et al., FSE’15
Naggappan et al., ICSE’06
Zimmermann et al., FSE’09
Caglayan et al., ICSE’15
Tan et al., ICSE’15
Shimagaki et al., ICSE’16
9

10
Code Properties 
lines of code,
complexity
Large and more
complex ﬁles
m
ore 
defect-prone
Files that are written
by experienced
developers
Developer 
author experience
less 
defect-prone
Dev. Process  
#prior defects,
#commits, #churn
Files that were
changed many
times
m
ore 
defect-prone
Organization 
#authors, #major
authors
Files that are
changed by
many authors
m
ore 
defect-prone
Software analytics is used to understand
software defect characteristics

A big picture of software analytics modelling
11
Data 
Preparation
Model
Construction
Model 
Validation
Software  
repositories
Dataset Analytics 
model
Accuracy

Statistical 
Model
Training 
Corpus
Classiﬁer  
Parameters
(7) Model 
Construction
Performance 
Measures
Data  
Sampling
(2) Data Cleaning and Filtration
(3) Metrics Extraction and Normalization
(4) Descriptive
Analytics
(+/-) Relationship
to the Outcome
Y
X
x
Software 
Repository
Software 
Dataset
Clean 
Dataset
Studied Dataset
Outcome Studied Metrics Control Metrics
+~
(1) Data Collection
Predictive  
Analytics
Prescriptive
Analytics
(8) Model Validation
(9) Model Analysis
and Interpretation
Importance  
Score
Testing 
Corpus
PredictionsPerformance 
Estimates
Patterns
12
In reality, software
analytics modelling is
detailed and complex

13
While today research toolkits (e.g., R, Weka)
are easily accessible, they come at risks
https://en.wikipedia.org/wiki/All_models_are_wrong

14
Practitioners often
borrow techniques
from other domains
that may not work
best in SE domain

15
Such challenge has an ultimate negative impact  
on developers, managers, and software company
Misleading  
insights
Developers initiate
wrong plan for quality
improvement
Testers waste time
and resources
Wrong  
predictions
+
The operating cost of
software company
is expensive

16
(1) What is the best
pattern for software
analytics modelling? 
 
 
(2) What is the impact
of experimental factors
on its accuracy?

There are various experimental factors
involved in software analytics modelling
Model 
Validation 
 
 
 
 Performance 
Measures
Model
Construction 
 
 
 
 Classiﬁcation 
Technique
Data 
Preparation 
 
 
 
 Metrics 
Collection
Defect  
Labelling
Classiﬁer  
Parameters
Validation 
Techniques
17

18
Noise in Defect Datasets (ICSE’15)
Parameters Optimization (ICSE’16)
Model Validation Techniques (TSE’17)
Defect  
Labelling
Classiﬁer  
Parameters
Validation 
Techniques
2
3
4
Experimental Factors Analysis (TSE’16)
Experimental
Factors
1

19
Which factors (i.e., researchers or experimental components)
have the largest impact on the accuracy of software analytics?
Dataset 
Family
Metric 
Family
Classiﬁer 
Family
Research 
Group
Reported 
AccuracyExtract experimental factors
from 42 defect prediction  
studies
Chakkrit Tantithamthavorn, Shane McIntosh, Ahmed E. Hassan, Kenichi Matsumoto: Comments on "Researcher Bias: The
Use of Machine Learning in Software Defect Prediction". IEEE Trans. Software Eng. 42(11): 1092-1094 (2016)

20
Using ANOVA analysis to investigate the relationship
between reported accuracy and experimental factors
Dataset 
Family
Metric 
Family
Classiﬁer 
Family
Research 
Group
Reported 
Accuracy
Studied factors
Analyze the
impact of
factors ANOVA 
Analysis
Outcome

Experimental design matters more than who
perform a research
Reported 
Accuracy
Inﬂuence(%)
Metric 
Family
23%
Research 
Group
13% 13%
Classiﬁer 
Family
21

22
Defect  
Labelling
Classiﬁer  
Parameters
Validation 
Techniques
2
3
4
Experimental
Factors
1 Experimental design matters more than who perform a research

23
Defect  
Labelling
Classiﬁer  
Parameters
Validation 
Techniques
2
3
4
Experimental
Factors

The accuracy of software analytics depends on
the quality of the data from which it was trained
Inaccurate 
Insights
Inaccurate 
Predictions
Noisy  
Dataset
24
Analytics 
model

The accuracy of software analytics depends on
the quality of the data from which it was trained
Such inaccurate predictions and insights
could lead to missteps in practice.
Inaccurate 
Insights
Inaccurate 
Predictions
Noisy  
Dataset
24
Analytics 
model

25
Investigating the impact of realistic noise on the
accuracy and interpretation of software analytics
Clean  
Samples
Realistic Noisy 
Samples
Clean  
Dataset
Generate  
noise
Control Group
Treatment Group
Build
model
Build
model
Analytics 
model
Analytics 
model
Accuracy
Accuracy
Compute 
accuracy
Compute 
accuracy
Chakkrit Tantithamthavorn, Shane McIntosh, Ahmed E. Hassan, Akinori Ihara, Ken-ichi Matsumoto: The Impact of
Mislabelling on the Performance and Interpretation of Defect Prediction Models. ICSE (1) 2015: 812-823

26
Leveraging HPC resources to improve
experimental design for software analytics
18 datasets 26 classiﬁcation
techniques
x
1 million
analytics 
model
1,000
validation 
samples
x =x
2 settings of
control and
treatment

26
techniques
x
1 million
analytics 
model
1,000
validation 
samples
x =x
2 settings of
control and
treatment
If one model requires 1 min computation time, then  
an experiment needs 2 years to be ﬁnished

26
techniques
x
1 million
analytics 
model
1,000
validation 
samples
x =x
2 settings of
control and
treatment
If one model requires 1 min computation time, then  
an experiment needs 2 years to be ﬁnished
A HPC cluster of 40 CPUs could
signiﬁcantly accelerate our
experiment from 2 years to 17 days

27
While the recall is often impacted,  
the precision is rarely impacted
= Realistic Noisy
Clean
Interpretation: 
Ratio = 1 means there is no impact.
Precision Recall
1.00.50.02.01.5
Ratio

27
= Realistic Noisy
Clean
Interpretation: 
Precision is rarely impacted
by realistic noise
Precision Recall
1.00.50.02.01.5
Ratio

27
= Realistic Noisy
Clean
Interpretation: 
Models trained on noisy data
achieve 56% of the recall of
models trained on clean data
Precision is rarely impacted
by realistic noise
Precision Recall
1.00.50.02.01.5
Ratio

28
Defect  
Labelling
Classiﬁer  
Parameters
Validation 
Techniques
2
3
4
Experimental
Factors
Researchers can rely on the accuracy of modules labelled as defective
by analytics models that are trained using such noisy data

29
Defect  
Labelling
Classiﬁer  
Parameters
Validation 
Techniques
2
3
4
Experimental
Factors

Software analytics is trained  
using classiﬁcation techniques
Defect  
Dataset
Classiﬁcation 
Technique
Model Construction
30
Analytics 
model

Such classiﬁcation techniques often require
parameter settings
Defect  
Dataset
Classiﬁcation 
Technique
Model Construction
31
Analytics 
model

parameter settings
Defect  
Dataset
Classiﬁcation 
Technique
Model Construction
Classiﬁer  
Parameters
31
Analytics 
model

parameter settings
26 of the 30 most commonly used
classification techniques require at least one
parameter setting
Defect  
Dataset
Classification 
Technique
Model Construction
Classifier  
Parameters
31
Analytics 
model

randomForest package
Default setting of the number of trees  
in a random forest
10
50
100
500
bigrf package
Different toolkits have different default
settings for the same classification technique
32

33
Investigating the impact of automated parameter
optimization on the accuracy of software analytics
Defect  
dataset
Apply
automated
parameter
optimization
Control Group
Treatment Group
Build
model
Build
model Analytics 
model
Analytics 
model
Accuracy
Accuracy
Compute 
accuracy
Compute 
accuracy
Chakkrit Tantithamthavorn, Shane McIntosh, Ahmed E. Hassan, Kenichi Matsumoto:  
Automated parameter optimization of classiﬁcation techniques for defect prediction models. ICSE 2016: 321-332
default  
setting
optimal  
setting

Large Medium Small
●
●
●
●
●
● ●
●
0.0
0.1
0.2
0.3
0.4
C
5.0
AdaBoost
AVN
N
etC
ART
PC
AN
N
etN
N
etFDA
M
LPW
eightD
ecayM
LP
LM
TG
PLS
LogitBoostKN
N
xG
BTreeG
BM
N
B
R
BF
SVM
R
adia
G
AM
AUCPerformanceImprovement
34
Each boxplot presents
the performance
improvement for all 
the 18 studied datasets
Automated parameter optimization can substantially
improve the accuracy of software analytics

Large Medium Small
●
●
●
●
●
● ●
●
0.0
0.1
0.2
0.3
0.4
C
5.0
AdaBoost
AVN
N
etC
ART
PC
AN
N
etN
N
etFDA
M
LPW
eightD
ecayM
LP
LM
TG
PLS
LogitBoostKN
N
xG
BTreeG
BM
N
B
R
BF
SVM
R
adia
G
AM
AUCPerformanceImprovement
35
9 of the 26 studied
classiﬁcation techniques
have a large performance
improvement
Automated parameter optimization can substantially
improve the accuracy of software analytics

36
Defect  
Labelling
Classiﬁer  
Parameters
Validation 
Techniques
2
3
4
Experimental
Factors
Researchers should apply automated parameter optimization in order
to improve the performance and reliability of software analytics

37
Defect  
Labelling
Classiﬁer  
Parameters
Validation 
Techniques
2
3
4
Experimental
Factors

Estimating model accuracy requires  
the use of Model Validation Techniques (MVTs)
Defect  
Dataset
Training 
Corpus
Testing 
Corpus
Model 
Validation
38
Chakkrit Tantithamthavorn, Shane McIntosh, Ahmed E. Hassan, Kenichi Matsumoto: An Empirical Comparison of Model
Validation Techniques for Defect Prediction Models. IEEE Trans. Software Eng. 43(1): 1-18 (2017)

Defect  
Dataset
Training 
Corpus
Testing 
Corpus
Analytics
model
Model 
Validation
38

Defect  
Dataset
Training 
Corpus
Testing 
Corpus
Analytics
model
Model 
Validation
Performance 
Estimates
Compute  
accuracy
38

Training
Prior studies apply 3 families of 12 most commonly- 
used model validation techniques
Testing
70% 30%
Training
Holdout Validation
Testing
k-1 folds
k-Fold Cross Validation
Repeat k times
1 fold
Bootstrap Validation
Training Testing
bootstrap
Repeat N times
out-of-sample
• 50% Holdout
• 70% Holdout
• Repeated 50% Holdout
• Repeated 70% Holdout
• Leave-one-out CV
• 2 Fold CV
• 10 Fold CV
• Repeated 10 fold CV
• Ordinary bootstrap
• Optimism-reduced bootstrap
• Out-of-sample bootstrap
• .632 Bootstrap
39

Model validation techniques may produce
diﬀerent performance estimates
AUC=0.73
Construct and evaluate
the model using
ordinary bootstrap
the model using  
50% holdout validation
Defect
Dataset
AUC=0.58
40

Model validation techniques may produce
diﬀerent performance estimates
It’s not clear which model validation
techniques provide the most accurate
performance estimates
AUC=0.73
the model using
ordinary bootstrap
the model using  
50% holdout validation
Defect
Dataset
AUC=0.58
40

Examining the bias and variance of performance estimates
that are produced by model validation techniques (MVTs)
41

Bias measures the difference between
performance estimates and the ground-truth
41

Variance measures the variation of
performance estimates when an experiment
is repeated
41

is repeated
42

is repeated
42
Defect  
Dataset
Sample  
Dataset
Unseen 
Dataset

V.S.
Bias and variance
calculation
Bias Variance
is repeated
43
Model Validation 
Techniques
Defect  
Dataset
Sample  
Dataset
Unseen 
Dataset
Training
Testing
Performance
on unseen data 
(ground-truth)
Analytics 
model
Training
Testing
Analytics 
model
Performance 
Estimates

●
● ●
●
Holdout 0.5
Holdout 0.7
2−Fold, 10−Fold CV
Rep. 10−Fold CV
Ordinary
Optimism
Outsample
.632 Bootstrap
Rep. Holdout 0.5, 0.7
1
1.5
2
2.5
3
11.522.53
Mean Ranks of Bias
MeanRanksofVariance
Family ● Bootstrap Cross Validation Holdout
Bias and variance of performance estimates that
are produced by MVTS are statistically diﬀerent
44

●
● ●
●
Holdout 0.5
Holdout 0.7
Rep. 10−Fold CV
Ordinary
Optimism
Outsample
.632 Bootstrap
1
1.5
2
2.5
3
11.522.53
Mean Ranks of Bias
MeanRanksofVariance
Single-repetition holdout family produces the least
accurate and stable performance estimates
45

●
● ●
●
Holdout 0.5
Holdout 0.7
Rep. 10−Fold CV
Ordinary
Optimism
Outsample
.632 Bootstrap
1
1.5
2
2.5
3
11.522.53
Mean Ranks of Bias
MeanRanksofVariance
45
produces the least
accurate and stable
performance
estimates
Single-repetition
holdout validation

●
● ●
●
Holdout 0.5
Holdout 0.7
Rep. 10−Fold CV
Ordinary
Optimism
Outsample
.632 Bootstrap
1
1.5
2
2.5
3
11.522.53
Mean Ranks of Bias
MeanRanksofVariance
45
produces the least
accurate and stable
performance
estimates
Single-repetition
holdout validation
produces the most
accurate and stable
performance
estimates
Out-of-sample
bootstrap validation

●
● ●
●
Holdout 0.5
Holdout 0.7
Rep. 10−Fold CV
Ordinary
Optimism
Outsample
.632 Bootstrap
1
1.5
2
2.5
3
11.522.53
Mean Ranks of Bias
MeanRanksofVariance
45
produces the least
accurate and stable
performance
estimates
Single-repetition
holdout validation
produces the most
accurate and stable
performance
estimates
Out-of-sample
bootstrap validation
Out-of-sample bootstrap should be
used in future defect prediction studies

46
Defect  
Labelling
Classiﬁer  
Parameters
Validation 
Techniques
2
3
4
Experimental
Factors
Researchers should avoid using the single-repetition holdout validation
and instead opt to use the out-of-sample bootstrap model validation
technique

47
Experimental design
matters more than  
who perform a
research!!!

Future Research Agenda of  
Software Analytics Research
48
(1:Education) Lack of modelling skills  
Practitioners often treat research toolkits as a
blackboxIntegrate data science courses into computer science curriculum

48
(2:Management) Using inappropriate techniques 
Practitioners often borrow techniques from other
domains that may not work best with our domain
Use experimental science to select the most appropriate
technique

48
(2:Management) Using inappropriate techniques 
Practitioners often borrow techniques from other
domains that may not work best with our domain
Use experimental science to select the most appropriate
technique
(3:Monitoring) Outdated training data and
statistical models 
Practitioners rarely check if their training data and
the model are up-to-date
Leveraging HPC resources to develop real-time software analytics

49http://chakkrit.com kla@chakkrit.com
@klainfoChakkrit (Kla) Tantithamthavorn

Leveraging HPC Resources to Improve the Experimental Design of Software Analytics

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Leveraging HPC Resources to Improve the Experimental Design of Software Analytics

Similar to Leveraging HPC Resources to Improve the Experimental Design of Software Analytics (20)

Recently uploaded

Recently uploaded (20)

Leveraging HPC Resources to Improve the Experimental Design of Software Analytics