This video contains the presentation at TCCE 2020 by Sayed Mohsin Reza on his paper titled "Performance Analysis of Machine Learning Approaches in Software Complexity Prediction"
Keywords: Software Complexity, Software Quality, Machine Learning, Software Design, Software Reliability, etc
Authors :
1. Sayed Mohsin Reza, Ph.D. Student, University of Texas
2. Mahfujur Rahman, Lecturer, Daffodil International University
3. Hasnat Parvez, Student, Jahangirnagar University
4. Omar Badreddin, Professor, University of Texas
5. Shamim Al Mamun, Professor, Jahangirnagar University
Abstract: Software design is one of the core concepts in software engineering. This covers insights and intuitions of software evolution, reliability, and maintainability. Effective software design facilitates software reliability and better quality management during development which reduces software development cost. Therefore, it is required to detect and maintain these issues earlier. Class complexity is one of the ways of detecting software quality. The objective of this paper is to predict class complexity from source code metrics using Machine Learning (ML) approaches and compare the performance of the approaches. In order to do that, we collect ten popular and quality maintained open source repositories and extract 18 source code metrics that relate to complexity for class-level analysis. First, we apply statistical correlation to find out the source code metrics that impact most on class complexity. Second, we apply five alternative ML techniques to build complexity predictors and compare the performances. The results report that the following source code metrics: Depth Inheritance Tree (DIT), Response For Class (RFC), Weighted Method Count (WMC), Lines of Code (LOC), and Coupling Between Objects (CBO) have the most impact on class complexity. Also, we evaluate the performance of the techniques and results show that Random Forest (RF) significantly improves accuracy without providing additional false negative or false positive that work as false alarms in complexity prediction.
Performance analysis of machine learning approaches in software complexity prediction by sayed mohsin reza at tcce 2020 conference
1. 2nd International Conference on Trends in Computational and Cognitive Engineering (TCCE)
Paper ID- xxx
Performance Analysis of Machine Learning
Approaches in Software Complexity
Prediction
Sayed Reza1, Mahfujur Rahman2, Hasnat Parvez3,
Omar Badreddin1, and Shamim Al Mamun3
1 University of Texas, 2 Daffodil International University and 3
Jahangirnagar University
1
Paper ID -
410
2. 2nd International Conference on Trends in Computational and Cognitive Engineering (TCCE)
Introduction
• Software complexity is an undesired characteristic of a software
• Increasing complexity reduces maintainability and sustainability
• Class level complexity
• Method level complexity
• Complexity can be affected by many factors related to code
structures, object-oriented properties, and source code metrics
• Machine learning techniques can automate the process and get rid of
manual process or code rules to detect class complexity
2
3. 2nd International Conference on Trends in Computational and Cognitive Engineering (TCCE)
Research Objectives
• Use machine learning techniques to build complexity
classifiers
• The reason behind using machine learning to get rid of
manual process or code rules to detect class complexity.
• Compare the performance of the ML classifiers
• Report the best technique based on performance
metrics
3
4. 2nd International Conference on Trends in Computational and Cognitive Engineering (TCCE)
Motivation
• Early detection of software complexity will
empower better software maintenance
• Effective software maintenance facilitates
better quality over time
• And a well qualified software facilitates
• Enhance future software maintainability
• Ensure a sustainable software over time
• Minimize software development efforts over time
• Reduce the software development costs
4
5. 2nd International Conference on Trends in Computational and Cognitive Engineering (TCCE)
Research Questions & Study
Design
• RQ1: How source code metrics are correlated with quality attribute:
class complexity?
• This question reveals the relationships between complexity and source code
metrics
• RQ2: How accurately can machine learning approaches predict class
complexity from source code metrics?
• This question is targeted to find out the accuracy of machine learning
approaches in class level complexity detection
5
Dataset
Collection
Dataset
Preparation
Correlation
Analysis
(RQ1)
Training
Performance
Evaluation
(RQ2)
Report Best
Technique
Figure: Study Design
6. 2nd International Conference on Trends in Computational and Cognitive Engineering (TCCE)
Dataset Collection
• Dataset for complexity prediction needs diverse set
of repositories
• We search codebase repositories using ModelMine
tool [1] with the following criteria;
• a repository with primary language Java
• a minimum of 5000 commits (proxy of maintenance)
• at least 100 active contributors
• a minimum of 3000 stars and 500 forks (proxy for
popularity )
• 10 repositories and 38,778 classes in total are
selected
6
Dataset
Collection
Dataset
Preparation
Correlation
Analysis
(RQ1)
Training
Performance
Evaluation
(RQ2)
Report Best
Technique
[1] Sayed Mohsin Reza, Omar Badreddin, and Khandoker Rahad. ModelMine: A tool to facilitate mining models from open-source repositories. In 2020 ACM/IEEE 23rd
International Conference on Model Driven Engineering Languages and Systems(MODELS). ACM, 2020.
Figure: Class distribution among
repositories
7. 2nd International Conference on Trends in Computational and Cognitive Engineering (TCCE)
Dataset Collection
(Continue)
• Input Variables: Extract 18 unique source code
metrics using static analyzer tool from each class
in code repositories
• Target Variable: Extract Current Complexity using
CODEMR tool [2] from each class in code repositories
• The variables are then combined using the class name
to create a dataset for complexity classifier
7
Dataset
Collection
Dataset
Preparation
Correlation
Analysis
(RQ1)
Training
Performance
Evaluation
(RQ2)
Report Best
Technique
[2] Asma Shaheen, Usman Qamar, Aiman Nazir, Raheela Bibi, Munazza Ansar, andIqra Zafar. Oocqm: Object oriented code quality meter. In International Conference on
Computational Science/Intelligence & Applied Informatics, pages 149–163.Springer, 2019.
Table: Source Code
Metrics
… … …
8. 2nd International Conference on Trends in Computational and Cognitive Engineering (TCCE)
Dataset Preparation
• Remove the duplicate observations
• Find the outliers to remove the bias datapoints
• Visualize explanatory data analysis on input and
target variables
• Create training (80%) and testing dataset (20%)
8
Dataset
Collection
Dataset
Preparation
Correlation
Analysis
(RQ1)
Training
Performance
Evaluation
(RQ2)
Report Best
Technique
Figure: Relationship of some input
variables with target variable
9. 2nd International Conference on Trends in Computational and Cognitive Engineering (TCCE)
Correlation Results
• RQ1: How source code metrics are correlated
with quality attribute: class complexity?
• The results of Pearson correlation reveals
the impact of source code metrics on
complexity.
• The following source code metrics DIT, SRFC,
RFC, WMC, CMLOC and CBO *** have moderately
high impact on complexity
9
Dataset
Collection
Dataset
Preparation
Correlation
Analysis
(RQ1)
Training
Performance
Evaluation
(RQ2)
Report Best
Technique
Figure: Correlation between source code
metrics and complexity
*** DIT = Depth Inheritance Tree, RFC = Response for a Class, CMLOC= Class-Method Lines of Code, CBO = Coupling between objects
10. 2nd International Conference on Trends in Computational and Cognitive Engineering (TCCE)
Training & Testing
• In training, we choose 5 different Machine Learning techniques to classify
complexity
1. Naive Bayes (NB)
2. Logistic Regression (LR)
3. Decision Tree (DT)
4. Random Forest (RF) and
5. Ada Boost (AB)
• These are well known classifiers in machine learning and used in several similar
research [3,4]
• Perform 10-fold cross validation to ensure the reduction in variability of
performance results
10
Dataset
Collection
Dataset
Preparation
Correlation
Analysis
(RQ1)
Training
Performance
Evaluation
(RQ2)
Report Best
Technique
[3] Istehad Chowdhury and Mohammad Zulkernine. Using complexity, coupling, and cohesion metrics as early indicators of vulnerabilities. Journal of Systems Architecture,
57(3):294–313, 2011
[4] Yun Zhang, David Lo, Xin Xia, Bowen Xu, Jianling Sun, and Shanping Li. Combining software metrics and text features for vulnerable file prediction. In 2015 20th
International Conference on Engineering of Complex Computer Systems (ICECCS), pages 40–49. IEEE, 2015.
11. 2nd International Conference on Trends in Computational and Cognitive Engineering (TCCE)
Performance
Evaluation
• RQ2: How accurately can machine learning
approaches predict class complexity from
source code metrics?
• Decision Tree & Random Forest classifier
has the highest accuracy and precision
compared to other classifiers.
• Random Forest has highest recall & F1
score
• Is that all to declare best technique?
11
Dataset
Collection
Dataset
Preparation
Correlation
Analysis
(RQ1)
Training
Performance
Evaluation
(RQ2)
Report Best
Technique
Figure: Relative performance of ML
classifiers
12. 2nd International Conference on Trends in Computational and Cognitive Engineering (TCCE)
Performance
Evaluation (Continue)
• We focus on false negative rate to reduce the risk of
false alarms
• Higher FN Rate -> High number of high complex classes are detected as
Low [Very Risky Model]
• Lower FN Rate -> low number of high complex classes are detected as
Low [Less Risky Model]
• Still, Random Forest(RF) shows lower FN rate compared to
others
• The reason behind this we find out that RF use
bootstrapping random re-sample technique and working
with significant elements which works much better in
prediction.
12
Dataset
Collection
Dataset
Preparation
Correlation
Analysis
(RQ1)
Training
Performance
Evaluation
(RQ2)
Report Best
Technique
Figure: Relative FN rate of
ML classifiers
13. 2nd International Conference on Trends in Computational and Cognitive Engineering (TCCE)
Conclusion
• Problem in quality management: It is undoubtedly necessary to take proper action
before classes are become more complex
• Research Objective & Results
• We compare Machine Learning techniques’ performance to predict class complexity
• Our results shows that Random Forest model is doing better compared to other models
• We also find out the source code metrics which have most impact on class complexity
• Industrial Usage: Using ML automatic prediction on code quality will allow quality
managers, practitioners to take preventive actions against high complex classes
• Long-term Outcome: Ensure a sustainable software, Minimize software development
efforts, Reduce the software development costs over time
13
If you have any questions, email me at sreza3@miners.utep.edu