Today I would like to present my master thesis in the topic The impact of design complexity on software cost and quality. The thesis is perpormed under direct supervision of Marcus Ciolkowski and general supervision of Professor Dieter Rombach.
Here is agenda for the presentation. Firstly, I will present the motivation towards the research topic, including the importance of the topic for software practice and research community. Then, the research problem is formally stated as research questions. In the research methodology, I will present the approaches to answer these questions. The next two parts are research result and interpretation. At last, I would like to discuss some significant threat to our research validitiy and future works.
It is a common hypothesis that the structural feature of a design such as coupling, cohesion, inheritance has impact on external quality attributes. It is reasoned that a complex design structure can take a developer or tester more effort to understand, implement and maintain. Therefore It could lead to an undesired software quality such as increased fault proneness or reduced maintainability. Though m any studies investigate the relationship between design complexity and cost and quality, it is unclear what we have learned from these studies, because no systematic synthesis exists to date !
This master thesis address the main research question: What is the impact of design complexity on software quality? This question (RQ) is divided into five sub questions (SQ). In particular, we would like to know
We use four research methods to answer these five sub questions as shown in the diagram The literature review is used to achieve an quick impression about what type of cost and quality attributes are investigated. Then a systematic literature review is performed with focus on the most common quality attributes in literature. The data extracted from the systematic literature review will be used as input data for synthesis methods. Two available quantitative synthesis methods are vote counting and meta analysis. Vote counting is selected to answer the sub question 3. A design metric is a potential predictor of software quality if major portion of studies that investigate their relationship vote for it. Meta analysis is used to synthesize and quantify the global impact of a design metric on an external quality, which answer for SQ4. Meta analysis procedure also includes explanation for disagreement between studies which answer for SQ5.
This slide presents the result of studies search and selection process. After searching in three electronic database, namely Scopus, IEEE Explorer and ACM Digital Library, we found 39 papers. After that, the reference scan and search for grey literature give us 18 papers more. In total the systematic search results in 57 primary studies. These two pictures shows the distribution of primary studies over publication year and publication channel. It is revealed that the number of papers in the topics increasing last 5 years. Besides, the selected papers mainly come from high quality source, like book chapter, international journal or conferences
From this slide, I present the results for answering research questions. The diagram shows cost and quality attributes that are investigated in design complexity studies. The external quality attributes fall into three categories. Reliability attributes such as fault proneness, fault density, vulnerability. Maintainability and sub categories like testability, changeability. Development effort such as implementation cost, debugging effort, refactoring effort. It is noticed that main portion of studies focuses on fault proneness with 45% of total studies and maintainability with 25% of the studies. Fault proneness is the probability of a class to be faulty. Maintainability involves the effort necessary to maintain a class. Since only these two attributes are investigated in efficient number of studies, fault proneness and maintainability will be considered for SQ3, 4, 5.
This slide presents the result for SQ2. The most frequently proposed and used design metrics focuses on coupling, cohesion, inheritance, scale and polymorphism aspect. The largest number of metrics is coupling metrics and followed by scale, inheritance and cohesion. Interestingly, this order is the same for both fault proneness and maintainability studies. In term of design metrics, C&K metric set is the most common used. Here I explain the definition of those metrics. NOC is number of child, … DIT is nu
In this slide, We recall some basic concepts related to the topic. How to measure the impact? How to know the impact is strong or weak? How to know the impact happened not by chance? The impact between a design complexity metric and cost and quality is quantified by statistical correlation. Correlation analysis investigates the extent to which changes in the value of a variable (such as value of complexity metric in a class) are associated with changes in another variable (such as number of defect in a class). The intensity of the correlation is called effect size. There are three common effect sizes used in the correlational study: Spearman, Pearson and Odds ratios. For the purpose of demonstration, in coming slides, we consider the impact in term of Spearman correlation coefficient. The impact can be positive or negative. Positive impact means the increase in value of one variable will lead to the increase in value of another variable. Negative impact means …. The absolute value of Spearman coefficient range from 0 to 1. Cohen defined the coefficient smaller than 0.1 trivial, small, medium or large by this value. To know whether the impact happens by chance, we use a statistical index called p value. The p value of 0.05 or significance level at 5% means only 5% the measured impact happens by chance. It is noticed that correlation does not imply causation due to confounding factors. However it is still an effective method to select candidate variables for cause effect relationship
To find whether a design metric is a potential predictor of external attributes, we test each design metric with the following hypothesis: H0: There is no positive impact of metric X on quality attribute Y. Vote counting says that H0 is rejected if ratio of the number of reported positive significant effect sizes and total number of reported effect size are larger than 0.5. The table shows the result of hypothesis test for some metrics in Fault proneness studies. The procedure is performed similarly for hypothesis of negative impact.
In this slide, We recall some basic concepts related to the topic. How to measure the impact? How to know the impact is strong or weak? How to know the impact happened not by chance? The impact between a design complexity metric and cost and quality is quantified by statistical correlation. Correlation analysis investigates the extent to which changes in the value of a variable (such as value of complexity metric in a class) are associated with changes in another variable (such as number of defect in a class). The intensity of the correlation is called effect size. There are three common effect sizes used in the correlational study: Spearman, Pearson and Odds ratios. For the purpose of demonstration, in coming slides, we consider the impact in term of Spearman correlation coefficient. The impact can be positive or negative. Positive impact means the increase in value of one variable will lead to the increase in value of another variable. Negative impact means …. The absolute value of Spearman coefficient range from 0 to 1. Cohen defined the coefficient smaller than 0.1 trivial, small, medium or large by this value. To know whether the impact happens by chance, we use a statistical index called p value. The p value of 0.05 or significance level at 5% means only 5% the measured impact happens by chance. It is noticed that correlation does not imply causation due to confounding factors. However it is still an effective method to select candidate variables for cause effect relationship
Appearance of high heterogeneity indicates that the effect sizes is coming from the heterogeneous population. In other words, it may exists the subgroups within the population and the true effect of those subgroups are different. In this case the aggregation should take into account the between subgroup variation as well. The calculation method for this is called random effect model. The table shows the results of aggregation of Spearman coefficient for 6 design metrics and LOC. We found the high level of heterogeneity in all of these metrics and therefore, we use a random effect model in all cases. This diagram show the comparison of 95% confidence interval of effect size among 7 metrics.
The significance level can tell us whether a metric is theoretically correlated to an external quality attribute. But in order to be practically meaningful, the strength of impact should be large enough. Meta analysis are applied here to quantify and synthesize reported Spearman coefficient in different study. The example of global Spearman coefficient estimation of RFC in Fault proneness studies are shown in the Diagram. Each reported Spearman coefficient is weighed by the data set size. The rectangle represents for the weight of the effect size and its position in the axis is its magnitude. The line is … And the diamond is the aggregated effect size. We can see that all of reported spearman coefficient is larger than 0 which indicates an positive impact. I square is an index represented for heterogeneity among reported effect size. If I square larger than 70% means the high heterogeneity level.
In the previous questions, we found a high heterogeneity in population of all investigated metrics. And we want to find an explanation for this. One available approach is subgroup analysis. That is, we attempt to find a moderator variable that are able to account for a significant part of the observed variation. The heterogeneity test is performed for each sub group. The ratio between within subgroup heterogeneity and whole population heterogeneity is ve and it is the percentage of variance explained by the moderator variable. We calculate ve value for each suspect moderator variable and for each design metric. The moderator variable here is the characteristics of the dataset that we extracted before. The results shows that Defect collection phase can explain more than 50% of observed variance in 5 out of 7 investigated metric. Domain can explain 76% of variance in case of NOC. In some cases, for example RFC and WMC, the defect collection phase can distinguish the 95% confidence interval of pre release defect and post release defect. The correlation between metrics and pre release defects are stronger than with post release defects. The number of post release defects is likely less than number of pre release defects due to the testing process. Therefore, the faulty class is less likely to be correlated to design complexity due to smaller probability to be detected.
In this slide, we show the comparison between our results and the perceptions in literature. The results from vote counting and meta analysis statistically confirms the common claims about relationship between design metrics and software fault proneness. In general, our results agree with intuitive perception about relationship between CK metrics expect for DIT and LCOM. It is surprising to us that programming language cannot explain for the difference in the effect of CK metrics on fault proneness.
threats to validity could come from the systematic review and meta analysis procedures. The bias in study selection is one threat to validity due to a single reviewer. The variety of quality of selected studies is a trade off to the desire to receive all reported effect size. The limitation of research design in observational and historical methods is a shortage of the research area. The conclusion validity includes the lack of information reported in studies, such as raw data for Univariate logistic regression and moderator variables. It suggests a further improvement in reported information for the purpose of aggregation.
This slide summarizes the results of your research.
This slide summarizes the results of your research.
Compare before and after rework Influence of context setting