DevEX - reference for building teams, processes, and platforms
Promise 2011: "Customization Support for CBR-Based Defect Prediction"
1. Customization Support for CBR-Based Defect Prediction Elham Paikari Department of Electrical and Computer EngineeringUniversity of Calgary2500 University Drive, NWCalgary, AB, Canadaepaikari@ucalgary.ca Bo Sun Department of Computer ScienceUniversity of Calgary2500 University Drive, NWCalgary, AB, Canadasbo@ucalgary.ca Guenther Ruhe Department of Computer Science & Department of Electrical and Computer EngineeringUniversity of Calgary2500 University Drive, NWCalgary, AB, Canadaruhe@ucalgary.ca Emadoddin Livani Department of Electrical and Computer EngineeringUniversity of Calgary2500 University Drive, NWCalgary, AB, Canadaelivani@ucalgary.ca
2. Agenda Parameters of a CBR Model Parameters Instantiation Weighting Method SANN Frequency Analysis Dependency Network and the Customization Support Rules Transferability Conclusions and Future Work 2
3.
4. Instantiation of the general CBR-based prediction method Solution Algorithm Similarity Function Prediction Performance of CBR model Number of Nearest Neighbor Case Weighting Technique used for Attributes 3
6. Sensitivity Analysis Based On NeuralNetwork (SANN) 5 Dataset CC …………… LOC Xmin(A1) NN OUTPUTmin(A1) ∆1= |OUTPUTmin(A1) - OUTPUTmax(A1)| Xmax (A1) NN OUTPUTmax(A1)
7. 6 What is the evaluation result in comparison with existing methods (un-weighted) What is the evaluation result in comparison with existing methods (MLR) How different numbers of the nearest neighbors can affect the results?
9. Data Repository PROMISE Repository 120 different CBR instantiations were created and applied to 11 data sets from PROMISE repository Characterization of data sets 8
11. Experimental Design for Frequency Analysis 10 Min(MMRE) Dataset 120 different instantiation Max(Pred(0.25)) Min(MMRE) Dataset 120 different instantiation Max(Pred(0.25)) 11different Datasets Min(MMRE) Dataset 120 different instantiation Max(Pred(0.25))
12. Frequency Analysis Frequency of the best performance in single attribute analysis Neural network based sensitivity analysis (as the weighting technique) Un-weighted average (as the solution algorithm) Maximum number of nearest neighbors (as the number of nearest neighbors) 11
14. 13 Customization Support Using DNA Dataset Eight attributes defined as condition attributes Four data set-related attributes: (NumOfModule),(DefectRatio),(Language),(LOC) Four CBR-related attributes: (SimFunc),(WeightingTech),(NumOfNN),(SolutionAlgorithm) The decision attributes: Pred(0.25) and MMRE (a1,a2,a3,a4) (p1,p2,p3,p4) Rule Induction Customization Support CBR model instantiated by (p1,p2,p3,p4) Data set DNA New data (a1,a2,a3,a4) Recommendation f (a1,a2,a3,a4) Rule Set
15. 14 Application of DNA Results Generation of the Decision Trees Given: NumOfModule = High DefectRatio = High LOC = Medium Language = JAVA Question: How to customize a CBR defect prediction model towards achieving high prediction accuracy measured in MMRE? Recommendation: Customize CBR model by means of: WeightingTech = SANN NumOfNN ≥ 10 SolutionAlgorithm = Rank-weighted Average Justification: Based on the data set characteristics, assumptions of rules 3, 4, 5, 11 and 12 are fulfilled. By comparing the probability distributions of MMRE rule No. 11 is the best in terms of having the highest probability (69.2%) to achieve “Low” MMRE.
16.
17.
18. Validation and Limitations Tools used for attribute selection, and modeling tasks Neural network, regression analysis, CBR, and dependency network analysis Only four parameters of the CBR instantiation The composition of the training and testing data sets Another aspect of the analysis undertaken is the definition of classification intervals for dependency networks, Two discretization algorithms Sensitivity analysis 16
19. Conclusions and Future Work Starting with 11 data sets from the PROMISE repository Calculating the prediction performance of 120 instantiations of the CBR-based defect prediction model based on the value of the MMRE and Pred(0.25) The frequency analysis on the top performances Generating the DNA to provide a customization support for a new data set The compatibility of rule sets extracted from different contexts Enhancement of the validity with inclusion of further data sets Comparing the performance against other measures Other methods for rule induction 17
20. References Brady, A. and Menzies, T. 2010. Case-based reasoning vs parametric models for software quality optimization. In Proceedings of the 6thInternational Conference on Predictive Models in Software Engineering, pp. 3:1-3:10. Catal, C. and Diri, B. 2009. A systematic review of software fault prediction studies. Expert Systems with Applications, vol. 36 (4), pp. 7346-7354. El Emam, K., Benlarbi, S., Goel, N., and Rai, S. N. 2001. Comparing case-based reasoning classifiers for predicting high risk software components. The Journal of Systems and Software, vol. 55, pp. 301-320. Foss, T., Stensrud, E., Kitchenham, B., and Myrtveit , I. 2003. A simulation study of the model evaluation criterion MMRE. IEEE Transactions on Software Engineering, vol. 29 (11), pp. 985- 995. Ganesan, K., Khoshgoftaar, T. M., and Allen, E. B. 2000. Case-based software quality prediction. International Journal of Software Engineering and Knowledge Engineering, vol. 10(2), pp. 139–152. Paikari, E., Richter, M. M., and Ruhe, G. 2010. A comparative study of attribute weighting techniques for software defect prediction using case-based reasoning. In Proceeding of the 22nd International Conference on Software Engineering and Knowledge Engineering, pp. 380-386. 18