Abstract
Defect prediction is a technique introduced to optimize the testing phase of the software development pipeline by predicting which components in the software may contain defects. Its methodology trains a classifier with data regarding a set of features measured on each component from the target software project to predict whether the component may be defective or not. However, suppose the defective information is not available in the training set. In that case, we need to rely on an alternate approach that uses the training set of external projects to train the classifier. This approached is called cross-project defect prediction. Bad code smells are a category of features that have been previously explored in defect prediction and have been shown to be a good predictor of defects. Code smells are patterns of poor development in the code and indicate flaws in its design and implementation. Although they have been previously studied in the context of defect prediction, they have not been studied as features for cross-project defect prediction. In our experiment, we train defect prediction models for 100 projects to evaluate the predictive performance of the bad code smells. We implemented four cross-project approaches known in the literature and compared the performance of 37 smells with 56 code metrics, commonly used for defect prediction. The results show that the cross-project defect prediction models trained with code smells significantly improved \(6.50\%\) on the ROC AUC compared against the code metrics.





Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Availability of data and material
The data sets generated during and analyzed during the current study are available in the public data repository https://zenodo.org/record/4697491.
References
Bal PR (2018) Cross project software defect prediction using extreme learning machine: an ensemble based study. In: Proceedings of the 13th international conference on software technologies, SCITEPRESS - Science and Technology Publications, Porto, Portugal, pp 354–361, https://doi.org/10.5220/0006886503540361, http://www.scitepress.org/DigitalLibrary/Link.aspx?doi=10.5220/0006886503540361
Booch G, Booch G (eds) (2007) Object-oriented analysis and design with applications, 3rd edn. The Addison-Wesley object technology series, Addison-Wesley, Upper Saddle River, NJ, p oCLC: ocm80020116
Borg M, Svensson O, Berg K, Hansson D (2019) SZZ unleashed: an open implementation of the SZZ algorithm - featuring example usage in a study of just-in-time bug prediction for the Jenkins project. In: Proceedings of the 3rd ACM SIGSOFT international workshop on machine learning techniques for software quality evaluation - MaLTeSQuE 2019, ACM Press, Tallinn, Estonia, pp 7–12. https://doi.org/10.1145/3340482.3342742, http://dl.acm.org/citation.cfm?doid=3340482.3342742
Brito e Abreu F, Carapuça R, (1994) In: Zenodo McLean, VA, USA, DOI, (eds) Object-Oriented Software Engineering: Measuring And Controlling The Development Process. In: 4th International. publisher: Zenodo, p https://doi.org/10.5281/ZENODO.1217609,
Brown WJ (ed) (1998) AntiPatterns: refactoring software, architectures, and projects in crisis. Wiley, New York
Cedrim D, Sousa L (2018) opus-research/organic. https://github.com/opus-research/organic
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intel Res 16:321–357. https://doi.org/10.1613/jair.953
Chidamber S, Kemerer C (1994) A metrics suite for object oriented design. IEEE Trans Softw Eng 20(6):476–493. https://doi.org/10.1109/32.295895
Cruz AEC, Ochimizu K (2009) Towards logistic regression models for predicting fault-prone code across software projects. In: 2009 3rd International Symposium on Empirical Software Engineering and Measurement, IEEE, Lake Buena Vista, FL, USA, pp 460–463. https://doi.org/10.1109/ESEM.2009.5316002, http://ieeexplore.ieee.org/document/5316002/
Fowler M, Beck K (1999) Refactoring: improving the design of existing code. The Addison-Wesley object technology series, Addison-Wesley, Reading, MA
Goel L, Damodaran D, Khatri SK, Sharma M (2017) A literature review on cross project defect prediction. In: 2017 4th IEEE Uttar Pradesh Section International Conference on Electrical, Computer and Electronics (UPCON), IEEE, Mathura, pp 680–685, https://doi.org/10.1109/UPCON.2017.8251131, http://ieeexplore.ieee.org/document/8251131/
Guo J, Rahimi M, Cleland-Huang J, Rasin A, Hayes JH, Vierhauser M (2016) Cold-start software analytics. In: Proceedings of the 13th International Conference on Mining Software Repositories, ACM, Austin Texas, pp 142–153. https://doi.org/10.1145/2901739.2901740, https://dl.acm.org/doi/10.1145/2901739.2901740
Halstead MH (1977) Elements of software science. No. 2 in Operating and programming systems series, Elsevier, New York
Hassan AE (2009) Predicting faults using the complexity of code changes, In: 2009 IEEE 31st International Conference on Software Engineering, IEEE, Vancouver, BC, Canada, pp 78–88, https://doi.org/10.1109/ICSE.2009.5070510, http://ieeexplore.ieee.org/document/5070510/
Herbold S, Trautsch A, Grabowski J (2018) A comparative study to benchmark cross-project defect prediction approaches. IEEE Trans Softw Eng 44(9):811–833. https://doi.org/10.1109/TSE.2017.2724538
Hosseini S, Turhan B, Gunarathna D (2019) A systematic literature review and meta-analysis on cross project defect prediction. IEEE Trans Softw Eng 45(2):111–147. https://doi.org/10.1109/TSE.2017.2770124
Huang GB, Zhu QY, Siew CK (2006) Extreme learning machine: theory and applications. Neurocomputing 70(1–3):489–501. https://doi.org/10.1016/j.neucom.2005.12.126
Ivanov R, Veach R, Bludov P, Paikin A, Dubinin I, Selkin A, Lisetskii V, Burn O, Kordas M, Diachenko R, Izmailov B, Yaroslavtsev D, Sopov I, Kühne L, Giles R, Sukhodolsky O, Studman M, Schneeberger T (2021) checkstyle – Checkstyle 8.41.1. https://checkstyle.sourceforge.io/
Jureczko M, Madeyski L (2010) Towards identifying software project clusters with regard to defect prediction. In: Proceedings of the 6th International Conference on Predictive Models in Software Engineering - PROMISE ’10, ACM Press, Timişoara, Romania
Kitchenham BA, Mendes E, Travassos GH (2007) Cross versus within-company cost estimation studies: a systematic review. IEEE Trans Softw Eng 33(5):316–329. https://doi.org/10.1109/TSE.2007.1001
Kotte A, Qyser D, Moiz AA (2021) A survey of different machine learning models for software defect testing. Eur J Mol Clin Med 7(9):3256–3268
Li Z, Jing XY, Zhu X (2018) Progress on approaches to software defect prediction. IET Softw 12(3):161–175. https://doi.org/10.1049/iet-sen.2017.0148
McCabe T (1976) A complexity measure. IEEE Trans Softw Eng SE 2(4):308–320. https://doi.org/10.1109/TSE.1976.233837
McGinnis W (2015) sklearn-extensions. https://github.com/wdm0006/sklearn-extensions
Moser R, Pedrycz W, Succi G (2008) Analysis of the reliability of a subset of change metrics for defect prediction. In: Proceedings of the Second ACM-IEEE international symposium on Empirical software engineering and measurement - ESEM ’08, ACM Press, Kaiserslautern, Germany, https://doi.org/10.1145/1414004.1414063, http://portal.acm.org/citation.cfm?doid=1414004.1414063
Nagappan N, Ball T (2005) Use of relative code churn measures to predict system defect density. In: Proceedings of the 27th international conference on Software engineering - ICSE ’05, ACM Press, St. Louis, MO, USA, p 284, https://doi.org/10.1145/1062455.1062514, http://portal.acm.org/citation.cfm?doid=1062455.1062514
Paterson D, Campos J, Abreu R, Kapfhammer GM, Fraser G, McMinn P (2019) An empirical study on the use of defect prediction for test case prioritization. In: 2019 12th IEEE conference on software testing, validation and verification (ICST), IEEE, Xi’an, China, pp 346–357, https://doi.org/10.1109/ICST.2019.00041, https://ieeexplore.ieee.org/document/8730206/
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E (2011) Scikit-learn: machine learning in python. J Mach Learn Res 12:2825–2830
Piotrowski P, Madeyski L (2020) Software defect prediction using bad code smells: a systematic literature review. In: Poniszewska-Marańda A, Kryvinska N, Jarzbek S, Madeyski L (eds) Data-centric business and applications: towards software development (volume 4). Springer International Publishing, Cham, pp 77–99. https://doi.org/10.1007/978-3-030-34706-2
Porto F, Minku L, Mendes E, Simao A (2019) A systematic study of cross-project defect prediction with meta-learning. arXiv:1802.06025 [cs]
Radjenović D, Heričko M, Torkar R, Živkovič A (2013) Software fault prediction metrics: A systematic literature review. Information and Software Technology 55(8):1397–1418. https://doi.org/10.1016/j.infsof.2013.02.009, https://linkinghub.elsevier.com/retrieve/pii/S0950584913000426
Rathore SS, Kumar S (2019) A study on software fault prediction techniques. Artif Intell Rev 51(2):255–327. https://doi.org/10.1007/s10462-017-9563-5
Sharma T (2018) DesigniteJava. https://doi.org/10.5281/zenodo.2566861
Suryanarayana G, Samarthyam G, Sharma T (2015) Refactoring for software design smells: managing technical debt. Elsevier, Morgan Kaufmann, Morgan Kaufmann is an imprint of Elsevier, Amsterdam, Boston
Taba SES, Khomh F, Zou Y, Hassan AE, Nagappan M (2013) Predicting Bugs Using Antipatterns, In: 2013 IEEE International Conference on Software Maintenance, IEEE, Eindhoven, Netherlands, pp 270–279, https://doi.org/10.1109/ICSM.2013.38, http://ieeexplore.ieee.org/document/6676898/
Turhan B, Menzies T, Bener AB, Di Stefano J (2009) On the relative value of cross-company and within-company data for defect prediction. Empir Softw Eng 14(5):540–578. https://doi.org/10.1007/s10664-008-9103-7
Watanabe S, Kaiya H, Kaijiri K (2008) Adapting a fault prediction model to allow inter languagereuse. In: Proceedings of the 4th international workshop on Predictor models in software engineering - PROMISE ’8, ACM Press, Leipzig, Germany, p 19, https://doi.org/10.1145/1370788.1370794, http://portal.acm.org/citation.cfm?doid=1370788.1370794
Zimmermann T, Nagappan N, Gall H, Giger E, Murphy B (2009) Cross-project defect prediction: a large scale experiment on data vs. domain vs. process. In: Proceedings of the 7th joint meeting of the European software engineering conference and the ACM SIGSOFT symposium on The foundations of software engineering on European software engineering conference and foundations of software engineering symposium - ESEC/FSE ’09, ACM Press, Amsterdam, The Netherlands, p 91, https://doi.org/10.1145/1595696.1595713, http://portal.acm.org/citation.cfm?doid=1595696.1595713
Funding
This work was supported by the Cyber Security Research Center at the Ben-Gurion University of the Negev.
Author information
Authors and Affiliations
Contributions
Conceptualization: BSM, MK; Funding acquisition: MK; Investigation: BSM, MK; Methodology: BSM, MK; Supervision: MK; Visualization: BSM; Writing—original draft: BSM; Writing—review & editing: BSM, MK.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no known competing interests or personal relationships that could have influenced the work reported in this paper.
Code availability
The software developed during the current study is available from the public repository at the website of https://github.com/Bruno81930/smells.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Sotto-Mayor, B., Kalech, M. Cross-project smell-based defect prediction. Soft Comput 25, 14171–14181 (2021). https://doi.org/10.1007/s00500-021-06254-7
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00500-021-06254-7