Abstract
Software defect prediction is aimed to find potential defects based on historical data and software features. Software features can reflect the characteristics of software modules. However, some of these features may be more relevant to the class (defective or non-defective), but others may be redundant or irrelevant. To fully measure the correlation between different features and the class, we present a feature selection approach based on a similarity measure (SM) for software defect prediction. First, the feature weights are updated according to the similarity of samples in different classes. Second, a feature ranking list is generated by sorting the feature weights in descending order, and all feature subsets are selected from the feature ranking list in sequence. Finally, all feature subsets are evaluated on a k-nearest neighbor (KNN) model and measured by an area under curve (AUC) metric for classification performance. The experiments are conducted on 11 National Aeronautics and Space Administration (NASA) datasets, and the results show that our approach performs better than or is comparable to the compared feature selection approaches in terms of classification performance.
Similar content being viewed by others
References
Aha, D.W., Kibler, D., Albert, M.K., 1991. Instance-based learning algorithms. Mach. Learn., 6(1): 37–66. https://doi.org/10.1007/BF00153759
Catal, C., Diri, B., 2009. Investigating the effect of dataset size, metrics sets, and feature selection techniques on software fault prediction problem. Inform. Sci., 179(8): 1040–1058. https://doi.org/10.1016/j.ins.2008.12.001
Duch, W., Wieczorek, T., Biesiada, J., et al., 2004. Comparison of feature ranking methods based on information entropy. Int. Joint Conf. on Neural Networks, p.1415–1419. https://doi.org/10.1109/IJCNN.2004.1380157
Galar, M., Fernández, A., Barrenechea, E., et al., 2012. A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches. IEEE Trans. Syst. Man Cybern. Part C, 42(4): 463–484. https://doi.org/10.1109/TSMCC.2011.2161285
Gao, K., Khoshgoftaar, T.M., Wang, H., et al., 2011. Choosing software metrics for defect prediction: an investigation on feature selection techniques. Softw. Pract. Exper., 41(5): 579–606. https://doi.org/10.1002/spe.1043
Ghareb, A.S., Bakar, A.A., Hamdan, A.R., 2016. Hybrid feature selection based on enhanced genetic algorithm for text categorization. Expert Syst. Appl., 49: 31–47. https://doi.org/10.1016/j.eswa.2015.12.004
Gray, D., Bowes, D., Davey, N., et al., 2011. The misuse of the NASA metrics data program data sets for automated software defect prediction. Int. Conf. on Evaluation and Assessment in Software Engineering, p.96–103. https://doi.org/10.1049/ic.2011.0012
Guyon, I., Elisseeff, A., 2003. An introduction to variable and feature selection. J. Mach. Learn. Res., 3: 1157–1182.
Hall, M.A., 1999. Correlation-Based Feature Selection for Machine Learning. University of Waikato, Hamilton, New Zealand.
Halstead, M.H., 1977. Elements of Software Science. Elsevier, New York, USA.
Han, Y., Park, K., Guan, D., et al., 2013. Topological similarity-based feature selection for graph classification. Comput. J., 58(9): 1884–1893. https://doi.org/10.1093/comjnl/bxt123
Holte, R.C., 1993. Very simple classification rules perform well on most commonly used datasets. Mach. Learn., 11(1): 63–90. https://doi.org/10.1023/A:1022631118932
Huang, J., Ling, C.X., 2005. Using AUC and accuracy in evaluating learning algorithms. IEEE Trans. Knowl. Data Eng., 17(3): 299–310. https://doi.org/10.1109/TKDE.2005.50
Jiang, Y., Lin, J., Cukic, B., et al., 2009. Variance analysis in software fault prediction models. Int. Symp. on Software Reliability Engineering, p.99–108. https://doi.org/10.1109/ISSRE.2009.13
Jing, X., Ying, S., Zhang, Z., et al., 2014a. Dictionary learning based software defect prediction. Int. Conf. on Software Engineering, p.414–423. https://doi.org/10.1145/2568225.2568320
Jing, X., Zhang, Z., Ying, S., et al., 2014b. Software defect prediction based on collaborative representation classification. Companion of Int. Conf. on Software Engineering, p.632–633. https://doi.org/10.1145/2591062.2591151
Jing, X., Wu, F., Dong, X., et al., 2015. Heterogeneous crosscompany defect prediction by unified metric representation and CCA-based transfer learning. Joint Meeting on Foundations of Software Engineering, p.496–507. https://doi.org/10.1145/2786805.2786813
Karegowda, A.G., Manjunath, A.S., Jayaram, M.A., 2010. Comparative study of attribute selection using gain ratio and correlation based feature selection. Int. J. Inform. Technol. Knowl. Manag., 2(2): 271–277.
Khoshgoftaar, T.M., Gao, K., Napolitano, A., et al., 2014. A comparative study of iterative and non-iterative feature selection techniques for software defect prediction. Inform. Syst. Front., 16(5): 801–822. https://doi.org/10.1007/s10796-013-9430-0
Kira, K., Rendell, L.A., 1992. A practical approach to feature selection. Int. Workshop on Machine Learning, p.249–256.
Kononenko, I., 1994. Estimating attributes: analysis and extensions of RELIEF. European Conf. on Machine Learning, p.171–182. https://doi.org/10.1007/3-540-57868-4_57
Laradji, I.H., Alshayeb, M., Ghouti, L., 2015. Software defect prediction using ensemble learning on selected features. Inform. Softw. Technol., 58: 388–402. https://doi.org/10.1016/j.infsof.2014.07.005
Liu, H., Yu, L., 2005. Toward integrating feature selection algorithms for classification and clustering. IEEE Trans. Knowl. Data Eng., 17(4): 491–502. https://doi.org/10.1109/TKDE.2005.66
Liu, H., Sun, J., Liu, L., et al., 2009. Feature selection with dynamic mutual information. Patt. Recogn., 42(7): 1330–1339. https://doi.org/10.1016/j.patcog.2008.10.028
Liu, H., Motoda, H., Setiono, R., et al., 2010. Feature selection: an ever evolving frontier in data mining. Int. Workshop on Feature Selection in Data Mining, p.4–13.
Liu, S., Chen, X., Liu, W., et al., 2014. FECAR: a feature selection framework for software defect prediction. Annual Computer Software and Applications Conf., p.426–435. https://doi.org/10.1109/COMPSAC.2014.66
McCabe, T.J., 1976. A complexity measure. IEEE Trans. Softw. Eng., SE-2(4):308–320. https://doi.org/10.1109/TSE.1976.233837
Miao, L., Liu, M., Zhang, D., 2012. Cost-sensitive feature selection with application in software defect prediction. Int. Conf. on Pattern Recognition, p.967–970.
Nam, J., Kim, S., 2015a. CLAMI: defect prediction on unlabeled datasets. Int. Conf. on Automated Software Engineering, p.452–463. https://doi.org/10.1109/ASE.2015.56
Nam, J., Kim, S., 2015b. Heterogeneous defect prediction. Joint Meeting on Foundations of Software Engineering, p.508–519. https://doi.org/10.1145/2786805.2786814
Shepperd, M., Song, Q., Sun, Z., et al., 2013. Data quality: some comments on the NASA software defect datasets. IEEE Trans. Softw. Eng., 39(9): 1208–1215. https://doi.org/10.1109/TSE.2013.11
Tantithamthavorn, C., McIntosh, S., Hassan, A.E., et al., 2016. Automated parameter optimization of classification techniques for defect prediction models. Int. Conf. on Software Engineering, p.321–332. https://doi.org/10.1145/2884781.2884857
Uysal, A.K., Gunal, S., 2012. A novel probabilistic feature selection method for text classification. Knowl. Based Syst., 36: 226–235. https://doi.org/10.1016/j.knosys.2012.06.005
Wang, H., Khoshgoftaar, T.M., Seliya, N., 2015. On the stability of feature selection methods in software quality prediction: an empirical investigation. Int. J. Softw. Eng. Know. Eng., 25: 1467–1490. https://doi.org/10.1142/S0218194015400288
Wang, Z., Li, M., Li, J., 2015. A multi-objective evolutionary algorithm for feature selection based on mutual information with a new redundancy measure. Inform. Sci., 307: 73–88. https://doi.org/10.1016/j.ins.2015.02.031
Wilcoxon, F., 1945. Individual comparisons by ranking methods. Biometr. Bull., 1(6): 80–83. https://doi.org/10.2307/3001968
Xu, J., Zhou, Y., Chen, L., et al., 2012. An unsupervised feature selection approach based on mutual information. J. Comput. Res. Dev., 49(2): 372–382 (in Chinese).
Xue, B., Zhang, M., Browne, W.N., 2013. Particle swarm optimization for feature selection in classification: a multiobjective approach. IEEE Trans. Cybern., 43(6): 1656–1671. https://doi.org/10.1109/TSMCB.2012.2227469
Yang, S., Gu, J., 2004. Feature selection based on mutual information and redundancy-synergy coefficient. J. Zhejiang Univ.-Sci., 5(11): 1382–1391. https://doi.org/10.1631/jzus.2004.1382
Author information
Authors and Affiliations
Corresponding author
Additional information
Project supported by the National Natural Science Foundation of China (Nos. 61673384 and 61502497), the Guangxi Key Laboratory of Trusted Software (No. kx201530), the China Postdoctoral Science Foundation (No. 2015M581887), and the Scientific Research Innovation Project for Graduate Students of Jiangsu Province, China (No. KYLX15_1443)
Rights and permissions
About this article
Cite this article
Yu, Q., Jiang, Sj., Wang, Rc. et al. A feature selection approach based on a similarity measure for software defect prediction. Frontiers Inf Technol Electronic Eng 18, 1744–1753 (2017). https://doi.org/10.1631/FITEE.1601322
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1631/FITEE.1601322