A feature selection approach based on a similarity measure for software defect prediction

Yu, Qiao; Jiang, Shu-juan; Wang, Rong-cun; Wang, Hong-yang

doi:10.1631/FITEE.1601322

A feature selection approach based on a similarity measure for software defect prediction

Published: 18 January 2018

Volume 18, pages 1744–1753, (2017)
Cite this article

Frontiers of Information Technology & Electronic Engineering Aims and scope Submit manuscript

Qiao Yu¹,
Shu-juan Jiang ORCID: orcid.org/0000-0003-0643-0565^1,2,
Rong-cun Wang¹ &
…
Hong-yang Wang¹

312 Accesses
Explore all metrics

Abstract

Software defect prediction is aimed to find potential defects based on historical data and software features. Software features can reflect the characteristics of software modules. However, some of these features may be more relevant to the class (defective or non-defective), but others may be redundant or irrelevant. To fully measure the correlation between different features and the class, we present a feature selection approach based on a similarity measure (SM) for software defect prediction. First, the feature weights are updated according to the similarity of samples in different classes. Second, a feature ranking list is generated by sorting the feature weights in descending order, and all feature subsets are selected from the feature ranking list in sequence. Finally, all feature subsets are evaluated on a k-nearest neighbor (KNN) model and measured by an area under curve (AUC) metric for classification performance. The experiments are conducted on 11 National Aeronautics and Space Administration (NASA) datasets, and the results show that our approach performs better than or is comparable to the compared feature selection approaches in terms of classification performance.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Analysis of Feature Ranking Techniques for Defect Prediction in Software Systems

Hybrid Classifier for Software Defect Prediction by Using Filter-Based Feature Selection Techniques

Software fault prediction using lion optimization algorithm

Article 21 September 2021

References

Aha, D.W., Kibler, D., Albert, M.K., 1991. Instance-based learning algorithms. Mach. Learn., 6(1): 37–66. https://doi.org/10.1007/BF00153759
Google Scholar
Catal, C., Diri, B., 2009. Investigating the effect of dataset size, metrics sets, and feature selection techniques on software fault prediction problem. Inform. Sci., 179(8): 1040–1058. https://doi.org/10.1016/j.ins.2008.12.001
Article Google Scholar
Duch, W., Wieczorek, T., Biesiada, J., et al., 2004. Comparison of feature ranking methods based on information entropy. Int. Joint Conf. on Neural Networks, p.1415–1419. https://doi.org/10.1109/IJCNN.2004.1380157
Google Scholar
Galar, M., Fernández, A., Barrenechea, E., et al., 2012. A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches. IEEE Trans. Syst. Man Cybern. Part C, 42(4): 463–484. https://doi.org/10.1109/TSMCC.2011.2161285
Article Google Scholar
Gao, K., Khoshgoftaar, T.M., Wang, H., et al., 2011. Choosing software metrics for defect prediction: an investigation on feature selection techniques. Softw. Pract. Exper., 41(5): 579–606. https://doi.org/10.1002/spe.1043
Article Google Scholar
Ghareb, A.S., Bakar, A.A., Hamdan, A.R., 2016. Hybrid feature selection based on enhanced genetic algorithm for text categorization. Expert Syst. Appl., 49: 31–47. https://doi.org/10.1016/j.eswa.2015.12.004
Article Google Scholar
Gray, D., Bowes, D., Davey, N., et al., 2011. The misuse of the NASA metrics data program data sets for automated software defect prediction. Int. Conf. on Evaluation and Assessment in Software Engineering, p.96–103. https://doi.org/10.1049/ic.2011.0012
Google Scholar
Guyon, I., Elisseeff, A., 2003. An introduction to variable and feature selection. J. Mach. Learn. Res., 3: 1157–1182.
MATH Google Scholar
Hall, M.A., 1999. Correlation-Based Feature Selection for Machine Learning. University of Waikato, Hamilton, New Zealand.
Google Scholar
Halstead, M.H., 1977. Elements of Software Science. Elsevier, New York, USA.
MATH Google Scholar
Han, Y., Park, K., Guan, D., et al., 2013. Topological similarity-based feature selection for graph classification. Comput. J., 58(9): 1884–1893. https://doi.org/10.1093/comjnl/bxt123
Article Google Scholar
Holte, R.C., 1993. Very simple classification rules perform well on most commonly used datasets. Mach. Learn., 11(1): 63–90. https://doi.org/10.1023/A:1022631118932
Article Google Scholar
Huang, J., Ling, C.X., 2005. Using AUC and accuracy in evaluating learning algorithms. IEEE Trans. Knowl. Data Eng., 17(3): 299–310. https://doi.org/10.1109/TKDE.2005.50
Article Google Scholar
Jiang, Y., Lin, J., Cukic, B., et al., 2009. Variance analysis in software fault prediction models. Int. Symp. on Software Reliability Engineering, p.99–108. https://doi.org/10.1109/ISSRE.2009.13
Google Scholar
Jing, X., Ying, S., Zhang, Z., et al., 2014a. Dictionary learning based software defect prediction. Int. Conf. on Software Engineering, p.414–423. https://doi.org/10.1145/2568225.2568320
Google Scholar
Jing, X., Zhang, Z., Ying, S., et al., 2014b. Software defect prediction based on collaborative representation classification. Companion of Int. Conf. on Software Engineering, p.632–633. https://doi.org/10.1145/2591062.2591151
Google Scholar
Jing, X., Wu, F., Dong, X., et al., 2015. Heterogeneous crosscompany defect prediction by unified metric representation and CCA-based transfer learning. Joint Meeting on Foundations of Software Engineering, p.496–507. https://doi.org/10.1145/2786805.2786813
Google Scholar
Karegowda, A.G., Manjunath, A.S., Jayaram, M.A., 2010. Comparative study of attribute selection using gain ratio and correlation based feature selection. Int. J. Inform. Technol. Knowl. Manag., 2(2): 271–277.
Google Scholar
Khoshgoftaar, T.M., Gao, K., Napolitano, A., et al., 2014. A comparative study of iterative and non-iterative feature selection techniques for software defect prediction. Inform. Syst. Front., 16(5): 801–822. https://doi.org/10.1007/s10796-013-9430-0
Article Google Scholar
Kira, K., Rendell, L.A., 1992. A practical approach to feature selection. Int. Workshop on Machine Learning, p.249–256.
Google Scholar
Kononenko, I., 1994. Estimating attributes: analysis and extensions of RELIEF. European Conf. on Machine Learning, p.171–182. https://doi.org/10.1007/3-540-57868-4_57
Google Scholar
Laradji, I.H., Alshayeb, M., Ghouti, L., 2015. Software defect prediction using ensemble learning on selected features. Inform. Softw. Technol., 58: 388–402. https://doi.org/10.1016/j.infsof.2014.07.005
Article Google Scholar
Liu, H., Yu, L., 2005. Toward integrating feature selection algorithms for classification and clustering. IEEE Trans. Knowl. Data Eng., 17(4): 491–502. https://doi.org/10.1109/TKDE.2005.66
Article MathSciNet Google Scholar
Liu, H., Sun, J., Liu, L., et al., 2009. Feature selection with dynamic mutual information. Patt. Recogn., 42(7): 1330–1339. https://doi.org/10.1016/j.patcog.2008.10.028
Article Google Scholar
Liu, H., Motoda, H., Setiono, R., et al., 2010. Feature selection: an ever evolving frontier in data mining. Int. Workshop on Feature Selection in Data Mining, p.4–13.
Google Scholar
Liu, S., Chen, X., Liu, W., et al., 2014. FECAR: a feature selection framework for software defect prediction. Annual Computer Software and Applications Conf., p.426–435. https://doi.org/10.1109/COMPSAC.2014.66
Google Scholar
McCabe, T.J., 1976. A complexity measure. IEEE Trans. Softw. Eng., SE-2(4):308–320. https://doi.org/10.1109/TSE.1976.233837
Article MathSciNet Google Scholar
Miao, L., Liu, M., Zhang, D., 2012. Cost-sensitive feature selection with application in software defect prediction. Int. Conf. on Pattern Recognition, p.967–970.
Google Scholar
Nam, J., Kim, S., 2015a. CLAMI: defect prediction on unlabeled datasets. Int. Conf. on Automated Software Engineering, p.452–463. https://doi.org/10.1109/ASE.2015.56
Google Scholar
Nam, J., Kim, S., 2015b. Heterogeneous defect prediction. Joint Meeting on Foundations of Software Engineering, p.508–519. https://doi.org/10.1145/2786805.2786814
Google Scholar
Shepperd, M., Song, Q., Sun, Z., et al., 2013. Data quality: some comments on the NASA software defect datasets. IEEE Trans. Softw. Eng., 39(9): 1208–1215. https://doi.org/10.1109/TSE.2013.11
Article Google Scholar
Tantithamthavorn, C., McIntosh, S., Hassan, A.E., et al., 2016. Automated parameter optimization of classification techniques for defect prediction models. Int. Conf. on Software Engineering, p.321–332. https://doi.org/10.1145/2884781.2884857
Google Scholar
Uysal, A.K., Gunal, S., 2012. A novel probabilistic feature selection method for text classification. Knowl. Based Syst., 36: 226–235. https://doi.org/10.1016/j.knosys.2012.06.005
Article Google Scholar
Wang, H., Khoshgoftaar, T.M., Seliya, N., 2015. On the stability of feature selection methods in software quality prediction: an empirical investigation. Int. J. Softw. Eng. Know. Eng., 25: 1467–1490. https://doi.org/10.1142/S0218194015400288
Article Google Scholar
Wang, Z., Li, M., Li, J., 2015. A multi-objective evolutionary algorithm for feature selection based on mutual information with a new redundancy measure. Inform. Sci., 307: 73–88. https://doi.org/10.1016/j.ins.2015.02.031
Article MathSciNet Google Scholar
Wilcoxon, F., 1945. Individual comparisons by ranking methods. Biometr. Bull., 1(6): 80–83. https://doi.org/10.2307/3001968
Article Google Scholar
Xu, J., Zhou, Y., Chen, L., et al., 2012. An unsupervised feature selection approach based on mutual information. J. Comput. Res. Dev., 49(2): 372–382 (in Chinese).
Google Scholar
Xue, B., Zhang, M., Browne, W.N., 2013. Particle swarm optimization for feature selection in classification: a multiobjective approach. IEEE Trans. Cybern., 43(6): 1656–1671. https://doi.org/10.1109/TSMCB.2012.2227469
Article Google Scholar
Yang, S., Gu, J., 2004. Feature selection based on mutual information and redundancy-synergy coefficient. J. Zhejiang Univ.-Sci., 5(11): 1382–1391. https://doi.org/10.1631/jzus.2004.1382
Article Google Scholar

Download references

Author information

Authors and Affiliations

School of Computer Science and Technology, China University of Mining and Technology, Xuzhou, 221116, China
Qiao Yu, Shu-juan Jiang, Rong-cun Wang & Hong-yang Wang
Guangxi Key Laboratory of Trusted Software, Guilin University of Electronic Technology, Guilin, 541004, China
Shu-juan Jiang

Authors

Qiao Yu
View author publications
You can also search for this author in PubMed Google Scholar
Shu-juan Jiang
View author publications
You can also search for this author in PubMed Google Scholar
Rong-cun Wang
View author publications
You can also search for this author in PubMed Google Scholar
Hong-yang Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shu-juan Jiang.

Additional information

Project supported by the National Natural Science Foundation of China (Nos. 61673384 and 61502497), the Guangxi Key Laboratory of Trusted Software (No. kx201530), the China Postdoctoral Science Foundation (No. 2015M581887), and the Scientific Research Innovation Project for Graduate Students of Jiangsu Province, China (No. KYLX15_1443)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Yu, Q., Jiang, Sj., Wang, Rc. et al. A feature selection approach based on a similarity measure for software defect prediction. Frontiers Inf Technol Electronic Eng 18, 1744–1753 (2017). https://doi.org/10.1631/FITEE.1601322

Download citation

Received: 11 June 2016
Accepted: 14 September 2016
Published: 18 January 2018
Issue Date: November 2017
DOI: https://doi.org/10.1631/FITEE.1601322

Key words

CLC number

TP311

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A feature selection approach based on a similarity measure for software defect prediction

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Analysis of Feature Ranking Techniques for Defect Prediction in Software Systems

Hybrid Classifier for Software Defect Prediction by Using Filter-Based Feature Selection Techniques

Software fault prediction using lion optimization algorithm

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Key words

CLC number

Subscribe and save

Buy Now

Navigation

A feature selection approach based on a similarity measure for software defect prediction

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Analysis of Feature Ranking Techniques for Defect Prediction in Software Systems

Hybrid Classifier for Software Defect Prediction by Using Filter-Based Feature Selection Techniques

Software fault prediction using lion optimization algorithm

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Key words

CLC number

Subscribe and save

Buy Now

Search

Navigation