Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

A feature selection approach based on a similarity measure for software defect prediction

  • Published:
Frontiers of Information Technology & Electronic Engineering Aims and scope Submit manuscript

Abstract

Software defect prediction is aimed to find potential defects based on historical data and software features. Software features can reflect the characteristics of software modules. However, some of these features may be more relevant to the class (defective or non-defective), but others may be redundant or irrelevant. To fully measure the correlation between different features and the class, we present a feature selection approach based on a similarity measure (SM) for software defect prediction. First, the feature weights are updated according to the similarity of samples in different classes. Second, a feature ranking list is generated by sorting the feature weights in descending order, and all feature subsets are selected from the feature ranking list in sequence. Finally, all feature subsets are evaluated on a k-nearest neighbor (KNN) model and measured by an area under curve (AUC) metric for classification performance. The experiments are conducted on 11 National Aeronautics and Space Administration (NASA) datasets, and the results show that our approach performs better than or is comparable to the compared feature selection approaches in terms of classification performance.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

References

  • Aha, D.W., Kibler, D., Albert, M.K., 1991. Instance-based learning algorithms. Mach. Learn., 6(1): 37–66. https://doi.org/10.1007/BF00153759

    Google Scholar 

  • Catal, C., Diri, B., 2009. Investigating the effect of dataset size, metrics sets, and feature selection techniques on software fault prediction problem. Inform. Sci., 179(8): 1040–1058. https://doi.org/10.1016/j.ins.2008.12.001

    Article  Google Scholar 

  • Duch, W., Wieczorek, T., Biesiada, J., et al., 2004. Comparison of feature ranking methods based on information entropy. Int. Joint Conf. on Neural Networks, p.1415–1419. https://doi.org/10.1109/IJCNN.2004.1380157

    Google Scholar 

  • Galar, M., Fernández, A., Barrenechea, E., et al., 2012. A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches. IEEE Trans. Syst. Man Cybern. Part C, 42(4): 463–484. https://doi.org/10.1109/TSMCC.2011.2161285

    Article  Google Scholar 

  • Gao, K., Khoshgoftaar, T.M., Wang, H., et al., 2011. Choosing software metrics for defect prediction: an investigation on feature selection techniques. Softw. Pract. Exper., 41(5): 579–606. https://doi.org/10.1002/spe.1043

    Article  Google Scholar 

  • Ghareb, A.S., Bakar, A.A., Hamdan, A.R., 2016. Hybrid feature selection based on enhanced genetic algorithm for text categorization. Expert Syst. Appl., 49: 31–47. https://doi.org/10.1016/j.eswa.2015.12.004

    Article  Google Scholar 

  • Gray, D., Bowes, D., Davey, N., et al., 2011. The misuse of the NASA metrics data program data sets for automated software defect prediction. Int. Conf. on Evaluation and Assessment in Software Engineering, p.96–103. https://doi.org/10.1049/ic.2011.0012

    Google Scholar 

  • Guyon, I., Elisseeff, A., 2003. An introduction to variable and feature selection. J. Mach. Learn. Res., 3: 1157–1182.

    MATH  Google Scholar 

  • Hall, M.A., 1999. Correlation-Based Feature Selection for Machine Learning. University of Waikato, Hamilton, New Zealand.

    Google Scholar 

  • Halstead, M.H., 1977. Elements of Software Science. Elsevier, New York, USA.

    MATH  Google Scholar 

  • Han, Y., Park, K., Guan, D., et al., 2013. Topological similarity-based feature selection for graph classification. Comput. J., 58(9): 1884–1893. https://doi.org/10.1093/comjnl/bxt123

    Article  Google Scholar 

  • Holte, R.C., 1993. Very simple classification rules perform well on most commonly used datasets. Mach. Learn., 11(1): 63–90. https://doi.org/10.1023/A:1022631118932

    Article  Google Scholar 

  • Huang, J., Ling, C.X., 2005. Using AUC and accuracy in evaluating learning algorithms. IEEE Trans. Knowl. Data Eng., 17(3): 299–310. https://doi.org/10.1109/TKDE.2005.50

    Article  Google Scholar 

  • Jiang, Y., Lin, J., Cukic, B., et al., 2009. Variance analysis in software fault prediction models. Int. Symp. on Software Reliability Engineering, p.99–108. https://doi.org/10.1109/ISSRE.2009.13

    Google Scholar 

  • Jing, X., Ying, S., Zhang, Z., et al., 2014a. Dictionary learning based software defect prediction. Int. Conf. on Software Engineering, p.414–423. https://doi.org/10.1145/2568225.2568320

    Google Scholar 

  • Jing, X., Zhang, Z., Ying, S., et al., 2014b. Software defect prediction based on collaborative representation classification. Companion of Int. Conf. on Software Engineering, p.632–633. https://doi.org/10.1145/2591062.2591151

    Google Scholar 

  • Jing, X., Wu, F., Dong, X., et al., 2015. Heterogeneous crosscompany defect prediction by unified metric representation and CCA-based transfer learning. Joint Meeting on Foundations of Software Engineering, p.496–507. https://doi.org/10.1145/2786805.2786813

    Google Scholar 

  • Karegowda, A.G., Manjunath, A.S., Jayaram, M.A., 2010. Comparative study of attribute selection using gain ratio and correlation based feature selection. Int. J. Inform. Technol. Knowl. Manag., 2(2): 271–277.

    Google Scholar 

  • Khoshgoftaar, T.M., Gao, K., Napolitano, A., et al., 2014. A comparative study of iterative and non-iterative feature selection techniques for software defect prediction. Inform. Syst. Front., 16(5): 801–822. https://doi.org/10.1007/s10796-013-9430-0

    Article  Google Scholar 

  • Kira, K., Rendell, L.A., 1992. A practical approach to feature selection. Int. Workshop on Machine Learning, p.249–256.

    Google Scholar 

  • Kononenko, I., 1994. Estimating attributes: analysis and extensions of RELIEF. European Conf. on Machine Learning, p.171–182. https://doi.org/10.1007/3-540-57868-4_57

    Google Scholar 

  • Laradji, I.H., Alshayeb, M., Ghouti, L., 2015. Software defect prediction using ensemble learning on selected features. Inform. Softw. Technol., 58: 388–402. https://doi.org/10.1016/j.infsof.2014.07.005

    Article  Google Scholar 

  • Liu, H., Yu, L., 2005. Toward integrating feature selection algorithms for classification and clustering. IEEE Trans. Knowl. Data Eng., 17(4): 491–502. https://doi.org/10.1109/TKDE.2005.66

    Article  MathSciNet  Google Scholar 

  • Liu, H., Sun, J., Liu, L., et al., 2009. Feature selection with dynamic mutual information. Patt. Recogn., 42(7): 1330–1339. https://doi.org/10.1016/j.patcog.2008.10.028

    Article  Google Scholar 

  • Liu, H., Motoda, H., Setiono, R., et al., 2010. Feature selection: an ever evolving frontier in data mining. Int. Workshop on Feature Selection in Data Mining, p.4–13.

    Google Scholar 

  • Liu, S., Chen, X., Liu, W., et al., 2014. FECAR: a feature selection framework for software defect prediction. Annual Computer Software and Applications Conf., p.426–435. https://doi.org/10.1109/COMPSAC.2014.66

    Google Scholar 

  • McCabe, T.J., 1976. A complexity measure. IEEE Trans. Softw. Eng., SE-2(4):308–320. https://doi.org/10.1109/TSE.1976.233837

    Article  MathSciNet  Google Scholar 

  • Miao, L., Liu, M., Zhang, D., 2012. Cost-sensitive feature selection with application in software defect prediction. Int. Conf. on Pattern Recognition, p.967–970.

    Google Scholar 

  • Nam, J., Kim, S., 2015a. CLAMI: defect prediction on unlabeled datasets. Int. Conf. on Automated Software Engineering, p.452–463. https://doi.org/10.1109/ASE.2015.56

    Google Scholar 

  • Nam, J., Kim, S., 2015b. Heterogeneous defect prediction. Joint Meeting on Foundations of Software Engineering, p.508–519. https://doi.org/10.1145/2786805.2786814

    Google Scholar 

  • Shepperd, M., Song, Q., Sun, Z., et al., 2013. Data quality: some comments on the NASA software defect datasets. IEEE Trans. Softw. Eng., 39(9): 1208–1215. https://doi.org/10.1109/TSE.2013.11

    Article  Google Scholar 

  • Tantithamthavorn, C., McIntosh, S., Hassan, A.E., et al., 2016. Automated parameter optimization of classification techniques for defect prediction models. Int. Conf. on Software Engineering, p.321–332. https://doi.org/10.1145/2884781.2884857

    Google Scholar 

  • Uysal, A.K., Gunal, S., 2012. A novel probabilistic feature selection method for text classification. Knowl. Based Syst., 36: 226–235. https://doi.org/10.1016/j.knosys.2012.06.005

    Article  Google Scholar 

  • Wang, H., Khoshgoftaar, T.M., Seliya, N., 2015. On the stability of feature selection methods in software quality prediction: an empirical investigation. Int. J. Softw. Eng. Know. Eng., 25: 1467–1490. https://doi.org/10.1142/S0218194015400288

    Article  Google Scholar 

  • Wang, Z., Li, M., Li, J., 2015. A multi-objective evolutionary algorithm for feature selection based on mutual information with a new redundancy measure. Inform. Sci., 307: 73–88. https://doi.org/10.1016/j.ins.2015.02.031

    Article  MathSciNet  Google Scholar 

  • Wilcoxon, F., 1945. Individual comparisons by ranking methods. Biometr. Bull., 1(6): 80–83. https://doi.org/10.2307/3001968

    Article  Google Scholar 

  • Xu, J., Zhou, Y., Chen, L., et al., 2012. An unsupervised feature selection approach based on mutual information. J. Comput. Res. Dev., 49(2): 372–382 (in Chinese).

    Google Scholar 

  • Xue, B., Zhang, M., Browne, W.N., 2013. Particle swarm optimization for feature selection in classification: a multiobjective approach. IEEE Trans. Cybern., 43(6): 1656–1671. https://doi.org/10.1109/TSMCB.2012.2227469

    Article  Google Scholar 

  • Yang, S., Gu, J., 2004. Feature selection based on mutual information and redundancy-synergy coefficient. J. Zhejiang Univ.-Sci., 5(11): 1382–1391. https://doi.org/10.1631/jzus.2004.1382

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Shu-juan Jiang.

Additional information

Project supported by the National Natural Science Foundation of China (Nos. 61673384 and 61502497), the Guangxi Key Laboratory of Trusted Software (No. kx201530), the China Postdoctoral Science Foundation (No. 2015M581887), and the Scientific Research Innovation Project for Graduate Students of Jiangsu Province, China (No. KYLX15_1443)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yu, Q., Jiang, Sj., Wang, Rc. et al. A feature selection approach based on a similarity measure for software defect prediction. Frontiers Inf Technol Electronic Eng 18, 1744–1753 (2017). https://doi.org/10.1631/FITEE.1601322

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1631/FITEE.1601322

Key words

CLC number