Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Random Subspace Sampling for Classification with Missing Data

  • Regular Paper
  • Published:
Journal of Computer Science and Technology Aims and scope Submit manuscript

Abstract

Many real-world datasets suffer from the unavoidable issue of missing values, and therefore classification with missing data has to be carefully handled since inadequate treatment of missing values will cause large errors. In this paper, we propose a random subspace sampling method, RSS, by sampling missing items from the corresponding feature histogram distributions in random subspaces, which is effective and efficient at different levels of missing data. Unlike most established approaches, RSS does not train on fixed imputed datasets. Instead, we design a dynamic training strategy where the filled values change dynamically by resampling during training. Moreover, thanks to the sampling strategy, we design an ensemble testing strategy where we combine the results of multiple runs of a single model, which is more efficient and resource-saving than previous ensemble methods. Finally, we combine these two strategies with the random subspace method, which makes our estimations more robust and accurate. The effectiveness of the proposed RSS method is well validated by experimental studies.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

References

  1. García-Laencina P J, Sancho-Gómez J L, Figueiras-Vidal A R. Pattern classification with missing data: A review. Neural Computing and Applications, 2010, 19(2): 263–282. DOI: https://doi.org/10.1007/s00521-009-0295-6.

    Article  Google Scholar 

  2. White I R, Royston P, Wood A M. Multiple imputation using chained equations: Issues and guidance for practice. Statistics in Medicine, 2011, 30(4): 377–399. DOI: https://doi.org/10.1002/sim.4067.

    Article  MathSciNet  Google Scholar 

  3. Farhangfar A, Kurgan L A, Pedrycz W. A novel framework for imputation of missing values in databases. IEEE Trans. Systems, Man, and Cybernetics—Part A: Systems and Humans, 2007, 37(5): 692–709. DOI: https://doi.org/10.1109/TSM-CA.2007.902631.

    Article  Google Scholar 

  4. Juszczak P, Duin R P W. Combining one-class classifiers to classify missing data. In Proc. the 5th International Workshop on Multiple Classifier Systems, Jun. 2004, pp.92–101. DOI: https://doi.org/10.1007/978-3-540-25966-49.

  5. Krause S, Polikar R. An ensemble of classifiers approach for the missing feature problem. In Proc. the 2003 International Joint Conference on Neural Networks, Jul. 2003, pp.553–558. DOI: https://doi.org/10.1109/IJCNN.2003.1223406.

  6. Polikar R, DePasquale J, Syed Mohammed H, Brown G, Kuncheva L I. Learn++. MF: A random subspace approach for the missing feature problem. Pattern Recognition, 2010, 43(11): 3817–3832. DOI: https://doi.org/10.1016/j.patcog.2010.05.028.

    Article  Google Scholar 

  7. Ghahramani Z, Jordan M I. Supervised learning from incomplete data via an EM approach. In Proc. the 6th International Conference on Neural Information Processing Systems, Nov. 1993, pp.120–127.

  8. Ahmad S, Tresp V. Some solutions to the missing feature problem in vision. In Proc. the 5th International Conference on Neural Information Processing Systems, Nov. 1992, pp.393–400.

  9. Salzberg S L. Bookreview: C4.5: Programs for machine learning by J. Ross Quinlan. Morgan Kaufmann Publishers, Inc., 1993. Machine Learning, 1994, 16(3): 235–240. DOI: https://doi.org/10.1007/BF00993309.

    Article  Google Scholar 

  10. Batista G E, Monard M C. A study of k-nearest neighbour as an imputation method. Hybrid Intelligent Systems, 2002, 87(48): 251–260. DOI: https://doi.org/10.1109/METRIC.2004.1357895.

    Google Scholar 

  11. Schafer J L. Analysis of Incomplete Multivariate Data (1st edition). CRC Press, 1997. DOI: https://doi.org/10.1201/9780367803025.

  12. Zhao Y X, Udell M. Missing value imputation for mixed data via Gaussian copula. In Proc. the 26th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Aug. 2020, pp.636–646. DOI: https://doi.org/10.1145/3394486.3403106.

  13. Rubin D B. Multiple Imputation for Nonresponse in Surveys (1st edition). John Wiley & Sons, Inc., 2004.

  14. Houari R, Bounceur A, Tari A K, Kecha M T. Handling missing data problems with sampling methods. In Proc. the 2014 International Conference on Advanced Networking Distributed Systems and Applications, Jun. 2014, pp.99–104. DOI: https://doi.org/10.1109/INDS.2014.25.

  15. Stekhoven D J, Bühlmann P. MissForest—Non-parametric missing value imputation for mixed-type data. Bioinformatics, 2012, 28(1): 112–118. DOI: https://doi.org/10.1093/bioinformatics/btr597.

    Article  Google Scholar 

  16. Zhou Z H. Ensemble Methods: Foundations and Algorithms (1st edition). CRC Press, 2012. DOI: https://doi.org/10.1201/b12207.

  17. Ho T K. The random subspace method for constructing decision forests. IEEE Trans. Pattern Analysis and Machine Intelligence, 1998, 20(8): 832–844. DOI: https://doi.org/10.1109/34.709601.

    Article  Google Scholar 

  18. Breiman L. Random forests. Machine Learning, 2001, 45(1): 5–32. DOI: https://doi.org/10.1023/A:1010933404324.

    Article  Google Scholar 

  19. Sharpe P K, Solly R J. Dealing with missing values in neural network-based diagnostic systems. Neural Computing & Applications, 1995, 3(2): 73–77. DOI: https://doi.org/10.1007/BF01421959.

    Article  Google Scholar 

  20. Jiang K, Chen H X, Yuan S M. Classification for incomplete data using classifier ensembles. In Proc. the 2005 International Conference on Neural Networks and Brain, Apr. 2005, pp.559–563. DOI: https://doi.org/10.1109/ICNNB.2005.1614675.

  21. Cao Y H, Wu J X, Wang H C, Lasenby J. Neural random subspace. Pattern Recognition, 2021, 112: Article No. 107801. DOI: https://doi.org/10.1016/j.patcog.2020.107801.

  22. Little R J A, Rubin D B. Statistical Analysis with Missing Data (3rd edition). John Wiley & Sons, Inc., 2019.

  23. Mazumder R, Hastie T, Tibshirani R. Spectral regularization algorithms for learning large incomplete matrices. The Journal of Machine Learning Research, 2010, 11(80): 2287–2322.

    MathSciNet  Google Scholar 

  24. Huang S J, Xu M, Xie M K, Sugiyama M, Niu G, Chen S C. Active feature acquisition with supervised matrix completion. In Proc. the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Jul. 2018, pp.1571–1579. DOI: https://doi.org/10.1145/3219819.3220084.

  25. Ioffe S, Szegedy C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proc. the 32nd International Conference on Machine Learning, Jul. 2015, pp.448–456.

  26. Kingma D P, Ba J. Adam: A method for stochastic optimization. In Proc. the 3rd International Conference on Learning Representations, May 2015.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jian-Xin Wu  (吴建鑫).

Ethics declarations

Conflict of Interest The authors declare that they have no conflict of interest.

Additional information

This work was supported by the National Natural Science Foundation of China under Grant Nos. 61772256 and 61921006.

Yun-Hao Cao is currently a Ph.D. candidate in the Department of Computer Science and Technology in Nanjing University, Nanjing. He received his B.S. degree in computer science and technology from Nanjing University, Nanjing, in 2018. His research interests are computer vision and machine learning.

Jian-Xin Wu is currently a professor in the School of Artificial Intelligence at Nanjing University, Nanjing, and is associated with the State Key Laboratory for Novel Software Technology, Nanjing. He received his B.S. and M.S. degrees from Nanjing University, Nanjing, in 1999 and 2002 respectively, and his Ph.D. degree from the Georgia Institute of Technology, Atlanta, in 2009, all in computer science. He has served as a senior area chair for CVPR, ICCV, ECCV, AAAI and IJCAI, and as an associate editor for the IEEE Transactions on Pattern Analysis and Machine Intelligence. His research interests are computer vision and machine learning.

Electronic supplementary material

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Cao, YH., Wu, JX. Random Subspace Sampling for Classification with Missing Data. J. Comput. Sci. Technol. 39, 472–486 (2024). https://doi.org/10.1007/s11390-023-1611-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11390-023-1611-9

Keywords