Random Subspace Sampling for Classification with Missing Data

Cao, Yun-Hao; Wu, Jian-Xin

doi:10.1007/s11390-023-1611-9

Random Subspace Sampling for Classification with Missing Data

Regular Paper
Published: 06 June 2024

Volume 39, pages 472–486, (2024)
Cite this article

Journal of Computer Science and Technology Aims and scope Submit manuscript

Yun-Hao Cao (曹云浩)¹ &
Jian-Xin Wu (吴建鑫)¹

277 Accesses
1 Altmetric
Explore all metrics

Abstract

Many real-world datasets suffer from the unavoidable issue of missing values, and therefore classification with missing data has to be carefully handled since inadequate treatment of missing values will cause large errors. In this paper, we propose a random subspace sampling method, RSS, by sampling missing items from the corresponding feature histogram distributions in random subspaces, which is effective and efficient at different levels of missing data. Unlike most established approaches, RSS does not train on fixed imputed datasets. Instead, we design a dynamic training strategy where the filled values change dynamically by resampling during training. Moreover, thanks to the sampling strategy, we design an ensemble testing strategy where we combine the results of multiple runs of a single model, which is more efficient and resource-saving than previous ensemble methods. Finally, we combine these two strategies with the random subspace method, which makes our estimations more robust and accurate. The effectiveness of the proposed RSS method is well validated by experimental studies.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

References

García-Laencina P J, Sancho-Gómez J L, Figueiras-Vidal A R. Pattern classification with missing data: A review. Neural Computing and Applications, 2010, 19(2): 263–282. DOI: https://doi.org/10.1007/s00521-009-0295-6.
Article Google Scholar
White I R, Royston P, Wood A M. Multiple imputation using chained equations: Issues and guidance for practice. Statistics in Medicine, 2011, 30(4): 377–399. DOI: https://doi.org/10.1002/sim.4067.
Article MathSciNet Google Scholar
Farhangfar A, Kurgan L A, Pedrycz W. A novel framework for imputation of missing values in databases. IEEE Trans. Systems, Man, and Cybernetics—Part A: Systems and Humans, 2007, 37(5): 692–709. DOI: https://doi.org/10.1109/TSM-CA.2007.902631.
Article Google Scholar
Juszczak P, Duin R P W. Combining one-class classifiers to classify missing data. In Proc. the 5th International Workshop on Multiple Classifier Systems, Jun. 2004, pp.92–101. DOI: https://doi.org/10.1007/978-3-540-25966-49.
Krause S, Polikar R. An ensemble of classifiers approach for the missing feature problem. In Proc. the 2003 International Joint Conference on Neural Networks, Jul. 2003, pp.553–558. DOI: https://doi.org/10.1109/IJCNN.2003.1223406.
Polikar R, DePasquale J, Syed Mohammed H, Brown G, Kuncheva L I. Learn⁺⁺. MF: A random subspace approach for the missing feature problem. Pattern Recognition, 2010, 43(11): 3817–3832. DOI: https://doi.org/10.1016/j.patcog.2010.05.028.
Article Google Scholar
Ghahramani Z, Jordan M I. Supervised learning from incomplete data via an EM approach. In Proc. the 6th International Conference on Neural Information Processing Systems, Nov. 1993, pp.120–127.
Ahmad S, Tresp V. Some solutions to the missing feature problem in vision. In Proc. the 5th International Conference on Neural Information Processing Systems, Nov. 1992, pp.393–400.
Salzberg S L. Bookreview: C4.5: Programs for machine learning by J. Ross Quinlan. Morgan Kaufmann Publishers, Inc., 1993. Machine Learning, 1994, 16(3): 235–240. DOI: https://doi.org/10.1007/BF00993309.
Article Google Scholar
Batista G E, Monard M C. A study of k-nearest neighbour as an imputation method. Hybrid Intelligent Systems, 2002, 87(48): 251–260. DOI: https://doi.org/10.1109/METRIC.2004.1357895.
Google Scholar
Schafer J L. Analysis of Incomplete Multivariate Data (1st edition). CRC Press, 1997. DOI: https://doi.org/10.1201/9780367803025.
Zhao Y X, Udell M. Missing value imputation for mixed data via Gaussian copula. In Proc. the 26th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Aug. 2020, pp.636–646. DOI: https://doi.org/10.1145/3394486.3403106.
Rubin D B. Multiple Imputation for Nonresponse in Surveys (1st edition). John Wiley & Sons, Inc., 2004.
Houari R, Bounceur A, Tari A K, Kecha M T. Handling missing data problems with sampling methods. In Proc. the 2014 International Conference on Advanced Networking Distributed Systems and Applications, Jun. 2014, pp.99–104. DOI: https://doi.org/10.1109/INDS.2014.25.
Stekhoven D J, Bühlmann P. MissForest—Non-parametric missing value imputation for mixed-type data. Bioinformatics, 2012, 28(1): 112–118. DOI: https://doi.org/10.1093/bioinformatics/btr597.
Article Google Scholar
Zhou Z H. Ensemble Methods: Foundations and Algorithms (1st edition). CRC Press, 2012. DOI: https://doi.org/10.1201/b12207.
Ho T K. The random subspace method for constructing decision forests. IEEE Trans. Pattern Analysis and Machine Intelligence, 1998, 20(8): 832–844. DOI: https://doi.org/10.1109/34.709601.
Article Google Scholar
Breiman L. Random forests. Machine Learning, 2001, 45(1): 5–32. DOI: https://doi.org/10.1023/A:1010933404324.
Article Google Scholar
Sharpe P K, Solly R J. Dealing with missing values in neural network-based diagnostic systems. Neural Computing & Applications, 1995, 3(2): 73–77. DOI: https://doi.org/10.1007/BF01421959.
Article Google Scholar
Jiang K, Chen H X, Yuan S M. Classification for incomplete data using classifier ensembles. In Proc. the 2005 International Conference on Neural Networks and Brain, Apr. 2005, pp.559–563. DOI: https://doi.org/10.1109/ICNNB.2005.1614675.
Cao Y H, Wu J X, Wang H C, Lasenby J. Neural random subspace. Pattern Recognition, 2021, 112: Article No. 107801. DOI: https://doi.org/10.1016/j.patcog.2020.107801.
Little R J A, Rubin D B. Statistical Analysis with Missing Data (3rd edition). John Wiley & Sons, Inc., 2019.
Mazumder R, Hastie T, Tibshirani R. Spectral regularization algorithms for learning large incomplete matrices. The Journal of Machine Learning Research, 2010, 11(80): 2287–2322.
MathSciNet Google Scholar
Huang S J, Xu M, Xie M K, Sugiyama M, Niu G, Chen S C. Active feature acquisition with supervised matrix completion. In Proc. the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Jul. 2018, pp.1571–1579. DOI: https://doi.org/10.1145/3219819.3220084.
Ioffe S, Szegedy C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proc. the 32nd International Conference on Machine Learning, Jul. 2015, pp.448–456.
Kingma D P, Ba J. Adam: A method for stochastic optimization. In Proc. the 3rd International Conference on Learning Representations, May 2015.

Download references

Author information

Authors and Affiliations

State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, 210093, China
Yun-Hao Cao (曹云浩) & Jian-Xin Wu (吴建鑫)

Authors

Yun-Hao Cao (曹云浩)
View author publications
You can also search for this author in PubMed Google Scholar
Jian-Xin Wu (吴建鑫)
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jian-Xin Wu (吴建鑫).

Ethics declarations

Conflict of Interest The authors declare that they have no conflict of interest.

Additional information

This work was supported by the National Natural Science Foundation of China under Grant Nos. 61772256 and 61921006.

Yun-Hao Cao is currently a Ph.D. candidate in the Department of Computer Science and Technology in Nanjing University, Nanjing. He received his B.S. degree in computer science and technology from Nanjing University, Nanjing, in 2018. His research interests are computer vision and machine learning.

Jian-Xin Wu is currently a professor in the School of Artificial Intelligence at Nanjing University, Nanjing, and is associated with the State Key Laboratory for Novel Software Technology, Nanjing. He received his B.S. and M.S. degrees from Nanjing University, Nanjing, in 1999 and 2002 respectively, and his Ph.D. degree from the Georgia Institute of Technology, Atlanta, in 2009, all in computer science. He has served as a senior area chair for CVPR, ICCV, ECCV, AAAI and IJCAI, and as an associate editor for the IEEE Transactions on Pattern Analysis and Machine Intelligence. His research interests are computer vision and machine learning.

Electronic supplementary material