Abstract
Many data mining algorithms cannot handle incomplete datasets where some data samples are missing attribute values. To solve this problem, missing value imputation is usually conducted and commonly based on reasoning from observed data or complete data to provide estimated replacements for missing values. In general, missing imputation methods can be classified into statistical and machine learning methods. The statistical methods are usually based on the mean for continuous attributes or mode for discrete attributes, whereas the machine learning methods are based on supervised learning techniques. However, which machine learning method performs optimally for missing value imputation is unknown. This paper compares five well-known supervised learning techniques, namely k-nearest neighbor, the multilayer perceptron neural network (MLP), the classification and regression tree (CART), naïve Bayes, and the support vector machine, to examine their imputation results for categorical, numerical, and mixed data types. The experimental results demonstrate that CART outperforms the other methods for categorical datasets, whereas the MLP is optimal for numerical and mixed datasets in terms of classification accuracy. However, when computational cost is a factor, CART is superior to the MLP because CART can provide reasonably accurate imputation results and requires the least amount of time to perform missing value imputation. Moreover, CART generates the lowest root-mean-squared error of all methods.
Similar content being viewed by others
Notes
The computing environment is based on PC, Intel® Core™ i7-2600 CPU @ 3.40 GHz, 4 GB RAM.
References
Acuna E, Rodriguez C (2004) The treatment of missing values and its effect in the classifier accuracy. In: Banks D et al (eds) Classification, clustering and data mining applications. Springer-Verlag, Berlin, pp 639–648
Arlot S (2010) A survey of cross-validation procedures for model selection. Stat Surv 4:40–79
Batista G, Monard M (2003) An analysis of four missing data treatment methods for supervised learning. Appl Artif Intell 17:519–533
Byun H, Lee S-W (2003) A survey on pattern recognition applications of support vector machines. Int J Pattern Recognit Artif Intell 17(3):459–486
Cervantes J, Garcia-Lamont F, Rodriguez-Mazahua L, Lopez A (2020) A comprehensive survey on support vector machine classification: applications, challenges and trends. Neurocomputing 408:189–215
Chang CC, Lin C-J (2011) LIBSVM: a library for support vector machines. ACM Trans Intell Syst Technol 2(3):1–27
De Leeuw E (2001) Reducing missing data in surveys: an overview of methods. Qual Quant 35:147–160
Demsar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30
Dixon JK (1979) Pattern recognition with partly missing data. IEEE Trans Syst Man Cybern 10:617–621
Eirola E, Lendasse A, Vandewalle V, Biernacki C (2014) Mixture of Gaussians for distance estimation with missing data. Neurocomputing 131:32–42
Enders CK (2010) Applied missing data analysis. Guilford Press, USA
Farhangfar A, Kurgan L, Dy J (2008) Impact of imputation of missing values on classification error for discrete data. Pattern Recogn 41:3692–3705
Garcia AJT, Hruschka ER (2005) Naïve Bayes as an imputation tool for classification problems. In: International conference on hybrid intelligent systems, pp 497–499
Garcia-Laencina PJ, Sancho-Gomez J-L, Figueiras-Vidal AR (2010) Pattern classification with missing data: a review. Neural Comput Appl 19:263–282
Grzymala-Busse JW, Grzymala-Busse WJ (2005) Handling missing attribute values. In: Maimon O, Rokach L (eds) Data mining and knowledge discovery handbook. Springer-Verlag, pp 37–57
Haykin S (1999) Neural networks: a comprehensive foundation, 2nd edn. Prentice Hall, USA
Hruschka ER Jr, Hruschka ER, Ebecken NFF (2007) Bayesian networks for imputation in classification problems. J Intell Inf Syst 29:231–252
Huang J, Keung JW, Sarro F, Li YF, Yu YT, Chan WK, Sun H (2017) Cross-validation based K nearest neighbor imputation for software quality datasets: an empirical study. J Syst Softw 132:226–252
Jonsson P, Wohlin C (2004) An evaluation of k-nearest neighbor imputation using likert data. In: IEEE international symposium on software metrics, pp 108–118
Jung Y (2018) Multiple predicting k-fold cross-validation for model selection. J Nonparametric Stat 30(1):197–215
Kohavi R (1995) A study of cross-validation and bootstrap for accuracy estimation and model selection. In: International joint conference on artificial intelligence, pp 1137–1143
Lakshminarayan K, Harp SA, Samad T (1999) Imputation of missing data in industrial databases. Appl Intell 11(3):259–275
Lin W-C, Tsai C-F (2019) Missing value imputation: a review and analysis of the literature (2016–2017). Artif Intell Rev. https://doi.org/10.1007/s10462-019-09709-4
Little RJA, Rubin DB (2002) Statistical analysis with missing data, 2nd edn. John Wiley and Sons, USA
Nayak J, Naik B, Behera H (2015) A comprehensive survey on support vector machine in data mining tasks: applications & challenges. Int J Database Theory Appl 8:169–186
Nishanth KJ, Ravi V (2016) Probabilistic neural network based categorical data imputation. Neurocomputing 218:17–25
Pan R, Yang T, Cao J, Lu K, Zhang Z (2015) Missing data imputation by K nearest neighbours based on grey relational structure and mutual information. Appl Intell 43:614–632
Pati SK, Das AK (2017) Missing value estimation for microarray data through cluster analysis. Knowl Inf Syst 52(3):709–750
Pelckmans K, De Brabanter J, Suykens JAK, De Moor B (2005) Handling missing values in support vector machine classifiers. Neural Netw 18:684–692
Poulos J, Valle R (2018) Missing data imputation for supervised learning. Appl Artif Intell 32(2):186–196
Quinlan JR (1986) Induction of decision trees. Mach Learn 1(1):81–106
Raymond M, Roberts D (1987) A comparison of methods for treating incomplete data in selection research. Educ Psychol Meas 47:13–26
Rodriguez JD, Perez A, Lozano JA (2010) Sensitivity analysis of k-fold cross validation in prediction error estimation. IEEE Trans Pattern Anal Mach Intell 32(3):569–575
Salcedo-Sanz S, Rojo-Alvarez JL, Martinez-Ramon M, Camps-Valls G (2014) Support vector machines in engineering: an overview. Wiley Interdiscip Rev Data Min Knowl Dis 4(3):234–267
Silva-Ramirez E-L, Pino-ejias R, Lopez-Coello M (2015) Single imputation with multilayer perceptron and multiple imputation combining multilayer perceptron and k-nearest neighbors for monotone patterns. Appl Soft Comput 29:65–74
Sivapriya TR, Kamal ARNB, Thavavel V (2012) Imputation and classification of missing data using least square support vector machines—a new approach in dementia diagnosis. Int J Adv Res Artif Intell 1(4):29–34
Strike K, Emam KE, Madhavji N (2001) Software cost estimation with incomplete data. IEEE Trans Softw Eng 27(10):890–908
Su X, Khoshgoftaar TM, Zhu X, Greiner R (2008) Imputation-boosted collaborative filtering using machine learning classifiers. In: ACM symposium on applied computing, pp 949–950
Tsai C-F, Chang F-Y (2016) Combining instance selection for better missing value imputation. J Syst Softw 122:63–71
Valdiviezo HC, van Aelst S (2015) Tree-based prediction on incomplete data using imputation or surrogate decision. Inf Sci 311:163–181
Vapnik V (1998) Statistical learning theory. John Wiley, USA
Wilson DR, Martinez TR (1997) Improved heterogeneous distance functions. J Artif Intell Res 6(1):1–34
Wu X, Kumar V, Quinlan JR, Ghosh J, Yang Q, Motoda H, McLachlan GJ, Ng A, Liu B, Yu PS, Zhou Z-H, Steinbach M, Hand DJ, Steinberg D (2008) Top 10 algorithms in data mining. Knowl Inf Syst 14:1–37
Xia J, Zhang S, Cai G, Li L, Pan Q, Yan J, Ning G (2017) Adjusted weight voting algorithm for random forests in handling missing values. Pattern Recogn 69:52–60
Zhang L, Bing Z, Zhang L (2015) A hybrid clustering algorithm based on missing attribute interval estimation for incomplete data. Pattern Anal Appl 18:377–384
Zhang S (2008) Parimputation: from imputation and null-imputation to partially imputation. IEEE Intell Inf Bull 9(1):32–38
Zhang Y, Liu Y (2009) Data imputation using least squares support vector machines in urban arterial streets. IEEE Signal Process Lett 16(5):414–417
Zhou X, Reiter JP (2010) A note n Bayesian inference after multiple imputation. Am Stat 64(2):159–163
Zhou Y, De S, Wang W, Wang R, Moessner K (2018) Missing data estimation in mobile sensing environments. IEEE Access 6(1):69869–69882
Zhu X, Zhang S, Jin Z, Zhang Z, Xu Z (2011) Missing value estimation for mixed-attribute data sets. IEEE Trans Knowl Data Eng 23(1):110–121
Acknowledgements
This work was supported by the Ministry of Science and Technology of Taiwan (MOST 105-2410-H-008-043-MY3).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Tsai, CF., Hu, YH. Empirical comparison of supervised learning techniques for missing value imputation. Knowl Inf Syst 64, 1047–1075 (2022). https://doi.org/10.1007/s10115-022-01661-0
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-022-01661-0