research-article

Multiple imputation using nearest neighbor methods

Authors:

Gerhard TutzAuthors Info & Claims

Volume 570, Issue C

Pages 500 - 516

https://doi.org/10.1016/j.ins.2021.04.009

Published: 01 September 2021 Publication History

Abstract

Missing values are a major problem in medical research. As the complete case analysis discards useful information, estimation and inference may suffer strongly. Multiple imputation has been shown to be a useful strategy to handle missing data problems and account for the uncertainty of imputation. In the presence of high-dimensional data (p ≫ n), the missing values raise even more serious problems as the existing software packages tend to fail. We present multiple imputation methods based on nearest neighbors. The distances are computed using the information of correlation among the target and candidate predictors. Thus only the relevant predictors contribute for computing distances. The method successfully imputes missing values also in high-dimensional settings. Using a variety of simulated data with MCAR and MAR missing patterns, the proposed algorithm is compared to existing methods. Various measures are used to compare the performance of methods, including MSE for imputation, MSE of estimated regression coefficients, their standard errors, confidence intervals, and their coverage probabilities. The simulation results, for both cases n < p and n > p, show that the sequential imputation using weighted nearest neighbors can be successfully applied to a wide range of data settings and outperforms or is close to the best when compared to existing methods.

References

[1]

S.J. Cranmer, J. Gill, We have to be discrete about this: A non-parametric imputation technique for missing categorical data, British Journal of Political Science 43 (02) (2013) 425–449.

[2]

D.J. Stekhoven, P. Bühlmann, MissForest: non-parametric missing value imputation for mixed-type data, Bioinformatics 28 (1) (2012) 112–118.

[3]

L. Breiman, Random forests, Machine Learning 45 (1) (2001) 5–32.

Digital Library

[4]

A.D. Shah, J.W. Bartlett, J. Carpenter, O. Nicholas, H. Hemingway, Comparison of random forest and parametric imputation models for imputing missing data using MICE: a CALIBER study, American Journal of Epidemiology 179 (6) (2014) 764–774.

[5]

R. Deb, A.W.C. Liew, Missing value imputation for the analysis of incomplete traffic accident data, Information Sciences 339 (2016) 274–289.

[6]

E. Eirola, G. Doquire, M. Verleysen, A. Lendasse, Distance estimation in numerical data sets with missing values, Information Sciences 240 (2013) 115–128.

[7]

S.G. Liao, Y. Lin, D.D. Kang, D. Chandra, J. Bon, N. Kaminski, F.C. Sciurba, G.C. Tseng, Missing value imputation in high-dimensional phenomic data: imputable or not, and how?, BMC Bioinformatics 15 (1) (2014) 346.

[8]

S. Faisal, C. Heumann, Bootstrap inference for weighted nearest neighbors imputation, Communications in Statistics-Simulation and Computation (2019) 1–16.

[9]

Y. Deng, C. Chang, M.S. Ido, Q. Long, Multiple imputation for general missing data patterns in the presence of high-dimensional data, Scientific Reports 6 (2016) 21689.

[10]

D.B. Rubin, Multiple Imputation for Nonresponse in Surveys, Wiley, New York, 1987.

[11]

R. He, T. Belin, Multiple imputation for high-dimensional mixed incomplete continuous and binary data, Statistics in Medicine 33 (13) (2014) 2251–2262.

[12]

R.J. Little, D.B. Rubin, Statistical Analysis with Missing Data, John Wiley & Sons, 2014.

[13]

D.B. Rubin, Multiple Imputation for Nonresponse in Surveys, vol. 81, John Wiley & Sons, 2004.

[14]

O. Harrel, X.H. Zhou, Multiple imputation: Review of theory, implementation and software, Statistics in Medicine 26 (2007) 3057–3077.

[15]

N.J. Horton, K.P. Kleinman, Much ado about nothing: A comparison of missing data methods and software to fit incomplete data regression models, American Statistician 61 (2007) 79–90.

[16]

S. García, J. Luengo, F. Herrera, Data Preprocessing in Data Mining, Springer, 2015.

Digital Library

[17]

S. van Buuren, K. Groothuis-Oudshoorn, mice: Multivariate Imputation by Chained Equations in R, Journal of Statistical Software 45 (3) (2011) 1–67. URL: http://www.jstatsoft.org/v45/i03/.

[18]

J. Honaker, G. King, M. Blackwell, I.I. Amelia, A program for missing data, Journal of Statistical Software 45 (7) (2011) 1–47. URL: http://www.jstatsoft.org/v45/i07/.

[19]

Y. Zhao, Q. Long, Multiple imputation in the presence of high-dimensional data, Statistical Methods in Medical Research 25 (5) (2016) 2021–2035.

[20]

Q. Long, B.A. Johnson, Variable selection in the presence of missing data: resampling and imputation, Biostatistics 16 (3) (2015) 596–610.

[21]

J. Song, T.R. Belin, Imputation for incomplete high-dimensional multivariate normal data using a common factor model, Statistics in Medicine 23 (18) (2004) 2827–2843.

[22]

R.J. Little, D.B. Rubin, Statistical Analysis with Missing Data, second ed., John Wiley & Sons, 2002.

[23]

D.B. Rubin, Multiple imputation after 18+ years, Journal of the American statistical Association 91 (434) (1996) 473–489.

[24]

H. Akaike, A new look at the statistical model identification, IEEE Transactions on Automatic Control 19 (6) (1974) 716–723.

[25]

O. Troyanskaya, M. Cantor, G. Sherlock, P. Brown, T. Hastie, R. Tibshirani, D. Botstein, R.B. Altman, Missing value estimation methods for DNA microarrays, Bioinformatics 17 (6) (2001) 520–525.

[26]

T.H. Bø, B. Dysvik, I. Jonassen, LSimpute: accurate estimation of missing values in microarray data with least squares methods, Nucleic Acids Research 32 (3) (2004) e34.

[27]

X. Zhang, X. Song, H. Wang, H. Zhang, Sequential local least squares imputation estimating missing value of microarray data, Computers in Biology and Medicine 38 (10) (2008) 1112–1120.

Digital Library

[28]

G. Tutz, S. Ramzan, Improved methods for the imputation of missing data by nearest neighbor methods, Computational Statistics and Data Analysis 90 (2015) 84–99,.

Digital Library

[29]

J.Y. Lee, M.P. Styczynski, NS-kNN: A modified k-nearest neighbors approach for imputing metabolomics data, Metabolomics 14 (12) (2018) 153.

[30]

L. Zheng, H. Huang, C. Zhu, K. Zhang, A tensor-based K-nearest neighbors method for traffic speed prediction under data missing, Transportmetrica B: Transport Dynamics 8 (1) (2020) 182–199.

[31]

C.-H. Cheng, J.-R. Chang, H.-H. Huang, A novel weighted distance threshold method for handling medical missing values, Computers in Biology and Medicine 103824 (2020).

[32]

S. Verboven, K.V. Branden, P. Goos, Sequential imputation for missing values, Computational Biology and Chemistry 31 (5) (2007) 320–327.

[33]

K.V. Branden, S. Verboven, Robust data imputation, Computational Biology and Chemistry 33 (1) (2009) 7–13.

[34]

S. Faisal, G. Tutz, Missing value imputation for gene expression data by tailored nearest neighbors, Statistical Applications in Genetics and Molecular Biology 16 (2) (2017) 95–106,.

[35]

D.B. Rubin, Inference and missing data, Biometrika 63 (1976) 581–592.

[36]

J.R. Carpenter, M.G. Kenward, I.R. White, Sensitivity analysis after multiple imputations under missing at random: a weighting approach, Statistical Methods in Medical Research 16 (3) (2007) 259–275.

[37]

I.R. White, P. Royston, A.M. Wood, Multiple imputation using chained equations: Issues and guidance for practice, Statistics in Medicine 30 (2011) 377–399.

[38]

D.A. Newman, Missing data techniques and low response rates: The role of systematic nonresponse parameters, in: C.E. Lance, R.J. Vandenberg (Eds.), Statistical and Methodological Myths and Urban Legends, chap. 1, Routledge: Tylor & Francis Group, New York, 7–36, 2009.

[39]

D.B. Rubin, N. Schenker, Multiple imputation for interval estimation from simple random samples with ignorable nonresponse, Journal of the American Statistical Association 81 (394) (1986) 366–374.

[40]

S. Lipsitz, M. Parzen, L.P. Zhao, A degrees-of-freedom approximation in multiple imputation, Journal of Statistical Computation and Simulation 72 (4) (2002) 309–318.

[41]

S. García, A. Fernández, J. Luengo, F. Herrera, Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: Experimental analysis of power, Information Sciences 180 (10) (2010) 2044–2064.

Digital Library

[42]

H. Finner, On a monotonicity problem in step-down multiple test procedures, Journal of the American Statistical Association 88 (423) (1993) 920–923.

[43]

C. Lichman, UCI Machine Learning Repository, University of California, Irvine, School of Information and Computer Sciences, http://archive.ics.uci.edu/ml, 2013.

[44]

A. Tsanas, M.A. Little, C. Fox, L.O. Ramig, Objective automatic assessment of rehabilitative speech treatment in Parkinson’s disease, IEEE Transactions on Neural Systems and Rehabilitation Engineering 22 (1) (2014) 181–190.

[45]

M.A. Little, P.E. McSharry, E.J. Hunter, J. Spielman, L.O. Ramig, Suitability of dysphonia measurements for telemonitoring of Parkinson’s disease, IEEE Transactions on Bio-medical Engineering 56 (4) (2009) 1015.

[46]

D.B. Dias, R.C. Madeo, T. Rocha, H.H. Biscaro, S.M. Peres, Hand movement recognition for brazilian sign language: a study using distance-based neural networks, in: International Joint Conference on Neural Networks, IEEE, 2009, pp. 697–704.

[47]

J. Demšar, Statistical comparisons of classifiers over multiple data sets, Journal of Machine Learning Research 7 (Jan) (2006) 1–30.

[48]

E. Cule, P. Vineis, M. De Iorio, Significance testing in ridge regression for genetic data, BMC Bioinformatics 12 (1) (2011) 372.

Cited By

Paterakis GFafalios SCharonyktakis PChristophides VTsamardinos I(2024)Do We Really Need Imputation in AutoML Predictive Modeling?ACM Transactions on Knowledge Discovery from Data10.1145/364364318:6(1-64)Online publication date: 12-Apr-2024
https://dl.acm.org/doi/10.1145/3643643
Zhang WYuen KYan W(2024)Two‐stage nonparametric framework for missing data imputation, uncertainty quantification, and incorporation in system identificationComputer-Aided Civil and Infrastructure Engineering10.1111/mice.1323739:19(2881-2902)Online publication date: 11-Sep-2024
https://dl.acm.org/doi/10.1111/mice.13237
Tian HWang XTan Y(2024)Incomplete data evidential classification with inconsistent distributionInformation Sciences: an International Journal10.1016/j.ins.2024.120824676:COnline publication date: 1-Aug-2024
https://dl.acm.org/doi/10.1016/j.ins.2024.120824
Show More Cited By

Index Terms

Multiple imputation using nearest neighbor methods

Index terms have been assigned to the content through auto-classification.

Recommendations

An experimental study on the use of nearest neighbor-based imputation algorithms for classification tasks

The substitution of missing values, also called imputation, is an important data preparation task for data mining applications. Imputation algorithms have been traditionally compared in terms of the similarity between imputed and original values. ...
Multiple imputation in principal component analysis

The available methods to handle missing values in principal component analysis only provide point estimates of the parameters (axes and components) and estimates of the missing values. To take into account the variability due to missing values a ...
Improved methods for the imputation of missing data by nearest neighbor methods

Missing data raise problems in almost all fields of quantitative research. A useful nonparametric procedure is the nearest neighbor imputation method. Improved versions of this method are presented. First, a weighted nearest neighbor imputation method ...

Comments

Information & Contributors

Information

Published In

cover image Information Sciences: an International Journal

Information Sciences: an International Journal Volume 570, Issue C

Sep 2021

849 pages

ISSN:0020-0255

Issue’s Table of Contents

Elsevier Inc.

Publisher

Elsevier Science Inc.

United States

Publication History

Published: 01 September 2021

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

7
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 17 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Paterakis GFafalios SCharonyktakis PChristophides VTsamardinos I(2024)Do We Really Need Imputation in AutoML Predictive Modeling?ACM Transactions on Knowledge Discovery from Data10.1145/364364318:6(1-64)Online publication date: 12-Apr-2024
https://dl.acm.org/doi/10.1145/3643643
Zhang WYuen KYan W(2024)Two‐stage nonparametric framework for missing data imputation, uncertainty quantification, and incorporation in system identificationComputer-Aided Civil and Infrastructure Engineering10.1111/mice.1323739:19(2881-2902)Online publication date: 11-Sep-2024
https://dl.acm.org/doi/10.1111/mice.13237
Tian HWang XTan Y(2024)Incomplete data evidential classification with inconsistent distributionInformation Sciences: an International Journal10.1016/j.ins.2024.120824676:COnline publication date: 1-Aug-2024
https://dl.acm.org/doi/10.1016/j.ins.2024.120824
Liu ZSong XYang JZhang CTao D(2024)Generative adversarial networks for multi-fidelity matrix completion with massive missing entriesInformation Fusion10.1016/j.inffus.2024.102541111:COnline publication date: 1-Nov-2024
https://dl.acm.org/doi/10.1016/j.inffus.2024.102541
Bhanja SDas A(2024)An air quality forecasting method using fuzzy time series with butterfly optimization algorithmMicrosystem Technologies10.1007/s00542-023-05591-x30:5(613-623)Online publication date: 1-May-2024
https://dl.acm.org/doi/10.1007/s00542-023-05591-x
Jeong SJoo CLim JCho HLim SKim J(2023)A novel graph-based missing values imputation method for industrial lubricant dataComputers in Industry10.1016/j.compind.2023.103937150:COnline publication date: 26-Jul-2023
https://dl.acm.org/doi/10.1016/j.compind.2023.103937
Faisal STutz G(2022)Nearest neighbor imputation for categorical data by weighting of attributesInformation Sciences: an International Journal10.1016/j.ins.2022.01.056592:C(306-319)Online publication date: 1-May-2022
https://dl.acm.org/doi/10.1016/j.ins.2022.01.056
Faisal STutz G(2021)Imputation methods for high-dimensional mixed-type datasets by nearest neighborsComputers in Biology and Medicine10.1016/j.compbiomed.2021.104577135:COnline publication date: 1-Aug-2021
https://dl.acm.org/doi/10.1016/j.compbiomed.2021.104577

View Options

View options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Issue’s Table of Contents