Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Multiple imputation using nearest neighbor methods

Published: 01 September 2021 Publication History

Abstract

Missing values are a major problem in medical research. As the complete case analysis discards useful information, estimation and inference may suffer strongly. Multiple imputation has been shown to be a useful strategy to handle missing data problems and account for the uncertainty of imputation. In the presence of high-dimensional data (p ≫ n), the missing values raise even more serious problems as the existing software packages tend to fail. We present multiple imputation methods based on nearest neighbors. The distances are computed using the information of correlation among the target and candidate predictors. Thus only the relevant predictors contribute for computing distances. The method successfully imputes missing values also in high-dimensional settings. Using a variety of simulated data with MCAR and MAR missing patterns, the proposed algorithm is compared to existing methods. Various measures are used to compare the performance of methods, including MSE for imputation, MSE of estimated regression coefficients, their standard errors, confidence intervals, and their coverage probabilities. The simulation results, for both cases n < p and n > p, show that the sequential imputation using weighted nearest neighbors can be successfully applied to a wide range of data settings and outperforms or is close to the best when compared to existing methods.

References

[1]
S.J. Cranmer, J. Gill, We have to be discrete about this: A non-parametric imputation technique for missing categorical data, British Journal of Political Science 43 (02) (2013) 425–449.
[2]
D.J. Stekhoven, P. Bühlmann, MissForest: non-parametric missing value imputation for mixed-type data, Bioinformatics 28 (1) (2012) 112–118.
[3]
L. Breiman, Random forests, Machine Learning 45 (1) (2001) 5–32.
[4]
A.D. Shah, J.W. Bartlett, J. Carpenter, O. Nicholas, H. Hemingway, Comparison of random forest and parametric imputation models for imputing missing data using MICE: a CALIBER study, American Journal of Epidemiology 179 (6) (2014) 764–774.
[5]
R. Deb, A.W.C. Liew, Missing value imputation for the analysis of incomplete traffic accident data, Information Sciences 339 (2016) 274–289.
[6]
E. Eirola, G. Doquire, M. Verleysen, A. Lendasse, Distance estimation in numerical data sets with missing values, Information Sciences 240 (2013) 115–128.
[7]
S.G. Liao, Y. Lin, D.D. Kang, D. Chandra, J. Bon, N. Kaminski, F.C. Sciurba, G.C. Tseng, Missing value imputation in high-dimensional phenomic data: imputable or not, and how?, BMC Bioinformatics 15 (1) (2014) 346.
[8]
S. Faisal, C. Heumann, Bootstrap inference for weighted nearest neighbors imputation, Communications in Statistics-Simulation and Computation (2019) 1–16.
[9]
Y. Deng, C. Chang, M.S. Ido, Q. Long, Multiple imputation for general missing data patterns in the presence of high-dimensional data, Scientific Reports 6 (2016) 21689.
[10]
D.B. Rubin, Multiple Imputation for Nonresponse in Surveys, Wiley, New York, 1987.
[11]
R. He, T. Belin, Multiple imputation for high-dimensional mixed incomplete continuous and binary data, Statistics in Medicine 33 (13) (2014) 2251–2262.
[12]
R.J. Little, D.B. Rubin, Statistical Analysis with Missing Data, John Wiley & Sons, 2014.
[13]
D.B. Rubin, Multiple Imputation for Nonresponse in Surveys, vol. 81, John Wiley & Sons, 2004.
[14]
O. Harrel, X.H. Zhou, Multiple imputation: Review of theory, implementation and software, Statistics in Medicine 26 (2007) 3057–3077.
[15]
N.J. Horton, K.P. Kleinman, Much ado about nothing: A comparison of missing data methods and software to fit incomplete data regression models, American Statistician 61 (2007) 79–90.
[16]
S. García, J. Luengo, F. Herrera, Data Preprocessing in Data Mining, Springer, 2015.
[17]
S. van Buuren, K. Groothuis-Oudshoorn, mice: Multivariate Imputation by Chained Equations in R, Journal of Statistical Software 45 (3) (2011) 1–67. URL: http://www.jstatsoft.org/v45/i03/.
[18]
J. Honaker, G. King, M. Blackwell, I.I. Amelia, A program for missing data, Journal of Statistical Software 45 (7) (2011) 1–47. URL: http://www.jstatsoft.org/v45/i07/.
[19]
Y. Zhao, Q. Long, Multiple imputation in the presence of high-dimensional data, Statistical Methods in Medical Research 25 (5) (2016) 2021–2035.
[20]
Q. Long, B.A. Johnson, Variable selection in the presence of missing data: resampling and imputation, Biostatistics 16 (3) (2015) 596–610.
[21]
J. Song, T.R. Belin, Imputation for incomplete high-dimensional multivariate normal data using a common factor model, Statistics in Medicine 23 (18) (2004) 2827–2843.
[22]
R.J. Little, D.B. Rubin, Statistical Analysis with Missing Data, second ed., John Wiley & Sons, 2002.
[23]
D.B. Rubin, Multiple imputation after 18+ years, Journal of the American statistical Association 91 (434) (1996) 473–489.
[24]
H. Akaike, A new look at the statistical model identification, IEEE Transactions on Automatic Control 19 (6) (1974) 716–723.
[25]
O. Troyanskaya, M. Cantor, G. Sherlock, P. Brown, T. Hastie, R. Tibshirani, D. Botstein, R.B. Altman, Missing value estimation methods for DNA microarrays, Bioinformatics 17 (6) (2001) 520–525.
[26]
T.H. Bø, B. Dysvik, I. Jonassen, LSimpute: accurate estimation of missing values in microarray data with least squares methods, Nucleic Acids Research 32 (3) (2004) e34.
[27]
X. Zhang, X. Song, H. Wang, H. Zhang, Sequential local least squares imputation estimating missing value of microarray data, Computers in Biology and Medicine 38 (10) (2008) 1112–1120.
[28]
G. Tutz, S. Ramzan, Improved methods for the imputation of missing data by nearest neighbor methods, Computational Statistics and Data Analysis 90 (2015) 84–99,.
[29]
J.Y. Lee, M.P. Styczynski, NS-kNN: A modified k-nearest neighbors approach for imputing metabolomics data, Metabolomics 14 (12) (2018) 153.
[30]
L. Zheng, H. Huang, C. Zhu, K. Zhang, A tensor-based K-nearest neighbors method for traffic speed prediction under data missing, Transportmetrica B: Transport Dynamics 8 (1) (2020) 182–199.
[31]
C.-H. Cheng, J.-R. Chang, H.-H. Huang, A novel weighted distance threshold method for handling medical missing values, Computers in Biology and Medicine 103824 (2020).
[32]
S. Verboven, K.V. Branden, P. Goos, Sequential imputation for missing values, Computational Biology and Chemistry 31 (5) (2007) 320–327.
[33]
K.V. Branden, S. Verboven, Robust data imputation, Computational Biology and Chemistry 33 (1) (2009) 7–13.
[34]
S. Faisal, G. Tutz, Missing value imputation for gene expression data by tailored nearest neighbors, Statistical Applications in Genetics and Molecular Biology 16 (2) (2017) 95–106,.
[35]
D.B. Rubin, Inference and missing data, Biometrika 63 (1976) 581–592.
[36]
J.R. Carpenter, M.G. Kenward, I.R. White, Sensitivity analysis after multiple imputations under missing at random: a weighting approach, Statistical Methods in Medical Research 16 (3) (2007) 259–275.
[37]
I.R. White, P. Royston, A.M. Wood, Multiple imputation using chained equations: Issues and guidance for practice, Statistics in Medicine 30 (2011) 377–399.
[38]
D.A. Newman, Missing data techniques and low response rates: The role of systematic nonresponse parameters, in: C.E. Lance, R.J. Vandenberg (Eds.), Statistical and Methodological Myths and Urban Legends, chap. 1, Routledge: Tylor & Francis Group, New York, 7–36, 2009.
[39]
D.B. Rubin, N. Schenker, Multiple imputation for interval estimation from simple random samples with ignorable nonresponse, Journal of the American Statistical Association 81 (394) (1986) 366–374.
[40]
S. Lipsitz, M. Parzen, L.P. Zhao, A degrees-of-freedom approximation in multiple imputation, Journal of Statistical Computation and Simulation 72 (4) (2002) 309–318.
[41]
S. García, A. Fernández, J. Luengo, F. Herrera, Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: Experimental analysis of power, Information Sciences 180 (10) (2010) 2044–2064.
[42]
H. Finner, On a monotonicity problem in step-down multiple test procedures, Journal of the American Statistical Association 88 (423) (1993) 920–923.
[43]
C. Lichman, UCI Machine Learning Repository, University of California, Irvine, School of Information and Computer Sciences, http://archive.ics.uci.edu/ml, 2013.
[44]
A. Tsanas, M.A. Little, C. Fox, L.O. Ramig, Objective automatic assessment of rehabilitative speech treatment in Parkinson’s disease, IEEE Transactions on Neural Systems and Rehabilitation Engineering 22 (1) (2014) 181–190.
[45]
M.A. Little, P.E. McSharry, E.J. Hunter, J. Spielman, L.O. Ramig, Suitability of dysphonia measurements for telemonitoring of Parkinson’s disease, IEEE Transactions on Bio-medical Engineering 56 (4) (2009) 1015.
[46]
D.B. Dias, R.C. Madeo, T. Rocha, H.H. Biscaro, S.M. Peres, Hand movement recognition for brazilian sign language: a study using distance-based neural networks, in: International Joint Conference on Neural Networks, IEEE, 2009, pp. 697–704.
[47]
J. Demšar, Statistical comparisons of classifiers over multiple data sets, Journal of Machine Learning Research 7 (Jan) (2006) 1–30.
[48]
E. Cule, P. Vineis, M. De Iorio, Significance testing in ridge regression for genetic data, BMC Bioinformatics 12 (1) (2011) 372.

Cited By

View all
  • (2024)Do We Really Need Imputation in AutoML Predictive Modeling?ACM Transactions on Knowledge Discovery from Data10.1145/364364318:6(1-64)Online publication date: 12-Apr-2024
  • (2024)Two‐stage nonparametric framework for missing data imputation, uncertainty quantification, and incorporation in system identificationComputer-Aided Civil and Infrastructure Engineering10.1111/mice.1323739:19(2881-2902)Online publication date: 11-Sep-2024
  • (2024)Incomplete data evidential classification with inconsistent distributionInformation Sciences: an International Journal10.1016/j.ins.2024.120824676:COnline publication date: 1-Aug-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Information Sciences: an International Journal
Information Sciences: an International Journal  Volume 570, Issue C
Sep 2021
849 pages

Publisher

Elsevier Science Inc.

United States

Publication History

Published: 01 September 2021

Author Tags

  1. Missing values
  2. Multiple imputation
  3. Sequential imputation
  4. Bootstrapping
  5. Kernel function

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 17 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Do We Really Need Imputation in AutoML Predictive Modeling?ACM Transactions on Knowledge Discovery from Data10.1145/364364318:6(1-64)Online publication date: 12-Apr-2024
  • (2024)Two‐stage nonparametric framework for missing data imputation, uncertainty quantification, and incorporation in system identificationComputer-Aided Civil and Infrastructure Engineering10.1111/mice.1323739:19(2881-2902)Online publication date: 11-Sep-2024
  • (2024)Incomplete data evidential classification with inconsistent distributionInformation Sciences: an International Journal10.1016/j.ins.2024.120824676:COnline publication date: 1-Aug-2024
  • (2024)Generative adversarial networks for multi-fidelity matrix completion with massive missing entriesInformation Fusion10.1016/j.inffus.2024.102541111:COnline publication date: 1-Nov-2024
  • (2024)An air quality forecasting method using fuzzy time series with butterfly optimization algorithmMicrosystem Technologies10.1007/s00542-023-05591-x30:5(613-623)Online publication date: 1-May-2024
  • (2023)A novel graph-based missing values imputation method for industrial lubricant dataComputers in Industry10.1016/j.compind.2023.103937150:COnline publication date: 26-Jul-2023
  • (2022)Nearest neighbor imputation for categorical data by weighting of attributesInformation Sciences: an International Journal10.1016/j.ins.2022.01.056592:C(306-319)Online publication date: 1-May-2022
  • (2021)Imputation methods for high-dimensional mixed-type datasets by nearest neighborsComputers in Biology and Medicine10.1016/j.compbiomed.2021.104577135:COnline publication date: 1-Aug-2021

View Options

View options

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media