Abstract
Imputation for missing values is a key operation in building data analysis models. In this paper, we target numerical and categorical values in tabular data. While previous studies have demonstrated the effectiveness of state-of-the-art methods, a major limitation is that these methods lack robustness and their performance significantly varies across datasets and the missing rate of values, hence posing considerable overhead of selecting and tuning models in a real-world scenario. To tackle this problem, we propose a Column Attention Generative Adversarial Imputation Network (CAGAIN), an imputation model which employs a generative adversarial network (GAN) and the attention mechanism. The generator of CAGAIN mimics the distribution of original data and generates imputed samples similar to real ones. The discriminator of CAGAIN distinguishes real and generated samples, so as to improve the quality of the imputed data. At the same time, the attention mechanism captures the correlation between attributes and focuses on the most significant attributes that determine the values of the missing positions. By inheriting the advantages of GAN and the attention mechanism, our model is endowed with robustness to shifting datasets and missing rates, which is demonstrated by experiments using 9 real datasets.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Asuncion, A., Newman, D.: UCI machine learning repository (2007). http://archive.ics.uci.edu/ml
Belkin, M., Hsu, D.J., Mitra, P.: Overfitting or perfect fitting? risk bounds for classification and regression rules that interpolate. In: NeurIPS, pp. 2306–2317 (2018)
Breve, B., Caruccio, L., Deufemia, V., Polese, G.: RENUVER: a missing value imputation algorithm based on relaxed functional dependencies. In: EDBT, pp. 1:52–1:64. OpenProceedings.org (2022)
Chen, T., Guestrin, C.: XGBoost: a scalable tree boosting system. In: KDD, pp. 785–794 (2016)
D’Ambrosio, A., Aria, M., Siciliano, R.: Accurate tree-based missing data imputation and data fusion within the statistical learning paradigm. J. Classif. 29(2), 227–258 (2012)
Friedjungová, M., Vašata, D., Balatsko, M., Jiřina, M.: Missing features reconstruction using a wasserstein generative adversarial imputation network. In: ICCS, pp. 225–239 (2020)
Gondara, L., Wang, K.: MIDA: multiple imputation using denoising autoencoders. In: PAKDD, pp. 260–272 (2018)
Goodfellow, I.J., et al.: Generative adversarial nets. In: NIPS, pp. 2672–2680 (2014)
Jonsson, P., Wohlin, C.: An evaluation of k-nearest neighbour imputation using likert data. In: METRICS, pp. 108–118 (2004)
Kalton, G., Kasprzyk, D.: Imputing for missing survey responses. In: ASA-SRMS, vol. 22, p. 31 (1982)
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: ICLR (2015)
Kodali, N., Abernethy, J., Hays, J., Kira, Z.: On convergence and stability of GANs (2017). arXiv preprint arXiv:1705.07215
Mao, X., Li, Q., Xie, H., Lau, R.Y., Wang, Z., Smolley, S.P.: Least squares generative adversarial networks. In: ICCV, pp. 2794–2802 (2017)
McCoy, J.T., Kroon, S., Auret, L.: Variational autoencoders for missing data imputation with application to a simulated milling circuit. IFAC 51(21), 141–146 (2018)
Miyato, T., Kataoka, T., Koyama, M., Yoshida, Y.: Spectral normalization for generative adversarial networks. In: ICLR (2018)
Nazabal, A., Olmos, P.M., Ghahramani, Z., Valera, I.: Handling incomplete heterogeneous data using VAEs. Pattern Recogn. 107, 107501 (2020)
Neves, D.T., Naik, M.G., Proença, A.: SGAIN, WSGAIN-CP and WSGAIN-GP: novel GAN methods for missing data imputation. In: ICCS, pp. 98–113 (2021)
Oh, E., Kim, T., Ji, Y., Khyalia, S.: STING: self-attention based time-series imputation networks using GAN. In: ICDM, pp. 1264–1269 (2021)
Qiu, W., Huang, Y., Li, Q.: IFGAN: missing value imputation using feature-specific generative adversarial networks. In: BigData, pp. 4715–4723 (2020)
Rekatsinas, T., Chu, X., Ilyas, I.F., Ré, C.: Holoclean: holistic data repairs with probabilistic inference. PVLDB 10(11), 1190–1201 (2017)
Ryu, S., Kim, M., Kim, H.: Denoising autoencoder-based missing value imputation for smart meters. IEEE Access 8, 40656–40666 (2020)
Song, S., Sun, Y., Zhang, A., Chen, L., Wang, J.: Enriching data imputation under similarity rule constraints. IEEE Trans. Knowl. Data Eng. 32(2), 275–287 (2020)
Stekhoven, D.J., Bühlmann, P.: MissForest - non-parametric missing value imputation for mixed-type data. Bioinformatics 28(1), 112–118 (2012)
Tihon, S., Javaid, M.U., Fourure, D., Posocco, N., Peel, T.: DAEMA: denoising autoencoder with mask attention. In: ICANN, pp. 229–240 (2021)
Van Buuren, S., Groothuis-Oudshoorn, K.: mice: multivariate imputation by chained equations in R. J. Stat. Softw. 45, 1–67 (2011)
Vaswani, A., et al.: Attention is all you need. In: NIPS, pp. 5998–6008 (2017)
Wu, R., Zhang, A., Ilyas, I., Rekatsinas, T.: Attention-based learning for missing data imputation in HoloClean. MLSys 2, 307–325 (2020)
Yoon, J., Jordon, J., Schaar, M.: GAIN: missing data imputation using generative adversarial nets. In: ICML, pp. 5689–5698 (2018)
Acknowledgements
This work is mainly supported by NEC Corporation, and partially supported by JSPS Kakenhi 22H03903 and CREST JPMJCR22M2.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Kawagoshi, J., Dong, Y., Nozawa, T., Xiao, C. (2023). CAGAIN: Column Attention Generative Adversarial Imputation Networks. In: Strauss, C., Amagasa, T., Kotsis, G., Tjoa, A.M., Khalil, I. (eds) Database and Expert Systems Applications. DEXA 2023. Lecture Notes in Computer Science, vol 14147. Springer, Cham. https://doi.org/10.1007/978-3-031-39821-6_21
Download citation
DOI: https://doi.org/10.1007/978-3-031-39821-6_21
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-39820-9
Online ISBN: 978-3-031-39821-6
eBook Packages: Computer ScienceComputer Science (R0)