research-article

"Deep" Learning for Missing Value Imputationin Tables with Non-Numerical Data

Authors:

Felix Biessmann,

Sebastian Schelter,

Philipp Schmidt,

Dustin LangeAuthors Info & Claims

CIKM '18: Proceedings of the 27th ACM International Conference on Information and Knowledge Management

Pages 2017 - 2025

https://doi.org/10.1145/3269206.3272005

Published: 17 October 2018 Publication History

Abstract

The success of applications that process data critically depends on the quality of the ingested data. Completeness of a data source is essential in many cases. Yet, most missing value imputation approaches suffer from severe limitations. They are almost exclusively restricted to numerical data, and they either offer only simple imputation methods or are difficult to scale and maintain in production. Here we present a robust and scalable approach to imputation that extends to tables with non-numerical values, including unstructured text data in diverse languages. Experiments on public data sets as well as data sets sampled from a large product catalog in different languages (English and Japanese) demonstrate that the proposed approach is both scalable and yields more accurate imputations than previous approaches. Training on data sets with several million rows is a matter of minutes on a single machine. With a median imputation F1 score of 0.93 across a broad selection of data sets our approach achieves on average a 23-fold improvement compared to mode imputation. While our system allows users to apply state-of-the-art deep learning models if needed, we find that often simple linear n-gram models perform on par with deep learning methods at a much lower operational cost. The proposed method learns all parameters of the entire imputation pipeline automatically in an end-to-end fashion, rendering it attractive as a generic plugin both for engineers in charge of data pipelines where data completeness is relevant, as well as for practitioners without expertise in machine learning who need to impute missing values in tables with non-numerical data.

References

[1]

M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, et al. Tensorflow: A system for large-scale machine learning. In OSDI, volume 16, pages 265--283, 2016.

Digital Library

[2]

R. R. Andridge and R. J. Little. A review of hot deck imputation for survey non-response. International statistical review, 78(1):40--64, 2010.

[3]

G. Batista and M. C. Monard. An analysis of four missing data treatment methods for supervised learning. Applied Artificial Intelligence, 17(5--6):519--533, 2003.

[4]

R. Bordawekar and O. Shmueli. Using word embedding to enable semantic queries in relational databases. In Workshop on Data Management for End-to-End Machine Learning at Sigmod, page 5, 2017.

Digital Library

[5]

J.-H. Böse, V. Flunkert, J. Gasthaus, T. Januschowski, D. Lange, D. Salinas, S. Schelter, M. Seeger, and Y. Wang. Probabilistic demand forecasting at scale. PVLDB, 10(12):1694--1705, 2017.

Digital Library

[6]

L. Bottou. On-line learning in neural networks. chapter On-line Learning and Stochastic Approximations, pages 9--42. Cambridge University Press, New York, NY, USA, 1998.

Digital Library

[7]

T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang, T. Xiao, B. Xu, C. Zhang, and Z. Zhang. Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems. CoRR, abs/1512.01274, 2015.

[8]

T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang, T. Xiao, B. Xu, C. Zhang, and Z. Zhang. MXNet: A flexible and efficient machine learning library for heterogeneous distributed systems. Machine Learning Systems workshop at NIPS, 2015.

[9]

M. Dallachiesa, A. Ebaid, A. Eldawy, A. Elmagarmid, I. F. Ilyas, M. Ouzzani, and N. Tang. Nadeef: a commodity data cleaning system. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, pages 541--552. ACM, 2013.

Digital Library

[10]

P. J. García-Laencina, J.-L. Sancho-Gómez, and A. R. Figueiras-Vidal. Pattern classification with missing data: a review. Neural Computing and Applications, 19(2):263--282, 2010.

Digital Library

[11]

L. Gondara and K. Wang. Multiple imputation using deep denoising autoencoders. CoRR, abs/1705.02737, 2017.

[12]

E. Grave, T. Mikolov, A. Joulin, and P. Bojanowski. Bag of tricks for efficient text classification. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2017, Valencia, Spain, April 3--7, 2017, Volume 2: Short Papers, pages 427--431, 2017.

[13]

S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Comput., 9(8):1735--1780, Nov. 1997.

Digital Library

[14]

A. Joulin, E. Grave, P. Bojanowski, M. Nickel, and T. Mikolov. Fast Linear Model for Knowledge Graph Embeddings. arXiv:1710.10881v1, 2017.

[15]

D. Kingma and J. Ba. Adam: A method for stochastic optimization. Technical report, preprint arXiv:1412.6980, 2014.

[16]

Y. Koren, R. M. Bell, and C. Volinsky. Matrix factorization techniques for recommender systems. IEEE Computer, 42(8):30--37, 2009.

Digital Library

[17]

S. Krishnan, M. J. Franklin, K. Goldberg, J. Wang, and E. Wu. Activeclean: An interactive data cleaning framework for modern machine learning. In SIGMOD'16, pages 2117--2120, 2016.

Digital Library

[18]

A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097--1105, 2012.

Digital Library

[19]

A. Kumar, R. McCann, J. Naughton, and J. M. Patel. Model Selection Management Systems: The Next Frontier of Advanced Analytics. SIGMOD Record, 2015.

Digital Library

[20]

Y. LeCun, Y. Bengio, et al. Convolutional networks for images, speech, and time series. The handbook of brain theory and neural networks, 3361(10):1995, 1995.

Digital Library

[21]

R. J. A. Little and D. B. Rubin. Statistical Analysis with Missing Data. John Wiley & Sons, Inc., New York, NY, USA, 1986.

Digital Library

[22]

R. Mazumder, T. Hastie, and R. Tibshirani. Spectral regularization algorithms for learning large incomplete matrices. Journal of Machine Learning Research, 11:2287--2322, 2010.

Digital Library

[23]

X. Meng, J. Bradley, B. Yavuz, E. Sparks, S. Venkataraman, D. Liu, J. Freeman, D. Tsai, M. Amde, S. Owen, et al. Mllib: Machine learning in apache spark. JMLR, 17(1):1235--1241, 2016.

Digital Library

[24]

T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 26, pages 3111--3119. Curran Associates, Inc., 2013.

Digital Library

[25]

F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825--2830, 2011.

Digital Library

[26]

N. Polyzotis, S. Roy, S. E. Whang, and M. Zinkevich. Data management challenges in production machine learning. In SIGMOD'17, pages 1723--1726, 2017.

Digital Library

[27]

T. Rekatsinas, X. Chu, I. F. Ilyas, and C. Ré. Holoclean: Holistic data repairs with probabilistic inference. PVLDB, 10(11):1190--1201, 2017.

Digital Library

[28]

D. Rubin. Multiple imputation for nonresponse in surveys. Bioinformatics, 17(6):520--525, 1987.

[29]

S. Schelter, A. Palumbo, S. Quinn, S. Marthi, and A. Musselman. Samsara: Declarative Machine Learning on Distributed Dataflow Systems. In Machine Learning Systems Workshop at NIPS'16.

[30]

G. M. Schmitt P, Mandel J. A comparison of six methods for missing data imputation. J Biom Biostat, 6(224), 2015.

[31]

D. Sculley, G. Holt, D. Golovin, E. Davydov, T. Phillips, D. Ebner, V. Chaudhary, M. Young, J. Crespo, and D. Dennison. Hidden technical debt in machine learning systems. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7--12, 2015, Montreal, Quebec, Canada, 2015.

Digital Library

[32]

A. P. Singh and G. J. Gordon. A unified view of matrix factorization models. In ECML/PKDD, pages 358--373, 2008.

[33]

O. G. Troyanskaya, M. N. Cantor, G. Sherlock, P. O. Brown, T. Hastie, R. Tibshirani, D. Botstein, and R. B. Altman. Missing value estimation methods for DNA microarrays. Bioinformatics, 17(6):520--525, 2001.

[34]

M. Yakout, A. K. Elmagarmid, J. Neville, M. Ouzzani, and I. F. Ilyas. Guided data repair. PVLDB, 4(5):279--289, Feb. 2011.

Digital Library

[35]

M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica. Spark: Cluster computing with working sets. HotCloud, 10(10--10):95, 2010.

Digital Library

Cited By

Perini MNikolic M(2024)In-Database Data ImputationProceedings of the ACM on Management of Data10.1145/36393262:1(1-27)Online publication date: 26-Mar-2024
https://dl.acm.org/doi/10.1145/3639326
Li WWang XSun YMilanovic SKon MCastrillón-Candás J(2024)Multilevel Stochastic Optimization for Imputation in Massive Medical Data RecordsIEEE Transactions on Big Data10.1109/TBDATA.2023.332843310:2(122-131)Online publication date: Apr-2024
https://doi.org/10.1109/TBDATA.2023.3328433
Rahman MNadal SRomero OSacharidis D(2024)Mitigating Data Sparsity in Integrated Data through Text Conceptualization2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00269(3490-3504)Online publication date: 13-May-2024
https://doi.org/10.1109/ICDE60146.2024.00269
Show More Cited By

Index Terms

"Deep" Learning for Missing Value Imputationin Tables with Non-Numerical Data
1. Information systems
  1. Data management systems
    1. Database design and models
      1. Data model extensions
        Incomplete data
    2. Information integration
      1. Data cleaning
2. Theory of computation
  1. Theory and algorithms for application domains
    1. Database theory
      1. Incomplete, inconsistent, and uncertain databases

Recommendations

Deep learning for missing value imputation of continuous data and the effect of data discretization
Abstract
Often real-world datasets are incomplete and contain some missing attribute values. Furthermore, many data mining and machine learning techniques cannot directly handle incomplete datasets. Missing value imputation is the major ...
Highlights
- Deep learning for imputing missing continuous values of tabular or structured data is studied.
Fuzzy neuron modeling of incomplete data for missing value imputation
Highlights
- A category-based TS-TRAE model is proposed for incomplete data modeling and missing value imputation.
- An iterative learning method is proposed, which updates the missing value variables and model parameters collaboratively.
- ...
Abstract
Missing values are a common problem found in many real-world datasets, and cannot be avoided. It is a challenging task to model incomplete data and reasonably impute missing values. This paper focuses on regression imputation and uses a tracking-...
Locally linear reconstruction based missing value imputation for supervised learning

Most learning algorithms generally assume that data is complete so each attribute of all instances is filled with a valid value. However, missing values are very common in real datasets for various reasons. In this paper, we propose a new single ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

CIKM '18: Proceedings of the 27th ACM International Conference on Information and Knowledge Management

October 2018

2362 pages

ISBN:9781450360142

DOI:10.1145/3269206

General Chair:
Alfredo Cuzzocrea
University of Trieste, Italy
,
Program Chairs:
James Allan
University of Massachusetts, USA
,
Norman Paton
University of Manchester, United Kingdom
,
Divesh Srivastava
AT&T Labs Research, USA
,
Rakesh Agrawal
Data Insights Lab, USA
,
Andrei Broder
Google Research, USA
,
Mohammed Zaki
Rensselaer Polytechnic Institute, USA
,
Selcuk Candan
Arizona State University, USA
,
Alexandros Labrinidis
University of Pittsburgh, USA
,
Assaf Schuster
Technion, Israel
,
Haixun Wang
Google Research, USA

Copyright © 2018 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 October 2018

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

CIKM '18

Sponsor:

CIKM '18: The 27th ACM International Conference on Information and Knowledge Management

October 22 - 26, 2018

Torino, Italy

Acceptance Rates

CIKM '18 Paper Acceptance Rate 147 of 826 submissions, 18%;

Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

42
Total Citations
View Citations
1,015
Total Downloads

Downloads (Last 12 months)91
Downloads (Last 6 weeks)7

Reflects downloads up to 10 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Perini MNikolic M(2024)In-Database Data ImputationProceedings of the ACM on Management of Data10.1145/36393262:1(1-27)Online publication date: 26-Mar-2024
https://dl.acm.org/doi/10.1145/3639326
Li WWang XSun YMilanovic SKon MCastrillón-Candás J(2024)Multilevel Stochastic Optimization for Imputation in Massive Medical Data RecordsIEEE Transactions on Big Data10.1109/TBDATA.2023.332843310:2(122-131)Online publication date: Apr-2024
https://doi.org/10.1109/TBDATA.2023.3328433
Rahman MNadal SRomero OSacharidis D(2024)Mitigating Data Sparsity in Integrated Data through Text Conceptualization2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00269(3490-3504)Online publication date: 13-May-2024
https://doi.org/10.1109/ICDE60146.2024.00269
Fazel Mojtahedi SAkbarpour ADarzi ASadeghi Hvan Genuchten M(2024)Prediction of Stress-Dependent Soil Water Retention Using Machine LearningGeotechnical and Geological Engineering10.1007/s10706-024-02767-842:5(3939-3966)Online publication date: 12-Mar-2024
https://doi.org/10.1007/s10706-024-02767-8
Jeyalakshmi VBala Shunmugam NKavitha MPaulin Diana Dani D(2024)Automized Quick Prediction of Skin Cancer Diagnosis by Enhanced Deep Convolutional Neural NetworkAdvances in Artificial Intelligence and Machine Learning in Big Data Processing10.1007/978-3-031-73065-8_24(292-302)Online publication date: 1-Oct-2024
https://doi.org/10.1007/978-3-031-73065-8_24
Lee DKim H(2023)A Self-Attention-Based Imputation Technique for Enhancing Tabular Data QualityData10.3390/data80601028:6(102)Online publication date: 4-Jun-2023
https://doi.org/10.3390/data8060102
Collier ZKong MSoyoye OChawla KAviles APayne Y(2023)Deep Learning Imputation for Asymmetric and Incomplete Likert-Type ItemsJournal of Educational and Behavioral Statistics10.3102/1076998623117601449:2(241-267)Online publication date: 13-Jun-2023
https://doi.org/10.3102/10769986231176014
Patel HGuttula SGupta NHans SMittal RN L(2023)A data centric AI framework for automating exploratory data analysis and data quality tasksJournal of Data and Information Quality10.1145/3603709Online publication date: 26-Jun-2023
https://doi.org/10.1145/3603709
Hans SSaha DAggarwal A(2023)Explainable Data Imputation using ConstraintsProceedings of the 6th Joint International Conference on Data Science & Management of Data (10th ACM IKDD CODS and 28th COMAD)10.1145/3570991.3571009(128-132)Online publication date: 4-Jan-2023
https://dl.acm.org/doi/10.1145/3570991.3571009
Jagadish HStoyanovich JHowe B(2023)The Many Facets of Data EquityJournal of Data and Information Quality10.1145/353342514:4(1-21)Online publication date: 7-Feb-2023
https://dl.acm.org/doi/10.1145/3533425
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents