Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3269206.3272005acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

"Deep" Learning for Missing Value Imputationin Tables with Non-Numerical Data

Published: 17 October 2018 Publication History

Abstract

The success of applications that process data critically depends on the quality of the ingested data. Completeness of a data source is essential in many cases. Yet, most missing value imputation approaches suffer from severe limitations. They are almost exclusively restricted to numerical data, and they either offer only simple imputation methods or are difficult to scale and maintain in production. Here we present a robust and scalable approach to imputation that extends to tables with non-numerical values, including unstructured text data in diverse languages. Experiments on public data sets as well as data sets sampled from a large product catalog in different languages (English and Japanese) demonstrate that the proposed approach is both scalable and yields more accurate imputations than previous approaches. Training on data sets with several million rows is a matter of minutes on a single machine. With a median imputation F1 score of 0.93 across a broad selection of data sets our approach achieves on average a 23-fold improvement compared to mode imputation. While our system allows users to apply state-of-the-art deep learning models if needed, we find that often simple linear n-gram models perform on par with deep learning methods at a much lower operational cost. The proposed method learns all parameters of the entire imputation pipeline automatically in an end-to-end fashion, rendering it attractive as a generic plugin both for engineers in charge of data pipelines where data completeness is relevant, as well as for practitioners without expertise in machine learning who need to impute missing values in tables with non-numerical data.

References

[1]
M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, et al. Tensorflow: A system for large-scale machine learning. In OSDI, volume 16, pages 265--283, 2016.
[2]
R. R. Andridge and R. J. Little. A review of hot deck imputation for survey non-response. International statistical review, 78(1):40--64, 2010.
[3]
G. Batista and M. C. Monard. An analysis of four missing data treatment methods for supervised learning. Applied Artificial Intelligence, 17(5--6):519--533, 2003.
[4]
R. Bordawekar and O. Shmueli. Using word embedding to enable semantic queries in relational databases. In Workshop on Data Management for End-to-End Machine Learning at Sigmod, page 5, 2017.
[5]
J.-H. Böse, V. Flunkert, J. Gasthaus, T. Januschowski, D. Lange, D. Salinas, S. Schelter, M. Seeger, and Y. Wang. Probabilistic demand forecasting at scale. PVLDB, 10(12):1694--1705, 2017.
[6]
L. Bottou. On-line learning in neural networks. chapter On-line Learning and Stochastic Approximations, pages 9--42. Cambridge University Press, New York, NY, USA, 1998.
[7]
T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang, T. Xiao, B. Xu, C. Zhang, and Z. Zhang. Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems. CoRR, abs/1512.01274, 2015.
[8]
T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang, T. Xiao, B. Xu, C. Zhang, and Z. Zhang. MXNet: A flexible and efficient machine learning library for heterogeneous distributed systems. Machine Learning Systems workshop at NIPS, 2015.
[9]
M. Dallachiesa, A. Ebaid, A. Eldawy, A. Elmagarmid, I. F. Ilyas, M. Ouzzani, and N. Tang. Nadeef: a commodity data cleaning system. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, pages 541--552. ACM, 2013.
[10]
P. J. García-Laencina, J.-L. Sancho-Gómez, and A. R. Figueiras-Vidal. Pattern classification with missing data: a review. Neural Computing and Applications, 19(2):263--282, 2010.
[11]
L. Gondara and K. Wang. Multiple imputation using deep denoising autoencoders. CoRR, abs/1705.02737, 2017.
[12]
E. Grave, T. Mikolov, A. Joulin, and P. Bojanowski. Bag of tricks for efficient text classification. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2017, Valencia, Spain, April 3--7, 2017, Volume 2: Short Papers, pages 427--431, 2017.
[13]
S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Comput., 9(8):1735--1780, Nov. 1997.
[14]
A. Joulin, E. Grave, P. Bojanowski, M. Nickel, and T. Mikolov. Fast Linear Model for Knowledge Graph Embeddings. arXiv:1710.10881v1, 2017.
[15]
D. Kingma and J. Ba. Adam: A method for stochastic optimization. Technical report, preprint arXiv:1412.6980, 2014.
[16]
Y. Koren, R. M. Bell, and C. Volinsky. Matrix factorization techniques for recommender systems. IEEE Computer, 42(8):30--37, 2009.
[17]
S. Krishnan, M. J. Franklin, K. Goldberg, J. Wang, and E. Wu. Activeclean: An interactive data cleaning framework for modern machine learning. In SIGMOD'16, pages 2117--2120, 2016.
[18]
A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097--1105, 2012.
[19]
A. Kumar, R. McCann, J. Naughton, and J. M. Patel. Model Selection Management Systems: The Next Frontier of Advanced Analytics. SIGMOD Record, 2015.
[20]
Y. LeCun, Y. Bengio, et al. Convolutional networks for images, speech, and time series. The handbook of brain theory and neural networks, 3361(10):1995, 1995.
[21]
R. J. A. Little and D. B. Rubin. Statistical Analysis with Missing Data. John Wiley & Sons, Inc., New York, NY, USA, 1986.
[22]
R. Mazumder, T. Hastie, and R. Tibshirani. Spectral regularization algorithms for learning large incomplete matrices. Journal of Machine Learning Research, 11:2287--2322, 2010.
[23]
X. Meng, J. Bradley, B. Yavuz, E. Sparks, S. Venkataraman, D. Liu, J. Freeman, D. Tsai, M. Amde, S. Owen, et al. Mllib: Machine learning in apache spark. JMLR, 17(1):1235--1241, 2016.
[24]
T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 26, pages 3111--3119. Curran Associates, Inc., 2013.
[25]
F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825--2830, 2011.
[26]
N. Polyzotis, S. Roy, S. E. Whang, and M. Zinkevich. Data management challenges in production machine learning. In SIGMOD'17, pages 1723--1726, 2017.
[27]
T. Rekatsinas, X. Chu, I. F. Ilyas, and C. Ré. Holoclean: Holistic data repairs with probabilistic inference. PVLDB, 10(11):1190--1201, 2017.
[28]
D. Rubin. Multiple imputation for nonresponse in surveys. Bioinformatics, 17(6):520--525, 1987.
[29]
S. Schelter, A. Palumbo, S. Quinn, S. Marthi, and A. Musselman. Samsara: Declarative Machine Learning on Distributed Dataflow Systems. In Machine Learning Systems Workshop at NIPS'16.
[30]
G. M. Schmitt P, Mandel J. A comparison of six methods for missing data imputation. J Biom Biostat, 6(224), 2015.
[31]
D. Sculley, G. Holt, D. Golovin, E. Davydov, T. Phillips, D. Ebner, V. Chaudhary, M. Young, J. Crespo, and D. Dennison. Hidden technical debt in machine learning systems. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7--12, 2015, Montreal, Quebec, Canada, 2015.
[32]
A. P. Singh and G. J. Gordon. A unified view of matrix factorization models. In ECML/PKDD, pages 358--373, 2008.
[33]
O. G. Troyanskaya, M. N. Cantor, G. Sherlock, P. O. Brown, T. Hastie, R. Tibshirani, D. Botstein, and R. B. Altman. Missing value estimation methods for DNA microarrays. Bioinformatics, 17(6):520--525, 2001.
[34]
M. Yakout, A. K. Elmagarmid, J. Neville, M. Ouzzani, and I. F. Ilyas. Guided data repair. PVLDB, 4(5):279--289, Feb. 2011.
[35]
M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica. Spark: Cluster computing with working sets. HotCloud, 10(10--10):95, 2010.

Cited By

View all
  • (2024)In-Database Data ImputationProceedings of the ACM on Management of Data10.1145/36393262:1(1-27)Online publication date: 26-Mar-2024
  • (2024)Multilevel Stochastic Optimization for Imputation in Massive Medical Data RecordsIEEE Transactions on Big Data10.1109/TBDATA.2023.332843310:2(122-131)Online publication date: Apr-2024
  • (2024)Mitigating Data Sparsity in Integrated Data through Text Conceptualization2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00269(3490-3504)Online publication date: 13-May-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
CIKM '18: Proceedings of the 27th ACM International Conference on Information and Knowledge Management
October 2018
2362 pages
ISBN:9781450360142
DOI:10.1145/3269206
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 October 2018

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. data cleaning
  2. missing value imputation

Qualifiers

  • Research-article

Conference

CIKM '18
Sponsor:

Acceptance Rates

CIKM '18 Paper Acceptance Rate 147 of 826 submissions, 18%;
Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)91
  • Downloads (Last 6 weeks)7
Reflects downloads up to 10 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2024)In-Database Data ImputationProceedings of the ACM on Management of Data10.1145/36393262:1(1-27)Online publication date: 26-Mar-2024
  • (2024)Multilevel Stochastic Optimization for Imputation in Massive Medical Data RecordsIEEE Transactions on Big Data10.1109/TBDATA.2023.332843310:2(122-131)Online publication date: Apr-2024
  • (2024)Mitigating Data Sparsity in Integrated Data through Text Conceptualization2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00269(3490-3504)Online publication date: 13-May-2024
  • (2024)Prediction of Stress-Dependent Soil Water Retention Using Machine LearningGeotechnical and Geological Engineering10.1007/s10706-024-02767-842:5(3939-3966)Online publication date: 12-Mar-2024
  • (2024)Automized Quick Prediction of Skin Cancer Diagnosis by Enhanced Deep Convolutional Neural NetworkAdvances in Artificial Intelligence and Machine Learning in Big Data Processing10.1007/978-3-031-73065-8_24(292-302)Online publication date: 1-Oct-2024
  • (2023)A Self-Attention-Based Imputation Technique for Enhancing Tabular Data QualityData10.3390/data80601028:6(102)Online publication date: 4-Jun-2023
  • (2023)Deep Learning Imputation for Asymmetric and Incomplete Likert-Type ItemsJournal of Educational and Behavioral Statistics10.3102/1076998623117601449:2(241-267)Online publication date: 13-Jun-2023
  • (2023)A data centric AI framework for automating exploratory data analysis and data quality tasksJournal of Data and Information Quality10.1145/3603709Online publication date: 26-Jun-2023
  • (2023)Explainable Data Imputation using ConstraintsProceedings of the 6th Joint International Conference on Data Science & Management of Data (10th ACM IKDD CODS and 28th COMAD)10.1145/3570991.3571009(128-132)Online publication date: 4-Jan-2023
  • (2023)The Many Facets of Data EquityJournal of Data and Information Quality10.1145/353342514:4(1-21)Online publication date: 7-Feb-2023
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media