Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3269206.3272005acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

"Deep" Learning for Missing Value Imputationin Tables with Non-Numerical Data

Published: 17 October 2018 Publication History
  • Get Citation Alerts
  • Abstract

    The success of applications that process data critically depends on the quality of the ingested data. Completeness of a data source is essential in many cases. Yet, most missing value imputation approaches suffer from severe limitations. They are almost exclusively restricted to numerical data, and they either offer only simple imputation methods or are difficult to scale and maintain in production. Here we present a robust and scalable approach to imputation that extends to tables with non-numerical values, including unstructured text data in diverse languages. Experiments on public data sets as well as data sets sampled from a large product catalog in different languages (English and Japanese) demonstrate that the proposed approach is both scalable and yields more accurate imputations than previous approaches. Training on data sets with several million rows is a matter of minutes on a single machine. With a median imputation F1 score of 0.93 across a broad selection of data sets our approach achieves on average a 23-fold improvement compared to mode imputation. While our system allows users to apply state-of-the-art deep learning models if needed, we find that often simple linear n-gram models perform on par with deep learning methods at a much lower operational cost. The proposed method learns all parameters of the entire imputation pipeline automatically in an end-to-end fashion, rendering it attractive as a generic plugin both for engineers in charge of data pipelines where data completeness is relevant, as well as for practitioners without expertise in machine learning who need to impute missing values in tables with non-numerical data.

    References

    [1]
    M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, et al. Tensorflow: A system for large-scale machine learning. In OSDI, volume 16, pages 265--283, 2016.
    [2]
    R. R. Andridge and R. J. Little. A review of hot deck imputation for survey non-response. International statistical review, 78(1):40--64, 2010.
    [3]
    G. Batista and M. C. Monard. An analysis of four missing data treatment methods for supervised learning. Applied Artificial Intelligence, 17(5--6):519--533, 2003.
    [4]
    R. Bordawekar and O. Shmueli. Using word embedding to enable semantic queries in relational databases. In Workshop on Data Management for End-to-End Machine Learning at Sigmod, page 5, 2017.
    [5]
    J.-H. Böse, V. Flunkert, J. Gasthaus, T. Januschowski, D. Lange, D. Salinas, S. Schelter, M. Seeger, and Y. Wang. Probabilistic demand forecasting at scale. PVLDB, 10(12):1694--1705, 2017.
    [6]
    L. Bottou. On-line learning in neural networks. chapter On-line Learning and Stochastic Approximations, pages 9--42. Cambridge University Press, New York, NY, USA, 1998.
    [7]
    T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang, T. Xiao, B. Xu, C. Zhang, and Z. Zhang. Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems. CoRR, abs/1512.01274, 2015.
    [8]
    T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang, T. Xiao, B. Xu, C. Zhang, and Z. Zhang. MXNet: A flexible and efficient machine learning library for heterogeneous distributed systems. Machine Learning Systems workshop at NIPS, 2015.
    [9]
    M. Dallachiesa, A. Ebaid, A. Eldawy, A. Elmagarmid, I. F. Ilyas, M. Ouzzani, and N. Tang. Nadeef: a commodity data cleaning system. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, pages 541--552. ACM, 2013.
    [10]
    P. J. García-Laencina, J.-L. Sancho-Gómez, and A. R. Figueiras-Vidal. Pattern classification with missing data: a review. Neural Computing and Applications, 19(2):263--282, 2010.
    [11]
    L. Gondara and K. Wang. Multiple imputation using deep denoising autoencoders. CoRR, abs/1705.02737, 2017.
    [12]
    E. Grave, T. Mikolov, A. Joulin, and P. Bojanowski. Bag of tricks for efficient text classification. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2017, Valencia, Spain, April 3--7, 2017, Volume 2: Short Papers, pages 427--431, 2017.
    [13]
    S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Comput., 9(8):1735--1780, Nov. 1997.
    [14]
    A. Joulin, E. Grave, P. Bojanowski, M. Nickel, and T. Mikolov. Fast Linear Model for Knowledge Graph Embeddings. arXiv:1710.10881v1, 2017.
    [15]
    D. Kingma and J. Ba. Adam: A method for stochastic optimization. Technical report, preprint arXiv:1412.6980, 2014.
    [16]
    Y. Koren, R. M. Bell, and C. Volinsky. Matrix factorization techniques for recommender systems. IEEE Computer, 42(8):30--37, 2009.
    [17]
    S. Krishnan, M. J. Franklin, K. Goldberg, J. Wang, and E. Wu. Activeclean: An interactive data cleaning framework for modern machine learning. In SIGMOD'16, pages 2117--2120, 2016.
    [18]
    A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097--1105, 2012.
    [19]
    A. Kumar, R. McCann, J. Naughton, and J. M. Patel. Model Selection Management Systems: The Next Frontier of Advanced Analytics. SIGMOD Record, 2015.
    [20]
    Y. LeCun, Y. Bengio, et al. Convolutional networks for images, speech, and time series. The handbook of brain theory and neural networks, 3361(10):1995, 1995.
    [21]
    R. J. A. Little and D. B. Rubin. Statistical Analysis with Missing Data. John Wiley & Sons, Inc., New York, NY, USA, 1986.
    [22]
    R. Mazumder, T. Hastie, and R. Tibshirani. Spectral regularization algorithms for learning large incomplete matrices. Journal of Machine Learning Research, 11:2287--2322, 2010.
    [23]
    X. Meng, J. Bradley, B. Yavuz, E. Sparks, S. Venkataraman, D. Liu, J. Freeman, D. Tsai, M. Amde, S. Owen, et al. Mllib: Machine learning in apache spark. JMLR, 17(1):1235--1241, 2016.
    [24]
    T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 26, pages 3111--3119. Curran Associates, Inc., 2013.
    [25]
    F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825--2830, 2011.
    [26]
    N. Polyzotis, S. Roy, S. E. Whang, and M. Zinkevich. Data management challenges in production machine learning. In SIGMOD'17, pages 1723--1726, 2017.
    [27]
    T. Rekatsinas, X. Chu, I. F. Ilyas, and C. Ré. Holoclean: Holistic data repairs with probabilistic inference. PVLDB, 10(11):1190--1201, 2017.
    [28]
    D. Rubin. Multiple imputation for nonresponse in surveys. Bioinformatics, 17(6):520--525, 1987.
    [29]
    S. Schelter, A. Palumbo, S. Quinn, S. Marthi, and A. Musselman. Samsara: Declarative Machine Learning on Distributed Dataflow Systems. In Machine Learning Systems Workshop at NIPS'16.
    [30]
    G. M. Schmitt P, Mandel J. A comparison of six methods for missing data imputation. J Biom Biostat, 6(224), 2015.
    [31]
    D. Sculley, G. Holt, D. Golovin, E. Davydov, T. Phillips, D. Ebner, V. Chaudhary, M. Young, J. Crespo, and D. Dennison. Hidden technical debt in machine learning systems. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7--12, 2015, Montreal, Quebec, Canada, 2015.
    [32]
    A. P. Singh and G. J. Gordon. A unified view of matrix factorization models. In ECML/PKDD, pages 358--373, 2008.
    [33]
    O. G. Troyanskaya, M. N. Cantor, G. Sherlock, P. O. Brown, T. Hastie, R. Tibshirani, D. Botstein, and R. B. Altman. Missing value estimation methods for DNA microarrays. Bioinformatics, 17(6):520--525, 2001.
    [34]
    M. Yakout, A. K. Elmagarmid, J. Neville, M. Ouzzani, and I. F. Ilyas. Guided data repair. PVLDB, 4(5):279--289, Feb. 2011.
    [35]
    M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica. Spark: Cluster computing with working sets. HotCloud, 10(10--10):95, 2010.

    Cited By

    View all
    • (2024)In-Database Data ImputationProceedings of the ACM on Management of Data10.1145/36393262:1(1-27)Online publication date: 26-Mar-2024
    • (2024)Multilevel Stochastic Optimization for Imputation in Massive Medical Data RecordsIEEE Transactions on Big Data10.1109/TBDATA.2023.332843310:2(122-131)Online publication date: Apr-2024
    • (2024)Prediction of Stress-Dependent Soil Water Retention Using Machine LearningGeotechnical and Geological Engineering10.1007/s10706-024-02767-842:5(3939-3966)Online publication date: 12-Mar-2024
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    CIKM '18: Proceedings of the 27th ACM International Conference on Information and Knowledge Management
    October 2018
    2362 pages
    ISBN:9781450360142
    DOI:10.1145/3269206
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 17 October 2018

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. data cleaning
    2. missing value imputation

    Qualifiers

    • Research-article

    Conference

    CIKM '18
    Sponsor:

    Acceptance Rates

    CIKM '18 Paper Acceptance Rate 147 of 826 submissions, 18%;
    Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

    Upcoming Conference

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)99
    • Downloads (Last 6 weeks)14

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)In-Database Data ImputationProceedings of the ACM on Management of Data10.1145/36393262:1(1-27)Online publication date: 26-Mar-2024
    • (2024)Multilevel Stochastic Optimization for Imputation in Massive Medical Data RecordsIEEE Transactions on Big Data10.1109/TBDATA.2023.332843310:2(122-131)Online publication date: Apr-2024
    • (2024)Prediction of Stress-Dependent Soil Water Retention Using Machine LearningGeotechnical and Geological Engineering10.1007/s10706-024-02767-842:5(3939-3966)Online publication date: 12-Mar-2024
    • (2023)A Self-Attention-Based Imputation Technique for Enhancing Tabular Data QualityData10.3390/data80601028:6(102)Online publication date: 4-Jun-2023
    • (2023)Deep Learning Imputation for Asymmetric and Incomplete Likert-Type ItemsJournal of Educational and Behavioral Statistics10.3102/1076998623117601449:2(241-267)Online publication date: 13-Jun-2023
    • (2023)A data centric AI framework for automating exploratory data analysis and data quality tasksJournal of Data and Information Quality10.1145/3603709Online publication date: 26-Jun-2023
    • (2023)Explainable Data Imputation using ConstraintsProceedings of the 6th Joint International Conference on Data Science & Management of Data (10th ACM IKDD CODS and 28th COMAD)10.1145/3570991.3571009(128-132)Online publication date: 4-Jan-2023
    • (2023)The Many Facets of Data EquityJournal of Data and Information Quality10.1145/353342514:4(1-21)Online publication date: 7-Feb-2023
    • (2023)Differentiable and Scalable Generative Adversarial Models for Data ImputationIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2023.3293129(1-13)Online publication date: 2023
    • (2023)Towards Enhanced Deep CNN For Early And Precise Skin Cancer Diagnosis2023 International Conference on Networking and Communications (ICNWC)10.1109/ICNWC57852.2023.10127521(1-7)Online publication date: 5-Apr-2023
    • Show More Cited By

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media