Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
Open access

Certain and Approximately Certain Models for Statistical Learning

Published: 30 May 2024 Publication History


Real-world data is often incomplete and contains missing values. To train accurate models over real-world datasets, users need to spend a substantial amount of time and resources imputing and finding proper values for missing data items. In this paper, we demonstrate that it is possible to learn accurate models directly from data with missing values for certain training data and target models. We propose a unified approach for checking the necessity of data imputation to learn accurate models across various widely-used machine learning paradigms. We build efficient algorithms with theoretical guarantees to check this necessity and return accurate models in cases where imputation is unnecessary. Our extensive experiments indicate that our proposed algorithms significantly reduce the amount of time and effort needed for data imputation without imposing considerable computational overhead.


2023. COVID-19 Reported Patient Impact and Hospital Capacity. https://catalog.data.gov/dataset/covid-19-reportedpatient-impact-and-hospital-capacity-by-state-timeseries-cf58c. Accessed on 01-01--2024.
L. Alzubaidi, J. Zhang, A. J. Humaidi, A. Al-Dujaili, Y. Duan, O. Al-Shamma, J. Santamaría, M. A. Fadhel, M. Al-Amidie, and L. Farhan. 2021. Review of Deep Learning: Concepts, CNN Architectures, Challenges, Applications, Future Directions. Journal of Big Data 8, 1 (2021), 53. https://doi.org/10.1186/s40537-021-00444--8
Peter Bodik,Wei Hong, Carlos Guestrin, Sam Madden, Mark Paskin, and Romain Thibaux. 2004. Intel Berkley Research Lab Data. https://db.csail.mit.edu/labdata/labdata.html
Parthajit Borah, DK Bhattacharyya, and JK Kalita. 2020. Malware Dataset Generation and Evaluation. In 2020 IEEE 4th Conference on Information and Communication Technology (CICT). IEEE, 1--6.
Yuri Burda, Roger Grosse, and Ruslan Salakhutdinov. 2016. Importance Weighted Autoencoders. arXiv:1509.00519 [cs.LG]
Chengliang Chai, Jiabin Liu, Nan Tang, Ju Fan, Dongjing Miao, JiayiWang, Yuyu Luo, and Guoliang Li. 2023. GoodCore: Data-effective and Data-efficient Machine Learning through Coreset Selection over Incomplete Data. Proceedings of the ACM on Management of Data 1, 2 (2023), 1--27.
Youngmin Cho and Lawrence Saul. 2009. Kernel methods for deep learning. Advances in neural information processing systems 22 (2009).
Youngmin Cho and Lawrence K Saul. 2011. Analysis and extension of arc-cosine kernels for large margin classification. arXiv preprint arXiv:1112.3712 (2011).
Samuel Drews, Aws Albarghouthi, and Loris D'Antoni. 2020. Proving data-poisoning robustness in decision trees. In Proceedings of the 41st ACM SIGPLAN International Conference on Programming Language Design and Implementation, PLDI 2020, London, UK, June 15--20, 2020, Alastair F. Donaldson and Emina Torlak (Eds.). ACM, 1083--1097. https: //doi.org/10.1145/3385412.3385975
Austen Z. Fan and Paraschos Koutris. 2022. Certifiable Robustness for Nearest Neighbor Classifiers. In 25th International Conference on Database Theory, ICDT 2022, March 29 to April 1, 2022, Edinburgh, UK (Virtual Conference) (LIPIcs, Vol. 220), Dan Olteanu and Nils Vortmeier (Eds.). Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 6:1--6:20. https: //doi.org/10.4230/LIPICS.ICDT.2022.6
Ravi Ganti and Rebecca M Willett. 2015. Sparse Linear regression with missing data. arXiv preprint arXiv:1503.08348 (2015).
Claudio Gentile and Manfred K. K Warmuth. 1998. Linear Hinge Loss and Average Margin. In Advances in Neural Information Processing Systems, M. Kearns, S. Solla, and D. Cohn (Eds.), Vol. 11. MIT Press.
Max Horowitz. 2015. Detailed NFL Play-by-Play Data 2015. Kaggle. https://www.kaggle.com/datasets/maxhorowitz/ nflplaybyplay2015
Isabelle Guyon, Steve Gunn, Asa Ben-Hur, Gideon Dror. 2003. Gisette. https://doi.org/10.24432/C5HP5B
Aditya Kadiwal. 2021. Water Potability. Kaggle. https://www.kaggle.com/datasets/adityakadiwal/water-potability
Bojan Karla?, Peng Li, Renzhi Wu, Nezihe Merve Gürel, Xu Chu, Wentao Wu, and Ce Zhang. 2020. Nearest neighbor classifiers over incomplete information: From certain answers to certain predictions. arXiv preprint arXiv:2005.05117 (2020).
Sanjay Krishnan, Michael J. Franklin, Ken Goldberg, and Eugene Wu. 2017. BoostClean: Automated Error Detection and Repair for Machine Learning. arXiv:1711.01299 [cs.DB]
Sanjay Krishnan, Jiannan Wang, Eugene Wu, Michael J Franklin, and Ken Goldberg. 2016. Activeclean: Interactive data cleaning for statistical modeling. Proceedings of the VLDB Endowment 9, 12 (2016), 948--959.
Krishnan, Sanjay and Wang, Jiannan and Wu, Eugene and Franklin, Michael J and Goldberg, Ken. 2018. Cleaning for Data Science. https://activeclean.github.io/
Marine Le Morvan, Julie Josse, Erwan Scornet, and Gael Varoquaux. 2021. What's a good imputation to predict with missing values?. In Advances in Neural Information Processing Systems, M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan (Eds.), Vol. 34. Curran Associates, Inc., 11530--11540. https://proceedings.neurips.cc/ paper_files/paper/2021/file/5fe8fdc79ce292c39c5f209d734b7206-Paper.pdf
R.J.A. Little and D.B. Rubin. 2002. Statistical analysis with missing data. Wiley. http://books.google.com/books?id= aYPwAAAAMAAJ
Tongyu Liu, Ju Fan, Yinqing Luo, Nan Tang, Guoliang Li, and Xiaoyong Du. 2021. Adaptive Data Augmentation for Supervised Learning over Missing Data. Proc. VLDB Endow. 14, 7 (mar 2021), 1202--1214. https://doi.org/10.14778/ 3450980.3450989
Pierre-Alexandre Mattei and Jes Frellsen. 2019. MIWAE: Deep Generative Modelling and Imputation of Incomplete Data Sets. In Proceedings of the 36th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 97), Kamalika Chaudhuri and Ruslan Salakhutdinov (Eds.). PMLR, 4413--4423. https://proceedings.mlr. press/v97/mattei19a.html
Felix Neutatz, Binger Chen, Ziawasch Abedjan, and Eugene Wu. 2021. From Cleaning before ML to Cleaning for ML. IEEE Data Eng. Bull. 44, 1 (2021), 24--41.
Jose Picado, John Davis, Arash Termehchy, and Ga Young Lee. 2020. Learning over dirty data without cleaning. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. 1301--1316.
Michael Redmond. 2009. Communities and Crime. UCI Machine Learning Repository.
Ricardo P Pinheiro, Sidney M. L. Lima, Sérgio M. M. Fernandes, E. D. Q. Albuquerque, S. Medeiros, Danilo Souza, T. Monteiro, Petrônio Lopes, Rafael Lima, Jemerson Oliveira, Sthéfano Silva. 2019. REJAFADA. https://doi.org/10.24432/ C5HG8D
DONALD B. RUBIN. 1976. Inference and missing data. Biometrika 63, 3 (12 1976), 581--592. https://doi.org/10.1093/ biomet/63.3.581 arXiv:https://academic.oup.com/biomet/article-pdf/63/3/581/756166/63--3--581.pdf
Olga Troyanskaya, Mike Cantor, Gavin Sherlock, Trevor Hastie, Rob Tibshirani, David Botstein, and Russ Altman. 2001. Missing Value Estimation Methods for DNA Microarrays. Bioinformatics 17 (07 2001), 520--525. https: //doi.org/10.1093/bioinformatics/17.6.520
Stef Van Buuren. 2018. Flexible imputation of missing data. CRC press.
Joaquin Vanschoren, Jan N. van Rijn, Bernd Bischl, and Luis Torgo. 2013. OpenML: Networked Science in Machine Learning. SIGKDD Explorations 15, 2 (2013), 49--60. https://doi.org/10.1145/2641190.2641198
Saverio Vito. 2016. Air Quality. UCI Machine Learning Repository.
Yining Wang and Aarti Singh. 2015. Column subset selection with missing data via active sampling. In Artificial Intelligence and Statistics. PMLR, 1033--1041.
William Wolberg. 1992. Breast Cancer Wisconsin (Original). UCI Machine Learning Repository.
Cheng Zhen, Nischal Aryal, Arash Termehchy, and Amandeep Singh Chabada. 2024. Certain and Approximately Certain Models for Statistical Learning. arXiv preprint arXiv:2402.17926 (2024).



Information & Contributors


Published In

cover image Proceedings of the ACM on Management of Data
Proceedings of the ACM on Management of Data  Volume 2, Issue 3
June 2024
1953 pages
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].


Association for Computing Machinery

New York, NY, United States

Publication History

Published: 30 May 2024
Published in PACMMOD Volume 2, Issue 3


Request permissions for this article.

Author Tags

  1. data preparation
  2. data quality
  3. uncertainty quantification


  • Research-article

Funding Sources

  • NSF grant and the Industry-University Cooperative Research Center on Pervasive Personalized Intelligence


Other Metrics

Bibliometrics & Citations


Article Metrics

  • 0
    Total Citations
  • 274
    Total Downloads
  • Downloads (Last 12 months)274
  • Downloads (Last 6 weeks)43
Reflects downloads up to 05 Mar 2025

Other Metrics


View Options

View options


View or Download as a PDF file.



View online with eReader.


Login options

Full Access






Share this Publication link

Share on social media