Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Open access

Certain and Approximately Certain Models for Statistical Learning

Published: 30 May 2024 Publication History

Abstract

Real-world data is often incomplete and contains missing values. To train accurate models over real-world datasets, users need to spend a substantial amount of time and resources imputing and finding proper values for missing data items. In this paper, we demonstrate that it is possible to learn accurate models directly from data with missing values for certain training data and target models. We propose a unified approach for checking the necessity of data imputation to learn accurate models across various widely-used machine learning paradigms. We build efficient algorithms with theoretical guarantees to check this necessity and return accurate models in cases where imputation is unnecessary. Our extensive experiments indicate that our proposed algorithms significantly reduce the amount of time and effort needed for data imputation without imposing considerable computational overhead.

References

[1]
2023. COVID-19 Reported Patient Impact and Hospital Capacity. https://catalog.data.gov/dataset/covid-19-reportedpatient-impact-and-hospital-capacity-by-state-timeseries-cf58c. Accessed on 01-01--2024.
[2]
L. Alzubaidi, J. Zhang, A. J. Humaidi, A. Al-Dujaili, Y. Duan, O. Al-Shamma, J. Santamaría, M. A. Fadhel, M. Al-Amidie, and L. Farhan. 2021. Review of Deep Learning: Concepts, CNN Architectures, Challenges, Applications, Future Directions. Journal of Big Data 8, 1 (2021), 53. https://doi.org/10.1186/s40537-021-00444--8
[3]
Peter Bodik,Wei Hong, Carlos Guestrin, Sam Madden, Mark Paskin, and Romain Thibaux. 2004. Intel Berkley Research Lab Data. https://db.csail.mit.edu/labdata/labdata.html
[4]
Parthajit Borah, DK Bhattacharyya, and JK Kalita. 2020. Malware Dataset Generation and Evaluation. In 2020 IEEE 4th Conference on Information and Communication Technology (CICT). IEEE, 1--6.
[5]
Yuri Burda, Roger Grosse, and Ruslan Salakhutdinov. 2016. Importance Weighted Autoencoders. arXiv:1509.00519 [cs.LG]
[6]
Chengliang Chai, Jiabin Liu, Nan Tang, Ju Fan, Dongjing Miao, JiayiWang, Yuyu Luo, and Guoliang Li. 2023. GoodCore: Data-effective and Data-efficient Machine Learning through Coreset Selection over Incomplete Data. Proceedings of the ACM on Management of Data 1, 2 (2023), 1--27.
[7]
Youngmin Cho and Lawrence Saul. 2009. Kernel methods for deep learning. Advances in neural information processing systems 22 (2009).
[8]
Youngmin Cho and Lawrence K Saul. 2011. Analysis and extension of arc-cosine kernels for large margin classification. arXiv preprint arXiv:1112.3712 (2011).
[9]
Samuel Drews, Aws Albarghouthi, and Loris D'Antoni. 2020. Proving data-poisoning robustness in decision trees. In Proceedings of the 41st ACM SIGPLAN International Conference on Programming Language Design and Implementation, PLDI 2020, London, UK, June 15--20, 2020, Alastair F. Donaldson and Emina Torlak (Eds.). ACM, 1083--1097. https: //doi.org/10.1145/3385412.3385975
[10]
Austen Z. Fan and Paraschos Koutris. 2022. Certifiable Robustness for Nearest Neighbor Classifiers. In 25th International Conference on Database Theory, ICDT 2022, March 29 to April 1, 2022, Edinburgh, UK (Virtual Conference) (LIPIcs, Vol. 220), Dan Olteanu and Nils Vortmeier (Eds.). Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 6:1--6:20. https: //doi.org/10.4230/LIPICS.ICDT.2022.6
[11]
Ravi Ganti and Rebecca M Willett. 2015. Sparse Linear regression with missing data. arXiv preprint arXiv:1503.08348 (2015).
[12]
Claudio Gentile and Manfred K. K Warmuth. 1998. Linear Hinge Loss and Average Margin. In Advances in Neural Information Processing Systems, M. Kearns, S. Solla, and D. Cohn (Eds.), Vol. 11. MIT Press.
[13]
Max Horowitz. 2015. Detailed NFL Play-by-Play Data 2015. Kaggle. https://www.kaggle.com/datasets/maxhorowitz/ nflplaybyplay2015
[14]
Isabelle Guyon, Steve Gunn, Asa Ben-Hur, Gideon Dror. 2003. Gisette. https://doi.org/10.24432/C5HP5B
[15]
Aditya Kadiwal. 2021. Water Potability. Kaggle. https://www.kaggle.com/datasets/adityakadiwal/water-potability
[16]
Bojan Karla?, Peng Li, Renzhi Wu, Nezihe Merve Gürel, Xu Chu, Wentao Wu, and Ce Zhang. 2020. Nearest neighbor classifiers over incomplete information: From certain answers to certain predictions. arXiv preprint arXiv:2005.05117 (2020).
[17]
Sanjay Krishnan, Michael J. Franklin, Ken Goldberg, and Eugene Wu. 2017. BoostClean: Automated Error Detection and Repair for Machine Learning. arXiv:1711.01299 [cs.DB]
[18]
Sanjay Krishnan, Jiannan Wang, Eugene Wu, Michael J Franklin, and Ken Goldberg. 2016. Activeclean: Interactive data cleaning for statistical modeling. Proceedings of the VLDB Endowment 9, 12 (2016), 948--959.
[19]
Krishnan, Sanjay and Wang, Jiannan and Wu, Eugene and Franklin, Michael J and Goldberg, Ken. 2018. Cleaning for Data Science. https://activeclean.github.io/
[20]
Marine Le Morvan, Julie Josse, Erwan Scornet, and Gael Varoquaux. 2021. What's a good imputation to predict with missing values?. In Advances in Neural Information Processing Systems, M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan (Eds.), Vol. 34. Curran Associates, Inc., 11530--11540. https://proceedings.neurips.cc/ paper_files/paper/2021/file/5fe8fdc79ce292c39c5f209d734b7206-Paper.pdf
[21]
R.J.A. Little and D.B. Rubin. 2002. Statistical analysis with missing data. Wiley. http://books.google.com/books?id= aYPwAAAAMAAJ
[22]
Tongyu Liu, Ju Fan, Yinqing Luo, Nan Tang, Guoliang Li, and Xiaoyong Du. 2021. Adaptive Data Augmentation for Supervised Learning over Missing Data. Proc. VLDB Endow. 14, 7 (mar 2021), 1202--1214. https://doi.org/10.14778/ 3450980.3450989
[23]
Pierre-Alexandre Mattei and Jes Frellsen. 2019. MIWAE: Deep Generative Modelling and Imputation of Incomplete Data Sets. In Proceedings of the 36th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 97), Kamalika Chaudhuri and Ruslan Salakhutdinov (Eds.). PMLR, 4413--4423. https://proceedings.mlr. press/v97/mattei19a.html
[24]
Felix Neutatz, Binger Chen, Ziawasch Abedjan, and Eugene Wu. 2021. From Cleaning before ML to Cleaning for ML. IEEE Data Eng. Bull. 44, 1 (2021), 24--41.
[25]
Jose Picado, John Davis, Arash Termehchy, and Ga Young Lee. 2020. Learning over dirty data without cleaning. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. 1301--1316.
[26]
Michael Redmond. 2009. Communities and Crime. UCI Machine Learning Repository.
[27]
Ricardo P Pinheiro, Sidney M. L. Lima, Sérgio M. M. Fernandes, E. D. Q. Albuquerque, S. Medeiros, Danilo Souza, T. Monteiro, Petrônio Lopes, Rafael Lima, Jemerson Oliveira, Sthéfano Silva. 2019. REJAFADA. https://doi.org/10.24432/ C5HG8D
[28]
DONALD B. RUBIN. 1976. Inference and missing data. Biometrika 63, 3 (12 1976), 581--592. https://doi.org/10.1093/ biomet/63.3.581 arXiv:https://academic.oup.com/biomet/article-pdf/63/3/581/756166/63--3--581.pdf
[29]
Olga Troyanskaya, Mike Cantor, Gavin Sherlock, Trevor Hastie, Rob Tibshirani, David Botstein, and Russ Altman. 2001. Missing Value Estimation Methods for DNA Microarrays. Bioinformatics 17 (07 2001), 520--525. https: //doi.org/10.1093/bioinformatics/17.6.520
[30]
Stef Van Buuren. 2018. Flexible imputation of missing data. CRC press.
[31]
Joaquin Vanschoren, Jan N. van Rijn, Bernd Bischl, and Luis Torgo. 2013. OpenML: Networked Science in Machine Learning. SIGKDD Explorations 15, 2 (2013), 49--60. https://doi.org/10.1145/2641190.2641198
[32]
Saverio Vito. 2016. Air Quality. UCI Machine Learning Repository.
[33]
Yining Wang and Aarti Singh. 2015. Column subset selection with missing data via active sampling. In Artificial Intelligence and Statistics. PMLR, 1033--1041.
[34]
William Wolberg. 1992. Breast Cancer Wisconsin (Original). UCI Machine Learning Repository.
[35]
Cheng Zhen, Nischal Aryal, Arash Termehchy, and Amandeep Singh Chabada. 2024. Certain and Approximately Certain Models for Statistical Learning. arXiv preprint arXiv:2402.17926 (2024).

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the ACM on Management of Data
Proceedings of the ACM on Management of Data  Volume 2, Issue 3
SIGMOD
June 2024
1953 pages
EISSN:2836-6573
DOI:10.1145/3670010
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 30 May 2024
Published in PACMMOD Volume 2, Issue 3

Permissions

Request permissions for this article.

Author Tags

  1. data preparation
  2. data quality
  3. uncertainty quantification

Qualifiers

  • Research-article

Funding Sources

  • NSF grant and the Industry-University Cooperative Research Center on Pervasive Personalized Intelligence

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 236
    Total Downloads
  • Downloads (Last 12 months)236
  • Downloads (Last 6 weeks)55
Reflects downloads up to 25 Jan 2025

Other Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Full Access

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media