research-article

Open access

Certain and Approximately Certain Models for Statistical Learning

Authors:

Arash Termehchy,

Amandeep Singh ChabadaAuthors Info & Claims

Proceedings of the ACM on Management of Data, Volume 2, Issue 3

Article No.: 126, Pages 1 - 25

https://doi.org/10.1145/3654929

Published: 30 May 2024 Publication History

Abstract

Real-world data is often incomplete and contains missing values. To train accurate models over real-world datasets, users need to spend a substantial amount of time and resources imputing and finding proper values for missing data items. In this paper, we demonstrate that it is possible to learn accurate models directly from data with missing values for certain training data and target models. We propose a unified approach for checking the necessity of data imputation to learn accurate models across various widely-used machine learning paradigms. We build efficient algorithms with theoretical guarantees to check this necessity and return accurate models in cases where imputation is unnecessary. Our extensive experiments indicate that our proposed algorithms significantly reduce the amount of time and effort needed for data imputation without imposing considerable computational overhead.

References

[1]

2023. COVID-19 Reported Patient Impact and Hospital Capacity. https://catalog.data.gov/dataset/covid-19-reportedpatient-impact-and-hospital-capacity-by-state-timeseries-cf58c. Accessed on 01-01--2024.

[2]

L. Alzubaidi, J. Zhang, A. J. Humaidi, A. Al-Dujaili, Y. Duan, O. Al-Shamma, J. Santamaría, M. A. Fadhel, M. Al-Amidie, and L. Farhan. 2021. Review of Deep Learning: Concepts, CNN Architectures, Challenges, Applications, Future Directions. Journal of Big Data 8, 1 (2021), 53. https://doi.org/10.1186/s40537-021-00444--8

[3]

Peter Bodik,Wei Hong, Carlos Guestrin, Sam Madden, Mark Paskin, and Romain Thibaux. 2004. Intel Berkley Research Lab Data. https://db.csail.mit.edu/labdata/labdata.html

[4]

Parthajit Borah, DK Bhattacharyya, and JK Kalita. 2020. Malware Dataset Generation and Evaluation. In 2020 IEEE 4th Conference on Information and Communication Technology (CICT). IEEE, 1--6.

[5]

Yuri Burda, Roger Grosse, and Ruslan Salakhutdinov. 2016. Importance Weighted Autoencoders. arXiv:1509.00519 [cs.LG]

[6]

Chengliang Chai, Jiabin Liu, Nan Tang, Ju Fan, Dongjing Miao, JiayiWang, Yuyu Luo, and Guoliang Li. 2023. GoodCore: Data-effective and Data-efficient Machine Learning through Coreset Selection over Incomplete Data. Proceedings of the ACM on Management of Data 1, 2 (2023), 1--27.

Digital Library

[7]

Youngmin Cho and Lawrence Saul. 2009. Kernel methods for deep learning. Advances in neural information processing systems 22 (2009).

[8]

Youngmin Cho and Lawrence K Saul. 2011. Analysis and extension of arc-cosine kernels for large margin classification. arXiv preprint arXiv:1112.3712 (2011).

[9]

Samuel Drews, Aws Albarghouthi, and Loris D'Antoni. 2020. Proving data-poisoning robustness in decision trees. In Proceedings of the 41st ACM SIGPLAN International Conference on Programming Language Design and Implementation, PLDI 2020, London, UK, June 15--20, 2020, Alastair F. Donaldson and Emina Torlak (Eds.). ACM, 1083--1097. https: //doi.org/10.1145/3385412.3385975

Digital Library

[10]

Austen Z. Fan and Paraschos Koutris. 2022. Certifiable Robustness for Nearest Neighbor Classifiers. In 25th International Conference on Database Theory, ICDT 2022, March 29 to April 1, 2022, Edinburgh, UK (Virtual Conference) (LIPIcs, Vol. 220), Dan Olteanu and Nils Vortmeier (Eds.). Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 6:1--6:20. https: //doi.org/10.4230/LIPICS.ICDT.2022.6

[11]

Ravi Ganti and Rebecca M Willett. 2015. Sparse Linear regression with missing data. arXiv preprint arXiv:1503.08348 (2015).

[12]

Claudio Gentile and Manfred K. K Warmuth. 1998. Linear Hinge Loss and Average Margin. In Advances in Neural Information Processing Systems, M. Kearns, S. Solla, and D. Cohn (Eds.), Vol. 11. MIT Press.

[13]

Max Horowitz. 2015. Detailed NFL Play-by-Play Data 2015. Kaggle. https://www.kaggle.com/datasets/maxhorowitz/ nflplaybyplay2015

[14]

Isabelle Guyon, Steve Gunn, Asa Ben-Hur, Gideon Dror. 2003. Gisette. https://doi.org/10.24432/C5HP5B

[15]

Aditya Kadiwal. 2021. Water Potability. Kaggle. https://www.kaggle.com/datasets/adityakadiwal/water-potability

[16]

Bojan Karla?, Peng Li, Renzhi Wu, Nezihe Merve Gürel, Xu Chu, Wentao Wu, and Ce Zhang. 2020. Nearest neighbor classifiers over incomplete information: From certain answers to certain predictions. arXiv preprint arXiv:2005.05117 (2020).

[17]

Sanjay Krishnan, Michael J. Franklin, Ken Goldberg, and Eugene Wu. 2017. BoostClean: Automated Error Detection and Repair for Machine Learning. arXiv:1711.01299 [cs.DB]

[18]

Sanjay Krishnan, Jiannan Wang, Eugene Wu, Michael J Franklin, and Ken Goldberg. 2016. Activeclean: Interactive data cleaning for statistical modeling. Proceedings of the VLDB Endowment 9, 12 (2016), 948--959.

Digital Library

[19]

Krishnan, Sanjay and Wang, Jiannan and Wu, Eugene and Franklin, Michael J and Goldberg, Ken. 2018. Cleaning for Data Science. https://activeclean.github.io/

[20]

Marine Le Morvan, Julie Josse, Erwan Scornet, and Gael Varoquaux. 2021. What's a good imputation to predict with missing values?. In Advances in Neural Information Processing Systems, M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan (Eds.), Vol. 34. Curran Associates, Inc., 11530--11540. https://proceedings.neurips.cc/ paper_files/paper/2021/file/5fe8fdc79ce292c39c5f209d734b7206-Paper.pdf

[21]

R.J.A. Little and D.B. Rubin. 2002. Statistical analysis with missing data. Wiley. http://books.google.com/books?id= aYPwAAAAMAAJ

[22]

Tongyu Liu, Ju Fan, Yinqing Luo, Nan Tang, Guoliang Li, and Xiaoyong Du. 2021. Adaptive Data Augmentation for Supervised Learning over Missing Data. Proc. VLDB Endow. 14, 7 (mar 2021), 1202--1214. https://doi.org/10.14778/ 3450980.3450989

Digital Library

[23]

Pierre-Alexandre Mattei and Jes Frellsen. 2019. MIWAE: Deep Generative Modelling and Imputation of Incomplete Data Sets. In Proceedings of the 36th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 97), Kamalika Chaudhuri and Ruslan Salakhutdinov (Eds.). PMLR, 4413--4423. https://proceedings.mlr. press/v97/mattei19a.html

[24]

Felix Neutatz, Binger Chen, Ziawasch Abedjan, and Eugene Wu. 2021. From Cleaning before ML to Cleaning for ML. IEEE Data Eng. Bull. 44, 1 (2021), 24--41.

[25]

Jose Picado, John Davis, Arash Termehchy, and Ga Young Lee. 2020. Learning over dirty data without cleaning. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. 1301--1316.

Digital Library

[26]

Michael Redmond. 2009. Communities and Crime. UCI Machine Learning Repository.

[27]

Ricardo P Pinheiro, Sidney M. L. Lima, Sérgio M. M. Fernandes, E. D. Q. Albuquerque, S. Medeiros, Danilo Souza, T. Monteiro, Petrônio Lopes, Rafael Lima, Jemerson Oliveira, Sthéfano Silva. 2019. REJAFADA. https://doi.org/10.24432/ C5HG8D

[28]

DONALD B. RUBIN. 1976. Inference and missing data. Biometrika 63, 3 (12 1976), 581--592. https://doi.org/10.1093/ biomet/63.3.581 arXiv:https://academic.oup.com/biomet/article-pdf/63/3/581/756166/63--3--581.pdf

[29]

Olga Troyanskaya, Mike Cantor, Gavin Sherlock, Trevor Hastie, Rob Tibshirani, David Botstein, and Russ Altman. 2001. Missing Value Estimation Methods for DNA Microarrays. Bioinformatics 17 (07 2001), 520--525. https: //doi.org/10.1093/bioinformatics/17.6.520

[30]

Stef Van Buuren. 2018. Flexible imputation of missing data. CRC press.

[31]

Joaquin Vanschoren, Jan N. van Rijn, Bernd Bischl, and Luis Torgo. 2013. OpenML: Networked Science in Machine Learning. SIGKDD Explorations 15, 2 (2013), 49--60. https://doi.org/10.1145/2641190.2641198

Digital Library

[32]

Saverio Vito. 2016. Air Quality. UCI Machine Learning Repository.

[33]

Yining Wang and Aarti Singh. 2015. Column subset selection with missing data via active sampling. In Artificial Intelligence and Statistics. PMLR, 1033--1041.

[34]

William Wolberg. 1992. Breast Cancer Wisconsin (Original). UCI Machine Learning Repository.

[35]

Cheng Zhen, Nischal Aryal, Arash Termehchy, and Amandeep Singh Chabada. 2024. Certain and Approximately Certain Models for Statistical Learning. arXiv preprint arXiv:2402.17926 (2024).

Index Terms

Certain and Approximately Certain Models for Statistical Learning
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Supervised learning
2. Information systems
  1. Data management systems
    1. Information integration
      1. Data cleaning

Recommendations

Four Factors Affecting Missing Data Imputation
SSDBM '23: Proceedings of the 35th International Conference on Scientific and Statistical Database Management

Missing data is a common problem in datasets and impacts the reliability of data analysis. Numerous methods to impute (i.e., predict and replace) missing values have been proposed. The quality of these imputed values depends on factors like correlation,...
FIMUS

A novel missing value imputation technique.Justification of the basic concepts of the technique through some empirical analyses.Experimentation on nine data sets, two evaluation criteria.Comparison with four existing techniques.A complexity analysis of ...
Imputation techniques for multivariate missingness in software measurement data

The problem of missing values in software measurement data used in empirical analysis has led to the proposal of numerous potential solutions. Imputation procedures, for example, have been proposed to `fill-in' the missing values with plausible ...

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the ACM on Management of Data

Proceedings of the ACM on Management of Data Volume 2, Issue 3

SIGMOD

June 2024

1953 pages

EISSN:2836-6573

DOI:10.1145/3670010

Editor:
Divyakant Agrawal
UC Santa Barbara, United States

Issue’s Table of Contents

Copyright © 2024 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 30 May 2024

Published in PACMMOD Volume 2, Issue 3

Permissions

Request permissions for this article.

Request Permissions

Author Tags

Qualifiers

Research-article

Funding Sources

NSF grant and the Industry-University Cooperative Research Center on Pervasive Personalized Intelligence

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
274
Total Downloads

Downloads (Last 12 months)274
Downloads (Last 6 weeks)43

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Figures

Tables

Media

View Issue’s Table of Contents