research-article

Imputing Various Incomplete Attributes via Distance Likelihood Maximization

Authors:

Yu SunAuthors Info & Claims

KDD '20: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining

Pages 535 - 545

https://doi.org/10.1145/3394486.3403096

Published: 20 August 2020 Publication History

Abstract

Missing values may appear in various attributes. By "various", we mean (1) different types of values in a tuple, such as numerical or categorical, and (2) different attributes in a tuple, either the dependent or determinant attributes of regression models or dependency rules. Such varieties unfortunately prevent the imputation performing. In this paper, we propose to study the distance models that predict distances between tuples for missing data imputation. The immediate benefits are in two aspects, (1) uniformly processing and collaboratively utilizing the distances on all the attributes with various types of values, and (2) rather than enumerating the combinations of imputation candidates on various attributes, we can directly calculate the most likely distances of missing values to other complete ones and thus infer the corresponding imputations. Our major technical highlights include (1) introducing the imputation statistically explainable by the likelihood on distances, (2) proving NP-hardness of finding the maximum likelihood imputation, and (3) devising the approximation algorithm with performance guarantees. Experiments over datasets with real missing values demonstrate the superiority of the proposed method compared to 11 existing approaches in 5 categories. Our proposal improves not only the imputation accuracy but also the downstream applications such as classification, clustering and record matching.

Supplementary Material

MP4 File (3394486.3403096.mp4)

Presentation Video.

Download
14.05 MB

References

[1]

N. S. Altman. An introduction to kernel and nearest-neighbor nonparametric regression. The American Statistician, 46(3):175--185, 1992.

[2]

C. M. Bishop. Pattern recognition and machine learning. springer, 2006.

Digital Library

[3]

P. Bohannon, W. Fan, M. Flaster, and R. Rastogi. A cost-based model and effective heuristic for repairing constraints by value modification. In SIGMOD, pages 143--154. ACM, 2005.

Digital Library

[4]

E. J. Candè s and B. Recht. Exact matrix completion via convex optimization. Foundations of Computational Mathematics, 9(6):717--772, 2009.

[5]

W. S. Cleveland and C. Loader. Smoothing by Local Regression: Principles and Methods, pages 10--49. Physica-Verlag HD, Heidelberg, 1996.

[6]

C. Cuadras and C. Arenas. A distance based regression model for prediction with mixed data. Communications in Statistics-Theory and Methods, 19(6):2261--2279, 1990.

[7]

A. H. de Souza Júnior, F. Corona, G. D. A. Barreto, Y. Miché, and A. Lendasse. Minimal learning machine: A novel supervised distance-based approach for regression and classification. Neurocomputing, 164:34--44, 2015.

Digital Library

[8]

C. Domeniconi and B. Yan. Nearest neighbor ensemble. In ICPR, pages 228--231, 2004.

[9]

A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios. Duplicate record detection: A survey. TKDE, 19(1):1--16, 2007.

[10]

W. Fan, F. Geerts, L. V. S. Lakshmanan, and M. Xiong. Discovering conditional functional dependencies. In ICDE, pages 1231--1234, 2009.

Digital Library

[11]

W. Fan, X. Jia, J. Li, and S. Ma. Reasoning about record matching rules. PVLDB, 2(1):407--418, 2009.

Digital Library

[12]

W. Fan, J. Li, S. Ma, N. Tang, and W. Yu. Towards certain fixes with editing rules and master data. PVLDB, 3(1):173--184, 2010.

Digital Library

[13]

J. Han, M. Kamber, and J. Pei. Data Mining: Concepts and Techniques, 3rd edition. Morgan Kaufmann, 2011.

Digital Library

[14]

M. Interlandi and N. Tang. Proof positive and negative in data cleaning. In ICDE, pages 18--29, 2015.

[15]

R. M. Karp. Reducibility among combinatorial problems. In Proceedings of a symposium on the Complexity of Computer Computations, pages 85--103, 1972.

[16]

D. Li, J. Deogun, W. Spaulding, and B. Shuart. Towards missing data imputation: a study of fuzzy k-means clustering method. In Rough sets and current trends in computing, volume 3066, pages 573--579. Springer, 2004.

[17]

Y. Li and B. Liu. A normalized levenshtein distance metric. TPAMI, 29(6):1091--1095, 2007.

Digital Library

[18]

R. J. Little and D. B. Rubin. Statistical analysis with missing data. John Wiley & Sons, 2014.

Digital Library

[19]

C. Mayfield, J. Neville, and S. Prabhakar. ERACER: a database approach for statistical inference and data cleaning. In SIGMOD, pages 75--86, 2010.

Digital Library

[20]

M. Mohri, A. Rostamizadeh, and A. Talwalkar. Foundations of machine learning. MIT press, 2018.

Digital Library

[21]

G. Navarro. A guided tour to approximate string matching. ACM Comput. Surv., 33(1):31--88, 2001.

Digital Library

[22]

S. Nikfalazar, C. Yeh, S. E. Bedingfield, and H. A. Khorshidi. A new iterative fuzzy clustering algorithm for multiple imputation of missing data. In FUZZ-IEEE, pages 1--6, 2017.

Digital Library

[23]

C. Patil and I. Baidari. Estimating the optimal number of clusters k in a dataset using data depth. Data Science and Engineering, 4(2):132--140, 2019.

[24]

T. Rekatsinas, X. Chu, I. F. Ilyas, and C. Ré. Holoclean: Holistic data repairs with probabilistic inference. PVLDB, 10(11):1190--1201, 2017.

Digital Library

[25]

J. T. Rgd Steel. Principales and pricedures of statistics. 1960.

[26]

W. Rudin et al. Principles of mathematical analysis, volume 3. McGraw-hill New York, 1964.

[27]

S. Song and L. Chen. Differential dependencies: Reasoning and discovery. ACM Trans. Database Syst., 36(3):16:1--16:41, 2011.

Digital Library

[28]

S. Song, A. Zhang, L. Chen, and J. Wang. Enriching data imputation with extensive similarity neighbors. PVLDB, 8(11):1286--1297, 2015.

Digital Library

[29]

J. Wang and N. Tang. Towards dependable data repairing with fixing rules. In ICDE, pages 457--468, 2014.

Digital Library

[30]

I. H. Witten, E. Frank, M. A. Hall, and C. J. Pal. Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann, 2016.

Digital Library

[31]

S. Wu, X. Feng, Y. Han, and Q. Wang. Missing categorical data imputation approach based on similarity. In SMC, pages 2827--2832, 2012.

[32]

M. Yakout, L. Berti-É quille, and A. K. Elmagarmid. Don't be scared: use scalable automatic repairing with maximal likelihood and bounded changes. In SIGMOD, pages 553--564, 2013.

Digital Library

[33]

X. Yan, W. Xiong, L. Hu, F. Wang, and K. Zhao. Missing value imputation based on gaussian mixture model for the internet of things. Mathematical Problems in Engineering, 2015, 2015.

[34]

A. Zhang, S. Song, Y. Sun, and J. Wang. Learning individual models for imputation. In ICDE, pages 160--171, 2019.

[35]

S. Zhang, J. Zhang, X. Zhu, Y. Qin, and C. Zhang. Missing value imputation based on data clustering. Trans. Computational Science, 1:128--138, 2008.

Cited By

Sun YZhu JXu XXu XSun YSong SLi XYuan X(2024)Win-Win: On Simultaneous Clustering and Imputing over Incomplete DataProceedings of the VLDB Endowment10.14778/3681954.368198217:11(3045-3057)Online publication date: 30-Aug-2024
https://doi.org/10.14778/3681954.3681982
Mei GGuo ZPan LLi QLi FLiu S(2024)LIHAN: A Lattice-Guided Incomplete Heterogeneous Information Network Embedding Model for Node ClassificationIEEE Transactions on Computational Social Systems10.1109/TCSS.2024.340556911:6(7411-7420)Online publication date: Dec-2024
https://doi.org/10.1109/TCSS.2024.3405569
Zhu JZhao XSun YSong SYuan X(2024)Relational Data Cleaning Meets Artificial Intelligence: A SurveyData Science and Engineering10.1007/s41019-024-00266-7Online publication date: 20-Dec-2024
https://doi.org/10.1007/s41019-024-00266-7
Show More Cited By

Index Terms

Imputing Various Incomplete Attributes via Distance Likelihood Maximization
1. Information systems
  1. Data management systems
    1. Information integration
      1. Data cleaning

Recommendations

A web-based approach to data imputation

In this paper, we present WebPut, a prototype system that adopts a novel web-based approach to the data imputation problem. Towards this, Webput utilizes the available information in an incomplete database in conjunction with the data consistency ...
On mining incomplete medical datasets: Ordering imputation and classification

BACKGROUND: To collect medical datasets, it is usually the case that a number of data samples contain some missing values. Performing the data mining task over the incomplete datasets is a difficult problem. In general, missing value imputation can be ...
Imputing missing values for mixed numeric and categorical attributes based on incomplete data hierarchical clustering
KSEM'11: Proceedings of the 5th international conference on Knowledge Science, Engineering and Management

Missing data imputation is a key issue of data pre-processing in data mining field. Though there are many methods for missing value imputation, almost each of these imputation methods has its limitation and is designed for either numeric attributes or ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

KDD '20: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining

August 2020

3664 pages

ISBN:9781450379984

DOI:10.1145/3394486

General Chairs:
Rajesh Gupta
UC San Diego, USA
,
Yan Liu
USC, USA
,
Program Chairs:
Mohak Shah
LG Electronics, USA
,
Suju Rajan
Linkedin, USA
,
Publications Chairs:
Jiliang Tang
Michigan State, USA
,
B. Aditya Prakash
Georgia Tech, USA

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 August 2020

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Natural Science Foundation of China
National Key Research and Development Plan

Conference

KDD '20

Sponsor:

KDD '20: The 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

July 6 - 10, 2020

CA, Virtual Event, USA

Acceptance Rates

Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Upcoming Conference

KDD '25

Sponsor:
sigkdd
sigkdd

The 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 3 - 7, 2025

Toronto , ON , Canada

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

11
Total Citations
View Citations
556
Total Downloads

Downloads (Last 12 months)23
Downloads (Last 6 weeks)3

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Sun YZhu JXu XXu XSun YSong SLi XYuan X(2024)Win-Win: On Simultaneous Clustering and Imputing over Incomplete DataProceedings of the VLDB Endowment10.14778/3681954.368198217:11(3045-3057)Online publication date: 30-Aug-2024
https://doi.org/10.14778/3681954.3681982
Mei GGuo ZPan LLi QLi FLiu S(2024)LIHAN: A Lattice-Guided Incomplete Heterogeneous Information Network Embedding Model for Node ClassificationIEEE Transactions on Computational Social Systems10.1109/TCSS.2024.340556911:6(7411-7420)Online publication date: Dec-2024
https://doi.org/10.1109/TCSS.2024.3405569
Zhu JZhao XSun YSong SYuan X(2024)Relational Data Cleaning Meets Artificial Intelligence: A SurveyData Science and Engineering10.1007/s41019-024-00266-7Online publication date: 20-Dec-2024
https://doi.org/10.1007/s41019-024-00266-7
Pan SChen S(2023)Empirical Comparison of Imputation Methods for Multivariate Missing Data in Public HealthInternational Journal of Environmental Research and Public Health10.3390/ijerph2002152420:2(1524)Online publication date: 14-Jan-2023
https://doi.org/10.3390/ijerph20021524
Fang CMei YSong S(2023)Matrix Factorization with Landmarks for Spatial Data2023 IEEE 39th International Conference on Data Engineering (ICDE)10.1109/ICDE55515.2023.00147(1887-1899)Online publication date: Apr-2023
https://doi.org/10.1109/ICDE55515.2023.00147
Jiang THuang XSong SWang CWang JLi RSun J(2023)Non-Blocking Raft for High Throughput IoT Data2023 IEEE 39th International Conference on Data Engineering (ICDE)10.1109/ICDE55515.2023.00092(1140-1152)Online publication date: Apr-2023
https://doi.org/10.1109/ICDE55515.2023.00092
Bou SAmagasa TKitagawa HShaikh SMatono A(2023)Efficient Missing Value Imputation by Maximum Distance Likelihood2023 IEEE International Conference on Big Data (BigData)10.1109/BigData59044.2023.10386584(331-338)Online publication date: 15-Dec-2023
https://doi.org/10.1109/BigData59044.2023.10386584
Batra SKhurana RKhan MBoulila WKoubaa ASrivastava P(2022)A Pragmatic Ensemble Strategy for Missing Values Imputation in Health RecordsEntropy10.3390/e2404053324:4(533)Online publication date: 10-Apr-2022
https://doi.org/10.3390/e24040533
Cubillos MWøhlk SWulff J(2022)A bi-objective k-nearest-neighbors-based imputation method for multilevel dataExpert Systems with Applications: An International Journal10.1016/j.eswa.2022.117298204:COnline publication date: 15-Oct-2022
https://dl.acm.org/doi/10.1016/j.eswa.2022.117298
Bou SAmagasa TKitagawa HShaikh SMatono A(2022)PR-MVI: Efficient Missing Value Imputation over Data Streams by Distance LikelihoodInformation Integration and Web Intelligence10.1007/978-3-031-21047-1_28(338-351)Online publication date: 28-Nov-2022
https://dl.acm.org/doi/10.1007/978-3-031-21047-1_28
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten