Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3394486.3403096acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article

Imputing Various Incomplete Attributes via Distance Likelihood Maximization

Published: 20 August 2020 Publication History

Abstract

Missing values may appear in various attributes. By "various", we mean (1) different types of values in a tuple, such as numerical or categorical, and (2) different attributes in a tuple, either the dependent or determinant attributes of regression models or dependency rules. Such varieties unfortunately prevent the imputation performing. In this paper, we propose to study the distance models that predict distances between tuples for missing data imputation. The immediate benefits are in two aspects, (1) uniformly processing and collaboratively utilizing the distances on all the attributes with various types of values, and (2) rather than enumerating the combinations of imputation candidates on various attributes, we can directly calculate the most likely distances of missing values to other complete ones and thus infer the corresponding imputations. Our major technical highlights include (1) introducing the imputation statistically explainable by the likelihood on distances, (2) proving NP-hardness of finding the maximum likelihood imputation, and (3) devising the approximation algorithm with performance guarantees. Experiments over datasets with real missing values demonstrate the superiority of the proposed method compared to 11 existing approaches in 5 categories. Our proposal improves not only the imputation accuracy but also the downstream applications such as classification, clustering and record matching.

Supplementary Material

MP4 File (3394486.3403096.mp4)
Presentation Video.

References

[1]
N. S. Altman. An introduction to kernel and nearest-neighbor nonparametric regression. The American Statistician, 46(3):175--185, 1992.
[2]
C. M. Bishop. Pattern recognition and machine learning. springer, 2006.
[3]
P. Bohannon, W. Fan, M. Flaster, and R. Rastogi. A cost-based model and effective heuristic for repairing constraints by value modification. In SIGMOD, pages 143--154. ACM, 2005.
[4]
E. J. Candè s and B. Recht. Exact matrix completion via convex optimization. Foundations of Computational Mathematics, 9(6):717--772, 2009.
[5]
W. S. Cleveland and C. Loader. Smoothing by Local Regression: Principles and Methods, pages 10--49. Physica-Verlag HD, Heidelberg, 1996.
[6]
C. Cuadras and C. Arenas. A distance based regression model for prediction with mixed data. Communications in Statistics-Theory and Methods, 19(6):2261--2279, 1990.
[7]
A. H. de Souza Júnior, F. Corona, G. D. A. Barreto, Y. Miché, and A. Lendasse. Minimal learning machine: A novel supervised distance-based approach for regression and classification. Neurocomputing, 164:34--44, 2015.
[8]
C. Domeniconi and B. Yan. Nearest neighbor ensemble. In ICPR, pages 228--231, 2004.
[9]
A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios. Duplicate record detection: A survey. TKDE, 19(1):1--16, 2007.
[10]
W. Fan, F. Geerts, L. V. S. Lakshmanan, and M. Xiong. Discovering conditional functional dependencies. In ICDE, pages 1231--1234, 2009.
[11]
W. Fan, X. Jia, J. Li, and S. Ma. Reasoning about record matching rules. PVLDB, 2(1):407--418, 2009.
[12]
W. Fan, J. Li, S. Ma, N. Tang, and W. Yu. Towards certain fixes with editing rules and master data. PVLDB, 3(1):173--184, 2010.
[13]
J. Han, M. Kamber, and J. Pei. Data Mining: Concepts and Techniques, 3rd edition. Morgan Kaufmann, 2011.
[14]
M. Interlandi and N. Tang. Proof positive and negative in data cleaning. In ICDE, pages 18--29, 2015.
[15]
R. M. Karp. Reducibility among combinatorial problems. In Proceedings of a symposium on the Complexity of Computer Computations, pages 85--103, 1972.
[16]
D. Li, J. Deogun, W. Spaulding, and B. Shuart. Towards missing data imputation: a study of fuzzy k-means clustering method. In Rough sets and current trends in computing, volume 3066, pages 573--579. Springer, 2004.
[17]
Y. Li and B. Liu. A normalized levenshtein distance metric. TPAMI, 29(6):1091--1095, 2007.
[18]
R. J. Little and D. B. Rubin. Statistical analysis with missing data. John Wiley & Sons, 2014.
[19]
C. Mayfield, J. Neville, and S. Prabhakar. ERACER: a database approach for statistical inference and data cleaning. In SIGMOD, pages 75--86, 2010.
[20]
M. Mohri, A. Rostamizadeh, and A. Talwalkar. Foundations of machine learning. MIT press, 2018.
[21]
G. Navarro. A guided tour to approximate string matching. ACM Comput. Surv., 33(1):31--88, 2001.
[22]
S. Nikfalazar, C. Yeh, S. E. Bedingfield, and H. A. Khorshidi. A new iterative fuzzy clustering algorithm for multiple imputation of missing data. In FUZZ-IEEE, pages 1--6, 2017.
[23]
C. Patil and I. Baidari. Estimating the optimal number of clusters k in a dataset using data depth. Data Science and Engineering, 4(2):132--140, 2019.
[24]
T. Rekatsinas, X. Chu, I. F. Ilyas, and C. Ré. Holoclean: Holistic data repairs with probabilistic inference. PVLDB, 10(11):1190--1201, 2017.
[25]
J. T. Rgd Steel. Principales and pricedures of statistics. 1960.
[26]
W. Rudin et al. Principles of mathematical analysis, volume 3. McGraw-hill New York, 1964.
[27]
S. Song and L. Chen. Differential dependencies: Reasoning and discovery. ACM Trans. Database Syst., 36(3):16:1--16:41, 2011.
[28]
S. Song, A. Zhang, L. Chen, and J. Wang. Enriching data imputation with extensive similarity neighbors. PVLDB, 8(11):1286--1297, 2015.
[29]
J. Wang and N. Tang. Towards dependable data repairing with fixing rules. In ICDE, pages 457--468, 2014.
[30]
I. H. Witten, E. Frank, M. A. Hall, and C. J. Pal. Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann, 2016.
[31]
S. Wu, X. Feng, Y. Han, and Q. Wang. Missing categorical data imputation approach based on similarity. In SMC, pages 2827--2832, 2012.
[32]
M. Yakout, L. Berti-É quille, and A. K. Elmagarmid. Don't be scared: use scalable automatic repairing with maximal likelihood and bounded changes. In SIGMOD, pages 553--564, 2013.
[33]
X. Yan, W. Xiong, L. Hu, F. Wang, and K. Zhao. Missing value imputation based on gaussian mixture model for the internet of things. Mathematical Problems in Engineering, 2015, 2015.
[34]
A. Zhang, S. Song, Y. Sun, and J. Wang. Learning individual models for imputation. In ICDE, pages 160--171, 2019.
[35]
S. Zhang, J. Zhang, X. Zhu, Y. Qin, and C. Zhang. Missing value imputation based on data clustering. Trans. Computational Science, 1:128--138, 2008.

Cited By

View all
  • (2024)Win-Win: On Simultaneous Clustering and Imputing over Incomplete DataProceedings of the VLDB Endowment10.14778/3681954.368198217:11(3045-3057)Online publication date: 30-Aug-2024
  • (2024)LIHAN: A Lattice-Guided Incomplete Heterogeneous Information Network Embedding Model for Node ClassificationIEEE Transactions on Computational Social Systems10.1109/TCSS.2024.340556911:6(7411-7420)Online publication date: Dec-2024
  • (2024)Relational Data Cleaning Meets Artificial Intelligence: A SurveyData Science and Engineering10.1007/s41019-024-00266-7Online publication date: 20-Dec-2024
  • Show More Cited By

Index Terms

  1. Imputing Various Incomplete Attributes via Distance Likelihood Maximization

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    KDD '20: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining
    August 2020
    3664 pages
    ISBN:9781450379984
    DOI:10.1145/3394486
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 20 August 2020

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. data imputation
    2. distance likelihood
    3. incomplete data

    Qualifiers

    • Research-article

    Funding Sources

    • National Natural Science Foundation of China
    • National Key Research and Development Plan

    Conference

    KDD '20
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

    Upcoming Conference

    KDD '25

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)18
    • Downloads (Last 6 weeks)3
    Reflects downloads up to 22 Dec 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Win-Win: On Simultaneous Clustering and Imputing over Incomplete DataProceedings of the VLDB Endowment10.14778/3681954.368198217:11(3045-3057)Online publication date: 30-Aug-2024
    • (2024)LIHAN: A Lattice-Guided Incomplete Heterogeneous Information Network Embedding Model for Node ClassificationIEEE Transactions on Computational Social Systems10.1109/TCSS.2024.340556911:6(7411-7420)Online publication date: Dec-2024
    • (2024)Relational Data Cleaning Meets Artificial Intelligence: A SurveyData Science and Engineering10.1007/s41019-024-00266-7Online publication date: 20-Dec-2024
    • (2023)Empirical Comparison of Imputation Methods for Multivariate Missing Data in Public HealthInternational Journal of Environmental Research and Public Health10.3390/ijerph2002152420:2(1524)Online publication date: 14-Jan-2023
    • (2023)Matrix Factorization with Landmarks for Spatial Data2023 IEEE 39th International Conference on Data Engineering (ICDE)10.1109/ICDE55515.2023.00147(1887-1899)Online publication date: Apr-2023
    • (2023)Non-Blocking Raft for High Throughput IoT Data2023 IEEE 39th International Conference on Data Engineering (ICDE)10.1109/ICDE55515.2023.00092(1140-1152)Online publication date: Apr-2023
    • (2023)Efficient Missing Value Imputation by Maximum Distance Likelihood2023 IEEE International Conference on Big Data (BigData)10.1109/BigData59044.2023.10386584(331-338)Online publication date: 15-Dec-2023
    • (2022)A Pragmatic Ensemble Strategy for Missing Values Imputation in Health RecordsEntropy10.3390/e2404053324:4(533)Online publication date: 10-Apr-2022
    • (2022)A bi-objective k-nearest-neighbors-based imputation method for multilevel dataExpert Systems with Applications: An International Journal10.1016/j.eswa.2022.117298204:COnline publication date: 15-Oct-2022
    • (2022)PR-MVI: Efficient Missing Value Imputation over Data Streams by Distance LikelihoodInformation Integration and Web Intelligence10.1007/978-3-031-21047-1_28(338-351)Online publication date: 28-Nov-2022
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media