Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2463676.2463706acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Don't be SCAREd: use SCalable Automatic REpairing with maximal likelihood and bounded changes

Published: 22 June 2013 Publication History

Abstract

Various computational procedures or constraint-based methods for data repairing have been proposed over the last decades to identify errors and, when possible, correct them. However, these approaches have several limitations including the scalability and quality of the values to be used in replacement of the errors. In this paper, we propose a new data repairing approach that is based on maximizing the likelihood of replacement data given the data distribution, which can be modeled using statistical machine learning techniques. This is a novel approach combining machine learning and likelihood methods for cleaning dirty databases by value modification. We develop a quality measure of the repairing updates based on the likelihood benefit and the amount of changes applied to the database. We propose SCARE (SCalable Automatic REpairing), a systematic scalable framework that follows our approach. SCARE relies on a robust mechanism for horizontal data partitioning and a combination of machine learning techniques to predict the set of possible updates. Due to data partitioning, several updates can be predicted for a single record based on local views on each data partition. Therefore, we propose a mechanism to combine the local predictions and obtain accurate final predictions. Finally, we experimentally demonstrate the effectiveness, efficiency, and scalability of our approach on real-world datasets in comparison to recent data cleaning approaches.

References

[1]
S. Arora, D. Karger, and M. Karpinski. Polynomial time approximation schemes for dense instances of np-hard problems. In STOC, 1995.
[2]
Y. Asahiro, K. Iwama, H. Tamaki, and T. Tokuyama. Greedily finding a dense subgraph. In Journal of Algorithms, 2000.
[3]
G. Cong, W. Fan, F. Geerts, X. Jia, and S. Ma. Improving data quality: consistency and accuracy. In VLDB, 2007.
[4]
K. Dembczynski, W. Cheng, and E. Hullermeier. Bayes optimal multilabel classification via probabilistic classifier chains. In ICML, 2010.
[5]
T. G. Dietterich. Ensemble methods in machine learning. In MCS workshop, 2000.
[6]
A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios. Duplicate record detection: A survey. TKDE, 2007.
[7]
L. P. English. Information quality applied: best practices for improving business information, processes and systems. Wiley, 2009.
[8]
W. Fan. Dependencies revisited for improving data quality. In PODS, 2008.
[9]
W. Fan, F. Geerts, L. V. Lakshmanan, and M. Xiong. Discovering conditional functional dependencies. In ICDE, 2009.
[10]
W. Fan, J. Li, S. Ma, N. Tang, and W. Yu. Towards certain fixes with editing rules and master data. PVLDB, 2010.
[11]
W. Fan, J. Li, S. Ma, N. Tang, and W. Yu. Interaction between record matching and data repairing. In SIGMOD, 2011.
[12]
U. Feige and M. Seltser. On the densest k-subgraph problem. Algorithmica, 1997.
[13]
S. German and D. German. Neurocomputing: foundations of research. chapter Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. 1988.
[14]
D. Heckerman, D. M. Chickering, C. Meek, R. Rounthwaite, and C. Kadie. Dependency networks for inference, collaborative filtering, and data visualization. J. Mach. Learn. Res., 2001.
[15]
D. S. Hochbaum. Efficient bounds for the stable set, vertex cover and set packing problems. In Discrete Applied Mathematics, 1983.
[16]
R. A. Jacobs, M. I. Jordan, S. J. Nowlan, and G. E. Hinton. Adaptive mixtures of local experts. Neural Comput.
[17]
J. L. Y. Koh, M. L. Lee, W. Hsu, and K. T. Lam. Correlation-based detection of attribute outliers. In DASFAA, 2007.
[18]
S. Kolahi and L. V. S. Lakshmanan. On approximating optimum repairs for functional dependency violations. In ICDT, 2009.
[19]
A. Lopatenko and L. Bravo. Efficient approximation algorithms for repairing inconsistent databases. In ICDE, 2007.
[20]
C. Mayfield, J. Neville, and S. Prabhakar. Eracer: a database approach for statistical inference and data cleaning. In SIGMOD, 2010.
[21]
S. Sonnenburg, G. Rätsch, C. Schäfer, and B. Schölkopf. Large scale multiple kernel learning. J. Mach. Learn. Res., 2006.
[22]
M. Yakout, A. K. Elmagarmid, J. Neville, M. Ouzzani, and I. F. Ilyas. Guided data repair. PVLDB, 2011.
[23]
X. Zhu and X. Wu. Class noise vs. attribute noise: a quantitative study of their impacts. Artif. Intell. Rev., 22:177--210, November 2004.

Cited By

View all
  • (2024)BClean: A Bayesian Data Cleaning System2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00263(3407-3420)Online publication date: 13-May-2024
  • (2024)An efficient learning based approach for automatic record deduplication with benchmark datasetsScientific Reports10.1038/s41598-024-63242-114:1Online publication date: 15-Jul-2024
  • (2024)DataAssist: A Machine Learning Approach to Data Cleaning and PreparationIntelligent Systems and Applications10.1007/978-3-031-66431-1_33(476-486)Online publication date: 31-Jul-2024
  • Show More Cited By

Index Terms

  1. Don't be SCAREd: use SCalable Automatic REpairing with maximal likelihood and bounded changes

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SIGMOD '13: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
    June 2013
    1322 pages
    ISBN:9781450320375
    DOI:10.1145/2463676
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 22 June 2013

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. data cleaning
    2. inconsistent data

    Qualifiers

    • Research-article

    Conference

    SIGMOD/PODS'13
    Sponsor:

    Acceptance Rates

    SIGMOD '13 Paper Acceptance Rate 76 of 372 submissions, 20%;
    Overall Acceptance Rate 785 of 4,003 submissions, 20%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)84
    • Downloads (Last 6 weeks)4
    Reflects downloads up to 30 Aug 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)BClean: A Bayesian Data Cleaning System2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00263(3407-3420)Online publication date: 13-May-2024
    • (2024)An efficient learning based approach for automatic record deduplication with benchmark datasetsScientific Reports10.1038/s41598-024-63242-114:1Online publication date: 15-Jul-2024
    • (2024)DataAssist: A Machine Learning Approach to Data Cleaning and PreparationIntelligent Systems and Applications10.1007/978-3-031-66431-1_33(476-486)Online publication date: 31-Jul-2024
    • (2023)SAGA: A Scalable Framework for Optimizing Data Cleaning Pipelines for Machine Learning ApplicationsProceedings of the ACM on Management of Data10.1145/36173381:3(1-26)Online publication date: 13-Nov-2023
    • (2023)A Data-centric AI Framework for Automating Exploratory Data Analysis and Data Quality TasksJournal of Data and Information Quality10.1145/360370915:4(1-26)Online publication date: 1-Nov-2023
    • (2023)DataPilot: Utilizing Quality and Usage Information for Subset Selection during Visual Data PreparationProceedings of the 2023 CHI Conference on Human Factors in Computing Systems10.1145/3544548.3581509(1-18)Online publication date: 19-Apr-2023
    • (2023)A Rule Based Data Cleansing Pipeline for Automated Data Import in the Context of Social Clubs2023 3rd International Conference on Electrical, Computer, Communications and Mechatronics Engineering (ICECCME)10.1109/ICECCME57830.2023.10253136(1-6)Online publication date: 19-Jul-2023
    • (2023)A survey on preprocessing and classification techniques for acoustic sceneExpert Systems with Applications: An International Journal10.1016/j.eswa.2023.120520229:PAOnline publication date: 13-Jul-2023
    • (2023)Streaming data cleaning based on speed changeThe VLDB Journal10.1007/s00778-023-00796-y33:1(1-24)Online publication date: 3-May-2023
    • (2022)IoT data cleaning techniques: A surveyIntelligent and Converged Networks10.23919/ICN.2022.00263:4(325-339)Online publication date: Dec-2022
    • Show More Cited By

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media