research-article

Don't be SCAREd: use SCalable Automatic REpairing with maximal likelihood and bounded changes

Authors:

Mohamed Yakout,

Laure Berti-Équille,

Ahmed K. ElmagarmidAuthors Info & Claims

SIGMOD '13: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data

Pages 553 - 564

https://doi.org/10.1145/2463676.2463706

Published: 22 June 2013 Publication History

Abstract

Various computational procedures or constraint-based methods for data repairing have been proposed over the last decades to identify errors and, when possible, correct them. However, these approaches have several limitations including the scalability and quality of the values to be used in replacement of the errors. In this paper, we propose a new data repairing approach that is based on maximizing the likelihood of replacement data given the data distribution, which can be modeled using statistical machine learning techniques. This is a novel approach combining machine learning and likelihood methods for cleaning dirty databases by value modification. We develop a quality measure of the repairing updates based on the likelihood benefit and the amount of changes applied to the database. We propose SCARE (SCalable Automatic REpairing), a systematic scalable framework that follows our approach. SCARE relies on a robust mechanism for horizontal data partitioning and a combination of machine learning techniques to predict the set of possible updates. Due to data partitioning, several updates can be predicted for a single record based on local views on each data partition. Therefore, we propose a mechanism to combine the local predictions and obtain accurate final predictions. Finally, we experimentally demonstrate the effectiveness, efficiency, and scalability of our approach on real-world datasets in comparison to recent data cleaning approaches.

References

[1]

S. Arora, D. Karger, and M. Karpinski. Polynomial time approximation schemes for dense instances of np-hard problems. In STOC, 1995.

Digital Library

[2]

Y. Asahiro, K. Iwama, H. Tamaki, and T. Tokuyama. Greedily finding a dense subgraph. In Journal of Algorithms, 2000.

Digital Library

[3]

G. Cong, W. Fan, F. Geerts, X. Jia, and S. Ma. Improving data quality: consistency and accuracy. In VLDB, 2007.

Digital Library

[4]

K. Dembczynski, W. Cheng, and E. Hullermeier. Bayes optimal multilabel classification via probabilistic classifier chains. In ICML, 2010.

[5]

T. G. Dietterich. Ensemble methods in machine learning. In MCS workshop, 2000.

Digital Library

[6]

A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios. Duplicate record detection: A survey. TKDE, 2007.

Digital Library

[7]

L. P. English. Information quality applied: best practices for improving business information, processes and systems. Wiley, 2009.

Digital Library

[8]

W. Fan. Dependencies revisited for improving data quality. In PODS, 2008.

Digital Library

[9]

W. Fan, F. Geerts, L. V. Lakshmanan, and M. Xiong. Discovering conditional functional dependencies. In ICDE, 2009.

Digital Library

[10]

W. Fan, J. Li, S. Ma, N. Tang, and W. Yu. Towards certain fixes with editing rules and master data. PVLDB, 2010.

Digital Library

[11]

W. Fan, J. Li, S. Ma, N. Tang, and W. Yu. Interaction between record matching and data repairing. In SIGMOD, 2011.

Digital Library

[12]

U. Feige and M. Seltser. On the densest k-subgraph problem. Algorithmica, 1997.

[13]

S. German and D. German. Neurocomputing: foundations of research. chapter Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. 1988.

Digital Library

[14]

D. Heckerman, D. M. Chickering, C. Meek, R. Rounthwaite, and C. Kadie. Dependency networks for inference, collaborative filtering, and data visualization. J. Mach. Learn. Res., 2001.

Digital Library

[15]

D. S. Hochbaum. Efficient bounds for the stable set, vertex cover and set packing problems. In Discrete Applied Mathematics, 1983.

[16]

R. A. Jacobs, M. I. Jordan, S. J. Nowlan, and G. E. Hinton. Adaptive mixtures of local experts. Neural Comput.

[17]

J. L. Y. Koh, M. L. Lee, W. Hsu, and K. T. Lam. Correlation-based detection of attribute outliers. In DASFAA, 2007.

Digital Library

[18]

S. Kolahi and L. V. S. Lakshmanan. On approximating optimum repairs for functional dependency violations. In ICDT, 2009.

Digital Library

[19]

A. Lopatenko and L. Bravo. Efficient approximation algorithms for repairing inconsistent databases. In ICDE, 2007.

[20]

C. Mayfield, J. Neville, and S. Prabhakar. Eracer: a database approach for statistical inference and data cleaning. In SIGMOD, 2010.

Digital Library

[21]

S. Sonnenburg, G. Rätsch, C. Schäfer, and B. Schölkopf. Large scale multiple kernel learning. J. Mach. Learn. Res., 2006.

Digital Library

[22]

M. Yakout, A. K. Elmagarmid, J. Neville, M. Ouzzani, and I. F. Ilyas. Guided data repair. PVLDB, 2011.

Digital Library

[23]

X. Zhu and X. Wu. Class noise vs. attribute noise: a quantitative study of their impacts. Artif. Intell. Rev., 22:177--210, November 2004.

Digital Library

Cited By

Qin JHuang SWang YZhu JZhang YMiao YMao ROnizuka MXiao C(2024)BClean: A Bayesian Data Cleaning System2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00263(3407-3420)Online publication date: 13-May-2024
https://doi.org/10.1109/ICDE60146.2024.00263
Ravikanth MKorra SMamidisetti GGoutham MBhaskar T(2024)An efficient learning based approach for automatic record deduplication with benchmark datasetsScientific Reports10.1038/s41598-024-63242-114:1Online publication date: 15-Jul-2024
https://doi.org/10.1038/s41598-024-63242-1
Goyle KXie QGoyle V(2024)DataAssist: A Machine Learning Approach to Data Cleaning and PreparationIntelligent Systems and Applications10.1007/978-3-031-66431-1_33(476-486)Online publication date: 31-Jul-2024
https://doi.org/10.1007/978-3-031-66431-1_33
Show More Cited By

Index Terms

Don't be SCAREd: use SCalable Automatic REpairing with maximal likelihood and bounded changes
1. Information systems
  1. Information systems applications
    1. Data mining

Recommendations

An Enhanced Technique to Clean Data in the Data Warehouse
DESE '11: Proceedings of the 2011 Developments in E-systems Engineering

Data quality is a critical factor for the success of data warehousing projects. Improving the quality of data is important in data warehouse, because it is used in the process of decision support, which requires accurate data. There are many errors and ...
ETDC: An Efficient Technique to Cleanse Data in the Data Warehouse
ICAIP '17: Proceedings of the International Conference on Advances in Image Processing

Data cleansing can be considered to be an activity that is performed on the data sets of the data warehouse. The cleansing is done in order to enhance and collectively maintain data consistency and quality. The quality of data has a strong impact on a ...
Dirty Data in the Newsroom: Comparing Data Preparation in Journalism and Data Science
CHI '23: Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems

The work involved in gathering, wrangling, cleaning, and otherwise preparing data for analysis is often the most time consuming and tedious aspect of data work. Although many studies describe data preparation within the context of data science workflows,...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGMOD '13: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data

June 2013

1322 pages

ISBN:9781450320375

DOI:10.1145/2463676

General Chairs:
Kenneth Ross
Columbia University
,
Divesh Srivastava
AT&T Research
,
Program Chair:
Dimitris Papadias
HKUST

Copyright © 2013 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMOD: ACM Special Interest Group on Management of Data

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 22 June 2013

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SIGMOD/PODS'13

Sponsor:

SIGMOD

SIGMOD/PODS'13: International Conference on Management of Data

June 22 - 27, 2013

New York, New York, USA

Acceptance Rates

SIGMOD '13 Paper Acceptance Rate 76 of 372 submissions, 20%;

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

83
Total Citations
View Citations
1,298
Total Downloads

Downloads (Last 12 months)84
Downloads (Last 6 weeks)4

Reflects downloads up to 30 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Qin JHuang SWang YZhu JZhang YMiao YMao ROnizuka MXiao C(2024)BClean: A Bayesian Data Cleaning System2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00263(3407-3420)Online publication date: 13-May-2024
https://doi.org/10.1109/ICDE60146.2024.00263
Ravikanth MKorra SMamidisetti GGoutham MBhaskar T(2024)An efficient learning based approach for automatic record deduplication with benchmark datasetsScientific Reports10.1038/s41598-024-63242-114:1Online publication date: 15-Jul-2024
https://doi.org/10.1038/s41598-024-63242-1
Goyle KXie QGoyle V(2024)DataAssist: A Machine Learning Approach to Data Cleaning and PreparationIntelligent Systems and Applications10.1007/978-3-031-66431-1_33(476-486)Online publication date: 31-Jul-2024
https://doi.org/10.1007/978-3-031-66431-1_33
Siddiqi SKern RBoehm M(2023)SAGA: A Scalable Framework for Optimizing Data Cleaning Pipelines for Machine Learning ApplicationsProceedings of the ACM on Management of Data10.1145/36173381:3(1-26)Online publication date: 13-Nov-2023
https://dl.acm.org/doi/10.1145/3617338
Patel HGuttula SGupta NHans SMittal RN L(2023)A Data-centric AI Framework for Automating Exploratory Data Analysis and Data Quality TasksJournal of Data and Information Quality10.1145/360370915:4(1-26)Online publication date: 1-Nov-2023
https://dl.acm.org/doi/10.1145/3603709
Narechania ADu FSinha ARossi RHoffswell JGuo SKoh ENavathe SEndert A(2023)DataPilot: Utilizing Quality and Usage Information for Subset Selection during Visual Data PreparationProceedings of the 2023 CHI Conference on Human Factors in Computing Systems10.1145/3544548.3581509(1-18)Online publication date: 19-Apr-2023
https://dl.acm.org/doi/10.1145/3544548.3581509
Pointner AHarrer M(2023)A Rule Based Data Cleansing Pipeline for Automated Data Import in the Context of Social Clubs2023 3rd International Conference on Electrical, Computer, Communications and Mechatronics Engineering (ICECCME)10.1109/ICECCME57830.2023.10253136(1-6)Online publication date: 19-Jul-2023
https://doi.org/10.1109/ICECCME57830.2023.10253136
Singh VSharma KSur S(2023)A survey on preprocessing and classification techniques for acoustic sceneExpert Systems with Applications: An International Journal10.1016/j.eswa.2023.120520229:PAOnline publication date: 13-Jul-2023
https://dl.acm.org/doi/10.1016/j.eswa.2023.120520
Wang HZhang ASong SWang J(2023)Streaming data cleaning based on speed changeThe VLDB Journal10.1007/s00778-023-00796-y33:1(1-24)Online publication date: 3-May-2023
https://doi.org/10.1007/s00778-023-00796-y
Ding XWang HLi GLi HLi YLiu Y(2022)IoT data cleaning techniques: A surveyIntelligent and Converged Networks10.23919/ICN.2022.00263:4(325-339)Online publication date: Dec-2022
https://doi.org/10.23919/ICN.2022.0026
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents