Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Unifying Data and Constraint Repairs

Published: 17 August 2016 Publication History

Abstract

Integrity constraints play an important role in data design. However, in an operational database, they may not be enforced for many reasons. Hence, over time, data may become inconsistent with respect to the constraints. To manage this, several approaches have proposed techniques to repair the data by finding minimal or lowest cost changes to the data that make it consistent with the constraints. Such techniques are appropriate for applications where only the data changes, but schemas and their constraints remain fixed. In many modern applications, however, constraints may evolve over time as application or business rules change, as data are integrated with new data sources or as the underlying semantics of the data evolves. In such settings, when an inconsistency occurs, it is no longer clear if there is an error in the data (and the data should be repaired) or if the constraints have evolved (and the constraints should be repaired). In this work, we present a novel unified cost model that allows data and constraint repairs to be compared on an equal footing. We consider repairs over a database that is inconsistent with respect to a set of rules, modeled as functional dependencies (FDs). FDs are the most common type of constraint and are known to play an important role in maintaining data quality. We propose modifications to the data and to the FDs such that the data and the constraints are better aligned. We evaluate the quality and scalability of our repair algorithms over synthetic and real datasets. The results show that our repair algorithms not only scale well for large datasets but also are able to accurately capture and correct inconsistencies and accurately decide when a data repair versus a constraint repair is best.

References

[1]
1998. Veterans of America dataset: http://mlr.cs.umass.edu/ml/databases/kddcup98/. (1998).
[2]
2013. FDIC financial data: https://catalog.data.gov/dataset/fdic-institution-directory-id-insured-insitution-download-file. (2013).
[3]
2014. Green vehicles data: http://catalog.data.gov/dataset/green-vehicle-guide-data-downloads. (2014).
[4]
Ziawasch Abedjan, Jorge-Arnulfo Quiané-Ruiz, and Felix Naumann. 2014a. Detecting unique column combinations on dynamic data. In ICDE. 1036--1047.
[5]
Ziawasch Abedjan, Patrick Schulze, and Felix Naumann. 2014b. DFD: Efficient functional dependency discovery. In CIKM. 949--958.
[6]
Laure Berti-Equille, Tamraparni Dasu, and Divesh Srivastava. 2011. Discovery of complex glitch patterns: A novel approach to quantitative data cleaning. In ICDE. 733--744.
[7]
George Beskales, Ihab F. Ilyas, and Lukasz Golab. 2010. Sampling the repairs of functional dependency violations under hard constraints. In PVLDB. 197--207.
[8]
G. Beskales, Ihab F. Ilyas, Lukasz Golab, and Artur Galiullin. 2013. On the relative trust between inconsistent data and inaccurate constraints. In ICDE. 541--552.
[9]
M. Bilenko and R. Mooney. 2003. RIDDLE: Repository of information on duplicate detection, record linkage, and identity uncertainty. (2003). http://www.cs.utexas.edu/users/ml/riddle.
[10]
Philip Bohannon, Wenfei Fan, Michael Flaster, and Rajeev Rastogi. 2005. A cost-based model and effective heuristic for repairing constraints by value modification. In SIGMOD’05. 143--154.
[11]
Philip Bohannon, Wenfei Fan, Floris Geerts, Xibei Jia, and Anastasios Kementsietsidis. 2007. Conditional functional dependencies for data cleaning. In ICDE’07. 746--755.
[12]
Isabelle Boydens, Esteban Zimanyi, and Alain Pirotte. 1997. Managing constraint violations in administrative information systems. In Conference on Data Semantics.
[13]
Anup Chalamalla, Ihab F. Ilyas, Mourad Ouzzani, and Paolo Papotti. 2014. Descriptive and prescriptive data cleaning. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data(SIGMOD’14). 445--456.
[14]
Fei Chiang and Renée J. Miller. 2008. Discovering data quality rules. Proc. VLDB Endow. 1, 1 (2008), 1166--1177.
[15]
Fei Chiang and Renée J. Miller. 2011. A unified model for data and constraint repair. In Proceedings of the 27th International Conference on Data Engineering, ICDE 2011. 446--457.
[16]
Xu Chu, Ihab F. Ilyas, and Paolo Papotti. 2013. Discovering denial constraints. Proc. VLDB Endow. 6, 13 (2013), 1498--1509.
[17]
Xu Chu, John Morcos, Ihab F. Ilyas, Mourad Ouzzani, Paolo Papotti, Nan Tang, and Yin Ye. 2015. KATARA: A data cleaning system powered by knowledge bases and crowdsourcing. In SIGMOD. 1247--1261.
[18]
Gao Cong, Wenfei Fan, Floris Geerts, Xibei Jia, and Shuai Ma. 2007. Improving data quality: Consistency and accuracy. In VLDB’07. 315--326.
[19]
T. Cover and J. Thomas. 1991. Elements of information theory.
[20]
Michele Dallachiesa, Amr Ebaid, Ahmed Eldawy, Ahmed Elmagarmid, Ihab F. Ilyas, Mourad Ouzzani, and Nan Tang. 2013. NADEEF: A commodity data cleaning system. In SIGMOD. 541--552.
[21]
T. Dasu and T. Johnson. 2003. Exploratory Data Mining and Data Clearning.
[22]
Tamraparni Dasu and Ji Meng Loh. 2012. Statistical distortion: Consequences of data cleaning. PVLDB 5, 11 (2012), 1674--1683.
[23]
Wenfei Fan, Floris Geerts, Laks Lakshmanan, and Xiong. 2009. Discovering conditional functional dependencies. In ICDE. 1231--1234.
[24]
F. Geerts, G. Mecca, P. Papotti, and D. Santoro. 2013. The LLUNATIC data-cleaning framework. PVLDB 6, 9 (2013), 625--636.
[25]
Lukasz Golab, Howard J. Karloff, Flip Korn, Divesh Srivastava, and Bei Yu. 2008. On generating near-optimal tableaux for conditional functional dependencies. PVLDB 1, 1 (2008), 376--390.
[26]
Y. Huhtala, J. Kärkkäinen, P. Porkka, and H. Toivonen. 1999. TANE: An efficient algorithm for discovering functional and approximate dependencies. Comput. J. 42, 2 (1999), 100--111.
[27]
Solmaz Kolahi and Laks V. S. Lakshmanan. 2009. On approximating optimum repairs for functional dependency violations. In ICDT’09. 53--62.
[28]
Stéphane Lopes, Jean-Marc Petit, and Lotfi Lakhal. 2000. Efficient discovery of functional dependencies and armstrong relations. In International Conference on Extending Database Technology: Advances in Database Technology (EDBT’00). 350--364.
[29]
Marina Meilă. 2007. Comparing clusterings—An information based distance. J. Multivar. Anal. 98, 5 (2007), 873--895.
[30]
J. Rissanen. 1978. Modeling shortest data description. In Automatica.
[31]
Shaoxu Song and Lei Chen. 2011. Differential dependencies: Reasoning and discovery. ACM Trans. Database Syst. 36, 3 (2011), 16:1--16:41.
[32]
Shaoxu Song and Lei Chen. 2013. Efficient discovery of similarity constraints for matching dependencies. Data Knowl. Eng. 87 (2013), 146--166.
[33]
D. Wang, X. Dong, A. Sarma, M. Franklin, and A. Halevy. 2009. Functional dependency generation and applications in pay-as-you-go data integration systems. In WebDB’09.
[34]
Jiannan Wang and Nan Tang. 2014. Towards dependable data repairing with fixing rules. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data (SIGMOD’14). 457--468.
[35]
William E. Winkler. 1999. The State of Record Linkage and Current Research Problems. Technical Report. Statistical Division, U.S. Census Bureau.
[36]
Catharine Wyss, Chris Giannella, and Edward L. Robertson. 2001. FastFDs: A heuristic-driven, depth-first alg for mining FDs from relations. In DaWaK’01. 101--110.
[37]
Mohamed Yakout, Laure Berti-Équille, and Ahmed K. Elmagarmid. 2013. Don’t be SCAREd: Use scalable automatic repairing with maximal likelihood and bounded changes. In SIGMOD. 553--564.
[38]
Mohamed Yakout, Ahmed K. Elmagarmid, Jennifer Neville, Mourad Ouzzani, and Ihab F. Ilyas. 2011. Guided data repair. PVLDB 4, 5 (2011), 279--289.

Cited By

View all
  • (2022)Repair missing data to improve corporate credit risk prediction accuracy with multi-layer perceptronSoft Computing - A Fusion of Foundations, Methodologies and Applications10.1007/s00500-022-07277-426:18(9167-9178)Online publication date: 1-Sep-2022
  • (2020)A novel data repairing approach based on constraints and ensemble learningExpert Systems with Applications10.1016/j.eswa.2020.113511(113511)Online publication date: May-2020
  • (2018)InfoCleanJournal of Data and Information Quality10.1145/31905779:4(1-26)Online publication date: 12-Apr-2018

Index Terms

  1. Unifying Data and Constraint Repairs

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image Journal of Data and Information Quality
    Journal of Data and Information Quality  Volume 7, Issue 3
    Research Paper, Challenge Papers and Experience Paper
    September 2016
    62 pages
    ISSN:1936-1955
    EISSN:1936-1963
    DOI:10.1145/2988525
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 17 August 2016
    Accepted: 01 January 2016
    Revised: 01 December 2015
    Received: 01 April 2015
    Published in JDIQ Volume 7, Issue 3

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Data quality
    2. constraint repair
    3. data repair

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Funding Sources

    • Natural Sciences and Engineering Research Council of Canada

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)5
    • Downloads (Last 6 weeks)3
    Reflects downloads up to 28 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2022)Repair missing data to improve corporate credit risk prediction accuracy with multi-layer perceptronSoft Computing - A Fusion of Foundations, Methodologies and Applications10.1007/s00500-022-07277-426:18(9167-9178)Online publication date: 1-Sep-2022
    • (2020)A novel data repairing approach based on constraints and ensemble learningExpert Systems with Applications10.1016/j.eswa.2020.113511(113511)Online publication date: May-2020
    • (2018)InfoCleanJournal of Data and Information Quality10.1145/31905779:4(1-26)Online publication date: 12-Apr-2018

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media