Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Pattern functional dependencies for data cleaning

Published: 01 January 2020 Publication History
  • Get Citation Alerts
  • Abstract

    Patterns (or regex-based expressions) are widely used to constrain the format of a domain (or a column), e.g., a Year column should contain only four digits, and thus a value like "1980-" might be a typo. Moreover, integrity constraints (ICs) defined over multiple columns, such as (conditional) functional dependencies and denial constraints, e.g., a ZIP code uniquely determines a city in the UK, have been widely used in data cleaning. However, a promising, but not yet explored, direction is to combine regex- and IC-based theories to capture data dependencies involving partial attribute values. For example, in an employee ID such as"F-9-107", "F" is sufficient to determine the finance department.
    Inspired by the above observation, we propose a novel class of ICs, called pattern functional dependencies (PFDs), to model fine-grained data dependencies gleaned from partial attribute values. These dependencies cannot be modeled using traditional ICs, such as (conditional) functional dependencies, which work on entire attribute values. We also present a set of axioms for the inference of PFDs, analogous to Armstrong's axioms for FDs, and study the complexity of consistency and implication analysis of PFDs. Moreover, we devise an effective algorithm to automatically discover PFDs even in the presence of errors in the data. Our extensive experiments on 15 real-world datasets show that our approach can effectively discover valid and useful PFDs over dirty data, which can then be used to detect data errors that are hard to capture by other types of ICs.

    References

    [1]
    Z. Abedjan, C. G. Akcora, M. Ouzzani, P. Papotti, and M. Stonebraker. Temporal rules discovery for web data cleaning. PVLDB, 9(4):336--347, 2015.
    [2]
    Z. Abedjan, X. Chu, D. Deng, R. C. Fernandez, I. F. Ilyas, M. Ouzzani, P. Papotti, M. Stonebraker, and N. Tang. Detecting data errors: Where are we and what needs to be done? PVLDB, 9(12):993--1004, 2016.
    [3]
    Z. Abedjan, L. Golab, and F. Naumann. Profiling relational data: a survey. VLDB J., 24(4):557--581, 2015.
    [4]
    S. Abiteboul, R. Hull, and V. Vianu. Foundations of databases. 1995.
    [5]
    L. Berti-Équille, H. Harmouch, F. Naumann, N. Novelli, and S. Thirumuruganathan. Discovery of genuine functional dependencies from relational data with missing values. PVLDB, 11(8):880--892, 2018.
    [6]
    W. A. Carnielli and J. Marcos. Ex contradictione non sequitur quodlibet. Bulletin of Advanced Reasoning and Knowledge, 1:89--109, 2001.
    [7]
    L. Chiticariu, Y. Li, and F. R. Reiss. Rule-based information extraction is dead! long live rule-based information extraction systems! In EMNLP, pages 827--832, 2013.
    [8]
    X. Chu, I. F. Ilyas, and P. Papotti. Discovering denial constraints. PVLDB, 6(13):1498--1509, 2013.
    [9]
    H. Comon, M. Dauchet, R. Gilleron, C. Löding, F. Jacquemard, D. Lugiez, S. Tison, and M. Tommasi. Tree automata techniques and applications. Available on: http://www.grappa.univ-lille3.fr/tata, 2007. release October, 12th 2007.
    [10]
    M. Dallachiesa, A. Ebaid, A. Eldawy, A. K. Elmagarmid, I. F. Ilyas, M. Ouzzani, and N. Tang. NADEEF: a commodity data cleaning system. In SIGMOD, pages 541--552, 2013.
    [11]
    A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios. Duplicate record detection: A survey. TKDE, 19(1):1--16, 2007.
    [12]
    W. Fan, F. Geerts, X. Jia, and A. Kementsietsidis. Conditional functional dependencies for capturing data inconsistencies. TODS, 33(2), 2008.
    [13]
    W. Fan, F. Geerts, J. Li, and M. Xiong. Discovering conditional functional dependencies. TKDE, 23(5):683--698, 2011.
    [14]
    P. A. Flach and I. Savnik. Database dependency discovery: a machine learning approach. AI Communications, 12(3):139--160, 1999.
    [15]
    M. R. Garey and D. S. Johnson. Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman, 1979.
    [16]
    P. S. G.C., C. Sun, K. G. Kuchimanchi, H. Zhang, F. Yang, N. Rampalli, S. Prasad, E. Arcaute, G. Krishnan, R. Deep, V. Raghavendra, and A. Doan. Why big data industrial systems need rules and what we can do about it. In SIGMOD, pages 265--276, 2015.
    [17]
    J. He, E. Veltri, D. Santoro, G. Li, G. Mecca, P. Papotti, and N. Tang. Interactive and deterministic data cleaning. In SIGMOD, pages 893--907, 2016.
    [18]
    Z. Huang and Y. He. Auto-detect: Data-driven error detection in tables. In SIGMOD, pages 1377--1392, 2018.
    [19]
    Y. Huhtala, J. Kärkkäinen, P. Porkka, and H. Toivonen. TANE: An efficient algorithm for discovering functional and approximate dependencies. The Computer Journal, 42(2):100--111, 1999.
    [20]
    M. Interlandi and N. Tang. Proof positive and negative in data cleaning. In ICDE, pages 18--29, 2015.
    [21]
    S. Kandel, A. Paepcke, J. Hellerstein, and J. Heer. Wrangler: Interactive visual specification of data transformation scripts. In CHI, pages 3363--3372, 2011.
    [22]
    P. Konda, S. Das, P. S. G.C., A. Doan, A. Ardalan, J. R. Ballard, H. Li, F. Panahi, H. Zhang, J. Naughton, S. Prasad, G. Krishnan, R. Deep, and V. Raghavendra. Magellan: Toward building entity matching management systems. PVLDB, 9(12):1197--1208, 2016.
    [23]
    N. Koudas, A. Saha, D. Srivastava, and S. Venkatasubramanian. Metric functional dependencies. In ICDE, pages 1275--1278, 2009.
    [24]
    S. Kwashie, J. Liu, J. Li, and F. Ye. Conditional differential dependencies (cdds). In Advances in Databases and Information Systems (ADBIS), pages 3--17, 2015.
    [25]
    P. Mandros, M. Boley, and J. Vreeken. Discovering reliable approximate functional dependencies. In KDD, pages 355--363, 2017.
    [26]
    F. Panahi, W. Wu, A. Doan, and J. F. Naughton. Towards interactive debugging of rule-based entity matching. In EDBT, pages 354--365, 2017.
    [27]
    G. Papadakis, L. Tsekouras, E. Thanos, G. Giannakopoulos, T. Palpanas, and M. Koubarakis. The return of JedAI: End-to-end entity resolution for structured and semi-structured data. PVLDB, 11(12):1950--1953, 2018.
    [28]
    T. Papenbrock, T. Bergmann, M. Finke, J. Zwiener, and F. Naumann. Data profiling with metanome. PVLDB, 8(12):1860--1863, 2015.
    [29]
    T. Papenbrock, J. Ehrlich, J. Marten, T. Neubert, J. Rudolph, M. Schönberg, J. Zwiener, and F. Naumann. Functional dependency discovery: An experimental evaluation of seven algorithms. PVLDB, 8(10):1082--1093, 2015.
    [30]
    C. Pit-Claudel, Z. Mariet, R. Harding, and S. Madden. Outlier detection in heterogeneous datasets using automatic tuple expansion. In Technical Report.
    [31]
    A. A. Qahtan, A. K. Elmagarmid, R. C. Fernandez, M. Ouzzani, and N. Tang. FAHES: A robust disguised missing values detector. In KDD, pages 2100--2109, 2018.
    [32]
    A. A. Qahtan, A. K. Elmagarmid, M. Ouzzani, and N. Tang. FAHES: detecting disguised missing values. In ICDE, pages 1609--1612, 2018.
    [33]
    A. A. Qahtan, N. Tang, M. Ouzzani, Y. Cao, and M. Stonebraker. ANMAT: automatic knowledge discovery and error detection through pattern functional dependencies. In SIMOD, pages 1977--1980, 2019.
    [34]
    J. Rammelaere and F. Geerts. Revisiting conditional functional dependency discovery: Splitting the "C" from the "FD". In ECML-PKDD, pages 552--568, 2018.
    [35]
    R. Singh. Blinkfill: Semi-supervised programming by example for syntactic string transformations. PVLDB, 9(10):816--827, 2016.
    [36]
    R. Singh, V. V. Meduri, A. K. Elmagarmid, S. Madden, P. Papotti, J. Quiané-Ruiz, A. Solar-Lezama, and N. Tang. Synthesizing entity matching rules by examples. PVLDB, 11(2):189--202, 2017.
    [37]
    S. Song and L. Chen. Differential dependencies: Reasoning and discovery. ACM Trans. Database Syst., 36(3):16:1--16:41, 2011.
    [38]
    L. J. Stockmeyer and A. R. Meyer. Word problems requiring exponential time: Preliminary report. In STOC, pages 1--9, 1973.
    [39]
    J. Szlichta, P. Godfrey, L. Golab, M. Kargar, and D. Srivastava. Effective and complete discovery of order dependencies via set-based axiomatization. PVLDB, 10(7):721--732, 2017.
    [40]
    Trifacta Documentation. Trifacta built-in data types. https://docs.trifacta.com/display/PE/Supported+Data+Types.
    [41]
    J. Wang and N. Tang. Towards dependable data repairing with fixing rules. In SIGMOD, pages 457--468, 2014.
    [42]
    E. Wu and S. Madden. Scorpion: Explaining away outliers in aggregate queries. PVLDB, 6(8):553--564, 2013.
    [43]
    G. Zhu, Q. Wang, Q. Tang, R. Gu, C. Yuan, and Y. Huang. Efficient and scalable functional dependency discovery on distributed data-parallel platforms. IEEE Transactions on Parallel and Distributed Systems, 30(12):2663--2676, 2019.

    Cited By

    View all

    Index Terms

    1. Pattern functional dependencies for data cleaning
      Index terms have been assigned to the content through auto-classification.

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image Proceedings of the VLDB Endowment
      Proceedings of the VLDB Endowment  Volume 13, Issue 5
      January 2020
      195 pages
      ISSN:2150-8097
      Issue’s Table of Contents

      Publisher

      VLDB Endowment

      Publication History

      Published: 01 January 2020
      Published in PVLDB Volume 13, Issue 5

      Qualifiers

      • Research-article

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)40
      • Downloads (Last 6 weeks)2
      Reflects downloads up to 12 Aug 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Mixed Covers of Keys and Functional Dependencies for Maintaining the Integrity of Data under UpdatesProceedings of the VLDB Endowment10.14778/3654621.365462617:7(1578-1590)Online publication date: 1-Mar-2024
      • (2024)Efficient Differential Dependency DiscoveryProceedings of the VLDB Endowment10.14778/3654621.365462417:7(1552-1564)Online publication date: 1-Mar-2024
      • (2023)Splitting Tuples of Mismatched EntitiesProceedings of the ACM on Management of Data10.1145/36267631:4(1-29)Online publication date: 12-Dec-2023
      • (2023)Distributional constraint discovery for intelligent auditingKnowledge and Information Systems10.1007/s10115-023-01929-z65:12(5195-5229)Online publication date: 1-Dec-2023
      • (2022)Self-Supervised and Interpretable Data Cleaning with Sequence Generative Adversarial NetworksProceedings of the VLDB Endowment10.14778/3570690.357069416:3(433-446)Online publication date: 1-Nov-2022
      • (2022)Parallel Rule Discovery from Large Datasets by SamplingProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3526165(384-398)Online publication date: 10-Jun-2022
      • (2022)On Studying the Effect of Data Quality on Classification PerformancesIntelligent Data Engineering and Automated Learning – IDEAL 202210.1007/978-3-031-21753-1_9(82-93)Online publication date: 24-Nov-2022
      • (2021)RPTProceedings of the VLDB Endowment10.14778/3457390.345739114:8(1254-1261)Online publication date: 1-Apr-2021
      • (2021)Conformance Constraint DiscoveryProceedings of the 2021 International Conference on Management of Data10.1145/3448016.3452795(499-512)Online publication date: 9-Jun-2021
      • (2021)TabReformer: Unsupervised Representation Learning for Erroneous Data DetectionACM/IMS Transactions on Data Science10.1145/34475412:3(1-29)Online publication date: 18-May-2021

      View Options

      Get Access

      Login options

      Full Access

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media