research-article

Pattern functional dependencies for data cleaning

Authors:

Abdulhakim Qahtan,

Mourad Ouzzani,

Michael StonebrakerAuthors Info & Claims

Proceedings of the VLDB Endowment, Volume 13, Issue 5

Pages 684 - 697

https://doi.org/10.14778/3377369.3377377

Published: 01 January 2020 Publication History

Abstract

Patterns (or regex-based expressions) are widely used to constrain the format of a domain (or a column), e.g., a Year column should contain only four digits, and thus a value like "1980-" might be a typo. Moreover, integrity constraints (ICs) defined over multiple columns, such as (conditional) functional dependencies and denial constraints, e.g., a ZIP code uniquely determines a city in the UK, have been widely used in data cleaning. However, a promising, but not yet explored, direction is to combine regex- and IC-based theories to capture data dependencies involving partial attribute values. For example, in an employee ID such as"F-9-107", "F" is sufficient to determine the finance department.

Inspired by the above observation, we propose a novel class of ICs, called pattern functional dependencies (PFDs), to model fine-grained data dependencies gleaned from partial attribute values. These dependencies cannot be modeled using traditional ICs, such as (conditional) functional dependencies, which work on entire attribute values. We also present a set of axioms for the inference of PFDs, analogous to Armstrong's axioms for FDs, and study the complexity of consistency and implication analysis of PFDs. Moreover, we devise an effective algorithm to automatically discover PFDs even in the presence of errors in the data. Our extensive experiments on 15 real-world datasets show that our approach can effectively discover valid and useful PFDs over dirty data, which can then be used to detect data errors that are hard to capture by other types of ICs.

References

[1]

Z. Abedjan, C. G. Akcora, M. Ouzzani, P. Papotti, and M. Stonebraker. Temporal rules discovery for web data cleaning. PVLDB, 9(4):336--347, 2015.

Digital Library

[2]

Z. Abedjan, X. Chu, D. Deng, R. C. Fernandez, I. F. Ilyas, M. Ouzzani, P. Papotti, M. Stonebraker, and N. Tang. Detecting data errors: Where are we and what needs to be done? PVLDB, 9(12):993--1004, 2016.

Digital Library

[3]

Z. Abedjan, L. Golab, and F. Naumann. Profiling relational data: a survey. VLDB J., 24(4):557--581, 2015.

Digital Library

[4]

S. Abiteboul, R. Hull, and V. Vianu. Foundations of databases. 1995.

[5]

L. Berti-Équille, H. Harmouch, F. Naumann, N. Novelli, and S. Thirumuruganathan. Discovery of genuine functional dependencies from relational data with missing values. PVLDB, 11(8):880--892, 2018.

Digital Library

[6]

W. A. Carnielli and J. Marcos. Ex contradictione non sequitur quodlibet. Bulletin of Advanced Reasoning and Knowledge, 1:89--109, 2001.

[7]

L. Chiticariu, Y. Li, and F. R. Reiss. Rule-based information extraction is dead! long live rule-based information extraction systems! In EMNLP, pages 827--832, 2013.

[8]

X. Chu, I. F. Ilyas, and P. Papotti. Discovering denial constraints. PVLDB, 6(13):1498--1509, 2013.

Digital Library

[9]

H. Comon, M. Dauchet, R. Gilleron, C. Löding, F. Jacquemard, D. Lugiez, S. Tison, and M. Tommasi. Tree automata techniques and applications. Available on: http://www.grappa.univ-lille3.fr/tata, 2007. release October, 12th 2007.

[10]

M. Dallachiesa, A. Ebaid, A. Eldawy, A. K. Elmagarmid, I. F. Ilyas, M. Ouzzani, and N. Tang. NADEEF: a commodity data cleaning system. In SIGMOD, pages 541--552, 2013.

Digital Library

[11]

A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios. Duplicate record detection: A survey. TKDE, 19(1):1--16, 2007.

[12]

W. Fan, F. Geerts, X. Jia, and A. Kementsietsidis. Conditional functional dependencies for capturing data inconsistencies. TODS, 33(2), 2008.

[13]

W. Fan, F. Geerts, J. Li, and M. Xiong. Discovering conditional functional dependencies. TKDE, 23(5):683--698, 2011.

Digital Library

[14]

P. A. Flach and I. Savnik. Database dependency discovery: a machine learning approach. AI Communications, 12(3):139--160, 1999.

Digital Library

[15]

M. R. Garey and D. S. Johnson. Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman, 1979.

Digital Library

[16]

P. S. G.C., C. Sun, K. G. Kuchimanchi, H. Zhang, F. Yang, N. Rampalli, S. Prasad, E. Arcaute, G. Krishnan, R. Deep, V. Raghavendra, and A. Doan. Why big data industrial systems need rules and what we can do about it. In SIGMOD, pages 265--276, 2015.

[17]

J. He, E. Veltri, D. Santoro, G. Li, G. Mecca, P. Papotti, and N. Tang. Interactive and deterministic data cleaning. In SIGMOD, pages 893--907, 2016.

Digital Library

[18]

Z. Huang and Y. He. Auto-detect: Data-driven error detection in tables. In SIGMOD, pages 1377--1392, 2018.

Digital Library

[19]

Y. Huhtala, J. Kärkkäinen, P. Porkka, and H. Toivonen. TANE: An efficient algorithm for discovering functional and approximate dependencies. The Computer Journal, 42(2):100--111, 1999.

[20]

M. Interlandi and N. Tang. Proof positive and negative in data cleaning. In ICDE, pages 18--29, 2015.

[21]

S. Kandel, A. Paepcke, J. Hellerstein, and J. Heer. Wrangler: Interactive visual specification of data transformation scripts. In CHI, pages 3363--3372, 2011.

Digital Library

[22]

P. Konda, S. Das, P. S. G.C., A. Doan, A. Ardalan, J. R. Ballard, H. Li, F. Panahi, H. Zhang, J. Naughton, S. Prasad, G. Krishnan, R. Deep, and V. Raghavendra. Magellan: Toward building entity matching management systems. PVLDB, 9(12):1197--1208, 2016.

Digital Library

[23]

N. Koudas, A. Saha, D. Srivastava, and S. Venkatasubramanian. Metric functional dependencies. In ICDE, pages 1275--1278, 2009.

Digital Library

[24]

S. Kwashie, J. Liu, J. Li, and F. Ye. Conditional differential dependencies (cdds). In Advances in Databases and Information Systems (ADBIS), pages 3--17, 2015.

[25]

P. Mandros, M. Boley, and J. Vreeken. Discovering reliable approximate functional dependencies. In KDD, pages 355--363, 2017.

Digital Library

[26]

F. Panahi, W. Wu, A. Doan, and J. F. Naughton. Towards interactive debugging of rule-based entity matching. In EDBT, pages 354--365, 2017.

[27]

G. Papadakis, L. Tsekouras, E. Thanos, G. Giannakopoulos, T. Palpanas, and M. Koubarakis. The return of JedAI: End-to-end entity resolution for structured and semi-structured data. PVLDB, 11(12):1950--1953, 2018.

Digital Library

[28]

T. Papenbrock, T. Bergmann, M. Finke, J. Zwiener, and F. Naumann. Data profiling with metanome. PVLDB, 8(12):1860--1863, 2015.

Digital Library

[29]

T. Papenbrock, J. Ehrlich, J. Marten, T. Neubert, J. Rudolph, M. Schönberg, J. Zwiener, and F. Naumann. Functional dependency discovery: An experimental evaluation of seven algorithms. PVLDB, 8(10):1082--1093, 2015.

Digital Library

[30]

C. Pit-Claudel, Z. Mariet, R. Harding, and S. Madden. Outlier detection in heterogeneous datasets using automatic tuple expansion. In Technical Report.

[31]

A. A. Qahtan, A. K. Elmagarmid, R. C. Fernandez, M. Ouzzani, and N. Tang. FAHES: A robust disguised missing values detector. In KDD, pages 2100--2109, 2018.

Digital Library

[32]

A. A. Qahtan, A. K. Elmagarmid, M. Ouzzani, and N. Tang. FAHES: detecting disguised missing values. In ICDE, pages 1609--1612, 2018.

[33]

A. A. Qahtan, N. Tang, M. Ouzzani, Y. Cao, and M. Stonebraker. ANMAT: automatic knowledge discovery and error detection through pattern functional dependencies. In SIMOD, pages 1977--1980, 2019.

Digital Library

[34]

J. Rammelaere and F. Geerts. Revisiting conditional functional dependency discovery: Splitting the "C" from the "FD". In ECML-PKDD, pages 552--568, 2018.

[35]

R. Singh. Blinkfill: Semi-supervised programming by example for syntactic string transformations. PVLDB, 9(10):816--827, 2016.

Digital Library

[36]

R. Singh, V. V. Meduri, A. K. Elmagarmid, S. Madden, P. Papotti, J. Quiané-Ruiz, A. Solar-Lezama, and N. Tang. Synthesizing entity matching rules by examples. PVLDB, 11(2):189--202, 2017.

Digital Library

[37]

S. Song and L. Chen. Differential dependencies: Reasoning and discovery. ACM Trans. Database Syst., 36(3):16:1--16:41, 2011.

Digital Library

[38]

L. J. Stockmeyer and A. R. Meyer. Word problems requiring exponential time: Preliminary report. In STOC, pages 1--9, 1973.

Digital Library

[39]

J. Szlichta, P. Godfrey, L. Golab, M. Kargar, and D. Srivastava. Effective and complete discovery of order dependencies via set-based axiomatization. PVLDB, 10(7):721--732, 2017.

Digital Library

[40]

Trifacta Documentation. Trifacta built-in data types. https://docs.trifacta.com/display/PE/Supported+Data+Types.

[41]

J. Wang and N. Tang. Towards dependable data repairing with fixing rules. In SIGMOD, pages 457--468, 2014.

Digital Library

[42]

E. Wu and S. Madden. Scorpion: Explaining away outliers in aggregate queries. PVLDB, 6(8):553--564, 2013.

Digital Library

[43]

G. Zhu, Q. Wang, Q. Tang, R. Gu, C. Yuan, and Y. Huang. Efficient and scalable functional dependency discovery on distributed data-parallel platforms. IEEE Transactions on Parallel and Distributed Systems, 30(12):2663--2676, 2019.

Cited By

Zhang ZLink S(2024)Mixed Covers of Keys and Functional Dependencies for Maintaining the Integrity of Data under UpdatesProceedings of the VLDB Endowment10.14778/3654621.365462617:7(1578-1590)Online publication date: 1-Mar-2024
https://dl.acm.org/doi/10.14778/3654621.3654626
Kuang SYang HTan ZMa S(2024)Efficient Differential Dependency DiscoveryProceedings of the VLDB Endowment10.14778/3654621.365462417:7(1552-1564)Online publication date: 1-Mar-2024
https://dl.acm.org/doi/10.14778/3654621.3654624
Fan WHan ZRen WWang DWang YXie MYan M(2023)Splitting Tuples of Mismatched EntitiesProceedings of the ACM on Management of Data10.1145/36267631:4(1-29)Online publication date: 12-Dec-2023
https://dl.acm.org/doi/10.1145/3626763
Show More Cited By

Index Terms

Pattern functional dependencies for data cleaning
1. Information systems
  1. Data management systems
    1. Database design and models

Index terms have been assigned to the content through auto-classification.

Recommendations

Inclusion dependencies and their interaction with functional dependencies
PODS '82: Proceedings of the 1st ACM SIGACT-SIGMOD symposium on Principles of database systems

Inclusion dependencies, or INDs (which can say, for example, that every manager is an employee) are studied, including their interaction with functional dependencies, or FDs. A simple complete axiomatization for INDs is presented, and the decision ...
The interaction between functional dependencies and template dependencies
SIGMOD '80: Proceedings of the 1980 ACM SIGMOD international conference on Management of data

A large class of dependencies, called template dependencies, was introduced in Sadri and Ullman [1979], and a complete set of inference rules (axioms) was given for it. In this paper, we investigate the interaction between template dependencies and ...
Elaboration on functional dependencies: functional dependencies are dead, long live functional dependencies!
Haskell 2017: Proceedings of the 10th ACM SIGPLAN International Symposium on Haskell

Functional dependencies are a popular extension to Haskell's type-class system because they provide fine-grained control over type inference, resolve ambiguities and even enable type-level computations.

Unfortunately, several aspects of Haskell's ...

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment

Proceedings of the VLDB Endowment Volume 13, Issue 5

January 2020

195 pages

ISSN:2150-8097

Editors:
Magdalena Balazinska
University of Washington
,
Xiaofang Zhou
University of Queensland, Australia

Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 January 2020

Published in PVLDB Volume 13, Issue 5

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

10
Total Citations
View Citations
350
Total Downloads

Downloads (Last 12 months)40
Downloads (Last 6 weeks)2

Reflects downloads up to 12 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Zhang ZLink S(2024)Mixed Covers of Keys and Functional Dependencies for Maintaining the Integrity of Data under UpdatesProceedings of the VLDB Endowment10.14778/3654621.365462617:7(1578-1590)Online publication date: 1-Mar-2024
https://dl.acm.org/doi/10.14778/3654621.3654626
Kuang SYang HTan ZMa S(2024)Efficient Differential Dependency DiscoveryProceedings of the VLDB Endowment10.14778/3654621.365462417:7(1552-1564)Online publication date: 1-Mar-2024
https://dl.acm.org/doi/10.14778/3654621.3654624
Fan WHan ZRen WWang DWang YXie MYan M(2023)Splitting Tuples of Mismatched EntitiesProceedings of the ACM on Management of Data10.1145/36267631:4(1-29)Online publication date: 12-Dec-2023
https://dl.acm.org/doi/10.1145/3626763
Hu WJiang DWu SChen KChen G(2023)Distributional constraint discovery for intelligent auditingKnowledge and Information Systems10.1007/s10115-023-01929-z65:12(5195-5229)Online publication date: 1-Dec-2023
https://dl.acm.org/doi/10.1007/s10115-023-01929-z
Peng JShen DTang NLiu TKou YNie TCui HYu G(2022)Self-Supervised and Interpretable Data Cleaning with Sequence Generative Adversarial NetworksProceedings of the VLDB Endowment10.14778/3570690.357069416:3(433-446)Online publication date: 1-Nov-2022
https://dl.acm.org/doi/10.14778/3570690.3570694
Fan WHan ZWang YXie MIves ZBonifati AEl Abbadi A(2022)Parallel Rule Discovery from Large Datasets by SamplingProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3526165(384-398)Online publication date: 10-Jun-2022
https://dl.acm.org/doi/10.1145/3514221.3526165
Jouseau RSalva SSamir C(2022)On Studying the Effect of Data Quality on Classification PerformancesIntelligent Data Engineering and Automated Learning – IDEAL 202210.1007/978-3-031-21753-1_9(82-93)Online publication date: 24-Nov-2022
https://dl.acm.org/doi/10.1007/978-3-031-21753-1_9
Tang NFan JLi FTu JDu XLi GMadden SOuzzani M(2021)RPTProceedings of the VLDB Endowment10.14778/3457390.345739114:8(1254-1261)Online publication date: 1-Apr-2021
https://dl.acm.org/doi/10.14778/3457390.3457391
Fariha ATiwari ARadhakrishna AGulwani SMeliou ALi GLi ZIdreos SSrivastava D(2021)Conformance Constraint DiscoveryProceedings of the 2021 International Conference on Management of Data10.1145/3448016.3452795(499-512)Online publication date: 9-Jun-2021
https://dl.acm.org/doi/10.1145/3448016.3452795
Nashaat MGhosh AMiller JQuader S(2021)TabReformer: Unsupervised Representation Learning for Erroneous Data DetectionACM/IMS Transactions on Data Science10.1145/34475412:3(1-29)Online publication date: 18-May-2021
https://dl.acm.org/doi/10.1145/3447541

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents