article

Free access

Dynamic constraints for record matching

Authors:

Shuai MaAuthors Info & Claims

The VLDB Journal — The International Journal on Very Large Data Bases, Volume 20, Issue 4

Pages 495 - 520

https://doi.org/10.1007/s00778-010-0206-6

Published: 01 August 2011 Publication History

PDF eReader

Abstract

This paper investigates constraints for matching records from unreliable data sources. (a) We introduce a class of matching dependencies (mds) for specifying the semantics of unreliable data. As opposed to static constraints for schema design, mds are developed for record matching, and are defined in terms of similarity predicates and a dynamic semantics. (b) We identify a special case of mds, referred to as relative candidate keys (rcks), to determine what attributes to compare and how to compare them when matching records across possibly different relations. (c) We propose a mechanism for inferring mds, a departure from traditional implication analysis, such that when we cannot match records by comparing attributes that contain errors, we may still find matches by using other, more reliable attributes. Moreover, we develop a sound and complete system for inferring mds. (d) We provide a quadratic-time algorithm for inferring mds and an effective algorithm for deducing a set of high-quality rcks from mds. (e) We experimentally verify that the algorithms help matching tools efficiently identify keys at compile time for matching, blocking or windowing and in addition, that the md-based techniques effectively improve the quality and efficiency of various record matching methods.

References

[1]

Abiteboul, S., Hull, R., Vianu, V.: Foundations of Databases. Addison-Wesley, Boston (1995).

Crossref

Google Scholar

[2]

Ananthakrishna, R., Chaudhuri, S., Ganti, V.: Eliminating fuzzy duplicates in data warehouses. In: VLDB (2002).

Crossref

Google Scholar

[3]

Sarma, J.W.A.D., Ullman, J.: Schema design for uncertain databases. In: Proceedings of the 3rd Alberto Mendelzon Workshop on Foundations of Data Management (2009).

Google Scholar

[4]

Arasu, A., Chaudhuri, S., Kaushik, R.: Transformation-based framework for record matching. In: ICDE (2008).

Crossref

Google Scholar

[5]

Arasu, A., Re, C., Suciu, D.: Large-scale deduplication with constraints using Dedupalog. In: ICDE (2009).

Crossref

Google Scholar

[6]

Aumueller, D., Do, H.H., Massmann, S., Rahm, E.: Schema and ontology matching with COMA++. In: SIGMOD (2005).

Crossref

Google Scholar

[7]

Batini, C., Scannapieco, M.: Data Quality: Concepts, Methodologies and Techniques. Springer, Berlin (2006).

Crossref

Google Scholar

[8]

Beeri, C., Bernstein, P.A.: Computational problems related to the design of normal form relational schemas. ACM Trans. Database Syst. 4(1), 30-59 (1979).

Crossref

Google Scholar

[9]

Belohlávek, R., Vychodil, V.: Data tables with similarity relations: functional dependencies, complete rules and non-redundant bases. In: DASFAA, (2006).

Crossref

Google Scholar

[10]

Cautis, B., Abiteboul, S., Milo, T.: Reasoning about XML update constraints. In: PODS, (2007).

Crossref

Google Scholar

[11]

Chaudhuri, S., Chen, B.-C., Ganti, V., Kaushik, R.: Example-driven design of efficient record matching queries. In: VLDB (2007).

Crossref

Google Scholar

[12]

Chaudhuri, S., Sarma, A.D., Ganti, V., Kaushik, R.: Leveraging aggregate constraints for deduplication. In: SIGMOD (2007).

Crossref

Google Scholar

[13]

Cohen, W.W., Richman, J.: Learning to match and cluster large high-dimensional data sets for data integration. In: KDD (2002).

Crossref

Google Scholar

[14]

Dhamankar, R., Lee, Y., Doan, A., Halevy, A.Y., Domingos, P.: iMAP: discovering complex mappings between database schemas. In: SIGMOD (2004).

Crossref

Google Scholar

[15]

Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate record detection: a survey. IEEE Trans. Knowl. Data Eng. 19(1), 1-16 (2007).

Crossref

Google Scholar

[16]

Fan, W.: Dependencies revisited for improving data quality. In: PODS (2008).

Crossref

Google Scholar

[17]

Fan, W., Geerts, F., Li, J., Xiong, M.: Discovering conditional functional dependencies. IEEE Trans. Knowl. Data Eng. (2010).

Crossref

Google Scholar

[18]

Fan, W., Jia, X., Li, J., Ma, S.: Reasoning about record matching rules. In: VLDB (2009).

Crossref

Google Scholar

[19]

Fellegi, Ivan, Holt, D.: A systematic approach to automatic edit and imputation. J. Am. Stat. Assoc. 71(353), 17-35 (1976).

Google Scholar

[20]

Fellegi, I., Sunter, A.B.: A theory for record linkage. J. Am. Stat. Assoc. 64(328), 1183-1210 (1969).

Google Scholar

[21]

Galhardas, H., Florescu, D., Shasha, D., Simon, E., Saita, C.: Declarative data cleaning: language, model and algorithms. In: VLDB (2001).

Crossref

Google Scholar

[22]

Guha, S., Koudas, N., Marathe, A., Srivastava, D.: Merging the results of approximate match operations. In: VLDB (2004).

Crossref

Google Scholar

[23]

Haas, L., Hernández, M., Ho, H., Popa, L., Roth, Mary: Clio grows up: from research prototype to industrial tool. In: SIGMOD (2005).

Crossref

Google Scholar

[24]

Hernndez, M.A., Stolfo, S.J.: The merge/purge problem for large databases. In: SIGMOD (1995).

Crossref

Google Scholar

[25]

Hernndez, M.A., Stolfo, S.J.: Real-world data is dirty: data cleansing and the merge/purge problem. Data Min. Knowl. Discov. 2(1), 9-37 (1998).

Crossref

Google Scholar

[26]

Huhtala, Y., Kärkk ainen, J., Porkka, P., Toivonen, H.: TANE: an efficient algorithm for discovering functional and approximate dependencies. Comput. J. 42(2), 100-111 (1999).

Google Scholar

[27]

http://www.sas.com/industry/fsi/fraud/

Google Scholar

[28]

http://userweb.cs.utexas.edu/users/ml/riddle/data.html

Google Scholar

[29]

Jaro, M.A.: Advances in record-linkage methodology as applied to matching the 1985 census of Tampa Florida. J. Am. Stat. Assoc. 89, 414-420 (1989).

Google Scholar

[30]

Koudas, N., Saha, A., Srivastava, D., Venkatasubramanian, S.: Metric functional dependencies. In: ICDE (2009).

Crossref

Google Scholar

[31]

Lim, E.-P., Srivastava, J., Prabhakar, S., Richardson, J.: Entity identification in database integration. Inf. Sci. 89(1-2), 1-38 (1996).

Crossref

Google Scholar

[32]

Loshin, D.: Master Data Management. Knowledge Integrity, Inc., New York (2009).

Crossref

Google Scholar

[33]

Lucchesi, C.L., Osborn, S.L.: Candidate keys for relations. J. Comput. Syst. Sci. 17(2), 270-279 (1978).

Google Scholar

[34]

Maier, D.: The Theory of Relational Databases. Computer Science Press, Rockville (1983).

Crossref

Google Scholar

[35]

Rahm, E., Bernstein, P.A.: A survey of approaches to automatic schema matching. VLDB J. (2001).

Crossref

Google Scholar

[36]

Sarawagi, S., Bhamidipaty, A.: Interactive deduplication using active learning. In: KDD (2002).

Crossref

Google Scholar

[37]

Sauter, G., Mathews, B., Ostic, E.: Information Service Patterns, Part 3: Data Cleansing Pattern. IBM, USA (2007).

Google Scholar

[38]

Shen, W., Li, X., Doan, A.: Constraint-based entity matching. In AAAI (2005).

Crossref

Google Scholar

[39]

Singla, P., Domingos, P.: Object identification with attribute-mediated dependences. In: PKDD (2005).

Crossref

Google Scholar

[40]

Sismanis, Y., Brown, P., Haas, P.J., Reinwald, B.: GORDIAN: efficient and scalable discovery of composite keys. In: VLDB (2006).

Crossref

Google Scholar

[41]

Soundex: http://en.wikipedia.org/wiki/Soundex

Google Scholar

[42]

Verykios, V.S., Elmagarmid, A.K., Houstis, E.: Automating the approximate record-matching process. Inf. Sci. 126(1-4), 83-89 (2002).

Crossref

Google Scholar

[43]

Vianu, V.: Dynamic functional dependencies and database aging. J. ACM 34(1), 28-59 (1987).

Crossref

Google Scholar

[44]

Weis, M., Naumann, F.: DogmatiX tracks down duplicates in XML. In: SIGMOD (2005).

Crossref

Google Scholar

[45]

Weis, M., Naumann, F., Jehle, U., Lufter, J., Schuster, H.: Industry-scale duplicate detection. In: VLDB (2008).

Crossref

Google Scholar

[46]

Winkler, W.E.: Methods for record linkage and bayesian networks. Technical report RRS2002/05, U.S. Census Bureau (2002).

Google Scholar

[47]

Winkler, W.E.: Methods for evaluating and creating data quality. Inf. Syst. 29(7), 531-550 (2004).

Crossref

Google Scholar

[48]

Yancey, W.: BigMatch: A program for extracting probable matches from a large file. Technical report computing 2007/01, U.S. Census Bureau (2007).

Google Scholar

Cited By

View all

Bao ZBie BFan WLi DLi MLin KLin WLiu PLiu PLv ZOuyang MSun CTang SWang YWei QWu XXie MZhang JZhao RZhu JZhu Y(2024)Rock: Cleaning Data with both ML and Logic RulesProceedings of the VLDB Endowment10.14778/3685800.368587817:12(4373-4376)Online publication date: 8-Nov-2024
https://dl.acm.org/doi/10.14778/3685800.3685878
Yan MFan WWang YXie M(2024)Enriching Relations with Additional Attributes for ERProceedings of the VLDB Endowment10.14778/3681954.368198717:11(3109-3123)Online publication date: 30-Aug-2024
https://doi.org/10.14778/3681954.3681987
Kuang SYang HTan ZMa S(2024)Efficient Differential Dependency DiscoveryProceedings of the VLDB Endowment10.14778/3654621.365462417:7(1552-1564)Online publication date: 1-Mar-2024
https://dl.acm.org/doi/10.14778/3654621.3654624
Show More Cited By

Index Terms

Dynamic constraints for record matching

Recommendations

Incorporating string transformations in record matching
SIGMOD '08: Proceedings of the 2008 ACM SIGMOD international conference on Management of data

Today's record matching infrastructure does not allow a flexible way to account for synonyms such as "Robert" and "Bob" which refer to the same name, and more general forms of string transformations such as abbreviations. We expand the problem of record ...
A generalized cost optimal decision model for record matching
IQIS '04: Proceedings of the 2004 international workshop on Information quality in information systems

Record (or entity) matching or linkage is the process of identifying records in one or more data sources, that refer to the same real world entity or object. In record linkage, the ultimate goal of a decision model is to provide the decision maker with ...
Record Matching over Query Results from Multiple Web Databases

Record matching, which identifies the records that represent the same real-world entity, is an important step for data integration. Most state-of-the-art record matching methods are supervised, which requires the user to provide training data. These ...

Reviews

Reviewer: Maulik A Dave

This research paper addresses the problem of matching records from different unreliable data sources. It introduces new concepts of matching dependencies (MDs) and relative candidate keys (RCKs), and presents the calculus of MDs. The calculus consists of formalization of MDs, a reasoning mechanism for deriving MDs from other MDs, and algorithms for MDs. The first algorithm is used to determine if an MD can be derived from given MDs; the second algorithm is used to deduce RCKs from MDs. The first section introduces the problem and provides a list of contributions. It also shows some of the applications in which the work can be applied. The second section concerns related works and points out references for works on record matching. The third section formalizes MDs, the semantics of MDs, and the RCKs. The fourth section provides an inference system for MDs and rules for capturing dynamic semantics of MDs. The fifth section provides the MD deduction analysis algorithm with its complexity analysis. The sixth section covers the RCKs deduction algorithm. A detailed discussion on the experimental evaluation of the algorithms is given in the next section. The concluding section is followed by an appendix, which contains proofs related to the MDs inference system. Knowledge of logic and complexity theory is required to understand the theory given in the paper. Online Computing Reviews Service

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Information & Contributors

Information

Published In

cover image The VLDB Journal — The International Journal on Very Large Data Bases

The VLDB Journal — The International Journal on Very Large Data Bases Volume 20, Issue 4

August 2011

168 pages

ISSN:1066-8888

Issue’s Table of Contents

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 01 August 2011

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

48
Total Citations
View Citations
335
Total Downloads

Downloads (Last 12 months)20
Downloads (Last 6 weeks)7

Reflects downloads up to 13 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

View all

Bao ZBie BFan WLi DLi MLin KLin WLiu PLiu PLv ZOuyang MSun CTang SWang YWei QWu XXie MZhang JZhao RZhu JZhu Y(2024)Rock: Cleaning Data with both ML and Logic RulesProceedings of the VLDB Endowment10.14778/3685800.368587817:12(4373-4376)Online publication date: 8-Nov-2024
https://dl.acm.org/doi/10.14778/3685800.3685878
Yan MFan WWang YXie M(2024)Enriching Relations with Additional Attributes for ERProceedings of the VLDB Endowment10.14778/3681954.368198717:11(3109-3123)Online publication date: 30-Aug-2024
https://doi.org/10.14778/3681954.3681987
Kuang SYang HTan ZMa S(2024)Efficient Differential Dependency DiscoveryProceedings of the VLDB Endowment10.14778/3654621.365462417:7(1552-1564)Online publication date: 1-Mar-2024
https://dl.acm.org/doi/10.14778/3654621.3654624
Fan WPang KLu PTian C(2024)Making It Tractable to Detect and Correct Errors in GraphsACM Transactions on Database Systems10.1145/370231549:4(1-75)Online publication date: 2-Nov-2024
https://dl.acm.org/doi/10.1145/3702315
Fan WHan ZXie MZhang G(2024)Discovering Top-k Relevant and Diversified RulesProceedings of the ACM on Management of Data10.1145/36771312:4(1-28)Online publication date: 30-Sep-2024
https://dl.acm.org/doi/10.1145/3677131
Pereira JFonseca MLopes AGalhardas H(2024)Cleenex: Support for User Involvement during an Iterative Data Cleaning ProcessJournal of Data and Information Quality10.1145/364847616:1(1-26)Online publication date: 19-Mar-2024
https://dl.acm.org/doi/10.1145/3648476
Bao XBao ZBinbin BDuan QFan WLei HLi DLin WLiu PLv ZOuyang MTang SWang YWei QXie MZhang JZhang XZhao RZhou SBarcelo PSanchez-Pi NMeliou ASudarshan S(2024)Rock: Cleaning Data by Embedding ML in Logic RulesCompanion of the 2024 International Conference on Management of Data10.1145/3626246.3653372(106-119)Online publication date: 9-Jun-2024
https://dl.acm.org/doi/10.1145/3626246.3653372
Sun CXu YShen DNie TChua TNgo CKa-Wei Lee RKumar RLauw H(2024)Matching Feature Separation Network for Domain Adaptation in Entity MatchingProceedings of the ACM Web Conference 202410.1145/3589334.3645397(1975-1985)Online publication date: 13-May-2024
https://dl.acm.org/doi/10.1145/3589334.3645397
Fan WHan ZRen WWang DWang YXie MYan M(2023)Splitting Tuples of Mismatched EntitiesProceedings of the ACM on Management of Data10.1145/36267631:4(1-29)Online publication date: 12-Dec-2023
https://dl.acm.org/doi/10.1145/3626763
Fan WFu WJin RLiu MLu PTian C(2023)Making It Tractable to Catch Duplicates and Conflicts in GraphsProceedings of the ACM on Management of Data10.1145/35889401:1(1-28)Online publication date: 30-May-2023
https://dl.acm.org/doi/10.1145/3588940
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Abstract

References

Cited By

Index Terms

Recommendations

Incorporating string transformations in record matching

A generalized cost optimal decision model for record matching

Record Matching over Query Results from Multiple Web Databases

Reviews

Access critical reviews of Computing literature here

Comments

Information

Published In

Publisher

Publication History

Author Tags

Qualifiers

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations