Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
article
Free access

Dynamic constraints for record matching

Published: 01 August 2011 Publication History

Abstract

This paper investigates constraints for matching records from unreliable data sources. (a) We introduce a class of matching dependencies (mds) for specifying the semantics of unreliable data. As opposed to static constraints for schema design, mds are developed for record matching, and are defined in terms of similarity predicates and a dynamic semantics. (b) We identify a special case of mds, referred to as relative candidate keys (rcks), to determine what attributes to compare and how to compare them when matching records across possibly different relations. (c) We propose a mechanism for inferring mds, a departure from traditional implication analysis, such that when we cannot match records by comparing attributes that contain errors, we may still find matches by using other, more reliable attributes. Moreover, we develop a sound and complete system for inferring mds. (d) We provide a quadratic-time algorithm for inferring mds and an effective algorithm for deducing a set of high-quality rcks from mds. (e) We experimentally verify that the algorithms help matching tools efficiently identify keys at compile time for matching, blocking or windowing and in addition, that the md-based techniques effectively improve the quality and efficiency of various record matching methods.

References

[1]
Abiteboul, S., Hull, R., Vianu, V.: Foundations of Databases. Addison-Wesley, Boston (1995).
[2]
Ananthakrishna, R., Chaudhuri, S., Ganti, V.: Eliminating fuzzy duplicates in data warehouses. In: VLDB (2002).
[3]
Sarma, J.W.A.D., Ullman, J.: Schema design for uncertain databases. In: Proceedings of the 3rd Alberto Mendelzon Workshop on Foundations of Data Management (2009).
[4]
Arasu, A., Chaudhuri, S., Kaushik, R.: Transformation-based framework for record matching. In: ICDE (2008).
[5]
Arasu, A., Re, C., Suciu, D.: Large-scale deduplication with constraints using Dedupalog. In: ICDE (2009).
[6]
Aumueller, D., Do, H.H., Massmann, S., Rahm, E.: Schema and ontology matching with COMA++. In: SIGMOD (2005).
[7]
Batini, C., Scannapieco, M.: Data Quality: Concepts, Methodologies and Techniques. Springer, Berlin (2006).
[8]
Beeri, C., Bernstein, P.A.: Computational problems related to the design of normal form relational schemas. ACM Trans. Database Syst. 4(1), 30-59 (1979).
[9]
Belohlávek, R., Vychodil, V.: Data tables with similarity relations: functional dependencies, complete rules and non-redundant bases. In: DASFAA, (2006).
[10]
Cautis, B., Abiteboul, S., Milo, T.: Reasoning about XML update constraints. In: PODS, (2007).
[11]
Chaudhuri, S., Chen, B.-C., Ganti, V., Kaushik, R.: Example-driven design of efficient record matching queries. In: VLDB (2007).
[12]
Chaudhuri, S., Sarma, A.D., Ganti, V., Kaushik, R.: Leveraging aggregate constraints for deduplication. In: SIGMOD (2007).
[13]
Cohen, W.W., Richman, J.: Learning to match and cluster large high-dimensional data sets for data integration. In: KDD (2002).
[14]
Dhamankar, R., Lee, Y., Doan, A., Halevy, A.Y., Domingos, P.: iMAP: discovering complex mappings between database schemas. In: SIGMOD (2004).
[15]
Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate record detection: a survey. IEEE Trans. Knowl. Data Eng. 19(1), 1-16 (2007).
[16]
Fan, W.: Dependencies revisited for improving data quality. In: PODS (2008).
[17]
Fan, W., Geerts, F., Li, J., Xiong, M.: Discovering conditional functional dependencies. IEEE Trans. Knowl. Data Eng. (2010).
[18]
Fan, W., Jia, X., Li, J., Ma, S.: Reasoning about record matching rules. In: VLDB (2009).
[19]
Fellegi, Ivan, Holt, D.: A systematic approach to automatic edit and imputation. J. Am. Stat. Assoc. 71(353), 17-35 (1976).
[20]
Fellegi, I., Sunter, A.B.: A theory for record linkage. J. Am. Stat. Assoc. 64(328), 1183-1210 (1969).
[21]
Galhardas, H., Florescu, D., Shasha, D., Simon, E., Saita, C.: Declarative data cleaning: language, model and algorithms. In: VLDB (2001).
[22]
Guha, S., Koudas, N., Marathe, A., Srivastava, D.: Merging the results of approximate match operations. In: VLDB (2004).
[23]
Haas, L., Hernández, M., Ho, H., Popa, L., Roth, Mary: Clio grows up: from research prototype to industrial tool. In: SIGMOD (2005).
[24]
Hernndez, M.A., Stolfo, S.J.: The merge/purge problem for large databases. In: SIGMOD (1995).
[25]
Hernndez, M.A., Stolfo, S.J.: Real-world data is dirty: data cleansing and the merge/purge problem. Data Min. Knowl. Discov. 2(1), 9-37 (1998).
[26]
Huhtala, Y., Kärkk ainen, J., Porkka, P., Toivonen, H.: TANE: an efficient algorithm for discovering functional and approximate dependencies. Comput. J. 42(2), 100-111 (1999).
[27]
http://www.sas.com/industry/fsi/fraud/
[28]
http://userweb.cs.utexas.edu/users/ml/riddle/data.html
[29]
Jaro, M.A.: Advances in record-linkage methodology as applied to matching the 1985 census of Tampa Florida. J. Am. Stat. Assoc. 89, 414-420 (1989).
[30]
Koudas, N., Saha, A., Srivastava, D., Venkatasubramanian, S.: Metric functional dependencies. In: ICDE (2009).
[31]
Lim, E.-P., Srivastava, J., Prabhakar, S., Richardson, J.: Entity identification in database integration. Inf. Sci. 89(1-2), 1-38 (1996).
[32]
Loshin, D.: Master Data Management. Knowledge Integrity, Inc., New York (2009).
[33]
Lucchesi, C.L., Osborn, S.L.: Candidate keys for relations. J. Comput. Syst. Sci. 17(2), 270-279 (1978).
[34]
Maier, D.: The Theory of Relational Databases. Computer Science Press, Rockville (1983).
[35]
Rahm, E., Bernstein, P.A.: A survey of approaches to automatic schema matching. VLDB J. (2001).
[36]
Sarawagi, S., Bhamidipaty, A.: Interactive deduplication using active learning. In: KDD (2002).
[37]
Sauter, G., Mathews, B., Ostic, E.: Information Service Patterns, Part 3: Data Cleansing Pattern. IBM, USA (2007).
[38]
Shen, W., Li, X., Doan, A.: Constraint-based entity matching. In AAAI (2005).
[39]
Singla, P., Domingos, P.: Object identification with attribute-mediated dependences. In: PKDD (2005).
[40]
Sismanis, Y., Brown, P., Haas, P.J., Reinwald, B.: GORDIAN: efficient and scalable discovery of composite keys. In: VLDB (2006).
[41]
Soundex: http://en.wikipedia.org/wiki/Soundex
[42]
Verykios, V.S., Elmagarmid, A.K., Houstis, E.: Automating the approximate record-matching process. Inf. Sci. 126(1-4), 83-89 (2002).
[43]
Vianu, V.: Dynamic functional dependencies and database aging. J. ACM 34(1), 28-59 (1987).
[44]
Weis, M., Naumann, F.: DogmatiX tracks down duplicates in XML. In: SIGMOD (2005).
[45]
Weis, M., Naumann, F., Jehle, U., Lufter, J., Schuster, H.: Industry-scale duplicate detection. In: VLDB (2008).
[46]
Winkler, W.E.: Methods for record linkage and bayesian networks. Technical report RRS2002/05, U.S. Census Bureau (2002).
[47]
Winkler, W.E.: Methods for evaluating and creating data quality. Inf. Syst. 29(7), 531-550 (2004).
[48]
Yancey, W.: BigMatch: A program for extracting probable matches from a large file. Technical report computing 2007/01, U.S. Census Bureau (2007).

Cited By

View all
  • (2024)Rock: Cleaning Data with both ML and Logic RulesProceedings of the VLDB Endowment10.14778/3685800.368587817:12(4373-4376)Online publication date: 8-Nov-2024
  • (2024)Enriching Relations with Additional Attributes for ERProceedings of the VLDB Endowment10.14778/3681954.368198717:11(3109-3123)Online publication date: 30-Aug-2024
  • (2024)Efficient Differential Dependency DiscoveryProceedings of the VLDB Endowment10.14778/3654621.365462417:7(1552-1564)Online publication date: 1-Mar-2024
  • Show More Cited By

Recommendations

Reviews

Maulik A Dave

This research paper addresses the problem of matching records from different unreliable data sources. It introduces new concepts of matching dependencies (MDs) and relative candidate keys (RCKs), and presents the calculus of MDs. The calculus consists of formalization of MDs, a reasoning mechanism for deriving MDs from other MDs, and algorithms for MDs. The first algorithm is used to determine if an MD can be derived from given MDs; the second algorithm is used to deduce RCKs from MDs. The first section introduces the problem and provides a list of contributions. It also shows some of the applications in which the work can be applied. The second section concerns related works and points out references for works on record matching. The third section formalizes MDs, the semantics of MDs, and the RCKs. The fourth section provides an inference system for MDs and rules for capturing dynamic semantics of MDs. The fifth section provides the MD deduction analysis algorithm with its complexity analysis. The sixth section covers the RCKs deduction algorithm. A detailed discussion on the experimental evaluation of the algorithms is given in the next section. The concluding section is followed by an appendix, which contains proofs related to the MDs inference system. Knowledge of logic and complexity theory is required to understand the theory given in the paper. Online Computing Reviews Service

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Information & Contributors

Information

Published In

cover image The VLDB Journal — The International Journal on Very Large Data Bases
The VLDB Journal — The International Journal on Very Large Data Bases  Volume 20, Issue 4
August 2011
168 pages

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 01 August 2011

Author Tags

  1. Data cleaning
  2. Deduction
  3. Inference
  4. Matching dependencies
  5. Record matching

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)20
  • Downloads (Last 6 weeks)7
Reflects downloads up to 13 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Rock: Cleaning Data with both ML and Logic RulesProceedings of the VLDB Endowment10.14778/3685800.368587817:12(4373-4376)Online publication date: 8-Nov-2024
  • (2024)Enriching Relations with Additional Attributes for ERProceedings of the VLDB Endowment10.14778/3681954.368198717:11(3109-3123)Online publication date: 30-Aug-2024
  • (2024)Efficient Differential Dependency DiscoveryProceedings of the VLDB Endowment10.14778/3654621.365462417:7(1552-1564)Online publication date: 1-Mar-2024
  • (2024)Making It Tractable to Detect and Correct Errors in GraphsACM Transactions on Database Systems10.1145/370231549:4(1-75)Online publication date: 2-Nov-2024
  • (2024)Discovering Top-k Relevant and Diversified RulesProceedings of the ACM on Management of Data10.1145/36771312:4(1-28)Online publication date: 30-Sep-2024
  • (2024)Cleenex: Support for User Involvement during an Iterative Data Cleaning ProcessJournal of Data and Information Quality10.1145/364847616:1(1-26)Online publication date: 19-Mar-2024
  • (2024)Rock: Cleaning Data by Embedding ML in Logic RulesCompanion of the 2024 International Conference on Management of Data10.1145/3626246.3653372(106-119)Online publication date: 9-Jun-2024
  • (2024)Matching Feature Separation Network for Domain Adaptation in Entity MatchingProceedings of the ACM Web Conference 202410.1145/3589334.3645397(1975-1985)Online publication date: 13-May-2024
  • (2023)Splitting Tuples of Mismatched EntitiesProceedings of the ACM on Management of Data10.1145/36267631:4(1-29)Online publication date: 12-Dec-2023
  • (2023)Making It Tractable to Catch Duplicates and Conflicts in GraphsProceedings of the ACM on Management of Data10.1145/35889401:1(1-28)Online publication date: 30-May-2023
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media