Article

Adaptive duplicate detection using learnable string similarity measures

Authors:

Mikhail Bilenko,

Raymond J. MooneyAuthors Info & Claims

KDD '03: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining

Pages 39 - 48

https://doi.org/10.1145/956750.956759

Published: 24 August 2003 Publication History

Abstract

The problem of identifying approximately duplicate records in databases is an essential step for data cleaning and data integration processes. Most existing approaches have relied on generic or manually tuned distance metrics for estimating the similarity of potential duplicates. In this paper, we present a framework for improving duplicate detection using trainable measures of textual similarity. We propose to employ learnable text distance functions for each database field, and show that such measures are capable of adapting to the specific notion of similarity that is appropriate for the field's domain. We present two learnable text similarity measures suitable for this task: an extended variant of learnable string edit distance, and a novel vector-space based measure that employs a Support Vector Machine (SVM) for training. Experimental results on a range of datasets show that our framework can improve duplicate detection accuracy over traditional techniques.

References

[1]

R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. ACM Press, New York, 1999.

Digital Library

[2]

M. Bilenko and R. J. Mooney. Learning to combine trained distance metrics for duplicate detection in databases. Technical Report AI 02-296, Artificial Intelligence Laboratory, University of Texas at Austin, Austin, TX, Feb. 2002.

[3]

W. W. Cohen, H. Kautz, and D. McAllester. Hardening soft information sources. In Proceedings of the Sixth International Conference on Knowledge Discovery and Data Mining (KDD-2000), Boston, MA, Aug. 2000.

Digital Library

[4]

W. W. Cohen and J. Richman. Learning to match and cluster large high-dimensional data sets for data integration. In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2002), Edmonton, Alberta, 2002.

Digital Library

[5]

D. J. Cook and L. B. Holder. Substructure discovery using minimum description length and background knowledge. Journal of Artificial Intelligence Research, 1:231--255, 1994.

Digital Library

[6]

R. Durbin, S. Eddy, A. Krogh, and G. Mitchison. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press, 1998.

[7]

I. P. Fellegi and A. B. Sunter. A theory for record linkage. Journal of the American Statistical Association, 64:1183--1210, 1969.

[8]

Y. Freund and L. Mason. The alternating decision tree learning algorithm. In Proceedings of the Sixteenth International Conference on Machine Learning (ICML-99), Bled, Slovenia, 1999.

Digital Library

[9]

D. Gusfield. Algorithms on Strings, Trees and Sequences. Cambridge University Press, New York, 1997.

Digital Library

[10]

M. A. Hernández and S. J. Stolfo. The merge/purge problem for large databases. In Proceedings of the 1995 ACM SIGMOD International Conference on Management of Data (SIGMOD-95), pages 127--138, San Jose, CA, May 1995.

Digital Library

[11]

T. Joachims. Making large-scale SVM learning practical. In B. Schölkopf, C. J. C. Burges, and A. J. Smola, editors, Advances in Kernel Methods - Support Vector Learning, pages 169--184. MIT Press, 1999.

Digital Library

[12]

T. Joachims. Transductive inference for text classification using support vector machines. In Proceedings of the Sixteenth International Conference on Machine Learning (ICML-99), Bled, Slovenia, June 1999.

Digital Library

[13]

A. K. McCallum, K. Nigam, and L. Ungar. Efficient clustering of high-dimensional data sets with application to reference matching. In Proceedings of the Sixth International Conference on Knowledge Discovery and Data Mining (KDD-2000), pages 169--178, Boston, MA, Aug. 2000.

Digital Library

[14]

A. E. Monge and C. Elkan. The field matching problem: Algorithms and applications. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD-96), pages 267--270, Portland, OR, Aug. 1996.

[15]

A. E. Monge and C. P. Elkan. An efficient domain-independent algorithm for detecting approximately duplicate database records. In Proceedings of the SIGMOD 1997 Workshop on Research Issues on Data Mining and Knowledge Discovery, pages 23--29, Tuscon, AZ, May 1997.

[16]

U. Y. Nahm and R. J. Mooney. Using information extraction to aid the discovery of prediction rules from texts. In Proceedings of the Sixth International Conference on Knowledge Discovery and Data Mining (KDD-2000) Workshop on Text Mining, Boston, MA, Aug. 2000.

[17]

S. B. Needleman and C. D. Wunsch. A general method applicable to the search for similarities in the amino acid sequences of two proteins. Journal of Molecular Biology, 48:443--453, 1970.

[18]

H. B. Newcombe, J. M. Kennedy, S. J. Axford, and A. P. James. Automatic linkage of vital records. Science, 130:954--959, 1959.

[19]

J. C. Platt. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. In A. J. Smola, P. Bartlett, B. Schölkopf, and D. Schuurmans, editors, Advances in Large Margin Classifiers, pages 185--208. MIT Press, 1999.

[20]

L. R. Rabiner. A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2):257--286, 1989.

[21]

E. S. Ristad and P. N. Yianilos. Learning string edit distance. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(5), 1998.

Digital Library

[22]

S. Sarawagi and A. Bhamidipaty. Interactive deduplication using active learning. In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2002), Edmonton, Alberta, 2002.

Digital Library

[23]

S. Tejada, C. A. Knoblock, and S. Minton. Learning domain-independent string transformation weights for high accuracy object identification. In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2002), Edmonton, Alberta, 2002.

Digital Library

[24]

V. N. Vapnik. Statistical Learning Theory. Wiley, 1998.

Digital Library

[25]

W. E. Winkler. The state of record linkage and current research problems. Technical report, Statistical Research Division, U.S. Bureau of the Census, Wachington, DC, 1999.

[26]

I. H. Witten and E. Frank. Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann, San Francisco, 1999.

Digital Library

[27]

B. Zadrozny and C. Elkan. Obtaining calibrated probability estimates from decision trees and naive bayesian classifiers. In Proceedings of 18th International Conference on Machine Learning (ICML-2001), Williamstown, MA, 2001.

Digital Library

Cited By

Yang XRajbahadur GLin DWang SJiang Z(2024)SimClone: Detecting Tabular Data Clones using Value SimilarityACM Transactions on Software Engineering and Methodology10.1145/3676961Online publication date: 16-Jul-2024
https://doi.org/10.1145/3676961
Li HLi SHao FZhang CSong YChen LChua TNgo CKumar RLauw HKa-Wei Lee R(2024)BoostER: Leveraging Large Language Models for Enhancing Entity ResolutionCompanion Proceedings of the ACM Web Conference 202410.1145/3589335.3651245(1043-1046)Online publication date: 13-May-2024
https://dl.acm.org/doi/10.1145/3589335.3651245
Rass SKönig SAhmad SGoman M(2024)Metricizing the Euclidean Space Toward Desired Distance Relations in Point CloudsIEEE Transactions on Information Forensics and Security10.1109/TIFS.2024.342024619(7304-7319)Online publication date: 2024
https://doi.org/10.1109/TIFS.2024.3420246
Show More Cited By

Index Terms

Adaptive duplicate detection using learnable string similarity measures
1. Computing methodologies
  1. Machine learning
2. Information systems
  1. Information systems applications

Recommendations

Learning similarity measures from data
Abstract
Defining similarity measures is a requirement for some machine learning methods. One such method is case-based reasoning (CBR) where the similarity measure is used to retrieve the stored case or a set of cases most similar to the query case. ...
Some cosine similarity measures and distance measures between q‐rung orthopair fuzzy sets
Abstract
In this paper, we consider some cosine similarity measures and distance measures between q‐rung orthopair fuzzy sets (q‐ROFSs). First, we define a cosine similarity measure and a Euclidean distance measure of q‐ROFSs, their properties are also ...
Adaptive stereo similarity fusion using confidence measures

We propose similarity fusion strategy based on stereo confidences.We propose a consensus strategy to exploit spatial correlation between pixels.Our fusion increases the accuracy of global and local stereo algorithms.We out-perform other fusion ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

KDD '03: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining

August 2003

736 pages

ISBN:1581137370

DOI:10.1145/956750

Conference Chair:
Lise Getoor
University of Maryland, College Park
,
General Chair:
Ted Senator
DARPA
,
Program Chairs:
Pedro Domingos
University of Washington
,
Christos Faloutsos
Carnegie Mellon University

Copyright © 2003 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 August 2003

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Conference

KDD03

Sponsor:

KDD03: The Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

August 24 - 27, 2003

Washington, D.C.

Acceptance Rates

KDD '03 Paper Acceptance Rate 46 of 298 submissions, 15%;

Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

616
Total Citations
View Citations
3,858
Total Downloads

Downloads (Last 12 months)75
Downloads (Last 6 weeks)8

Reflects downloads up to 30 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Yang XRajbahadur GLin DWang SJiang Z(2024)SimClone: Detecting Tabular Data Clones using Value SimilarityACM Transactions on Software Engineering and Methodology10.1145/3676961Online publication date: 16-Jul-2024
https://doi.org/10.1145/3676961
Li HLi SHao FZhang CSong YChen LChua TNgo CKumar RLauw HKa-Wei Lee R(2024)BoostER: Leveraging Large Language Models for Enhancing Entity ResolutionCompanion Proceedings of the ACM Web Conference 202410.1145/3589335.3651245(1043-1046)Online publication date: 13-May-2024
https://dl.acm.org/doi/10.1145/3589335.3651245
Rass SKönig SAhmad SGoman M(2024)Metricizing the Euclidean Space Toward Desired Distance Relations in Point CloudsIEEE Transactions on Information Forensics and Security10.1109/TIFS.2024.342024619(7304-7319)Online publication date: 2024
https://doi.org/10.1109/TIFS.2024.3420246
Xu ZWang N(2024)Low-resource entity resolution with domain generalization and active learningNeurocomputing10.1016/j.neucom.2024.128131599(128131)Online publication date: Sep-2024
https://doi.org/10.1016/j.neucom.2024.128131
Zhang ZYang YChen B(2024)Relation-aware heterogeneous graph neural network for entity alignmentNeurocomputing10.1016/j.neucom.2024.127797592(127797)Online publication date: Aug-2024
https://doi.org/10.1016/j.neucom.2024.127797
Sun RGuo SGuo JLi WZhang XGuo XPan Z(2024)GraphMoCoNeurocomputing10.1016/j.neucom.2024.127273575:COnline publication date: 28-Mar-2024
https://dl.acm.org/doi/10.1016/j.neucom.2024.127273
Nananukul NSisaengsuwanchai KKejriwal M(2024)Cost-efficient prompt engineering for unsupervised entity resolution in the product matching domainDiscover Artificial Intelligence10.1007/s44163-024-00159-84:1Online publication date: 16-Aug-2024
https://doi.org/10.1007/s44163-024-00159-8
Luca TPaes AZaverucha G(2024)Word embeddings-based transfer learning for boosted relational dependency networksMachine Language10.1007/s10994-023-06404-y113:3(1269-1302)Online publication date: 1-Mar-2024
https://dl.acm.org/doi/10.1007/s10994-023-06404-y
Nguyen-Trang TNguyen-Hoang YVo-Van T(2024)A new semi-supervised clustering algorithm for probability density functions and applicationsNeural Computing and Applications10.1007/s00521-023-09404-036:11(5965-5980)Online publication date: 16-Jan-2024
https://doi.org/10.1007/s00521-023-09404-0
Rabiei Zadeh AAmirkhani H(2023)A survey on short text similarity measurement methodsSignal and Data Processing10.61186/jsdp.20.3.10320:3(103-126)Online publication date: 1-Dec-2023
https://doi.org/10.61186/jsdp.20.3.103
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents