Article

Iterative record linkage for cleaning and integration

Authors:

Indrajit Bhattacharya,

Lise GetoorAuthors Info & Claims

DMKD '04: Proceedings of the 9th ACM SIGMOD workshop on Research issues in data mining and knowledge discovery

Pages 11 - 18

https://doi.org/10.1145/1008694.1008697

Published: 13 June 2004 Publication History

Abstract

Record linkage, the problem of determining when two records refer to the same entity, has applications for both data cleaning (deduplication) and for integrating data from multiple sources. Traditional approaches use a similarity measure that compares tuples' attribute values; tuples with similarity scores above a certain threshold are declared to be matches. While this method can perform quite well in many domains, particularly domains where there is not a large amount of noise in the data, in some domains looking only at tuple values is not enough. By also examining the context of the tuple, i.e. the other tuples to which it is linked, we can come up with a more accurate linkage decision. But this additional accuracy comes at a price. In order to correctly find all duplicates, we may need to make multiple passes over the data; as linkages are discovered, they may in turn allow us to discover additional linkages. We present results that illustrate the power and feasibility of making use of join information when comparing records.

References

[1]

R. Ananthakrishna, S. Chaudhuri, and V. Ganti. Eliminating fuzzy duplicates in data warehouses. In Proceedings of the 28th International Conference on Very Large Databases (VLDB-2002), Hong Kong, China, 2002.

Digital Library

[2]

M. Bilenko and R. J. Mooney. Adaptive duplicate detection using learnable string similarity measures. In Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2003), Washington, DC, 2003.

Digital Library

[3]

S. Chaudhuri, K. Ganjam, V. Ganti, and R. Motwani. Robust and efficient fuzzy match for online data cleaning. In Proceedings of the 2003 ACM SIGMOD international conference on on Management of data, pages 313--324, San Diego, CA, 2003.

Digital Library

[4]

W. Cohen. Overview of record linkage methods. Powerpoint presentation, available at http://www-2.cs.cmu.edu/wcohen/Matching-1.ppt.

[5]

W. Cohen. Data integration using similarity joins and a word-based information representation language. ACM Transactions on Information Systems, 18:288--321, 2000.

Digital Library

[6]

W. Cohen and J. Richman. Learning to match and cluster entity names. In ACM SIGIR-2001 Workshop on Mathematical/Formal Methods in Information Retrieval, New Orleans, LA, Sept. 2001.

[7]

W. W. Cohen, P. Ravikumar, and S. E. Fienberg. A comparison of string distance metrics for name-matching tasks. In Proceedings of the IJCAI-2003 Workshop on Information Integration on the Web, pages 73--78, Acapulco, Mexico, Aug. 2003.

[8]

W. W. Cohen and J. Richman. Learning to match and cluster large high-dimensional data sets for data integration. In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2002), Edmonton, Alberta, 2002.

Digital Library

[9]

I. P. Fellegi and A. B. Sunter. A theory for record linkage. Journal of the American Statistical Association, 64:1183--1210, 1969.

[10]

C. L. Giles, K. Bollacker, and S. Lawrence. CiteSeer: An automatic citation indexing system. In Proceedings of the Third ACM Conference on Digital Libraries, pages 89--98, Pittsburgh, PA, June 23--26 1998.

Digital Library

[11]

L. Gu, R. Baxter, D. Vickers, and C. Rainsford. Record linkage: Current practice and future directions. Technical Report 03/83, CSIRO Mathematical and Information Sciences, Canberra, Australia, April 2003.

[12]

M. A. Hernández and S. J. Stolfo. The merge/purge problem for large databases. In Proceedings of the 1995 ACM SIGMOD International Conference on Management of Data (SIGMOD-95), pages 127--138, San Jose, CA, May 1995.

Digital Library

[13]

J. A. Hylton. Identifying and merging related bibliographic records. Master's thesis, Department of Electrical Engineering and Computer Science, MIT, 1996.

[14]

M. Ley. The dblp computer science bibliography: Evolution, research issues, perspectives. In Proceedings of the 9th International Symposium on String Processing and Information Retrieval.

Digital Library

[15]

A. McCallum and B. Wellner. Toward conditional models of identity uncertainty with application to proper noun coreference. In Proceedings of the IJCAI-2003 Workshop on Information Integration on the Web, pages 79--86, Acapulco, Mexico, Aug. 2003.

[16]

A. K. McCallum, K. Nigam, and L. Ungar. Efficient clustering of high-dimensional data sets with application to reference matching. In Proceedings of the Sixth International Conference On Knowledge Discovery and Data Mining (KDD-2000), pages 169--178, Boston, MA, Aug. 2000.

Digital Library

[17]

V. S. V. Mohamed G. Elfeky, Ahmed K. Elmagarmid. Tailor: A record linkage tool box. In 18th International Conference on Data Engineering (ICDE'02), 2002.

Digital Library

[18]

A. E. Monge and C. P. Elkan. The field matching problem: Algorithms and applications. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD-96), pages 267--270, Portland, OR, August 1996.

[19]

A. E. Monge and C. P. Elkan. An efficient domain-independent algorithm for detecting approximately duplicate database records. In Proceedings of the SIGMOD 1997 Workshop on Research Issues on Data Mining and Knowledge Discovery, pages 23--29, Tuscon, AZ, May 1997.

[20]

G. Navarro. A guided tour to approximate string matching. ACM Computing Surveys, 33(1):31--88, 2001.

Digital Library

[21]

H. Newcombe, J. Kennedy, S. Axford, and A. James. Automatic linkage of vital records. Science, 130:954--959, 1959.

[22]

H. Pasula, B. Marthi, B. Milch, S. Russell, and I. Shpitser. Identity uncertainty and citation matching. In Advances in Neural Information Processing Systems 15. MIT Press, 2003.

[23]

E. Ristad and P. Yianilos. Learning string edit distance. IEEE Transactions on PAMI, 20(5):522--532, 1998.

Digital Library

[24]

S. Sarawagi and A. Bhamidipaty. Interactive deduplication using active learning. In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2002), Edmonton, Alberta, 2002.

Digital Library

[25]

S. Tejada, C. A. Knoblock, and S. Minton. Learning object identification rules for information integration. Information Systems Journal, 26(8):635--656, 2001.

Digital Library

[26]

W. E. Winkler. String comparator metrics and enhanced decision rules in the fellegi-sunter model of record linkage. In Proceedings of the Section on Survey Research Methods, American Statistical Association, pages 354--359, 1990.

[27]

W. E. Winkler. Improved decision rules in the fellegi-sunter model of record linkage. Technical report, Statistical Research Division, U.S. Census Bureau, Washington, DC, 1993.

[28]

W. E. Winkler. The state of record linkage and current research problems. Technical report, Statistical Research Division, U.S. Census Bureau, Washington, DC, 1999.

[29]

W. E. Winkler. Methods for record linkage and Bayesian networks. Technical report, Statistical Research Division, U.S. Census Bureau, Washington, DC, 2002.

Cited By

Lu JWang S(2023)Heterogeneous Entity Matching with Complex Attribute Associations using BERT and Neural NetworksSSRN Electronic Journal10.2139/ssrn.4577447Online publication date: 2023
https://doi.org/10.2139/ssrn.4577447
Khoshnevis S(2022)A search-based identification of variable microservices for enterprise SaaSFrontiers of Computer Science10.1007/s11704-022-1390-417:3Online publication date: 10-Nov-2022
https://doi.org/10.1007/s11704-022-1390-4
Rajjliwal NChetty G(2021)Deep Learning Based Decision Support Framework for Cardiovascular Disease Prediction2021 IEEE Asia-Pacific Conference on Computer Science and Data Engineering (CSDE)10.1109/CSDE53843.2021.9718459(1-12)Online publication date: 8-Dec-2021
https://doi.org/10.1109/CSDE53843.2021.9718459
Show More Cited By

Index Terms

Iterative record linkage for cleaning and integration
1. Information systems
  1. Data management systems

Recommendations

Subsequent patient visit detection in a high volume OPD using record linkage techniques
COMPUTE '10: Proceedings of the Third Annual ACM Bangalore Conference

Record or data linkage techniques are used to link records which represent the same entity (e.g. patient, customer, citation, etc.) in one or more data sets where a unique identifier for each entity is not available in all or any of the data sets to be ...
Scalable Privacy-Preserving Record Linkage for Multiple Databases
CIKM '14: Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management

Privacy-preserving record linkage (PPRL) is the process of identifying records that correspond to the same real-world entities across several databases without revealing any sensitive information about these entities. Various techniques have been ...
A taxonomy of privacy-preserving record linkage techniques

The process of identifying which records in two or more databases correspond to the same entity is an important aspect of data quality activities such as data pre-processing and data integration. Known as record linkage, data matching or entity ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

DMKD '04: Proceedings of the 9th ACM SIGMOD workshop on Research issues in data mining and knowledge discovery

June 2004

85 pages

ISBN:158113908X

DOI:10.1145/1008694

Program Chairs:
Gautam Das
Microsoft Research
,
Bing Liu
University of Illinois at Chicago
,
Philip S. Yu
IBM T.J. Watson Research Center

Copyright © 2004 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMOD: ACM Special Interest Group on Management of Data

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 June 2004

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Conference

DMKD04

Sponsor:

SIGMOD

DMKD04: 9th ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery 2004

13 06 2004

Paris, France

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

125
Total Citations
View Citations
1,604
Total Downloads

Downloads (Last 12 months)6
Downloads (Last 6 weeks)0

Reflects downloads up to 12 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Lu JWang S(2023)Heterogeneous Entity Matching with Complex Attribute Associations using BERT and Neural NetworksSSRN Electronic Journal10.2139/ssrn.4577447Online publication date: 2023
https://doi.org/10.2139/ssrn.4577447
Khoshnevis S(2022)A search-based identification of variable microservices for enterprise SaaSFrontiers of Computer Science10.1007/s11704-022-1390-417:3Online publication date: 10-Nov-2022
https://doi.org/10.1007/s11704-022-1390-4
Rajjliwal NChetty G(2021)Deep Learning Based Decision Support Framework for Cardiovascular Disease Prediction2021 IEEE Asia-Pacific Conference on Computer Science and Data Engineering (CSDE)10.1109/CSDE53843.2021.9718459(1-12)Online publication date: 8-Dec-2021
https://doi.org/10.1109/CSDE53843.2021.9718459
Gözükara FÖzel S(2021)An Incremental Hierarchical Clustering Based System For Record Linkage In E-Commerce DomainThe Computer Journal10.1093/comjnl/bxab17966:3(581-602)Online publication date: 11-Nov-2021
https://doi.org/10.1093/comjnl/bxab179
Laxmi Lydia EMadhusudhana Rao TVijaya Kumar KKrishna Mohan ALingamgunta S(2021)Challenging Data Models and Data Confidentiality Through “Pay-As-You-Go” Approach Entity ResolutionComputer Networks, Big Data and IoT10.1007/978-981-16-0965-7_37(469-482)Online publication date: 22-Jun-2021
https://doi.org/10.1007/978-981-16-0965-7_37
Koylu CGuo DHuang YKasakoff AGrieve J(2020)Connecting family trees to construct a population-scale and longitudinal geo-social network for the U.SInternational Journal of Geographical Information Science10.1080/13658816.2020.182188535:12(2380-2423)Online publication date: 30-Sep-2020
https://doi.org/10.1080/13658816.2020.1821885
Li BLiu YZhang AWang WWan S(2020)A Survey on Blocking Technology of Entity ResolutionJournal of Computer Science and Technology10.1007/s11390-020-0350-435:4(769-793)Online publication date: 27-Jul-2020
https://doi.org/10.1007/s11390-020-0350-4
Kang NKim JOn BLee I(2020)A node resistance-based probability model for resolving duplicate named entitiesScientometrics10.1007/s11192-020-03585-4Online publication date: 13-Jul-2020
https://doi.org/10.1007/s11192-020-03585-4
Ioannou EVelegrakis YNgonga Ngomo AFundulaki IKrithara A(2019)EMBench++Semantic Web10.3233/SW-18033110:2(435-450)Online publication date: 1-Jan-2019
https://dl.acm.org/doi/10.3233/SW-180331
Jaramillo DLakhiyani BChinta V(2019)GUP: A cognitive initiative to identify relationships between academia and industry2019 SoutheastCon10.1109/SoutheastCon42311.2019.9020380(1-3)Online publication date: Apr-2019
https://doi.org/10.1109/SoutheastCon42311.2019.9020380
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents