article

Collective entity resolution in relational data

Authors:

Indrajit Bhattacharya,

Lise GetoorAuthors Info & Claims

ACM Transactions on Knowledge Discovery from Data (TKDD), Volume 1, Issue 1

Pages 5 - es

https://doi.org/10.1145/1217299.1217304

Published: 01 March 2007 Publication History

Abstract

Many databases contain uncertain and imprecise references to real-world entities. The absence of identifiers for the underlying entities often results in a database which contains multiple references to the same entity. This can lead not only to data redundancy, but also inaccuracies in query processing and knowledge extraction. These problems can be alleviated through the use of entity resolution. Entity resolution involves discovering the underlying entities and mapping each database reference to these entities. Traditionally, entities are resolved using pairwise similarity over the attributes of references. However, there is often additional relational information in the data. Specifically, references to different entities may cooccur. In these cases, collective entity resolution, in which entities for cooccurring references are determined jointly rather than independently, can improve entity resolution accuracy. We propose a novel relational clustering algorithm that uses both attribute and relational information for determining the underlying domain entities, and we give an efficient implementation. We investigate the impact that different relational similarity measures have on entity resolution quality. We evaluate our collective entity resolution algorithm on multiple real-world databases. We show that it improves entity resolution performance over both attribute-based baselines and over algorithms that consider relational information but do not resolve entities collectively. In addition, we perform detailed experiments on synthetically generated data to identify data characteristics that favor collective relational resolution over purely attribute-based algorithms.

References

[1]

Adamic, L. and Adar, E. 2003. Friends and neighbors on the Web. Social Networ. 25, 3 (July), 211--230.

[2]

Ananthakrishna, R., Chaudhuri, S., and Ganti, V. 2002. Eliminating fuzzy duplicates in data warehouses. In The International Conference on Very Large Databases (VLDB). Hong Kong, China.

Digital Library

[3]

Benjelloun, O., Garcia-Molina, H., Su, Q., and Widom, J. 2005. Swoosh: A generic approach to entity resolution. Tech. rep., Stanford University. (March)

[4]

Bhattacharya, I. and Getoor, L. 2004. Iterative record linkage for cleaning and integration. In The ACM SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery (DMKD). Paris, France.

Digital Library

[5]

Bhattacharya, I. and Getoor, L. 2006a. Mining graph data. In Entity Resolution in Graphs. L. Holder and D. Cook, Eds. John Wiley.

[6]

Bhattacharya, I. and Getoor, L. 2006b. A latent dirichlet model for unsupervised entity resolution. In The SIAM Conference on Data Mining (SIAM-SDM). Bethesda, MD.

[7]

Bhattacharya, I. and Getoor, L. 2006c. Query-time entity resolution. In The ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD). Philadelphia, PA.

Digital Library

[8]

Bilenko, M. and Mooney, R. 2003. Adaptive duplicate detection using learnable string similarity measures. In The ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD). Washington, DC.

Digital Library

[9]

Bilenko, M., Mooney, R., Cohen, W., Ravikumar, P., and Fienberg, S. 2003. Adaptive name matching in information integration. IEEE Intellig. Syst. 18, 5, 16--23.

Digital Library

[10]

Chaudhuri, S., Ganjam, K., Ganti, V., and Motwani, R. 2003. Robust and efficient fuzzy match for online data cleaning. In The ACM International Conference on Management of Data (SIGMOD). San Diego, CA.

Digital Library

[11]

Cohen, W. 2000. Data integration using similarity joins and a word-based information representation language. ACM Trans. Inform. Syst. 18, 288--321.

Digital Library

[12]

Cohen, W., Ravikumar, P., and Fienberg, S. 2003. A comparison of string distance metrics for name-matching tasks. In The IJCAI Workshop on Information Integration on the Web (IIWeb). Acapulco, Mexico.

[13]

Cohen, W. and Richman, J. 2002. Learning to match and cluster large high-dimensional data sets for data integration. In The ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD). Edmonton, Canada.

Digital Library

[14]

Dong, X., Halevy, A., and Madhavan, J. 2005. Reference reconciliation in complex information spaces. In The ACM International Conference on Management of Data (SIGMOD). Baltimore, MD.

Digital Library

[15]

Fellegi, I. and Sunter, A. 1969. A theory for record linkage. J. Amer. Statis. Assoc. 64, 1183--1210.

[16]

Giles, C. L., Bollacker, K., and Lawrence, S. 1998. CiteSeer: An automatic citation indexing system. In The ACM Conference on Digital Libraries. Pittsburgh, PA.

Digital Library

[17]

Gravano, L., Ipeirotis, P., Koudas, N., and Srivastava, D. 2003. Text joins for data cleansing and integration in an RDBMS. In The IEEE International Conference on Data Engineering (ICDE). Bangalore, India.

[18]

Hernández, M. and Stolfo, S. 1995. The merge/purge problem for large databases. In The ACM International Conference on Management of Data (SIGMOD). San Jose, CA.

Digital Library

[19]

Kalashnikov, D., Mehrotra, S., and Chen, Z. 2005. Exploiting relationships for domain-independent data cleaning. In The SIAM International Conference on Data Mining (SIAM SDM). Newport Beach, CA.

[20]

Li, X., Morie, P., and Roth, D. 2005. Semantic integration in text: From ambiguous names to identifiable entities. AI Magazine. Special Issue on Semantic Integration 26, 1, 45--58.

Digital Library

[21]

Liben-Nowell, D. and Kleinberg, J. 2003. The link prediction problem for social networks. In The International Conference on Information and Knowledge Management (CIKM). New Orleans, LA.

Digital Library

[22]

McCallum, A., Nigam, K., and Ungar, L. 2000. Efficient clustering of high-dimensional data sets with application to reference matching. In The International Conference On Knowledge Discovery and Data Mining (SIGKDD). Boston, MA.

Digital Library

[23]

McCallum, A. and Wellner, B. 2004. Conditional models of identity uncertainty with application to noun coreference. In The Annual Conference on Neural Information Processing Systems (NIPS). Vancouver, Canada.

[24]

Monge, A. and Elkan, C. 1996. The field matching problem: Algorithms and applications. In The International Conference on Knowledge Discovery and Data Mining (SIGKDD). Portland, ME.

[25]

Monge, A. and Elkan, C. 1997. An efficient domain-independent algorithm for detecting approximately duplicate database records. In The SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery (DMKD). Tuscon, AZ.

[26]

Navarro, G. 2001. A guided tour to approximate string matching. ACM Comp. Sur. 33, 1, 31--88.

Digital Library

[27]

Newcombe, H., Kennedy, J., Axford, S., and James, A. 1959. Automatic linkage of vital records. Science 130, 954--959.

[28]

Pasula, H., Marthi, B., Milch, B., Russell, S., and Shpitser, I. 2003. Identity uncertainty and citation matching. In The Annual Conference on Neural Information Processing Systems (NIPS). Vancouver, Canada.

[29]

Ravikumar, P. and Cohen, W. 2004. A hierarchical graphical model for record linkage. In The Conference on Uncertainty in Artificial Intelligence (UAI). Banff, Canada.

Digital Library

[30]

Ristad, E. and Yianilos, P. 1998. Learning string edit distance. IEEE Trans. Patt. Anal. Mach. Intell. 20, 5, 522--532.

Digital Library

[31]

Sarawagi, S. and Bhamidipaty, A. 2002. Interactive deduplication using active learning. In The ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD). Edmonton, Canada.

Digital Library

[32]

Singla, P. and Domingos, P. 2004. Multi-relational record linkage. In The ACM SIGKDD Workshop on Multi-Relational Data Mining (MRDM). Seattle, WA.

[33]

Tejada, S., Knoblock, C., and Minton, S. 2001. Learning object identification rules for information integration. Inform. Syst. J. 26, 8, 635--656.

Digital Library

[34]

Winkler, W. 1999. The state of record linkage and current research problems. Tech. rep., Statistical Research Division, U.S. Census Bureau, Washington, DC.

[35]

Winkler, W. 2002. Methods for record linkage and Bayesian networks. Tech. rep., Statistical Research Division, U.S. Census Bureau, Washington, DC.

Cited By

Torres NOlivares P(2024)De-Anonymizing Users across Rating Datasets via Record Linkage and Quasi-Identifier AttacksData10.3390/data90600759:6(75)Online publication date: 27-May-2024
https://doi.org/10.3390/data9060075
Chen DLu CBai HXia KZheng M(2024)Integrating AI with medical industry chain data: enhancing clinical nutrition research through semantic knowledge graphsFrontiers in Digital Health10.3389/fdgth.2024.14391136Online publication date: 3-Oct-2024
https://doi.org/10.3389/fdgth.2024.1439113
Xiang ZBienvenu MCima GGutiérrez-Basulto VIbáñez-García YMarquis POrtiz MPagnucco M(2024)ASPENProceedings of the 21st International Conference on Principles of Knowledge Representation and Reasoning10.24963/kr.2024/74(788-799)Online publication date: 2-Nov-2024
https://dl.acm.org/doi/10.24963/kr.2024/74
Show More Cited By

Index Terms

Collective entity resolution in relational data
1. Information systems
  1. Information retrieval
    1. Retrieval tasks and goals
      1. Clustering and classification
  2. Information systems applications
    1. Data mining
      1. Clustering

Recommendations

Unsupervised Graph-Based Entity Resolution for Complex Entities
Entity resolution (ER) is the process of linking records that refer to the same entity. Traditionally, this process compares attribute values of records to calculate similarities and then classifies pairs of records as referring to the same entity or not ...
Pay-As-You-Go Entity Resolution

Entity resolution (ER) is the problem of identifying which records in a database refer to the same entity. In practice, many applications need to resolve large data sets efficiently, but do not require the ER result to be exact. For example, people data ...
A Graduate-Level Course on Entity Resolution and Information Quality: A Step toward ER Education
Special Issue on Entity Resolution

This article discusses the topics, approaches, and lessons learned in teaching a graduate-level course covering entity resolution (ER) and its relationship to information quality (IQ). The course surveys a broad spectrum of ER topics and activities ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Knowledge Discovery from Data

ACM Transactions on Knowledge Discovery from Data Volume 1, Issue 1

March 2007

161 pages

ISSN:1556-4681

EISSN:1556-472X

DOI:10.1145/1217299

Issue’s Table of Contents

Copyright © 2007 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 March 2007

Published in TKDD Volume 1, Issue 1

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

418
Total Citations
View Citations
4,962
Total Downloads

Downloads (Last 12 months)104
Downloads (Last 6 weeks)8

Reflects downloads up to 24 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Torres NOlivares P(2024)De-Anonymizing Users across Rating Datasets via Record Linkage and Quasi-Identifier AttacksData10.3390/data90600759:6(75)Online publication date: 27-May-2024
https://doi.org/10.3390/data9060075
Chen DLu CBai HXia KZheng M(2024)Integrating AI with medical industry chain data: enhancing clinical nutrition research through semantic knowledge graphsFrontiers in Digital Health10.3389/fdgth.2024.14391136Online publication date: 3-Oct-2024
https://doi.org/10.3389/fdgth.2024.1439113
Xiang ZBienvenu MCima GGutiérrez-Basulto VIbáñez-García YMarquis POrtiz MPagnucco M(2024)ASPENProceedings of the 21st International Conference on Principles of Knowledge Representation and Reasoning10.24963/kr.2024/74(788-799)Online publication date: 2-Nov-2024
https://dl.acm.org/doi/10.24963/kr.2024/74
Fan WPang KLu PTian C(2024)Making It Tractable to Detect and Correct Errors in GraphsACM Transactions on Database Systems10.1145/370231549:4(1-75)Online publication date: 2-Nov-2024
https://dl.acm.org/doi/10.1145/3702315
Dinh LYang PDiesner J(2024)From plan to practice: Interorganizational crisis response networks from governmental guidelines and real‐world collaborations during hurricane eventsJournal of Contingencies and Crisis Management10.1111/1468-5973.1260132:3Online publication date: 27-Jul-2024
https://doi.org/10.1111/1468-5973.12601
Wu HYip ALong JZhang JNg M(2024)Simplicial Complex Neural NetworksIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2023.332362446:1(561-575)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1109/TPAMI.2023.3323624
Zhang QLiu JLi LChen XWang R(2024)Automatic generation of system model diagrams driven by multi-source heterogeneous dataJournal of Engineering Design10.1080/09544828.2024.236085335:11(1442-1486)Online publication date: 6-Jul-2024
https://doi.org/10.1080/09544828.2024.2360853
Wu HLi NZhang JChen SNg MLong J(2024)Collaborative contrastive learning for hypergraph node classificationPattern Recognition10.1016/j.patcog.2023.109995146(109995)Online publication date: Feb-2024
https://doi.org/10.1016/j.patcog.2023.109995
Ye DJiang HFan JWang Q(2024)Low-rank persistent probability representation for higher-order role discoveryExpert Systems with Applications: An International Journal10.1016/j.eswa.2023.121381236:COnline publication date: 1-Feb-2024
https://dl.acm.org/doi/10.1016/j.eswa.2023.121381
Ng MWu HYip A(2024)Stability and Generalization of Hypergraph Collaborative NetworksMachine Intelligence Research10.1007/s11633-022-1397-121:1(184-196)Online publication date: 15-Jan-2024
https://doi.org/10.1007/s11633-022-1397-1
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Issue’s Table of Contents