Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
article

Collective entity resolution in relational data

Published: 01 March 2007 Publication History
  • Get Citation Alerts
  • Abstract

    Many databases contain uncertain and imprecise references to real-world entities. The absence of identifiers for the underlying entities often results in a database which contains multiple references to the same entity. This can lead not only to data redundancy, but also inaccuracies in query processing and knowledge extraction. These problems can be alleviated through the use of entity resolution. Entity resolution involves discovering the underlying entities and mapping each database reference to these entities. Traditionally, entities are resolved using pairwise similarity over the attributes of references. However, there is often additional relational information in the data. Specifically, references to different entities may cooccur. In these cases, collective entity resolution, in which entities for cooccurring references are determined jointly rather than independently, can improve entity resolution accuracy. We propose a novel relational clustering algorithm that uses both attribute and relational information for determining the underlying domain entities, and we give an efficient implementation. We investigate the impact that different relational similarity measures have on entity resolution quality. We evaluate our collective entity resolution algorithm on multiple real-world databases. We show that it improves entity resolution performance over both attribute-based baselines and over algorithms that consider relational information but do not resolve entities collectively. In addition, we perform detailed experiments on synthetically generated data to identify data characteristics that favor collective relational resolution over purely attribute-based algorithms.

    References

    [1]
    Adamic, L. and Adar, E. 2003. Friends and neighbors on the Web. Social Networ. 25, 3 (July), 211--230.
    [2]
    Ananthakrishna, R., Chaudhuri, S., and Ganti, V. 2002. Eliminating fuzzy duplicates in data warehouses. In The International Conference on Very Large Databases (VLDB). Hong Kong, China.
    [3]
    Benjelloun, O., Garcia-Molina, H., Su, Q., and Widom, J. 2005. Swoosh: A generic approach to entity resolution. Tech. rep., Stanford University. (March)
    [4]
    Bhattacharya, I. and Getoor, L. 2004. Iterative record linkage for cleaning and integration. In The ACM SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery (DMKD). Paris, France.
    [5]
    Bhattacharya, I. and Getoor, L. 2006a. Mining graph data. In Entity Resolution in Graphs. L. Holder and D. Cook, Eds. John Wiley.
    [6]
    Bhattacharya, I. and Getoor, L. 2006b. A latent dirichlet model for unsupervised entity resolution. In The SIAM Conference on Data Mining (SIAM-SDM). Bethesda, MD.
    [7]
    Bhattacharya, I. and Getoor, L. 2006c. Query-time entity resolution. In The ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD). Philadelphia, PA.
    [8]
    Bilenko, M. and Mooney, R. 2003. Adaptive duplicate detection using learnable string similarity measures. In The ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD). Washington, DC.
    [9]
    Bilenko, M., Mooney, R., Cohen, W., Ravikumar, P., and Fienberg, S. 2003. Adaptive name matching in information integration. IEEE Intellig. Syst. 18, 5, 16--23.
    [10]
    Chaudhuri, S., Ganjam, K., Ganti, V., and Motwani, R. 2003. Robust and efficient fuzzy match for online data cleaning. In The ACM International Conference on Management of Data (SIGMOD). San Diego, CA.
    [11]
    Cohen, W. 2000. Data integration using similarity joins and a word-based information representation language. ACM Trans. Inform. Syst. 18, 288--321.
    [12]
    Cohen, W., Ravikumar, P., and Fienberg, S. 2003. A comparison of string distance metrics for name-matching tasks. In The IJCAI Workshop on Information Integration on the Web (IIWeb). Acapulco, Mexico.
    [13]
    Cohen, W. and Richman, J. 2002. Learning to match and cluster large high-dimensional data sets for data integration. In The ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD). Edmonton, Canada.
    [14]
    Dong, X., Halevy, A., and Madhavan, J. 2005. Reference reconciliation in complex information spaces. In The ACM International Conference on Management of Data (SIGMOD). Baltimore, MD.
    [15]
    Fellegi, I. and Sunter, A. 1969. A theory for record linkage. J. Amer. Statis. Assoc. 64, 1183--1210.
    [16]
    Giles, C. L., Bollacker, K., and Lawrence, S. 1998. CiteSeer: An automatic citation indexing system. In The ACM Conference on Digital Libraries. Pittsburgh, PA.
    [17]
    Gravano, L., Ipeirotis, P., Koudas, N., and Srivastava, D. 2003. Text joins for data cleansing and integration in an RDBMS. In The IEEE International Conference on Data Engineering (ICDE). Bangalore, India.
    [18]
    Hernández, M. and Stolfo, S. 1995. The merge/purge problem for large databases. In The ACM International Conference on Management of Data (SIGMOD). San Jose, CA.
    [19]
    Kalashnikov, D., Mehrotra, S., and Chen, Z. 2005. Exploiting relationships for domain-independent data cleaning. In The SIAM International Conference on Data Mining (SIAM SDM). Newport Beach, CA.
    [20]
    Li, X., Morie, P., and Roth, D. 2005. Semantic integration in text: From ambiguous names to identifiable entities. AI Magazine. Special Issue on Semantic Integration 26, 1, 45--58.
    [21]
    Liben-Nowell, D. and Kleinberg, J. 2003. The link prediction problem for social networks. In The International Conference on Information and Knowledge Management (CIKM). New Orleans, LA.
    [22]
    McCallum, A., Nigam, K., and Ungar, L. 2000. Efficient clustering of high-dimensional data sets with application to reference matching. In The International Conference On Knowledge Discovery and Data Mining (SIGKDD). Boston, MA.
    [23]
    McCallum, A. and Wellner, B. 2004. Conditional models of identity uncertainty with application to noun coreference. In The Annual Conference on Neural Information Processing Systems (NIPS). Vancouver, Canada.
    [24]
    Monge, A. and Elkan, C. 1996. The field matching problem: Algorithms and applications. In The International Conference on Knowledge Discovery and Data Mining (SIGKDD). Portland, ME.
    [25]
    Monge, A. and Elkan, C. 1997. An efficient domain-independent algorithm for detecting approximately duplicate database records. In The SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery (DMKD). Tuscon, AZ.
    [26]
    Navarro, G. 2001. A guided tour to approximate string matching. ACM Comp. Sur. 33, 1, 31--88.
    [27]
    Newcombe, H., Kennedy, J., Axford, S., and James, A. 1959. Automatic linkage of vital records. Science 130, 954--959.
    [28]
    Pasula, H., Marthi, B., Milch, B., Russell, S., and Shpitser, I. 2003. Identity uncertainty and citation matching. In The Annual Conference on Neural Information Processing Systems (NIPS). Vancouver, Canada.
    [29]
    Ravikumar, P. and Cohen, W. 2004. A hierarchical graphical model for record linkage. In The Conference on Uncertainty in Artificial Intelligence (UAI). Banff, Canada.
    [30]
    Ristad, E. and Yianilos, P. 1998. Learning string edit distance. IEEE Trans. Patt. Anal. Mach. Intell. 20, 5, 522--532.
    [31]
    Sarawagi, S. and Bhamidipaty, A. 2002. Interactive deduplication using active learning. In The ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD). Edmonton, Canada.
    [32]
    Singla, P. and Domingos, P. 2004. Multi-relational record linkage. In The ACM SIGKDD Workshop on Multi-Relational Data Mining (MRDM). Seattle, WA.
    [33]
    Tejada, S., Knoblock, C., and Minton, S. 2001. Learning object identification rules for information integration. Inform. Syst. J. 26, 8, 635--656.
    [34]
    Winkler, W. 1999. The state of record linkage and current research problems. Tech. rep., Statistical Research Division, U.S. Census Bureau, Washington, DC.
    [35]
    Winkler, W. 2002. Methods for record linkage and Bayesian networks. Tech. rep., Statistical Research Division, U.S. Census Bureau, Washington, DC.

    Cited By

    View all

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Knowledge Discovery from Data
    ACM Transactions on Knowledge Discovery from Data  Volume 1, Issue 1
    March 2007
    161 pages
    ISSN:1556-4681
    EISSN:1556-472X
    DOI:10.1145/1217299
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 01 March 2007
    Published in TKDD Volume 1, Issue 1

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Entity resolution
    2. data cleaning
    3. graph clustering
    4. record linkage

    Qualifiers

    • Article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)146
    • Downloads (Last 6 weeks)8

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)De-Anonymizing Users across Rating Datasets via Record Linkage and Quasi-Identifier AttacksData10.3390/data90600759:6(75)Online publication date: 27-May-2024
    • (2024)Simplicial Complex Neural NetworksIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2023.332362446:1(561-575)Online publication date: 1-Jan-2024
    • (2024)Collaborative contrastive learning for hypergraph node classificationPattern Recognition10.1016/j.patcog.2023.109995146(109995)Online publication date: Mar-2024
    • (2024)Low-rank persistent probability representation for higher-order role discoveryExpert Systems with Applications: An International Journal10.1016/j.eswa.2023.121381236:COnline publication date: 1-Feb-2024
    • (2024)Stability and Generalization of Hypergraph Collaborative NetworksMachine Intelligence Research10.1007/s11633-022-1397-121:1(184-196)Online publication date: 15-Jan-2024
    • (2024)High‐degree penalty based global statistical network embedding for name disambiguation in anonymized graphConcurrency and Computation: Practice and Experience10.1002/cpe.8195Online publication date: 2-Jun-2024
    • (2023)Graph Convolutional Neural Networks for Histologic Classification of Pancreatic CancerArchives of Pathology & Laboratory Medicine10.5858/arpa.2022-0035-OA147:11(1251-1260)Online publication date: 20-Jan-2023
    • (2023)Examining Knowledge Extraction Processes from Heterogeneous Data SourcesBrilliant Engineering10.36937/ben.2023.47984:1(1-8)Online publication date: 8-Feb-2023
    • (2023)Name Disambiguation Scheme Based on Heterogeneous Academic SitesApplied Sciences10.3390/app1401019214:1(192)Online publication date: 25-Dec-2023
    • (2023)DomainNet: Homograph Detection and Understanding in Data Lake DisambiguationACM Transactions on Database Systems10.1145/361291948:3(1-40)Online publication date: 12-Sep-2023
    • Show More Cited By

    View Options

    Get Access

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media