Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2983323.2983714acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article
Public Access

Bayesian Non-Exhaustive Classification A Case Study: Online Name Disambiguation using Temporal Record Streams

Published: 24 October 2016 Publication History
  • Get Citation Alerts
  • Abstract

    The name entity disambiguation task aims to partition the records of multiple real-life persons so that each partition contains records pertaining to a unique person. Most of the existing solutions for this task operate in a batch mode, where all records to be disambiguated are initially available to the algorithm. However, more realistic settings require that the name disambiguation task be performed in an online fashion, in addition to, being able to identify records of new ambiguous entities having no preexisting records. In this work, we propose a Bayesian non-exhaustive classification framework for solving online name disambiguation task. Our proposed method uses a Dirichlet process prior with a Normal x Normal x Inverse Wishart data model which enables identification of new ambiguous entities who have no records in the training data. For online classification, we use one sweep Gibbs sampler which is very efficient and effective. As a case study we consider bibliographic data in a temporal stream format and disambiguate authors by partitioning their papers into homogeneous groups. Our experimental results demonstrate that the proposed method is better than existing methods for performing online name disambiguation task.

    References

    [1]
    F. Akova, M. Dundar, V. J. Davisson, E. D. Hirleman, A. K. Bhunia, J. P. Robinson, and B. Rajwa. A machine-learning approach to detecting unknown bacterial serovars. Statistical Analysis and Data Mining, pages 289--301, 2010.
    [2]
    D. Aldous. Exchangeability and related topics. 1985.
    [3]
    T. W. Anderson, editor. An Introduction to Multivariate Statistical Analysis. 1984.
    [4]
    R. Bunescu and M. Pasca. Using encyclopedic knowledge for named entity disambiguation. In European Chapter of the Association for Comp. Linguistics, pages 9--16, 2006.
    [5]
    L. Cen, E. C. Dragut, L. Si, and M. Ouzzani. Author disambiguation by hierarchical agglomerative clustering with adaptive stopping criterion. In SIGIR, pages 741--744, 2013.
    [6]
    P.-Y. Chen, B. Zhang, M. A. Hasan, and A. O. Hero. Incremental method for spectral clustering of increasing orders. KDD Workshop on Mining and Learning with Graphs, 2016.
    [7]
    S. Choudhury, K. Agarwal, S. Purohit, B. Zhang, M. Pirrung, W. Smith, and M. Thomas. Nous: Construction and querying of dynamic knowledge graphs. arXiv preprint arXiv:1606.02314, 2016.
    [8]
    A. Davis, A. Veloso, A. S. da Silva, W. Meira, Jr., and A. H. F. Laender. Named entity disambiguation in streaming data. In ACL, 2012.
    [9]
    A. P. de Carvalho, A. A. Ferreira, A. H. F. Laender, and M. A. Goncalves. Incremental unsupervised name disambiguation in cleaned digital libraries. JIDM, pages 289--304, 2011.
    [10]
    M. Dundar, F. Akova, A. Qi, and B. Rajwa. Bayesian nonexhaustive learning for online discovery and modeling of emerging classes. In ICML, pages 113--120, 2012.
    [11]
    T. S. Ferguson. A bayesian analysis of some nonparametric problems. Ann. Statist., pages 209--230, 1973.
    [12]
    T. Greene and W. S.Rayens. Partially pooled covariance matrix estimation in discriminant analysis. Communications in Statistics - Theory and Methods, pages 3679--3702, 1989.
    [13]
    H. Han, L. Giles, H. Zha, C. Li, and K. Tsioutsiouliklis. Two supervised learning approaches for name disambiguation in author citations. In Joint Conf. on Digital Libraries, 2004.
    [14]
    H. Han, H. Zha, and C. L. Giles. Name disambiguation in author citations using a k-way spectral clustering method. In ACM Joint Conf. on Digital Libraries, pages 334--343, 2005.
    [15]
    L. Hermansson, T. Kerola, F. Johansson, V. Jethava, and D. Dubhashi. Entity disambiguation in anonymized graphs using graph kernels. In CIKM, pages 1037--1046, 2013.
    [16]
    J. Hoffart, Y. Altun, and G. Weikum. Discovering emerging entities with ambiguous names. In WWW, 2014.
    [17]
    M. Khabsa, P. Treeratpituk, and C. L. Giles. Online person name disambiguation with constraints. JCDL, 2015.
    [18]
    D. D. Lee and H. S. Seung. Algorithms for non-negative matrix factorization. In NIPS, pages 556--562. 2001.
    [19]
    D. Li and M. Becchi. Deploying graph algorithms on gpus: An adaptive solution. In IPDPS, 2013.
    [20]
    D. J. Michaud. Adventures in computer forensics. SANS Institute, 2001.
    [21]
    D. J. Miller and J. Browning. A mixture model and em-based algorithm for class discovery, robust classification, and outlier rejection in mixed labeled/unlabeled data sets. IEEE Transactions on PAMI, pages 1468--1483, 2003.
    [22]
    Y. Qian, Q. Zheng, T. Sakai, J. Ye, and J. Liu. Dynamic author name disambiguation for growing digital libraries. Journal of Inf. Retr., pages 379--412, 2015.
    [23]
    T. K. Saha, B. Zhang, and M. Al Hasan. Name disambiguation from link data in a collaboration graph using temporal and topological features. Social Network Analysis and Mining, pages 1--14, 2015.
    [24]
    G. Salton and M. J. McGill. Introduction to Modern Information Retrieval. 1986.
    [25]
    J. Sethuraman. A constructive definition of dirichlet priors. Statistica Sinica, pages 639--650, 1994.
    [26]
    Y. Song, J. Huang, I. G. Councill, J. Li, and C. L. Giles. Efficient topic-based unsupervised name disambiguation. In JCDL, pages 342--351, 2007.
    [27]
    J. Tang, A. C. M. Fong, B. Wang, and J. Zhang. A unified probabilistic framework for name disambiguation in digital library. IEEE TKDE, pages 975--987, 2012.
    [28]
    A. Veloso, A. A. Ferreira, M. A. Goncalves, A. H. F. Laender, and W. M. Jr. Cost-effective on-demand associative author name disambiguation. Inf. Process. Manage., 2012.
    [29]
    X. Wang, J. Tang, H. Cheng, and P. S. Yu. Adana: Active name disambiguation. In ICDM, pages 794--803, 2011.
    [30]
    B. Zhang, S. Choudhury, M. A. Hasan, X. Ning, K. Agarwal, S. Purohit, and P. G. P. Cabrera. Trust from the past: Bayesian personalized ranking based link prediction in knowledge graphs. SDM Workshop on Mining Networks and Graphs, 2016.
    [31]
    B. Zhang, N. Mohammed, V. Dave, and M. A. Hasan. Feature selection for classification under anonymity constraint. arXiv preprint arXiv:1512.07158, 2015.
    [32]
    B. Zhang, T. K. Saha, and M. A. Hasan. Name disambiguation from link data in a collaboration graph. In ASONAM, 2014.

    Cited By

    View all
    • (2024)High‐degree penalty based global statistical network embedding for name disambiguation in anonymized graphConcurrency and Computation: Practice and Experience10.1002/cpe.8195Online publication date: 2-Jun-2024
    • (2023)Deep author name disambiguation using DBLP dataInternational Journal on Digital Libraries10.1007/s00799-023-00361-6Online publication date: 4-May-2023
    • (2022)A Collective Approach to Scholar Name DisambiguationIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2020.301167434:5(2020-2032)Online publication date: 1-May-2022
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    CIKM '16: Proceedings of the 25th ACM International on Conference on Information and Knowledge Management
    October 2016
    2566 pages
    ISBN:9781450340731
    DOI:10.1145/2983323
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 24 October 2016

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. bayesian non-exhaustive classification
    2. emerging class
    3. online name disambiguation
    4. temporal record stream

    Qualifiers

    • Research-article

    Funding Sources

    Conference

    CIKM'16
    Sponsor:
    CIKM'16: ACM Conference on Information and Knowledge Management
    October 24 - 28, 2016
    Indiana, Indianapolis, USA

    Acceptance Rates

    CIKM '16 Paper Acceptance Rate 160 of 701 submissions, 23%;
    Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

    Upcoming Conference

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)45
    • Downloads (Last 6 weeks)5
    Reflects downloads up to 11 Aug 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)High‐degree penalty based global statistical network embedding for name disambiguation in anonymized graphConcurrency and Computation: Practice and Experience10.1002/cpe.8195Online publication date: 2-Jun-2024
    • (2023)Deep author name disambiguation using DBLP dataInternational Journal on Digital Libraries10.1007/s00799-023-00361-6Online publication date: 4-May-2023
    • (2022)A Collective Approach to Scholar Name DisambiguationIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2020.301167434:5(2020-2032)Online publication date: 1-May-2022
    • (2022)Web table data integration based on smart campus scenarios to resolve name disambiguation of scientific research personnel2022 IEEE 46th Annual Computers, Software, and Applications Conference (COMPSAC)10.1109/COMPSAC54236.2022.00106(602-607)Online publication date: Jun-2022
    • (2022)Online author name disambiguation in evolving digital libraryNeurocomputing10.1016/j.neucom.2021.07.104493:C(1-14)Online publication date: 7-Jul-2022
    • (2022)Whois? Deep Author Name Disambiguation Using Bibliographic DataLinking Theory and Practice of Digital Libraries10.1007/978-3-031-16802-4_16(201-215)Online publication date: 15-Sep-2022
    • (2021)Distributed testing on mutual independence of massive multivariate dataCommunications in Statistics - Theory and Methods10.1080/03610926.2021.200623252:15(5332-5348)Online publication date: 25-Nov-2021
    • (2021)Exploiting similarities across multiple dimensions for author name disambiguationScientometrics10.1007/s11192-021-04101-yOnline publication date: 18-Jul-2021
    • (2021)A supervised and distributed framework for cold-start author disambiguation in large-scale publicationsNeural Computing and Applications10.1007/s00521-020-05684-y35:18(13093-13108)Online publication date: 5-Mar-2021
    • (2021)Non-exhaustive Learning Using Gaussian Mixture Generative Adversarial NetworksMachine Learning and Knowledge Discovery in Databases. Research Track10.1007/978-3-030-86520-7_1(3-18)Online publication date: 10-Sep-2021
    • Show More Cited By

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Get Access

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media