research-article

Public Access

Bayesian Non-Exhaustive Classification A Case Study: Online Name Disambiguation using Temporal Record Streams

Authors:

Baichuan Zhang,

Mohammad Al HasanAuthors Info & Claims

CIKM '16: Proceedings of the 25th ACM International on Conference on Information and Knowledge Management

Pages 1341 - 1350

https://doi.org/10.1145/2983323.2983714

Published: 24 October 2016 Publication History

Abstract

The name entity disambiguation task aims to partition the records of multiple real-life persons so that each partition contains records pertaining to a unique person. Most of the existing solutions for this task operate in a batch mode, where all records to be disambiguated are initially available to the algorithm. However, more realistic settings require that the name disambiguation task be performed in an online fashion, in addition to, being able to identify records of new ambiguous entities having no preexisting records. In this work, we propose a Bayesian non-exhaustive classification framework for solving online name disambiguation task. Our proposed method uses a Dirichlet process prior with a Normal x Normal x Inverse Wishart data model which enables identification of new ambiguous entities who have no records in the training data. For online classification, we use one sweep Gibbs sampler which is very efficient and effective. As a case study we consider bibliographic data in a temporal stream format and disambiguate authors by partitioning their papers into homogeneous groups. Our experimental results demonstrate that the proposed method is better than existing methods for performing online name disambiguation task.

References

[1]

F. Akova, M. Dundar, V. J. Davisson, E. D. Hirleman, A. K. Bhunia, J. P. Robinson, and B. Rajwa. A machine-learning approach to detecting unknown bacterial serovars. Statistical Analysis and Data Mining, pages 289--301, 2010.

Digital Library

[2]

D. Aldous. Exchangeability and related topics. 1985.

[3]

T. W. Anderson, editor. An Introduction to Multivariate Statistical Analysis. 1984.

[4]

R. Bunescu and M. Pasca. Using encyclopedic knowledge for named entity disambiguation. In European Chapter of the Association for Comp. Linguistics, pages 9--16, 2006.

[5]

L. Cen, E. C. Dragut, L. Si, and M. Ouzzani. Author disambiguation by hierarchical agglomerative clustering with adaptive stopping criterion. In SIGIR, pages 741--744, 2013.

Digital Library

[6]

P.-Y. Chen, B. Zhang, M. A. Hasan, and A. O. Hero. Incremental method for spectral clustering of increasing orders. KDD Workshop on Mining and Learning with Graphs, 2016.

[7]

S. Choudhury, K. Agarwal, S. Purohit, B. Zhang, M. Pirrung, W. Smith, and M. Thomas. Nous: Construction and querying of dynamic knowledge graphs. arXiv preprint arXiv:1606.02314, 2016.

[8]

A. Davis, A. Veloso, A. S. da Silva, W. Meira, Jr., and A. H. F. Laender. Named entity disambiguation in streaming data. In ACL, 2012.

Digital Library

[9]

A. P. de Carvalho, A. A. Ferreira, A. H. F. Laender, and M. A. Goncalves. Incremental unsupervised name disambiguation in cleaned digital libraries. JIDM, pages 289--304, 2011.

[10]

M. Dundar, F. Akova, A. Qi, and B. Rajwa. Bayesian nonexhaustive learning for online discovery and modeling of emerging classes. In ICML, pages 113--120, 2012.

Digital Library

[11]

T. S. Ferguson. A bayesian analysis of some nonparametric problems. Ann. Statist., pages 209--230, 1973.

[12]

T. Greene and W. S.Rayens. Partially pooled covariance matrix estimation in discriminant analysis. Communications in Statistics - Theory and Methods, pages 3679--3702, 1989.

[13]

H. Han, L. Giles, H. Zha, C. Li, and K. Tsioutsiouliklis. Two supervised learning approaches for name disambiguation in author citations. In Joint Conf. on Digital Libraries, 2004.

Digital Library

[14]

H. Han, H. Zha, and C. L. Giles. Name disambiguation in author citations using a k-way spectral clustering method. In ACM Joint Conf. on Digital Libraries, pages 334--343, 2005.

Digital Library

[15]

L. Hermansson, T. Kerola, F. Johansson, V. Jethava, and D. Dubhashi. Entity disambiguation in anonymized graphs using graph kernels. In CIKM, pages 1037--1046, 2013.

Digital Library

[16]

J. Hoffart, Y. Altun, and G. Weikum. Discovering emerging entities with ambiguous names. In WWW, 2014.

Digital Library

[17]

M. Khabsa, P. Treeratpituk, and C. L. Giles. Online person name disambiguation with constraints. JCDL, 2015.

Digital Library

[18]

D. D. Lee and H. S. Seung. Algorithms for non-negative matrix factorization. In NIPS, pages 556--562. 2001.

Digital Library

[19]

D. Li and M. Becchi. Deploying graph algorithms on gpus: An adaptive solution. In IPDPS, 2013.

Digital Library

[20]

D. J. Michaud. Adventures in computer forensics. SANS Institute, 2001.

[21]

D. J. Miller and J. Browning. A mixture model and em-based algorithm for class discovery, robust classification, and outlier rejection in mixed labeled/unlabeled data sets. IEEE Transactions on PAMI, pages 1468--1483, 2003.

Digital Library

[22]

Y. Qian, Q. Zheng, T. Sakai, J. Ye, and J. Liu. Dynamic author name disambiguation for growing digital libraries. Journal of Inf. Retr., pages 379--412, 2015.

Digital Library

[23]

T. K. Saha, B. Zhang, and M. Al Hasan. Name disambiguation from link data in a collaboration graph using temporal and topological features. Social Network Analysis and Mining, pages 1--14, 2015.

[24]

G. Salton and M. J. McGill. Introduction to Modern Information Retrieval. 1986.

Digital Library

[25]

J. Sethuraman. A constructive definition of dirichlet priors. Statistica Sinica, pages 639--650, 1994.

[26]

Y. Song, J. Huang, I. G. Councill, J. Li, and C. L. Giles. Efficient topic-based unsupervised name disambiguation. In JCDL, pages 342--351, 2007.

Digital Library

[27]

J. Tang, A. C. M. Fong, B. Wang, and J. Zhang. A unified probabilistic framework for name disambiguation in digital library. IEEE TKDE, pages 975--987, 2012.

Digital Library

[28]

A. Veloso, A. A. Ferreira, M. A. Goncalves, A. H. F. Laender, and W. M. Jr. Cost-effective on-demand associative author name disambiguation. Inf. Process. Manage., 2012.

Digital Library

[29]

X. Wang, J. Tang, H. Cheng, and P. S. Yu. Adana: Active name disambiguation. In ICDM, pages 794--803, 2011.

Digital Library

[30]

B. Zhang, S. Choudhury, M. A. Hasan, X. Ning, K. Agarwal, S. Purohit, and P. G. P. Cabrera. Trust from the past: Bayesian personalized ranking based link prediction in knowledge graphs. SDM Workshop on Mining Networks and Graphs, 2016.

[31]

B. Zhang, N. Mohammed, V. Dave, and M. A. Hasan. Feature selection for classification under anonymity constraint. arXiv preprint arXiv:1512.07158, 2015.

Digital Library

[32]

B. Zhang, T. K. Saha, and M. A. Hasan. Name disambiguation from link data in a collaboration graph. In ASONAM, 2014.

Cited By

Bai SBu CWu X(2024)High‐degree penalty based global statistical network embedding for name disambiguation in anonymized graphConcurrency and Computation: Practice and Experience10.1002/cpe.8195Online publication date: 2-Jun-2024
https://doi.org/10.1002/cpe.8195
Boukhers ZAsundi N(2023)Deep author name disambiguation using DBLP dataInternational Journal on Digital Libraries10.1007/s00799-023-00361-6Online publication date: 4-May-2023
https://doi.org/10.1007/s00799-023-00361-6
Luo DMa SYan YHu CZhang XHuai J(2022)A Collective Approach to Scholar Name DisambiguationIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2020.301167434:5(2020-2032)Online publication date: 1-May-2022
https://doi.org/10.1109/TKDE.2020.3011674
Show More Cited By

Index Terms

Bayesian Non-Exhaustive Classification A Case Study: Online Name Disambiguation using Temporal Record Streams

Recommendations

Dirichlet process gaussian mixture for active online name disambiguation by particle filter
JCDL '19: Proceedings of the 18th Joint Conference on Digital Libraries

The name disambiguation task partitions a collection of records pertaining to a given name, such that there is a one-to-one correspondence between the partitions and a group of people, all sharing that given name. Most existing solutions for this task ...
Variational Bayesian multinomial logistic Gaussian process classification

The multinomial logistic Gaussian process is a flexible non-parametric model for multi-class classification tasks. These tasks are often involved in solving a pattern recognition problem in real life. In such contexts, the multinomial logistic function (...
Domain-General Versus Domain-Specific Named Entity Recognition: A Case Study Using TEXT
Multi-disciplinary Trends in Artificial Intelligence
Abstract
Named entity recognition (NER) seeks to identify and classify named entities within bodies of text into language categories such as nouns, that are reflective of locations, organizations, and people. As it is language dependent, the approach ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

CIKM '16: Proceedings of the 25th ACM International on Conference on Information and Knowledge Management

October 2016

2566 pages

ISBN:9781450340731

DOI:10.1145/2983323

General Chairs:
Snehasis Mukhopadhyay
Indiana University Purdue University Indianapolis, USA
,
ChengXiang Zhai
University of Illinois at Urbana-Champaign, USA
,
Program Chairs:
Elisa Bertino
Purdue University
,
Fabio Crestani
University of Lugano
,
Javed Mostafa
University of North Carolina
,
Jie Tang
Tsinghua University
,
Luo Si
Alibaba Group Inc & Purdue University
,
Xiaofang Zhou
University of Queensland
,
Yi Chang
Yahoo Research
,
Yunyao Li
IBM Research - Almaden
,
Parikshit Sondhi
WalmartLabs

Copyright © 2016 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 October 2016

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Science Foundation

Conference

CIKM'16

Sponsor:

CIKM'16: ACM Conference on Information and Knowledge Management

October 24 - 28, 2016

Indiana, Indianapolis, USA

Acceptance Rates

CIKM '16 Paper Acceptance Rate 160 of 701 submissions, 23%;

Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

23
Total Citations
View Citations
495
Total Downloads

Downloads (Last 12 months)45
Downloads (Last 6 weeks)5

Reflects downloads up to 11 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Bai SBu CWu X(2024)High‐degree penalty based global statistical network embedding for name disambiguation in anonymized graphConcurrency and Computation: Practice and Experience10.1002/cpe.8195Online publication date: 2-Jun-2024
https://doi.org/10.1002/cpe.8195
Boukhers ZAsundi N(2023)Deep author name disambiguation using DBLP dataInternational Journal on Digital Libraries10.1007/s00799-023-00361-6Online publication date: 4-May-2023
https://doi.org/10.1007/s00799-023-00361-6
Luo DMa SYan YHu CZhang XHuai J(2022)A Collective Approach to Scholar Name DisambiguationIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2020.301167434:5(2020-2032)Online publication date: 1-May-2022
https://doi.org/10.1109/TKDE.2020.3011674
Jin JChen JZhang JLiu TQian RLiu FZhou LRen Y(2022)Web table data integration based on smart campus scenarios to resolve name disambiguation of scientific research personnel2022 IEEE 46th Annual Computers, Software, and Applications Conference (COMPSAC)10.1109/COMPSAC54236.2022.00106(602-607)Online publication date: Jun-2022
https://doi.org/10.1109/COMPSAC54236.2022.00106
Pooja KMondal SChandra J(2022)Online author name disambiguation in evolving digital libraryNeurocomputing10.1016/j.neucom.2021.07.104493:C(1-14)Online publication date: 7-Jul-2022
https://dl.acm.org/doi/10.1016/j.neucom.2021.07.104
Boukhers ZAsundi N(2022)Whois? Deep Author Name Disambiguation Using Bibliographic DataLinking Theory and Practice of Digital Libraries10.1007/978-3-031-16802-4_16(201-215)Online publication date: 15-Sep-2022
https://doi.org/10.1007/978-3-031-16802-4_16
Kuang YXie J(2021)Distributed testing on mutual independence of massive multivariate dataCommunications in Statistics - Theory and Methods10.1080/03610926.2021.200623252:15(5332-5348)Online publication date: 25-Nov-2021
https://doi.org/10.1080/03610926.2021.2006232
Pooja KMondal SChandra J(2021)Exploiting similarities across multiple dimensions for author name disambiguationScientometrics10.1007/s11192-021-04101-yOnline publication date: 18-Jul-2021
https://doi.org/10.1007/s11192-021-04101-y
Chen YJiang ZGao JDu HGao LLi Z(2021)A supervised and distributed framework for cold-start author disambiguation in large-scale publicationsNeural Computing and Applications10.1007/s00521-020-05684-y35:18(13093-13108)Online publication date: 5-Mar-2021
https://doi.org/10.1007/s00521-020-05684-y
Zhuang JAl Hasan M(2021)Non-exhaustive Learning Using Gaussian Mixture Generative Adversarial NetworksMachine Learning and Knowledge Discovery in Databases. Research Track10.1007/978-3-030-86520-7_1(3-18)Online publication date: 10-Sep-2021
https://doi.org/10.1007/978-3-030-86520-7_1
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents