research-article

Entity linking at the tail: sparse signals, unknown entities, and phrase models

Authors:

Emre Kıcıman,

Ricky LoyndAuthors Info & Claims

WSDM '14: Proceedings of the 7th ACM international conference on Web search and data mining

Pages 453 - 462

https://doi.org/10.1145/2556195.2556230

Published: 24 February 2014 Publication History

Abstract

Web search is seeing a paradigm shift from keyword based search to an entity-centric organization of web data. To support web search with this deeper level of understanding, a web-scale entity linking system must have 3 key properties: First, its feature extraction must be robust to the diversity of web documents and their varied writing styles and content structures. Second, it must maintain high-precision linking for "tail" (unpopular) entities that is robust to the existence of confounding entities outside of the knowledge base and entity profiles with minimal information. Finally, the system must represent large-scale knowledge bases with a scalable and powerful feature representation. We have built and deployed a web-scale unsupervised entity linking system for a commercial search engine that addresses these requirements by combining new developments in sparse signal recovery to identify the most discriminative features from noisy, free-text web documents; explicit modeling of out-of-knowledge-base entities to improve precision at the tail; and the development of a new phrase-unigram language model to efficiently capture high-order dependencies in lexical features. Using a knowledge base of 100M unique people from a popular social networking site, we present experimental results in the challenging domain of people-linking at the tail, where most entities have limited web presence. Our experimental results show that this system substantially improves on the precision-recall tradeoff over baseline methods, achieving precision over 95% with recall over 60%.

References

[1]

S. F. Adafre and M. de Rijke. Discovering missing links in wikipedia. In Proceedings of the 3rd International Workshop on Link Discovery, 2005.

Digital Library

[2]

A. Bagga and B. Baldwin. Entity-based cross-document co-referencing using the vector space model. In Proceedings of ACL, pages 79--85, 1998.

Digital Library

[3]

D. Blei, A. Ng, and M. Jordan. Latent dirichlet allocation. Journal of Machine Learning Research, (3):993--1022, 2003.

Digital Library

[4]

K. Bollacker, C. Evans, P. Paritosh, T. Sturge, and J. Taylor. Freebase: a collaboratively created graph database for structuring human knowledge. In Proceedings of SIGMOD, 2008.

Digital Library

[5]

R. Bunescu and M. Pasca. Using encyclopedic knowledge for named entity disambiguation. In Proceedings of EACL, 2006.

[6]

E. J. Candes. Compressive sampling. Proc. Int. Congr. Mathematicians, 2006.

[7]

C.D.Manning, P. Raghavan, and H. Schutze. Introduction to Information Retrieval. Cambrige Univ. Press, 2008.

Digital Library

[8]

S. F. Chen and J. Goodman. An empirical study of smoothing techniques for language modeling. Harvard University, TR-10-98, 1998.

[9]

S. Cucerzan. Large-scale named entity disambiguation based on wikipedia data. In Proceedings of EMNLP, pages 708--716, June 2007.

[10]

S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman. Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6):391--407, 1990.

[11]

D. L. Donoho. Compressed sensing. IEEE Trans. Inf. Theory, 52(4):1289--1306, 2006.

Digital Library

[12]

M. Dredze, P. McNamee, D. Rao, A. Gerber, and T. Finin. Entity disambiguation for knowledge base population. In Proceedings of COLING, 2010.

Digital Library

[13]

M. Figueiredo, R. D. Nowak, and S. J. Wright. Gradient projection for sparse reconstruction: Application to compressed sensing and other inverse problems. IEEE JSTSP, 1:586--597, 2007.

[14]

A. Gattani, D. S. Lamba, N. Garera, M. Tiwari, X. Chai, S. Das, S. Subramaniam, A. Rajaraman, V. Harinarayan, and A. Doan. Entity extraction, linking, classification, and tagging for social media: a wikipedia-based approach. In Proceedings of VLDB, 2013.

Digital Library

[15]

I. Good. The population frequency of species and the estimation of population parameters. Biometrika, 40(3):237--264, 1953.

[16]

S. Guo, M.-W. Chang, and E. Kıcıman. To link or not to link? a study on end-to-end tweet entity linking. In Proceedings of NAACL-HLT, 2013.

[17]

I. Guyon and A. Elisseeff. An introduction to variable and feature selection. Journal of the Journal of Machine Learning Research, 3:1157--1182, 2003.

Digital Library

[18]

X. Han and L. Sun. A generative entity-mention model for linking entities with knowledge base. In Proceedings of ACL-HLT, pages 945--954, June 2011.

Digital Library

[19]

X. Han, L. Sun, and J. Zhao. Collective entity linking in web text: A graph-based method. In Proceedings of SIGIR, pages 765--774, July 2011.

Digital Library

[20]

J. Hoffart, M. A. Yosef, I. Bordino, H. Furstenau, M. Pinkal, M. Spaniol, B. Taneva, S. Thater, and G. Weikum. Robust disambiguation of named entities in text. In Proceedings of EMNLP, pages 782--792, 2011.

Digital Library

[21]

T. Hofmann. Probabilistic latent semantic indexing. In Proceedings of SIGIR, 1999.

Digital Library

[22]

S. Kasiviswanathan, P. Melville, A. Banerjee, and V. Sindhwani. Emerging topic detection using dictionary learning. In Proceedings of CIKM, 2011.

Digital Library

[23]

S. Katz. Estimation of probabilities from sparse data for the language model component of a speech recognizer. Ieee Transactions On Acoustics, Speech, And Signal Processing, 35(3):400--401, 1987.

[24]

S. Kulkarni, A. Singh, G. Ramakrishnan, and S. Chakrabarti. Collective annotation of wikipedia entities in web text. In Proceedings of KDD, 2009.

Digital Library

[25]

X. Li, P. Morie, and D. Roth. Identification and tracing of ambiguous names: discriminative and generative approaches. In Proceedings of the 19th national conference on Artificial intelligence, 2004.

Digital Library

[26]

D. Lin and X. Wu. Phrase clustering for discriminative learning. In Proceedings of ACL and AFNLP, 2009.

Digital Library

[27]

X. Liu, Y. Li, H. Wu, M. Zhou, F. Wei, and Y. Lu. Entity linking for tweets. In Proceedings of ACL, 2013.

[28]

O. Medelyan, I. Witten, and D. Milne. Topic indexing with wikipedia. In Proceedings of WIKIAI, pages 19--24, 2008.

[29]

R. Mihalcea and A. Csomal. Wikify! linking documents to encyclopedic knowledge. In Proceedings of CIKM, 2007.

Digital Library

[30]

D. Milne and I. H. Witten. Learning to link with wikipedia. In Proceedings of CIKM, 2008.

Digital Library

[31]

K. Min, Z. Zhang, J. Wright, and Y. Ma. Decomposing background topics from keywords by principal component pursuit. In Proceedings of CIKM, 2010.

Digital Library

[32]

A. Ng. Feature selection, l1 vs. l2 regularization, and rotational invariance. In Proceeding of ICML, 2004.

Digital Library

[33]

J. Ponte and W. Croft. A language modeling approach to information retrieval. In Proceedings of SIGIR, 1998.

Digital Library

[34]

A. Ritter, S. Clark, Mausam, and O. Etzioni. Named entity recognition in tweets: An experimental study. In Proceedings of EMNLP, 2011.

Digital Library

[35]

W. Shen, J. Wang, P. Luo, and M. Wang. Linking named entities in tweets with knowledge base via user interest modeling. In Proceedings of KDD, 2013.

Digital Library

[36]

R. Tibshirani. Regression shrinkage and selection via the LASSO. J. R. Statist. Soc. B, 58(1):267--288, 1996.

[37]

Y. Zhou, L. Nie, O. Rouhani-Kalleh, F. Vasile, and S. Gaffney. Resolving surface forms to wikipedia topics. In Proceedings of COLING, 2010.

Digital Library

[38]

M. Zhu, S. Shi, N. Yu, and J. Wen. Can phrase indexing help to process non-phrase queries? In Proceedings of CIKM, 2008.

Digital Library

Cited By

Zhang ZZhao YGao HHu MChua TNgo CKa-Wei Lee RKumar RLauw H(2024)LinkNER: Linking Local Named Entity Recognition Models to Large Language Models using UncertaintyProceedings of the ACM Web Conference 202410.1145/3589334.3645414(4047-4058)Online publication date: 13-May-2024
https://dl.acm.org/doi/10.1145/3589334.3645414
Bi ZZhang TZhou PLi Y(2020)Knowledge Transfer for Out-of-Knowledge-Base Entities: Improving Graph-Neural-Network-Based Embedding Using Convolutional LayersIEEE Access10.1109/ACCESS.2020.30195928(159039-159049)Online publication date: 2020
https://doi.org/10.1109/ACCESS.2020.3019592
Li PWang HLi HWu X(2018)Employing Semantic Context for Sparse Information Extraction AssessmentACM Transactions on Knowledge Discovery from Data10.1145/320140712:5(1-36)Online publication date: 27-Jun-2018
https://dl.acm.org/doi/10.1145/3201407
Show More Cited By

Index Terms

Entity linking at the tail: sparse signals, unknown entities, and phrase models
1. Information systems
  1. Information systems applications

Recommendations

Bringing Head Closer to the Tail with Entity Linking
ESAIR '14: Proceedings of the 7th International Workshop on Exploiting Semantic Annotations in Information Retrieval

With the creation and rapid development of knowledge bases, it has become easier to understand the underlying semantics of unstructured text (short or long) on the web. In this work we especially look at the impact of entity linking on search logs. ...
Entity linking and retrieval
SIGIR '13: Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval

This full-day tutorial presents a comprehensive introduction to entity linking and retrieval. Part I provides a detailed overview of entity linking: identifying and disambiguating entity occurrences in unstructured text. Part II focuses on entity ...
Tail Entity Recognition and Linking for Knowledge Graphs
Web and Big Data
Abstract
This paper works on a new task - Tail Entity Recognition and Linking (TERL) for Knowledge Graphs (KG), i.e., recognizing ambiguous entity mentions from the tails of some relational triples, and linking these mentions to their corresponding KG ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

WSDM '14: Proceedings of the 7th ACM international conference on Web search and data mining

February 2014

712 pages

ISBN:9781450323512

DOI:10.1145/2556195

General Chairs:
Ben Carterette
University of Delaware, USA
,
Fernando Diaz
Microsoft Research, USA
,
Program Chairs:
Carlos Castillo
Qatar Computing Research Institute, Qatar
,
Donald Metzler
Google, USA

Copyright © 2014 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 February 2014

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

WSDM 2014

Sponsor:

WSDM 2014: Seventh ACM International Conference on Web Search and Data Mining

February 24 - 28, 2014

New York, New York, USA

Acceptance Rates

WSDM '14 Paper Acceptance Rate 64 of 355 submissions, 18%;

Overall Acceptance Rate 498 of 2,863 submissions, 17%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

9
Total Citations
View Citations
309
Total Downloads

Downloads (Last 12 months)10
Downloads (Last 6 weeks)1

Reflects downloads up to 01 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Zhang ZZhao YGao HHu MChua TNgo CKa-Wei Lee RKumar RLauw H(2024)LinkNER: Linking Local Named Entity Recognition Models to Large Language Models using UncertaintyProceedings of the ACM Web Conference 202410.1145/3589334.3645414(4047-4058)Online publication date: 13-May-2024
https://dl.acm.org/doi/10.1145/3589334.3645414
Bi ZZhang TZhou PLi Y(2020)Knowledge Transfer for Out-of-Knowledge-Base Entities: Improving Graph-Neural-Network-Based Embedding Using Convolutional LayersIEEE Access10.1109/ACCESS.2020.30195928(159039-159049)Online publication date: 2020
https://doi.org/10.1109/ACCESS.2020.3019592
Li PWang HLi HWu X(2018)Employing Semantic Context for Sparse Information Extraction AssessmentACM Transactions on Knowledge Discovery from Data10.1145/320140712:5(1-36)Online publication date: 27-Jun-2018
https://dl.acm.org/doi/10.1145/3201407
Tariq AKarim AForoosh H(2017)NELasso: Group-Sparse Modeling for Characterizing Relations Among Named Entities in News ArticlesIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2016.263211739:10(2000-2014)Online publication date: 1-Oct-2017
https://doi.org/10.1109/TPAMI.2016.2632117
Li YTan SSun HHan JRoth DYan XBourdeau JHendler JNkambou RHorrocks IZhao B(2016)Entity Disambiguation with Linkless Knowledge BasesProceedings of the 25th International Conference on World Wide Web10.1145/2872427.2883068(1261-1270)Online publication date: 11-Apr-2016
https://dl.acm.org/doi/10.1145/2872427.2883068
Lee THwang S(2016)Linking, integrating, and translating entities via iterative graph matching2016 Conference on Technologies and Applications of Artificial Intelligence (TAAI)10.1109/TAAI.2016.7880156(248-255)Online publication date: Nov-2016
https://doi.org/10.1109/TAAI.2016.7880156
Kıcıman ERichardson MCao LZhang CJoachims TWebb GMargineantu DWilliams G(2015)Towards Decision Support and Goal AchievementProceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining10.1145/2783258.2783310(547-556)Online publication date: 10-Aug-2015
https://dl.acm.org/doi/10.1145/2783258.2783310
Jiang YShao ZGuo YZhang HSun L(2015)Building XML Data Warehouse with Data Reconstruction by Knowledge GraphProceedings of the 2015 IEEE Fifth International Conference on Big Data and Cloud Computing10.1109/BDCloud.2015.48(314-320)Online publication date: 26-Aug-2015
https://dl.acm.org/doi/10.1109/BDCloud.2015.48
Van DHuynh HNguyen HVo V(2015)Entity Linking for Vietnamese TweetsKnowledge and Systems Engineering10.1007/978-3-319-11680-8_48(603-615)Online publication date: 2015
https://doi.org/10.1007/978-3-319-11680-8_48

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten