Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2556195.2556230acmconferencesArticle/Chapter ViewAbstractPublication PageswsdmConference Proceedingsconference-collections
research-article

Entity linking at the tail: sparse signals, unknown entities, and phrase models

Published: 24 February 2014 Publication History

Abstract

Web search is seeing a paradigm shift from keyword based search to an entity-centric organization of web data. To support web search with this deeper level of understanding, a web-scale entity linking system must have 3 key properties: First, its feature extraction must be robust to the diversity of web documents and their varied writing styles and content structures. Second, it must maintain high-precision linking for "tail" (unpopular) entities that is robust to the existence of confounding entities outside of the knowledge base and entity profiles with minimal information. Finally, the system must represent large-scale knowledge bases with a scalable and powerful feature representation. We have built and deployed a web-scale unsupervised entity linking system for a commercial search engine that addresses these requirements by combining new developments in sparse signal recovery to identify the most discriminative features from noisy, free-text web documents; explicit modeling of out-of-knowledge-base entities to improve precision at the tail; and the development of a new phrase-unigram language model to efficiently capture high-order dependencies in lexical features. Using a knowledge base of 100M unique people from a popular social networking site, we present experimental results in the challenging domain of people-linking at the tail, where most entities have limited web presence. Our experimental results show that this system substantially improves on the precision-recall tradeoff over baseline methods, achieving precision over 95% with recall over 60%.

References

[1]
S. F. Adafre and M. de Rijke. Discovering missing links in wikipedia. In Proceedings of the 3rd International Workshop on Link Discovery, 2005.
[2]
A. Bagga and B. Baldwin. Entity-based cross-document co-referencing using the vector space model. In Proceedings of ACL, pages 79--85, 1998.
[3]
D. Blei, A. Ng, and M. Jordan. Latent dirichlet allocation. Journal of Machine Learning Research, (3):993--1022, 2003.
[4]
K. Bollacker, C. Evans, P. Paritosh, T. Sturge, and J. Taylor. Freebase: a collaboratively created graph database for structuring human knowledge. In Proceedings of SIGMOD, 2008.
[5]
R. Bunescu and M. Pasca. Using encyclopedic knowledge for named entity disambiguation. In Proceedings of EACL, 2006.
[6]
E. J. Candes. Compressive sampling. Proc. Int. Congr. Mathematicians, 2006.
[7]
C.D.Manning, P. Raghavan, and H. Schutze. Introduction to Information Retrieval. Cambrige Univ. Press, 2008.
[8]
S. F. Chen and J. Goodman. An empirical study of smoothing techniques for language modeling. Harvard University, TR-10-98, 1998.
[9]
S. Cucerzan. Large-scale named entity disambiguation based on wikipedia data. In Proceedings of EMNLP, pages 708--716, June 2007.
[10]
S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman. Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6):391--407, 1990.
[11]
D. L. Donoho. Compressed sensing. IEEE Trans. Inf. Theory, 52(4):1289--1306, 2006.
[12]
M. Dredze, P. McNamee, D. Rao, A. Gerber, and T. Finin. Entity disambiguation for knowledge base population. In Proceedings of COLING, 2010.
[13]
M. Figueiredo, R. D. Nowak, and S. J. Wright. Gradient projection for sparse reconstruction: Application to compressed sensing and other inverse problems. IEEE JSTSP, 1:586--597, 2007.
[14]
A. Gattani, D. S. Lamba, N. Garera, M. Tiwari, X. Chai, S. Das, S. Subramaniam, A. Rajaraman, V. Harinarayan, and A. Doan. Entity extraction, linking, classification, and tagging for social media: a wikipedia-based approach. In Proceedings of VLDB, 2013.
[15]
I. Good. The population frequency of species and the estimation of population parameters. Biometrika, 40(3):237--264, 1953.
[16]
S. Guo, M.-W. Chang, and E. Kıcıman. To link or not to link? a study on end-to-end tweet entity linking. In Proceedings of NAACL-HLT, 2013.
[17]
I. Guyon and A. Elisseeff. An introduction to variable and feature selection. Journal of the Journal of Machine Learning Research, 3:1157--1182, 2003.
[18]
X. Han and L. Sun. A generative entity-mention model for linking entities with knowledge base. In Proceedings of ACL-HLT, pages 945--954, June 2011.
[19]
X. Han, L. Sun, and J. Zhao. Collective entity linking in web text: A graph-based method. In Proceedings of SIGIR, pages 765--774, July 2011.
[20]
J. Hoffart, M. A. Yosef, I. Bordino, H. Furstenau, M. Pinkal, M. Spaniol, B. Taneva, S. Thater, and G. Weikum. Robust disambiguation of named entities in text. In Proceedings of EMNLP, pages 782--792, 2011.
[21]
T. Hofmann. Probabilistic latent semantic indexing. In Proceedings of SIGIR, 1999.
[22]
S. Kasiviswanathan, P. Melville, A. Banerjee, and V. Sindhwani. Emerging topic detection using dictionary learning. In Proceedings of CIKM, 2011.
[23]
S. Katz. Estimation of probabilities from sparse data for the language model component of a speech recognizer. Ieee Transactions On Acoustics, Speech, And Signal Processing, 35(3):400--401, 1987.
[24]
S. Kulkarni, A. Singh, G. Ramakrishnan, and S. Chakrabarti. Collective annotation of wikipedia entities in web text. In Proceedings of KDD, 2009.
[25]
X. Li, P. Morie, and D. Roth. Identification and tracing of ambiguous names: discriminative and generative approaches. In Proceedings of the 19th national conference on Artificial intelligence, 2004.
[26]
D. Lin and X. Wu. Phrase clustering for discriminative learning. In Proceedings of ACL and AFNLP, 2009.
[27]
X. Liu, Y. Li, H. Wu, M. Zhou, F. Wei, and Y. Lu. Entity linking for tweets. In Proceedings of ACL, 2013.
[28]
O. Medelyan, I. Witten, and D. Milne. Topic indexing with wikipedia. In Proceedings of WIKIAI, pages 19--24, 2008.
[29]
R. Mihalcea and A. Csomal. Wikify! linking documents to encyclopedic knowledge. In Proceedings of CIKM, 2007.
[30]
D. Milne and I. H. Witten. Learning to link with wikipedia. In Proceedings of CIKM, 2008.
[31]
K. Min, Z. Zhang, J. Wright, and Y. Ma. Decomposing background topics from keywords by principal component pursuit. In Proceedings of CIKM, 2010.
[32]
A. Ng. Feature selection, l1 vs. l2 regularization, and rotational invariance. In Proceeding of ICML, 2004.
[33]
J. Ponte and W. Croft. A language modeling approach to information retrieval. In Proceedings of SIGIR, 1998.
[34]
A. Ritter, S. Clark, Mausam, and O. Etzioni. Named entity recognition in tweets: An experimental study. In Proceedings of EMNLP, 2011.
[35]
W. Shen, J. Wang, P. Luo, and M. Wang. Linking named entities in tweets with knowledge base via user interest modeling. In Proceedings of KDD, 2013.
[36]
R. Tibshirani. Regression shrinkage and selection via the LASSO. J. R. Statist. Soc. B, 58(1):267--288, 1996.
[37]
Y. Zhou, L. Nie, O. Rouhani-Kalleh, F. Vasile, and S. Gaffney. Resolving surface forms to wikipedia topics. In Proceedings of COLING, 2010.
[38]
M. Zhu, S. Shi, N. Yu, and J. Wen. Can phrase indexing help to process non-phrase queries? In Proceedings of CIKM, 2008.

Cited By

View all
  • (2024)LinkNER: Linking Local Named Entity Recognition Models to Large Language Models using UncertaintyProceedings of the ACM Web Conference 202410.1145/3589334.3645414(4047-4058)Online publication date: 13-May-2024
  • (2020)Knowledge Transfer for Out-of-Knowledge-Base Entities: Improving Graph-Neural-Network-Based Embedding Using Convolutional LayersIEEE Access10.1109/ACCESS.2020.30195928(159039-159049)Online publication date: 2020
  • (2018)Employing Semantic Context for Sparse Information Extraction AssessmentACM Transactions on Knowledge Discovery from Data10.1145/320140712:5(1-36)Online publication date: 27-Jun-2018
  • Show More Cited By

Index Terms

  1. Entity linking at the tail: sparse signals, unknown entities, and phrase models

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    WSDM '14: Proceedings of the 7th ACM international conference on Web search and data mining
    February 2014
    712 pages
    ISBN:9781450323512
    DOI:10.1145/2556195
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 24 February 2014

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. phrase language model
    2. sparsity
    3. tail entity linking
    4. unknown entity
    5. web-scale system

    Qualifiers

    • Research-article

    Conference

    WSDM 2014

    Acceptance Rates

    WSDM '14 Paper Acceptance Rate 64 of 355 submissions, 18%;
    Overall Acceptance Rate 498 of 2,863 submissions, 17%

    Upcoming Conference

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)10
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 01 Mar 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)LinkNER: Linking Local Named Entity Recognition Models to Large Language Models using UncertaintyProceedings of the ACM Web Conference 202410.1145/3589334.3645414(4047-4058)Online publication date: 13-May-2024
    • (2020)Knowledge Transfer for Out-of-Knowledge-Base Entities: Improving Graph-Neural-Network-Based Embedding Using Convolutional LayersIEEE Access10.1109/ACCESS.2020.30195928(159039-159049)Online publication date: 2020
    • (2018)Employing Semantic Context for Sparse Information Extraction AssessmentACM Transactions on Knowledge Discovery from Data10.1145/320140712:5(1-36)Online publication date: 27-Jun-2018
    • (2017)NELasso: Group-Sparse Modeling for Characterizing Relations Among Named Entities in News ArticlesIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2016.263211739:10(2000-2014)Online publication date: 1-Oct-2017
    • (2016)Entity Disambiguation with Linkless Knowledge BasesProceedings of the 25th International Conference on World Wide Web10.1145/2872427.2883068(1261-1270)Online publication date: 11-Apr-2016
    • (2016)Linking, integrating, and translating entities via iterative graph matching2016 Conference on Technologies and Applications of Artificial Intelligence (TAAI)10.1109/TAAI.2016.7880156(248-255)Online publication date: Nov-2016
    • (2015)Towards Decision Support and Goal AchievementProceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining10.1145/2783258.2783310(547-556)Online publication date: 10-Aug-2015
    • (2015)Building XML Data Warehouse with Data Reconstruction by Knowledge GraphProceedings of the 2015 IEEE Fifth International Conference on Big Data and Cloud Computing10.1109/BDCloud.2015.48(314-320)Online publication date: 26-Aug-2015
    • (2015)Entity Linking for Vietnamese TweetsKnowledge and Systems Engineering10.1007/978-3-319-11680-8_48(603-615)Online publication date: 2015

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media