Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
article

Finding aliases on the web using latent semantic analysis

Published: 01 May 2004 Publication History

Abstract

A common problem faced when gathering information from the web is the use of different names to refer to the same entity. For example, the city in India referred to as Bombay in some documents may be referred to as Mumbai in others because its name officially changed from the former to the latter in 1995. Multiplicity of names can cause relevant documents to be missed by search engines. Our goal is to develop an automated system that discovers additional names for an entity given just one of its names. Latent semantic analysis (LSA) is generally thought to be well-suited for this task [Numerical linear algebra with applications 3(4) (1996) 301]. We demonstrate empirically that under a broad range of circumstances LSA performs poorly, and describe a two-stage algorithm based on LSA that performs significantly better.

References

[1]
{1} M.W. Berry, S.T. Dumais, T.A. Letsche, Computational methods for intelligent information access, in: Proceedings of Supercomputing'95, 1995.
[2]
{2} Michael W. Berry, Ricardo D. Fierro, Low-rank orthogonal decompositions for information retrieval applications. Numerical Linear Algebra with Applications 3 (4) (1996) 301-327.
[3]
{3} Peter F. Brown, Peter V. deSouza, Robert L. Mercer, Vincent J. Della Pietra, Jenifer C. Lai, Class-based n-gram models of natural language, Computational Linguistics 18 (4) (1992) 467-479.
[4]
{4} Scott C. Deerwester, Susan T. Dumais, Thomas K. Landauer, George W. Furnas, Richard A. Harshman, Indexing by latent semantic analysis, Journal of the American Society of Information Science 41 (6) (1990) 391-407.
[5]
{5} S.T. Dumais, J. Nielson, Automating the assignment of submitted manuscripts to reviewers, in: Proceedings of the 15th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 1992, pp. 233-244.
[6]
{6} P.W. Foltz, S.T. Dumais, Personalized information delivery: an analysis of information filtering methods, Communications of the ACM 35 (12) (1992) 51-60.
[7]
{7} G.H. Golub, C.F. Van Loan, Matrix Computations, Johns Hopkins University Press, 1989.
[8]
{8} D. Hull, Improving text retrieval for the routing problem using latent semantic indexing, in: Proceedings of the 17th ACM-SIGIR Conference, 1994, pp. 282-291.
[9]
{9} T.K. Landauer, S.T. Dumais, A solution to plato's problem: the latent semantic analysis theory of the acquisition, induction and representation of knowledge, Psychological Review 104 (1997) 211-240.
[10]
{10} T.K. Landauer, M.L. Littman, Fully automatic cross-language document retrieval using latent semantic indexing, in: Proceedings of the Sixth Annual Conference of the UW Centre for the New Oxford English Dictionary and Text Research, 1990, pp. 31-38.
[11]
{11} M. Rosenstein, C. Lochbaum, Recommending from content: preliminary results from an e-commerce experiment, in: Proceedings of CHI'00: Conference on Human Factors in Computing, 2000.
[12]
{12} G. Salton, M.J. McGill, Introduction to Modern Information Retrieval, McGraw-Hill, 1983.

Cited By

View all
  • (2010)The vector space models for finding co-occurrence names as aliases in Thai sports newsProceedings of the Second international conference on Intelligent information and database systems: Part I10.5555/1894753.1894769(122-130)Online publication date: 24-Mar-2010
  • (2008)The CONCUR framework forcommunity maintenance of curated resourcesProceedings of the eighth ACM symposium on Document engineering10.1145/1410140.1410166(123-126)Online publication date: 16-Sep-2008
  • (2008)Using recursive ART network to construction domain ontology based on term frequency and inverse document frequencyExpert Systems with Applications: An International Journal10.1016/j.eswa.2006.09.01934:1(488-501)Online publication date: 1-Jan-2008
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Data & Knowledge Engineering
Data & Knowledge Engineering  Volume 49, Issue 2
Special issue: WIDM 2002
May 2004
94 pages

Publisher

Elsevier Science Publishers B. V.

Netherlands

Publication History

Published: 01 May 2004

Author Tags

  1. aliases
  2. latent semantic analysis
  3. search engines

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 11 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2010)The vector space models for finding co-occurrence names as aliases in Thai sports newsProceedings of the Second international conference on Intelligent information and database systems: Part I10.5555/1894753.1894769(122-130)Online publication date: 24-Mar-2010
  • (2008)The CONCUR framework forcommunity maintenance of curated resourcesProceedings of the eighth ACM symposium on Document engineering10.1145/1410140.1410166(123-126)Online publication date: 16-Sep-2008
  • (2008)Using recursive ART network to construction domain ontology based on term frequency and inverse document frequencyExpert Systems with Applications: An International Journal10.1016/j.eswa.2006.09.01934:1(488-501)Online publication date: 1-Jan-2008
  • (2007)Emerging consensus in-situProceedings of the 2nd International Conference on Ontology Matching - Volume 30410.5555/2889662.2889669(72-83)Online publication date: 11-Nov-2007
  • (2005)Development of new techniques to improve web searchProceedings of the 19th international joint conference on Artificial intelligence10.5555/1642293.1642587(1632-1633)Online publication date: 30-Jul-2005
  • (2005)Automatic discovery of synonyms and lexicalizations from the WebProceedings of the 2005 conference on Artificial Intelligence Research and Development10.5555/1565835.1565867(205-212)Online publication date: 12-May-2005

View Options

View options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media