Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2911451.2914687acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
short-paper

How Informative is a Term?: Dispersion as a measure of Term Specificity

Published: 07 July 2016 Publication History
  • Get Citation Alerts
  • Abstract

    Similarity functions assign scores to documents in response to queries. These functions require as input statistics about the terms in the queries and documents, where the intention is that the statistics are estimates of the relative informativeness of the terms. Common measures of informativeness use the number of documents containing each term (the document frequency) as a key measure. We argue in this paper that the distribution of within-document frequencies across a collection is also pertinent to informativeness, a measure that has not been considered in prior work: the most informative words tend to be those whose frequency of occurrence has high variance. We propose use of relative standard deviation (RSD) as a measure of variability incorporating within-document frequencies, and show that RSD compares favourably with inverse document frequency (IDF), in both in-principle analysis and in practice in retrieval, with small but consistent gains.

    References

    [1]
    Church, K. and Gale, W. {1999}, Inverse document frequency (IDF): A measure of deviations from Poisson, in 'Natural language processing using very large corpora', Springer, pp. 283--295.
    [2]
    Cooper, W. S. and Huizinga, P. {1982}, 'The maximum entropy principle and its application to the design of probabilistic retrieval systems.', Information Technology, Research and Development 1, 99--112.
    [3]
    Fellbaum, C. {1998}, WordNet: An Electronic Lexical Database, Bradford Books.
    [4]
    Greiff, W. R. and Ponte, J. M. {2000}, 'The maximum entropy approach and probabilistic IR models', ACM Trans. Inf. Syst. 18 (3), 246--287.
    [5]
    Hersh, W., Buckley, C., Leone, T. and Hickam, D. {1994}, Ohsumed: An interactive retrieval evaluation and new large test collection for research, in 'SIGIR94', Springer, pp. 192--201.
    [6]
    Jiao, Y., Cornec, M. and Jakubowicz., J. {2015}, An entropy-based term weighting scheme and its application in e-commerce search engines, in Proceedings of the first International Symposium on Web Algorithms', International Symposium on Web Algorithms.
    [7]
    Jones, K. S., Walker, S. and Robertson, S. E. {2000}, 'A probabilistic model of information retrieval: Development and comparative experiments, Parts 1 and 2', Inf. Process. Manage. 36 (6), 779--840.
    [8]
    Ke, W. {2013}, Information-theoretic term weighting schemes for document clustering, in 'Proceedings of the 13th ACM/IEEE-CS joint conference on Digital libraries', ACM, pp. 143--152.
    [9]
    Porter, M. {1980}, 'An algorithm for suffix stripping', Program 14 (3), 130--137.
    [10]
    Robertson, S. and Spärk Jones, K. {1976}, 'Relevance weighting of search terms', Journal of the American Society for Information Science 27 (3), 129--146.
    [11]
    Salton, G. and Wong, A. {1976}, 'On the role of words and phrases in automatic text analysis', Computers and the Humanities 10, 291--391.
    [12]
    Salton, G., Wong, A. and Yang, C. {1975}, 'A vector space model for automatic indexing', Communications of the ACM 18 (11), 613--620.
    [13]
    Shannon, C. {1948}, 'A mathematical theory of communication', Bell System Technical Journal 27, 379--423.
    [14]
    Spärk Jones, K. {1972}, 'A statistical interpretation of term specificity and its application in retrieval', Journal of Docmentation 28 (1), 11--21.
    [15]
    Voorhees, E. M., Harman, D. K. et al. {2005}, TREC: Experiment and evaluation in information retrieval, Vol. 1, MIT press Cambridge.

    Index Terms

    1. How Informative is a Term?: Dispersion as a measure of Term Specificity

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      SIGIR '16: Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval
      July 2016
      1296 pages
      ISBN:9781450340694
      DOI:10.1145/2911451
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 07 July 2016

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. dispersion
      2. relative standard deviation
      3. rsd
      4. term specificity

      Qualifiers

      • Short-paper

      Conference

      SIGIR '16
      Sponsor:

      Acceptance Rates

      SIGIR '16 Paper Acceptance Rate 62 of 341 submissions, 18%;
      Overall Acceptance Rate 792 of 3,983 submissions, 20%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • 0
        Total Citations
      • 224
        Total Downloads
      • Downloads (Last 12 months)5
      • Downloads (Last 6 weeks)0

      Other Metrics

      Citations

      View Options

      Get Access

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media