Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2911451.2914687acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
short-paper

How Informative is a Term?: Dispersion as a measure of Term Specificity

Published: 07 July 2016 Publication History

Abstract

Similarity functions assign scores to documents in response to queries. These functions require as input statistics about the terms in the queries and documents, where the intention is that the statistics are estimates of the relative informativeness of the terms. Common measures of informativeness use the number of documents containing each term (the document frequency) as a key measure. We argue in this paper that the distribution of within-document frequencies across a collection is also pertinent to informativeness, a measure that has not been considered in prior work: the most informative words tend to be those whose frequency of occurrence has high variance. We propose use of relative standard deviation (RSD) as a measure of variability incorporating within-document frequencies, and show that RSD compares favourably with inverse document frequency (IDF), in both in-principle analysis and in practice in retrieval, with small but consistent gains.

References

[1]
Church, K. and Gale, W. {1999}, Inverse document frequency (IDF): A measure of deviations from Poisson, in 'Natural language processing using very large corpora', Springer, pp. 283--295.
[2]
Cooper, W. S. and Huizinga, P. {1982}, 'The maximum entropy principle and its application to the design of probabilistic retrieval systems.', Information Technology, Research and Development 1, 99--112.
[3]
Fellbaum, C. {1998}, WordNet: An Electronic Lexical Database, Bradford Books.
[4]
Greiff, W. R. and Ponte, J. M. {2000}, 'The maximum entropy approach and probabilistic IR models', ACM Trans. Inf. Syst. 18 (3), 246--287.
[5]
Hersh, W., Buckley, C., Leone, T. and Hickam, D. {1994}, Ohsumed: An interactive retrieval evaluation and new large test collection for research, in 'SIGIR94', Springer, pp. 192--201.
[6]
Jiao, Y., Cornec, M. and Jakubowicz., J. {2015}, An entropy-based term weighting scheme and its application in e-commerce search engines, in Proceedings of the first International Symposium on Web Algorithms', International Symposium on Web Algorithms.
[7]
Jones, K. S., Walker, S. and Robertson, S. E. {2000}, 'A probabilistic model of information retrieval: Development and comparative experiments, Parts 1 and 2', Inf. Process. Manage. 36 (6), 779--840.
[8]
Ke, W. {2013}, Information-theoretic term weighting schemes for document clustering, in 'Proceedings of the 13th ACM/IEEE-CS joint conference on Digital libraries', ACM, pp. 143--152.
[9]
Porter, M. {1980}, 'An algorithm for suffix stripping', Program 14 (3), 130--137.
[10]
Robertson, S. and Spärk Jones, K. {1976}, 'Relevance weighting of search terms', Journal of the American Society for Information Science 27 (3), 129--146.
[11]
Salton, G. and Wong, A. {1976}, 'On the role of words and phrases in automatic text analysis', Computers and the Humanities 10, 291--391.
[12]
Salton, G., Wong, A. and Yang, C. {1975}, 'A vector space model for automatic indexing', Communications of the ACM 18 (11), 613--620.
[13]
Shannon, C. {1948}, 'A mathematical theory of communication', Bell System Technical Journal 27, 379--423.
[14]
Spärk Jones, K. {1972}, 'A statistical interpretation of term specificity and its application in retrieval', Journal of Docmentation 28 (1), 11--21.
[15]
Voorhees, E. M., Harman, D. K. et al. {2005}, TREC: Experiment and evaluation in information retrieval, Vol. 1, MIT press Cambridge.

Index Terms

  1. How Informative is a Term?: Dispersion as a measure of Term Specificity

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SIGIR '16: Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval
    July 2016
    1296 pages
    ISBN:9781450340694
    DOI:10.1145/2911451
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 07 July 2016

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. dispersion
    2. relative standard deviation
    3. rsd
    4. term specificity

    Qualifiers

    • Short-paper

    Conference

    SIGIR '16
    Sponsor:

    Acceptance Rates

    SIGIR '16 Paper Acceptance Rate 62 of 341 submissions, 18%;
    Overall Acceptance Rate 792 of 3,983 submissions, 20%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 228
      Total Downloads
    • Downloads (Last 12 months)9
    • Downloads (Last 6 weeks)3
    Reflects downloads up to 17 Oct 2024

    Other Metrics

    Citations

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media