short-paper

How Informative is a Term?: Dispersion as a measure of Term Specificity

Authors:

Rodney McDonell,

Bodo BillerbeckAuthors Info & Claims

SIGIR '16: Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval

Pages 853 - 856

https://doi.org/10.1145/2911451.2914687

Published: 07 July 2016 Publication History

Abstract

Similarity functions assign scores to documents in response to queries. These functions require as input statistics about the terms in the queries and documents, where the intention is that the statistics are estimates of the relative informativeness of the terms. Common measures of informativeness use the number of documents containing each term (the document frequency) as a key measure. We argue in this paper that the distribution of within-document frequencies across a collection is also pertinent to informativeness, a measure that has not been considered in prior work: the most informative words tend to be those whose frequency of occurrence has high variance. We propose use of relative standard deviation (RSD) as a measure of variability incorporating within-document frequencies, and show that RSD compares favourably with inverse document frequency (IDF), in both in-principle analysis and in practice in retrieval, with small but consistent gains.

References

[1]

Church, K. and Gale, W. {1999}, Inverse document frequency (IDF): A measure of deviations from Poisson, in 'Natural language processing using very large corpora', Springer, pp. 283--295.

[2]

Cooper, W. S. and Huizinga, P. {1982}, 'The maximum entropy principle and its application to the design of probabilistic retrieval systems.', Information Technology, Research and Development 1, 99--112.

[3]

Fellbaum, C. {1998}, WordNet: An Electronic Lexical Database, Bradford Books.

[4]

Greiff, W. R. and Ponte, J. M. {2000}, 'The maximum entropy approach and probabilistic IR models', ACM Trans. Inf. Syst. 18 (3), 246--287.

Digital Library

[5]

Hersh, W., Buckley, C., Leone, T. and Hickam, D. {1994}, Ohsumed: An interactive retrieval evaluation and new large test collection for research, in 'SIGIR94', Springer, pp. 192--201.

Digital Library

[6]

Jiao, Y., Cornec, M. and Jakubowicz., J. {2015}, An entropy-based term weighting scheme and its application in e-commerce search engines, in Proceedings of the first International Symposium on Web Algorithms', International Symposium on Web Algorithms.

[7]

Jones, K. S., Walker, S. and Robertson, S. E. {2000}, 'A probabilistic model of information retrieval: Development and comparative experiments, Parts 1 and 2', Inf. Process. Manage. 36 (6), 779--840.

Digital Library

[8]

Ke, W. {2013}, Information-theoretic term weighting schemes for document clustering, in 'Proceedings of the 13th ACM/IEEE-CS joint conference on Digital libraries', ACM, pp. 143--152.

Digital Library

[9]

Porter, M. {1980}, 'An algorithm for suffix stripping', Program 14 (3), 130--137.

[10]

Robertson, S. and Spärk Jones, K. {1976}, 'Relevance weighting of search terms', Journal of the American Society for Information Science 27 (3), 129--146.

[11]

Salton, G. and Wong, A. {1976}, 'On the role of words and phrases in automatic text analysis', Computers and the Humanities 10, 291--391.

[12]

Salton, G., Wong, A. and Yang, C. {1975}, 'A vector space model for automatic indexing', Communications of the ACM 18 (11), 613--620.

Digital Library

[13]

Shannon, C. {1948}, 'A mathematical theory of communication', Bell System Technical Journal 27, 379--423.

[14]

Spärk Jones, K. {1972}, 'A statistical interpretation of term specificity and its application in retrieval', Journal of Docmentation 28 (1), 11--21.

[15]

Voorhees, E. M., Harman, D. K. et al. {2005}, TREC: Experiment and evaluation in information retrieval, Vol. 1, MIT press Cambridge.

Digital Library

Index Terms

How Informative is a Term?: Dispersion as a measure of Term Specificity
1. Information systems
  1. Information retrieval
    1. Retrieval models and ranking

Recommendations

Why do Users Issue Good Queries?: Neural Correlates of Term Specificity
SIGIR'19: Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval

Despite advances in the past few decades in studying what kind of queries users input to search engines and how to suggest queries for the users, the fundamental question of what makes human cognition able to estimate goodness of query terms is largely ...
Determining the specificity of terms using inside-outside information: a necessary condition of term hierarchy mining

This paper introduces new specificity measuring methods of terms using inside and outside information. Specificity of a term is the quantity of domain specific information contained in the term. Specific terms have a larger quantity of domain ...
Multi term based co-term frequency method for term weighting in information retrieval

Nowadays, World Wide Web WWW has become the only source of all kind of information. Retrieving the relevant web pages based on user queries from WWW is an exigent task. Term frequency inverse document frequency TF-IDF is the most frequently used method ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGIR '16: Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval

July 2016

1296 pages

ISBN:9781450340694

DOI:10.1145/2911451

General Chairs:
Raffaele Perego
ISTI-CNR, Italy
,
Fabrizio Sebastiani
Qatar Computing Research Institute, HBKU, Qatar
,
Program Chairs:
Javed Aslam
Northeastern University, US
,
Ian Ruthven
University of Strathclyde, UK
,
Justin Zobel
University of Melbourne, Australia

Copyright © 2016 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGIR: ACM Special Interest Group on Information Retrieval

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 July 2016

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Short-paper

Conference

SIGIR '16

Sponsor:

SIGIR

SIGIR '16: The 39th International ACM SIGIR conference on research and development in Information Retrieval

July 17 - 21, 2016

Pisa, Italy

Acceptance Rates

SIGIR '16 Paper Acceptance Rate 62 of 341 submissions, 18%;

Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
228
Total Downloads

Downloads (Last 12 months)9
Downloads (Last 6 weeks)3

Reflects downloads up to 17 Oct 2024

Other Metrics

View Author Metrics

Citations

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents