Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/1008992.1009004acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
Article

A formal study of information retrieval heuristics

Published: 25 July 2004 Publication History

Abstract

Empirical studies of information retrieval methods show that good retrieval performance is closely related to the use of various retrieval heuristics, such as TF-IDF weighting. One basic research question is thus what exactly are these "necessary" heuristics that seem to cause good retrieval performance. In this paper, we present a formal study of retrieval heuristics. We formally define a set of basic desirable constraints that any reasonable retrieval function should satisfy, and check these constraints on a variety of representative retrieval functions. We find that none of these retrieval functions satisfies all the constraints unconditionally. Empirical results show that when a constraint is not satisfied, it often indicates non-optimality of the method, and when a constraint is satisfied only for a certain range of parameter values, its performance tends to be poor when the parameter is out of the range. In general, we find that the empirical performance of a retrieval formula is tightly related to how well it satisfies these constraints. Thus the proposed constraints provide a good explanation of many empirical observations and make it possible to evaluate any existing or new retrieval formula analytically.

References

[1]
G. Amati and C. J. V. Rijsbergen. Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Transactions on Information Systems, 20(4):357--389, 2002.
[2]
N. Fuhr. Language models and uncertain inference in information retrieval. In Proceedings of the Language Modeling and IR workshop.
[3]
N. Fuhr. Probabilistic models in information retrieval. The Computer Journal, 35(3):243--255, 1992.
[4]
J. Kleinberg. An impossibility theorem for clustering. In Advances in NIPS 15, 2002.
[5]
J. Lafferty and C. Zhai.Probabilistic relevance models based on document and query generation. In W. B. Croft and J. Lafferty, editors, Language Modeling and Information Retrieval. Kluwer Academic Publishers, 2003.
[6]
J. Ponte and W. B. Croft. A language modeling approach to information retrieval. In Proceedings of the ACM SIGIR'98, pages 275--281, 1998.
[7]
S. Robertson and K. Sparck Jones. Relevance weighting of search terms. Journal of the American Society for Information Science, 27:129--146, 1976.
[8]
S. Robertson and S. Walker. Some simple effective approximations to the 2-poisson model for probabilistic weighted retrieval. In Proceedings of SIGIR'94, pages 232--241, 1994.
[9]
S. Robertson and S. Walker. On relevance weights with little relevance information. In Proceedings of SIGIR '97, pages 16--24, 1997.
[10]
G. Salton. Automatic Text Processing: The Transformation, Analysis and Retrieval of Information by Computer. Addison-Wesley, 1989.
[11]
G. Salton and C. Buckley. Term-weighting approaches in automatic text retrieval. Information Processing and Management, 24:513--523, 1988.
[12]
G. Salton and M. McGill. Introduction to Modern Information Retrieval. McGraw-Hill, 1983.
[13]
G. Salton, C. S. Yang, and C. T. Yu. A theory of term importance in automatic text analysis. Journal of the American Society for Information Science, 26(1):33--44, Jan-Feb 1975.
[14]
A. Singhal. Modern information retrieval: A brief overview. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, 24(4):35--43, 2001.
[15]
H. Turtle and W. B. Croft. Evaluation of an inference network-based retrieval model. ACM Transactions on Information Systems, 9(3):187--222, 1991.
[16]
C. J. van Rijbergen. A theoretical basis for theuse of co-occurrence data in information retrieval. Journal of Documentation, pages 106--119, 1977.
[17]
C. J. van Rijsbergen. A non-classical logic for information retrieval. The Computer Journal, 29(6), 1986.
[18]
E. Voorhees and D. Harman, editors. Proceedings of Text RE trieval Conference(TREC1-9). NIST Special Publications, 2001. http://trec.nist.gov/pubs.html.
[19]
S. K. M. Wong and Y. Y. Yao. On modeling information retrieval with probabilistic inference. ACM Transactions on Information Systems, 13(1):69--99, 1995.
[20]
C. Zhai and J. Lafferty. A study of smoothing methods for language models applied to ad hoc information retrieval. In Proceedings of SIGIR'01, pages 334--342, Sept 2001.
[21]
J. Zobel and A. Moffat. Exploring the similarity space. SIGIR Forum, 31(1):18--34, 1998.

Cited By

View all
  • (2024)Secure semantic search using deep learning in a blockchain-assisted multi-user settingJournal of Cloud Computing10.1186/s13677-023-00578-513:1Online publication date: 30-Jan-2024
  • (2024)Towards Effective and Efficient Sparse Neural Information RetrievalACM Transactions on Information Systems10.1145/363491242:5(1-46)Online publication date: 29-Apr-2024
  • (2024)Systematic Evaluation of Neural Retrieval Models on the Touché 2020 Argument Retrieval Subset of BEIRProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657861(1420-1430)Online publication date: 10-Jul-2024
  • Show More Cited By

Index Terms

  1. A formal study of information retrieval heuristics

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SIGIR '04: Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
    July 2004
    624 pages
    ISBN:1581138814
    DOI:10.1145/1008992
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 25 July 2004

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. TF-IDF weighting
    2. constraints
    3. formal models
    4. retrieval heuristics

    Qualifiers

    • Article

    Conference

    SIGIR04
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 792 of 3,983 submissions, 20%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)113
    • Downloads (Last 6 weeks)13
    Reflects downloads up to 23 Dec 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Secure semantic search using deep learning in a blockchain-assisted multi-user settingJournal of Cloud Computing10.1186/s13677-023-00578-513:1Online publication date: 30-Jan-2024
    • (2024)Towards Effective and Efficient Sparse Neural Information RetrievalACM Transactions on Information Systems10.1145/363491242:5(1-46)Online publication date: 29-Apr-2024
    • (2024)Systematic Evaluation of Neural Retrieval Models on the Touché 2020 Argument Retrieval Subset of BEIRProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657861(1420-1430)Online publication date: 10-Jul-2024
    • (2024)Course Recommender Systems Need to Consider the Job MarketProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657847(522-532)Online publication date: 10-Jul-2024
    • (2024)Axiomatic Causal Interventions for Reverse Engineering Relevance Computation in Neural Retrieval ModelsProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657841(1401-1410)Online publication date: 10-Jul-2024
    • (2024)Axiomatic Guidance for Efficient and Controlled Neural SearchProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657651(3071-3071)Online publication date: 10-Jul-2024
    • (2024)An Intrinsic Framework of Information Retrieval Evaluation MeasuresIntelligent Systems and Applications10.1007/978-3-031-47721-8_47(692-713)Online publication date: 10-Jan-2024
    • (2023)Dense Text Retrieval based on Pretrained Language Models: A SurveyACM Transactions on Information Systems10.1145/3637870Online publication date: 18-Dec-2023
    • (2023)A Systematic Review of Fairness, Accountability, Transparency and Ethics in Information RetrievalACM Computing Surveys10.1145/3637211Online publication date: 15-Dec-2023
    • (2023)Explainability of Text Processing and Retrieval MethodsProceedings of the 15th Annual Meeting of the Forum for Information Retrieval Evaluation10.1145/3632754.3632944(153-157)Online publication date: 15-Dec-2023
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media