Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/1390334.1390409acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
research-article

TF-IDF uncovered: a study of theories and probabilities

Published: 20 July 2008 Publication History

Abstract

Interpretations of TF-IDF are based on binary independence retrieval, Poisson, information theory, and language modelling. This paper contributes a review of existing interpretations, and then, TF-IDF is systematically related to the probabilities P(q|d) and P(d|q). Two approaches are explored: a space of independent, and a space of disjoint terms. For independent terms, an "extreme" query/non-query term assumption uncovers TF-IDF, and an analogy of P(d|q) and the probabilistic odds O(r|d, q) mirrors relevance feedback. For disjoint terms, a relationship between probability theory and TF-IDF is established through the integral + 1/x dx = log x. This study uncovers components such as divergence from randomness and pivoted document length to be inherent parts of a document-query independence (DQI) measure, and interestingly, an integral of the DQI over the term occurrence probability leads to TF-IDF.

References

[1]
Akiko Aizawa. An information-theoretic perspective of tf-idf measures. Information Processing and Management, 39:45--65, January 2003.
[2]
Gianni Amati and C. J. van Rijsbergen. Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM TOIS, 20(4):357--389, October 2002.
[3]
K. Church and W Gale. Inverse document frequency (idf): A measure of deviation from poisson. In Third Workshop on Very Large Corpora, pages 121--130, 1995.
[4]
W.B. Croft and D.J. Harper. Using probabilistic models of document retrieval without relevance information. Journal of Documentation, 35:285--295, 1979.
[5]
Arjen de Vries and Thomas Roelleke. Relevance information: A loss of entropy but a gain for idf? In ACM SIGIR, Salvador, Brazil, 2005.
[6]
David A. Grossman and Ophir Frieder. Information Retrieval. Algorithms and Heuristics, 2nd ed., volume 15 of The Information Retrieval Series. Springer, 2004.
[7]
Djoerd Hiemstra. A probabilistic justification for using tf.idf term weighting in information retrieval. International Journal on Digital Libraries, 3(2):131--139, 2000.
[8]
John Lafferty and ChengXiang Zhai. Probabilistic Relevance Models Based on Document and Query Generation, chapter 1. Kluwer, 2003.
[9]
Qiaozhu Mei, Hui Fang, and ChengXiang Zhai. A study of Poisson query generation model for information retrieval. In ACM SIGIR, pages 319--326, New York, 2007.
[10]
J.M. Ponte and W.B. Croft. A language modeling approach to information retrieval. ACM SIGIR, pages 275--281, 1998.
[11]
S. E. Robertson and S. Walker. Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval. ACM SIGIR, pages 232--241, 1994.
[12]
S.E. Robertson. Understanding inverse document frequency: On theoretical arguments for idf. Journal of Documentation, 60:503--520, 2004.
[13]
S.E. Robertson and K. Sparck Jones. Relevance weighting of search terms. Journal of the American Society for Information Science, 27:129--146, 1976.
[14]
Thomas Roelleke. A frequency-based and a Poisson-based probability of being informative. In ACM SIGIR, pages 227--234, Toronto, Canada, 2003.
[15]
Thomas Roelleke and Jun Wang. A parallel derivation of probabilistic information retrieval models. In ACM SIGIR, pages 107--114, Seattle, USA, 2006.
[16]
S.K.M. Wong and Y.Y. Yao. On modeling information retrieval with probabilistic inference. ACM TOIS, 13(1):38--68, 1995.
[17]
Hugo Zaragoza, Djoerd Hiemstra, and Michael E. Tipping. Bayesian extension to the language model for ad hoc information retrieval. In ACM SIGIR, pages 4--9, Toronto, Canada, 2003.

Cited By

View all
  • (2024)Determining Whether a Turkish Text is Produced by Artificial Intelligence or Human2024 IEEE 3rd International Conference on Computing and Machine Intelligence (ICMI)10.1109/ICMI60790.2024.10585770(1-5)Online publication date: 13-Apr-2024
  • (2024)Sentiment Analysis of Israeli-Palestinian Conflict on Indonesian Tweets Using Machine Learning2024 International Conference on Electrical Engineering and Computer Science (ICECOS)10.1109/ICECOS63900.2024.10791149(59-64)Online publication date: 25-Sep-2024
  • (2024)ReClean: Reinforcement Learning for Automated Data Cleaning in ML Pipelines*2024 IEEE 40th International Conference on Data Engineering Workshops (ICDEW)10.1109/ICDEW61823.2024.00048(324-330)Online publication date: 13-May-2024
  • Show More Cited By

Index Terms

  1. TF-IDF uncovered: a study of theories and probabilities

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SIGIR '08: Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
    July 2008
    934 pages
    ISBN:9781605581644
    DOI:10.1145/1390334
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 20 July 2008

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. TF-IDF interpretations
    2. derivative of logarithm
    3. document-query-independence
    4. integral
    5. probability theory

    Qualifiers

    • Research-article

    Conference

    SIGIR '08
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 792 of 3,983 submissions, 20%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)206
    • Downloads (Last 6 weeks)33
    Reflects downloads up to 23 Dec 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Determining Whether a Turkish Text is Produced by Artificial Intelligence or Human2024 IEEE 3rd International Conference on Computing and Machine Intelligence (ICMI)10.1109/ICMI60790.2024.10585770(1-5)Online publication date: 13-Apr-2024
    • (2024)Sentiment Analysis of Israeli-Palestinian Conflict on Indonesian Tweets Using Machine Learning2024 International Conference on Electrical Engineering and Computer Science (ICECOS)10.1109/ICECOS63900.2024.10791149(59-64)Online publication date: 25-Sep-2024
    • (2024)ReClean: Reinforcement Learning for Automated Data Cleaning in ML Pipelines*2024 IEEE 40th International Conference on Data Engineering Workshops (ICDEW)10.1109/ICDEW61823.2024.00048(324-330)Online publication date: 13-May-2024
    • (2024)Spam Comment Detection Using the Ensemble Technique2023 4th International Conference on Intelligent Technologies (CONIT)10.1109/CONIT61985.2024.10626863(1-7)Online publication date: 21-Jun-2024
    • (2024)Exploring user reactions to luxury brand videos on YouTube: a comparative study of influencers and brand-official channelsInternational Journal of Advertising10.1080/02650487.2024.2367316(1-23)Online publication date: 19-Jun-2024
    • (2024)LCA and energy efficiency in buildings: Mapping more than twenty years of researchEnergy and Buildings10.1016/j.enbuild.2024.114684(114684)Online publication date: Aug-2024
    • (2023)DDoS2Vec: Flow-Level Characterisation of Volumetric DDoS Attacks at ScaleProceedings of the ACM on Networking10.1145/36291351:CoNEXT3(1-25)Online publication date: 28-Nov-2023
    • (2023)You Are How You Use Apps: User Profiling Based on Spatiotemporal App Usage BehaviorACM Transactions on Intelligent Systems and Technology10.1145/359721214:4(1-21)Online publication date: 21-Jul-2023
    • (2023)ExpFinder: A hybrid model for expert finding from text-based expertise dataExpert Systems with Applications10.1016/j.eswa.2022.118691211(118691)Online publication date: Jan-2023
    • (2023)The hypergeometric test performs comparably to TF-IDF on standard text analysis tasksMultimedia Tools and Applications10.1007/s11042-023-16615-zOnline publication date: 8-Sep-2023
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media