research-article

TopPRF: A Probabilistic Framework for Integrating Topic Space into Pseudo Relevance Feedback

Authors:

Jun Miao,

Jimmy Xiangji Huang,

Jiashu ZhaoAuthors Info & Claims

ACM Transactions on Information Systems (TOIS), Volume 34, Issue 4

Article No.: 22, Pages 1 - 36

https://doi.org/10.1145/2956234

Published: 29 August 2016 Publication History

Get Access

Abstract

Traditional pseudo relevance feedback (PRF) models choose top k feedback documents for query expansion and treat those documents equally. When k is determined, feedback terms are selected without considering the reliability of these documents for relevance. Because the performance of PRF is sensitive to the selection of feedback terms, noisy terms imported from these irrelevant documents or partially relevant documents will harm the final results extensively. Intuitively, terms in these documents should be considered less important for feedback term selection. Nonetheless, how to measure the reliability of feedback documents is a difficult problem.

Recently, topic modeling has become more and more popular in the information retrieval (IR) area. In order to identify how reliable a feedback document is to be relevant, we attempt to adapt the topical information into PRF. However, topics are hard to be quantified and therefore the identification of topic is usually fuzzy. It is very challenging for integrating the obtained topical information effectively into IR and other text-processing-related areas. Current research work mainly focuses on mining relevant information from particular topics. This is extremely difficult when the boundaries of different topics are hard to define. In this article, we investigate a key factor of this problem, the topic number for topic modeling and how it makes topics “fuzzy.” To effectively and efficiently apply topical information, we propose a new probabilistic framework, “TopPRF,” and three models, TS-COS, TS-EU, and TS-Entropy, via integrating “Topic Space” (TS) information into pseudo relevance feedback. These methods discover how reliable a document is to be relevant through both term and topical information. When selecting feedback terms, candidate terms in more reliable feedback documents should obtain extra weights. Experimental results on various public collections justify that our proposed methods can significantly reduce the influence of “fuzzy topics” and obtain stable, good results over the strong baseline models. Our proposed probabilistic framework, TopPRF, and three topic-space-based models are capable of searching documents beyond traditional term matching only and provide a promising avenue for constructing better topic-space-based IR systems. Moreover, in-depth discussions and conclusions are made to help other researchers apply topical information effectively.

Supplementary Material

a22-miao-apndx.pdf (miao.zip)

Supplemental movie, appendix, image and software files for, TopPRF: A Probabilistic Framework for Integrating Topic Space into Pseudo Relevance Feedback

Download
23.12 KB

References

[1]

J. Allan, M. E. Connell, W. B. Croft, F. Feng, D. Fisher, and X. Li. 2000. INQUERY and TREC-9. In Proceedings of the 9th Text REtrieval Conference, 13.

Abstract

Supplementary Material

References

Cited By

Index Terms

Recommendations

Text, Topics, and Turkers: A Consensus Measure for Statistical Topics

Fuzzy topic modeling approach for text mining over short text

Targeted aspects oriented topic modeling for short texts

Comments

Information

Published In

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Get Access

Login options

Full Access

View options

PDF

eReader

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations