abstract

Resilient Retrieval Models for Large Collection

Author:

SIGIR '23: Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval

Page 3492

https://doi.org/10.1145/3539618.3591793

Published: 18 July 2023 Publication History

Get Access

Abstract

Modern search engines employ multi-stage ranking pipeline to balance retrieval efficiency and effectiveness for large collections. These pipelines retrieve an initial set of candidate documents from the large repository by some cost-effective retrieval model (such as BM25, LM), then re-rank these candidate documents by neural retrieval models. These pipelines perform well if the first-stage ranker achieves high recall [2]. To achieve this, the first-stage ranker should address the problems in milliseconds.

One of the major problems of the search engine is the presence of extraneous terms in the query. Since the query document term matching is the fundamental block of any retrieval model, the retrieval effectiveness drops when the documents are getting matched with these extraneous query terms. The existing models [4, 5] address this issue by estimating weights of the terms either by using supervised approaches or by utilizing the information of a set of initial top-ranked documents and incorporating it into the final ranking function. Although the later category of methods is unsupervised, they are inefficient as ranking the large collection to get the initial top-ranked documents is computationally expensive.

Besides, in the real-world collection, some terms may appear multiple times in the documents for several reasons, such as a term may appear for different contexts, the author bursts this term, or it is an outlier. Thus, the existing retrieval models overestimate the relevance score of the irrelevant documents if they contain some query term with extremely high frequency. Paik et al. [3] propose a probabilistic model based on truncated distributions that reduce the contribution of such high-frequency occurrences of the terms in relevance score. But, the truncation point selection does not leverage term-specific distribution information. It treats all the relevant documents as a bag for a set of queries which is not a good way to capture the distribution of terms. Furthermore, this model does not capture the term burstiness; it only reduces the effect of the outliers. Cummins et al. [1] propose a language model based on Dirichlet compound multinomial distribution that can capture the term burstiness. But this model is explicitly specific to the language model.

Considering the above research gaps, we focus on the following research questions in this doctoral work.

Research Question 1: How can we identify the central query terms from the verbose query without relying on an initial ranked list or relevance judgment and modify the ranking function so that it can focus on the derived central query terms? To address RQ1, we generate the contextual vector of the entire query and individual query terms using the pre-trained BERT (Bidi-rectional Encoder Representations from Transformers) model and subsequently analyze their correlation to estimate the term centrality score so that the ranking function may focus on the central terms while term matching.

Research Question 2: How can we identify the outlier terms of the large collection and penalize them in the ranking function? For RQ2, we model the distribution of maximum normalized term frequency values of relevant documents for the terms of a set of queries. Then we estimate the probability that the normalized frequency of a new term is coming from the right extreme of that distribution and uses this probability to penalize them in the ranking function.

Research Question 3: How can we detect the bursty terms and incorporate them in the ranking function?

To address RQ3, we propose a model that estimates the burstiness score of a term from its information content in a document and use this score to penalize the bursty term in the ranking function. To estimate the information content of a term, we capture the contextual information of each occurrence of a term by utilizing the pre-trained BERT model and estimate the contextual divergence of the occurrence of a term from its previous occurrences.

References

[1]

Ronan Cummins, Jiaul H Paik, and Yuanhua Lv. 2015. A pólyaurn document language model for improved information retrieval. ACM Transactions on Information Systems (TOIS), Vol. 33, 4 (2015), 1--34.

Digital Library

Google Scholar

[2]

Jiafeng Guo, Yinqiong Cai, Yixing Fan, Fei Sun, Ruqing Zhang, and Xueqi Cheng. 2022. Semantic models for the first-stage retrieval: A comprehensive review. ACM Transactions on Information Systems (TOIS), Vol. 40, 4 (2022), 1--42.

Digital Library

Google Scholar

[3]

Jiaul H Paik, Yash Agrawal, Sahil Rishi, and Vaishal Shah. 2021. Truncated Models for Probabilistic Weighted Retrieval. ACM Transactions on Information Systems (TOIS), Vol. 40, 3 (2021), 1--24.

Digital Library

Google Scholar

[4]

Jiaul H. Paik and Douglas W. Oard. 2014. A Fixed-Point Method for Weighting Terms in Verbose Informational Queries. In Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management (CIKM '14). 131--140.

Google Scholar

[5]

Guoqing Zheng and Jamie Callan. 2015. Learning to reweight terms with distributed representations. In Proceedings of the 38th international ACM SIGIR conference on research and development in information retrieval. 575--584.

Digital Library

Google Scholar

Index Terms

Resilient Retrieval Models for Large Collection
1. Information systems
  1. Information retrieval
    1. Information retrieval query processing
    2. Retrieval models and ranking

Recommendations

Effective level of term frequency impact on large-scale retrieval performance: by top-term ranking method
InfoScale '06: Proceedings of the 1st international conference on Scalable information systems

As the volume of information increases, effective information retrieval methods become more essential to deal with the growth of information. Present document develops a new method to assess the potential role of the term frequency-inverse document ...
Context-based literature digital collection search

We identify two issues with searching literature digital collections within digital libraries: (a) there are no effective paper-scoring and ranking mechanisms. Without a scoring and ranking system, users are often forced to scan a large and diverse set ...
Re-ranking search results using query logs
CIKM '06: Proceedings of the 15th ACM international conference on Information and knowledge management

This work addresses two common problems in search, frequently occurring with underspecified user queries: the top-ranked results for such queries may not contain documents relevant to the user's search intent, and fresh and relevant pages may not get ...

Comments

Information & Contributors

Information

Published In

SIGIR '23: Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval

July 2023

3567 pages

ISBN:9781450394086

DOI:10.1145/3539618

General Chairs:
Hsin-Hsi Chen
National Taiwan University
,
Wei-Jou (Edward) Duh
National Taiwan University
,
Hen-Hsen Huang
Academia Sinica
,
Program Chairs:
Makoto P. Kato
Spotify
,
Josiane Mothe
Universite de Toulouse
,
Barbara Poblete
University of Chile and Amazon Visiting Academic

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 July 2023

Check for updates

Author Tags

Qualifiers

Abstract

Conference

SIGIR '23

Sponsor:

SIGIR

SIGIR '23: The 46th International ACM SIGIR Conference on Research and Development in Information Retrieval

July 23 - 27, 2023

Taipei, Taiwan

Acceptance Rates

Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
61
Total Downloads

Downloads (Last 12 months)46
Downloads (Last 6 weeks)2

Reflects downloads up to 09 Aug 2024

Other Metrics

View Author Metrics

Citations

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Index Terms

Recommendations

Effective level of term frequency impact on large-scale retrieval performance: by top-term ranking method

Context-based literature digital collection search

Re-ranking search results using query logs