Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3539618.3591793acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
abstract

Resilient Retrieval Models for Large Collection

Published: 18 July 2023 Publication History
  • Get Citation Alerts
  • Abstract

    Modern search engines employ multi-stage ranking pipeline to balance retrieval efficiency and effectiveness for large collections. These pipelines retrieve an initial set of candidate documents from the large repository by some cost-effective retrieval model (such as BM25, LM), then re-rank these candidate documents by neural retrieval models. These pipelines perform well if the first-stage ranker achieves high recall [2]. To achieve this, the first-stage ranker should address the problems in milliseconds.
    One of the major problems of the search engine is the presence of extraneous terms in the query. Since the query document term matching is the fundamental block of any retrieval model, the retrieval effectiveness drops when the documents are getting matched with these extraneous query terms. The existing models [4, 5] address this issue by estimating weights of the terms either by using supervised approaches or by utilizing the information of a set of initial top-ranked documents and incorporating it into the final ranking function. Although the later category of methods is unsupervised, they are inefficient as ranking the large collection to get the initial top-ranked documents is computationally expensive.
    Besides, in the real-world collection, some terms may appear multiple times in the documents for several reasons, such as a term may appear for different contexts, the author bursts this term, or it is an outlier. Thus, the existing retrieval models overestimate the relevance score of the irrelevant documents if they contain some query term with extremely high frequency. Paik et al. [3] propose a probabilistic model based on truncated distributions that reduce the contribution of such high-frequency occurrences of the terms in relevance score. But, the truncation point selection does not leverage term-specific distribution information. It treats all the relevant documents as a bag for a set of queries which is not a good way to capture the distribution of terms. Furthermore, this model does not capture the term burstiness; it only reduces the effect of the outliers. Cummins et al. [1] propose a language model based on Dirichlet compound multinomial distribution that can capture the term burstiness. But this model is explicitly specific to the language model.
    Considering the above research gaps, we focus on the following research questions in this doctoral work.
    Research Question 1: How can we identify the central query terms from the verbose query without relying on an initial ranked list or relevance judgment and modify the ranking function so that it can focus on the derived central query terms? To address RQ1, we generate the contextual vector of the entire query and individual query terms using the pre-trained BERT (Bidi-rectional Encoder Representations from Transformers) model and subsequently analyze their correlation to estimate the term centrality score so that the ranking function may focus on the central terms while term matching.
    Research Question 2: How can we identify the outlier terms of the large collection and penalize them in the ranking function? For RQ2, we model the distribution of maximum normalized term frequency values of relevant documents for the terms of a set of queries. Then we estimate the probability that the normalized frequency of a new term is coming from the right extreme of that distribution and uses this probability to penalize them in the ranking function.
    Research Question 3: How can we detect the bursty terms and incorporate them in the ranking function?
    To address RQ3, we propose a model that estimates the burstiness score of a term from its information content in a document and use this score to penalize the bursty term in the ranking function. To estimate the information content of a term, we capture the contextual information of each occurrence of a term by utilizing the pre-trained BERT model and estimate the contextual divergence of the occurrence of a term from its previous occurrences.

    References

    [1]
    Ronan Cummins, Jiaul H Paik, and Yuanhua Lv. 2015. A pólyaurn document language model for improved information retrieval. ACM Transactions on Information Systems (TOIS), Vol. 33, 4 (2015), 1--34.
    [2]
    Jiafeng Guo, Yinqiong Cai, Yixing Fan, Fei Sun, Ruqing Zhang, and Xueqi Cheng. 2022. Semantic models for the first-stage retrieval: A comprehensive review. ACM Transactions on Information Systems (TOIS), Vol. 40, 4 (2022), 1--42.
    [3]
    Jiaul H Paik, Yash Agrawal, Sahil Rishi, and Vaishal Shah. 2021. Truncated Models for Probabilistic Weighted Retrieval. ACM Transactions on Information Systems (TOIS), Vol. 40, 3 (2021), 1--24.
    [4]
    Jiaul H. Paik and Douglas W. Oard. 2014. A Fixed-Point Method for Weighting Terms in Verbose Informational Queries. In Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management (CIKM '14). 131--140.
    [5]
    Guoqing Zheng and Jamie Callan. 2015. Learning to reweight terms with distributed representations. In Proceedings of the 38th international ACM SIGIR conference on research and development in information retrieval. 575--584.

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SIGIR '23: Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval
    July 2023
    3567 pages
    ISBN:9781450394086
    DOI:10.1145/3539618
    Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 18 July 2023

    Check for updates

    Author Tags

    1. large collection
    2. ranking
    3. verbose query

    Qualifiers

    • Abstract

    Conference

    SIGIR '23
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 792 of 3,983 submissions, 20%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 61
      Total Downloads
    • Downloads (Last 12 months)46
    • Downloads (Last 6 weeks)2
    Reflects downloads up to 09 Aug 2024

    Other Metrics

    Citations

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media