Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Improving zero-shot retrieval using dense external expansion

Published: 01 September 2022 Publication History
  • Get Citation Alerts
  • Abstract

    Pseudo-relevance feedback (PRF) is a classical technique to improve search engine retrieval effectiveness, by closing the vocabulary gap between users’ query formulations and the relevant documents. While PRF is typically applied on the same target corpus as the final retrieval, in the past, external expansion techniques have sometimes been applied to obtain a high-quality pseudo-relevant feedback set using the external corpus. However, such external expansion approaches have only been studied for sparse (BoW) retrieval methods, and its effectiveness for recent dense retrieval methods remains under-investigated. Indeed, dense retrieval approaches such as ANCE and ColBERT, which conduct similarity search based on encoded contextualised query and document embeddings, are of increasing importance. Moreover, pseudo-relevance feedback mechanisms have been proposed to further enhance dense retrieval effectiveness. In particular, in this work, we examine the application of dense external expansion to improve zero-shot retrieval effectiveness, i.e. evaluation on corpora without further training. Zero-shot retrieval experiments with six datasets, including two TREC datasets and four BEIR datasets, when applying the MSMARCO passage collection as external corpus, indicate that obtaining external feedback documents using ColBERT can significantly improve NDCG@10 for the sparse retrieval (by upto 28%) and the dense retrieval (by upto 12%). In addition, using ANCE on the external corpus brings upto 30% NDCG@10 improvements for the sparse retrieval and upto 29% for the dense retrieval.

    Highlights

    Dense external expansion improves zero-shot retrieval performance.
    High quality feedback documents can be retrieved using dense external expansion.
    Experimental results show significant improvements using ColBERT-PRF and ANCE-PRF external expansion.

    References

    [1]
    Abdul-Jaleel, Nasreen, Allan, James, Croft, W Bruce, Diaz, Fernando, Larkey, Leah, & Li, Xiaoyan, et al. (2004). UMass at TREC 2004: Novelty and HARD. In Proceedings of TREC.
    [2]
    Amati, Gianni, Carpineto, Claudio, & Romano, Giovanni (2004). Query difficulty, robustness, and selective application of query expansion. In Proceedings of ECIR (pp. 127–137).
    [3]
    Amati Gianni, Van Rijsbergen Cornelis Joost, Probabilistic models of information retrieval based on measuring the divergence from randomness, ACM Transactions on Information Systems (TOIS) 20 (4) (2002) 357–389.
    [4]
    Arabzadeh, Negar, Yan, Xinyi, & Clarke, Charles L. A. (2021). Predicting efficiency/effectiveness trade-offs for dense vs. Sparse retrieval strategy selection. In Proceedings of CIKM (pp. 2862–2866).
    [5]
    Azad Hiteshwar Kumar, Deepak Akshay, Query expansion techniques for information retrieval: a survey, Information Processing & Management 56 (5) (2019) 1698–1735.
    [6]
    Bondarenko, Alexander, Fröbe, Maik, Beloucif, Meriem, Gienapp, Lukas, Ajjour, Yamen, & Panchenko, Alexander, et al. (2020). Overview of touché 2020: argument retrieval. In Proceddings of CLEF (pp. 384–395).
    [7]
    Boteva, Vera, Gholipour, Demian, Sokolov, Artem, & Riezler, Stefan (2016). A full-text learning to rank dataset for medical information retrieval. In Proceddings of ECIR (pp. 716–722).
    [8]
    Chen Xiaoyang, Hui Kai, He Ben, Han Xianpei, Sun Le, Ye Zheng, Incorporating ranking context for end-to-end BERT Re-ranking, in: Proceedings of ECIR, Springer, 2022, pp. 111–127.
    [9]
    Chen, Tao, Zhang, Mingyang, Lu, Jing, Bendersky, Michael, & Najork, Marc (2022). Out-of-domain semantics to the rescue! zero-shot hybrid retrieval models. In Proceedings of ECIR.
    [10]
    Croft W. Bruce, Metzler Donald, Strohman Trevor, Search engines: Information retrieval in practice, Vol. 520, Addison-Wesley Reading, 2010.
    [11]
    Dai, Zhuyun, & Callan, Jamie (2020). Context-aware document term weighting for ad-hoc search. In Proceedings of WWW (pp. 1897–1907).
    [12]
    Devlin, Jacob, Chang, Ming-Wei, Lee, Kenton, & Toutanova, Kristina (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of ACL (pp. 4171–4186).
    [13]
    Diaz, Fernando, & Metzler, Donald (2006). Improving the estimation of relevance models using large external corpora. In Proceedings of SIGIR (pp. 154–161).
    [14]
    Formal, Thibault, Piwowarski, Benjamin, & Clinchant, Stéphane (2021). SPLADE: Sparse lexical and expansion model for first stage ranking. In Proceedings of SIGIR (pp. 2288–2292).
    [15]
    Gao, Luyu, Dai, Zhuyun, Chen, Tongfei, Fan, Zhen, Van Durme, Benjamin, & Callan, Jamie (2020). Complementing lexical retrieval with semantic residual embedding. In Proceedings of ECIR (pp. 146–160).
    [16]
    Hasibi, Faegheh, Nikolaev, Fedor, Xiong, Chenyan, Balog, Krisztian, Bratsberg, Svein Erik, & Kotov, Alexander, et al. (2017). DBpedia-entity v2: a test collection for entity search. In Proceedings of SIGIR (pp. 1265–1268).
    [17]
    He Ben, Ounis Iadh, Combining fields for query expansion and adaptive query expansion, Information Processing & Management 43 (5) (2007) 1294–1307.
    [18]
    Johnson Jeff, Douze Matthijs, Jégou Hervé, Billion-scale similarity search with GPUs, IEEE Transactions on Big Data 7 (3) (2019) 535–547.
    [19]
    Karpukhin, Vladimir, Oguz, Barlas, Min, Sewon, Lewis, Patrick, Wu, Ledell, & Edunov, Sergey, et al. (2020). Dense passage retrieval for open-domain question answering. In Proceedings of EMNLP (pp. 6769–6781).
    [20]
    Khattab, Omar, & Zaharia, Matei (2020). ColBERT: Efficient and effective passage search via contextualized late interaction over BERT. In Proceedings of SIGIR (pp. 39–48).
    [21]
    Kwok, Kui Lam, & Chan, Margaret (1998). Improving two-stage ad-hoc retrieval for short queries. In Proceedings of SIGIR (pp. 250–256).
    [22]
    Lavrenko, Victor, & Croft, W.  Bruce (2001). Relevance based language models. In Proceedings of SIGIR (pp. 120–127).
    [23]
    Li Hang, Mourad Ahmed, Zhuang Shengyao, Koopman Bevan, Zuccon Guido, Pseudo relevance feedback with deep language models and dense retrievers: Successes and pitfalls, ACM Transactions on Information Systems (TOIS) (2021).
    [24]
    Li, Canjia, Sun, Yingfei, He, Ben, Wang, Le, Hui, Kai, & Yates, Andrew, et al. (2018). NPRF: A neural pseudo relevance feedback framework for ad-hoc information retrieval. In Proceedings of EMNLP (pp. 4482–4491).
    [25]
    Li, Hang, Zhuang, Shengyao, Mourad, Ahmed, Ma, Xueguang, Lin, Jimmy, & Zuccon, Guido (2021). Improving query representations for dense retrieval with pseudo relevance feedback: A reproducibility study. In Proceedings of ECIR (pp. 599–612).
    [26]
    Lin Jimmy, Nogueira Rodrigo, Yates Andrew, Pretrained transformers for text ranking: Bert and beyond, Synthesis Lectures on Human Language Technologies 14 (4) (2021) 1–325.
    [27]
    Lioma Christina, Ounis Iadh, A syntactically-based query reformulation technique for information retrieval, Information Processing & Management 44 (1) (2008) 143–162.
    [28]
    MacAvaney, Sean, Cohan, Arman, & Goharian, Nazli (2020). SLEDGE-Z: A zero-shot baseline for COVID-19 literature search. In Proceedings of EMNLP (pp. 4171–4179).
    [29]
    MacAvaney, Sean, Nardini, Franco Maria, Perego, Raffaele, Tonellotto, Nicola, Goharian, Nazli, & Frieder, Ophir (2020). Expansion via prediction of importance with contextualization. In Proceedings of SIGIR (pp. 1573–1576).
    [30]
    MacAvaney, Sean, Yates, Andrew, Cohan, Arman, & Goharian, Nazli (2019). CEDR: Contextualized embeddings for document ranking. In Proceedings of SIGIR (pp. 1101–1104).
    [31]
    Macdonald, Craig, & Tonellotto, Nicola (2020). Declarative experimentation in information retrieval using PyTerrier. In Proceedings of ICTIR (pp. 161–168).
    [32]
    Macdonald Craig, Tonellotto Nicola, Ounis Iadh, On single and multiple representations in dense passage retrieval, in: IIR 2021 workshop, 2021.
    [33]
    Mallia, Antonio, Khattab, Omar, Suel, Torsten, & Tonellotto, Nicola (2021). Learning passage impacts for inverted indexes. In Proceedings of SIGIR (pp. 1723–1727).
    [34]
    Naseri Shahrzad, Dalton Jeffrey, Yates Andrew, Allan James, CEQE: Contextualized embeddings for query expansion, in: Proceedings of ECIR, 2021.
    [35]
    Nguyen Tri, Rosenberg Mir, Song Xia, Gao Jianfeng, Tiwary Saurabh, Majumder Rangan, et al., MS MARCO: A Human generated machine reading comprehension dataset, in: CoCo@ NIPS, 2016.
    [36]
    Nogueira Rodrigo, Lin Jimmy, Epistemic A.I., From doc2query to docTTTTTquery, Online Preprint (2019).
    [37]
    Nogueira Rodrigo, Yang Wei, Lin Jimmy, Cho Kyunghyun, Document expansion by query prediction, 2019, arXiv preprint arXiv:1904.08375.
    [38]
    Pan Min, Wang Junmei, Huang Jimmy X, Huang Angela J, Chen Qi, Chen Jinguang, A probabilistic framework for integrating sentence-level semantics via BERT into pseudo-relevance feedback, Information Processing & Management 59 (1) (2022).
    [39]
    Peng, Jie, He, Ben, & Ounis, Iadh (2009). Predicting the usefulness of collection enrichment for enterprise search. In Proceedings of CIKM (pp. 366–370).
    [40]
    Peng, Jie, Macdonald, Craig, He, Ben, & Ounis, Iadh (2009). A study of selective collection enrichment for enterprise search. In Proceedings of CIKM (pp. 1999–2002).
    [41]
    Rocchio Joseph, Relevance feedback in information retrieval, The Smart Retrieval System-Experiments in Automatic Document Processing (1971) 313–323.
    [42]
    Sakai Tetsuya, On Fuhr’s guideline for IR evaluation, SIGIR Forum 54 (1) (2021).
    [43]
    Thakur, Nandan, Reimers, Nils, Rücklé, Andreas, Srivastava, Abhishek, & Gurevych, Iryna (2021). BEIR: A heterogenous benchmark for zero-shot evaluation of information retrieval models. In Proceedings of NeurIPS.
    [44]
    Voorhees Ellen, The TREC robust retrieval track, in: ACM SIGIR forum, Vol. 39, 2005, pp. 11–20.
    [45]
    Voorhees Ellen, The TREC 2005 robust track, in: ACM SIGIR forum, Vol. 40, 2006, pp. 41–48.
    [46]
    Voorhees Ellen, Alam Tasmeer, Bedrick Steven, Demner-Fushman Dina, Hersh William R, Lo Kyle, et al., TREC-COVID: Constructing a pandemic information retrieval test collection, in: ACM SIGIR forum, 2021, pp. 1–12.
    [47]
    Wang Le, Luo Ze, Li Canjia, He Ben, Sun Le, Yu Hao, et al., An end-to-end pseudo relevance feedback framework for neural document retrieval, Information Processing & Management 57 (2) (2020).
    [48]
    Wang, Xiao, Macdonald, Craig, & Tonellotto, Nicola (2021). Pseudo-relevance feedback for multiple representation dense retrieval. In Proceedings of ICTIR (pp. 297–306).
    [49]
    Wang Junmei, Pan Min, He Tingting, Huang Xiang, Wang Xueyan, Tu Xinhui, A pseudo-relevance feedback framework combining relevance matching and semantic matching for information retrieval, Information Processing & Management 57 (6) (2020).
    [50]
    Wong WS, Luk Robert Wing Pong, Leong Hong Va, Ho KS, Lee Dik Lun, Re-examining the effects of adding relevance information in a relevance feedback environment, Information Processing & Management 44 (3) (2008) 1086–1116.
    [51]
    Xiong, Lee, Xiong, Chenyan, Li, Ye, Tang, Kwok-Fung, Liu, Jialin, & Bennett, Paul, et al. (2021). Approximate nearest neighbor negative contrastive learning for dense text retrieval. In Proceedings of ICLR.
    [52]
    Xu, Yang, Jones, Gareth J. F., & Wang, Bin (2009). Query dependent pseudo-relevance feedback based on Wikipedia. In Proceedings of SIGIR (pp. 59–66).
    [53]
    Yu HongChien, Dai Zhuyun, Callan Jamie, PGT: Pseudo relevance feedback using a graph-based transformer, in: Proceedings of ECIR, Springer, 2021, pp. 440–447.
    [54]
    Yu, HongChien, Xiong, Chenyan, & Callan, Jamie (2021). Improving query representations for dense retrieval with pseudo relevance feedback. In Proceedings of CIKM (pp. 3592–3596).
    [55]
    Zheng, Zhi, Hui, Kai, He, Ben, Han, Xianpei, Sun, Le, & Yates, Andrew (2020). BERT-QE: Contextualized query expansion for document re-ranking. In Proceedings of EMNLP: findings (pp. 4718–4728).

    Cited By

    View all
    • (2024)Beneath the [MASK]: An Analysis of Structural Query Tokens in ColBERTAdvances in Information Retrieval10.1007/978-3-031-56063-7_35(431-439)Online publication date: 24-Mar-2024
    • (2023)KGPR: Knowledge Graph Enhanced Passage RankingProceedings of the 32nd ACM International Conference on Information and Knowledge Management10.1145/3583780.3615252(3880-3885)Online publication date: 21-Oct-2023
    • (2023)Static Pruning for Multi-Representation Dense RetrievalProceedings of the ACM Symposium on Document Engineering 202310.1145/3573128.3604896(1-10)Online publication date: 22-Aug-2023
    • Show More Cited By

    Index Terms

    1. Improving zero-shot retrieval using dense external expansion
          Index terms have been assigned to the content through auto-classification.

          Recommendations

          Comments

          Information & Contributors

          Information

          Published In

          cover image Information Processing and Management: an International Journal
          Information Processing and Management: an International Journal  Volume 59, Issue 5
          Sep 2022
          730 pages

          Publisher

          Pergamon Press, Inc.

          United States

          Publication History

          Published: 01 September 2022

          Author Tags

          1. Query expansion
          2. Pseudo-relevance feedback
          3. Dense retrieval
          4. Information retrieval

          Qualifiers

          • Research-article

          Contributors

          Other Metrics

          Bibliometrics & Citations

          Bibliometrics

          Article Metrics

          • Downloads (Last 12 months)0
          • Downloads (Last 6 weeks)0

          Other Metrics

          Citations

          Cited By

          View all
          • (2024)Beneath the [MASK]: An Analysis of Structural Query Tokens in ColBERTAdvances in Information Retrieval10.1007/978-3-031-56063-7_35(431-439)Online publication date: 24-Mar-2024
          • (2023)KGPR: Knowledge Graph Enhanced Passage RankingProceedings of the 32nd ACM International Conference on Information and Knowledge Management10.1145/3583780.3615252(3880-3885)Online publication date: 21-Oct-2023
          • (2023)Static Pruning for Multi-Representation Dense RetrievalProceedings of the ACM Symposium on Document Engineering 202310.1145/3573128.3604896(1-10)Online publication date: 22-Aug-2023
          • (2023)ColBERT-PRF: Semantic Pseudo-Relevance Feedback for Dense Passage and Document RetrievalACM Transactions on the Web10.1145/357240517:1(1-39)Online publication date: 16-Jan-2023
          • (2023)Reproducibility, Replicability, and Insights into Dense Multi-Representation Retrieval Models: from ColBERT to Col*Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3539618.3591916(2552-2561)Online publication date: 19-Jul-2023

          View Options

          View options

          Get Access

          Login options

          Media

          Figures

          Other

          Tables

          Share

          Share

          Share this Publication link

          Share on social media