research-article

Improving zero-shot retrieval using dense external expansion

Authors:

Craig Macdonald, and

Iadh OunisAuthors Info & Claims

Volume 59, Issue 5

https://doi.org/10.1016/j.ipm.2022.103026

Published: 01 September 2022 Publication History

Abstract

Pseudo-relevance feedback (PRF) is a classical technique to improve search engine retrieval effectiveness, by closing the vocabulary gap between users’ query formulations and the relevant documents. While PRF is typically applied on the same target corpus as the final retrieval, in the past, external expansion techniques have sometimes been applied to obtain a high-quality pseudo-relevant feedback set using the external corpus. However, such external expansion approaches have only been studied for sparse (BoW) retrieval methods, and its effectiveness for recent dense retrieval methods remains under-investigated. Indeed, dense retrieval approaches such as ANCE and ColBERT, which conduct similarity search based on encoded contextualised query and document embeddings, are of increasing importance. Moreover, pseudo-relevance feedback mechanisms have been proposed to further enhance dense retrieval effectiveness. In particular, in this work, we examine the application of dense external expansion to improve zero-shot retrieval effectiveness, i.e. evaluation on corpora without further training. Zero-shot retrieval experiments with six datasets, including two TREC datasets and four BEIR datasets, when applying the MSMARCO passage collection as external corpus, indicate that obtaining external feedback documents using ColBERT can significantly improve NDCG@10 for the sparse retrieval (by upto 28%) and the dense retrieval (by upto 12%). In addition, using ANCE on the external corpus brings upto 30% NDCG@10 improvements for the sparse retrieval and upto 29% for the dense retrieval.

Highlights

•

Dense external expansion improves zero-shot retrieval performance.

•

High quality feedback documents can be retrieved using dense external expansion.

•

Experimental results show significant improvements using ColBERT-PRF and ANCE-PRF external expansion.

References

[1]

Abdul-Jaleel, Nasreen, Allan, James, Croft, W Bruce, Diaz, Fernando, Larkey, Leah, & Li, Xiaoyan, et al. (2004). UMass at TREC 2004: Novelty and HARD. In Proceedings of TREC.

[2]

Amati, Gianni, Carpineto, Claudio, & Romano, Giovanni (2004). Query difficulty, robustness, and selective application of query expansion. In Proceedings of ECIR (pp. 127–137).

[3]

Amati Gianni, Van Rijsbergen Cornelis Joost, Probabilistic models of information retrieval based on measuring the divergence from randomness, ACM Transactions on Information Systems (TOIS) 20 (4) (2002) 357–389.

[4]

Arabzadeh, Negar, Yan, Xinyi, & Clarke, Charles L. A. (2021). Predicting efficiency/effectiveness trade-offs for dense vs. Sparse retrieval strategy selection. In Proceedings of CIKM (pp. 2862–2866).

[5]

Azad Hiteshwar Kumar, Deepak Akshay, Query expansion techniques for information retrieval: a survey, Information Processing & Management 56 (5) (2019) 1698–1735.

[6]

Bondarenko, Alexander, Fröbe, Maik, Beloucif, Meriem, Gienapp, Lukas, Ajjour, Yamen, & Panchenko, Alexander, et al. (2020). Overview of touché 2020: argument retrieval. In Proceddings of CLEF (pp. 384–395).

[7]

Boteva, Vera, Gholipour, Demian, Sokolov, Artem, & Riezler, Stefan (2016). A full-text learning to rank dataset for medical information retrieval. In Proceddings of ECIR (pp. 716–722).

[8]

Chen Xiaoyang, Hui Kai, He Ben, Han Xianpei, Sun Le, Ye Zheng, Incorporating ranking context for end-to-end BERT Re-ranking, in: Proceedings of ECIR, Springer, 2022, pp. 111–127.

[9]

Chen, Tao, Zhang, Mingyang, Lu, Jing, Bendersky, Michael, & Najork, Marc (2022). Out-of-domain semantics to the rescue! zero-shot hybrid retrieval models. In Proceedings of ECIR.

[10]

Croft W. Bruce, Metzler Donald, Strohman Trevor, Search engines: Information retrieval in practice, Vol. 520, Addison-Wesley Reading, 2010.

[11]

Dai, Zhuyun, & Callan, Jamie (2020). Context-aware document term weighting for ad-hoc search. In Proceedings of WWW (pp. 1897–1907).

[12]

Devlin, Jacob, Chang, Ming-Wei, Lee, Kenton, & Toutanova, Kristina (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of ACL (pp. 4171–4186).

[13]

Diaz, Fernando, & Metzler, Donald (2006). Improving the estimation of relevance models using large external corpora. In Proceedings of SIGIR (pp. 154–161).

[14]

Formal, Thibault, Piwowarski, Benjamin, & Clinchant, Stéphane (2021). SPLADE: Sparse lexical and expansion model for first stage ranking. In Proceedings of SIGIR (pp. 2288–2292).

[15]

Gao, Luyu, Dai, Zhuyun, Chen, Tongfei, Fan, Zhen, Van Durme, Benjamin, & Callan, Jamie (2020). Complementing lexical retrieval with semantic residual embedding. In Proceedings of ECIR (pp. 146–160).

[16]

Hasibi, Faegheh, Nikolaev, Fedor, Xiong, Chenyan, Balog, Krisztian, Bratsberg, Svein Erik, & Kotov, Alexander, et al. (2017). DBpedia-entity v2: a test collection for entity search. In Proceedings of SIGIR (pp. 1265–1268).

[17]

He Ben, Ounis Iadh, Combining fields for query expansion and adaptive query expansion, Information Processing & Management 43 (5) (2007) 1294–1307.

[18]

Johnson Jeff, Douze Matthijs, Jégou Hervé, Billion-scale similarity search with GPUs, IEEE Transactions on Big Data 7 (3) (2019) 535–547.

[19]

Karpukhin, Vladimir, Oguz, Barlas, Min, Sewon, Lewis, Patrick, Wu, Ledell, & Edunov, Sergey, et al. (2020). Dense passage retrieval for open-domain question answering. In Proceedings of EMNLP (pp. 6769–6781).

[20]

Khattab, Omar, & Zaharia, Matei (2020). ColBERT: Efficient and effective passage search via contextualized late interaction over BERT. In Proceedings of SIGIR (pp. 39–48).

[21]

Kwok, Kui Lam, & Chan, Margaret (1998). Improving two-stage ad-hoc retrieval for short queries. In Proceedings of SIGIR (pp. 250–256).

[22]

Lavrenko, Victor, & Croft, W. Bruce (2001). Relevance based language models. In Proceedings of SIGIR (pp. 120–127).

[23]

Li Hang, Mourad Ahmed, Zhuang Shengyao, Koopman Bevan, Zuccon Guido, Pseudo relevance feedback with deep language models and dense retrievers: Successes and pitfalls, ACM Transactions on Information Systems (TOIS) (2021).

[24]

Li, Canjia, Sun, Yingfei, He, Ben, Wang, Le, Hui, Kai, & Yates, Andrew, et al. (2018). NPRF: A neural pseudo relevance feedback framework for ad-hoc information retrieval. In Proceedings of EMNLP (pp. 4482–4491).

[25]

Li, Hang, Zhuang, Shengyao, Mourad, Ahmed, Ma, Xueguang, Lin, Jimmy, & Zuccon, Guido (2021). Improving query representations for dense retrieval with pseudo relevance feedback: A reproducibility study. In Proceedings of ECIR (pp. 599–612).

[26]

Lin Jimmy, Nogueira Rodrigo, Yates Andrew, Pretrained transformers for text ranking: Bert and beyond, Synthesis Lectures on Human Language Technologies 14 (4) (2021) 1–325.

[27]

Lioma Christina, Ounis Iadh, A syntactically-based query reformulation technique for information retrieval, Information Processing & Management 44 (1) (2008) 143–162.

[28]

MacAvaney, Sean, Cohan, Arman, & Goharian, Nazli (2020). SLEDGE-Z: A zero-shot baseline for COVID-19 literature search. In Proceedings of EMNLP (pp. 4171–4179).

[29]

MacAvaney, Sean, Nardini, Franco Maria, Perego, Raffaele, Tonellotto, Nicola, Goharian, Nazli, & Frieder, Ophir (2020). Expansion via prediction of importance with contextualization. In Proceedings of SIGIR (pp. 1573–1576).

[30]

MacAvaney, Sean, Yates, Andrew, Cohan, Arman, & Goharian, Nazli (2019). CEDR: Contextualized embeddings for document ranking. In Proceedings of SIGIR (pp. 1101–1104).

[31]

Macdonald, Craig, & Tonellotto, Nicola (2020). Declarative experimentation in information retrieval using PyTerrier. In Proceedings of ICTIR (pp. 161–168).

[32]

Macdonald Craig, Tonellotto Nicola, Ounis Iadh, On single and multiple representations in dense passage retrieval, in: IIR 2021 workshop, 2021.

[33]

Mallia, Antonio, Khattab, Omar, Suel, Torsten, & Tonellotto, Nicola (2021). Learning passage impacts for inverted indexes. In Proceedings of SIGIR (pp. 1723–1727).

[34]

Naseri Shahrzad, Dalton Jeffrey, Yates Andrew, Allan James, CEQE: Contextualized embeddings for query expansion, in: Proceedings of ECIR, 2021.

[35]

Nguyen Tri, Rosenberg Mir, Song Xia, Gao Jianfeng, Tiwary Saurabh, Majumder Rangan, et al., MS MARCO: A Human generated machine reading comprehension dataset, in: CoCo@ NIPS, 2016.

[36]

Nogueira Rodrigo, Lin Jimmy, Epistemic A.I., From doc2query to docTTTTTquery, Online Preprint (2019).

[37]

Nogueira Rodrigo, Yang Wei, Lin Jimmy, Cho Kyunghyun, Document expansion by query prediction, 2019, arXiv preprint arXiv:1904.08375.

[38]

Pan Min, Wang Junmei, Huang Jimmy X, Huang Angela J, Chen Qi, Chen Jinguang, A probabilistic framework for integrating sentence-level semantics via BERT into pseudo-relevance feedback, Information Processing & Management 59 (1) (2022).

Digital Library

[39]

Peng, Jie, He, Ben, & Ounis, Iadh (2009). Predicting the usefulness of collection enrichment for enterprise search. In Proceedings of CIKM (pp. 366–370).

[40]

Peng, Jie, Macdonald, Craig, He, Ben, & Ounis, Iadh (2009). A study of selective collection enrichment for enterprise search. In Proceedings of CIKM (pp. 1999–2002).

[41]

Rocchio Joseph, Relevance feedback in information retrieval, The Smart Retrieval System-Experiments in Automatic Document Processing (1971) 313–323.

[42]

Sakai Tetsuya, On Fuhr’s guideline for IR evaluation, SIGIR Forum 54 (1) (2021).

[43]

Thakur, Nandan, Reimers, Nils, Rücklé, Andreas, Srivastava, Abhishek, & Gurevych, Iryna (2021). BEIR: A heterogenous benchmark for zero-shot evaluation of information retrieval models. In Proceedings of NeurIPS.

[44]

Voorhees Ellen, The TREC robust retrieval track, in: ACM SIGIR forum, Vol. 39, 2005, pp. 11–20.

[45]

Voorhees Ellen, The TREC 2005 robust track, in: ACM SIGIR forum, Vol. 40, 2006, pp. 41–48.

[46]

Voorhees Ellen, Alam Tasmeer, Bedrick Steven, Demner-Fushman Dina, Hersh William R, Lo Kyle, et al., TREC-COVID: Constructing a pandemic information retrieval test collection, in: ACM SIGIR forum, 2021, pp. 1–12.

[47]

Wang Le, Luo Ze, Li Canjia, He Ben, Sun Le, Yu Hao, et al., An end-to-end pseudo relevance feedback framework for neural document retrieval, Information Processing & Management 57 (2) (2020).

[48]

Wang, Xiao, Macdonald, Craig, & Tonellotto, Nicola (2021). Pseudo-relevance feedback for multiple representation dense retrieval. In Proceedings of ICTIR (pp. 297–306).

[49]

Wang Junmei, Pan Min, He Tingting, Huang Xiang, Wang Xueyan, Tu Xinhui, A pseudo-relevance feedback framework combining relevance matching and semantic matching for information retrieval, Information Processing & Management 57 (6) (2020).

[50]

Wong WS, Luk Robert Wing Pong, Leong Hong Va, Ho KS, Lee Dik Lun, Re-examining the effects of adding relevance information in a relevance feedback environment, Information Processing & Management 44 (3) (2008) 1086–1116.

[51]

Xiong, Lee, Xiong, Chenyan, Li, Ye, Tang, Kwok-Fung, Liu, Jialin, & Bennett, Paul, et al. (2021). Approximate nearest neighbor negative contrastive learning for dense text retrieval. In Proceedings of ICLR.

[52]

Xu, Yang, Jones, Gareth J. F., & Wang, Bin (2009). Query dependent pseudo-relevance feedback based on Wikipedia. In Proceedings of SIGIR (pp. 59–66).

[53]

Yu HongChien, Dai Zhuyun, Callan Jamie, PGT: Pseudo relevance feedback using a graph-based transformer, in: Proceedings of ECIR, Springer, 2021, pp. 440–447.

[54]

Yu, HongChien, Xiong, Chenyan, & Callan, Jamie (2021). Improving query representations for dense retrieval with pseudo relevance feedback. In Proceedings of CIKM (pp. 3592–3596).

[55]

Zheng, Zhi, Hui, Kai, He, Ben, Han, Xianpei, Sun, Le, & Yates, Andrew (2020). BERT-QE: Contextualized query expansion for document re-ranking. In Proceedings of EMNLP: findings (pp. 4718–4728).

Cited By

Giacalone BPaiement GTucker QZanibbi R(2024)Beneath the [MASK]: An Analysis of Structural Query Tokens in ColBERTAdvances in Information Retrieval10.1007/978-3-031-56063-7_35(431-439)Online publication date: 24-Mar-2024
https://dl.acm.org/doi/10.1007/978-3-031-56063-7_35
Fang JMeng ZMacdonald CFrommholz IHopfgartner FLee MOakes MLalmas MZhang MSantos R(2023)KGPR: Knowledge Graph Enhanced Passage RankingProceedings of the 32nd ACM International Conference on Information and Knowledge Management10.1145/3583780.3615252(3880-3885)Online publication date: 21-Oct-2023
https://dl.acm.org/doi/10.1145/3583780.3615252
Acquavia AMacdonald CTonellotto N(2023)Static Pruning for Multi-Representation Dense RetrievalProceedings of the ACM Symposium on Document Engineering 202310.1145/3573128.3604896(1-10)Online publication date: 22-Aug-2023
https://dl.acm.org/doi/10.1145/3573128.3604896
Show More Cited By

Index Terms

Improving zero-shot retrieval using dense external expansion
1. Information systems
  1. Information retrieval

Index terms have been assigned to the content through auto-classification.

Recommendations

ColBERT-PRF: Semantic Pseudo-Relevance Feedback for Dense Passage and Document Retrieval
Pseudo-relevance feedback mechanisms, from Rocchio to the relevance models, have shown the usefulness of expanding and reweighting the users’ initial queries using information occurring in an initial set of retrieved documents, known as the pseudo-...
Read More
Pseudo-Relevance Feedback for Multiple Representation Dense Retrieval
ICTIR '21: Proceedings of the 2021 ACM SIGIR International Conference on Theory of Information Retrieval

Pseudo-relevance feedback mechanisms, from Rocchio to the relevance models, have shown the usefulness of expanding and reweighting the users' initial queries using information occurring in an initial set of retrieved documents, known as the pseudo-...
Read More
Document expansion for image retrieval
RIAO '10: Adaptivity, Personalization and Fusion of Heterogeneous Information

Successful information retrieval requires effective matching between the user's search request and the contents of relevant documents. Often the request entered by a user may not use the same topic relevant terms as the authors' of these documents. One ...
Read More

Comments

Information & Contributors

Information

Published In

cover image Information Processing and Management: an International Journal

Information Processing and Management: an International Journal Volume 59, Issue 5

Sep 2022

730 pages

ISSN:0306-4573

Issue’s Table of Contents

The Author(s).

Publisher

Pergamon Press, Inc.

United States

Publication History

Published: 01 September 2022

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

5
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Other Metrics

View Author Metrics

Citations

Cited By

Giacalone BPaiement GTucker QZanibbi R(2024)Beneath the [MASK]: An Analysis of Structural Query Tokens in ColBERTAdvances in Information Retrieval10.1007/978-3-031-56063-7_35(431-439)Online publication date: 24-Mar-2024
https://dl.acm.org/doi/10.1007/978-3-031-56063-7_35
Fang JMeng ZMacdonald CFrommholz IHopfgartner FLee MOakes MLalmas MZhang MSantos R(2023)KGPR: Knowledge Graph Enhanced Passage RankingProceedings of the 32nd ACM International Conference on Information and Knowledge Management10.1145/3583780.3615252(3880-3885)Online publication date: 21-Oct-2023
https://dl.acm.org/doi/10.1145/3583780.3615252
Acquavia AMacdonald CTonellotto N(2023)Static Pruning for Multi-Representation Dense RetrievalProceedings of the ACM Symposium on Document Engineering 202310.1145/3573128.3604896(1-10)Online publication date: 22-Aug-2023
https://dl.acm.org/doi/10.1145/3573128.3604896
Wang XMacDonald CTonellotto NOunis I(2023)ColBERT-PRF: Semantic Pseudo-Relevance Feedback for Dense Passage and Document RetrievalACM Transactions on the Web10.1145/357240517:1(1-39)Online publication date: 16-Jan-2023
https://dl.acm.org/doi/10.1145/3572405
Wang XMacdonald CTonellotto NOunis IChen HDuh WHuang HKato MMothe JPoblete B(2023)Reproducibility, Replicability, and Insights into Dense Multi-Representation Retrieval Models: from ColBERT to Col*Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3539618.3591916(2552-2561)Online publication date: 19-Jul-2023
https://dl.acm.org/doi/10.1145/3539618.3591916

View Options

View options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Issue’s Table of Contents