Doc2Query–: When Less is More

Gospodinov, Mitko; MacAvaney, Sean; Macdonald, Craig

doi:10.1007/978-3-031-28238-6_31

Mitko Gospodinov¹⁶,
Sean MacAvaney¹⁶ &
Craig Macdonald¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13981))

Included in the following conference series:

European Conference on Information Retrieval

2114 Accesses
1 Altmetric

Abstract

Doc2Query—the process of expanding the content of a document before indexing using a sequence-to-sequence model—has emerged as a prominent technique for improving the first-stage retrieval effectiveness of search engines. However, sequence-to-sequence models are known to be prone to “hallucinating” content that is not present in the source text. We argue that Doc2Query is indeed prone to hallucination, which ultimately harms retrieval effectiveness and inflates the index size. In this work, we explore techniques for filtering out these harmful queries prior to indexing. We find that using a relevance model to remove poor-quality queries can improve the retrieval effectiveness of Doc2Query by up to 16%, while simultaneously reducing mean query execution time by 30% and cutting the index size by 48%. We release the code, data, and a live demonstration to facilitate reproduction and further exploration (https://github.com/terrierteam/pyterrier_doc2query).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Recent Advances in Generative Information Retrieval

Snippet Generation Using Local Alignment for Information Retrieval (LAIR)

SISIS : Sequence Indexing for SImilarity Search

Notes

1.
For instance, we find that SPLADE [10] generates the following seemingly-unrelated terms for the passage in Fig. 1 in the top 20 expansion terms: reed, herb, and troy.
2.
ir-datasets [21] IDs: msmarco-passage/dev/small, msmarco-passage/dev/2, msmarco-passage/eval/small, msmarco-passage/trec-dl-2019/judged, msmarco-passage/trec-dl-2020/judged.
3.
BM25’s k1, b, and whether to remove stopwords were tuned for all systems; the filtering percentage (p) was also tuned for filtered systems.
4.
crystina-z/monoELECTRA_LCE_nneg31.
5.
castorini/monot5-base-msmarco.
6.
castorini/tct_colbert-v2-hnp-msmarco.
7.
Significance cannot be determined due to the held-out nature of the dataset. Further, due to restrictions on the number of submissions to the leaderboard, we only are able to submit two runs. The first aims to be a fair comparison with the existing Doc2Query Eval result, using the same number of generated queries and same base T5 model for scoring. The second is our overall best-performing setting, using the ELECTRA filter at $n=80$ generated queries.

References

Amati, G., Van Rijsbergen, C.J.: Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Trans. Inf. Syst. 20(4), 357–389 (2002)
Google Scholar
Bonifacio, L., Abonizio, H., Fadaee, M., Nogueira, R.: InPars: unsupervised dataset generation for information retrieval. In: Proceedings of SIGIR (2022)
Google Scholar
Dai, Z., Callan, J.: Deeper text understanding for IR with contextual neural language modeling. In: Proceedings of SIGIR (2019)
Google Scholar
Dai, Z., Callan, J.: Context-aware document term weighting for ad-hoc search. In: Proceedings of the Web Conference (2020)
Google Scholar
Das, R., Dhuliawala, S., Zaheer, M., McCallum, A.: Multi-step retriever-reader interaction for scalable open-domain question answering. In: Proceedings of ICLR (2019)
Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of NAACL-HLT (2019)
Google Scholar
Ding, S., Suel, T.: Faster top-k document retrieval using block-max indexes. In: Proceedings of SIGIR (2011)
Google Scholar
Dumais, S.T., Furnas, G.W., Landauer, T.K., Deerwester, S., Harshman, R.: Using latent semantic analysis to improve access to textual information. In: Proceedings of SIGCHI CHI (1988)
Google Scholar
Efron, M., Organisciak, P., Fenlon, K.: Improving retrieval of short texts through document expansion. In: Proceedings of SIGIR (2012)
Google Scholar
Formal, T., Piwowarski, B., Clinchant, S.: SPLADE: sparse lexical and expansion model for first stage ranking. In: Proceedings of SIGIR (2021)
Google Scholar
He, B., Ounis, I.: Studying query expansion effectiveness. In: Proceedings of ECIR (2009)
Google Scholar
Jaleel, N.A., et al.: Umass at TREC 2004: novelty and HARD. In: TREC (2004)
Google Scholar
Johnson, J., Douze, M., Jegou, H.: Billion-scale similarity search with GPUs. IEEE Trans. Big Data 7(03), 535–547 (2021)
Google Scholar
Khattab, O., Zaharia, M.: ColBERT: efficient and effective passage search via contextualized late interaction over BERT. In: Proceedings of SIGIR (2020)
Google Scholar
Lin, J., Ma, X., Mackenzie, J., Mallia, A.: On the separation of logical and physical ranking models for text retrieval applications. In: Proceedings of DESIRES (2021)
Google Scholar
Lin, S.C., Yang, J.H., Lin, J.: In-batch negatives for knowledge distillation with tightly-coupled teachers for dense retrieval. In: Proceedings of RepL4NLP (2021)
Google Scholar
MacAvaney, S., Macdonald, C.: A python interface to PISA! In: Proceedings of SIGIR (2022)
Google Scholar
MacAvaney, S., Macdonald, C., Ounis, I.: Streamlining evaluation with ir-measures. In: Proceedings of ECIR (2022)
Google Scholar
MacAvaney, S., Nardini, F.M., Perego, R., Tonellotto, N., Goharian, N., Frieder, O.: Expansion via prediction of importance with contextualization. In: Proceedings of SIGIR (2020)
Google Scholar
MacAvaney, S., Tonellotto, N., Macdonald, C.: Adaptive re-ranking with a corpus graph. In: Proceedings of CIKM (2022)
Google Scholar
MacAvaney, S., Yates, A., Feldman, S., Downey, D., Cohan, A., Goharian, N.: Simplified data wrangling with ir_datasets. In: Proceedings of SIGIR (2021)
Google Scholar
Macdonald, C., Tonellotto, N.: Declarative experimentation in information retrieval using PyTerrier. In: Proceedings of ICTIR (2020)
Google Scholar
Mallia, A., Khattab, O., Suel, T., Tonellotto, N.: Learning passage impacts for inverted indexes. In: Proceedings of SIGIR (2021)
Google Scholar
Mallia, A., Siedlaczek, M., Mackenzie, J., Suel, T.: PISA: performant indexes and search for academia. In: Proceedings of OSIRRC@SIGIR (2019)
Google Scholar
Maynez, J., Narayan, S., Bohnet, B., McDonald, R.: On faithfulness and factuality in abstractive summarization. In: Proceedings of ACL (2020)
Google Scholar
Nguyen, T., Rosenberg, M., Song, X., Gao, J., Tiwary, S., Majumder, R., Deng, L.: MS MARCO: a human generated machine reading comprehension dataset. In: Proceedings of CoCo@NIPS (2016)
Google Scholar
Nogueira, R., Cho, K.: Passage re-ranking with BERT. ArXiv abs/1901.04085 (2019)
Google Scholar
Nogueira, R., Lin, J.: From doc2query to doctttttquery (2019)
Google Scholar
Nogueira, R., Yang, W., Lin, J.J., Cho, K.: Document expansion by query prediction. ArXiv abs/1904.08375 (2019)
Google Scholar
Pickens, J., Cooper, M., Golovchinsky, G.: Reverted indexing for feedback and expansion. In: Proceedings of CIKM (2010)
Google Scholar
Pradeep, R., Liu, Y., Zhang, X., Li, Y., Yates, A., Lin, J.: Squeezing water from a stone: a bag of tricks for further improving cross-encoder effectiveness for reranking. In: Proceedings of ECIR (2022)
Google Scholar
Pradeep, R., Nogueira, R., Lin, J.: The expando-mono-duo design pattern for text ranking with pretrained sequence-to-sequence models. ArXiv abs/2101.05667 (2021)
Google Scholar
Raffel, C., et al.: Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 21(140), 5485–5551 (2020)
Google Scholar
Scells, H., Zhuang, S., Zuccon, G.: Reduce, reuse, recycle: green information retrieval research. In: Proceedings of SIGIR (2022)
Google Scholar
Tao, T., Wang, X., Mei, Q., Zhai, C.: Language model information retrieval with document expansion. In: Proceedings of HLT-NAACL (2006)
Google Scholar
Wang, X., MacAvaney, S., Macdonald, C., Ounis, I.: An inspection of the reproducibility and replicability of TCT-ColBERT. In: Proceedings of SIGIR (2022)
Google Scholar
Xiong, L., Xiong, C., Li, Y., Tang, K.F., Liu, J., Bennett, P.N., Ahmed, J., Overwijk, A.: Approximate nearest neighbor negative contrastive learning for dense text retrieval. In: Proceedings of ICLR (2021)
Google Scholar
Yu, S.Y., Liu, J., Yang, J., Xiong, C., Bennett, P.N., Gao, J., Liu, Z.: Few-shot generative conversational query rewriting. In: Proceedings of SIGIR (2020)
Google Scholar
Zhao, T., Lu, X., Lee, K.: SPARTA: efficient open-domain question answering via sparse transformer matching retrieval. arXiv abs/2009.13013 (2020)
Google Scholar
Zhuang, S., Zuccon, G.: TILDE: term independent likelihood model for passage re-ranking. In: Proceedings of SIGIR (2021)
Google Scholar

Download references

Acknowledgements

Sean MacAvaney and Craig Macdonald acknowledge EPSRC grant EP/R018634/1: Closed-Loop Data Science for Complex, Computationally- & Data-Intensive Analytics.

Author information

Authors and Affiliations

University of Glasgow, Glasgow, UK
Mitko Gospodinov, Sean MacAvaney & Craig Macdonald

Authors

Mitko Gospodinov
View author publications
You can also search for this author in PubMed Google Scholar
Sean MacAvaney
View author publications
You can also search for this author in PubMed Google Scholar
Craig Macdonald
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sean MacAvaney .

Editor information

Editors and Affiliations

University of Amsterdam, Amsterdam, The Netherlands
Jaap Kamps
Université Grenoble-Alpes, Saint-Martin-d’Hères, France
Lorraine Goeuriot
Università della Svizzera Italiana, Lugano, Switzerland
Fabio Crestani
University of Copenhagen, Copenhagen, Denmark
Maria Maistro
University of Tsukuba, Ibaraki, Japan
Hideo Joho
Dublin City University, Dublin, Ireland
Brian Davis
Dublin City University, Dublin, Ireland
Cathal Gurrin
Universität Regensburg, Regensburg, Germany
Udo Kruschwitz
Dublin City University, Dublin, Ireland
Annalina Caputo

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Gospodinov, M., MacAvaney, S., Macdonald, C. (2023). Doc2Query–: When Less is More. In: Kamps, J., et al. Advances in Information Retrieval. ECIR 2023. Lecture Notes in Computer Science, vol 13981. Springer, Cham. https://doi.org/10.1007/978-3-031-28238-6_31

Download citation

DOI: https://doi.org/10.1007/978-3-031-28238-6_31
Published: 17 March 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-28237-9
Online ISBN: 978-3-031-28238-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Doc2Query–: When Less is More

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Recent Advances in Generative Information Retrieval

Snippet Generation Using Local Alignment for Information Retrieval (LAIR)

SISIS : Sequence Indexing for SImilarity Search

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Doc2Query–: When Less is More

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Recent Advances in Generative Information Retrieval

Snippet Generation Using Local Alignment for Information Retrieval (LAIR)

SISIS : Sequence Indexing for SImilarity Search

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation