Abstract
Doc2Query—the process of expanding the content of a document before indexing using a sequence-to-sequence model—has emerged as a prominent technique for improving the first-stage retrieval effectiveness of search engines. However, sequence-to-sequence models are known to be prone to “hallucinating” content that is not present in the source text. We argue that Doc2Query is indeed prone to hallucination, which ultimately harms retrieval effectiveness and inflates the index size. In this work, we explore techniques for filtering out these harmful queries prior to indexing. We find that using a relevance model to remove poor-quality queries can improve the retrieval effectiveness of Doc2Query by up to 16%, while simultaneously reducing mean query execution time by 30% and cutting the index size by 48%. We release the code, data, and a live demonstration to facilitate reproduction and further exploration (https://github.com/terrierteam/pyterrier_doc2query).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
- 2.
ir-datasets [21] IDs: msmarco-passage/dev/small, msmarco-passage/dev/2, msmarco-passage/eval/small, msmarco-passage/trec-dl-2019/judged, msmarco-passage/trec-dl-2020/judged.
- 3.
BM25’s k1, b, and whether to remove stopwords were tuned for all systems; the filtering percentage (p) was also tuned for filtered systems.
- 4.
crystina-z/monoELECTRA_LCE_nneg31.
- 5.
castorini/monot5-base-msmarco.
- 6.
castorini/tct_colbert-v2-hnp-msmarco.
- 7.
Significance cannot be determined due to the held-out nature of the dataset. Further, due to restrictions on the number of submissions to the leaderboard, we only are able to submit two runs. The first aims to be a fair comparison with the existing Doc2Query Eval result, using the same number of generated queries and same base T5 model for scoring. The second is our overall best-performing setting, using the ELECTRA filter at \(n=80\) generated queries.
References
Amati, G., Van Rijsbergen, C.J.: Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Trans. Inf. Syst. 20(4), 357–389 (2002)
Bonifacio, L., Abonizio, H., Fadaee, M., Nogueira, R.: InPars: unsupervised dataset generation for information retrieval. In: Proceedings of SIGIR (2022)
Dai, Z., Callan, J.: Deeper text understanding for IR with contextual neural language modeling. In: Proceedings of SIGIR (2019)
Dai, Z., Callan, J.: Context-aware document term weighting for ad-hoc search. In: Proceedings of the Web Conference (2020)
Das, R., Dhuliawala, S., Zaheer, M., McCallum, A.: Multi-step retriever-reader interaction for scalable open-domain question answering. In: Proceedings of ICLR (2019)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of NAACL-HLT (2019)
Ding, S., Suel, T.: Faster top-k document retrieval using block-max indexes. In: Proceedings of SIGIR (2011)
Dumais, S.T., Furnas, G.W., Landauer, T.K., Deerwester, S., Harshman, R.: Using latent semantic analysis to improve access to textual information. In: Proceedings of SIGCHI CHI (1988)
Efron, M., Organisciak, P., Fenlon, K.: Improving retrieval of short texts through document expansion. In: Proceedings of SIGIR (2012)
Formal, T., Piwowarski, B., Clinchant, S.: SPLADE: sparse lexical and expansion model for first stage ranking. In: Proceedings of SIGIR (2021)
He, B., Ounis, I.: Studying query expansion effectiveness. In: Proceedings of ECIR (2009)
Jaleel, N.A., et al.: Umass at TREC 2004: novelty and HARD. In: TREC (2004)
Johnson, J., Douze, M., Jegou, H.: Billion-scale similarity search with GPUs. IEEE Trans. Big Data 7(03), 535–547 (2021)
Khattab, O., Zaharia, M.: ColBERT: efficient and effective passage search via contextualized late interaction over BERT. In: Proceedings of SIGIR (2020)
Lin, J., Ma, X., Mackenzie, J., Mallia, A.: On the separation of logical and physical ranking models for text retrieval applications. In: Proceedings of DESIRES (2021)
Lin, S.C., Yang, J.H., Lin, J.: In-batch negatives for knowledge distillation with tightly-coupled teachers for dense retrieval. In: Proceedings of RepL4NLP (2021)
MacAvaney, S., Macdonald, C.: A python interface to PISA! In: Proceedings of SIGIR (2022)
MacAvaney, S., Macdonald, C., Ounis, I.: Streamlining evaluation with ir-measures. In: Proceedings of ECIR (2022)
MacAvaney, S., Nardini, F.M., Perego, R., Tonellotto, N., Goharian, N., Frieder, O.: Expansion via prediction of importance with contextualization. In: Proceedings of SIGIR (2020)
MacAvaney, S., Tonellotto, N., Macdonald, C.: Adaptive re-ranking with a corpus graph. In: Proceedings of CIKM (2022)
MacAvaney, S., Yates, A., Feldman, S., Downey, D., Cohan, A., Goharian, N.: Simplified data wrangling with ir_datasets. In: Proceedings of SIGIR (2021)
Macdonald, C., Tonellotto, N.: Declarative experimentation in information retrieval using PyTerrier. In: Proceedings of ICTIR (2020)
Mallia, A., Khattab, O., Suel, T., Tonellotto, N.: Learning passage impacts for inverted indexes. In: Proceedings of SIGIR (2021)
Mallia, A., Siedlaczek, M., Mackenzie, J., Suel, T.: PISA: performant indexes and search for academia. In: Proceedings of OSIRRC@SIGIR (2019)
Maynez, J., Narayan, S., Bohnet, B., McDonald, R.: On faithfulness and factuality in abstractive summarization. In: Proceedings of ACL (2020)
Nguyen, T., Rosenberg, M., Song, X., Gao, J., Tiwary, S., Majumder, R., Deng, L.: MS MARCO: a human generated machine reading comprehension dataset. In: Proceedings of CoCo@NIPS (2016)
Nogueira, R., Cho, K.: Passage re-ranking with BERT. ArXiv abs/1901.04085 (2019)
Nogueira, R., Lin, J.: From doc2query to doctttttquery (2019)
Nogueira, R., Yang, W., Lin, J.J., Cho, K.: Document expansion by query prediction. ArXiv abs/1904.08375 (2019)
Pickens, J., Cooper, M., Golovchinsky, G.: Reverted indexing for feedback and expansion. In: Proceedings of CIKM (2010)
Pradeep, R., Liu, Y., Zhang, X., Li, Y., Yates, A., Lin, J.: Squeezing water from a stone: a bag of tricks for further improving cross-encoder effectiveness for reranking. In: Proceedings of ECIR (2022)
Pradeep, R., Nogueira, R., Lin, J.: The expando-mono-duo design pattern for text ranking with pretrained sequence-to-sequence models. ArXiv abs/2101.05667 (2021)
Raffel, C., et al.: Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 21(140), 5485–5551 (2020)
Scells, H., Zhuang, S., Zuccon, G.: Reduce, reuse, recycle: green information retrieval research. In: Proceedings of SIGIR (2022)
Tao, T., Wang, X., Mei, Q., Zhai, C.: Language model information retrieval with document expansion. In: Proceedings of HLT-NAACL (2006)
Wang, X., MacAvaney, S., Macdonald, C., Ounis, I.: An inspection of the reproducibility and replicability of TCT-ColBERT. In: Proceedings of SIGIR (2022)
Xiong, L., Xiong, C., Li, Y., Tang, K.F., Liu, J., Bennett, P.N., Ahmed, J., Overwijk, A.: Approximate nearest neighbor negative contrastive learning for dense text retrieval. In: Proceedings of ICLR (2021)
Yu, S.Y., Liu, J., Yang, J., Xiong, C., Bennett, P.N., Gao, J., Liu, Z.: Few-shot generative conversational query rewriting. In: Proceedings of SIGIR (2020)
Zhao, T., Lu, X., Lee, K.: SPARTA: efficient open-domain question answering via sparse transformer matching retrieval. arXiv abs/2009.13013 (2020)
Zhuang, S., Zuccon, G.: TILDE: term independent likelihood model for passage re-ranking. In: Proceedings of SIGIR (2021)
Acknowledgements
Sean MacAvaney and Craig Macdonald acknowledge EPSRC grant EP/R018634/1: Closed-Loop Data Science for Complex, Computationally- & Data-Intensive Analytics.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Gospodinov, M., MacAvaney, S., Macdonald, C. (2023). Doc2Query–: When Less is More. In: Kamps, J., et al. Advances in Information Retrieval. ECIR 2023. Lecture Notes in Computer Science, vol 13981. Springer, Cham. https://doi.org/10.1007/978-3-031-28238-6_31
Download citation
DOI: https://doi.org/10.1007/978-3-031-28238-6_31
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-28237-9
Online ISBN: 978-3-031-28238-6
eBook Packages: Computer ScienceComputer Science (R0)