Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Doc2Query–: When Less is More

  • Conference paper
  • First Online:
Advances in Information Retrieval (ECIR 2023)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13981))

Included in the following conference series:

Abstract

Doc2Query—the process of expanding the content of a document before indexing using a sequence-to-sequence model—has emerged as a prominent technique for improving the first-stage retrieval effectiveness of search engines. However, sequence-to-sequence models are known to be prone to “hallucinating” content that is not present in the source text. We argue that Doc2Query is indeed prone to hallucination, which ultimately harms retrieval effectiveness and inflates the index size. In this work, we explore techniques for filtering out these harmful queries prior to indexing. We find that using a relevance model to remove poor-quality queries can improve the retrieval effectiveness of Doc2Query by up to 16%, while simultaneously reducing mean query execution time by 30% and cutting the index size by 48%. We release the code, data, and a live demonstration to facilitate reproduction and further exploration (https://github.com/terrierteam/pyterrier_doc2query).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 79.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 99.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    For instance, we find that SPLADE [10] generates the following seemingly-unrelated terms for the passage in Fig. 1 in the top 20 expansion terms: reed, herb, and troy.

  2. 2.

    ir-datasets [21] IDs: msmarco-passage/dev/small, msmarco-passage/dev/2, msmarco-passage/eval/small, msmarco-passage/trec-dl-2019/judged, msmarco-passage/trec-dl-2020/judged.

  3. 3.

    BM25’s k1, b, and whether to remove stopwords were tuned for all systems; the filtering percentage (p) was also tuned for filtered systems.

  4. 4.

    crystina-z/monoELECTRA_LCE_nneg31.

  5. 5.

    castorini/monot5-base-msmarco.

  6. 6.

    castorini/tct_colbert-v2-hnp-msmarco.

  7. 7.

    Significance cannot be determined due to the held-out nature of the dataset. Further, due to restrictions on the number of submissions to the leaderboard, we only are able to submit two runs. The first aims to be a fair comparison with the existing Doc2Query Eval result, using the same number of generated queries and same base T5 model for scoring. The second is our overall best-performing setting, using the ELECTRA filter at \(n=80\) generated queries.

References

  1. Amati, G., Van Rijsbergen, C.J.: Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Trans. Inf. Syst. 20(4), 357–389 (2002)

    Google Scholar 

  2. Bonifacio, L., Abonizio, H., Fadaee, M., Nogueira, R.: InPars: unsupervised dataset generation for information retrieval. In: Proceedings of SIGIR (2022)

    Google Scholar 

  3. Dai, Z., Callan, J.: Deeper text understanding for IR with contextual neural language modeling. In: Proceedings of SIGIR (2019)

    Google Scholar 

  4. Dai, Z., Callan, J.: Context-aware document term weighting for ad-hoc search. In: Proceedings of the Web Conference (2020)

    Google Scholar 

  5. Das, R., Dhuliawala, S., Zaheer, M., McCallum, A.: Multi-step retriever-reader interaction for scalable open-domain question answering. In: Proceedings of ICLR (2019)

    Google Scholar 

  6. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of NAACL-HLT (2019)

    Google Scholar 

  7. Ding, S., Suel, T.: Faster top-k document retrieval using block-max indexes. In: Proceedings of SIGIR (2011)

    Google Scholar 

  8. Dumais, S.T., Furnas, G.W., Landauer, T.K., Deerwester, S., Harshman, R.: Using latent semantic analysis to improve access to textual information. In: Proceedings of SIGCHI CHI (1988)

    Google Scholar 

  9. Efron, M., Organisciak, P., Fenlon, K.: Improving retrieval of short texts through document expansion. In: Proceedings of SIGIR (2012)

    Google Scholar 

  10. Formal, T., Piwowarski, B., Clinchant, S.: SPLADE: sparse lexical and expansion model for first stage ranking. In: Proceedings of SIGIR (2021)

    Google Scholar 

  11. He, B., Ounis, I.: Studying query expansion effectiveness. In: Proceedings of ECIR (2009)

    Google Scholar 

  12. Jaleel, N.A., et al.: Umass at TREC 2004: novelty and HARD. In: TREC (2004)

    Google Scholar 

  13. Johnson, J., Douze, M., Jegou, H.: Billion-scale similarity search with GPUs. IEEE Trans. Big Data 7(03), 535–547 (2021)

    Google Scholar 

  14. Khattab, O., Zaharia, M.: ColBERT: efficient and effective passage search via contextualized late interaction over BERT. In: Proceedings of SIGIR (2020)

    Google Scholar 

  15. Lin, J., Ma, X., Mackenzie, J., Mallia, A.: On the separation of logical and physical ranking models for text retrieval applications. In: Proceedings of DESIRES (2021)

    Google Scholar 

  16. Lin, S.C., Yang, J.H., Lin, J.: In-batch negatives for knowledge distillation with tightly-coupled teachers for dense retrieval. In: Proceedings of RepL4NLP (2021)

    Google Scholar 

  17. MacAvaney, S., Macdonald, C.: A python interface to PISA! In: Proceedings of SIGIR (2022)

    Google Scholar 

  18. MacAvaney, S., Macdonald, C., Ounis, I.: Streamlining evaluation with ir-measures. In: Proceedings of ECIR (2022)

    Google Scholar 

  19. MacAvaney, S., Nardini, F.M., Perego, R., Tonellotto, N., Goharian, N., Frieder, O.: Expansion via prediction of importance with contextualization. In: Proceedings of SIGIR (2020)

    Google Scholar 

  20. MacAvaney, S., Tonellotto, N., Macdonald, C.: Adaptive re-ranking with a corpus graph. In: Proceedings of CIKM (2022)

    Google Scholar 

  21. MacAvaney, S., Yates, A., Feldman, S., Downey, D., Cohan, A., Goharian, N.: Simplified data wrangling with ir_datasets. In: Proceedings of SIGIR (2021)

    Google Scholar 

  22. Macdonald, C., Tonellotto, N.: Declarative experimentation in information retrieval using PyTerrier. In: Proceedings of ICTIR (2020)

    Google Scholar 

  23. Mallia, A., Khattab, O., Suel, T., Tonellotto, N.: Learning passage impacts for inverted indexes. In: Proceedings of SIGIR (2021)

    Google Scholar 

  24. Mallia, A., Siedlaczek, M., Mackenzie, J., Suel, T.: PISA: performant indexes and search for academia. In: Proceedings of OSIRRC@SIGIR (2019)

    Google Scholar 

  25. Maynez, J., Narayan, S., Bohnet, B., McDonald, R.: On faithfulness and factuality in abstractive summarization. In: Proceedings of ACL (2020)

    Google Scholar 

  26. Nguyen, T., Rosenberg, M., Song, X., Gao, J., Tiwary, S., Majumder, R., Deng, L.: MS MARCO: a human generated machine reading comprehension dataset. In: Proceedings of CoCo@NIPS (2016)

    Google Scholar 

  27. Nogueira, R., Cho, K.: Passage re-ranking with BERT. ArXiv abs/1901.04085 (2019)

    Google Scholar 

  28. Nogueira, R., Lin, J.: From doc2query to doctttttquery (2019)

    Google Scholar 

  29. Nogueira, R., Yang, W., Lin, J.J., Cho, K.: Document expansion by query prediction. ArXiv abs/1904.08375 (2019)

    Google Scholar 

  30. Pickens, J., Cooper, M., Golovchinsky, G.: Reverted indexing for feedback and expansion. In: Proceedings of CIKM (2010)

    Google Scholar 

  31. Pradeep, R., Liu, Y., Zhang, X., Li, Y., Yates, A., Lin, J.: Squeezing water from a stone: a bag of tricks for further improving cross-encoder effectiveness for reranking. In: Proceedings of ECIR (2022)

    Google Scholar 

  32. Pradeep, R., Nogueira, R., Lin, J.: The expando-mono-duo design pattern for text ranking with pretrained sequence-to-sequence models. ArXiv abs/2101.05667 (2021)

    Google Scholar 

  33. Raffel, C., et al.: Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 21(140), 5485–5551 (2020)

    Google Scholar 

  34. Scells, H., Zhuang, S., Zuccon, G.: Reduce, reuse, recycle: green information retrieval research. In: Proceedings of SIGIR (2022)

    Google Scholar 

  35. Tao, T., Wang, X., Mei, Q., Zhai, C.: Language model information retrieval with document expansion. In: Proceedings of HLT-NAACL (2006)

    Google Scholar 

  36. Wang, X., MacAvaney, S., Macdonald, C., Ounis, I.: An inspection of the reproducibility and replicability of TCT-ColBERT. In: Proceedings of SIGIR (2022)

    Google Scholar 

  37. Xiong, L., Xiong, C., Li, Y., Tang, K.F., Liu, J., Bennett, P.N., Ahmed, J., Overwijk, A.: Approximate nearest neighbor negative contrastive learning for dense text retrieval. In: Proceedings of ICLR (2021)

    Google Scholar 

  38. Yu, S.Y., Liu, J., Yang, J., Xiong, C., Bennett, P.N., Gao, J., Liu, Z.: Few-shot generative conversational query rewriting. In: Proceedings of SIGIR (2020)

    Google Scholar 

  39. Zhao, T., Lu, X., Lee, K.: SPARTA: efficient open-domain question answering via sparse transformer matching retrieval. arXiv abs/2009.13013 (2020)

    Google Scholar 

  40. Zhuang, S., Zuccon, G.: TILDE: term independent likelihood model for passage re-ranking. In: Proceedings of SIGIR (2021)

    Google Scholar 

Download references

Acknowledgements

Sean MacAvaney and Craig Macdonald acknowledge EPSRC grant EP/R018634/1: Closed-Loop Data Science for Complex, Computationally- & Data-Intensive Analytics.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sean MacAvaney .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Gospodinov, M., MacAvaney, S., Macdonald, C. (2023). Doc2Query–: When Less is More. In: Kamps, J., et al. Advances in Information Retrieval. ECIR 2023. Lecture Notes in Computer Science, vol 13981. Springer, Cham. https://doi.org/10.1007/978-3-031-28238-6_31

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-28238-6_31

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-28237-9

  • Online ISBN: 978-3-031-28238-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics