research-article

Revisiting Document Expansion and Filtering for Effective First-Stage Retrieval

Authors:

Watheq Mansour,

Shengyao Zhuang,

Joel MackenzieAuthors Info & Claims

SIGIR '24: Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval

Pages 186 - 196

https://doi.org/10.1145/3626772.3657850

Published: 11 July 2024 Publication History

Abstract

Document expansion is a technique that aims to reduce the likelihood of term mismatch by augmenting documents with related terms or queries. Doc2Query minus minus (Doc2Query-) represents an extension to the expansion process that uses a neural model to identify and remove expansions that may not be relevant to the given document, thereby increasing the quality of the ranking while simultaneously reducing the amount of augmented data. In this work, we conduct a detailed reproducibility study of Doc2Query- to better understand the trade-offs inherent to document expansion and filtering mechanisms. After successfully reproducing the best-performing method from the Doc2Query- family, we show that filtering actually harms recall-based metrics on various test collections. Next, we explore whether the two-stage "generate-then-filter" process can be replaced with a single generation phase via reinforcement learning. Finally, we extend our experimentation to learned sparse retrieval models and demonstrate that filtering is not helpful when term weights can be learned. Overall, our work provides a deeper understanding of the behaviour and characteristics of common document expansion mechanisms, and paves the way for developing more efficient yet effective augmentation models.

References

[1]

Nasreen Abdul-Jaleel, James Allan, W Bruce Croft, Fernando Diaz, Leah Larkey, Xiaoyan Li, Mark D Smucker, and Courtney Wade. 2004. UMass at TREC 2004: Novelty and HARD. In Proc. TREC.

[2]

Marwah Alaofi, Luke Gallagher, Mark Sanderson, Falk Scholer, and Paul Thomas. 2023. Can Generative LLMs Create Query Variants for Test Collections? An Exploratory Study. In Proc. SIGIR. 1869--1873.

Digital Library

[3]

Negar Arabzadeh, Alexandra Vtyurina, Xinyi Yan, and Charles L. A. Clarke. 2022. Shallow pooling for sparse labels. Inf. Retr. 25, 4 (2022), 365--385.

Digital Library

[4]

Timothy G. Armstrong, Alistair Moffat, William Webber, and Justin Zobel. 2009. Improvements That Don't Add up: Ad-hoc Retrieval Results since 1998. In Proc. CIKM. 601--610.

Digital Library

[5]

Hiteshwar Kumar Azad and Akshay Deepak. 2019. Query expansion techniques for information retrieval: A survey. Inf. Proc. & Man. 56, 5 (2019), 1698--1735.

Digital Library

[6]

Leif Azzopardi, Maarten de Rijke, and Krisztian Balog. 2007. Building simulated queries for known-item topics: An analysis using six European languages. In Proc. SIGIR. 455--462.

Digital Library

[7]

Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, and Tong Wang. 2018. MS MARCO: A Human Generated MAchine Reading COmprehension Dataset. arXiv:1611.09268v3 (2018).

[8]

Bodo Billerbeck and Justin Zobel. 2004. Questioning query expansion: An examination of behaviour and parameters. In Proc. ADC. 69--76.

[9]

Alexander Bondarenko, Maik Fröbe, Meriem Beloucif, Lukas Gienapp, Yamen Ajjour, Alexander Panchenko, Chris Biemann, Benno Stein, Henning Wachsmuth, Martin Potthast, and Matthias Hagen. 2020. Overview of Touché 2020: Argument Retrieval. In Proc. CLEF Assoc.

[10]

Kevin Clark, Minh-Thang Luong, Quoc V. Le, and Christopher D. Manning. 2020. ELECTRA: Pre-training text encoders as discriminators rather than generators. In Proc. ICLR.

[11]

Kevyn Collins-Thompson. 2009. Reducing the risk of query expansion via robust constrained optimization. In Proc. CIKM. 837--846.

Digital Library

[12]

Nick Craswell, David Hawking, and Stephen Robertson. 2001. Effective site finding using link anchor information. In Proc. SIGIR. 250--257.

Digital Library

[13]

Nick Craswell, Bhaskar Mitra, Emine Yilmaz, and Daniel Campos. 2021. Overview of the TREC 2020 deep learning track. In Proc. TREC.

[14]

Nick Craswell, Bhaskar Mitra, Emine Yilmaz, Daniel Campos, and Jimmy Lin. 2021. MS MARCO: Benchmarking Ranking Models in the Large-Data Regime. In Proc. SIGIR. 1566--1576.

Digital Library

[15]

Nick Craswell, Bhaskar Mitra, Emine Yilmaz, Daniel Campos, and Ellen M. Voorhees. 2020. Overview of the TREC 2019 deep learning track. (2020).

[16]

Nick Craswell and Martin Szummer. 2007. Random Walks on the Click Graph. In Proc. SIGIR. 239--246.

Digital Library

[17]

Zhuyun Dai and Jamie Callan. 2020. Context-Aware Document Term Weighting for Ad-Hoc Search. In Proc. WWW. 1897--1907.

Digital Library

[18]

Laxman Dhulipala, Igor Kabiljo, Brian Karrer, Giuseppe Ottaviano, Sergey Pupyrev, and Alon Shalita. 2016. Compressing Graphs and Indexes with Recursive Graph Bisection. In Proc. KDD. 1535--1544.

Digital Library

[19]

Miles Efron, Peter Organisciak, and Katrina Fenlon. 2012. Improving retrieval of short texts through document expansion. In Proc. SIGIR. 911--920.

Digital Library

[20]

Nadav Eiron and Kevin S. McCurley. 2003. Analysis of anchor text for web search. In Proc. SIGIR. 459--460.

[21]

Angela Fan, Mike Lewis, and Yann Dauphin. 2018. Hierarchical Neural Story Generation. In Proc. ACL. 889--898.

[22]

Thibault Formal, Benjamin Piwowarski, and Stéphane Clinchant. 2021. SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking. In Proc. SIGIR. 2288--2292.

Digital Library

[23]

George W. Furnas, Thomas K. Landauer, Louis M. Gomez, and Susan T. Dumais. 1987. The vocabulary problem in human-system communication. Commun. ACM 30, 11 (1987), 964--971.

Digital Library

[24]

Yukang Gan, Yixiao Ge, Chang Zhou, Shupeng Su, Zhouchuan Xu, Xuyuan Xu, Quanchao Hui, Xiang Chen, Yexin Wang, and Ying Shan. 2023. Binary Embedding-based Retrieval at Tencent. In Proc. KDD. 4056--4067.

Digital Library

[25]

Luyu Gao, Zhuyun Dai, and Jamie Callan. 2021. COIL: Revisit Exact Lexical Match in Information Retrieval with Contextualized Inverted List. In Proc. NAACL. 3030--3042.

[26]

Luyu Gao, Zhuyun Dai, and Jamie Callan. 2021. Rethink Training of BERT Rerankers in Multi-stage Retrieval Pipeline. In Proc. ECIR. 280--286.

Digital Library

[27]

Mitko Gospodinov, Sean MacAvaney, and Craig Macdonald. 2023. Doc2Query-: When Less is More. In Proc. ECIR. 414--422.

Digital Library

[28]

Faegheh Hasibi, Fedor Nikolaev, Chenyan Xiong, Krisztian Balog, Svein Erik Bratsberg, Alexander Kotov, and Jamie Callan. 2017. DBpedia-entity v2: A test collection for entity search. In Proc. SIGIR. 1265--1268.

Digital Library

[29]

Ben He and Iadh Ounis. 2009. Studying Query Expansion Effectiveness. In Proc. ECIR. 611--619.

Digital Library

[30]

Jui-Ting Huang, Ashish Sharma, Shuying Sun, Li Xia, David Zhang, Philip Pronin, Janani Padmanabhan, Giuseppe Ottaviano, and Linjun Yang. 2020. Embedding-based Retrieval in Facebook Search. In Proc. KDD. 2553--2561.

Digital Library

[31]

Shan Jiang, Yuening Hu, Changsung Kang, Tim Daly, Dawei Yin, Yi Chang, and Chengxiang Zhai. 2016. Learning Query and Document Relevance from a Web-scale Click Graph. In Proc. SIGIR. 185--194.

Digital Library

[32]

Chris Kamphuis, Arjen P. de Vries, Leonid Boytsov, and Jimmy Lin. 2020. Which BM25 do you mean? A large-scale reproducibility study of scoring variants. In Proc. ECIR. 28--34.

Digital Library

[33]

Omar Khattab and Matei Zaharia. 2020. ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT. In Proc. SIGIR. 39--48.

Digital Library

[34]

Carlos Lassance and Stephane Clinchant. 2023. The Tale of Two MSMARCO - and Their Unfair Comparisons. In Proc. SIGIR. 2431--2435.

Digital Library

[35]

Daniel Lemire and Leonid Boytsov. 2015. Decoding billions of integers per second through vectorization. Soft. Prac. & Exp. 41, 1 (2015), 1--29.

Digital Library

[36]

Jurek Leonhardt, Henrik Müller, Koustav Rudra, Megha Khosla, Abhijit Anand, and Avishek Anand. 2023. Efficient Neural Ranking using Forward Indexes and Lightweight Encoders. ACM Transactions on Information Systems (2023). Just Accepted.

[37]

Hang Li, Shuai Wang, Shengyao Zhuang, Ahmed Mourad, Xueguang Ma, Jimmy Lin, and Guido Zuccon. 2022. To Interpolate or not to Interpolate: PRF, Dense and Sparse Retrievers. In Proc. SIGIR. 2495--2500.

Digital Library

[38]

Jimmy Lin and Xueguang Ma. 2021. A Few Brief Notes on DeepImpact, COIL, and a Conceptual Framework for Information Retrieval Techniques. arXiv:2106.14807 (2021).

[39]

Jimmy Lin, Joel Mackenzie, Chris Kamphuis, Craig Macdonald, Antonio Mallia, Michaŀ Siedlaczek, Andrew Trotman, and Arjen de Vries. 2020. Supporting Interoperability Between Open-Source Search Engines with the Common Index File Format. In Proc. SIGIR. 2149--2152.

Digital Library

[40]

Jimmy Lin, Rodrigo Nogueira, and Andrew Yates. 2021. Pretrained Transformers for Text Ranking: BERT and Beyond. Morgan & Claypool Publishers.

[41]

Jimmy Lin and Peilin Yang. 2019. The Impact of Score Ties on Repeatability in Document Ranking. In Proc. SIGIR. 1125--1128.

Digital Library

[42]

Xueguang Ma, Ronak Pradeep, Rodrigo Nogueira, and Jimmy Lin. 2022. Document Expansion Baselines and Learned Sparse Lexical Representations for MSMARCO V1 and V2. In Proc. SIGIR. 3187--3197.

[43]

Sean MacAvaney and Craig Macdonald. 2022. A Python Interface to PISA!. In Proc. SIGIR. 3339--3344.

Digital Library

[44]

Sean MacAvaney, Franco Maria Nardini, Raffaele Perego, Nicola Tonellotto, Nazli Goharian, and Ophir Frieder. 2020. Expansion via Prediction of Importance with Contextualization. In Proc. SIGIR. 1573--1576.

Digital Library

[45]

Sean MacAvaney, Andrew Yates, Sergey Feldman, Doug Downey, Arman Cohan, and Nazli Goharian. 2021. Simplified Data Wrangling with ir_datasets. In Proc. SIGIR. 2429--2436.

Digital Library

[46]

Craig Macdonald and Nicola Tonellotto. 2020. Declarative Experimentation in Information Retrieval using PyTerrier. In Proc. ICTIR.

Digital Library

[47]

Craig Macdonald, Nicola Tonellotto, Sean MacAvaney, and Iadh Ounis. 2021. PyTerrier: Declarative Experimentation in Python from BM25 to Dense Retrieval. In Proc. CIKM. 4526--4533.

Digital Library

[48]

Joel Mackenzie and Alistair Moffat. 2020. Examining the additivity of top-k query processing innovations. In Proc. CIKM. 1085--1094.

Digital Library

[49]

Joel Mackenzie, Matthias Petri, and Alistair Moffat. 2021. A Sensitivity Analysis of the MSMARCO Passage Collection. arXiv preprint arXiv:2112.03396 (2021).

[50]

Joel Mackenzie, Matthias Petri, and Alistair Moffat. 2022. Tradeoff options for bipartite graph partitioning. IEEE Trans. Know. & Data Eng. (2022), 8644--8657.

Digital Library

[51]

Joel Mackenzie, Andrew Trotman, and Jimmy Lin. 2023. Efficient document-at-a-time and score-at-a-time query evaluation for learned sparse representations. ACM Transactions on Information Systems 41, 4 (2023), 1--28.

Digital Library

[52]

Antonio Mallia, Omar Khattab, Torsten Suel, and Nicola Tonellotto. 2021. Learning Passage Impacts for Inverted Indexes. In Proc. SIGIR. 1723--1727.

Digital Library

[53]

Antonio Mallia, Joel Mackenzie, Torsten Suel, and Nicola Tonellotto. 2022. Faster Learned Sparse Retrieval with Guided Traversal. In Proc. SIGIR. 1901--1905.

Digital Library

[54]

Antonio Mallia, Michaŀ Siedlaczek, Joel Mackenzie, and Torsten Suel. 2019. PISA: Performant Indexes and Search for Academia. In Proc. OSIRRC at SIGIR. 50--56.

[55]

Antonio Mallia, Michaŀ Siedlaczek, and Torsten Suel. 2019. An Experimental Study of Index Compression and DAAT Query Processing Methods. In Proc. ECIR. 353--368.

Digital Library

[56]

Thong Nguyen, Sean MacAvaney, and Andrew Yates. 2023. A Unified Framework for Learned Sparse Retrieval. In Proc. ECIR. 101--116.

Digital Library

[57]

Rodrigo Nogueira, Zhiying Jiang, Ronak Pradeep, and Jimmy Lin. 2020. Document Ranking with a Pretrained Sequence-to-Sequence Model. In Proc. EMNLP (Findings). 708--718.

[58]

Rodrigo Nogueira and Jimmy Lin. 2019. From doc2query to docTTTTTquery. (2019). https://cs.uwaterloo.ca/~jimmylin/publications/Nogueira_Lin_2019_ docTTTTTquery-latest.pdf.

[59]

Rodrigo Nogueira, Wei Yang, Jimmy Lin, and Kyunghyun Cho. 2019. Document expansion by query prediction. arXiv preprint arXiv:1904.08375 (2019).

[60]

Jeremy Pickens, Matthew Cooper, and Gene Golovchinsky. 2010. Reverted indexing for feedback and expansion. In Proc. CIKM. 1049--1058.

Digital Library

[61]

Ronak Pradeep, Yuqi Liu, Xinyu Zhang, Yilin Li, Andrew Yates, and Jimmy Lin. 2022. Squeezing Water from a Stone: A Bag of Tricks for Further Improving Cross-Encoder Effectiveness for Reranking. In Proc. ECIR. 655--670.

Digital Library

[62]

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. 2023. Direct Preference Optimization: Your Language Model is Secretly a Reward Model. In Proc. NeurIPS.

[63]

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. J. Mach. Learn. Res. 21, 140 (2020), 1--67.

[64]

Stephen Robertson and Hugo Zaragoza. 2009. The Probabilistic Relevance Framework: BM25 and Beyond. Found. Trnd. Inf. Retr. 3, 4 (2009), 333--389.

Digital Library

[65]

Tetsuya Sakai. 2016. Statistical Significance, Power, and Sample Sizes: A System-atic Review of SIGIR and TOIS, 2006--2015. In Proc. SIGIR. 5--14.

Digital Library

[66]

Harrisen Scells, Shengyao Zhuang, and Guido Zuccon. 2022. Reduce, Reuse, Recycle: Green Information Retrieval Research. In Proc. SIGIR. 2825--2837.

Digital Library

[67]

Falk Scholer, Hugh E. Williams, and Andrew Turpin. 2004. Query association surrogates for Web search. J. Assoc. Inf. Sci. Technol. 55, 7 (2004), 637--650.

[68]

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017).

[69]

Jheng-Hong Yang Sheng-Chieh Lin and Jimmy Lin. 2021. In-Batch Negatives for Knowledge Distillation with Tightly-Coupled Teachers for Dense Retrieval. In Proc. Wkshp. on Representation Learning for NLP. 163--173.

[70]

Tao Tao, Xuanhui Wang, Qiaozhu Mei, and ChengXiang Zhai. 2006. Language Model Information Retrieval with Document Expansion. In Proc. HLT-NAACL. 407--414.

Digital Library

[71]

Nandan Thakur, Nils Reimers, and Jimmy Lin. 2023. Injecting Domain Adaptation with Learning-to-hash for Effective and Efficient Zero-shot Dense Retrieval. In Proc. ReNeuIR at SIGIR 2023.

[72]

Nandan Thakur, Nils Reimers, Andreas Rücklé, Abhishek Srivastava, and Iryna Gurevych. 2021. BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models. In Proc. NeurIPS.

[73]

Nicola Tonellotto, Craig Macdonald, and Iadh Ounis. 2018. Efficient Query Processing for Scalable Web Search. Found. Trnd. Inf. Retr. 12, 4--5 (2018), 319--500.

Digital Library

[74]

Howard R. Turtle and James Flood. 1995. Query Evaluation: Strategies and Optimizations. Inf. Proc. & Man. 31, 6 (1995), 831--850.

Digital Library

[75]

Ellen M. Voorhees. 2004. Overview of the TREC 2004 Robust Retrieval Track. In Proc. TREC.

[76]

Ellen M. Voorhees, Tasmeer Alam, Steven Bedrick, Dina Demner-Fushman, William R. Hersh, Kyle Lo, Kirk Roberts, Ian Soboroff, and Lucy Lu Wang. 2021. TREC-COVID: Constructing a pandemic information retrieval test collection. SIGIR Forum 54, 1 (2021), 1--12.

[77]

Lidan Wang, Jimmy Lin, and Donald Metzler. 2011. A Cascade Ranking Model for Efficient Ranked Retrieval. In Proc. SIGIR. 105--114.

Digital Library

[78]

Shuai Wang, Shengyao Zhuang, and Guido Zuccon. 2021. BERT-based Dense Retrievers Require Interpolation with BM25 for Effective Passage Retrieval. In Proc. ICTIR. 317--324.

Digital Library

[79]

Orion Weller, Kyle Lo, David Wadden, Dawn Lawrie, Benjamin Van Durme, Arman Cohan, and Luca Soldaini. 2023. When do Generative Query and Document Expansions Fail? A Comprehensive Study Across Methods, Retrievers, and Datasets. arXiv preprint arXiv:2309.08541 (2023).

[80]

Thijs Westerveld, Wessel Kraaij, and Djoerd Hiemstra. 2002. Retrieving Web Pages using Content, Links, URLs and Anchors. In Proc. TREC. 663--672.

[81]

Peilin Yang, Hui Fang, and Jimmy Lin. 2018. Anserini: Reproducible Ranking Baselines Using Lucene. J. Data Inf. Qual. 10, 4 (2018).

[82]

Ziying Yang, Alistair Moffat, and Andrew Turpin. 2016. How Precise Does Document Scoring Need to Be?. In Proc. Asia Info. Retri. Soc. Conf. 279--291.

Digital Library

[83]

Le Zhao and Jamie Callan. 2010. Term necessity prediction. In Proc. CIKM. 259--268.

Digital Library

[84]

Shengyao Zhuang, Honglei Zhuang, Bevan Koopman, and Guido Zuccon. 2023. A Setwise Approach for Effective and Highly Efficient Zero-shot Ranking with Large Language Models. arXiv preprint arXiv:2310.09497 (2023).

[85]

Shengyao Zhuang and Guido Zuccon. 2021. Dealing with Typos for BERT-based Passage Retrieval and Ranking. In Proc. EMNLP. 2836--2842.

[86]

Shengyao Zhuang and Guido Zuccon. 2021. TILDE: Term Independent Likelihood moDEl for Passage Re-ranking. In Proc. SIGIR. 1483--1492.

Digital Library

[87]

Shengyao Zhuang and Guido Zuccon. 2022. Fast Passage Re-ranking with Contex-tualized Exact Term Matching and Efficient Passage Expansion. In Proc. ReNeuIR at SIGIR.

[88]

Justin Zobel and Alistair Moffat. 2006. Inverted files for text search engines. ACM Comp. Surv. 38, 2 (2006), 6.1--6.56.

Digital Library

Index Terms

Revisiting Document Expansion and Filtering for Effective First-Stage Retrieval
1. Information systems
  1. Information retrieval

Recommendations

Document expansion for image retrieval
RIAO '10: Adaptivity, Personalization and Fusion of Heterogeneous Information

Successful information retrieval requires effective matching between the user's search request and the contents of relevant documents. Often the request entered by a user may not use the same topic relevant terms as the authors' of these documents. One ...
Improving retrieval of short texts through document expansion
SIGIR '12: Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval

Collections containing a large number of short documents are becoming increasingly common. As these collections grow in number and size, providing effective retrieval of brief texts presents a significant research problem. We propose a novel approach to ...
Document Expansion Using External Collections
SIGIR '17: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval

Document expansion has been shown to improve the effectiveness of information retrieval systems by augmenting documents' term probability estimates with those of similar documents, producing higher quality document representations. We propose a method ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGIR '24: Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval

July 2024

3164 pages

ISBN:9798400704314

DOI:10.1145/3626772

General Chairs:
Grace Hui Yang
Georgetown University, USA
,
Hongning Wang
Tsinghua University, China
,
Sam Han
The Washington Post, USA
,
Program Chairs:
Claudia Hauff
Spotify, Netherlands
,
Guido Zuccon
The University of Queensland, Australia
,
Yi Zhang
University of California Santa Cruz, USA

Copyright © 2024 ACM.

Publication rights licensed to ACM. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of a national government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

Sponsors

SIGIR: ACM Special Interest Group on Information Retrieval

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 July 2024

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SIGIR 2024

Sponsor:

SIGIR

SIGIR 2024: The 47th International ACM SIGIR Conference on Research and Development in Information Retrieval

July 14 - 18, 2024

Washington DC, USA

Acceptance Rates

Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
275
Total Downloads

Downloads (Last 12 months)275
Downloads (Last 6 weeks)52

Reflects downloads up to 16 Feb 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten