research-article

Efficient Neural Ranking using Forward Indexes

Authors:

Jurek Leonhardt,

Abhijit Anand, and

Avishek AnandAuthors Info & Claims

WWW '22: Proceedings of the ACM Web Conference 2022

April 2022

Pages 266 - 276

https://doi.org/10.1145/3485447.3511955

Published: 25 April 2022 Publication History

Abstract

Neural document ranking approaches, specifically transformer models, have achieved impressive gains in ranking performance. However, query processing using such over-parameterized models is both resource and time intensive. In this paper, we propose the Fast-Forward index – a simple vector forward index that facilitates ranking documents using interpolation of lexical and semantic scores – as a replacement for contextual re-rankers and dense indexes based on nearest neighbor search. Fast-Forward indexes rely on efficient sparse models for retrieval and merely look up pre-computed dense transformer-based vector representations of documents and passages in constant time for fast CPU-based semantic similarity computation during query processing. We propose index pruning and theoretically grounded early stopping techniques to improve the query processing throughput. We conduct extensive large-scale experiments on TREC-DL datasets and show improvements over hybrid indexes in performance and query processing efficiency using only CPUs. Fast-Forward indexes can provide superior ranking performance using interpolation due to the complementary benefits of lexical and semantic similarities.

References

[1]

Zeynep Akkalyoncu Yilmaz, Wei Yang, Haotian Zhang, and Jimmy Lin. 2019. Cross-Domain Modeling of Sentence-Level Evidence for Document Retrieval. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 3481–3487.

[2]

Wei-Cheng Chang, Felix X. Yu, Yin-Wen Chang, Yiming Yang, and Sanjiv Kumar. 2020. Pre-training Tasks for Embedding-based Large-scale Retrieval. In International Conference on Learning Representations. https://openreview.net/forum?id=rkg-mA4FDr

[3]

Xiaoyang Chen, Kai Hui, Ben He, Xianpei Han, Le Sun, and Zheng Ye. 2021. Co-BERT: A Context-Aware BERT Retrieval Model Incorporating Local and Query-specific Context. arXiv preprint arXiv:2104.08523(2021).

[4]

Zhuyun Dai and Jamie Callan. 2019. Deeper Text Understanding for IR with Contextual Neural Language Modeling. In ACM SIGIR’19. 985–988.

[5]

Zhuyun Dai and Jamie Callan. 2019. An Evaluation of Weakly-Supervised DeepCT in the TREC 2019 Deep Learning Track. In TREC.

[6]

Zhuyun Dai and Jamie Callan. 2020. Context-Aware Document Term Weighting for Ad-Hoc Search. In Proceedings of The Web Conference 2020. 1897–1907.

Digital Library

[7]

Zhuyun Dai and Jamie Callan. 2020. Context-Aware Term Weighting For First Stage Passage Retrieval. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. 1533–1536.

Digital Library

[8]

Thibault Formal, Benjamin Piwowarski, and Stéphane Clinchant. 2021. SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval(SIGIR ’21). Association for Computing Machinery, New York, NY, USA, 2288–2292. https://doi.org/10.1145/3404835.3463098

Digital Library

[9]

Luke Gallagher. 2019. Pairwise t-test on TREC Run Files. https://github.com/lgrz/pairwise-ttest/.

[10]

Luyu Gao, Zhuyun Dai, and Jamie Callan. 2021. COIL: Revisit Exact Lexical Match in Information Retrieval with Contextualized Inverted List. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, Online, 3030–3042. https://doi.org/10.18653/v1/2021.naacl-main.241

[11]

Luyu Gao, Zhuyun Dai, Tongfei Chen, Zhen Fan, Benjamin Van Durme, and Jamie Callan. 2021. Complement Lexical Retrieval Model with Semantic Residual Embeddings. In Advances in Information Retrieval, Djoerd Hiemstra, Marie-Francine Moens, Josiane Mothe, Raffaele Perego, Martin Potthast, and Fabrizio Sebastiani (Eds.). Springer International Publishing, Cham, 146–160.

[12]

Sebastian Hofstätter, Sheng-Chieh Lin, Jheng-Hong Yang, Jimmy Lin, and Allan Hanbury. 2021. Efficiently Teaching an Effective Dense Retriever with Balanced Topic Aware Sampling. Association for Computing Machinery, New York, NY, USA, 113–122. https://doi.org/10.1145/3404835.3462891

Digital Library

[13]

Sebastian Hofstätter, Bhaskar Mitra, Hamed Zamani, Nick Craswell, and Allan Hanbury. 2021. Intra-Document Cascading: Learning to Select Passages for Neural Document Ranking. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1349–1358.

Digital Library

[14]

Sebastian Hofstätter, Markus Zlabinger, and Allan Hanbury. 2020. Interpretable & Time-Budget-Constrained Contextualization for Re-Ranking. In Proc. of ECAI.

[15]

Kyoung-Rok Jang, Junmo Kang, Giwon Hong, Sung-Hyon Myaeng, Joohee Park, Taewon Yoon, and Heecheol Seo. 2021. Ultra-High Dimensional Sparse Representations with Binarization for Efficient Text Retrieval. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Online and Punta Cana, Dominican Republic, 1016–1029. https://doi.org/10.18653/v1/2021.emnlp-main.78

[16]

Jeff Johnson, Matthijs Douze, and Hervé Jégou. 2019. Billion-scale similarity search with gpus. IEEE Transactions on Big Data(2019).

[17]

Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. 2020. Dense Passage Retrieval for Open-Domain Question Answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Online, 6769–6781. https://doi.org/10.18653/v1/2020.emnlp-main.550

[18]

Omar Khattab and Matei Zaharia. 2020. ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT. Association for Computing Machinery, New York, NY, USA, 39–48. https://doi.org/10.1145/3397271.3401075

Digital Library

[19]

John Lafferty and Chengxiang Zhai. 2001. Document Language Models, Query Models, and Risk Minimization for Information Retrieval. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval(SIGIR ’01). Association for Computing Machinery, New York, NY, USA, 111–119. https://doi.org/10.1145/383952.383970

Digital Library

[20]

Victor Lavrenko and W. Bruce Croft. 2001. Relevance Based Language Models. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval(SIGIR ’01). ACM, New York, NY, USA, 120–127. https://doi.org/10.1145/383952.383972

Digital Library

[21]

Jurek Leonhardt, Avishek Anand, and Megha Khosla. 2020. Boilerplate Removal Using a Neural Sequence Labeling Model. In Companion Proceedings of the Web Conference 2020(WWW ’20). Association for Computing Machinery, New York, NY, USA, 226–229. https://doi.org/10.1145/3366424.3383547

Digital Library

[22]

Canjia Li, Andrew Yates, Sean MacAvaney, Ben He, and Yingfei Sun. 2020. PARADE: Passage Representation Aggregation for Document Reranking. arXiv preprint arXiv:2008.09093(2020).

[23]

Minghan Li and Eric Gaussier. 2021. KeyBLD: Selecting Key Blocks with Local Pre-Ranking for Long Document Information Retrieval. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2207–2211. https://doi.org/10.1145/3404835.3463083

Digital Library

[24]

Jimmy Lin. 2019. The Neural Hype and Comparisons Against Weak Baselines. In ACM SIGIR Forum, Vol. 52. ACM, 40–51.

[25]

Jimmy Lin, Xueguang Ma, Sheng-Chieh Lin, Jheng-Hong Yang, Ronak Pradeep, and Rodrigo Nogueira. 2021. Pyserini: A Python Toolkit for Reproducible Information Retrieval Research with Sparse and Dense Representations. Association for Computing Machinery, New York, NY, USA, 2356–2362. https://doi.org/10.1145/3404835.3463238

Digital Library

[26]

Sheng-Chieh Lin, Jheng-Hong Yang, and Jimmy Lin. 2021. In-Batch Negatives for Knowledge Distillation with Tightly-Coupled Teachers for Dense Retrieval. In Proceedings of the 6th Workshop on Representation Learning for NLP (RepL4NLP-2021). Association for Computational Linguistics, Online, 163–173. https://doi.org/10.18653/v1/2021.repl4nlp-1.17

[27]

Julia Luxenburger, Shady Elbassuoni, and Gerhard Weikum. 2008. Matching Task Profiles and User Needs in Personalized Web Search. In Proceedings of the 17th ACM Conference on Information and Knowledge Management(CIKM ’08). Association for Computing Machinery, New York, NY, USA, 689–698. https://doi.org/10.1145/1458082.1458175

Digital Library

[28]

Sean MacAvaney, Franco Maria Nardini, Raffaele Perego, Nicola Tonellotto, Nazli Goharian, and Ophir Frieder. 2020. Expansion via Prediction of Importance with Contextualization. Association for Computing Machinery, New York, NY, USA, 1573–1576. https://doi.org/10.1145/3397271.3401262

Digital Library

[29]

Sean MacAvaney, Andrew Yates, Arman Cohan, and Nazli Goharian. 2019. CEDR: Contextualized Embeddings for Document Ranking. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval(SIGIR’19). Association for Computing Machinery, New York, NY, USA, 1101–1104. https://doi.org/10.1145/3331184.3331317

Digital Library

[30]

Joel Mackenzie, Zhuyun Dai, Luke Gallagher, and Jamie Callan. 2020. Efficiency Implications of Term Weighting for Passage Retrieval. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. 1821–1824.

Digital Library

[31]

Antonio Mallia, Omar Khattab, Torsten Suel, and Nicola Tonellotto. 2021. Learning Passage Impacts for Inverted Indexes. Association for Computing Machinery, New York, NY, USA, 1723–1727. https://doi.org/10.1145/3404835.3463030

Digital Library

[32]

Pascal Massart. 1990. The tight constant in the Dvoretzky-Kiefer-Wolfowitz inequality. The annals of Probability(1990), 1269–1283.

[33]

Bhaskar Mitra, Eric Nalisnick, Nick Craswell, and Rich Caruana. 2016. A dual embedding space model for document ranking. arXiv preprint arXiv:1602.01137(2016).

[34]

Bhaskar Mitra, Corby Rosset, David Hawking, Nick Craswell, Fernando Diaz, and Emine Yilmaz. 2019. Incorporating query term independence assumption for efficient retrieval and ranking using deep neural networks. arXiv preprint arXiv:1907.03693(2019).

[35]

Rodrigo Nogueira, Jimmy Lin, and AI Epistemic. 2019. From doc2query to docTTTTTquery. Online preprint (2019).

[36]

Prafull Prakash, Julian Killingback, and Hamed Zamani. 2021. Learning Robust Dense Retrieval Models from Incomplete Relevance Labels. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1728–1732.

Digital Library

[37]

Koustav Rudra and Avishek Anand. 2020. Distant Supervision in BERT-Based Adhoc Document Retrieval. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management(CIKM ’20). Association for Computing Machinery, New York, NY, USA, 2197–2200. https://doi.org/10.1145/3340531.3412124

Digital Library

[38]

Koustav Rudra, Zeon Trevor Fernando, and Avishek Anand. 2021. An In-depth Analysis of Passage-Level Label Transfer for Contextual Document Ranking. CoRR abs/2103.16669(2021).

[39]

Shuai Wang, Shengyao Zhuang, and Guido Zuccon. 2021. BERT-based Dense Retrievers Require Interpolation with BM25 for Effective Passage Retrieval. In Proceedings of The 11th International Conference on the Theory of Information Retrieval.

Digital Library

[40]

Xing Wei and W. Bruce Croft. 2006. LDA-Based Document Models for Ad-Hoc Retrieval. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval(SIGIR ’06). Association for Computing Machinery, New York, NY, USA, 178–185. https://doi.org/10.1145/1148170.1148204

Digital Library

[41]

Lee Xiong, Chenyan Xiong, Ye Li, Kwok-Fung Tang, Jialin Liu, Paul N. Bennett, Junaid Ahmed, and Arnold Overwijk. 2021. Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval. In International Conference on Learning Representations. https://openreview.net/forum?id=zeFrfgyZln

[42]

Jingtao Zhan, Jiaxin Mao, Yiqun Liu, Jiafeng Guo, Min Zhang, and Shaoping Ma. 2021. Optimizing Dense Retrieval Model Training with Hard Negatives. Association for Computing Machinery, New York, NY, USA, 1503–1512. https://doi.org/10.1145/3404835.3462880

Digital Library

[43]

Shengyao Zhuang and Guido Zuccon. 2021. TILDE: Term Independent Likelihood MoDEl for Passage Re-Ranking. Association for Computing Machinery, New York, NY, USA, 1483–1492. https://doi.org/10.1145/3404835.3462922

Digital Library

Cited By

Leonhardt JMüller HRudra KKhosla MAnand AAnand A(2024)Efficient Neural Ranking Using Forward Indexes and Lightweight EncodersACM Transactions on Information Systems10.1145/363193942:5(1-34)Online publication date: 29-Apr-2024
https://dl.acm.org/doi/10.1145/3631939
Leonhardt JRudra KAnand A(2023)Extractive Explanations for Interpretable Text RankingACM Transactions on Information Systems10.1145/357692441:4(1-31)Online publication date: 23-Mar-2023
https://dl.acm.org/doi/10.1145/3576924
Kulkarni HYoung ZGoharian NFrieder OMacAvaney S(2023)Genetic Generative Information RetrievalProceedings of the ACM Symposium on Document Engineering 202310.1145/3573128.3609340(1-4)Online publication date: 22-Aug-2023
https://dl.acm.org/doi/10.1145/3573128.3609340
Show More Cited By

Index Terms

Efficient Neural Ranking using Forward Indexes
1. Information systems
  1. Information retrieval

Index terms have been assigned to the content through auto-classification.

Recommendations

Re-ranking search results using query logs
CIKM '06: Proceedings of the 15th ACM international conference on Information and knowledge management

This work addresses two common problems in search, frequently occurring with underspecified user queries: the top-ranked results for such queries may not contain documents relevant to the user's search intent, and fresh and relevant pages may not get ...
Read More
Ranking web sites using domain ontology concepts

Many web search engines retrieve enormous amounts of irrelevant information in answer to users' queries. The semantic web provides a promising approach to improve search operation. For specific domains, ontologies can capture concepts to help machines ...
Read More
Coverage, relevance, and ranking: The impact of query operators on Web search engine results

Research has reported that about 10% of Web searchers utilize advanced query operators, with the other 90% using extremely simple queries. It is often assumed that the use of query operators, such as Boolean operators and phrase searching, improves the ...
Read More

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

WWW '22: Proceedings of the ACM Web Conference 2022

April 2022

3764 pages

ISBN:9781450390965

DOI:10.1145/3485447

Editors:
Frédérique Laforest
INSA Lyon, France
,
Raphaël Troncy
EURECOM, France
,
Elena Simperl
King’s College London, UK
,
Deepak Agarwal
Pinterest, USA
,
Aristides Gionis
KTH Royal Institute of Technology, Sweden
,
Ivan Herman
W3C / retired
,
Lionel Médini
Université Lyon 1, France

Copyright © 2022 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGWEB: ACM Special Interest Group on Hypertext, Hypermedia, and Web

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 April 2022

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

BMBF, Germany
Horizon 2020, EU

Conference

WWW '22

Sponsor:

SIGWEB

WWW '22: The ACM Web Conference 2022

April 25 - 29, 2022

Virtual Event, Lyon, France

Acceptance Rates

Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

7
Total Citations
View Citations
415
Total Downloads

Downloads (Last 12 months)116
Downloads (Last 6 weeks)13

Other Metrics

View Author Metrics

Citations

Cited By

Leonhardt JMüller HRudra KKhosla MAnand AAnand A(2024)Efficient Neural Ranking Using Forward Indexes and Lightweight EncodersACM Transactions on Information Systems10.1145/363193942:5(1-34)Online publication date: 29-Apr-2024
https://dl.acm.org/doi/10.1145/3631939
Leonhardt JRudra KAnand A(2023)Extractive Explanations for Interpretable Text RankingACM Transactions on Information Systems10.1145/357692441:4(1-31)Online publication date: 23-Mar-2023
https://dl.acm.org/doi/10.1145/3576924
Kulkarni HYoung ZGoharian NFrieder OMacAvaney S(2023)Genetic Generative Information RetrievalProceedings of the ACM Symposium on Document Engineering 202310.1145/3573128.3609340(1-4)Online publication date: 22-Aug-2023
https://dl.acm.org/doi/10.1145/3573128.3609340
Kulkarni HMacAvaney SGoharian NFrieder OChen HDuh WHuang HKato MMothe JPoblete B(2023)Lexically-Accelerated Dense RetrievalProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3539618.3591715(152-162)Online publication date: 19-Jul-2023
https://dl.acm.org/doi/10.1145/3539618.3591715
Hambarde KProença H(2023)Information Retrieval: Recent Advances and BeyondIEEE Access10.1109/ACCESS.2023.329577611(76581-76604)Online publication date: 2023
https://doi.org/10.1109/ACCESS.2023.3295776
Rudra KFernando ZAnand A(2023)An in-depth analysis of passage-level label transfer for contextual document rankingInformation Retrieval10.1007/s10791-023-09430-526:1-2Online publication date: 8-Dec-2023
https://dl.acm.org/doi/10.1007/s10791-023-09430-5
Wallat JBeringer FAnand AAnand A(2023)Probing BERT for Ranking AbilitiesAdvances in Information Retrieval10.1007/978-3-031-28238-6_17(255-273)Online publication date: 2-Apr-2023
https://dl.acm.org/doi/10.1007/978-3-031-28238-6_17

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Table of Contents