short-paper

Open access

Pyserini: A Python Toolkit for Reproducible Information Retrieval Research with Sparse and Dense Representations

Authors:

Sheng-Chieh Lin,

Jheng-Hong Yang,

Rodrigo NogueiraAuthors Info & Claims

SIGIR '21: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval

Pages 2356 - 2362

https://doi.org/10.1145/3404835.3463238

Published: 11 July 2021 Publication History

Abstract

Pyserini is a Python toolkit for reproducible information retrieval research with sparse and dense representations. It aims to provide effective, reproducible, and easy-to-use first-stage retrieval in a multi-stage ranking architecture. Our toolkit is self-contained as a standard Python package and comes with queries, relevance judgments, pre-built indexes, and evaluation scripts for many commonly used IR test collections. We aim to support, out of the box, the entire research lifecycle of efforts aimed at improving ranking with modern neural approaches. In particular, Pyserini supports sparse retrieval (e.g., BM25 scoring using bag-of-words representations), dense retrieval (e.g., nearest-neighbor search on transformer-encoded representations), as well as hybrid retrieval that integrates both approaches. This paper provides an overview of toolkit features and presents empirical results that illustrate its effectiveness on two popular ranking tasks. Around this toolkit, our group has built a culture of reproducibility through shared norms and tools that enable rigorous automated testing.

References

[1]

Mart'in Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2016. TensorFlow: A system for large-scale machine learning. In Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI '16). 265--283.

[2]

Zeynep Akkalyoncu Yilmaz, Charles L. A. Clarke, and Jimmy Lin. 2020. A Lightweight Environment for Learning Experimental IR Research Practices. In Proceedings of the 43rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2020). 2113--2116.

[3]

Zeynep Akkalyoncu Yilmaz, Shengjin Wang, Wei Yang, Haotian Zhang, and Jimmy Lin. 2019. Applying BERT to Document Retrieval with Birch. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP): System Demonstrations. Hong Kong, China, 19--24.

[4]

Jaime Arguello, Matt Crane, Fernando Diaz, Jimmy Lin, and Andrew Trotman. 2015. Report on the SIGIR 2015 Workshop on Reproducibility, Inexplicability, and Generalizability of Results (RIGOR). SIGIR Forum, Vol. 49, 2 (2015), 107--116.

Digital Library

[5]

Nima Asadi and Jimmy Lin. 2013. Effectiveness/Efficiency Tradeoffs for Candidate Generation in Multi-Stage Retrieval Architectures. In Proceedings of the 36th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2013). Dublin, Ireland, 997--1000.

Digital Library

[6]

Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, and Tong Wang. 2018. MS MARCO: A Human Generated MAchine Reading COmprehension Dataset. arXiv:1611.09268v3 (2018).

[7]

Michael Bendersky, Honglei Zhuang, Ji Ma, Shuguang Han, Keith Hall, and Ryan McDonald. 2020. RRF102: Meeting the TREC-COVID Challenge with a 100+ Runs Ensemble. arXiv:2010.00200 (2020).

[8]

Andre Esteva, Anuprit Kale, Romain Paulus, Kazuma Hashimoto, Wenpeng Yin, Dragomir Radev, and Richard Socher. 2020. CO-Search: COVID-19 Information Retrieval with Semantic Search, Question Answering, and Abstractive Summarization. arXiv:2006.09595 (2020).

[9]

Adrien Grand, Robert Muir, Jim Ferenczi, and Jimmy Lin. 2020. From MaxScore to Block-Max WAND: The Story of How Lucene Significantly Improved Query Evaluation Performance. In Proceedings of the 42nd European Conference on Information Retrieval, Part II (ECIR 2020). 20--27.

[10]

Sebastian Hofstatter, Sophia Althammer, Michael Schröder, Mete Sertkan, and Allan Hanbury. 2020. Improving Efficient Neural Ranking Models with Cross-Architecture Knowledge Distillation. arXiv:2010.02666 (2020).

[11]

Sebastian Hofstätter, Sophia Althammer, Michael Schröder, Mete Sertkan, and Allan Hanbury. 2021. Improving Efficient Neural Ranking Models with Cross-Architecture Knowledge Distillation. arXiv:2010.02666 (2021).

[12]

Jeff Johnson, Matthijs Douze, and Hervé Jégou. 2017. Billion-scale similarity search with GPUs. arXiv:1702.08734 (2017).

[13]

Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. 2020. Dense Passage Retrieval for Open-Domain Question Answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 6769--6781.

[14]

Omar Khattab and Matei Zaharia. 2020. ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2020). 39--48.

Digital Library

[15]

Jimmy Lin, Matt Crane, Andrew Trotman, Jamie Callan, Ishan Chattopadhyaya, John Foley, Grant Ingersoll, Craig Macdonald, and Sebastiano Vigna. 2016. Toward Reproducible Baselines: The Open-Source IR Reproducibility Challenge. In Proceedings of the 38th European Conference on Information Retrieval (ECIR 2016). Padua, Italy, 408--420.

[16]

Jimmy Lin, Rodrigo Nogueira, and Andrew Yates. 2020 a. Pretrained Transformers for Text Ranking: BERT and Beyond. arXiv:2010.06467 (2020).

[17]

Sheng-Chieh Lin, Jheng-Hong Yang, and Jimmy Lin. 2020 b. Distilling Dense Representations for Ranking using Tightly-Coupled Teachers. arXiv:2010.11386 (2020).

[18]

Xueguang Ma, Kai Sun, Ronak Pradeep, and Jimmy Lin. 2021. A Replication Study of Dense Passage Retriever. arXiv:2104.05740 (2021).

[19]

Sean MacAvaney. 2020. OpenNIR: A Complete Neural Ad-Hoc Ranking Pipeline. In Proceedings of the 13th ACM International Conference on Web Search and Data Mining (WSDM 2020). Houston, Texas, 845--848.

Digital Library

[20]

Sean MacAvaney, Andrew Yates, Sergey Feldman, Doug Downey, Arman Cohan, and Nazli Goharian. 2021. Simplified Data Wrangling with textttir_datasets. arXiv:2103.02280 (2021).

[21]

Craig Macdonald, Richard McCreadie, Rodrygo L.T. Santos, and Iadh Ounis. 2012. From Puppy to Maturity: Experiences in Developing Terrier. In Proceedings of the SIGIR 2012 Workshop on Open Source Information Retrieval. Portland, Oregon.

[22]

Craig Macdonald and Nicola Tonellotto. 2020. Declarative Experimentation in Information Retrieval using PyTerrier. In Proceedings of the 2020 International Conference on the Theory of Information Retrieval (ICTIR 2020). 161--168.

Digital Library

[23]

Yu A. Malkov and D. A. Yashunin. 2020. Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs. Transactions on Pattern Analysis and Machine Intelligence, Vol. 42, 4 (2020), 824--836.

Digital Library

[24]

Antonio Mallia, Michał Siedlaczek, Joel Mackenzie, and Torsten Suel. 2019. PISA: Performant Indexes and Search for Academia. In Proceedings of the Open-Source IR Replicability Challenge (OSIRRC 2019): CEUR Workshop Proceedings Vol-2409. Paris, France, 50--56.

[25]

Irina Matveeva, Chris Burges, Timo Burkard, Andy Laucius, and Leon Wong. 2006. High Accuracy Retrieval with Multiple Nested Ranker. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2006). Seattle, Washington, 437--444.

Digital Library

[26]

Rodrigo Nogueira and Jimmy Lin. 2019. From doc2query to docTTTTTquery .

[27]

Rodrigo Nogueira, Wei Yang, Kyunghyun Cho, and Jimmy Lin. 2019 a. Multi-Stage Document Ranking with BERT. arXiv:1910.14424 (2019).

[28]

Rodrigo Nogueira, Wei Yang, Jimmy Lin, and Kyunghyun Cho. 2019 b. Document Expansion by Query Prediction. arXiv:1904.08375 (2019).

[29]

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Yang, Zach DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems. 8024--8035.

Digital Library

[30]

Ronak Pradeep, Xueguang Ma, Rodrigo Nogueira, and Jimmy Lin. 2021 a. Scientific Claim Verification with VerT5erini. In Proceedings of the 12th International Workshop on Health Text Mining and Information Analysis. 94--103.

[31]

Ronak Pradeep, Xueguang Ma, Rodrigo Nogueira, and Jimmy Lin. 2021 b. Vera: Prediction Techniques for Reducing Harmful Misinformation in Consumer Health Search. In Proceedings of the 44th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2021) .

Digital Library

[32]

Ronak Pradeep, Rodrigo Nogueira, and Jimmy Lin. 2021 c. The Expando-Mono-Duo Design Pattern for Text Ranking with Pretrained Sequence-to-Sequence Models. arXiv:2101.05667 (2021).

[33]

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. Journal of Machine Learning Research, Vol. 21, 140 (2020), 1--67.

[34]

Lidan Wang, Jimmy Lin, and Donald Metzler. 2011. A Cascade Ranking Model for Efficient Ranked Retrieval. In Proceedings of the 34th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2011). Beijing, China, 105--114.

Digital Library

[35]

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Remi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander Rush. 2020. Transformers: State-of-the-Art Natural Language Processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. 38--45.

[36]

Chenyan Xiong, Zhenghao Liu, Si Sun, Zhuyun Dai, Kaitao Zhang, Shi Yu, Zhiyuan Liu, Hoifung Poon, Jianfeng Gao, and Paul Bennett. 2020 a. CMT in TREC-COVID Round 2: Mitigating the Generalization Gaps from Web to Special Domain Search. arXiv:2011.01580 (2020).

[37]

Lee Xiong, Chenyan Xiong, Ye Li, Kwok-Fung Tang, Jialin Liu, Paul Bennett, Junaid Ahmed, and Arnold Overwijk. 2020 b. Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval. arXiv:2007.00808 (2020).

[38]

Peilin Yang, Hui Fang, and Jimmy Lin. 2017. Anserini: Enabling the Use of Lucene for Information Retrieval Research. In Proceedings of the 40th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2017). Tokyo, Japan, 1253--1256.

Digital Library

[39]

Peilin Yang, Hui Fang, and Jimmy Lin. 2018. Anserini: Reproducible Ranking Baselines Using Lucene. Journal of Data and Information Quality, Vol. 10, 4 (2018), Article 16.

Digital Library

[40]

Andrew Yates, Siddhant Arora, Xinyu Zhang, Wei Yang, Kevin Martin Jose, and Jimmy Lin. 2020 a. Capreolus: A Toolkit for End-to-End Neural Ad Hoc Retrieval. In Proceedings of the 13th ACM International Conference on Web Search and Data Mining (WSDM 2020). Houston, Texas, 861--864.

Digital Library

[41]

Andrew Yates, Kevin Martin Jose, Xinyu Zhang, and Jimmy Lin. 2020 b. Flexible IR Pipelines with Capreolus. In Proceedings of the 29th International Conference on Information and Knowledge Management (CIKM 2020). 3181--3188.

Digital Library

[42]

Yongze Yu, Jussi Karlgren, Hamed Bonab, Ann Clifton, Md Iftekhar Tanveer, and Rosie Jones. 2020. Spotify at the TREC 2020 Podcasts Track: Segment Retrieval. In Proceedings of the Twenty-Ninth Text REtrieval Conference (TREC 2020).

[43]

Edwin Zhang, Nikhil Gupta, Raphael Tang, Xiao Han, Ronak Pradeep, Kuang Lu, Yue Zhang, Rodrigo Nogueira, Kyunghyun Cho, Hui Fang, and Jimmy Lin. 2020. Covidex: Neural Ranking Models and Keyword Search Infrastructure for the COVID-19 Open Research Dataset. In Proceedings of the First Workshop on Scholarly Document Processing. 31--41.

Cited By

Hu HFang MLiu J(2025)An intent-enhanced feedback extension model for code searchInformation and Software Technology10.1016/j.infsof.2024.107589177(107589)Online publication date: Jan-2025
https://doi.org/10.1016/j.infsof.2024.107589
Yang HLi SGonçalves T(2024)Enhancing Biomedical Question Answering with Large Language ModelsInformation10.3390/info1508049415:8(494)Online publication date: 19-Aug-2024
https://doi.org/10.3390/info15080494
Tang YZhang RGuo Jde Rijke MChen WCheng X(2024)Listwise Generative Retrieval Models via a Sequential Learning ProcessACM Transactions on Information Systems10.1145/365371242:5(1-31)Online publication date: 29-Apr-2024
https://dl.acm.org/doi/10.1145/3653712
Show More Cited By

Index Terms

Pyserini: A Python Toolkit for Reproducible Information Retrieval Research with Sparse and Dense Representations
1. Information systems
  1. Information retrieval

Recommendations

Pseudo-Relevance Feedback with Dense Retrievers in Pyserini
ADCS '22: Proceedings of the 26th Australasian Document Computing Symposium

Transformer-based Dense Retrievers (DRs) are attracting extensive attention because of their effectiveness paired with high efficiency. In this context, few Pseudo-Relevance Feedback (PRF) methods applied to DRs have emerged. However, the absence of a ...
Neural Pseudo-Relevance Feedback Models for Sparse and Dense Retrieval
SIGIR '22: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval

Pseudo-relevance feedback mechanisms have long served as an effective technique to improve the retrieval effectiveness in information retrieval. Recently, large pre-trained language models, such as T5 and BERT, have shown a strong capacity to capture ...
Improving Query Representations for Dense Retrieval with Pseudo Relevance Feedback
CIKM '21: Proceedings of the 30th ACM International Conference on Information & Knowledge Management

Dense retrieval systems conduct first-stage retrieval using embedded representations and simple similarity metrics to match a query to documents. Its effectiveness depends on encoded embeddings to capture the semantics of queries and documents, a ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGIR '21: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval

July 2021

2998 pages

ISBN:9781450380379

DOI:10.1145/3404835

General Chairs:
Fernando Diaz
(Google)
,
Chirag Shah
University of Washington
,
Torsten Suel
New York University
,
Program Chairs:
Pablo Castells
Universidad Autónoma de Madrid, Amazon
,
Rosie Jones
Spotify
,
Tetsuya Sakai
Waseda University

Copyright © 2021 Owner/Author.

This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

SIGIR: ACM Special Interest Group on Information Retrieval

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 July 2021

Check for updates

Author Tags

Qualifiers

Short-paper

Funding Sources

Natural Sciences and Engineering Research Council of Canada
Waterloo-Huawei Joint Innovation Laboratory
Canada First Research Excellence Fund

Conference

SIGIR '21

Sponsor:

SIGIR

SIGIR '21: The 44th International ACM SIGIR Conference on Research and Development in Information Retrieval

July 11 - 15, 2021

Virtual Event, Canada

Acceptance Rates

Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

113
Total Citations
View Citations
3,504
Total Downloads

Downloads (Last 12 months)1,477
Downloads (Last 6 weeks)210

Reflects downloads up to 10 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Hu HFang MLiu J(2025)An intent-enhanced feedback extension model for code searchInformation and Software Technology10.1016/j.infsof.2024.107589177(107589)Online publication date: Jan-2025
https://doi.org/10.1016/j.infsof.2024.107589
Yang HLi SGonçalves T(2024)Enhancing Biomedical Question Answering with Large Language ModelsInformation10.3390/info1508049415:8(494)Online publication date: 19-Aug-2024
https://doi.org/10.3390/info15080494
Tang YZhang RGuo Jde Rijke MChen WCheng X(2024)Listwise Generative Retrieval Models via a Sequential Learning ProcessACM Transactions on Information Systems10.1145/365371242:5(1-31)Online publication date: 29-Apr-2024
https://dl.acm.org/doi/10.1145/3653712
Kolb T(2024)Enhancing Cross-Domain Recommender Systems with LLMs: Evaluating Bias and Beyond-Accuracy MeasuresProceedings of the 18th ACM Conference on Recommender Systems10.1145/3640457.3688027(1388-1394)Online publication date: 8-Oct-2024
https://dl.acm.org/doi/10.1145/3640457.3688027
Liu QGuo GMao JDou ZWen JJiang HZhang XCao Z(2024)An Analysis on Matching Mechanisms and Token Pruning for Late-interaction ModelsACM Transactions on Information Systems10.1145/363981842:5(1-28)Online publication date: 29-Apr-2024
https://dl.acm.org/doi/10.1145/3639818
Zhao WLiu JRen RWen J(2024)Dense Text Retrieval Based on Pretrained Language Models: A SurveyACM Transactions on Information Systems10.1145/363787042:4(1-60)Online publication date: 9-Feb-2024
https://dl.acm.org/doi/10.1145/3637870
Leonhardt JMüller HRudra KKhosla MAnand AAnand A(2024)Efficient Neural Ranking Using Forward Indexes and Lightweight EncodersACM Transactions on Information Systems10.1145/363193942:5(1-34)Online publication date: 29-Apr-2024
https://dl.acm.org/doi/10.1145/3631939
Pathiyan Cherumanal STian LAbushaqra FMagnossão de Paula AJi KAli HHettiachchi DTrippas JScholer FSpina D(2024)Walert: Putting Conversational Information Seeking Knowledge into Action by Building and Evaluating a Large Language Model-Powered ChatbotProceedings of the 2024 Conference on Human Information Interaction and Retrieval10.1145/3627508.3638309(401-405)Online publication date: 10-Mar-2024
https://dl.acm.org/doi/10.1145/3627508.3638309
Li MZhuang HHui KQin ZLin JJagerman RWang XBendersky MHui Yang GWang HHan SHauff CZuccon GZhang Y(2024)Can Query Expansion Improve Generalization of Strong Cross-Encoder Rankers?Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657979(2321-2326)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3626772.3657979
Salemi AZamani HHui Yang GWang HHan SHauff CZuccon GZhang Y(2024)Evaluating Retrieval Quality in Retrieval-Augmented GenerationProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657957(2395-2400)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3626772.3657957
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents