Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3404835.3463238acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
short-paper
Open access

Pyserini: A Python Toolkit for Reproducible Information Retrieval Research with Sparse and Dense Representations

Published: 11 July 2021 Publication History

Abstract

Pyserini is a Python toolkit for reproducible information retrieval research with sparse and dense representations. It aims to provide effective, reproducible, and easy-to-use first-stage retrieval in a multi-stage ranking architecture. Our toolkit is self-contained as a standard Python package and comes with queries, relevance judgments, pre-built indexes, and evaluation scripts for many commonly used IR test collections. We aim to support, out of the box, the entire research lifecycle of efforts aimed at improving ranking with modern neural approaches. In particular, Pyserini supports sparse retrieval (e.g., BM25 scoring using bag-of-words representations), dense retrieval (e.g., nearest-neighbor search on transformer-encoded representations), as well as hybrid retrieval that integrates both approaches. This paper provides an overview of toolkit features and presents empirical results that illustrate its effectiveness on two popular ranking tasks. Around this toolkit, our group has built a culture of reproducibility through shared norms and tools that enable rigorous automated testing.

References

[1]
Mart'in Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2016. TensorFlow: A system for large-scale machine learning. In Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI '16). 265--283.
[2]
Zeynep Akkalyoncu Yilmaz, Charles L. A. Clarke, and Jimmy Lin. 2020. A Lightweight Environment for Learning Experimental IR Research Practices. In Proceedings of the 43rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2020). 2113--2116.
[3]
Zeynep Akkalyoncu Yilmaz, Shengjin Wang, Wei Yang, Haotian Zhang, and Jimmy Lin. 2019. Applying BERT to Document Retrieval with Birch. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP): System Demonstrations. Hong Kong, China, 19--24.
[4]
Jaime Arguello, Matt Crane, Fernando Diaz, Jimmy Lin, and Andrew Trotman. 2015. Report on the SIGIR 2015 Workshop on Reproducibility, Inexplicability, and Generalizability of Results (RIGOR). SIGIR Forum, Vol. 49, 2 (2015), 107--116.
[5]
Nima Asadi and Jimmy Lin. 2013. Effectiveness/Efficiency Tradeoffs for Candidate Generation in Multi-Stage Retrieval Architectures. In Proceedings of the 36th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2013). Dublin, Ireland, 997--1000.
[6]
Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, and Tong Wang. 2018. MS MARCO: A Human Generated MAchine Reading COmprehension Dataset. arXiv:1611.09268v3 (2018).
[7]
Michael Bendersky, Honglei Zhuang, Ji Ma, Shuguang Han, Keith Hall, and Ryan McDonald. 2020. RRF102: Meeting the TREC-COVID Challenge with a 100+ Runs Ensemble. arXiv:2010.00200 (2020).
[8]
Andre Esteva, Anuprit Kale, Romain Paulus, Kazuma Hashimoto, Wenpeng Yin, Dragomir Radev, and Richard Socher. 2020. CO-Search: COVID-19 Information Retrieval with Semantic Search, Question Answering, and Abstractive Summarization. arXiv:2006.09595 (2020).
[9]
Adrien Grand, Robert Muir, Jim Ferenczi, and Jimmy Lin. 2020. From Max­Score to Block-Max WAND: The Story of How Lucene Significantly Improved Query Evaluation Performance. In Proceedings of the 42nd European Conference on Information Retrieval, Part II (ECIR 2020). 20--27.
[10]
Sebastian Hofstatter, Sophia Althammer, Michael Schröder, Mete Sertkan, and Allan Hanbury. 2020. Improving Efficient Neural Ranking Models with Cross-Architecture Knowledge Distillation. arXiv:2010.02666 (2020).
[11]
Sebastian Hofstätter, Sophia Althammer, Michael Schröder, Mete Sertkan, and Allan Hanbury. 2021. Improving Efficient Neural Ranking Models with Cross-Architecture Knowledge Distillation. arXiv:2010.02666 (2021).
[12]
Jeff Johnson, Matthijs Douze, and Hervé Jégou. 2017. Billion-scale similarity search with GPUs. arXiv:1702.08734 (2017).
[13]
Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. 2020. Dense Passage Retrieval for Open-Domain Question Answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 6769--6781.
[14]
Omar Khattab and Matei Zaharia. 2020. ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2020). 39--48.
[15]
Jimmy Lin, Matt Crane, Andrew Trotman, Jamie Callan, Ishan Chattopadhyaya, John Foley, Grant Ingersoll, Craig Macdonald, and Sebastiano Vigna. 2016. Toward Reproducible Baselines: The Open-Source IR Reproducibility Challenge. In Proceedings of the 38th European Conference on Information Retrieval (ECIR 2016). Padua, Italy, 408--420.
[16]
Jimmy Lin, Rodrigo Nogueira, and Andrew Yates. 2020 a. Pretrained Transformers for Text Ranking: BERT and Beyond. arXiv:2010.06467 (2020).
[17]
Sheng-Chieh Lin, Jheng-Hong Yang, and Jimmy Lin. 2020 b. Distilling Dense Representations for Ranking using Tightly-Coupled Teachers. arXiv:2010.11386 (2020).
[18]
Xueguang Ma, Kai Sun, Ronak Pradeep, and Jimmy Lin. 2021. A Replication Study of Dense Passage Retriever. arXiv:2104.05740 (2021).
[19]
Sean MacAvaney. 2020. OpenNIR: A Complete Neural Ad-Hoc Ranking Pipeline. In Proceedings of the 13th ACM International Conference on Web Search and Data Mining (WSDM 2020). Houston, Texas, 845--848.
[20]
Sean MacAvaney, Andrew Yates, Sergey Feldman, Doug Downey, Arman Cohan, and Nazli Goharian. 2021. Simplified Data Wrangling with textttir_datasets. arXiv:2103.02280 (2021).
[21]
Craig Macdonald, Richard McCreadie, Rodrygo L.T. Santos, and Iadh Ounis. 2012. From Puppy to Maturity: Experiences in Developing Terrier. In Proceedings of the SIGIR 2012 Workshop on Open Source Information Retrieval. Portland, Oregon.
[22]
Craig Macdonald and Nicola Tonellotto. 2020. Declarative Experimentation in Information Retrieval using PyTerrier. In Proceedings of the 2020 International Conference on the Theory of Information Retrieval (ICTIR 2020). 161--168.
[23]
Yu A. Malkov and D. A. Yashunin. 2020. Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs. Transactions on Pattern Analysis and Machine Intelligence, Vol. 42, 4 (2020), 824--836.
[24]
Antonio Mallia, Michał Siedlaczek, Joel Mackenzie, and Torsten Suel. 2019. PISA: Performant Indexes and Search for Academia. In Proceedings of the Open-Source IR Replicability Challenge (OSIRRC 2019): CEUR Workshop Proceedings Vol-2409. Paris, France, 50--56.
[25]
Irina Matveeva, Chris Burges, Timo Burkard, Andy Laucius, and Leon Wong. 2006. High Accuracy Retrieval with Multiple Nested Ranker. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2006). Seattle, Washington, 437--444.
[26]
Rodrigo Nogueira and Jimmy Lin. 2019. From doc2query to docTTTTTquery .
[27]
Rodrigo Nogueira, Wei Yang, Kyunghyun Cho, and Jimmy Lin. 2019 a. Multi-Stage Document Ranking with BERT. arXiv:1910.14424 (2019).
[28]
Rodrigo Nogueira, Wei Yang, Jimmy Lin, and Kyunghyun Cho. 2019 b. Document Expansion by Query Prediction. arXiv:1904.08375 (2019).
[29]
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Yang, Zach DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems. 8024--8035.
[30]
Ronak Pradeep, Xueguang Ma, Rodrigo Nogueira, and Jimmy Lin. 2021 a. Scientific Claim Verification with VerT5erini. In Proceedings of the 12th International Workshop on Health Text Mining and Information Analysis. 94--103.
[31]
Ronak Pradeep, Xueguang Ma, Rodrigo Nogueira, and Jimmy Lin. 2021 b. Vera: Prediction Techniques for Reducing Harmful Misinformation in Consumer Health Search. In Proceedings of the 44th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2021) .
[32]
Ronak Pradeep, Rodrigo Nogueira, and Jimmy Lin. 2021 c. The Expando-Mono-Duo Design Pattern for Text Ranking with Pretrained Sequence-to-Sequence Models. arXiv:2101.05667 (2021).
[33]
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. Journal of Machine Learning Research, Vol. 21, 140 (2020), 1--67.
[34]
Lidan Wang, Jimmy Lin, and Donald Metzler. 2011. A Cascade Ranking Model for Efficient Ranked Retrieval. In Proceedings of the 34th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2011). Beijing, China, 105--114.
[35]
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Remi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander Rush. 2020. Transformers: State-of-the-Art Natural Language Processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. 38--45.
[36]
Chenyan Xiong, Zhenghao Liu, Si Sun, Zhuyun Dai, Kaitao Zhang, Shi Yu, Zhiyuan Liu, Hoifung Poon, Jianfeng Gao, and Paul Bennett. 2020 a. CMT in TREC-COVID Round 2: Mitigating the Generalization Gaps from Web to Special Domain Search. arXiv:2011.01580 (2020).
[37]
Lee Xiong, Chenyan Xiong, Ye Li, Kwok-Fung Tang, Jialin Liu, Paul Bennett, Junaid Ahmed, and Arnold Overwijk. 2020 b. Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval. arXiv:2007.00808 (2020).
[38]
Peilin Yang, Hui Fang, and Jimmy Lin. 2017. Anserini: Enabling the Use of Lucene for Information Retrieval Research. In Proceedings of the 40th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2017). Tokyo, Japan, 1253--1256.
[39]
Peilin Yang, Hui Fang, and Jimmy Lin. 2018. Anserini: Reproducible Ranking Baselines Using Lucene. Journal of Data and Information Quality, Vol. 10, 4 (2018), Article 16.
[40]
Andrew Yates, Siddhant Arora, Xinyu Zhang, Wei Yang, Kevin Martin Jose, and Jimmy Lin. 2020 a. Capreolus: A Toolkit for End-to-End Neural Ad Hoc Retrieval. In Proceedings of the 13th ACM International Conference on Web Search and Data Mining (WSDM 2020). Houston, Texas, 861--864.
[41]
Andrew Yates, Kevin Martin Jose, Xinyu Zhang, and Jimmy Lin. 2020 b. Flexible IR Pipelines with Capreolus. In Proceedings of the 29th International Conference on Information and Knowledge Management (CIKM 2020). 3181--3188.
[42]
Yongze Yu, Jussi Karlgren, Hamed Bonab, Ann Clifton, Md Iftekhar Tanveer, and Rosie Jones. 2020. Spotify at the TREC 2020 Podcasts Track: Segment Retrieval. In Proceedings of the Twenty-Ninth Text REtrieval Conference (TREC 2020).
[43]
Edwin Zhang, Nikhil Gupta, Raphael Tang, Xiao Han, Ronak Pradeep, Kuang Lu, Yue Zhang, Rodrigo Nogueira, Kyunghyun Cho, Hui Fang, and Jimmy Lin. 2020. Covidex: Neural Ranking Models and Keyword Search Infrastructure for the COVID-19 Open Research Dataset. In Proceedings of the First Workshop on Scholarly Document Processing. 31--41.

Cited By

View all
  • (2025)An intent-enhanced feedback extension model for code searchInformation and Software Technology10.1016/j.infsof.2024.107589177(107589)Online publication date: Jan-2025
  • (2024)Enhancing Biomedical Question Answering with Large Language ModelsInformation10.3390/info1508049415:8(494)Online publication date: 19-Aug-2024
  • (2024)Listwise Generative Retrieval Models via a Sequential Learning ProcessACM Transactions on Information Systems10.1145/365371242:5(1-31)Online publication date: 29-Apr-2024
  • Show More Cited By

Index Terms

  1. Pyserini: A Python Toolkit for Reproducible Information Retrieval Research with Sparse and Dense Representations

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SIGIR '21: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval
    July 2021
    2998 pages
    ISBN:9781450380379
    DOI:10.1145/3404835
    This work is licensed under a Creative Commons Attribution International 4.0 License.

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 11 July 2021

    Check for updates

    Author Tags

    1. first-stage retrieval
    2. open-source search engine

    Qualifiers

    • Short-paper

    Funding Sources

    Conference

    SIGIR '21
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 792 of 3,983 submissions, 20%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)1,477
    • Downloads (Last 6 weeks)210
    Reflects downloads up to 10 Oct 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2025)An intent-enhanced feedback extension model for code searchInformation and Software Technology10.1016/j.infsof.2024.107589177(107589)Online publication date: Jan-2025
    • (2024)Enhancing Biomedical Question Answering with Large Language ModelsInformation10.3390/info1508049415:8(494)Online publication date: 19-Aug-2024
    • (2024)Listwise Generative Retrieval Models via a Sequential Learning ProcessACM Transactions on Information Systems10.1145/365371242:5(1-31)Online publication date: 29-Apr-2024
    • (2024)Enhancing Cross-Domain Recommender Systems with LLMs: Evaluating Bias and Beyond-Accuracy MeasuresProceedings of the 18th ACM Conference on Recommender Systems10.1145/3640457.3688027(1388-1394)Online publication date: 8-Oct-2024
    • (2024)An Analysis on Matching Mechanisms and Token Pruning for Late-interaction ModelsACM Transactions on Information Systems10.1145/363981842:5(1-28)Online publication date: 29-Apr-2024
    • (2024)Dense Text Retrieval Based on Pretrained Language Models: A SurveyACM Transactions on Information Systems10.1145/363787042:4(1-60)Online publication date: 9-Feb-2024
    • (2024)Efficient Neural Ranking Using Forward Indexes and Lightweight EncodersACM Transactions on Information Systems10.1145/363193942:5(1-34)Online publication date: 29-Apr-2024
    • (2024)Walert: Putting Conversational Information Seeking Knowledge into Action by Building and Evaluating a Large Language Model-Powered ChatbotProceedings of the 2024 Conference on Human Information Interaction and Retrieval10.1145/3627508.3638309(401-405)Online publication date: 10-Mar-2024
    • (2024)Can Query Expansion Improve Generalization of Strong Cross-Encoder Rankers?Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657979(2321-2326)Online publication date: 10-Jul-2024
    • (2024)Evaluating Retrieval Quality in Retrieval-Augmented GenerationProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657957(2395-2400)Online publication date: 10-Jul-2024
    • Show More Cited By

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Get Access

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media