Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3397271.3401404acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
research-article
Public Access

Supporting Interoperability Between Open-Source Search Engines with the Common Index File Format

Published: 25 July 2020 Publication History

Abstract

There exists a natural tension between encouraging a diverse ecosystem of open-source search engines and supporting fair, replicable comparisons across those systems. To balance these two goals, we examine two approaches to providing interoperability between the inverted indexes of several systems. The first takes advantage of internal abstractions around index structures and building wrappers that allow one system to directly read the indexes of another. The second involves sharing indexes across systems via a data exchange specification that we have developed, called the Common Index File Format (CIFF). We demonstrate the first approach with the Java systems Anserini and Terrier, and the second approach with Anserini, JASSv2, OldDog, PISA, and Terrier. Together, these systems provide a wide range of implementations and features, with different research goals. Overall, we recommend CIFF as a low-effort approach to support independent innovation while enabling the types of fair evaluations that are critical for driving the field forward.

References

[1]
Leonid Boytsov, David Novak, Yury Malkov, and Eric Nyberg. 2016. Off the Beaten Path: Let's Replace Term-Based Retrieval with k-NN Search. In CIKM.
[2]
Chris Buckley. 1985. Implementation of the SMART Information Retrieval System. Department of Computer Science TR 85--686. Cornell University.
[3]
Ryan Clancy, Nicola Ferro, Claudia Hauff, Jimmy Lin, Tetsuya Sakai, and Ze Zhong Wu. 2019. Overview of the 2019 Open-Source IR Replicability Challenge (OSIRRC 2019). In CEUR Workshop Proceedings Vol-2409.
[4]
Matt Crane, J. Shane Culpepper, Jimmy Lin, Joel Mackenzie, and Andrew Trotman. 2017. A Comparison of Document-at-a-Time and Score-at-a-Time Query Evaluation. In WSDM.
[5]
Chris Kamphuis and Arjen de Vries. 2019. The OldDog Docker Image for OSIRRC at SIGIR 2019. In CEUR Workshop Proceedings Vol-2409.
[6]
Chris Kamphuis, Arjen de Vries, Leonid Boytsov, and Jimmy Lin. 2020. Which BM25 Do You Mean? A Large-Scale Reproducibility Study of Scoring Variants. In ECIR.
[7]
Jimmy Lin, Matt Crane, Andrew Trotman, Jamie Callan, Ishan Chattopadhyaya, John Foley, Grant Ingersoll, Craig Macdonald, and Sebastiano Vigna. 2016. Toward Reproducible Baselines: The Open-Source IR Reproducibility Challenge. In ECIR.
[8]
Jimmy Lin and Peilin Yang. 2019. The Impact of Score Ties on Repeatability in Document Ranking. In SIGIR.
[9]
Craig Macdonald, Richard McCreadie, Rodrygo L.T. Santos, and Iadh Ounis. 2012. From puppy to maturity: Experiences in developing Terrier. OSIR Workshop at SIGIR.
[10]
Antonio Mallia, Michał Siedlaczek, Joel Mackenzie, and Torsten Suel. 2019. PISA: Performant Indexes and Search for Academia. In CEUR Workshop Proceedings Vol-2409.
[11]
Hannes Mühleisen, Thaer Samar, Jimmy Lin, and Arjen de Vries. 2014. Old Dogs Are Great at New Tricks: Column Stores for IR Prototyping. In SIGIR.
[12]
Andrew Trotman and Matt Crane. 2019. Micro- and Macro-optimizations of SaaT Search. Software: Practice and Experience, Vol. 49, 5 (2019), 942--950.
[13]
Andrew Trotman, Xiang-Fei Jia, and Matt Crane. 2012. Towards an Efficient and Effective Search Engine. In SIGIR 2012 Workshop on Open Source Information Retrieval.
[14]
Andrew Trotman, Antti Puurula, and Blake Burgess. 2014. Improvements to BM25 and Language Models Examined. In ADCS.
[15]
Peilin Yang, Hui Fang, and Jimmy Lin. 2018. Anserini: Reproducible Ranking Baselines Using Lucene. Journal of Data and Information Quality, Vol. 10, 4 (2018), Article 16.
[16]
Ziying Yang, Alistair Moffat, and Andrew Turpin. 2016. How Precise Does Document Scoring Need to Be?. In AIRS.

Cited By

View all
  • (2024)Faster Learned Sparse Retrieval with Block-Max PruningProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657906(2411-2415)Online publication date: 10-Jul-2024
  • (2024)Browsing and Searching Metadata of TRECProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657873(313-323)Online publication date: 10-Jul-2024
  • (2024)Revisiting Document Expansion and Filtering for Effective First-Stage RetrievalProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657850(186-196)Online publication date: 10-Jul-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGIR '20: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval
July 2020
2548 pages
ISBN:9781450380164
DOI:10.1145/3397271
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 July 2020

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. effectiveness and efficiency evaluation
  2. inverted indexes
  3. protocol buffers

Qualifiers

  • Research-article

Funding Sources

Conference

SIGIR '20
Sponsor:

Acceptance Rates

Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)95
  • Downloads (Last 6 weeks)16
Reflects downloads up to 15 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Faster Learned Sparse Retrieval with Block-Max PruningProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657906(2411-2415)Online publication date: 10-Jul-2024
  • (2024)Browsing and Searching Metadata of TRECProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657873(313-323)Online publication date: 10-Jul-2024
  • (2024)Revisiting Document Expansion and Filtering for Effective First-Stage RetrievalProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657850(186-196)Online publication date: 10-Jul-2024
  • (2024)The First International Workshop on Open Web Search (WOWS)Advances in Information Retrieval10.1007/978-3-031-56069-9_58(426-431)Online publication date: 23-Mar-2024
  • (2024)The Open Web IndexAdvances in Information Retrieval10.1007/978-3-031-56069-9_10(130-143)Online publication date: 23-Mar-2024
  • (2023)Efficient Document-at-a-time and Score-at-a-time Query Evaluation for Learned Sparse RepresentationsACM Transactions on Information Systems10.1145/357692241:4(1-28)Online publication date: 22-Mar-2023
  • (2023)Index-Based Batch Query Processing RevisitedAdvances in Information Retrieval10.1007/978-3-031-28241-6_6(86-100)Online publication date: 16-Mar-2023
  • (2022)Faster Learned Sparse Retrieval with Guided TraversalProceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3477495.3531774(1901-1905)Online publication date: 6-Jul-2022
  • (2022)Reduce, Reuse, RecycleProceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3477495.3531766(2825-2837)Online publication date: 6-Jul-2022
  • (2022)A Common Framework for Exploring Document-at-a-Time and Score-at-a-Time Retrieval MethodsProceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3477495.3531657(3229-3234)Online publication date: 6-Jul-2022
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media