research-article

Public Access

Supporting Interoperability Between Open-Source Search Engines with the Common Index File Format

Authors:

Joel Mackenzie,

Chris Kamphuis,

Craig Macdonald,

Antonio Mallia,

Michał Siedlaczek,

Andrew Trotman,

Arjen de VriesAuthors Info & Claims

SIGIR '20: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval

Pages 2149 - 2152

https://doi.org/10.1145/3397271.3401404

Published: 25 July 2020 Publication History

Abstract

There exists a natural tension between encouraging a diverse ecosystem of open-source search engines and supporting fair, replicable comparisons across those systems. To balance these two goals, we examine two approaches to providing interoperability between the inverted indexes of several systems. The first takes advantage of internal abstractions around index structures and building wrappers that allow one system to directly read the indexes of another. The second involves sharing indexes across systems via a data exchange specification that we have developed, called the Common Index File Format (CIFF). We demonstrate the first approach with the Java systems Anserini and Terrier, and the second approach with Anserini, JASSv2, OldDog, PISA, and Terrier. Together, these systems provide a wide range of implementations and features, with different research goals. Overall, we recommend CIFF as a low-effort approach to support independent innovation while enabling the types of fair evaluations that are critical for driving the field forward.

References

[1]

Leonid Boytsov, David Novak, Yury Malkov, and Eric Nyberg. 2016. Off the Beaten Path: Let's Replace Term-Based Retrieval with k-NN Search. In CIKM.

[2]

Chris Buckley. 1985. Implementation of the SMART Information Retrieval System. Department of Computer Science TR 85--686. Cornell University.

[3]

Ryan Clancy, Nicola Ferro, Claudia Hauff, Jimmy Lin, Tetsuya Sakai, and Ze Zhong Wu. 2019. Overview of the 2019 Open-Source IR Replicability Challenge (OSIRRC 2019). In CEUR Workshop Proceedings Vol-2409.

[4]

Matt Crane, J. Shane Culpepper, Jimmy Lin, Joel Mackenzie, and Andrew Trotman. 2017. A Comparison of Document-at-a-Time and Score-at-a-Time Query Evaluation. In WSDM.

[5]

Chris Kamphuis and Arjen de Vries. 2019. The OldDog Docker Image for OSIRRC at SIGIR 2019. In CEUR Workshop Proceedings Vol-2409.

[6]

Chris Kamphuis, Arjen de Vries, Leonid Boytsov, and Jimmy Lin. 2020. Which BM25 Do You Mean? A Large-Scale Reproducibility Study of Scoring Variants. In ECIR.

[7]

Jimmy Lin, Matt Crane, Andrew Trotman, Jamie Callan, Ishan Chattopadhyaya, John Foley, Grant Ingersoll, Craig Macdonald, and Sebastiano Vigna. 2016. Toward Reproducible Baselines: The Open-Source IR Reproducibility Challenge. In ECIR.

[8]

Jimmy Lin and Peilin Yang. 2019. The Impact of Score Ties on Repeatability in Document Ranking. In SIGIR.

[9]

Craig Macdonald, Richard McCreadie, Rodrygo L.T. Santos, and Iadh Ounis. 2012. From puppy to maturity: Experiences in developing Terrier. OSIR Workshop at SIGIR.

[10]

Antonio Mallia, Michał Siedlaczek, Joel Mackenzie, and Torsten Suel. 2019. PISA: Performant Indexes and Search for Academia. In CEUR Workshop Proceedings Vol-2409.

[11]

Hannes Mühleisen, Thaer Samar, Jimmy Lin, and Arjen de Vries. 2014. Old Dogs Are Great at New Tricks: Column Stores for IR Prototyping. In SIGIR.

[12]

Andrew Trotman and Matt Crane. 2019. Micro- and Macro-optimizations of SaaT Search. Software: Practice and Experience, Vol. 49, 5 (2019), 942--950.

[13]

Andrew Trotman, Xiang-Fei Jia, and Matt Crane. 2012. Towards an Efficient and Effective Search Engine. In SIGIR 2012 Workshop on Open Source Information Retrieval.

[14]

Andrew Trotman, Antti Puurula, and Blake Burgess. 2014. Improvements to BM25 and Language Models Examined. In ADCS.

[15]

Peilin Yang, Hui Fang, and Jimmy Lin. 2018. Anserini: Reproducible Ranking Baselines Using Lucene. Journal of Data and Information Quality, Vol. 10, 4 (2018), Article 16.

Digital Library

[16]

Ziying Yang, Alistair Moffat, and Andrew Turpin. 2016. How Precise Does Document Scoring Need to Be?. In AIRS.

Cited By

Mallia ASuel TTonellotto NHui Yang GWang HHan SHauff CZuccon GZhang Y(2024)Faster Learned Sparse Retrieval with Block-Max PruningProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657906(2411-2415)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3626772.3657906
Breuer TVoorhees ESoboroff IHui Yang GWang HHan SHauff CZuccon GZhang Y(2024)Browsing and Searching Metadata of TRECProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657873(313-323)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3626772.3657873
Mansour WZhuang SZuccon GMackenzie JHui Yang GWang HHan SHauff CZuccon GZhang Y(2024)Revisiting Document Expansion and Filtering for Effective First-Stage RetrievalProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657850(186-196)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3626772.3657850
Show More Cited By

Index Terms

Supporting Interoperability Between Open-Source Search Engines with the Common Index File Format
1. Information systems
  1. Information retrieval
    1. Evaluation of retrieval results
      1. Retrieval effectiveness
      2. Retrieval efficiency
    2. Search engine architectures and scalability
      1. Search engine indexing

Recommendations

Techniques for Inverted Index Compression
Invited Tutorial and Regular Papers

The data structure at the core of large-scale search engines is the inverted index, which is essentially a collection of sorted integer sequences called inverted lists. Because of the many documents indexed by such engines and stringent performance ...
Inverted indexes vs. bitmap indexes in decision support systems
CIKM '09: Proceedings of the 18th ACM conference on Information and knowledge management

Bitmap indexes are widely used in Decision Support Systems (DSSs) to improve query performance. In this paper, we evaluate the use of compressed inverted indexes with adapted query processing strategies from Information Retrieval as an alternative. In a ...
Inverted indexes for phrases and strings
SIGIR '11: Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval

Inverted indexes are the most fundamental and widely used data structures in information retrieval. For each unique word occurring in a document collection, the inverted index stores a list of the documents in which this word occurs. Compression ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGIR '20: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval

July 2020

2548 pages

ISBN:9781450380164

DOI:10.1145/3397271

General Chairs:
Jimmy Huang
York University, Canada
,
Yi Chang
Jilin University, China
,
Xueqi Cheng
Chinese Academy of Sciences, China
,
Program Chairs:
Jaap Kamps
University of Amsterdam, Netherlands
,
Vanessa Murdock
Amazon, U.S.A.
,
Ji-Rong Wen
Renmin University of China, China
,
Yiqun Liu
Tsinghua University, China

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGIR: ACM Special Interest Group on Information Retrieval

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 July 2020

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Natural Sciences and Engineering Research Council of Canada
Dutch Research Council
National Science Foundation
Compute Ontario
Compute Canada
Australian Research Council

Conference

SIGIR '20

Sponsor:

SIGIR

SIGIR '20: The 43rd International ACM SIGIR conference on research and development in Information Retrieval

July 25 - 30, 2020

Virtual Event, China

Acceptance Rates

Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

22
Total Citations
View Citations
325
Total Downloads

Downloads (Last 12 months)95
Downloads (Last 6 weeks)16

Reflects downloads up to 15 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Mallia ASuel TTonellotto NHui Yang GWang HHan SHauff CZuccon GZhang Y(2024)Faster Learned Sparse Retrieval with Block-Max PruningProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657906(2411-2415)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3626772.3657906
Breuer TVoorhees ESoboroff IHui Yang GWang HHan SHauff CZuccon GZhang Y(2024)Browsing and Searching Metadata of TRECProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657873(313-323)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3626772.3657873
Mansour WZhuang SZuccon GMackenzie JHui Yang GWang HHan SHauff CZuccon GZhang Y(2024)Revisiting Document Expansion and Filtering for Effective First-Stage RetrievalProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657850(186-196)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3626772.3657850
Farzana SFröbe MGranitzer MHendriksen GHiemstra DPotthast MZerhoudi S(2024)The First International Workshop on Open Web Search (WOWS)Advances in Information Retrieval10.1007/978-3-031-56069-9_58(426-431)Online publication date: 23-Mar-2024
https://doi.org/10.1007/978-3-031-56069-9_58
Hendriksen GDinzinger MFarzana SFathima NFröbe MSchmidt SZerhoudi SGranitzer MHagen MHiemstra DPotthast MStein B(2024)The Open Web IndexAdvances in Information Retrieval10.1007/978-3-031-56069-9_10(130-143)Online publication date: 23-Mar-2024
https://doi.org/10.1007/978-3-031-56069-9_10
Mackenzie JTrotman ALin J(2023)Efficient Document-at-a-time and Score-at-a-time Query Evaluation for Learned Sparse RepresentationsACM Transactions on Information Systems10.1145/357692241:4(1-28)Online publication date: 22-Mar-2023
https://dl.acm.org/doi/10.1145/3576922
Mackenzie JMoffat A(2023)Index-Based Batch Query Processing RevisitedAdvances in Information Retrieval10.1007/978-3-031-28241-6_6(86-100)Online publication date: 16-Mar-2023
https://doi.org/10.1007/978-3-031-28241-6_6
Mallia AMackenzie JSuel TTonellotto NAmigo ECastells PGonzalo JCarterette BCulpepper JKazai G(2022)Faster Learned Sparse Retrieval with Guided TraversalProceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3477495.3531774(1901-1905)Online publication date: 6-Jul-2022
https://dl.acm.org/doi/10.1145/3477495.3531774
Scells HZhuang SZuccon GAmigo ECastells PGonzalo JCarterette BCulpepper JKazai G(2022)Reduce, Reuse, RecycleProceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3477495.3531766(2825-2837)Online publication date: 6-Jul-2022
https://dl.acm.org/doi/10.1145/3477495.3531766
Trotman AMackenzie JParameswaran PLin JAmigo ECastells PGonzalo JCarterette BCulpepper JKazai G(2022)A Common Framework for Exploring Document-at-a-Time and Score-at-a-Time Retrieval MethodsProceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3477495.3531657(3229-3234)Online publication date: 6-Jul-2022
https://dl.acm.org/doi/10.1145/3477495.3531657
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents