research-article

Public Access

Fast Disjunctive Candidate Generation Using Live Block Filtering

Authors:

Antonio Mallia,

Michał Siedlaczek,

Torsten SuelAuthors Info & Claims

WSDM '21: Proceedings of the 14th ACM International Conference on Web Search and Data Mining

Pages 671 - 679

https://doi.org/10.1145/3437963.3441813

Published: 08 March 2021 Publication History

Abstract

A lot of research has focused on the efficiency of search engine query processing, and in particular on disjunctive top-k queries that return the highest scoring k results that contain at least one of the query terms. Disjunctive top-k queries over simple ranking functions are commonly used to retrieve an initial set of candidate results that are then reranked by more complex, often machine-learned rankers. Many optimized top-k algorithms have been proposed, including MaxScore, WAND, BMW, and JASS. While the fastest methods achieve impressive results on top-10 and top-100 queries, they tend to become much slower for the larger k commonly used for candidate generation. In this paper, we focus on disjunctive top-k queries for larger k. We propose new algorithms that achieve much faster query processing for values of k up to thousands or tens of thousands. Our algorithms build on top of the live-block filtering approach of Dimopoulos et al, and exploit the SIMD capabilities of modern CPUs. We also perform a detailed experimental comparison of our methods with the fastest known approaches, and release a full model implementation of our methods and of the underlying live-block mechanism, which will allows others to design and experiment with additional methods under the live-block approach.

References

[1]

Gianni Amati and Cornelis Joost Van Rijsbergen. 2002. Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Transactions on Information Systems (TOIS) 20, 4 (2002), 357--389.

Digital Library

[2]

Vo Ngoc Anh, Owen de Kretser, and Alistair Moffat. 2001. Vector-space ranking with effective early termination. In Proceedings of the 24th International ACM SIGIR Conference on Research and Development in Information Retrieval. 35--42.

Digital Library

[3]

Nima Asadi and Jimmy Lin. 2013. Effectiveness/efficiency tradeoffs for candidate generation in multi-stage retrieval architectures. In Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval. 997--1000.

Digital Library

[4]

Andrei Z Broder, David Carmel, Michael Herscovici, Aya Soffer, and Jason Zien. 2003. Efficient query evaluation using a two-level retrieval process. In Proceedings of the twelfth International Conference on Information and Knowledge Management. 426--434.

Digital Library

[5]

Kaushik Chakrabarti, Surajit Chaudhuri, and Venkatesh Ganti. 2011. Intervalbased pruning for top-k processing over compressed lists. In 2011 IEEE 27th International Conference on Data Engineering. 709--720.

Digital Library

[6]

Ruey-Cheng Chen, Luke Gallagher, Roi Blanco, and J Shane Culpepper. 2017. Efficient cost-aware cascade ranking in multi-stage retrieval. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval. 445--454.

Digital Library

[7]

Stéphane Clinchant and Eric Gaussier. 2010. Information-based models for ad hoc IR. In Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval. 234--241.

Digital Library

[8]

Matt Crane, J Shane Culpepper, Jimmy Lin, Joel Mackenzie, and Andrew Trotman. 2017. A comparison of document-at-a-time and score-at-a-time query evaluation. In Proceedings of the tenth ACM International Conference on Web Search and Data Mining. 201--210.

Digital Library

[9]

Lídia Lizziane Serejo de Carvalho, Edleno Silva de Moura, Caio Moura Daoud, and Altigran Soares da Silva. 2015. Heuristics to improve the BMW method and its variants. Journal of Information and Data Management 6, 3 (2015), 178--178.

[10]

Jeffrey Dean. 2009. Challenges in Building Large-Scale Information Retrieval Systems: Invited Talk. In Proceedings of the second ACM International Conference on Web Search and Data Mining.

Digital Library

[11]

Laxman Dhulipala, Igor Kabiljo, Brian Karrer, Giuseppe Ottaviano, Sergey Pupyrev, and Alon Shalita. 2016. Compressing graphs and indexes with recursive graph bisection. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 1535--1544.

Digital Library

[12]

Constantinos Dimopoulos, Sergey Nepomnyachiy, and Torsten Suel. 2013. A candidate filtering mechanism for fast top-k query processing on modern cpus. In Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval. 723--732.

Digital Library

[13]

Constantinos Dimopoulos, Sergey Nepomnyachiy, and Torsten Suel. 2013. Optimizing top-k document retrieval strategies for block-max indexes. In Proceedings of the sixth ACM International Conference on Web Search and Data Mining. 113-- 122.

Digital Library

[14]

Shuai Ding and Torsten Suel. 2011. Faster top-k document retrieval using blockmax indexes. In Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval. 993--1002.

Digital Library

[15]

Hui Fang and ChengXiang Zhai. 2005. An exploration of axiomatic approaches to information retrieval. In Proceedings of the 28th International ACM SIGIR Conference on Research and Development in Information Retrieval. 480--487.

Digital Library

[16]

Omar Khattab, Mohammad Hammoud, and Tamer Elsayed. 2020. Finding the Best of BothWorlds: Faster and More Robust Top-k Document Retrieval. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. 1031--1040.

[17]

Daniel Lemire and Leonid Boytsov. 2015. Decoding billions of integers per second through vectorization. Software: Practice and Experience 45, 1 (2015), 1--29.

Digital Library

[18]

Jimmy Lin and Andrew Trotman. 2015. Anytime ranking for impact-ordered indexes. In Proceedings of the 2015 International Conference on The Theory of Information Retrieval. 301--304.

Digital Library

[19]

T-Y. Liu. 2009. Learning to Rank for Information Retrieval. 3, 3 (2009), 225--331.

Digital Library

[20]

C. Macdonald, R. L. Santos, and I. Ounis. 2013. The Whens and Hows of Learning to Rank for Web Search. 16, 5 (2013), 584--628.

[21]

Joel Mackenzie, Antonio Mallia, Matthias Petri, J Shane Culpepper, and Torsten Suel. 2019. Compressing inverted indexes with recursive graph bisection: A reproducibility study. In European Conference on Information Retrieval. 339--352.

Digital Library

[22]

Joel Mackenzie and Alistair Moffat. 2020. Examining the Additivity of Topk Query Processing Innovations. In Proceedings of the 29th ACM International Conference on Information and Knowledge Management.

Digital Library

[23]

Antonio Mallia, Giuseppe Ottaviano, Elia Porciani, Nicola Tonellotto, and Rossano Venturini. 2017. Faster BlockMax WAND with variable-sized blocks. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval. 625--634.

Digital Library

[24]

Antonio Mallia and Elia Porciani. 2019. Faster BlockMax WAND with longer skipping. In European Conference on Information Retrieval. 771--778.

Digital Library

[25]

Antonio Mallia, Micha? Siedlaczek, Joel Mackenzie, and Torsten Suel. 2019. PISA: Performant Indexes and Search for Academia. In CEUR Workshop Proceedings Vol-2409. 50--56.

[26]

Antonio Mallia, Michal Siedlaczek, and Torsten Suel. 2019. An experimental study of index compression and DAAT query processing methods. In European Conference on Information Retrieval. 353--368.

Digital Library

[27]

Antonio Mallia, Michal Siedlaczek, Mengyang Sun, and Torsten Suel. 2020. A Comparison of Top-k Threshold Estimation Techniques for Disjunctive Query Processing. In Proceedings of the 29th ACM International Conference on Information and Knowledge Management.

Digital Library

[28]

Matthias Petri, Alistair Moffat, Joel Mackenzie, J Shane Culpepper, and Daniel Beck. 2019. Accelerated query processing via similarity score prediction. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval. 485--494.

Digital Library

[29]

Stephen E Robertson, SteveWalker, Susan Jones, Micheline M Hancock-Beaulieu, Mike Gatford, et al. 1995. Okapi at TREC-3. Nist Special Publication Sp 109 (1995), 109.

[30]

Cristian Rossi, Edleno S de Moura, Andre L Carvalho, and Altigran S da Silva. 2013. Fast document-at-a-time query processing using two-tier indexes. In Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval. 183--192.

Digital Library

[31]

Dongdong Shan, Shuai Ding, Jing He, Hongfei Yan, and Xiaoming Li. 2012. Optimized top-k processing with global page scores on block-max indexes. In Proceedings of the fifth ACM International Conference on Web Search and Data Mining. 423--432.

Digital Library

[32]

Fabrizio Silvestri. 2007. Sorting out the document identifier assignment problem. In European Conference on Information Retrieval. 101--112.

[33]

Andrew Trotman and Kat Lilly. 2018. Elias Revisited: Group Elias SIMD Coding. In Proceedings of the 23rd Australasian Document Computing Symposium. 1--8.

Digital Library

[34]

Howard Turtle and James Flood. 1995. Query Evaluation: Strategies and Optimizations. Inf. Process. Manage. 31, 6 (1995), 831--850.

Digital Library

[35]

Lidan Wang, Jimmy Lin, and Donald Metzler. 2011. A cascade ranking model for efficient ranked retrieval. In Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval. 105--114.

Digital Library

[36]

Erman Yafay and Ismail Sengor Altingovde. 2019. Caching Scores for Faster Query Processing with Dynamic Pruning in Search Engines. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management. 2457--2460.

Digital Library

[37]

Dawei Yin, Yuening Hu, Jiliang Tang, Tim Daly, Mianwei Zhou, Hua Ouyang, Jianhui Chen, Changsung Kang, Hongbo Deng, Chikashi Nobata, et al. 2016. Ranking relevance in yahoo search. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 323--332.

Digital Library

[38]

Chengxiang Zhai and John Lafferty. 2004. A study of smoothing methods for language models applied to information retrieval. ACM Transactions on Information Systems (TOIS) 22, 2 (2004), 179--214.

Digital Library

[39]

Justin Zobel and Alistair Moffat. 2006. Inverted files for text search engines. ACM computing surveys (CSUR) 38, 2 (2006), 6--es.

Cited By

Mallia ASuel TTonellotto NHui Yang GWang HHan SHauff CZuccon GZhang Y(2024)Faster Learned Sparse Retrieval with Block-Max PruningProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657906(2411-2415)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3626772.3657906
Gou JLiu YShao MSuel T(2024)Beyond Quantile Methods: Improved Top-K Threshold Estimation for Traditional and Learned Sparse Indexes2024 IEEE International Conference on Big Data (BigData)10.1109/BigData62323.2024.10825349(709-716)Online publication date: 15-Dec-2024
https://doi.org/10.1109/BigData62323.2024.10825349
Yafay EAltingovde IChen HDuh WHuang HKato MMothe JPoblete B(2023)Faster Dynamic Pruning via Reordering of Documents in Inverted IndexesProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3539618.3591987(2001-2005)Online publication date: 19-Jul-2023
https://dl.acm.org/doi/10.1145/3539618.3591987
Show More Cited By

Index Terms

Fast Disjunctive Candidate Generation Using Live Block Filtering
1. Information systems
  1. Information retrieval
    1. Information retrieval query processing

Recommendations

Faster top-k document retrieval using block-max indexes
SIGIR '11: Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval

Large search engines process thousands of queries per second over billions of documents, making query processing a major performance bottleneck. An important class of optimization techniques called early termination achieves faster query processing by ...
Faster BlockMax WAND with Longer Skipping
Advances in Information Retrieval
Abstract
One of the major problems for modern search engines is to keep up with the tremendous growth in the size of the web and the number of queries submitted by users. The amount of data being generated today can only be processed and managed with ...
A candidate filtering mechanism for fast top-k query processing on modern cpus
SIGIR '13: Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval

A large amount of research has focused on faster methods for finding top-k results in large document collections, one of the main scalability challenges for web search engines. In this paper, we propose a method for accelerating such top-k queries that ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

WSDM '21: Proceedings of the 14th ACM International Conference on Web Search and Data Mining

March 2021

1192 pages

ISBN:9781450382977

DOI:10.1145/3437963

General Chairs:
Liane Lewin-Eytan
Amazon, Israel
,
David Carmel
Amazon, Israel
,
Elad Yom-Tov
Microsoft, Israel
,
Program Chairs:
Eugene Agichtein
Emory University and Amazon, USA
,
Evgeniy Gabrilovich
Google Health, USA

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 08 March 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Amazon
NSF

Conference

WSDM '21

Sponsor:

WSDM '21: The Fourteenth ACM International Conference on Web Search and Data Mining

March 8 - 12, 2021

Virtual Event, Israel

Acceptance Rates

Overall Acceptance Rate 498 of 2,863 submissions, 17%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

7
Total Citations
View Citations
256
Total Downloads

Downloads (Last 12 months)122
Downloads (Last 6 weeks)31

Reflects downloads up to 12 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Mallia ASuel TTonellotto NHui Yang GWang HHan SHauff CZuccon GZhang Y(2024)Faster Learned Sparse Retrieval with Block-Max PruningProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657906(2411-2415)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3626772.3657906
Gou JLiu YShao MSuel T(2024)Beyond Quantile Methods: Improved Top-K Threshold Estimation for Traditional and Learned Sparse Indexes2024 IEEE International Conference on Big Data (BigData)10.1109/BigData62323.2024.10825349(709-716)Online publication date: 15-Dec-2024
https://doi.org/10.1109/BigData62323.2024.10825349
Yafay EAltingovde IChen HDuh WHuang HKato MMothe JPoblete B(2023)Faster Dynamic Pruning via Reordering of Documents in Inverted IndexesProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3539618.3591987(2001-2005)Online publication date: 19-Jul-2023
https://dl.acm.org/doi/10.1145/3539618.3591987
Mackenzie JTrotman ALin J(2022)Efficient Document-at-a-time and Score-at-a-time Query Evaluation for Learned Sparse RepresentationsACM Transactions on Information Systems10.1145/357692241:4(1-28)Online publication date: 15-Dec-2022
https://dl.acm.org/doi/10.1145/3576922
Siedlaczek MMallia ASuel TSelcuk Candan KLiu HAkoglu LLuna Dong XTang J(2022)Using Conjunctions for Faster Disjunctive Top-k QueriesProceedings of the Fifteenth ACM International Conference on Web Search and Data Mining10.1145/3488560.3498489(917-927)Online publication date: 11-Feb-2022
https://dl.acm.org/doi/10.1145/3488560.3498489
Lassance CClinchant SAmigo ECastells PGonzalo JCarterette BCulpepper JKazai G(2022)An Efficiency Study for SPLADE ModelsProceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3477495.3531833(2220-2226)Online publication date: 6-Jul-2022
https://dl.acm.org/doi/10.1145/3477495.3531833
Shao JQiao YJi SYang TDiaz FShah CSuel TCastells PJones RSakai T(2021)Window Navigation with Adaptive Probing for Executing BlockMax WANDProceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3404835.3463109(2323-2327)Online publication date: 11-Jul-2021
https://dl.acm.org/doi/10.1145/3404835.3463109

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Figures

Tables

Media

View Table of Conten