Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3437963.3441813acmconferencesArticle/Chapter ViewAbstractPublication PageswsdmConference Proceedingsconference-collections
research-article
Public Access

Fast Disjunctive Candidate Generation Using Live Block Filtering

Published: 08 March 2021 Publication History

Abstract

A lot of research has focused on the efficiency of search engine query processing, and in particular on disjunctive top-k queries that return the highest scoring k results that contain at least one of the query terms. Disjunctive top-k queries over simple ranking functions are commonly used to retrieve an initial set of candidate results that are then reranked by more complex, often machine-learned rankers. Many optimized top-k algorithms have been proposed, including MaxScore, WAND, BMW, and JASS. While the fastest methods achieve impressive results on top-10 and top-100 queries, they tend to become much slower for the larger k commonly used for candidate generation. In this paper, we focus on disjunctive top-k queries for larger k. We propose new algorithms that achieve much faster query processing for values of k up to thousands or tens of thousands. Our algorithms build on top of the live-block filtering approach of Dimopoulos et al, and exploit the SIMD capabilities of modern CPUs. We also perform a detailed experimental comparison of our methods with the fastest known approaches, and release a full model implementation of our methods and of the underlying live-block mechanism, which will allows others to design and experiment with additional methods under the live-block approach.

References

[1]
Gianni Amati and Cornelis Joost Van Rijsbergen. 2002. Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Transactions on Information Systems (TOIS) 20, 4 (2002), 357--389.
[2]
Vo Ngoc Anh, Owen de Kretser, and Alistair Moffat. 2001. Vector-space ranking with effective early termination. In Proceedings of the 24th International ACM SIGIR Conference on Research and Development in Information Retrieval. 35--42.
[3]
Nima Asadi and Jimmy Lin. 2013. Effectiveness/efficiency tradeoffs for candidate generation in multi-stage retrieval architectures. In Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval. 997--1000.
[4]
Andrei Z Broder, David Carmel, Michael Herscovici, Aya Soffer, and Jason Zien. 2003. Efficient query evaluation using a two-level retrieval process. In Proceedings of the twelfth International Conference on Information and Knowledge Management. 426--434.
[5]
Kaushik Chakrabarti, Surajit Chaudhuri, and Venkatesh Ganti. 2011. Intervalbased pruning for top-k processing over compressed lists. In 2011 IEEE 27th International Conference on Data Engineering. 709--720.
[6]
Ruey-Cheng Chen, Luke Gallagher, Roi Blanco, and J Shane Culpepper. 2017. Efficient cost-aware cascade ranking in multi-stage retrieval. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval. 445--454.
[7]
Stéphane Clinchant and Eric Gaussier. 2010. Information-based models for ad hoc IR. In Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval. 234--241.
[8]
Matt Crane, J Shane Culpepper, Jimmy Lin, Joel Mackenzie, and Andrew Trotman. 2017. A comparison of document-at-a-time and score-at-a-time query evaluation. In Proceedings of the tenth ACM International Conference on Web Search and Data Mining. 201--210.
[9]
Lídia Lizziane Serejo de Carvalho, Edleno Silva de Moura, Caio Moura Daoud, and Altigran Soares da Silva. 2015. Heuristics to improve the BMW method and its variants. Journal of Information and Data Management 6, 3 (2015), 178--178.
[10]
Jeffrey Dean. 2009. Challenges in Building Large-Scale Information Retrieval Systems: Invited Talk. In Proceedings of the second ACM International Conference on Web Search and Data Mining.
[11]
Laxman Dhulipala, Igor Kabiljo, Brian Karrer, Giuseppe Ottaviano, Sergey Pupyrev, and Alon Shalita. 2016. Compressing graphs and indexes with recursive graph bisection. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 1535--1544.
[12]
Constantinos Dimopoulos, Sergey Nepomnyachiy, and Torsten Suel. 2013. A candidate filtering mechanism for fast top-k query processing on modern cpus. In Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval. 723--732.
[13]
Constantinos Dimopoulos, Sergey Nepomnyachiy, and Torsten Suel. 2013. Optimizing top-k document retrieval strategies for block-max indexes. In Proceedings of the sixth ACM International Conference on Web Search and Data Mining. 113-- 122.
[14]
Shuai Ding and Torsten Suel. 2011. Faster top-k document retrieval using blockmax indexes. In Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval. 993--1002.
[15]
Hui Fang and ChengXiang Zhai. 2005. An exploration of axiomatic approaches to information retrieval. In Proceedings of the 28th International ACM SIGIR Conference on Research and Development in Information Retrieval. 480--487.
[16]
Omar Khattab, Mohammad Hammoud, and Tamer Elsayed. 2020. Finding the Best of BothWorlds: Faster and More Robust Top-k Document Retrieval. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. 1031--1040.
[17]
Daniel Lemire and Leonid Boytsov. 2015. Decoding billions of integers per second through vectorization. Software: Practice and Experience 45, 1 (2015), 1--29.
[18]
Jimmy Lin and Andrew Trotman. 2015. Anytime ranking for impact-ordered indexes. In Proceedings of the 2015 International Conference on The Theory of Information Retrieval. 301--304.
[19]
T-Y. Liu. 2009. Learning to Rank for Information Retrieval. 3, 3 (2009), 225--331.
[20]
C. Macdonald, R. L. Santos, and I. Ounis. 2013. The Whens and Hows of Learning to Rank for Web Search. 16, 5 (2013), 584--628.
[21]
Joel Mackenzie, Antonio Mallia, Matthias Petri, J Shane Culpepper, and Torsten Suel. 2019. Compressing inverted indexes with recursive graph bisection: A reproducibility study. In European Conference on Information Retrieval. 339--352.
[22]
Joel Mackenzie and Alistair Moffat. 2020. Examining the Additivity of Topk Query Processing Innovations. In Proceedings of the 29th ACM International Conference on Information and Knowledge Management.
[23]
Antonio Mallia, Giuseppe Ottaviano, Elia Porciani, Nicola Tonellotto, and Rossano Venturini. 2017. Faster BlockMax WAND with variable-sized blocks. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval. 625--634.
[24]
Antonio Mallia and Elia Porciani. 2019. Faster BlockMax WAND with longer skipping. In European Conference on Information Retrieval. 771--778.
[25]
Antonio Mallia, Micha? Siedlaczek, Joel Mackenzie, and Torsten Suel. 2019. PISA: Performant Indexes and Search for Academia. In CEUR Workshop Proceedings Vol-2409. 50--56.
[26]
Antonio Mallia, Michal Siedlaczek, and Torsten Suel. 2019. An experimental study of index compression and DAAT query processing methods. In European Conference on Information Retrieval. 353--368.
[27]
Antonio Mallia, Michal Siedlaczek, Mengyang Sun, and Torsten Suel. 2020. A Comparison of Top-k Threshold Estimation Techniques for Disjunctive Query Processing. In Proceedings of the 29th ACM International Conference on Information and Knowledge Management.
[28]
Matthias Petri, Alistair Moffat, Joel Mackenzie, J Shane Culpepper, and Daniel Beck. 2019. Accelerated query processing via similarity score prediction. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval. 485--494.
[29]
Stephen E Robertson, SteveWalker, Susan Jones, Micheline M Hancock-Beaulieu, Mike Gatford, et al. 1995. Okapi at TREC-3. Nist Special Publication Sp 109 (1995), 109.
[30]
Cristian Rossi, Edleno S de Moura, Andre L Carvalho, and Altigran S da Silva. 2013. Fast document-at-a-time query processing using two-tier indexes. In Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval. 183--192.
[31]
Dongdong Shan, Shuai Ding, Jing He, Hongfei Yan, and Xiaoming Li. 2012. Optimized top-k processing with global page scores on block-max indexes. In Proceedings of the fifth ACM International Conference on Web Search and Data Mining. 423--432.
[32]
Fabrizio Silvestri. 2007. Sorting out the document identifier assignment problem. In European Conference on Information Retrieval. 101--112.
[33]
Andrew Trotman and Kat Lilly. 2018. Elias Revisited: Group Elias SIMD Coding. In Proceedings of the 23rd Australasian Document Computing Symposium. 1--8.
[34]
Howard Turtle and James Flood. 1995. Query Evaluation: Strategies and Optimizations. Inf. Process. Manage. 31, 6 (1995), 831--850.
[35]
Lidan Wang, Jimmy Lin, and Donald Metzler. 2011. A cascade ranking model for efficient ranked retrieval. In Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval. 105--114.
[36]
Erman Yafay and Ismail Sengor Altingovde. 2019. Caching Scores for Faster Query Processing with Dynamic Pruning in Search Engines. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management. 2457--2460.
[37]
Dawei Yin, Yuening Hu, Jiliang Tang, Tim Daly, Mianwei Zhou, Hua Ouyang, Jianhui Chen, Changsung Kang, Hongbo Deng, Chikashi Nobata, et al. 2016. Ranking relevance in yahoo search. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 323--332.
[38]
Chengxiang Zhai and John Lafferty. 2004. A study of smoothing methods for language models applied to information retrieval. ACM Transactions on Information Systems (TOIS) 22, 2 (2004), 179--214.
[39]
Justin Zobel and Alistair Moffat. 2006. Inverted files for text search engines. ACM computing surveys (CSUR) 38, 2 (2006), 6--es.

Cited By

View all
  • (2024)Faster Learned Sparse Retrieval with Block-Max PruningProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657906(2411-2415)Online publication date: 10-Jul-2024
  • (2024)Beyond Quantile Methods: Improved Top-K Threshold Estimation for Traditional and Learned Sparse Indexes2024 IEEE International Conference on Big Data (BigData)10.1109/BigData62323.2024.10825349(709-716)Online publication date: 15-Dec-2024
  • (2023)Faster Dynamic Pruning via Reordering of Documents in Inverted IndexesProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3539618.3591987(2001-2005)Online publication date: 19-Jul-2023
  • Show More Cited By

Index Terms

  1. Fast Disjunctive Candidate Generation Using Live Block Filtering

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    WSDM '21: Proceedings of the 14th ACM International Conference on Web Search and Data Mining
    March 2021
    1192 pages
    ISBN:9781450382977
    DOI:10.1145/3437963
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 08 March 2021

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. early termination
    2. inverted index
    3. top-k query processing

    Qualifiers

    • Research-article

    Funding Sources

    • Amazon
    • NSF

    Conference

    WSDM '21

    Acceptance Rates

    Overall Acceptance Rate 498 of 2,863 submissions, 17%

    Upcoming Conference

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)122
    • Downloads (Last 6 weeks)31
    Reflects downloads up to 12 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Faster Learned Sparse Retrieval with Block-Max PruningProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657906(2411-2415)Online publication date: 10-Jul-2024
    • (2024)Beyond Quantile Methods: Improved Top-K Threshold Estimation for Traditional and Learned Sparse Indexes2024 IEEE International Conference on Big Data (BigData)10.1109/BigData62323.2024.10825349(709-716)Online publication date: 15-Dec-2024
    • (2023)Faster Dynamic Pruning via Reordering of Documents in Inverted IndexesProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3539618.3591987(2001-2005)Online publication date: 19-Jul-2023
    • (2022)Efficient Document-at-a-time and Score-at-a-time Query Evaluation for Learned Sparse RepresentationsACM Transactions on Information Systems10.1145/357692241:4(1-28)Online publication date: 15-Dec-2022
    • (2022)Using Conjunctions for Faster Disjunctive Top-k QueriesProceedings of the Fifteenth ACM International Conference on Web Search and Data Mining10.1145/3488560.3498489(917-927)Online publication date: 11-Feb-2022
    • (2022)An Efficiency Study for SPLADE ModelsProceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3477495.3531833(2220-2226)Online publication date: 6-Jul-2022
    • (2021)Window Navigation with Adaptive Probing for Executing BlockMax WANDProceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3404835.3463109(2323-2327)Online publication date: 11-Jul-2021

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Login options

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media