short-paper

A Comparison of Top-k Threshold Estimation Techniques for Disjunctive Query Processing

Authors:

Antonio Mallia,

Michal Siedlaczek,

Torsten SuelAuthors Info & Claims

CIKM '20: Proceedings of the 29th ACM International Conference on Information & Knowledge Management

Pages 2141 - 2144

https://doi.org/10.1145/3340531.3412080

Published: 19 October 2020 Publication History

Abstract

In the top-k threshold estimation problem, given a query q, the goal is to estimate the score of the result at rank k. A good estimate of this score can result in significant performance improvements for several query processing scenarios, including selective search, index tiering, and widely used disjunctive query processing algorithms such as MaxScore, WAND, and BMW. Several approaches have been proposed, including parametric approaches, methods using random sampling, and a recent approach based on machine learning. However, previous work fails to perform any experimental comparison between these approaches. In this paper, we address this issue by reimplementing four major approaches and comparing them in terms of estimation error, running time, likelihood of an overestimate, and end-to-end performance when applied to common classes of disjunctive top-k query processing algorithms.

Supplementary Material

MP4 File (3340531.3412080.mp4)

Video file

Download
196.19 MB

References

[1]

I. S. Altingovde, R. Ozcan, and Ö. Ulusoy. 2012. Static Index Pruning in Web Search Engines: Combining Term and Document Popularities with Query Views. ACM Trans. Inf. Syst. 30, 1 (2012).

Digital Library

[2]

R. Aly, D. Hiemstra, and T. Demeester. 2013. Taily: Shard Selection Using the Tail of Score Distributions. In SIGIR. 673--682.

Digital Library

[3]

R. Baeza-Yates, V. Murdock, and C. Hauff. 2009. Efficiency Trade-Offs in Two-Tier Web Search Systems. In SIGIR. 163--170.

[4]

A. Z. Broder, D. Carmel, M. Herscovici, A. Soffer, and J. Zien. 2003. Efficient Query Evaluation Using a Two-Level Retrieval Process. In CIKM.

[5]

Z. Dai, Y. Kim, and J. Callan. 2017. Learning to rank resources. In SIGIR. 837--840.

[6]

L. L. S. de Carvalho, E. S. de Moura, C. M. Daoud, and A. S. da Silva. 2015. Heuristics to Improve the BMW Method and Its Variants. JIDM 6, 3 (2015).

[7]

C. Dimopoulos, S. Nepomnyachiy, and T. Suel. 2013. A Candidate Filtering Mechanism for Fast Top-k Query Processing on Modern Cpus. In SIGIR. 723--732.

[8]

C. Dimopoulos, S. Nepomnyachiy, and T. Suel. 2013. Optimizing Top-k Document Retrieval Strategies for Block-Max Indexes. In WSDM.

[9]

S. Ding and T. Suel. 2011. Faster Top-k Document Retrieval Using Block-Max Indexes. In SIGIR. 993--1002.

[10]

S. Garcia. 2007. Search engine optimisation using past queries. Ph.D. Dissertation. RMIT University, Melbourne, Australia.

[11]

A. Kane and F. Wm. Tompa. 2018. Split-Lists and Initial Thresholds for WAND based Search. In SIGIR. 877--880.

[12]

A. Kulkarni and J. Callan. 2015. Selective search: Efficient and effective search of large textual collections. TOIS 33, 4 (2015), 1--33.

Digital Library

[13]

A. Kulkarni, A. S. Tigelaar, D. Hiemstra, and J. Callan. 2012. Shard Ranking and Cutoff Estimation for Topically Partitioned Collections. In CIKM. 555--564.

[14]

D. Lemire and L. Boytsov. 2015. Decoding Billions of Integers Per Second Through Vectorization. Software: Practice and Experience 45, 1 (2015), 1--29.

Digital Library

[15]

G. Leung, N. Quadrianto, A. J. Smola, and K. Tsioutsiouliklis. 2010. Optimal Web-Scale Tiering as a Flow Problem. In NIPS. 1333--1341.

[16]

A. Mallia, G. Ottaviano, E. Porciani, N. Tonellotto, and R. Venturini. 2017. Faster BlockMax WAND with Variable-Sized Blocks. In SIGIR.

[17]

A. Mallia and E. Porciani. 2019. Faster BlockMax WAND with longer skipping. In European Conference on Information Retrieval. 771--778.

[18]

A. Ntoulas and J. Cho. 2007. Pruning Policies for Two-Tiered Inverted Index with Correctness Guarantee. In SIGIR. 191--198.

[19]

M. Petri, A. Moffat, J. Mackenzie, J. S. Culpepper, and D. Beck. 2019. Accelerated Query Processing Via Similarity Score Prediction. In SIGIR. 485--494.

[20]

Jay M Ponte and W Bruce Croft. 1998. A language modeling approach to information retrieval. In SIGIR. 275--281.

[21]

K. M. Risvik, Y. Aasheim, and M. Lidal. 2003. Multi-Tier Architecture for Web Search Engines. In Proc. of the First Conf. on Latin American Web Congress. 132.

[22]

L. Si and J. Callan. 2002. Using Sampled Data and Regression to Merge Search Engine Results. In SIGIR. 19--26.

[23]

L. Si and J. P. Callan. 2003. Relevant document distribution estimation method for resource selection. In SIGIR. 298--305.

[24]

P. Thomas and M. Shokouhi. 2009. SUSHI: Scoring Scaled Samples for Server Selection. In SIGIR. 419--426.

[25]

H. Turtle and J. Flood. 1995. Query Evaluation: Strategies and Optimizations. Information Processing & Management 31, 6 (1995), 831--850.

Digital Library

[26]

E. Yafay and I. S. Altingovde. 2019. Caching Scores for Faster Query Processing with Dynamic Pruning in Search Engines. In CIKM. 2457--2460.

Cited By

Mallia ASuel TTonellotto NHui Yang GWang HHan SHauff CZuccon GZhang Y(2024)Faster Learned Sparse Retrieval with Block-Max PruningProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657906(2411-2415)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3626772.3657906
Gou JLiu YShao MSuel T(2024)Beyond Quantile Methods: Improved Top-K Threshold Estimation for Traditional and Learned Sparse Indexes2024 IEEE International Conference on Big Data (BigData)10.1109/BigData62323.2024.10825349(709-716)Online publication date: 15-Dec-2024
https://doi.org/10.1109/BigData62323.2024.10825349
Mackenzie JTrotman ALin J(2023)Efficient Document-at-a-time and Score-at-a-time Query Evaluation for Learned Sparse RepresentationsACM Transactions on Information Systems10.1145/357692241:4(1-28)Online publication date: 22-Mar-2023
https://dl.acm.org/doi/10.1145/3576922
Show More Cited By

Index Terms

A Comparison of Top-k Threshold Estimation Techniques for Disjunctive Query Processing
1. Information systems
  1. Information retrieval
    1. Information retrieval query processing

Recommendations

Combining Joint and Semi-Join Operations for Distributed Query Processing

The application of a combination of join and semi-join operations to minimize the amount of data transmission required for distributed query processing is discussed. Specifically, two important concepts that occur with the use of join operations as ...
Interleaving a Join Sequence with Semijoins in Distributed Query Processing

The problem of combining join and semijoin reducers for distributed query processing is studied. An approach based on interleaving a join sequence with beneficial semijoins is proposed. A join sequence is mapped into a join sequence tree first. The join ...
Approximate Query Processing with Error Guarantees
Big-Data-Analytics in Astronomy, Science, and Engineering
Abstract
In recent years, with the increase of data and the sophistication of analysis requirements, query processing in databases has become more important. Recently, approximate query processing (AQP) was proposed for efficiently executing database ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

CIKM '20: Proceedings of the 29th ACM International Conference on Information & Knowledge Management

October 2020

3619 pages

ISBN:9781450368599

DOI:10.1145/3340531

General Chairs:
Mathieu d'Aquin
DSI, Insight, NUI Galway, Ireland
,
Stefan Dietze
GESIS, Cologne, Germany, Heinrich-Heine-University Düsseldorf, Germany, L3S Research Center, Germany
,
Program Chairs:
Claudia Hauff
TU Delft, The Netherlands
,
Edward Curry
DSI, Insight, NUI Galway, Ireland
,
Philippe Cudre Mauroux
eXascale, University of Fribourg, Switzerland

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 October 2020

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Short-paper

Funding Sources

Amazon
NSF

Conference

CIKM '20

Sponsor:

CIKM '20: The 29th ACM International Conference on Information and Knowledge Management

October 19 - 23, 2020

Virtual Event, Ireland

Acceptance Rates

Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

CIKM '25

Sponsor:
sigir
sigir

The 34th ACM International Conference on Information and Knowledge Management

November 10 - 14, 2025

Seoul , Republic of Korea

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

10
Total Citations
View Citations
269
Total Downloads

Downloads (Last 12 months)32
Downloads (Last 6 weeks)0

Reflects downloads up to 11 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Mallia ASuel TTonellotto NHui Yang GWang HHan SHauff CZuccon GZhang Y(2024)Faster Learned Sparse Retrieval with Block-Max PruningProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657906(2411-2415)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3626772.3657906
Gou JLiu YShao MSuel T(2024)Beyond Quantile Methods: Improved Top-K Threshold Estimation for Traditional and Learned Sparse Indexes2024 IEEE International Conference on Big Data (BigData)10.1109/BigData62323.2024.10825349(709-716)Online publication date: 15-Dec-2024
https://doi.org/10.1109/BigData62323.2024.10825349
Mackenzie JTrotman ALin J(2023)Efficient Document-at-a-time and Score-at-a-time Query Evaluation for Learned Sparse RepresentationsACM Transactions on Information Systems10.1145/357692241:4(1-28)Online publication date: 22-Mar-2023
https://dl.acm.org/doi/10.1145/3576922
Yafay EAltingovde IChen HDuh WHuang HKato MMothe JPoblete B(2023)Faster Dynamic Pruning via Reordering of Documents in Inverted IndexesProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3539618.3591987(2001-2005)Online publication date: 19-Jul-2023
https://dl.acm.org/doi/10.1145/3539618.3591987
Li ZMackenzie JChen HDuh WHuang HKato MMothe JPoblete B(2023)Profiling and Visualizing Dynamic Pruning AlgorithmsProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3539618.3591806(3125-3129)Online publication date: 19-Jul-2023
https://dl.acm.org/doi/10.1145/3539618.3591806
Siedlaczek MMallia ASuel TSelcuk Candan KLiu HAkoglu LLuna Dong XTang J(2022)Using Conjunctions for Faster Disjunctive Top-k QueriesProceedings of the Fifteenth ACM International Conference on Web Search and Data Mining10.1145/3488560.3498489(917-927)Online publication date: 11-Feb-2022
https://dl.acm.org/doi/10.1145/3488560.3498489
Mackenzie JPetri MMoffat A(2022)Efficient query processing techniques for next-page retrievalInformation Retrieval10.1007/s10791-021-09402-725:1(27-43)Online publication date: 18-Jan-2022
https://dl.acm.org/doi/10.1007/s10791-021-09402-7
Mallia ASiedlaczek MSuel TLewin-Eytan LCarmel DYom-Tov EAgichtein EGabrilovich E(2021)Fast Disjunctive Candidate Generation Using Live Block FilteringProceedings of the 14th ACM International Conference on Web Search and Data Mining10.1145/3437963.3441813(671-679)Online publication date: 8-Mar-2021
https://dl.acm.org/doi/10.1145/3437963.3441813
Shao JQiao YJi SYang TDiaz FShah CSuel TCastells PJones RSakai T(2021)Window Navigation with Adaptive Probing for Executing BlockMax WANDProceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3404835.3463109(2323-2327)Online publication date: 11-Jul-2021
https://dl.acm.org/doi/10.1145/3404835.3463109
Mackenzie JMoffat Ad'Aquin MDietze SHauff CCurry ECudre Mauroux P(2020)Examining the Additivity of Top-k Query Processing InnovationsProceedings of the 29th ACM International Conference on Information & Knowledge Management10.1145/3340531.3412000(1085-1094)Online publication date: 19-Oct-2020
https://dl.acm.org/doi/10.1145/3340531.3412000

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten