research-article

Faster top-k document retrieval using block-max indexes

Authors:

Torsten SuelAuthors Info & Claims

SIGIR '11: Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval

Pages 993 - 1002

https://doi.org/10.1145/2009916.2010048

Published: 24 July 2011 Publication History

Abstract

Large search engines process thousands of queries per second over billions of documents, making query processing a major performance bottleneck. An important class of optimization techniques called early termination achieves faster query processing by avoiding the scoring of documents that are unlikely to be in the top results. We study new algorithms for early termination that outperform previous methods. In particular, we focus on safe techniques for disjunctive queries, which return the same result as an exhaustive evaluation over the disjunction of the query terms. The current state-of-the-art methods for this case, the WAND algorithm by Broder et al. [11] and the approach of Strohman and Croft [30], achieve great benefits but still leave a large performance gap between disjunctive and (even non-early terminated) conjunctive queries. We propose a new set of algorithms by introducing a simple augmented inverted index structure called a block-max index. Essentially, this is a structure that stores the maximum impact score for each block of a compressed inverted list in uncompressed form, thus enabling us to skip large parts of the lists. We show how to integrate this structure into the WAND approach, leading to considerable performance gains. We then describe extensions to a layered index organization, and to indexes with reassigned document IDs, that achieve additional gains that narrow the gap between disjunctive and conjunctive top-k query processing.

References

[1]

I. S. Altingovde, E. Demir, F. Can, and O. Ulusoy. Incremental cluster-based retrieval using compressed cluster-skipping inverted files. ACM Transactions on Information Systems, 26(3):1--36, 2008.

Digital Library

[2]

V. Anh and A. Moffat. Simplified similarity scoring using term ranks. In Proceedings of the 28th Annual Int. ACM SIGIR Conference on Research and Development in Inf. Retrieval, pages 226--233, 2005.

Digital Library

[3]

V. Anh and A. Moffat. Pruned query evaluation using pre-computed impacts. In Proceedings of the 29th Annual Int. ACM SIGIR Conference on Research and Development in Inf. Retrieval, pages 372--379, 2006.

Digital Library

[4]

V. N. Anh, O. de Kretser, and A. Moffat. Vector-space ranking with effective early termination. In Proceedings of the 24th Annual Int. ACM SIGIR Conference on Research and Development in Inf. Retrieval, 2001.

Digital Library

[5]

C. Badue, R. Baeza-Yates, B. Ribeiro-Neto, and N. Ziviani. Distributed query processing using partitioned inverted files. In Proceedings of the 9th String Processing and Information Retrieval Symposium, 2002.

[6]

R. Baeza-Yates, A. Gionis, F. Junqueira, V. Murdock, V. Plachouras, and F. Silvestri. The impact of caching on search engines. In Proceedings of the 30th Annual Int. ACM SIGIR Conference on Research and Development in Inf. Retrieval, 2007.

Digital Library

[7]

R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. Addision Wesley, 1999.

Digital Library

[8]

H. Bast, D. Majumdar, R. Schenkel, M. Theobald, and G. Weikum. IO-Top-K: Index-access optimized top-k query processing. In Proceedings of the 32th International Conference on Very Large Data Bases, 2006.

Digital Library

[9]

R. Blanco and A. Barreiro. Tsp and cluster-based solutions to the reassignment of document identifiers. Inf. Retr., 9(4):499--517, 2006.

Digital Library

[10]

R. Blanco and A. Barreiro. Probabilistic static pruning of inverted files. ACM Transactions on Information Systems, 28(1), Jan. 2010.

Digital Library

[11]

A. Z. Broder, D. Carmel, M. Herscovici, A. Soffer, and J. Zien. Efficient query evaluation using a two-level retrieval process. In Proceedings of the 12th ACM Conf. on Inf. and Knowledge Management, 2003.

Digital Library

[12]

N. Bruno, L. Gravano, and A. Marian. Evaluating top-k queries over web-accessible databases. In Proceedings of the 18th Annual Int. Conf. on Data Engineering, 2002.

Digital Library

[13]

C. Buckley and A. F. Lewit. Optimization of inverted vector searches. In Proceedings of the 8th Annual Int. ACM SIGIR Conference on Research and Development in Inf. Retrieval, 1985.

Digital Library

[14]

S. Buttcher and C. L. A. Clarke. Index compression is good, especially for random access. In Proceedings of the 16th ACM Conf. on Inf. and Knowledge Management, 2007.

Digital Library

[15]

K. Chakrabarti, S. Chaudhuri, and V. Ganti. Interval-based pruning for top-k processing over compressed lists. In Proceedings of the 27th IEEE International Conference on Data Engineering (ICDE), 2011.

Digital Library

[16]

J. Cho and A. Ntoulas. Pruning policies for two-tiered inverted index with correctness guarantee. In Proceedings of the 30th Annual Int. ACM SIGIR Conference on Research and Development in Inf. Retrieval, 2007.

Digital Library

[17]

J. Dean. Challenges in building large-scale information retrieval systems. In Proceedings of the Second ACM International Conference on Web Search and Data Mining, 2009.

Digital Library

[18]

S. Ding, J. Attenberg, and T. Suel. Scalable techniques for document identifier assignment in inverted indexes. In Proceedings of the 19th Int. Conf. on World Wide Web, 2010.

Digital Library

[19]

R. Fagin. Combining fuzzy information: an overview. SIGMOD Record, 31:2002, 2002.

Digital Library

[20]

R. Fagin, D. Carmel, D. Cohen, E. Farchi, M. Herscovici, Y. Maarek, and A. Soffer. Static index pruning for information retrieval systems. In Proceedings of the 24th Annual Int. ACM SIGIR Conference on Research and Development in Inf. Retrieval, 2001.

Digital Library

[21]

R. Fagin, A. Lotem, and M. Naor. Optimal aggregation algorithms for middleware. In Proceedings of the ACM Symp. on Principles of Database Systems, 2001.

Digital Library

[22]

R. Lempel and S. Moran. Optimizing result prefetching in web search engines with segmented indices. In Proceedings of the 28th International Conference on Very Large Data Bases, 2002.

Digital Library

[23]

X. Long and T. Suel. Optimized query execution in large search engines with global page ordering. In Proceedings of the 29th International Conference on Very Large Data Bases, 2003.

Digital Library

[24]

S. Melnik, S. Raghavan, B. Yang, and H. Garcia-Molina. Building a distributed full-text index for the web. In Proceedings of the 10th Int. Conf. on World Wide Web, 2000.

Digital Library

[25]

M. Persin, J. Zobel, and R. Sacks-davis. Filtered document retrieval with frequency-sorted indexes. Journal of the American Society for Information Science, 47:749--764, 1996.

[26]

W. Shieh, T. Chen, J. Shann, and C. Chung. Inverted file compression through document identifier reassignment. Inf. Processing and Management, 39(1):117--131, 2003.

Digital Library

[27]

F. Silvestri. Sorting out the document identifier assignment problem. In Proceedings of 29th European Conference on IR Research, pages 101--112, 2007.

Digital Library

[28]

F. Silvestri and D. Laforenza. Query-driven document partitioning and collection selection. In Proceedings of the First International Conference on Scalable Information Systems, 2006.

Digital Library

[29]

F. Silvestri and R. Venturini. Vsencoding: efficient coding and fast decoding of integer lists via dynamic programming. In Proceedings of the 19th ACM Conf. on Inf. and Knowledge Management, 2010.

Digital Library

[30]

T. Strohman and W. B. Croft. Efficient document retrieval in main memory. In Proceedings of the 30th Annual Int. ACM SIGIR Conference on Research and Development in Inf. Retrieval, 2007.

Digital Library

[31]

T. Strohman, H. Turtle, and B. W. Croft. Optimization strategies for complex queries. In Proceedings of the 28th Annual Int. ACM SIGIR Conference on Research and Development in Inf. Retrieval, 2005.

Digital Library

[32]

H. Turtle and J. Flood. Query evaluation: strategies and optimizations. Information Processing and Management, 31(6):831--850, Nov. 1995.

Digital Library

[33]

H. Wong and D. Lee. Implementations of partial document ranking using inverted files. Information Processing and Management, 29(5):647--669, 1993.

Digital Library

[34]

H. Yan, S. Ding, and T. Suel. Compressing term positions in web indexes. In Proceedings of the 32th Annual Int. ACM SIGIR Conference on Research and Development in Inf. Retrieval, 2009.

Digital Library

[35]

H. Yan, S. Ding, and T. Suel. Inverted index compression and query processing with optimized document ordering. In Proceedings of the 18th Int. Conf. on World Wide Web, 2009.

Digital Library

[36]

J. Zhang, X. Long, and T. Suel. Performance of compressed inverted list caching in search engines. In Proceedings of the 17th Int. Conf. on World Wide Web, 2008.

Digital Library

[37]

J. Zobel and A. Moffat. Inverted files for text search engines. ACM Computing Surveys, 38(2), 2006.

Digital Library

Cited By

Yates ALassance CMacAvaney SNguyen TLei YSakai TIshita EOhshima HHasibi FMao JJose J(2024)Neural Lexical Search with Learned Sparse RetrievalProceedings of the 2024 Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region10.1145/3673791.3698441(303-306)Online publication date: 8-Dec-2024
https://dl.acm.org/doi/10.1145/3673791.3698441
Bruch SNardini FIngber ALiberty E(2024)Bridging Dense and Sparse Maximum Inner Product SearchACM Transactions on Information Systems10.1145/366532442:6(1-38)Online publication date: 19-Aug-2024
https://dl.acm.org/doi/10.1145/3665324
Formal TLassance CPiwowarski BClinchant S(2024)Towards Effective and Efficient Sparse Neural Information RetrievalACM Transactions on Information Systems10.1145/363491242:5(1-46)Online publication date: 29-Apr-2024
https://dl.acm.org/doi/10.1145/3634912
Show More Cited By

Index Terms

Faster top-k document retrieval using block-max indexes
1. Information systems
  1. Information retrieval
    1. Information retrieval query processing

Recommendations

Optimizing top-k document retrieval strategies for block-max indexes
WSDM '13: Proceedings of the sixth ACM international conference on Web search and data mining

Large web search engines use significant hardware and energy resources to process hundreds of millions of queries each day, and a lot of research has focused on how to improve query processing efficiency. One general class of optimizations called early ...
Optimized top-k processing with global page scores on block-max indexes
WSDM '12: Proceedings of the fifth ACM international conference on Web search and data mining

Large web search engines are facing formidable performance challenges because they have to process thousands of queries per second on tens of billions of documents, within interactive response time. Among many others, Top-k query processing (also called ...
Efficient Top-k Query Answering through its Top-N Rewritings Using Views
PIKM '15: Proceedings of the 8th Workshop on Ph.D. Workshop in Information and Knowledge Management

Recently, various algorithms were proposed to speed up top-k query answering by using multiple materialized query results. Nevertheless, for most of the proposed algorithms, a potentially costly view selection operation is required. In fact, the ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGIR '11: Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval

July 2011

1374 pages

ISBN:9781450307574

DOI:10.1145/2009916

General Chairs:
Wei-Ying Ma
Microsoft Research Asia, China
,
Jian-Yun Nie
University of Montreal, Canada
,
Program Chairs:
Ricardo Baeza-Yates
Yahoo! Research, Spain
,
Tat-Seng Chua
National University of Singapore
,
W. Bruce Croft
University of Massachusetts, Amherst, USA

Copyright © 2011 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGIR: ACM Special Interest Group on Information Retrieval

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 July 2011

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SIGIR '11

Sponsor:

SIGIR

SIGIR '11: The 34th International ACM SIGIR conference on research and development in Information Retrieval

July 24 - 28, 2011

Beijing, China

Acceptance Rates

Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

152
Total Citations
View Citations
947
Total Downloads

Downloads (Last 12 months)89
Downloads (Last 6 weeks)12

Reflects downloads up to 15 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Yates ALassance CMacAvaney SNguyen TLei YSakai TIshita EOhshima HHasibi FMao JJose J(2024)Neural Lexical Search with Learned Sparse RetrievalProceedings of the 2024 Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region10.1145/3673791.3698441(303-306)Online publication date: 8-Dec-2024
https://dl.acm.org/doi/10.1145/3673791.3698441
Bruch SNardini FIngber ALiberty E(2024)Bridging Dense and Sparse Maximum Inner Product SearchACM Transactions on Information Systems10.1145/366532442:6(1-38)Online publication date: 19-Aug-2024
https://dl.acm.org/doi/10.1145/3665324
Formal TLassance CPiwowarski BClinchant S(2024)Towards Effective and Efficient Sparse Neural Information RetrievalACM Transactions on Information Systems10.1145/363491242:5(1-46)Online publication date: 29-Apr-2024
https://dl.acm.org/doi/10.1145/3634912
Mallia ASuel TTonellotto NHui Yang GWang HHan SHauff CZuccon GZhang Y(2024)Faster Learned Sparse Retrieval with Block-Max PruningProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657906(2411-2415)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3626772.3657906
MacAvaney STonellotto NHui Yang GWang HHan SHauff CZuccon GZhang Y(2024)A Reproducibility Study of PLAIDProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657856(1411-1419)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3626772.3657856
Bruch SNardini FRulli CVenturini RHui Yang GWang HHan SHauff CZuccon GZhang Y(2024)Efficient Inverted Indexes for Approximate Retrieval over Learned Sparse RepresentationsProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657769(152-162)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3626772.3657769
Chang XMishra DMacdonald CMacAvaney SHui Yang GWang HHan SHauff CZuccon GZhang Y(2024)Neural Passage Quality Estimation for Static PruningProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657765(174-185)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3626772.3657765
Zhao XChen ZHuang KZhang RZheng BZhou X(2024)Efficient Approximate Maximum Inner Product Search Over Sparse Vectors2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00303(3961-3974)Online publication date: 13-May-2024
https://doi.org/10.1109/ICDE60146.2024.00303
Gou JLiu YShao MSuel T(2024)Beyond Quantile Methods: Improved Top-K Threshold Estimation for Traditional and Learned Sparse Indexes2024 IEEE International Conference on Big Data (BigData)10.1109/BigData62323.2024.10825349(709-716)Online publication date: 15-Dec-2024
https://doi.org/10.1109/BigData62323.2024.10825349
Yu PMallia APetri M(2024)Improved Learned Sparse Retrieval with Corpus-Specific VocabulariesAdvances in Information Retrieval10.1007/978-3-031-56063-7_12(181-194)Online publication date: 23-Mar-2024
https://doi.org/10.1007/978-3-031-56063-7_12
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents