Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2009916.2010048acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
research-article

Faster top-k document retrieval using block-max indexes

Published: 24 July 2011 Publication History

Abstract

Large search engines process thousands of queries per second over billions of documents, making query processing a major performance bottleneck. An important class of optimization techniques called early termination achieves faster query processing by avoiding the scoring of documents that are unlikely to be in the top results. We study new algorithms for early termination that outperform previous methods. In particular, we focus on safe techniques for disjunctive queries, which return the same result as an exhaustive evaluation over the disjunction of the query terms. The current state-of-the-art methods for this case, the WAND algorithm by Broder et al. [11] and the approach of Strohman and Croft [30], achieve great benefits but still leave a large performance gap between disjunctive and (even non-early terminated) conjunctive queries. We propose a new set of algorithms by introducing a simple augmented inverted index structure called a block-max index. Essentially, this is a structure that stores the maximum impact score for each block of a compressed inverted list in uncompressed form, thus enabling us to skip large parts of the lists. We show how to integrate this structure into the WAND approach, leading to considerable performance gains. We then describe extensions to a layered index organization, and to indexes with reassigned document IDs, that achieve additional gains that narrow the gap between disjunctive and conjunctive top-k query processing.

References

[1]
I. S. Altingovde, E. Demir, F. Can, and O. Ulusoy. Incremental cluster-based retrieval using compressed cluster-skipping inverted files. ACM Transactions on Information Systems, 26(3):1--36, 2008.
[2]
V. Anh and A. Moffat. Simplified similarity scoring using term ranks. In Proceedings of the 28th Annual Int. ACM SIGIR Conference on Research and Development in Inf. Retrieval, pages 226--233, 2005.
[3]
V. Anh and A. Moffat. Pruned query evaluation using pre-computed impacts. In Proceedings of the 29th Annual Int. ACM SIGIR Conference on Research and Development in Inf. Retrieval, pages 372--379, 2006.
[4]
V. N. Anh, O. de Kretser, and A. Moffat. Vector-space ranking with effective early termination. In Proceedings of the 24th Annual Int. ACM SIGIR Conference on Research and Development in Inf. Retrieval, 2001.
[5]
C. Badue, R. Baeza-Yates, B. Ribeiro-Neto, and N. Ziviani. Distributed query processing using partitioned inverted files. In Proceedings of the 9th String Processing and Information Retrieval Symposium, 2002.
[6]
R. Baeza-Yates, A. Gionis, F. Junqueira, V. Murdock, V. Plachouras, and F. Silvestri. The impact of caching on search engines. In Proceedings of the 30th Annual Int. ACM SIGIR Conference on Research and Development in Inf. Retrieval, 2007.
[7]
R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. Addision Wesley, 1999.
[8]
H. Bast, D. Majumdar, R. Schenkel, M. Theobald, and G. Weikum. IO-Top-K: Index-access optimized top-k query processing. In Proceedings of the 32th International Conference on Very Large Data Bases, 2006.
[9]
R. Blanco and A. Barreiro. Tsp and cluster-based solutions to the reassignment of document identifiers. Inf. Retr., 9(4):499--517, 2006.
[10]
R. Blanco and A. Barreiro. Probabilistic static pruning of inverted files. ACM Transactions on Information Systems, 28(1), Jan. 2010.
[11]
A. Z. Broder, D. Carmel, M. Herscovici, A. Soffer, and J. Zien. Efficient query evaluation using a two-level retrieval process. In Proceedings of the 12th ACM Conf. on Inf. and Knowledge Management, 2003.
[12]
N. Bruno, L. Gravano, and A. Marian. Evaluating top-k queries over web-accessible databases. In Proceedings of the 18th Annual Int. Conf. on Data Engineering, 2002.
[13]
C. Buckley and A. F. Lewit. Optimization of inverted vector searches. In Proceedings of the 8th Annual Int. ACM SIGIR Conference on Research and Development in Inf. Retrieval, 1985.
[14]
S. Buttcher and C. L. A. Clarke. Index compression is good, especially for random access. In Proceedings of the 16th ACM Conf. on Inf. and Knowledge Management, 2007.
[15]
K. Chakrabarti, S. Chaudhuri, and V. Ganti. Interval-based pruning for top-k processing over compressed lists. In Proceedings of the 27th IEEE International Conference on Data Engineering (ICDE), 2011.
[16]
J. Cho and A. Ntoulas. Pruning policies for two-tiered inverted index with correctness guarantee. In Proceedings of the 30th Annual Int. ACM SIGIR Conference on Research and Development in Inf. Retrieval, 2007.
[17]
J. Dean. Challenges in building large-scale information retrieval systems. In Proceedings of the Second ACM International Conference on Web Search and Data Mining, 2009.
[18]
S. Ding, J. Attenberg, and T. Suel. Scalable techniques for document identifier assignment in inverted indexes. In Proceedings of the 19th Int. Conf. on World Wide Web, 2010.
[19]
R. Fagin. Combining fuzzy information: an overview. SIGMOD Record, 31:2002, 2002.
[20]
R. Fagin, D. Carmel, D. Cohen, E. Farchi, M. Herscovici, Y. Maarek, and A. Soffer. Static index pruning for information retrieval systems. In Proceedings of the 24th Annual Int. ACM SIGIR Conference on Research and Development in Inf. Retrieval, 2001.
[21]
R. Fagin, A. Lotem, and M. Naor. Optimal aggregation algorithms for middleware. In Proceedings of the ACM Symp. on Principles of Database Systems, 2001.
[22]
R. Lempel and S. Moran. Optimizing result prefetching in web search engines with segmented indices. In Proceedings of the 28th International Conference on Very Large Data Bases, 2002.
[23]
X. Long and T. Suel. Optimized query execution in large search engines with global page ordering. In Proceedings of the 29th International Conference on Very Large Data Bases, 2003.
[24]
S. Melnik, S. Raghavan, B. Yang, and H. Garcia-Molina. Building a distributed full-text index for the web. In Proceedings of the 10th Int. Conf. on World Wide Web, 2000.
[25]
M. Persin, J. Zobel, and R. Sacks-davis. Filtered document retrieval with frequency-sorted indexes. Journal of the American Society for Information Science, 47:749--764, 1996.
[26]
W. Shieh, T. Chen, J. Shann, and C. Chung. Inverted file compression through document identifier reassignment. Inf. Processing and Management, 39(1):117--131, 2003.
[27]
F. Silvestri. Sorting out the document identifier assignment problem. In Proceedings of 29th European Conference on IR Research, pages 101--112, 2007.
[28]
F. Silvestri and D. Laforenza. Query-driven document partitioning and collection selection. In Proceedings of the First International Conference on Scalable Information Systems, 2006.
[29]
F. Silvestri and R. Venturini. Vsencoding: efficient coding and fast decoding of integer lists via dynamic programming. In Proceedings of the 19th ACM Conf. on Inf. and Knowledge Management, 2010.
[30]
T. Strohman and W. B. Croft. Efficient document retrieval in main memory. In Proceedings of the 30th Annual Int. ACM SIGIR Conference on Research and Development in Inf. Retrieval, 2007.
[31]
T. Strohman, H. Turtle, and B. W. Croft. Optimization strategies for complex queries. In Proceedings of the 28th Annual Int. ACM SIGIR Conference on Research and Development in Inf. Retrieval, 2005.
[32]
H. Turtle and J. Flood. Query evaluation: strategies and optimizations. Information Processing and Management, 31(6):831--850, Nov. 1995.
[33]
H. Wong and D. Lee. Implementations of partial document ranking using inverted files. Information Processing and Management, 29(5):647--669, 1993.
[34]
H. Yan, S. Ding, and T. Suel. Compressing term positions in web indexes. In Proceedings of the 32th Annual Int. ACM SIGIR Conference on Research and Development in Inf. Retrieval, 2009.
[35]
H. Yan, S. Ding, and T. Suel. Inverted index compression and query processing with optimized document ordering. In Proceedings of the 18th Int. Conf. on World Wide Web, 2009.
[36]
J. Zhang, X. Long, and T. Suel. Performance of compressed inverted list caching in search engines. In Proceedings of the 17th Int. Conf. on World Wide Web, 2008.
[37]
J. Zobel and A. Moffat. Inverted files for text search engines. ACM Computing Surveys, 38(2), 2006.

Cited By

View all
  • (2024)Neural Lexical Search with Learned Sparse RetrievalProceedings of the 2024 Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region10.1145/3673791.3698441(303-306)Online publication date: 8-Dec-2024
  • (2024)Bridging Dense and Sparse Maximum Inner Product SearchACM Transactions on Information Systems10.1145/366532442:6(1-38)Online publication date: 19-Aug-2024
  • (2024)Towards Effective and Efficient Sparse Neural Information RetrievalACM Transactions on Information Systems10.1145/363491242:5(1-46)Online publication date: 29-Apr-2024
  • Show More Cited By

Index Terms

  1. Faster top-k document retrieval using block-max indexes

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SIGIR '11: Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
    July 2011
    1374 pages
    ISBN:9781450307574
    DOI:10.1145/2009916
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 24 July 2011

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. block-max index
    2. early termination
    3. inverted index
    4. ir query processing
    5. top-k query processing

    Qualifiers

    • Research-article

    Conference

    SIGIR '11
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 792 of 3,983 submissions, 20%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)89
    • Downloads (Last 6 weeks)12
    Reflects downloads up to 15 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Neural Lexical Search with Learned Sparse RetrievalProceedings of the 2024 Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region10.1145/3673791.3698441(303-306)Online publication date: 8-Dec-2024
    • (2024)Bridging Dense and Sparse Maximum Inner Product SearchACM Transactions on Information Systems10.1145/366532442:6(1-38)Online publication date: 19-Aug-2024
    • (2024)Towards Effective and Efficient Sparse Neural Information RetrievalACM Transactions on Information Systems10.1145/363491242:5(1-46)Online publication date: 29-Apr-2024
    • (2024)Faster Learned Sparse Retrieval with Block-Max PruningProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657906(2411-2415)Online publication date: 10-Jul-2024
    • (2024)A Reproducibility Study of PLAIDProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657856(1411-1419)Online publication date: 10-Jul-2024
    • (2024)Efficient Inverted Indexes for Approximate Retrieval over Learned Sparse RepresentationsProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657769(152-162)Online publication date: 10-Jul-2024
    • (2024)Neural Passage Quality Estimation for Static PruningProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657765(174-185)Online publication date: 10-Jul-2024
    • (2024)Efficient Approximate Maximum Inner Product Search Over Sparse Vectors2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00303(3961-3974)Online publication date: 13-May-2024
    • (2024)Beyond Quantile Methods: Improved Top-K Threshold Estimation for Traditional and Learned Sparse Indexes2024 IEEE International Conference on Big Data (BigData)10.1109/BigData62323.2024.10825349(709-716)Online publication date: 15-Dec-2024
    • (2024)Improved Learned Sparse Retrieval with Corpus-Specific VocabulariesAdvances in Information Retrieval10.1007/978-3-031-56063-7_12(181-194)Online publication date: 23-Mar-2024
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media