research-article

Scalable top-k retrieval with Sparta

Authors:

Edward Bortnikov,

Idit KeidarAuthors Info & Claims

PPoPP '20: Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

Pages 62 - 73

https://doi.org/10.1145/3332466.3374522

Published: 19 February 2020 Publication History

Abstract

Many big data processing applications rely on a top-k retrieval building block, which selects (or approximates) the k highest-scoring data items based on an aggregation of features. In web search, for instance, a document's score is the sum of its scores for all query terms. Top-k retrieval is often used to sift through massive data and identify a smaller subset of it for further analysis. Because it filters out the bulk of the data, it often constitutes the main performance bottleneck.

Beyond the rise in data sizes, today's data processing scenarios also increase the number of features contributing to the overall score. In web search, for example, verbose queries are becoming mainstream, while state-of-the-art algorithms fail to process long queries in real-time.

We present Sparta, a practical parallel algorithm that exploits multi-core hardware for fast (approximate) top-k retrieval. Thanks to lightweight coordination and judicious context sharing among threads, Sparta scales both in the number of features and in the searched index size. In our web search case study on 50M documents, Sparta processes 12-term queries more than twice as fast as the state-of-the-art. On a tenfold bigger index, Sparta processes queries at the same speed, whereas the average latency of existing algorithms soars to be an order-of-magnitude larger than Sparta's.

References

[1]

[n. d.]. https://docs.oracle.com/javase/7/docs/api/java/util/concurren/ConcurrentHashMap.html.

[2]

[n. d.]. https://docs.oracle.com/javase/7/docs/api/java/nio/MappedByteBuffer.html.

[3]

[n. d.]. https://lucene.apache.org.

[4]

[n. d.]. http://www.cim.mcgill.ca/~dudek/206/Logs/AOL-user-ct-collection.

[5]

[n. d.]. Flurry, https://www.flurry.com/.

[6]

[n. d.]. TopN queries, http://druid.io/docs/latest/querying/topnquery.html.

[7]

Reza Akbarinia, Esther Pacitti, and Patrick Valduriez. 2007. Best Position Algorithms for Top-k Queries. In Proceedings of VLDB. VLDB Endowment, 495--506. http://dl.acm.org/citation.cfm?id=1325851.1325909

[8]

Naiyong Ao, Fan Zhang, Di Wu, Douglas S. Stones, Gang Wang, Xiaoguang Liu, Jing Liu, and Sheng Lin. 2011. Efficient Parallel Lists Intersection and Index Compression Algorithms Using Graphics Processing Units. Proc. VLDB Endow. 4, 8 (May 2011), 470--481.

Digital Library

[9]

Ioannis Arapakis, Xiao Bai, and B. Barla Cambazoglu. 2014. Impact of Response Latency on User Behavior in Web Search. In Proceedings of SIGIR. ACM, 103--112.

[10]

Ricardo A. Baeza-Yates and Berthier Ribeiro-Neto. 1999. Modern Information Retrieval. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA.

Digital Library

[11]

Holger Bast, Debapriyo Majumdar, Ralf Schenkel, Martin Theobald, and Gerhard Weikum. 2006. IO-Top-k: Index-access Optimized Top-k Query Processing. In Proceedings of VLDB. VLDB Endowment, 475--486.

[12]

Carolina Bonacic, Carlos García, Mauricio Marin, Manuel Prieto-Matias, and Francisco Tirado. 2010. Building Efficient Multi-threaded Search Nodes. In Proceedings of CIKM. ACM, 1249--1258.

Digital Library

[13]

Edward Bortnikov, David Carmel, and Guy Golan-Gueta. 2017. Top-k Query Processing with Conditional Skips. In Proceedings of WWW Companion. 653--661.

Digital Library

[14]

Andrei Z. Broder, David Carmel, Michael Herscovici, Aya Soffer, and Jason Zien. 2003. Efficient Query Evaluation Using a Two-level Retrieval Process. In Proceedings of CIKM. ACM, 426--434.

Digital Library

[15]

Jamie Callan, Mark Hoy, Changkuk Yoo, and Le Zhao. 2009. Clueweb09 data set.

[16]

Matt Crane, J. Shane Culpepper, Jimmy Lin, Joel Mackenzie, and Andrew Trotman. 2017. A Comparison of Document-at-a-Time and Score-at-a-Time Query Evaluation. In Proceedings of the Tenth ACM International Conference on Web Search and Data Mining (WSDM '17). ACM, New York, NY, USA, 201--210.

Digital Library

[17]

Shuai Ding and Torsten Suel. 2011. Faster Top-k Document Retrieval Using Block-max Indexes. In Proceedings of SIGIR. ACM, 993--1002.

Digital Library

[18]

Ronald Fagin, Amnon Lotem, and Moni Naor. 2003. Optimal aggregation algorithms for middleware. Journal of computer and system sciences 66, 4 (2003), 614--656.

Digital Library

[19]

Peter Gurský and Peter Vojtáš. 2008. Speeding Up the NRA Algorithm. In Proceedings of the 2Nd International Conference on Scalable Uncertainty Management (SUM '08). Springer-Verlag, 243--255.

Digital Library

[20]

Ido Guy. 2016. Searching by Talking: Analysis of Voice Queries on Mobile Web Search. In Proceedings of SIGIR. ACM, 35--44.

Digital Library

[21]

Samuel Huston and W. Bruce Croft. 2010. Evaluating Verbose Query Processing Techniques. In Proceedings of SIGIR '10. ACM, 291--298.

Digital Library

[22]

Ihab F. Ilyas, George Beskales, and Mohamed A. Soliman. 2008. A survey of top-k query processing techniques in relational database systems. ACM Comput. Surv. 40, 4 (2008), 11:1--11:58.

Digital Library

[23]

Ihab F Ilyas, George Beskales, and Mohamed A Soliman. 2008. A survey of top-k query processing techniques in relational database systems. ACM Computing Surveys (CSUR) 40, 4 (2008), 11.

Digital Library

[24]

Myeongjae Jeon, Saehoon Kim, Seung-won Hwang, Yuxiong He, Sameh Elnikety, Alan L. Cox, and Scott Rixner. 2014. Predictive Parallelization: Taming Tail Latencies in Web Search. In Proceedings of SIGIR. ACM, 253--262.

Digital Library

[25]

Jimmy Lin and Andrew Trotman. 2015. Anytime Ranking for Impact-Ordered Indexes. In Proceedings ICTIR. ACM, 301--304.

Digital Library

[26]

Jimmy Lin and Andrew Trotman. 2017. The Role of Index Compression in Score-at-a-time Query Evaluation. Inf. Retr. 20, 3 (June 2017), 199--220.

Digital Library

[27]

Yang Liu, Jianguo Wang, and Steven Swanson. 2018. Griffin: Uniting CPU and GPU in Information Retrieval Systems for Intra-query Parallelism. In Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '18). ACM, New York, NY, USA, 327--337.

Digital Library

[28]

Joel Mackenzie, Falk Scholer, and J. Shane Culpepper. 2017. Early Termination Heuristics for Score-at-a-Time Index Traversal. In Proceedings of the 22Nd Australasian Document Computing Symposium (ADCS 2017). ACM, New York, NY, USA, Article 8, 8 pages.

Digital Library

[29]

Nikos Mamoulis, Man Lung Yiu, Kit Hung Cheng, and David W. Cheung. 2007. Efficient Top-k Aggregation of Ranked Inputs. ACM Trans. Database Syst. 32, 3, Article 19 (Aug. 2007).

Digital Library

[30]

Oscar Rojas, Veronica Gil-Costa, and Mauricio Marin. 2013. Distributing efficiently the Block-Max WAND algorithm. Procedia Computer Science 18 (2013), 120--129.

[31]

Oscar Rojas, Veronica Gil-Costa, and Mauricio Marin. 2013. Efficient parallel block-max WAND algorithm. In European Conference on Parallel Processing. Springer, 394--405.

Digital Library

[32]

Amanda Spink, Dietmar Wolfram, Major B. J. Jansen, and Tefko Saracevic. 2001. Searching the Web: The Public and Their Queries. J. Am. Soc. Inf. Sci. Technol. 52, 3 (Feb. 2001), 226--234.

[33]

Trevor Strohman, Howard Turtle, and W. Bruce Croft. 2005. Optimization Strategies for Complex Queries. In Proceedings of SIGIR. ACM, 219--225.

[34]

Shirish Tatikonda, B. Barla Cambazoglu, and Flavio P. Junqueira. 2011. Posting List Intersection on Multicore Architectures. In Proceedings of SIGIR. ACM, 963--972.

[35]

Martin Theobald, Holger Bast, Debapriyo Majumdar, Ralf Schenkel, and Gerhard Weikum. 2008. TopX: Efficient and Versatile Top-k Query Processing for Semistructured Data. The VLDB Journal 17, 1 (Jan. 2008), 81--115.

Digital Library

[36]

Martin Theobald, Gerhard Weikum, and Ralf Schenkel. 2004. Top-k Query Evaluation with Probabilistic Guarantees. In Proceedings of VLDB (VLDB '04). VLDB Endowment, 648--659. http://dl.acm.org/citation.cfm?id=1316689.1316746

[37]

Howard Turtle and James Hood. 1995. Query Evaluation: Strategies and Optimizations. Inf. Process. Manage. 31, 6 (Nov. 1995), 831--850.

Digital Library

[38]

Lidan Wang, Jimmy Lin, and Donald Metzler. 2011. A Cascade Ranking Model for Efficient Ranked Retrieval. In Proceedings of SIGIR. ACM, 105--114.

Digital Library

[39]

Jing Yuan, Guangzhong Sun, Tao Luo, Defu Lian, and Guoliang Chen. 2012. Efficient processing of top-k queries: selective NRA algorithms. Journal of Intelligent Information Systems 39, 3 (2012), 687--710.

Index Terms

Recommendations

Supporting efficient top-k queries in type-ahead search
SIGIR '12: Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval

Type-ahead search can on-the-fly find answers as a user types in a keyword query. A main challenge in this search paradigm is the high-efficiency requirement that queries must be answered within milliseconds. In this paper we study how to answer top-k ...
Scalable information extraction for web queries

The dominant way to find information on the web nowadays is through search. General search engines are very effective, but search phrases and results are unstructured and that limits a user's ability to further automate the processing of the search ...
Scalable and efficient processing of top-k multiple-type integrated queries
Abstract
In this paper, we define a new class of queries, the top-k multiple-type integrated query (simply, top-k MULTI query). It deals with multiple data types and finds the information in the order of relevance between the query and the object. Various ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

PPoPP '20: Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

February 2020

454 pages

ISBN:9781450368186

DOI:10.1145/3332466

General Chair:
Rajiv Gupta
UC Riverside
,
Program Chair:
Xipeng Shen
NCSU

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication Notes

Badge change: Article originally badged under Version 1.0 guidelines https://www.acm.org/publications/policies/artifact-review-badging

Publication History

Published: 19 February 2020

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Badges

Author Tags

Qualifiers

Research-article

Conference

PPoPP '20

Sponsor:

PPoPP '20: 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

February 22 - 26, 2020

California, San Diego

Acceptance Rates

PPoPP '20 Paper Acceptance Rate 28 of 121 submissions, 23%;

Overall Acceptance Rate 230 of 1,014 submissions, 23%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
232
Total Downloads

Downloads (Last 12 months)13
Downloads (Last 6 weeks)1

Reflects downloads up to 04 Oct 2024

Other Metrics

View Author Metrics

Citations

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents