Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3332466.3374522acmconferencesArticle/Chapter ViewAbstractPublication PagesppoppConference Proceedingsconference-collections
research-article

Scalable top-k retrieval with Sparta

Published: 19 February 2020 Publication History

Abstract

Many big data processing applications rely on a top-k retrieval building block, which selects (or approximates) the k highest-scoring data items based on an aggregation of features. In web search, for instance, a document's score is the sum of its scores for all query terms. Top-k retrieval is often used to sift through massive data and identify a smaller subset of it for further analysis. Because it filters out the bulk of the data, it often constitutes the main performance bottleneck.
Beyond the rise in data sizes, today's data processing scenarios also increase the number of features contributing to the overall score. In web search, for example, verbose queries are becoming mainstream, while state-of-the-art algorithms fail to process long queries in real-time.
We present Sparta, a practical parallel algorithm that exploits multi-core hardware for fast (approximate) top-k retrieval. Thanks to lightweight coordination and judicious context sharing among threads, Sparta scales both in the number of features and in the searched index size. In our web search case study on 50M documents, Sparta processes 12-term queries more than twice as fast as the state-of-the-art. On a tenfold bigger index, Sparta processes queries at the same speed, whereas the average latency of existing algorithms soars to be an order-of-magnitude larger than Sparta's.

References

[1]
[n. d.]. https://docs.oracle.com/javase/7/docs/api/java/util/concurren/ConcurrentHashMap.html.
[2]
[n. d.]. https://docs.oracle.com/javase/7/docs/api/java/nio/MappedByteBuffer.html.
[3]
[n. d.]. https://lucene.apache.org.
[4]
[n. d.]. http://www.cim.mcgill.ca/~dudek/206/Logs/AOL-user-ct-collection.
[5]
[n. d.]. Flurry, https://www.flurry.com/.
[6]
[n. d.]. TopN queries, http://druid.io/docs/latest/querying/topnquery.html.
[7]
Reza Akbarinia, Esther Pacitti, and Patrick Valduriez. 2007. Best Position Algorithms for Top-k Queries. In Proceedings of VLDB. VLDB Endowment, 495--506. http://dl.acm.org/citation.cfm?id=1325851.1325909
[8]
Naiyong Ao, Fan Zhang, Di Wu, Douglas S. Stones, Gang Wang, Xiaoguang Liu, Jing Liu, and Sheng Lin. 2011. Efficient Parallel Lists Intersection and Index Compression Algorithms Using Graphics Processing Units. Proc. VLDB Endow. 4, 8 (May 2011), 470--481.
[9]
Ioannis Arapakis, Xiao Bai, and B. Barla Cambazoglu. 2014. Impact of Response Latency on User Behavior in Web Search. In Proceedings of SIGIR. ACM, 103--112.
[10]
Ricardo A. Baeza-Yates and Berthier Ribeiro-Neto. 1999. Modern Information Retrieval. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA.
[11]
Holger Bast, Debapriyo Majumdar, Ralf Schenkel, Martin Theobald, and Gerhard Weikum. 2006. IO-Top-k: Index-access Optimized Top-k Query Processing. In Proceedings of VLDB. VLDB Endowment, 475--486.
[12]
Carolina Bonacic, Carlos García, Mauricio Marin, Manuel Prieto-Matias, and Francisco Tirado. 2010. Building Efficient Multi-threaded Search Nodes. In Proceedings of CIKM. ACM, 1249--1258.
[13]
Edward Bortnikov, David Carmel, and Guy Golan-Gueta. 2017. Top-k Query Processing with Conditional Skips. In Proceedings of WWW Companion. 653--661.
[14]
Andrei Z. Broder, David Carmel, Michael Herscovici, Aya Soffer, and Jason Zien. 2003. Efficient Query Evaluation Using a Two-level Retrieval Process. In Proceedings of CIKM. ACM, 426--434.
[15]
Jamie Callan, Mark Hoy, Changkuk Yoo, and Le Zhao. 2009. Clueweb09 data set.
[16]
Matt Crane, J. Shane Culpepper, Jimmy Lin, Joel Mackenzie, and Andrew Trotman. 2017. A Comparison of Document-at-a-Time and Score-at-a-Time Query Evaluation. In Proceedings of the Tenth ACM International Conference on Web Search and Data Mining (WSDM '17). ACM, New York, NY, USA, 201--210.
[17]
Shuai Ding and Torsten Suel. 2011. Faster Top-k Document Retrieval Using Block-max Indexes. In Proceedings of SIGIR. ACM, 993--1002.
[18]
Ronald Fagin, Amnon Lotem, and Moni Naor. 2003. Optimal aggregation algorithms for middleware. Journal of computer and system sciences 66, 4 (2003), 614--656.
[19]
Peter Gurský and Peter Vojtáš. 2008. Speeding Up the NRA Algorithm. In Proceedings of the 2Nd International Conference on Scalable Uncertainty Management (SUM '08). Springer-Verlag, 243--255.
[20]
Ido Guy. 2016. Searching by Talking: Analysis of Voice Queries on Mobile Web Search. In Proceedings of SIGIR. ACM, 35--44.
[21]
Samuel Huston and W. Bruce Croft. 2010. Evaluating Verbose Query Processing Techniques. In Proceedings of SIGIR '10. ACM, 291--298.
[22]
Ihab F. Ilyas, George Beskales, and Mohamed A. Soliman. 2008. A survey of top-k query processing techniques in relational database systems. ACM Comput. Surv. 40, 4 (2008), 11:1--11:58.
[23]
Ihab F Ilyas, George Beskales, and Mohamed A Soliman. 2008. A survey of top-k query processing techniques in relational database systems. ACM Computing Surveys (CSUR) 40, 4 (2008), 11.
[24]
Myeongjae Jeon, Saehoon Kim, Seung-won Hwang, Yuxiong He, Sameh Elnikety, Alan L. Cox, and Scott Rixner. 2014. Predictive Parallelization: Taming Tail Latencies in Web Search. In Proceedings of SIGIR. ACM, 253--262.
[25]
Jimmy Lin and Andrew Trotman. 2015. Anytime Ranking for Impact-Ordered Indexes. In Proceedings ICTIR. ACM, 301--304.
[26]
Jimmy Lin and Andrew Trotman. 2017. The Role of Index Compression in Score-at-a-time Query Evaluation. Inf. Retr. 20, 3 (June 2017), 199--220.
[27]
Yang Liu, Jianguo Wang, and Steven Swanson. 2018. Griffin: Uniting CPU and GPU in Information Retrieval Systems for Intra-query Parallelism. In Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '18). ACM, New York, NY, USA, 327--337.
[28]
Joel Mackenzie, Falk Scholer, and J. Shane Culpepper. 2017. Early Termination Heuristics for Score-at-a-Time Index Traversal. In Proceedings of the 22Nd Australasian Document Computing Symposium (ADCS 2017). ACM, New York, NY, USA, Article 8, 8 pages.
[29]
Nikos Mamoulis, Man Lung Yiu, Kit Hung Cheng, and David W. Cheung. 2007. Efficient Top-k Aggregation of Ranked Inputs. ACM Trans. Database Syst. 32, 3, Article 19 (Aug. 2007).
[30]
Oscar Rojas, Veronica Gil-Costa, and Mauricio Marin. 2013. Distributing efficiently the Block-Max WAND algorithm. Procedia Computer Science 18 (2013), 120--129.
[31]
Oscar Rojas, Veronica Gil-Costa, and Mauricio Marin. 2013. Efficient parallel block-max WAND algorithm. In European Conference on Parallel Processing. Springer, 394--405.
[32]
Amanda Spink, Dietmar Wolfram, Major B. J. Jansen, and Tefko Saracevic. 2001. Searching the Web: The Public and Their Queries. J. Am. Soc. Inf. Sci. Technol. 52, 3 (Feb. 2001), 226--234.
[33]
Trevor Strohman, Howard Turtle, and W. Bruce Croft. 2005. Optimization Strategies for Complex Queries. In Proceedings of SIGIR. ACM, 219--225.
[34]
Shirish Tatikonda, B. Barla Cambazoglu, and Flavio P. Junqueira. 2011. Posting List Intersection on Multicore Architectures. In Proceedings of SIGIR. ACM, 963--972.
[35]
Martin Theobald, Holger Bast, Debapriyo Majumdar, Ralf Schenkel, and Gerhard Weikum. 2008. TopX: Efficient and Versatile Top-k Query Processing for Semistructured Data. The VLDB Journal 17, 1 (Jan. 2008), 81--115.
[36]
Martin Theobald, Gerhard Weikum, and Ralf Schenkel. 2004. Top-k Query Evaluation with Probabilistic Guarantees. In Proceedings of VLDB (VLDB '04). VLDB Endowment, 648--659. http://dl.acm.org/citation.cfm?id=1316689.1316746
[37]
Howard Turtle and James Hood. 1995. Query Evaluation: Strategies and Optimizations. Inf. Process. Manage. 31, 6 (Nov. 1995), 831--850.
[38]
Lidan Wang, Jimmy Lin, and Donald Metzler. 2011. A Cascade Ranking Model for Efficient Ranked Retrieval. In Proceedings of SIGIR. ACM, 105--114.
[39]
Jing Yuan, Guangzhong Sun, Tao Luo, Defu Lian, and Guoliang Chen. 2012. Efficient processing of top-k queries: selective NRA algorithms. Journal of Intelligent Information Systems 39, 3 (2012), 687--710.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
PPoPP '20: Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
February 2020
454 pages
ISBN:9781450368186
DOI:10.1145/3332466
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication Notes

Badge change: Article originally badged under Version 1.0 guidelines https://www.acm.org/publications/policies/artifact-review-badging

Publication History

Published: 19 February 2020

Permissions

Request permissions for this article.

Check for updates

Badges

Author Tags

  1. information retrieval
  2. multi-threading
  3. parallel computing
  4. performance
  5. top-k search
  6. web search

Qualifiers

  • Research-article

Conference

PPoPP '20

Acceptance Rates

PPoPP '20 Paper Acceptance Rate 28 of 121 submissions, 23%;
Overall Acceptance Rate 230 of 1,014 submissions, 23%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 232
    Total Downloads
  • Downloads (Last 12 months)13
  • Downloads (Last 6 weeks)1
Reflects downloads up to 04 Oct 2024

Other Metrics

Citations

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media