research-article

Open access

External Merge Sort for Top-K Queries: Eager input filtering guided by histograms

Authors:

Yannis Chronis,

Keith PetersAuthors Info & Claims

SIGMOD '20: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data

Pages 2423 - 2437

https://doi.org/10.1145/3318464.3389729

Published: 31 May 2020 Publication History

Abstract

Business intelligence and web log analysis workloads often use queries with top-k clauses to produce the most relevant results. Values ofk range from small to rather large and sometimes the requested output exceeds the capacity of the available main memory. When the requested output fits in the available memory existing top-k algorithms are efficient, as they can eliminate almost all but the topk results before sorting them. When the requested output exceeds the main memory capacity, existing algorithms externally sort the entire input, which can be very expensive. Furthermore, the drastic difference in execution cost when the memory capacity is exceeded results in an unpleasant user experience. Every day, tens of thousands of production top-k queries executed on F1 Query resort to an external sort of the input. To address these challenges, we introduce a new top-k algorithm that is able to eliminate parts of the input before sorting or writing them to secondary storage, regardless of whether the requested output fits in the available memory. To achieve this, at execution time our algorithm creates a concise model of the input using histograms. The proposed algorithm is implemented as part of F1 Query and is used in production, where significantly accelerates top-k queries with outputs larger than the available memory. We evaluate our algorithm against existing top-k algorithms and show that it reduces I/O traffic and can be up to 11 times faster.

Supplementary Material

MP4 File (3318464.3389729.mp4)

Presentation Video

Download
115.94 MB

References

[1]

Spyros Blanas, Yinan Li, and Jignesh M Patel. 2011. Design and evaluation of main memory hash join algorithms for multi-core CPUs. In Proceedings of the 2011 ACM SIGMOD International Conference on Management of data. ACM, 37--48.

Digital Library

[2]

Manuel Blum, Robert W. Floyd, Vaughan R. Pratt, Ronald L. Rivest, and Robert Endre Tarjan. 1973. Time bounds for selection. J. Comput. Syst. Sci., Vol. 7, 4 (1973), 448--461.

Digital Library

[3]

Dominik Brodowski and N Golde. 2016. Linux CPUFreq governors. http://www.mjmwired.net/kernel/Documentation/cpufreq/governors.txt (2016).

[4]

Michael J. Carey and Donald Kossmann. 1997. Processing top n and bottom n queries. IEEE Data Eng. Bull., Vol. 20, 3 (1997), 12--19.

[5]

Michael J Carey and Donald Kossmann. 1998. Reducing the braking distance of an SQL query engine. In VLDB, Vol. 98. 24--27.

[6]

Josephine Cheng, Don Haderle, Richard Hedges, Balakrishna R Iyer, Ted Messinger, C Mohan, and Yun Wang. 1991. An efficient hybrid join algorithm: A DB2 prototype. In [1991] Proceedings. Seventh International Conference on Data Engineering. IEEE, 171--180.

[7]

Aaron Clauset, Cosma Rohilla Shalizi, and Mark EJ Newman. 2009. Power-law distributions in empirical data. SIAM review, Vol. 51, 4 (2009), 661--703.

[8]

Transaction Processing Performance Council. 2008. TPC-H benchmark specification. Published at http://www. tcp. org/hspec. html, Vol. 21 (2008), 592--603.

[9]

Jialin Ding, Umar Farooq Minhas, Hantian Zhang, Yinan Li, Chi Wang, Badrish Chandramouli, Johannes Gehrke, Donald Kossmann, and David Lomet. 2019. ALEX: an updatable adaptive learned index. arXiv preprint arXiv:1905.08898 (2019).

[10]

Ronald Fagin. 2016. Optimal Score Aggregation Algorithms. In Proceedings of the 35th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems. ACM, 55--55.

Digital Library

[11]

Christos Faloutsos and HV Jagadish. 1992. On B-tree indices for skewed distributions. (1992).

[12]

Goetz Graefe. 1993. Query evaluation techniques for large databases. ACM Computing Surveys (CSUR), Vol. 25, 2 (1993), 73--169.

Digital Library

[13]

Goetz Graefe. 2006. Implementing sorting in database systems. ACM Computing Surveys (CSUR), Vol. 38, 3 (2006), 10.

Digital Library

[14]

Goetz Graefe. 2008. A general and efficient algorithm for "top" queries. In Data Engineering Workshop, 2008. ICDEW 2008. IEEE 24th International Conference on. IEEE, 548--555.

[15]

Julian Huxley, Richard E Strauss, and Frederick B Churchill. 1932. Problems of relative growth. (1932).

[16]

Ihab F Ilyas, George Beskales, and Mohamed A Soliman. 2008. A survey of top-k query processing techniques in relational database systems. ACM Computing Surveys (CSUR), Vol. 40, 4 (2008), 11.

Digital Library

[17]

Business Insider. 2019. Facebook Photos Statistics. https://www.businessinsider.com/facebook-350-million-photos-each-day-2013--9 Retrieved 02/16/2019 from

[18]

Business Insider. 2020. Amazon Prime Users Statistics. https://www.businessinsider.com/amazon-more-than-100-million-prime-members-us-survey-2019--1 Retrieved 04/09/2020 from

[19]

Herald Kllapi, Eva Sitaridi, Manolis M Tsangaris, and Yannis Ioannidis. 2011. Schedule optimization for data processing flows on the cloud. In Proceedings of the 2011 International Conference on Management of Data. ACM, 289--300.

Digital Library

[20]

Donald Ervin Knuth. 1973. The art of computer programming: sorting and searching. Vol. 3. Pearson Education.

[21]

D Kossmann and M Carey. 1997. On saying "enough already!". In SQL, inProc. of the 1997 ACM-SIGMOD Conference on Management of Data', Tucson, Arizona .

[22]

Tim Kraska, Alex Beutel, Ed H Chi, Jeffrey Dean, and Neoklis Polyzotis. 2018. The case for learned index structures. In Proceedings of the 2018 International Conference on Management of Data. 489--504.

Digital Library

[23]

Chengkai Li, Kevin Chen-Chuan Chang, Ihab F Ilyas, and Sumin Song. 2005. RankSQL: query algebra and optimization for relational top-k queries. In Proceedings of the 2005 ACM SIGMOD international conference on Management of data. ACM, 131--142.

Digital Library

[24]

Wentian Li. 2002. Zipf's Law everywhere. Glottometrics, Vol. 5 (2002), 14--21.

[25]

Tian Mi and Sanguthevar Rajasekaran. 2013. A two-pass exact algorithm for selection on Parallel Disk Systems. In 2013 IEEE Symposium on Computers and Communications (ISCC). IEEE, 000612--000617.

[26]

Chris Nyberg, Tom Barclay, Zarka Cvetanovic, Jim Gray, and Dave Lomet. 1994. AlphaSort: A RISC machine sort. In ACM SIGMOD Record, Vol. 23. ACM, 233--242.

Digital Library

[27]

Chris Nyberg, Tom Barclay, Zarka Cvetanovic, Jim Gray, and Dave Lomet. 1995. Alphasort: A cache-sensitive parallel external sort. The VLDB Journal, Vol. 4, 4 (1995), 603--627.

[28]

David MW Powers. 1998. Applications and explanations of Zipf's law. In Proceedings of the joint conferences on new methods in language processing and computational natural language learning. Association for Computational Linguistics, 151--160.

Digital Library

[29]

Bart Samwel, John Cieslewicz, Ben Handy, Jason Govig, Petros Venetis, Chanjun Yang, Keith Peters, Jeff Shute, Daniel Tenedorio, Himani Apte, Felix Weigel, David Wilhite, Jiacheng Yang, Jun Xu, Jiexing Li, Zhan Yuan, Craig Chasseur, Qiang Zeng, Ian Rae, Anurag Biyani, Andrew Harn, Yang Xia, Andrey Gubichev, Amr El-Helw, Orri Erling, Zhepeng Yan, Mohan Yang, Yiqun Wei, Thanh Do, Colin Zheng, Goetz Graefe, Somayeh Sardashti, Ahmed M. Aly, Divy Agrawal, Ashish Gupta, and Shiv Venkataraman. 2018. F1 Query: Declarative Querying at Scale. Proc. VLDB Endow., Vol. 11, 12 (Aug. 2018), 1835--1848. https://doi.org/10.14778/3229863.3229871

Digital Library

[30]

Anil Shanbhag, Holger Pirk, and Samuel Madden. 2018. Efficient Top-K Query Processing on Massively Parallel Hardware. In Proceedings of the 2018 International Conference on Management of Data (SIGMOD '18). Association for Computing Machinery, New York, NY, USA, 1557--1570. https://doi.org/10.1145/3183713.3183735

Digital Library

[31]

David Simmen, Eugene Shekita, and Timothy Malkemus. 1996. Fundamental Techniques for Order Optimization. SIGMOD Rec., Vol. 25, 2 (June 1996), 57--67. https://doi.org/10.1145/235968.233320

Digital Library

[32]

Internet Live Stats. 2020. Twitter statistics. https://www.internetlivestats.com/twitter-statistics/ Retrieved 04/09/2020 from

[33]

Peter Van Sandt, Yannis Chronis, and Jignesh M Patel. 2019. Efficiently Searching In-Memory Sorted Arrays: Revenge of the Interpolation Search?. In Proceedings of the 2019 International Conference on Management of Data. ACM, 36--53.

Digital Library

[34]

The Verge. 2019. Google's billion user services. https://www.theverge.com/2019/7/24/20708328/google-photos-users-gallery-go-1-billion Retrieved 04/09/2020 from

[35]

Peifeng Yin, Ping Luo, Wang-Chien Lee, and Min Wang. 2013. Silence is also evidence: interpreting dwell time for recommendation from psychological perspective. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 989--997.

Digital Library

Cited By

Wang QLuo QWang Y(2024)Relational Algorithms for Top-k Query EvaluationProceedings of the ACM on Management of Data10.1145/36549712:3(1-27)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3654971
Li YZhou BZhang JWei XLi YChen Y(2024)RadiK: Scalable and Optimized GPU-Parallel Radix Top-K SelectionProceedings of the 38th ACM International Conference on Supercomputing10.1145/3650200.3656596(537-548)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3650200.3656596
Siddiqui TNarasayya VDumitru MChaudhuri S(2023)Cache-Efficient Top-k Aggregation over High Cardinality Large DatasetsProceedings of the VLDB Endowment10.14778/3636218.363622217:4(644-656)Online publication date: 1-Dec-2023
https://dl.acm.org/doi/10.14778/3636218.3636222
Show More Cited By

Index Terms

External Merge Sort for Top-K Queries: Eager input filtering guided by histograms
1. Information systems
  1. Data management systems
    1. Database management system engines
      1. Database query processing
        Query operators

Recommendations

Supporting top-k join queries in relational databases

Ranking queries, also known as top-k queries, produce results that are ordered on some computed score. Typically, these queries involve joins, where users are usually interested only in the top-k join results. Top-k queries are dominant in many emerging ...
Probabilistic top-k and ranking-aggregate queries

Ranking and aggregation queries are widely used in data exploration, data analysis, and decision-making scenarios. While most of the currently proposed ranking and aggregation techniques focus on deterministic data, several emerging applications involve ...
The Sort-Merge-Shrink join

One of the most common operations in analytic query processing is the application of an aggregate function to the result of a relational join. We describe an algorithm called the Sort-Merge-Shrink (SMS) Join for computing the answer to such a query over ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGMOD '20: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data

June 2020

2925 pages

ISBN:9781450367356

DOI:10.1145/3318464

General Chairs:
David Maier
Portland State University, USA
,
Rachel Pottinger
University of British Columbia, Canada
,
Program Chairs:
AnHai Doan
University of Wisconsin, USA
,
Wang-Chiew Tan
Megagon Labs, USA
,
Publications Chairs:
Abdussalam Alawini
University of Illinois at Urbana-Champaign, USA
,
Hung Q. Ngo
RelationalAI, USA

Copyright © 2020 Owner/Author.

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives International 4.0 License.

Sponsors

SIGMOD: ACM Special Interest Group on Management of Data

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 31 May 2020

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SIGMOD/PODS '20

Sponsor:

SIGMOD

SIGMOD/PODS '20: International Conference on Management of Data

June 14 - 19, 2020

OR, Portland, USA

Acceptance Rates

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
1,555
Total Downloads

Downloads (Last 12 months)322
Downloads (Last 6 weeks)30

Reflects downloads up to

Other Metrics

View Author Metrics

Citations

Cited By

Wang QLuo QWang Y(2024)Relational Algorithms for Top-k Query EvaluationProceedings of the ACM on Management of Data10.1145/36549712:3(1-27)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3654971
Li YZhou BZhang JWei XLi YChen Y(2024)RadiK: Scalable and Optimized GPU-Parallel Radix Top-K SelectionProceedings of the 38th ACM International Conference on Supercomputing10.1145/3650200.3656596(537-548)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3650200.3656596
Siddiqui TNarasayya VDumitru MChaudhuri S(2023)Cache-Efficient Top-k Aggregation over High Cardinality Large DatasetsProceedings of the VLDB Endowment10.14778/3636218.363622217:4(644-656)Online publication date: 1-Dec-2023
https://dl.acm.org/doi/10.14778/3636218.3636222
Cachel KRundensteiner EFrommholz IHopfgartner FLee MOakes MLalmas MZhang MSantos R(2023)Fair&Share: Fast and Fair Multi-Criteria SelectionsProceedings of the 32nd ACM International Conference on Information and Knowledge Management10.1145/3583780.3614874(152-162)Online publication date: 21-Oct-2023
https://dl.acm.org/doi/10.1145/3583780.3614874

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents