Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3318464.3389729acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article
Open access

External Merge Sort for Top-K Queries: Eager input filtering guided by histograms

Published: 31 May 2020 Publication History
  • Get Citation Alerts
  • Abstract

    Business intelligence and web log analysis workloads often use queries with top-k clauses to produce the most relevant results. Values ofk range from small to rather large and sometimes the requested output exceeds the capacity of the available main memory. When the requested output fits in the available memory existing top-k algorithms are efficient, as they can eliminate almost all but the topk results before sorting them. When the requested output exceeds the main memory capacity, existing algorithms externally sort the entire input, which can be very expensive. Furthermore, the drastic difference in execution cost when the memory capacity is exceeded results in an unpleasant user experience. Every day, tens of thousands of production top-k queries executed on F1 Query resort to an external sort of the input. To address these challenges, we introduce a new top-k algorithm that is able to eliminate parts of the input before sorting or writing them to secondary storage, regardless of whether the requested output fits in the available memory. To achieve this, at execution time our algorithm creates a concise model of the input using histograms. The proposed algorithm is implemented as part of F1 Query and is used in production, where significantly accelerates top-k queries with outputs larger than the available memory. We evaluate our algorithm against existing top-k algorithms and show that it reduces I/O traffic and can be up to 11 times faster.

    Supplementary Material

    MP4 File (3318464.3389729.mp4)
    Presentation Video

    References

    [1]
    Spyros Blanas, Yinan Li, and Jignesh M Patel. 2011. Design and evaluation of main memory hash join algorithms for multi-core CPUs. In Proceedings of the 2011 ACM SIGMOD International Conference on Management of data. ACM, 37--48.
    [2]
    Manuel Blum, Robert W. Floyd, Vaughan R. Pratt, Ronald L. Rivest, and Robert Endre Tarjan. 1973. Time bounds for selection. J. Comput. Syst. Sci., Vol. 7, 4 (1973), 448--461.
    [3]
    Dominik Brodowski and N Golde. 2016. Linux CPUFreq governors. http://www.mjmwired.net/kernel/Documentation/cpufreq/governors.txt (2016).
    [4]
    Michael J. Carey and Donald Kossmann. 1997. Processing top n and bottom n queries. IEEE Data Eng. Bull., Vol. 20, 3 (1997), 12--19.
    [5]
    Michael J Carey and Donald Kossmann. 1998. Reducing the braking distance of an SQL query engine. In VLDB, Vol. 98. 24--27.
    [6]
    Josephine Cheng, Don Haderle, Richard Hedges, Balakrishna R Iyer, Ted Messinger, C Mohan, and Yun Wang. 1991. An efficient hybrid join algorithm: A DB2 prototype. In [1991] Proceedings. Seventh International Conference on Data Engineering. IEEE, 171--180.
    [7]
    Aaron Clauset, Cosma Rohilla Shalizi, and Mark EJ Newman. 2009. Power-law distributions in empirical data. SIAM review, Vol. 51, 4 (2009), 661--703.
    [8]
    Transaction Processing Performance Council. 2008. TPC-H benchmark specification. Published at http://www. tcp. org/hspec. html, Vol. 21 (2008), 592--603.
    [9]
    Jialin Ding, Umar Farooq Minhas, Hantian Zhang, Yinan Li, Chi Wang, Badrish Chandramouli, Johannes Gehrke, Donald Kossmann, and David Lomet. 2019. ALEX: an updatable adaptive learned index. arXiv preprint arXiv:1905.08898 (2019).
    [10]
    Ronald Fagin. 2016. Optimal Score Aggregation Algorithms. In Proceedings of the 35th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems. ACM, 55--55.
    [11]
    Christos Faloutsos and HV Jagadish. 1992. On B-tree indices for skewed distributions. (1992).
    [12]
    Goetz Graefe. 1993. Query evaluation techniques for large databases. ACM Computing Surveys (CSUR), Vol. 25, 2 (1993), 73--169.
    [13]
    Goetz Graefe. 2006. Implementing sorting in database systems. ACM Computing Surveys (CSUR), Vol. 38, 3 (2006), 10.
    [14]
    Goetz Graefe. 2008. A general and efficient algorithm for "top" queries. In Data Engineering Workshop, 2008. ICDEW 2008. IEEE 24th International Conference on. IEEE, 548--555.
    [15]
    Julian Huxley, Richard E Strauss, and Frederick B Churchill. 1932. Problems of relative growth. (1932).
    [16]
    Ihab F Ilyas, George Beskales, and Mohamed A Soliman. 2008. A survey of top-k query processing techniques in relational database systems. ACM Computing Surveys (CSUR), Vol. 40, 4 (2008), 11.
    [17]
    Business Insider. 2019. Facebook Photos Statistics. https://www.businessinsider.com/facebook-350-million-photos-each-day-2013--9 Retrieved 02/16/2019 from
    [18]
    Business Insider. 2020. Amazon Prime Users Statistics. https://www.businessinsider.com/amazon-more-than-100-million-prime-members-us-survey-2019--1 Retrieved 04/09/2020 from
    [19]
    Herald Kllapi, Eva Sitaridi, Manolis M Tsangaris, and Yannis Ioannidis. 2011. Schedule optimization for data processing flows on the cloud. In Proceedings of the 2011 International Conference on Management of Data. ACM, 289--300.
    [20]
    Donald Ervin Knuth. 1973. The art of computer programming: sorting and searching. Vol. 3. Pearson Education.
    [21]
    D Kossmann and M Carey. 1997. On saying "enough already!". In SQL, inProc. of the 1997 ACM-SIGMOD Conference on Management of Data', Tucson, Arizona .
    [22]
    Tim Kraska, Alex Beutel, Ed H Chi, Jeffrey Dean, and Neoklis Polyzotis. 2018. The case for learned index structures. In Proceedings of the 2018 International Conference on Management of Data. 489--504.
    [23]
    Chengkai Li, Kevin Chen-Chuan Chang, Ihab F Ilyas, and Sumin Song. 2005. RankSQL: query algebra and optimization for relational top-k queries. In Proceedings of the 2005 ACM SIGMOD international conference on Management of data. ACM, 131--142.
    [24]
    Wentian Li. 2002. Zipf's Law everywhere. Glottometrics, Vol. 5 (2002), 14--21.
    [25]
    Tian Mi and Sanguthevar Rajasekaran. 2013. A two-pass exact algorithm for selection on Parallel Disk Systems. In 2013 IEEE Symposium on Computers and Communications (ISCC). IEEE, 000612--000617.
    [26]
    Chris Nyberg, Tom Barclay, Zarka Cvetanovic, Jim Gray, and Dave Lomet. 1994. AlphaSort: A RISC machine sort. In ACM SIGMOD Record, Vol. 23. ACM, 233--242.
    [27]
    Chris Nyberg, Tom Barclay, Zarka Cvetanovic, Jim Gray, and Dave Lomet. 1995. Alphasort: A cache-sensitive parallel external sort. The VLDB Journal, Vol. 4, 4 (1995), 603--627.
    [28]
    David MW Powers. 1998. Applications and explanations of Zipf's law. In Proceedings of the joint conferences on new methods in language processing and computational natural language learning. Association for Computational Linguistics, 151--160.
    [29]
    Bart Samwel, John Cieslewicz, Ben Handy, Jason Govig, Petros Venetis, Chanjun Yang, Keith Peters, Jeff Shute, Daniel Tenedorio, Himani Apte, Felix Weigel, David Wilhite, Jiacheng Yang, Jun Xu, Jiexing Li, Zhan Yuan, Craig Chasseur, Qiang Zeng, Ian Rae, Anurag Biyani, Andrew Harn, Yang Xia, Andrey Gubichev, Amr El-Helw, Orri Erling, Zhepeng Yan, Mohan Yang, Yiqun Wei, Thanh Do, Colin Zheng, Goetz Graefe, Somayeh Sardashti, Ahmed M. Aly, Divy Agrawal, Ashish Gupta, and Shiv Venkataraman. 2018. F1 Query: Declarative Querying at Scale. Proc. VLDB Endow., Vol. 11, 12 (Aug. 2018), 1835--1848. https://doi.org/10.14778/3229863.3229871
    [30]
    Anil Shanbhag, Holger Pirk, and Samuel Madden. 2018. Efficient Top-K Query Processing on Massively Parallel Hardware. In Proceedings of the 2018 International Conference on Management of Data (SIGMOD '18). Association for Computing Machinery, New York, NY, USA, 1557--1570. https://doi.org/10.1145/3183713.3183735
    [31]
    David Simmen, Eugene Shekita, and Timothy Malkemus. 1996. Fundamental Techniques for Order Optimization. SIGMOD Rec., Vol. 25, 2 (June 1996), 57--67. https://doi.org/10.1145/235968.233320
    [32]
    Internet Live Stats. 2020. Twitter statistics. https://www.internetlivestats.com/twitter-statistics/ Retrieved 04/09/2020 from
    [33]
    Peter Van Sandt, Yannis Chronis, and Jignesh M Patel. 2019. Efficiently Searching In-Memory Sorted Arrays: Revenge of the Interpolation Search?. In Proceedings of the 2019 International Conference on Management of Data. ACM, 36--53.
    [34]
    The Verge. 2019. Google's billion user services. https://www.theverge.com/2019/7/24/20708328/google-photos-users-gallery-go-1-billion Retrieved 04/09/2020 from
    [35]
    Peifeng Yin, Ping Luo, Wang-Chien Lee, and Min Wang. 2013. Silence is also evidence: interpreting dwell time for recommendation from psychological perspective. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 989--997.

    Cited By

    View all
    • (2024)Relational Algorithms for Top-k Query EvaluationProceedings of the ACM on Management of Data10.1145/36549712:3(1-27)Online publication date: 30-May-2024
    • (2024)RadiK: Scalable and Optimized GPU-Parallel Radix Top-K SelectionProceedings of the 38th ACM International Conference on Supercomputing10.1145/3650200.3656596(537-548)Online publication date: 30-May-2024
    • (2023)Cache-Efficient Top-k Aggregation over High Cardinality Large DatasetsProceedings of the VLDB Endowment10.14778/3636218.363622217:4(644-656)Online publication date: 1-Dec-2023
    • Show More Cited By

    Index Terms

    1. External Merge Sort for Top-K Queries: Eager input filtering guided by histograms

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      SIGMOD '20: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data
      June 2020
      2925 pages
      ISBN:9781450367356
      DOI:10.1145/3318464
      This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives International 4.0 License.

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 31 May 2020

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. out-of-core
      2. query operators
      3. top-k

      Qualifiers

      • Research-article

      Conference

      SIGMOD/PODS '20
      Sponsor:

      Acceptance Rates

      Overall Acceptance Rate 785 of 4,003 submissions, 20%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)322
      • Downloads (Last 6 weeks)30
      Reflects downloads up to

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Relational Algorithms for Top-k Query EvaluationProceedings of the ACM on Management of Data10.1145/36549712:3(1-27)Online publication date: 30-May-2024
      • (2024)RadiK: Scalable and Optimized GPU-Parallel Radix Top-K SelectionProceedings of the 38th ACM International Conference on Supercomputing10.1145/3650200.3656596(537-548)Online publication date: 30-May-2024
      • (2023)Cache-Efficient Top-k Aggregation over High Cardinality Large DatasetsProceedings of the VLDB Endowment10.14778/3636218.363622217:4(644-656)Online publication date: 1-Dec-2023
      • (2023)Fair&Share: Fast and Fair Multi-Criteria SelectionsProceedings of the 32nd ACM International Conference on Information and Knowledge Management10.1145/3583780.3614874(152-162)Online publication date: 21-Oct-2023

      View Options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Get Access

      Login options

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media