Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Cache-Efficient Top-k Aggregation over High Cardinality Large Datasets

Published: 05 March 2024 Publication History
  • Get Citation Alerts
  • Abstract

    Top-k aggregation queries are widely used in data analytics for summarizing and identifying important groups from large amounts of data. These queries are usually processed by first computing exact aggregates for all groups and then selecting the groups with the top-k aggregate values. However, such an approach can be inefficient for high-cardinality large datasets where intermediate results may not fit within the local cache of multi-core processors leading to excessive data movement. To address this problem, we have developed Zippy, a new cache-conscious aggregation framework that leverages the skew in the data distribution to minimize data movements. This is achieved by designing cache-resident data structures and an adaptive multi-pass algorithm that quickly identifies candidate groups during processing, and performs exact aggregations for these groups. The non-candidate groups are pruned cheaply using efficient hashing and partitioning techniques without performing exact aggregations. We develop techniques to improve robustness over adversarial data distributions and have optimized the framework to reuse computations incrementally for rolling (or paginated) top-k aggregate queries. Our extensive evaluation using both real-world and synthetic datasets demonstrate that Zippy can achieve a median speed-up of more than 3× for monotonic aggregation functions across typical ranges of k values (e.g., 1 to 100) and 1.4× for non-monotonic functions when compared with state-of-the-art cache-conscious aggregation techniques.

    References

    [1]
    2023. Apache Datafusion). https://godatadriven.com/blog/optimizing-topk-queries-in-datafusion/ [Online; accessed 3-May-2023].
    [2]
    2023. PowerBI (https://powerbi.microsoft.com/en-us/). https://powerbi.microsoft.com/en-us/ [Online; accessed 3-May-2023].
    [3]
    2023. Tableau Public (www.tableaupublic.com/). www.tableaupublic.com/ [Online; accessed 3-May-2023].
    [4]
    Martina-Cezara Albutiu, Alfons Kemper, and Thomas Neumann. 2012. Massively parallel sort-merge joins in main memory multi-core database systems. arXiv preprint arXiv:1207.0145 (2012).
    [5]
    C Balkesen, G Alonso, and J Teubner. 2013. MT Ozsu. Multicore, main-memory joins: Sort vs. hash revisited. PVLDB 7, 1 (2013), 85--96.
    [6]
    Cagri Balkesen, Gustavo Alonso, Jens Teubner, and M Tamer Özsu. 2013. Multicore, main-memory joins: Sort vs. hash revisited. Proceedings of the VLDB Endowment 7, 1 (2013), 85--96.
    [7]
    Ronald Barber, Guy Lohman, Ippokratis Pandis, Vijayshankar Raman, Richard Sidle, Gopi Attaluri, Naresh Chainani, Sam Lightstone, and David Sharpe. 2014. Memory-efficient hash joins. Proceedings of the VLDB Endowment 8, 4 (2014), 353--364.
    [8]
    Peter A Boncz, Marcin Zukowski, and Niels Nes. 2005. MonetDB/X100: Hyper-Pipelining Query Execution. In Cidr, Vol. 5. 225--237.
    [9]
    Badrish Chandramouli and Jonathan Goldstein. 2014. Patience is a virtue: Revisiting merge and sort on modern processors. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data. 731--742.
    [10]
    Yannis Chronis, Thanh Do, Goetz Graefe, and Keith Peters. 2020. External merge sort for Top-K queries: Eager input filtering guided by histograms. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. 2423--2437.
    [11]
    John Cieslewicz and Kenneth A Ross. 2007. Adaptive aggregation on chip multiprocessors. In Proceedings of the 33rd international conference on Very large data bases. Citeseer, 339--350.
    [12]
    Graham Cormode and Shan Muthukrishnan. 2005. An improved data stream summary: the count-min sketch and its applications. Journal of Algorithms 55, 1 (2005), 58--75.
    [13]
    David J DeWitt, Randy H Katz, Frank Olken, Leonard D Shapiro, Michael R Stonebraker, and David A Wood. 1984. Implementation techniques for main memory database systems. In Proceedings of the 1984 ACM SIGMOD international conference on management of data. 1--8.
    [14]
    Rui Ding, Qiang Wang, Yingnong Dang, Qiang Fu, Haidong Zhang, and Dongmei Zhang. 2015. Yading: Fast clustering of large-scale time series data. Proceedings of the VLDB Endowment 8, 5 (2015), 473--484.
    [15]
    Philippe Flajolet and G Nigel Martin. 1985. Probabilistic counting algorithms for data base applications. Journal of computer and system sciences 31, 2 (1985), 182--209.
    [16]
    Jim Gray, Prakash Sundaresan, Susanne Englert, Ken Baclawski, and Peter J Weinberger. 1994. Quickly generating billion-record synthetic databases. In Proceedings of the 1994 ACM SIGMOD international conference on Management of data. 243--252.
    [17]
    Joseph M Hellerstein, Peter J Haas, and Helen J Wang. 1997. Online aggregation. In Proceedings of the 1997 ACM SIGMOD international conference on Management of data. 171--182.
    [18]
    Ihab F Ilyas, George Beskales, and Mohamed A Soliman. 2008. A survey of top-k query processing techniques in relational database systems. ACM Computing Surveys (CSUR) 40, 4 (2008), 1--58.
    [19]
    Richard M Karp, Scott Shenker, and Christos H Papadimitriou. 2003. A simple algorithm for finding frequent elements in streams and bags. ACM Transactions on Database Systems (TODS) 28, 1 (2003), 51--55.
    [20]
    Albert Kim, Eric Blais, Aditya Parameswaran, Piotr Indyk, Sam Madden, and Ronitt Rubinfeld. 2015. Rapid sampling for visualizations with ordering guarantees. In Proceedings of the vldb endowment international conference on very large data bases, Vol. 8. NIH Public Access, 521.
    [21]
    Chengkai Li, Kevin Chen-Chuan Chang, and Ihab F Ilyas. 2006. Supporting ad-hoc ranking aggregates. In Proceedings of the 2006 ACM SIGMOD international conference on Management of data. 61--72.
    [22]
    Stefan Manegold, Peter Boncz, and Martin Kersten. 2002. Optimizing mainmemory join on modern hardware. IEEE Transactions on Knowledge and Data Engineering 14, 4 (2002), 709--730.
    [23]
    Gurmeet Singh Manku and Rajeev Motwani. 2002. Approximate frequency counts over data streams. In VLDB'02: Proceedings of the 28th International Conference on Very Large Databases. Elsevier, 346--357.
    [24]
    Ahmed Metwally, Divyakant Agrawal, and Amr El Abbadi. 2006. An integrated efficient solution for computing frequent and top-k elements in data streams. ACM Transactions on Database Systems (TODS) 31, 3 (2006), 1095--1133.
    [25]
    Ingo Müller, Peter Sanders, Arnaud Lacurie, Wolfgang Lehner, and Franz Färber. 2015. Cache-efficient aggregation: Hashing is sorting. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. 1123--1136.
    [26]
    Mark EJ Newman. 2005. Power laws, Pareto distributions and Zipf's law. Contemporary physics 46, 5 (2005), 323--351.
    [27]
    Orestis Polychroniou and Kenneth A Ross. 2013. High throughput heavy hitter aggregation for modern SIMD processors. In Proceedings of the Ninth International Workshop on Data Management on New Hardware. 1--6.
    [28]
    Orestis Polychroniou and Kenneth A Ross. 2014. A comprehensive study of main-memory partitioning and its application to large-scale comparison-and radix-sort. In Proceedings of the 2014 ACM SIGMOD international conference on Management of data. 755--766.
    [29]
    Vijayshankar Raman, Gopi K Attaluri, Ronald Barber, Naresh Chainani, David Kalmuk, Vincent KulandaiSamy, Jens Leenstra, Sam Lightstone, Shaorong Liu, Guy M Lohman, et al. 2013. René Mü ller, Ippokratis Pandis, Berni Schiefer, David Sharpe, Richard Sidle, Adam J. Storm, and Liping Zhang (2013).
    [30]
    Pratanu Roy, Jens Teubner, and Gustavo Alonso. 2012. Efficient frequent item counting in multi-core hardware. In Proceedings of the 18th acm sigkdd international conference on knowledge discovery and data mining. 1451--1459.
    [31]
    Anil Shanbhag, Holger Pirk, and Samuel Madden. 2018. Efficient top-k query processing on massively parallel hardware. In Proceedings of the 2018 International Conference on Management of Data. 1557--1570.
    [32]
    Ambuj Shatdal and Jeffrey F Naughton. 1995. Adaptive parallel aggregation algorithms. Acm Sigmod Record 24, 2 (1995), 104--114.
    [33]
    Kim C Kaldewey T Lee VW. 2009. Sedlar E Nguyen AD Satish N Chhugani J Di Blas A Dubey P Sort vs. hash revisited: fast join implementation on modern multi-core CPUs. Proc. VLDB Endow 2, 2 (2009), 1378.
    [34]
    Jan Wassenberg and Peter Sanders. 2011. Engineering a multi-core radix sort. In Euro-Par 2011 Parallel Processing: 17th International Conference, Euro-Par 2011, Bordeaux, France, August 29-September 2, 2011, Proceedings, Part II 17. Springer, 160--169.
    [35]
    Yang Ye, Kenneth A Ross, and Norases Vesdapunt. 2011. Scalable aggregation on multicore processors. In Proceedings of the Seventh International Workshop on Data Management on New Hardware. 1--9.

    Index Terms

    1. Cache-Efficient Top-k Aggregation over High Cardinality Large Datasets
        Index terms have been assigned to the content through auto-classification.

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image Proceedings of the VLDB Endowment
        Proceedings of the VLDB Endowment  Volume 17, Issue 4
        December 2023
        309 pages
        ISSN:2150-8097
        Issue’s Table of Contents

        Publisher

        VLDB Endowment

        Publication History

        Published: 05 March 2024
        Published in PVLDB Volume 17, Issue 4

        Check for updates

        Qualifiers

        • Research-article

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • 0
          Total Citations
        • 20
          Total Downloads
        • Downloads (Last 12 months)20
        • Downloads (Last 6 weeks)0
        Reflects downloads up to

        Other Metrics

        Citations

        View Options

        Get Access

        Login options

        Full Access

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media