research-article

Cache-Efficient Top-k Aggregation over High Cardinality Large Datasets

Authors:

Tarique Siddiqui,

Vivek Narasayya,

Marius Dumitru,

Surajit ChaudhuriAuthors Info & Claims

Proceedings of the VLDB Endowment, Volume 17, Issue 4

Pages 644 - 656

https://doi.org/10.14778/3636218.3636222

Published: 05 March 2024 Publication History

Abstract

Top-k aggregation queries are widely used in data analytics for summarizing and identifying important groups from large amounts of data. These queries are usually processed by first computing exact aggregates for all groups and then selecting the groups with the top-k aggregate values. However, such an approach can be inefficient for high-cardinality large datasets where intermediate results may not fit within the local cache of multi-core processors leading to excessive data movement. To address this problem, we have developed Zippy, a new cache-conscious aggregation framework that leverages the skew in the data distribution to minimize data movements. This is achieved by designing cache-resident data structures and an adaptive multi-pass algorithm that quickly identifies candidate groups during processing, and performs exact aggregations for these groups. The non-candidate groups are pruned cheaply using efficient hashing and partitioning techniques without performing exact aggregations. We develop techniques to improve robustness over adversarial data distributions and have optimized the framework to reuse computations incrementally for rolling (or paginated) top-k aggregate queries. Our extensive evaluation using both real-world and synthetic datasets demonstrate that Zippy can achieve a median speed-up of more than 3× for monotonic aggregation functions across typical ranges of k values (e.g., 1 to 100) and 1.4× for non-monotonic functions when compared with state-of-the-art cache-conscious aggregation techniques.

References

[1]

2023. Apache Datafusion). https://godatadriven.com/blog/optimizing-topk-queries-in-datafusion/ [Online; accessed 3-May-2023].

[2]

2023. PowerBI (https://powerbi.microsoft.com/en-us/). https://powerbi.microsoft.com/en-us/ [Online; accessed 3-May-2023].

[3]

2023. Tableau Public (www.tableaupublic.com/). www.tableaupublic.com/ [Online; accessed 3-May-2023].

[4]

Martina-Cezara Albutiu, Alfons Kemper, and Thomas Neumann. 2012. Massively parallel sort-merge joins in main memory multi-core database systems. arXiv preprint arXiv:1207.0145 (2012).

[5]

C Balkesen, G Alonso, and J Teubner. 2013. MT Ozsu. Multicore, main-memory joins: Sort vs. hash revisited. PVLDB 7, 1 (2013), 85--96.

Digital Library

[6]

Cagri Balkesen, Gustavo Alonso, Jens Teubner, and M Tamer Özsu. 2013. Multicore, main-memory joins: Sort vs. hash revisited. Proceedings of the VLDB Endowment 7, 1 (2013), 85--96.

Digital Library

[7]

Ronald Barber, Guy Lohman, Ippokratis Pandis, Vijayshankar Raman, Richard Sidle, Gopi Attaluri, Naresh Chainani, Sam Lightstone, and David Sharpe. 2014. Memory-efficient hash joins. Proceedings of the VLDB Endowment 8, 4 (2014), 353--364.

Digital Library

[8]

Peter A Boncz, Marcin Zukowski, and Niels Nes. 2005. MonetDB/X100: Hyper-Pipelining Query Execution. In Cidr, Vol. 5. 225--237.

[9]

Badrish Chandramouli and Jonathan Goldstein. 2014. Patience is a virtue: Revisiting merge and sort on modern processors. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data. 731--742.

Digital Library

[10]

Yannis Chronis, Thanh Do, Goetz Graefe, and Keith Peters. 2020. External merge sort for Top-K queries: Eager input filtering guided by histograms. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. 2423--2437.

Digital Library

[11]

John Cieslewicz and Kenneth A Ross. 2007. Adaptive aggregation on chip multiprocessors. In Proceedings of the 33rd international conference on Very large data bases. Citeseer, 339--350.

Digital Library

[12]

Graham Cormode and Shan Muthukrishnan. 2005. An improved data stream summary: the count-min sketch and its applications. Journal of Algorithms 55, 1 (2005), 58--75.

Digital Library

[13]

David J DeWitt, Randy H Katz, Frank Olken, Leonard D Shapiro, Michael R Stonebraker, and David A Wood. 1984. Implementation techniques for main memory database systems. In Proceedings of the 1984 ACM SIGMOD international conference on management of data. 1--8.

Digital Library

[14]

Rui Ding, Qiang Wang, Yingnong Dang, Qiang Fu, Haidong Zhang, and Dongmei Zhang. 2015. Yading: Fast clustering of large-scale time series data. Proceedings of the VLDB Endowment 8, 5 (2015), 473--484.

Digital Library

[15]

Philippe Flajolet and G Nigel Martin. 1985. Probabilistic counting algorithms for data base applications. Journal of computer and system sciences 31, 2 (1985), 182--209.

Digital Library

[16]

Jim Gray, Prakash Sundaresan, Susanne Englert, Ken Baclawski, and Peter J Weinberger. 1994. Quickly generating billion-record synthetic databases. In Proceedings of the 1994 ACM SIGMOD international conference on Management of data. 243--252.

Digital Library

[17]

Joseph M Hellerstein, Peter J Haas, and Helen J Wang. 1997. Online aggregation. In Proceedings of the 1997 ACM SIGMOD international conference on Management of data. 171--182.

Digital Library

[18]

Ihab F Ilyas, George Beskales, and Mohamed A Soliman. 2008. A survey of top-k query processing techniques in relational database systems. ACM Computing Surveys (CSUR) 40, 4 (2008), 1--58.

Digital Library

[19]

Richard M Karp, Scott Shenker, and Christos H Papadimitriou. 2003. A simple algorithm for finding frequent elements in streams and bags. ACM Transactions on Database Systems (TODS) 28, 1 (2003), 51--55.

Digital Library

[20]

Albert Kim, Eric Blais, Aditya Parameswaran, Piotr Indyk, Sam Madden, and Ronitt Rubinfeld. 2015. Rapid sampling for visualizations with ordering guarantees. In Proceedings of the vldb endowment international conference on very large data bases, Vol. 8. NIH Public Access, 521.

Digital Library

[21]

Chengkai Li, Kevin Chen-Chuan Chang, and Ihab F Ilyas. 2006. Supporting ad-hoc ranking aggregates. In Proceedings of the 2006 ACM SIGMOD international conference on Management of data. 61--72.

Digital Library

[22]

Stefan Manegold, Peter Boncz, and Martin Kersten. 2002. Optimizing mainmemory join on modern hardware. IEEE Transactions on Knowledge and Data Engineering 14, 4 (2002), 709--730.

Digital Library

[23]

Gurmeet Singh Manku and Rajeev Motwani. 2002. Approximate frequency counts over data streams. In VLDB'02: Proceedings of the 28th International Conference on Very Large Databases. Elsevier, 346--357.

[24]

Ahmed Metwally, Divyakant Agrawal, and Amr El Abbadi. 2006. An integrated efficient solution for computing frequent and top-k elements in data streams. ACM Transactions on Database Systems (TODS) 31, 3 (2006), 1095--1133.

Digital Library

[25]

Ingo Müller, Peter Sanders, Arnaud Lacurie, Wolfgang Lehner, and Franz Färber. 2015. Cache-efficient aggregation: Hashing is sorting. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. 1123--1136.

Digital Library

[26]

Mark EJ Newman. 2005. Power laws, Pareto distributions and Zipf's law. Contemporary physics 46, 5 (2005), 323--351.

[27]

Orestis Polychroniou and Kenneth A Ross. 2013. High throughput heavy hitter aggregation for modern SIMD processors. In Proceedings of the Ninth International Workshop on Data Management on New Hardware. 1--6.

Digital Library

[28]

Orestis Polychroniou and Kenneth A Ross. 2014. A comprehensive study of main-memory partitioning and its application to large-scale comparison-and radix-sort. In Proceedings of the 2014 ACM SIGMOD international conference on Management of data. 755--766.

Digital Library

[29]

Vijayshankar Raman, Gopi K Attaluri, Ronald Barber, Naresh Chainani, David Kalmuk, Vincent KulandaiSamy, Jens Leenstra, Sam Lightstone, Shaorong Liu, Guy M Lohman, et al. 2013. René Mü ller, Ippokratis Pandis, Berni Schiefer, David Sharpe, Richard Sidle, Adam J. Storm, and Liping Zhang (2013).

[30]

Pratanu Roy, Jens Teubner, and Gustavo Alonso. 2012. Efficient frequent item counting in multi-core hardware. In Proceedings of the 18th acm sigkdd international conference on knowledge discovery and data mining. 1451--1459.

Digital Library

[31]

Anil Shanbhag, Holger Pirk, and Samuel Madden. 2018. Efficient top-k query processing on massively parallel hardware. In Proceedings of the 2018 International Conference on Management of Data. 1557--1570.

Digital Library

[32]

Ambuj Shatdal and Jeffrey F Naughton. 1995. Adaptive parallel aggregation algorithms. Acm Sigmod Record 24, 2 (1995), 104--114.

Digital Library

[33]

Kim C Kaldewey T Lee VW. 2009. Sedlar E Nguyen AD Satish N Chhugani J Di Blas A Dubey P Sort vs. hash revisited: fast join implementation on modern multi-core CPUs. Proc. VLDB Endow 2, 2 (2009), 1378.

Digital Library

[34]

Jan Wassenberg and Peter Sanders. 2011. Engineering a multi-core radix sort. In Euro-Par 2011 Parallel Processing: 17th International Conference, Euro-Par 2011, Bordeaux, France, August 29-September 2, 2011, Proceedings, Part II 17. Springer, 160--169.

[35]

Yang Ye, Kenneth A Ross, and Norases Vesdapunt. 2011. Scalable aggregation on multicore processors. In Proceedings of the Seventh International Workshop on Data Management on New Hardware. 1--9.

Digital Library

Index Terms

Cache-Efficient Top-k Aggregation over High Cardinality Large Datasets
1. Information systems
  1. Data management systems
    1. Database management system engines
2. Theory of computation
  1. Theory and algorithms for application domains
    1. Database theory

Index terms have been assigned to the content through auto-classification.

Recommendations

Top-<em>K</em> aggregate queries on continuous probabilistic datasets
WAIM'13: Proceedings of the 14th international conference on Web-Age Information Management

Top-K aggregate query, which ranks groups of tuples by their aggregate values and returns the K groups with the highest aggregates, is a crucial requirement in many domains such as information extraction, data integration, and sensor data processing. In ...
Efficient Algorithms for Large-Scale Temporal Aggregation

The ability to model time-varying natures is essential to many database applications such as data warehousing and mining. However, the temporal aspects provide many unique characteristics and challenges for query processing and optimization. Among the ...
Efficient Top-k Query Answering through its Top-N Rewritings Using Views
PIKM '15: Proceedings of the 8th Workshop on Ph.D. Workshop in Information and Knowledge Management

Recently, various algorithms were proposed to speed up top-k query answering by using multiple materialized query results. Nevertheless, for most of the proposed algorithms, a potentially costly view selection operation is required. In fact, the ...

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment

Proceedings of the VLDB Endowment Volume 17, Issue 4

December 2023

309 pages

ISSN:2150-8097

Editors:
Meihui Zhang
Beijing Institute of Technology
,
Cyrus Shahabi
University of Southern California

Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 05 March 2024

Published in PVLDB Volume 17, Issue 4

Check for updates

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
20
Total Downloads

Downloads (Last 12 months)20
Downloads (Last 6 weeks)0

Reflects downloads up to

Other Metrics

View Author Metrics

Citations

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents