research-article

Open access

Convolution and Cross-Correlation of Count Sketches Enables Fast Cardinality Estimation of Multi-Join Queries

Authors:

Alex NicolauAuthors Info & Claims

Proceedings of the ACM on Management of Data, Volume 2, Issue 3

Article No.: 129, Pages 1 - 26

https://doi.org/10.1145/3654932

Published: 30 May 2024 Publication History

Abstract

With the increasing rate of data generated by critical systems, estimating functions on streaming data has become essential. This demand has driven numerous advancements in algorithms designed to efficiently query and analyze one or more data streams while operating under memory constraints. The primary challenge arises from the rapid influx of new items, requiring algorithms that enable efficient incremental processing of streams in order to keep up. A prominent algorithm in this domain is the AMS sketch. Originally developed to estimate the second frequency moment of a data stream, it can also estimate the cardinality of the equi-join between two relations. Since then, two important advancements are the Count sketch, a method which significantly improves upon the sketch update time, and secondly, an extension of the AMS sketch to accommodate multi-join queries. However, combining the strengths of these methods to maintain sketches for multi-join queries while ensuring fast update times is a non-trivial task, and has remained an open problem for decades as highlighted in the existing literature. In this work, we successfully address this problem by introducing a novel sketching method which has fast updates, even for sketches capable of accurately estimating the cardinality of complex multi-join queries. We prove that our estimator is unbiased and has the same error guarantees as the AMS-based method. Our experimental results confirm the significant improvement in update time complexity, resulting in orders of magnitude faster estimates, with equal or better estimation accuracy.

References

[1]

Charu C Aggarwal and Philip S Yu. 2007. A survey of synopsis construction in data streams. Data streams: models and algorithms (2007), 169--207.

[2]

Thomas Dybdahl Ahle, Jakob Tejs Bæk Knudsen, and Mikkel Thorup. 2020. The Power of Hashing with Mersenne Primes. arXiv preprint arXiv:2008.08654 (2020).

[3]

Noga Alon, Phillip B Gibbons, Yossi Matias, and Mario Szegedy. 1999. Tracking join and self-join sizes in limited storage. In Proceedings of the eighteenth ACM SIGMOD-SIGACT-SIGART Dymposium on Principles of Database Systems. 10--20.

Digital Library

[4]

Noga Alon, Yossi Matias, and Mario Szegedy. 1996. The space complexity of approximating the frequency moments. In Proceedings of the twenty-eighth annual ACM Symposium on Theory of computing (STOC). 20--29.

Digital Library

[5]

Shivnath Babu and Jennifer Widom. 2001. Continuous queries over data streams. ACM SIGMOD Record, Vol. 30, 3 (2001), 109--120.

Digital Library

[6]

Alain Biem, Eric Bouillet, Hanhua Feng, Anand Ranganathan, Anton Riabov, Olivier Verscheure, et al. 2010. IBM infosphere streams for scalable, real-time, intelligent transportation services. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data. 1093--1104.

Digital Library

[7]

Burton H Bloom. 1970. Space/time trade-offs in hash coding with allowable errors. Commun. ACM, Vol. 13, 7 (1970), 422--426.

Digital Library

[8]

Walter Cai, Magdalena Balazinska, and Dan Suciu. 2019. Pessimistic cardinality estimation: Tighter upper bounds for intermediate join cardinalities. In Proceedings of the 2019 International Conference on Management of Data. 18--35.

Digital Library

[9]

Lei Cao, Qingyang Wang, and Elke A Rundensteiner. 2014. Interactive outlier exploration in big data streams. Proceedings of the VLDB Endowment, Vol. 7, 13 (2014), 1621--1624.

Digital Library

[10]

Moses Charikar, Kevin Chen, and Martin Farach-Colton. 2002. Finding frequent items in data streams. In International Colloquium on Automata, Languages, and Programming. Springer, 693--703.

Digital Library

[11]

Surajit Chaudhuri. 1998. An overview of query optimization in relational systems. In Proceedings of the seventeenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems. 34--43.

Digital Library

[12]

Chen Chen, Hongzhi Yin, Junjie Yao, and Bin Cui. 2013. Terec: A temporal recommender system over tweet stream. Proceedings of the VLDB Endowment, Vol. 6, 12 (2013), 1254--1257.

Digital Library

[13]

Zhida Chen, Gao Cong, and Walid G Aref. 2020. STAR: A distributed stream warehouse system for spatial data. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. 2761--2764.

Digital Library

[14]

Graham Cormode. 2011. Sketch techniques for approximate query processing. Foundations and Trends® in Databases (2011), 15.

[15]

Graham Cormode. 2022. Current Trends in Data Summaries. ACM SIGMOD Record, Vol. 50, 4 (2022), 6--15.

Digital Library

[16]

Graham Cormode and Minos Garofalakis. 2005. Sketching streams through the net: Distributed approximate query tracking. In Proceedings of the 31st international conference on Very large Data Bases (VLDB). 13--24.

[17]

Graham Cormode, Minos Garofalakis, Peter J Haas, Chris Jermaine, et al. 2011. Synopses for massive data: Samples, histograms, wavelets, sketches. Foundations and Trends® in Databases, Vol. 4, 1--3 (2011), 1--294.

[18]

Graham Cormode, Flip Korn, Shanmugavelayutham Muthukrishnan, and Divesh Srivastava. 2003. Finding hierarchical heavy hitters in data streams. In Proceedings 2003 VLDB Conference. Elsevier, 464--475.

[19]

Graham Cormode and Shan Muthukrishnan. 2005. An improved data stream summary: the count-min sketch and its applications. Journal of Algorithms, Vol. 55, 1 (2005), 58--75.

Digital Library

[20]

Alin Dobra, Minos Garofalakis, Johannes Gehrke, and Rajeev Rastogi. 2002. Processing complex aggregate queries over data streams. In Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data. 61--72.

Digital Library

[21]

Sumit Ganguly, Minos Garofalakis, and Rajeev Rastogi. 2004. Tracking set-expression cardinalities over continuous update streams. The VLDB Journal, Vol. 13, 4 (2004), 354--369.

Digital Library

[22]

Nikos Giatrakos, Alexander Artikis, Antonios Deligiannakis, and Minos Garofalakis. 2017. Complex event recognition in the big data era. Proceedings of the VLDB Endowment, Vol. 10, 12 (2017), 1996--1999.

Digital Library

[23]

Phillip B Gibbons and Yossi Matias. 1999. Synopsis data structures for massive data sets. External Memory Algorithms, Vol. 50 (1999), 39--70.

[24]

Anna C Gilbert, Yannis Kotidis, Shanmugavelayutham Muthukrishnan, and Martin Strauss. 2001. Surfing wavelets on streams: One-pass summaries for approximate aggregate queries. In VLDB, Vol. 1. 79--88.

Digital Library

[25]

Amit Goyal, Hal Daumé III, and Graham Cormode. 2012. Sketch algorithms for estimating point queries in nlp. In Proceedings of the 2012 joint conference on empirical methods in natural language processing and computational natural language learning. 1093--1103.

[26]

Yuxing Han, Ziniu Wu, Peizhi Wu, Rong Zhu, Jingyi Yang, Liang Wei Tan, Kai Zeng, Gao Cong, Yanzhao Qin, Andreas Pfadler, et al. 2021. Cardinality estimation in DBMS: a comprehensive benchmark evaluation. Proceedings of the VLDB Endowment, Vol. 15, 4 (2021), 752--765.

Digital Library

[27]

Benjamin Hilprecht, Andreas Schmidt, Moritz Kulessa, Alejandro Molina, Kristian Kersting, and Carsten Binnig. 2020. DeepDB: learn from data, not from queries! Proceedings of the VLDB Endowment, Vol. 13, 7 (2020), 992--1005.

Digital Library

[28]

Yanxiang Huang, Bin Cui, Wenyu Zhang, Jie Jiang, and Ying Xu. 2015. Tencentrec: Real-time stream recommendation in practice. In Proceedings of the 2015 ACM SIGMOD international conference on management of data. 227--238.

Digital Library

[29]

Yesdaulet Izenov, Asoke Datta, Florin Rusu, and Jun Hyung Shin. 2021. COMPASS: Online sketch-based query optimization for in-memory databases. In Proceedings of the 2021 International Conference on Management of Data. 804--816.

Digital Library

[30]

Hai Lan, Zhifeng Bao, and Yuwei Peng. 2021. A survey on advancing the dbms query optimizer: Cardinality estimation, cost model, and plan enumeration. Data Science and Engineering, Vol. 6 (2021), 86--101.

[31]

Viktor Leis, Andrey Gubichev, Atanas Mirchev, Peter Boncz, Alfons Kemper, and Thomas Neumann. 2015. How good are query optimizers, really? Proceedings of the VLDB Endowment, Vol. 9, 3 (2015), 204--215.

Digital Library

[32]

Kaiyu Li and Guoliang Li. 2018. Approximate query processing: What is new and where to go? A survey on approximate query processing. Data Science and Engineering, Vol. 3 (2018), 379--397.

[33]

Yipeng Liu, Jiani Liu, Zhen Long, and Ce Zhu. 2022. Tensor Sketch. Tensor Computation for Data Analysis (2022), 299--321.

[34]

Nishad Manerikar and Themis Palpanas. 2009. Frequent items in streaming data: An experimental evaluation of the state-of-the-art. Data & Knowledge Engineering, Vol. 68, 4 (2009), 415--430.

Digital Library

[35]

Guido Moerkotte, Thomas Neumann, and Gabriele Steidl. 2009. Preventing bad plans by bounding the impact of cardinality estimation errors. Proceedings of the VLDB Endowment, Vol. 2, 1 (2009), 982--993.

Digital Library

[36]

Magnus Müller. 2022. Selected problems in cardinality estimation. (2022).

[37]

Rasmus Pagh. 2013. Compressed matrix multiplication. ACM Transactions on Computation Theory (TOCT), Vol. 5, 3 (2013), 1--17.

Digital Library

[38]

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. 2019. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, Vol. 32 (2019).

[39]

Ninh Pham and Rasmus Pagh. 2013. Fast and scalable polynomial kernels via explicit feature maps. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining. 239--247.

Digital Library

[40]

Gordon J Ross, Dimitris K Tasoulis, and Niall M Adams. 2011. Nonparametric monitoring of data streams for changes in location and scale. Technometrics, Vol. 53, 4 (2011), 379--389.

[41]

Pratanu Roy, Arijit Khan, and Gustavo Alonso. 2016. Augmented sketch: Faster and more accurate stream processing. In Proceedings of the 2016 International Conference on Management of Data. 1449--1463.

Digital Library

[42]

Florin Rusu and Alin Dobra. 2008. Sketches for size of join estimation. ACM Transactions on Database Systems (TODS), Vol. 33, 3 (2008), 1--46.

Digital Library

[43]

Yang Shi and Animashree Anandkumar. 2019. Higher-order count sketch: dimensionality reduction that retains efficient tensor operations. arXiv preprint arXiv:1901.11261 (2019).

[44]

Michael Stonebraker, U?ur cC etintemel, and Stan Zdonik. 2005. The 8 requirements of real-time stream processing. ACM Sigmod Record, Vol. 34, 4 (2005), 42--47.

Digital Library

[45]

Mikkel Thorup and Yin Zhang. 2004. Tabulation based 4-universal hashing with applications to second moment estimation. In Proceedings of the fifteenth annual ACM-SIAM symposium on Discrete algorithms. 615--624.

Digital Library

[46]

David Vengerov, Andre Cavalheiro Menck, Mohamed Zait, and Sunil P Chakkappen. 2015. Join size estimation subject to filter conditions. Proceedings of the VLDB Endowment, Vol. 8, 12 (2015), 1530--1541.

Digital Library

[47]

Feiyu Wang, Qizhi Chen, Yuanpeng Li, Tong Yang, Yaofeng Tu, et al. 2023. JoinSketch: A Sketch Algorithm for Accurate and Unbiased Inner-Product Estimation. Proceedings of the ACM on Management of Data, Vol. 1, 1 (2023), 1--26.

Digital Library

[48]

Mark N Wegman and J Lawrence Carter. 1981. New hash functions and their use in authentication and set equality. Journal of computer and system sciences, Vol. 22, 3 (1981), 265--279.

[49]

Kilian Weinberger, Anirban Dasgupta, John Langford, Alex Smola, and Josh Attenberg. 2009. Feature hashing for large scale multitask learning. In Proceedings of the 26th annual international conference on machine learning. 1113--1120.

Digital Library

[50]

Ziniu Wu, Amir Shaikhha, Rong Zhu, Kai Zeng, Yuxing Han, and Jingren Zhou. 2020. Bayescard: Revitilizing bayesian frameworks for cardinality estimation. arXiv preprint arXiv:2012.14743 (2020).

[51]

Tong Yang, Yang Zhou, Hao Jin, Shigang Chen, and Xiaoming Li. 2017. Pyramid sketch: A sketch framework for frequency estimation of data streams. Proceedings of the VLDB Endowment, Vol. 10, 11 (2017), 1442--1453.

Digital Library

[52]

Zongheng Yang, Amog Kamsetty, Sifei Luan, Eric Liang, Yan Duan, Xi Chen, and Ion Stoica. 2020. NeuroCard: one cardinality estimator for all tables. Proceedings of the VLDB Endowment, Vol. 14, 1 (2020), 61--73.

Digital Library

[53]

Rong Zhu, Ziniu Wu, Yuxing Han, Kai Zeng, Andreas Pfadler, Zhengping Qian, et al. 2021. FLAT: fast, lightweight and accurate method for cardinality estimation. Proceedings of the VLDB Endowment, Vol. 14, 9 (2021), 1489--1502.

Digital Library

Index Terms

Convolution and Cross-Correlation of Count Sketches Enables Fast Cardinality Estimation of Multi-Join Queries
1. Information systems
  1. Data management systems
    1. Database management system engines
      1. Database query processing
        Query optimization
2. Theory of computation
  1. Design and analysis of algorithms
    1. Streaming, sublinear and near linear time algorithms
      1. Sketching and sampling
  2. Models of computation
    1. Streaming models

Recommendations

FactorJoin: A New Cardinality Estimation Framework for Join Queries
PACMMOD

Cardinality estimation is one of the most fundamental and challenging problems in query optimization. Neither classical nor learning-based methods yield satisfactory performance when estimating the cardinality of the join queries. They either rely on ...
Approximate Sketches
SIGMOD

Sketches are single-pass small-space data summaries that can quickly estimate the cardinality of join queries. However, sketches are not directly applicable to join queries with dynamic filter conditions --- where arbitrary selection predicate(s) are ...
Weighted Distinct Sampling: Cardinality Estimation for SPJ Queries
SIGMOD '21: Proceedings of the 2021 International Conference on Management of Data

SPJ (select-project-join) queries form the backbone of many SQL queries used in practice. Accurate cardinality estimation of these queries is thus an important problem, with applications in query optimization, approximate query processing, and data ...

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the ACM on Management of Data

Proceedings of the ACM on Management of Data Volume 2, Issue 3

SIGMOD

June 2024

1953 pages

EISSN:2836-6573

DOI:10.1145/3670010

Editor:
Divyakant Agrawal
UC Santa Barbara, United States

Issue’s Table of Contents

Copyright © 2024 Owner/Author.

This work is licensed under a Creative Commons Attribution-NonCommercial International 4.0 License.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 30 May 2024

Published in PACMMOD Volume 2, Issue 3

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
224
Total Downloads

Downloads (Last 12 months)224
Downloads (Last 6 weeks)57

Reflects downloads up to 09 Nov 2024

Other Metrics

View Author Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables

View Issue’s Table of Contents