Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Open access

Convolution and Cross-Correlation of Count Sketches Enables Fast Cardinality Estimation of Multi-Join Queries

Published: 30 May 2024 Publication History

Abstract

With the increasing rate of data generated by critical systems, estimating functions on streaming data has become essential. This demand has driven numerous advancements in algorithms designed to efficiently query and analyze one or more data streams while operating under memory constraints. The primary challenge arises from the rapid influx of new items, requiring algorithms that enable efficient incremental processing of streams in order to keep up. A prominent algorithm in this domain is the AMS sketch. Originally developed to estimate the second frequency moment of a data stream, it can also estimate the cardinality of the equi-join between two relations. Since then, two important advancements are the Count sketch, a method which significantly improves upon the sketch update time, and secondly, an extension of the AMS sketch to accommodate multi-join queries. However, combining the strengths of these methods to maintain sketches for multi-join queries while ensuring fast update times is a non-trivial task, and has remained an open problem for decades as highlighted in the existing literature. In this work, we successfully address this problem by introducing a novel sketching method which has fast updates, even for sketches capable of accurately estimating the cardinality of complex multi-join queries. We prove that our estimator is unbiased and has the same error guarantees as the AMS-based method. Our experimental results confirm the significant improvement in update time complexity, resulting in orders of magnitude faster estimates, with equal or better estimation accuracy.

References

[1]
Charu C Aggarwal and Philip S Yu. 2007. A survey of synopsis construction in data streams. Data streams: models and algorithms (2007), 169--207.
[2]
Thomas Dybdahl Ahle, Jakob Tejs Bæk Knudsen, and Mikkel Thorup. 2020. The Power of Hashing with Mersenne Primes. arXiv preprint arXiv:2008.08654 (2020).
[3]
Noga Alon, Phillip B Gibbons, Yossi Matias, and Mario Szegedy. 1999. Tracking join and self-join sizes in limited storage. In Proceedings of the eighteenth ACM SIGMOD-SIGACT-SIGART Dymposium on Principles of Database Systems. 10--20.
[4]
Noga Alon, Yossi Matias, and Mario Szegedy. 1996. The space complexity of approximating the frequency moments. In Proceedings of the twenty-eighth annual ACM Symposium on Theory of computing (STOC). 20--29.
[5]
Shivnath Babu and Jennifer Widom. 2001. Continuous queries over data streams. ACM SIGMOD Record, Vol. 30, 3 (2001), 109--120.
[6]
Alain Biem, Eric Bouillet, Hanhua Feng, Anand Ranganathan, Anton Riabov, Olivier Verscheure, et al. 2010. IBM infosphere streams for scalable, real-time, intelligent transportation services. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data. 1093--1104.
[7]
Burton H Bloom. 1970. Space/time trade-offs in hash coding with allowable errors. Commun. ACM, Vol. 13, 7 (1970), 422--426.
[8]
Walter Cai, Magdalena Balazinska, and Dan Suciu. 2019. Pessimistic cardinality estimation: Tighter upper bounds for intermediate join cardinalities. In Proceedings of the 2019 International Conference on Management of Data. 18--35.
[9]
Lei Cao, Qingyang Wang, and Elke A Rundensteiner. 2014. Interactive outlier exploration in big data streams. Proceedings of the VLDB Endowment, Vol. 7, 13 (2014), 1621--1624.
[10]
Moses Charikar, Kevin Chen, and Martin Farach-Colton. 2002. Finding frequent items in data streams. In International Colloquium on Automata, Languages, and Programming. Springer, 693--703.
[11]
Surajit Chaudhuri. 1998. An overview of query optimization in relational systems. In Proceedings of the seventeenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems. 34--43.
[12]
Chen Chen, Hongzhi Yin, Junjie Yao, and Bin Cui. 2013. Terec: A temporal recommender system over tweet stream. Proceedings of the VLDB Endowment, Vol. 6, 12 (2013), 1254--1257.
[13]
Zhida Chen, Gao Cong, and Walid G Aref. 2020. STAR: A distributed stream warehouse system for spatial data. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. 2761--2764.
[14]
Graham Cormode. 2011. Sketch techniques for approximate query processing. Foundations and Trends® in Databases (2011), 15.
[15]
Graham Cormode. 2022. Current Trends in Data Summaries. ACM SIGMOD Record, Vol. 50, 4 (2022), 6--15.
[16]
Graham Cormode and Minos Garofalakis. 2005. Sketching streams through the net: Distributed approximate query tracking. In Proceedings of the 31st international conference on Very large Data Bases (VLDB). 13--24.
[17]
Graham Cormode, Minos Garofalakis, Peter J Haas, Chris Jermaine, et al. 2011. Synopses for massive data: Samples, histograms, wavelets, sketches. Foundations and Trends® in Databases, Vol. 4, 1--3 (2011), 1--294.
[18]
Graham Cormode, Flip Korn, Shanmugavelayutham Muthukrishnan, and Divesh Srivastava. 2003. Finding hierarchical heavy hitters in data streams. In Proceedings 2003 VLDB Conference. Elsevier, 464--475.
[19]
Graham Cormode and Shan Muthukrishnan. 2005. An improved data stream summary: the count-min sketch and its applications. Journal of Algorithms, Vol. 55, 1 (2005), 58--75.
[20]
Alin Dobra, Minos Garofalakis, Johannes Gehrke, and Rajeev Rastogi. 2002. Processing complex aggregate queries over data streams. In Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data. 61--72.
[21]
Sumit Ganguly, Minos Garofalakis, and Rajeev Rastogi. 2004. Tracking set-expression cardinalities over continuous update streams. The VLDB Journal, Vol. 13, 4 (2004), 354--369.
[22]
Nikos Giatrakos, Alexander Artikis, Antonios Deligiannakis, and Minos Garofalakis. 2017. Complex event recognition in the big data era. Proceedings of the VLDB Endowment, Vol. 10, 12 (2017), 1996--1999.
[23]
Phillip B Gibbons and Yossi Matias. 1999. Synopsis data structures for massive data sets. External Memory Algorithms, Vol. 50 (1999), 39--70.
[24]
Anna C Gilbert, Yannis Kotidis, Shanmugavelayutham Muthukrishnan, and Martin Strauss. 2001. Surfing wavelets on streams: One-pass summaries for approximate aggregate queries. In VLDB, Vol. 1. 79--88.
[25]
Amit Goyal, Hal Daumé III, and Graham Cormode. 2012. Sketch algorithms for estimating point queries in nlp. In Proceedings of the 2012 joint conference on empirical methods in natural language processing and computational natural language learning. 1093--1103.
[26]
Yuxing Han, Ziniu Wu, Peizhi Wu, Rong Zhu, Jingyi Yang, Liang Wei Tan, Kai Zeng, Gao Cong, Yanzhao Qin, Andreas Pfadler, et al. 2021. Cardinality estimation in DBMS: a comprehensive benchmark evaluation. Proceedings of the VLDB Endowment, Vol. 15, 4 (2021), 752--765.
[27]
Benjamin Hilprecht, Andreas Schmidt, Moritz Kulessa, Alejandro Molina, Kristian Kersting, and Carsten Binnig. 2020. DeepDB: learn from data, not from queries! Proceedings of the VLDB Endowment, Vol. 13, 7 (2020), 992--1005.
[28]
Yanxiang Huang, Bin Cui, Wenyu Zhang, Jie Jiang, and Ying Xu. 2015. Tencentrec: Real-time stream recommendation in practice. In Proceedings of the 2015 ACM SIGMOD international conference on management of data. 227--238.
[29]
Yesdaulet Izenov, Asoke Datta, Florin Rusu, and Jun Hyung Shin. 2021. COMPASS: Online sketch-based query optimization for in-memory databases. In Proceedings of the 2021 International Conference on Management of Data. 804--816.
[30]
Hai Lan, Zhifeng Bao, and Yuwei Peng. 2021. A survey on advancing the dbms query optimizer: Cardinality estimation, cost model, and plan enumeration. Data Science and Engineering, Vol. 6 (2021), 86--101.
[31]
Viktor Leis, Andrey Gubichev, Atanas Mirchev, Peter Boncz, Alfons Kemper, and Thomas Neumann. 2015. How good are query optimizers, really? Proceedings of the VLDB Endowment, Vol. 9, 3 (2015), 204--215.
[32]
Kaiyu Li and Guoliang Li. 2018. Approximate query processing: What is new and where to go? A survey on approximate query processing. Data Science and Engineering, Vol. 3 (2018), 379--397.
[33]
Yipeng Liu, Jiani Liu, Zhen Long, and Ce Zhu. 2022. Tensor Sketch. Tensor Computation for Data Analysis (2022), 299--321.
[34]
Nishad Manerikar and Themis Palpanas. 2009. Frequent items in streaming data: An experimental evaluation of the state-of-the-art. Data & Knowledge Engineering, Vol. 68, 4 (2009), 415--430.
[35]
Guido Moerkotte, Thomas Neumann, and Gabriele Steidl. 2009. Preventing bad plans by bounding the impact of cardinality estimation errors. Proceedings of the VLDB Endowment, Vol. 2, 1 (2009), 982--993.
[36]
Magnus Müller. 2022. Selected problems in cardinality estimation. (2022).
[37]
Rasmus Pagh. 2013. Compressed matrix multiplication. ACM Transactions on Computation Theory (TOCT), Vol. 5, 3 (2013), 1--17.
[38]
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. 2019. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, Vol. 32 (2019).
[39]
Ninh Pham and Rasmus Pagh. 2013. Fast and scalable polynomial kernels via explicit feature maps. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining. 239--247.
[40]
Gordon J Ross, Dimitris K Tasoulis, and Niall M Adams. 2011. Nonparametric monitoring of data streams for changes in location and scale. Technometrics, Vol. 53, 4 (2011), 379--389.
[41]
Pratanu Roy, Arijit Khan, and Gustavo Alonso. 2016. Augmented sketch: Faster and more accurate stream processing. In Proceedings of the 2016 International Conference on Management of Data. 1449--1463.
[42]
Florin Rusu and Alin Dobra. 2008. Sketches for size of join estimation. ACM Transactions on Database Systems (TODS), Vol. 33, 3 (2008), 1--46.
[43]
Yang Shi and Animashree Anandkumar. 2019. Higher-order count sketch: dimensionality reduction that retains efficient tensor operations. arXiv preprint arXiv:1901.11261 (2019).
[44]
Michael Stonebraker, U?ur cC etintemel, and Stan Zdonik. 2005. The 8 requirements of real-time stream processing. ACM Sigmod Record, Vol. 34, 4 (2005), 42--47.
[45]
Mikkel Thorup and Yin Zhang. 2004. Tabulation based 4-universal hashing with applications to second moment estimation. In Proceedings of the fifteenth annual ACM-SIAM symposium on Discrete algorithms. 615--624.
[46]
David Vengerov, Andre Cavalheiro Menck, Mohamed Zait, and Sunil P Chakkappen. 2015. Join size estimation subject to filter conditions. Proceedings of the VLDB Endowment, Vol. 8, 12 (2015), 1530--1541.
[47]
Feiyu Wang, Qizhi Chen, Yuanpeng Li, Tong Yang, Yaofeng Tu, et al. 2023. JoinSketch: A Sketch Algorithm for Accurate and Unbiased Inner-Product Estimation. Proceedings of the ACM on Management of Data, Vol. 1, 1 (2023), 1--26.
[48]
Mark N Wegman and J Lawrence Carter. 1981. New hash functions and their use in authentication and set equality. Journal of computer and system sciences, Vol. 22, 3 (1981), 265--279.
[49]
Kilian Weinberger, Anirban Dasgupta, John Langford, Alex Smola, and Josh Attenberg. 2009. Feature hashing for large scale multitask learning. In Proceedings of the 26th annual international conference on machine learning. 1113--1120.
[50]
Ziniu Wu, Amir Shaikhha, Rong Zhu, Kai Zeng, Yuxing Han, and Jingren Zhou. 2020. Bayescard: Revitilizing bayesian frameworks for cardinality estimation. arXiv preprint arXiv:2012.14743 (2020).
[51]
Tong Yang, Yang Zhou, Hao Jin, Shigang Chen, and Xiaoming Li. 2017. Pyramid sketch: A sketch framework for frequency estimation of data streams. Proceedings of the VLDB Endowment, Vol. 10, 11 (2017), 1442--1453.
[52]
Zongheng Yang, Amog Kamsetty, Sifei Luan, Eric Liang, Yan Duan, Xi Chen, and Ion Stoica. 2020. NeuroCard: one cardinality estimator for all tables. Proceedings of the VLDB Endowment, Vol. 14, 1 (2020), 61--73.
[53]
Rong Zhu, Ziniu Wu, Yuxing Han, Kai Zeng, Andreas Pfadler, Zhengping Qian, et al. 2021. FLAT: fast, lightweight and accurate method for cardinality estimation. Proceedings of the VLDB Endowment, Vol. 14, 9 (2021), 1489--1502.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the ACM on Management of Data
Proceedings of the ACM on Management of Data  Volume 2, Issue 3
SIGMOD
June 2024
1953 pages
EISSN:2836-6573
DOI:10.1145/3670010
Issue’s Table of Contents
This work is licensed under a Creative Commons Attribution-NonCommercial International 4.0 License.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 30 May 2024
Published in PACMMOD Volume 2, Issue 3

Author Tags

  1. cardinality estimation
  2. sketching
  3. synopsis data structures

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 224
    Total Downloads
  • Downloads (Last 12 months)224
  • Downloads (Last 6 weeks)57
Reflects downloads up to 09 Nov 2024

Other Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media