Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Cardinality estimation in DBMS: a comprehensive benchmark evaluation

Published: 01 December 2021 Publication History

Abstract

Cardinality estimation (CardEst) plays a significant role in generating high-quality query plans for a query optimizer in DBMS. In the last decade, an increasing number of advanced CardEst methods (especially ML-based) have been proposed with outstanding estimation accuracy and inference latency. However, there exists no study that systematically evaluates the quality of these methods and answer the fundamental problem: to what extent can these methods improve the performance of query optimizer in real-world settings, which is the ultimate goal of a CardEst method.
In this paper, we comprehensively and systematically compare the effectiveness of CardEst methods in a real DBMS. We establish a new benchmark for CardEst, which contains a new complex real-world dataset STATS and a diverse query workload STATS-CEB. We integrate multiple most representative CardEst methods into an open-source DBMS PostgreSQL, and comprehensively evaluate their true effectiveness in improving query plan quality, and other important aspects affecting their applicability. We obtain a number of key findings under different data and query settings. Furthermore, we find that the widely used estimation accuracy metric (Q-Error) cannot distinguish the importance of different sub-plan queries during query optimization and thus cannot truly reflect the generated query plan quality. Therefore, we propose a new metric P-Error to evaluate the performance of CardEst methods, which overcomes the limitation of Q-Error and is able to reflect the overall end-to-end performance of CardEst methods. It could serve as a better optimization objective for future CardEst methods.

References

[1]
Nicolas Bruno, Surajit Chaudhuri, and Luis Gravano. 2001. STHoles: a multidimensional workload-aware histogram. In SIGMOD. 211--222.
[2]
Walter Cai. 2021. Github repository: pqo-open source. https://github.com/waltercai (2021).
[3]
Walter Cai, Magdalena Balazinska, and Dan Suciu. 2019. Pessimistic cardinality estimation: Tighter upper bounds for intermediate join cardinalities. In SIGMOD. 18--35.
[4]
Surajit Chaudhuri, Vivek Narasayya, and Ravi Ramamurthy. 2009. Exact cardinality query optimization for optimizer testing. Proceedings of the VLDB Endowment 2, 1 (2009), 994--1005.
[5]
Tianqi Chen and Carlos Guestrin. 2016. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining. 785--794.
[6]
C. Chow and Cong Liu. 1968. Approximating discrete probability distributions with dependence trees. IEEE transactions on Information Theory 14, 3 (1968), 462--467.
[7]
Transaction Processing Performance Council(TPC). 2021. TPC-DS Vesion 2 and Version 3. http://www.tpc.org/tpcds/ (2021).
[8]
Transaction Processing Performance Council(TPC). 2021. TPC-H Vesion 2 and Version 3. http://www.tpc.org/tpch/ (2021).
[9]
Mattia Desana and Christoph Schnörr. 2020. Sum-product graphical models. Machine Learning 109, 1 (2020), 135--173.
[10]
Amol Deshpande, Minos Garofalakis, and Rajeev Rastogi. 2001. Independence is good: Dependency-based histogram synopses for high-dimensional data. ACM SIGMOD Record 30, 2 (2001), 199--210.
[11]
Postgresql Documentation 12. 2020. Chapter 70.1. Row Estimation Examples. https://www.postgresql.org/docs/current/row-estimation-examples.html (2020).
[12]
Anshuman Dutt, Chi Wang, Azade Nazi, Srikanth Kandula, Vivek Narasayya, and Surajit Chaudhuri. 2019. Selectivity estimation for range predicates using lightweight models. PVLDB 12, 9 (2019), 1044--1057.
[13]
Dennis Fuchs, Zhen He, and Byung Suk Lee. 2007. Compressed histograms with arbitrary bucket layouts for selectivity estimation. Information Sciences 177, 3 (2007), 680--702.
[14]
Hector Garcia-Molina and Gio Wiederhold. 1982. Read-only transactions in a distributed database. ACM Transactions on Database Systems (TODS) 7, 2 (1982), 209--234.
[15]
Mathieu Germain, Karol Gregor, Iain Murray, and Hugo Larochelle. 2015. MADE: Masked autoencoder for distribution estimation. International Conference on Machine Learning (2015), 881--889.
[16]
Lise Getoor, Benjamin Taskar, and Daphne Koller. 2001. Selectivity estimation using probabilistic models. In SIGMOD. 461--472.
[17]
Dimitrios Gunopulos, George Kollios, Vassilis J Tsotras, and Carlotta Domeniconi. 2000. Approximating multi-dimensional aggregate range queries over real attributes. In SIGMOD. 463--474.
[18]
Dimitrios Gunopulos, George Kollios, Vassilis J Tsotras, and Carlotta Domeniconi. 2005. Selectivity estimators for multidimensional range queries over real attributes. The VLDB Journal 14, 2 (2005), 137--154.
[19]
Max Halford, Philippe Saint-Pierre, and Franck Morvan. 2019. An approach based on bayesian networks for query selectivity estimation. DASFAA 2 (2019).
[20]
Shohedul Hasan, Saravanan Thirumuruganathan, Jees Augustine, Nick Koudas, and Gautam Das. 2019. Multi-attribute selectivity estimation using deep learning. In SIGMOD.
[21]
Shohedul Hasan, Saravanan Thirumuruganathan, Jees Augustine, Nick Koudas, and Gautam Das. 2020. Deep Learning Models for Selectivity Estimation of Multi-Attribute Queries. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. 1035--1050.
[22]
Max Heimel, Martin Kiefer, and Volker Markl. 2015. Self-tuning, gpu-accelerated kernel density models for multidimensional selectivity estimation. In SIGMOD. 1477--1492.
[23]
Benjamin Hilprecht. 2019. Github repository: deepdb public. https://github.com/DataManagementLab/deepdb-public (2019).
[24]
Benjamin Hilprecht, Andreas Schmidt, Moritz Kulessa, Alejandro Molina, Kristian Kersting, and Carsten Binnig. 2019. DeepDB: learn from data, not from queries!. In PVLDB.
[25]
Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie-Yan Liu. 2017. Lightgbm: A highly efficient gradient boosting decision tree. Advances in neural information processing systems 30 (2017), 3146--3154.
[26]
Andranik Khachatryan, Emmanuel Müller, Christian Stier, and Klemens Böhm. 2015. Improving accuracy and robustness of self-tuning histograms by subspace clustering. IEEE TKDE 27, 9 (2015), 2377--2389.
[27]
Martin Kiefer, Max Heimel, Sebastian Breß, and Volker Markl. 2017. Estimating join selectivities using bandwidth-optimized kernel density models. PVLDB 10, 13 (2017), 2085--2096.
[28]
Andreas Kipf, Thomas Kipf, Bernhard Radke, Viktor Leis, Peter Boncz, and Alfons Kemper. 2019. Learned cardinalities: Estimating correlated joins with deep learning. In CIDR.
[29]
Viktor Leis, Andrey Gubichev, Atanas Mirchev, Peter Boncz, Alfons Kemper, and Thomas Neumann. 2015. How good are query optimizers, really? PVLDB 9, 3 (2015), 204--215.
[30]
Viktor Leis, Bernhard Radke, Andrey Gubichev, Alfons Kemper, and Thomas Neumann. 2017. Cardinality Estimation Done Right: Index-Based Join Sampling. In CIDR.
[31]
Feifei Li, Bin Wu, Ke Yi, and Zhuoyue Zhao. 2016. Wander join: Online aggregation via random walks. In SIGMOD. 615--629.
[32]
Feifei Li, Bin Wu, Ke Yi, and Zhuoyue Zhao. 2019. Wander join and XDB: online aggregation via random walks. ACM Transactions on Database Systems 44, 1 (2019), 1--41.
[33]
Eric Liang, Zongheng Yang, Ion Stoica, Pieter Abbeel, Yan Duan, and Peter Chen. 2020. Variable Skipping for Autoregressive Range Density Estimation. In ICML. 6040--6049.
[34]
Pedro Lopes, Craig Guyer, and Milener Gene. 2019. Sql docs: cardinality estimation (SQL Server). https://docs.microsoft.com/en-us/sql/relational-databases/performance/cardinality-estimation-sql-server?view=sql-server-ver15 (2019).
[35]
Frank Luan, Amog Kamsetty, Eric Liang, and Zongheng Yang. 2020. Github repository: neurocard project. https://github.com/neurocard/neurocard (2020).
[36]
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Distributed representations of words and phrases and their compositionality. NIPS (2013).
[37]
Guido Moerkotte, Thomas Neumann, and Gabriele Steidl. 2009. Preventing bad plans by bounding the impact of cardinality estimation errors. Proceedings of the VLDB Endowment 2, 1 (2009), 982--993.
[38]
M Muralikrishna and David J DeWitt. 1988. Equi-depth multidimensional histograms. In Proceedings of the 1988 ACM SIGMOD international conference on Management of data. 28--36.
[39]
Yoon-Min Nam Nam, Donghyoung Han Han, and Min-Soo Kim Kim. 2020. SPRINTER: a fast n-ary join query processing method for complex OLAP queries. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. 2055--2070.
[40]
Parimarjan Negi. 2021. Github repository: pg_hint_plan. https://github.com/parimarjan/pg_hint_plan (2021).
[41]
Parimarjan Negi, Ryan Marcus, Andreas Kipf, Hongzi Mao, Nesime Tatbul, Tim Kraska, and Mohammad Alizadeh. 2021. Flow-Loss: Learning Cardinality Estimates That Matter. arXiv preprint arXiv:2101.04964 (2021).
[42]
Parimarjan Negi, Ryan Marcus, Hongzi Mao, Nesime Tatbul, Tim Kraska, and Mohammad Alizadeh. 2020. Cost-guided cardinality estimation: Focus where it matters. In 2020 IEEE 36th International Conference on Data Engineering Workshops (ICDEW). IEEE, 154--157.
[43]
Patrick O'Neil, Elizabeth O'Neil, Xuedong Chen, and Stephen Revilak. 2009. The star schema benchmark and augmented fact table indexing. In Technology Conference on Performance Evaluation and Benchmarking. Springer, 237--252.
[44]
Yeonsu Park, Seongyun Ko, Sourav S Bhowmick, Kyoungmin Kim, Kijae Hong, and Wook-Shin Han. 2020. G-CARE: A framework for performance benchmarking of cardinality estimation techniques for subgraph matching. In SIGMOD. 1099--1114.
[45]
Pedro Pedreira, Yinghai Lu, Sergey Pershin, Amit Dutta, and Chris Croswhite. 2018. Rethinking concurrency control for in-memory OLAP dbmss. In 2018 IEEE 34th International Conference on Data Engineering (ICDE). IEEE, 1453--1464.
[46]
Matthew Perron, Zeyuan Shang, Tim Kraska, and Michael Stonebraker. 2019. How I learned to stop worrying and love re-optimization. In ICDE. 1758--1761.
[47]
Hoifung Poon and Pedro Domingos. 2011. Sum-product networks: A new deep architecture. In ICCV Workshops. 689--690.
[48]
Viswanath Poosala and Yannis E Ioannidis. 1997. Selectivity estimation without the attribute value independence assumption. In VLDB, Vol. 97. 486--495.
[49]
MySQL 8.0 Reference Manual. 2020. Chapter 15.8.10.2 Configuring Non-Persistent Optimizer Statistics Parameters. https://dev.mysql.com/doc/refman/8.0/en/innodb-statistics-estimation.html (2020).
[50]
P Griffiths Selinger, Morton M Astrahan, Donald D Chamberlin, Raymond A Lorie, and Thomas G Price. 1979. Access path selection in a relational database management system. In SIGMOD. 23--34.
[51]
MariaDB Server Documentation. 2020. Statistics for optimizing queries: InnoDB persistent statistics. https://mariadb.com/kb/en/innodb-persistent-statistics/ (2020).
[52]
Suraj Shetiya, Saravanan Thirumuruganathan, Nick Koudas, and Gautam Das. 2020. Astrid: accurate selectivity estimation for string predicates using deep learning. Proceedings of the VLDB Endowment 14, 4 (2020), 471--484.
[53]
Utkarsh Srivastava, Peter J Haas, Volker Markl, Marcel Kutsch, and Tam Minh Tran. 2006. Isomer: Consistent histogram construction using query feedback. In ICDE. 39--39.
[54]
Michael Stillger, Guy M Lohman, Volker Markl, and Mokhtar Kandil. 2001. LEO-DB2's learning optimizer. In PVLDB, Vol. 1. 19--28.
[55]
Ji Sun and Guoliang Li. 2019. An end-to-end learning-based cost estimator. VLDB (2019).
[56]
Vladimir Svetnik, Andy Liaw, Christopher Tong, J Christopher Culberson, Robert P Sheridan, and Bradley P Feuston. 2003. Random forest: a classification and regression tool for compound classification and QSAR modeling. Journal of chemical information and computer sciences 43, 6 (2003), 1947--1958.
[57]
Immanuel Trummer. 2019. Exact cardinality query optimization with bounded execution cost. In Proceedings of the 2019 International Conference on Management of Data. 2--17.
[58]
Kostas Tzoumas, Amol Deshpande, and Christian S Jensen. 2011. Lightweight graphical models for selectivity estimation without independence assumptions. PVLDB 4, 11 (2011), 852--863.
[59]
Hai Wang and Kenneth C Sevcik. 2003. A multi-dimensional histogram for selectivity estimation and fast approximate query answering. In Proceedings of the 2003 conference of the Centre for Advanced Studies on Collaborative research. 328--342.
[60]
Xiaoying Wang, Changbo Qu, Weiyuan Wu, Jiannan Wang, and Qingqing Zhou. 2021. Are We Ready For Learned Cardinality Estimation? VLDB 14, 9 (2021), 1640--1654.
[61]
Chenggang Wu, Alekh Jindal, Saeed Amizadeh, Hiren Patel, Wangchao Le, Shi Qiao, and Sriram Rao. 2018. Towards a learning optimizer for shared clouds. PVLDB 12, 3 (2018), 210--222.
[62]
Peizhi Wu. 2021. Github repository: UAE/UEA-Q. https://github.com/pagegitss/UAE (2021).
[63]
Peizhi Wu and Gao Cong. 2021. A Unified Deep Model of Learning from both Data and Queries for Cardinality Estimation. In Proceedings of the 2021 ACM SIGMOD International Conference on Management of Data.
[64]
Ziniu Wu. 2021. Github repository: BayesCard. https://github.com/wuziniu/BayesCard (2021).
[65]
Ziniu Wu. 2021. Github repository: FSPN. https://github.com/wuziniu/FSPN (2021).
[66]
Ziniu Wu and Amir Shaikhha. 2020. BayesCard: A Unified Bayesian Framework for Cardinality Estimation. arXiv preprint arXiv:2012.14743 (2020).
[67]
Ziniu Wu, Peilun Yang, Pei Yu, Rong Zhu, Yuxing Han, Yaliang Li, Defu Lian, Kai Zeng, and Jingren Zhou. 2021. A Unified Transferable Model for ML-Enhanced DBMS. arXiv preprint arXiv:2105.02418 (2021).
[68]
Ziniu Wu, Rong Zhu, Andreas Pfadler, Yuxing Han, Jiangneng Li, Zhengping Qian, Kai Zeng, and Jingren Zhou. 2020. FSPN: A New Class of Probabilistic Graphical Model. arXiv preprint arXiv:2011.09020 (2020).
[69]
Zongheng Yang, Amog Kamsetty, Sifei Luan, Eric Liang, Yan Duan, Xi Chen, and Ion Stoica. 2021. NeuroCard: One Cardinality Estimator for All Tables. PVLDB 14, 1 (2021), 61--73.
[70]
Zongheng Yang, Eric Liang, Amog Kamsetty, Chenggang Wu, Yan Duan, Xi Chen, Pieter Abbeel, Joseph M Hellerstein, Sanjay Krishnan, and Ion Stoica. 2019. Deep unsupervised cardinality estimation. PVLDB (2019).
[71]
Zongheng Yang and Chenggang Wu. 2019. Github repository: naru project. https://github.com/naru-project/naru (2019).
[72]
Zhuoyue Zhao, Robert Christensen, Feifei Li, Xiao Hu, and Ke Yi. 2018. Random sampling over joins revisited. In SIGMOD. 1525--1539.
[73]
Zhuoyue Zhao, Feifei Li, and Yuxi Liu. 2020. Efficient join synopsis maintenance for data warehouse. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. 2027--2042.
[74]
Zhuoyue Zhao, Bin Wu, Feifei Li, and Ke Yi. 2021. Github repository: InitialDL-Lab/XDB. https://github.com/InitialDLab/XDB (2021).
[75]
Rong Zhu, Ziniu Wu, Yuxing Han, Kai Zeng, Andreas Pfadler, Zhengping Qian, Jingren Zhou, and Bin Cui. 2021. FLAT: Fast, Lightweight and Accurate Method for Cardinality Estimation. VLDB 14, 9 (2021), 1489--1502.

Cited By

View all
  • (2024)Blueprinting the Cloud: Unifying and Automatically Optimizing Cloud Data Infrastructures with BRADProceedings of the VLDB Endowment10.14778/3681954.368202617:11(3629-3643)Online publication date: 1-Jul-2024
  • (2024)PilotScope: Steering Databases with Machine Learning DriversProceedings of the VLDB Endowment10.14778/3641204.364120917:5(980-993)Online publication date: 1-Jan-2024
  • (2024)Eraser: Eliminating Performance Regression on Learned Query OptimizerProceedings of the VLDB Endowment10.14778/3641204.364120517:5(926-938)Online publication date: 1-Jan-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment
Proceedings of the VLDB Endowment  Volume 15, Issue 4
December 2021
246 pages
ISSN:2150-8097
Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 December 2021
Published in PVLDB Volume 15, Issue 4

Badges

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)111
  • Downloads (Last 6 weeks)10
Reflects downloads up to 09 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Blueprinting the Cloud: Unifying and Automatically Optimizing Cloud Data Infrastructures with BRADProceedings of the VLDB Endowment10.14778/3681954.368202617:11(3629-3643)Online publication date: 1-Jul-2024
  • (2024)PilotScope: Steering Databases with Machine Learning DriversProceedings of the VLDB Endowment10.14778/3641204.364120917:5(980-993)Online publication date: 1-Jan-2024
  • (2024)Eraser: Eliminating Performance Regression on Learned Query OptimizerProceedings of the VLDB Endowment10.14778/3641204.364120517:5(926-938)Online publication date: 1-Jan-2024
  • (2024)Enabling Adaptive Sampling for Intra-Window Join: Simultaneously Optimizing Quantity and QualityProceedings of the ACM on Management of Data10.1145/36771342:4(1-31)Online publication date: 30-Sep-2024
  • (2024)QardEst: Using Quantum Machine Learning for Cardinality Estimation of Join QueriesProceedings of the 1st Workshop on Quantum Computing and Quantum-Inspired Technology for Data-Intensive Systems and Applications10.1145/3665225.3665444(2-13)Online publication date: 9-Jun-2024
  • (2024)Convolution and Cross-Correlation of Count Sketches Enables Fast Cardinality Estimation of Multi-Join QueriesProceedings of the ACM on Management of Data10.1145/36549322:3(1-26)Online publication date: 30-May-2024
  • (2024)Join Size Bounds using lp-Norms on Degree SequencesProceedings of the ACM on Management of Data10.1145/36515972:2(1-24)Online publication date: 14-May-2024
  • (2024)Approximate SketchesProceedings of the ACM on Management of Data10.1145/36393212:1(1-24)Online publication date: 26-Mar-2024
  • (2024)Wred: Workload Reduction for Scalable Index TuningProceedings of the ACM on Management of Data10.1145/36393052:1(1-26)Online publication date: 26-Mar-2024
  • (2024)Learned Query Optimizer: What is New and What is NextCompanion of the 2024 International Conference on Management of Data10.1145/3626246.3654692(561-569)Online publication date: 9-Jun-2024
  • Show More Cited By

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media