research-article

Cardinality estimation in DBMS: a comprehensive benchmark evaluation

Authors:

Andreas Pfadler,

Zhengping Qian,

Bin CuiAuthors Info & Claims

Proceedings of the VLDB Endowment, Volume 15, Issue 4

Pages 752 - 765

https://doi.org/10.14778/3503585.3503586

Published: 01 December 2021 Publication History

Abstract

Cardinality estimation (CardEst) plays a significant role in generating high-quality query plans for a query optimizer in DBMS. In the last decade, an increasing number of advanced CardEst methods (especially ML-based) have been proposed with outstanding estimation accuracy and inference latency. However, there exists no study that systematically evaluates the quality of these methods and answer the fundamental problem: to what extent can these methods improve the performance of query optimizer in real-world settings, which is the ultimate goal of a CardEst method.

In this paper, we comprehensively and systematically compare the effectiveness of CardEst methods in a real DBMS. We establish a new benchmark for CardEst, which contains a new complex real-world dataset STATS and a diverse query workload STATS-CEB. We integrate multiple most representative CardEst methods into an open-source DBMS PostgreSQL, and comprehensively evaluate their true effectiveness in improving query plan quality, and other important aspects affecting their applicability. We obtain a number of key findings under different data and query settings. Furthermore, we find that the widely used estimation accuracy metric (Q-Error) cannot distinguish the importance of different sub-plan queries during query optimization and thus cannot truly reflect the generated query plan quality. Therefore, we propose a new metric P-Error to evaluate the performance of CardEst methods, which overcomes the limitation of Q-Error and is able to reflect the overall end-to-end performance of CardEst methods. It could serve as a better optimization objective for future CardEst methods.

References

[1]

Nicolas Bruno, Surajit Chaudhuri, and Luis Gravano. 2001. STHoles: a multidimensional workload-aware histogram. In SIGMOD. 211--222.

[2]

Walter Cai. 2021. Github repository: pqo-open source. https://github.com/waltercai (2021).

[3]

Walter Cai, Magdalena Balazinska, and Dan Suciu. 2019. Pessimistic cardinality estimation: Tighter upper bounds for intermediate join cardinalities. In SIGMOD. 18--35.

[4]

Surajit Chaudhuri, Vivek Narasayya, and Ravi Ramamurthy. 2009. Exact cardinality query optimization for optimizer testing. Proceedings of the VLDB Endowment 2, 1 (2009), 994--1005.

Digital Library

[5]

Tianqi Chen and Carlos Guestrin. 2016. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining. 785--794.

Digital Library

[6]

C. Chow and Cong Liu. 1968. Approximating discrete probability distributions with dependence trees. IEEE transactions on Information Theory 14, 3 (1968), 462--467.

Digital Library

[7]

Transaction Processing Performance Council(TPC). 2021. TPC-DS Vesion 2 and Version 3. http://www.tpc.org/tpcds/ (2021).

[8]

Transaction Processing Performance Council(TPC). 2021. TPC-H Vesion 2 and Version 3. http://www.tpc.org/tpch/ (2021).

[9]

Mattia Desana and Christoph Schnörr. 2020. Sum-product graphical models. Machine Learning 109, 1 (2020), 135--173.

Digital Library

[10]

Amol Deshpande, Minos Garofalakis, and Rajeev Rastogi. 2001. Independence is good: Dependency-based histogram synopses for high-dimensional data. ACM SIGMOD Record 30, 2 (2001), 199--210.

Digital Library

[11]

Postgresql Documentation 12. 2020. Chapter 70.1. Row Estimation Examples. https://www.postgresql.org/docs/current/row-estimation-examples.html (2020).

[12]

Anshuman Dutt, Chi Wang, Azade Nazi, Srikanth Kandula, Vivek Narasayya, and Surajit Chaudhuri. 2019. Selectivity estimation for range predicates using lightweight models. PVLDB 12, 9 (2019), 1044--1057.

Digital Library

[13]

Dennis Fuchs, Zhen He, and Byung Suk Lee. 2007. Compressed histograms with arbitrary bucket layouts for selectivity estimation. Information Sciences 177, 3 (2007), 680--702.

Digital Library

[14]

Hector Garcia-Molina and Gio Wiederhold. 1982. Read-only transactions in a distributed database. ACM Transactions on Database Systems (TODS) 7, 2 (1982), 209--234.

Digital Library

[15]

Mathieu Germain, Karol Gregor, Iain Murray, and Hugo Larochelle. 2015. MADE: Masked autoencoder for distribution estimation. International Conference on Machine Learning (2015), 881--889.

[16]

Lise Getoor, Benjamin Taskar, and Daphne Koller. 2001. Selectivity estimation using probabilistic models. In SIGMOD. 461--472.

[17]

Dimitrios Gunopulos, George Kollios, Vassilis J Tsotras, and Carlotta Domeniconi. 2000. Approximating multi-dimensional aggregate range queries over real attributes. In SIGMOD. 463--474.

[18]

Dimitrios Gunopulos, George Kollios, Vassilis J Tsotras, and Carlotta Domeniconi. 2005. Selectivity estimators for multidimensional range queries over real attributes. The VLDB Journal 14, 2 (2005), 137--154.

Digital Library

[19]

Max Halford, Philippe Saint-Pierre, and Franck Morvan. 2019. An approach based on bayesian networks for query selectivity estimation. DASFAA 2 (2019).

[20]

Shohedul Hasan, Saravanan Thirumuruganathan, Jees Augustine, Nick Koudas, and Gautam Das. 2019. Multi-attribute selectivity estimation using deep learning. In SIGMOD.

[21]

Shohedul Hasan, Saravanan Thirumuruganathan, Jees Augustine, Nick Koudas, and Gautam Das. 2020. Deep Learning Models for Selectivity Estimation of Multi-Attribute Queries. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. 1035--1050.

Digital Library

[22]

Max Heimel, Martin Kiefer, and Volker Markl. 2015. Self-tuning, gpu-accelerated kernel density models for multidimensional selectivity estimation. In SIGMOD. 1477--1492.

[23]

Benjamin Hilprecht. 2019. Github repository: deepdb public. https://github.com/DataManagementLab/deepdb-public (2019).

[24]

Benjamin Hilprecht, Andreas Schmidt, Moritz Kulessa, Alejandro Molina, Kristian Kersting, and Carsten Binnig. 2019. DeepDB: learn from data, not from queries!. In PVLDB.

[25]

Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie-Yan Liu. 2017. Lightgbm: A highly efficient gradient boosting decision tree. Advances in neural information processing systems 30 (2017), 3146--3154.

Digital Library

[26]

Andranik Khachatryan, Emmanuel Müller, Christian Stier, and Klemens Böhm. 2015. Improving accuracy and robustness of self-tuning histograms by subspace clustering. IEEE TKDE 27, 9 (2015), 2377--2389.

[27]

Martin Kiefer, Max Heimel, Sebastian Breß, and Volker Markl. 2017. Estimating join selectivities using bandwidth-optimized kernel density models. PVLDB 10, 13 (2017), 2085--2096.

Digital Library

[28]

Andreas Kipf, Thomas Kipf, Bernhard Radke, Viktor Leis, Peter Boncz, and Alfons Kemper. 2019. Learned cardinalities: Estimating correlated joins with deep learning. In CIDR.

[29]

Viktor Leis, Andrey Gubichev, Atanas Mirchev, Peter Boncz, Alfons Kemper, and Thomas Neumann. 2015. How good are query optimizers, really? PVLDB 9, 3 (2015), 204--215.

Digital Library

[30]

Viktor Leis, Bernhard Radke, Andrey Gubichev, Alfons Kemper, and Thomas Neumann. 2017. Cardinality Estimation Done Right: Index-Based Join Sampling. In CIDR.

[31]

Feifei Li, Bin Wu, Ke Yi, and Zhuoyue Zhao. 2016. Wander join: Online aggregation via random walks. In SIGMOD. 615--629.

[32]

Feifei Li, Bin Wu, Ke Yi, and Zhuoyue Zhao. 2019. Wander join and XDB: online aggregation via random walks. ACM Transactions on Database Systems 44, 1 (2019), 1--41.

Digital Library

[33]

Eric Liang, Zongheng Yang, Ion Stoica, Pieter Abbeel, Yan Duan, and Peter Chen. 2020. Variable Skipping for Autoregressive Range Density Estimation. In ICML. 6040--6049.

[34]

Pedro Lopes, Craig Guyer, and Milener Gene. 2019. Sql docs: cardinality estimation (SQL Server). https://docs.microsoft.com/en-us/sql/relational-databases/performance/cardinality-estimation-sql-server?view=sql-server-ver15 (2019).

[35]

Frank Luan, Amog Kamsetty, Eric Liang, and Zongheng Yang. 2020. Github repository: neurocard project. https://github.com/neurocard/neurocard (2020).

[36]

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Distributed representations of words and phrases and their compositionality. NIPS (2013).

Digital Library

[37]

Guido Moerkotte, Thomas Neumann, and Gabriele Steidl. 2009. Preventing bad plans by bounding the impact of cardinality estimation errors. Proceedings of the VLDB Endowment 2, 1 (2009), 982--993.

Digital Library

[38]

M Muralikrishna and David J DeWitt. 1988. Equi-depth multidimensional histograms. In Proceedings of the 1988 ACM SIGMOD international conference on Management of data. 28--36.

Digital Library

[39]

Yoon-Min Nam Nam, Donghyoung Han Han, and Min-Soo Kim Kim. 2020. SPRINTER: a fast n-ary join query processing method for complex OLAP queries. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. 2055--2070.

Digital Library

[40]

Parimarjan Negi. 2021. Github repository: pg_hint_plan. https://github.com/parimarjan/pg_hint_plan (2021).

[41]

Parimarjan Negi, Ryan Marcus, Andreas Kipf, Hongzi Mao, Nesime Tatbul, Tim Kraska, and Mohammad Alizadeh. 2021. Flow-Loss: Learning Cardinality Estimates That Matter. arXiv preprint arXiv:2101.04964 (2021).

[42]

Parimarjan Negi, Ryan Marcus, Hongzi Mao, Nesime Tatbul, Tim Kraska, and Mohammad Alizadeh. 2020. Cost-guided cardinality estimation: Focus where it matters. In 2020 IEEE 36th International Conference on Data Engineering Workshops (ICDEW). IEEE, 154--157.

[43]

Patrick O'Neil, Elizabeth O'Neil, Xuedong Chen, and Stephen Revilak. 2009. The star schema benchmark and augmented fact table indexing. In Technology Conference on Performance Evaluation and Benchmarking. Springer, 237--252.

Digital Library

[44]

Yeonsu Park, Seongyun Ko, Sourav S Bhowmick, Kyoungmin Kim, Kijae Hong, and Wook-Shin Han. 2020. G-CARE: A framework for performance benchmarking of cardinality estimation techniques for subgraph matching. In SIGMOD. 1099--1114.

[45]

Pedro Pedreira, Yinghai Lu, Sergey Pershin, Amit Dutta, and Chris Croswhite. 2018. Rethinking concurrency control for in-memory OLAP dbmss. In 2018 IEEE 34th International Conference on Data Engineering (ICDE). IEEE, 1453--1464.

[46]

Matthew Perron, Zeyuan Shang, Tim Kraska, and Michael Stonebraker. 2019. How I learned to stop worrying and love re-optimization. In ICDE. 1758--1761.

[47]

Hoifung Poon and Pedro Domingos. 2011. Sum-product networks: A new deep architecture. In ICCV Workshops. 689--690.

[48]

Viswanath Poosala and Yannis E Ioannidis. 1997. Selectivity estimation without the attribute value independence assumption. In VLDB, Vol. 97. 486--495.

Digital Library

[49]

MySQL 8.0 Reference Manual. 2020. Chapter 15.8.10.2 Configuring Non-Persistent Optimizer Statistics Parameters. https://dev.mysql.com/doc/refman/8.0/en/innodb-statistics-estimation.html (2020).

[50]

P Griffiths Selinger, Morton M Astrahan, Donald D Chamberlin, Raymond A Lorie, and Thomas G Price. 1979. Access path selection in a relational database management system. In SIGMOD. 23--34.

[51]

MariaDB Server Documentation. 2020. Statistics for optimizing queries: InnoDB persistent statistics. https://mariadb.com/kb/en/innodb-persistent-statistics/ (2020).

[52]

Suraj Shetiya, Saravanan Thirumuruganathan, Nick Koudas, and Gautam Das. 2020. Astrid: accurate selectivity estimation for string predicates using deep learning. Proceedings of the VLDB Endowment 14, 4 (2020), 471--484.

Digital Library

[53]

Utkarsh Srivastava, Peter J Haas, Volker Markl, Marcel Kutsch, and Tam Minh Tran. 2006. Isomer: Consistent histogram construction using query feedback. In ICDE. 39--39.

[54]

Michael Stillger, Guy M Lohman, Volker Markl, and Mokhtar Kandil. 2001. LEO-DB2's learning optimizer. In PVLDB, Vol. 1. 19--28.

[55]

Ji Sun and Guoliang Li. 2019. An end-to-end learning-based cost estimator. VLDB (2019).

Digital Library

[56]

Vladimir Svetnik, Andy Liaw, Christopher Tong, J Christopher Culberson, Robert P Sheridan, and Bradley P Feuston. 2003. Random forest: a classification and regression tool for compound classification and QSAR modeling. Journal of chemical information and computer sciences 43, 6 (2003), 1947--1958.

[57]

Immanuel Trummer. 2019. Exact cardinality query optimization with bounded execution cost. In Proceedings of the 2019 International Conference on Management of Data. 2--17.

Digital Library

[58]

Kostas Tzoumas, Amol Deshpande, and Christian S Jensen. 2011. Lightweight graphical models for selectivity estimation without independence assumptions. PVLDB 4, 11 (2011), 852--863.

Digital Library

[59]

Hai Wang and Kenneth C Sevcik. 2003. A multi-dimensional histogram for selectivity estimation and fast approximate query answering. In Proceedings of the 2003 conference of the Centre for Advanced Studies on Collaborative research. 328--342.

Digital Library

[60]

Xiaoying Wang, Changbo Qu, Weiyuan Wu, Jiannan Wang, and Qingqing Zhou. 2021. Are We Ready For Learned Cardinality Estimation? VLDB 14, 9 (2021), 1640--1654.

Digital Library

[61]

Chenggang Wu, Alekh Jindal, Saeed Amizadeh, Hiren Patel, Wangchao Le, Shi Qiao, and Sriram Rao. 2018. Towards a learning optimizer for shared clouds. PVLDB 12, 3 (2018), 210--222.

Digital Library

[62]

Peizhi Wu. 2021. Github repository: UAE/UEA-Q. https://github.com/pagegitss/UAE (2021).

[63]

Peizhi Wu and Gao Cong. 2021. A Unified Deep Model of Learning from both Data and Queries for Cardinality Estimation. In Proceedings of the 2021 ACM SIGMOD International Conference on Management of Data.

Digital Library

[64]

Ziniu Wu. 2021. Github repository: BayesCard. https://github.com/wuziniu/BayesCard (2021).

[65]

Ziniu Wu. 2021. Github repository: FSPN. https://github.com/wuziniu/FSPN (2021).

[66]

Ziniu Wu and Amir Shaikhha. 2020. BayesCard: A Unified Bayesian Framework for Cardinality Estimation. arXiv preprint arXiv:2012.14743 (2020).

[67]

Ziniu Wu, Peilun Yang, Pei Yu, Rong Zhu, Yuxing Han, Yaliang Li, Defu Lian, Kai Zeng, and Jingren Zhou. 2021. A Unified Transferable Model for ML-Enhanced DBMS. arXiv preprint arXiv:2105.02418 (2021).

[68]

Ziniu Wu, Rong Zhu, Andreas Pfadler, Yuxing Han, Jiangneng Li, Zhengping Qian, Kai Zeng, and Jingren Zhou. 2020. FSPN: A New Class of Probabilistic Graphical Model. arXiv preprint arXiv:2011.09020 (2020).

[69]

Zongheng Yang, Amog Kamsetty, Sifei Luan, Eric Liang, Yan Duan, Xi Chen, and Ion Stoica. 2021. NeuroCard: One Cardinality Estimator for All Tables. PVLDB 14, 1 (2021), 61--73.

Digital Library

[70]

Zongheng Yang, Eric Liang, Amog Kamsetty, Chenggang Wu, Yan Duan, Xi Chen, Pieter Abbeel, Joseph M Hellerstein, Sanjay Krishnan, and Ion Stoica. 2019. Deep unsupervised cardinality estimation. PVLDB (2019).

[71]

Zongheng Yang and Chenggang Wu. 2019. Github repository: naru project. https://github.com/naru-project/naru (2019).

[72]

Zhuoyue Zhao, Robert Christensen, Feifei Li, Xiao Hu, and Ke Yi. 2018. Random sampling over joins revisited. In SIGMOD. 1525--1539.

[73]

Zhuoyue Zhao, Feifei Li, and Yuxi Liu. 2020. Efficient join synopsis maintenance for data warehouse. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. 2027--2042.

Digital Library

[74]

Zhuoyue Zhao, Bin Wu, Feifei Li, and Ke Yi. 2021. Github repository: InitialDL-Lab/XDB. https://github.com/InitialDLab/XDB (2021).

[75]

Rong Zhu, Ziniu Wu, Yuxing Han, Kai Zeng, Andreas Pfadler, Zhengping Qian, Jingren Zhou, and Bin Cui. 2021. FLAT: Fast, Lightweight and Accurate Method for Cardinality Estimation. VLDB 14, 9 (2021), 1489--1502.

Digital Library

Cited By

Yu GWu ZKossmann FLi TMarkakis MNgom AMadden SKraska T(2024)Blueprinting the Cloud: Unifying and Automatically Optimizing Cloud Data Infrastructures with BRADProceedings of the VLDB Endowment10.14778/3681954.368202617:11(3629-3643)Online publication date: 1-Jul-2024
https://dl.acm.org/doi/10.14778/3681954.3682026
Zhu RWeng LWei WWu DPeng JWang YDing BLian DZheng BZhou J(2024)PilotScope: Steering Databases with Machine Learning DriversProceedings of the VLDB Endowment10.14778/3641204.364120917:5(980-993)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.14778/3641204.3641209
Weng LZhu RWu DDing BZheng BZhou J(2024)Eraser: Eliminating Performance Regression on Learned Query OptimizerProceedings of the VLDB Endowment10.14778/3641204.364120517:5(926-938)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.14778/3641204.3641205
Show More Cited By

Cardinality estimation in DBMS: a comprehensive benchmark evaluation
1. Information systems
  1. Data management systems
    1. Database management system engines
2. Theory of computation
  1. Theory and algorithms for application domains
    1. Database theory

Recommendations

FactorJoin: A New Cardinality Estimation Framework for Join Queries
PACMMOD

Cardinality estimation is one of the most fundamental and challenging problems in query optimization. Neither classical nor learning-based methods yield satisfactory performance when estimating the cardinality of the join queries. They either rely on ...
Workload-Driven Antijoin Cardinality Estimation

Antijoin cardinality estimation is among a handful of problems that has eluded accurate efficient solutions amenable to implementation in relational query optimizers. Given the widespread use of antijoin and subset-based queries in analytical workloads ...
Weighted Distinct Sampling: Cardinality Estimation for SPJ Queries
SIGMOD '21: Proceedings of the 2021 International Conference on Management of Data

SPJ (select-project-join) queries form the backbone of many SQL queries used in practice. Accurate cardinality estimation of these queries is thus an important problem, with applications in query optimization, approximate query processing, and data ...

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment

Proceedings of the VLDB Endowment Volume 15, Issue 4

December 2021

246 pages

ISSN:2150-8097

Editors:
Juliana Freire
New York University
,
Xuemin Lin
University of New South Wales

Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 December 2021

Published in PVLDB Volume 15, Issue 4

Badges

Artifacts Available / v1.1

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

24
Total Citations
View Citations
267
Total Downloads

Downloads (Last 12 months)111
Downloads (Last 6 weeks)10

Reflects downloads up to 09 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Yu GWu ZKossmann FLi TMarkakis MNgom AMadden SKraska T(2024)Blueprinting the Cloud: Unifying and Automatically Optimizing Cloud Data Infrastructures with BRADProceedings of the VLDB Endowment10.14778/3681954.368202617:11(3629-3643)Online publication date: 1-Jul-2024
https://dl.acm.org/doi/10.14778/3681954.3682026
Zhu RWeng LWei WWu DPeng JWang YDing BLian DZheng BZhou J(2024)PilotScope: Steering Databases with Machine Learning DriversProceedings of the VLDB Endowment10.14778/3641204.364120917:5(980-993)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.14778/3641204.3641209
Weng LZhu RWu DDing BZheng BZhou J(2024)Eraser: Eliminating Performance Regression on Learned Query OptimizerProceedings of the VLDB Endowment10.14778/3641204.364120517:5(926-938)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.14778/3641204.3641205
Tang XZhang FZhang SLiu YHe BHe BDu XDu X(2024)Enabling Adaptive Sampling for Intra-Window Join: Simultaneously Optimizing Quantity and QualityProceedings of the ACM on Management of Data10.1145/36771342:4(1-31)Online publication date: 30-Sep-2024
https://dl.acm.org/doi/10.1145/3677134
Kittelmann FSulimov PStockinger K(2024)QardEst: Using Quantum Machine Learning for Cardinality Estimation of Join QueriesProceedings of the 1st Workshop on Quantum Computing and Quantum-Inspired Technology for Data-Intensive Systems and Applications10.1145/3665225.3665444(2-13)Online publication date: 9-Jun-2024
https://dl.acm.org/doi/10.1145/3665225.3665444
Heddes MNunes IGivargis TNicolau A(2024)Convolution and Cross-Correlation of Count Sketches Enables Fast Cardinality Estimation of Multi-Join QueriesProceedings of the ACM on Management of Data10.1145/36549322:3(1-26)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3654932
Abo Khamis MNakos VOlteanu DSuciu D(2024)Join Size Bounds using lp-Norms on Degree SequencesProceedings of the ACM on Management of Data10.1145/36515972:2(1-24)Online publication date: 14-May-2024
https://dl.acm.org/doi/10.1145/3651597
Tsan BDatta AIzenov YRusu F(2024)Approximate SketchesProceedings of the ACM on Management of Data10.1145/36393212:1(1-24)Online publication date: 26-Mar-2024
https://dl.acm.org/doi/10.1145/3639321
Brucato MSiddiqui TWu WNarasayya VChaudhuri S(2024)Wred: Workload Reduction for Scalable Index TuningProceedings of the ACM on Management of Data10.1145/36393052:1(1-26)Online publication date: 26-Mar-2024
https://dl.acm.org/doi/10.1145/3639305
Zhu RWeng LDing BZhou JBarcelo PSanchez-Pi NMeliou ASudarshan S(2024)Learned Query Optimizer: What is New and What is NextCompanion of the 2024 International Conference on Management of Data10.1145/3626246.3654692(561-569)Online publication date: 9-Jun-2024
https://dl.acm.org/doi/10.1145/3626246.3654692
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents