research-article

NeuroCard: one cardinality estimator for all tables

Authors:

Ion StoicaAuthors Info & Claims

Proceedings of the VLDB Endowment, Volume 14, Issue 1

Pages 61 - 73

https://doi.org/10.14778/3421424.3421432

Published: 01 September 2020 Publication History

Abstract

Query optimizers rely on accurate cardinality estimates to produce good execution plans. Despite decades of research, existing cardinality estimators are inaccurate for complex queries, due to making lossy modeling assumptions and not capturing inter-table correlations. In this work, we show that it is possible to learn the correlations across all tables in a database without any independence assumptions. We present NeuroCard, a join cardinality estimator that builds a single neural density estimator over an entire database. Leveraging join sampling and modern deep autoregressive models, NeuroCard makes no inter-table or inter-column independence assumptions in its probabilistic modeling. NeuroCard achieves orders of magnitude higher accuracy than the best prior methods (a new state-of-the-art result of 8.5x maximum error on JOB-light), scales to dozens of tables, while being compact in space (several MBs) and efficient to construct or update (seconds to minutes).

References

[1]

Michael Armburst, Reynold S. Xin, Cheng Lian, Yin Huai, Davies Liu, Joseph K. Bradley, Xiangrui Meng, Tomer Kaftan, Michael J. Franklin, Ali Ghodsi, and Matei Zaharia. 2015. Spark SQL: Relational Data Processing in Spark. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (Melbourne, Victoria, Australia) (SIGMOD '15). ACM, New York, NY, USA, 1383--1394.

Digital Library

[2]

Kaushik Chakrabarti, Minos Garofalakis, Rajeev Rastogi, and Kyuseok Shim. 2001. Approximate query processing using wavelets. The VLDB Journal 10, 2-3 (2001), 199--223.

Digital Library

[3]

Amol Deshpande, Minos Garofalakis, and Rajeev Rastogi. 2001. Independence is good: Dependency-based histogram synopses for high-dimensional data. ACM SIGMOD Record 30, 2 (2001), 199--210.

Digital Library

[4]

Conor Durkan and Charlie Nash. 2019. Autoregressive Energy Machines. In Proceedings of the 36th International Conference on Machine Learning (Proceedings of Machine Learning Research), Kamalika Chaudhuri and Ruslan Salakhutdinov (Eds.), Vol. 97. PMLR, Long Beach, California, USA, 1735--1744.

[5]

Anshuman Dutt, Chi Wang, Azade Nazi, Srikanth Kandula, Vivek Narasayya, and Surajit Chaudhuri. 2019. Selectivity estimation for range predicates using lightweight models. Proceedings of the VLDB Endowment 12, 9 (2019), 1044--1057.

Digital Library

[6]

Mathieu Germain, Karol Gregor, Iain Murray, and Hugo Larochelle. 2015. MADE: Masked autoencoder for distribution estimation. In International Conference on Machine Learning. 881--889.

Digital Library

[7]

Lise Getoor, Nir Friedman, Daphne Koller, and Benjamin Taskar. 2001. Learning probabilistic models of relational structure. In ICML, Vol. 1. 170--177.

Digital Library

[8]

Lise Getoor, Benjamin Taskar, and Daphne Koller. 2001. Selectivity estimation using probabilistic models. In ACM SIGMOD Record, Vol. 30. ACM, 461--472.

Digital Library

[9]

Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2016. Deep learning. MIT press.

Digital Library

[10]

Dimitrios Gunopulos, George Kollios, Vassilis J Tsotras, and Carlotta Domeniconi. 2005. Selectivity estimators for multidimensional range queries over real attributes. The VLDB Journal 14, 2 (2005), 137--154.

Digital Library

[11]

Max Heimel, Martin Kiefer, and Volker Markl. 2015. Self-Tuning, GPU-Accelerated Kernel Density Models for Multidimensional Selectivity Estimation. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (SIGMOD '15). ACM, New York, NY, USA, 1477--1492.

Digital Library

[12]

Benjamin Hilprecht, Andreas Schmidt, Moritz Kulessa, Alejandro Molina, Kristian Kersting, and Carsten Binnig. 2020. DeepDB: Learn from Data, not from Queries! Proceedings of the VLDB Endowment 13, 7 (2020), 992--1005.

[13]

Hilprecht et al. 2020. Github repository, deepdb-public. github.com/DataManagementLab/deepdb-public. [Online; accessed April, 2020].

[14]

Martin Kiefer, Max Heimel, Sebastian Breß, and Volker Markl. 2017. Estimating join selectivities using bandwidth-optimized kernel density models. Proceedings of the VLDB Endowment 10, 13 (2017), 2085--2096.

Digital Library

[15]

Andreas Kipf, Thomas Kipf, Bernhard Radke, Viktor Leis, Peter A. Boncz, and Alfons Kemper. 2019. Learned Cardinalities: Estimating Correlated Joins with Deep Learning. In CIDR 2019, 9th Biennial Conference on Innovative Data Systems Research, Asilomar, CA, USA, January 13-16, 2019, Online Proceedings.

[16]

Kipf et al. 2019. Github repository, learnedcardinalities. github.com/andreaskipf/learnedcardinalities. [Online; accessed April, 2020].

[17]

Tim Kraska, Alex Beutel, Ed H Chi, Jeffrey Dean, and Neoklis Polyzotis. 2018. The case for learned index structures. In Proceedings of the 2018 International Conference on Management of Data. ACM, 489--504.

Digital Library

[18]

Sanjay Krishnan, Zongheng Yang, Ken Goldberg, Joseph Hellerstein, and Ion Stoica. 2018. Learning to optimize join queries with deep reinforcement learning. arXiv preprint arXiv:1808.03196 (2018).

[19]

Viktor Leis, Andrey Gubichev, Atanas Mirchev, Peter Boncz, Alfons Kemper, and Thomas Neumann. 2015. How good are query optimizers, really? Proceedings of the VLDB Endowment 9, 3 (2015), 204--215.

Digital Library

[20]

Viktor Leis, Bernhard Radke, Andrey Gubichev, Alfons Kemper, and Thomas Neumann. 2017. Cardinality Estimation Done Right: Index-Based Join Sampling. In CIDR.

[21]

Viktor Leis, Bernhard Radke, Andrey Gubichev, Atanas Mirchev, Peter Boncz, Alfons Kemper, and Thomas Neumann. 2018. Query optimization through the looking glass, and what we found running the join order benchmark. The VLDB Journal (2018), 1--26.

Digital Library

[22]

Feifei Li, Bin Wu, Ke Yi, and Zhuoyue Zhao. 2016. Wander join: Online aggregation via random walks. In Proceedings of the 2016 International Conference on Management of Data. 615--629.

Digital Library

[23]

Ryan Marcus, Parimarjan Negi, Hongzi Mao, Chi Zhang, Mohammad Alizadeh, Tim Kraska, Olga Papaemmanouil, and Nesime Tatbul. 2019. Neo: A Learned Query Optimizer. PVLDB 12, 11 (2019), 1705--1718.

Digital Library

[24]

James Martens and Venkatesh Medabalimi. 2014. On the expressive efficiency of sum product networks. arXiv preprint arXiv:1411.7717 (2014).

[25]

M Muralikrishna and David J DeWitt. 1988. Equi-depth multidimensional histograms. In ACM SIGMOD Record, Vol. 17. ACM, 28--36.

Digital Library

[26]

Kevin P Murphy. 2012. Machine learning: a probabilistic perspective. MIT press.

Digital Library

[27]

Neural Relation Understanding (Naru). 2020. Github repository, naru. github.com/naru-project/naru. [Online; accessed April, 2020].

[28]

Patrick O'Neil and Dallan Quass. 1997. Improved query performance with variant indexes. In Proceedings of the 1997 ACM SIGMOD international conference on Management of data. 38--49.

Digital Library

[29]

Hoifung Poon and Pedro Domingos. 2011. Sum-product networks: A new deep architecture. In 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops). IEEE, 689--690.

[30]

Viswanath Poosala, Peter J. Haas, Yannis E. Ioannidis, and Eugene J. Shekita. 1996. Improved Histograms for Selectivity Estimation of Range Predicates. In Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data (Montreal, Quebec, Canada) (SIGMOD '96). ACM, New York, NY, USA, 294--305.

Digital Library

[31]

Viswanath Poosala and Yannis E Ioannidis. 1997. Selectivity estimation without the attribute value independence assumption. In VLDB, Vol. 97. 486--495.

Digital Library

[32]

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. URL https://openai.com/blog/better-language-models (2019).

[33]

Tim Salimans, Andrej Karpathy, Xi Chen, and Diederik P. Kingma. 2017. Pixel-CNN++: Improving the PixelCNN with Discretized Logistic Mixture Likelihood and Other Modifications. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings.

[34]

P Griffiths Selinger, Morton M Astrahan, Donald D Chamberlin, Raymond A Lorie, and Thomas G Price. 1979. Access path selection in a relational database management system. In Proceedings of the 1979 ACM SIGMOD international conference on Management of data. ACM, 23--34.

Digital Library

[35]

Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural Machine Translation of Rare Words with Subword Units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Berlin, Germany, 1715--1725.

[36]

R. Sethi, M. Traverso, D. Sundstrom, D. Phillips, W. Xie, Y. Sun, N. Yegitbasi, H. Jin, E. Hwang, N. Shingte, and C. Berner. 2019. Presto: SQL on Everything. In 2019 IEEE 35th International Conference on Data Engineering (ICDE). 1802--1813.

[37]

Michael Stillger, Guy M Lohman, Volker Markl, and Mokhtar Kandil. 2001. LEODB2's learning optimizer. In VLDB, Vol. 1. 19--28.

Digital Library

[38]

Ji Sun and Guoliang Li. 2019. An end-to-end learning-based cost estimator. Proceedings of the VLDB Endowment 13, 3 (2019), 307--319.

Digital Library

[39]

Immanuel Trummer, Junxiong Wang, Deepak Maram, Samuel Moseley, Saehan Jo, and Joseph Antonakakis. 2019. SkinnerDB: Regret-Bounded Query Evaluation via Reinforcement Learning. In Proceedings of the 2019 International Conference on Management of Data (SIGMOD '19). ACM, New York, NY, USA, 1153--1170.

Digital Library

[40]

Kostas Tzoumas, Amol Deshpande, and Christian S Jensen. 2011. Lightweight graphical models for selectivity estimation without independence assumptions. Proceedings of the VLDB Endowment 4, 11 (2011), 852--863.

Digital Library

[41]

Kostas Tzoumas, Amol Deshpande, and Christian S Jensen. 2013. Efficiently adapting graphical models for selectivity estimation. The VLDB Journal 22, 1 (2013), 3--27.

Digital Library

[42]

Aaron Van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. 2016. WaveNet: A generative model for raw audio. arXiv preprint arXiv:1609.03499 (2016).

[43]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems. 5998--6008.

Digital Library

[44]

Chenggang Wu, Alekh Jindal, Saeed Amizadeh, Hiren Patel, Wangchao Le, Shi Qiao, and Sriram Rao. 2018. Towards a learning optimizer for shared clouds. Proceedings of the VLDB Endowment 12, 3 (2018), 210--222.

Digital Library

[45]

Richard Wu, Aoqian Zhang, Ihab Ilyas, and Theodoros Rekatsinas. 2020. Attention-based Learning for Missing Data Imputation in HoloClean. Proceedings of Machine Learning and Systems (2020), 307--325.

[46]

Yingjun Wu, Jia Yu, Yuanyuan Tian, Richard Sidle, and Ronald Barber. 2019. Designing succinct secondary indexing mechanism by exploiting column correlations. In Proceedings of the 2019 International Conference on Management of Data. 1223--1240.

Digital Library

[47]

Zongheng Yang, Badrish Chandramouli, Chi Wang, Johannes Gehrke, Yinan Li, Umar Farooq Minhas, Per-Åke Larson, Donald Kossmann, and Rajeev Acharya. 2020. Qd-tree: Learning Data Layouts for Big Data Analytics. In Proceedings of the 2020 International Conference on Management of Data (SIGMOD '20).

Digital Library

[48]

Zongheng Yang, Eric Liang, Amog Kamsetty, Chenggang Wu, Yan Duan, Xi Chen, Pieter Abbeel, Joseph M Hellerstein, Sanjay Krishnan, and Ion Stoica. 2019. Deep Unsupervised Cardinality Estimation. Proceedings of the VLDB Endowment 13, 3 (2019), 279--292.

Digital Library

[49]

Barzan Mozafari Yongjoo Park, Shucheng Zhong. 2020. QuickSel: Quick Selectivity Learning with Mixture Models. SIGMOD (2020).

[50]

Zhuoyue Zhao, Robert Christensen, Feifei Li, Xiao Hu, and Ke Yi. 2018. Random sampling over joins revisited. In Proceedings of the 2018 International Conference on Management of Data. 1525--1539.

Digital Library

Cited By

Lim WMa LZhang WButrovich MArch SPavlo A(2024)Hit the Gym: Accelerating Query Execution to Efficiently Bootstrap Behavior Models for Self-Driving Database Management SystemsProceedings of the VLDB Endowment10.14778/3681954.368203017:11(3680-3693)Online publication date: 1-Jul-2024
https://dl.acm.org/doi/10.14778/3681954.3682030
Lehmann CSulimov PStockinger K(2024)Is Your Learned Query Optimizer Behaving As You Expect? A Machine Learning PerspectiveProceedings of the VLDB Endowment10.14778/3654621.365462517:7(1565-1577)Online publication date: 1-Mar-2024
https://dl.acm.org/doi/10.14778/3654621.3654625
Yu TZou ZSun WYan Y(2024)Refactoring Index Tuning Process with Benefit EstimationProceedings of the VLDB Endowment10.14778/3654621.365462217:7(1528-1541)Online publication date: 1-Mar-2024
https://dl.acm.org/doi/10.14778/3654621.3654622
Show More Cited By

Recommendations

Covariance shaping least-squares estimation

A new linear estimator is proposed, which we refer to as the covariance shaping least-squares (CSLS) estimator, for estimating a set of unknown deterministic parameters, x, observed through a known linear transformation H and corrupted by additive ...
Notes on the tightness of the hybrid Cramér-Rao lower bound

In this paper, we study the properties of the hybrid Cramér-Rao bound (HCRB). We first address the problem of estimating unknown deterministic parameters in the presence of nuisance random parameters. We specify a necessary and sufficient condition ...
Robust variance estimation for random effects meta-analysis

In random effects meta-analysis, an overall effect is estimated using a weighted mean, with weights based on estimated marginal variances. The variance of the overall effect is often estimated using the inverse of the sum of the estimated weights, and ...

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment

Proceedings of the VLDB Endowment Volume 14, Issue 1

September 2020

73 pages

ISSN:2150-8097

Editors:
Xin Luna Dong
Amazon
,
Felix Naumann
HPI, University of Potsdam

Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 September 2020

Published in PVLDB Volume 14, Issue 1

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

66
Total Citations
View Citations
220
Total Downloads

Downloads (Last 12 months)46
Downloads (Last 6 weeks)8

Reflects downloads up to 22 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Lim WMa LZhang WButrovich MArch SPavlo A(2024)Hit the Gym: Accelerating Query Execution to Efficiently Bootstrap Behavior Models for Self-Driving Database Management SystemsProceedings of the VLDB Endowment10.14778/3681954.368203017:11(3680-3693)Online publication date: 1-Jul-2024
https://dl.acm.org/doi/10.14778/3681954.3682030
Lehmann CSulimov PStockinger K(2024)Is Your Learned Query Optimizer Behaving As You Expect? A Machine Learning PerspectiveProceedings of the VLDB Endowment10.14778/3654621.365462517:7(1565-1577)Online publication date: 1-Mar-2024
https://dl.acm.org/doi/10.14778/3654621.3654625
Yu TZou ZSun WYan Y(2024)Refactoring Index Tuning Process with Benefit EstimationProceedings of the VLDB Endowment10.14778/3654621.365462217:7(1528-1541)Online publication date: 1-Mar-2024
https://dl.acm.org/doi/10.14778/3654621.3654622
Zhu RWeng LWei WWu DPeng JWang YDing BLian DZheng BZhou J(2024)PilotScope: Steering Databases with Machine Learning DriversProceedings of the VLDB Endowment10.14778/3641204.364120917:5(980-993)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.14778/3641204.3641209
Hu PMotik B(2024)Accurate Sampling-Based Cardinality Estimation for Complex Graph QueriesACM Transactions on Database Systems10.1145/368920949:3(1-46)Online publication date: 17-Sep-2024
https://dl.acm.org/doi/10.1145/3689209
Kittelmann FSulimov PStockinger K(2024)QardEst: Using Quantum Machine Learning for Cardinality Estimation of Join QueriesProceedings of the 1st Workshop on Quantum Computing and Quantum-Inspired Technology for Data-Intensive Systems and Applications10.1145/3665225.3665444(2-13)Online publication date: 9-Jun-2024
https://dl.acm.org/doi/10.1145/3665225.3665444
Heddes MNunes IGivargis TNicolau A(2024)Convolution and Cross-Correlation of Count Sketches Enables Fast Cardinality Estimation of Multi-Join QueriesProceedings of the ACM on Management of Data10.1145/36549322:3(1-26)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3654932
Tsan BDatta AIzenov YRusu F(2024)Approximate SketchesProceedings of the ACM on Management of Data10.1145/36393212:1(1-24)Online publication date: 26-Mar-2024
https://dl.acm.org/doi/10.1145/3639321
Kim KLee SKim IHan W(2024)ASM: Harmonizing Autoregressive Model, Sampling, and Multi-dimensional Statistics Merging for Cardinality EstimationProceedings of the ACM on Management of Data10.1145/36393002:1(1-27)Online publication date: 26-Mar-2024
https://dl.acm.org/doi/10.1145/3639300
Zhang JZhang CLi GChai C(2024)PACE: Poisoning Attacks on Learned Cardinality EstimationProceedings of the ACM on Management of Data10.1145/36392922:1(1-27)Online publication date: 26-Mar-2024
https://dl.acm.org/doi/10.1145/3639292
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents