Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

FACE: a normalizing flow based cardinality estimator

Published: 01 September 2021 Publication History

Abstract

Cardinality estimation is one of the most important problems in query optimization. Recently, machine learning based techniques have been proposed to effectively estimate cardinality, which can be broadly classified into query-driven and data-driven approaches. Query-driven approaches learn a regression model from a query to its cardinality; while data-driven approaches learn a distribution of tuples, select some samples that satisfy a SQL query, and use the data distributions of these selected tuples to estimate the cardinality of the SQL query. As query-driven methods rely on training queries, the estimation quality is not reliable when there are no high-quality training queries; while data-driven methods have no such limitation and have high adaptivity.
In this work, we focus on data-driven methods. A good data-driven model should achieve three optimization goals. First, the model needs to capture data dependencies between columns and support large domain sizes (achieving high accuracy). Second, the model should achieve high inference efficiency, because many data samples are needed to estimate the cardinality (achieving low inference latency). Third, the model should not be too large (achieving a small model size). However, existing data-driven methods cannot simultaneously optimize the three goals. To address the limitations, we propose a novel cardinality estimator FACE, which leverages the Normalizing Flow based model to learn a continuous joint distribution for relational data. FACE can transform a complex distribution over continuous random variables into a simple distribution (e.g., multivariate normal distribution), and use the probability density to estimate the cardinality. First, we design a dequantization method to make data more "continuous". Second, we propose encoding and indexing techniques to handle Like predicates for string data. Third, we propose a Monte Carlo method to efficiently estimate the cardinality. Experimental results show that our method significantly outperforms existing approaches in terms of estimation accuracy while keeping similar latency and model size.

References

[1]
Gleb Beliakov. 2005. Monotonicity Preserving Approximation of Multivariate Scattered Data. BIT Numerical Mathematics 45 (01 2005), 653--677.
[2]
Laurent Dinh, David Krueger, and Yoshua Bengio. 2015. NICE: Non-linear Independent Components Estimation. In ICLR, Yoshua Bengio and Yann LeCun (Eds.). http://arxiv.org/abs/1410.8516
[3]
Anshuman Dutt, Chi Wang, Azade Nazi, Srikanth Kandula, Vivek Narasayya, and Surajit Chaudhuri. 2019. Selectivity estimation for range predicates using lightweight models. Proceedings of the VLDB Endowment 12, 9 (2019), 1044--1057.
[4]
Frederick N Fritsch and Ralph E Carlson. 1980. Monotone piecewise cubic interpolation. SIAM J. Numer. Anal. 17, 2 (1980), 238--246.
[5]
Mathieu Germain, Karol Gregor, Iain Murray, and Hugo Larochelle. 2015. MADE: Masked Autoencoder for Distribution Estimation. In ICML (JMLR Workshop and Conference Proceedings), Francis R. Bach and David M. Blei (Eds.), Vol. 37. JMLR.org, 881--889. http://proceedings.mlr.press/v37/germain15.html
[6]
Zhabiz Gharibshah, Xingquan Zhu, Arthur Hainline, and Michael Conway. 2020. Deep Learning for User Interest and Response Prediction in Online Display Advertising. Data Science and Engineering 5, 1 (2020), 12--26.
[7]
Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. NIPS 27 (2014), 2672--2680.
[8]
LU Han and Larry L Schumaker. 1997. Fitting monotone surfaces to scattered data using C1 piecewise cubics. SIAM journal on numerical analysis 34, 2 (1997), 569--585.
[9]
Shohedul Hasan, Saravanan Thirumuruganathan, Jees Augustine, Nick Koudas, and Gautam Das. 2020. Deep Learning Models for Selectivity Estimation of Multi-Attribute Queries. In SIGMOD. ACM, 1035--1050.
[10]
Max Heimel, Martin Kiefer, and Volker Markl. 2015. Self-Tuning, GPU-Accelerated Kernel Density Models for Multidimensional Selectivity Estimation. In SIGMOD, Timos K. Sellis, Susan B. Davidson, and Zachary G. Ives (Eds.). ACM, 1477--1492.
[11]
Benjamin Hilprecht, Andreas Schmidt, Moritz Kulessa, Alejandro Molina, Kristian Kersting, and Carsten Binnig. 2020. DeepDB: Learn from Data, not from Queries! VLDB 13, 7 (2020), 992--1005. http://www.vldb.org/pvldb/vol13/p992-hilprecht.pdf
[12]
Jonathan Ho, Xi Chen, Aravind Srinivas, Yan Duan, and Pieter Abbeel. 2019. Flow++: Improving Flow-Based Generative Models with Variational Dequantization and Architecture Design. In ICML, Kamalika Chaudhuri and Ruslan Salakhutdinov (Eds.), Vol. 97. PMLR, 2722--2730. http://proceedings.mlr.press/v97/ho19a.html
[13]
Emiel Hoogeboom, Taco S Cohen, and Jakub M Tomczak. 2020. Learning discrete distributions by dequantization. arXiv preprint arXiv:2001.11235 (2020).
[14]
Individual household electric power consumption data set. 2021. https://github.com/gpapamak/maf. Last accessed: 2021-09-14.
[15]
Martin Kiefer, Max Heimel, Sebastian Breß, and Volker Markl. 2017. Estimating Join Selectivities using Bandwidth-Optimized Kernel Density Models. VLDB 10, 13 (2017), 2085--2096.
[16]
Diederik P. Kingma and Max Welling. 2014. Auto-Encoding Variational Bayes. In ICLR, Yoshua Bengio and Yann LeCun (Eds.). http://arxiv.org/abs/1312.6114
[17]
Andreas Kipf, Thomas Kipf, Bernhard Radke, Viktor Leis, Peter A. Boncz, and Alfons Kemper. 2019. Cardinalities: Estimating Correlated Joins with Deep Learning. In CIDR. www.cidrdb.org.
[18]
Andreas Kipf, Thomas Kipf, Bernhard Radke, Viktor Leis, Peter A. Boncz, and Alfons Kemper. 2019. Learned Cardinalities: Estimating Correlated Joins with Deep Learning. In CIDR.
[19]
Ivan Kobyzev, Simon Prince, and Marcus Brubaker. 2020. Normalizing flows: An introduction and review of current methods. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020).
[20]
Viktor Leis, Andrey Gubichev, Atanas Mirchev, Peter A. Boncz, Alfons Kemper, and Thomas Neumann. 2015. How Good Are Query Optimizers, Really? VLDB 9, 3 (2015), 204--215.
[21]
Viktor Leis, Bernhard Radke, Andrey Gubichev, Alfons Kemper, and Thomas Neumann. 2017. Cardinality Estimation Done Right: Index-Based Join Sampling. In CIDR. www.cidrdb.org.
[22]
G Peter Lepage. 2021. Adaptive multidimensional integration: vegas enhanced. J. Comput. Phys. 439 (2021), 110386.
[23]
Guoliang Li, Xuanhe Zhou, and Lei Cao. 2021. AI Meets Database: AI4DB and DB4AI. In SIGMOD. 2859--2866.
[24]
Guoliang Li, Xuanhe Zhou, and Lei Cao. 2021. Machine Learning for Databases. Proc. VLDB Endow. 14, 12 (2021), 3190--3193. http://www.vldb.org/pvldb/vol14/p3190-li.pdf
[25]
Guoliang Li, Xuanhe Zhou, and Chengliang Chai. 2021. AI Meets Database: A Survey. In TKDE.
[26]
Guoliang Li, Xuanhe Zhou, Ji Sun, Xiang Yu, Yue Han, Lianyuan Jin, Wenbo Li, Tianqing Wang, and Shifu Li. 2021. openGauss: An Autonomous Database System. Proc. VLDB Endow. 14, 12 (2021), 3028--3041. http://www.vldb.org/pvldb/vol14/p3028-li.pdf
[27]
Mingda Li, Hongzhi Wang, and Jianzhong Li. 2020. Mining Conditional Functional Dependency Rules on Big Data. Big Data Mining and Analytics 03, 01, Article 68 (2020), 16 pages.
[28]
Thomas Müller, Brian McWilliams, Fabrice Rousselle, Markus Gross, and Jan Novák. 2019. Neural Importance Sampling. ACM Trans. Graph. 38, 5 (2019), 145:1--145:19.
[29]
Jennifer Ortiz, Magdalena Balazinska, Johannes Gehrke, and S Sathiya Keerthi. 2019. An empirical analysis of deep learning for cardinality estimation. arXiv preprint arXiv:1905.06425 (2019).
[30]
George Papamakarios, Eric Nalisnick, Danilo Jimenez Rezende, Shakir Mohamed, and Balaji Lakshminarayanan. 2021. Normalizing flows for probabilistic modeling and inference. Journal of Machine Learning Research 22, 57 (2021), 1--64.
[31]
G Peter Lepage. 1978. A new algorithm for adaptive multidimensional integration. J. Comput. Phys. 27, 2 (1978), 192--203.
[32]
Viswanath Poosala, Yannis E. Ioannidis, Peter J. Haas, and Eugene J. Shekita. 1996. Improved Histograms for Selectivity Estimation of Range Predicates. In SIGMOD, H. V. Jagadish and Inderpal Singh Mumick (Eds.). ACM Press, 294--305.
[33]
PostgreSQL. 2021. https://www.postgresql.org/. Accessed: 2021-09-14.
[34]
Danilo Jimenez Rezende and Shakir Mohamed. 2015. Variational Inference with Normalizing Flows. In ICML (JMLR Workshop and Conference Proceedings), Vol. 37. JMLR.org, 1530--1538.
[35]
Patricia G. Selinger, Morton M. Astrahan, Donald D. Chamberlin, Raymond A. Lorie, and Thomas G. Price. 1979. Access Path Selection in a Relational Database Management System. In SIGMOD, Philip A. Bernstein (Ed.). ACM, 23--34.
[36]
Beijing Multi-Site Air-Quality Data Data Set. 2021. https://archive.ics.uci.edu/ml/datasets/Beijing+Multi-Site+Air-Quality+Data. Last accessed: 2021-09-14.
[37]
Ji Sun and Guoliang Li. 2019. An End-to-End Learning-based Cost Estimator. VLDB 13, 3 (2019), 307--319. http://www.vldb.org/pvldb/vol13/p307-sun.pdf
[38]
Ji Sun, Guoliang Li, and Nan Tang. 2021. Learned Cardinality Estimation for Similarity Queries. In SIGMOD. 1745--1757.
[39]
Ji Sun, Jintao Zhang, Zhaoyan Sun, Guoliang Li, and Nan Tang. 2021. Learned Cardinality Estimation: A Design Space Exploration and A Comparative Evaluation. VLDB (2021).
[40]
Lucas Theis, Aäron van den Oord, and Matthias Bethge. 2016. A note on the evaluation of generative models. In ICLR, Yoshua Bengio and Yann LeCun (Eds.). http://arxiv.org/abs/1511.01844
[41]
Shan Tian, Songsong Mo, Liwei Wang, and Zhiyong Peng. 2020. Deep Reinforcement Learning-Based Approach to Tackle Topic-Aware Influence Maximization. Data Science and Engineering 5, 1 (2020), 1--11.
[42]
Benigno Uria, Iain Murray, and Hugo Larochelle. 2013. RNADE: The real-valued neural autoregressive density-estimator. In NIPS, Christopher J. C. Burges, Léon Bottou, Zoubin Ghahramani, and Kilian Q. Weinberger (Eds.). 2175--2183. https://proceedings.neurips.cc/paper/2013/hash/53adaf494dc89ef7196d73636eb2451b-Abstract.html
[43]
Xiaoying Wang, Changbo Qu, Weiyuan Wu, Jiannan Wang, and Qingqing Zhou. 2021. Are We Ready For Learned Cardinality Estimation? Proc. VLDB Endow. 14, 9 (2021), 1640--1654. http://www.vldb.org/pvldb/vol14/p1640-wang.pdf
[44]
Zongheng Yang, Amog Kamsetty, Sifei Luan, Eric Liang, Yan Duan, Xi Chen, and Ion Stoica. 2020. NeuroCard: One Cardinality Estimator for All Tables. Proc. VLDB Endow. 14, 1 (2020), 61--73.
[45]
Zongheng Yang, Eric Liang, Amog Kamsetty, Chenggang Wu, Yan Duan, Peter Chen, Pieter Abbeel, Joseph M. Hellerstein, Sanjay Krishnan, and Ion Stoica. 2019. Deep Unsupervised Cardinality Estimation. VLDB 13, 3 (2019), 279--292.
[46]
Xiang Yu, Guoliang Li, Chengliang Chai, and Nan Tang. 2020. Reinforcement Learning with Tree-LSTM for Join Order Selection. In ICDE. IEEE, 1297--1308.
[47]
Zhuoyue Zhao, Robert Christensen, Feifei Li, Xiao Hu, and Ke Yi. 2018. Random Sampling over Joins Revisited. In SIGMOD, Gautam Das, Christopher M. Jermaine, and Philip A. Bernstein (Eds.). ACM, 1525--1539.
[48]
Xuanhe Zhou, Ji Sun, Guoliang Li, and Jianhua Feng. 2020. Query Performance Prediction for Concurrent Queries using Graph Embedding. Proc. VLDB Endow. 13, 9 (2020), 1416--1428.
[49]
Rong Zhu, Ziniu Wu, Yuxing Han, Kai Zeng, Andreas Pfadler, Zhengping Qian, Jingren Zhou, and Bin Cui. 2021. FLAT: Fast, Lightweight and Accurate Method for Cardinality Estimation. VLDB 14, 9 (2021), 1489--1502. http://www.vldb.org/pvldb/vol14/p1489-zhu.pdf
[50]
Zachary M. Ziegler and Alexander M. Rush. 2019. Latent Normalizing Flows for Discrete Sequences. In ICML (Proceedings of Machine Learning Research), Kamalika Chaudhuri and Ruslan Salakhutdinov (Eds.), Vol. 97. PMLR, 7673--7682.

Cited By

View all
  • (2024)A Spark Optimizer for Adaptive, Fine-Grained Parameter TuningProceedings of the VLDB Endowment10.14778/3681954.368202117:11(3565-3579)Online publication date: 1-Jul-2024
  • (2024)ASM: Harmonizing Autoregressive Model, Sampling, and Multi-dimensional Statistics Merging for Cardinality EstimationProceedings of the ACM on Management of Data10.1145/36393002:1(1-27)Online publication date: 26-Mar-2024
  • (2024)PACE: Poisoning Attacks on Learned Cardinality EstimationProceedings of the ACM on Management of Data10.1145/36392922:1(1-27)Online publication date: 26-Mar-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment
Proceedings of the VLDB Endowment  Volume 15, Issue 1
September 2021
140 pages
ISSN:2150-8097
Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 September 2021
Published in PVLDB Volume 15, Issue 1

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)124
  • Downloads (Last 6 weeks)10
Reflects downloads up to 30 Aug 2024

Other Metrics

Citations

Cited By

View all
  • (2024)A Spark Optimizer for Adaptive, Fine-Grained Parameter TuningProceedings of the VLDB Endowment10.14778/3681954.368202117:11(3565-3579)Online publication date: 1-Jul-2024
  • (2024)ASM: Harmonizing Autoregressive Model, Sampling, and Multi-dimensional Statistics Merging for Cardinality EstimationProceedings of the ACM on Management of Data10.1145/36393002:1(1-27)Online publication date: 26-Mar-2024
  • (2024)PACE: Poisoning Attacks on Learned Cardinality EstimationProceedings of the ACM on Management of Data10.1145/36392922:1(1-27)Online publication date: 26-Mar-2024
  • (2024)Learned Query Optimizer: What is New and What is NextCompanion of the 2024 International Conference on Management of Data10.1145/3626246.3654692(561-569)Online publication date: 9-Jun-2024
  • (2024)ByteCard: Enhancing ByteDance's Data Warehouse with Learned Cardinality EstimationCompanion of the 2024 International Conference on Management of Data10.1145/3626246.3653376(41-54)Online publication date: 9-Jun-2024
  • (2024)Enhancing Storage Efficiency and Performance: A Survey of Data Partitioning TechniquesJournal of Computer Science and Technology10.1007/s11390-024-3538-139:2(346-368)Online publication date: 1-Mar-2024
  • (2024)Automating localized learning for cardinality estimation based on XGBoostKnowledge and Information Systems10.1007/s10115-024-02142-266:7(3825-3854)Online publication date: 1-Jul-2024
  • (2024)Cardinality estimation using normalizing flowThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-023-00808-x33:2(323-348)Online publication date: 1-Mar-2024
  • (2023)ALECE: An Attention-based Learned Cardinality Estimator for SPJ Queries on Dynamic WorkloadsProceedings of the VLDB Endowment10.14778/3626292.362630217:2(197-210)Online publication date: 1-Oct-2023
  • (2023)FEBench: A Benchmark for Real-Time Relational Data Feature ExtractionProceedings of the VLDB Endowment10.14778/3611540.361155016:12(3597-3609)Online publication date: 1-Aug-2023
  • Show More Cited By

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media