research-article

Learning to be a statistician: learned estimator for number of distinct values

Authors:

Jingren ZhouAuthors Info & Claims

Proceedings of the VLDB Endowment, Volume 15, Issue 2

Pages 272 - 284

https://doi.org/10.14778/3489496.3489508

Published: 01 October 2021 Publication History

Abstract

Estimating the number of distinct values (NDV) in a column is useful for many tasks in database systems, such as columnstore compression and data profiling. In this work, we focus on how to derive accurate NDV estimations from random (online/offline) samples. Such efficient estimation is critical for tasks where it is prohibitive to scan the data even once. Existing sample-based estimators typically rely on heuristics or assumptions and do not have robust performance across different datasets as the assumptions on data can easily break. On the other hand, deriving an estimator from a principled formulation such as maximum likelihood estimation is very challenging due to the complex structure of the formulation. We propose to formulate the NDV estimation task in a supervised learning framework, and aim to learn a model as the estimator. To this end, we need to answer several questions: i) how to make the learned model workload agnostic; ii) how to obtain training data; iii) how to perform model training. We derive conditions of the learning framework under which the learned model is workload agnostic, in the sense that the model/estimator can be trained with synthetically generated training data, and then deployed into any data warehouse simply as, e.g., user-defined functions (UDFs), to offer efficient (within microseconds on CPU) and accurate NDV estimations for unseen tables and workloads. We compare the learned estimator with the state-of-the-art sample-based estimators on nine real-world datasets to demonstrate its superior estimation accuracy. We publish our code for training data generation, model training, and the learned estimator online for reproducibility.

References

[1]

2020. Airlines Departure Delay. https://www.openml.org/d/42728

[2]

2020. Box plot. https://en.wikipedia.org/wiki/Box_plot

[3]

2020. Bureau of Transportation Statistics. https://www.transtats.bts.gov/

[4]

2020. Campaign finance data. https://www.fec.gov/data/

[5]

2020. Department of Motor Vehicle (DMV) Office Locations. https://catalog.data.gov/dataset/department-of-motor-vehicle-dmv-office-locations

[6]

2020. Leaky ReLU. https://pytorch.org/docs/stable/generated/torch.nn.LeakyReLU.html

[7]

2020. MaxCompute. https://www.alibabacloud.com/product/maxcompute

[8]

2020. Pydistinct - Population Distinct Value Estimators. https://pydistinct.readthedocs.io/

[9]

2020. Random numbers that add to 100: Matlab. https://stackoverflow.com/questions/8064629/random-numbers-that-add-to-100-matlab

[10]

2020. scipy.optimize.brentq. https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.brentq.html

[11]

2020. skorch documentation. https://skorch.readthedocs.io/en/stable/

[12]

2020. Voter Registration Statistics. https://www.ncsbe.gov/results-data/voter-registration-data

[13]

2020. weight decay in neural networks. https://metacademy.org/graphs/concepts/weight_decay_neural_networks

[14]

2021. Random Vectors with Fixed Sum - File Exchange - MATLAB Central. https://www.mathworks.com/matlabcentral/fileexchange/9700-random-vectors-with-fixed-sum [Online; accessed 27. Apr. 2021].

[15]

Christos Anagnostopoulos and Peter Triantafillou. 2015. Learning to accurately count with query-driven predictive analytics. In 2015 IEEE international conference on big data (big data). IEEE, 14--23.

Digital Library

[16]

Christopher M Bishop. 2006. Pattern recognition and machine learning. springer.

Digital Library

[17]

Richard P Brent. 1973. Algorithms for Minimization without Derivatives, chap. 4.

[18]

John Bunge and Michael Fitzpatrick. 1993. Estimating the number of species: a review. J. Amer. Statist. Assoc. 88, 421 (1993), 364--373.

[19]

Raymond L Chambers, David G Steel, Suojin Wang, and Alan Welsh. 2012. Maximum likelihood estimation for sample surveys. CRC Press.

[20]

Anne Chao. 1984. Nonparametric estimation of the number of classes in a population. Scandinavian Journal of statistics (1984), 265--270.

[21]

Anne Chao and Shen-Ming Lee. 1992. Estimating the number of classes via sample coverage. Journal of the American statistical Association 87, 417 (1992), 210--217.

[22]

Moses Charikar, Surajit Chaudhuri, Rajeev Motwani, and Vivek Narasayya. 2000. Towards estimation error guarantees for distinct values. In Proceedings of the nineteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems. 268--279.

Digital Library

[23]

Moses Charikar, Kirankumar Shiragur, and Aaron Sidford. 2019. Efficient profile maximum likelihood for universal symmetric property estimation. In Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing. 780--791.

Digital Library

[24]

Surajit Chaudhuri, Bolin Ding, and Srikanth Kandula. 2017. Approximate Query Processing: No Silver Bullet. In SIGMOD. 511--519.

Digital Library

[25]

Reuven Cohen and Yuval Nezri. 2019. Cardinality Estimation in a Virtualized Network Device Using Online Machine Learning. IEEE/ACM Transactions on Networking 27, 5 (2019), 2098--2110.

Digital Library

[26]

Anshuman Dutt, Chi Wang, Vivek R. Narasayya, and Surajit Chaudhuri. 2020. Efficiently Approximating Selectivity Functions using Low Overhead Regression Models. Proc. VLDB Endow. 13, 11 (2020), 2215--2228.

Digital Library

[27]

Anshuman Dutt, Chi Wang, Azade Nazi, Srikanth Kandula, Vivek Narasayya, and Surajit Chaudhuri. 2019. Selectivity estimation for range predicates using lightweight models. Proc. VLDB Endow. 12, 9 (2019), 1044--1057.

Digital Library

[28]

Philippe Flajolet, Éric Fusy, Olivier Gandouet, and Frédéric Meunier. 2007. Hyperloglog: the analysis of a near-optimal cardinality estimation algorithm. In Proceedings of the Analysis of Algorithms Conference. 137--156.

[29]

Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2017. Deep learning Ch. 5 Machine Learning Basics. The MIT Press, 132--133.

[30]

Peter J Haas, Jeffrey F Naughton, S Seshadri, and Lynne Stokes. 1995. Sampling-based estimation of the number of distinct values of an attribute. In VLDB, Vol. 95. 311--322.

Digital Library

[31]

Peter J Haas and Lynne Stokes. 1998. Estimating the number of classes in a finite population. J. Amer. Statist. Assoc. 93, 444 (1998), 1475--1487.

[32]

Yi Hao and Alon Orlitsky. 2019. The broad optimality of profile maximum likelihood. In Advances in Neural Information Processing Systems. 10991--11003.

Digital Library

[33]

Hazar Harmouch and Felix Naumann. 2017. Cardinality estimation: An experimental survey. Proceedings of the VLDB Endowment 11, 4 (2017), 499--512.

[34]

Simon Haykin. 1998. Neural Networks: A Comprehensive Foundation. Prentice Hall.

Digital Library

[35]

Benjamin Hilprecht, Andreas Schmidt, Moritz Kulessa, Alejandro Molina, Kristian Kersting, and Carsten Binnig. 2020. DeepDB: learn from data, not from queries! Proceedings of the VLDB Endowment 13, 7 (2020), 992--1005.

Digital Library

[36]

J Wesley Hines. 1996. A logarithmic neural network architecture for unbounded non-linear function approximation. In Proceedings of International Conference on Neural Networks (ICNN'96), Vol. 2. IEEE, 1245--1250.

[37]

Sergey Ioffe and Christian Szegedy. 2015. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37 (Lille, France) (ICML'15). JMLR.org, 448--456.

Digital Library

[38]

Piotr Juszczak, D Tax, and Robert PW Duin. [n.d.]. Feature scaling in support vector data description. Citeseer.

[39]

Martin Kiefer, Max Heimel, Sebastian Breß, and Volker Markl. 2017. Estimating join selectivities using bandwidth-optimized kernel density models. Proceedings of the VLDB Endowment 10, 13 (2017), 2085--2096.

Digital Library

[40]

Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7--9, 2015, Conference Track Proceedings, Yoshua Bengio and Yann LeCun (Eds.). http://arxiv.org/abs/1412.6980

[41]

Andreas Kipf, Thomas Kipf, Bernhard Radke, Viktor Leis, Peter Boncz, and Alfons Kemper. 2018. Learned cardinalities: Estimating correlated joins with deep learning. arXiv preprint arXiv:1809.00677 (2018).

[42]

Seetha Lakshmi and Shaoyu Zhou. 1998. Selectivity estimation in extensible databases-a neural network approach. In VLDB, Vol. 98. 24--27.

Digital Library

[43]

Library. 2021. An learned sample-based NDV estimator. https://github.com/wurenzhi/learned_ndv_estimator. [Online; accessed 11-October-2021].

[44]

Henry Liu, Mingbin Xu, Ziting Yu, Vincent Corvinelli, and Calisto Zuzarte. 2015. Cardinality estimation using neural networks. In Proceedings of the 25th Annual International Conference on Computer Science and Software Engineering. 53--59.

Digital Library

[45]

Ahmed Metwally, Divyakant Agrawal, and Amr El Abbadi. 2008. Why go logarithmic if we can go linear? Towards effective distinct counting of search traffic. In Proceedings of the 11th international conference on Extending database technology: Advances in database technology. 618--629.

Digital Library

[46]

Hamid Mohamadi, Hamza Khan, and Inanc Birol. 2017. ntCard: a streaming algorithm for cardinality estimation in genomics data. Bioinformatics 33, 9 (2017), 1324--1330.

[47]

Rajeev Motwani and Sergei Vassilvitskii. 2006. Distinct values estimators for power law distributions. In 2006 Proceedings of the Third Workshop on Analytic Algorithmics and Combinatorics (ANALCO). SIAM, 230--237.

Digital Library

[48]

Suman Nath, Phillip B Gibbons, Srinivasan Seshan, and Zachary Anderson. 2008. Synopsis diffusion for robust aggregation in sensor networks. ACM Transactions on Sensor Networks (TOSN) 4, 2 (2008), 1--40.

Digital Library

[49]

Patrick E O'Neil, Elizabeth J O'Neil, and Xuedong Chen. 2007. The star schema benchmark (SSB).

[50]

Dmitri S Pavlichin, Jiantao Jiao, and Tsachy Weissman. 2019. Approximate Profile Maximum Likelihood. Journal of Machine Learning Research 20, 122 (2019), 1--55. http://jmlr.org/papers/v20/18-075.html

[51]

Maithra Raghu, Ben Poole, Jon M. Kleinberg, Surya Ganguli, and Jascha Sohl-Dickstein. 2017. On the Expressive Power of Deep Neural Networks. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017. 2847--2854.

Digital Library

[52]

A Shlosser. 1981. On estimation of the size of the dictionary of a long text on the basis of a sample. Engineering Cybernetics 19, 1 (1981), 97--102.

[53]

Connor Shorten and Taghi M Khoshgoftaar. 2019. A survey on image data augmentation for deep learning. Journal of Big Data 6, 1 (2019), 60.

[54]

Sumit Sidana, Charlotte Laclau, Massih R Amini, Gilles Vandelle, and André Bois-Crettez. 2017. KASANDR: a large-scale dataset with implicit feedback for recommendation. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1245--1248.

Digital Library

[55]

Daniel Ting. 2019. Approximate Distinct Counts for Billions of Datasets. In Proceedings of the 2019 International Conference on Management of Data (Amsterdam, Netherlands) (SIGMOD '19). Association for Computing Machinery, New York, NY, USA, 69--86.

Digital Library

[56]

Twan Van Laarhoven. 2017. L2 regularization versus batch and weight normalization. arXiv preprint arXiv:1706.05350 (2017).

[57]

Xiaoying Wang, Changbo Qu, Weiyuan Wu, Jiannan Wang, and Qingqing Zhou. 2021. Are We Ready for Learned Cardinality Estimation? Proc. VLDB Endow. 14, 9 (May 2021), 1640--1654.

Digital Library

[58]

Renzhi Wu, Bolin Ding, Xu Chu, Zhewei Wei, Xiening Dai, Tao Guan, and JingrenZhou. 2021. An learned sample-based NDV estimator (technical report). https://figshare.com/s/8cd5f3dad9418b84b75a. [Online; accessed 11-October-2021].

[59]

Keyulu Xu, Mozhi Zhang, Jingling Li, Simon S Du, Ken-ichi Kawarabayashi, and Stefanie Jegelka. 2021. How neural networks extrapolate: From feedforward to graph neural networks. In ICLR.

[60]

Rong Zhu, Ziniu Wu, Yuxing Han, Kai Zeng, Andreas Pfadler, Zhengping Qian, Jingren Zhou, and Bin Cui. 2021. FLAT: Fast, Lightweight and Accurate Method for Cardinality Estimation. Proc. VLDB Endow. 14, 9 (May 2021), 1489--1502.

Digital Library

Cited By

Zhu RWeng LWei WWu DPeng JWang YDing BLian DZheng BZhou J(2024)PilotScope: Steering Databases with Machine Learning DriversProceedings of the VLDB Endowment10.14778/3641204.364120917:5(980-993)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.14778/3641204.3641209
Li JLei RWang SWei ZDing B(2024)Learning-based Property Estimation with PolynomialsProceedings of the ACM on Management of Data10.1145/36549942:3(1-27)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3654994
Feng JLi ZChen QLiu H(2024)Automating localized learning for cardinality estimation based on XGBoostKnowledge and Information Systems10.1007/s10115-024-02142-266:7(3825-3854)Online publication date: 1-Jul-2024
https://dl.acm.org/doi/10.1007/s10115-024-02142-2
Show More Cited By

Index Terms

Learning to be a statistician: learned estimator for number of distinct values
1. Computing methodologies
  1. Machine learning

Index terms have been assigned to the content through auto-classification.

Recommendations

Transductive Multilabel Learning via Label Set Propagation

The problem of multilabel classification has attracted great interest in the last decade, where each instance can be assigned with a set of multiple class labels simultaneously. It has a wide variety of real-world applications, e.g., automatic image ...
Learning to Estimate Without Bias
The Gauss–Markov theorem states that the weighted least squares estimator is a linear minimum variance unbiased estimation (MVUE) in linear models. In this paper, we take a first step towards extending this result to non-linear settings via deep ...
Learning with Minimal Supervision: New Meta-Learning and Reinforcement Learning Algorithms

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment

Proceedings of the VLDB Endowment Volume 15, Issue 2

October 2021

247 pages

ISSN:2150-8097

Editors:
Juliana Freire
New York University
,
Xuemin Lin
University of New South Wales

Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 October 2021

Published in PVLDB Volume 15, Issue 2

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

6
Total Citations
View Citations
62
Total Downloads

Downloads (Last 12 months)16
Downloads (Last 6 weeks)1

Reflects downloads up to 27 Jul 2024

Other Metrics

View Author Metrics

Citations

Cited By

Zhu RWeng LWei WWu DPeng JWang YDing BLian DZheng BZhou J(2024)PilotScope: Steering Databases with Machine Learning DriversProceedings of the VLDB Endowment10.14778/3641204.364120917:5(980-993)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.14778/3641204.3641209
Li JLei RWang SWei ZDing B(2024)Learning-based Property Estimation with PolynomialsProceedings of the ACM on Management of Data10.1145/36549942:3(1-27)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3654994
Feng JLi ZChen QLiu H(2024)Automating localized learning for cardinality estimation based on XGBoostKnowledge and Information Systems10.1007/s10115-024-02142-266:7(3825-3854)Online publication date: 1-Jul-2024
https://dl.acm.org/doi/10.1007/s10115-024-02142-2
Wu RBendeck AChu XHe Y(2023)Ground Truth Inference for Weakly Supervised Entity MatchingProceedings of the ACM on Management of Data10.1145/35887121:1(1-28)Online publication date: 30-May-2023
https://dl.acm.org/doi/10.1145/3588712
Li YYu XKoudas NLin SSun CChen C(2023)dbET: Execution Time Distribution-based Plan SelectionProceedings of the ACM on Management of Data10.1145/35887111:1(1-26)Online publication date: 30-May-2023
https://dl.acm.org/doi/10.1145/3588711
Wang FYan XYiu MLI SMao ZTang B(2023)Speeding Up End-to-end Query Execution via Learning-based Progressive Cardinality EstimationProceedings of the ACM on Management of Data10.1145/35887081:1(1-25)Online publication date: 30-May-2023
https://dl.acm.org/doi/10.1145/3588708

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents