Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Learning to be a statistician: learned estimator for number of distinct values

Published: 01 October 2021 Publication History
  • Get Citation Alerts
  • Abstract

    Estimating the number of distinct values (NDV) in a column is useful for many tasks in database systems, such as columnstore compression and data profiling. In this work, we focus on how to derive accurate NDV estimations from random (online/offline) samples. Such efficient estimation is critical for tasks where it is prohibitive to scan the data even once. Existing sample-based estimators typically rely on heuristics or assumptions and do not have robust performance across different datasets as the assumptions on data can easily break. On the other hand, deriving an estimator from a principled formulation such as maximum likelihood estimation is very challenging due to the complex structure of the formulation. We propose to formulate the NDV estimation task in a supervised learning framework, and aim to learn a model as the estimator. To this end, we need to answer several questions: i) how to make the learned model workload agnostic; ii) how to obtain training data; iii) how to perform model training. We derive conditions of the learning framework under which the learned model is workload agnostic, in the sense that the model/estimator can be trained with synthetically generated training data, and then deployed into any data warehouse simply as, e.g., user-defined functions (UDFs), to offer efficient (within microseconds on CPU) and accurate NDV estimations for unseen tables and workloads. We compare the learned estimator with the state-of-the-art sample-based estimators on nine real-world datasets to demonstrate its superior estimation accuracy. We publish our code for training data generation, model training, and the learned estimator online for reproducibility.

    References

    [1]
    2020. Airlines Departure Delay. https://www.openml.org/d/42728
    [2]
    2020. Box plot. https://en.wikipedia.org/wiki/Box_plot
    [3]
    2020. Bureau of Transportation Statistics. https://www.transtats.bts.gov/
    [4]
    2020. Campaign finance data. https://www.fec.gov/data/
    [5]
    2020. Department of Motor Vehicle (DMV) Office Locations. https://catalog.data.gov/dataset/department-of-motor-vehicle-dmv-office-locations
    [6]
    2020. Leaky ReLU. https://pytorch.org/docs/stable/generated/torch.nn.LeakyReLU.html
    [7]
    2020. MaxCompute. https://www.alibabacloud.com/product/maxcompute
    [8]
    2020. Pydistinct - Population Distinct Value Estimators. https://pydistinct.readthedocs.io/
    [9]
    2020. Random numbers that add to 100: Matlab. https://stackoverflow.com/questions/8064629/random-numbers-that-add-to-100-matlab
    [10]
    2020. scipy.optimize.brentq. https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.brentq.html
    [11]
    2020. skorch documentation. https://skorch.readthedocs.io/en/stable/
    [12]
    2020. Voter Registration Statistics. https://www.ncsbe.gov/results-data/voter-registration-data
    [13]
    2020. weight decay in neural networks. https://metacademy.org/graphs/concepts/weight_decay_neural_networks
    [14]
    2021. Random Vectors with Fixed Sum - File Exchange - MATLAB Central. https://www.mathworks.com/matlabcentral/fileexchange/9700-random-vectors-with-fixed-sum [Online; accessed 27. Apr. 2021].
    [15]
    Christos Anagnostopoulos and Peter Triantafillou. 2015. Learning to accurately count with query-driven predictive analytics. In 2015 IEEE international conference on big data (big data). IEEE, 14--23.
    [16]
    Christopher M Bishop. 2006. Pattern recognition and machine learning. springer.
    [17]
    Richard P Brent. 1973. Algorithms for Minimization without Derivatives, chap. 4.
    [18]
    John Bunge and Michael Fitzpatrick. 1993. Estimating the number of species: a review. J. Amer. Statist. Assoc. 88, 421 (1993), 364--373.
    [19]
    Raymond L Chambers, David G Steel, Suojin Wang, and Alan Welsh. 2012. Maximum likelihood estimation for sample surveys. CRC Press.
    [20]
    Anne Chao. 1984. Nonparametric estimation of the number of classes in a population. Scandinavian Journal of statistics (1984), 265--270.
    [21]
    Anne Chao and Shen-Ming Lee. 1992. Estimating the number of classes via sample coverage. Journal of the American statistical Association 87, 417 (1992), 210--217.
    [22]
    Moses Charikar, Surajit Chaudhuri, Rajeev Motwani, and Vivek Narasayya. 2000. Towards estimation error guarantees for distinct values. In Proceedings of the nineteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems. 268--279.
    [23]
    Moses Charikar, Kirankumar Shiragur, and Aaron Sidford. 2019. Efficient profile maximum likelihood for universal symmetric property estimation. In Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing. 780--791.
    [24]
    Surajit Chaudhuri, Bolin Ding, and Srikanth Kandula. 2017. Approximate Query Processing: No Silver Bullet. In SIGMOD. 511--519.
    [25]
    Reuven Cohen and Yuval Nezri. 2019. Cardinality Estimation in a Virtualized Network Device Using Online Machine Learning. IEEE/ACM Transactions on Networking 27, 5 (2019), 2098--2110.
    [26]
    Anshuman Dutt, Chi Wang, Vivek R. Narasayya, and Surajit Chaudhuri. 2020. Efficiently Approximating Selectivity Functions using Low Overhead Regression Models. Proc. VLDB Endow. 13, 11 (2020), 2215--2228.
    [27]
    Anshuman Dutt, Chi Wang, Azade Nazi, Srikanth Kandula, Vivek Narasayya, and Surajit Chaudhuri. 2019. Selectivity estimation for range predicates using lightweight models. Proc. VLDB Endow. 12, 9 (2019), 1044--1057.
    [28]
    Philippe Flajolet, Éric Fusy, Olivier Gandouet, and Frédéric Meunier. 2007. Hyperloglog: the analysis of a near-optimal cardinality estimation algorithm. In Proceedings of the Analysis of Algorithms Conference. 137--156.
    [29]
    Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2017. Deep learning Ch. 5 Machine Learning Basics. The MIT Press, 132--133.
    [30]
    Peter J Haas, Jeffrey F Naughton, S Seshadri, and Lynne Stokes. 1995. Sampling-based estimation of the number of distinct values of an attribute. In VLDB, Vol. 95. 311--322.
    [31]
    Peter J Haas and Lynne Stokes. 1998. Estimating the number of classes in a finite population. J. Amer. Statist. Assoc. 93, 444 (1998), 1475--1487.
    [32]
    Yi Hao and Alon Orlitsky. 2019. The broad optimality of profile maximum likelihood. In Advances in Neural Information Processing Systems. 10991--11003.
    [33]
    Hazar Harmouch and Felix Naumann. 2017. Cardinality estimation: An experimental survey. Proceedings of the VLDB Endowment 11, 4 (2017), 499--512.
    [34]
    Simon Haykin. 1998. Neural Networks: A Comprehensive Foundation. Prentice Hall.
    [35]
    Benjamin Hilprecht, Andreas Schmidt, Moritz Kulessa, Alejandro Molina, Kristian Kersting, and Carsten Binnig. 2020. DeepDB: learn from data, not from queries! Proceedings of the VLDB Endowment 13, 7 (2020), 992--1005.
    [36]
    J Wesley Hines. 1996. A logarithmic neural network architecture for unbounded non-linear function approximation. In Proceedings of International Conference on Neural Networks (ICNN'96), Vol. 2. IEEE, 1245--1250.
    [37]
    Sergey Ioffe and Christian Szegedy. 2015. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37 (Lille, France) (ICML'15). JMLR.org, 448--456.
    [38]
    Piotr Juszczak, D Tax, and Robert PW Duin. [n.d.]. Feature scaling in support vector data description. Citeseer.
    [39]
    Martin Kiefer, Max Heimel, Sebastian Breß, and Volker Markl. 2017. Estimating join selectivities using bandwidth-optimized kernel density models. Proceedings of the VLDB Endowment 10, 13 (2017), 2085--2096.
    [40]
    Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7--9, 2015, Conference Track Proceedings, Yoshua Bengio and Yann LeCun (Eds.). http://arxiv.org/abs/1412.6980
    [41]
    Andreas Kipf, Thomas Kipf, Bernhard Radke, Viktor Leis, Peter Boncz, and Alfons Kemper. 2018. Learned cardinalities: Estimating correlated joins with deep learning. arXiv preprint arXiv:1809.00677 (2018).
    [42]
    Seetha Lakshmi and Shaoyu Zhou. 1998. Selectivity estimation in extensible databases-a neural network approach. In VLDB, Vol. 98. 24--27.
    [43]
    Library. 2021. An learned sample-based NDV estimator. https://github.com/wurenzhi/learned_ndv_estimator. [Online; accessed 11-October-2021].
    [44]
    Henry Liu, Mingbin Xu, Ziting Yu, Vincent Corvinelli, and Calisto Zuzarte. 2015. Cardinality estimation using neural networks. In Proceedings of the 25th Annual International Conference on Computer Science and Software Engineering. 53--59.
    [45]
    Ahmed Metwally, Divyakant Agrawal, and Amr El Abbadi. 2008. Why go logarithmic if we can go linear? Towards effective distinct counting of search traffic. In Proceedings of the 11th international conference on Extending database technology: Advances in database technology. 618--629.
    [46]
    Hamid Mohamadi, Hamza Khan, and Inanc Birol. 2017. ntCard: a streaming algorithm for cardinality estimation in genomics data. Bioinformatics 33, 9 (2017), 1324--1330.
    [47]
    Rajeev Motwani and Sergei Vassilvitskii. 2006. Distinct values estimators for power law distributions. In 2006 Proceedings of the Third Workshop on Analytic Algorithmics and Combinatorics (ANALCO). SIAM, 230--237.
    [48]
    Suman Nath, Phillip B Gibbons, Srinivasan Seshan, and Zachary Anderson. 2008. Synopsis diffusion for robust aggregation in sensor networks. ACM Transactions on Sensor Networks (TOSN) 4, 2 (2008), 1--40.
    [49]
    Patrick E O'Neil, Elizabeth J O'Neil, and Xuedong Chen. 2007. The star schema benchmark (SSB).
    [50]
    Dmitri S Pavlichin, Jiantao Jiao, and Tsachy Weissman. 2019. Approximate Profile Maximum Likelihood. Journal of Machine Learning Research 20, 122 (2019), 1--55. http://jmlr.org/papers/v20/18-075.html
    [51]
    Maithra Raghu, Ben Poole, Jon M. Kleinberg, Surya Ganguli, and Jascha Sohl-Dickstein. 2017. On the Expressive Power of Deep Neural Networks. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017. 2847--2854.
    [52]
    A Shlosser. 1981. On estimation of the size of the dictionary of a long text on the basis of a sample. Engineering Cybernetics 19, 1 (1981), 97--102.
    [53]
    Connor Shorten and Taghi M Khoshgoftaar. 2019. A survey on image data augmentation for deep learning. Journal of Big Data 6, 1 (2019), 60.
    [54]
    Sumit Sidana, Charlotte Laclau, Massih R Amini, Gilles Vandelle, and André Bois-Crettez. 2017. KASANDR: a large-scale dataset with implicit feedback for recommendation. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1245--1248.
    [55]
    Daniel Ting. 2019. Approximate Distinct Counts for Billions of Datasets. In Proceedings of the 2019 International Conference on Management of Data (Amsterdam, Netherlands) (SIGMOD '19). Association for Computing Machinery, New York, NY, USA, 69--86.
    [56]
    Twan Van Laarhoven. 2017. L2 regularization versus batch and weight normalization. arXiv preprint arXiv:1706.05350 (2017).
    [57]
    Xiaoying Wang, Changbo Qu, Weiyuan Wu, Jiannan Wang, and Qingqing Zhou. 2021. Are We Ready for Learned Cardinality Estimation? Proc. VLDB Endow. 14, 9 (May 2021), 1640--1654.
    [58]
    Renzhi Wu, Bolin Ding, Xu Chu, Zhewei Wei, Xiening Dai, Tao Guan, and JingrenZhou. 2021. An learned sample-based NDV estimator (technical report). https://figshare.com/s/8cd5f3dad9418b84b75a. [Online; accessed 11-October-2021].
    [59]
    Keyulu Xu, Mozhi Zhang, Jingling Li, Simon S Du, Ken-ichi Kawarabayashi, and Stefanie Jegelka. 2021. How neural networks extrapolate: From feedforward to graph neural networks. In ICLR.
    [60]
    Rong Zhu, Ziniu Wu, Yuxing Han, Kai Zeng, Andreas Pfadler, Zhengping Qian, Jingren Zhou, and Bin Cui. 2021. FLAT: Fast, Lightweight and Accurate Method for Cardinality Estimation. Proc. VLDB Endow. 14, 9 (May 2021), 1489--1502.

    Cited By

    View all
    • (2024)PilotScope: Steering Databases with Machine Learning DriversProceedings of the VLDB Endowment10.14778/3641204.364120917:5(980-993)Online publication date: 1-Jan-2024
    • (2024)Learning-based Property Estimation with PolynomialsProceedings of the ACM on Management of Data10.1145/36549942:3(1-27)Online publication date: 30-May-2024
    • (2024)Automating localized learning for cardinality estimation based on XGBoostKnowledge and Information Systems10.1007/s10115-024-02142-266:7(3825-3854)Online publication date: 1-Jul-2024
    • Show More Cited By

    Index Terms

    1. Learning to be a statistician: learned estimator for number of distinct values
      Index terms have been assigned to the content through auto-classification.

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image Proceedings of the VLDB Endowment
      Proceedings of the VLDB Endowment  Volume 15, Issue 2
      October 2021
      247 pages
      ISSN:2150-8097
      Issue’s Table of Contents

      Publisher

      VLDB Endowment

      Publication History

      Published: 01 October 2021
      Published in PVLDB Volume 15, Issue 2

      Qualifiers

      • Research-article

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)16
      • Downloads (Last 6 weeks)1
      Reflects downloads up to 27 Jul 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)PilotScope: Steering Databases with Machine Learning DriversProceedings of the VLDB Endowment10.14778/3641204.364120917:5(980-993)Online publication date: 1-Jan-2024
      • (2024)Learning-based Property Estimation with PolynomialsProceedings of the ACM on Management of Data10.1145/36549942:3(1-27)Online publication date: 30-May-2024
      • (2024)Automating localized learning for cardinality estimation based on XGBoostKnowledge and Information Systems10.1007/s10115-024-02142-266:7(3825-3854)Online publication date: 1-Jul-2024
      • (2023)Ground Truth Inference for Weakly Supervised Entity MatchingProceedings of the ACM on Management of Data10.1145/35887121:1(1-28)Online publication date: 30-May-2023
      • (2023)dbET: Execution Time Distribution-based Plan SelectionProceedings of the ACM on Management of Data10.1145/35887111:1(1-26)Online publication date: 30-May-2023
      • (2023)Speeding Up End-to-end Query Execution via Learning-based Progressive Cardinality EstimationProceedings of the ACM on Management of Data10.1145/35887081:1(1-25)Online publication date: 30-May-2023

      View Options

      Get Access

      Login options

      Full Access

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media