Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Learning-based Property Estimation with Polynomials

Published: 30 May 2024 Publication History
  • Get Citation Alerts
  • Abstract

    The problem of estimating data properties using sampling frequency histograms has attracted extensive interest in the area of databases. The properties include the number of distinct values (NDV), entropy, and so on. In the field of databases, property estimation is fundamental to complex applications. For example, NDV estimation is the foundation of query optimization, and entropy estimation is the foundation of data compression. Among them, methods originating from statistics exhibit desirable theoretical guarantees but rely on specific assumptions about the distribution of data, resulting in poor performance in real-world applications. Learning-based methods, which use information from training data, are adaptable in the real world but often lack theoretical guarantees or explainability. In addition, a unified framework for estimating these frequency-based estimators with machine learning is lacking. Given the aforementioned challenges, it is natural to wonder if a unified framework with theoretical guarantees can be established for property estimation. The recent literature has presented theoretical studies that propose estimation frameworks based on polynomials. These studies also prove estimation errors with respect to the sample size. Motivated by the above polynomial estimation framework, we propose a learning-based estimation framework with polynomial approximation, which aims to learn the coefficients of the polynomial, providing theoretical guarantees to the learning framework. Through comprehensive experiments on both synthetic and real-world datasets for estimating various data properties like NDV, entropy, and power sum, our results show the superiority of our algorithms over previous estimators.

    References

    [1]
    2020. Airlines Departure Delay. https://www.openml.org/d/42728.
    [2]
    2020. Voter Registration Statistics. https://www.ncsbe.gov/results-data/voterregistration-data.
    [3]
    2021. Synthetic data generator. https://github.com/wurenzhi/learned_ndv_estimator.git.
    [4]
    2022. Github. https://github.com/wurenzhi/learned_ndv_estimator.git.
    [5]
    2023. Source Code of Spark. https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/JoinEstimation.scala.
    [6]
    2024. Source Code of PostgreSQL. https://github.com/postgres/postgres/blob/master/src/backend/optimizer/plan/analyzejoins.c.
    [7]
    Jayadev Acharya, Hirakendu Das, Alon Orlitsky, and Ananda Theertha Suresh. 2017. A unified maximum likelihood approach for estimating symmetric properties of discrete distributions. In International Conference on Machine Learning. PMLR, 11--21.
    [8]
    Jayadev Acharya, Hirakendu Das, Alon Orlitsky, and Ananda Theertha Suresh. 2017. A unified maximum likelihood approach for estimating symmetric properties of discrete distributions. In International Conference on Machine Learning. PMLR, 11--21.
    [9]
    Christos Anagnostopoulos and Peter Triantafillou. 2015. Learning to accurately count with query-driven predictive analytics. In 2015 IEEE international conference on big data (big data). IEEE, 14--23.
    [10]
    András Antos and Ioannis Kontoyiannis. 2001. Convergence properties of functional estimates for discrete distributions. Random Structures & Algorithms 19, 3--4 (2001), 163--193.
    [11]
    Aryaman Arora, Clara Meister, and Ryan Cotterell. 2022. Estimating the Entropy of Linguistic Distributions. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 175--195.
    [12]
    Ziv Bar-Yossef, TS Jayram, Ravi Kumar, D Sivakumar, and Luca Trevisan. 2002. Counting distinct elements in a data stream. In International Workshop on Randomization and Approximation Techniques in Computer Science. Springer, 1--10.
    [13]
    Tugkan Batu, Sanjoy Dasgupta, Ravi Kumar, and Ronitt Rubinfeld. 2002. The complexity of approximating entropy. In Proceedings of the thiry-fourth annual ACM symposium on Theory of computing. 678--687.
    [14]
    Srikanth Bellamkonda, Hua-Gang Li, Unmesh Jagtap, Yali Zhu, Vince Liang, and Thierry Cruanes. 2013. Adaptive and big data scale parallel execution in oracle. Proceedings of the VLDB Endowment 6, 11 (2013), 1102--1113.
    [15]
    Richard P Brent. 2013. Algorithms for minimization without derivatives. Courier Corporation.
    [16]
    John Bunge and Michael Fitzpatrick. 1993. Estimating the number of species: a review. J. Amer. Statist. Assoc. 88, 421 (1993), 364--373.
    [17]
    Anne Chao. 1984. Nonparametric estimation of the number of classes in a population. Scandinavian Journal of statistics (1984), 265--270.
    [18]
    Anne Chao and TJ Shen. 2010. User's guide for program SPADE (Species prediction and diversity estimation). Taiwan: National Tsing Hua University (2010).
    [19]
    Moses Charikar, Surajit Chaudhuri, Rajeev Motwani, and Vivek Narasayya. 2000. Towards estimation error guarantees for distinct values. In Proceedings of the nineteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems. 268--279.
    [20]
    Surajit Chaudhuri, Gautam Das, and Utkarsh Srivastava. 2004. Effective use of block-level sampling in statistics estimation. In Proceedings of the 2004 ACM SIGMOD international conference on Management of data. 287--298.
    [21]
    Xingguang Chen and Sibo Wang. 2021. Efficient approximate algorithms for empirical entropy and mutual information. In Proceedings of the 2021 International Conference on Management of Data. 274--286.
    [22]
    Eli Chien, Olgica Milenkovic, and Angelia Nedich. 2021. Support Estimation with Sampling Artifacts and Errors. In 2021 IEEE International Symposium on Information Theory (ISIT). IEEE, 244--249.
    [23]
    I Chien. 2019. Regularized weighted Chebyshev approximations for support estimation. (2019).
    [24]
    Reuven Cohen and Yuval Nezri. 2019. Cardinality estimation in a virtualized network device using online machine learning. IEEE/ACM Transactions on Networking 27, 5 (2019), 2098--2110.
    [25]
    Anshuman Dutt, Chi Wang, Vivek Narasayya, and Surajit Chaudhuri. 2020. Efficiently approximating selectivity functions using low overhead regression models. Proceedings of the VLDB Endowment 13, 12 (2020), 2215--2228.
    [26]
    Anshuman Dutt, Chi Wang, Azade Nazi, Srikanth Kandula, Vivek Narasayya, and Surajit Chaudhuri. 2019. Selectivity estimation for range predicates using lightweight models. Proceedings of the VLDB Endowment 12, 9 (2019), 1044--1057.
    [27]
    Talya Eden, Piotr Indyk, Shyam Narayanan, Ronitt Rubinfeld, Sandeep Silwal, and Tal Wagner. 2021. Learning-based Support Estimation in Sublinear Time. In International Conference on Learning Representations.
    [28]
    Bradley Efron and Charles Stein. 1981. The jackknife estimate of variance. The Annals of Statistics (1981), 586--596.
    [29]
    Bradley Efron and Ronald Thisted. 1976. Estimating the number of unseen species: How many words did Shakespeare know? Biometrika 63, 3 (1976), 435--447.
    [30]
    Ronald A Fisher, A Steven Corbet, and Carrington B Williams. 1943. The relation between the number of species and the number of individuals in a random sample of an animal population. The Journal of Animal Ecology (1943), 42--58.
    [31]
    Philippe Flajolet, Éric Fusy, Olivier Gandouet, and Frédéric Meunier. 2007. Hyperloglog: the analysis of a near-optimal cardinality estimation algorithm. In Discrete Mathematics and Theoretical Computer Science. Discrete Mathematics and Theoretical Computer Science, 137--156.
    [32]
    Irving J Good. 1953. The population frequencies of species and the estimation of population parameters. Biometrika 40, 3--4 (1953), 237--264.
    [33]
    Irving J Good and George H Toulmin. 1956. The number of new species, and the increase in population coverage, when a sample is increased. Biometrika 43, 1--2 (1956), 45--63.
    [34]
    Sudipto Guha, Andrew McGregor, and Suresh Venkatasubramanian. 2006. Streaming and sublinear approximation of entropy and information distances. In Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm. 733--742.
    [35]
    Yanjun Han, Jiantao Jiao, and Tsachy Weissman. 2018. Local moment matching: A unified methodology for symmetric functional estimation and distribution estimation under wasserstein distance. In Conference On Learning Theory. PMLR, 3189--3221.
    [36]
    Yi Hao and Alon Orlitsky. 2019. Unified sample-optimal property estimation in near-linear time. Advances in Neural Information Processing Systems 32 (2019).
    [37]
    Benjamin Hilprecht, Andreas Schmidt, Moritz Kulessa, Alejandro Molina, Kristian Kersting, and Carsten Binnig. 2019. Deepdb: Learn from data, not from queries! arXiv preprint arXiv:1909.00607 (2019).
    [38]
    Wen-Chi Hou, Gultekin Ozsoyoglu, and Baldeo K Taneja. 1988. Statistical estimators for relational algebra expressions. In Proceedings of the seventh ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems. 276--287.
    [39]
    Jiantao Jiao, Yanjun Han, and Tsachy Weissman. 2018. Minimax estimation of the L_ {1} distance. IEEE Transactions on Information Theory 64, 10 (2018), 6672--6706.
    [40]
    Roland Koberle, Rob R De Ruyter Van Steveninck, and William Bialek. 1998. Entropy and information in neural spike trains. Physical review letters 80, 1 (1998), 197.
    [41]
    Jiajun Li, Zhewei Wei, Bolin Ding, Xiening Dai, Lu Lu, and Jingren Zhou. 2022. Sampling-based Estimation of the Number of Distinct Values in Distributed Environment. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 893--903.
    [42]
    Pengfei Li, Wenqing Wei, Rong Zhu, Bolin Ding, Jingren Zhou, and Hua Lu. 2023. ALECE: An Attention-based Learned Cardinality Estimator for SPJ Queries on Dynamic Workloads. Proceedings of the VLDB Endowment 17, 2 (2023), 197--210.
    [43]
    Henry Liu, Mingbin Xu, Ziting Yu, Vincent Corvinelli, and Calisto Zuzarte. 2015. Cardinality estimation using neural networks. In Proceedings of the 25th Annual International Conference on Computer Science and Software Engineering. 53--59.
    [44]
    George Miller. 1955. Note on the bias of information estimates. Information theory in psychology: Problems and methods (1955).
    [45]
    Alon Orlitsky, Narayana P Santhanam, Krishnamurthy Viswanathan, and Junan Zhang. 2004. On modeling profiles instead of values. In Proceedings of the 20th conference on Uncertainty in artificial intelligence. 426--435.
    [46]
    Gultekin Ozsoyoglu, Kaizheng Du, A Tjahjana, W-C Hou, and DY Rowland. 1991. On estimating COUNT, SUM, and AVERAGE relational algebra queries. In Database and Expert Systems Applications. Springer, 406--412.
    [47]
    Patrick O'Neil, Elizabeth O'Neil, Xuedong Chen, and Stephen Revilak. 2009. The star schema benchmark and augmented fact table indexing. In Technology Conference on Performance Evaluation and Benchmarking. Springer, 237--252.
    [48]
    A Shlosser. 1981. On estimation of the size of the dictionary of a long text on the basis of a sample. Engineering Cybernetics 19, 1 (1981), 97--102.
    [49]
    Sumit Sidana, Charlotte Laclau, Massih R Amini, Gilles Vandelle, and André Bois-Crettez. 2017. KASANDR: a large-scale dataset with implicit feedback for recommendation. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1245--1248.
    [50]
    Gregory Valiant and Paul Valiant. 2011. Estimating the unseen: an n/log (n)-sample estimator for entropy and support size, shown optimal via new CLTs. In Proceedings of the forty-third annual ACM symposium on Theory of computing. 685--694.
    [51]
    Paul Valiant and Gregory Valiant. 2013. Estimating the Unseen: Improved Estimators for Entropy and other Properties. In NIPS. 2157--2165.
    [52]
    Renzhi Wu, Bolin Ding, Xu Chu, Zhewei Wei, Xiening Dai, Tao Guan, and Jingren Zhou. 2021. Learning to be a statistician: learned estimator for number of distinct values. Proceedings of the VLDB Endowment 15, 2 (2021), 272--284.
    [53]
    Yihong Wu and Pengkun Yang. 2016. Minimax rates of entropy estimation on large alphabets via best polynomial approximation. IEEE Transactions on Information Theory 62, 6 (2016), 3702--3720.
    [54]
    Yihong Wu and Pengkun Yang. 2019. Chebyshev polynomials, moment matching, and optimal estimation of the unseen. The Annals of Statistics 47, 2 (2019), 857--883.
    [55]
    Rong Zhu, Ziniu Wu, Yuxing Han, Kai Zeng, Andreas Pfadler, Zhengping Qian, Jingren Zhou, and Bin Cui. 2021. FLAT: fast, lightweight and accurate method for cardinality estimation. Proceedings of the VLDB Endowment 14, 9 (2021), 1489--1502.

    Index Terms

    1. Learning-based Property Estimation with Polynomials

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image Proceedings of the ACM on Management of Data
      Proceedings of the ACM on Management of Data  Volume 2, Issue 3
      SIGMOD
      June 2024
      1953 pages
      EISSN:2836-6573
      DOI:10.1145/3670010
      Issue’s Table of Contents
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 30 May 2024
      Published in PACMMOD Volume 2, Issue 3

      Permissions

      Request permissions for this article.

      Author Tags

      1. entropy
      2. learning
      3. number of distinct values
      4. polynomial
      5. property estimation

      Qualifiers

      • Research-article

      Funding Sources

      • Hong Kong ITC ITF
      • National Natural Science Foundation of China
      • Beijing Natural Science Foundation
      • Hong Kong RGC GRF
      • Beijing Outstanding Young Scientist Program
      • Alibaba Innovative Research Program

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • 0
        Total Citations
      • 43
        Total Downloads
      • Downloads (Last 12 months)43
      • Downloads (Last 6 weeks)22
      Reflects downloads up to 12 Aug 2024

      Other Metrics

      Citations

      View Options

      Get Access

      Login options

      Full Access

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media