research-article

Learning-based Property Estimation with Polynomials

Authors:

Bolin DingAuthors Info & Claims

Proceedings of the ACM on Management of Data, Volume 2, Issue 3

Article No.: 148, Pages 1 - 27

https://doi.org/10.1145/3654994

Published: 30 May 2024 Publication History

Abstract

The problem of estimating data properties using sampling frequency histograms has attracted extensive interest in the area of databases. The properties include the number of distinct values (NDV), entropy, and so on. In the field of databases, property estimation is fundamental to complex applications. For example, NDV estimation is the foundation of query optimization, and entropy estimation is the foundation of data compression. Among them, methods originating from statistics exhibit desirable theoretical guarantees but rely on specific assumptions about the distribution of data, resulting in poor performance in real-world applications. Learning-based methods, which use information from training data, are adaptable in the real world but often lack theoretical guarantees or explainability. In addition, a unified framework for estimating these frequency-based estimators with machine learning is lacking. Given the aforementioned challenges, it is natural to wonder if a unified framework with theoretical guarantees can be established for property estimation. The recent literature has presented theoretical studies that propose estimation frameworks based on polynomials. These studies also prove estimation errors with respect to the sample size. Motivated by the above polynomial estimation framework, we propose a learning-based estimation framework with polynomial approximation, which aims to learn the coefficients of the polynomial, providing theoretical guarantees to the learning framework. Through comprehensive experiments on both synthetic and real-world datasets for estimating various data properties like NDV, entropy, and power sum, our results show the superiority of our algorithms over previous estimators.

References

[1]

2020. Airlines Departure Delay. https://www.openml.org/d/42728.

[2]

2020. Voter Registration Statistics. https://www.ncsbe.gov/results-data/voterregistration-data.

[3]

2021. Synthetic data generator. https://github.com/wurenzhi/learned_ndv_estimator.git.

[4]

2022. Github. https://github.com/wurenzhi/learned_ndv_estimator.git.

[5]

2023. Source Code of Spark. https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/JoinEstimation.scala.

[6]

2024. Source Code of PostgreSQL. https://github.com/postgres/postgres/blob/master/src/backend/optimizer/plan/analyzejoins.c.

[7]

Jayadev Acharya, Hirakendu Das, Alon Orlitsky, and Ananda Theertha Suresh. 2017. A unified maximum likelihood approach for estimating symmetric properties of discrete distributions. In International Conference on Machine Learning. PMLR, 11--21.

[8]

Jayadev Acharya, Hirakendu Das, Alon Orlitsky, and Ananda Theertha Suresh. 2017. A unified maximum likelihood approach for estimating symmetric properties of discrete distributions. In International Conference on Machine Learning. PMLR, 11--21.

[9]

Christos Anagnostopoulos and Peter Triantafillou. 2015. Learning to accurately count with query-driven predictive analytics. In 2015 IEEE international conference on big data (big data). IEEE, 14--23.

Digital Library

[10]

András Antos and Ioannis Kontoyiannis. 2001. Convergence properties of functional estimates for discrete distributions. Random Structures & Algorithms 19, 3--4 (2001), 163--193.

Digital Library

[11]

Aryaman Arora, Clara Meister, and Ryan Cotterell. 2022. Estimating the Entropy of Linguistic Distributions. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 175--195.

[12]

Ziv Bar-Yossef, TS Jayram, Ravi Kumar, D Sivakumar, and Luca Trevisan. 2002. Counting distinct elements in a data stream. In International Workshop on Randomization and Approximation Techniques in Computer Science. Springer, 1--10.

Digital Library

[13]

Tugkan Batu, Sanjoy Dasgupta, Ravi Kumar, and Ronitt Rubinfeld. 2002. The complexity of approximating entropy. In Proceedings of the thiry-fourth annual ACM symposium on Theory of computing. 678--687.

Digital Library

[14]

Srikanth Bellamkonda, Hua-Gang Li, Unmesh Jagtap, Yali Zhu, Vince Liang, and Thierry Cruanes. 2013. Adaptive and big data scale parallel execution in oracle. Proceedings of the VLDB Endowment 6, 11 (2013), 1102--1113.

Digital Library

[15]

Richard P Brent. 2013. Algorithms for minimization without derivatives. Courier Corporation.

[16]

John Bunge and Michael Fitzpatrick. 1993. Estimating the number of species: a review. J. Amer. Statist. Assoc. 88, 421 (1993), 364--373.

[17]

Anne Chao. 1984. Nonparametric estimation of the number of classes in a population. Scandinavian Journal of statistics (1984), 265--270.

[18]

Anne Chao and TJ Shen. 2010. User's guide for program SPADE (Species prediction and diversity estimation). Taiwan: National Tsing Hua University (2010).

[19]

Moses Charikar, Surajit Chaudhuri, Rajeev Motwani, and Vivek Narasayya. 2000. Towards estimation error guarantees for distinct values. In Proceedings of the nineteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems. 268--279.

Digital Library

[20]

Surajit Chaudhuri, Gautam Das, and Utkarsh Srivastava. 2004. Effective use of block-level sampling in statistics estimation. In Proceedings of the 2004 ACM SIGMOD international conference on Management of data. 287--298.

Digital Library

[21]

Xingguang Chen and Sibo Wang. 2021. Efficient approximate algorithms for empirical entropy and mutual information. In Proceedings of the 2021 International Conference on Management of Data. 274--286.

Digital Library

[22]

Eli Chien, Olgica Milenkovic, and Angelia Nedich. 2021. Support Estimation with Sampling Artifacts and Errors. In 2021 IEEE International Symposium on Information Theory (ISIT). IEEE, 244--249.

[23]

I Chien. 2019. Regularized weighted Chebyshev approximations for support estimation. (2019).

[24]

Reuven Cohen and Yuval Nezri. 2019. Cardinality estimation in a virtualized network device using online machine learning. IEEE/ACM Transactions on Networking 27, 5 (2019), 2098--2110.

Digital Library

[25]

Anshuman Dutt, Chi Wang, Vivek Narasayya, and Surajit Chaudhuri. 2020. Efficiently approximating selectivity functions using low overhead regression models. Proceedings of the VLDB Endowment 13, 12 (2020), 2215--2228.

Digital Library

[26]

Anshuman Dutt, Chi Wang, Azade Nazi, Srikanth Kandula, Vivek Narasayya, and Surajit Chaudhuri. 2019. Selectivity estimation for range predicates using lightweight models. Proceedings of the VLDB Endowment 12, 9 (2019), 1044--1057.

Digital Library

[27]

Talya Eden, Piotr Indyk, Shyam Narayanan, Ronitt Rubinfeld, Sandeep Silwal, and Tal Wagner. 2021. Learning-based Support Estimation in Sublinear Time. In International Conference on Learning Representations.

[28]

Bradley Efron and Charles Stein. 1981. The jackknife estimate of variance. The Annals of Statistics (1981), 586--596.

[29]

Bradley Efron and Ronald Thisted. 1976. Estimating the number of unseen species: How many words did Shakespeare know? Biometrika 63, 3 (1976), 435--447.

[30]

Ronald A Fisher, A Steven Corbet, and Carrington B Williams. 1943. The relation between the number of species and the number of individuals in a random sample of an animal population. The Journal of Animal Ecology (1943), 42--58.

[31]

Philippe Flajolet, Éric Fusy, Olivier Gandouet, and Frédéric Meunier. 2007. Hyperloglog: the analysis of a near-optimal cardinality estimation algorithm. In Discrete Mathematics and Theoretical Computer Science. Discrete Mathematics and Theoretical Computer Science, 137--156.

[32]

Irving J Good. 1953. The population frequencies of species and the estimation of population parameters. Biometrika 40, 3--4 (1953), 237--264.

[33]

Irving J Good and George H Toulmin. 1956. The number of new species, and the increase in population coverage, when a sample is increased. Biometrika 43, 1--2 (1956), 45--63.

[34]

Sudipto Guha, Andrew McGregor, and Suresh Venkatasubramanian. 2006. Streaming and sublinear approximation of entropy and information distances. In Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm. 733--742.

Digital Library

[35]

Yanjun Han, Jiantao Jiao, and Tsachy Weissman. 2018. Local moment matching: A unified methodology for symmetric functional estimation and distribution estimation under wasserstein distance. In Conference On Learning Theory. PMLR, 3189--3221.

[36]

Yi Hao and Alon Orlitsky. 2019. Unified sample-optimal property estimation in near-linear time. Advances in Neural Information Processing Systems 32 (2019).

[37]

Benjamin Hilprecht, Andreas Schmidt, Moritz Kulessa, Alejandro Molina, Kristian Kersting, and Carsten Binnig. 2019. Deepdb: Learn from data, not from queries! arXiv preprint arXiv:1909.00607 (2019).

[38]

Wen-Chi Hou, Gultekin Ozsoyoglu, and Baldeo K Taneja. 1988. Statistical estimators for relational algebra expressions. In Proceedings of the seventh ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems. 276--287.

Digital Library

[39]

Jiantao Jiao, Yanjun Han, and Tsachy Weissman. 2018. Minimax estimation of the L_ {1} distance. IEEE Transactions on Information Theory 64, 10 (2018), 6672--6706.

[40]

Roland Koberle, Rob R De Ruyter Van Steveninck, and William Bialek. 1998. Entropy and information in neural spike trains. Physical review letters 80, 1 (1998), 197.

[41]

Jiajun Li, Zhewei Wei, Bolin Ding, Xiening Dai, Lu Lu, and Jingren Zhou. 2022. Sampling-based Estimation of the Number of Distinct Values in Distributed Environment. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 893--903.

Digital Library

[42]

Pengfei Li, Wenqing Wei, Rong Zhu, Bolin Ding, Jingren Zhou, and Hua Lu. 2023. ALECE: An Attention-based Learned Cardinality Estimator for SPJ Queries on Dynamic Workloads. Proceedings of the VLDB Endowment 17, 2 (2023), 197--210.

Digital Library

[43]

Henry Liu, Mingbin Xu, Ziting Yu, Vincent Corvinelli, and Calisto Zuzarte. 2015. Cardinality estimation using neural networks. In Proceedings of the 25th Annual International Conference on Computer Science and Software Engineering. 53--59.

Digital Library

[44]

George Miller. 1955. Note on the bias of information estimates. Information theory in psychology: Problems and methods (1955).

[45]

Alon Orlitsky, Narayana P Santhanam, Krishnamurthy Viswanathan, and Junan Zhang. 2004. On modeling profiles instead of values. In Proceedings of the 20th conference on Uncertainty in artificial intelligence. 426--435.

Digital Library

[46]

Gultekin Ozsoyoglu, Kaizheng Du, A Tjahjana, W-C Hou, and DY Rowland. 1991. On estimating COUNT, SUM, and AVERAGE relational algebra queries. In Database and Expert Systems Applications. Springer, 406--412.

[47]

Patrick O'Neil, Elizabeth O'Neil, Xuedong Chen, and Stephen Revilak. 2009. The star schema benchmark and augmented fact table indexing. In Technology Conference on Performance Evaluation and Benchmarking. Springer, 237--252.

Digital Library

[48]

A Shlosser. 1981. On estimation of the size of the dictionary of a long text on the basis of a sample. Engineering Cybernetics 19, 1 (1981), 97--102.

[49]

Sumit Sidana, Charlotte Laclau, Massih R Amini, Gilles Vandelle, and André Bois-Crettez. 2017. KASANDR: a large-scale dataset with implicit feedback for recommendation. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1245--1248.

Digital Library

[50]

Gregory Valiant and Paul Valiant. 2011. Estimating the unseen: an n/log (n)-sample estimator for entropy and support size, shown optimal via new CLTs. In Proceedings of the forty-third annual ACM symposium on Theory of computing. 685--694.

Digital Library

[51]

Paul Valiant and Gregory Valiant. 2013. Estimating the Unseen: Improved Estimators for Entropy and other Properties. In NIPS. 2157--2165.

[52]

Renzhi Wu, Bolin Ding, Xu Chu, Zhewei Wei, Xiening Dai, Tao Guan, and Jingren Zhou. 2021. Learning to be a statistician: learned estimator for number of distinct values. Proceedings of the VLDB Endowment 15, 2 (2021), 272--284.

Digital Library

[53]

Yihong Wu and Pengkun Yang. 2016. Minimax rates of entropy estimation on large alphabets via best polynomial approximation. IEEE Transactions on Information Theory 62, 6 (2016), 3702--3720.

Digital Library

[54]

Yihong Wu and Pengkun Yang. 2019. Chebyshev polynomials, moment matching, and optimal estimation of the unseen. The Annals of Statistics 47, 2 (2019), 857--883.

[55]

Rong Zhu, Ziniu Wu, Yuxing Han, Kai Zeng, Andreas Pfadler, Zhengping Qian, Jingren Zhou, and Bin Cui. 2021. FLAT: fast, lightweight and accurate method for cardinality estimation. Proceedings of the VLDB Endowment 14, 9 (2021), 1489--1502.

Digital Library

Index Terms

Learning-based Property Estimation with Polynomials
1. Computing methodologies
  1. Machine learning
    1. Machine learning approaches
      1. Learning linear models

Recommendations

Estimation of the entropy of a multivariate normal distribution

Motivated by problems in molecular biosciences wherein the evaluation of entropy of a molecular system is important for understanding its thermodynamic properties, we consider the efficient estimation of entropy of a multivariate normal distribution ...
Computation of the Entropy of Polynomials Orthogonal on an Interval
^* Special Issue on Uncertainty Quantification

We give an effective method for computing the entropy for polynomials orthogonal on a segment of the real axis, which uses as input data only the coefficients of the recurrence relation satisfied by these polynomials. This algorithm is based on a series ...
Strong factorization property of Macdonald polynomials and higher-order Macdonald's positivity conjecture

We prove a strong factorization property of interpolation Macdonald polynomials when q tends to 1. As a consequence, we show that Macdonald polynomials have a strong factorization property when q tends to 1, which was posed as an open question in our ...

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the ACM on Management of Data

Proceedings of the ACM on Management of Data Volume 2, Issue 3

SIGMOD

June 2024

1953 pages

EISSN:2836-6573

DOI:10.1145/3670010

Editor:
Divyakant Agrawal
UC Santa Barbara, United States

Issue’s Table of Contents

Copyright © 2024 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 30 May 2024

Published in PACMMOD Volume 2, Issue 3

Permissions

Request permissions for this article.

Request Permissions

Author Tags

Qualifiers

Research-article

Funding Sources

Hong Kong ITC ITF
National Natural Science Foundation of China
Beijing Natural Science Foundation
Hong Kong RGC GRF
Beijing Outstanding Young Scientist Program
Alibaba Innovative Research Program

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
43
Total Downloads

Downloads (Last 12 months)43
Downloads (Last 6 weeks)22

Reflects downloads up to 12 Aug 2024

Other Metrics

View Author Metrics

Citations

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents