Article

Effective use of block-level sampling in statistics estimation

Authors:

Surajit Chaudhuri,

Utkarsh SrivastavaAuthors Info & Claims

SIGMOD '04: Proceedings of the 2004 ACM SIGMOD international conference on Management of data

Pages 287 - 298

https://doi.org/10.1145/1007568.1007602

Published: 13 June 2004 Publication History

Abstract

Block-level sampling is far more efficient than true uniform-random sampling over a large database, but prone to significant errors if used to create database statistics. In this paper, we develop principled approaches to overcome this limitation of block-level sampling for histograms as well as distinct-value estimations. For histogram construction, we give a novel two-phase adaptive method in which the sample size required to reach a desired accuracy is decided based on a first phase sample. This method is significantly faster than previous iterative methods proposed for the same problem. For distinct-value estimation, we show that existing estimators designed for uniform-random samples may perform very poorly if used directly on block-level samples. We present a key technique that computes an appropriate subset of a block-level sample that is suitable for use with most existing estimators. This, to the best of our knowledge, is the first principled method for distinct-value estimation with block-level samples. We provide extensive experimental results validating our methods.

References

[1]

M. Charikar, S. Chaudhuri, R. Motwani, and V. Narasayya. Towards estimation error guarantees for distinct values. In Proc. of the ACM Symp. on Principles of Database Systems, 2000.

Digital Library

[2]

S. Chaudhuri, R. Motwani, and V. Narasayya. Random sampling for histogram construction: How much is enough? In Proc. of the 1998 ACM SIGMOD Intl. Conf. on Management of Data, pages 436--447, 1998.

Digital Library

[3]

W. G. Cochran. Sampling Techniques. John Wiley & Sons, 1977.

[4]

B. Efron and R. Tibshirani. An Introduction to the Bootstrap. Chapman and Hall, 1993.

[5]

P. Gibbons, Y. Matias, and V. Poosala. Fast incremental maintenance of approximate histograms. In Proc. of the 1997. Intl. Conf. on Very Large Data Bases, pages 466--475, 1997.

Digital Library

[6]

L. Goodman. On the estimation of the number of classes in a population. Annals of Math. Stat., 20:572--579, 1949.

[7]

P. Haas, J. Naughton, P. Seshadri, and L. Stokes. Sampling-based estimation of the number of distinct values of an attribute. In Proc. of the 1995 Intl. Conf. on Very Large Data Bases, pages 311--322, Sept. 1995.

Digital Library

[8]

P. Haas and A. Swami. Sequential sampling procedures for query size estimation. In Proc. of the 1992 ACM SIGMOD Intl. Conf. on Management of Data, pages 341--350, 1992.

Digital Library

[9]

W. Hou, G. Ozsoyoglu, and E. Dogdu. Error-Constrained COUNT Query Evaluation in Relational Databases. In Proc. of the 1991 ACM SIGMOD Intl. Conf. on Management of Data, pages 278--287, 1991.

Digital Library

[10]

W. Hou, G. Ozsoyoglu, and B. Taneja. Statistical estimators for relational algebra expressions. In Proc. of the 1988 ACM Symp. on Principles of Database Systems, pages 276--287, Mar 1988.

Digital Library

[11]

K. Burnham and W. Overton. Robust estimation of population size when capture probabilities vary among animals. Ecology, 60:927--936, 1979.

[12]

R. Lipton, J. Naughton, and D. Schneider. Practical selectivity estimation through adaptive sampling. In Proc. of the 1990 ACM SIGMOD Intl. Conf. on Management of Data, pages 1--11, 1990.

Digital Library

[13]

G. Piatetsky-Shapiro and C. Connell. Accurate estimation of the number of tuples satisfying a condition. In Proc. of the 1984 ACM SIGMOD Intl. Conf. on Management of Data, pages 256--276, 1984.

Digital Library

[14]

V. Poosala, Y. E. Ioannidis, P. J. Haas, and E. J. Shekita. Improved histograms for selectivity estimation of range predicates. In Proc. of the 1996 ACM SIGMOD Intl. Conf. on Management of Data, pages 294--305, 1996.

Digital Library

[15]

Shlosser A. On estimation of the size of the dictionary of a long text on the basis of a sample. Engrg. Cybernetics, 19:97--102, 1981.

[16]

G. E. Zipf. Human Behavior and the Principle of Least Effort. Addison-Wesley Press, Inc., 1949.

Cited By

Xue MBu YSomani AFan WLiu ZChen Svan Hovell HSamwel BMokhtar MKorlapati RLam AMa YErcegovac VLi JBehm ALi YLi XKrishnamurthy SShukla APetropoulos MParanjpye SXin RZaharia M(2024)Adaptive and Robust Query Execution for Lakehouses at ScaleProceedings of the VLDB Endowment10.14778/3685800.368581817:12(3947-3959)Online publication date: 8-Nov-2024
https://dl.acm.org/doi/10.14778/3685800.3685818
Sanca VChrysogelos PAilamaki A(2024)Efficient and Reusable Lazy SamplingACM SIGMOD Record10.1145/3665252.366526153:1(33-42)Online publication date: 14-May-2024
https://dl.acm.org/doi/10.1145/3665252.3665261
Li JLei RWang SWei ZDing B(2024)Learning-based Property Estimation with PolynomialsProceedings of the ACM on Management of Data10.1145/36549942:3(1-27)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3654994
Show More Cited By

Recommendations

Stratified random sampling for power estimation

In this paper, we present new statistical sampling techniques for performing power estimation at the circuit level. These techniques first transform the power estimation problem to a survey sampling problem, and then apply stratified random sampling to ...
Comparison of variance estimation methods for use with two-dimensional systematic sampling of land use/land cover data

Systematic sampling is more precise than simple random sampling when spatial autocorrelation is present and the sampling effort is equal, but there is no unbiased method to estimate the variance from a systematic sample. The objective of this paper is ...
Importance sampling for parametric estimation
WSC '10: Proceedings of the Winter Simulation Conference

We consider a class of parametric estimation problems where the goal is efficient estimation of a quantity of interest for many instances that differ in some model or decision parameters. We have proposed an approach, called DataBase Monte Carlo (DBMC), ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGMOD '04: Proceedings of the 2004 ACM SIGMOD international conference on Management of data

June 2004

988 pages

ISBN:1581138598

DOI:10.1145/1007568

Conference Chairs:
Arnd Christian König
Microsoft Research
,
Stefan Dessloch
University of Kaiserslautern, Germany
,
General Chair:
Patrick Valduriez
INRIA, France
,
Program Chair:
Gerhard Weikum
University of the Saarland

Copyright © 2004 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMOD: ACM Special Interest Group on Management of Data

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 June 2004

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Article

Conference

SIGMOD/PODS04

Sponsor:

SIGMOD

SIGMOD/PODS04: International Conference on Management of Data and Symposium on Principles Database and Systems

June 13 - 18, 2004

Paris, France

Acceptance Rates

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

101
Total Citations
View Citations
943
Total Downloads

Downloads (Last 12 months)56
Downloads (Last 6 weeks)7

Reflects downloads up to 13 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Xue MBu YSomani AFan WLiu ZChen Svan Hovell HSamwel BMokhtar MKorlapati RLam AMa YErcegovac VLi JBehm ALi YLi XKrishnamurthy SShukla APetropoulos MParanjpye SXin RZaharia M(2024)Adaptive and Robust Query Execution for Lakehouses at ScaleProceedings of the VLDB Endowment10.14778/3685800.368581817:12(3947-3959)Online publication date: 8-Nov-2024
https://dl.acm.org/doi/10.14778/3685800.3685818
Sanca VChrysogelos PAilamaki A(2024)Efficient and Reusable Lazy SamplingACM SIGMOD Record10.1145/3665252.366526153:1(33-42)Online publication date: 14-May-2024
https://dl.acm.org/doi/10.1145/3665252.3665261
Li JLei RWang SWei ZDing B(2024)Learning-based Property Estimation with PolynomialsProceedings of the ACM on Management of Data10.1145/36549942:3(1-27)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3654994
Cai YWu DSun XWu SXu JHuang J(2024)CDFRS: A scalable sampling approach for efficient big data analysisInformation Processing & Management10.1016/j.ipm.2024.10374661:4(103746)Online publication date: Jul-2024
https://doi.org/10.1016/j.ipm.2024.103746
Sun XHe YWu DHuang J(2023)Survey of Distributed Computing Frameworks for Supporting Big Data AnalysisBig Data Mining and Analytics10.26599/BDMA.2022.90200146:2(154-169)Online publication date: Jun-2023
https://doi.org/10.26599/BDMA.2022.9020014
Wang JLi TSong HYang XZhou WLi FYan BWu QLiang YYing CWang YChen BCai CRuan YWeng XChen SYin LYang CCai XXing HYu NChen XHuang DSun J(2023)PolarDB-IMCI: A Cloud-Native HTAP Database System at AlibabaProceedings of the ACM on Management of Data10.1145/35897851:2(1-25)Online publication date: 20-Jun-2023
https://dl.acm.org/doi/10.1145/3589785
Sanca VChrysogelos PAilamaki A(2023)LAQy: Efficient and Reusable Query Approximations via Lazy SamplingProceedings of the ACM on Management of Data10.1145/35893191:2(1-26)Online publication date: 20-Jun-2023
https://dl.acm.org/doi/10.1145/3589319
He XTan JWu BLi FZhang XLiang GXu J(2023)Active Sampling for Sparse Table by Bayesian Optimization with Adaptive Resolution2023 IEEE 39th International Conference on Data Engineering (ICDE)10.1109/ICDE55515.2023.00068(816-828)Online publication date: Apr-2023
https://doi.org/10.1109/ICDE55515.2023.00068
Sanca VAilamaki A(2022)Sampling-Based AQP in Modern Analytical EnginesProceedings of the 18th International Workshop on Data Management on New Hardware10.1145/3533737.3535095(1-8)Online publication date: 12-Jun-2022
https://dl.acm.org/doi/10.1145/3533737.3535095
Salloum SHuang J(2021)RSP-Hist: Approximate Histograms for Big Data Exploration on Hadoop Clusters2021 IEEE 28th International Conference on High Performance Computing, Data, and Analytics (HiPC)10.1109/HiPC53243.2021.00058(412-417)Online publication date: Dec-2021
https://doi.org/10.1109/HiPC53243.2021.00058
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents