Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/1007568.1007602acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
Article

Effective use of block-level sampling in statistics estimation

Published: 13 June 2004 Publication History

Abstract

Block-level sampling is far more efficient than true uniform-random sampling over a large database, but prone to significant errors if used to create database statistics. In this paper, we develop principled approaches to overcome this limitation of block-level sampling for histograms as well as distinct-value estimations. For histogram construction, we give a novel two-phase adaptive method in which the sample size required to reach a desired accuracy is decided based on a first phase sample. This method is significantly faster than previous iterative methods proposed for the same problem. For distinct-value estimation, we show that existing estimators designed for uniform-random samples may perform very poorly if used directly on block-level samples. We present a key technique that computes an appropriate subset of a block-level sample that is suitable for use with most existing estimators. This, to the best of our knowledge, is the first principled method for distinct-value estimation with block-level samples. We provide extensive experimental results validating our methods.

References

[1]
M. Charikar, S. Chaudhuri, R. Motwani, and V. Narasayya. Towards estimation error guarantees for distinct values. In Proc. of the ACM Symp. on Principles of Database Systems, 2000.
[2]
S. Chaudhuri, R. Motwani, and V. Narasayya. Random sampling for histogram construction: How much is enough? In Proc. of the 1998 ACM SIGMOD Intl. Conf. on Management of Data, pages 436--447, 1998.
[3]
W. G. Cochran. Sampling Techniques. John Wiley & Sons, 1977.
[4]
B. Efron and R. Tibshirani. An Introduction to the Bootstrap. Chapman and Hall, 1993.
[5]
P. Gibbons, Y. Matias, and V. Poosala. Fast incremental maintenance of approximate histograms. In Proc. of the 1997. Intl. Conf. on Very Large Data Bases, pages 466--475, 1997.
[6]
L. Goodman. On the estimation of the number of classes in a population. Annals of Math. Stat., 20:572--579, 1949.
[7]
P. Haas, J. Naughton, P. Seshadri, and L. Stokes. Sampling-based estimation of the number of distinct values of an attribute. In Proc. of the 1995 Intl. Conf. on Very Large Data Bases, pages 311--322, Sept. 1995.
[8]
P. Haas and A. Swami. Sequential sampling procedures for query size estimation. In Proc. of the 1992 ACM SIGMOD Intl. Conf. on Management of Data, pages 341--350, 1992.
[9]
W. Hou, G. Ozsoyoglu, and E. Dogdu. Error-Constrained COUNT Query Evaluation in Relational Databases. In Proc. of the 1991 ACM SIGMOD Intl. Conf. on Management of Data, pages 278--287, 1991.
[10]
W. Hou, G. Ozsoyoglu, and B. Taneja. Statistical estimators for relational algebra expressions. In Proc. of the 1988 ACM Symp. on Principles of Database Systems, pages 276--287, Mar 1988.
[11]
K. Burnham and W. Overton. Robust estimation of population size when capture probabilities vary among animals. Ecology, 60:927--936, 1979.
[12]
R. Lipton, J. Naughton, and D. Schneider. Practical selectivity estimation through adaptive sampling. In Proc. of the 1990 ACM SIGMOD Intl. Conf. on Management of Data, pages 1--11, 1990.
[13]
G. Piatetsky-Shapiro and C. Connell. Accurate estimation of the number of tuples satisfying a condition. In Proc. of the 1984 ACM SIGMOD Intl. Conf. on Management of Data, pages 256--276, 1984.
[14]
V. Poosala, Y. E. Ioannidis, P. J. Haas, and E. J. Shekita. Improved histograms for selectivity estimation of range predicates. In Proc. of the 1996 ACM SIGMOD Intl. Conf. on Management of Data, pages 294--305, 1996.
[15]
Shlosser A. On estimation of the size of the dictionary of a long text on the basis of a sample. Engrg. Cybernetics, 19:97--102, 1981.
[16]
G. E. Zipf. Human Behavior and the Principle of Least Effort. Addison-Wesley Press, Inc., 1949.

Cited By

View all
  • (2024)Adaptive and Robust Query Execution for Lakehouses at ScaleProceedings of the VLDB Endowment10.14778/3685800.368581817:12(3947-3959)Online publication date: 8-Nov-2024
  • (2024)Efficient and Reusable Lazy SamplingACM SIGMOD Record10.1145/3665252.366526153:1(33-42)Online publication date: 14-May-2024
  • (2024)Learning-based Property Estimation with PolynomialsProceedings of the ACM on Management of Data10.1145/36549942:3(1-27)Online publication date: 30-May-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGMOD '04: Proceedings of the 2004 ACM SIGMOD international conference on Management of data
June 2004
988 pages
ISBN:1581138598
DOI:10.1145/1007568
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 June 2004

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Article

Conference

SIGMOD/PODS04
Sponsor:

Acceptance Rates

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)56
  • Downloads (Last 6 weeks)7
Reflects downloads up to 13 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Adaptive and Robust Query Execution for Lakehouses at ScaleProceedings of the VLDB Endowment10.14778/3685800.368581817:12(3947-3959)Online publication date: 8-Nov-2024
  • (2024)Efficient and Reusable Lazy SamplingACM SIGMOD Record10.1145/3665252.366526153:1(33-42)Online publication date: 14-May-2024
  • (2024)Learning-based Property Estimation with PolynomialsProceedings of the ACM on Management of Data10.1145/36549942:3(1-27)Online publication date: 30-May-2024
  • (2024)CDFRS: A scalable sampling approach for efficient big data analysisInformation Processing & Management10.1016/j.ipm.2024.10374661:4(103746)Online publication date: Jul-2024
  • (2023)Survey of Distributed Computing Frameworks for Supporting Big Data AnalysisBig Data Mining and Analytics10.26599/BDMA.2022.90200146:2(154-169)Online publication date: Jun-2023
  • (2023)PolarDB-IMCI: A Cloud-Native HTAP Database System at AlibabaProceedings of the ACM on Management of Data10.1145/35897851:2(1-25)Online publication date: 20-Jun-2023
  • (2023)LAQy: Efficient and Reusable Query Approximations via Lazy SamplingProceedings of the ACM on Management of Data10.1145/35893191:2(1-26)Online publication date: 20-Jun-2023
  • (2023)Active Sampling for Sparse Table by Bayesian Optimization with Adaptive Resolution2023 IEEE 39th International Conference on Data Engineering (ICDE)10.1109/ICDE55515.2023.00068(816-828)Online publication date: Apr-2023
  • (2022)Sampling-Based AQP in Modern Analytical EnginesProceedings of the 18th International Workshop on Data Management on New Hardware10.1145/3533737.3535095(1-8)Online publication date: 12-Jun-2022
  • (2021)RSP-Hist: Approximate Histograms for Big Data Exploration on Hadoop Clusters2021 IEEE 28th International Conference on High Performance Computing, Data, and Analytics (HiPC)10.1109/HiPC53243.2021.00058(412-417)Online publication date: Dec-2021
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media