Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/276304.276343acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
Article
Free access

Random sampling for histogram construction: how much is enough?

Published: 01 June 1998 Publication History

Abstract

Random sampling is a standard technique for constructing (approximate) histograms for query optimization. However, any real implementation in commercial products requires solving the hard problem of determining “How much sampling is enough?” We address this critical question in the context of equi-height histograms used in many commercial products, including Microsoft SQL Server. We introduce a conservative error metric capturing the intuition that for an approximate histogram to have low error, the error must be small in all regions of the histogram. We then present a result establishing an optimal bound on the amount of sampling required for pre-specified error bounds. We also describe an adaptive page sampling algorithm which achieves greater efficiency by using all values in a sampled page but adjusts the amount of sampling depending on clustering of values in pages. Next, we establish that the problem of estimating the number of distinct values is provably difficult, but propose a new error metric which has a reliable estimator and can still be exploited by query optimizers to influence the choice of execution plans. The algorithm for histogram construction was prototyped on Microsoft SQL Server 7.0 and we present experimental results showing that the adaptive algorithm accurately approximates the true histogram over different data distributions.

References

[1]
J. Bunge and M. Fitzpatrick. Estimating the Number of Species: A Review. Journal of the American Statistical Association 88(1993): 364-373.
[2]
K.P. Burnham and W.S. Overton. Estimation of the size of a closed population when capture possibilities vary among animals. Biometrika 65(1978): 625-633.
[3]
K.P. Burnham and W.S. Overton. Robust estimation of population size when capture possibilities vary among animals. Ecology 60(1979): 927-936.
[4]
A. Chao. Nonparametric estimation of the number of classes in a population. Scandinavian Journal o/Statistical Theory and Applications 11(1984): 265-270.
[5]
S. Chaudhuri, R. Motwani, and V. Narasayya. Using Random Sampling for Histogram Construction. Microsoft Research Report, In preparation, 1997.
[6]
S. Chaudhuri and V. Narasayya. An Efficient, Cost- Driven Index Selection Tool for Microsoft SQL Server. In Proc. 23rd VLDB, 1997.
[7]
S. Finkelstein, M. Schkolnick, and P. Tiberio. Physical Database Design for Relational Databases. A CM TODS, 13(1988): 91-128.
[8]
P.B. Gibbons, Y. Matias, and V. Poosala. Fast Incremental Maintenance of Approximate Histograms. In Proc. 23rd VLDB, pages 466-475, 1997.
[9]
L.A. Goodman. On the estimation of the number of classes in a population. Annals of Mathematical Statistics 20( 1949): 572-579.
[10]
P.J. Haas, J.F. Naughton, S. Seshadri, and L. Stokes. Sampling-based estimation of the number of distinct values of an attribute. In Proc. 21st VLDB, pages 311- 322, 1995.
[11]
P.J. Haas and A.N. Swami. Sequential Sampling Procedures for Query Size Estimation. In Proc. A CM SIG- MOD Conference, pages 341-350, 1992.
[12]
W. Hou, G. Ozsoyoglu, and E. Dogdu. Error- Constrained COUNT Query Evaluation in Relational Databases. In Proc. A CM SIGMOD Conference, pages 278-287, 1991.
[13]
W. Hou, G. Ozsoyoglu, and B. Taneja. Statistical estimators for relational algebra expressions. In Proc. 7th A CM Symposium on Principles of Database Systems, pages 276-287, 1988.
[14]
W. Hou, G. Ozsoyoglu, and B. Taneja. Processing aggregate relational queries with hard time constraints. In Proc. A CM SIGMOD Conference, pages 68-77, 1989.
[15]
Y. Ioannidis and V. Poosala. Balancing Histogram Optimality and Practicality for Query Result Size Estimation. In Proc. A CM SIGMOD Conference, pages 233-244, 1995.
[16]
Y. Ioannidis and V. Poosala. Histogram-Based Solutions to Diverse Database Estimation Problems. IEEE Data Engineering Bulletin 18(1995): 10-18.
[17]
Y. Ling and W. Sun. An Evaluation of Sampling-Based Size Estimation Methods for Selections in Database Systems. In Proc. IEEE Conference on Data Engineering, pages 532-539, 1995.
[18]
R.J. Lipton and J.F. Naughton. Query Size Estimation by Adaptive Sampling. In Proc. A CM PODS, pages 40-46, 1990.
[19]
R.J. Lipton, J.F. Naughton, and D.A. Schneider. Practical Selectivity Estimation through Adaptive Sampiing. In Proc. A CM SIGMOD Conference, pages 1-11, 1990.
[20]
R.J. Lipton, j.F. Naughton, D.A. Schneider, and S. Seshadri. Efficient Sampling Strategies for Relational Database Operations. Theoretical Computer Science 116(1993): 195-226.
[21]
R. Motwani and P. Raghavan. Randomized Algorithms. Cambridge University Press, 1995.
[22]
J.F. Naughton and S. Seshadri. On Estimating the Size of Projections. In Proc. Third international Conference on Database Theory, pages 499-513, 1990.
[23]
F. Olken. Random Sampling .from Databases. PhD Thesis, Computer Science, U.C. Berkeley, 1993.
[24]
F. Olken and D. Rotem. Random Sampling from Databases - A Survey. Manuscript, 1995.
[25]
G. Ozsoyoglu, K. Du, A. Tjahjana, W. Hou, and D.Y. Rowland. On estimating COUNT, SUM, and AV- ERAGE relational algebra queries. In Proc. Conference on Database and Expert Systems Applications, pages 406-412, 1991.
[26]
V. Poosala, Y. Ioannidis, P. Haas, and E. Shekita. Improved Histograms for Selectivity Estimation of Range Predicates. in Proc. A CM SIGMOD Conference, pages 294-305, 1996.
[27]
G. Piatetsky-Shapiro and C. Connell. Accurate estimation of the number of tuples satisfying a condition. In Proc. A CM SIGMOD Conference, pages 256-276, 1984.
[28]
P.G. Selinger, D.D. Astrahan, R.A. Chamberlain, R.A. Lorie, and T.G. Price. Access path selection in a relational database management system. In Proc. A CM SIGMOD Conference, pages 23-34, 1979.
[29]
G.E. Zipf. Human Behavior and the Principle of Least Effort. Addison-Wesley Press, Inc, 1949.

Cited By

View all
  • (2023)PolarDB-IMCI: A Cloud-Native HTAP Database System at AlibabaProceedings of the ACM on Management of Data10.1145/35897851:2(1-25)Online publication date: 20-Jun-2023
  • (2022)Near-optimal bounds for testing histogram distributionsProceedings of the 36th International Conference on Neural Information Processing Systems10.5555/3600270.3602561(31599-31611)Online publication date: 28-Nov-2022
  • (2022)Authorship Attribution via Occupancy-problem-type IndicesJournal of Quantitative Linguistics10.1080/09296174.2022.203727630:1(27-41)Online publication date: 14-Feb-2022
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGMOD '98: Proceedings of the 1998 ACM SIGMOD international conference on Management of data
June 1998
599 pages
ISBN:0897919955
DOI:10.1145/276304
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 June 1998

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Article

Conference

SIGMOD/PODS98
SIGMOD/PODS98: Special Interest Group on Management of Data
June 1 - 4, 1998
Washington, Seattle, USA

Acceptance Rates

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)517
  • Downloads (Last 6 weeks)38
Reflects downloads up to 30 Aug 2024

Other Metrics

Citations

Cited By

View all
  • (2023)PolarDB-IMCI: A Cloud-Native HTAP Database System at AlibabaProceedings of the ACM on Management of Data10.1145/35897851:2(1-25)Online publication date: 20-Jun-2023
  • (2022)Near-optimal bounds for testing histogram distributionsProceedings of the 36th International Conference on Neural Information Processing Systems10.5555/3600270.3602561(31599-31611)Online publication date: 28-Nov-2022
  • (2022)Authorship Attribution via Occupancy-problem-type IndicesJournal of Quantitative Linguistics10.1080/09296174.2022.203727630:1(27-41)Online publication date: 14-Feb-2022
  • (2021)Weighted Distinct Sampling: Cardinality Estimation for SPJ QueriesProceedings of the 2021 International Conference on Management of Data10.1145/3448016.3452821(1465-1477)Online publication date: 9-Jun-2021
  • (2021)MOSE: A Monotonic Selectivity Estimator Using Learned CDFIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2021.3112753(1-1)Online publication date: 2021
  • (2021)Non‐asymptotic moment bounds for random variables rounded to non‐uniformly spaced setsStat10.1002/sta4.39510:1Online publication date: 20-Dec-2021
  • (2020)Efficient Join Synopsis Maintenance for Data WarehouseProceedings of the 2020 ACM SIGMOD International Conference on Management of Data10.1145/3318464.3389717(2027-2042)Online publication date: 11-Jun-2020
  • (2020)Estimating Predicate Selectivities in a NoSQL Database Service2020 11th IEEE Annual Ubiquitous Computing, Electronics & Mobile Communication Conference (UEMCON)10.1109/UEMCON51285.2020.9298056(0414-0420)Online publication date: 28-Oct-2020
  • (2020)Practical Range Counting over Data Streams2020 IEEE International Conference on Big Data (Big Data)10.1109/BigData50022.2020.9378146(659-668)Online publication date: 10-Dec-2020
  • (2020)Sampling for Big Data Profiling: A SurveyIEEE Access10.1109/ACCESS.2020.29881208(72713-72726)Online publication date: 2020
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media