Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
article
Free access

On random sampling over joins

Published: 01 June 1999 Publication History

Abstract

A major bottleneck in implementing sampling as a primitive relational operation is the inefficiency of sampling the output of a query. It is not even known whether it is possible to generate a sample of a join tree without first evaluating the join tree completely. We undertake a detailed study of this problem and attempt to analyze it in a variety of settings. We present theoretical results explaining the difficulty of this problem and setting limits on the efficiency that can be achieved. Based on new insights into the interaction between join and sampling, we develop join sampling techniques for the settings where our negative results do not apply. Our new sampling algorithms are significantly more efficient than those known earlier. We present experimental evaluation of our techniques on Microsoft's SQL Server 7.0.

References

[1]
S. Chaudhuri, R. Motwani, and V. Narasayya. Using Random Sampling for Histogram Construction. In Proc. A CM SIGMOD Conference, pages 436-447, 1998.
[2]
S. Ganguly, P.B. Gibbons, Y. Matias, and A. Silberschatz. Bifocal Sampling for Skew-Resistant Join Size Estimation. In Proc. A CM SIGMOD Conference, pages 271-281, 1996.
[3]
P.J. Haas, J.F. Naughton, and A.N. Swami. On the Relative Cost of Sampling for Join Selectivity Estimation. In Proc. 13th ACM PODS, pages 14-24, 1994.
[4]
J.M. Hellerstein, P.J. Haas, and H.J. Wang. Online Aggregation. In Proc. A CM SIGMOD Conference, pages 171-182, 1997.
[5]
W. Hou, G. Ozsoyoglu, and E. Dogdu. Error- Constrained COUNT Query Evaluation in Relational Databases. In Proc. A CM SIGMOD Conference, pages 278-287, 1991.
[6]
R.J. Lipton, J.F. Naughton, D.A. Schneider, and S. Seshadri. Efficient Sampling Strategies for Relational Database Operations. Theoretical Computer Science 116(1993): 195-226.
[7]
R. Motwani and P. Raghavan. Randomized Algorithms. Cambridge University Press, 1995.
[8]
J.F. Naughton and S. Seshadri. On Estimating the Size of Projections. In Proc. Third International Conference on Database Theory, pages 499-513, 1990.
[9]
F. Olken and D. Rotem. Simple random sampling from relational databases. In Proc. 12th VLDB, pages 160- 169, 1986.
[10]
F. Olken. Random Sampling from Databases. PhD Dissertation, Computer Science, University of California at Berkeley, 1993.
[11]
G. Piatetsky-Shapiro and C. Connell. Accurate estimation of the number of tuples satisfying a condition. In Proc. A CM SIGMOD Conference, pages 256-276, 1984.
[12]
J.S. Vitter. Random sampling with a reservoir. A CM Trans. Mathematical Software, 11 (1985): 37-57.
[13]
G.E. Zipf. Human Behavior and the Principle of Least Effort. Addison-Wesley Press, Inc, 1949.

Cited By

View all
  • (2024)Learning manifolds from non-stationary streamsJournal of Big Data10.1186/s40537-023-00872-811:1Online publication date: 23-Mar-2024
  • (2024)Enabling Adaptive Sampling for Intra-Window Join: Simultaneously Optimizing Quantity and QualityProceedings of the ACM on Management of Data10.1145/36771342:4(1-31)Online publication date: 30-Sep-2024
  • (2024)Reservoir Sampling over JoinsProceedings of the ACM on Management of Data10.1145/36549212:3(1-26)Online publication date: 30-May-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM SIGMOD Record
ACM SIGMOD Record  Volume 28, Issue 2
June 1999
599 pages
ISSN:0163-5808
DOI:10.1145/304181
Issue’s Table of Contents
  • cover image ACM Conferences
    SIGMOD '99: Proceedings of the 1999 ACM SIGMOD international conference on Management of data
    June 1999
    604 pages
    ISBN:1581130848
    DOI:10.1145/304182
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 June 1999
Published in SIGMOD Volume 28, Issue 2

Check for updates

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)284
  • Downloads (Last 6 weeks)47
Reflects downloads up to 15 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Learning manifolds from non-stationary streamsJournal of Big Data10.1186/s40537-023-00872-811:1Online publication date: 23-Mar-2024
  • (2024)Enabling Adaptive Sampling for Intra-Window Join: Simultaneously Optimizing Quantity and QualityProceedings of the ACM on Management of Data10.1145/36771342:4(1-31)Online publication date: 30-Sep-2024
  • (2024)Reservoir Sampling over JoinsProceedings of the ACM on Management of Data10.1145/36549212:3(1-26)Online publication date: 30-May-2024
  • (2023)PlexusProceedings of the 2023 ACM Symposium on Cloud Computing10.1145/3620678.3624643(1-16)Online publication date: 30-Oct-2023
  • (2023)Sampling over Union of JoinsCompanion of the 2023 International Conference on Management of Data10.1145/3555041.3589400(273-275)Online publication date: 4-Jun-2023
  • (2022)Towards Observability for Production Machine Learning PipelinesProceedings of the VLDB Endowment10.14778/3565838.356585315:13(4015-4022)Online publication date: 1-Sep-2022
  • (2022)Threshold queries in theory and in the wildProceedings of the VLDB Endowment10.14778/3510397.351040715:5(1105-1118)Online publication date: 1-Jan-2022
  • (2022)A random walk sampling on knowledge graphs for semantic-oriented statistical tasksData & Knowledge Engineering10.1016/j.datak.2022.102024140:COnline publication date: 1-Jul-2022
  • (2021)PGMJoins: Random Join Sampling with Graphical ModelsProceedings of the 2021 International Conference on Management of Data10.1145/3448016.3457302(1610-1622)Online publication date: 9-Jun-2021
  • (2021)XLJoinsProceedings of the 2021 International Conference on Management of Data10.1145/3448016.3450582(2902-2904)Online publication date: 9-Jun-2021
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media