Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/304182.304207acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
Article
Free access

Join synopses for approximate query answering

Published: 01 June 1999 Publication History

Abstract

In large data warehousing environments, it is often advantageous to provide fast, approximate answers to complex aggregate queries based on statistical summaries of the full data. In this paper, we demonstrate the difficulty of providing good approximate answers for join-queries using only statistics (in particular, samples) from the base relations. We propose join synopses as an effective solution for this problem and show how precomputing just one join synopsis for each relation suffices to significantly improve the quality of approximate answers for arbitrary queries with foreign key joins. We present optimal strategies for allocating the available space among the various join synopses when the query work load is known and identify heuristics for the common case when the work load is not known. We also present efficient algorithms for incrementally maintaining join synopses in the presence of updates to the base relations. Our extensive set of experiments on the TPC-D benchmark database show the effectiveness of join synopses and various other techniques proposed in this paper.

References

[1]
S. Acharya, E B. Gibbons, V. Poosala, and S. Ramaswamy. The Aqua approximate query answering system. In Proc. ACM SIGMOD International Conf. on Management of Data, June 1999. Demonstration paper.
[2]
S. Acharya, P. B. Gibbons, V. Poosala, and S. Ramaswamy. Join synopses for approximate query answering. Technical report, Bell Laboratories, Murray Hilt, New Jersey, 1999. Full version of the paper appearing in SIGMOD'99.
[3]
N. Alon, Y. IV.Iatias, and M. Szegedy. The space complexity of approximating the frequency moments. In Proc. 28th ACM Syrup. on the Theory of Computing, pages 20-29, May 1996. Full version to appear in JCSS special issue for STOC'96.
[4]
S. Acharya, V. Poosala, and S. Ramaswamy. Selectivity estimation in spatial databases. In Proc. ACM SIGMOD International Conf. on Management of Data, June 1999.
[5]
l). Barbarfi, W. DuMouchel, C. Faloutsos, E J. Haas, J. M. HeUerstein, Y. Ioannidis, H. V. Jagadish, T. Johnson, R. Ng, V. Poosala, K. A. Ross, and K. C. Sevcik. The New Jersey data reduction report. Bulletin of the Technical Committee on Data Engineering, 20(4):3-45, 1997.
[6]
C.M. Chen .and N. Roussopoulos. Adaptive selectivity estimation using query feedback. In Proc. ACM SIGMOD International Conf. on Management of Data, pages 16 l- 172, May 1994.
[7]
S. Ganguly, E B. Gibbons, Y. Matins, and A. Silberschatz. Bifocal sampling for skew-resistant join size estimation. In Proc. ACM SIGMOD International Conf. on Management of Data, pages 271-281, June 1996.
[8]
P.B. Gibbons and Y. Matins. New sampling-based summary statistics for improving approximate query answers. In Proc. ACM SIGMOD International Conf. on Management of Data, pages 331-342, June 1998.
[9]
E B. Gibbons and Y. Matins. Selecting estimation proce.- dures and bounds for approximate answering of aggregation queries. Technical report, Bell Laboratories, Murray Hill, New Jersey, 1999.
[10]
E B. Gibbons and Y. Matins. Synopsis data structures for mas. sive data sets. DIMACS: Series in Discrete Mathematics and Theoretical Computer Science, 1999. To appear. Available,. as Bell Labs tech. rep., Sept. 1998, and at http://www.belllabs. corn/- pbgibbons/.
[11]
E B. Gibbons, Y. Matias, and V. Poosala. Aqua project white paper. Technical report, Bell Laboratories, Murray Hill, New Jersey, December 1997.
[12]
P.B. Gibbons, Y. Matias, and V. Poosala. Fast incremental maintenance of approximate histograms. In Proc. 23rd International Conf. on Very Large Data Bases, pages 466-. 475, August 1997.
[13]
P.J. Haas. Hoeffding inequalities for join-selectivity estimation and online aggregation. Technical Report RJ 10040, IBM Almaden Research Center, San Jose, CA, 1996.
[14]
P.J. Haas. Large-sample and deterministic confidence intervals for online aggregation. In Proc. 9th International Conf. on Scientific and Statistical Database Management, August 1997.
[15]
J.M. Hellerstein, P. J. Haas, and H. J. Wang. Online aggregation. In Proc. ACM SIGMOD international Co1~ on Management of Data, pages 171-182, May 1997.
[16]
P.j. Haas, J. F. Naughton, and A. N. Swami. On the relative cost of sampling for join selectivity estimation. In Proc. 13th ACM Symp. on Principles of Database Systems, pages 14-24, May 1994.
[17]
P.J. Haas, J. E Naughton, S. Seshadri, and L. Stokes. Sampling-based estimation of the number of distinct values o1 an attribute. In Proc. 21st International Conf. on Very l~rge Data Bases, pages 311-322, September 1995.
[18]
W.-C. Hou, G. Ozsoyo~lu, and B. K. Taneja. Statistical estimators for relational algebra expressions. In Proc. 7th ACM Syrup. on Principles of Database Systems, pages 276- 287, March 1988.
[19]
R.P. Kooi. The Optimization of Queries in Relational Databases. PhD thesis, Case Western Reserve University, September 1980.
[20]
R.J. Lipton and J. E Naughton. Query size estimation by adaptive sampling. J. Computer and System Sciences, 51(1):18-25, 1995.
[21]
R.J. Lipton, j. E Naughton, and D. A. Schneider. Practical selectivity estimation through adaptive sampling. In Proc. ACM SIGMOD International Conf. on Management of Data, pages l- 12, May 1990.
[22]
E Olken and D. Rotem. Maintenance of materialized views of sampling queries. In Proc. 8th IEEE International Cot, f on Data Engineering, pages 632-641, February 1992.
[23]
V. Poosala, Y. E. loannidis, P. J. Haas, and E. J. Shekita. Improved histograms for selectivity estimation of range predicates. In Proc. ACM SIGMOD International Conf. on Management of Data, pages 294-305, June 1996.
[24]
V. Poosala. Histogram-based estimation techniques in databases. PhD thesis, Univ. of Wisconsin-Madison, 1997.
[25]
P.G. Selinger, M. M. Astrahan, D. D. Chambedin, R. A. Lode, and T. T. Price. Access path selection in a relational database management system. In Proc. ACM SIGMOD International Conf. on Management of Data, pages 23-34, June 1979.
[26]
D. Schneider. The ins & outs (and everything in between) of data warehousing. Tutorial in the 23rd International Conf. on Very Large Data Bases, August 1997.
[27]
S.V. Vrbsky and J. W. S. Liu. Approximate~a query processor that produces monotonically improving approximate: answers. IEEE Trans. on Knowledge and Data Engineering, 5(6):1056-1068, 1993.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGMOD '99: Proceedings of the 1999 ACM SIGMOD international conference on Management of data
June 1999
604 pages
ISBN:1581130848
DOI:10.1145/304182
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 June 1999

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Article

Conference

SIGMOD/PODS99

Acceptance Rates

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)137
  • Downloads (Last 6 weeks)34
Reflects downloads up to 15 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Accurate Sampling-Based Cardinality Estimation for Complex Graph QueriesACM Transactions on Database Systems10.1145/368920949:3(1-46)Online publication date: 17-Sep-2024
  • (2024)Reservoir Sampling over JoinsProceedings of the ACM on Management of Data10.1145/36549212:3(1-26)Online publication date: 30-May-2024
  • (2024)Modeling Shifting Workloads for Learned Database SystemsProceedings of the ACM on Management of Data10.1145/36392932:1(1-27)Online publication date: 26-Mar-2024
  • (2023)ShadowAQP: Efficient Approximate Group-by and Join Query via Attribute-Oriented Sample Size Allocation and Data GenerationProceedings of the VLDB Endowment10.14778/3625054.362505916:13(4216-4229)Online publication date: 1-Sep-2023
  • (2023)JoinBoost: Grow Trees over Normalized Data Using Only SQLProceedings of the VLDB Endowment10.14778/3611479.361150916:11(3071-3084)Online publication date: 1-Jul-2023
  • (2023)Foreign Keys Open the Door for Faster Incremental View MaintenanceProceedings of the ACM on Management of Data10.1145/35887201:1(1-25)Online publication date: 30-May-2023
  • (2023)On Join Sampling and the Hardness of Combinatorial Output-Sensitive Join AlgorithmsProceedings of the 42nd ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems10.1145/3584372.3588666(99-111)Online publication date: 18-Jun-2023
  • (2022)Coresets for relational data and the applicationsProceedings of the 36th International Conference on Neural Information Processing Systems10.5555/3600270.3600302(434-448)Online publication date: 28-Nov-2022
  • (2022)AB-treeProceedings of the VLDB Endowment10.14778/3538598.353860615:9(1835-1847)Online publication date: 1-May-2022
  • (2022)Answering (Unions of) Conjunctive Queries using Random Access and Random-Order EnumerationACM Transactions on Database Systems10.1145/353105547:3(1-49)Online publication date: 18-Aug-2022
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media