Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
article

The Sort-Merge-Shrink join

Published: 01 December 2006 Publication History

Abstract

One of the most common operations in analytic query processing is the application of an aggregate function to the result of a relational join. We describe an algorithm called the Sort-Merge-Shrink (SMS) Join for computing the answer to such a query over large, disk-based input tables. The key innovation of the SMS join is that if the input data are clustered in a statistically random fashion on disk, then at all times, the join provides an online, statistical estimator for the eventual answer to the query as well as probabilistic confidence bounds. Thus, a user can monitor the progress of the join throughout its execution and stop the join when satisfied with the estimate's accuracy or run the algorithm to completion with a total time requirement that is not much longer than that of other common join algorithms. This contrasts with other online join algorithms, which either do not offer such statistical guarantees or can only offer guarantees so long as the input data can fit into main memory.

References

[1]
Acharya, S., Gibbons, P. B., Poosala, V., and Ramaswamy, S. 1999. Join synopses for approximate query answering. SIGMOD Conference 275--286.
[2]
Alon, N., Gibbons, P. B., Matias, Y., and Szegedy, M. 1999. Tracking join and self-join sizes in limited storage. PODS Conference 10--20.
[3]
Alon, N., Gibbons, P. B., Matias, Y., and Szegedy, M. 2002. Tracking join and self-join sizes in limited storage. J. Comput. Syst. Sci. 64, 3, 719--747.
[4]
Chaudhuri, S., Motwani, R., and Narasayya, V. R. 1999. On random sampling over joins. SIGMOD Conference 263--274.
[5]
Cochran, W. 1977. Sampling Techniques. John Wiley and Sons.
[6]
Das, A., Gehrke, J., and Riedewald, M. 2004. Appioximation techniques for spatial data. SIGMOD Conference 695--706.
[7]
Dittrich, J-P., Seeger, B., Taylor, D. S., and Widmayer, P. 2002. Progressive merge join: A generic and non-blocking sort-based join Algorithm. VLDB Conference 299--310.
[8]
Dittrich, J.-P., Seeger, B., Taylor, D. S., and Widmayer, P. 2003. On producing join results early. PODS Conference 134--142.
[9]
Dobra, A. 2005. Histograms revisited: When are histograms the best approximation method for aggregates over joins? PODS Conference 228--237.
[10]
Dobra, A., Garofalakis, M. N., Gehrke, J., and Rastogi, R. 2002. Processing complex aggregate queries over data streams, SIGMOD Conference 61--72.
[11]
Ganguly, S., Gibbons, P. B., Matias, Y., and Silberschatz, A. 1996. Bifocal sampling for skew-resistant join size estimation. SIGMOD Conference 271--281.
[12]
Haas, P. J. 1997. Large-sample and deterministic confidence intervals for online aggregation. SSDBM Conference 51--63.
[13]
Haas, P. J., Naughton, J. F., Seshadri, S., and Swami, A. N. 1996. Selectivity and cost estimation for joins based on random sampling. J. Com. Syst. Sci. 52, 3, 550--569.
[14]
Haas, P. J. and Hellerstein, J. M. 1999. Ripple joins for online aggregation. SIGMOD Conference 287--298.
[15]
Hellerstein, J. M., Avnur, R., Chou, A., Hidber, C., Olston, C., Raman, V., Roth, T., and Haas, P. J. 1999. Interactive data analysis: The control project. IEEE Comput. 32, 8, 51--59.
[16]
Hellerstein, J. M., Haas, P. J., and Wang, H. J. 1997. Online aggregation. SIGMOD Conference, 171--182.
[17]
Hou, W.-C., Özsoyoglu, G., and Taneja, B. K. 1988. Statistical estimators for relational algebra expressions. PODS Conference 276--287.
[18]
Hou, W.-C., Özsoyoglu, G., and Taneja, B. K. 1989. Processing aggregate relational queries with hard time constraints SIGMOD Conference 68--77.
[19]
Hou, W.-C. and Özsoyoglu, G. 1991. Statistical estimators for aggregate relational algebra queries. ACM Trans. Datab. Syst. 16, 4, 600--654.
[20]
Ives, Z. G., Florescu, D., Friedman, M., Levy, A. Y., and Weld, D. S. 1999. An adaptive query execution system for data integration. SIGMOD Conference 299--310.
[21]
Jermaine, C., Dobra, A., Arumugam, S., Pol, A., and Joshi, S. 2005. A disk-based join with probabilistic guarantees. SIGMOD Conference 587--598.
[22]
Kooi, R. P. 1980. The optimization of queries in relational databases, PhD thesis, CWR University.
[23]
Lipton, R. J. and Naughton, J. F. 1990. Query size estimation by adaptive sampling. PODS Conference 40--46.
[24]
Lipton, R. J., Naughton, J. F., and Schnerden, D. A. 1990. Practical selectivity estimation through adaptive sampling. SIGMOD Conference 1--11.
[25]
Luo, G., Naughton, J. F., and Ellmann, C. 2002. A non-blocking parallel spatial join algorithm. ICDE Conference 697--705.
[26]
Luo, G., Ellmann, C., Haas, P. J., and Naughton, J. F. 2002a. A scalable hash ripple join algorithm. SIGMOD Conference 252--262.
[27]
Luo, G., Ellmann, C., Haas, P. J., and Naughton, J. F. 2002b. A scalable hash ripple join algorithm. SIGMOD Conference 252--262.
[28]
Mokbel, M. F., Lu, M., and Aret, W. G. 2004. Hash-merge join: A non-blocking join algorithm for producing east and early join results. ICDE Conference 251--263.
[29]
Olken, F. 1993. Raridom sampling from databases. PhD thesis, University of California, Berkeley, CA.
[30]
Shao, J. 1999. Mathematical Statistics, Springer-Verlag.
[31]
Shapiro, L. D. 1986. Join processing in database systems with large main memories. ACM Trans. Datab. Syst. 11, 3, 239--264.
[32]
Urhan, T. and Franklin, M. J. 2000. XJoin: A reactively-scheduled pipelined join operator. IEEE Data Eng. Bull. 23, 2, 27--33.

Cited By

View all
  • (2023)A Step Toward Deep Online AggregationProceedings of the ACM on Management of Data10.1145/35892691:2(1-28)Online publication date: 20-Jun-2023
  • (2020)RSATree: Distribution-Aware Data Representation of Large-Scale Tabular Datasets for Flexible Visual QueryIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2019.293480026:1(1161-1171)Online publication date: Jan-2020
  • (2018)Online Aggregation: A ReviewWeb Information Systems and Applications10.1007/978-3-030-02934-0_10(103-114)Online publication date: 14-Sep-2018
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Database Systems
ACM Transactions on Database Systems  Volume 31, Issue 4
December 2006
357 pages
ISSN:0362-5915
EISSN:1557-4644
DOI:10.1145/1189769
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 December 2006
Published in TODS Volume 31, Issue 4

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. OLAP
  2. Online algorithms
  3. nonparametric statistics

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)12
  • Downloads (Last 6 weeks)0
Reflects downloads up to 30 Aug 2024

Other Metrics

Citations

Cited By

View all
  • (2023)A Step Toward Deep Online AggregationProceedings of the ACM on Management of Data10.1145/35892691:2(1-28)Online publication date: 20-Jun-2023
  • (2020)RSATree: Distribution-Aware Data Representation of Large-Scale Tabular Datasets for Flexible Visual QueryIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2019.293480026:1(1161-1171)Online publication date: Jan-2020
  • (2018)Online Aggregation: A ReviewWeb Information Systems and Applications10.1007/978-3-030-02934-0_10(103-114)Online publication date: 14-Sep-2018
  • (2017)A Unified Correlation-based Approach to Sampling Over JoinsProceedings of the 29th International Conference on Scientific and Statistical Database Management10.1145/3085504.3085524(1-12)Online publication date: 27-Jun-2017
  • (2017)Bi-Level Online Aggregation on Raw DataProceedings of the 29th International Conference on Scientific and Statistical Database Management10.1145/3085504.3085514(1-12)Online publication date: 27-Jun-2017
  • (2017)Approximate Query ProcessingProceedings of the 2017 ACM International Conference on Management of Data10.1145/3035918.3056097(511-519)Online publication date: 9-May-2017
  • (2017)How Progressive Visualizations Affect Exploratory AnalysisIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2016.260771423:8(1977-1987)Online publication date: 1-Aug-2017
  • (2017)Pattern discoveryJournal of Visual Languages and Computing10.1016/j.jvlc.2017.05.00443:C(42-49)Online publication date: 1-Dec-2017
  • (2016)Interactive Visualization of Large Data SetsIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2016.255732428:8(2142-2157)Online publication date: 1-Aug-2016
  • (2015)Speculative Approximations for Terascale Distributed Gradient Descent OptimizationProceedings of the Fourth Workshop on Data analytics in the Cloud10.1145/2799562.2799563(1-10)Online publication date: 31-May-2015
  • Show More Cited By

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media