abstract

Index-Based Join Size Estimation Using Adaptive Sampling

Author:

Sergiu PocolAuthors Info & Claims

SIGMOD '21: Proceedings of the 2021 International Conference on Management of Data

Pages 2932 - 2933

https://doi.org/10.1145/3448016.3450572

Published: 18 June 2021 Publication History

Get Access

Abstract

Cost-based query optimizers rely on cardinality estimates of intermediate results to avoid suboptimal query execution plans. However, when confronted with ad-hoc queries on big data, said optimizers can produce large estimation errors, resulting in drastic decreases in overall performance. Such errors occur because many estimation algorithms for joins make use of strong independence and uniformity assumptions. Moreover, equi-joins on skewed data with filter predicates tend to cause the aforementioned assumptions to fail [2]. Since the cardinality estimate of a result with many joins depends on estimates of the underlying joins, it has been shown that improving the accuracy of join size estimates in a "bottom-up" order can significantly improve performance [2]. Thus, our research aims at improving the estimation of two-table join sizes.

Certain join size estimation approaches that use offline samples fare poorly with filtering and can suffer from insufficient sample size [2]. Alternatively, query optimizers may make use of persisted histograms. However, the associated storage space is a large deterrent, as is the case with persisting offline samples [4]. In turn, algorithms have been presented wherein adaptive, block-level sampling is conducted during query optimization [5]. To the best of our knowledge, there is no algorithm for two-table join size estimation that 1) samples filtered base tables, 2) makes use exclusively of persisted counts in B+-tree indexes and 3) attempts to provide statistical confidence on estimates.

In this paper, we present and evaluate a novel join size estimation algorithm prototyped on SAP IQ. Our algorithm can be easily incorporated into query optimizers that utilize bottom-up enumeration and evaluate filter predicates prior to optimization. In contrast with existing machine learning based approaches [1], the algorithm is simpler to implement or derive from existing support for sampling. Join size estimates are produced using a combination of variance-based calculations from Oracle 12c [5] along with the usage of persisted counts in indexes from index-based sampling [2]. In doing so, the amount of sampling conducted is minimized until either a parameterized budget is surpassed or there is sufficient statistical confidence in an estimate. On a subset of industry-standard benchmark queries involving joins on skewed data, our method improved the overall execution time by 16%. Amongst queries with execution plans altered by our estimates, the mean percentage improvement in individual execution time was 34%.

References

[1]

A. Kipf, T. Kipf, B. Radke, V. Leis, P. A. Boncz, and A. Kemper. Learned cardinalities: Estimating correlated joins with deep learning. CoRR, abs/1809.00677, 2018.

Google Scholar

[2]

V. Leis, B. Radke, A. Gubichev, A. Kemper, and T. Neumann. Cardinality Estimation Done Right: Index-Based Join Sampling. In CIDR 2017, 8th Biennial Conference on Innovative Data Systems Research, Chaminade, CA, USA, January 8--11, 2017, Online Proceedings. www.cidrdb.org, 2017.

Google Scholar

[3]

G. Moerkotte, D. DeHaan, N. May, A. Nica, and A. Bö hm. Exploiting ordered dictionaries to efficiently construct histograms with q-error guarantees in SAP HANA. In C. E. Dyreson, F. Li, and M. T. Özsu, editors, International Conference on Management of Data, SIGMOD 2014, Snowbird, UT, USA, June 22--27, 2014, pages 361--372. ACM, 2014.

Digital Library

Google Scholar

[4]

T. Wang and C. Chan. Improved Correlated Sampling for Join Size Estimation. In 36th IEEE International Conference on Data Engineering, ICDE 2020, Dallas, TX, USA, April 20--24, 2020, pages 325--336. IEEE, 2020.

Google Scholar

[5]

M. Zait, S. Chakkappen, S. Budalakoti, S. R. Valluri, R. Krishnamachari, and A. Wood. Adaptive Statistics in Oracle 12c. Proc. VLDB Endow., 10(12):1813--1824, 2017.

Digital Library

Google Scholar

Index Terms

Index-Based Join Size Estimation Using Adaptive Sampling
1. Information systems
  1. Data management systems
    1. Database management system engines
      1. Database query processing
        Query optimization
      2. Online analytical processing engines

Recommendations

Weighted Distinct Sampling: Cardinality Estimation for SPJ Queries
SIGMOD '21: Proceedings of the 2021 International Conference on Management of Data

SPJ (select-project-join) queries form the backbone of many SQL queries used in practice. Accurate cardinality estimation of these queries is thus an important problem, with applications in query optimization, approximate query processing, and data ...
Query optimization through the looking glass, and what we found running the Join Order Benchmark

Finding a good join order is crucial for query performance. In this paper, we introduce the Join Order Benchmark that works on real-life data riddled with correlations and introduces 113 complex join queries. We experimentally revisit the main ...
FactorJoin: A New Cardinality Estimation Framework for Join Queries
PACMMOD

Cardinality estimation is one of the most fundamental and challenging problems in query optimization. Neither classical nor learning-based methods yield satisfactory performance when estimating the cardinality of the join queries. They either rely on ...

Comments

Information & Contributors

Information

Published In

SIGMOD '21: Proceedings of the 2021 International Conference on Management of Data

June 2021

2969 pages

ISBN:9781450383431

DOI:10.1145/3448016

General Chairs:
Guoliang Li
Tsinghua University (China)
,
Zhanhuai Li
Northwestern Polytechnical University (China)
,
Program Chairs:
Stratos Idreos
Harvard University (USA)
,
Divesh Srivastava
AT&T (USA)

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 June 2021

Check for updates

Author Tags

Qualifiers

Abstract

Conference

SIGMOD/PODS '21

Sponsor:

SIGMOD

SIGMOD/PODS '21: International Conference on Management of Data

June 20 - 25, 2021

Virtual Event, China

Acceptance Rates

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
183
Total Downloads

Downloads (Last 12 months)25
Downloads (Last 6 weeks)5

Reflects downloads up to 13 Sep 2024

Other Metrics

View Author Metrics

Citations

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Index Terms

Recommendations

Weighted Distinct Sampling: Cardinality Estimation for SPJ Queries

Query optimization through the looking glass, and what we found running the Join Order Benchmark

FactorJoin: A New Cardinality Estimation Framework for Join Queries