Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/1007568.1007641acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
Article

CORDS: automatic discovery of correlations and soft functional dependencies

Published: 13 June 2004 Publication History

Abstract

The rich dependency structure found in the columns of real-world relational databases can be exploited to great advantage, but can also cause query optimizers---which usually assume that columns are statistically independent---to underestimate the selectivities of conjunctive predicates by orders of magnitude. We introduce CORDS, an efficient and scalable tool for automatic discovery of correlations and soft functional dependencies between columns. CORDS searches for column pairs that might have interesting and useful dependency relations by systematically enumerating candidate pairs and simultaneously pruning unpromising candidates using a flexible set of heuristics. A robust chi-squared analysis is applied to a sample of column values in order to identify correlations, and the number of distinct values in the sampled columns is analyzed to detect soft functional dependencies. CORDS can be used as a data mining tool, producing dependency graphs that are of intrinsic interest. We focus primarily on the use of CORDS in query optimization. Specifically, CORDS recommends groups of columns on which to maintain certain simple joint statistics. These "column-group" statistics are then used by the optimizer to avoid naive selectivity estimates based on inappropriate independence assumptions. This approach, because of its simplicity and judicious use of sampling, is relatively easy to implement in existing commercial systems, has very low overhead, and scales well to the large numbers of columns and large table sizes found in real-world databases. Experiments with a prototype implementation show that the use of CORDS in query optimization can speed up query execution times by an order of magnitude. CORDS can be used in tandem with query feedback systems such as the LEO learning optimizer, leveraging the infrastructure of such systems to correct bad selectivity estimates and ameliorating the poor performance of feedback systems during slow learning phases.

References

[1]
A. Aboulnaga and S. Chaudhuri. Self-tuning histograms: Building histograms without looking at data. In Proc. 1999 ACM SIGMOD, pages 181--192. ACM Press, 1999.
[2]
A. Agresti. Categorical Data Analysis. Wiley, second edition, 2002.
[3]
S. M. Ali and S. D. Silvey. A general class of coefficients of divergence of one distribution from another. J. Royal Statist. Soc. Ser. B, 28:131--142, 1966.
[4]
N. Bruno and S. Chaudhuri. Exploiting statistics on query expressions for optimzation. In Proc. 2002 ACM SIGMOD, pages 263--274. ACM Press, 2002.
[5]
N. Bruno, S. Chaudhuri, and L. Gravano. STHoles: a multidimensional workload-aware histogram. In Proc. 2001 ACM SIGMOD, pages 211--222. ACM Press, 2001.
[6]
J. Cheng, D. A. Bell, and W. Liu. Learning belief networks from data: An information theory based approach. In Proc. ACM Conf. Info. Knowledge Mgmt. (CIKM '97), pages 325--331, 1997.
[7]
H. Cramér. Mathematical Methods of Statistics. Princeton, 1948.
[8]
A. Deshpande, M. Garofalakis, and R. Rastogi. Independence is good: dependency-based histogram synopses for high-dimensional data. In Proc. 2001 ACM SIGMOD, pages 199--210. ACM Press, 2001.
[9]
W. Feller. An Introduction to Probability Theory and Its Applications, Volume I. Wiley, 1968.
[10]
L. Getoor, B. Taskar, and D. Koller. Selectivity estimation using probabilistic models. In Proc. 2001 ACM SIGMOD, pages 461--472. ACM Press, 2001.
[11]
L. A. Goodman and W. H. Kruskal. Measures of association for cross-classifications. J. Amer. Statist. Assoc., 49:733--763, 1954.
[12]
P. J. Haas and P. G. Brown. BHUNT: Automatic discovery of fuzzy algebraic constraints in relational data. In Proc. 29th VLDB, pages 668--679. Morgan Kaufmann, 2003.
[13]
P. J. Haas and L. Stokes. Estimating the number of classes in a finite population. J. Amer. Statist. Assoc., 93:1475--1487, 1998.
[14]
L. Lim, M. Wang, and J. S. Vitter. SASH: A self-adaptive histogram set for dynamically changing workloads. In Proc. 29th VLDB, pages 369--380. Morgan Kaufmann, 2003.
[15]
H. B. Mann and A. Wald. On the choice of the number of class intervals in the application of the chi-square test. Ann. Math. Statist., 13:306--317, 1942.
[16]
P. B. Patnaik. The non-central X2 - and F-distributions and their applications. Biometrika, 36:202--232, 1949.
[17]
T. R. C. Read and N. A. C. Cressie. Goodness-of-Fit Statistics for Discrete Multivariate Data. Springer, 1988.
[18]
M. Siegel, E. Sciore, and S. Salveter. A method for automatic rule derivation to support semantic query optimization. ACM Trans. Database Syst., 17:563--600, 1992.
[19]
R. Srikant and R. Agrawal. Mining quantitative association rules in large relational tables. In Proc. 1996 ACM SIGMOD, pages 1--12. ACM Press, 1996.
[20]
M. Stillger, G. M. Lohman, V. Markl, and M. Kandil. LEO --- DB2's LEarning Optimizer. In Proc. 27th VLDB, pages 19--28. Morgan Kaufmann, 2001.

Cited By

View all
  • (2024)Learned Query Optimization by Constraint-Based Query Plan AugmentationMathematics10.3390/math1219310212:19(3102)Online publication date: 3-Oct-2024
  • (2024)DAFDiscover: Robust Mining Algorithm for Dynamic Approximate Functional Dependencies on Dirty DataProceedings of the VLDB Endowment10.14778/3681954.368201517:11(3484-3496)Online publication date: 1-Jul-2024
  • (2024)SplitDF: Splitting Dataframes for Memory-Efficient Data AnalysisProceedings of the VLDB Endowment10.14778/3665844.366584917:9(2175-2184)Online publication date: 1-May-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGMOD '04: Proceedings of the 2004 ACM SIGMOD international conference on Management of data
June 2004
988 pages
ISBN:1581138598
DOI:10.1145/1007568
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 June 2004

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Article

Conference

SIGMOD/PODS04
Sponsor:

Acceptance Rates

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)50
  • Downloads (Last 6 weeks)7
Reflects downloads up to 23 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Learned Query Optimization by Constraint-Based Query Plan AugmentationMathematics10.3390/math1219310212:19(3102)Online publication date: 3-Oct-2024
  • (2024)DAFDiscover: Robust Mining Algorithm for Dynamic Approximate Functional Dependencies on Dirty DataProceedings of the VLDB Endowment10.14778/3681954.368201517:11(3484-3496)Online publication date: 1-Jul-2024
  • (2024)SplitDF: Splitting Dataframes for Memory-Efficient Data AnalysisProceedings of the VLDB Endowment10.14778/3665844.366584917:9(2175-2184)Online publication date: 1-May-2024
  • (2024)POLAR: Adaptive and Non-invasive Join Order Selection via Plans of Least ResistanceProceedings of the VLDB Endowment10.14778/3648160.364817517:6(1350-1363)Online publication date: 1-Feb-2024
  • (2024)PLAQUE: Automated Predicate Learning at Query TimeProceedings of the ACM on Management of Data10.1145/36393012:1(1-25)Online publication date: 26-Mar-2024
  • (2024)Decentralized and Incremental Discovery of Relaxed Functional Dependencies Using Bitwise SimilarityIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2024.340392836:12(7380-7398)Online publication date: Dec-2024
  • (2024)Efficient Relaxed Functional Dependency Discovery with Minimal Set Cover2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00271(3519-3531)Online publication date: 13-May-2024
  • (2024)Measuring Approximate Functional Dependencies: A Comparative Study2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00270(3505-3518)Online publication date: 13-May-2024
  • (2024)Non-Invasive Fairness in Learning Through the Lens of Data Drift2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00172(2164-2178)Online publication date: 13-May-2024
  • (2024)Boosting Meaningful Dependency Mining with Clustering and Covariance Analysis2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00055(639-652)Online publication date: 13-May-2024
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media