Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
article
Free access

Estimating alphanumeric selectivity in the presence of wildcards

Published: 01 June 1996 Publication History

Abstract

Success of commercial query optimizers and database management systems (object-oriented or relational) depend on accurate cost estimation of various query reordering [BGI]. Estimating predicate selectivity, or the fraction of rows in a database that satisfy a selection predicate, is key to determining the optimal join order. Previous work has concentrated on estimating selectivity for numeric fields [ASW, HaSa, IoP, LNS, SAC, WVT]. With the popularity of textual data being stored in databases, it has become important to estimate selectivity accurately for alphanumeric fields. A particularly problematic predicate used against alphanumeric fields is the SQL like predicate [Dat]. Techniques used for estimating numeric selectivity are not suited for estimating alphanumeric selectivity.In this paper, we study for the first time the problem of estimating alphanumeric selectivity in the presence of wildcards. Based on the intuition that the model built by a data compressor on an input text encapsulates information about common substrings in the text, we develop a technique based on the suffix tree data structure to estimate alphanumeric selectivity. In a statistics generation pass over the database, we construct a compact suffix tree-based structure from the columns of the database. We then look at three families of methods that utilize this structure to estimate selectivity during query plan costing, when a query with predicates on alphanumeric attributes contains wildcards in the predicate.We evaluate our methods empirically in the context of the TPC-D benchmark. We study our methods experimentally against a variety of query patterns and identify five techniques that hold promise.

References

[1]
M. M. Astrahan, M. Schkolnick, and K. Y. Whang, "Approximating the Number of Unique Values of an Attribute Without Sorting," Inf. Sys. 12 (1987), 11-15.]]
[2]
G. Bhargava, P. Goel, and B. R. Iyer, "Hypergraph Based Reorderings of Outer Join Queries with Complex Predicates," Proc. of the 1995 A CM SIGMOD Conference, 304-315.]]
[3]
S. Brin, J. Davis, and H. Garcia-Molina, "Copy Detection Mechanisms for Digital Documents," Proc. of the 1995 A CM SIGMOD Conference.]]
[4]
S. Bunton and G. Borriello, "Practical Dictionary Management for Hardware Data Compression," Dept. of Comp, Sci., Univ. of Washington, FR-35, 1991.]]
[5]
K. Curewitz, P. Krishnan, and J. S. Vitter, "Practical Prefetching via Data Compression," Proc. of the 1993 A CM SIGMOD Conference, 257-266.]]
[6]
C. Date, An introduction to Database Systems, Addison-Wesley, 1981.]]
[7]
E. R. Fiala and D. H. Greene, "Data Compression with Finite Windows," Comm. of the A CM 32 (April 1989), 490-505.]]
[8]
P. Haas and A. Swami, "Sequential Sampling Procedures for Query Size Estimation," Proc. of the 1992 A CM SIGMOD Conference, 341-350.]]
[9]
P. J. Haas and A. N. Swami, "Sampling-Based Selectivity for Joins Using Augmented Frequent VMue Statistics," Proc. of the 11th lntl. Conf. on Data Engg. (March 1995).]]
[10]
M. A. Hernandez and S. J. Stolfo, "The Merge/Purge Problem for Large Databases," Proc. of the 1995 A CM SIGMOD Conference, 127-138.]]
[11]
Y. E. Ioannidis and V. Poosala, "Balancing Histogram Optimality and Practicality for Query Result Size Estimation," Proc. of the 1995 A CM SIG- MOD Conference.]]
[12]
D. Knuth, The Art of Computer Programming, Vot. 3: Sorting and Searching, Addison-Wesley, Reading, MA, 1973.]]
[13]
P. Krishnan, "Online Prediction Algorithms for Databases and Operating Systems," Brown Univ. Ph.D. thesis, May, 1995, also available as Brown Univ. technical report CS-95-24.]]
[14]
R. Lipton, J. Naughton, and D. Schneider, "Practical Selectivity Estimation through Adaptive Sampling," Proc. o} the 1990 A CM SIGMOD Conference, 1-11.]]
[15]
R. J. Lipton and J. F. Naughton, "Query Size Estimation by Adaptive Sampling," J. Comput. Sys. Sci. 51 (August 1995), 18-25.]]
[16]
E. M. McCreight, "A Space-Economical Suffix Tree Construction Algorithm," J. of the ACM 23 (1976), 262-272.]]
[17]
D. R. Morrison, "PATrtICIA-A Practical Algorithm to Retrieve Information Coded in Alphanumeric," J. of the A CM 15 (1968), 514-534.]]
[18]
P. G. Selinger, M. M. Astrahan, D. D. Chamberlin, R. A. Lorie, and T. G. Price, "Access Path Selection in a Relational Database Management System," Proc. of the 1979 A CM SIGMOD Conf., 23-34.]]
[19]
Transaction Processing Performance Council, TPC, "TPC BenchmarkTM D (Decision Support), Working Draft 6.0," 1994, F. Raab (ed.).]]
[20]
V. N. Vapnik, Estimation of Dependencies based on Empirical Data, Springer Verlag, 1982.]]
[21]
J. T. Wang, G. Chirn, T. G. Marr, B. Shapiro, D. Shasha, and K. Zhang, "Combinatorial Pattern Discovery for Scientific Data: Some Preliminary Results," Proc. of the 1994 A CM SIGMOD Conference (May 1994).]]
[22]
M. Wang, Duke Univ., persona} communication.]]
[23]
P. Weiner, "Linear Pattern Matching Algorithms," Proc. of the IEEE l$th Annual Symp. on Switching and Automata Theory(October, 1973), 1-11.]]
[24]
K.Y. Whang, B. T. Vander-Zanden, and H. M. Taylor, "A Linear-Time Probabilistic Counting Algorithm for Database Applications," ACM Trans. on Database Sys. 15 (June 1990), 208-229.]]
[25]
J. Ziv and A. Lempel, "A Universal Algorithm for Sequential Data Compression," IEEE Trans. on In}. Theory 23 (May 1977).]]
[26]
J. Ziv and A. Lempel, "Compression of Individual Sequences via Variable-Rate Coding," IEEE Trans. on Inf. Theory 24 (September 1978), 530-536.]]

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM SIGMOD Record
ACM SIGMOD Record  Volume 25, Issue 2
June 1996
557 pages
ISSN:0163-5808
DOI:10.1145/235968
Issue’s Table of Contents
  • cover image ACM Conferences
    SIGMOD '96: Proceedings of the 1996 ACM SIGMOD international conference on Management of data
    June 1996
    560 pages
    ISBN:0897917944
    DOI:10.1145/233269
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 June 1996
Published in SIGMOD Volume 25, Issue 2

Check for updates

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)71
  • Downloads (Last 6 weeks)21
Reflects downloads up to 15 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Random Subsequence ForestsInformation Sciences10.1016/j.ins.2024.120478(120478)Online publication date: Mar-2024
  • (2021)AstridProceedings of the VLDB Endowment10.14778/3436905.343690714:4(471-484)Online publication date: 22-Feb-2021
  • (2021)Accurate Cardinality Estimation of Co-occurring Words Using Suffix TreesDatabase Systems for Advanced Applications10.1007/978-3-030-73197-7_50(721-737)Online publication date: 6-Apr-2021
  • (2020)Using Positional Sequence Patterns to Estimate the Selectivity of SQL LIKE QueriesExpert Systems with Applications10.1016/j.eswa.2020.113762(113762)Online publication date: Jul-2020
  • (2013)Histograms as statistical estimators for aggregate queriesInformation Systems10.1016/j.is.2012.08.00338:2(213-230)Online publication date: 1-Apr-2013
  • (2013)Efficient processing of substring match queries with inverted variable-length gram indexesInformation Sciences10.1016/j.ins.2013.04.037244(119-141)Online publication date: Sep-2013
  • (2010)Efficient processing of substring match queries with inverted q-gram indexes2010 IEEE 26th International Conference on Data Engineering (ICDE 2010)10.1109/ICDE.2010.5447866(721-732)Online publication date: Mar-2010
  • (2009)Space-Constrained Gram-Based Indexing for Efficient Approximate String SearchProceedings of the 2009 IEEE International Conference on Data Engineering10.1109/ICDE.2009.32(604-615)Online publication date: 29-Mar-2009
  • (2008)Unsupervised Spam Detection by Document Complexity EstimationDiscovery Science10.1007/978-3-540-88411-8_30(319-331)Online publication date: 13-Oct-2008
  • (2006)Declarative Querying for Biological SequencesProceedings of the 22nd International Conference on Data Engineering10.1109/ICDE.2006.47Online publication date: 3-Apr-2006
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media