article

Free access

Estimating alphanumeric selectivity in the presence of wildcards

Authors:

Jeffrey Scott Vitter,

Bala IyerAuthors Info & Claims

ACM SIGMOD Record, Volume 25, Issue 2

Pages 282 - 293

https://doi.org/10.1145/235968.233341

Published: 01 June 1996 Publication History

Abstract

Success of commercial query optimizers and database management systems (object-oriented or relational) depend on accurate cost estimation of various query reordering [BGI]. Estimating predicate selectivity, or the fraction of rows in a database that satisfy a selection predicate, is key to determining the optimal join order. Previous work has concentrated on estimating selectivity for numeric fields [ASW, HaSa, IoP, LNS, SAC, WVT]. With the popularity of textual data being stored in databases, it has become important to estimate selectivity accurately for alphanumeric fields. A particularly problematic predicate used against alphanumeric fields is the SQL like predicate [Dat]. Techniques used for estimating numeric selectivity are not suited for estimating alphanumeric selectivity.In this paper, we study for the first time the problem of estimating alphanumeric selectivity in the presence of wildcards. Based on the intuition that the model built by a data compressor on an input text encapsulates information about common substrings in the text, we develop a technique based on the suffix tree data structure to estimate alphanumeric selectivity. In a statistics generation pass over the database, we construct a compact suffix tree-based structure from the columns of the database. We then look at three families of methods that utilize this structure to estimate selectivity during query plan costing, when a query with predicates on alphanumeric attributes contains wildcards in the predicate.We evaluate our methods empirically in the context of the TPC-D benchmark. We study our methods experimentally against a variety of query patterns and identify five techniques that hold promise.

References

[1]

M. M. Astrahan, M. Schkolnick, and K. Y. Whang, "Approximating the Number of Unique Values of an Attribute Without Sorting," Inf. Sys. 12 (1987), 11-15.]]

Digital Library

[2]

G. Bhargava, P. Goel, and B. R. Iyer, "Hypergraph Based Reorderings of Outer Join Queries with Complex Predicates," Proc. of the 1995 A CM SIGMOD Conference, 304-315.]]

Digital Library

[3]

S. Brin, J. Davis, and H. Garcia-Molina, "Copy Detection Mechanisms for Digital Documents," Proc. of the 1995 A CM SIGMOD Conference.]]

Digital Library

[4]

S. Bunton and G. Borriello, "Practical Dictionary Management for Hardware Data Compression," Dept. of Comp, Sci., Univ. of Washington, FR-35, 1991.]]

[5]

K. Curewitz, P. Krishnan, and J. S. Vitter, "Practical Prefetching via Data Compression," Proc. of the 1993 A CM SIGMOD Conference, 257-266.]]

Digital Library

[6]

C. Date, An introduction to Database Systems, Addison-Wesley, 1981.]]

Digital Library

[7]

E. R. Fiala and D. H. Greene, "Data Compression with Finite Windows," Comm. of the A CM 32 (April 1989), 490-505.]]

Digital Library

[8]

P. Haas and A. Swami, "Sequential Sampling Procedures for Query Size Estimation," Proc. of the 1992 A CM SIGMOD Conference, 341-350.]]

Digital Library

[9]

P. J. Haas and A. N. Swami, "Sampling-Based Selectivity for Joins Using Augmented Frequent VMue Statistics," Proc. of the 11th lntl. Conf. on Data Engg. (March 1995).]]

Digital Library

[10]

M. A. Hernandez and S. J. Stolfo, "The Merge/Purge Problem for Large Databases," Proc. of the 1995 A CM SIGMOD Conference, 127-138.]]

Digital Library

[11]

Y. E. Ioannidis and V. Poosala, "Balancing Histogram Optimality and Practicality for Query Result Size Estimation," Proc. of the 1995 A CM SIG- MOD Conference.]]

Digital Library

[12]

D. Knuth, The Art of Computer Programming, Vot. 3: Sorting and Searching, Addison-Wesley, Reading, MA, 1973.]]

Digital Library

[13]

P. Krishnan, "Online Prediction Algorithms for Databases and Operating Systems," Brown Univ. Ph.D. thesis, May, 1995, also available as Brown Univ. technical report CS-95-24.]]

Digital Library

[14]

R. Lipton, J. Naughton, and D. Schneider, "Practical Selectivity Estimation through Adaptive Sampling," Proc. o} the 1990 A CM SIGMOD Conference, 1-11.]]

Digital Library

[15]

R. J. Lipton and J. F. Naughton, "Query Size Estimation by Adaptive Sampling," J. Comput. Sys. Sci. 51 (August 1995), 18-25.]]

Digital Library

[16]

E. M. McCreight, "A Space-Economical Suffix Tree Construction Algorithm," J. of the ACM 23 (1976), 262-272.]]

Digital Library

[17]

D. R. Morrison, "PATrtICIA-A Practical Algorithm to Retrieve Information Coded in Alphanumeric," J. of the A CM 15 (1968), 514-534.]]

Digital Library

[18]

P. G. Selinger, M. M. Astrahan, D. D. Chamberlin, R. A. Lorie, and T. G. Price, "Access Path Selection in a Relational Database Management System," Proc. of the 1979 A CM SIGMOD Conf., 23-34.]]

Digital Library

[19]

Transaction Processing Performance Council, TPC, "TPC BenchmarkTM D (Decision Support), Working Draft 6.0," 1994, F. Raab (ed.).]]

[20]

V. N. Vapnik, Estimation of Dependencies based on Empirical Data, Springer Verlag, 1982.]]

Digital Library

[21]

J. T. Wang, G. Chirn, T. G. Marr, B. Shapiro, D. Shasha, and K. Zhang, "Combinatorial Pattern Discovery for Scientific Data: Some Preliminary Results," Proc. of the 1994 A CM SIGMOD Conference (May 1994).]]

Digital Library

[22]

M. Wang, Duke Univ., persona} communication.]]

[23]

P. Weiner, "Linear Pattern Matching Algorithms," Proc. of the IEEE l$th Annual Symp. on Switching and Automata Theory(October, 1973), 1-11.]]

[24]

K.Y. Whang, B. T. Vander-Zanden, and H. M. Taylor, "A Linear-Time Probabilistic Counting Algorithm for Database Applications," ACM Trans. on Database Sys. 15 (June 1990), 208-229.]]

Digital Library

[25]

J. Ziv and A. Lempel, "A Universal Algorithm for Sequential Data Compression," IEEE Trans. on In}. Theory 23 (May 1977).]]

[26]

J. Ziv and A. Lempel, "Compression of Individual Sequences via Variable-Rate Coding," IEEE Trans. on Inf. Theory 24 (September 1978), 530-536.]]

Cited By

He ZWang JJiang MHu LZou Q(2024)Random Subsequence ForestsInformation Sciences10.1016/j.ins.2024.120478(120478)Online publication date: Mar-2024
https://doi.org/10.1016/j.ins.2024.120478
Shetiya SThirumuruganathan SKoudas NDas G(2021)AstridProceedings of the VLDB Endowment10.14778/3436905.343690714:4(471-484)Online publication date: 22-Feb-2021
https://dl.acm.org/doi/10.14778/3436905.3436907
Willkomm JSchäler MBöhm K(2021)Accurate Cardinality Estimation of Co-occurring Words Using Suffix TreesDatabase Systems for Advanced Applications10.1007/978-3-030-73197-7_50(721-737)Online publication date: 6-Apr-2021
https://doi.org/10.1007/978-3-030-73197-7_50
Show More Cited By

Index Terms

Estimating alphanumeric selectivity in the presence of wildcards
1. Information systems
  1. Data management systems
    1. Database management system engines
      1. Database query processing
2. Theory of computation
  1. Theory and algorithms for application domains
    1. Database theory
      1. Database query processing and optimization (theory)

Recommendations

Estimating alphanumeric selectivity in the presence of wildcards
SIGMOD '96: Proceedings of the 1996 ACM SIGMOD international conference on Management of data

Success of commercial query optimizers and database management systems (object-oriented or relational) depend on accurate cost estimation of various query reordering [BGI]. Estimating predicate selectivity, or the fraction of rows in a database that ...
Selectivity Estimation in the Presence of Alphanumeric Correlations
ICDE '97: Proceedings of the Thirteenth International Conference on Data Engineering

Query optimization is an integral part of relational database management systems. One important task in query optimization is selectivity estimation, that is, given a query P, we need to estimate the fraction of records in the database that satisfy P. ...
Estimating selectivity for joined RDF triple patterns
CIKM '11: Proceedings of the 20th ACM international conference on Information and knowledge management

A fundamental problem related to RDF query processing is selectivity estimation, which is crucial to query optimization for determining a join order of RDF triple patterns. In this paper we focus research on selectivity estimation for SPARQL graph ...

Comments

Information & Contributors

Information

Published In

cover image ACM SIGMOD Record

ACM SIGMOD Record Volume 25, Issue 2

June 1996

557 pages

ISSN:0163-5808

DOI:10.1145/235968

Chairman:
T. H. Merrett
McGill Univ.
,
Editors:
H. V. Jagadish,
Inderpal Singh Mumick

Issue’s Table of Contents

SIGMOD '96: Proceedings of the 1996 ACM SIGMOD international conference on Management of data
June 1996
560 pages
ISBN:0897917944
DOI:10.1145/233269
Editor:
Jennifer Widom
Stanford Univ., Stanford, CT

Copyright © 1996 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 June 1996

Published in SIGMOD Volume 25, Issue 2

Check for updates

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

47
Total Citations
View Citations
579
Total Downloads

Downloads (Last 12 months)71
Downloads (Last 6 weeks)21

Reflects downloads up to 15 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

He ZWang JJiang MHu LZou Q(2024)Random Subsequence ForestsInformation Sciences10.1016/j.ins.2024.120478(120478)Online publication date: Mar-2024
https://doi.org/10.1016/j.ins.2024.120478
Shetiya SThirumuruganathan SKoudas NDas G(2021)AstridProceedings of the VLDB Endowment10.14778/3436905.343690714:4(471-484)Online publication date: 22-Feb-2021
https://dl.acm.org/doi/10.14778/3436905.3436907
Willkomm JSchäler MBöhm K(2021)Accurate Cardinality Estimation of Co-occurring Words Using Suffix TreesDatabase Systems for Advanced Applications10.1007/978-3-030-73197-7_50(721-737)Online publication date: 6-Apr-2021
https://doi.org/10.1007/978-3-030-73197-7_50
Aytimur MCakmak A(2020)Using Positional Sequence Patterns to Estimate the Selectivity of SQL LIKE QueriesExpert Systems with Applications10.1016/j.eswa.2020.113762(113762)Online publication date: Jul-2020
https://doi.org/10.1016/j.eswa.2020.113762
Chen LDobra A(2013)Histograms as statistical estimators for aggregate queriesInformation Systems10.1016/j.is.2012.08.00338:2(213-230)Online publication date: 1-Apr-2013
https://dl.acm.org/doi/10.1016/j.is.2012.08.003
Kim YPark HShim KWoo K(2013)Efficient processing of substring match queries with inverted variable-length gram indexesInformation Sciences10.1016/j.ins.2013.04.037244(119-141)Online publication date: Sep-2013
https://doi.org/10.1016/j.ins.2013.04.037
Kim YWoo KPark HShim K(2010)Efficient processing of substring match queries with inverted q-gram indexes2010 IEEE 26th International Conference on Data Engineering (ICDE 2010)10.1109/ICDE.2010.5447866(721-732)Online publication date: Mar-2010
https://doi.org/10.1109/ICDE.2010.5447866
Behm AJi SLi CLu J(2009)Space-Constrained Gram-Based Indexing for Efficient Approximate String SearchProceedings of the 2009 IEEE International Conference on Data Engineering10.1109/ICDE.2009.32(604-615)Online publication date: 29-Mar-2009
https://dl.acm.org/doi/10.1109/ICDE.2009.32
Uemura TIkeda DArimura H(2008)Unsupervised Spam Detection by Document Complexity EstimationDiscovery Science10.1007/978-3-540-88411-8_30(319-331)Online publication date: 13-Oct-2008
https://dl.acm.org/doi/10.1007/978-3-540-88411-8_30
Tata SFriedman JSwaroop A(2006)Declarative Querying for Biological SequencesProceedings of the 22nd International Conference on Data Engineering10.1109/ICDE.2006.47Online publication date: 3-Apr-2006
https://dl.acm.org/doi/10.1109/ICDE.2006.47
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Issue’s Table of Contents