Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/1807085.1807095acmconferencesArticle/Chapter ViewAbstractPublication PagespodsConference Proceedingsconference-collections
research-article

Understanding cardinality estimation using entropy maximization

Published: 06 June 2010 Publication History

Abstract

Cardinality estimation is the problem of estimating the number of tuples returned by a query; it is a fundamentally important task in data management, used in query optimization, progress estimation, and resource provisioning. We study cardinality estimation in a principled framework: given a set of statistical assertions about the number of tuples returned by a fixed set of queries, predict the number of tuples returned by a new query. We model this problem using the probability space, over possible worlds, that satisfies all provided statistical assertions and maximizes entropy. We call this the Entropy Maximization model for statistics (MaxEnt). In this paper we develop the mathematical techniques needed to use the MaxEnt model for predicting the cardinality of conjunctive queries.

References

[1]
S. Abiteboul, R. Hull, and V. Vianu. Foundations of Databases. Addison Wesley Publishing Co, 1995.
[2]
N. Alon, P. B. Gibbons, Y. Matias, and M. Szegedy. Tracking join and self-join sizes in limited storage. In PODS, pages 10--20, 1999.
[3]
N. Alon, Y. Matias, and M. Szegedy. The space complexity of approximating the frequency moments. In STOC, pages 20--29, 1996.
[4]
L. Antova, C. Koch, and D. Olteanu. World-set decompositions: Expressiveness and efficient algorithms. In ICDT, pages 194--208, 2007.
[5]
S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004.
[6]
S. Chaudhuri, V. R. Narasayya, and R. Ramamurthy. Diagnosing estimation errors in page counts using execution feedback. In ICDE, pages 1013--1022, 2008.
[7]
R. M. Corless, D. J. Jeffrey, and D. E. Knuth. A sequence of series for the lambert w function. In ISSAC, pages 197--204, 1997.
[8]
N. N. Dalvi, G. Miklau, and D. Suciu. Asymptotic conditional probabilities for conjunctive queries. In ICDT, pages 289--305, 2005.
[9]
N. N. Dalvi and D. Suciu. The dichotomy of conjunctive queries on probabilistic structures. In PODS, pages 293--302, 2007.
[10]
A. Deligiannakis, M. N. Garofalakis, and N. Roussopoulos. Extended wavelets for multiple measures. ACM Trans. Database Syst., 32(2):10, 2007.
[11]
L. Getoor, B. Taskar, and D. Koller. Selectivity estimation using probabilistic models. In SIGMOD Conference, pages 461--472, 2001.
[12]
P. J. Haas, J. F. Naughton, S. Seshadri, and A. N. Swami. Selectivity and cost estimation for joins based on random sampling. J. Comput. Syst. Sci., 52(3):550--569, 1996.
[13]
Y. E. Ioannidis. The history of histograms (abridged). In VLDB, pages 19--30, 2003.
[14]
Y. E. Ioannidis and S. Christodoulakis. On the propagation of errors in the size of join results. In SIGMOD Conference, pages 268--277, 1991.
[15]
E. Jaynes. Probability Theory: The Logic of Science. Cambridge University Press, Cambridge, UK, 2003.
[16]
R. Kaushik, C. Ré, and D. Suciu. General database statistics using entropy maximization. In DBPL, pages 84--99, 2009.
[17]
R. Kaushik and D. Suciu. Consistent histograms in the presence of distinct value counts. PVLDB, 2(1):850--861, 2009.
[18]
C. Koch and D. Olteanu. Conditioning probabilistic databases. PVLDB, 1(1):313--325, 2008.
[19]
V. Markl, N. Megiddo, M. Kutsch, T. M. Tran, P. J. Haas, and U. Srivastava. Consistently estimating the selectivity of conjuncts of predicates. In VLDB, pages 373--384, 2005.
[20]
M. Mitzenmacher and E. Upfal. Probability and Computing: Randomized Algorithms and Probabilistic Analysis. Cambridge University Press, New York, NY, USA, 2005.
[21]
F. Olken. Random Sampling from Databases. PhD thesis, University of California at Berkeley, 1993.
[22]
C. Papadimitriou. Computational Complexity. Addison Wesley Publishing Company, 1994.
[23]
V. Poosala and Y. E. Ioannidis. Selectivity estimation without the attribute value independence assumption. In VLDB, pages 486--495, 1997.
[24]
M. Richardson and P. Domingos. Markov logic networks. Machine Learning, 62(1-2):107--136, 2006.
[25]
W. Rudin. Principles of Mathematical Analysis, Third Edition. McGraw-Hill Science/Engineering/Math, 3rd edition, January 1976.
[26]
F. Rusu and A. Dobra. Sketches for size of join estimation. ACM Trans. Database Syst., 33(3), 2008.
[27]
Sage. Open-source mathematics software. http://sagemath.org, 2009.
[28]
P. Sen and A. Deshpande. Representing and querying correlated tuples in probabilistic databases. In ICDE, pages 596--605, 2007.
[29]
J. Shao. Mathematical Statistics. Springer, 2nd edition, 2003.
[30]
U. Srivastava, P. J. Haas, V. Markl, M. Kutsch, and T. M. Tran. Isomer: Consistent histogram construction using query feedback. In ICDE, page 39, 2006.
[31]
M. Stillger, G. M. Lohman, V. Markl, and M. Kandil. Leo - db2's learning optimizer. In VLDB, pages 19--28, 2001.
[32]
M. J. Wainwright and M. I. Jordan. Graphical models, exponential families, and variational inference. Foundations and Trends in Machine Learning, 1(1-2):1--305, 2008.
[33]
J. Widom. Trio: A system for integrated management of data, accuracy, and lineage. In CIDR, pages 262--276, 2005.

Cited By

View all
  • (2018)Generating Realistic Synthetic Population DatasetsACM Transactions on Knowledge Discovery from Data10.1145/318238312:4(1-22)Online publication date: 16-Apr-2018
  • (2013)Online ordering of overlapping data sourcesProceedings of the VLDB Endowment10.14778/2732232.27322337:3(133-144)Online publication date: 1-Nov-2013
  • (2013)Efficiently adapting graphical models for selectivity estimationThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-012-0293-722:1(3-27)Online publication date: 1-Feb-2013
  • Show More Cited By

Index Terms

  1. Understanding cardinality estimation using entropy maximization

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    PODS '10: Proceedings of the twenty-ninth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
    June 2010
    350 pages
    ISBN:9781450300339
    DOI:10.1145/1807085
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 06 June 2010

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. cardinality estimation
    2. database theory
    3. distinct value estimation
    4. maximum entropy

    Qualifiers

    • Research-article

    Conference

    SIGMOD/PODS '10
    Sponsor:
    SIGMOD/PODS '10: International Conference on Management of Data
    June 6 - 11, 2010
    Indiana, Indianapolis, USA

    Acceptance Rates

    PODS '10 Paper Acceptance Rate 27 of 113 submissions, 24%;
    Overall Acceptance Rate 642 of 2,707 submissions, 24%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)8
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 25 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2018)Generating Realistic Synthetic Population DatasetsACM Transactions on Knowledge Discovery from Data10.1145/318238312:4(1-22)Online publication date: 16-Apr-2018
    • (2013)Online ordering of overlapping data sourcesProceedings of the VLDB Endowment10.14778/2732232.27322337:3(133-144)Online publication date: 1-Nov-2013
    • (2013)Efficiently adapting graphical models for selectivity estimationThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-012-0293-722:1(3-27)Online publication date: 1-Feb-2013
    • (2012)Understanding cardinality estimation using entropy maximizationACM Transactions on Database Systems10.1145/2109196.210920237:1(1-31)Online publication date: 6-Mar-2012
    • (2011)The VC-dimension of SQL queries and selectivity estimation through samplingProceedings of the 2011 European conference on Machine learning and knowledge discovery in databases - Volume Part II10.5555/2034117.2034160(661-676)Online publication date: 5-Sep-2011
    • (2011)Data generation using declarative constraintsProceedings of the 2011 ACM SIGMOD International Conference on Management of data10.1145/1989323.1989395(685-696)Online publication date: 12-Jun-2011
    • (2011)The VC-dimension of SQL queries and selectivity estimation through samplingProceedings of the 2011th European Conference on Machine Learning and Knowledge Discovery in Databases - Volume Part II10.1007/978-3-642-23783-6_42(661-676)Online publication date: 5-Sep-2011

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media