Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Open access

FactorJoin: A New Cardinality Estimation Framework for Join Queries

Published: 30 May 2023 Publication History

Abstract

Cardinality estimation is one of the most fundamental and challenging problems in query optimization. Neither classical nor learning-based methods yield satisfactory performance when estimating the cardinality of the join queries. They either rely on simplified assumptions leading to ineffective cardinality estimates or build large models to understand the complicated data distributions, leading to long planning times and a lack of generalizability across queries.
In this paper, we propose a new framework FactorJoin for estimating join queries. FactorJoin combines the idea behind the classical join-histogram method to efficiently handle joins with the learning-based methods to accurately capture attribute correlation Specifically, FactorJoin scans every table in a DB and builds single-table conditional distributions during an offline preparation phase. When a join query comes, FactorJoin translates it into a factor graph model over the learned distributions to effectively and efficiently estimate its cardinality.
Unlike existing learning-based methods, FactorJoin does not need to de-normalize joins upfront or require executed query workloads to train the model. Since it only relies on single-table statistics, FactorJoin has a small space overhead and is extremely easy to train and maintain. In our evaluation, FactorJoin can produce more effective estimates than the previous state-of-the-art learning-based methods, with 40x less estimation latency, 100x smaller model size, and 100x faster training speed at comparable or better accuracy. In addition, FactorJoin can estimate 10,000 sub-plan queries within one second to optimize the query plan, which is very close to the traditional cardinality estimators in commercial DBMS.

Supplemental Material

MP4 File
Presentation video - short version of "FactorJoin: A New Cardinality Estimation Framework for Join Queries"
PDF File
Read me
ZIP File
Source Code

References

[1]
Mahmoud Abo Khamis, Hung Q Ngo, and Dan Suciu. 2017. What do Shannon-type Inequalities, Submodular Width, and Disjunctive Datalog have to do with one another?. In Proceedings of the 36th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems. 429--444.
[2]
Mert Akdere, Ugur Cetintemel, Matteo Riondato, Eli Upfal, and Stanley B Zdonik. 2012. Learning-based query performance modeling and prediction. In 2012 IEEE 28th International Conference on Data Engineering. IEEE, 390--401.
[3]
Albert Atserias, Martin Grohe, and Dániel Marx. 2008. Size bounds and query plans for relational joins. In 2008 49th Annual IEEE Symposium on Foundations of Computer Science. IEEE, 739--748.
[4]
Nicolas Bruno, Surajit Chaudhuri, and Luis Gravano. 2001. STHoles: a multidimensional workload-aware histogram. In SIGMOD. 211--222.
[5]
Walter Cai, Magdalena Balazinska, and Dan Suciu. 2019. Pessimistic cardinality estimation: Tighter upper bounds for intermediate join cardinalities. In SIGMOD. 18--35.
[6]
C. Chow and Cong Liu. 1968. Approximating discrete probability distributions with dependence trees. IEEE transactions on Information Theory 14, 3 (1968), 462--467.
[7]
Alberto Dell'Era. 2007. Join Over Histograms. Available on www. adellera. it/investigations/join_over_histograms (2007).
[8]
Alberto Dell'Era. 2017. Oracle Database Online Documentation 12c Release 1 (12.1). Available at(2017).
[9]
Amol Deshpande, Minos Garofalakis, and Rajeev Rastogi. 2001. Independence is good: Dependency-based histogram synopses for high-dimensional data. ACM SIGMOD Record 30, 2 (2001), 199--210.
[10]
Postgresql Documentation 12. 2020. Chapter 70.1. Row Estimation Examples. https://www.postgresql.org/docs/current/row-estimation-examples.html (2020).
[11]
Anshuman Dutt, Chi Wang, Azade Nazi, Srikanth Kandula, Vivek Narasayya, and Surajit Chaudhuri. 2019. Selectivity estimation for range predicates using lightweight models. PVLDB 12, 9 (2019), 1044--1057.
[12]
Dennis Fuchs, Zhen He, and Byung Suk Lee. 2007. Compressed histograms with arbitrary bucket layouts for selectivity estimation. Information Sciences 177, 3 (2007), 680--702.
[13]
Lise Getoor, Benjamin Taskar, and Daphne Koller. 2001. Selectivity estimation using probabilistic models. In SIGMOD. 461--472.
[14]
Dimitrios Gunopulos, George Kollios, Vassilis J Tsotras, and Carlotta Domeniconi. 2000. Approximating multi- dimensional aggregate range queries over real attributes. In SIGMOD. 463--474.
[15]
Dimitrios Gunopulos, George Kollios, Vassilis J Tsotras, and Carlotta Domeniconi. 2005. Selectivity estimators for multidimensional range queries over real attributes. The VLDB Journal 14, 2 (2005), 137--154.
[16]
Max Halford, Philippe Saint-Pierre, and Franck Morvan. 2019. An approach based on bayesian networks for query selectivity estimation. DASFAA 2 (2019).
[17]
Yuxing Han. 2021. Github repository: E2E benchmark. https://github.com/Nathaniel-Han/End-to-End-CardEst- Benchmark (2021).
[18]
Yuxing Han, Ziniu Wu, Peizhi Wu, Rong Zhu, Jingyi Yang, Liang Wei Tan, Kai Zeng, Gao Cong, Yanzhao Qin, Andreas Pfadler, et al . 2021. Cardinality Estimation in DBMS: A Comprehensive Benchmark Evaluation. VLDB Endowment (2021).
[19]
Shohedul Hasan, Saravanan Thirumuruganathan, Jees Augustine, Nick Koudas, and Gautam Das. 2019. Multi-attribute selectivity estimation using deep learning. In SIGMOD.
[20]
Shohedul Hasan, Saravanan Thirumuruganathan, Jees Augustine, Nick Koudas, and Gautam Das. 2020. Deep Learning Models for Selectivity Estimation of Multi-Attribute Queries. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. 1035--1050.
[21]
Max Heimel, Martin Kiefer, and Volker Markl. 2015. Self-tuning, gpu-accelerated kernel density models for multidi- mensional selectivity estimation. In SIGMOD. 1477--1492.
[22]
Axel Hertzschuch, Claudio Hartmann, Dirk Habich, and Wolfgang Lehner. 2021. Simplicity Done Right for Join Ordering. CIDR (2021).
[23]
Benjamin Hilprecht. 2019. Github repository: deepdb public. https://github.com/DataManagementLab/deepdb-public (2019).
[24]
Benjamin Hilprecht and Carsten Binnig. 2022. One Model to Rule them All: Towards Zero-Shot Learning for Databases. CIDR (2022).
[25]
Benjamin Hilprecht, Andreas Schmidt, Moritz Kulessa, Alejandro Molina, Kristian Kersting, and Carsten Binnig. 2019. DeepDB: learn from data, not from queries!. In PVLDB.
[26]
Yannis Ioannidis. 2003. The history of histograms (abridged). In Proceedings 2003 VLDB Conference. Elsevier, 19--30.
[27]
Yannis E Ioannidis. 1993. Universality of serial histograms. In VLDB, Vol. 93. Citeseer, 256--267.
[28]
Yannis E Ioannidis and Stavros Christodoulakis. 1991. On the propagation of errors in the size of join results. In Proceedings of the 1991 ACM SIGMOD International Conference on Management of data. 268--277.
[29]
Yannis E Ioannidis and Stavros Christodoulakis. 1993. Optimal histograms for limiting worst-case error propagation in the size of join results. ACM Transactions on Database Systems (TODS) 18, 4 (1993), 709--748.
[30]
Yannis E Ioannidis and Viswanath Poosala. 1995. Balancing histogram optimality and practicality for query result size estimation. Acm Sigmod Record 24, 2 (1995), 233--244.
[31]
Andranik Khachatryan, Emmanuel Müller, Christian Stier, and Klemens Böhm. 2015. Improving accuracy and robustness of self-tuning histograms by subspace clustering. IEEE TKDE 27, 9 (2015), 2377--2389.
[32]
Martin Kiefer, Max Heimel, Sebastian Breß, and Volker Markl. 2017. Estimating join selectivities using bandwidth- optimized kernel density models. PVLDB 10, 13 (2017), 2085--2096.
[33]
Andreas Kipf, Thomas Kipf, Bernhard Radke, Viktor Leis, Peter Boncz, and Alfons Kemper. 2019. Learned cardinalities: Estimating correlated joins with deep learning. In CIDR.
[34]
Daphne Koller and Nir Friedman. 2009. Probabilistic graphical models: principles and techniques. MIT press.
[35]
Frank R Kschischang, Brendan J Frey, and H-A Loeliger. 2001. Factor graphs and the sum-product algorithm. IEEE Transactions on information theory 47, 2 (2001), 498--519.
[36]
Seetha Lakshmi and Shaoyu Zhou. 1998. Selectivity estimation in extensible databases-A neural network approach. In VLDB, Vol. 98. 24--27.
[37]
Steffen L Lauritzen. 1996. Graphical models. Vol. 17. Clarendon Press.
[38]
Viktor Leis, Andrey Gubichev, Atanas Mirchev, Peter Boncz, Alfons Kemper, and Thomas Neumann. 2015. How good are query optimizers, really? PVLDB 9, 3 (2015), 204--215.
[39]
Viktor Leis, Bernhard Radke, Andrey Gubichev, Alfons Kemper, and Thomas Neumann. 2017. Cardinality Estimation Done Right: Index-Based Join Sampling. In CIDR.
[40]
Feifei Li, Bin Wu, Ke Yi, and Zhuoyue Zhao. 2016. Wander join: Online aggregation via random walks. In SIGMOD. 615--629.
[41]
Richard J Lipton, Jeffrey F Naughton, and Donovan A Schneider. 1990. Practical selectivity estimation through adaptive sampling. In Proceedings of the 1990 ACM SIGMOD international conference on Management of data. 1--11.
[42]
Jie Liu, Wenqian Dong, Qingqing Zhou, and Dong Li. 2021. Fauce: fast and accurate deep ensembles with uncertainty for cardinality estimation. Proceedings of the VLDB Endowment 14, 11 (2021), 1950--1963.
[43]
Qiyu Liu, Yanyan Shen, and Lei Chen. 2021. LHist: Towards Learning Multi-dimensional Histogram for Massive Spatial Data. In 2021 IEEE 37th International Conference on Data Engineering (ICDE). IEEE, 1188--1199.
[44]
H-A Loeliger. 2004. An introduction to factor graphs. IEEE Signal Processing Magazine 21, 1 (2004), 28--41.
[45]
Lin Ma, Bailu Ding, Sudipto Das, and Adith Swaminathan. 2020. Active learning for ML enhanced database systems. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. 175--191.
[46]
David JC MacKay, David JC Mac Kay, et al. 2003. Information theory, inference and learning algorithms. Cambridge university press.
[47]
Tanu Malik, Randal C Burns, and Nitesh V Chawla. 2007. A Black-Box Approach to Query Cardinality Estimation. In CIDR. Citeseer, 56--67.
[48]
Ryan Marcus, Parimarjan Negi, Hongzi Mao, Nesime Tatbul, Mohammad Alizadeh, and Tim Kraska. 2021. Bao: Learning to steer query optimizers. SIGMOD (2021).
[49]
Ryan Marcus, Parimarjan Negi, Hongzi Mao, Chi Zhang, Mohammad Alizadeh, Tim Kraska, Olga Papaemmanouil, and Nesime Tatbul. 2019. Neo: A learned query optimizer. arXiv preprint arXiv:1904.03711 (2019).
[50]
Ryan Marcus and Olga Papaemmanouil. 2018. Deep reinforcement learning for join order enumeration. In Proceedings of the First International Workshop on Exploiting Artificial Intelligence Techniques for Data Management. 1--4.
[51]
Ryan Marcus and Olga Papaemmanouil. 2019. Plan-structured deep neural network models for query performance prediction. arXiv preprint arXiv:1902.00132 (2019).
[52]
M Muralikrishna and David J DeWitt. 1988. Equi-depth multidimensional histograms. In Proceedings of the 1988 ACM SIGMOD international conference on Management of data. 28--36.
[53]
Parimarjan Negi, Matteo Interlandi, Ryan Marcus, Mohammad Alizadeh, Tim Kraska, Marc Friedman, and Alekh Jindal. 2021. Steering query optimizers: A practical take on big data workloads. In Proceedings of the 2021 International Conference on Management of Data. 2557--2569.
[54]
Parimarjan Negi, Ryan Marcus, Andreas Kipf, Hongzi Mao, Nesime Tatbul, Tim Kraska, and Mohammad Alizadeh. 2021. Flow-Loss: Learning Cardinality Estimates That Matter. arXiv preprint arXiv:2101.04964 (2021).
[55]
Hung Q Ngo, Ely Porat, Christopher Ré, and Atri Rudra. 2018. Worst-case optimal join algorithms. Journal of the ACM (JACM) 65, 3 (2018), 1--40.
[56]
Yeonsu Park, Seongyun Ko, Sourav S Bhowmick, Kyoungmin Kim, Kijae Hong, and Wook-Shin Han. 2020. G-CARE: A Framework for Performance Benchmarking of Cardinality Estimation Techniques for Subgraph Matching. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. 1099--1114.
[57]
Hoifung Poon and Pedro Domingos. 2011. Sum-product networks: A new deep architecture. In ICCV Workshops. 689--690.
[58]
Viswanath Poosala and Yannis E Ioannidis. 1997. Selectivity estimation without the attribute value independence assumption. In VLDB, Vol. 97. 486--495.
[59]
MySQL 8.0 Reference Manual. 2020. Chapter 15.8.10.2 Configuring Non-Persistent Optimizer Statistics Parameters. https://dev.mysql.com/doc/refman/8.0/en/innodb-statistics-estimation.html (2020).
[60]
P Griffiths Selinger, Morton M Astrahan, Donald D Chamberlin, Raymond A Lorie, and Thomas G Price. 1979. Access path selection in a relational database management system. In SIGMOD. 23--34.
[61]
Utkarsh Srivastava, Peter J Haas, Volker Markl, Marcel Kutsch, and Tam Minh Tran. 2006. Isomer: Consistent histogram construction using query feedback. In ICDE. 39--39.
[62]
Michael Stillger, Guy M Lohman, Volker Markl, and Mokhtar Kandil. 2001. LEO-DB2's learning optimizer. In PVLDB, Vol. 1. 19--28.
[63]
Ji Sun and Guoliang Li. 2019. An end-to-end learning-based cost estimator. VLDB (2019).
[64]
Kostas Tzoumas, Amol Deshpande, and Christian S Jensen. 2011. Lightweight graphical models for selectivity estimation without independence assumptions. PVLDB 4, 11 (2011), 852--863.
[65]
Kostas Tzoumas, Amol Deshpande, and Christian S Jensen. 2013. Efficiently adapting graphical models for selectivity estimation. Proceedings of the VLDB Endowment 1, 22 (2013).
[66]
Hai Wang and Kenneth C Sevcik. 2003. A multi-dimensional histogram for selectivity estimation and fast approximate query answering. In Proceedings of the 2003 conference of the Centre for Advanced Studies on Collaborative research. 328--342.
[67]
Jiayi Wang, Chengliang Chai, Jiabin Liu, and Guoliang Li. 2021. FACE: a normalizing flow based cardinality estimator. Proceedings of the VLDB Endowment 15, 1 (2021), 72--84.
[68]
Chenggang Wu, Alekh Jindal, Saeed Amizadeh, Hiren Patel, Wangchao Le, Shi Qiao, and Sriram Rao. 2018. Towards a learning optimizer for shared clouds. PVLDB 12, 3 (2018), 210--222.
[69]
Peizhi Wu and Gao Cong. 2021. A Unified Deep Model of Learning from both Data and Queries for Cardinality Estimation. In Proceedings of the 2021 ACM SIGMOD International Conference on Management of Data.
[70]
Ziniu Wu, Parimarjan Negi, Mohammad Alizadeh, Tim Kraska, and Samuel Madden. 2022. FactorJoin: A New Cardinality Estimation Framework for Join Queries. arXiv preprint arXiv:2212.05526 (2022).
[71]
Ziniu Wu, Amir Shaikhha, Rong Zhu, Kai Zeng, Yuxing Han, and Jingren Zhou. 2020. BayesCard: Revitilizing Bayesian Frameworks for Cardinality Estimation. arXiv preprint arXiv:2012.14743 (2020).
[72]
Ziniu Wu, Peilun Yang, Pei Yu, Rong Zhu, Yuxing Han, Yaliang Li, Defu Lian, Kai Zeng, and Jingren Zhou. 2022. A Unified Transferable Model for ML-Enhanced DBMS. CIDR (2022).
[73]
Ziniu Wu, Rong Zhu, Andreas Pfadler, Yuxing Han, Jiangneng Li, Zhengping Qian, Kai Zeng, and Jingren Zhou. 2020. FSPN: A New Class of Probabilistic Graphical Model. arXiv preprint arXiv:2011.09020 (2020).
[74]
Zongheng Yang, Amog Kamsetty, Sifei Luan, Eric Liang, Yan Duan, Xi Chen, and Ion Stoica. 2021. NeuroCard: One Cardinality Estimator for All Tables. PVLDB 14, 1 (2021), 61--73.
[75]
Zongheng Yang, Eric Liang, Amog Kamsetty, Chenggang Wu, Yan Duan, Xi Chen, Pieter Abbeel, Joseph M Hellerstein, Sanjay Krishnan, and Ion Stoica. 2019. Deep unsupervised cardinality estimation. PVLDB (2019).
[76]
Xiang Yu, Guoliang Li, Chengliang Chai, and Nan Tang. 2020. Reinforcement learning with tree-lstm for join order selection. In 2020 IEEE 36th International Conference on Data Engineering (ICDE). IEEE, 1297--1308.
[77]
Zhuoyue Zhao, Robert Christensen, Feifei Li, Xiao Hu, and Ke Yi. 2018. Random sampling over joins revisited. In SIGMOD. 1525--1539.
[78]
Rong Zhu, Ziniu Wu, Yuxing Han, Kai Zeng, Andreas Pfadler, Zhengping Qian, Jingren Zhou, and Bin Cui. 2021. FLAT: Fast, Lightweight and Accurate Method for Cardinality Estimation. VLDB 14, 9 (2021), 1489--1502.

Cited By

View all
  • (2024)ReJOOSp: Reinforcement Learning for Join Order Optimization in SPARQLBig Data and Cognitive Computing10.3390/bdcc80700718:7(71)Online publication date: 27-Jun-2024
  • (2024)Presto's History-Based Query OptimizerProceedings of the VLDB Endowment10.14778/3685800.368582817:12(4077-4089)Online publication date: 8-Nov-2024
  • (2024)Blueprinting the Cloud: Unifying and Automatically Optimizing Cloud Data Infrastructures with BRADProceedings of the VLDB Endowment10.14778/3681954.368202617:11(3629-3643)Online publication date: 30-Aug-2024
  • Show More Cited By

Index Terms

  1. FactorJoin: A New Cardinality Estimation Framework for Join Queries

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image Proceedings of the ACM on Management of Data
    Proceedings of the ACM on Management of Data  Volume 1, Issue 1
    PACMMOD
    May 2023
    2807 pages
    EISSN:2836-6573
    DOI:10.1145/3603164
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 30 May 2023
    Published in PACMMOD Volume 1, Issue 1

    Permissions

    Request permissions for this article.

    Badges

    Author Tags

    1. cardinality estimation
    2. query optimization

    Qualifiers

    • Research-article

    Funding Sources

    • U.S. National Science Foundation

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)1,144
    • Downloads (Last 6 weeks)114
    Reflects downloads up to 08 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)ReJOOSp: Reinforcement Learning for Join Order Optimization in SPARQLBig Data and Cognitive Computing10.3390/bdcc80700718:7(71)Online publication date: 27-Jun-2024
    • (2024)Presto's History-Based Query OptimizerProceedings of the VLDB Endowment10.14778/3685800.368582817:12(4077-4089)Online publication date: 8-Nov-2024
    • (2024)Blueprinting the Cloud: Unifying and Automatically Optimizing Cloud Data Infrastructures with BRADProceedings of the VLDB Endowment10.14778/3681954.368202617:11(3629-3643)Online publication date: 30-Aug-2024
    • (2024)TFB: Towards Comprehensive and Fair Benchmarking of Time Series Forecasting MethodsProceedings of the VLDB Endowment10.14778/3665844.366586317:9(2363-2377)Online publication date: 6-Aug-2024
    • (2024)Disclosure-Compliant Query AnsweringProceedings of the ACM on Management of Data10.1145/36988082:6(1-28)Online publication date: 20-Dec-2024
    • (2024)A Universal Sketch for Estimating Heavy Hitters and Per-Element Frequency Moments in Data Streams with Bounded DeletionsProceedings of the ACM on Management of Data10.1145/36987992:6(1-28)Online publication date: 20-Dec-2024
    • (2024)Memory Management in Complex Join Queries: A Re-evaluation StudyProceedings of the 2024 ACM Symposium on Cloud Computing10.1145/3698038.3698565(933-942)Online publication date: 20-Nov-2024
    • (2024)QardEst: Using Quantum Machine Learning for Cardinality Estimation of Join QueriesProceedings of the 1st Workshop on Quantum Computing and Quantum-Inspired Technology for Data-Intensive Systems and Applications10.1145/3665225.3665444(2-13)Online publication date: 9-Jun-2024
    • (2024)Precision Meets Resilience: Cross-Database Generalization with Uncertainty Quantification for Robust Cost EstimationProceedings of the 33rd ACM International Conference on Information and Knowledge Management10.1145/3627673.3679632(581-590)Online publication date: 21-Oct-2024
    • (2024)ASM in Action: Fast and Practical Learned Cardinality EstimationCompanion of the 2024 International Conference on Management of Data10.1145/3626246.3654728(460-463)Online publication date: 9-Jun-2024
    • Show More Cited By

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Login options

    Full Access

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media