research-article

Detecting Logic Bugs of Join Optimizations in DBMS

Authors:

Dongxiang Zhang,

Gang ChenAuthors Info & Claims

Proceedings of the ACM on Management of Data, Volume 1, Issue 1

Article No.: 55, Pages 1 - 26

https://doi.org/10.1145/3588909

Published: 30 May 2023 Publication History

Abstract

Generation-based testing techniques have shown their effectiveness in detecting logic bugs of DBMS, which are often caused by improper implementation of query optimizers. Nonetheless, existing generation-based debug tools are limited to single-table queries and there is a substantial research gap regarding multi-table queries with join operators. In this paper, we propose TQS, a novel testing framework targeted at detecting logic bugs derived by queries involving multi-table joins. Given a target DBMS, TQS achieves the goal with two key components: Data-guided Schema and Query Generation (DSG) and Knowledge-guided Query Space Exploration (KQE). DSG addresses the key challenge of multi-table query debugging: how to generate ground-truth (query, result) pairs for verification. It adopts the database normalization technique to generate a testing schema and maintains a bitmap index for result tracking. To improve debug efficiency, DSG also artificially inserts some noises into the generated data. To avoid repetitive query space search, KQE forms the problem as isomorphic graph set discovery and combines the graph embedding and weighted random walk for query generation. We evaluated TQS on four popular DBMSs: MySQL, MariaDB, TiDB and PolarDB. Experimental results show that TQS is effective in finding logic bugs of join optimization in database management systems. It successfully detected 115 bugs within 24 hours, including 31 bugs in MySQL, 30 in MariaDB, 31 in TiDB, and 23 in PolarDB respectively.

Supplemental Material

MP4 File

Presentation video for SIGMOD 2023.

Download
134.39 MB

References

[1]

Gü nes Alucc, Olaf Hartig, M. Tamer Ö zsu, and Khuzaima Daudjee. 2014a. Diversified Stress Testing of RDF Data Management Systems. In ISWC. Springer, 197--212.

[2]

Günecs Alucc, Olaf Hartig, M Tamer Özsu, and Khuzaima Daudjee. 2014b. Diversified stress testing of RDF data management systems. In International Semantic Web Conference. 197--212.

[3]

Akhil Arora, Sakshi Sinha, Piyush Kumar, and Arnab Bhattacharya. 2018. HD-Index: Pushing the Scalability-Accuracy Boundary for Approximate kNN Search in High-Dimensional Spaces. Proc. VLDB Endow., Vol. 11, 8 (2018), 906--919.

Digital Library

[4]

Guillaume Bagan, Angela Bonifati, Radu Ciucanu, George H. L. Fletcher, Auré lien Lemay, and Nicky Advokaat. 2017. gMark: Schema-Driven Generation of Graphs and Queries. In ICDE. 63--64.

[5]

Hardik Bati, Leo Giakoumakis, Steve Herbert, and Aleksandras Surna. 2007. A genetic approach for random testing of database systems. In VLDB. ACM, 1243--1251.

[6]

Carsten Binnig, Donald Kossmann, Eric Lo, and M. Tamer Ö zsu. 2007. QAGen: generating query-aware test databases. In SIGMOD. ACM, 341--352.

[7]

Joachim Biskup. 1995. Achievements of Relational Database Schema Design Theory Revisited. In Semantics in Databases. Springer, 29--54.

[8]

Joachim Biskup, Umeshwar Dayal, and Philip A. Bernstein. 1979. Synthesizing Independent Database Schemas. In SIGMOD, Philip A. Bernstein (Ed.). ACM, 143--151.

[9]

Christian Bizer and Andreas Schultz. 2009. The berlin sparql benchmark. IJSWIS, Vol. 5, 2 (2009), 1--24.

[10]

Daniel Blum and Sara Cohen. 2011. Grr: generating random RDF. In Extended Semantic Web Conference. Springer, 16--30.

[11]

Manuel Bodirsky. 2015. Graph homomorphisms and universal algebra course notes. TU Dresden (2015).

[12]

Robert Brummayer and Armin Biere. 2009. Fuzzing and delta-debugging SMT solvers. In Proceedings of the 7th International Workshop on Satisfiability Modulo Theories. 1--5.

Digital Library

[13]

Nicolas Bruno and Surajit Chaudhuri. 2005. Flexible Database Generators. In VLDB. ACM, 1097--1107.

[14]

Nicolas Bruno, Surajit Chaudhuri, and Dilys Thomas. 2006. Generating Queries with Cardinality Constraints for DBMS Testing. IEEE Trans. Knowl. Data Eng., Vol. 18, 12 (2006), 1721--1725.

Digital Library

[15]

Wei Cao, Yingqiang Zhang, Xinjun Yang, Feifei Li, Sheng Wang, Qingda Hu, Xuntao Cheng, Zongzhi Chen, Zhenjun Liu, Jing Fang, Bo Wang, Yuhui Wang, Haiqing Sun, Ze Yang, Zhushi Cheng, Sen Chen, Jian Wu, Wei Hu, Jianwei Zhao, Yusong Gao, Songlu Cai, Yunyang Zhang, and Jiawang Tong. 2021. PolarDB Serverless: A Cloud Native Database for Disaggregated Data Centers. In SIGMOD. ACM, 2477--2489.

[16]

Pascal Cuoq, Benjamin Monate, Anne Pacalet, Virgile Prevosto, John Regehr, Boris Yakobowski, and Xuejun Yang. 2012. Testing Static Analyzers with Randomly Generated Programs. In NASA Formal Methods - 4th International Symposium, NFM. Springer, 120--125.

[17]

Angjela Davitkova, Damjan Gjurovski, and Sebastian Michel. 2022. LMKG: Learned Models for Cardinality Estimation in Knowledge Graphs. In EDBT. OpenProceedings.org, 2:169--2:182.

[18]

DB-Engines. 2018. DB-Engines Ranking. [EB/OL]. https://db-engines.com/en/ranking.

[19]

Jim Diederich and Jack Milton. 1988. New Methods and Fast Algorithms for Database Normalization. ACM Trans. Database Syst., Vol. 13, 3 (1988), 339--365.

Digital Library

[20]

Chi Thang Duong, Dung Hoang, Hongzhi Yin, Matthias Weidlich, Quoc Viet Hung Nguyen, and Karl Aberer. 2021. Efficient Streaming Subgraph Isomorphism with Graph Neural Networks. VLDB, Vol. 14, 5 (2021), 730--742.

Digital Library

[21]

Orri Erling, Alex Averbuch, Josep Larriba-Pey, Hassan Chafi, Andrey Gubichev, Arnau Prat, Minh-Duc Pham, and Peter Boncz. 2015. The LDBC social network benchmark: Interactive workload. In SIGMOD. 619--630.

[22]

M. R. Garey and David S. Johnson. 1979. Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman.

Digital Library

[23]

Jim Gray, Prakash Sundaresan, Susanne Englert, Kenneth Baclawski, and Peter J. Weinberger. 1994. Quickly Generating Billion-Record Synthetic Databases. In SIGMOD. ACM Press, 243--252.

[24]

Zhongxian Gu, Mohamed A Soliman, and Florian M Waas. 2012. Testing the accuracy of query optimizers. In Proceedings of the Fifth International Workshop on Testing Database Systems. 1--6.

Digital Library

[25]

Yuanbo Guo, Zhengxiang Pan, and Jeff Heflin. 2005. LUBM: A benchmark for OWL knowledge base systems. Journal of Web Semantics, Vol. 3, 2--3 (2005), 158--182.

Digital Library

[26]

Kenneth Houkjær, Kristian Torp, and Rico Wind. 2006. Simple and Realistic Data Generation. In VLDB. ACM, 1243--1246.

[27]

Dongxu Huang, Qi Liu, Qiu Cui, Zhuhe Fang, Xiaoyu Ma, Fei Xu, Li Shen, Liu Tang, Yuxing Zhou, Menglong Huang, et al. 2020. TiDB: a Raft-based HTAP database. VLDB, Vol. 13, 12 (2020), 3072--3084.

Digital Library

[28]

Yka Huhtala, Juha K"arkk"ainen, Pasi Porkka, and Hannu Toivonen. 1999. TANE: An efficient algorithm for discovering functional and approximate dependencies. The computer journal, Vol. 42, 2 (1999), 100--111.

[29]

Jinho Jung, Hong Hu, Joy Arulraj, Taesoo Kim, and Woon-Hak Kang. 2019. APOLLO: Automatic Detection and Diagnosis of Performance Regressions in Database Systems. VLDB, Vol. 13, 1 (2019), 57--70.

Digital Library

[30]

Timotej Kapus and Cristian Cadar. 2017. Automatic testing of symbolic execution engines via program generation and differential testing. In ASE. IEEE Computer Society, 590--600.

Digital Library

[31]

Shadi Abdul Khalek, Bassem Elkarablieh, Yai O. Laleye, and Sarfraz Khurshid. 2008. Query-Aware Test Generation Using a Relational Constraint Solver. In ASE. IEEE Computer Society, 238--247.

[32]

Shadi Abdul Khalek and Sarfraz Khurshid. 2010. Automated SQL query generation for systematic testing of database engines. In ASE. ACM, 329--332.

[33]

Vu Le, Mehrdad Afshari, and Zhendong Su. 2014. Compiler validation via equivalence modulo inputs. In PLDI. ACM, 216--226.

[34]

Eric Lo, Carsten Binnig, Donald Kossmann, M. Tamer Ö zsu, and Wing-Kai Hon. 2010. A framework for testing DBMS features. VLDB J., Vol. 19, 2 (2010), 203--230.

Digital Library

[35]

David Maier. 1983. The Theory of Relational Databases. Computer Science Press.

Digital Library

[36]

MariaDB. 2022. MariaDB hints. [EB/OL]. https://mariadb.com/kb/en/optimizer-switch/.

[37]

Mariadb. 2022. Mariadb Homepage. [EB/OL]. https://mariadb.org/.

[38]

William M. McKeeman. 1998. Differential Testing for Software. Digit. Tech. J., Vol. 10, 1 (1998), 100--107.

[39]

Chaitanya Mishra, Nick Koudas, and Calisto Zuzarte. 2008. Generating targeted queries for database testing. In SIGMOD. ACM, 499--510.

[40]

MySQL. 2022a. MySQL hints. [EB/OL]. https://dev.mysql.com/doc/refman/8.0/en/optimizer-hints.html.

[41]

MySQL. 2022b. MySQL Homepage. [EB/OL]. https://www.mysql.com.

[42]

Stack Overflow. 2021. Developer Survey Results. [EB/OL]. https://insights.stackoverflow.com/survey/2021.

[43]

Thorsten Papenbrock and Felix Naumann. 2016. A Hybrid Approach to Functional Dependency Discovery. In SIGMOD, FatmaÖzcan, Georgia Koutrika, and Sam Madden (Eds.). ACM, 821--833.

[44]

Thorsten Papenbrock and Felix Naumann. 2017. Data-driven Schema Normalization. In EDBT. OpenProceedings.org, 342--353.

[45]

Meikel Poess and John M. Stephens. 2004. Generating Thousand Benchmark Queries in Seconds. In VLDB. Morgan Kaufmann, 1045--1053.

[46]

Kim-Thomas Rehmann, Changyun Seo, Dongwon Hwang, Binh Than Truong, Alexander Boehm, and Dong Hun Lee. 2016. Performance Monitoring in SAP HANA's Continuous Integration Process. SIGMETRICS Perform. Evaluation Rev., Vol. 43, 4 (2016), 43--52.

Digital Library

[47]

Manuel Rigger and Zhendong Su. 2020a. Detecting optimization bugs in database engines via non-optimizing reference engine construction. In ACM Joint Meeting on ESEC and FSE. 1140--1152.

Digital Library

[48]

Manuel Rigger and Zhendong Su. 2020b. Finding bugs in database systems via query partitioning. Proceedings of the ACM on Programming Languages, Vol. 4, OOPSLA (2020), 1--30.

Digital Library

[49]

Manuel Rigger and Zhendong Su. 2020c. SQLancer. [EB/OL]. https://github.com/sqlancer/sqlancer.

[50]

Manuel Rigger and Zhendong Su. 2020d. Testing database engines via pivoted query synthesis. In OSDI 20. 667--682.

[51]

Michael Schmidt, Thomas Hornung, Georg Lausen, and Christoph Pinkel. 2009. SP^ 2Bench: a SPARQL performance benchmark. In ICDE. IEEE, 222--233.

[52]

Andreas Seltenreich. 2020. SQLSmith. [EB/OL]. https://github.com/anse1/sqlsmith.

[53]

Donald R. Slutz. 1998. Massive Stochastic Testing of SQL. In VLDB, Ashish Gupta, Oded Shmueli, and Jennifer Widom (Eds.). Morgan Kaufmann, 618--622.

[54]

TiDB. 2022. TiDB hints. [EB/OL]. https://docs.pingcap.com/tidb/v5.3/optimizer-hints.

[55]

Manasi Vartak, Venkatesh Raghavan, and Elke A. Rundensteiner. 2010. QRelX: generating meaningful queries that provide cardinality assurance. In SIGMOD. ACM, 1215--1218.

[56]

Ziheng Wei and Sebastian Link. 2019. Embedded Functional Dependencies and Data-completeness Tailored Database Design. VLDB, Vol. 12, 11 (2019), 1458--1470.

Digital Library

[57]

Kesheng Wu, Ekow J. Otoo, and Arie Shoshani. 2002. Compressing Bitmap Indexes for Faster Search Operations. In SSDBM. IEEE Computer Society, 99--108.

[58]

Jiaqi Yan, Qiuye Jin, Shrainik Jain, Stratis D. Viglas, and Allison W. Lee. 2018. Snowtrail: Testing with Production Queries on a Cloud Database. In SIGMOD. ACM, 4:1--4:6.

[59]

Xuejun Yang, Yang Chen, Eric Eide, and John Regehr. 2011. Finding and understanding bugs in C compilers. In SIGPLAN, PLDI. ACM, 283--294.

[60]

Yixing Yang, Yixiang Fang, Maria E. Orlowska, Wenjie Zhang, and Xuemin Lin. 2021. Efficient Bi-triangle Counting for Large Bipartite Networks. VLDB, Vol. 14, 6 (2021), 984--996.

Digital Library

[61]

Rex Ying, Zhaoyu Lou, Jiaxuan You, Chengtao Wen, Arquimedes Canedo, and Jure Leskovec. 2020. Neural Subgraph Matching. CoRR, Vol. abs/2007.03092 (2020).

[62]

Rui Zhong, Yongheng Chen, Hong Hu, Hangfan Zhang, Wenke Lee, and Dinghao Wu. 2020. SQUIRREL: Testing Database Management Systems with Language Validity and Coverage Feedback. In CCS. ACM, 955--970.

Cited By

Song JDou WGao YCui ZZheng YWang DWang WWei JHuang T(2024)Detecting Metadata-Related Logic Bugs in Database Systems via Raw Database ConstructionProceedings of the VLDB Endowment10.14778/3659437.365944517:8(1884-1897)Online publication date: 31-May-2024
https://dl.acm.org/doi/10.14778/3659437.3659445
Zhang CSridharan M(2024)Step-wise Execution of Data-Centric SystemsCompanion Proceedings of the 2024 ACM SIGPLAN International Conference on Systems, Programming, Languages, and Applications: Software for Humanity10.1145/3689491.3691821(19-21)Online publication date: 20-Oct-2024
https://dl.acm.org/doi/10.1145/3689491.3691821
Ba JRigger M(2024)Keep It Simple: Testing Databases via Differential Query PlansProceedings of the ACM on Management of Data10.1145/36549912:3(1-26)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3654991
Show More Cited By

Index Terms

Detecting Logic Bugs of Join Optimizations in DBMS
1. Information systems
  1. Data management systems
    1. Query languages
      1. Relational database query languages
        Structured Query Language
2. Security and privacy
  1. Database and storage security

Recommendations

Testing Database Systems via Differential Query Execution
ICSE '23: Proceedings of the 45th International Conference on Software Engineering

Database Management Systems (DBMSs) provide efficient data retrieval and manipulation for many applications through Structured Query Language (SQL). Incorrect implementations of DBMSs can result in logic bugs, which cause SELECT queries to fetch ...
Skew-Aware Join Optimization for Array Databases
SIGMOD '15: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data

Science applications are accumulating an ever-increasing amount of multidimensional data. Although some of it can be processed in a relational database, much of it is better suited to array-based engines. As such, it is important to optimize the query ...
Testing Gremlin-Based Graph Database Systems via Query Disassembling
ISSTA 2024: Proceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis

Graph Database Systems (GDBs) support efficiently storing and retrieving graph data, and have become a critical component in many important applications. Many widely-used GDBs utilize the Gremlin query language to create, modify, and retrieve data in ...

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the ACM on Management of Data

Proceedings of the ACM on Management of Data Volume 1, Issue 1

PACMMOD

May 2023

2807 pages

EISSN:2836-6573

DOI:10.1145/3603164

Editor:
Divyakant Agrawal
UC Santa Barbara, United States

Issue’s Table of Contents

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 30 May 2023

Published in PACMMOD Volume 1, Issue 1

Permissions

Request permissions for this article.

Request Permissions

Badges

Best Paper

Author Tags

Qualifiers

Research-article

Funding Sources

National Natural Science Foundation of China

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

5
Total Citations
View Citations
2,273
Total Downloads

Downloads (Last 12 months)869
Downloads (Last 6 weeks)86

Reflects downloads up to 16 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Song JDou WGao YCui ZZheng YWang DWang WWei JHuang T(2024)Detecting Metadata-Related Logic Bugs in Database Systems via Raw Database ConstructionProceedings of the VLDB Endowment10.14778/3659437.365944517:8(1884-1897)Online publication date: 31-May-2024
https://dl.acm.org/doi/10.14778/3659437.3659445
Zhang CSridharan M(2024)Step-wise Execution of Data-Centric SystemsCompanion Proceedings of the 2024 ACM SIGPLAN International Conference on Systems, Programming, Languages, and Applications: Software for Humanity10.1145/3689491.3691821(19-21)Online publication date: 20-Oct-2024
https://dl.acm.org/doi/10.1145/3689491.3691821
Ba JRigger M(2024)Keep It Simple: Testing Databases via Differential Query PlansProceedings of the ACM on Management of Data10.1145/36549912:3(1-26)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3654991
Bergami G(2024)Towards automating microservices orchestration through data-driven evolutionary architecturesService Oriented Computing and Applications10.1007/s11761-024-00387-x18:1(1-12)Online publication date: 27-Feb-2024
https://dl.acm.org/doi/10.1007/s11761-024-00387-x
Tang XWu SZhang DWang ZYuan GChen G(2023)A Demonstration of DLBD: Database Logic Bug Detection SystemProceedings of the VLDB Endowment10.14778/3611540.361158416:12(3914-3917)Online publication date: 1-Aug-2023
https://dl.acm.org/doi/10.14778/3611540.3611584

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents