research-article

Facilitating SQL Query Composition and Analysis

Authors:

Zainab Zolaktaf,

Mostafa Milani,

Rachel PottingerAuthors Info & Claims

SIGMOD '20: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data

Pages 209 - 224

https://doi.org/10.1145/3318464.3380602

Published: 31 May 2020 Publication History

Abstract

Formulating efficient SQL queries requires several cycles of tuning and execution. We examine methods that can accelerate and improve this interaction by providing insights about SQL queries prior to execution. We achieve this by predicting properties such as the query answer size, its run-time, and error class. Unlike existing approaches, our approach does not rely on any statistics from the database instance or query execution plans. Our approach is based on using data-driven machine learning techniques that rely on large query workloads to model SQL queries and their properties. Empirical results show that the neural network models are more accurate in predicting several query properties.

Supplementary Material

MP4 File (3318464.3380602.mp4)

Presentation Video

Download
160.26 MB

References

[1]

Mert Akdere, Ugur Cetintemel, Matteo Riondato, Eli Upfal, and Stanley B Zdonik. 2012. Learning-Based Query Performance Modeling and Prediction. In ICDE. 390--401.

[2]

Miltiadis Allamanis, Earl T Barr, Premkumar Devanbu, and Charles Sutton. 2018. A Survey of Machine Learning for Big Code and Naturalness. Comput. Surveys, Vol. 51, 4 (2018), 81.

Digital Library

[3]

Shaojie Bai, J. Zico Kolter, and Vladlen Koltun. 2018. An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling. ArXiv, Vol. 1803.01271 (2018).

[4]

Yoshua Bengio. 2011. Deep Learning of Representations for Unsupervised and Transfer Learning. In UTLW. 17--37.

[5]

Sonia Bergamaschi, Elton Domnori, Francesco Guerra, Raquel Trillo Lado, and Yannis Velegrakis. 2011. Keyword Search over Relational Databases: A Metadata Approach. In SIGMOD. 565--576.

[6]

Gloria Chatzopoulou, Magdalini Eirinaki, and Neoklis Polyzotis. 2009. Query Recommendations for Interactive Database Exploration. In SSDBM. 3--18.

[7]

Surajit Chaudhuri, Ashish Kumar Gupta, and Vivek Narasayya. 2002. Compressing SQL Workloads. In SIGMOD. 488--499.

[8]

Alexis Conneau, Holger Schwenk, Lo"ic Barrault, and Yann Lecun. 2016. Very Deep Convolutional Networks for Text Classification. ArXiv, Vol. 1606.01781 (2016).

[9]

Kyriaki Dimitriadou, Olga Papaemmanouil, and Yanlei Diao. 2014. Explore-by-Example: An Automatic Query Steering Framework for Interactive Data Exploration. In SIGMOD. 517--528.

[10]

Bailu Ding, Sudipto Das, Ryan Marcus, Wentao Wu, Surajit Chaudhuri, and Vivek Narasayya. 2019. AI Meets AI: Leveraging Query Executions to Improve Index Recommendations. In SIGMOD. 1241--1258.

[11]

Magdalini Eirinaki, Suju Abraham, Neoklis Polyzotis, and Naushin Shaikh. 2014. QueRIE: Collaborative Database Exploration. TKDE, Vol. 26, 7 (2014), 1778--1790.

[12]

Ju Fan, Guoliang Li, and Lizhu Zhou. 2011. Interactive SQL Query Suggestion: Making Databases User-Friendly. In ICDE. 351--362.

[13]

Archana Ganapathi, Harumi Kuno, Umeshwar Dayal, Janet L Wiener, Armando Fox, Michael Jordan, and David Patterson. 2009. Predicting Multiple Metrics for Queries: Better Decisions Enabled by Machine Learning. In ICDE. 592--603.

[14]

Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2016. Deep Learning .MIT Press.

Digital Library

[15]

Google. 2010. Big Query. https://cloud.google.com/bigquery/.

[16]

Hakan Hacigümücs, Bala Iyer, Chen Li, and Sharad Mehrotra. 2002. Executing SQL over Encrypted Data in the Database-Service-Provider Model. In SIGMOD. 216--227.

[17]

Peter J Huber et al. 1964. Robust estimation of a location parameter. The Annals of Mathematical Statistics, Vol. 35, 1 (1964), 73--101.

[18]

Amazon Inc. 2012a. Amazon Redshift. https://aws.amazon.com/redshift/.

[19]

Snowflake Inc. 2012b. Snowflake. https://www.snowflake.com/.

[20]

HV Jagadish, Adriane Chapman, Aaron Elkiss, Magesh Jayapandian, Yunyao Li, Arnab Nandi, and Cong Yu. 2007. Making Database Systems Usable. In SIGMOD. 13--24.

[21]

Shrainik Jain and Bill Howe. 2018. Query2Vec: NLP Meets Databases for Generalized Workload Analytics. ArXiv, Vol. 1801.05613 (2018).

[22]

Shrainik Jain, Dominik Moritz, Daniel Halperin, Bill Howe, and Ed Lazowska. 2016. SQLShare: Results from a Multi-Year SQL-as-a-Service Experiment. In SIGMOD. 281--293.

Digital Library

[23]

Shrainik Jain, Jiaqi Yan, Thierry Cruane, and Bill Howe. 2018. Database-Agnostic Workload Management. ArXiv, Vol. 1808.08355 (2018).

[24]

Magesh Jayapandian and HV Jagadish. 2009. Automating the Design and Construction of Query Forms. TKDE, Vol. 21, 10 (2009), 1389--1402.

Digital Library

[25]

Rie Johnson and Tong Zhang. 2014. Effective Use of Word Order for Text Categorization with Convolutional Neural Networks. ArXiv, Vol. 1412.1058 (2014).

[26]

Rie Johnson and Tong Zhang. 2015. Semi-Supervised Convolutional Neural Networks for Text Categorization via Region Embedding. In Advances in Neural Information Processing Systems. 919--927.

[27]

Rie Johnson and Tong Zhang. 2016a. Convolutional neural networks for text categorization: Shallow Word-Level vs. Deep Character-Level. ArXiv, Vol. 1609.00718 (2016).

[28]

Rie Johnson and Tong Zhang. 2016b. Supervised and Semi-Supervised Text Categorization using LSTM for Region Embeddings. In ICML. 526--534.

[29]

Rie Johnson and Tong Zhang. 2017. Deep Pyramid Convolutional Neural Networks for Text Categorization. In ACL. 562--570.

[30]

Nodira Khoussainova, YongChul Kwon, Magdalena Balazinska, and Dan Suciu. 2010. SnipSuggest: Context-Aware Autocompletion for SQL. PVLDB, Vol. 4, 1 (2010), 22--33.

[31]

Yoon Kim. 2014. Convolutional Neural Networks for Sentence Classification. ArXiv, Vol. 1408.5882 (2014).

[32]

Yoon Kim, Yacine Jernite, David Sontag, and Alexander M Rush. 2016. Character-Aware Neural Language Models. In AAAI. 2741--2749.

[33]

Diederik P Kingma and Jimmy Ba. 2014. Adam: A Method for Stochastic Optimization. ArXiv, Vol. 1412.6980 (2014).

[34]

Gokhan Kul, Duc Luong, Ting Xie, Patrick Coonan, Varun Chandola, Oliver Kennedy, and Shambhu Upadhyaya. 2016. Ettu: Analyzing Query Intents in Corporate Databases. In WWW. 463--466.

Digital Library

[35]

Yann LeCun, Bernhard Boser, John S Denker, Donnie Henderson, Richard E Howard, Wayne Hubbard, and Lawrence D Jackel. 1989. Backpropagation Applied to Handwritten Zip Code Recognition. Neural computation, Vol. 1, 4 (1989), 541--551.

[36]

Viktor Leis, Andrey Gubichev, Atanas Mirchev, Peter Boncz, Alfons Kemper, and Thomas Neumann. 2015. How Good Are Query Optimizers, Really? PVLDB, Vol. 9, 3 (2015), 204--215.

Digital Library

[37]

Fei Li and HV Jagadish. 2014. Constructing an Interactive Natural Language Interface for Relational Databases. PVLDB, Vol. 8, 1 (2014), 73--84.

Digital Library

[38]

Jiexing Li, Arnd Christian König, Vivek Narasayya, and Surajit Chaudhuri. 2012. Robust Estimation of Resource Consumption for SQL Queries using Statistical Techniques. PVLDB, Vol. 5, 11 (2012), 1555--1566.

Digital Library

[39]

Henry Liu, Mingbin Xu, Ziting Yu, Vincent Corvinelli, and Calisto Zuzarte. 2015. Cardinality Estimation using Neural Networks. In CASCON. 53--59.

[40]

Hoang Vu Nguyen, Klemens Böhm, Florian Becker, Bertrand Goldman, Georg Hinkel, and Emmanuel Müller. 2015. Identifying User Interests within the Data Space-a Case Study with SkyServer. In EDBT. 641--652.

[41]

William O'Mullane, Nolan Li, Maria Nieto-Santisteban, Alex Szalay, Ani Thakar, and Jim Gray. 2005. Batch is Back: CasJobs, Serving Multi-TB Data on the Web. In ICWS. 33--40.

[42]

Terence Parr. 2013. The Definitive ANTLR 4 Reference. Pragmatic Bookshelf.

[43]

F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-Learn: Machine Learning in Python. JMLR, Vol. 12 (2011), 2825--2830.

Digital Library

[44]

M Jordan Raddick, Ani R Thakar, Alexander S Szalay, and Rafael DC Santos. 2014a. Ten Years of SkyServer I: Tracking Web and SQL e-Science Usage. Computing in Science & Engineering, Vol. 16, 4 (2014), 22--31.

[45]

M Jordan Raddick, Ani R Thakar, Alexander S Szalay, and Rafael DC Santos. 2014b. Ten Years of SkyServer II: How Astronomers and the Public Have Embraced e-Science. Computing in Science & Engineering, Vol. 16, 4 (2014), 32--40.

[46]

Yanyan Shen, Kaushik Chakrabarti, Surajit Chaudhuri, Bolin Ding, and Lev Novik. 2014. Discovering Queries Based on Example Tuples. In SIGMOD. 493--504.

[47]

Vik Singh, Jim Gray, Ani Thakar, Alexander S Szalay, Jordan Raddick, Bill Boroski, Svetlana Lebedeva, and Brian Yanny. 2007. Skyserver Traffic Report - The First Five Years. ArXiv, Vol. 0701173 (2007).

[48]

Kostas Stefanidis, Marina Drosou, and Evaggelia Pitoura. 2009. “You May Also Like” Results in Relational Databases. In PersDB. 37--42.

[49]

Alexander S Szalay. 2018. From SkyServer to SciServer. AAPSS, Vol. 675, 1 (2018), 202--220.

[50]

Alexander S Szalay, Jim Gray, Ani R Thakar, Peter Z Kunszt, Tanu Malik, Jordan Raddick, Christopher Stoughton, and Jan vandenBerg. 2002. The SDSS Skyserver: Public Access to the Sloan Digital Sky Server Data. In SIGMOD. 570--581.

[51]

Kai Sheng Tai, Richard Socher, and Christopher D Manning. 2015. Improved Semantic Representations From Tree-Structured Long Short-Term Memory Networks. ArXiv, Vol. 1503.00075 (2015).

[52]

TPC. 2018. TPC Benchmarks. http://www.tpc.org.

[53]

Wei Wang, Meihui Zhang, Gang Chen, H. V. Jagadish, Beng Chin Ooi, and Kian-Lee Tan. 2016. Database Meets Deep Learning: Challenges and Opportunities. SIGMOD Rec., Vol. 45, 2 (2016), 17--22.

Digital Library

[54]

Wenpeng Yin, Katharina Kann, Mo Yu, and Hinrich Schütze. 2017. Comparative Study of CNN and RNN for Natural Language Processing. ArXiv, Vol. 1702.01923 (2017).

[55]

Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. 2014. How Transferable Are Features in Deep Neural Networks?. In NIPS. 3320--3328.

[56]

Tom Young, Devamanyu Hazarika, Soujanya Poria, and Erik Cambria. 2018. Recent Trends in Deep Learning Based Natural Language Processing. IEEE Computational Intelligence Magazine, Vol. 13, 3 (2018), 55--75.

[57]

Wojciech Zaremba and Ilya Sutskever. 2014. Learning to Execute. ArXiv, Vol. 1410.4615 (2014).

[58]

Zhong Zeng, Zhifeng Bao, Thuy Ngoc Le, Mong Li Lee, and Wang Tok Ling. 2014. ExpressQ: Identifying Keyword Context and Search Target in Relational Keyword Queries. In CIKM. 31--40.

[59]

Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015. Character-Level Convolutional Networks for Text Classification. In NIPS. 649--657.

[60]

Victor Zhong, Caiming Xiong, and Richard Socher. 2017. Seq2SQL: Generating Structured Queries From Natural Language using Reinforcement Learning. ArXiv, Vol. 1709.00103 (2017).

[61]

Zainab Zolaktaf, Mostafa Milani, and Rachel Pottinger. 2020. Facilitating SQL Query Composition and Analysis. ArXiv, Vol. 2002.09091 (2020).

Digital Library

Cited By

Warnke BMartens KWinker TGroppe SGroppe JAdhiyaman PSrinivasan SKrishnakumar S(2024)ReJOOSp: Reinforcement Learning for Join Order Optimization in SPARQLBig Data and Cognitive Computing10.3390/bdcc80700718:7(71)Online publication date: 27-Jun-2024
https://doi.org/10.3390/bdcc8070071
Park JFang YTa CZhang GIdnay BChen FFeng DShyu RGordon ESpotnitz MWeng C(2024)Criteria2Query 3.0: Leveraging generative large language models for clinical trial eligibility query generationJournal of Biomedical Informatics10.1016/j.jbi.2024.104649154(104649)Online publication date: Jun-2024
https://doi.org/10.1016/j.jbi.2024.104649
Shi LLi TWei LTao YLi CGao Y(2023)FASTune: Towards Fast and Stable Database Tuning System with Reinforcement LearningElectronics10.3390/electronics1210216812:10(2168)Online publication date: 10-May-2023
https://doi.org/10.3390/electronics12102168
Show More Cited By

Index Terms

Facilitating SQL Query Composition and Analysis
1. Computing methodologies
  1. Machine learning
2. Information systems
  1. Information systems applications
    1. Data mining

Recommendations

SQL query optimization: reordering for a general class of queries

The strength of commercial query optimizers like DB2 comes from their ability to select an optimal order by generating all equivalent reorderings of binary operators. However, there are no known methods to generate all equivalent reorderings for a SQL ...
SQL query optimization: reordering for a general class of queries
SIGMOD '96: Proceedings of the 1996 ACM SIGMOD international conference on Management of data

The strength of commercial query optimizers like DB2 comes from their ability to select an optimal order by generating all equivalent reorderings of binary operators. However, there are no known methods to generate all equivalent reorderings for a SQL ...
SQL-based Email Composition and Query Synthesis in RMX
iiWAS2018: Proceedings of the 20th International Conference on Information Integration and Web-based Applications & Services

The Rule-based e-Mail eXchange system, also known as RMX, is an email transfer agent that transfers email based on user-defined delivery rules which are expressed as parameterized SQL queries. In this paper, we introduce insertion rules, which are also ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGMOD '20: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data

June 2020

2925 pages

ISBN:9781450367356

DOI:10.1145/3318464

General Chairs:
David Maier
Portland State University, USA
,
Rachel Pottinger
University of British Columbia, Canada
,
Program Chairs:
AnHai Doan
University of Wisconsin, USA
,
Wang-Chiew Tan
Megagon Labs, USA
,
Publications Chairs:
Abdussalam Alawini
University of Illinois at Urbana-Champaign, USA
,
Hung Q. Ngo
RelationalAI, USA

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMOD: ACM Special Interest Group on Management of Data

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 31 May 2020

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SIGMOD/PODS '20

Sponsor:

SIGMOD

SIGMOD/PODS '20: International Conference on Management of Data

June 14 - 19, 2020

OR, Portland, USA

Acceptance Rates

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

10
Total Citations
View Citations
656
Total Downloads

Downloads (Last 12 months)63
Downloads (Last 6 weeks)4

Reflects downloads up to 30 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Warnke BMartens KWinker TGroppe SGroppe JAdhiyaman PSrinivasan SKrishnakumar S(2024)ReJOOSp: Reinforcement Learning for Join Order Optimization in SPARQLBig Data and Cognitive Computing10.3390/bdcc80700718:7(71)Online publication date: 27-Jun-2024
https://doi.org/10.3390/bdcc8070071
Park JFang YTa CZhang GIdnay BChen FFeng DShyu RGordon ESpotnitz MWeng C(2024)Criteria2Query 3.0: Leveraging generative large language models for clinical trial eligibility query generationJournal of Biomedical Informatics10.1016/j.jbi.2024.104649154(104649)Online publication date: Jun-2024
https://doi.org/10.1016/j.jbi.2024.104649
Shi LLi TWei LTao YLi CGao Y(2023)FASTune: Towards Fast and Stable Database Tuning System with Reinforcement LearningElectronics10.3390/electronics1210216812:10(2168)Online publication date: 10-May-2023
https://doi.org/10.3390/electronics12102168
Warnke BFischer SGroppe S(2023)Using Machine Learning and Routing Protocols for Optimizing Distributed SPARQL Queries in CollaborationComputers10.3390/computers1210021012:10(210)Online publication date: 17-Oct-2023
https://doi.org/10.3390/computers12100210
Wang JLi TWang ALiu XChen LChen JLiu JWu JLi FGao Y(2023)Real-Time Workload Pattern Analysis for Large-Scale Cloud DatabasesProceedings of the VLDB Endowment10.14778/3611540.361155716:12(3689-3701)Online publication date: 1-Aug-2023
https://dl.acm.org/doi/10.14778/3611540.3611557
de Carvalho TMartins Dde Lima Neto F(2023)Human-in-the-Loop Evolution of Database Views for Data Exploration2023 IEEE Latin American Conference on Computational Intelligence (LA-CCI)10.1109/LA-CCI58595.2023.10409423(1-6)Online publication date: 29-Oct-2023
https://doi.org/10.1109/LA-CCI58595.2023.10409423
Aleyasen AMorcos MAntova LSugiyama MKorablev DPatvarczki JMutreja RDuller MWaas FWinslett MIves ZBonifati AEl Abbadi A(2022)Intelligent Automated Workload Analysis for Database ReplatformingProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3526050(2273-2285)Online publication date: 10-Jun-2022
https://dl.acm.org/doi/10.1145/3514221.3526050
Laptev NTao WKomurlu CXu JSun DLux TMi L(2022)Smarter Warehouse2022 IEEE 38th International Conference on Data Engineering Workshops (ICDEW)10.1109/ICDEW55742.2022.00005(1-8)Online publication date: May-2022
https://doi.org/10.1109/ICDEW55742.2022.00005
Zhi Kang JGaurav Tan SCheng FSun SHe BLi GLi ZIdreos SSrivastava D(2021)Efficient Deep Learning Pipelines for Accurate Cost Estimations Over Large Scale Query WorkloadProceedings of the 2021 International Conference on Management of Data10.1145/3448016.3457546(1014-1022)Online publication date: 9-Jun-2021
https://dl.acm.org/doi/10.1145/3448016.3457546
Remil YBendimerad AMathonat RChaleat PKaytoue MGrundy J(2021)"What makes my queries slow?"Proceedings of the 36th IEEE/ACM International Conference on Automated Software Engineering10.1109/ASE51524.2021.9678915(642-652)Online publication date: 15-Nov-2021
https://dl.acm.org/doi/10.1109/ASE51524.2021.9678915

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents