research-article

Public Access

Evaluation of Machine Learning Algorithms in Predicting the Next SQL Query from the Future

Authors:

Venkata Vamsikrishna Meduri,

Kanchan Chowdhury,

Mohamed SarwatAuthors Info & Claims

ACM Transactions on Database Systems (TODS), Volume 46, Issue 1

Article No.: 4, Pages 1 - 46

https://doi.org/10.1145/3442338

Published: 18 March 2021 Publication History

All formats PDF

Abstract

Prediction of the next SQL query from the user, given her sequence of queries until the current timestep, during an ongoing interaction session of the user with the database, can help in speculative query processing and increased interactivity. While existing machine learning-- (ML) based approaches use recommender systems to suggest relevant queries to a user, there has been no exhaustive study on applying temporal predictors to predict the next user issued query.

In this work, we experimentally compare ML algorithms in predicting the immediate next future query in an interaction workload, given the current user query or the sequence of queries in a user session thus far. As a part of this, we propose the adaptation of two powerful temporal predictors: (a) Recurrent Neural Networks (RNNs) and (b) a Reinforcement Learning approach called Q-Learning that uses Markov Decision Processes. We represent each query as a comprehensive set of fragment embeddings that not only captures the SQL operators, attributes, and relations but also the arithmetic comparison operators and constants that occur in the query. Our experiments on two real-world datasets show the effectiveness of temporal predictors against the baseline recommender systems in predicting the structural fragments in a query w.r.t. both quality and time. Besides showing that RNNs can be used to synthesize novel queries, we find that exact Q-Learning outperforms RNNs despite predicting the next query entirely from the historical query logs.

Supplementary Material

a4-meduri-apndx.pdf (meduri.zip)

Supplemental movie, appendix, image and software files for, Evaluation of Machine Learning Algorithms in Predicting the Next SQL Query from the Future

Download
456.39 KB

References

[1]

2011. JSQLParser. Retrieved from https://github.com/JSQLParser/JSqlParser.

[2]

Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2015. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. Retrieved from http://tensorflow.org/.

[3]

Sameer Agarwal, Barzan Mozafari, Aurojit Panda, Henry Milner, Samuel Madden, and Ion Stoica. 2013. BlinkDB: Queries with bounded errors and bounded response times on very large data. In Proceedings of the 8th ACM European Conference on Computer Systems (EuroSys’13). 29--42.

Digital Library

[4]

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. In Proceedings of the 3rd International Conference on Learning Representations (ICLR’15).

[5]

Ugur Çetintemel, Mitch Cherniack, Justin DeBrabant, Yanlei Diao, Kyriaki Dimitriadou, Alexander Kalinin, Olga Papaemmanouil, and Stanley B. Zdonik. 2013. Query steering for interactive data exploration. In Proceedings of the Sixth Biennial Conference on Innovative Data Systems Research (CIDR’13).

[6]

Gloria Chatzopoulou, Magdalini Eirinaki, and Neoklis Polyzotis. 2009. Query recommendations for interactive database exploration. In Proceedings of the 21st International Conference on Scientific and Statistical Database Management (SSDBM’09). 3--18.

Digital Library

[7]

Surajit Chaudhuri and Raghav Kaushik. 2009. Extending autocompletion to tolerate errors. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD’09). 707--718.

Digital Library

[8]

Kyunghyun Cho, Bart van Merrienboer, Dzmitry Bahdanau, and Yoshua Bengio. 2014. On the properties of neural machine translation: Encoder-decoder approaches. In Proceedings of the 8th Workshop on Syntax, Semantics and Structure in Statistical Translation (SSST@EMNLP’14). 103--111.

[9]

François Chollet. 2015. keras. Retrieved from https://keras.io/.

[10]

Dong Deng, Guoliang Li, He Wen, H. V. Jagadish, and Jianhua Feng. 2016. META: An efficient matching-based method for error-tolerant autocompletion. Proc. VLDB Endow. 9, 10 (2016), 828--839.

Digital Library

[11]

Kyriaki Dimitriadou, Olga Papaemmanouil, and Yanlei Diao. 2014. Explore-by-example: An automatic query steering framework for interactive data exploration. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data (SIGMOD’14). 517--528.

Digital Library

[12]

Kyriaki Dimitriadou, Olga Papaemmanouil, and Yanlei Diao. 2016. AIDE: An active learning-based approach for interactive data exploration. IEEE Trans. Knowl. Data Eng. 28, 11 (2016), 2842--2856.

Digital Library

[13]

Magdalini Eirinaki, Suju Abraham, Neoklis Polyzotis, and Naushin Shaikh. 2014. QueRIE: Collaborative database exploration. IEEE Trans. Knowl. Data Eng. 26, 7 (2014), 1778--1790.

[14]

Magdalini Eirinaki and Sweta Patel. 2015. QueRIE reloaded: Using matrix factorization to improve database query recommendations. In Proceedings of the 2015 IEEE International Conference on Big Data (Big Data’15). 1500--1508.

Digital Library

[15]

Ori Bar El, Tova Milo, and Amit Somech. 2020. Automatically generating data exploration sessions using deep reinforcement learning. In Proceedings of the 2020 International Conference on Management of Data (SIGMOD’20). 1527--1537.

[16]

Antonio Giuzio, Giansalvatore Mecca, Elisa Quintarelli, Manuel Roveri, Donatello Santoro, and Letizia Tanca. 2017. INDIANA the Database Explorer. Technical Report. Università della Basilicata, Politecnico di Milano.

[17]

Bill G. Horne and Don R. Hush. 1996. Bounds on the complexity of recurrent neural network implementations of finite state machines. Neural Netw. 9, 2 (Mar. 1996), 243--252.

Digital Library

[18]

Prasanth Jayachandran, Karthik Tunga, Niranjan Kamat, and Arnab Nandi. 2014. Combining user interaction, speculative query execution and sampling in the DICE system. Proc. VLDB 7, 13 (2014), 1697--1700.

Digital Library

[19]

Manas Joglekar, Hector Garcia-Molina, and Aditya G. Parameswaran. 2016. Interactive data exploration with smart drill-down. In Proceedings of the 32nd IEEE International Conference on Data Engineering (ICDE’16). 906--917.

[20]

Niranjan Kamat, Prasanth Jayachandran, Karthik Tunga, and Arnab Nandi. 2014. Distributed and interactive cube exploration. In Proceedings of the IEEE 30th International Conference on Data Engineering, Chicago (ICDE’14). 472--483.

[21]

Andrej Karpathy. 2015. The Unreasonable Effectiveness of Recurrent Neural Networks. Retrieved from http://karpathy.github.io/2015/05/21/rnn-effectiveness/.

[22]

Nodira Khoussainova, YongChul Kwon, Magdalena Balazinska, and Dan Suciu. 2010. SnipSuggest: Context-aware autocompletion for SQL. Proc. VLDB Endow. 4, 1 (2010), 22--33.

Digital Library

[23]

Andreas Kipf, Thomas Kipf, Bernhard Radke, Viktor Leis, Peter A. Boncz, and Alfons Kemper. 2019. Learned cardinalities: Estimating correlated joins with deep learning. In Proceedings of the 9th Biennial Conference on Innovative Data Systems Research (CIDR’19).

[24]

Sanjay Krishnan, Zongheng Yang, Ken Goldberg, Joseph M. Hellerstein, and Ion Stoica. 2018. Learning to optimize join queries with deep reinforcement learning. arxiv:1808.03196. Retrieved from https://arxiv.org/abs/1808.03196.

[25]

Daniel D. Lee and H. Sebastian Seung. 1999. Learning the parts of objects by non-negative matrix factorization. Nature 401, 6755 (1999), 788--791.

[26]

Jeff LeFevre, Jagan Sankaranarayanan, Hakan Hacigumus, Junichi Tatemura, Neoklis Polyzotis, and Michael J. Carey. 2014. Opportunistic physical design for big data analytics. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data (SIGMOD’14). 851--862.

[27]

Guoliang Li, Xuanhe Zhou, Shifu Li, and Bo Gao. 2019. QTune: A query-aware database tuning system with deep reinforcement learning. Proc. VLDB Endow. 12, 12 (Aug. 2019), 2118--2130.

Digital Library

[28]

Teng Li, Zhiyuan Xu, Jian Tang, and Yanzhi Wang. 2018. Model-free control for distributed stream data processing using deep reinforcement learning. Proc. VLDB Endow. 11, 6 (Feb. 2018), 705--718.

Digital Library

[29]

Xi Liang, Aaron J. Elmore, and Sanjay Krishnan. 2019. Opportunistic view materialization with deep reinforcement learning. arxiv:1903.01363. Retrieved from https://arxiv.org/abs/1903.01363.

[30]

Lin Ma, Dana Van Aken, Ahmed Hefny, Gustavo Mezerhane, Andrew Pavlo, and Geoffrey J. Gordon. 2018. Query-based workload forecasting for self-driving database management systems. In Proceedings of the 2018 International Conference on Management of Data (SIGMOD’18). 631--645.

[31]

Ben McCamish, Vahid Ghadakchi, Arash Termehchy, Behrouz Touri, and Liang Huang. 2018. The data interaction game. In Proceedings of the 2018 International Conference on Management of Data (SIGMOD’18). 83--98.

Digital Library

[32]

Venkata Vamsikrishna Meduri, Kanchan Chowdhury, and Mohamed Sarwat. 2019. Recurrent neural networks for dynamic user intent prediction in human-database interaction. In Proceedings of the 22nd International Conference on Extending Database Technology (EDBT’19). 654--657.

[33]

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Proceedings of the Conference on Neural Information Processing Systems (NIPS’13). Curran Associates, Inc., 3111--3119.

[34]

Tova Milo and Amit Somech. 2018. Next-step suggestions for modern interactive data analysis platforms. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD’18). 576--585.

Digital Library

[35]

Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin A. Riedmiller, Andreas Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. 2015. Human-level control through deep reinforcement learning. Nature 518, 7540 (2015), 529--533.

[36]

Christopher Olah. 2015. Understanding LSTM-based RNNs. Retrieved from http://colah.github.io/posts/2015-08-Understanding-LSTMs/.

[37]

Inc. Open Source Matters and the Joomla community. 2005. Joomla! Retrieved from https://www.joomla.org/.

[38]

Olga Papaemmanouil, Yanlei Diao, Kyriaki Dimitriadou, and Liping Peng. 2016. Interactive data exploration via machine learning models. IEEE Data Eng. Bull. 39, 4 (2016), 38--49.

[39]

Yongjoo Park, Ahmad Shahab Tajik, Michael Cafarella, and Barzan Mozafari. 2017. Database learning: Toward a database that becomes smarter every time. In Proceedings of the 2017 ACM International Conference on Management of Data (SIGMOD’17). 587--602.

Digital Library

[40]

Liping Peng, Enhui Huang, Yuqing Xing, Anna Liu, and Yanlei Diao. 2017. Uncertainty Sampling and Optimization for Interactive Database Exploration. UMass Technical Report (2017).

[41]

Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’14).

[42]

Senjuti Basu Roy, Haidong Wang, Ullas Nambiar, Gautam Das, and Mukesh K. Mohania. 2009. DynaCet: Building dynamic faceted search systems over databases. In Proceedings of the 25th International Conference on Data Engineering (ICDE’09), Yannis E. Ioannidis, Dik Lun Lee, and Raymond T. Ng (Eds.). IEEE Computer Society, 1463--1466.

[43]

Stuart J. Russell and Peter Norvig. 2003. Artificial Intelligence: A Modern Approach (2nd ed.). Pearson Education.

Digital Library

[44]

Tom Schaul, John Quan, Ioannis Antonoglou, and David Silver. 2016. Prioritized experience replay. In Proceedings of the 4th International Conference on Learning Representations (ICLR’16).

[45]

Vik Singh, Jim Gray, Ani Thakar, Alexander S. Szalay, Jordan Raddick, Bill Boroski, Svetlana Lebedeva, and Brian Yanny. 2007. SkyServer traffic report—The first five years. arxiv:cs/0701173. Retrieved from https://arxiv.org/abs/cs/0701173.

[46]

Amit Somech, Tova Milo, and Chai Ozeri. 2019. Predicting “What is Interesting” by mining interactive-data-analysis session logs. In Proceedings of the 22nd International Conference on Extending Database Technology (EDBT’19). 456--467.

[47]

Richard S. Sutton and Andrew G. Barto. 2018. Reinforcement Learning: An Introduction (2 ed.). The MIT Press.

Digital Library

[48]

Manasi Vartak, Sajjadur Rahman, Samuel Madden, Aditya G. Parameswaran, and Neoklis Polyzotis. 2015. SEEDB: Efficient data-driven visualization recommendations to support visual analytics. Proc. VLDB 8, 13 (2015), 2182--2193.

Digital Library

[49]

Abdul Wasay, Xinding Wei, Niv Dayan, and Stratos Idreos. 2017. Data canopy: Accelerating exploratory statistical analysis. In Proceedings of the 2017 ACM International Conference on Management of Data (SIGMOD’17). 557--572.

Digital Library

[50]

Christopher J. C. H. Watkins and Peter Dayan. 1992. Q-learning. In Machine Learning. 279--292.

Digital Library

[51]

Michael Wunder, Michael L. Littman, and Monica Babes. 2010. Classes of multiagent Q-learning dynamics with epsilon-greedy exploration. In Proceedings of the International Conference on Machine Learning (ICML’10), Johannes Fürnkranz and Thorsten Joachims (Eds.). Omnipress, 1167--1174.

[52]

Cong Yan and Yeye He. 2020. Auto-suggest: Learning-to-recommend data preparation steps using data science notebooks. In Proceedings of the 2020 International Conference on Management of Data (SIGMOD’20). 1539--1554.

Digital Library

[53]

Saizheng Zhang, Yuhuai Wu, Tong Che, Zhouhan Lin, Roland Memisevic, Russ R. Salakhutdinov, and Yoshua Bengio. 2016. Architectural complexity measures of recurrent neural networks. In Advances in Neural Information Processing Systems 29, D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett (Eds.). Curran Associates, Inc., 1822--1830.

Cited By

Huang HSiddiqui TAlotaibi RCurino CLeeka JJindal AZhao JCamacho-Rodríguez JTian Y(2024)Sibyl: Forecasting Time-Evolving Query WorkloadsProceedings of the ACM on Management of Data10.1145/36393082:1(1-27)Online publication date: 26-Mar-2024
https://dl.acm.org/doi/10.1145/3639308
Pavlenko ACahoon JZhu YKroth BNelson MCarter ALiao DWright TCamacho-Rodríguez JSaur KBarcelo PSanchez-Pi NMeliou ASudarshan S(2024)Vertically Autoscaling Monolithic Applications with CaaSPER: Scalable Container-as-a-Service Performance Enhanced Resizing Algorithm for the CloudCompanion of the 2024 International Conference on Management of Data10.1145/3626246.3653378(241-254)Online publication date: 9-Jun-2024
https://dl.acm.org/doi/10.1145/3626246.3653378
Zhu JYe ZCai PWang DZhang FCai DQian L(2024)Log Replaying for Real-Time HTAP: An Adaptive Epoch-Based Two-Stage Framework2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00167(2096-2108)Online publication date: 13-May-2024
https://doi.org/10.1109/ICDE60146.2024.00167
Show More Cited By

Index Terms

Evaluation of Machine Learning Algorithms in Predicting the Next SQL Query from the Future
1. Computing methodologies
  1. Machine learning
2. Information systems
  1. Data management systems
    1. Database design and models
      1. Data model extensions
        Temporal data
      2. Relational database model

Recommendations

Efficient type-ahead search on relational data: a TASTIER approach
SIGMOD '09: Proceedings of the 2009 ACM SIGMOD International Conference on Management of data

Existing keyword-search systems in relational databases require users to submit a complete query to compute answers. Often users feel "left in the dark" when they have limited knowledge about the data, and have to use a try-and-see approach for ...
Web Query Prediction by Unifying Model
ICDMW '08: Proceedings of the 2008 IEEE International Conference on Data Mining Workshops

Recently, many commercial products, such as Google Trends and Yahoo! Buzz, are released to monitor the past search engine query frequency trend. However, little research has been devoted for predicting the upcoming query trend, which is of great ...
SQL query optimization: reordering for a general class of queries

The strength of commercial query optimizers like DB2 comes from their ability to select an optimal order by generating all equivalent reorderings of binary operators. However, there are no known methods to generate all equivalent reorderings for a SQL ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Database Systems

ACM Transactions on Database Systems Volume 46, Issue 1

March 2021

143 pages

ISSN:0362-5915

EISSN:1557-4644

DOI:10.1145/3457891

Editor:
Christopher Jermaine
Rice University, USA

Issue’s Table of Contents

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 March 2021

Accepted: 01 December 2020

Revised: 01 September 2020

Received: 01 February 2020

Published in TODS Volume 46, Issue 1

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

NSF (National Science Foundation)

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

5
Total Citations
View Citations
2,080
Total Downloads

Downloads (Last 12 months)893
Downloads (Last 6 weeks)72

Reflects downloads up to

Other Metrics

View Author Metrics

Citations

Cited By

Huang HSiddiqui TAlotaibi RCurino CLeeka JJindal AZhao JCamacho-Rodríguez JTian Y(2024)Sibyl: Forecasting Time-Evolving Query WorkloadsProceedings of the ACM on Management of Data10.1145/36393082:1(1-27)Online publication date: 26-Mar-2024
https://dl.acm.org/doi/10.1145/3639308
Pavlenko ACahoon JZhu YKroth BNelson MCarter ALiao DWright TCamacho-Rodríguez JSaur KBarcelo PSanchez-Pi NMeliou ASudarshan S(2024)Vertically Autoscaling Monolithic Applications with CaaSPER: Scalable Container-as-a-Service Performance Enhanced Resizing Algorithm for the CloudCompanion of the 2024 International Conference on Management of Data10.1145/3626246.3653378(241-254)Online publication date: 9-Jun-2024
https://dl.acm.org/doi/10.1145/3626246.3653378
Zhu JYe ZCai PWang DZhang FCai DQian L(2024)Log Replaying for Real-Time HTAP: An Adaptive Epoch-Based Two-Stage Framework2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00167(2096-2108)Online publication date: 13-May-2024
https://doi.org/10.1109/ICDE60146.2024.00167
Brindavathi BKarrothu AAnilkumar C(2023)An Analysis of AI-based SQL Injection (SQLi) Attack Detection2023 Second International Conference on Augmented Intelligence and Sustainable Systems (ICAISS)10.1109/ICAISS58487.2023.10250505(31-35)Online publication date: 23-Aug-2023
https://doi.org/10.1109/ICAISS58487.2023.10250505
Nguyen TBui NTa TNguyen-Hoang T(2022)Predicting the Future Actions of People in the Real World to Improve Health ManagementArtificial Intelligence in Data and Big Data Processing10.1007/978-3-030-97610-1_15(175-187)Online publication date: 19-May-2022
https://doi.org/10.1007/978-3-030-97610-1_15

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables

View Issue’s Table of Contents