research-article

Open access

Sibyl: Forecasting Time-Evolving Query Workloads

Authors:

Tarique Siddiqui,

Jesús Camacho-Rodríguez,

Yuanyuan TianAuthors Info & Claims

Proceedings of the ACM on Management of Data, Volume 2, Issue 1

Article No.: 53, Pages 1 - 27

https://doi.org/10.1145/3639308

Published: 26 March 2024 Publication History

Abstract

Database systems often rely on historical query traces to perform workload-based performance tuning. However, real production workloads are time-evolving, making historical queries ineffective for optimizing future workloads. To address this challenge, we propose SIBYL, an end-to-end machine learning-based framework that accurately forecasts a sequence of future queries, with the entire query statements, in various prediction windows. Drawing insights from real-workloads, we propose template-based featurization techniques and develop a stacked-LSTM with an encoder-decoder architecture for accurate forecasting of query workloads. We also develop techniques to improve forecasting accuracy over large prediction windows and achieve high scalability over large workloads with high variability in arrival rates of queries. Finally, we propose techniques to handle workload drifts. Our evaluation on four real workloads demonstrates that SIBYL can forecast workloads with an 87.3% median F1 score, and can result in 1.7× and 1.3× performance improvement when applied to materialized view selection and index selection applications, respectively.

Supplemental Material

MP4 File

Presentation video - long version

Download
41.52 MB

References

[1]

[n. d.]. Dexter. https://github.com/ankane/dexter.

[2]

[n. d.]. IBM Db2. https://www.ibm.com/analytics/us/en/db2.

[3]

[n. d.]. IBM Informix. https://www.ibm.com/products/informix.

[4]

[n. d.]. Microsoft SQL Server. https://www.microsoft.com/en-us/sql-server/sql-server-2022.

[5]

[n. d.]. Oracle. https://www.oracle.com/database.

[6]

[n. d.]. SQL Server - Parameter Markers. https://learn.microsoft.com/sql/odbc/reference/appendixes/parameter-markers.

[7]

Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al . 2016. Tensorflow: a system for large-scale machine learning. In Osdi, Vol. 16. Savannah, GA, USA, 265--283.

Digital Library

[8]

Michael Abebe, Horatiu Lazu, and Khuzaima Daudjee. 2022. Tiresias: enabling predictive autonomous storage and indexing. Proceedings of the VLDB Endowment 15, 11 (2022), 3126--3136.

Digital Library

[9]

Sanjay Agrawal, Surajit Chaudhuri, and Vivek R. Narasayya. 2000. Automated Selection of Materialized Views and Indexes in SQL Databases. In VLDB 2000, Proceedings of 26th International Conference on Very Large Data Bases, September 10--14, 2000, Cairo, Egypt. Morgan Kaufmann, 496--505.

[10]

Edmon Begoli, Jesús Camacho-Rodríguez, Julian Hyde, Michael J. Mior, and Daniel Lemire. 2018. Apache Calcite: A Foundational Framework for Optimized Query Processing Over Heterogeneous Data Sources. In Proceedings of the 2018 International Conference on Management of Data, SIGMOD Conference 2018, Houston, TX, USA, June 10--15, 2018. ACM, 221--230.

Digital Library

[11]

Nicolas Bruno, Surajit Chaudhuri, Arnd Christian König, Vivek R. Narasayya, Ravishankar Ramamurthy, and Manoj Syamala. 2011. AutoAdmin Project at Microsoft Research: Lessons Learned. IEEE Data Eng. Bull. 34, 4 (2011), 12--19.

[12]

Surajit Chaudhuri and Vivek Narasayya. 2007. Self-Tuning Database Systems: A Decade of Progress. In VLDB '07. 3--14.

[13]

Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 (2014).

[14]

Dineshen Chuckravanen. [n. d.]. Approximate entropy as a measure of cognitive fatigue: an eeg pilot study. ([n. d.]).

[15]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, 4171--4186. https://doi.org/10.18653/v1/N19--1423

[16]

Edgar Haren. 2017. Oracle Revolutionizes Cloud with the World's First Self-Driving Database. https://blogs.oracle.com/database/post/oracle-revolutionizes-cloud-with-the-worlds-first-self-driving-database.

[17]

Magdalini Eirinaki, Suju Abraham, Neoklis Polyzotis, and Naushin Shaikh. 2013. Querie: Collaborative database exploration. IEEE Transactions on knowledge and data engineering 26, 7 (2013), 1778--1790.

[18]

Magdalini Eirinaki and Sweta Patel. 2015. QueRIE reloaded: Using matrix factorization to improve database query recommendations. In 2015 IEEE International Conference on Big Data (Big Data). IEEE, 1500--1508.

Digital Library

[19]

Janusz R. Getta. 2018. Event Based Forecasting of Database Workloads. In 2018 IEEE 4th International Conference on Computer and Communications (ICCC). 1767--1773.

[20]

Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation 9, 8 (1997), 1735--1780.

Digital Library

[21]

Marc Holze, Ali Haschimi, and Norbert Ritter. 2010. Towards workload-aware self-management: Predicting significant workload shifts. In 2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW 2010). IEEE, 111--116.

[22]

Marc Holze and Norbert Ritter. 2008. Autonomic databases: Detection of workload shifts with n-gram-models. In Advances in Databases and Information Systems: 12th East European Conference, ADBIS 2008, Pori, Finland, September 5--9, 2008. Proceedings 12. Springer, 127--142.

Digital Library

[23]

Xiuqi Huang, Yunlong Cheng, Xiaofeng Gao, and Guihai Chen. 2022. TEALED: A Multi-Step Workload Forecasting Approach Using Time-Sensitive EMD and Auto LSTM Encoder-Decoder. In Database Systems for Advanced Applications. 706--713.

[24]

Peter J Huber. 1992. Robust estimation of a location parameter. Breakthroughs in statistics: Methodology and distribution (1992), 492--518.

[25]

Shrainik Jain, Bill Howe, Jiaqi Yan, and Thierry Cruanes. 2018. Query2vec: An evaluation of NLP techniques for generalized workload analytics. arXiv preprint arXiv:1801.05613 (2018).

[26]

Alekh Jindal, Konstantinos Karanasos, Sriram Rao, and Hiren Patel. 2018. Selecting subexpressions to materialize at datacenter scale. VLDB 11, 7 (2018), 800--812.

Digital Library

[27]

Alekh Jindal, Shi Qiao, Hiren Patel, Abhishek Roy, Jyoti Leeka, and Brandon Haynes. 2021. Production Experiences from Computation Reuse at Microsoft. In EDBT. 623--634.

[28]

Alekh Jindal, Shi Qiao, Hiren Patel, Zhicheng Yin, Jieming Di, Malay Bag, Marc Friedman, Yifung Lin, Konstantinos Karanasos, and Sriram Rao. 2018. Computation reuse in analytics job service at microsoft. In Proceedings of the 2018 International Conference on Management of Data. 191--203.

Digital Library

[29]

Nodira Khoussainova, YongChul Kwon, Magdalena Balazinska, and Dan Suciu. 2010. SnipSuggest: Context-aware autocompletion for SQL. Proceedings of the VLDB Endowment 4, 1 (2010), 22--33.

Digital Library

[30]

Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).

[31]

Xiaoling Li, Ying Jiang, Jun Hong, Yuanzhe Dong, and Lei Yao. 2016. Estimation of cognitive workload by approximate entropy of EEG. Journal of Mechanics in Medicine and Biology 16, 06 (2016), 1650077.

[32]

Liang Lu, Xingxing Zhang, and Steve Renais. 2016. On training the recurrent neural network encoder-decoder for large vocabulary end-to-end speech recognition. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 5060--5064.

Digital Library

[33]

Lin Ma, Dana Van Aken, Ahmed Hefny, Gustavo Mezerhane, Andrew Pavlo, and Geoffrey J Gordon. 2018. Query-based workload forecasting for self-driving database management systems. In Proceedings of the 2018 International Conference on Management of Data. 631--645.

Digital Library

[34]

Ryan Marcus, Parimarjan Negi, Hongzi Mao, Nesime Tatbul, Mohammad Alizadeh, and Tim Kraska. 2022. Bao: Making learned query optimization practical. ACM SIGMOD Record 51, 1 (2022), 6--13.

Digital Library

[35]

Ryan Marcus, Parimarjan Negi, Hongzi Mao, Chi Zhang, Mohammad Alizadeh, Tim Kraska, Olga Papaemmanouil, and Nesime Tatbul. 2019. Neo: A learned query optimizer. arXiv preprint arXiv:1904.03711 (2019).

[36]

Silvano Martello and Paolo Toth. 1990. Knapsack problems: algorithms and computer implementations. John Wiley & Sons, Inc.

[37]

Larry R Medsker and LC Jain. 2001. Recurrent neural networks. Design and Applications 5 (2001), 64--67.

[38]

Venkata Vamsikrishna Meduri, Kanchan Chowdhury, and Mohamed Sarwat. 2021. Evaluation of machine learning algorithms in predicting the next SQL query from the future. ACM Transactions on Database Systems (TODS) 46, 1 (2021), 1--46.

Digital Library

[39]

A.V. Oppenheim. 1999. Discrete-Time Signal Processing. Pearson Education.

[40]

Oracle. 2006. Oracle Database 10g Release 2: The Self-Managing Database. Technical Report. Oracle.

[41]

Sriram Padmanabhan, Bishwaranjan Bhattacharjee, Tim Malkemus, Leslie Cranston, and Matthew Huras. 2003. Multi-Dimensional Clustering: A New Data Layout Scheme in DB2. In SIGMOD '03. 637--641.

Digital Library

[42]

Andrew Pavlo, Gustavo Angulo, Joy Arulraj, Haibin Lin, Jiexi Lin, Lin Ma, Prashanth Menon, Todd C. Mowry, Matthew Perron, Ian Quah, Siddharth Santurkar, Anthony Tomasic, Skye Toor, Dana Van Aken, Ziqi Wang, Yingjun Wu, Ran Xian, and Tieying Zhang. 2017. Self-Driving Database Management Systems. In CIDR.

[43]

Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexandre Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot, and Édouard Duchesnay. 2011. Scikit-Learn: Machine Learning in Python. J. Mach. Learn. Res. 12, null (nov 2011), 2825--2830.

[44]

Steven M Pincus. 1991. Approximate entropy as a measure of system complexity. Proceedings of the National Academy of Sciences 88, 6 (1991), 2297--2301.

[45]

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al . 2019. Language models are unsupervised multitask learners. OpenAI blog 1, 8 (2019), 9.

[46]

Omer Sagi and Lior Rokach. 2018. Ensemble learning: A survey. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 8, 4 (2018), e1249.

[47]

Tarique Siddiqui, Alekh Jindal, Shi Qiao, Hiren Patel, and Wangchao Le. 2020. Cost Models for Big Data Query Processing: Learning, Retrofitting, and Our Findings. In SIGMOD. 99--113.

[48]

Rebecca Taft, Nosayba El-Sayed, Marco Serafini, Yu Lu, Ashraf Aboulnaga, Michael Stonebraker, Ricardo Mayerhofer, and Francisco Andrade. 2018. P-Store: An Elastic Database System with Predictive Provisioning. In SIGMOD '18 (Houston, TX, USA). 205--219.

Digital Library

[49]

Dixin Tang, Zechao Shang, Aaron J. Elmore, Sanjay Krishnan, and Michael J. Franklin. 2020. CrocodileDB in Action: Resource-Efficient Query Execution by Exploiting Time Slackness. Proc. VLDB Endow. 13, 12 (aug 2020), 2937--2940.

Digital Library

[50]

Sean J Taylor and Benjamin Letham. 2018. Forecasting at scale. The American Statistician 72, 1 (2018), 37--45.

[51]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017).

[52]

Christopher JCH Watkins and Peter Dayan. 1992. Q-learning. Machine learning 8 (1992), 279--292.

[53]

Kilian Weinberger, Anirban Dasgupta, John Langford, Alex Smola, and Josh Attenberg. 2009. Feature Hashing for Large Scale Multitask Learning. In ICML '09. 1113--1120.

[54]

Chenggang Wu, Alekh Jindal, Saeed Amizadeh, Hiren Patel, Wangchao Le, Shi Qiao, and Sriram Rao. 2018. Towards a Learning Optimizer for Shared Clouds. PVLDB 12, 3 (nov 2018), 210--222.

[55]

Haitao Yuan, Guoliang Li, Ling Feng, Ji Sun, and Yue Han. 2020. Automatic view generation with deep learning and reinforcement learning. In ICDE. 1501--1512.

[56]

Daniel C. Zilio, Jun Rao, Sam Lightstone, Guy Lohman, Adam Storm, Christian Garcia-Arellano, and Scott Fadden. 2004. DB2 Design Advisor: Integrated Automatic Physical Database Design. In VLDB '04. 1087--1097.

Index Terms

Sibyl: Forecasting Time-Evolving Query Workloads
1. Information systems
  1. Data management systems

Recommendations

Time-series forecasting using flexible neural tree model

Time-series forecasting is an important research and application area. Much effort has been devoted over the past several decades to develop and improve the time-series forecasting models. This paper introduces a new time-series forecasting model based ...
Towards Improving Multivariate Time-Series Forecasting Using Weighted Linear Stacking
Agents and Artificial Intelligence
Abstract
In this day and age, the emergence of Big Data, has made a substantial amount of data accessible across various fields. In particular, time-series data has sparked interest, with researchers and practitioners developing approaches and models in an ...
A hierarchical hybrid neural model with time integrators in long-term load forecasting

A novel hierarchical hybrid neural model to the problem of long-term load forecasting is proposed in this paper. The neural model is made up of two self-organizing map nets—one on top of the other—and a single-layer perceptron. It has application into ...

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the ACM on Management of Data

Proceedings of the ACM on Management of Data Volume 2, Issue 1

SIGMOD

February 2024

1874 pages

EISSN:2836-6573

DOI:10.1145/3654807

Editor:
Divyakant Agrawal
UC Santa Barbara, United States

Issue’s Table of Contents

Copyright © 2024 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 26 March 2024

Published in PACMMOD Volume 2, Issue 1

Permissions

Request permissions for this article.

Request Permissions

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
291
Total Downloads

Downloads (Last 12 months)291
Downloads (Last 6 weeks)50

Reflects downloads up to 03 Oct 2024

Other Metrics

View Author Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables

View Issue’s Table of Contents