Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Open access

Sibyl: Forecasting Time-Evolving Query Workloads

Published: 26 March 2024 Publication History

Abstract

Database systems often rely on historical query traces to perform workload-based performance tuning. However, real production workloads are time-evolving, making historical queries ineffective for optimizing future workloads. To address this challenge, we propose SIBYL, an end-to-end machine learning-based framework that accurately forecasts a sequence of future queries, with the entire query statements, in various prediction windows. Drawing insights from real-workloads, we propose template-based featurization techniques and develop a stacked-LSTM with an encoder-decoder architecture for accurate forecasting of query workloads. We also develop techniques to improve forecasting accuracy over large prediction windows and achieve high scalability over large workloads with high variability in arrival rates of queries. Finally, we propose techniques to handle workload drifts. Our evaluation on four real workloads demonstrates that SIBYL can forecast workloads with an 87.3% median F1 score, and can result in 1.7× and 1.3× performance improvement when applied to materialized view selection and index selection applications, respectively.

Supplemental Material

MP4 File
Presentation video - long version

References

[1]
[n. d.]. Dexter. https://github.com/ankane/dexter.
[2]
[n. d.]. IBM Db2. https://www.ibm.com/analytics/us/en/db2.
[3]
[n. d.]. IBM Informix. https://www.ibm.com/products/informix.
[4]
[n. d.]. Microsoft SQL Server. https://www.microsoft.com/en-us/sql-server/sql-server-2022.
[5]
[n. d.]. Oracle. https://www.oracle.com/database.
[6]
[n. d.]. SQL Server - Parameter Markers. https://learn.microsoft.com/sql/odbc/reference/appendixes/parameter-markers.
[7]
Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al . 2016. Tensorflow: a system for large-scale machine learning. In Osdi, Vol. 16. Savannah, GA, USA, 265--283.
[8]
Michael Abebe, Horatiu Lazu, and Khuzaima Daudjee. 2022. Tiresias: enabling predictive autonomous storage and indexing. Proceedings of the VLDB Endowment 15, 11 (2022), 3126--3136.
[9]
Sanjay Agrawal, Surajit Chaudhuri, and Vivek R. Narasayya. 2000. Automated Selection of Materialized Views and Indexes in SQL Databases. In VLDB 2000, Proceedings of 26th International Conference on Very Large Data Bases, September 10--14, 2000, Cairo, Egypt. Morgan Kaufmann, 496--505.
[10]
Edmon Begoli, Jesús Camacho-Rodríguez, Julian Hyde, Michael J. Mior, and Daniel Lemire. 2018. Apache Calcite: A Foundational Framework for Optimized Query Processing Over Heterogeneous Data Sources. In Proceedings of the 2018 International Conference on Management of Data, SIGMOD Conference 2018, Houston, TX, USA, June 10--15, 2018. ACM, 221--230.
[11]
Nicolas Bruno, Surajit Chaudhuri, Arnd Christian König, Vivek R. Narasayya, Ravishankar Ramamurthy, and Manoj Syamala. 2011. AutoAdmin Project at Microsoft Research: Lessons Learned. IEEE Data Eng. Bull. 34, 4 (2011), 12--19.
[12]
Surajit Chaudhuri and Vivek Narasayya. 2007. Self-Tuning Database Systems: A Decade of Progress. In VLDB '07. 3--14.
[13]
Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 (2014).
[14]
Dineshen Chuckravanen. [n. d.]. Approximate entropy as a measure of cognitive fatigue: an eeg pilot study. ([n. d.]).
[15]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, 4171--4186. https://doi.org/10.18653/v1/N19--1423
[16]
Edgar Haren. 2017. Oracle Revolutionizes Cloud with the World's First Self-Driving Database. https://blogs.oracle.com/database/post/oracle-revolutionizes-cloud-with-the-worlds-first-self-driving-database.
[17]
Magdalini Eirinaki, Suju Abraham, Neoklis Polyzotis, and Naushin Shaikh. 2013. Querie: Collaborative database exploration. IEEE Transactions on knowledge and data engineering 26, 7 (2013), 1778--1790.
[18]
Magdalini Eirinaki and Sweta Patel. 2015. QueRIE reloaded: Using matrix factorization to improve database query recommendations. In 2015 IEEE International Conference on Big Data (Big Data). IEEE, 1500--1508.
[19]
Janusz R. Getta. 2018. Event Based Forecasting of Database Workloads. In 2018 IEEE 4th International Conference on Computer and Communications (ICCC). 1767--1773.
[20]
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation 9, 8 (1997), 1735--1780.
[21]
Marc Holze, Ali Haschimi, and Norbert Ritter. 2010. Towards workload-aware self-management: Predicting significant workload shifts. In 2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW 2010). IEEE, 111--116.
[22]
Marc Holze and Norbert Ritter. 2008. Autonomic databases: Detection of workload shifts with n-gram-models. In Advances in Databases and Information Systems: 12th East European Conference, ADBIS 2008, Pori, Finland, September 5--9, 2008. Proceedings 12. Springer, 127--142.
[23]
Xiuqi Huang, Yunlong Cheng, Xiaofeng Gao, and Guihai Chen. 2022. TEALED: A Multi-Step Workload Forecasting Approach Using Time-Sensitive EMD and Auto LSTM Encoder-Decoder. In Database Systems for Advanced Applications. 706--713.
[24]
Peter J Huber. 1992. Robust estimation of a location parameter. Breakthroughs in statistics: Methodology and distribution (1992), 492--518.
[25]
Shrainik Jain, Bill Howe, Jiaqi Yan, and Thierry Cruanes. 2018. Query2vec: An evaluation of NLP techniques for generalized workload analytics. arXiv preprint arXiv:1801.05613 (2018).
[26]
Alekh Jindal, Konstantinos Karanasos, Sriram Rao, and Hiren Patel. 2018. Selecting subexpressions to materialize at datacenter scale. VLDB 11, 7 (2018), 800--812.
[27]
Alekh Jindal, Shi Qiao, Hiren Patel, Abhishek Roy, Jyoti Leeka, and Brandon Haynes. 2021. Production Experiences from Computation Reuse at Microsoft. In EDBT. 623--634.
[28]
Alekh Jindal, Shi Qiao, Hiren Patel, Zhicheng Yin, Jieming Di, Malay Bag, Marc Friedman, Yifung Lin, Konstantinos Karanasos, and Sriram Rao. 2018. Computation reuse in analytics job service at microsoft. In Proceedings of the 2018 International Conference on Management of Data. 191--203.
[29]
Nodira Khoussainova, YongChul Kwon, Magdalena Balazinska, and Dan Suciu. 2010. SnipSuggest: Context-aware autocompletion for SQL. Proceedings of the VLDB Endowment 4, 1 (2010), 22--33.
[30]
Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
[31]
Xiaoling Li, Ying Jiang, Jun Hong, Yuanzhe Dong, and Lei Yao. 2016. Estimation of cognitive workload by approximate entropy of EEG. Journal of Mechanics in Medicine and Biology 16, 06 (2016), 1650077.
[32]
Liang Lu, Xingxing Zhang, and Steve Renais. 2016. On training the recurrent neural network encoder-decoder for large vocabulary end-to-end speech recognition. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 5060--5064.
[33]
Lin Ma, Dana Van Aken, Ahmed Hefny, Gustavo Mezerhane, Andrew Pavlo, and Geoffrey J Gordon. 2018. Query-based workload forecasting for self-driving database management systems. In Proceedings of the 2018 International Conference on Management of Data. 631--645.
[34]
Ryan Marcus, Parimarjan Negi, Hongzi Mao, Nesime Tatbul, Mohammad Alizadeh, and Tim Kraska. 2022. Bao: Making learned query optimization practical. ACM SIGMOD Record 51, 1 (2022), 6--13.
[35]
Ryan Marcus, Parimarjan Negi, Hongzi Mao, Chi Zhang, Mohammad Alizadeh, Tim Kraska, Olga Papaemmanouil, and Nesime Tatbul. 2019. Neo: A learned query optimizer. arXiv preprint arXiv:1904.03711 (2019).
[36]
Silvano Martello and Paolo Toth. 1990. Knapsack problems: algorithms and computer implementations. John Wiley & Sons, Inc.
[37]
Larry R Medsker and LC Jain. 2001. Recurrent neural networks. Design and Applications 5 (2001), 64--67.
[38]
Venkata Vamsikrishna Meduri, Kanchan Chowdhury, and Mohamed Sarwat. 2021. Evaluation of machine learning algorithms in predicting the next SQL query from the future. ACM Transactions on Database Systems (TODS) 46, 1 (2021), 1--46.
[39]
A.V. Oppenheim. 1999. Discrete-Time Signal Processing. Pearson Education.
[40]
Oracle. 2006. Oracle Database 10g Release 2: The Self-Managing Database. Technical Report. Oracle.
[41]
Sriram Padmanabhan, Bishwaranjan Bhattacharjee, Tim Malkemus, Leslie Cranston, and Matthew Huras. 2003. Multi-Dimensional Clustering: A New Data Layout Scheme in DB2. In SIGMOD '03. 637--641.
[42]
Andrew Pavlo, Gustavo Angulo, Joy Arulraj, Haibin Lin, Jiexi Lin, Lin Ma, Prashanth Menon, Todd C. Mowry, Matthew Perron, Ian Quah, Siddharth Santurkar, Anthony Tomasic, Skye Toor, Dana Van Aken, Ziqi Wang, Yingjun Wu, Ran Xian, and Tieying Zhang. 2017. Self-Driving Database Management Systems. In CIDR.
[43]
Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexandre Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot, and Édouard Duchesnay. 2011. Scikit-Learn: Machine Learning in Python. J. Mach. Learn. Res. 12, null (nov 2011), 2825--2830.
[44]
Steven M Pincus. 1991. Approximate entropy as a measure of system complexity. Proceedings of the National Academy of Sciences 88, 6 (1991), 2297--2301.
[45]
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al . 2019. Language models are unsupervised multitask learners. OpenAI blog 1, 8 (2019), 9.
[46]
Omer Sagi and Lior Rokach. 2018. Ensemble learning: A survey. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 8, 4 (2018), e1249.
[47]
Tarique Siddiqui, Alekh Jindal, Shi Qiao, Hiren Patel, and Wangchao Le. 2020. Cost Models for Big Data Query Processing: Learning, Retrofitting, and Our Findings. In SIGMOD. 99--113.
[48]
Rebecca Taft, Nosayba El-Sayed, Marco Serafini, Yu Lu, Ashraf Aboulnaga, Michael Stonebraker, Ricardo Mayerhofer, and Francisco Andrade. 2018. P-Store: An Elastic Database System with Predictive Provisioning. In SIGMOD '18 (Houston, TX, USA). 205--219.
[49]
Dixin Tang, Zechao Shang, Aaron J. Elmore, Sanjay Krishnan, and Michael J. Franklin. 2020. CrocodileDB in Action: Resource-Efficient Query Execution by Exploiting Time Slackness. Proc. VLDB Endow. 13, 12 (aug 2020), 2937--2940.
[50]
Sean J Taylor and Benjamin Letham. 2018. Forecasting at scale. The American Statistician 72, 1 (2018), 37--45.
[51]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017).
[52]
Christopher JCH Watkins and Peter Dayan. 1992. Q-learning. Machine learning 8 (1992), 279--292.
[53]
Kilian Weinberger, Anirban Dasgupta, John Langford, Alex Smola, and Josh Attenberg. 2009. Feature Hashing for Large Scale Multitask Learning. In ICML '09. 1113--1120.
[54]
Chenggang Wu, Alekh Jindal, Saeed Amizadeh, Hiren Patel, Wangchao Le, Shi Qiao, and Sriram Rao. 2018. Towards a Learning Optimizer for Shared Clouds. PVLDB 12, 3 (nov 2018), 210--222.
[55]
Haitao Yuan, Guoliang Li, Ling Feng, Ji Sun, and Yue Han. 2020. Automatic view generation with deep learning and reinforcement learning. In ICDE. 1501--1512.
[56]
Daniel C. Zilio, Jun Rao, Sam Lightstone, Guy Lohman, Adam Storm, Christian Garcia-Arellano, and Scott Fadden. 2004. DB2 Design Advisor: Integrated Automatic Physical Database Design. In VLDB '04. 1087--1097.

Index Terms

  1. Sibyl: Forecasting Time-Evolving Query Workloads

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image Proceedings of the ACM on Management of Data
    Proceedings of the ACM on Management of Data  Volume 2, Issue 1
    SIGMOD
    February 2024
    1874 pages
    EISSN:2836-6573
    DOI:10.1145/3654807
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 26 March 2024
    Published in PACMMOD Volume 2, Issue 1

    Permissions

    Request permissions for this article.

    Author Tags

    1. query workload forecasting
    2. time-series forecasting

    Qualifiers

    • Research-article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 291
      Total Downloads
    • Downloads (Last 12 months)291
    • Downloads (Last 6 weeks)50
    Reflects downloads up to 03 Oct 2024

    Other Metrics

    Citations

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Get Access

    Login options

    Full Access

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media