research-article

Real-Time Workload Pattern Analysis for Large-Scale Cloud Databases

Authors:

Yunjun GaoAuthors Info & Claims

Proceedings of the VLDB Endowment, Volume 16, Issue 12

Pages 3689 - 3701

https://doi.org/10.14778/3611540.3611557

Published: 01 August 2023 Publication History

Abstract

Hosting database services on cloud systems has become a common practice. This has led to the increasing volume of database workloads, which provides the opportunity for pattern analysis. Discovering workload patterns from a business logic perspective is conducive to better understanding the trends and characteristics of the database system. However, existing workload pattern discovery systems are not suitable for large-scale cloud databases which are commonly employed by the industry. This is because the workload patterns of large-scale cloud databases are generally far more complicated than those of ordinary databases.

In this paper, we propose Alibaba Workload Miner (AWM), a real-time system for discovering workload patterns in complicated large-scale workloads. AW M encodes and discovers the SQL query patterns logged from user requests and optimizes the querying processing based on the discovered patterns. First, Data Collection & Preprocessing Module collects streaming query logs and encodes them into high-dimensional feature embeddings with rich semantic contexts and execution features. Next, Online Workload Mining Module separates encoded query by business groups and discovers the workload patterns for each group. Meanwhile, Offline Training Module collects labels and trains the classification model using the labels. Finally, Pattern-based Optimizing Module optimizes query processing in cloud databases by exploiting discovered patterns. Extensive experimental results on one synthetic dataset and two real-life datasets (extracted from Alibaba Cloud databases) show that AW M enhances the accuracy of pattern discovery by 66% and reduce the latency of online inference by 22%, compared with the state-of-the-arts.

References

[1]

Alibaba Cloud. 2022. Alibaba Cloud Databases. https://www.alibabacloud.com/product/databases

[2]

Amazon EC. 2015. Amazon web services. http://aws.amazon.com/es/ec2/

[3]

Wei Cao, Xiaojie Feng, Boyuan Liang, Tianyu Zhang, Yusong Gao, Yunyang Zhang, and Feifei Li. 2021. LogStore: A Cloud-Native and Multi-Tenant Log Database. In SIGMOD. 2464--2476.

[4]

Bikash Chandra, Bhupesh Chawda, Biplab Kar, KV Reddy, Shetal Shah, and S Sudarshan. 2015. Data generation for testing and grading SQL queries. VLDBJ 24, 6 (2015), 731--755.

Digital Library

[5]

Tianqi Chen and Carlos Guestrin. 2016. XGBoost: A Scalable Tree Boosting System. In KDD. 785--794.

[6]

Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Édouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. 2020. Unsupervised Cross-lingual Representation Learning at Scale. In ACL. 8440--8451.

[7]

Marshall Copeland, Julian Soh, Anthony Puca, Mike Manning, and David Gollob. 2015. Microsoft Azure: planning, deploying, and managing your data center in the cloud. Apress.

[8]

Guilherme Damasio, Vincent Corvinelli, Parke Godfrey, Piotr Mierzejewski, Alex Mihaylov, Jaroslaw Szlichta, and Calisto Zuzarte. 2019. Guided automated learning for query workload re-optimization. PVLDB 12, 12 (2019), 2010--2021.

Digital Library

[9]

Sudipto Das, Miroslav Grbic, Igor Ilic, Isidora Jovandic, Andrija Jovanovic, Vivek R. Narasayya, Miodrag Radulovic, Maja Stikic, Gaoxiang Xu, and Surajit Chaudhuri. 2019. Automatically Indexing Millions of Databases in Microsoft Azure SQL Database. In SIGMOD. 666--679.

[10]

Shaleen Deep, Anja Gruenheid, Paraschos Koutris, Jeffrey Naughton, and Stratis Viglas. 2020. Comprehensive and efficient workload compression. PVLDB 14, 3 (2020), 418--430.

Digital Library

[11]

Songyun Duan, Vamsidhar Thummala, and Shivnath Babu. 2009. Tuning Database Configuration Parameters with iTuned. PVLDB 2, 1 (2009), 1246--1257.

Digital Library

[12]

Mehrad Eslami, Yicheng Tu, Hadi Charkhgard, Zichen Xu, and Jiacheng Liu. 2019. PsiDB: A framework for batched query processing and optimization. In IEEE BigData. 6046--6048.

[13]

Yunjun Gao, Xiaoze Liu, Junyang Wu, Tianyi Li, Pengfei Wang, and Lu Chen. 2022. ClusterEA: Scalable Entity Alignment with Stochastic Training and Normalized Mini-batch Similarities. In KDD. 421--431.

[14]

Congcong Ge, Xiaoze Liu, Lu Chen, Baihua Zheng, and Yunjun Gao. 2021. Make It Easy: An Effective End-to-End Entity Alignment Framework. In SIGIR. 777--786.

[15]

Congcong Ge, Xiaoze Liu, Lu Chen, Baihua Zheng, and Yunjun Gao. 2022. LargeEA: Aligning Entities for Large-scale Knowledge Graphs. PVLDB 15, 2 (2022), 237--245.

[16]

Congcong Ge, Pengfei Wang, Lu Chen, Xiaoze Liu, Baihua Zheng, and Yunjun Gao. 2021. CollaborEM: A Self-supervised Entity Matching Framework Using Multi-features Collaboration. TKDE (2021), 1--14.

[17]

Chris Giannella, Jiawei Han, Jian Pei, Xifeng Yan, and Philip S Yu. 2003. Mining frequent patterns in data streams at multiple time granularities. Next generation data mining 212 (2003), 191--212.

[18]

Georgios Giannikis, Darko Makreshanski, Gustavo Alonso, and Donald Kossmann. 2013. Workload optimization using shareddb. In SIGMOD. 1045--1048.

[19]

Georgios Giannikis, Darko Makreshanski, Gustavo Alonso, and Donald Kossmann. 2014. Shared workload optimization. PVLDB 7, 6 (2014), 429--440.

Digital Library

[20]

Peter D Grünwald. 2007. The minimum description length principle. MIT press.

[21]

Herodotos Herodotou and Shivnath Babu. 2011. Profiling, What-if Analysis, and Cost-based Optimization of MapReduce Programs. PVLDB 4, 11 (2011), 1111--1122.

Digital Library

[22]

Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, and Luke Zettlemoyer. 2016. Summarizing source code using a neural attention model. In ACL. 2073--2083.

[23]

Shrainik Jain, Bill Howe, Jiaqi Yan, and Thierry Cruanes. 2018. Query2Vec: An Evaluation of NLP Techniques for Generalized Workload Analytics. PVLDB 11, 5 (2018).

[24]

Ruoming Jin and Gagan Agrawal. 2007. Frequent pattern mining in data streams. Data streams: Models and algorithms (2007), 61--84.

[25]

Oliver Kennedy, Jerry Ajay, Geoffrey Challen, and Lukasz Ziarek. 2015. Pocket data: The need for TPC-MOBILE. In TPCTC. Springer, 8--25.

[26]

Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL. 4171--4186.

[27]

S. P. T. Krishnan and Jose L Ugia Gonzalez. 2015. Building your next big thing with google cloud platform: A guide for developers and enterprise architects. Springer.

[28]

Gokhan Kul, Duc Thanh Anh Luong, Ting Xie, Varun Chandola, Oliver Kennedy, and Shambhu Upadhyaya. 2018. Similarity metrics for SQL query clustering. TKDE 30, 12 (2018), 2408--2420.

Digital Library

[29]

Quoc Le and Tomas Mikolov. 2014. Distributed representations of sentences and documents. In International conference on machine learning. PMLR, 1188--1196.

[30]

Guoliang Li, Xuanhe Zhou, Ji Sun, Xiang Yu, Yue Han, Lianyuan Jin, Wenbo Li, Tianqing Wang, and Shifu Li. 2021. openGauss: An Autonomous Database System. PVLDB 14, 12 (2021), 3028--3041.

[31]

Tianyi Li, Lu Chen, Christian S Jensen, and Torben Bach Pedersen. 2021. TRACE: Real-time compression of streaming trajectories in road networks. PVLDB 14, 7 (2021), 1175--1187.

Digital Library

[32]

Tianyi Li, Ruikai Huang, Lu Chen, Christian S Jensen, and Torben Bach Pedersen. 2020. Compression of uncertain trajectories in road networks. PVLDB 13, 7 (2020), 1050--1063.

Digital Library

[33]

Xiaoze Liu, Junyang Wu, Tianyi Li, Lu Chen, and Yunjun Gao. 2023. Unsupervised Entity Alignment for Temporal Knowledge Graphs. In WWW. 2528--2538.

[34]

Xiaoze Liu, Zheng Yin, Chao Zhao, Congcong Ge, Lu Chen, Yunjun Gao, Dimeng Li, Ziting Wang, Gaozhong Liang, Jian Tan, and Feifei Li. 2022. PinSQL: Pinpoint Root Cause SQLs to Resolve Performance Issues in Cloud Databases. In ICDE. 2549--2561.

[35]

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019).

[36]

Yang Liu, Yao Zhang, Yixin Wang, Feng Hou, Jin Yuan, Jiang Tian, Yang Zhang, Zhongchao Shi, Jianping Fan, and Zhiqiang He. 2023. A survey of visual transformers. TNNLS (2023), 1--21.

[37]

Lin Ma, Dana Van Aken, Ahmed Hefny, Gustavo Mezerhane, Andrew Pavlo, and Geoffrey J. Gordon. 2018. Query-based Workload Forecasting for Self-Driving Database Management Systems. In SIGMOD. 631--645.

[38]

Minghua Ma, Zheng Yin, Shenglin Zhang, Sheng Wang, Christopher Zheng, Xinhao Jiang, Hanwen Hu, Cheng Luo, Yilin Li, Nengjun Qiu, Feifei Li, Changcheng Chen, and Dan Pei. 2020. Diagnosing Root Causes of Intermittent Slow Queries in Large-Scale Cloud Databases. PVLDB 13, 8 (2020), 1176--1189.

Digital Library

[39]

Ryan Marcus, Parimarjan Negi, Hongzi Mao, Nesime Tatbul, Mohammad Alizadeh, and Tim Kraska. 2021. Bao: Making Learned Query Optimization Practical. In SIGMOD. 1275--1288.

[40]

Ryan Marcus and Olga Papaemmanouil. 2016. WiSeDB: A Learning-based Workload Management Advisor for Cloud Databases. PVLDB 9, 10 (2016), 780--791.

Digital Library

[41]

Ryan C. Marcus, Parimarjan Negi, Hongzi Mao, Chi Zhang, Mohammad Alizadeh, Tim Kraska, Olga Papaemmanouil, and Nesime Tatbul. 2019. Neo: A Learned Query Optimizer. PVLDB 12, 11 (2019), 1705--1718.

Digital Library

[42]

Barzan Mozafari, Carlo Curino, Alekh Jindal, and Samuel Madden. 2013. Performance and resource modeling in highly-concurrent OLTP workloads. In SIGMOD. 301--312.

[43]

Debjyoti Paul, Jie Cao, Feifei Li, and Vivek Srikumar. 2021. Database workload characterization with query plan encoders. PVLDB 15, 4 (2021), 923--935.

Digital Library

[44]

Andrew Pavlo, Gustavo Angulo, Joy Arulraj, Haibin Lin, Jiexi Lin, Lin Ma, Prashanth Menon, Todd C Mowry, Matthew Perron, Ian Quah, et al. 2017. Self-Driving Database Management Systems. In CIDR, Vol. 4. 1.

[45]

Fotis Psallidas, Ashvin Agrawal, Chandru Sugunan, Khaled Ibrahim, Konstantinos Karanasos, Jesús Camacho-Rodríguez, Avrilia Floratou, Carlo Curino, and Raghu Ramakrishnan. 2022. OneProvenance: Efficient Extraction of Dynamic Coarse-Grained Provenance from Database Logs. arXiv preprint arXiv:2210.14047 (2022).

[46]

Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In EMNLP. 3980--3990.

[47]

Leonard Richardson and Sam Ruby. 2008. RESTful web services. " O'Reilly Media, Inc.".

[48]

Xiu Tang, Sai Wu, Mingli Song, Shanshan Ying, Feifei Li, and Gang Chen. 2022. PreQR: Pre-training Representation for SQL Understanding. In SIGMOD. 204--216.

[49]

Quoc Trung Tran, Konstantinos Morfonios, and Neoklis Polyzotis. 2015. Oracle Workload Intelligence. In SIGMOD. 1669--1681.

[50]

Alexandre Verbitski, Anurag Gupta, Debanjan Saha, Murali Brahmadesam, Kamal Gupta, Raman Mittal, Sailesh Krishnamurthy, Sandor Maurice, Tengiz Kharatishvili, and Xiaofeng Bao. 2017. Amazon aurora: Design considerations for high throughput cloud-native relational databases. In SIGMOD. 1041--1052.

[51]

Junyang Wu, Tianyi Li, Lu Chen, Yunjun Gao, and Ziheng Wei. 2023. SEA: A Scalable Entity Alignment System. arXiv preprint arXiv:2304.07065 (2023).

[52]

Dong Young Yoon, Ning Niu, and Barzan Mozafari. 2016. DBSherlock: A Performance Diagnostic Tool for Transactional Databases. In SIGMOD. 1599--1614.

[53]

Xuanhe Zhou, Guoliang Li, Chengliang Chai, and Jianhua Feng. 2021. A Learned Query Rewrite System using Monte Carlo Tree Search. PVLDB 15, 1 (2021), 46--58.

Digital Library

[54]

Rong Zhu, Ziniu Wu, Chengliang Chai, Andreas Pfadler, Bolin Ding, Guoliang Li, and Jingren Zhou. 2022. Learned Query Optimizer: At the Forefront of AI-Driven Databases. In EDBT. 1--4.

[55]

Yiwen Zhu, Subru Krishnan, Konstantinos Karanasos, Isha Tarte, Conor Power, Abhishek Modi, Manoj Kumar, Deli Zhang, Kartheek Muthyala, Nick Jurgens, et al. 2021. Kea: Tuning an exabyte-scale data infrastructure. In SIGMOD. 2667--2680.

[56]

Zainab Zolaktaf, Mostafa Milani, and Rachel Pottinger. 2020. Facilitating SQL query composition and analysis. In SIGMOD. 209--224.

Cited By

Han SShen KShen DWang C(2024)Enhanced Multi-Party Privacy-Preserving Record Linkage Using Trusted Execution EnvironmentsMathematics10.3390/math1215233712:15(2337)Online publication date: 26-Jul-2024
https://doi.org/10.3390/math12152337
Han SWang ZShen DWang C(2024)A Parallel Multi-Party Privacy-Preserving Record Linkage Method Based on a Consortium BlockchainMathematics10.3390/math1212185412:12(1854)Online publication date: 14-Jun-2024
https://doi.org/10.3390/math12121854
Han SWang YShen DWang C(2024)A Multi-Party Privacy-Preserving Record Linkage Method Based on Secondary EncodingMathematics10.3390/math1212180012:12(1800)Online publication date: 9-Jun-2024
https://doi.org/10.3390/math12121800
Show More Cited By

Recommendations

Efficient closed high-utility pattern fusion model in large-scale databases
Abstract
High-Utility Itemset Mining (HUIM) is considered a major issue in recent decades since it reveals profit strategies for use in industry for decision-making. Most existing works have focused on mining high-utility itemsets from ...
Highlights
- Mine required CHUIs in parallel and distributed environments.
- Use HG-k-means to ...
Cloud databases: new techniques, challenges, and opportunities

As database vendors are increasingly moving towards the cloud data service, i.e., databases as a service (DBaaS), cloud databases have become prevalent. Compared with the early cloud-hosted databases, the new generation of cloud databases, also known as ...
Large science databases - are cloud services ready for them?
Science-Driven Cloud Computing

We report on attempts to put an astronomical database - the Sloan Digital Sky Survey science archive - in the cloud. We find that it is very frustrating to impossible at this time to migrate a complex SQL Server database into current cloud service ...

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment

Proceedings of the VLDB Endowment Volume 16, Issue 12

August 2023

685 pages

ISSN:2150-8097

Editors:
Georgia Koutrika
Athena Research Center
,
Jun Yang
Duke University

Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 August 2023

Published in PVLDB Volume 16, Issue 12

Check for updates

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

5
Total Citations
View Citations
59
Total Downloads

Downloads (Last 12 months)59
Downloads (Last 6 weeks)5

Reflects downloads up to

Other Metrics

View Author Metrics

Citations

Cited By

Han SShen KShen DWang C(2024)Enhanced Multi-Party Privacy-Preserving Record Linkage Using Trusted Execution EnvironmentsMathematics10.3390/math1215233712:15(2337)Online publication date: 26-Jul-2024
https://doi.org/10.3390/math12152337
Han SWang ZShen DWang C(2024)A Parallel Multi-Party Privacy-Preserving Record Linkage Method Based on a Consortium BlockchainMathematics10.3390/math1212185412:12(1854)Online publication date: 14-Jun-2024
https://doi.org/10.3390/math12121854
Han SWang YShen DWang C(2024)A Multi-Party Privacy-Preserving Record Linkage Method Based on Secondary EncodingMathematics10.3390/math1212180012:12(1800)Online publication date: 9-Jun-2024
https://doi.org/10.3390/math12121800
Ji ZXie ZWu YZhang M(2024)LBSC: A Cost-Aware Caching Framework for Cloud Databases2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00373(4911-4924)Online publication date: 13-May-2024
https://doi.org/10.1109/ICDE60146.2024.00373
Zhu RJia YYang XZheng BWang BZong C(2024)Multiple Continuous Top-K Queries Over Data Stream2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00129(1575-1588)Online publication date: 13-May-2024
https://doi.org/10.1109/ICDE60146.2024.00129

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents