research-article

ADSTS: Automatic Distributed Storage Tuning System Using Deep Reinforcement Learning

Authors:

Wei ZhaoAuthors Info & Claims

ICPP '22: Proceedings of the 51st International Conference on Parallel Processing

Article No.: 25, Pages 1 - 13

https://doi.org/10.1145/3545008.3545012

Published: 13 January 2023 Publication History

Abstract

Modern distributed storage systems with the immense number of configurations, unpredictable workloads and difficult performance evaluation pose higher requirements to parameter tuning. Providing an automatic parameter tuning solution for distributed storage systems is in demand. Lots of researches have attempted to build automatic tuning systems based on deep reinforcement learning (RL). However, they have several limitations in the face of these requirements, including lack of parameter spaces processing, less advanced RL models and time-consuming and unstable training process. In this paper, we present and evaluate the ADSTS, which is an automatic distributed storage tuning system based on deep reinforcement learning. A general preprocessing guideline is first proposed to generate standardized tunable parameter domain. Thereinto, Recursive Stratified Sampling without the nonincremental nature is designed to sample huge parameter spaces and Lasso regression is adopted to identify important parameters. Besides, the twin-delayed deep deterministic policy gradient method is utilized to find the optimal values of tunable parameters. Finally, Multi-processing Training and Workload-directed Model Fine-tuning are adopted to accelerate the model convergence. ADSTS is implemented on Park and is used in the real-world system Ceph. The evaluation results show that ADSTS can recommend near-optimal configurations and improve system performance by 1.5 × ∼2.5 × with acceptable overheads.

References

[1]

Adnan Ahmad. 2020. YCSB on RADOS of Ceph.https://gitlab.cs.washington.edu/adnana2/ycsb/-/tree/master/rados.

[2]

Joe Arnold. 2014. Openstack swift: Using, administering, and developing for swift object storage. ” O’Reilly Media, Inc.”.

[3]

Babak Behzad, Huong Vu Thanh Luu, Joseph Huchette, Surendra Byna, Ruth Aydt, Quincey Koziol, Marc Snir, 2013. Taming parallel I/O complexity with auto-tuning. In SC’13: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. IEEE, 1–12.

Digital Library

[4]

Zhen Cao, Geoff Kuenning, and Erez Zadok. 2020. Carver: Finding important parameters for storage system tuning. In 18th {USENIX} Conference on File and Storage Technologies ({FAST} 20). 43–57.

[5]

Zhen Cao, Vasily Tarasov, Sachin Tiwari, and Erez Zadok. 2018. Towards better understanding of black-box auto-tuning: A comparative analysis for storage systems. In 2018 {USENIX} Annual Technical Conference ({USENIX}{ATC} 18). 893–907.

[6]

George Casella and Roger L Berger. 2021. Statistical inference. Cengage Learning.

[7]

Yu CHEN and Yingchi MAO. 2020. Automatic tuning of Ceph parameters based on random forest and genetic algorithm. Journal of Computer Applications 40, 2 (2020), 347–351.

[8]

Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall, and Werner Vogels. 2007. Dynamo: Amazon’s highly available key-value store. ACM SIGOPS operating systems review 41, 6 (2007), 205–220.

Digital Library

[9]

Stephanie Donovan, Gerrit Huizenga, Andrew J Hutton, C Craig Ross, Martin K Petersen, and Philip Schwan. 2003. Lustre: Building a file system for 1000-node clusters. In Proceedings of the Linux Symposium, Vol. 2003.

[10]

Terry L Friesz, Hsun-Jung Cho, Nihal J Mehta, Roger L Tobin, and G Anandalingam. 1992. A simulated annealing approach to the network design problem with variational inequality constraints. Transportation science 26, 1 (1992), 18–26.

[11]

Scott Fujimoto, Herke Hoof, and David Meger. 2018. Addressing function approximation error in actor-critic methods. In International Conference on Machine Learning. PMLR, 1587–1596.

[12]

hadoop 2.10.1 2022. dfs-default.xml.https://hadoop.apache.org/docs/r2.10.1/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml.

[13]

Arthur E Hoerl and Robert W Kennard. 1970. Ridge regression: applications to nonorthogonal problems. Technometrics 12, 1 (1970), 69–82.

[14]

Guoliang Li, Xuanhe Zhou, Shifu Li, and Bo Gao. 2019. Qtune: A query-aware database tuning system with deep reinforcement learning. Proceedings of the VLDB Endowment 12, 12 (2019), 2118–2130.

Digital Library

[15]

Huiba Li, Yiming Zhang, Dongsheng Li, Zhiming Zhang, Shengyun Liu, Peng Huang, Zheng Qin, Kai Chen, and Yongqiang Xiong. 2019. Ursa: Hybrid block storage for cloud-scale virtual disks. In Proceedings of the Fourteenth EuroSys Conference 2019. 1–17.

Digital Library

[16]

Yan Li, Kenneth Chang, Oceane Bel, Ethan L Miller, and Darrell DE Long. 2017. CAPES: unsupervised storage performance tuning using neural network-based deep reinforcement learning. In Proceedings of the international conference for high performance computing, networking, storage and analysis. 1–14.

Digital Library

[17]

Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. 2015. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971(2015).

[18]

Wenhao Lyu, Youyou Lu, Jiwu Shu, and Wei Zhao. 2020. Sapphire: Automatic Configuration Recommendation for Distributed Storage Systems. arXiv preprint arXiv:2007.03220(2020).

[19]

Hongzi Mao, Parimarjan Negi, Akshay Narayan, Hanrui Wang, 2019. Park: An open platform for learning-augmented computer systems. Advances in Neural Information Processing Systems 32 (2019).

[20]

Michael D McKay, Richard J Beckman, and William J Conover. 2000. A comparison of three methods for selecting values of input variables in the analysis of output from a computer code. Technometrics 42, 1 (2000), 55–61.

[21]

Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, 2013. Playing atari with deep reinforcement learning. arXiv preprint:1312.5602(2013).

[22]

Kazutaka Morita. 2010. Sheepdog: Distributed storage system for qemu/kvm. LCA 2010 DS&R miniconf.

[23]

Ridwan Rashid Noel, Rohit Mehra, and Palden Lama. 2019. Towards self-managing cloud storage with reinforcement learning. In 2019 IEEE International Conference on Cloud Engineering (IC2E). IEEE, 34–44.

[24]

Robin L Plackett and J Peter Burman. 1946. The design of optimum multifactorial experiments. Biometrika 33, 4 (1946), 305–325.

[25]

Konstantin Shvachko, Hairong Kuang, Sanjay Radia, and Robert Chansler. 2010. The hadoop distributed file system. In 2010 IEEE 26th symposium on mass storage systems and technologies (MSST). Ieee, 1–10.

Digital Library

[26]

Sklearn LassoCV Repository 2022. LassoCV.https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LassoCV.html.

[27]

Steven R Soltis, Thomas M Ruwart, and Matthew T O’keefe. 1996. The global file system. In NASA Conference Publication. 319–342.

[28]

Richard S Sutton and Andrew G Barto. 2018. Reinforcement learning: An introduction. MIT press.

Digital Library

[29]

Dana Van Aken, Andrew Pavlo, Geoffrey J Gordon, and Bohan Zhang. 2017. Automatic database management system tuning through large-scale machine learning. In Proceedings of the 2017 ACM International Conference on Management of Data. 1009–1024.

Digital Library

[30]

Miroslav Vořechovskỳ. 2015. Hierarchical refinement of latin hypercube samples. Computer-Aided Civil and Infrastructure Engineering 30, 5(2015), 394–411.

Digital Library

[31]

Christopher JCH Watkins and Peter Dayan. 1992. Q-learning. Machine learning 8, 3-4 (1992), 279–292.

[32]

Sage A Weil, Scott A Brandt, Ethan L Miller, Darrell DE Long, and Carlos Maltzahn. 2006. Ceph: A scalable, high-performance distributed file system. In Proceedings of the 7th symposium on Operating systems design and implementation. 307–320.

[33]

Ji Zhang, Yu Liu, and Ke Zhou. 2019. An End-to-End Automatic Cloud Database Tuning System Using Deep Reinforcement Learning. Proceedings of the 2019 International Conference on Management of Data (2019).

Digital Library

[34]

Qing Zheng, Haopeng Chen, Yaguang Wang, Jiangang Duan, and Zhiteng Huang. 2012. Cosbench: A benchmark tool for cloud object storage services. In 2012 IEEE Fifth International Conference on Cloud Computing. IEEE, 998–999.

Digital Library

[35]

Yuqing Zhu, Jianxun Liu, Mengying Guo, Yungang Bao, Wenlong Ma, Zhuoyue Liu, Kunpeng Song, and Yingchun Yang. 2017. Bestconfig: tapping the performance potential of systems via automatic configuration tuning. In Proceedings of the 2017 Symposium on Cloud Computing. 338–350.

Digital Library

[36]

Hui Zou. 2006. The adaptive lasso and its oracle properties. Journal of the American statistical association 101, 476(2006), 1418–1429.

Cited By

Chen CXin JYu Z(2024)TIE: Fast Experiment-Driven ML-Based Configuration Tuning for In-Memory Data AnalyticsIEEE Transactions on Computers10.1109/TC.2024.336593773:5(1233-1247)Online publication date: 14-Feb-2024
https://dl.acm.org/doi/10.1109/TC.2024.3365937

Recommendations

An End-to-End Automatic Cloud Database Tuning System Using Deep Reinforcement Learning
SIGMOD '19: Proceedings of the 2019 International Conference on Management of Data

Configuration tuning is vital to optimize the performance of database management system (DBMS). It becomes more tedious and urgent for cloud databases (CDB) due to the diverse database instances and query workloads, which make the database administrator ...
Conversational Recommender System Using Deep Reinforcement Learning
RecSys '22: Proceedings of the 16th ACM Conference on Recommender Systems

Deep Reinforcement Learning (DRL) uses the best of both Reinforcement Learning and Deep Learning for solving problems which cannot be addressed by them individually. Deep Reinforcement Learning has been used widely for games, robotics etc. Limited work ...
Tuning Apex DQN: A Reinforcement Learning based Deep Q-Network Algorithm
PEARC '24: Practice and Experience in Advanced Research Computing 2024: Human Powered Computing

Technological inventions in the last decade have increasingly geared towards developing autonomous systems that belong to robotics, healthcare, games, smart grids, finance and other domains. Such advanced developments have become possible through the ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

ICPP '22: Proceedings of the 51st International Conference on Parallel Processing

August 2022

976 pages

ISBN:9781450397339

DOI:10.1145/3545008

Copyright © 2022 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 January 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

the Creative Research Group Project of NSFC
the Key Research and Development Program of Guangdong Provinc
the National Natural Science Foundation of China

Conference

ICPP '22

ICPP '22: 51st International Conference on Parallel Processing

August 29 - September 1, 2022

Bordeaux, France

Acceptance Rates

Overall Acceptance Rate 91 of 313 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
202
Total Downloads

Downloads (Last 12 months)121
Downloads (Last 6 weeks)5

Reflects downloads up to 18 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Chen CXin JYu Z(2024)TIE: Fast Experiment-Driven ML-Based Configuration Tuning for In-Memory Data AnalyticsIEEE Transactions on Computers10.1109/TC.2024.336593773:5(1233-1247)Online publication date: 14-Feb-2024
https://dl.acm.org/doi/10.1109/TC.2024.3365937

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Table of Contents