research-article

Adaptive and Efficient GPU Time Sharing for Hyperparameter Tuning in Cloud

Authors:

Zhijun DingAuthors Info & Claims

ICPP '22: Proceedings of the 51st International Conference on Parallel Processing

Article No.: 5, Pages 1 - 11

https://doi.org/10.1145/3545008.3545027

Published: 13 January 2023 Publication History

Abstract

Hyperparameter tuning (HPT), which chooses a set of optimal hyperparameters for a learning algorithm, is critical to machine learning training. Unfortunately, the current resource provisioning approaches for HPT are unable to adjust resources adaptively according to the upward trends of HPT accuracy at runtime, resulting in low GPU utilization or HPT accuracy. On the other hand, dynamic resource provisioning approaches based on checkpointing are inefficient for HPT, because of high overhead of context switching and job restarting.

This paper presents DISC, an adaptive and efficient HPT service with GPU time sharing for the cloud, which aims to improve GPU utilization and HPT accuracy. DISC provides a potential-aware GPU adaptive scaling to adjust the size of GPU time slices occupied by HPT jobs at runtime based on the upward trends of HPT accuracy. The dynamic allocation of GPU time slices is formalized as an optimization problem and tackled with an effective heuristic algorithm. Further, DISC achieves GPU memory temporal and spatial sharing according to the memory usage pattern of HPT jobs. It designs a time slice early release mechanism with relaxed PACK scheduling to improve memory utilization while avoiding memory overflow of the GPU due to time sharing. DISC is implemented upon the Kubeflow and Kubernetes ecosystem. We adopt a subset of Microsoft Philly Trace with public datasets to conduct evaluation. Experimental results show that DISC improves the average job completion time by 1.15x compared to the naïve approach and the HPT accuracy by 1.58x compared to a state-of-the-art early-stopping approach.

References

[1]

Shubham Chaudhary, Ramachandran Ramjee, Muthian Sivathanu, Nipun Kwatra, and Srinidhi Viswanatha. 2020. Balancing efficiency and fairness in heterogeneous GPU clusters for deep learning. In Proceedings of the EuroSys. 1–16.

Digital Library

[2]

[2] Wikitext-2 Dataset.2022. https://www.salesforce.com/products/einstein/ai-research/the-wikitext-dependency-language-modeling-dataset

[3]

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In Proceedings of the IEEE Conference on CVPR. 248–255.

[4]

Juncheng Gu, Mosharaf Chowdhury, Kang G Shin, Yibo Zhu, Myeongjae Jeon, Junjie Qian, Hongqiang Liu, and Chuanxiong Guo. 2019. Tiresias: A GPU cluster manager for distributed deep learning. In Proceedings of the USENIX Symposium on NSDI. 485–500.

[5]

Wassily Hoeffding. 1994. Probability inequalities for sums of bounded random variables. In The collected works of Wassily Hoeffding. Springer, 409–426.

[6]

Safwan Hossain, Evi Micha, and Nisarg Shah. 2020. Fair Algorithms for Multi-Agent Multi-Armed Bandits. arXiv preprint arXiv:2007.06699(2020).

[7]

Max Jaderberg, Valentin Dalibard, Simon Osindero, Wojciech M Czarnecki, Jeff Donahue, Ali Razavi, Oriol Vinyals, Tim Green, Iain Dunning, Karen Simonyan, 2017. Population based training of neural networks. arXiv preprint arXiv:1711.09846(2017).

[8]

Myeongjae Jeon, Shivaram Venkataraman, Amar Phanishayee, Junjie Qian, Wencong Xiao, and Fan Yang. 2019. Analysis of large-scale multi-tenant GPU clusters for DNN training workloads. In Proceedings of the USENIX ATC. 947–960.

[9]

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. 2020. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361(2020).

[10]

[10] Katib.2022. https://github.com/kubeflow/katib

[11]

[11] Kubeflow.2022. https://github.com/kubeflow/kubeflow

[12]

[12] Kubernetes.2022. https://github.com/kubernetes/kubernetes

[13]

Liam Li, Kevin Jamieson, Afshin Rostamizadeh, Ekaterina Gonina, Jonathan Ben-Tzur, Moritz Hardt, Benjamin Recht, and Ameet Talwalkar. 2020. A System for Massively Parallel Hyperparameter Tuning. In Proceedings of the MLSys. 230–246.

[14]

Lisha Li, Kevin G Jamieson, Giulia DeSalvo, Afshin Rostamizadeh, and Ameet Talwalkar. 2017. Hyperband: Bandit-based configuration evaluation for hyperparameter optimization. In Proceedings of the ICLR.

[15]

Tian Li, Jie Zhong, Ji Liu, Wentao Wu, and Ce Zhang. 2018. Ease. ml: Towards multi-tenant resource sharing for machine learning workloads. In Proceedings of the VLDB. 607–620.

[16]

Richard Liaw, Romil Bhardwaj, Lisa Dunlap, Yitian Zou, Joseph E Gonzalez, Ion Stoica, and Alexey Tumanov. 2019. Hypersched: Dynamic resource reallocation for model development on a deadline. In Proceedings of the ACM SoCC. 61–73.

Digital Library

[17]

Gangmuk Lim, Jeongseob Ahn, Wencong Xiao, Youngjin Kwon, and Myeongjae Jeon. 2021. Zico: Efficient GPU Memory Sharing for Concurrent DNN Training. In Proceedings of the USENIX ATC. 161–175.

[18]

Luo Mai, Guo Li, Marcel Wagenländer, Konstantinos Fertakis, Andrei-Octavian Brabete, and Peter Pietzuch. 2020. KungFu: Making Training in Distributed Machine Learning Adaptive. In Proceedings of the USENIX Symposium on OSDI. 937–954.

[19]

Hervé Moulin. 2003. Fair division and collective welfare. MIT press.

[20]

[20] Gemini open source.2022. https://github.com/NTHU-LSALAB/Gemini

[21]

Yanghua Peng, Yixin Bao, Yangrui Chen, Chuan Wu, and Chuanxiong Guo. 2018. Optimus: an efficient dynamic resource scheduler for deep learning clusters. In Proceedings of the EuroSys. 1–14.

Digital Library

[22]

Jeff Rasley, Yuxiong He, Feng Yan, Olatunji Ruwase, and Rodrigo Fonseca. 2017. Hyperdrive: Exploring hyperparameters with pop scheduling. In Proceedings of the ACM/IFIP/USENIX Middleware Conference. 1–13.

Digital Library

[23]

Aleksandrs Slivkins. 2019. Introduction to multi-armed bandits. arXiv preprint arXiv:1904.07272(2019).

[24]

Evan R Sparks, Ameet Talwalkar, Daniel Haas, Michael J Franklin, Michael I Jordan, and Tim Kraska. 2015. Automating model search for large scale machine learning. In Proceedings of the ACM SoCC. 368–380.

Digital Library

[25]

[25] Microsoft Philly Trace.2022. https://github.com/msr-fiddle/philly-traces

[26]

Shaoqi Wang, Wei Chen, Xiaobo Zhou, and Mike Ji. 2019. Addressing Skewness in Iterative ML Jobs with Parameter Partition. In Proceedings of the IEEE INFOCOM.

Digital Library

[27]

Shaoqi Wang, Oscar J. Gonzalez, Xiaobo Zhou, Thomas Williams, Brian D. Friedman, Martin Havemann, and Thomas Woo. 2020. An Efficient and Non-Intrusive GPU Scheduling Framework for Deep Learning Training Systems. In Proceedings of the ACM/IEEE SC.

[28]

Shaoqi Wang, Aidi Pi, and Xiaobo Zhou. 2019. Scalable distributed dl training: Batching communication and computation. In Proceedings of the AAAI.

Digital Library

[29]

Wencong Xiao, Romil Bhardwaj, Ramachandran Ramjee, Muthian Sivathanu, Nipun Kwatra, Zhenhua Han, Pratyush Patel, Xuan Peng, Hanyu Zhao, Quanlu Zhang, 2018. Gandiva: Introspective cluster scheduling for deep learning. In Proceedings of the USENIX Symposium on OSDI. 595–610.

[30]

Wencong Xiao, Shiru Ren, Yong Li, Yang Zhang, Pengyang Hou, Zhi Li, Yihui Feng, Wei Lin, and Yangqing Jia. 2020. AntMan: Dynamic Scaling on GPU Clusters for Deep Learning. In Proceedings of the USENIX Symposium on OSDI. 533–548.

[31]

Ting-An Yeh, Hung-Hsin Chen, and Jerry Chou. 2020. KubeShare: a framework to manage GPUs as first-class and shared resources in container cloud. In Proceedings of ACM International Symposium on HPDC. 173–184.

Digital Library

[32]

Peifeng Yu and Mosharaf Chowdhury. 2020. Fine-grained GPU sharing primitives for deep learning applications. In Proceedings of the MLSys. 98–111.

[33]

Peifeng Yu, Jiachen Liu, and Mosharaf Chowdhury. 2021. Fluid: Resource-aware Hyperparameter Tuning Engine. In Proceedings of the MLSys. 502–516.

[34]

Hanyu Zhao, Zhenhua Han, Zhi Yang, Quanlu Zhang, Fan Yang, Lidong Zhou, Mao Yang, Francis CM Lau, Yuqi Wang, Yifan Xiong, 2020. Hived: sharing a GPU cluster for deep learning with guarantees. In Proceedings of the USENIX Symposium on OSDI. 515–532.

Cited By

Filippini FAnselmi JArdagna DGaujal B(2024)A Stochastic Approach for Scheduling AI Training Jobs in GPU-Based SystemsIEEE Transactions on Cloud Computing10.1109/TCC.2023.333654012:1(53-69)Online publication date: Jan-2024
https://doi.org/10.1109/TCC.2023.3336540

Index Terms

Adaptive and Efficient GPU Time Sharing for Hyperparameter Tuning in Cloud
1. Computer systems organization
  1. Architectures
    1. Distributed architectures
      1. Cloud computing
2. Computing methodologies
  1. Machine learning

Recommendations

Collaborative GPU Preemption via Spatial Multitasking for Efficient GPU Sharing
Euro-Par 2021: Parallel Processing
Abstract
GPUs have been widely used in data centers and are often over-provisioned to satisfy the stringent latency targets of latency-sensitive (LS) jobs. The GPU under-utilization provides a strong incentive to share GPUs among LS jobs and batch jobs. ...
The anachronism of whole-GPU accounting
PEARC '22: Practice and Experience in Advanced Research Computing 2022: Revolutionary: Computing, Connections, You

NVIDIA has been making steady progress in increasing the compute performance of its GPUs, resulting in order of magnitude compute throughput improvements over the years. With several models of GPUs coexisting in many deployments, the traditional ...
Accelerated high-performance computing through efficient multi-process GPU resource sharing
CF '12: Proceedings of the 9th conference on Computing Frontiers

The HPC field is witnessing a widespread adoption of GPUs as accelerators for traditional homogeneous HPC systems. One of the prevalent parallel programming models is the SPMD paradigm, which has been adapted for GPU-based parallel processing. Since ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

ICPP '22: Proceedings of the 51st International Conference on Parallel Processing

August 2022

976 pages

ISBN:9781450397339

DOI:10.1145/3545008

Copyright © 2022 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 January 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

National Key Research and Development Program of China

Conference

ICPP '22

ICPP '22: 51st International Conference on Parallel Processing

August 29 - September 1, 2022

Bordeaux, France

Acceptance Rates

Overall Acceptance Rate 91 of 313 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
217
Total Downloads

Downloads (Last 12 months)110
Downloads (Last 6 weeks)12

Reflects downloads up to 11 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Filippini FAnselmi JArdagna DGaujal B(2024)A Stochastic Approach for Scheduling AI Training Jobs in GPU-Based SystemsIEEE Transactions on Cloud Computing10.1109/TCC.2023.333654012:1(53-69)Online publication date: Jan-2024
https://doi.org/10.1109/TCC.2023.3336540

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Table of Contents