Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3545008.3545027acmotherconferencesArticle/Chapter ViewAbstractPublication PagesicppConference Proceedingsconference-collections
research-article

Adaptive and Efficient GPU Time Sharing for Hyperparameter Tuning in Cloud

Published: 13 January 2023 Publication History
  • Get Citation Alerts
  • Abstract

    Hyperparameter tuning (HPT), which chooses a set of optimal hyperparameters for a learning algorithm, is critical to machine learning training. Unfortunately, the current resource provisioning approaches for HPT are unable to adjust resources adaptively according to the upward trends of HPT accuracy at runtime, resulting in low GPU utilization or HPT accuracy. On the other hand, dynamic resource provisioning approaches based on checkpointing are inefficient for HPT, because of high overhead of context switching and job restarting.
    This paper presents DISC, an adaptive and efficient HPT service with GPU time sharing for the cloud, which aims to improve GPU utilization and HPT accuracy. DISC provides a potential-aware GPU adaptive scaling to adjust the size of GPU time slices occupied by HPT jobs at runtime based on the upward trends of HPT accuracy. The dynamic allocation of GPU time slices is formalized as an optimization problem and tackled with an effective heuristic algorithm. Further, DISC achieves GPU memory temporal and spatial sharing according to the memory usage pattern of HPT jobs. It designs a time slice early release mechanism with relaxed PACK scheduling to improve memory utilization while avoiding memory overflow of the GPU due to time sharing. DISC is implemented upon the Kubeflow and Kubernetes ecosystem. We adopt a subset of Microsoft Philly Trace with public datasets to conduct evaluation. Experimental results show that DISC improves the average job completion time by 1.15x compared to the naïve approach and the HPT accuracy by 1.58x compared to a state-of-the-art early-stopping approach.

    References

    [1]
    Shubham Chaudhary, Ramachandran Ramjee, Muthian Sivathanu, Nipun Kwatra, and Srinidhi Viswanatha. 2020. Balancing efficiency and fairness in heterogeneous GPU clusters for deep learning. In Proceedings of the EuroSys. 1–16.
    [2]
    [2] Wikitext-2 Dataset.2022. https://www.salesforce.com/products/einstein/ai-research/the-wikitext-dependency-language-modeling-dataset
    [3]
    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In Proceedings of the IEEE Conference on CVPR. 248–255.
    [4]
    Juncheng Gu, Mosharaf Chowdhury, Kang G Shin, Yibo Zhu, Myeongjae Jeon, Junjie Qian, Hongqiang Liu, and Chuanxiong Guo. 2019. Tiresias: A GPU cluster manager for distributed deep learning. In Proceedings of the USENIX Symposium on NSDI. 485–500.
    [5]
    Wassily Hoeffding. 1994. Probability inequalities for sums of bounded random variables. In The collected works of Wassily Hoeffding. Springer, 409–426.
    [6]
    Safwan Hossain, Evi Micha, and Nisarg Shah. 2020. Fair Algorithms for Multi-Agent Multi-Armed Bandits. arXiv preprint arXiv:2007.06699(2020).
    [7]
    Max Jaderberg, Valentin Dalibard, Simon Osindero, Wojciech M Czarnecki, Jeff Donahue, Ali Razavi, Oriol Vinyals, Tim Green, Iain Dunning, Karen Simonyan, 2017. Population based training of neural networks. arXiv preprint arXiv:1711.09846(2017).
    [8]
    Myeongjae Jeon, Shivaram Venkataraman, Amar Phanishayee, Junjie Qian, Wencong Xiao, and Fan Yang. 2019. Analysis of large-scale multi-tenant GPU clusters for DNN training workloads. In Proceedings of the USENIX ATC. 947–960.
    [9]
    Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. 2020. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361(2020).
    [10]
    [10] Katib.2022. https://github.com/kubeflow/katib
    [11]
    [11] Kubeflow.2022. https://github.com/kubeflow/kubeflow
    [12]
    [12] Kubernetes.2022. https://github.com/kubernetes/kubernetes
    [13]
    Liam Li, Kevin Jamieson, Afshin Rostamizadeh, Ekaterina Gonina, Jonathan Ben-Tzur, Moritz Hardt, Benjamin Recht, and Ameet Talwalkar. 2020. A System for Massively Parallel Hyperparameter Tuning. In Proceedings of the MLSys. 230–246.
    [14]
    Lisha Li, Kevin G Jamieson, Giulia DeSalvo, Afshin Rostamizadeh, and Ameet Talwalkar. 2017. Hyperband: Bandit-based configuration evaluation for hyperparameter optimization. In Proceedings of the ICLR.
    [15]
    Tian Li, Jie Zhong, Ji Liu, Wentao Wu, and Ce Zhang. 2018. Ease. ml: Towards multi-tenant resource sharing for machine learning workloads. In Proceedings of the VLDB. 607–620.
    [16]
    Richard Liaw, Romil Bhardwaj, Lisa Dunlap, Yitian Zou, Joseph E Gonzalez, Ion Stoica, and Alexey Tumanov. 2019. Hypersched: Dynamic resource reallocation for model development on a deadline. In Proceedings of the ACM SoCC. 61–73.
    [17]
    Gangmuk Lim, Jeongseob Ahn, Wencong Xiao, Youngjin Kwon, and Myeongjae Jeon. 2021. Zico: Efficient GPU Memory Sharing for Concurrent DNN Training. In Proceedings of the USENIX ATC. 161–175.
    [18]
    Luo Mai, Guo Li, Marcel Wagenländer, Konstantinos Fertakis, Andrei-Octavian Brabete, and Peter Pietzuch. 2020. KungFu: Making Training in Distributed Machine Learning Adaptive. In Proceedings of the USENIX Symposium on OSDI. 937–954.
    [19]
    Hervé Moulin. 2003. Fair division and collective welfare. MIT press.
    [20]
    [20] Gemini open source.2022. https://github.com/NTHU-LSALAB/Gemini
    [21]
    Yanghua Peng, Yixin Bao, Yangrui Chen, Chuan Wu, and Chuanxiong Guo. 2018. Optimus: an efficient dynamic resource scheduler for deep learning clusters. In Proceedings of the EuroSys. 1–14.
    [22]
    Jeff Rasley, Yuxiong He, Feng Yan, Olatunji Ruwase, and Rodrigo Fonseca. 2017. Hyperdrive: Exploring hyperparameters with pop scheduling. In Proceedings of the ACM/IFIP/USENIX Middleware Conference. 1–13.
    [23]
    Aleksandrs Slivkins. 2019. Introduction to multi-armed bandits. arXiv preprint arXiv:1904.07272(2019).
    [24]
    Evan R Sparks, Ameet Talwalkar, Daniel Haas, Michael J Franklin, Michael I Jordan, and Tim Kraska. 2015. Automating model search for large scale machine learning. In Proceedings of the ACM SoCC. 368–380.
    [25]
    [25] Microsoft Philly Trace.2022. https://github.com/msr-fiddle/philly-traces
    [26]
    Shaoqi Wang, Wei Chen, Xiaobo Zhou, and Mike Ji. 2019. Addressing Skewness in Iterative ML Jobs with Parameter Partition. In Proceedings of the IEEE INFOCOM.
    [27]
    Shaoqi Wang, Oscar J. Gonzalez, Xiaobo Zhou, Thomas Williams, Brian D. Friedman, Martin Havemann, and Thomas Woo. 2020. An Efficient and Non-Intrusive GPU Scheduling Framework for Deep Learning Training Systems. In Proceedings of the ACM/IEEE SC.
    [28]
    Shaoqi Wang, Aidi Pi, and Xiaobo Zhou. 2019. Scalable distributed dl training: Batching communication and computation. In Proceedings of the AAAI.
    [29]
    Wencong Xiao, Romil Bhardwaj, Ramachandran Ramjee, Muthian Sivathanu, Nipun Kwatra, Zhenhua Han, Pratyush Patel, Xuan Peng, Hanyu Zhao, Quanlu Zhang, 2018. Gandiva: Introspective cluster scheduling for deep learning. In Proceedings of the USENIX Symposium on OSDI. 595–610.
    [30]
    Wencong Xiao, Shiru Ren, Yong Li, Yang Zhang, Pengyang Hou, Zhi Li, Yihui Feng, Wei Lin, and Yangqing Jia. 2020. AntMan: Dynamic Scaling on GPU Clusters for Deep Learning. In Proceedings of the USENIX Symposium on OSDI. 533–548.
    [31]
    Ting-An Yeh, Hung-Hsin Chen, and Jerry Chou. 2020. KubeShare: a framework to manage GPUs as first-class and shared resources in container cloud. In Proceedings of ACM International Symposium on HPDC. 173–184.
    [32]
    Peifeng Yu and Mosharaf Chowdhury. 2020. Fine-grained GPU sharing primitives for deep learning applications. In Proceedings of the MLSys. 98–111.
    [33]
    Peifeng Yu, Jiachen Liu, and Mosharaf Chowdhury. 2021. Fluid: Resource-aware Hyperparameter Tuning Engine. In Proceedings of the MLSys. 502–516.
    [34]
    Hanyu Zhao, Zhenhua Han, Zhi Yang, Quanlu Zhang, Fan Yang, Lidong Zhou, Mao Yang, Francis CM Lau, Yuqi Wang, Yifan Xiong, 2020. Hived: sharing a GPU cluster for deep learning with guarantees. In Proceedings of the USENIX Symposium on OSDI. 515–532.

    Cited By

    View all
    • (2024)A Stochastic Approach for Scheduling AI Training Jobs in GPU-Based SystemsIEEE Transactions on Cloud Computing10.1109/TCC.2023.333654012:1(53-69)Online publication date: Jan-2024

    Index Terms

    1. Adaptive and Efficient GPU Time Sharing for Hyperparameter Tuning in Cloud

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image ACM Other conferences
        ICPP '22: Proceedings of the 51st International Conference on Parallel Processing
        August 2022
        976 pages
        ISBN:9781450397339
        DOI:10.1145/3545008
        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        Published: 13 January 2023

        Permissions

        Request permissions for this article.

        Check for updates

        Author Tags

        1. GPU sharing
        2. hyperparameter tuning
        3. resource provisioning

        Qualifiers

        • Research-article
        • Research
        • Refereed limited

        Funding Sources

        Conference

        ICPP '22
        ICPP '22: 51st International Conference on Parallel Processing
        August 29 - September 1, 2022
        Bordeaux, France

        Acceptance Rates

        Overall Acceptance Rate 91 of 313 submissions, 29%

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)110
        • Downloads (Last 6 weeks)12
        Reflects downloads up to 11 Aug 2024

        Other Metrics

        Citations

        Cited By

        View all
        • (2024)A Stochastic Approach for Scheduling AI Training Jobs in GPU-Based SystemsIEEE Transactions on Cloud Computing10.1109/TCC.2023.333654012:1(53-69)Online publication date: Jan-2024

        View Options

        Get Access

        Login options

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        HTML Format

        View this article in HTML Format.

        HTML Format

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media