Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3472883.3486978acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Chronus: A Novel Deadline-aware Scheduler for Deep Learning Training Jobs

Published: 01 November 2021 Publication History

Abstract

Modern GPU clusters support Deep Learning training (DLT) jobs in a distributed manner. Job scheduling is the key to improve the training performance, resource utilization and fairness across users. Different training jobs may require various objectives and demands in terms of completion time. How to efficiently satisfy all these requirements is not extensively studied.
We present Chronus, an end-to-end scheduling system to provide deadline guarantee for SLO jobs and maximize the performance of best-effort jobs. Chronus is designed based on the unique features of DLT jobs. (1) It leverages the intra-job predictability of DLT processes to efficiently profile jobs and estimate their runtime speed with dynamic resource scaling. (2) It takes advantages of the DLT preemption feature to select jobs with a lease-based training scheme. (3) It considers the placement sensitivity of DLT jobs to allocate resources with new consolidation and local-search strategies. Large-scale simulations on real-world job traces show that Chronus can reduce the deadline miss rate of SLO jobs by up to 14.7x, and the completion time of best-effort jobs by up to 19.9x, compared to existing schedulers. We also implement a prototype of Chronus atop Kubernents in a cluster of 120 GPUs to validate its practicability.

Supplementary Material

MP4 File (Day4_13_1_WeiGao.mp4)
Presentation video

References

[1]
Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dandelion Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2015. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. https://www.tensorflow.org/ Software available from tensorflow.org.
[2]
Brendan Burns, Brian Grant, David Oppenheimer, Eric Brewer, and John Wilkes. 2016. Borg, Omega, and Kubernetes. ACM Queue (2016).
[3]
Giorgio C Buttazzo. 2011. Hard real-time computing systems: predictable scheduling algorithms and applications. Springer Science & Business Media.
[4]
Shubham Chaudhary, Ramachandran Ramjee, Muthian Sivathanu, Nipun Kwatra, and Srinidhi Viswanatha. 2020. Balancing efficiency and fairness in heterogeneous GPU clusters for deep learning. In European Conference on Computer Systems.
[5]
Wei Chen, Jia Rao, and Xiaobo Zhou. 2017. Preemptive, low latency datacenter scheduling via lightweight virtualization. In USENIX Annual Technical Conference.
[6]
Xiangwen Chen, Minghua Chen, Baochun Li, Yao Zhao, Yunnan Wu, and Jin Li. 2011. Celerity: A low-delay multi-party conferencing solution. In ACM international conference on Multimedia.
[7]
Zhaoyun Chen, Wei Quan, Mei Wen, Jianbin Fang, Jie Yu, Chunyuan Zhang, and Lei Luo. 2019. Deep Learning Research and Development Platform: Characterizing and Scheduling with QoS Guarantees on GPU Clusters. IEEE Transactions on Parallel and Distributed Systems (2019).
[8]
Gurobi Company. 2021. Gurobi Optimization: https://https://www.gurobi.com/. https://www.gurobi.com/
[9]
Carlo Curino, Djellel E Difallah, Chris Douglas, Subru Krishnan, Raghu Ramakrishnan, and Sriram Rao. 2014. Reservation-based scheduling: If you're late don't blame us!. In ACM Symposium on Cloud Computing.
[10]
Christina Delimitrou and Christos Kozyrakis. 2013. Paragon: QoS-aware scheduling for heterogeneous datacenters. ACM SIGPLAN Notices (2013).
[11]
Christina Delimitrou and Christos Kozyrakis. 2014. Quasar: Resource-efficient and qos-aware cluster management. ACM SIGPLAN Notices (2014).
[12]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT.
[13]
Nandita Dukkipati, Nick McKeown, and Alexander G Fraser. 2006. RCP-AC: Congestion control to make flows complete quickly in any environment. In International Conference on Computer Communications.
[14]
Yanjie Gao, Xianyu Gu, Hongyu Zhang, Haoxiang Lin, and Mao Yang. 2021. Runtime Performance Prediction for Deep Learning Models with Graph Neural Network. Technical Report MSR-TR-2021-3. Microsoft.
[15]
Ian J Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial networks. arXiv preprint arXiv:1406.2661 (2014).
[16]
Juncheng Gu, Mosharaf Chowdhury, Kang G Shin, Yibo Zhu, Myeongjae Jeon, Junjie Qian, Hongqiang Liu, and Chuanxiong Guo. 2019. Tiresias: A GPU cluster manager for distributed deep learning. In USENIX Symposium on Networked Systems Design and Implementation.
[17]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In CVPR.
[18]
Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. 2017. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861 (2017).
[19]
Qinghao Hu, Peng Sun, Shengen Yan, Yonggang Wen, and Tianwei Zhang. 2021. Characterization and Prediction of Deep Learning Workloads i nLarge-Scale GPU Datacenters. In International Conference for High Performance Computing, Networking, Storage, and Analysis.
[20]
Virajith Jalaparti, Hitesh Ballani, Paolo Costa, Thomas Karagiannis, and Ant Rowstron. 2012. Bridging the tenant-provider gap in cloud services. In ACM Symposium on Cloud Computing.
[21]
Myeongjae Jeon, Shivaram Venkataraman, Amar Phanishayee, Junjie Qian, Wencong Xiao, and Fan Yang. 2019. Analysis of large-scale multi-tenant GPU clusters for DNN training workloads. In USENIX Annual Technical Conference.
[22]
Ru Jia, Yun Yang, John Grundy, Jacky Keung, and Hao Li. 2019. A deadline constrained preemptive scheduler using queuing systems for multi-tenancy clouds. In International Conference on Cloud Computing.
[23]
Sangeetha Abdu Jyothi, Carlo Curino, Ishai Menache, Shravan Matthur Narayanamurthy, Alexey Tumanov, Jonathan Yaniv, Ruslan Mavlyutov, Inigo Goiri, Subru Krishnan, Janardhan Kulkarni, et al. 2016. Morpheus: Towards automated slos for enterprise clusters. In 12th USENIX Symposium on Operating Systems Design and Implementation.
[24]
Vijay R Konda and John N Tsitsiklis. 2000. Actor-critic algorithms. In Advances in neural information processing systems. Citeseer.
[25]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2017. ImageNet Classification with Deep Convolutional Neural Networks. Commun. ACM (2017).
[26]
Kubernetes contributors. 2021. Kubernetes: https://kubernetes.io/. https://kubernetes.io/
[27]
MIT Distributed Robotics Laboratory. [n.d.]. Github repository https://github/com/mit-drl/goop: Generalized Mixed Integer Optimization in Go. https://github.com/mit-drl/goop
[28]
Tan N. Le, Xiao Sun, Mosharaf Chowdhury, and Zhenhua Liu. 2020. AlloX: Compute Allocation in Hybrid Clusters. In Proceedings of the Fifteenth European Conference on Computer Systems.
[29]
Dan Li, Congjie Chen, Junjie Guan, Ying Zhang, Jing Zhu, and Ruozhou Yu. 2015. DCloud: deadline-aware resource allocation for cloud computing jobs. IEEE transactions on parallel and distributed systems (2015).
[30]
Richard Liaw, Romil Bhardwaj, Lisa Dunlap, Yitian Zou, Joseph E Gonzalez, Ion Stoica, and Alexey Tumanov. 2019. Hypersched: Dynamic resource reallocation for model development on a deadline. In ACM Symposium on Cloud Computing.
[31]
Jimmy Lin and Chris Dyer. 2010. Data-intensive text processing with MapReduce. Synthesis Lectures on Human Language Technologies (2010).
[32]
Chung Laung Liu and James W Layland. 1973. Scheduling algorithms for multiprogramming in a hard-real-time environment. J. ACM (1973).
[33]
Shiyao Ma, Jingjie Jiang, Bo Li, and Baochun Li. 2016. Chronos: Meeting coflow deadlines in data center networks. In International Conference on Communications.
[34]
Kshiteej Mahajan, Arjun Balasubramanian, Arjun Singhvi, Shivaram Venkataraman, Aditya Akella, Amar Phanishayee, and Shuchi Chawla. 2020. Themis: Fair and Efficient GPU Cluster Scheduling. In 17th USENIX Symposium on Networked Systems Design and Implementation.
[35]
Jayashree Mohan, Amar Phanishayee, and Vijay Chidambaram. 2021. CheckFreq: Frequent, Fine-Grained {DNN} Checkpointing. In 19th USENIX Conference on File and Storage Technologies.
[36]
Kristi Morton, Magdalena Balazinska, and Dan Grossman. 2010. Para-Timer: a progress indicator for MapReduce DAGs. In ACM SIGMOD International Conference on Management of data.
[37]
Kristi Morton, Abram Friesen, Magdalena Balazinska, and Dan Grossman. 2010. Estimating the progress of MapReduce pipelines. In nternational Conference on Data Engineering.
[38]
Deepak Narayanan, Keshav Santhanam, Fiodar Kazhamiaka, Amar Phanishayee, and Matei Zaharia. 2020. Heterogeneity-Aware Cluster Scheduling Policies for Deep Learning Workloads. In 14th USENIX Symposium on Operating Systems Design and Implementation.
[39]
Jun Woo Park, Alexey Tumanov, Angela Jiang, Michael A Kozuch, and Gregory R Ganger. 2018. 3sigma: distribution-based cluster scheduling for runtime uncertainty. In Proceedings of the Thirteenth EuroSys Conference.
[40]
Mario Pastorelli. 2014. Size-based disciplines for job scheduling in data-intensive scalable computing systems. Ph.D. Dissertation. Télécom ParisTech.
[41]
Yanghua Peng, Yixin Bao, Yangrui Chen, Chuan Wu, and Chuanxiong Guo. 2018. Optimus: an efficient dynamic resource scheduler for deep learning clusters. In Proceedings of the Thirteenth EuroSys Conference.
[42]
Aurick Qiao, Keun Choe Sang, Jayaram Subramanya Suhas, Neiswanger Willie, Qirong Ho, Hao Zhang, Gregory R Ganger, and Eric P Xing. 2021. Pollux: Co-adaptive Cluster Scheduling for Goodput-Optimized Deep Learning. In 15th USENIX Symposium on Operating Systems Design and Implementation.
[43]
Elvis Rojas, Albert Njoroge Kahira, Esteban Meneses, Leonardo Bautista Gomez, and Rosa M Badia. 2020. A Study of Checkpointing in Large Scale Training of Deep Neural Networks. arXiv preprint arXiv:2012.00825 (2020).
[44]
Karen Simonyan and Andrew Zisserman. 2015. Very deep convolutional networks for large-scale image recognition. In ICLR.
[45]
Richard S Sutton, David A McAllester, Satinder P Singh, Yishay Mansour, et al. 1999. Policy gradient methods for reinforcement learning with function approximation. In NIPS.
[46]
Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. 2016. Rethinking the inception architecture for computer vision. In IEEE conference on computer vision and pattern recognition.
[47]
Alexey Tumanov, Angela Jiang, Jun Woo Park, Michael A Kozuch, and Gregory R Ganger. 2016. Jamaisvu: Robust scheduling with auto-estimated job runtimes. Parallel Data Laboratory, Carnegie Mellon University, Tech. Rep. (2016).
[48]
Alexey Tumanov, Timothy Zhu, Jun Woo Park, Michael A Kozuch, Mor Harchol-Balter, and Gregory R Ganger. 2016. TetriSched: global rescheduling with adaptive plan-ahead in dynamic heterogeneous clusters. In European Conference on Computer Systems.
[49]
Vinod Kumar Vavilapalli, Arun C Murthy, Chris Douglas, Sharad Agarwal, Mahadev Konar, Robert Evans, Thomas Graves, Jason Lowe, Hitesh Shah, Siddharth Seth, et al. 2013. Apache hadoop yarn: Yet another resource negotiator. In Annual Symposium on Cloud Computing.
[50]
Shivaram Venkataraman, Zongheng Yang, Michael Franklin, Benjamin Recht, and Ion Stoica. 2016. Ernest: Efficient Performance Prediction for Large-Scale Advanced Analytics. In USENIX Symposium on Networked Systems Design and Implementation (NSDI'16).
[51]
Juan Pablo Vielma. 2015. Mixed integer linear programming formulation techniques. Siam Review (2015).
[52]
Haoyu Wang, Zetian Liu, and Haiying Shen. 2020. Job scheduling for large-scale machine learning clusters. In International Conference on emerging Networking EXperiments and Technologies.
[53]
Wencong Xiao, Romil Bhardwaj, Ramachandran Ramjee, Muthian Sivathanu, Nipun Kwatra, Zhenhua Han, Pratyush Patel, Xuan Peng, Hanyu Zhao, Quanlu Zhang, et al. 2018. Gandiva: Introspective cluster scheduling for deep learning. In 13th USENIX Symposium on Operating Systems Design and Implementation.
[54]
Wencong Xiao, Shiru Ren, Yong Li, Yang Zhang, Pengyang Hou, Zhi Li, Yihui Feng, Wei Lin, and Yangqing Jia. 2020. AntMan: Dynamic Scaling on {GPU} Clusters for Deep Learning. In USENIX Symposium on Operating Systems Design and Implementation.
[55]
Yang You. 2020. Fast and Accurate Machine Learning on Distributed Systems and Supercomputers. Ph.D. Dissertation. UC Berkeley.
[56]
Hanyu Zhao, Zhenhua Han, Zhi Yang, Quanlu Zhang, Fan Yang, Lidong Zhou, Mao Yang, Francis C.M. Lau, Yuqi Wang, Yifan Xiong, and Bin Wang. 2020. HiveD: Sharing a GPU Cluster for Deep Learning with Guarantees. In USENIX Symposium on Operating Systems Design and Implementation.

Cited By

View all
  • (2025)GreenFlow: A Carbon-Efficient Scheduler for Deep Learning WorkloadsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.347007436:2(168-184)Online publication date: Mar-2025
  • (2024)When will my ML job finish? toward providing completion time estimates through predictability-centric schedulingProceedings of the 18th USENIX Conference on Operating Systems Design and Implementation10.5555/3691938.3691964(487-505)Online publication date: 10-Jul-2024
  • (2024)Training efficiency optimization algorithm of wireless federated learning based on processor performance and network condition awarenessEURASIP Journal on Advances in Signal Processing10.1186/s13634-024-01192-62024:1Online publication date: 26-Nov-2024
  • Show More Cited By

Index Terms

  1. Chronus: A Novel Deadline-aware Scheduler for Deep Learning Training Jobs

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SoCC '21: Proceedings of the ACM Symposium on Cloud Computing
    November 2021
    685 pages
    ISBN:9781450386388
    DOI:10.1145/3472883
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 01 November 2021

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Cluster Management System
    2. Deadline-aware Scheduler
    3. Deep Learning Training
    4. GPU Datacenter

    Qualifiers

    • Research-article
    • Research
    • Refereed limited

    Conference

    SoCC '21
    Sponsor:
    SoCC '21: ACM Symposium on Cloud Computing
    November 1 - 4, 2021
    WA, Seattle, USA

    Acceptance Rates

    Overall Acceptance Rate 169 of 722 submissions, 23%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)225
    • Downloads (Last 6 weeks)27
    Reflects downloads up to 26 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2025)GreenFlow: A Carbon-Efficient Scheduler for Deep Learning WorkloadsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.347007436:2(168-184)Online publication date: Mar-2025
    • (2024)When will my ML job finish? toward providing completion time estimates through predictability-centric schedulingProceedings of the 18th USENIX Conference on Operating Systems Design and Implementation10.5555/3691938.3691964(487-505)Online publication date: 10-Jul-2024
    • (2024)Training efficiency optimization algorithm of wireless federated learning based on processor performance and network condition awarenessEURASIP Journal on Advances in Signal Processing10.1186/s13634-024-01192-62024:1Online publication date: 26-Nov-2024
    • (2024)ETS: Deep Learning Training Iteration Time Prediction based on Execution Trace Sliding WindowProceedings of the 33rd International Symposium on High-Performance Parallel and Distributed Computing10.1145/3625549.3658658(56-68)Online publication date: 3-Jun-2024
    • (2024)Non-Clairvoyant Scheduling of Distributed Machine Learning With Inter-Job and Intra-Job Parallelism on Heterogeneous GPUsIEEE Transactions on Cloud Computing10.1109/TCC.2024.341444012:4(1011-1025)Online publication date: Oct-2024
    • (2024)UniSched: A Unified Scheduler for Deep Learning Training Jobs With Different User DemandsIEEE Transactions on Computers10.1109/TC.2024.337179473:6(1500-1515)Online publication date: Jun-2024
    • (2024)vTrain: A Simulation Framework for Evaluating Cost-Effective and Compute-Optimal Large Language Model Training2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00021(153-167)Online publication date: 2-Nov-2024
    • (2024)Online VM Service Selection with Spot Cores for Dynamic Workloads2024 IEEE Cloud Summit10.1109/Cloud-Summit61220.2024.00016(54-60)Online publication date: 27-Jun-2024
    • (2023)Deep Learning Workload Scheduling in GPU Datacenters: A SurveyACM Computing Surveys10.1145/3638757Online publication date: 27-Dec-2023
    • (2023)Towards GPU Memory Efficiency for Distributed Training at ScaleProceedings of the 2023 ACM Symposium on Cloud Computing10.1145/3620678.3624661(281-297)Online publication date: 30-Oct-2023
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media