Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3492323.3495594acmconferencesArticle/Chapter ViewAbstractPublication PagesuccConference Proceedingsconference-collections
research-article

Scheduling ML training on unreliable spot instances

Published: 07 February 2022 Publication History

Abstract

Cloud providers rent out surplus computational resources as spot instances at a deep discount. However, these cheap spot instances are revocable. When demand surges for higher priced on-demand instances, cloud providers can interrupt these spot instances after a brief alert. Such unreliability makes it challenging to utilize spot instances for many long-running jobs. However, with checkpoints and restoration, machine-learning (ML) training jobs are a good candidate to overcome this difficulty. In this paper, we formalize the problem of scheduling ML-training jobs on transient spot instances, especially from an ML researcher's view, who may have some grant/credit for renting cloud computing services for several ML training tasks. Such a researcher would need to partition the computational resources wisely to maximize outcome (or total expected utility of all jobs) while maintaining some fairness between jobs. We investigate the trade-off between low-cost/interruptible and high-cost/uninterruptible computation, by proposing a linear-programming (LP) rounding based polynomial time algorithm. Based on the LP solution, we also give an LP-based heuristic that performs well in practice. We implement and evaluate these algorithms, and are able to achieve the same utility with 23% to 48% of the budget needed with on-demand instances. Moreover, the total utility we get is close to the theoretical upper bound under various settings, indicating close to optimal performance.

References

[1]
Léon. Bottou, Frank E. Curtis, and Jorge. Nocedal. 2018. Optimization Methods for Large-Scale Machine Learning. SIAM Rev. 60, 2 (2018).
[2]
Brian C Dean, Michel X Goemans, and Jan Vondrák. 2008. Approximating the stochastic knapsack problem: The benefit of adaptivity. Mathematics of Operations Research 33, 4 (2008).
[3]
Anupam Gupta, Ravishankar Krishnaswamy, Marco Molinaro, and Ramamoorthi Ravi. 2011. Approximation algorithms for correlated knapsacks and non-martingale bandits. In FOCS. IEEE.
[4]
Leslie A Hall, Andreas S Schulz, David B Shmoys, and Joel Wein. 1997. Scheduling to minimize average completion time: Off-line and on-line approximation algorithms. Mathematics of operations research 22, 3 (1997).
[5]
Aaron Harlap, Alexey Tumanov, Andrew Chung, Gregory R Ganger, and Phillip B Gibbons. 2017. Proteus: agile ml elasticity through tiered reliability in dynamic resource markets. In EuroSys. ACM.
[6]
Myeongjae Jeon, Shivaram Venkataraman, Junjie Qian, Amar Phanishayee, Wencong Xiao, and Fan Yang. 2018. Multi-tenant GPU clusters for deep learning workloads: Analysis and implications. Technical Report. Microsoft Research.
[7]
Jon Kleinberg, Yuval Rabani, and Éva Tardos. 2000. Allocating bandwidth for bursty connections. SIAM J. Comput. 30, 1 (2000).
[8]
Feng Niu, Benjamin Recht, Christopher Re, and Stephen J. Wright. 2011. HOG-WILD! A Lock-Free Approach to Parallelizing Stochastic Gradient Descent. In NIPS (Granada, Spain) (NIPS'11). Curran Associates Inc., Red Hook, NY, USA, 693--701.
[9]
Yanghua Peng, Yixin Bao, Yangrui Chen, Chuan Wu, and Chuanxiong Guo. 2018. Optimus: an efficient dynamic resource scheduler for deep learning clusters. In EuroSys. ACM.
[10]
Prasad Saripalli, G. V. R. Kiran, R. Ravi Shankar, Harish Narware, and Nitin Bindal. 2011. Load Prediction and Hot Spot Detection Models for Autonomic Cloud Computing. In UCC. IEEE Computer Society.
[11]
Prateek Sharma, David Irwin, and Prashant Shenoy. 2017. Portfolio-driven Resource Management for Transient Cloud Servers. Proc. ACM Meas. Anal. Comput. Syst. (2017).
[12]
Pan Xu, Jinghui Chen, Difan Zou, and Quanquan Gu. 2018. Global Convergence of Langevin Dynamics Based Algorithms for Nonconvex Optimization. In NIPS, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.). Curran Associates, Inc.
[13]
Haoyu Zhang, Logan Stafman, Andrew Or, and Michael J Freedman. 2017. Slaq: quality-driven scheduling for distributed machine learning. In SoCC. ACM.
[14]
Xiaoxi Zhang, Jianyu Wang, Gauri Joshi, and Carlee Joe-Wong. 2020. Machine Learning on Volatile Instances. In INFOCOM.

Cited By

View all
  • (2024)Monte Carlo Simulation-Based Robust Workflow Scheduling for Spot Instances in Cloud EnvironmentsTsinghua Science and Technology10.26599/TST.2022.901006529:1(112-126)Online publication date: Feb-2024
  • (2024)How Can We Train Deep Learning Models Across Clouds and Continents? An Experimental StudyProceedings of the VLDB Endowment10.14778/3648160.364816517:6(1214-1226)Online publication date: 3-May-2024

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
UCC '21: Proceedings of the 14th IEEE/ACM International Conference on Utility and Cloud Computing Companion
December 2021
256 pages
ISBN:9781450391634
DOI:10.1145/3492323
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

In-Cooperation

  • IEEE TCSC: IEEE Technical Committee on Scalable Computing

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 February 2022

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. machine learning training
  2. scheduling
  3. spot instances

Qualifiers

  • Research-article

Funding Sources

Conference

UCC '21
Sponsor:

Acceptance Rates

Overall Acceptance Rate 38 of 125 submissions, 30%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)38
  • Downloads (Last 6 weeks)5
Reflects downloads up to 25 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Monte Carlo Simulation-Based Robust Workflow Scheduling for Spot Instances in Cloud EnvironmentsTsinghua Science and Technology10.26599/TST.2022.901006529:1(112-126)Online publication date: Feb-2024
  • (2024)How Can We Train Deep Learning Models Across Clouds and Continents? An Experimental StudyProceedings of the VLDB Endowment10.14778/3648160.364816517:6(1214-1226)Online publication date: 3-May-2024

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media