research-article

Scheduling ML training on unreliable spot instances

Authors:

Kanak MahadikAuthors Info & Claims

UCC '21: Proceedings of the 14th IEEE/ACM International Conference on Utility and Cloud Computing Companion

Article No.: 29, Pages 1 - 8

https://doi.org/10.1145/3492323.3495594

Published: 07 February 2022 Publication History

Get Access

Abstract

Cloud providers rent out surplus computational resources as spot instances at a deep discount. However, these cheap spot instances are revocable. When demand surges for higher priced on-demand instances, cloud providers can interrupt these spot instances after a brief alert. Such unreliability makes it challenging to utilize spot instances for many long-running jobs. However, with checkpoints and restoration, machine-learning (ML) training jobs are a good candidate to overcome this difficulty. In this paper, we formalize the problem of scheduling ML-training jobs on transient spot instances, especially from an ML researcher's view, who may have some grant/credit for renting cloud computing services for several ML training tasks. Such a researcher would need to partition the computational resources wisely to maximize outcome (or total expected utility of all jobs) while maintaining some fairness between jobs. We investigate the trade-off between low-cost/interruptible and high-cost/uninterruptible computation, by proposing a linear-programming (LP) rounding based polynomial time algorithm. Based on the LP solution, we also give an LP-based heuristic that performs well in practice. We implement and evaluate these algorithms, and are able to achieve the same utility with 23% to 48% of the budget needed with on-demand instances. Moreover, the total utility we get is close to the theoretical upper bound under various settings, indicating close to optimal performance.

References

[1]

Léon. Bottou, Frank E. Curtis, and Jorge. Nocedal. 2018. Optimization Methods for Large-Scale Machine Learning. SIAM Rev. 60, 2 (2018).

Google Scholar

[2]

Brian C Dean, Michel X Goemans, and Jan Vondrák. 2008. Approximating the stochastic knapsack problem: The benefit of adaptivity. Mathematics of Operations Research 33, 4 (2008).

Google Scholar

[3]

Anupam Gupta, Ravishankar Krishnaswamy, Marco Molinaro, and Ramamoorthi Ravi. 2011. Approximation algorithms for correlated knapsacks and non-martingale bandits. In FOCS. IEEE.

Google Scholar

[4]

Leslie A Hall, Andreas S Schulz, David B Shmoys, and Joel Wein. 1997. Scheduling to minimize average completion time: Off-line and on-line approximation algorithms. Mathematics of operations research 22, 3 (1997).

Google Scholar

[5]

Aaron Harlap, Alexey Tumanov, Andrew Chung, Gregory R Ganger, and Phillip B Gibbons. 2017. Proteus: agile ml elasticity through tiered reliability in dynamic resource markets. In EuroSys. ACM.

Google Scholar

[6]

Myeongjae Jeon, Shivaram Venkataraman, Junjie Qian, Amar Phanishayee, Wencong Xiao, and Fan Yang. 2018. Multi-tenant GPU clusters for deep learning workloads: Analysis and implications. Technical Report. Microsoft Research.

Google Scholar

[7]

Jon Kleinberg, Yuval Rabani, and Éva Tardos. 2000. Allocating bandwidth for bursty connections. SIAM J. Comput. 30, 1 (2000).

Google Scholar

[8]

Feng Niu, Benjamin Recht, Christopher Re, and Stephen J. Wright. 2011. HOG-WILD! A Lock-Free Approach to Parallelizing Stochastic Gradient Descent. In NIPS (Granada, Spain) (NIPS'11). Curran Associates Inc., Red Hook, NY, USA, 693--701.

Digital Library

Google Scholar

[9]

Yanghua Peng, Yixin Bao, Yangrui Chen, Chuan Wu, and Chuanxiong Guo. 2018. Optimus: an efficient dynamic resource scheduler for deep learning clusters. In EuroSys. ACM.

Google Scholar

[10]

Prasad Saripalli, G. V. R. Kiran, R. Ravi Shankar, Harish Narware, and Nitin Bindal. 2011. Load Prediction and Hot Spot Detection Models for Autonomic Cloud Computing. In UCC. IEEE Computer Society.

Google Scholar

[11]

Prateek Sharma, David Irwin, and Prashant Shenoy. 2017. Portfolio-driven Resource Management for Transient Cloud Servers. Proc. ACM Meas. Anal. Comput. Syst. (2017).

Digital Library

Google Scholar

[12]

Pan Xu, Jinghui Chen, Difan Zou, and Quanquan Gu. 2018. Global Convergence of Langevin Dynamics Based Algorithms for Nonconvex Optimization. In NIPS, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.). Curran Associates, Inc.

Google Scholar

[13]

Haoyu Zhang, Logan Stafman, Andrew Or, and Michael J Freedman. 2017. Slaq: quality-driven scheduling for distributed machine learning. In SoCC. ACM.

Google Scholar

[14]

Xiaoxi Zhang, Jianyu Wang, Gauri Joshi, and Carlee Joe-Wong. 2020. Machine Learning on Volatile Instances. In INFOCOM.

Google Scholar

Cited By

View all

Wu QFang JZeng JWen JLuo F(2024)Monte Carlo Simulation-Based Robust Workflow Scheduling for Spot Instances in Cloud EnvironmentsTsinghua Science and Technology10.26599/TST.2022.901006529:1(112-126)Online publication date: Feb-2024
https://doi.org/10.26599/TST.2022.9010065
Erben AMayer RJacobsen H(2024)How Can We Train Deep Learning Models Across Clouds and Continents? An Experimental StudyProceedings of the VLDB Endowment10.14778/3648160.364816517:6(1214-1226)Online publication date: 3-May-2024
https://dl.acm.org/doi/10.14778/3648160.3648165

Index Terms

Scheduling ML training on unreliable spot instances
1. Theory of computation
  1. Design and analysis of algorithms
    1. Approximation algorithms analysis
      1. Scheduling algorithms
      2. Stochastic approximation

Recommendations

How Small and Medium Enterprises SMEs Should Bid for Spot Instances of Amazon's EC2 Cloud

In cloud service provisioning, spot instances are spare slots for which it has no pre-booking, unlike reserved or on-demand instances for which a cloud service provider CSP has a priori booking. CSPs like Amazon prefer spot instance approach to sell ...
Methods for improving the availability of spot instances: A survey
Abstract
The burgeoning development of the cloud market has promoted the expansion of resources held by cloud providers, but the resulting underutilization caused by the over-provisioned resources has become a challenge. To improve the ...
Highlights
- Describe the development history of spot instances and analyze what benefits the spot instance provides to cloud providers and users, respectively.
Achieving Performance and Availability Guarantees with Spot Instances
HPCC '11: Proceedings of the 2011 IEEE International Conference on High Performance Computing and Communications

In the Infrastructure-as-a-Service (IaaS) cloud computing market, spot instances refer to virtual servers that are rented via an auction. Spot instances allow IaaS providers to sell spare capacity while enabling IaaS users to acquire virtual servers at ...

Comments

Information & Contributors

Information

Published In

UCC '21: Proceedings of the 14th IEEE/ACM International Conference on Utility and Cloud Computing Companion

December 2021

256 pages

ISBN:9781450391634

DOI:10.1145/3492323

Conference Chairs:
Luiz F. Bittencourt
University of Campinas, Brazil
,
Alan Sill
Texas Tech University

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

In-Cooperation

IEEE TCSC: IEEE Technical Committee on Scalable Computing

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 February 2022

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Conference

UCC '21

Sponsor:

SIGARCH

UCC '21: 2021 IEEE/ACM 14th International Conference on Utility and Cloud Computing

December 6 - 9, 2021

Leicester, United Kingdom

Acceptance Rates

Overall Acceptance Rate 38 of 125 submissions, 30%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
143
Total Downloads

Downloads (Last 12 months)38
Downloads (Last 6 weeks)5

Reflects downloads up to 25 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

View all

Wu QFang JZeng JWen JLuo F(2024)Monte Carlo Simulation-Based Robust Workflow Scheduling for Spot Instances in Cloud EnvironmentsTsinghua Science and Technology10.26599/TST.2022.901006529:1(112-126)Online publication date: Feb-2024
https://doi.org/10.26599/TST.2022.9010065
Erben AMayer RJacobsen H(2024)How Can We Train Deep Learning Models Across Clouds and Continents? An Experimental StudyProceedings of the VLDB Endowment10.14778/3648160.364816517:6(1214-1226)Online publication date: 3-May-2024
https://dl.acm.org/doi/10.14778/3648160.3648165

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Cited By

Index Terms

Recommendations

How Small and Medium Enterprises SMEs Should Bid for Spot Instances of Amazon's EC2 Cloud

Methods for improving the availability of spot instances: A survey

Achieving Performance and Availability Guarantees with Spot Instances