research-article

Can't be late: optimizing spot instance savings under deadlines

AUTHORs:

Wei-Lin Chiang,

Ion StoicaAuthors Info & Claims

NSDI'24: Proceedings of the 21st USENIX Symposium on Networked Systems Design and Implementation

Article No.: 11, Pages 185 - 203

Published: 16 April 2024 Publication History

Abstract

Cloud providers offer spot instances alongside on-demand instances to optimize resource utilization. While economically appealing, spot instances' preemptible nature causes them ill-suited for deadline-sensitive jobs. To allow jobs to meet deadlines while leveraging spot instances, we propose a simple idea: use on-demand instances judiciously as a backup resource. However, due to the unpredictable spot instance availability, determining when to switch between spot and on-demand to minimize cost requires careful policy design. In this paper, we first provide an in-depth characterization of spot instances (e.g., availability, pricing, duration), and develop a basic theoretical model to examine the worst and average-case behaviors of baseline policies (e.g., greedy). The model serves as a foundation to motivate our design of a simple and effective policy, Uniform Progress, which is parameter-free and requires no assumptions on spot availability. Our empirical study, based on three-month-long real spot availability traces on AWS, demonstrates that it can (1) outperform the greedy policy by closing the gap to the optimal policy by 2× in both average and bad cases, and (2) further reduce the gap when limited future knowledge is given. These results hold in a variety of conditions ranging from loose to tight deadlines, low to high spot availability, and on single or multiple instances. By implementing this policy on top of SkyPilot, an intercloud broker system, we achieve 27%-84% cost savings across a variety of representative real-world workloads and deadlines. The spot availability traces are open-sourced for future research.

References

[1]

Amazon EC2 Spot customers. https://aws.amazon.com/ec2/spot/customers/.

[2]

GCP Spot VMs Pricing. https://cloud.google.com/compute/docs/instances/spot#pricing.

[3]

Google Cloud Spot VM Pricing. https://cloud.google.com/compute/docs/instances/spot#pricing.

[4]

Navigating the High Cost of AI Compute. https://a16z.com/2023/04/27/navigating-the-high-cost-of-ai-compute/.

[5]

New Amazon EC2 Spot pricing model: Simplified purchasing without bidding and fewer interruptions. https://aws.amazon.com/blogs/compute/new-amazon-ec2-spot-pricing/.

[6]

Oracle Computing Pricing. https://www.oracle.com/cloud/compute/pricing/.

[7]

Pretraining RoBERTa using your own data. https://github.com/facebookresearch/fairseq/blob/main/examples/roberta/README.pretraining.md.

[8]

Vantage Cloud Cost Breakdown Report. https://www.vantage.sh/cloud-cost-report/2023-q1.

[9]

F. Alzhouri, A. Agarwal, and Y. Liu. Maximizing cloud revenue using dynamic pricing of multiple class virtual machines. IEEE Transactions on Cloud Computing, 9(2):682-695, 2018.

[10]

P. Ambati, I. Goiri, F. Frujeri, A. Gun, K. Wang, B. Dolan, B. Corell, S. Pasupuleti, T. Moscibroda, S. Elnikety, M. Fontoura, and R. Bianchini. Providing SLOs for Resource-Harvesting VMs in cloud platforms. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), pages 735-751. USENIX Association, Nov. 2020.

[11]

I. Buch, M. J. Harvey, T. Giorgino, D. P. Anderson, and G. De Fabritiis. High-throughput all-atom molecular dynamics simulations using distributed computing. Journal of Chemical Information and Modeling, 50(3):397-403, 2010.

[12]

M. Chetto. Optimal scheduling for real-time jobs in energy harvesting computing systems. IEEE Transactions on Emerging Topics in Computing, 2(2):122-133, 2014.

[13]

N. Chohan, C. Castillo, M. Spreitzer, M. Steinder, A. N. Tantawi, and C. Krintz. See spot run: using spot instances for mapreduce workflows. HotCloud, 10:7-7, 2010.

Digital Library

[14]

J. Forrest, T. Ralphs, H. G. Santos, S. Vigerske, J. Forrest, L. Hafer, B. Kristjansson, jpfasano, EdwinStraver, M. Lubin, Jan-Willem, rlougee, jpgoncal1, S. Brito, hi gassmann, Cristina, M. Saltzman, tosttost, B. Pitrus, F. MATSUSHIMA, and to st. coin-or/cbc: Release releases/2.10.10, Apr. 2023.

[15]

H. E. Ghor, M. Chetto, and R. H. Chehade. A real-time scheduling framework for embedded systems with environmental energy harvesting. Computers & Electrical Engineering, 37(4):498-510, 2011.

Digital Library

[16]

A. Harlap, A. Chung, A. Tumanov, G. R. Ganger, and P. B. Gibbons. Tributary: spot-dancing for elastic services with latency SLOs. In 2018 USENIX Annual Technical Conference (USENIX ATC 18), pages 1-14, Boston, MA, July 2018. USENIX Association.

[17]

S. M. Iqbal, H. Li, S. Bergsma, I. Beschastnikh, and A. J. Hu. Cospot: A cooperative vm allocation framework for increased revenue from spot instances. In Proceedings of the 13th Symposium on Cloud Computing, SoCC '22, page 540-556, New York, NY, USA, 2022. Association for Computing Machinery.

Digital Library

[18]

B. Islam and S. Nirjon. Scheduling computational and energy harvesting tasks in deadline-aware intermittent systems. In 2020 IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS), pages 95-109. IEEE, 2020.

[19]

N. Jain, I. Menache, J. Naor, and J. Yaniv. Near-optimal scheduling mechanisms for deadline-sensitive jobs in large computing clusters. ACM Transactions on Parallel Computing (TOPC), 2(1):1-29, 2015.

[20]

J. Kadupitige, V. Jadhao, and P. Sharma. Modeling the temporally constrained preemptions of transient cloud vms. In Proceedings of the 29th International Symposium on High-Performance Parallel and Distributed Computing, HPDC '20, page 41-52, New York, NY, USA, 2020. Association for Computing Machinery.

Digital Library

[21]

S. Lee, J. Hwang, and K. Lee. Spotlake: Diverse spot instance dataset archive service. In 2022 IEEE International Symposium onWorkload Characterization (IISWC), pages 242-255, 2022.

[22]

M. Li, D. G. Andersen, J.W. Park, A. J. Smola, A. Ahmed, V. Josifovski, J. Long, E. J. Shekita, and B.-Y. Su. Scaling distributed machine learning with the parameter server. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14), pages 583-598, Broomfield, CO, Oct. 2014. USENIX Association.

Digital Library

[23]

S. Li, Y. Zhao, R. Varma, O. Salpekar, P. Noordhuis, T. Li, A. Paszke, J. Smith, B. Vaughan, P. Damania, et al. Pytorch distributed: Experiences on accelerating data parallel training. Proceedings of the VLDB Endowment, 13(12), 2019.

[24]

R. Liaw, R. Bhardwaj, L. Dunlap, Y. Zou, J. E. Gonzalez, I. Stoica, and A. Tumanov. Hypersched: Dynamic resource reallocation for model development on a deadline. In Proceedings of the ACM Symposium on Cloud Computing, pages 61-73, 2019.

Digital Library

[25]

R. Liaw, E. Liang, R. Nishihara, P. Moritz, J. E. Gonzalez, and I. Stoica. Tune: A research platform for distributed model selection and training. arXiv preprint arXiv:1807.05118, 2018.

[26]

H. Liu, Q. Zeng, J. Zhou, A. Bartlett, B.-A. Wang, P. Berube, W. Tian, M. Kenworthy, J. Altshul, J. R. Nery, H. Chen, R. G. Castanon, S. Zu, Y. E. Li, J. Lucero, J. K. Osteen, A. Pinto-Duarte, J. Lee, J. Rink, S. Cho, N. Emerson, M. Nunn, C. O'Connor, Z. Yao, K. A. Smith, B. Tasic, H. Zeng, C. Luo, J. R. Dixon, B. Ren, M. M. Behrens, and J. R. Ecker. Single-cell dna methylome and 3d multi-omic atlas of the adult mouse brain. bioRxiv, 2023.

[27]

Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov. Roberta: A robustly optimized BERT pretraining approach. CoRR, abs/1907.11692, 2019.

[28]

L. Luo, P.West, P. Patel, A. Krishnamurthy, and L. Ceze. Srifty: Swift and thrifty distributed neural network training on the cloud. Proceedings of Machine Learning and Systems, 4:833-847, 2022.

[29]

A. Marathe, R. Harris, D. K. Lowenthal, B. R. de Supinski, B. Rountree, and M. Schulz. Exploiting redundancy and application scalability for cost-effective, time-constrained execution of hpc applications on amazon ec2. IEEE Transactions on Parallel and Distributed Systems, 27(9):2574-2588, 2015.

Digital Library

[30]

I. Menache, O. Shamir, and N. Jain. On-demand, spot, or both: Dynamic resource allocation for executing batch jobs in the cloud. In 11th International Conference on Autonomic Computing (ICAC 14), pages 177-187, Philadelphia, PA, June 2014. USENIX Association.

[31]

S. Mitchell, M. O'Sullivan, and I. Dunning. PuLP: A Linear Programming Toolkit for Python. 2011.

[32]

R. O. Nambiar and M. Poess. The Making of TPC-DS. In Proceedings of the 32nd International Conference on Very Large Data Bases, VLDB '06, page 1049-1058. VLDB Endowment, 2006.

Digital Library

[33]

D. Narayanan, K. Santhanam, F. Kazhamiaka, A. Phanishayee, and M. Zaharia. Analysis and exploitation of dynamic pricing in the public cloud for ml training. In Workshop on Distributed Infrastructure, Systems, Programming, and AI, August 2020.

[34]

M. Ott, S. Edunov, A. Baevski, A. Fan, S. Gross, N. Ng, D. Grangier, and M. Auli. fairseq: A fast, extensible toolkit for sequence modeling. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations), pages 48-53, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics.

[35]

D. Poola, K. Ramamohanarao, and R. Buyya. Fault-tolerant workflow scheduling using spot instances on clouds. Procedia Computer Science, 29:523-533, 2014. 2014 International Conference on Computational Science.

[36]

A. Sergeev and M. Del Balso. Horovod: fast and easy distributed deep learning in tensorflow. arXiv preprint arXiv:1802.05799, 2018.

[37]

J. Thorpe, P. Zhao, J. Eyolfson, Y. Qiao, Z. Jia, M. Zhang, R. Netravali, and G. H. Xu. Bamboo: Making preemptible instances resilient for affordable training of large DNNs. In 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23), pages 497-513, Boston, MA, Apr. 2023. USENIX Association.

[38]

J. Thorpe, P. Zhao, J. Eyolfson, Y. Qiao, Z. Jia, M. Zhang, R. Netravali, and G. H. Xu. Bamboo: Making preemptible instances resilient for affordable training of large DNNs. In 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23), pages 497-513, Boston, MA, Apr. 2023. USENIX Association.

[39]

P. Varshney and Y. Simmhan. Autobot: Resilient and cost-effective scheduling of a bag of tasks on spot vms. IEEE Transactions on Parallel and Distributed Systems, 30(7):1512-1527, 2019.

Digital Library

[40]

M. Wagenlander, L. Mai, G. Li, and P. Pietzuch. Spotnik: Designing distributed machine learning for transient cloud resources. In Proceedings of the 12th USENIX Conference on Hot Topics in Cloud Computing, pages 4-4, 2020.

[41]

S. Wang and M. Casado. The Cost of Cloud, a Trillion Dollar Paradox. https://a16z.com/2021/05/27/cost-of-cloud-paradox-market-cap-cloud-lifecycle-scale-growth-repatriation-optimization.

[42]

F. Yang, B. Pang, J. Zhang, B. Qiao, L. Wang, C. Couturier, C. Bansal, S. Ram, S. Qin, Z. Ma, I. n. Goiri, E. Cortez, S. Baladhandayutham, V. Ruhle, S. Rajmohan, Q. Lin, and D. Zhang. Spot virtual machine eviction prediction in microsoft cloud. In Companion Proceedings of the Web Conference 2022, WWW'22, page 152-156, New York, NY, USA, 2022. Association for Computing Machinery.

Digital Library

[43]

F. Yang, L. Wang, Z. Xu, J. Zhang, L. Li, B. Qiao, C. Couturier, C. Bansal, S. Ram, S. Qin, Z. Ma, I. n. Goiri, E. Cortez, T. Yang, V. Ruhle, S. Rajmohan, Q. Lin, and D. Zhang. Snape: Reliable and low-cost computing with mixture of spot and on-demand vms. In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3, ASPLOS 2023, page 631-643, New York, NY, USA, 2023. Association for Computing Machinery.

Digital Library

[44]

Z. Yang, Z. Wu, M. Luo, W.-L. Chiang, R. Bhardwaj, W. Kwon, S. Zhuang, F. S. Luan, G. Mittal, S. Shenker, and I. Stoica. SkyPilot: An intercloud broker for sky computing. In 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23), pages 437-455, Boston, MA, Apr. 2023. USENIX Association.

[45]

M. Zafer, Y. Song, and K.-W. Lee. Optimal bids for spot vms in a cloud for deadline constrained jobs. In 2012 IEEE Fifth International Conference on Cloud Computing, pages 75-82, 2012.

Digital Library

[46]

M. Zaharia, R. S. Xin, P. Wendell, T. Das, M. Armbrust, A. Dave, X. Meng, J. Rosen, S. Venkataraman, M. J. Franklin, A. Ghodsi, J. Gonzalez, S. Shenker, and I. Stoica. Apache spark: A unified engine for big data processing. Commun. ACM, 59(11):56-65, oct 2016.

Digital Library

[47]

Q. Zhang, P. Bernstein, D. S. Berger, B. Chandramouli, B. T. Loo, and V. Liu. Compucache: Remote computable caching using spot vms. In Conference on Innovative Data Systems Research (CIDR 2022), January 2022.

Index Terms

Can't be late: optimizing spot instance savings under deadlines

Index terms have been assigned to the content through auto-classification.

Recommendations

Approximation algorithms for minimizing the total weighted number of late jobs with late deliveries in two-level supply chains

We study a supply chain scheduling problem in which n jobs have to be scheduled on a single machine and delivered to m customers in batches. Each job has a due date, a processing time and a lateness penalty (weight). To save batch-delivery costs, ...
Scheduling a Single Machine to Minimize the Number of Late Jobs
Single Machine Scheduling to Minimize Total Late Work

<P>In the problem of scheduling a single machine to minimize total late work, there are n jobs to be processed for which each has an integer processing time and a due date. The objective is to minimize the total late work, where the late work for a job ...

Comments

Information & Contributors

Information

Published In

cover image Guide Proceedings

NSDI'24: Proceedings of the 21st USENIX Symposium on Networked Systems Design and Implementation

April 2024

2062 pages

ISBN:978-1-939133-39-7

Others:
Laurent Vanbever
ETH Zürich
,
Irene Zhang
Microsoft Research

Copyright © 2024 The USENIX Association.

Sponsors

Meta
FUTUREWEI
NSF
Microsort
Google Inc.

Publisher

USENIX Association

United States

Publication History

Published: 16 April 2024

Qualifiers

Research-article
Research
Refereed limited

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 03 Feb 2025

Other Metrics

View Author Metrics

Citations

View Options

View options

Figures

Tables

Media

View Table of Conten