research-article

How Can We Train Deep Learning Models Across Clouds and Continents? An Experimental Study

Authors:

Alexander Erben,

Hans-Arno JacobsenAuthors Info & Claims

Proceedings of the VLDB Endowment, Volume 17, Issue 6

Pages 1214 - 1226

https://doi.org/10.14778/3648160.3648165

Published: 03 May 2024 Publication History

Abstract

This paper aims to answer the question: Can deep learning models be cost-efficiently trained on a global market of spot VMs spanning different data centers and cloud providers? To provide guidance, we extensively evaluate the cost and throughput implications of training in different zones, continents, and clouds for representative CV, NLP and ASR models. To expand the current training options further, we compare the scalability potential for hybrid-cloud scenarios by adding cloud resources to on-premise hardware to improve training throughput. Finally, we show how leveraging spot instance pricing enables a new cost-efficient way to train models with multiple cheap VMs, trumping both more centralized and powerful hardware and even on-demand cloud offerings at competitive prices.

References

[1]

[n.d.]. Horovod: fast and easy distributed deep learning in TensorFlow, author=Sergeev, Alexander and Del Balso, Mike, journal=arXiv preprint arXiv:1802.05799, year=2018. ([n. d.]).

[2]

2023. Amazon AWS. Accessed: 19 May 2023, aws.amazon.com.

[3]

2023. Amazon AWS Spot Pricing. https://aws.amazon.com/blogs/compute/new-amazon-ec2-spot-pricing/. Accessed: 2023-09-27.

[4]

2023. Backblaze. https://backblaze.com/. Accessed: 2023-10-05.

[5]

2023. Google Cloud. Accessed: 19 May 2023, cloud.google.com.

[6]

2023. Google Cloud Region Picker. https://cloud.withgoogle.com/region-picker/. Accessed: 2023-10-05.

[7]

2023. Hivemind GAC Issue. https://github.com/learning-at-home/hivemind/issues/566. Accessed: 2023-10-05.

[8]

2023. LambdaLabs. Accessed: 19 May 2023, lambdalabs.com.

[9]

2023. Microsoft Azure. Accessed: 19 May 2023, portal.azure.com.

[10]

Alex Aizman, Gavin Maltby, and Thomas Breuel. 2019. High Performance I/O For Large Scale Deep Learning. In 2019 IEEE International Conference on Big Data (Big Data). 5965--5967.

[11]

Rosana Ardila, Megan Branson, Kelly Davis, Michael Henretty, Michael Kohler, Josh Meyer, Reuben Morais, Lindsay Saunders, Francis M Tyers, and Gregor Weber. 2019. Common voice: A massively-multilingual speech corpus. arXiv preprint arXiv:1912.06670 (2019).

[12]

Alexander Borzunov, Max Ryabinin, Tim Dettmers, Quentin Lhoest, Lucile Saulnier, Michael Diskin, and Yacine Jernite. 2022. Training Transformers Together. In NeurIPS 2021 Competitions and Demonstrations Track. PMLR, 335--342.

[13]

Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. 2020. Unsupervised Cross-lingual Representation Learning at Scale. arXiv:1911.02116 [cs.CL]

[14]

Piali Das, Nikita Ivkin, Tanya Bansal, Laurence Rouesnel, Philip Gautier, Zohar Karnin, Leo Dirac, Lakshmi Ramakrishnan, Andre Perunicic, Iaroslav Shcherbatyi, Wilton Wu, Aida Zolic, Huibin Shen, Amr Ahmed, Fela Winkelmolen, Miroslav Miladinovic, Cedric Archembeau, Alex Tang, Bhaskar Dutt, Patricia Grao, and Kumar Venkateswar. 2020. Amazon SageMaker Autopilot: A White Box AutoML Solution at Scale. In Proceedings of the Fourth International Workshop on Data Management for End-to-End Machine Learning (Portland, OR, USA) (DEEM'20). Association for Computing Machinery, New York, NY, USA, Article 2, 7 pages.

Digital Library

[15]

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition. Ieee, 248--255.

[16]

Tim Dettmers. 2016. 8-Bit Approximations for Parallelism in Deep Learning. arXiv:1511.04561 [cs.NE]

[17]

Michael Diskin, Alexey Bukhtiyarov, Max Ryabinin, Lucile Saulnier, Anton Sinitsin, Dmitry Popov, Dmitry V Pyrkin, Maxim Kashirin, Alexander Borzunov, Albert Villanova del Moral, et al. 2021. Distributed Deep Learning In Open Collaborations. Advances in Neural Information Processing Systems 34 (2021), 7879--7897.

[18]

O Elharrouss, Y Akbari, N Almaadeed, and S Al-Maadeed. [n.d.]. Backbones-review: Feature extraction networks for deep learning and deep reinforcement learning approaches. arXiv 2022. arXiv preprint arXiv:2206.08016 ([n. d.]).

[19]

Anne C Elster and Tor A Haugdahl. 2022. NVIDIA Hopper GPU and Grace CPU Highlights. Computing in Science & Engineering 24, 2 (2022), 95--100.

[20]

Wikimedia Foundation. 2023. "Wikimedia Downloads". https://dumps.wikimedia.org

[21]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.

[22]

Kai Hwang. 1992. Advanced Computer Architecture: Parallelism, Scalability, Programmability (1st ed.). McGraw-Hill Higher Education.

[23]

Kyungyong Lee and Myungjun Son. 2017. DeepSpotCloud: Leveraging Cross-Region GPU Spot Instances for Deep Learning. In 2017 IEEE 10th International Conference on Cloud Computing (CLOUD). 98--105.

[24]

Sungjae Lee, Jaeil Hwang, and Kyungyong Lee. 2022. SpotLake: Diverse Spot Instance Dataset Archive Service. In 2022 IEEE International Symposium on Workload Characterization (IISWC). 242--255.

[25]

Mu Li, David G Andersen, Jun Woo Park, Alexander J Smola, Amr Ahmed, Vanja Josifovski, James Long, Eugene J Shekita, and Bor-Yiing Su. 2014. Scaling distributed machine learning with the parameter server. In 11th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 14). 583--598.

Digital Library

[26]

Shen Li, Yanli Zhao, Rohan Varma, Omkar Salpekar, Pieter Noordhuis, Teng Li, Adam Paszke, Jeff Smith, Brian Vaughan, Pritam Damania, et al. 2020. Pytorch distributed: Experiences on accelerating data parallel training. arXiv preprint arXiv:2006.15704 (2020).

[27]

Xiangru Lian, Ce Zhang, Huan Zhang, Cho-Jui Hsieh, Wei Zhang, and Ji Liu. 2017. Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent. Advances in neural information processing systems 30 (2017).

[28]

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv:1907.11692 [cs.CL]

[29]

Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, and Saining Xie. 2022. A convnet for the 2020s. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 11976--11986.

[30]

Peter Mattson, Christine Cheng, Gregory Diamos, Cody Coleman, Paulius Micikevicius, David Patterson, Hanlin Tang, Gu-Yeon Wei, Peter Bailis, Victor Bittorf, et al. 2020. MLPerf Training Benchmark. Proceedings of Machine Learning and Systems 2 (2020), 336--349.

[31]

Petar Maymounkov and David Mazieres. 2002. Kademlia: A peer-to-peer information system based on the xor metric. In Peer-to-Peer Systems: First InternationalWorkshop, IPTPS 2002 Cambridge, MA, USA, March 7--8, 2002 Revised Papers. Springer, 53--65.

[32]

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. 2019. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 (2019).

[33]

Gustavo Portella, Genaina N Rodrigues, Eduardo Nakano, and Alba CMA Melo. 2019. Statistical analysis of Amazon EC2 cloud pricing models. Concurrency and Computation: Practice and Experience 31, 18 (2019), e4451.

[34]

Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine Mcleavey, and Ilya Sutskever. 2023. Robust Speech Recognition via Large-Scale Weak Supervision. In Proceedings of the 40th International Conference on Machine Learning (Proceedings of Machine Learning Research), Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (Eds.), Vol. 202. PMLR, 28492--28518. https://proceedings.mlr.press/v202/radford23a.html

[35]

Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. 2020. Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 3505--3506.

Digital Library

[36]

Jie Ren, Samyam Rajbhandari, Reza Yazdani Aminabadi, Olatunji Ruwase, Shuangyan Yang, Minjia Zhang, Dong Li, and Yuxiong He. 2021. ZeRO-Offload: Democratizing Billion-Scale Model Training. arXiv:2101.06840 [cs.DC]

[37]

Max Ryabinin, Tim Dettmers, Michael Diskin, and Alexander Borzunov. 2023. SWARM Parallelism: Training Large Models Can Be Surprisingly Communication-Efficient. arXiv preprint arXiv:2301.11913 (2023).

[38]

Max Ryabinin, Eduard Gorbunov, Vsevolod Plokhotnyuk, and Gennady Pekhimenko. 2021. Moshpit SGD: Communication-Efficient Decentralized Training on Heterogeneous Unreliable Devices. In Advances in Neural Information Processing Systems, Vol. 34. https://proceedings.neurips.cc/paper/2021/file/97275a23ca44226c9964043c8462be96-Paper.pdf

[39]

Learning@home team. 2020. Hivemind: a Library for Decentralized Deep Learning. https://github.com/learning-at-home/hivemind.

[40]

Chathurika S. Wickramasinghe, Daniel L. Marino, and Milos Manic. 2021. ResNet Autoencoders for Unsupervised Feature Learning From High-Dimensional Data: Deep Models Resistant to Performance Degradation. IEEE Access 9 (2021), 40511--40520.

[41]

Mitchell Wortsman, Tim Dettmers, Luke Zettlemoyer, Ari Morcos, Ali Farhadi, and Ludwig Schmidt. 2023. Stable and low-precision training for large-scale vision-language models. arXiv preprint arXiv:2304.13013 (2023).

[42]

Sheng Yang, Samir Khuller, Sunav Choudhary, Subrata Mitra, and Kanak Mahadik. 2022. Scheduling ML Training on Unreliable Spot Instances. In Proceedings of the 14th IEEE/ACM International Conference on Utility and Cloud Computing Companion (Leicester, United Kingdom) (UCC '21). Association for Computing Machinery, New York, NY, USA, Article 29, 8 pages.

Digital Library

[43]

Zongheng Yang, Zhanghao Wu, Michael Luo, Wei-Lin Chiang, Romil Bhardwaj, Woosuk Kwon, Siyuan Zhuang, Frank Sifei Luan, Gautam Mittal, Scott Shenker, and Ion Stoica. 2023. SkyPilot: An Intercloud Broker for Sky Computing. In 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23). USENIX Association, Boston, MA, 437--455. https://www.usenix.org/conference/nsdi23/presentation/yang-zongheng

[44]

Yang You, Jing Li, Sashank Reddi, Jonathan Hseu, Sanjiv Kumar, Srinadh Bhojanapalli, Xiaodan Song, James Demmel, Kurt Keutzer, and Cho-Jui Hsieh. 2019. Large batch optimization for deep learning: Training bert in 76 minutes. arXiv preprint arXiv:1904.00962 (2019).

[45]

Binhang Yuan, Yongjun He, Jared Davis, Tianyi Zhang, Tri Dao, Beidi Chen, Percy S Liang, Christopher Re, and Ce Zhang. 2022. Decentralized training of foundation models in heterogeneous environments. Advances in Neural Information Processing Systems 35 (2022), 25464--25477.

[46]

Sergey Zagoruyko and Nikos Komodakis. 2016. Wide residual networks. arXiv preprint arXiv:1605.07146 (2016).

Cited By

Recommendations

Monitoring-based auto-scalability across hybrid clouds
SAC '18: Proceedings of the 33rd Annual ACM Symposium on Applied Computing

Cloud computing is a relatively new type of Internet-based computing that becomes more and more popular. Using methods like virtualization, adopting architectures based on microservices, automation of building and deployment processes, Cloud could ...
A coordinator for scaling elastic applications across multiple clouds

Cloud computing allows customers to dynamically scale their applications, software platforms, and hardware infrastructures according to negotiated Service Level Agreements (SLAs). However, resources available in a single Cloud data center are limited, ...
Optimal scheduling across public and private clouds in complex hybrid cloud environment

The hybrid cloud extends the private cloud model by using both local and remote resources. The private cloud will rely on the resources leased from public cloud providers for the execution of private cloud applications. The paper presents optimal ...

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment

Proceedings of the VLDB Endowment Volume 17, Issue 6

February 2024

369 pages

Editors:
Meihui Zhang
Beijing Institute of Technology
,
Cyrus Shahabi
University of Southern California

Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 03 May 2024

Published in PVLDB Volume 17, Issue 6

Check for updates

Badges

Artifacts Available / v1.1

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
36
Total Downloads

Downloads (Last 12 months)36
Downloads (Last 6 weeks)6

Reflects downloads up to 25 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents