research-article

Open access

Accelerating Model Training in Multi-cluster Environments with Consumer-grade GPUs

Authors:

Sangeetha Abdu Jyothi,

Dongsu HanAuthors Info & Claims

ACM SIGCOMM '24: Proceedings of the ACM SIGCOMM 2024 Conference

Pages 707 - 720

https://doi.org/10.1145/3651890.3672228

Published: 04 August 2024 Publication History

Abstract

Rapid advances in machine learning necessitate significant computing power and memory for training, which is accessible only to large corporations today. Small-scale players like academics often only have consumer-grade GPU clusters locally and can afford cloud GPU instances to a limited extent. However, training performance significantly degrades in this multi-cluster setting. In this paper, we identify unique opportunities to accelerate training and propose StellaTrain, a holistic framework that achieves near-optimal training speeds in multi-cloud environments. StellaTrain dynamically adapts a combination of acceleration techniques to minimize time-to-accuracy in model training. StellaTrain introduces novel acceleration techniques such as cache-aware gradient compression and a CPU-based sparse optimizer to maximize GPU utilization and optimize the training pipeline. With the optimized pipeline, StellaTrain holistically determines the training configurations to optimize the total training time. We show that StellaTrain achieves up to 104× speedup over PyTorch DDP in inter-cluster settings by adapting training configurations to fluctuating dynamic network bandwidth. StellaTrain demonstrates that we can cope with the scarce network bandwidth through systematic optimization, achieving up to 257.3× and 78.1× speed-ups on the network bandwidths of 100 Mbps and 500 Mbps, respectively. Finally, StellaTrain enables efficient co-training using on-premises and cloud clusters to reduce costs by 64.5% in conjunction with a reduced training time of 28.9%.

References

[1]

Ahmed M Abdelmoniem and Marco Canini. 2021. DC2: Delay-aware compression control for distributed machine learning. In IEEE INFOCOM 2021-IEEE Conference on Computer Communications (Vancouver, BC, Canada). IEEE.

Digital Library

[2]

Saurabh Agarwal, Hongyi Wang, Shivaram Venkataraman, and Dimitris Papailiopoulos. 2022. On the utility of gradient compression in distributed training systems. Proceedings of Machine Learning and Systems 4 (2022).

[3]

Alham Fikri Aji and Kenneth Heafield. 2017. Sparse communication for distributed gradient descent. arXiv preprint arXiv:1704.05021 (2017).

[4]

Dan Alistarh, Torsten Hoefler, Mikael Johansson, Nikola Konstantinov, Sarit Khirirat, and Cédric Renggli. 2018. The convergence of sparsified gradient methods. Advances in Neural Information Processing Systems 31 (2018).

[5]

The ZeroMQ authors. 2023. ZeroMQ. https://zeromq.org/.

[6]

BIZON. 2023. GPU Deep Learning Benchmarks 2023--2024. https://bizon-tech.com/gpu-benchmarks/NVIDIA-RTX-3090-vs-NVIDIA-A100-40-GB-(PCIe)/579vs592. [Accessed 01-02-2024].

[7]

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in neural information processing systems 33 (2020).

[8]

Yangrui Chen, Cong Xie, Meng Ma, Juncheng Gu, Yanghua Peng, Haibin Lin, Chuan Wu, and Yibo Zhu. 2022. SAPipe: Staleness-Aware Pipeline for Data Parallel DNN Training. In Advances in Neural Information Processing Systems.

[9]

Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Marc'aurelio Ranzato, Andrew Senior, Paul Tucker, Ke Yang, et al. 2012. Large scale distributed deep networks. Advances in neural information processing systems 25 (2012).

[10]

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition. Ieee.

[11]

Tim Dettmers. 2023. The Best GPUs for Deep Learning in 2023 --- An In-depth Analysis. https://timdettmers.com/2023/01/30/which-gpu-for-deep-learning/. [Accessed 27-Jun-2023].

[12]

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020).

[13]

Dmitry Duplyakin, Robert Ricci, Aleksander Maricq, Gary Wong, Jonathon Duerig, Eric Eide, Leigh Stoller, Mike Hibler, David Johnson, Kirk Webb, Aditya Akella, Kuangching Wang, Glenn Ricart, Larry Landweber, Chip Elliott, Michael Zink, Emmanuel Cecchet, Snigdhaswin Kar, and Prabodh Mishra. 2019. The Design and Operation of CloudLab. In Proceedings of the USENIX Annual Technical Conference (ATC). https://www.flux.utah.edu/paper/duplyakin-atc19

[14]

Aritra Dutta, El Houcine Bergou, Ahmed M Abdelmoniem, Chen-Yu Ho, Atal Narayan Sahu, Marco Canini, and Panos Kalnis. 2020. On the discrepancy between the theoretical analysis and practical implementations of compressed communication for distributed deep learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34.

[15]

Jiawei Fei, Chen-Yu Ho, Atal N Sahu, Marco Canini, and Amedeo Sapio. 2021. Efficient sparse collective communication and its application to accelerate distributed deep learning. In Proceedings of the 2021 ACM SIGCOMM 2021 Conference.

Digital Library

[16]

Sayed Hadi Hashemi, Sangeetha Abdu Jyothi, and Roy Campbell. 2019. Tictac: Accelerating distributed deep learning with communication scheduling. Proceedings of Machine Learning and Systems 1 (2019).

[17]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition.

[18]

Qirong Ho, James Cipar, Henggang Cui, Seunghak Lee, Jin Kyu Kim, Phillip B Gibbons, Garth A Gibson, Greg Ganger, and Eric P Xing. 2013. More effective distributed ML via a stale synchronous parallel parameter server. Advances in neural information processing systems 26 (2013).

Digital Library

[19]

Samuel Horvóth, Chen-Yu Ho, Ludovit Horvath, Atal Narayan Sahu, Marco Canini, and Peter Richtárik. 2022. Natural compression for distributed deep learning. In Mathematical and Scientific Machine Learning. PMLR.

[20]

Kevin Hsieh, Aaron Harlap, Nandita Vijaykumar, Dimitris Konomis, Gregory R Ganger, Phillip B Gibbons, and Onur Mutlu. 2017. Gaia: Geo-distributed machine learning approaching LAN speeds. In NSDI.

Digital Library

[21]

Intel. 2023. 13th Generation Intel® Core™ Processors Datasheet.

[22]

Anand Jayarajan, Jinliang Wei, Garth Gibson, Alexandra Fedorova, and Gennady Pekhimenko. 2019. Priority-based parameter propagation for distributed DNN training. Proceedings of Machine Learning and Systems 1 (2019).

[23]

Yimin Jiang, Yibo Zhu, Chang Lan, Bairen Yi, Yong Cui, and Chuanxiong Guo. 2020. A unified architecture for accelerating distributed DNN training in heterogeneous GPU/CPU clusters. In Proceedings of the 14th USENIX Conference on Operating Systems Design and Implementation.

Digital Library

[24]

Mu Li, David G Andersen, Jun Woo Park, Alexander J Smola, Amr Ahmed, Vanja Josifovski, James Long, Eugene J Shekita, and Bor-Yiing Su. 2014. Scaling distributed machine learning with the parameter server. In 11th USENIX Symposium on operating systems design and implementation (OSDI 14).

Digital Library

[25]

Mu Li, David G Andersen, Alexander J Smola, and Kai Yu. 2014. Communication efficient distributed machine learning with the parameter server. Advances in Neural Information Processing Systems 27 (2014).

[26]

Shen Li, Yanli Zhao, Rohan Varma, Omkar Salpekar, Pieter Noordhuis, Teng Li, Adam Paszke, Jeff Smith, Brian Vaughan, Pritam Damania, et al. 2020. Pytorch distributed: Experiences on accelerating data parallel training. arXiv preprint arXiv:2006.15704 (2020).

[27]

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. 2021. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision.

[28]

Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. 2017. Pointer Sentinel Mixture Models. In International Conference on Learning Representations. https://openreview.net/forum?id=Byj72udxe

[29]

NVIDIA. 2022. NVIDIA DGX A100 : The Universal System for AI Infrastructure --- nvidia.com. https://www.nvidia.com/en-us/data-center/dgx-a100/. [Accessed 22-09-2023].

[30]

NVIDIA. 2023. NVIDIA H100 Tensor Core GPU. https://nvidia.com/en-us/data-center/h100. [Accessed 27-Jun-2023].

[31]

NVIDIA. 2023. NVLink & NVSwitch for Advanced Multi-GPU Communication. https://nvidia.com/en-us/data-center/nvlink. [Accessed 27-Jun-2023].

[32]

NVIDIA. 2024. GPUDirect. https://developer.nvidia.com/gpudirect.

[33]

Yanghua Peng, Yibo Zhu, Yangrui Chen, Yixin Bao, Bairen Yi, Chang Lan, Chuan Wu, and Chuanxiong Guo. 2019. A generic communication scheduler for distributed dnn training acceleration. In Proceedings of the 27th ACM Symposium on Operating Systems Principles.

Digital Library

[34]

Benjamin Recht, Christopher Re, Stephen Wright, and Feng Niu. 2011. Hogwild!: A lock-free approach to parallelizing stochastic gradient descent. Advances in neural information processing systems 24 (2011).

[35]

Jie Ren, Samyam Rajbhandari, Reza Yazdani Aminabadi, Olatunji Ruwase, Shuangyan Yang, Minjia Zhang, Dong Li, and Yuxiong He. 2021. {ZeRO-Offload}: Democratizing {Billion-Scale} model training. In 2021 USENIX Annual Technical Conference (USENIX ATC 21).

[36]

Herbert Robbins and Sutton Monro. 1951. A stochastic approximation method. The annals of mathematical statistics (1951).

[37]

Ahmed Saeed, Varun Gupta, Prateesh Goyal, Milad Sharif, Rong Pan, Mostafa Ammar, Ellen Zegura, Keon Jang, Mohammad Alizadeh, Abdul Kabbani, et al. 2020. Annulus: A dual congestion control loop for datacenter and wan traffic aggregates. In Proceedings of the Annual conference of the ACM Special Interest Group on Data Communication on the applications, technologies, architectures, and protocols for computer communication.

Digital Library

[38]

Frank Seide, Hao Fu, Jasha Droppo, Gang Li, and Dong Yu. 2014. 1-bit stochastic gradient descent and its application to data-parallel distributed training of speech dnns. In Fifteenth annual conference of the international speech communication association.

[39]

Amazon Web Services. 2023. Compute - Amazon EC2 Instance Types - AWS --- aws.amazon.com. https://aws.amazon.com/en/ec2/instance-types/. [Accessed 21-09-2023].

[40]

Shaohuai Shi, Qiang Wang, Kaiyong Zhao, Zhenheng Tang, Yuxin Wang, Xiang Huang, and Xiaowen Chu. 2019. A distributed synchronous SGD algorithm with global top-k sparsification for low bandwidth networks. In 2019 IEEE 39th International Conference on Distributed Computing Systems (ICDCS). IEEE.

[41]

Shaohuai Shi, Xianhao Zhou, Shutao Song, Xingyao Wang, Zilin Zhu, Xue Huang, Xinan Jiang, Feihu Zhou, Zhenyu Guo, Liqiang Xie, et al. 2021. Towards scalable distributed training of deep learning on public cloud clusters. Proceedings of Machine Learning and Systems 3 (2021).

[42]

Samuel L Smith, Pieter-Jan Kindermans, Chris Ying, and Quoc V Le. 2017. Don't decay the learning rate, increase the batch size. arXiv preprint arXiv:1711.00489 (2017).

[43]

Sebastian U Stich, Jean-Baptiste Cordonnier, and Martin Jaggi. 2018. Sparsified SGD with memory. Advances in Neural Information Processing Systems 31 (2018).

[44]

Nikko Ström. 2015. Scalable Distributed DNN Training Using Commodity GPU Cloud Computing. In Interspeech 2015. https://www.amazon.science/publications/scalable-distributed-dnn-training-using-commodity-gpu-cloud-computing

[45]

Guanhua Wang, Heyang Qin, Sam Ade Jacobs, Connor Holmes, Samyam Rajbhandari, Olatunji Ruwase, Feng Yan, Lei Yang, and Yuxiong He. 2023. ZeRO++: Extremely Efficient Collective Communication for Giant Model Training. arXiv preprint arXiv:2306.10209 (2023).

[46]

Zhuang Wang, Haibin Lin, Yibo Zhu, and TS Eugene Ng. 2023. Hi-Speed DNN Training with Espresso: Unleashing the Full Potential of Gradient Compression with Near-Optimal Usage Strategies. In Proceedings of the Eighteenth European Conference on Computer Systems.

Digital Library

[47]

Jihao Xin, Ivan Ilin, Shunkang Zhang, Marco Canini, and Peter Richtárik. 2023. Kimad: Adaptive Gradient Compression with Bandwidth Awareness. In Proceedings of the 4th International Workshop on Distributed Machine Learning.

Digital Library

[48]

Hang Xu, Chen-Yu Ho, Ahmed M Abdelmoniem, Aritra Dutta, El Houcine Bergou, Konstantinos Karatsenidis, Marco Canini, and Panos Kalnis. 2020. Compressed communication for distributed deep learning: Survey and quantitative evaluation. Technical Report. KAUST.

[49]

Hao Zhang, Zhiting Hu, Jinliang Wei, Pengtao Xie, Gunhee Kim, Qirong Ho, and Eric Xing. 2015. Poseidon: A system architecture for efficient gpu-based deep learning on multiple machines. arXiv preprint arXiv:1512.06216 (2015).

[50]

Ruiliang Zhang and James Kwok. 2014. Asynchronous distributed ADMM for consensus optimization. In International conference on machine learning. PMLR.

[51]

Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. 2022. OPT: Open Pre-trained Transformer Language Models. arXiv preprint arXiv:2205.01068 (2022).

[52]

Shanshan Zhang, Ce Zhang, Zhao You, Rong Zheng, and Bo Xu. 2013. Asynchronous stochastic gradient descent for DNN training. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE.

[53]

Zhenwei Zhang, Qiang Qi, Ruitao Shang, Li Chen, and Fei Xu. 2021. Prophet: Speeding up Distributed DNN Training with Predictable Communication Scheduling. In 50th International Conference on Parallel Processing.

Digital Library

[54]

Shuxin Zheng, Qi Meng, Taifeng Wang, Wei Chen, Nenghai Yu, Zhi-Ming Ma, and Tie-Yan Liu. 2017. Asynchronous stochastic gradient descent with delay compensation. In International Conference on Machine Learning. PMLR.

[55]

Yuchen Zhong, Cong Xie, Shuai Zheng, and Haibin Lin. 2021. Compressed communication for distributed training: Adaptive methods and system. arXiv preprint arXiv:2105.07829 (2021).

[56]

Martin Zinkevich, John Langford, and Alex Smola. 2009. Slow learners are fast. Advances in neural information processing systems 22 (2009).

Index Terms

Accelerating Model Training in Multi-cluster Environments with Consumer-grade GPUs
1. Computer systems organization
  1. Architectures
    1. Distributed architectures
      1. Cloud computing
2. Computing methodologies
  1. Machine learning

Recommendations

Accelerating a hydrological uncertainty ensemble model using graphics processing units (GPUs)

The practical application of hydrological uncertainty models that are designed to generate multiple ensembles can be severely restricted by the available computer processing power and thus, the time taken to generate the results. CPU clusters can help ...
Accelerating Radiative Transfer Simulation on NVIDIA GPUs with OpenACC
Parallel and Distributed Computing, Applications and Technologies
Abstract
To accelerate multiphysics applications, making use of not only GPUs but also FPGAs has been emerging. Multiphysics applications are simulations involving multiple physical models and multiple simultaneous physical phenomena. Operations with ...
Accelerating Training Process in Logistic Regression Model using OpenCL Framework

In this paper, the authors propose a new parallel implemented approach on Graphics Processing Units GPU for training logistic regression model. Logistic regression has been applied in many machine learning applications to build building predictive ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ACM SIGCOMM '24: Proceedings of the ACM SIGCOMM 2024 Conference

August 2024

1033 pages

ISBN:9798400706141

DOI:10.1145/3651890

Co-chairs:
Aruna Seneviratne,
Darryl Veitch,
Program Co-chairs:
Vyas Sekar,
Minlan Yu

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike International 4.0 License.

Sponsors

SIGCOMM: ACM Special Interest Group on Data Communication

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 August 2024

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Research Foundation of Korea
Institute of Information & Communications Technology Planning & Evaluation (IITP)

Conference

ACM SIGCOMM '24

Sponsor:

SIGCOMM

ACM SIGCOMM '24: ACM SIGCOMM 2024 Conference

August 4 - 8, 2024

NSW, Sydney, Australia

Acceptance Rates

Overall Acceptance Rate 462 of 3,389 submissions, 14%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
580
Total Downloads

Downloads (Last 12 months)580
Downloads (Last 6 weeks)580

Reflects downloads up to 13 Sep 2024

Other Metrics

View Author Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents