Abstract
As the cost of deep learning training increases, using heterogeneous GPU clusters is a reasonable way to scale cluster resources to support distributed deep learning (DDL) tasks. However, the commonly used synchronous stochastic gradient descent (SSGD) algorithm based on the bulk synchronous parallel (BSP) model suffers from stragglers in heterogeneous clusters, resulting in a significant reduction in training efficiency. To overcome this challenge, we propose load-balanced batching (LBB) to eliminate stragglers in DDL workloads. LBB first formulates the load balancing problem and builds performance models for all workers in DDL workloads, which is achieved by analyzing the relationship between DDL iteration time and each worker’s local batch size. Then the LBB balances all workers’ workloads by coordinating local batch sizes. In particular, the LBB greatly mitigates static stragglers and severe dynamic stragglers by solving the load balancing problem and eliminates stragglers by batch size fine-tuning during training. LBB is implemented in PyTorch, and extensive experiments are performed on a heterogeneous server equipped with four GPUs with three different models. The experimental results verify the effectiveness of LBB on standard benchmarks, demonstrating that LBB can significantly reduce training time by 64.57%, 59%, and 5.4% compared to SSGD, local SGD, and FlexRR, respectively, without sacrificing accuracy.
Similar content being viewed by others
Data availability
The public datasets of CIFAR10 and CIFAR100 [23] used in the research are available at https://www.cs.toronto.edu/kriz/cifar.html.
Code availability
The authors will release LBB implementation for reproducibility after it is organized. The code will be released on https://github.com/FLYING37520/LBB.
References
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017) Attention is all you need. In: Advances in neural information processing systems, vol 30
Jiang P, Ergu D, Liu F, Cai Y, Ma B (2022) A review of Yolo algorithm developments. Proc Comput Sci 199:1066–1073. https://doi.org/10.1016/j.procs.2022.01.135
Saharia C, Chan W, Saxena S, Li L, Whang J, Denton E, Ghasemipour SKS, Ayan BK, Mahdavi SS, Lopes RG, Salimans T, Ho J, Fleet DJ, Norouzi M (2022) Photorealistic text-to-image diffusion models with deep language understanding. arXiv:2205.11487 [cs.CV]
Ramesh A, Pavlov M, Goh G, Gray S, Voss C, Radford A, Chen M, Sutskever I (2021) Zero-shot text-to-image generation. In: Proceedings of the 38th International Conference on Machine Learning, vol 139, pp 8821–8831
Radford A, Kim JW, Hallacy C, Ramesh A, Goh G, Agarwal S, Sastry G, Askell A, Mishkin P, Clark J, Krueger G, Sutskever I (2021) Learning transferable visual models from natural language supervision. In: Proceedings of the 38th International Conference on Machine Learning, vol 139, pp 8748–8763. https://proceedings.mlr.press/v139/radford21a.html
Krizhevsky A, Sutskever I, Hinton GE (2017) Imagenet classification with deep convolutional neural networks. Commun ACM 60(6):84–90. https://doi.org/10.1145/3065386
Shoeybi M, Patwary M, Puri R, LeGresley P, Casper J, Catanzaro B (2020) Megatron-LM: training multi-billion parameter language models using model parallelism. arXiv:1909.08053 [cs.CV]
Brown T, Mann B, Ryder N, Subbiah M, Kaplan JD, Dhariwal P, Neelakantan A, Shyam P, Sastry G, Askell A, Agarwal S, Herbert-Voss A, Krueger G, Henighan T, Child R, Ramesh A, Ziegler D, Wu J, Winter C, Hesse C, Chen M, Sigler E, Litwin M, Gray S, Chess B, Clark J, Berner C, McCandlish S, Radford A, Sutskever I, Amodei D (2020) Language models are few-shot learners. In: Advances in neural information processing systems, vol 33, pp 1877–1901
Tang Z, Shi S, Chu X, Wang W, Li B (2020) Communication-efficient distributed deep learning: a comprehensive survey. arXiv:2003.06307 [cs.CV]
Gan S, Jiang J, Yuan B, Zhang C, Lian X, Wang R, Chang J, Liu C, Shi H, Zhang S, Li X, Sun T, Yang S, Liu J (2021) Bagua: scaling up distributed learning with system relaxations. Proc VLDB Endow 15(4):804–813. https://doi.org/10.14778/3503585.3503590
Jiang J, Cui B, Zhang C, Yu L (2017) Heterogeneity-aware distributed parameter servers. Association for Computing Machinery, New York, pp 463–478. https://doi.org/10.1145/3035918.3035933
Narayanan D, Santhanam K, Kazhamiaka F, Phanishayee A, Zaharia M (2020) Heterogeneity-aware cluster scheduling policies for deep learning workloads. In: 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), pp 481–498. https://www.usenix.org/conference/osdi20/presentation/narayanan-deepak
Kim H, Song C, Lee H, Yu H (2023) Addressing straggler problem through dynamic partial all-reduce for distributed deep learning in heterogeneous GPU clusters. In: IEEE International Conference on Consumer Electronics (ICCE), pp 1–6. https://doi.org/10.1109/ICCE56470.2023.10043527
Ho Q, Cipar J, Cui H, Lee S, Kim JK, Gibbons PB, Gibson GA, Ganger G, Xing EP (2013) More effective distributed ML via a stale synchronous parallel parameter server. In: Advances in neural information processing systems, vol 26
Kavarakuntla T, Han L, Lloyd H, Latham A, Akintoye SB (2021) Performance analysis of distributed deep learning frameworks in a multi-GPU environment. In: 20th International Conference on Ubiquitous Computing and Communications (IUCC/CIT/DSCI/SmartCNS), pp 406–413. https://doi.org/10.1109/IUCC-CIT-DSCI-SmartCNS55181.2021.00071
Keuper J, Pfreundt F-J (2015) Asynchronous parallel stochastic gradient descent: a numeric core for scalable distributed machine learning algorithms. In: Proceedings of the Workshop on Machine Learning in High-Performance Computing Environments. MLHPC ’15. Association for Computing Machinery, New York. https://doi.org/10.1145/2834892.2834893
Harlap A, Cui H, Dai W, Wei J, Ganger GR, Gibbons PB, Gibson GA, Xing EP (2016) Addressing the straggler problem for iterative convergent parallel ML. In: Proceedings of the Seventh ACM Symposium on Cloud Computing. Association for Computing Machinery, New York, pp 98–111. https://doi.org/10.1145/2987550.2987554
Moreno-Alvarez S, Haut JM, Paoletti ME, Rico-Gallego JA, Diaz-Martin JC, Plaza J (2020) Training deep neural networks: a static load balancing approach. J Supercomput 76:9739–9754
Yang E, Kang D-K, Youn C-H (2020) BOA: batch orchestration algorithm for straggler mitigation of distributed DL training in heterogeneous GPU cluster. J Supercomput 76:47–67
Goyal P, Dollár P, Girshick R, Noordhuis P, Wesolowski L, Kyrola A, Tulloch A, Jia Y, He K (2018) Accurate, large minibatch SGD: training imagenet in 1 hour. arXiv:1706.02677 [cs.CV]
Tao Z, Li Q (2018) eSGD: communication efficient distributed deep learning on the edge. In: USENIX Workshop on Hot Topics in Edge Computing (HotEdge 18). USENIX Association, Boston
Ye Q, Zhou Y, Shi M, Sun Y, Lv J (2022) DLB: a dynamic load balance strategy for distributed training of deep neural networks. IEEE Trans Emerg Top Comput Intell. https://doi.org/10.1109/TETCI.2022.3220224
Krizhevsky A, Hinton G et al (2009) Learning multiple layers of features from tiny images
Li S, Zhao Y, Varma R, Salpekar O, Noordhuis P, Li T, Paszke A, Smith J, Vaughan B, Damania P et al (2020) Pytorch distributed: experiences on accelerating data parallel training. arXiv:2006.15704 [cs.CV]
Gitman YYI, Ginsburg B (2017) Scaling SGD batch size to 32k for imagenet training. arXiv:1708.03888 [cs.CV]
Li S, Walls RJ, Xu L, Guo T (2019) Speeding up deep learning with transient servers. In: IEEE International Conference on Autonomic Computing (ICAC), pp 125–135. https://doi.org/10.1109/ICAC.2019.00024
Li S, Walls RJ, Guo T (2020) Characterizing and modeling distributed training with transient cloud GPU servers. In: IEEE 40th International Conference on Distributed Computing Systems (ICDCS), pp 943–953. https://doi.org/10.1109/ICDCS47774.2020.00097
Zheng S, Meng Q, Wang T, Chen W, Yu N, Ma Z-M, Liu T-Y (2017) Asynchronous stochastic gradient descent with delay compensation. In: International Conference on Machine Learning, vol 70, pp 4120–4129. PMLR. https://proceedings.mlr.press/v70/zheng17b.html
Ko Y, Kim S-W (2022) SHAT: a novel asynchronous training algorithm that provides fast model convergence in distributed deep learning. Appl Sci. https://doi.org/10.3390/app12010292
Zhang W, Gupta S, Lian X, Liu J (2016) Staleness-aware async-SGD for distributed deep learning. In: Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, pp 2350–2356
Li S, Mangoubi O, Xu L, Guo T (2021) Sync-switch: hybrid parameter synchronization for distributed deep learning. In: IEEE 41st International Conference on Distributed Computing Systems (ICDCS), pp 528–538. https://doi.org/10.1109/ICDCS51616.2021.00057
Zhao X, Papagelis M, An A, Chen BX, Liu J, Hu Y (2019) Elastic bulk synchronous parallel model for distributed deep learning. In: IEEE International Conference on Data Mining (ICDM), pp 1504–1509. https://doi.org/10.1109/ICDM.2019.00198
Li S, Ben-Nun T, Girolamo SD, Alistarh D, Hoefler T (2020) Taming unbalanced training workloads in deep learning with partial collective operations. Association for Computing Machinery, New York, pp 45–61. https://doi.org/10.1145/3332466.3374528
Chen C, Weng Q, Wang W, Li B, Li B (2020) Semi-dynamic load balancing: efficient distributed learning in non-dedicated environments. Association for Computing Machinery, New York, pp 431–446. https://doi.org/10.1145/3419111.3421299
Chetlur S, Woolley C, Vandermersch P, Cohen J, Tran J, Catanzaro B, Shelhamer E (2014) cuDNN: efficient primitives for deep learning. arXiv:1410.0759 [cs.CV]
Stich SU (2018) Local SGD converges fast and communicates little. arXiv:1805.09767 [cs.CV]
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 770–778
Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S et al (2020) An image is worth 16x16 words: transformers for image recognition at scale. arXiv:2010.11929 [cs.CV]
Ma N, Zhang X, Zheng H-T, Sun J (2018) Shufflenet v2: practical guidelines for efficient CNN architecture design. In: Proceedings of the European Conference on Computer Vision (ECCV)
Tan M, Le Q (2019) Efficientnet: rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp 6105–6114
Acknowledgments
The authors would like to acknowledge the support of National Natural Science Foundation of China under grant No. 62376226, the Shaanxi’s Key Research and Development Program under grant 2023-ZDLNY-63, the Xianyang’s Key Research and Development Program under grant No. L2022-ZDYF-NY-019, and the Key Research and Development Program of Shaanxi under grants No. 2019ZDLNY07-06-01 and No. 2020NY-098.
Funding
This work is supported by National Natural Science Foundation of China under grant No. 62376226, the Shaanxi’s Key Research and Development Program under grant 2023-ZDLNY-63, the Xianyang’s Key Research and Development Program under grant No. L2022-ZDYF-NY-019, and the Key Research and Development Program of Shaanxi under grants No. 2019ZDLNY07-06-01 and No. 2020NY-098.
Author information
Authors and Affiliations
Contributions
FY proposed the idea, participated in the protocol design, authored the main sections of the paper, and produced the figures and tables. BL supervised the research and proofread the manuscript as the supervisor. HG participated in constructing the experimental workflow.
Corresponding authors
Ethics declarations
Conflict of interest
The authors declare that they have no competing interests as defined by Springer or other interests that might be perceived to influence the results and/or discussion reported in this paper.
Ethics approval
Not applicable.
Consent to participate
Not applicable.
Consent for publication
Not applicable.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Yao, F., Zhang, Z., Ji, Z. et al. LBB: load-balanced batching for efficient distributed learning on heterogeneous GPU cluster. J Supercomput 80, 12247–12272 (2024). https://doi.org/10.1007/s11227-023-05886-w
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-023-05886-w