Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

UMPIPE: Unequal Microbatches-Based Pipeline Parallelism for Deep Neural Network Training

Published: 01 February 2025 Publication History

Abstract

The increasing need for large-scale deep neural networks (DNN) has made parallel training an area of intensive focus. One effective method, microbatch-based pipeline parallelism (notably GPipe), accelerates parallel training in various architectures. However, existing parallel training architectures normally use equal data partitioning (EDP), where each layer&#x0027;s process maintains identical microbatch-sizes. EDP may hinder training speed because different processes often require varying optimal microbatch-sizes. To address this, we introduce UMPIPE, a novel framework for unequal microbatches-based pipeline parallelism. UMPIPE enables unequal data partitions (UEDP) across processes to optimize resource utilization. We develop a recurrence formula to calculate the time cost in UMPIPE by considering both computation and communication processes. To further enhance UMPIPE&#x0027;s efficiency, we propose the Dual-Chromosome Genetic Algorithm for UMPIPE (DGAP) that accounts for the independent time costs of forward and backward propagation. Furthermore, we present TiDGAP, a two-level improvement on DGAP. TiDGAP accelerates the process by simultaneously calculating the end time for multiple individuals and microbatches using matrix operations. Our extensive experiments validate the dual-chromosome strategy&#x0027;s optimization benefits and TiDGAP&#x0027;s acceleration capabilities. TiDGAP can achieve better training schemes than baselines, such as the local greedy algorithm and the global greedy-based dynamic programming. Compared to (GPipe, PipeDream), UMPIPE achieves increases in training speed: <inline-formula><tex-math notation="LaTeX">$(13.89,11.09)\%$</tex-math><alternatives><mml:math><mml:mrow><mml:mo>(</mml:mo><mml:mn>13</mml:mn><mml:mo>.</mml:mo><mml:mn>89</mml:mn><mml:mo>,</mml:mo><mml:mn>11</mml:mn><mml:mo>.</mml:mo><mml:mn>09</mml:mn><mml:mo>)</mml:mo><mml:mo>%</mml:mo></mml:mrow></mml:math><inline-graphic xlink:href="zhou-ieq1-3515804.gif"/></alternatives></inline-formula> for GPT1-14, <inline-formula><tex-math notation="LaTeX">$(17.11, 7.96)\%$</tex-math><alternatives><mml:math><mml:mrow><mml:mo>(</mml:mo><mml:mn>17</mml:mn><mml:mo>.</mml:mo><mml:mn>11</mml:mn><mml:mo>,</mml:mo><mml:mn>7</mml:mn><mml:mo>.</mml:mo><mml:mn>96</mml:mn><mml:mo>)</mml:mo><mml:mo>%</mml:mo></mml:mrow></mml:math><inline-graphic xlink:href="zhou-ieq2-3515804.gif"/></alternatives></inline-formula> for VGG16 and <inline-formula><tex-math notation="LaTeX">$\geq (170,100)\%$</tex-math><alternatives><mml:math><mml:mrow><mml:mo>&#x2265;</mml:mo><mml:mo>(</mml:mo><mml:mn>170</mml:mn><mml:mo>,</mml:mo><mml:mn>100</mml:mn><mml:mo>)</mml:mo><mml:mo>%</mml:mo></mml:mrow></mml:math><inline-graphic xlink:href="zhou-ieq3-3515804.gif"/></alternatives></inline-formula> for simulation networks.

References

[1]
K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” 2014, arXiv:1409.1556.
[2]
F. Xue, Q. Wang, and G. Guo, “TransFER: Learning relation-aware facial expression representations with transformers,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., Montreal, QC, Canada, 2021, pp. 3581–3590.
[3]
J. E. Zini and M. Awad, “On the explainability of natural language processing deep models,” ACM Comput. Surv., vol. 55, no. 5, pp. 103:1–103:31, 2023.
[4]
H. Fu et al., “HGP4CNN: An efficient parallelization framework for training convolutional neural networks on modern GPUs,” J. Supercomput., vol. 77, no. 11, pp. 12 741–12 770, 2021.
[5]
D. Narayanan et al., “Efficient large-scale language model training on GPU clusters using Megatron-LM,” in Proc. Int. Conf. High Perform. Comput. Netw. Storage Anal., St. Louis, Missouri, USA, 2021, pp. 58:1–58:15.
[6]
Z. Li et al., “TeraPipe: Token-level pipeline parallelism for training large-scale language models,” in Proc. 38th Int. Conf. Mach. Learn., 2021, pp. 6543–6552.
[7]
Y. Lee, J. Chung, and M. Rhu, “SmartSAGE: Training large-scale graph neural networks using in-storage processing architectures,” in Proc. 49th Annu. Int. Symp. Comput. Architecture, 2022, pp. 932–945.
[8]
T. Rao, J. Li, X. Wang, Y. Sun, and H. Chen, “Facial expression recognition with multiscale graph convolutional networks,” IEEE Multimedia, vol. 28, no. 2, pp. 11–19, Second Quarter, 2021.
[9]
H. Wang et al., “A comprehensive survey on training acceleration for large machine learning models in IoT,” IEEE Internet of Things J., vol. 9, no. 2, pp. 939–963, Jan. 2022.
[10]
Z. Li et al., “Optimizing makespan and resource utilization for multi-DNN training in GPU cluster,” Future Gener. Comput. Syst., vol. 125, pp. 206–220, 2021.
[11]
X. Ye et al., “Hippie: A data-paralleled pipeline approach to improve memory-efficiency and scalability for large DNN training,” in Proc. 50th Int. Conf. Parallel Process., 2021, pp. 71:1–71:10.
[12]
J. Romero et al., “Accelerating collective communication in data parallel training across deep learning frameworks,” in Proc. 19th USENIX Symp. Netw. Syst. Des. Implementation, 2022, pp. 1027–1040.
[13]
Z. Lai et al., “Merak: An efficient distributed DNN training framework with automated 3D parallelism for giant foundation models,” IEEE Trans. Parallel Distrib. Syst., vol. 34, no. 5, pp. 1466–1478, May 2023.
[14]
Y. Huang et al., “GPipe: Efficient training of giant neural networks using pipeline parallelism,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2019, pp. 103–112.
[15]
D. Narayanan et al., “PipeDream: Generalized pipeline parallelism for DNN training,” in Proc. 27th ACM Symp. Operating Syst. Princ., T. Brecht and C. Williamson, Eds., 2019, pp. 1–15.
[16]
S. Fan et al., “DAPPLE: A pipelined data parallel approach for training large models,” in Proc. 26th ACM SIGPLAN Symp. Princ. Pract. Parallel Program., 2021, pp. 431–445.
[17]
M. Wang, C. Huang, and J. Li, “Supporting very large models using automatic dataflow graph partitioning,” in Proc. 14th EuroSys Conf., 2019, pp. 26:1–26:17.
[18]
F. Li et al., “Fold3D: Rethinking and parallelizing computational and communicational tasks in the training of large DNN models,” IEEE Trans. Parallel Distrib. Syst., vol. 34, no. 5, pp. 1432–1449, May 2023.
[19]
S. Ouyang et al., “Communication optimization strategies for distributed deep neural network training: A survey,” J. Parallel Distrib. Comput., vol. 149, pp. 52–65, 2021.
[20]
Z. Zhang, J. Chen, and B. Hu, “The optimization of model parallelization strategies for multi-GPU training,” in Proc. IEEE Glob. Commun. Conf., 2021, pp. 1–6.
[21]
V. Elango, “Pase: Parallelization strategies for efficient DNN training,” in Proc. 35th IEEE Int. Parallel Distrib. Process. Symp., 2021, pp. 1025–1034.
[22]
L. Cui et al., “A bidirectional DNN partition mechanism for efficient pipeline parallel training in cloud,” J. Cloud Comput., vol. 12, no. 1, 2023, Art. no.
[23]
J. Xu et al., “Effective scheduler for distributed DNN training based on MapReduce and GPU cluster,” J. Grid Comput., vol. 19, no. 1, 2021, Art. no.
[24]
J. Dong et al., “EFLOPS: Algorithm and system co-design for a high performance distributed training platform,” in Proc. Int. Symp. High Perform. Comput. Architecture, 2020, pp. 610–622.
[25]
G. Zhou et al., “CSIMD: Cross-search algorithm with improved multi-dimensional dichotomy for micro-batch-based pipeline parallel training in DNN,” in Proc. 30th Eur. Conf. Parallel Distrib. Process., Madrid, Spain, 2024, pp. 288–301.
[26]
Z. Zeng, C. Liu, Z. Tang, W. Chang, and K. Li, “Training acceleration for deep neural networks: A hybrid parallelization strategy,” in Proc. 58th ACM/IEEE Des. Automat. Conf., 2021, pp. 1165–1170.
[27]
L. Zheng et al., “Alpa: Automating inter- and intra-operator parallelism for distributed deep learning,” in Proc. 16th USENIX Symp. Operating Syst. Des. Implementation, 2022, pp. 559–578.
[28]
Z. Han et al., “Exploit the data level parallelism and schedule dependent tasks on the multi-core processors,” Inf. Sci., vol. 585, pp. 382–394, 2022.
[29]
Y. Li, Z. Zeng, J. Li, B. Yan, Y. Zhao, and J. Zhang, “Distributed model training based on data parallelism in edge computing-enabled elastic optical networks,” IEEE Commun. Lett., vol. 25, no. 4, pp. 1241–1244, Apr. 2021.
[30]
L. Guan, Z. Yang, D. Li, and X. Lu, “pdlADMM: An ADMM-based framework for parallel deep learning training with efficiency,” Neurocomputing, vol. 435, pp. 264–272, 2021.
[31]
A. N. Kahira et al., “An oracle for guiding large-scale model/hybrid parallel training of convolutional neural networks,” in Proc. 30th Int. Symp. High-Perform. Parallel Distrib. Comput., 2021, pp. 161–173.
[32]
J. Zhang et al., “PipePar: Enabling fast DNN pipeline parallel training in heterogeneous GPU clusters,” Neurocomputing, vol. 555, 2023, Art. no.
[33]
Z. Zhang, Z. Ji, and C. Wang, “Momentum-driven adaptive synchronization model for distributed DNN training on HPC clusters,” J. Parallel Distrib. Comput., vol. 159, pp. 65–84, 2022.
[34]
S. Zheng et al., “NeoFlow: A flexible framework for enabling efficient compilation for high performance DNN training,” IEEE Trans. Parallel Distrib. Syst., vol. 33, no. 11, pp. 3220–3232, Nov. 2022.
[35]
R. Gu et al., “Liquid: Intelligent resource estimation and network-efficient scheduling for deep learning jobs on distributed GPU clusters,” IEEE Trans. Parallel Distrib. Syst., vol. 33, no. 11, pp. 2808–2820, Nov. 2022.
[36]
S. Zhao et al., “vPipe: A virtualized acceleration system for achieving efficient and scalable pipeline parallel DNN training,” IEEE Trans. Parallel Distrib. Syst., vol. 33, no. 3, pp. 489–506, Mar. 2022.
[37]
K. S. Pal and P. P. Wang, Genetic Algorithms for Pattern Recognition. Boca Raton, FL, USA: CRC Press, 1996.
[38]
K. Deb, A. Pratap, S. Agrawal, and T. Meyarivan, “A fast and elitist multiobjective genetic algorithm: NSGA-II,” IEEE Trans. Evol. Comput., vol. 6, no. 2, pp. 182–197, Apr. 2002.
[39]
G. Zhou, W. Tian, R. Buyya, and K. Wu, “Growable genetic algorithm with heuristic-based local search for multi-dimensional resources scheduling of cloud computing,” Appl. Soft Comput., vol. 136, 2023, Art. no.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image IEEE Transactions on Parallel and Distributed Systems
IEEE Transactions on Parallel and Distributed Systems  Volume 36, Issue 2
Feb. 2025
248 pages

Publisher

IEEE Press

Publication History

Published: 01 February 2025

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 0
    Total Downloads
  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 31 Jan 2025

Other Metrics

Citations

View Options

View options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media