Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content

UMPIPE: Unequal Microbatches-Based Pipeline Parallelism for Deep Neural Network Training

Published: 01 February 2025 Publication History


The increasing need for large-scale deep neural networks (DNN) has made parallel training an area of intensive focus. One effective method, microbatch-based pipeline parallelism (notably GPipe), accelerates parallel training in various architectures. However, existing parallel training architectures normally use equal data partitioning (EDP), where each layer&#x0027;s process maintains identical microbatch-sizes. EDP may hinder training speed because different processes often require varying optimal microbatch-sizes. To address this, we introduce UMPIPE, a novel framework for unequal microbatches-based pipeline parallelism. UMPIPE enables unequal data partitions (UEDP) across processes to optimize resource utilization. We develop a recurrence formula to calculate the time cost in UMPIPE by considering both computation and communication processes. To further enhance UMPIPE&#x0027;s efficiency, we propose the Dual-Chromosome Genetic Algorithm for UMPIPE (DGAP) that accounts for the independent time costs of forward and backward propagation. Furthermore, we present TiDGAP, a two-level improvement on DGAP. TiDGAP accelerates the process by simultaneously calculating the end time for multiple individuals and microbatches using matrix operations. Our extensive experiments validate the dual-chromosome strategy&#x0027;s optimization benefits and TiDGAP&#x0027;s acceleration capabilities. TiDGAP can achieve better training schemes than baselines, such as the local greedy algorithm and the global greedy-based dynamic programming. Compared to (GPipe, PipeDream), UMPIPE achieves increases in training speed: <inline-formula><tex-math notation="LaTeX">$(13.89,11.09)\%$</tex-math><alternatives><mml:math><mml:mrow><mml:mo>(</mml:mo><mml:mn>13</mml:mn><mml:mo>.</mml:mo><mml:mn>89</mml:mn><mml:mo>,</mml:mo><mml:mn>11</mml:mn><mml:mo>.</mml:mo><mml:mn>09</mml:mn><mml:mo>)</mml:mo><mml:mo>%</mml:mo></mml:mrow></mml:math><inline-graphic xlink:href="zhou-ieq1-3515804.gif"/></alternatives></inline-formula> for GPT1-14, <inline-formula><tex-math notation="LaTeX">$(17.11, 7.96)\%$</tex-math><alternatives><mml:math><mml:mrow><mml:mo>(</mml:mo><mml:mn>17</mml:mn><mml:mo>.</mml:mo><mml:mn>11</mml:mn><mml:mo>,</mml:mo><mml:mn>7</mml:mn><mml:mo>.</mml:mo><mml:mn>96</mml:mn><mml:mo>)</mml:mo><mml:mo>%</mml:mo></mml:mrow></mml:math><inline-graphic xlink:href="zhou-ieq2-3515804.gif"/></alternatives></inline-formula> for VGG16 and <inline-formula><tex-math notation="LaTeX">$\geq (170,100)\%$</tex-math><alternatives><mml:math><mml:mrow><mml:mo>&#x2265;</mml:mo><mml:mo>(</mml:mo><mml:mn>170</mml:mn><mml:mo>,</mml:mo><mml:mn>100</mml:mn><mml:mo>)</mml:mo><mml:mo>%</mml:mo></mml:mrow></mml:math><inline-graphic xlink:href="zhou-ieq3-3515804.gif"/></alternatives></inline-formula> for simulation networks.


K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” 2014, arXiv:1409.1556.
F. Xue, Q. Wang, and G. Guo, “TransFER: Learning relation-aware facial expression representations with transformers,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., Montreal, QC, Canada, 2021, pp. 3581–3590.
J. E. Zini and M. Awad, “On the explainability of natural language processing deep models,” ACM Comput. Surv., vol. 55, no. 5, pp. 103:1–103:31, 2023.
H. Fu et al., “HGP4CNN: An efficient parallelization framework for training convolutional neural networks on modern GPUs,” J. Supercomput., vol. 77, no. 11, pp. 12 741–12 770, 2021.
D. Narayanan et al., “Efficient large-scale language model training on GPU clusters using Megatron-LM,” in Proc. Int. Conf. High Perform. Comput. Netw. Storage Anal., St. Louis, Missouri, USA, 2021, pp. 58:1–58:15.
Z. Li et al., “TeraPipe: Token-level pipeline parallelism for training large-scale language models,” in Proc. 38th Int. Conf. Mach. Learn., 2021, pp. 6543–6552.
Y. Lee, J. Chung, and M. Rhu, “SmartSAGE: Training large-scale graph neural networks using in-storage processing architectures,” in Proc. 49th Annu. Int. Symp. Comput. Architecture, 2022, pp. 932–945.
T. Rao, J. Li, X. Wang, Y. Sun, and H. Chen, “Facial expression recognition with multiscale graph convolutional networks,” IEEE Multimedia, vol. 28, no. 2, pp. 11–19, Second Quarter, 2021.
H. Wang et al., “A comprehensive survey on training acceleration for large machine learning models in IoT,” IEEE Internet of Things J., vol. 9, no. 2, pp. 939–963, Jan. 2022.
Z. Li et al., “Optimizing makespan and resource utilization for multi-DNN training in GPU cluster,” Future Gener. Comput. Syst., vol. 125, pp. 206–220, 2021.
X. Ye et al., “Hippie: A data-paralleled pipeline approach to improve memory-efficiency and scalability for large DNN training,” in Proc. 50th Int. Conf. Parallel Process., 2021, pp. 71:1–71:10.
J. Romero et al., “Accelerating collective communication in data parallel training across deep learning frameworks,” in Proc. 19th USENIX Symp. Netw. Syst. Des. Implementation, 2022, pp. 1027–1040.
Z. Lai et al., “Merak: An efficient distributed DNN training framework with automated 3D parallelism for giant foundation models,” IEEE Trans. Parallel Distrib. Syst., vol. 34, no. 5, pp. 1466–1478, May 2023.
Y. Huang et al., “GPipe: Efficient training of giant neural networks using pipeline parallelism,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2019, pp. 103–112.
D. Narayanan et al., “PipeDream: Generalized pipeline parallelism for DNN training,” in Proc. 27th ACM Symp. Operating Syst. Princ., T. Brecht and C. Williamson, Eds., 2019, pp. 1–15.
S. Fan et al., “DAPPLE: A pipelined data parallel approach for training large models,” in Proc. 26th ACM SIGPLAN Symp. Princ. Pract. Parallel Program., 2021, pp. 431–445.
M. Wang, C. Huang, and J. Li, “Supporting very large models using automatic dataflow graph partitioning,” in Proc. 14th EuroSys Conf., 2019, pp. 26:1–26:17.
F. Li et al., “Fold3D: Rethinking and parallelizing computational and communicational tasks in the training of large DNN models,” IEEE Trans. Parallel Distrib. Syst., vol. 34, no. 5, pp. 1432–1449, May 2023.
S. Ouyang et al., “Communication optimization strategies for distributed deep neural network training: A survey,” J. Parallel Distrib. Comput., vol. 149, pp. 52–65, 2021.
Z. Zhang, J. Chen, and B. Hu, “The optimization of model parallelization strategies for multi-GPU training,” in Proc. IEEE Glob. Commun. Conf., 2021, pp. 1–6.
V. Elango, “Pase: Parallelization strategies for efficient DNN training,” in Proc. 35th IEEE Int. Parallel Distrib. Process. Symp., 2021, pp. 1025–1034.
L. Cui et al., “A bidirectional DNN partition mechanism for efficient pipeline parallel training in cloud,” J. Cloud Comput., vol. 12, no. 1, 2023, Art. no.
J. Xu et al., “Effective scheduler for distributed DNN training based on MapReduce and GPU cluster,” J. Grid Comput., vol. 19, no. 1, 2021, Art. no.
J. Dong et al., “EFLOPS: Algorithm and system co-design for a high performance distributed training platform,” in Proc. Int. Symp. High Perform. Comput. Architecture, 2020, pp. 610–622.
G. Zhou et al., “CSIMD: Cross-search algorithm with improved multi-dimensional dichotomy for micro-batch-based pipeline parallel training in DNN,” in Proc. 30th Eur. Conf. Parallel Distrib. Process., Madrid, Spain, 2024, pp. 288–301.
Z. Zeng, C. Liu, Z. Tang, W. Chang, and K. Li, “Training acceleration for deep neural networks: A hybrid parallelization strategy,” in Proc. 58th ACM/IEEE Des. Automat. Conf., 2021, pp. 1165–1170.
L. Zheng et al., “Alpa: Automating inter- and intra-operator parallelism for distributed deep learning,” in Proc. 16th USENIX Symp. Operating Syst. Des. Implementation, 2022, pp. 559–578.
Z. Han et al., “Exploit the data level parallelism and schedule dependent tasks on the multi-core processors,” Inf. Sci., vol. 585, pp. 382–394, 2022.
Y. Li, Z. Zeng, J. Li, B. Yan, Y. Zhao, and J. Zhang, “Distributed model training based on data parallelism in edge computing-enabled elastic optical networks,” IEEE Commun. Lett., vol. 25, no. 4, pp. 1241–1244, Apr. 2021.
L. Guan, Z. Yang, D. Li, and X. Lu, “pdlADMM: An ADMM-based framework for parallel deep learning training with efficiency,” Neurocomputing, vol. 435, pp. 264–272, 2021.
A. N. Kahira et al., “An oracle for guiding large-scale model/hybrid parallel training of convolutional neural networks,” in Proc. 30th Int. Symp. High-Perform. Parallel Distrib. Comput., 2021, pp. 161–173.
J. Zhang et al., “PipePar: Enabling fast DNN pipeline parallel training in heterogeneous GPU clusters,” Neurocomputing, vol. 555, 2023, Art. no.
Z. Zhang, Z. Ji, and C. Wang, “Momentum-driven adaptive synchronization model for distributed DNN training on HPC clusters,” J. Parallel Distrib. Comput., vol. 159, pp. 65–84, 2022.
S. Zheng et al., “NeoFlow: A flexible framework for enabling efficient compilation for high performance DNN training,” IEEE Trans. Parallel Distrib. Syst., vol. 33, no. 11, pp. 3220–3232, Nov. 2022.
R. Gu et al., “Liquid: Intelligent resource estimation and network-efficient scheduling for deep learning jobs on distributed GPU clusters,” IEEE Trans. Parallel Distrib. Syst., vol. 33, no. 11, pp. 2808–2820, Nov. 2022.
S. Zhao et al., “vPipe: A virtualized acceleration system for achieving efficient and scalable pipeline parallel DNN training,” IEEE Trans. Parallel Distrib. Syst., vol. 33, no. 3, pp. 489–506, Mar. 2022.
K. S. Pal and P. P. Wang, Genetic Algorithms for Pattern Recognition. Boca Raton, FL, USA: CRC Press, 1996.
K. Deb, A. Pratap, S. Agrawal, and T. Meyarivan, “A fast and elitist multiobjective genetic algorithm: NSGA-II,” IEEE Trans. Evol. Comput., vol. 6, no. 2, pp. 182–197, Apr. 2002.
G. Zhou, W. Tian, R. Buyya, and K. Wu, “Growable genetic algorithm with heuristic-based local search for multi-dimensional resources scheduling of cloud computing,” Appl. Soft Comput., vol. 136, 2023, Art. no.



Information & Contributors


Published In

cover image IEEE Transactions on Parallel and Distributed Systems
IEEE Transactions on Parallel and Distributed Systems  Volume 36, Issue 2
Feb. 2025
248 pages


IEEE Press

Publication History

Published: 01 February 2025


  • Research-article


Other Metrics

Bibliometrics & Citations


Article Metrics

  • 0
    Total Citations
  • 0
    Total Downloads
  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 31 Jan 2025

Other Metrics


View Options

View options






Share this Publication link

Share on social media