research-article

UMPIPE: Unequal Microbatches-Based Pipeline Parallelism for Deep Neural Network Training

Authors:

Guangyao Zhou,

Wenhong Tian,

Rajkumar Buyya,

Kui WuAuthors Info & Claims

IEEE Transactions on Parallel and Distributed Systems, Volume 36, Issue 2

Pages 293 - 307

https://doi.org/10.1109/TPDS.2024.3515804

Published: 01 February 2025 Publication History

Abstract

The increasing need for large-scale deep neural networks (DNN) has made parallel training an area of intensive focus. One effective method, microbatch-based pipeline parallelism (notably GPipe), accelerates parallel training in various architectures. However, existing parallel training architectures normally use equal data partitioning (EDP), where each layer's process maintains identical microbatch-sizes. EDP may hinder training speed because different processes often require varying optimal microbatch-sizes. To address this, we introduce UMPIPE, a novel framework for unequal microbatches-based pipeline parallelism. UMPIPE enables unequal data partitions (UEDP) across processes to optimize resource utilization. We develop a recurrence formula to calculate the time cost in UMPIPE by considering both computation and communication processes. To further enhance UMPIPE's efficiency, we propose the Dual-Chromosome Genetic Algorithm for UMPIPE (DGAP) that accounts for the independent time costs of forward and backward propagation. Furthermore, we present TiDGAP, a two-level improvement on DGAP. TiDGAP accelerates the process by simultaneously calculating the end time for multiple individuals and microbatches using matrix operations. Our extensive experiments validate the dual-chromosome strategy's optimization benefits and TiDGAP's acceleration capabilities. TiDGAP can achieve better training schemes than baselines, such as the local greedy algorithm and the global greedy-based dynamic programming. Compared to (GPipe, PipeDream), UMPIPE achieves increases in training speed: <inline-formula><tex-math notation="LaTeX">$(13.89,11.09)\%$</tex-math><alternatives><mml:math><mml:mrow><mml:mo>(</mml:mo><mml:mn>13</mml:mn><mml:mo>.</mml:mo><mml:mn>89</mml:mn><mml:mo>,</mml:mo><mml:mn>11</mml:mn><mml:mo>.</mml:mo><mml:mn>09</mml:mn><mml:mo>)</mml:mo><mml:mo>%</mml:mo></mml:mrow></mml:math><inline-graphic xlink:href="zhou-ieq1-3515804.gif"/></alternatives></inline-formula> for GPT1-14, <inline-formula><tex-math notation="LaTeX">$(17.11, 7.96)\%$</tex-math><alternatives><mml:math><mml:mrow><mml:mo>(</mml:mo><mml:mn>17</mml:mn><mml:mo>.</mml:mo><mml:mn>11</mml:mn><mml:mo>,</mml:mo><mml:mn>7</mml:mn><mml:mo>.</mml:mo><mml:mn>96</mml:mn><mml:mo>)</mml:mo><mml:mo>%</mml:mo></mml:mrow></mml:math><inline-graphic xlink:href="zhou-ieq2-3515804.gif"/></alternatives></inline-formula> for VGG16 and <inline-formula><tex-math notation="LaTeX">$\geq (170,100)\%$</tex-math><alternatives><mml:math><mml:mrow><mml:mo>≥</mml:mo><mml:mo>(</mml:mo><mml:mn>170</mml:mn><mml:mo>,</mml:mo><mml:mn>100</mml:mn><mml:mo>)</mml:mo><mml:mo>%</mml:mo></mml:mrow></mml:math><inline-graphic xlink:href="zhou-ieq3-3515804.gif"/></alternatives></inline-formula> for simulation networks.

References

[1]

K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” 2014, arXiv:1409.1556.

Abstract

References

Index Terms

Recommendations

On-the-fly pipeline parallelism

On-the-Fly Pipeline Parallelism

Toward robustness against label noise in training deep discriminative neural networks

Comments

Information

Published In

Publisher

Publication History

Qualifiers

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

View options

Share

Share this Publication link

Share on social media

Affiliations