research-article

Acceleration of large transformer model training by sensitivity-based layer dropping

AUTHORs:

Ihor Vasyltsov,

Lin ChenAuthors Info & Claims

AAAI'23/IAAI'23/EAAI'23: Proceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence and Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence and Thirteenth Symposium on Educational Advances in Artificial Intelligence

Article No.: 1252, Pages 11156 - 11163

https://doi.org/10.1609/aaai.v37i9.26321

Published: 07 February 2023 Publication History

Abstract

Transformer models are widely used in AI applications such as Natural Language Processing (NLP), Computer Vision (CV), etc. However, enormous computation workload becomes an obstacle to train large transformer models efficiently. Recently, some methods focus on reducing the computation workload during the training by skipping some layers. However, these methods use simple probability distribution and coarse-grained probability calculation, which significantly affect the model accuracy. To address the issue, in this paper we propose a novel method to accelerate training—Sensitivity-Based Layer Dropping (SBLD). SBLD uses layer-wise sensitivity data to switch on/off transformer layers in proper order to keep high accuracy. Besides, we adjust the probability of skipping transformer layers with a scheduler to accelerate training speed and get faster convergence. Our results show that SBLD solves the accuracy drop issue compared with prior layer dropping methods. Our SBLD method can decrease end-to-end training time by 19.67% during training of GPT-3 Medium model, the same time increasing the accuracy by 1.65% w.r.t. baseline. Furthermore, for SwinV2-L model the obtained Top-1 and Top-5 accuracies are also higher vs. the baseline. Thus, the proposed method is efficient and practical to improve the large transformer model training.

References

[1]

Arici T.; Seyfioglu M. S.; Neiman T.; Xu Y.; Train S.; Chilimbi T.; Zeng B.; and Tutar I. 2021. MLIM: Vision-and-Language Model Pre-training with Masked Language and Image Modeling. arXiv:2109.12178.

[2]

Brown T. B.; Mann B. ; Ryder N.; Subbiah M. ; Kaplan J.; Dhariwal P.; Neelakantan A.; Shyam P.; Sastry G.; Askell A.; et al. 2020. Language models are few-shot learners. arXiv preprint arXiv:2005.14165.

[3]

Dehghani M.; Gouws S.; Vinyals O.; Uszkoreit J.; and Kaiser L. 2019. Universal Transformers. arXiv:1807.03819.

[4]

Dong Z.; Yao Z.; Gholami A.; Mahoney M.; Keutzer K. HAWQ: Hessian AWare Quantization of Neural Networks with Mixed-Precision. arXiv:1905.03696

[5]

Fan A.; Grave E.; and Joulin A. 2020. Reducing transformer depth on demand with structured dropout. In 8th International Conference on Learning Representations 2020. Addis Ababa, Ethiopia. April 26-30.

[6]

Hou L.; Pang R. Y.; Zhou T.; Wu Y.; Song X.; and Zhou D. 2022. Token Dropping for Efficient BERT Pretraining. arXiv:2203.13240.

[7]

Huang G.; Sun Y.; Liu Z.; Sedra D.; and Weinberger K. 2016. Deep networks with stochastic depth. arXiv:1603.09382

[8]

Iqbal T. and Iqbal M. Z. 2020. On the Mixture Of Weighted Exponential and Weighted Gamma Distribution. Int. J. Anal. Appl. 18 (3): 396-408.

[9]

Li C.; Zhuang B.; Wang G.; Liang X.; Chang X.; and Yang Y. 2022. Automated Progressive Learning for Efficient Training of Vision Transformers. arXiv:2203.14509.

[10]

Liu Y.; Agarwal S.; and Venkataraman S. 2021. AutoFreeze: Automatically Freezing Model Blocks to Accelerate Fine-tuning. arXiv:2102.01386.

[11]

Liu Z.; Hu H.; Lin Y.; Yao Z.; Xie Z.; Wei Y.; Ning J.; Cao Y.; Zhang Z.; Dong L.; Wei F.; and Guo B. 2022. Swin Transformer V2: Scaling Up Capacity and Resolution. arXiv:2111.09883.

[12]

Liu Z.; Lin Y.; Cao Y.; Hu H.; Wei Y.; Zhang Z.; Lin S.; and Guo B. 2021. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. arXiv:2103.14030.

[13]

Nkemnole1 E. B. and Ikegwu E. M. 2020. Poly-Weighted Exponentiated Gamma Distribution with Application. Journal of Statistical Theory and Applications. 19(3): 446-459.

[14]

Paperno D.; Kruszewski G.; Lazaridou A.; Pham Q. N.; Bernardi R.; Pezzelle S.; Baroni M.; Boleda G.; and Fernández R. 2016. The lambada dataset: Word prediction requiring a broad discourse context. arXiv:1606.0603.

[15]

Shoeybi M.; Patwary M.; Puri R.; LeGresley P.; Casper J.; and Catanzaro B. 2019. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism. arXiv:1909.08053.

[16]

Wang J. and Zhang T. 2019. Utilizing Second Order Information in Mini-batch Stochastic Variance Reduced Proximal Iterations. Journal of Machine Learning Research. 20(42): 1-56.

[17]

Yao Z.; Dong Z.; Zheng Z.; Gholami A.; Yu J.; Tan E.; Wang L.; Huang Q.; Wang Y.; Mahoney M. W.; and Keutzer K. 2021. HAWQV3: Dyadic Neural Network Quantization. arXiv:2011.10680.

[18]

Zhang M. and He Y. 2020. Accelerating Training of Transformer-Based Language Models with Progressive Layer Dropping. arXiv:2010.13369.

[19]

Zhao J.; Wang Y.; Bao J.; Wu Y.; and He X. 2022. Fine-and Coarse-Granularity Hybrid Self-Attention for Efficient BERT. arXiv:2203.09055.

[20]

Zhu Y.; Kiros R.; Zemel R.; Salakhutdinov R.; Urtasun R.; Torralba A.; and Fidler S. 2019. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. arXiv:1506.06724.

Recommendations

Accelerating training of transformer-based language models with progressive layer dropping
NIPS '20: Proceedings of the 34th International Conference on Neural Information Processing Systems

Recently, Transformer-based language models have demonstrated remarkable performance across many NLP domains. However, the unsupervised pre-training step of these models suffers from unbearable overall computational expenses. Current methods for ...
Learn & drop: fast learning of cnns based on layer dropping
Abstract
This paper proposes a new method to improve the training efficiency of deep convolutional neural networks. During training, the method evaluates scores to measure how much each layer’s parameters change and whether the layer will continue learning ...
DAFTA: Distributed Architecture for Fusion-Transformer training Acceleration
BiDEDE '23: Proceedings of the International Workshop on Big Data in Emergent Distributed Environments

Multi-modal data fusion transformer is a deep learning model that integrates information from multiple modalities, such as text, image, audio, etc., to improve performance in various tasks, especially in the remote sensing domain. Recent efforts ...

Comments

Information & Contributors

Information

Published In

cover image Guide Proceedings

AAAI'23/IAAI'23/EAAI'23: Proceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence and Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence and Thirteenth Symposium on Educational Advances in Artificial Intelligence

February 2023

16496 pages

ISBN:978-1-57735-880-0

Copyright © 2023 Association for the Advancement of Artificial Intelligence.

Sponsors

Association for the Advancement of Artificial Intelligence

Publisher

AAAI Press

Publication History

Published: 07 February 2023

Qualifiers

Research-article
Research
Refereed limited

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 17 Oct 2024

Other Metrics

View Author Metrics

Citations

View Options

View options

Media

Figures

Other

Tables

View Table of Contents