Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1609/aaai.v37i9.26321guideproceedingsArticle/Chapter ViewAbstractPublication PagesConference Proceedingsacm-pubtype
research-article

Acceleration of large transformer model training by sensitivity-based layer dropping

Published: 07 February 2023 Publication History

Abstract

Transformer models are widely used in AI applications such as Natural Language Processing (NLP), Computer Vision (CV), etc. However, enormous computation workload becomes an obstacle to train large transformer models efficiently. Recently, some methods focus on reducing the computation workload during the training by skipping some layers. However, these methods use simple probability distribution and coarse-grained probability calculation, which significantly affect the model accuracy. To address the issue, in this paper we propose a novel method to accelerate training—Sensitivity-Based Layer Dropping (SBLD). SBLD uses layer-wise sensitivity data to switch on/off transformer layers in proper order to keep high accuracy. Besides, we adjust the probability of skipping transformer layers with a scheduler to accelerate training speed and get faster convergence. Our results show that SBLD solves the accuracy drop issue compared with prior layer dropping methods. Our SBLD method can decrease end-to-end training time by 19.67% during training of GPT-3 Medium model, the same time increasing the accuracy by 1.65% w.r.t. baseline. Furthermore, for SwinV2-L model the obtained Top-1 and Top-5 accuracies are also higher vs. the baseline. Thus, the proposed method is efficient and practical to improve the large transformer model training.

References

[1]
Arici T.; Seyfioglu M. S.; Neiman T.; Xu Y.; Train S.; Chilimbi T.; Zeng B.; and Tutar I. 2021. MLIM: Vision-and-Language Model Pre-training with Masked Language and Image Modeling. arXiv:2109.12178.
[2]
Brown T. B.; Mann B. ; Ryder N.; Subbiah M. ; Kaplan J.; Dhariwal P.; Neelakantan A.; Shyam P.; Sastry G.; Askell A.; et al. 2020. Language models are few-shot learners. arXiv preprint arXiv:2005.14165.
[3]
Dehghani M.; Gouws S.; Vinyals O.; Uszkoreit J.; and Kaiser L. 2019. Universal Transformers. arXiv:1807.03819.
[4]
Dong Z.; Yao Z.; Gholami A.; Mahoney M.; Keutzer K. HAWQ: Hessian AWare Quantization of Neural Networks with Mixed-Precision. arXiv:1905.03696
[5]
Fan A.; Grave E.; and Joulin A. 2020. Reducing transformer depth on demand with structured dropout. In 8th International Conference on Learning Representations 2020. Addis Ababa, Ethiopia. April 26-30.
[6]
Hou L.; Pang R. Y.; Zhou T.; Wu Y.; Song X.; and Zhou D. 2022. Token Dropping for Efficient BERT Pretraining. arXiv:2203.13240.
[7]
Huang G.; Sun Y.; Liu Z.; Sedra D.; and Weinberger K. 2016. Deep networks with stochastic depth. arXiv:1603.09382
[8]
Iqbal T. and Iqbal M. Z. 2020. On the Mixture Of Weighted Exponential and Weighted Gamma Distribution. Int. J. Anal. Appl. 18 (3): 396-408.
[9]
Li C.; Zhuang B.; Wang G.; Liang X.; Chang X.; and Yang Y. 2022. Automated Progressive Learning for Efficient Training of Vision Transformers. arXiv:2203.14509.
[10]
Liu Y.; Agarwal S.; and Venkataraman S. 2021. AutoFreeze: Automatically Freezing Model Blocks to Accelerate Fine-tuning. arXiv:2102.01386.
[11]
Liu Z.; Hu H.; Lin Y.; Yao Z.; Xie Z.; Wei Y.; Ning J.; Cao Y.; Zhang Z.; Dong L.; Wei F.; and Guo B. 2022. Swin Transformer V2: Scaling Up Capacity and Resolution. arXiv:2111.09883.
[12]
Liu Z.; Lin Y.; Cao Y.; Hu H.; Wei Y.; Zhang Z.; Lin S.; and Guo B. 2021. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. arXiv:2103.14030.
[13]
Nkemnole1 E. B. and Ikegwu E. M. 2020. Poly-Weighted Exponentiated Gamma Distribution with Application. Journal of Statistical Theory and Applications. 19(3): 446-459.
[14]
Paperno D.; Kruszewski G.; Lazaridou A.; Pham Q. N.; Bernardi R.; Pezzelle S.; Baroni M.; Boleda G.; and Fernández R. 2016. The lambada dataset: Word prediction requiring a broad discourse context. arXiv:1606.0603.
[15]
Shoeybi M.; Patwary M.; Puri R.; LeGresley P.; Casper J.; and Catanzaro B. 2019. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism. arXiv:1909.08053.
[16]
Wang J. and Zhang T. 2019. Utilizing Second Order Information in Mini-batch Stochastic Variance Reduced Proximal Iterations. Journal of Machine Learning Research. 20(42): 1-56.
[17]
Yao Z.; Dong Z.; Zheng Z.; Gholami A.; Yu J.; Tan E.; Wang L.; Huang Q.; Wang Y.; Mahoney M. W.; and Keutzer K. 2021. HAWQV3: Dyadic Neural Network Quantization. arXiv:2011.10680.
[18]
Zhang M. and He Y. 2020. Accelerating Training of Transformer-Based Language Models with Progressive Layer Dropping. arXiv:2010.13369.
[19]
Zhao J.; Wang Y.; Bao J.; Wu Y.; and He X. 2022. Fine-and Coarse-Granularity Hybrid Self-Attention for Efficient BERT. arXiv:2203.09055.
[20]
Zhu Y.; Kiros R.; Zemel R.; Salakhutdinov R.; Urtasun R.; Torralba A.; and Fidler S. 2019. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. arXiv:1506.06724.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Guide Proceedings
AAAI'23/IAAI'23/EAAI'23: Proceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence and Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence and Thirteenth Symposium on Educational Advances in Artificial Intelligence
February 2023
16496 pages
ISBN:978-1-57735-880-0

Sponsors

  • Association for the Advancement of Artificial Intelligence

Publisher

AAAI Press

Publication History

Published: 07 February 2023

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 0
    Total Downloads
  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 17 Oct 2024

Other Metrics

Citations

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media