Fast LSTM by dynamic decomposition on cloud and distributed systems

You, Yang; He, Yuxiong; Rajbhandari, Samyam; Wang, Wenhan; Hsieh, Cho-Jui; Keutzer, Kurt; Demmel, James

doi:10.1007/s10115-020-01487-8

Fast LSTM by dynamic decomposition on cloud and distributed systems

Regular paper
Published: 19 July 2020

Volume 62, pages 4169–4197, (2020)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Yang You¹,
Yuxiong He²,
Samyam Rajbhandari²,
Wenhan Wang²,
Cho-Jui Hsieh³,
Kurt Keutzer⁴ &
…
James Demmel⁴

552 Accesses
3 Citations
Explore all metrics

Abstract

Long short-term memory (LSTM) is a powerful deep learning technique that has been widely used in many real-world data-mining applications such as language modeling and machine translation. In this paper, we aim to minimize the latency of LSTM inference on cloud systems without losing accuracy. If an LSTM model does not fit in cache, the latency due to data movement will likely be greater than that due to computation. In this case, we reduce model parameters. If, as in most applications we consider, the LSTM models are able to fit the cache of cloud server processors, we focus on reducing the number of floating point operations, which has a corresponding linear impact on the latency of the inference calculation. Thus, in our system, we dynamically reduce model parameters or flops depending on which most impacts latency. Our inference system is based on singular value decomposition and canonical polyadic decomposition. Our system is accurate and low latency. We evaluate our system based on models from a series of real-world applications like language modeling, computer vision, question answering, and sentiment analysis. Users of our system can use either pre-trained models or start from scratch. Our system achieves 15$\times $ average speedup for six real-world applications without losing accuracy in inference. We also design and implement a distributed optimization system with dynamic decomposition, which can significantly reduce the energy cost and accelerate the training process.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 2

The Necessity to Adopt Big Data Technologies for Efficient Performance Evaluation in the Modern Era

Accelerating Massively Distributed Deep Learning Through Efficient Pseudo-Synchronous Update Method

Article 13 November 2023

DataStates: Towards Lightweight Data Models for Deep Learning

Find the latest articles, discoveries, and news in related topics.

Artificial Intelligence

Notes

References

Abadi M, Barham P, Chen J, Chen Z, Davis A, Dean J, Devin M, Ghemawat S, Irving G, Isard M et al (2016) Tensorflow: a system for large-scale machine learning. In: 12th USENIX symposium on operating systems design and implementation (OSDI), vol 16. pp 265–283
Amodei D, Ananthanarayanan S, Anubhai R, Bai J, Battenberg E, Case C, Casper J, Catanzaro B, Cheng Q, Chen G, et al (2016) Deep speech 2: end-to-end speech recognition in english and mandarin. In: International conference on machine learning, pp 173–182
Bengio Y, Simard P, Frasconi P (1994) Learning long-term dependencies with gradient descent is difficult. IEEE Trans Neural Netw 5(2):157–166
Article Google Scholar
Benkrid K, Belkacemi S (2002) Design and implementation of a 2d convolution core for video applications on fpgas. In: Third international workshop on digital and computational video, 2002. DCV 2002. proceedings. IEEE, pp 85–92
Bottou L (2012) Stochastic gradient descent tricks. In: Montavon G, Orr G, Müller K-R (eds) Neural networks: tricks of the trade. Springer, New York, pp 421–436
Chapter Google Scholar
Cardells-Tormo F, Molinet P-L (2005) Area-efficient 2-d shift-variant convolvers for fpga-based digital image processing. In: IEEE workshop on signal processing systems design and implementation. IEEE, pp 209–213
Chelba C, Mikolov T, Schuster M, Ge Q, Brants T, Koehn P, Robinson T (2013) One billion word benchmark for measuring progress in statistical language modeling. arXiv:1312.3005
Cloutier J, Cosatto E, Pigeon S, Boyer FR, Simard PY (1996) Vip: An fpga-based processor for image processing and neural networks. In: Proceedings of fifth international conference on microelectronics for neural networks. IEEE, pp 330–336
Dean J, Corrado G, Monga R, Chen K, Devin M, Mao M, Ranzato M, Senior A, Tucker P, Yang K, et al (2012) Large scale distributed deep networks. In: Advances in neural information processing systems, pp 1223–1231
Deerwester S, Dumais ST, Furnas GW, Landauer TK, Harshman R (1990) Indexing by latent semantic analysis. J Am Soc Inf Sci 41(6):391
Article Google Scholar
Demmel JW (1997) Applied numerical linear algebra, vol 56. Siam, USA
Book Google Scholar
Deng J, Dong W, Socher R, Li L-J, Li K, Fei-Fei L (2009) Imagenet: a large-scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition. IEEE, pp 248–255
Denton EL, Zaremba W, Bruna J, LeCun Y, Fergus R (2014) Exploiting linear structure within convolutional networks for efficient evaluation. In: Advances in neural information processing systems, pp 1269–1277
Eldan R, Shamir O (2016) The power of depth for feedforward neural networks. In: Conference on learning theory, pp 907–940
Gironés RG, Palero RC, Boluda JC, Cortés AS (2005) Fpga implementation of a pipelined on-line backpropagation. J VLSI Signal Process Syst Signal Image Video Technol 40(2):189–213
Article Google Scholar
Grave E, Joulin A, Cissé M, Grangier D, Jégou H (2016) Efficient softmax approximation for gpus. arXiv:1609.04309
Han S, Liu X, Mao H, Pu J, Pedram A, Horowitz MA, Dally WJ (2016) Eie: efficient inference engine on compressed deep neural network. In: 2016 ACM/IEEE 43rd annual international symposium on, computer architecture (ISCA). IEEE, pp 243–254
Han S, Mao H, Dally WJ (2015) Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv:1510.00149
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
He Z, Gao S, Xiao L, Liu D, He H, Barber D (2017) Wider and deeper, cheaper and faster: tensorized lstms for sequence learning. In: Advances in neural information processing systems, pp 1–11
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780
Article Google Scholar
Hu R, Andreas J, Rohrbach M, Darrell T, Saenko K Learning to reason: end-to-end module networks for visual question answering
Hu R, Rohrbach M, Andreas J, Darrell T, Saenko K (2017) Modeling relationships in referential expressions with compositional modular networks. In: 2017 IEEE conference on computer vision and pattern recognition (CVPR). IEEE, pp 4418–4427
Huang Y, Cheng Y, Bapna A, Firat O, Chen D, Chen M, Lee H, Ngiam J, Le QV, Wu Y, et al (2019) Gpipe: Efficient training of giant neural networks using pipeline parallelism. In: Advances in neural information processing systems, pp 103–112
Jouppi NP, Young C, Patil N, Patterson D, Agrawal G, Bajwa R, Bates S, Bhatia S, Boden N, Borchers A, et al (2017) In-datacenter performance analysis of a tensor processing unit. In: Proceedings of the 44th annual international symposium on computer architecture, pp 1–12
Jozefowicz R, Vinyals O, Schuster M, Shazeer N, Wu Y (2016) Exploring the limits of language modeling. arXiv:1602.02410
Keskar NS, Mudigere D, Nocedal J, Smelyanskiy M, Tang PTP (2016) On large-batch training for deep learning: Generalization gap and sharp minima. arXiv:1609.04836
Kingma DP, Ba J (2014) Adam: a method for stochastic optimization. arXiv:1412.6980
Kuchaiev O, Ginsburg B (2017) Factorization tricks for lstm networks. arXiv:1703.10722
LeCun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based learning applied to document recognition. Proc IEEE 86(11):2278–2324
Article Google Scholar
Lei T, Zhang Y (2017) Training rnns as fast as cnns. arXiv:1709.02755
Lin Y, Han S, Mao H, Wang Y, Dally WJ (2017) Deep gradient compression: Reducing the communication bandwidth for distributed training. arXiv:1712.01887
Luong M, Brevdo E, Zhao R (2017) Neural machine translation (seq2seq) tutorial. https://github.com/tensorflow/nmt
Maas AL, Daly RE, Pham PT, Huang D, Ng AY, Potts C (2011) Learning word vectors for sentiment analysis. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies, Portland, Oregon, USA association for computational linguistics, pp 142–150
Marcus MP, Marcinkiewicz MA, Santorini B (1993) Building a large annotated corpus of english: The penn treebank. Computational linguistics 19(2):313–330
Google Scholar
Narang S, Undersander E, Diamos G (2017) Block-sparse recurrent neural networks. arXiv:1711.02782
Nichols KR, Moussa MA, Areibi SM (2002) Feasibility of floating-point arithmetic in fpga based artificial neural networks. In: In CAINE. Citeseer
Pascanu R, Mikolov T, Bengio Y (2013) On the difficulty of training recurrent neural networks. In: International conference on machine learning, pp 1310–1318
Paszke A, Gross S, Chintala S, Chanan G, Yang E, DeVito Z, Lin Z, Desmaison A, Antiga L, Lerer A (2017) Automatic differentiation in pytorch
Sainath TN, Kingsbury B, Sindhwani V, Arisoy E, Ramabhadran B Low-rank matrix factorization for deep neural network training with high-dimensional output targets
Sak H, Senior A, Beaufays F (2014) Long short-term memory recurrent neural network architectures for large scale acoustic modeling. In: Fifteenth annual conference of the international speech communication association
Savich AW, Moussa M, Areibi S (2007) The impact of arithmetic representation on implementing mlp-bp on fpgas: A study. IEEE Trans Neural Netw 18(1):240–252
Article Google Scholar
Schwartz MO (2020) Groundwater contamination associated with a potential nuclear waste repository at yucca mountain. USA Bull Eng Geol Environ 79(2):1125–1136
Article Google Scholar
Shawahna A, Sait SM, El-Maleh A (2018) Fpga-based accelerators of deep learning networks for learning and classification: A review. IEEE Access 7:7823–7859
Article Google Scholar
Shazeer N, Cheng Y, Parmar N, Tran D, Vaswani A, Koanantakool P, Hawkins P, Lee H, Hong M, Young C, et al (2018) Mesh-tensorflow: Deep learning for supercomputers. In: Advances in neural information processing systems, pp 10414–10423
Shim K, Lee M, Choi I, Boo Y, Sung W (2017) Svd-softmax: Fast softmax approximation on large vocabulary neural networks. In: Advances in neural information processing systems, pp 5469–5479
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556
Sivakumar SC, Robertson W, Phillips WJ (1999) Online stabilization of block-diagonal recurrent neural networks. IEEE Trans Neural Netw 10(1):167–175
Article Google Scholar
Thakur R, Rabenseifner R, Gropp W (2005) Optimization of collective communication operations in mpich. The Intern J High Perform Comput Appl 19(1):49–66
Article Google Scholar
Tjandra A, Sakti S, Nakamura S (2017) Compressing recurrent neural network with tensor train. In: 2017 international joint conference on, Neural networks (IJCNN). IEEE, pp 4451–4458
Tjandra A, Sakti S, Nakamura S (2018) Tensor decomposition for compressing recurrent neural network. In: 2018 international joint conference on neural networks (IJCNN), IEEE, pp 1–8
Tucker LR (1966) Some mathematical notes on three-mode factor analysis. Psychometrika 31(3):279–311
Article MathSciNet Google Scholar
Wen W, He Y, Rajbhandari S, Wang W, Liu F, Hu B, Chen Y, Li H (2017) Learning intrinsic sparse structures within long short-term memory. arXiv:1709.05027
Wolf DF, Romero RA, Marques E (2001) Using embedded processors in hardware models of artificial neural networks. In: Simpósio Brasileiro de Automação Inteligente, vol 9. Brasil
Wu C-Y, Ahmed A, Beutel A, Smola AJ, Jing H (2017) Recurrent recommender networks. In: Proceedings of the tenth ACM international conference on web search and data mining. ACM, pp 495–503
Wu Y, Schuster M, Chen Z, Le QV, Norouzi M, Macherey W, Krikun M, Cao Y, Gao Q, Macherey K, et al (2016) Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv:1609.08144
Xiong W, Droppo J, Huang X, Seide F, Seltzer M, Stolcke A, Yu D, Zweig G (2016) Achieving human parity in conversational speech recognition. arXiv:1610.05256
Xu C, Yao J, Lin Z, Ou W, Cao Y, Wang Z, Zha H (2018) Alternating multi-bit quantization for recurrent neural networks. arXiv:1802.00150
You Y, Buluç A, Demmel J (2017) Scaling deep learning on gpu and knights landing clusters. In: Proceedings of the international conference for high performance computing, networking, storage and analysis. ACM, p 9
You Y, Li J, Reddi S, Hseu J, Kumar S, Bhojanapalli S, Song X, Demmel J, Hsieh C-J (2019) Large batch optimization for deep learning: training bert in 76 minutes. arXiv:1904.00962
You Y, Zhang Z, Hsieh C-J, Demmel J, Keutzer K (2018) Imagenet training in minutes. In: Proceedings of the 47th international conference on parallel processing. ACM, p 1
Zhai S, Chang K-h, Zhang R, Zhang ZM (2016) Deepintent: Learning attentions for online advertising with recurrent neural networks. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 1295–1304
Zhang C, Sun G, Fang Z, Zhou P, Pan P, Cong J (2018) Caffeine: Toward uniformed representation and acceleration for deep convolutional neural networks. IEEE Trans Comput-aided Des Integr Circuits and Syst 38(11):2072–2085
Article Google Scholar
Zhang H, Xia M, Hu G (2007) A multiwindow partial buffering scheme for fpga-based 2-d convolvers. IEEE Trans Circuits Syst II: Express Br 54(2):200–204
Article Google Scholar
Zhu M, Rhu M, Clemons J, Keckler SW, Xie Y (2016) Training long short-term memory with sparsified stochastic gradient descent

Download references

Acknowledgements

We thank CSCS for granting us access to Piz Daint resources. We thank Ronghang Hu at UC Berkeley for useful discussions.

Author information

Authors and Affiliations

Department of Computer Science, National University of Singapore, Singapore, Singapore
Yang You
Microsoft Research, Microsoft Corporation, Redmond, WA, USA
Yuxiong He, Samyam Rajbhandari & Wenhan Wang
Department of Computer Science, UCLA, Los Angeles, CA, USA
Cho-Jui Hsieh
Computer Science Division, UC Berkeley, Berkeley, CA, USA
Kurt Keutzer & James Demmel

Authors

Yang You
View author publications
You can also search for this author in PubMed Google Scholar
Yuxiong He
View author publications
You can also search for this author in PubMed Google Scholar
Samyam Rajbhandari
View author publications
You can also search for this author in PubMed Google Scholar
Wenhan Wang
View author publications
You can also search for this author in PubMed Google Scholar
Cho-Jui Hsieh
View author publications
You can also search for this author in PubMed Google Scholar
Kurt Keutzer
View author publications
You can also search for this author in PubMed Google Scholar
James Demmel
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yang You.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

You, Y., He, Y., Rajbhandari, S. et al. Fast LSTM by dynamic decomposition on cloud and distributed systems. Knowl Inf Syst 62, 4169–4197 (2020). https://doi.org/10.1007/s10115-020-01487-8

Download citation

Received: 20 June 2020
Accepted: 27 June 2020
Published: 19 July 2020
Issue Date: November 2020
DOI: https://doi.org/10.1007/s10115-020-01487-8

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fast LSTM by dynamic decomposition on cloud and distributed systems

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

The Necessity to Adopt Big Data Technologies for Efficient Performance Evaluation in the Modern Era

Accelerating Massively Distributed Deep Learning Through Efficient Pseudo-Synchronous Update Method

DataStates: Towards Lightweight Data Models for Deep Learning

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Fast LSTM by dynamic decomposition on cloud and distributed systems

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

The Necessity to Adopt Big Data Technologies for Efficient Performance Evaluation in the Modern Era

Accelerating Massively Distributed Deep Learning Through Efficient Pseudo-Synchronous Update Method

DataStates: Towards Lightweight Data Models for Deep Learning

Explore related subjects

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation