Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
survey

A Practical Survey on Faster and Lighter Transformers

Published: 17 July 2023 Publication History

Abstract

Recurrent neural networks are effective models to process sequences. However, they are unable to learn long-term dependencies because of their inherent sequential nature. As a solution, Vaswani et al. introduced the Transformer, a model solely based on the attention mechanism that is able to relate any two positions of the input sequence, hence modelling arbitrary long dependencies. The Transformer has improved the state-of-the-art across numerous sequence modelling tasks. However, its effectiveness comes at the expense of a quadratic computational and memory complexity with respect to the sequence length, hindering its adoption. Fortunately, the deep learning community has always been interested in improving the models’ efficiency, leading to a plethora of solutions such as parameter sharing, pruning, mixed-precision, and knowledge distillation. Recently, researchers have directly addressed the Transformer’s limitation by designing lower-complexity alternatives such as the Longformer, Reformer, Linformer, and Performer. However, due to the wide range of solutions, it has become challenging for researchers and practitioners to determine which methods to apply in practice to meet the desired tradeoff between capacity, computation, and memory. This survey addresses this issue by investigating popular approaches to make Transformers faster and lighter and by providing a comprehensive explanation of the methods’ strengths, limitations, and underlying assumptions.

References

[1]
Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, et al. 2015. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. Retrieved from https://www.tensorflow.org/.
[2]
Jimmy Ba and Rich Caruana. 2014. Do deep nets really need to be deep? In NIPS, Vol. 27.
[3]
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. In ICLR.
[4]
Irwan Bello. 2021. LambdaNetworks: Modeling long-range interactions without attention. In ICLR.
[5]
Iz Beltagy, Matthew E. Peters, and Arman Cohan. 2020. Longformer: The long-document transformer. arXiv e-prints (2020), arXiv:2004.05150.
[6]
Yoshua Bengio. 2013. Deep learning of representations: Looking forward. In SLSP, Vol. 7978, 1–37.
[7]
Yoshua Bengio. 2013. Estimating or propagating gradients through stochastic neurons. CoRR abs/1305.2982 (2013).
[8]
Jonathan Berant, Andrew Chou, Roy Frostig, and Percy Liang. 2013. Semantic parsing on freebase from question-answer pairs. In EMNLP. 1533–1544.
[9]
Ondrej Bojar, Christian Buck, Christian Federmann, Barry Haddow, Philipp Koehn, Johannes Leveling, et al. 2014. Findings of the 2014 workshop on statistical machine translation. In SIGMT. 12–58.
[10]
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D. Kaplan, Prafulla Dhariwal, et al. 2020. Language models are few-shot learners. In NeurIPS, Vol. 33, 1877–1901.
[11]
A. Buluc and J. R. Gilbert. 2008. Challenges and advances in parallel sparse matrix-matrix multiplication. In ICPP. 503–510.
[12]
Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. 2020. End-to-end object detection with transformers. In ECCV. 213–229.
[13]
William Chan, Navdeep Jaitly, Quoc Le, and Oriol Vinyals. 2016. Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. In ICASSP. 4960–4964.
[14]
Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. 2016. Training deep nets with sublinear memory cost. CoRR abs/1604.06174 (2016).
[15]
Jianpeng Cheng, Li Dong, and Mirella Lapata. 2016. Long short-term memory-networks for machine reading. In EMNLP. 551–561.
[16]
Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. 2019. Generating long sequences with sparse transformers. CoRR abs/1904.10509 (2019).
[17]
Kyunghyun Cho, Bart van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, et al. 2014. Learning phrase representations using RNN encoder–decoder for statistical machine translation. In EMNLP. 1724–1734.
[18]
Krzysztof Marcin Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, et al. 2021. Rethinking attention with performers. In ICLR.
[19]
Kevin Clark, Minh-Thang Luong, Quoc V. Le, and Christopher D. Manning. 2020. ELECTRA: Pre-training text encoders as discriminators rather than generators. In ICLR.
[20]
Gonçalo M. Correia, Vlad Niculae, and André F. T. Martins. 2019. Adaptively sparse transformers. In EMNLP-IJCNLP. 2174–2184.
[21]
Zihang Dai, Guokun Lai, Yiming Yang, and Quoc Le. 2020. Funnel-transformer: Filtering out sequential redundancy for efficient language processing. In NeurIPS.
[22]
Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc Le, and Ruslan Salakhutdinov. 2019. Transformer-XL: Attentive language models beyond a fixed-length context. In ACL. 2978–2988.
[23]
Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Lukasz Kaiser. 2019. Universal transformers. In ICLR.
[24]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT. 4171–4186.
[25]
Laurent Dinh, David Krueger, and Yoshua Bengio. 2015. NICE: Non-linear independent components estimation. In ICLR.
[26]
Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. 2017. Density estimation using real NVP. In ICLR.
[27]
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, et al. 2021. An image is worth 16 × 16 words: Transformers for image recognition at scale. In ICLR.
[28]
Maha Elbayad, Jiatao Gu, Edouard Grave, and Michael Auli. 2020. Depth-adaptive transformer. In ICLR.
[29]
Thomas Elsken, Jan Hendrik Metzen, and Frank Hutter. 2019. Neural architecture search: A survey. J. Mach. Learn. Res. 20, 55 (2019), 1–21.
[30]
William Fedus, Barret Zoph, and Noam Shazeer. 2021. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. arXiv e-prints (2021), arXiv:2101.03961.
[31]
Quentin Fournier, Daniel Aloise, Seyed Vahid Azhari, and François Tetreault. 2021. On improving deep learning trace analysis with system call arguments. In MSR. 120–130.
[32]
Jonathan Frankle and Michael Carbin. 2019. The lottery ticket hypothesis: Finding sparse, trainable neural networks. In ICLR.
[33]
Andrea Galassi, Marco Lippi, and Paolo Torroni. 2021. Attention in natural language processing. IEEE Transactions on Neural Networks and Learning Systems 32, 10 (2021), 4291–4308. DOI:
[34]
Xavier Glorot and Yoshua Bengio. 2010. Understanding the difficulty of training deep feedforward neural networks. In AISTATS, Vol. 9, 249–256.
[35]
Aidan N. Gomez, Mengye Ren, Raquel Urtasun, and Roger B. Grosse. 2017. The reversible residual network: Backpropagation without storing activations. In NeurIPS, Vol. 30.
[36]
Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2016. Deep Learning. Retrieved from http://www.deeplearningbook.org.
[37]
Anirudh Goyal, Alex Lamb, Jordan Hoffmann, Shagun Sodhani, Sergey Levine, Yoshua Bengio, et al. 2019. Recurrent independent mechanisms. CoRR abs/1909.10893 (2019).
[38]
Alex Graves. 2016. Adaptive computation time for recurrent neural networks. CoRR abs/1603.08983 (2016).
[39]
Scott Gray, Alec Radford, and Diederik P. Kingma. 2017. GPU kernels for block-sparse weights. https://cdn.openai.com/blocksparse/blocksparsepaper.pdf.
[40]
Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki Parmar, Yu Zhang, Jiahui Yu, et al. 2020. Conformer: Convolution-augmented transformer for speech recognition. In Interspeech. 5036–5040.
[41]
Qipeng Guo, Xipeng Qiu, Pengfei Liu, Yunfan Shao, Xiangyang Xue, and Zheng Zhang. 2019. Star-transformer. In NAACL-HLT. 1315–1325.
[42]
K. He, X. Zhang, S. Ren, and J. Sun. 2016. Deep residual learning for image recognition. In CVPR. 770–778.
[43]
Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. Distilling the knowledge in a neural network. arXiv e-prints (2015), arXiv:1503.02531.
[44]
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Computat. 9, 8 (1997), 1735–1780.
[45]
Sara Hooker. 2020. The hardware lottery. CoRR abs/2009.06489 (2020).
[46]
Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Ian Simon, Curtis Hawthorne, Noam Shazeer, et al. 2019. Music transformer. In ICLR.
[47]
Xiao Shi Huang, Felipe Pérez, Jimmy Ba, and Maksims Volkovs. 2020. Improving transformer optimization through better initialization. In ICML, Vol. 119, 4475–4483.
[48]
Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Chen, et al. 2019. GPipe: Efficient training of giant neural networks using pipeline parallelism. In NeurIPS, Vol. 32.
[49]
IEA. 2018. World gross electricity production, by source, 2018. Retrieved from https://www.iea.org/data-and-statistics/charts/world-gross-electricity-production-by-source-2018.
[50]
Sergey Ioffe and Christian Szegedy. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, Vol. 37, 448–456.
[51]
B. Jacob, S. Kligys, B. Chen, M. Zhu, M. Tang, A. Howard, et al. 2018. Quantization and training of neural networks for efficient integer-arithmetic-only inference. In IEEE/CVF. 2704–2713.
[52]
R. A. Jacobs, M. I. Jordan, S. J. Nowlan, and G. E. Hinton. 1991. Adaptive mixtures of local experts. Neural Computat. 3 (1991), 79–87.
[53]
Sarthak Jain and Byron C. Wallace. 2019. Attention is not explanation. CoRR abs/1902.10186 (2019).
[54]
Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, et al. 2020. TinyBERT: Distilling BERT for natural language understanding. In EMNLP. 4163–4174.
[55]
Shigeki Karita, Nanxin Chen, Tomoki Hayashi, Takaaki Hori, Hirofumi Inaguma, Ziyan Jiang, et al. 2019. A comparative study on transformer vs RNN in speech applications. In ASRU. 449–456. DOI:
[56]
Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. 2020. Transformers are RNNs: Fast autoregressive transformers with linear attention. In ICML, Vol. 119, 5156–5165.
[57]
Salman Khan, Muzammal Naseer, Munawar Hayat, Syed Waqas Zamir, Fahad Shahbaz Khan, and Mubarak Shah. 2022. Transformers in vision: A survey. ACM Comput. Surv. 54, 10s (2022).
[58]
Urvashi Khandelwal, He He, Peng Qi, and Dan Jurafsky. 2018. Sharp nearby, fuzzy far away: How neural language models use context. In ACL. 284–294.
[59]
Nikita Kitaev, Lukasz Kaiser, and Anselm Levskaya. 2020. Reformer: The efficient transformer. In ICLR.
[60]
Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Joan Puigcerver, Jessica Yung, Sylvain Gelly, et al. 2020. Big transfer (BiT): General visual representation learning. In ECCV, Vol. 12350, 491–507.
[61]
Alex Lamb, Di He, Anirudh Goyal, Guolin Ke, Chien-Feng Liao, Mirco Ravanelli, et al. 2021. Transformers with competitive ensembles of independent mechanisms. arXiv e-prints (2021), arXiv:2103.00336.
[62]
Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2020. ALBERT: A lite BERT for self-supervised learning of language representations. In ICLR.
[63]
Yann LeCun, John S. Denker, and Sara A. Solla. 1990. Optimal brain damage. In NIPS. 598–605.
[64]
Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. 2016. Layer normalization. arXiv e-prints (2016), arXiv:1607.06450.
[65]
Mohan Li, Catalin Zorila, and Rama Doddipatla. 2020. Transformer-based online speech recognition with decoder-end adaptive computation steps. arXiv e-prints (2020), arXiv:2011.13834.
[66]
Shiyang Li, Xiaoyong Jin, Yao Xuan, Xiyou Zhou, Wenhu Chen, Yu-Xiang Wang, et al. 2019. Enhancing the locality and breaking the memory bottleneck of transformer on time series forecasting. In NeurIPS. 5244–5254.
[67]
Zhiyuan Li, Kaifeng Lyu, and Sanjeev Arora. 2020. Reconciling modern deep learning with traditional optimization analyses: The intrinsic learning rate. In NeurIPS, Vol. 33, 14544–14555.
[68]
Tianyang Lin, Yuxin Wang, Xiangyang Liu, and Xipeng Qiu. 2021. A Survey of Transformers. arxiv:2106.04554.
[69]
Chunxi Liu, Frank Zhang, Duc Le, Suyoun Kim, Yatharth Saraf, and Geoffrey Zweig. 2021. Improving RNN transducer based ASR with auxiliary tasks. In SLT. 172–179.
[70]
Hanxiao Liu, Zihang Dai, David R. So, and Quoc V. Le. 2021. Pay attention to MLPs. ArXiv abs/2105.08050 (2021).
[71]
Hanxiao Liu, Karen Simonyan, and Yiming Yang. 2019. DARTS: Differentiable architecture search. In ICLR.
[72]
Liyuan Liu, Haoming Jiang, Pengcheng He, Weizhu Chen, Xiaodong Liu, Jianfeng Gao, et al. 2020. On the variance of the adaptive learning rate and beyond. In ICLR.
[73]
Liyuan Liu, Xiaodong Liu, Jianfeng Gao, Weizhu Chen, and Jiawei Han. 2020. Understanding the difficulty of training transformers. In EMNLP. 5747–5763.
[74]
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, et al. 2019. RoBERTa: A robustly optimized BERT pretraining approach. CoRR abs/1907.11692 (2019).
[75]
Alexandra Luccioni, Alexandre Lacoste, and Victor Schmidt. 2020. Estimating carbon emissions of artificial intelligence [opinion]. IEEE Technol. Societ. Mag. 39, 2 (2020), 48–51.
[76]
Chris J. Maddison, Andriy Mnih, and Yee Whye Teh. 2017. The concrete distribution: A continuous relaxation of discrete random variables. In ICLR.
[77]
Matt Mahoney. 2011. Large Text Compression Benchmark. Retrieved from http://mattmahoney.net/dc/text.html.
[78]
Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. 2017. Pointer Sentinel Mixture models. In ICLR.
[79]
Paul Michel, Omer Levy, and Graham Neubig. 2019. Are sixteen heads really better than one? In NeurIPS, Vol. 32.
[80]
Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, et al. 2018. Mixed precision training. In ICLR.
[81]
Nikita Nangia and Samuel R. Bowman. 2018. ListOps: A diagnostic dataset for latent tree learning. In NAACL.
[82]
Sharan Narang, Hyung Won Chung, Yi Tay, William Fedus, Thibault Fevry, Michael Matena, et al. 2021. Do transformer modifications transfer across implementations and applications? arXiv e-prints (2021), arXiv:2102.11972.
[83]
Shashi Narayan, Shay B. Cohen, and Mirella Lapata. 2018. Don’t give me the details, just the summary! Topic-aware convolutional neural networks for extreme summarization. In EMNLP. 1797–1807.
[84]
OpenAI. 2013. Saving memory using gradient-checkpointing. Retrieved from https://github.com/openai/gradient-checkpointing.
[85]
Myle Ott, Sergey Edunov, David Grangier, and Michael Auli. 2018. Scaling neural machine translation. In ML. 1–9.
[86]
Daniel S. Park, Yu Zhang, Chung-Cheng Chiu, Youzheng Chen, Bo Li, William Chan, et al. 2020. SpecAugment on large scale datasets. In ICASSP. 6879–6883.
[87]
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, et al. 2019. PyTorch: An imperative style, high-performance deep learning library. In NeurIPS. 8024–8035.
[88]
Hieu Pham, Melody Y. Guan, Barret Zoph, Quoc V. Le, and Jeff Dean. 2018. Efficient neural architecture search via parameter sharing. In ICML, Vol. 80, 4092–4101.
[89]
Sai Prasanna, Anna Rogers, and Anna Rumshisky. 2020. When BERT plays the lottery, all tickets are winning. In EMNLP. 3208–3229.
[90]
Jiezhong Qiu, Hao Ma, Omer Levy, Wen-tau Yih, Sinong Wang, and Jie Tang. 2020. Blockwise self-attention for long document understanding. In EMNLP. 2555–2565.
[91]
Alec Radford and Karthik Narasimhan. 2018. Improving language understanding by generative pre-training. https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf.
[92]
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. (2019). https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf.
[93]
Jack W. Rae, Anna Potapenko, Siddhant M. Jayakumar, Chloe Hillier, and Timothy P. Lillicrap. 2020. Compressive transformers for long-range sequence modelling. In ICLR.
[94]
Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. SQuAD: 100,000+ questions for machine comprehension of text. In EMNLP. 2383–2392.
[95]
Esteban Real, Alok Aggarwal, Yanping Huang, and Quoc V. Le. 2019. Regularized evolution for image classifier architecture search. In AAAI. 4780–4789.
[96]
Aurko Roy, Mohammad Saffar, Ashish Vaswani, and David Grangier. 2021. Efficient content-based sparse attention with routing transformers. Trans. Assoc. Computat. Ling. 9 (2021), 53–68.
[97]
D. E. Rumelhart, P. Smolensky, J. L. McClelland, and G. E. Hinton. 1986. Schemata and sequential thought processes in PDP models. Parallel Distributed Processing: Explorations in the Microstructure, Vol. 2: Psychological and Biological Models, MIT Press, Cambridge, MA, 7-57.
[98]
Hassan Sajjad, Fahim Dalvi, Nadir Durrani, and Preslav Nakov. 2020. On the effect of dropping layers of pre-trained transformer models. arXiv e-prints (2020), arXiv:2004.03844.
[99]
Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2019. DistilBERT, a distilled version of BERT: Smaller, faster, cheaper and lighter. CoRR abs/1910.01108 (2019).
[100]
Shibani Santurkar, Dimitris Tsipras, Andrew Ilyas, and Aleksander Madry. 2018. How does batch normalization help optimization? In NeurIPS, Vol. 31.
[101]
Sofia Serrano and Noah A. Smith. 2019. Is attention interpretable? In ACL. 2931–2951.
[102]
Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. 2018. Self-attention with relative position representations. In NAACL-HLT. 464–468.
[103]
Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc V. Le, Geoffrey E. Hinton, et al. 2017. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. In ICLR.
[104]
Han Shi, Jiahui Gao, Xiaozhe Ren, Hang Xu, Xiaodan Liang, Zhenguo Li, et al. 2021. SparseBERT: Rethinking the importance analysis in self-attention. In ICML, Vol. 139, 9547–9557.
[105]
David R. So, Quoc V. Le, and Chen Liang. 2019. The evolved transformer. In ICML, Vol. 97, 5877–5886.
[106]
Nitish Srivastava, Geoffrey E. Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15, 1 (2014), 1929–1958.
[107]
Pierre Stock, Angela Fan, Benjamin Graham, Edouard Grave, Rémi Gribonval, Herve Jegou, et al. 2021. Training with quantization noise for extreme model compression. In ICLR.
[108]
Emma Strubell, Ananya Ganesh, and Andrew McCallum. 2019. Energy and policy considerations for deep learning in NLP. CoRR abs/1906.02243 (2019).
[109]
Emma Strubell, Ananya Ganesh, and Andrew McCallum. 2020. Energy and policy considerations for modern deep learning research. Proc. AAAI Conf. Artif. Intell. 34, 09 (2020), 13693–13696.
[110]
Cem Subakan, Mirco Ravanelli, Samuele Cornell, Mirko Bronzi, and Jianyuan Zhong. 2021. Attention is all you need in speech separation. In ICASSP. 21–25.
[111]
Sainbayar Sukhbaatar, Edouard Grave, Piotr Bojanowski, and Armand Joulin. 2019. Adaptive attention span in transformers. In ACL. 331–335.
[112]
Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to sequence learning with neural networks. In NIPS. 3104–3112.
[113]
Yi Tay, Dara Bahri, Donald Metzler, Da-Cheng Juan, Zhe Zhao, and Che Zheng. 2020. Synthesizer: Rethinking self-attention in transformer models. arXiv e-prints (2020), arXiv:2005.00743.
[114]
Yi Tay, Dara Bahri, Liu Yang, Donald Metzler, and Da-Cheng Juan. 2020. Sparse sinkhorn attention. In ICML, Vol. 119, 9438–9447.
[115]
Yi Tay, Mostafa Dehghani, Samira Abnar, Yikang Shen, Dara Bahri, Philip Pham, et al. 2021. Long range arena : A benchmark for efficient transformers. In ICLR.
[116]
Yi Tay, Mostafa Dehghani, Dara Bahri, and Donald Metzler. 2020. Efficient transformers: A survey. CoRR abs/2009.06732 (2020).
[117]
Wilson L. Taylor. 1953. “Cloze Procedure”: A new tool for measuring readability. Journal. Quart. 30, 4 (1953), 415–433.
[118]
Ilya O. Tolstikhin, Neil Houlsby, Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Thomas Unterthiner, et al. 2021. MLP-Mixer: An all-MLP architecture for vision. CoRR abs/2105.01601 (2021).
[119]
Henry Tsai, Jayden Ooi, Chun-Sung Ferng, Hyung Won Chung, and Jason Riesa. 2020. Finding fast transformers: One-shot neural architecture search by component composition. arXiv e-prints (2020), arXiv:2008.06808.
[120]
Henry Tsai, Jason Riesa, Melvin Johnson, Naveen Arivazhagan, Xin Li, and Amelia Archer. 2019. Small and practical BERT models for sequence labeling. In EMNLP-IJCNLP. 3632–3636.
[121]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, et al. 2017. Attention is all you need. In NIPS. 5998–6008.
[122]
Apoorv Vyas, Angelos Katharopoulos, and François Fleuret. 2020. Fast transformers with clustered attention. In NeurIPS.
[123]
Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, et al. 2019. SuperGLUE: A stickier benchmark for general-purpose language understanding systems. In NeurIPS, Vol. 32.
[124]
Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. 2018. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In EMNLP. 353–355.
[125]
Chenguang Wang, Zihao Ye, Aston Zhang, Zheng Zhang, and Alexander J. Smola. 2020. Transformer on a diet. arXiv e-prints (2020), arXiv:2002.06170.
[126]
Sinong Wang, Belinda Z. Li, Madian Khabsa, Han Fang, and Hao Ma. 2020. Linformer: Self-attention with linear complexity. arXiv e-prints (2020), arXiv:2006.04768.
[127]
Yuxin Wang, Qiang Wang, Shaohuai Shi, Xin He, Zhenheng Tang, Kaiyong Zhao, et al. 2020. Benchmarking the performance and energy efficiency of AI accelerators for AI training. In CCGRID. 744–751.
[128]
Yu Wang, Gu-Yeon Wei, and David Brooks. 2019. Benchmarking TPU, GPU, and CPU platforms for deep learning. CoRR abs/1907.10701. http://arxiv.org/abs/1907.10701
[129]
Alex Warstadt, Amanpreet Singh, and Samuel R. Bowman. 2019. Neural network acceptability judgments. Trans. Assoc. Computat. Ling. 7 (2019), 625–641.
[130]
Lilian Weng. 2018. Attention? Attention! Retrieved from http://lilianweng.github.io/lil-log/2018/06/24/attention-attention.html.
[131]
Sarah Wiegreffe and Yuval Pinter. 2019. Attention is not not explanation. In EMNLP-IJCNLP. 11–20.
[132]
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, et al. 2020. Transformers: State-of-the-art natural language processing. In EMNLP. 38–45.
[133]
Zhanghao Wu, Zhijian Liu, Ji Lin, Yujun Lin, and Song Han. 2020. Lite transformer with long-short range attention. In ICLR.
[134]
Qizhe Xie, Minh-Thang Luong, Eduard H. Hovy, and Quoc V. Le. 2020. Self-training with noisy student improves ImageNet classification. In CVPR. 10684–10695.
[135]
Ruibin Xiong, Yunchang Yang, Di He, Kai Zheng, Shuxin Zheng, Chen Xing, et al. 2020. On layer normalization in the transformer architecture. In ICML, Vol. 119, 10524–10533.
[136]
Yunyang Xiong, Zhanpeng Zeng, Rudrasis Chakraborty, Mingxing Tan, Glenn Fung, Yin Li, et al. 2021. Nyströmformer: A Nyström-based algorithm for approximating self-attention. arXiv e-prints (2021), arXiv:2102.03902.
[137]
Weihao Yu, Mi Luo, Pan Zhou, Chenyang Si, Yichen Zhou, Xinchao Wang, Jiashi Feng, and Shuicheng Yan. 2022. MetaFormer is actually what you need for vision. In CVPR. 10819–10829.
[138]
Ofir Zafrir, Guy Boudoukh, Peter Izsak, and Moshe Wasserblat. 2019. Q8BERT: Quantized 8Bit BERT. In EMC2-NIPS.
[139]
Manzil Zaheer, Guru Guruganesh, Kumar Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, et al. 2020. Big bird: Transformers for longer sequences. In NeurIPS, Vol. 33, 17283–17297.
[140]
Qian Zhang, Han Lu, Hasim Sak, Anshuman Tripathi, Erik McDermott, Stephen Koo, et al. 2020. Transformer transducer: A streamable speech recognition model with transformer encoders and RNN-T loss. In ICASSP. 7829–7833.
[141]
Barret Zoph and Quoc V. Le. 2017. Neural architecture search with reinforcement learning. In ICLR.
[142]
Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V. Le. 2018. Learning transferable architectures for scalable image recognition. In CVPR. 8697–8710.

Cited By

View all
  • (2025)Resource-efficient Algorithms and Systems of Foundation Models: A SurveyACM Computing Surveys10.1145/370641857:5(1-39)Online publication date: 9-Jan-2025
  • (2025)Forecasting Trends in Cloud-Edge Computing: Unleashing the Power of Attention MechanismsIEEE Communications Magazine10.1109/MCOM.001.230058363:1(108-114)Online publication date: 1-Jan-2025
  • (2025)Prediction sets for high-dimensional mixture of experts modelsJournal of the Royal Statistical Society Series B: Statistical Methodology10.1093/jrsssb/qkae117Online publication date: 17-Jan-2025
  • Show More Cited By

Index Terms

  1. A Practical Survey on Faster and Lighter Transformers

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Computing Surveys
    ACM Computing Surveys  Volume 55, Issue 14s
    December 2023
    1355 pages
    ISSN:0360-0300
    EISSN:1557-7341
    DOI:10.1145/3606253
    Issue’s Table of Contents

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 17 July 2023
    Online AM: 04 March 2023
    Accepted: 23 February 2023
    Revised: 05 October 2022
    Received: 27 January 2022
    Published in CSUR Volume 55, Issue 14s

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Deep learning
    2. efficient transformer
    3. self-attention
    4. survey

    Qualifiers

    • Survey

    Funding Sources

    • Natural Sciences and Engineering Research Council of Canada (NSERC), Prompt, Ericsson, Ciena, and EfficiOS

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)1,396
    • Downloads (Last 6 weeks)125
    Reflects downloads up to 19 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2025)Resource-efficient Algorithms and Systems of Foundation Models: A SurveyACM Computing Surveys10.1145/370641857:5(1-39)Online publication date: 9-Jan-2025
    • (2025)Forecasting Trends in Cloud-Edge Computing: Unleashing the Power of Attention MechanismsIEEE Communications Magazine10.1109/MCOM.001.230058363:1(108-114)Online publication date: 1-Jan-2025
    • (2025)Prediction sets for high-dimensional mixture of experts modelsJournal of the Royal Statistical Society Series B: Statistical Methodology10.1093/jrsssb/qkae117Online publication date: 17-Jan-2025
    • (2025)Transformers in biosignal analysis: A reviewInformation Fusion10.1016/j.inffus.2024.102697114(102697)Online publication date: Feb-2025
    • (2025)Multimodal deep learning fusion of ultrafast-DCE MRI and clinical information for breast lesion classificationComputers in Biology and Medicine10.1016/j.compbiomed.2025.109721188(109721)Online publication date: Apr-2025
    • (2025)A Comparative Study of Korean Text Summarization Performance According to Architecture Features of Pre-trained Language ModelsIntelligent Sustainable Systems10.1007/978-981-97-9327-3_7(65-84)Online publication date: 20-Feb-2025
    • (2024)How do transformers perform in-context autoregressive learning?Proceedings of the 41st International Conference on Machine Learning10.5555/3692070.3693831(43235-43254)Online publication date: 21-Jul-2024
    • (2024)Modeling language tokens as functionals of semantic fieldsProceedings of the 41st International Conference on Machine Learning10.5555/3692070.3693696(40114-40128)Online publication date: 21-Jul-2024
    • (2024)Identification of Perceived Challenges in the Green Energy Transition by Turkish Society through Sentiment AnalysisSustainability10.3390/su1608336716:8(3367)Online publication date: 17-Apr-2024
    • (2024)Feasibility of Transformer Model for User Authentication Using Electromyogram SignalsElectronics10.3390/electronics1320413413:20(4134)Online publication date: 21-Oct-2024
    • Show More Cited By

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    Full Text

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media