survey

A Practical Survey on Faster and Lighter Transformers

Authors:

Quentin Fournier,

Gaétan Marceau Caron,

Daniel AloiseAuthors Info & Claims

ACM Computing Surveys, Volume 55, Issue 14s

Article No.: 304, Pages 1 - 40

https://doi.org/10.1145/3586074

Published: 17 July 2023 Publication History

Abstract

Recurrent neural networks are effective models to process sequences. However, they are unable to learn long-term dependencies because of their inherent sequential nature. As a solution, Vaswani et al. introduced the Transformer, a model solely based on the attention mechanism that is able to relate any two positions of the input sequence, hence modelling arbitrary long dependencies. The Transformer has improved the state-of-the-art across numerous sequence modelling tasks. However, its effectiveness comes at the expense of a quadratic computational and memory complexity with respect to the sequence length, hindering its adoption. Fortunately, the deep learning community has always been interested in improving the models’ efficiency, leading to a plethora of solutions such as parameter sharing, pruning, mixed-precision, and knowledge distillation. Recently, researchers have directly addressed the Transformer’s limitation by designing lower-complexity alternatives such as the Longformer, Reformer, Linformer, and Performer. However, due to the wide range of solutions, it has become challenging for researchers and practitioners to determine which methods to apply in practice to meet the desired tradeoff between capacity, computation, and memory. This survey addresses this issue by investigating popular approaches to make Transformers faster and lighter and by providing a comprehensive explanation of the methods’ strengths, limitations, and underlying assumptions.

References

[1]

Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, et al. 2015. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. Retrieved from https://www.tensorflow.org/.

[2]

Jimmy Ba and Rich Caruana. 2014. Do deep nets really need to be deep? In NIPS, Vol. 27.

[3]

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. In ICLR.

[4]

Irwan Bello. 2021. LambdaNetworks: Modeling long-range interactions without attention. In ICLR.

[5]

Iz Beltagy, Matthew E. Peters, and Arman Cohan. 2020. Longformer: The long-document transformer. arXiv e-prints (2020), arXiv:2004.05150.

[6]

Yoshua Bengio. 2013. Deep learning of representations: Looking forward. In SLSP, Vol. 7978, 1–37.

[7]

Yoshua Bengio. 2013. Estimating or propagating gradients through stochastic neurons. CoRR abs/1305.2982 (2013).

[8]

Jonathan Berant, Andrew Chou, Roy Frostig, and Percy Liang. 2013. Semantic parsing on freebase from question-answer pairs. In EMNLP. 1533–1544.

[9]

Ondrej Bojar, Christian Buck, Christian Federmann, Barry Haddow, Philipp Koehn, Johannes Leveling, et al. 2014. Findings of the 2014 workshop on statistical machine translation. In SIGMT. 12–58.

[10]

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D. Kaplan, Prafulla Dhariwal, et al. 2020. Language models are few-shot learners. In NeurIPS, Vol. 33, 1877–1901.

[11]

A. Buluc and J. R. Gilbert. 2008. Challenges and advances in parallel sparse matrix-matrix multiplication. In ICPP. 503–510.

[12]

Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. 2020. End-to-end object detection with transformers. In ECCV. 213–229.

[13]

William Chan, Navdeep Jaitly, Quoc Le, and Oriol Vinyals. 2016. Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. In ICASSP. 4960–4964.

[14]

Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. 2016. Training deep nets with sublinear memory cost. CoRR abs/1604.06174 (2016).

[15]

Jianpeng Cheng, Li Dong, and Mirella Lapata. 2016. Long short-term memory-networks for machine reading. In EMNLP. 551–561.

[16]

Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. 2019. Generating long sequences with sparse transformers. CoRR abs/1904.10509 (2019).

[17]

Kyunghyun Cho, Bart van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, et al. 2014. Learning phrase representations using RNN encoder–decoder for statistical machine translation. In EMNLP. 1724–1734.

[18]

Krzysztof Marcin Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, et al. 2021. Rethinking attention with performers. In ICLR.

[19]

Kevin Clark, Minh-Thang Luong, Quoc V. Le, and Christopher D. Manning. 2020. ELECTRA: Pre-training text encoders as discriminators rather than generators. In ICLR.

[20]

Gonçalo M. Correia, Vlad Niculae, and André F. T. Martins. 2019. Adaptively sparse transformers. In EMNLP-IJCNLP. 2174–2184.

[21]

Zihang Dai, Guokun Lai, Yiming Yang, and Quoc Le. 2020. Funnel-transformer: Filtering out sequential redundancy for efficient language processing. In NeurIPS.

[22]

Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc Le, and Ruslan Salakhutdinov. 2019. Transformer-XL: Attentive language models beyond a fixed-length context. In ACL. 2978–2988.

[23]

Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Lukasz Kaiser. 2019. Universal transformers. In ICLR.

[24]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT. 4171–4186.

[25]

Laurent Dinh, David Krueger, and Yoshua Bengio. 2015. NICE: Non-linear independent components estimation. In ICLR.

[26]

Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. 2017. Density estimation using real NVP. In ICLR.

[27]

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, et al. 2021. An image is worth 16 × 16 words: Transformers for image recognition at scale. In ICLR.

[28]

Maha Elbayad, Jiatao Gu, Edouard Grave, and Michael Auli. 2020. Depth-adaptive transformer. In ICLR.

[29]

Thomas Elsken, Jan Hendrik Metzen, and Frank Hutter. 2019. Neural architecture search: A survey. J. Mach. Learn. Res. 20, 55 (2019), 1–21.

[30]

William Fedus, Barret Zoph, and Noam Shazeer. 2021. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. arXiv e-prints (2021), arXiv:2101.03961.

[31]

Quentin Fournier, Daniel Aloise, Seyed Vahid Azhari, and François Tetreault. 2021. On improving deep learning trace analysis with system call arguments. In MSR. 120–130.

[32]

Jonathan Frankle and Michael Carbin. 2019. The lottery ticket hypothesis: Finding sparse, trainable neural networks. In ICLR.

[33]

Andrea Galassi, Marco Lippi, and Paolo Torroni. 2021. Attention in natural language processing. IEEE Transactions on Neural Networks and Learning Systems 32, 10 (2021), 4291–4308. DOI:

[34]

Xavier Glorot and Yoshua Bengio. 2010. Understanding the difficulty of training deep feedforward neural networks. In AISTATS, Vol. 9, 249–256.

[35]

Aidan N. Gomez, Mengye Ren, Raquel Urtasun, and Roger B. Grosse. 2017. The reversible residual network: Backpropagation without storing activations. In NeurIPS, Vol. 30.

[36]

Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2016. Deep Learning. Retrieved from http://www.deeplearningbook.org.

Digital Library

[37]

Anirudh Goyal, Alex Lamb, Jordan Hoffmann, Shagun Sodhani, Sergey Levine, Yoshua Bengio, et al. 2019. Recurrent independent mechanisms. CoRR abs/1909.10893 (2019).

[38]

Alex Graves. 2016. Adaptive computation time for recurrent neural networks. CoRR abs/1603.08983 (2016).

[39]

Scott Gray, Alec Radford, and Diederik P. Kingma. 2017. GPU kernels for block-sparse weights. https://cdn.openai.com/blocksparse/blocksparsepaper.pdf.

[40]

Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki Parmar, Yu Zhang, Jiahui Yu, et al. 2020. Conformer: Convolution-augmented transformer for speech recognition. In Interspeech. 5036–5040.

[41]

Qipeng Guo, Xipeng Qiu, Pengfei Liu, Yunfan Shao, Xiangyang Xue, and Zheng Zhang. 2019. Star-transformer. In NAACL-HLT. 1315–1325.

[42]

K. He, X. Zhang, S. Ren, and J. Sun. 2016. Deep residual learning for image recognition. In CVPR. 770–778.

[43]

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. Distilling the knowledge in a neural network. arXiv e-prints (2015), arXiv:1503.02531.

[44]

Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Computat. 9, 8 (1997), 1735–1780.

Digital Library

[45]

Sara Hooker. 2020. The hardware lottery. CoRR abs/2009.06489 (2020).

[46]

Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Ian Simon, Curtis Hawthorne, Noam Shazeer, et al. 2019. Music transformer. In ICLR.

[47]

Xiao Shi Huang, Felipe Pérez, Jimmy Ba, and Maksims Volkovs. 2020. Improving transformer optimization through better initialization. In ICML, Vol. 119, 4475–4483.

[48]

Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Chen, et al. 2019. GPipe: Efficient training of giant neural networks using pipeline parallelism. In NeurIPS, Vol. 32.

[49]

IEA. 2018. World gross electricity production, by source, 2018. Retrieved from https://www.iea.org/data-and-statistics/charts/world-gross-electricity-production-by-source-2018.

[50]

Sergey Ioffe and Christian Szegedy. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, Vol. 37, 448–456.

[51]

B. Jacob, S. Kligys, B. Chen, M. Zhu, M. Tang, A. Howard, et al. 2018. Quantization and training of neural networks for efficient integer-arithmetic-only inference. In IEEE/CVF. 2704–2713.

[52]

R. A. Jacobs, M. I. Jordan, S. J. Nowlan, and G. E. Hinton. 1991. Adaptive mixtures of local experts. Neural Computat. 3 (1991), 79–87.

[53]

Sarthak Jain and Byron C. Wallace. 2019. Attention is not explanation. CoRR abs/1902.10186 (2019).

[54]

Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, et al. 2020. TinyBERT: Distilling BERT for natural language understanding. In EMNLP. 4163–4174.

[55]

Shigeki Karita, Nanxin Chen, Tomoki Hayashi, Takaaki Hori, Hirofumi Inaguma, Ziyan Jiang, et al. 2019. A comparative study on transformer vs RNN in speech applications. In ASRU. 449–456. DOI:

[56]

Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. 2020. Transformers are RNNs: Fast autoregressive transformers with linear attention. In ICML, Vol. 119, 5156–5165.

[57]

Salman Khan, Muzammal Naseer, Munawar Hayat, Syed Waqas Zamir, Fahad Shahbaz Khan, and Mubarak Shah. 2022. Transformers in vision: A survey. ACM Comput. Surv. 54, 10s (2022).

Digital Library

[58]

Urvashi Khandelwal, He He, Peng Qi, and Dan Jurafsky. 2018. Sharp nearby, fuzzy far away: How neural language models use context. In ACL. 284–294.

[59]

Nikita Kitaev, Lukasz Kaiser, and Anselm Levskaya. 2020. Reformer: The efficient transformer. In ICLR.

[60]

Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Joan Puigcerver, Jessica Yung, Sylvain Gelly, et al. 2020. Big transfer (BiT): General visual representation learning. In ECCV, Vol. 12350, 491–507.

[61]

Alex Lamb, Di He, Anirudh Goyal, Guolin Ke, Chien-Feng Liao, Mirco Ravanelli, et al. 2021. Transformers with competitive ensembles of independent mechanisms. arXiv e-prints (2021), arXiv:2103.00336.

[62]

Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2020. ALBERT: A lite BERT for self-supervised learning of language representations. In ICLR.

[63]

Yann LeCun, John S. Denker, and Sara A. Solla. 1990. Optimal brain damage. In NIPS. 598–605.

[64]

Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. 2016. Layer normalization. arXiv e-prints (2016), arXiv:1607.06450.

[65]

Mohan Li, Catalin Zorila, and Rama Doddipatla. 2020. Transformer-based online speech recognition with decoder-end adaptive computation steps. arXiv e-prints (2020), arXiv:2011.13834.

[66]

Shiyang Li, Xiaoyong Jin, Yao Xuan, Xiyou Zhou, Wenhu Chen, Yu-Xiang Wang, et al. 2019. Enhancing the locality and breaking the memory bottleneck of transformer on time series forecasting. In NeurIPS. 5244–5254.

[67]

Zhiyuan Li, Kaifeng Lyu, and Sanjeev Arora. 2020. Reconciling modern deep learning with traditional optimization analyses: The intrinsic learning rate. In NeurIPS, Vol. 33, 14544–14555.

[68]

Tianyang Lin, Yuxin Wang, Xiangyang Liu, and Xipeng Qiu. 2021. A Survey of Transformers. arxiv:2106.04554.

[69]

Chunxi Liu, Frank Zhang, Duc Le, Suyoun Kim, Yatharth Saraf, and Geoffrey Zweig. 2021. Improving RNN transducer based ASR with auxiliary tasks. In SLT. 172–179.

[70]

Hanxiao Liu, Zihang Dai, David R. So, and Quoc V. Le. 2021. Pay attention to MLPs. ArXiv abs/2105.08050 (2021).

[71]

Hanxiao Liu, Karen Simonyan, and Yiming Yang. 2019. DARTS: Differentiable architecture search. In ICLR.

[72]

Liyuan Liu, Haoming Jiang, Pengcheng He, Weizhu Chen, Xiaodong Liu, Jianfeng Gao, et al. 2020. On the variance of the adaptive learning rate and beyond. In ICLR.

[73]

Liyuan Liu, Xiaodong Liu, Jianfeng Gao, Weizhu Chen, and Jiawei Han. 2020. Understanding the difficulty of training transformers. In EMNLP. 5747–5763.

[74]

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, et al. 2019. RoBERTa: A robustly optimized BERT pretraining approach. CoRR abs/1907.11692 (2019).

[75]

Alexandra Luccioni, Alexandre Lacoste, and Victor Schmidt. 2020. Estimating carbon emissions of artificial intelligence [opinion]. IEEE Technol. Societ. Mag. 39, 2 (2020), 48–51.

[76]

Chris J. Maddison, Andriy Mnih, and Yee Whye Teh. 2017. The concrete distribution: A continuous relaxation of discrete random variables. In ICLR.

[77]

Matt Mahoney. 2011. Large Text Compression Benchmark. Retrieved from http://mattmahoney.net/dc/text.html.

[78]

Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. 2017. Pointer Sentinel Mixture models. In ICLR.

[79]

Paul Michel, Omer Levy, and Graham Neubig. 2019. Are sixteen heads really better than one? In NeurIPS, Vol. 32.

[80]

Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, et al. 2018. Mixed precision training. In ICLR.

[81]

Nikita Nangia and Samuel R. Bowman. 2018. ListOps: A diagnostic dataset for latent tree learning. In NAACL.

[82]

Sharan Narang, Hyung Won Chung, Yi Tay, William Fedus, Thibault Fevry, Michael Matena, et al. 2021. Do transformer modifications transfer across implementations and applications? arXiv e-prints (2021), arXiv:2102.11972.

[83]

Shashi Narayan, Shay B. Cohen, and Mirella Lapata. 2018. Don’t give me the details, just the summary! Topic-aware convolutional neural networks for extreme summarization. In EMNLP. 1797–1807.

[84]

OpenAI. 2013. Saving memory using gradient-checkpointing. Retrieved from https://github.com/openai/gradient-checkpointing.

[85]

Myle Ott, Sergey Edunov, David Grangier, and Michael Auli. 2018. Scaling neural machine translation. In ML. 1–9.

[86]

Daniel S. Park, Yu Zhang, Chung-Cheng Chiu, Youzheng Chen, Bo Li, William Chan, et al. 2020. SpecAugment on large scale datasets. In ICASSP. 6879–6883.

[87]

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, et al. 2019. PyTorch: An imperative style, high-performance deep learning library. In NeurIPS. 8024–8035.

[88]

Hieu Pham, Melody Y. Guan, Barret Zoph, Quoc V. Le, and Jeff Dean. 2018. Efficient neural architecture search via parameter sharing. In ICML, Vol. 80, 4092–4101.

[89]

Sai Prasanna, Anna Rogers, and Anna Rumshisky. 2020. When BERT plays the lottery, all tickets are winning. In EMNLP. 3208–3229.

[90]

Jiezhong Qiu, Hao Ma, Omer Levy, Wen-tau Yih, Sinong Wang, and Jie Tang. 2020. Blockwise self-attention for long document understanding. In EMNLP. 2555–2565.

[91]

Alec Radford and Karthik Narasimhan. 2018. Improving language understanding by generative pre-training. https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf.

[92]

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. (2019). https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf.

[93]

Jack W. Rae, Anna Potapenko, Siddhant M. Jayakumar, Chloe Hillier, and Timothy P. Lillicrap. 2020. Compressive transformers for long-range sequence modelling. In ICLR.

[94]

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. SQuAD: 100,000+ questions for machine comprehension of text. In EMNLP. 2383–2392.

[95]

Esteban Real, Alok Aggarwal, Yanping Huang, and Quoc V. Le. 2019. Regularized evolution for image classifier architecture search. In AAAI. 4780–4789.

[96]

Aurko Roy, Mohammad Saffar, Ashish Vaswani, and David Grangier. 2021. Efficient content-based sparse attention with routing transformers. Trans. Assoc. Computat. Ling. 9 (2021), 53–68.

[97]

D. E. Rumelhart, P. Smolensky, J. L. McClelland, and G. E. Hinton. 1986. Schemata and sequential thought processes in PDP models. Parallel Distributed Processing: Explorations in the Microstructure, Vol. 2: Psychological and Biological Models, MIT Press, Cambridge, MA, 7-57.

[98]

Hassan Sajjad, Fahim Dalvi, Nadir Durrani, and Preslav Nakov. 2020. On the effect of dropping layers of pre-trained transformer models. arXiv e-prints (2020), arXiv:2004.03844.

[99]

Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2019. DistilBERT, a distilled version of BERT: Smaller, faster, cheaper and lighter. CoRR abs/1910.01108 (2019).

[100]

Shibani Santurkar, Dimitris Tsipras, Andrew Ilyas, and Aleksander Madry. 2018. How does batch normalization help optimization? In NeurIPS, Vol. 31.

[101]

Sofia Serrano and Noah A. Smith. 2019. Is attention interpretable? In ACL. 2931–2951.

[102]

Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. 2018. Self-attention with relative position representations. In NAACL-HLT. 464–468.

[103]

Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc V. Le, Geoffrey E. Hinton, et al. 2017. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. In ICLR.

[104]

Han Shi, Jiahui Gao, Xiaozhe Ren, Hang Xu, Xiaodan Liang, Zhenguo Li, et al. 2021. SparseBERT: Rethinking the importance analysis in self-attention. In ICML, Vol. 139, 9547–9557.

[105]

David R. So, Quoc V. Le, and Chen Liang. 2019. The evolved transformer. In ICML, Vol. 97, 5877–5886.

[106]

Nitish Srivastava, Geoffrey E. Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15, 1 (2014), 1929–1958.

Digital Library

[107]

Pierre Stock, Angela Fan, Benjamin Graham, Edouard Grave, Rémi Gribonval, Herve Jegou, et al. 2021. Training with quantization noise for extreme model compression. In ICLR.

[108]

Emma Strubell, Ananya Ganesh, and Andrew McCallum. 2019. Energy and policy considerations for deep learning in NLP. CoRR abs/1906.02243 (2019).

[109]

Emma Strubell, Ananya Ganesh, and Andrew McCallum. 2020. Energy and policy considerations for modern deep learning research. Proc. AAAI Conf. Artif. Intell. 34, 09 (2020), 13693–13696.

[110]

Cem Subakan, Mirco Ravanelli, Samuele Cornell, Mirko Bronzi, and Jianyuan Zhong. 2021. Attention is all you need in speech separation. In ICASSP. 21–25.

[111]

Sainbayar Sukhbaatar, Edouard Grave, Piotr Bojanowski, and Armand Joulin. 2019. Adaptive attention span in transformers. In ACL. 331–335.

[112]

Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to sequence learning with neural networks. In NIPS. 3104–3112.

[113]

Yi Tay, Dara Bahri, Donald Metzler, Da-Cheng Juan, Zhe Zhao, and Che Zheng. 2020. Synthesizer: Rethinking self-attention in transformer models. arXiv e-prints (2020), arXiv:2005.00743.

[114]

Yi Tay, Dara Bahri, Liu Yang, Donald Metzler, and Da-Cheng Juan. 2020. Sparse sinkhorn attention. In ICML, Vol. 119, 9438–9447.

[115]

Yi Tay, Mostafa Dehghani, Samira Abnar, Yikang Shen, Dara Bahri, Philip Pham, et al. 2021. Long range arena : A benchmark for efficient transformers. In ICLR.

[116]

Yi Tay, Mostafa Dehghani, Dara Bahri, and Donald Metzler. 2020. Efficient transformers: A survey. CoRR abs/2009.06732 (2020).

[117]

Wilson L. Taylor. 1953. “Cloze Procedure”: A new tool for measuring readability. Journal. Quart. 30, 4 (1953), 415–433.

[118]

Ilya O. Tolstikhin, Neil Houlsby, Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Thomas Unterthiner, et al. 2021. MLP-Mixer: An all-MLP architecture for vision. CoRR abs/2105.01601 (2021).

[119]

Henry Tsai, Jayden Ooi, Chun-Sung Ferng, Hyung Won Chung, and Jason Riesa. 2020. Finding fast transformers: One-shot neural architecture search by component composition. arXiv e-prints (2020), arXiv:2008.06808.

[120]

Henry Tsai, Jason Riesa, Melvin Johnson, Naveen Arivazhagan, Xin Li, and Amelia Archer. 2019. Small and practical BERT models for sequence labeling. In EMNLP-IJCNLP. 3632–3636.

[121]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, et al. 2017. Attention is all you need. In NIPS. 5998–6008.

[122]

Apoorv Vyas, Angelos Katharopoulos, and François Fleuret. 2020. Fast transformers with clustered attention. In NeurIPS.

[123]

Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, et al. 2019. SuperGLUE: A stickier benchmark for general-purpose language understanding systems. In NeurIPS, Vol. 32.

[124]

Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. 2018. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In EMNLP. 353–355.

[125]

Chenguang Wang, Zihao Ye, Aston Zhang, Zheng Zhang, and Alexander J. Smola. 2020. Transformer on a diet. arXiv e-prints (2020), arXiv:2002.06170.

[126]

Sinong Wang, Belinda Z. Li, Madian Khabsa, Han Fang, and Hao Ma. 2020. Linformer: Self-attention with linear complexity. arXiv e-prints (2020), arXiv:2006.04768.

[127]

Yuxin Wang, Qiang Wang, Shaohuai Shi, Xin He, Zhenheng Tang, Kaiyong Zhao, et al. 2020. Benchmarking the performance and energy efficiency of AI accelerators for AI training. In CCGRID. 744–751.

[128]

Yu Wang, Gu-Yeon Wei, and David Brooks. 2019. Benchmarking TPU, GPU, and CPU platforms for deep learning. CoRR abs/1907.10701. http://arxiv.org/abs/1907.10701

[129]

Alex Warstadt, Amanpreet Singh, and Samuel R. Bowman. 2019. Neural network acceptability judgments. Trans. Assoc. Computat. Ling. 7 (2019), 625–641.

[130]

Lilian Weng. 2018. Attention? Attention! Retrieved from http://lilianweng.github.io/lil-log/2018/06/24/attention-attention.html.

[131]

Sarah Wiegreffe and Yuval Pinter. 2019. Attention is not not explanation. In EMNLP-IJCNLP. 11–20.

[132]

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, et al. 2020. Transformers: State-of-the-art natural language processing. In EMNLP. 38–45.

[133]

Zhanghao Wu, Zhijian Liu, Ji Lin, Yujun Lin, and Song Han. 2020. Lite transformer with long-short range attention. In ICLR.

[134]

Qizhe Xie, Minh-Thang Luong, Eduard H. Hovy, and Quoc V. Le. 2020. Self-training with noisy student improves ImageNet classification. In CVPR. 10684–10695.

[135]

Ruibin Xiong, Yunchang Yang, Di He, Kai Zheng, Shuxin Zheng, Chen Xing, et al. 2020. On layer normalization in the transformer architecture. In ICML, Vol. 119, 10524–10533.

[136]

Yunyang Xiong, Zhanpeng Zeng, Rudrasis Chakraborty, Mingxing Tan, Glenn Fung, Yin Li, et al. 2021. Nyströmformer: A Nyström-based algorithm for approximating self-attention. arXiv e-prints (2021), arXiv:2102.03902.

[137]

Weihao Yu, Mi Luo, Pan Zhou, Chenyang Si, Yichen Zhou, Xinchao Wang, Jiashi Feng, and Shuicheng Yan. 2022. MetaFormer is actually what you need for vision. In CVPR. 10819–10829.

[138]

Ofir Zafrir, Guy Boudoukh, Peter Izsak, and Moshe Wasserblat. 2019. Q8BERT: Quantized 8Bit BERT. In EMC2-NIPS.

[139]

Manzil Zaheer, Guru Guruganesh, Kumar Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, et al. 2020. Big bird: Transformers for longer sequences. In NeurIPS, Vol. 33, 17283–17297.

[140]

Qian Zhang, Han Lu, Hasim Sak, Anshuman Tripathi, Erik McDermott, Stephen Koo, et al. 2020. Transformer transducer: A streamable speech recognition model with transformer encoders and RNN-T loss. In ICASSP. 7829–7833.

[141]

Barret Zoph and Quoc V. Le. 2017. Neural architecture search with reinforcement learning. In ICLR.

[142]

Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V. Le. 2018. Learning transferable architectures for scalable image recognition. In CVPR. 8697–8710.

Cited By

Xu MCai DYin WWang SJin XLiu X(2025)Resource-efficient Algorithms and Systems of Foundation Models: A SurveyACM Computing Surveys10.1145/370641857:5(1-39)Online publication date: 9-Jan-2025
https://dl.acm.org/doi/10.1145/3706418
Gort BKibalya GSerrano MAntonopoulos A(2025)Forecasting Trends in Cloud-Edge Computing: Unleashing the Power of Attention MechanismsIEEE Communications Magazine10.1109/MCOM.001.230058363:1(108-114)Online publication date: 1-Jan-2025
https://dl.acm.org/doi/10.1109/MCOM.001.2300583
Javanmard AShao SBien J(2025)Prediction sets for high-dimensional mixture of experts modelsJournal of the Royal Statistical Society Series B: Statistical Methodology10.1093/jrsssb/qkae117Online publication date: 17-Jan-2025
https://doi.org/10.1093/jrsssb/qkae117
Show More Cited By

Index Terms

A Practical Survey on Faster and Lighter Transformers
1. Computing methodologies
  1. Machine learning
    1. Machine learning approaches
      1. Neural networks

Recommendations

A survey of the vision transformers and their CNN-transformer based variants
Abstract
Vision transformers have become popular as a possible substitute to convolutional neural networks (CNNs) for a variety of computer vision applications. These transformers, with their ability to focus on global relationships in images, offer large ...
Dual-branch deep learning architecture for enhanced hourly global horizontal irradiance forecasting
Abstract
Solar energy, a pivotal renewable resource, faces challenges posed from the inherent randomness and volatility of solar irradiance. Effective control and utilization of solar energy systems rely on the accuracy of solar irradiance prediction ...
TSF-transformer: a time series forecasting model for exhaust gas emission using transformer
Abstract
Monitoring and prediction of exhaust gas emissions for heavy trucks is a promising way to solve environmental problems. However, the emission data acquisition is time delayed and the pattern of emission is usually irregular, which makes it very ...

Comments

Information & Contributors

Information

Published In

cover image ACM Computing Surveys

ACM Computing Surveys Volume 55, Issue 14s

December 2023

1355 pages

ISSN:0360-0300

EISSN:1557-7341

DOI:10.1145/3606253

Editor:
Albert Zomaya
University of Sydney, Australia

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 July 2023

Online AM: 04 March 2023

Accepted: 23 February 2023

Revised: 05 October 2022

Received: 27 January 2022

Published in CSUR Volume 55, Issue 14s

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Survey

Funding Sources

Natural Sciences and Engineering Research Council of Canada (NSERC), Prompt, Ericsson, Ciena, and EfficiOS

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

35
Total Citations
View Citations
3,281
Total Downloads

Downloads (Last 12 months)1,396
Downloads (Last 6 weeks)125

Reflects downloads up to 19 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Xu MCai DYin WWang SJin XLiu X(2025)Resource-efficient Algorithms and Systems of Foundation Models: A SurveyACM Computing Surveys10.1145/370641857:5(1-39)Online publication date: 9-Jan-2025
https://dl.acm.org/doi/10.1145/3706418
Gort BKibalya GSerrano MAntonopoulos A(2025)Forecasting Trends in Cloud-Edge Computing: Unleashing the Power of Attention MechanismsIEEE Communications Magazine10.1109/MCOM.001.230058363:1(108-114)Online publication date: 1-Jan-2025
https://dl.acm.org/doi/10.1109/MCOM.001.2300583
Javanmard AShao SBien J(2025)Prediction sets for high-dimensional mixture of experts modelsJournal of the Royal Statistical Society Series B: Statistical Methodology10.1093/jrsssb/qkae117Online publication date: 17-Jan-2025
https://doi.org/10.1093/jrsssb/qkae117
Anwar AKhalifa YCoyle JSejdic E(2025)Transformers in biosignal analysis: A reviewInformation Fusion10.1016/j.inffus.2024.102697114(102697)Online publication date: Feb-2025
https://doi.org/10.1016/j.inffus.2024.102697
Lokaj BDurand de Gevigney VDjema DZaghir JGoldman JBjelogrlic MTurbé HKinkel KLovis CSchmid J(2025)Multimodal deep learning fusion of ultrafast-DCE MRI and clinical information for breast lesion classificationComputers in Biology and Medicine10.1016/j.compbiomed.2025.109721188(109721)Online publication date: Apr-2025
https://doi.org/10.1016/j.compbiomed.2025.109721
Song M(2025)A Comparative Study of Korean Text Summarization Performance According to Architecture Features of Pre-trained Language ModelsIntelligent Sustainable Systems10.1007/978-981-97-9327-3_7(65-84)Online publication date: 20-Feb-2025
https://doi.org/10.1007/978-981-97-9327-3_7
Sander MGiryes RSuzuki TBlondel MPeyré GSalakhutdinov RKolter ZHeller KWeller AOliver NScarlett JBerkenkamp F(2024)How do transformers perform in-context autoregressive learning?Proceedings of the 41st International Conference on Machine Learning10.5555/3692070.3693831(43235-43254)Online publication date: 21-Jul-2024
https://dl.acm.org/doi/10.5555/3692070.3693831
Pei ZZhang AWang SHuang QSalakhutdinov RKolter ZHeller KWeller AOliver NScarlett JBerkenkamp F(2024)Modeling language tokens as functionals of semantic fieldsProceedings of the 41st International Conference on Machine Learning10.5555/3692070.3693696(40114-40128)Online publication date: 21-Jul-2024
https://dl.acm.org/doi/10.5555/3692070.3693696
Bilgin USoner Kara S(2024)Identification of Perceived Challenges in the Green Energy Transition by Turkish Society through Sentiment AnalysisSustainability10.3390/su1608336716:8(3367)Online publication date: 17-Apr-2024
https://doi.org/10.3390/su16083367
Choi H(2024)Feasibility of Transformer Model for User Authentication Using Electromyogram SignalsElectronics10.3390/electronics1320413413:20(4134)Online publication date: 21-Oct-2024
https://doi.org/10.3390/electronics13204134
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Full Text

View this article in Full Text.

Figures

Tables

Media

View full text|Download PDF

View Issue’s Table of Contents