Can attention enable MLPs to catch up with CNNs?

Tolstikhin, I.; Houlsby, N.; Kolesnikov, A.; Beyer, L.; Zhai, X.; Unterthiner, T.; Yung, J.; Keysers, D.; Uszkoreit, J.; Lucic, M. et al. MLP-mixer: An all-MLP architecture for vision. arXiv preprint arXiv:2105.01601, 2021.
Google Scholar
Guo, M. H.; Liu, Z. N.; Mu, T. J.; Hu, S. M. Beyond self-attention: External attention using two linear layers for visual tasks. arXiv preprint arXiv:2105.02358, 2021.
Google Scholar
Melas-Kyriazi, L. Do you even need attention? A stack of feed-forward layers does surprisingly well on imageNet. arXiv preprint arXiv:2105.02723, 2021.
Google Scholar
Touvron, H.; Bojanowski, P.; Caron, M.; Cord, M.; El-Nouby, A.; Grave, E.; Izacard, G.; Joulin, A.; Synnaeve, G.; Verbeek, J. et al. ResMLP: Feedforward networks for image classification with data-efficient training. arXiv preprint arXiv:2105.03404, 2021.
Google Scholar
Rumelhart, D. E.; Hinton, G. E.; Williams, R. J. Learning internal representations by error propagation. In: Parallel Distributed Processing: Explorations in the Microstructure of Cognition, Volume 1, Foundations. Rumelhart, D. E.; Mcclelland, J. L. Eds. MIT Press, 318–362, 1986.
Chapter Google Scholar
LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proceedings of the IEEE Vol. 86, No. 11, 2278–2324, 1998.
Article Google Scholar
Krizhevsky, A.; Sutskever, I.; Hinton, G. E. ImageNet classification with deep convolutional neural networks. Communications of the ACM Vol. 60, No. 6, 84–90, 2017.
Article Google Scholar
Yu, F.; Koltun, V. Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122, 2016.
Google Scholar
Peng, C.; Zhang, X. Y.; Yu, G.; Luo, G. M.; Sun, J. Large kernel matters—Improve semantic segmentation by global convolutional network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1743–1751, 2017.
Google Scholar
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, L.; Polosukhin, I. Attention is all you need. In: Proceedings of the 31st International Conference on Neural Information Processing Systems, 6000–6010, 2017.
Google Scholar
Radford, A.; Narasimhan, K.; Salimans, T.; Sutskever, I. Improving language understanding with unsupervised learning. Technical Report. OpenAI, 2018.
Google Scholar
Devlin, J.; Chang, M.; Lee, K.; Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Vol. 1, 4171–4186, 2019.
Google Scholar
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S. et al. An image is worth 16x16 words: Transformers for image recognition at scale. In: Proceedings of the International Conference on Learning Representations, 2021.
Google Scholar
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In: Computer Vision — ECCV 2020. Lecture Notes in Computer Science, Vol. 12346. Vedaldi, A.; Bischof, H.; Brox, T.; Frahm, J. M. Eds. Springer Cham, 213–229, 2020.
Chapter Google Scholar
Guo, M. H.; Cai, J. X.; Liu, Z. N.; Mu, T. J.; Martin, R. R.; Hu, S. M. PCT: Point cloud transformer. Computational Visual Media Vol. 7, No. 2, 187–199, 2021.
Article Google Scholar
Yuan, L.; Chen, Y.; Wang, T.; Yu, W.; Shi, Y.; Jiang, Z.; Tay, F. E.; Feng, J.; Yan, S. Tokens-to-token ViT: Training vision transformers from scratch on imageNet. arXiv preprint arXiv:2101.11986, 2021.
Google Scholar
Touvron, H.; Cord, M.; Douze, M.; Massa, F.; Sablayrolles, A.; Jégou, H. Training data-efficient image transformers & distillation through attention. arXiv preprint arXiv:2012.12877, 2021.
Google Scholar
Brown, T. B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A. Shyam, P.; Sastry, G.; Askell, A. et al. Language models are few-shot learners. In: Proceedings of the 34th Conference on Neural Information Processing Systems, 2020.
Google Scholar
Hendrycks, D.; Gimpel, K. Gaussian error linear units (GELUs). arXiv preprint arXiv:1606.08415, 2016.
Google Scholar
Ba, J. L.; Kiros, J. R.; Hinton, G. E. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
Google Scholar
He, K. M.; Zhang, X. Y.; Ren, S. Q.; Sun, J. Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 770–778, 2016.
Google Scholar

Download references

Acknowledgements

This work was supported by the National Natural Science Foundation of China (Project No. 61521002).

Author information

Authors and Affiliations

BNRist, Department of Computer Science and Technology, Tsinghua University, Beijing, 100084, China
Meng-Hao Guo, Zheng-Ning Liu, Tai-Jiang Mu, Dun Liang & Shi-Min Hu
Cardiff University, Cardiff, CF243AA, UK
Ralph R. Martin

Authors

Meng-Hao Guo
View author publications
You can also search for this author in PubMed Google Scholar
Zheng-Ning Liu
View author publications
You can also search for this author in PubMed Google Scholar
Tai-Jiang Mu
View author publications
You can also search for this author in PubMed Google Scholar
Dun Liang
View author publications
You can also search for this author in PubMed Google Scholar
Ralph R. Martin
View author publications
You can also search for this author in PubMed Google Scholar
Shi-Min Hu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shi-Min Hu.

Additional information

Meng-Hao Guo is a Ph.D. candidate supervised by Prof. Shi-Min Hu in the Department of Computer Science and Technology at Tsinghua University, Beijing, China. His research interests include computer graphics, computer vision, and machine learning.

Zheng-Ning Liu received his bachelor degree in computer science from Tsinghua University in 2017. He is currently a Ph.D. candidate in computer science at Tsinghua University. His research interests include 3D computer vision, 3D reconstruction, and computer graphics.

Tai-Jiang Mu is currently an assistant researcher at Tsinghua University, where he received his B.S. and Ph.D. degrees in computer science in 2011 and 2016, respectively. His research interests include computer vision, robotics, and computer graphics.

Dun Liang is a Ph.D. candidate in computer science at Tsinghua University, where he received his B.S. degree in 2016. His research interests include computer graphics, visual media learning, and high-performance computing.

Ralph R. Martin received his Ph.D. degree from Cambridge University in 1983. He is an emeritus professor of Cardiff University with over 300 papers and 14 books in geometry processing, computer graphics, and computer vision. He is a sometime Fellow of the Learned Society of Wales, the Institute of Mathematics and its Applications, and the British Computer Society. He is currently an Associate Editor-in-Chief of Computational Visual Media.

Shi-Min Hu is currently a professor in computer science at Tsinghua University. He received his Ph.D. degree from Zhejiang University in 1996. His research interests include geometry processing, image & video processing, rendering, computer animation, and CAD. He has published more than 100 papers in journals and refereed conferences. He is Editor-in-Chief of Computational Visual Media, and on the editorial boards of several journals, including Computer Aided Design and Computer & Graphics.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.

The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Other papers from this open access journal are available free of charge from http://www.springer.com/journal/41095. To submit a manuscript, please go to https://www.editorialmanager.com/cvmj.

Reprints and permissions

About this article

Cite this article

Guo, MH., Liu, ZN., Mu, TJ. et al. Can attention enable MLPs to catch up with CNNs?. Comp. Visual Media 7, 283–288 (2021). https://doi.org/10.1007/s41095-021-0240-x

Download citation

Received: 18 May 2021
Accepted: 23 May 2021
Published: 27 July 2021
Issue Date: September 2021
DOI: https://doi.org/10.1007/s41095-021-0240-x

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Can attention enable MLPs to catch up with CNNs?

Article PDF

Explore related subjects

References