PVT v2: Improved baselines with Pyramid Vision Transformer

Wang, Wenhai; Xie, Enze; Li, Xiang; Fan, Deng-Ping; Song, Kaitao; Liang, Ding; Lu, Tong; Luo, Ping; Shao, Ling

doi:10.1007/s41095-022-0274-8

PVT v2: Improved baselines with Pyramid Vision Transformer

Research Article
Open access
Published: 16 March 2022

Volume 8, pages 415–424, (2022)
Cite this article

Download PDF

You have full access to this open access article

Computational Visual Media Aims and scope Submit manuscript

PVT v2: Improved baselines with Pyramid Vision Transformer

Download PDF

Wenhai Wang^1,2,
Enze Xie³,
Xiang Li⁴,
Deng-Ping Fan⁵,
Kaitao Song⁴,
Ding Liang⁶,
Tong Lu²,
Ping Luo³ &
…
Ling Shao⁷

10k Accesses
933 Citations
20 Altmetric
2 Mentions
Explore all metrics

Abstract

Transformers have recently lead to encouraging progress in computer vision. In this work, we present new baselines by improving the original Pyramid Vision Transformer (PVT v1) by adding three designs: (i) a linear complexity attention layer, (ii) an overlapping patch embedding, and (iii) a convolutional feed-forward network. With these modifications, PVT v2 reduces the computational complexity of PVT v1 to linearity and provides significant improvements on fundamental vision tasks such as classification, detection, and segmentation. In particular, PVT v2 achieves comparable or better performance than recent work such as the Swin transformer. We hope this work will facilitate state-of-the-art transformer research in computer vision. Code is available at https://github.com/whai362/PVT.

Article PDF

Conv-PVT: a fusion architecture of convolution and pyramid vision transformer

Article 22 December 2022

Pyramid Swin Transformer for Multi-task: Expanding to More Computer Vision Tasks

Visual attention network

Article Open access 28 July 2023

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

References

Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; S. Gelly, S.; et al. An image is worth 16×16 words: Transformers for image recognition at scale. In: Proceedings of the International Conference on Learning Representations, 2021.
Touvron, H.; Cord, M.; Douze, M.; Massa, F.; Jegou, H. Training data-efficient image transformers & distillation through attention. In: Proceedings of the 38th International Conference on Machine Learning, 2021.
Wang, W.; Xie, E.; Li, X.; Fan, D.-P.; Song, K.; Liang, D.; Lu, T.; Luo, P.; Shao, L. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 568–578, 2021.
Wu, H.; Xiao, B.; Codella, N.; Liu, M.; Dai, X.; Yuan, L.; Zhang, L. CvT: Introducing convolutions to vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 22–31, 2021.
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 10012–10022, 2021.
Xu, W.; Xu, Y.; Chang, T.; Tu, Z. Co-scale convattentional image transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 9981–9990, 2021.
Graham, B.; El-Nouby, A.; Touvron, H.; Stock, P.; Joulin, A.; Jegou, H. LeViT: A vision transformer in ConvNet’s clothing for faster inference. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 12259–12269, 2021.
Chu, X.; Tian, Z.; Wang, Y.; Zhang, B.; Ren, H.; Wei, X.; Xia, H.; Shen, C. Twins: Revisiting the design of spatial attention in vision transformers. In: Proceedings of the 35th Conference on Neural Information Processing Systems, 2021.
Lin, T. Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C. L. Microsoft COCO: Common objects in context. In: Computer Vision — ECCV 2014. Lecture Notes in Computer Science, Vol. 8693. Fleet, D.; Pajdla, T.; Schiele, B.; Tuytelaars, T. Eds. Springer Cham, 740–755, 2014.
Google Scholar
Zhou, B. L.; Zhao, H.; Puig, X.; Fidler, S.; Barriuso, A.; Torralba, A. Scene parsing through ADE20K dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 5122–5130, 2017.
Dong, B.; Wang, W.; Fan, D.-P.; Li, J.; Fu, H.; Shao, L. Polyp-PVT: Polyp segmentation with pyramid vision transformers. arXiv preprint arXiv:2108.06932, 2021.
Li, X.; Wang, W.; Wu, L.; Chen, S.; Hu, X.; Li, J.; Tang, J.; Yang, J. Generalized focal loss: Learning qualified and distributed bounding boxes for dense object detection. In: Proceedings of the 34th Conference on Neural Information Processing Systems, 2020.
He, K. M.; Zhang, X. Y.; Ren, S. Q.; Sun, J. Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification. In: Proceedings of the IEEE International Conference on Computer Vision, 1026–1034, 2015.
Deng, J.; Dong, W.; Socher, R.; Li, L. J.; Kai, L.; Li, F. F. ImageNet: A large-scale hierarchical image database. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 248–255, 2009.
Yuan, L.; Chen, Y.; Wang, T.; Yu, W.; Shi, Y.; Jiang, Z.; Tay, F. E.; Feng, J.; Yan, S. Tokens-to-token ViT: Training vision transformers from scratch on ImageNet. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 558–567, 2021.
Han, K.; Xiao, A.; Wu, E.; Guo, J.; Xu, C.; Wang, Y. Transformer in transformer. arXiv preprint arXiv:2103.00112, 2021.
Chu, X.; Tian, Z.; Zhang, B.; Wang, X.; Wei, X.; Xia, H.; Shen, C. Conditional positional encodings for vision transformers. arXiv preprint arXiv:2102.10882, 2021.
Chen, C.-F.; Fan, Q.; Panda, R. CrossViT: Cross-attention multi-scale vision transformer for image classification. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 357–366, 2021.
Li, Y.; Zhang, K.; Cao, J.; Timofte, R.; van Gool, L. LocalViT: Bringing locality to vision transformers. arXiv preprint arXiv:2104.05707, 2021.
Islam, M. A.; Jia, S.; Bruce, N. D. B. How much position information do convolutional neural networks encode? In: Proceedings of the International Conference on Learning Representations, 2020.
Howard, A. G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T; Andreetto, M.; Adam, H. MobileNets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
Hendrycks, D.; Gimpel, K. Gaussian error linear units (GELUs). arXiv preprint arXiv:1606.08415, 2016.
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, L.; Polosukhin, I. Attention is all you need. In: Proceedings of the 31st International Conference on Neural Information Processing Systems, 6000–6010, 2017.
He, K. M.; Zhang, X. Y.; Ren, S. Q.; Sun, J. Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 770–778, 2016.
Xie, S. N.; Girshick, R.; Dollar, P.; Tu, Z. W.; He, K. M. Aggregated residual transformations for deep neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 5987–5995, 2017.
Radosavovic, I.; Kosaraju, R. P.; Girshick, R.; He, K. M.; Dollár, P. Designing network design spaces. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10425–10433, 2020.
Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S. A.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.; et al. ImageNet large scale visual recognition challenge. International Journal of Computer Vision Vol. 115, No. 3, 211–252, 2015.
Article MathSciNet Google Scholar
Szegedy, C.; Liu, W.; Jia, Y. Q.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1–9, 2015.
Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2818–2826, 2016.
Zhang, H.; Cisse, M.; Dauphin, Y. N.; Lopez-Paz, D. mixup: Beyond empirical risk minimization. In: Proceedings of the International Conference on Learning Representations, 2018.
Zhong, Z.; Zheng, L.; Kang, G. L.; Li, S. Z.; Yang, Y. Random erasing data augmentation. Proceedings of the AAAI Conference on Artificial Intelligence Vol. 34, No. 7, 13001–13008, 2020.
Article Google Scholar
Loshchilov, I.; Hutter, F. Decoupled weight decay regularization. In: Proceedings of the International Conference on Learning Representations, 2019.
Loshchilov, I.; Hutter, F. SGDR: Stochastic gradient descent with warm restarts. In: Proceedings of the International Conference on Learning Representations, 2017.
Lin, T. Y.; Goyal, P.; Girshick, R.; He, K. M.; Dollár, P. Focal loss for dense object detection. In: Proceedings of the IEEE International Conference on Computer Vision, 2999–3007, 2017.
He, K. M.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. In: Proceedings of the IEEE International Conference on Computer Vision, 2980–2988, 2017.
Cai, Z. W.; Vasconcelos, N. Cascade R-CNN: Delving into high quality object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 6154–6162, 2018.
Zhang, S. F.; Chi, C.; Yao, Y. Q.; Lei, Z.; Li, S. Z. Bridging the gap between anchor-based and anchor-free detection via adaptive training sample selection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 9756–9765, 2020.
Sun, P. Z.; Zhang, R. F.; Jiang, Y.; Kong, T.; Xu, C. F.; Zhan, W.; Tomizuka, M.; Li, L.; Yuan, Z.; Wang, C.; et al. Sparse R-CNN: End-to-end object detection with learnable proposals. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 14449–14458, 2021.
Glorot, X.; Bengio, Y. Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the 13th International Conference on Artificial Intelligence and Statistics, 249–256, 2010.
Chen, K.; Wang, J. Q.; Pang, J. M.; Cao, Y. H.; Xiong, Y.; Li, X.; Sun, S.; Feng, W.; Liu, Z.; Xu, J.; et al. MMDetection: Open MMLab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155, 2019.
Kirillov, A.; Girshick, R.; He, K. M.; Dollár, P. Panoptic feature pyramid networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 6392–6401, 2019.
Chen, L. C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A. L. DeepLab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 40, No. 4, 834–848, 2018.
Article Google Scholar

Download references

Acknowledgements

This work was supported by the National Natural Science Foundation of China under Grant Nos. 61672273 and 61832008, the Science Foundation for Distinguished Young Scholars of Jiangsu under Grant No. BK20160021, the Postdoctoral Innovative Talent Support Program of China under Grant Nos. BX20200168 and 2020M681608, and the General Research Fund of Hong Kong under Grant No. 27208720.

Author information

Authors and Affiliations

Shanghai AI Laboratory, Shanghai, 200232, China
Wenhai Wang
Department of Computer Science and Technology, Nanjing University, Nanjing, 210023, China
Wenhai Wang & Tong Lu
Department of Computer Science, the University of Hong Kong, Hong Kong, 999077, China
Enze Xie & Ping Luo
School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, 210014, China
Xiang Li & Kaitao Song
Computer Vision Lab, ETH Zurich, Zurich, 8092, Switzerland
Deng-Ping Fan
SenseTime, Beijing, 100080, China
Ding Liang
Inception Institute of Artificial Intelligence, Abu Dhabi, United Arab Emirates
Ling Shao

Authors

Wenhai Wang
View author publications
You can also search for this author in PubMed Google Scholar
Enze Xie
View author publications
You can also search for this author in PubMed Google Scholar
Xiang Li
View author publications
You can also search for this author in PubMed Google Scholar
Deng-Ping Fan
View author publications
You can also search for this author in PubMed Google Scholar
Kaitao Song
View author publications
You can also search for this author in PubMed Google Scholar
Ding Liang
View author publications
You can also search for this author in PubMed Google Scholar
Tong Lu
View author publications
You can also search for this author in PubMed Google Scholar
Ping Luo
View author publications
You can also search for this author in PubMed Google Scholar
Ling Shao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Wenhai Wang.

Ethics declarations

The authors have no competing interests to declare that are relevant to the content of this article.

Additional information

Wenhai Wang received his B.S. degree from Nanjing University of Science and Technology, China, in 2016. He is currently a Ph.D. student with the Department of Computer Science, Nanjing University. His main research interests include scene text detection and recognition, deep neural network exploration, object detection, and instance segmentation.

Enze Xie received his B.S. degree from Nanjing University of Aeronautics and Astronautics, China, in 2016, and his M.S. degree from Tongji University, China, in 2019. He is currently a Ph.D. student with the Department of Computer Science, the University of Hong Kong. His main research interests include object detection and instance segmentation.

Xiang Li received his B.S. degree in computer science from Nanjing University of Science and Technology in 2013, where he is currently working towards a Ph.D. degree in pattern recognition and intelligent systems. His research interests include computer vision, pattern recognition, data mining, and deep learning.

Deng-Ping Fan is a postdoctoral researcher at ETH Zurich, Switzerland. He received his Ph.D. degree from Nankai University in 2019. He joined the Inception Institute of Artificial Intelligence (IIAI) in 2019. He has published about 30+ top journal and conference papers. His research interests include computer vision, deep learning, and saliency detection.

Kaitao Song received his Ph.D. degree in computer science from Nanjing University of Science and Technology in 2021. His research interests focus on machine learning and deep learning algorithms for natural language processing and speech processing, including pretrained language models, neural machine translation, music generation, text summarization, neural architecture search for NLP, audio speech recognition, text-to-speech synthesis, etc.

Ding Liang has been working for SenseTime Ltd., since he graduated from Tsinghua University. He is now an associate director and head of the OCR team. His main research interests include OCR, face recognition, and model compression.

Tong Lu received his Ph.D. degree in computer science from Nanjing University in 2005, where he also received his M.Sc. and B.Sc. degrees in 2002 and 1997, respectively. He served as associate professor and assistant professor in the Department of Computer Science and Technology at Nanjing University from 2007 and 2005, respectively, where he is now a full professor. He is also a member of the National Key Laboratory of Novel Software Technology in China. He has published over 130 papers and authored 2 books, and received more than 30 international and Chinese patents. His current interests are in multimedia, computer vision, and pattern recognition algorithms and systems.

Ping Luo is an assistant professor in the Department of Computer Science, The University of Hong Kong. He received his Ph.D. degree in 2014 from Information Engineering, the Chinese University of Hong Kong, and was a postdoctoral fellow there from 2014 to 2016. He joined SenseTime Research as a principal research scientist from 2017 to 2018. His research interests are machine learning and computer vision. He has published 100+ peer-reviewed articles in top-tier conferences and journals. He was named a young innovator under 35 by MIT Technology Review (TR35) Asia Pacific.

Ling Shao is the CEO and Chief Scientist of the Inception Institute of AI (IIAI), Abu Dhabi, United Arab Emirates (UAE). He was the initiator and Founding Provost and Executive Vice President of the Mohamed bin Zayed University of Artificial Intelligence (the world’s first AI University), UAE. His research interests include computer vision, machine learning, and medical imaging. He is a fellow of the IEEE, the IAPR, the IET, and the BCS.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.

The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Other papers from this open access journal are available free of charge from http://www.springer.com/journal/41095. To submit a manuscript, please go to https://www.editorialmanager.com/cvmj.

Reprints and permissions

About this article

Cite this article

Wang, W., Xie, E., Li, X. et al. PVT v2: Improved baselines with Pyramid Vision Transformer. Comp. Visual Media 8, 415–424 (2022). https://doi.org/10.1007/s41095-022-0274-8

Download citation

Received: 22 December 2021
Accepted: 08 February 2022
Published: 16 March 2022
Issue Date: September 2022
DOI: https://doi.org/10.1007/s41095-022-0274-8

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

PVT v2: Improved baselines with Pyramid Vision Transformer

Abstract

Article PDF

Similar content being viewed by others

Conv-PVT: a fusion architecture of convolution and pyramid vision transformer

Pyramid Swin Transformer for Multi-task: Expanding to More Computer Vision Tasks

Visual attention network

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

PVT v2: Improved baselines with Pyramid Vision Transformer

Abstract

Article PDF

Similar content being viewed by others

Conv-PVT: a fusion architecture of convolution and pyramid vision transformer

Pyramid Swin Transformer for Multi-task: Expanding to More Computer Vision Tasks

Visual attention network

Explore related subjects

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation