Abstract
Vision Transformer (ViT) has fully exhibited the potential of Transformer in computer vision domain. However, the computational complexity is proportional to the input dimension which is a constant value for Transformer. Therefore, training a vision transformer network is extremely memory expensive, where a large number of intermediate activation functions and parameters are involved to compute the gradients during back-propagation. In this paper, we propose Conv-PVT (Convolution blocks + Pyramid Vision Transformer) to improve the overall performance of vision transformer. Especially, we deploy simple convolution blocks in the first layer to reduce the memory footprint by down-sampling the input. Extensive experiments (including image classification, object detection and segmentation) have been carried out on ImageNet-1k, COCO and ADE20k datasets to test the accuracy, training time, memory occupation and robustness of our model. The results demonstrate that Conv-PVT achieves comparable performances with the original PVT and outperforms ResNet and ResNetXt for some downstream vision tasks. But it shortens 60% of the training time and reduces 42% GPU (Graphics Processing Unit) memory occupation, realizing twice the inference speed of PVT.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Data Availability Statement
All data included in this study are available upon request by contact with the corresponding author.
References
Afan HA, Ibrahem Ahmed Osman A, Essam Y et al (2021) Modeling the fluctuations of groundwater level by employing ensemble deep learning techniques. Eng Appl Comput Fluid Mech 15(1):1420–1439
Bai Y, Mei J, Yuille A et al (2021) Are transformers more robust than CNNs? Adv Neural Inf Process Syst 34:2
Carion N, Massa F, Synnaeve G, et al (2020) End-to-end object detection with transformers. In: European Conference on Computer Vision, pp 213–229
Chaudhari P, Agrawal H, Kotecha K (2020) Data augmentation using MG-GAN for improved cancer classification on gene expression data. Soft Comput 24(15):11381–11391. https://doi.org/10.1007/s00500-019-04602-2
Chen C, Zhang Q, Kashani MH et al (2022) Forecast of rainfall distribution based on fixed sliding window long short-term memory. Eng Appl Comput Fluid Mech 16(1):248–261
Chen H (2021) Pre-trained image processing transformer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 12,299–12,310
Chen W, Sharifrazi D, Liang G et al (2022) Accurate discharge coefficient prediction of streamlined weirs by coupling linear regression and deep convolutional gated recurrent unit. Eng Appl Comput Fluid Mech 16(1):965–976
Chollet F (2017) Xception: Deep learning with depthwise separable convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1251–1258
d’Ascoli S, Touvron H, Leavitt M, et al (2021) ConViT: Improving vision transformers with soft convolutional inductive biases. In: ICML, vol 2, 3
Deng J, Dong W, Socher R, et al (2009) Imagenet: A large-scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition, pp 248–255
Dosovitskiy A, Beyer L, Kolesnikov A, et al (2021) An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR, vol 1, 2, 3, 4, 5, 10. p 13
Fan Y, Xu K, Wu H et al (2020) Spatiotemporal modeling for nonlinear distributed thermal processes based on kl decomposition, mlp and lstm network. IEEE Access 8:25111–25121
Geirhos R, Rubisch P, Michaelis C, et al (2019) ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness. In: International Conference on Learning Representations
Glorot X, Bengio Y (2010) Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the thirteenth international conference on artificial intelligence and statistics, pp 249–256
Goyal P (2017) Accurate, large minibatch sgd: Training imagenet in 1 hour. ArXiv Prepr ArXiv170602677
Guo MH, Cai JX, Liu ZN et al (2021) PCT: Point cloud transformer. Comput Vis Media 7(2):187–199
He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
He K, Gkioxari G, Dollár P, et al (2017) Mask r-cnn. In: Proceedings of the IEEE international conference on computer vision, pp 2961–2969
Hendrycks D, Dietterich T (2019) Benchmarking Neural Network Robustness to Common Corruptions and Perturbations. In: Proceedings of the International Conference on Learning Representations
Hendrycks D, Zhao K, Basart S, et al (2021) Natural Adversarial Examples. CVPR
Howard AG, Zhu M, Chen B, et al (1704) Mobilenets: efficient convolutional neural networks for mobile vision applications. CoRR 2, 4, 5:6
Jin Y, Han D, Ko H (2021) TrSeg: transformer for semantic segmentation. Pattern Recogn Lett 148:29–35. https://doi.org/10.1016/j.patrec.2021.04.024
K. He PDG. Gkioxari, Girshick R (2017) Mask r-cnn. In: Proceedings of the IEEE international conference on computer vision, pp 2961–2969
Kirillov A, Girshick R, He K, et al (2019) Panoptic feature pyramid networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 6399–6408
Krizhevsky A, Sutskever I, Hinton G (2012) Imagenet classification with deep convolutional neural networks. Adv Neural Inf Process Syst 25:1097–1105
Lin T, Dollár P, Girshick R, et al (2017) Feature pyramid networks for object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2117–2125
Lin T, Goyal P, Girshick R, et al (2017) Focal loss for dense object detection. In: Proceedings of the IEEE international conference on computer vision, pp 2980–2988
Lin TY (2014) Microsoft coco: common objects in context. In: European conference on computer vision, pp 740–755
Lin TY, Goyal P, Girshick R, et al (2017) Focal loss for dense object detection. In: Proceedings of the IEEE international conference on computer vision, pp 2980–2988
Liu Z, Lin Y, Cao Y, et al (2021) Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). p 10,012–10,022
Loshchilov I, Hutter F (2017) SGDR: stochastic gradient descent with warm restarts. In: Proceedings of the Internal Conference on Learning Representations 2017
Loshchilov I, Hutter F (2019) Decoupled weight decay regularization. In: ICLR, vol 1, 3. p 5
Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training
Raghu M, Unterthiner T, Kornblith S, et al (2021) Do vision transformers see like convolutional neural networks? In: Thirty-Fifth Conference on Neural Information Processing Systems
Shamshirband S, Rabczuk T, Chau KW (2019) A survey of deep learning techniques: application in wind and solar energy resources. IEEE Access 7:164,650–164,666
Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition. In: ICLR
Sun P (2021) Sparse r-cnn: End-to-end object detection with learnable proposals. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 14,454–14,463
Szegedy C (2015) Going deeper with convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1–9
Szegedy C, Vanhoucke V, Ioffe S, et al (2016) Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2818–2826
Touvron H, Cord M, Douze M, et al (2021) Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning. PMLR, pp 10,347–10,357
Vaswani A (2017) Attention is all you need. In: Advances in neural information processing systems. p 5998–6008
Wang P (2021) Scaled relu matters for training vision transformers. ArXiv Prepr ArXiv210903810
Wang W, Xie E, Li X, et al (2021) Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In: ICCV, vol 3
Wu H, Xiao B, Codella N, et al (2021) CvT: Introducing convolutions to vision transformers. In: ICCV, vol 3
Xiao T, Dollar P, Singh M et al (2021) Early convolutions help transformers see better. Adv Neural Inf Process Syst 2:2
Xie S, Girshick R, Dollár P, et al (2017) Aggregated residual transformations for deep neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1492–1500
Yan H, Li Z, Li W, et al (2021) ConTNet: Why not use convolution and transformer at the same time? In: ArXiv210413497 Cs. http://arxiv.org/abs/2104.13497
Yuan H, Cai Z, Zhou H, et al (2021) TransAnomaly: Video Anomaly Detection Using Video Vision Transformer. IEEE Access 9:123,977–123,986. https://doi.org/10.1109/ACCESS.2021.3109102.
Yuan K, Guo S, Liu Z, et al (2021) Incorporating convolution designs into visual transformers. In: ICCV, vol 3
Yuan L, Chen Y, Wang T, et al (2021) Tokens-to-Token ViT: Training Vision Transformers From Scratch on ImageNet. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). p 558–567
Zhang H, Cissé M, Dauphin YN, et al (2018) Mixup: Beyond empirical risk minimization. In: ICLR
Zhang P (2021) Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV, pp 2998–3008
Zhang X, Zhou X, Lin M, et al (2018) Shufflenet: an extremely efficient convolutional neural network for mobile devices. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6848–6856
Zhang Y (2021) Vidtr: Video transformer without convolutions. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 13,577–13,587
Zhou B, Zhao H, Puig X, et al (2017) Scene parsing through ade20k dataset. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 633–641
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Declaration of Interest Statement
I declare I have no financial support and personal relationships with other people or organizations that can inappropriately influence our work. There is no professional or other personal interest of any nature or kind in any products, services, or company that could be construed as influencing the position presented in, or the review of, the manuscript entitled.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Zhang, X., Zhang, Y. Conv-PVT: a fusion architecture of convolution and pyramid vision transformer. Int. J. Mach. Learn. & Cyber. 14, 2127–2136 (2023). https://doi.org/10.1007/s13042-022-01750-0
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13042-022-01750-0