Comparing Vision Transformers and Convolutional Neural Networks for Image Classification: A Literature Review
Abstract
:1. Introduction
- RQ1—Can the ViT architecture have a better performance than the CNN architecture, regardless of the characteristics of the dataset?
- RQ2—What influences CNNs that do not to perform as well as ViTs?
- RQ3—How does the Multi-Head Attention mechanism, which is a key component of ViTs, influence the performance of these models in image classification?
2. Research Methodology
2.1. Data Sources
2.2. Search String
2.3. Inclusion Criteria
2.4. Exclusion Criteria
2.5. Results
3. Findings
Ref. | Datasets | Images Size | Number of Classes | Hardware | Evaluated Architectures | Best Architecture | Best Results |
---|---|---|---|---|---|---|---|
[5] | ImageNet-1K (more than 1.431 M images) for training and ImageNet-C for validation | 224 × 224 | 2 | N/A | ViT-B/16, ViT-L/16, Mixer-B/16, Mixer-B/16, RN18 (SWSL), RN50 (SWSL), RN18 and RN50 | ViT-L/16 | 82.89% of Acc |
[6] | ImageNet-A; ImageNet-C and Stylized ImageNet | 224 × 224 | N/A | N/A | ResNet50 and DeiT-S | N/A | N/A |
[7] | 5856 images collected by X-ray | 250 × 250 to ViT 224 × 224 to CNN | 2 | Intel Core i5-8300H 2.30 GHz | ViT, CNN and VGG16 | ViT | 96.45% Acc, 86.38% val. Acc, 10.92% loss and 18.25% val. Loss |
[8] | ImageNetILSVRC 2012 (1.78 M images) | 224 × 224 | 1000 | N/A | ViT-B/32, ViT-L/16, ViT-H/14, ResNet50 and ResNet152 | N/A | N/A |
[9] | Public dataset 1 [25] with 780 images; Public dataset 2 [26] with 163 images | 224 × 224 | 3 | N/A | ViT-S/32, ViT-B/32, ViT-Ti/16, R26 + S/16, R + Ti/16, VGG, Inception and NASNET | ViT-B/32 | 86.7% Acc and 95% AUC |
[10] | Flower 102 (4080 to 11,016 images); CUB 200 (11,788 images); Indoor 67 (15,620 images); NY Depth V2 (1449 images); WikiArt; COVID-19 Image Data Collection (700 images); Caltech101 (9146 images); FG-NET (1002 images) | 384 × 384; 224 × 224; 300 × 300 | 40 to 102 | N/A | R-101 × 3, R-152 × 4, ViT-B/16, ViT-L/16, and Swim-B | N/A | N/A |
[11] | 3192 images collected by CT and 160 images of a public dataset with CT biomarkers | 61 × 61 | 4 | Intel Core i7-9700 3.0 GHz, 326 GB RAM; NVIDIA GeForce RTX 2080 Ti (116 GB DDR6) | AlexNet, VGG-16, InceptionV3, MobileNetV2, ResNet34, ResNet50 and ViT | ViT | 95.95% Acc |
[12] | ImageNet-C benchmark | 224 × 224 | 2 | NVIDIA Quadro A6000 | ViT-L/16, CNN, hybrid model (BiT-M + ResNet152 × 4 | Hybrid model | 99.20% Acc |
[13] | Dataset provided in DFUC 2021 challenge (15,683 images) | 224 × 224 | 4 | NVIDIA GeForce RTX 3080, 10 GB memory | EfficientNetB3, BiT-ResNeXt50, ViT-B/16 and DeiT-S/16 | BiT-ResNeXt50 | 88.49% AUC, 61.53% F1-Score, 65.59% recall and 60.53% precision |
[14] | 3400 images collected by holographic camera | 512 × 512 | 10 | NVIDIA V100 | EfficientnetB7, Densenet169 and ViT-B/16 | ViT-B/16 | 99% Acc |
[15] | ForgeryNet with 2.9 M images | N/A | 2 | N/A | EfficientNetV2 and ViT-B | N/A | N/A |
[16] | German dataset (51,830 images); Indian dataset (1976 images); Chinese dataset (18,168 images) | 128 × 128 | 15, 43 and 103 | AMD Ryzen 7 5800H; NVIDIA GeForce RTX 3070 | DenseNet161, ViT, DeepViT, MLP-Mixer, CvT, PiT, CaiT, CCT, CrossViT and Twins-SVT | CCT | 99.04% Acc |
[17] | HAM10000 dataset (10,015 images); 1016 images collected by dermoscopy | 224 × 224 | 3 | Intel i7; 2x NVIDIA RTX 3060, 12 GB | MobileNetV2, ResNet50, InceptionV2, ViT and Proposed ViT model | Proposed ViT model | 94.10% Acc, 94.10% precision and 94.10% F1-Score |
[18] | IP102 dataset (75,222 images); D0 dataset (4508 images; Li’s dataset (5629 images) | 224 × 224; 480 × 480 | 10 to 102 | Train: Intel Xeon; 8x NVIDIA Tesla V100, 256 GB. Test: Intel Core; NVIDIA GTX 1060 Ti, 16 GB | ResNet, EfficientNetB0, EfficientB1, RepVGG, VGG-16, ViT-L/16 and hybrid model proposed | Proposed hybrid model | 99.47% Acc on the D0 dataset and 97.94% Acc on the Li’s dataset |
[19] | CIFAR-10 and CIFAR-100 (6000 images); Speech commands (100,503 1-second audio clips); GTZAM (1,00,030-second audio clips); DISCO (1935 images) | 224 × 224 px; 1024 × 576 px. Spectrograms: 229 × 229 samples; 512 × 256 samples | 10 to 100 | 4x NVIDIA 2080 Ti | SL-ViT, ResNet152, DenseNet201 and InceptionV3 | SL-ViT | 71.89% Acc |
[20] | CrackTree260 (260 images); Ozegenel (458 images); Lab’s on dataset (80,000 images) | 256 × 256; 448 × 448 | 2 | N/A | TransUNet, U-Net, DeepLabv3+ and CNN + ViT | CNN + ViT | 99.55% Acc and 99.57% precision |
[21] | 10,265 images collected by Pilgrim technologies UAV with Sony ILCE-7R-36 mega pixels | 64 × 64 | 5 | Intel Xeon E5-1620 V4 3.50 GHz with 8 processor, 16 GB RAM; NVIDIA Quadro M2000 | ViT-B/16, ViT-B/32, EfficientNetB, EfficientNetB1 and ResNet50 | ViT-B/16 | 99.8% Acc |
4. Discussion
- RQ1—Can the ViT architecture have a better performance than the CNN architecture, regardless of the characteristics of the dataset?
- RQ2—What influences the CNNs that do not allow them to perform as well as the ViTs?
- RQ3—How does the Multi-Head Attention mechanism, which is a key component of ViTs, influence the performance of these models in image classification?
5. Threats to Validity
6. Strengths, Limitations, and Future Research Directions
6.1. Strengths
6.1.1. Dataset Considerations
6.1.2. Robustness
6.1.3. Performance Optimization
6.1.4. Evaluation
6.1.5. Explainability and Interpretability
6.1.6. Architecture
6.2. Limitations
6.2.1. Dataset Considerations
6.2.2. Robustness
6.2.3. Performance Optimization
6.2.4. Evaluation
6.2.5. Explainability and Interpretability
6.2.6. Architecture
6.3. Future Research Directions
6.3.1. Dataset Considerations
6.3.2. Robustness
6.3.3. Performance Optimization
6.3.4. Evaluation
6.3.5. Explainability and Interpretability
6.3.6. Architecture
7. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv 2020, arXiv:2010.11929. [Google Scholar] [CrossRef]
- Saha, S. A Comprehensive Guide to Convolutional Neural Networks—The ELI5 Way. Available online: https://towardsdatascience.com/a-comprehensive-guide-to-convolutional-neural-networks-the-eli5-way-3bd2b1164a53 (accessed on 8 January 2023).
- Snyder, H. Literature Review as a Research Methodology: An Overview and Guidelines. J. Bus. Res. 2019, 104, 333–339. [Google Scholar] [CrossRef]
- Matloob, F.; Ghazal, T.M.; Taleb, N.; Aftab, S.; Ahmad, M.; Khan, M.A.; Abbas, S.; Soomro, T.R. Software Defect Prediction Using Ensemble Learning: A Systematic Literature Review. IEEE Access 2021, 9, 98754–98771. [Google Scholar] [CrossRef]
- Benz, P.; Ham, S.; Zhang, C.; Karjauv, A.; Kweon, I.S. Adversarial Robustness Comparison of Vision Transformer and MLP-Mixer to CNNs. arXiv 2021, arXiv:2110.02797. [Google Scholar] [CrossRef]
- Bai, Y.; Mei, J.; Yuille, A.; Xie, C. Are Transformers More Robust Than CNNs? arXiv 2021, arXiv:2111.05464. [Google Scholar] [CrossRef]
- Tyagi, K.; Pathak, G.; Nijhawan, R.; Mittal, A. Detecting Pneumonia Using Vision Transformer and Comparing with Other Techniques. In Proceedings of the 2021 5th International Conference on Electronics, Communication and Aerospace Technology (ICECA), IEEE, Coimbatore, India, 2 December 2021; pp. 12–16. [Google Scholar]
- Raghu, M.; Unterthiner, T.; Kornblith, S.; Zhang, C.; Dosovitskiy, A. Do Vision Transformers See Like Convolutional Neural Networks? arXiv 2021, arXiv:2108.08810. [Google Scholar] [CrossRef]
- Gheflati, B.; Rivaz, H. Vision Transformer for Classification of Breast Ultrasound Images. arXiv 2021, arXiv:2110.14731. [Google Scholar] [CrossRef]
- Zhou, H.-Y.; Lu, C.; Yang, S.; Yu, Y. ConvNets vs. Transformers: Whose Visual Representations Are More Transferable? In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), IEEE, Montreal, BC, Canada, 17 October 2021; pp. 2230–2238. [Google Scholar]
- Wu, Y.; Qi, S.; Sun, Y.; Xia, S.; Yao, Y.; Qian, W. A Vision Transformer for Emphysema Classification Using CT Images. Phys. Med. Biol. 2021, 66, 245016. [Google Scholar] [CrossRef]
- Filipiuk, M.; Singh, V. Comparing Vision Transformers and Convolutional Nets for Safety Critical Systems. AAAI Workshop Artif. Intell. Saf. 2022, 3087, 1–5. [Google Scholar]
- Galdran, A.; Carneiro, G.; Ballester, M.A.G. Convolutional Nets Versus Vision Transformers for Diabetic Foot Ulcer Classification. arXiv 2022, arXiv:2111.06894. [Google Scholar] [CrossRef]
- Cuenat, S.; Couturier, R. Convolutional Neural Network (CNN) vs Vision Transformer (ViT) for Digital Holography. In Proceedings of the 2022 2nd International Conference on Computer, Control and Robotics (ICCCR), IEEE, Shanghai, China, 18 March 2022; pp. 235–240. [Google Scholar]
- Coccomini, D.A.; Caldelli, R.; Falchi, F.; Gennaro, C.; Amato, G. Cross-Forgery Analysis of Vision Transformers and CNNs for Deepfake Image Detection. In Proceedings of the 1st International Workshop on Multimedia AI against Disinformation, Newark, NJ, USA, 27–30 June 2022; Association for Computing Machinery: New York, NY, USA, 2022; pp. 52–58. [Google Scholar]
- Wang, H. Traffic Sign Recognition with Vision Transformers. In Proceedings of the 6th International Conference on Information System and Data Mining, Silicon Valley, CA, USA, 27–29 May 2022; Association for Computing Machinery: New York, NY, USA, 2022; pp. 55–61. [Google Scholar]
- Xin, C.; Liu, Z.; Zhao, K.; Miao, L.; Ma, Y.; Zhu, X.; Zhou, Q.; Wang, S.; Li, L.; Yang, F.; et al. An Improved Transformer Network for Skin Cancer Classification. Comput. Biol. Med. 2022, 149, 105939. [Google Scholar] [CrossRef]
- Peng, Y.; Wang, Y. CNN and Transformer Framework for Insect Pest Classification. Ecol. Inform. 2022, 72, 101846. [Google Scholar] [CrossRef]
- Bakhtiarnia, A.; Zhang, Q.; Iosifidis, A. Single-Layer Vision Transformers for More Accurate Early Exits with Less Overhead. Neural Netw. 2022, 153, 461–473. [Google Scholar] [CrossRef]
- Asadi Shamsabadi, E.; Xu, C.; Rao, A.S.; Nguyen, T.; Ngo, T.; Dias-da-Costa, D. Vision Transformer-Based Autonomous Crack Detection on Asphalt and Concrete Surfaces. Autom. Constr. 2022, 140, 104316. [Google Scholar] [CrossRef]
- Reedha, R.; Dericquebourg, E.; Canals, R.; Hafiane, A. Vision Transformers for Weeds and Crops Classification of High Resolution UAV Images. Remote Sens. 2022, 14, 592. [Google Scholar] [CrossRef]
- Bottou, L.; Bousquet, O. The Tradeoffs of Large Scale Learning. In Advances in Neural Information Processing Systems; Platt, J., Koller, D., Singer, Y., Roweis, S., Eds.; Curran Associates, Inc.: Vancouver, BC, Canada, 2007; Volume 20. [Google Scholar]
- Foret, P.; Kleiner, A.; Mobahi, H.; Neyshabur, B. Sharpness-Aware Minimization for Efficiently Improving Generalization. arXiv 2020, arXiv:2010.01412. [Google Scholar] [CrossRef]
- Korpelevich, G.M. The Extragradient Method for Finding Saddle Points and Other Problems. Ekon. Mat. Metod. 1976, 12, 747–756. [Google Scholar]
- Al-Dhabyani, W.; Gomaa, M.; Khaled, H.; Fahmy, A. Dataset of Breast Ultrasound Images. Data Brief 2020, 28, 104863. [Google Scholar] [CrossRef]
- Yap, M.H.; Pons, G.; Marti, J.; Ganau, S.; Sentis, M.; Zwiggelaar, R.; Davison, A.K.; Marti, R. Automated Breast Ultrasound Lesions Detection Using Convolutional Neural Networks. IEEE J. Biomed. Health Inform. 2018, 22, 1218–1226. [Google Scholar] [CrossRef]
- Zhang, R. Making Convolutional Networks Shift-Invariant Again. arXiv 2019, arXiv:1904.11486. [Google Scholar] [CrossRef]
- Vaswani, A.; Shazeer, N.M.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. Neural Inf. Process. Syst. 2017, 30, 3762. [Google Scholar] [CrossRef]
- Zhou, D.; Kang, B.; Jin, X.; Yang, L.; Lian, X.; Jiang, Z.; Hou, Q.; Feng, J. DeepViT: Towards Deeper Vision Transformer. arXiv 2021, arXiv:2103.11886. [Google Scholar] [CrossRef]
- Amorim, J.P.; Domingues, I.; Abreu, P.H.; Santos, J.A.M. Interpreting Deep Learning Models for Ordinal Problems. In Proceedings of the European Symposium on Artificial Neural Networks, Bruges, Belgium, 25–27 April 2018. [Google Scholar]
Data Source | Number of Results | Number of Selected Papers |
---|---|---|
ACM Digital Library | 19,159 | 1 |
Google Scholar | 10,700 | 10 |
Science Direct | 1437 | 3 |
Scopus | 55 | 2 |
Web of Science | 90 | 1 |
Data Source | Search String |
---|---|
ACM Digital Library | ((Vision Transformers) AND (convolutional neural networks) AND (images classification) AND (comparing)) |
Google Scholar | ((ViT) AND (CNN) AND (Images Classification) OR (Comparing) OR (Vision Transformers) OR (convolutional neural networks) OR (differences)) |
Science Direct | ((Vision Transformers) AND (convolutional neural networks) AND (images classification) AND (comparing)) |
Scopus | ((ViT) AND (CNN) AND (comparing)) |
Web of Science | ((ViT) AND (CNN) AND (comparing)) |
Ref. | Title | Year | Type |
---|---|---|---|
[5] | Adversarial Robustness Comparison of Vision Transformer and MLP-Mixer to CNNs | 2021 | Conference |
[6] | Are Transformers More Robust Than CNNs? | 2021 | Conference |
[7] | Detecting Pneumonia using Vision Transformer and comparing with other techniques | 2021 | Conference |
[8] | Do Vision Transformers See Like Convolutional Neural Networks? | 2021 | Conference |
[9] | Vision Transformer for Classification of Breast Ultrasound Images | 2021 | Conference |
[10] | ConvNets vs. Transformers: Whose Visual Representations are More Transferable? | 2021 | Conference |
[11] | A vision transformer for emphysema classification using CT images | 2021 | Journal |
[12] | Comparing Vision Transformers and Convolutional Nets for Safety Critical Systems | 2022 | Conference |
[13] | Convolutional Nets Versus Vision Transformers for Diabetic Foot Ulcer Classification | 2022 | Conference |
[14] | Convolutional Neural Network (CNN) vs Vision Transformer (ViT) for Digital Holography | 2022 | Conference |
[15] | Cross-Forgery Analysis of Vision Transformers and CNNs for Deepfake Image Detection | 2022 | Conference |
[16] | Traffic Sign Recognition with Vision Transformers | 2022 | Conference |
[17] | An improved transformer network for skin cancer classification | 2022 | Journal |
[18] | CNN and transformer framework for insect pest classification | 2022 | Journal |
[19] | Single-layer Vision Transformers for more accurate early exits with less overhead | 2022 | Journal |
[20] | Vision transformer-based autonomous crack detection on asphalt and concrete surfaces | 2022 | Journal |
[21] | Vision Transformers for Weeds and Crops Classification of High-Resolution UAV Images | 2022 | Journal |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Maurício, J.; Domingues, I.; Bernardino, J. Comparing Vision Transformers and Convolutional Neural Networks for Image Classification: A Literature Review. Appl. Sci. 2023, 13, 5521. https://doi.org/10.3390/app13095521
Maurício J, Domingues I, Bernardino J. Comparing Vision Transformers and Convolutional Neural Networks for Image Classification: A Literature Review. Applied Sciences. 2023; 13(9):5521. https://doi.org/10.3390/app13095521
Chicago/Turabian StyleMaurício, José, Inês Domingues, and Jorge Bernardino. 2023. "Comparing Vision Transformers and Convolutional Neural Networks for Image Classification: A Literature Review" Applied Sciences 13, no. 9: 5521. https://doi.org/10.3390/app13095521