research-article

Convolutionally Enhanced Feature Fusion Visual Transformer for Fine-Grained Visual Classification

Authors:

Zehua Wang, and

Shuanghong QuAuthors Info & Claims

ICMLC '24: Proceedings of the 2024 16th International Conference on Machine Learning and Computing

February 2024

Pages 447 - 452

https://doi.org/10.1145/3651671.3651752

Published: 07 June 2024 Publication History

Abstract

Fine-grained image classification is a popular research topic in computer vision and pattern recognition, where the goal is to recognize and classify subclasses of objects in images at the fine-grained level. In recent years, Transformer's self-attention mechanism has been increasingly introduced into fine-grained image classification tasks due to its ability to naturally focus on the most discriminative regions of the object. In this paper, a new Convolutionally Enhanced Feature Fusion Visual Transformer method is proposed based on the Feature Fusion Visual Transformer by introducing convolutional operations. Firstly, for the original input image, patches are not directly labeled, but extracted from the generated low-level features; Secondly, the computational complexity at the multi-head attention layer is reduced through spatial-reduction attention, which also reduces memory consumption; Finally, the inverted residual feed-forward network is applied to each encoder to improve the network's expression ability. Comparative experiments on four datasets show that the method improves the accuracy of fine-grained image feature extraction and reduces the computation and memory consumption by improving the self-attention layer to improve the efficiency and performance of the model.

References

[1]

Zhang, N., Donahue, J., Girshick, R., Darrell, T. Part-based R-CNNs for Fine-grained Category Detection. in Proc. ECCV, Zurich, Switzerland, 2014, pp. 834–849.

[2]

Wei, X. S., Xie, C. W., Wu, J. X., Shen, C. H. Mask-CNN: Localizing Parts and Selecting Descriptors for Fine-grained Bird Species Categorization. Pattern Recognition, vol. 76, pp. 704–714, Apr. 2018, 10.1016/j.patcog.2017.10.002.

Digital Library

[3]

Lin, T. Y., Roychowdhury, A., Maji, S. Bilinear convolutional neural networks for fine-grained visual recognition. IEEE Trans. PA MI, vol. 40, no. 6, pp. 1309–1322, Jun. 2018, 10.1109/TPAMI.2017.2723400.

[4]

Fu, J. L., Zheng, H. L., Mei, T. Look closer to see better: Recurrent attention convolutional neural network for fine-grained image recognition. in Proc. CVPR, Honolulu, HAWAII, USA, 2017, pp. 4476–4484.

[5]

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X. H., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N. An image is worth 16x16 words: Transformers for image recognition at scale. 2020, arXiv:2010.11929.

[6]

Li, Y. H., Mao, H. Z., Girshick, R., He, K. M. Exploring plain vision transformer backbones for object detection. in Proc. ECCV, Tel Aviv, Israel, 2022, pp. 280–296.

Digital Library

[7]

Guo, R. H., Niu, D. T., Qu, L., Li, Z. B. SOTR: Segmenting objects with transformers. in Proc. ICCV, Montreal, Canada, 2021, pp. 7137–7146.

[8]

He, J., Chen, J. N., Liu, S., Kortylewski, A., Yang, C., Bai, Y. T., Wang, C. H. TransFG: A transformer architecture for fine-grained recognition. in Proc. AAAI, Vancouver, Canada, 2022, pp. 852–860.

[9]

Wang, J., Yu, X. H., Gao, Y. S. Feature fusion vision transformer for fine-grained visual categorization. 2021, arXiv:2107.02341.

[10]

Zhang, Y., Cao, J., Zhang, L., Liu, X. C., Wang, Z. Y., Ling, F., Chen, W. Q. A free lunch from vit: Adaptive attention multi-scale fusion transformer for fine-grained visual recognition. in Proc. ICASSP, Singapore, 2022, pp. 3234–3238.

[11]

Wang, Y. M., Morariu, V. I., Davis, L. S. Learning a discriminative filter bank within a CNN for fine-grained recognition. in Proc. CVPR, Salt Lake City, UT, USA, 2018, pp. 4148–4157.

[12]

Luo, W., Zhang, H. M., Li, J., Wei, X. S. Learning semantically enhanced feature for fine-grained image classification. IEEE Trans. SPL, vol. 27, pp. 1545–1549, 2022, 10.1109/LSP.2020.3020227.

[13]

Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S. End-to-end object detection with transformers. 2020, arXiv:2005.12872.

Digital Library

[14]

Wang, H. Y., Zhu, Y. K., Adam, H., Yuille, A., Chen, L. C. Max-deeplab: End-to-end panoptic segmentation with mask transformers. in Proc. CVPR, 2021, pp. 5459–5470.

[15]

Wang, W. H., Xie, E. Z., Li, X., Fan, D. P., Song, K. T., Liang, D., Lu, T., Luo, P., Shao, L. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. in Proc. ICCV, Montreal, Canada, 2021, pp. 548–558.

[16]

Guo, J. Y., Han, K., Wu, H., Tang, Y. H., Chen, X. H., Wang, Y. H., Xu, C. CMT: Convolutional neural networks meet vision transformers. in Proc. CVPR, New Orleans, LA, USA, 2022, pp. 12165–12175.

[17]

Yuan, K., Guo, S. P., Liu, Z. W., Zhou, A. J., Yu, F. W., Wu, W. Incorporating convolution designs into visual transformers. in Proc. ICCV, Montreal, Canada, 2021, pp. 559–568.

[18]

Wah, C., Branson, S., Welinder, P., Perona, P., Belongie, S. The Caltech-UCSD Birds-200-2011 Dataset. California Institute of Technology, CNS-TR-2011-001, 2011.

[19]

Yu, X. H., Zhao, Y., Gao, Y. S., Xiong, S. W., Yuan, X. H. Patchy image structure classification using multi-orientation region transform. in Proc. AAAI, New York, NY, 2020, pp. 12741–12748.

[20]

Du, R. Y., Chang, D. L., Bhunia, A. K., Xie, J. Y., Ma, Z. Y., Song, Z. Y., Guo, J. Fine-grained visual classification via progressive multi-granularity training of jigsaw patches. 2020, arXiv:2003.03836.

Digital Library

[21]

Rao, Y. M., Chen, G. Y., Lu, J. W., Zhou, J. Counterfactual attention learning for fine-grained visual categorization and re-identification. in Proc. ICCV, Montreal, Canada, 2021, pp. 1005–1014.

[22]

Sun, H. B., He, X. T., Peng, Y. X. SIM-trans: Structure information modeling transformer for fine-grained visual categorization. 2022, arXiv:2208.14607.

[23]

Krizhevsky, A., Sutskever, I., Hinton, G. E. ImageNet classification with deep convolutional neural networks. ACM, vol. 60, no. 6, pp. 84–90, June. 2017.

Digital Library

[24]

He, X. T., Peng, P. X. Weakly supervised learning of part selection model with spatial constraints for fine-grained image classification. in Proc. AAAI, San Francisco, CA, 2017, pp. 4075–4081.

[25]

Li, P. H., Xie, J. T., Wang, Q. L., Gao, Z. L. Towards faster training of global covariance pooling networks by iterative matrix square root normalization. in Proc. CVPR, Salt Lake City, UT, USA, 2018, pp. 947–955.

Recommendations

Multi-feature fusion enhanced transformer with multi-layer fused decoding for image captioning
Abstract
The objects’ semantic information of the image is vital for image captioning. Though some methods have used semantic information, the alignment between the specific semantic feature and the corresponding visual feature has not been explored. In ...
Read More
Fine-Grained Visual Computing Based on Deep Learning
With increasing amounts of information, the image information received by people also increases exponentially. To perform fine-grained categorization and recognition of images and visual calculations, this study combines the Visual Geometry Group Network ...
Read More
A fine-grained image classification method combining YOLOv7 and bilinear multi-level feature fusion
ICNCC '22: Proceedings of the 2022 11th International Conference on Networks, Communication and Computing

Fine grain image is an important research field of image recognition, which can classify objects in more detail. Image feature extraction is usually implemented by bilinear network, which can meet the desired effect of feature extraction, but also ...
Read More

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

ICMLC '24: Proceedings of the 2024 16th International Conference on Machine Learning and Computing

February 2024

757 pages

ISBN:9798400709234

DOI:10.1145/3651671

Copyright © 2024 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 June 2024

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

ICMLC 2024

ICMLC 2024: 2024 16th International Conference on Machine Learning and Computing

February 2 - 5, 2024

Shenzhen, China

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
6
Total Downloads

Downloads (Last 12 months)6
Downloads (Last 6 weeks)6

Other Metrics

View Author Metrics

Citations

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Table of Contents