Abstract
Compared to traditional image recognition, Fine-Grained Image Recognition (FGIR) faces significant challenges due to the subtle distinctions among different categories and the notable variances within the same category. Furthermore, the complexity of backgrounds and the extraction of discriminative features limited to small local regions further exacerbate the difficulty. Recently, several studies have demonstrated the effectiveness of the Vision Transformer (ViT) in FGIR. However, these investigations have frequently overlooked critical information embedded within class tokens across different layers, while also neglecting the subtle local details hidden within patch tokens. To address these issues and enhance FGIR performance, we introduce a novel ViT-based network architecture MIFBF. The proposed model builds upon ViT by incorporating three modules: Complementary Class Tokens Combination module (CCTC), Patches Information Integration module (PII), and Attention Cropping Module (ACM). The CCTC module integrates multi-layer class tokens to capture complementary information, thereby enhancing the model’s representational capacity. The PII module delves into the rich local details encoded in patch tokens to improve classification accuracy. The ACM module generates regions of interest based on ViT’s self-attention weights and effectively filters background noise, thereby directing the model’s attention to the most relevant image areas. Experiments conducted on three different datasets validate the effectiveness of the proposed model, yielding state-of-the-art results and highlighting its superiority in FGIR tasks.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Data availability
The datasets generated and/or analyzed during the current study will be made available on reasonable request.
References
Wei X, Song Y, Aodha O et al (2021) Fine-grained image analysis with deep learning: A survey[J]. IEEE Trans Pattern Anal Mach Intell 44(12):8927–8948
Wah C, Branson S, Welinder P, Belongie S (2011) The Caltech-UCSDBirds-200–2011 dataset. California Institute of Technology
Khosla A, Jayadevaprakash N, Yao B et al (2011) Novel dataset for fine-grained image categorization: Stanford dogs. In: IEEE conference on computer vision and pattern recognition, cvpr workshops 2011, Colorado Springs, CO, USA, 20-25 June, 2011. IEEE Computer Society
Krause J, Stark M, Deng J, et al (2013) 3d object representations for fine-grained categorization[C]//Proceedings of the IEEE International Conference on Computer Vision workshops 554–561
Har L, Rashid U, Chuan L et al (2022) Revolution of retail industry: from perspective of retail 1.0 to 4.0[J]. Proc Comput Sci 200:1615–1625
Kotwal J, Kashyap R, Pathan S (2023) Agricultural plant diseases identification: From traditional approach to deep learning[J]. Materials Today: Proceedings 80(1):344–356
Khoshand A (2021) Application of artificial intelligence in groundwater ecosystem protection: a case study of Semnan/Sorkheh plain, Iran[J]. Environ Dev Sustain 23(4):16617–16631
Xie L, Tian Q, Hong R, et al (2013) Hierarchical part matching for fine-grained visual categorization. IEEE International Conference on Computer Vision, ICCV 2013, Sydney, Australia, December 1-8, 2013. IEEE Computer Society
Lei J, Duan J, Wu F et al (2016) Fast mode decision based on grayscale similarity and inter-view correlation for depth map coding in 3D-HEVC[J]. IEEE Trans Circuits Syst Video Technol 28(3):706–718
Huang S, Xu Z, Tao D et al (2016) Part-stacked CNN for fine-grained visual categorization. In: IEEE conference on computer vision and pattern recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016. IEEE Computer Society
Lin T, RoyChowdhury A, Maji S (2016) Bilinear CNN models for fine-grained visual recognition. In: IEEE international conference on computer vision, ICCV 2015a, Santiago, Chile, December 7-13, 2015. IEEE Computer Society, p 2015
Fu J, Zheng H, Mei T (2017) Look closer to see better: Recurrent attention convolutional neural network for fine-grained image recognition. In: IEEE conference on computer vision and pattern recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017. IEEE Computer Society, p 2017
Zhang T, Chang D, Ma Z et al (2021) Progressive co-attention network for fine-grained visual classification. In: International conference on visual communications and image processing, VCIP 2021, Munich, Germany, December 5-8, 2021. IEEE
Yu Y, Wang J (2023) Hybrid Granularities Transformer for Fine-Grained Image Recognition[J]. Entropy 25(4):601–613
Wang Z (2022) Recognition of occluded objects by slope difference distribution features[J]. Appl Soft Comput 120:108622
Wang L, He K, Feng X et al (2022) Multilayer feature fusion with parallel convolutional block for fine-grained image classification[J]. Appl Intell 52(3):2872–2883
Dosovitskiy A, Beyer L, Kolesnikov A et al (2021) An image is worth 16x16 words: Transformers for image recognition at scale. In: International conference on learning representations, ICLR 2021, virtual event, Austria, May 3-7, 2021
Wang Q, Wang J, Deng H et al (2023) AA-Trans: Core attention aggregating transformer with information entropy selector for fine-grained visual classification[J]. Pattern Recogn 140:109547
Wei X, Xie C, Wu J et al (2018) Mask-CNN: Localizing parts and selecting descriptors for fine-grained bird species categorization[J]. Pattern Recogn 76:704–714
Hu T, Qi H, Huang Q, et al (2019) See better before looking closer: Weakly supervised data augmentation network for fine-grained visual classification[J]. arXiv preprint arXiv:1901.09891
Du R, Chang D, Bhunia A et al (2020) Fine-grained visual classification via progressive multi-granularity training of jigsaw patches. In: Computer vision - ECCV 2020 - 16th European conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part XX. Lecture notes in computer science 12365. Springer
Zhang F, Li M, Zhai G et al (2021) Multi-branch and multi-scale attention learning for fine-grained visual categorization. In: MultiMedia modeling - 27th international conference, MMM 2021, Prague, Czech Republic, June 22-24, 2021, proceedings, part I. Lecture notes in computer science 12572. Springer
Ge W, Lin X, Yu Y (2019) Weakly supervised complementary parts models for fine-grained image classification from the bottom up. In: IEEE conference on computer vision and pattern recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019
He J, Chen J, Liu S et al (2022) TransFG: A transformer architecture for fine-grained recognition. In: Thirty-sixth AAAI conference on artificial intelligence, AAAI 2022, February 22 - March 1, 2022. AAAI Press
Hu Y, Jin X, Zhang Y et al (2021) RAMS-trans: recurrent attention multi-scale transformer for fine-grained image recognition. In: Proceedings of the 29th ACM international conference on multimedia, ACM multimedia conference, virtual event, China, October 20 - 24, 2021. ACM
Wang J, Yu X, Gao Y (2021) Feature fusion vision transformer for fine-grained visual categorization. In: 32nd British machine vision conference 2021, BMVC 2021, online, November 22-25, 2021. BMVA Press
Liu X, Wang L, Han X (2022) Transformer with peak suppression and knowledge guidance for fine-grained image recognition[J]. Neurocomputing 492:137–149
Devlin J, Chang M, Lee K et al (2019) Bert: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, vol 1. Association for Computational Linguistics
Horn V, Branson S, Farrell R et al (2015) Building a bird recognition app and large scale dataset with citizen scientists: The fine print in fine-grained dataset collection. In: IEEE conference on computer vision and pattern recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015. IEEE Computer Society
He K, Zhang X, Ren S et al (2016) Deep residual learning for image recognition. In: IEEE conference on computer vision and pattern recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016. IEEE Computer Society
Chang D, Ding Y, Xie J et al (2020) The devil is in the channels: Mutual-channel loss for fine-grained image classification[J]. IEEE Trans Image Process 29:4683–4695
Luo W, Yang X, Mo X et al (2019) Cross-X learning for fine-grained visual categorization. In: 2019 IEEE/CVF international conference on computer vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019. IEEE, p 2019
Liu C, Huang L, Wei Z et al (2021) Subtler mixed attention network on fine-grained image classification. Appl Intell 51(11):7903–7916
Chen Y, Bai Y, Zhang W et al (2019) Destruction and construction learning for fine-grained image recognition. In: IEEE conference on computer vision and pattern recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019
Ji R, Wen L, Zhang L et al (2020) Attention convolutional binary neural tree for fine-grained visual categorization. In: 2020 IEEE/CVF conference on computer vision and pattern recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020. Computer Vision Foundation / IEEE
Gao Y, Han X, Wang X, et al (2020) Channel interaction networks for fine-grained image categorization. The thirty-fourth AAAI conference on artificial intelligence, AAAI 2020, New York, NY, USA, February 7-12, 2020. 34(07): 10818–10825
Ding Y, Ma Z, Wen S et al (2021) AP-CNN: Weakly supervised attention pyramid convolutional neural network for fine-grained visual classification[J]. IEEE Trans Image Process 30:2826–2836
Hu Y, Liu X, Zhang B et al (2021) Alignment enhancement network for fine-grained visual categorization[J]. ACM Trans Multimed Comput Commun Appl 17(1):1–20
Wang X, Shi J, Fujita H et al (2023) Aggregate attention module for fine-grained image classification[J]. J Ambient Intell Humaniz Comput 14(7):8335–8345
Liu C, Xie H, Zha ZJ et al (2020) Filtration and distillation: enhancing region attention for fine-grained visual categorization. In: The thirty-fourth AAAI conference on artificial intelligence, New York, NY, USA, February 7-12, 2020. AAAI Press, p 2020
Xie J, Zhong Y, Zhang J et al (2023) A weakly supervised spatial group attention network for fine-grained visual recognition[J]. Appl Intell 53(20):23301–23315
Ke X, Cai Y, Chen B et al (2023) Granularity-aware distillation and structure modeling region proposal network for fine-grained image classification[J]. Pattern Recogn 137:109305
Zhuang P, Wang Y, Qiao Y (2020) Learning attentive pairwise interaction for fine-grained classification. In: The thirty-fourth AAAI conference on artificial intelligence, new york, ny, usa, february 7-12, 2020. AAAI Press
Rao Y, Chen G, Lu J et al (2021) Counterfactual attention learning for fine-grained visual categorization and re-identification. In: 2021 IEEE/CVF international conference on computer vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021. IEEE, p 2021
Dubey A, Gupta O, Raskar R et al (2018) Maximum-entropy fine grained classification[J]. Adv Neural Inf Process Syst 31:1–12
Sun G, Cholakkal H, Khan S, et al (2020) Fine-grained recognition: Accounting for subtle differences between similar classes. The thirty-fourth AAAI conference on artificial intelligence, AAAI 2020, New York, NY, USA, February 7-12, 2020. AAAI Press 2020
Luo W, Zhang H, Li J et al (2020) Learning semantically enhanced feature for fine-grained image classification. IEEE Signal Process Lett 27:1545–1549
Guo P, Farrell R 2019) Aligned to the object, not to the image: A unified pose-aligned representation for fine-grained recognition. In: IEEE winter conference on applications of computer vision, WACV 2019, Waikoloa Village, HI, USA, January 7-11, 2019. IEEE
Huang S, Wang X, Tao D (2021) Stochastic partial swap: Enhanced model generalization and interpretability for fine-grained recognition. 2021 IEEE/CVF international conference on computer vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021. IEEE
Korsch D, Bodesheim P, Denzler J (2019) Classification-specific parts for improving fine-grained visual categorization. In: Pattern recognition - 41st DAGM German conference, DAGM GCPR 2019, Dortmund, Germany, September 10-13, 2019, proceedings. Lecture notes in computer science 11824. Springer
Zhang L, Huang S, Liu W et al (2019) Learning a mixture of granularity-specific experts for fine-grained categorization. In: 2019 IEEE/CVF international conference on computer vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019. IEEE
Touvron H, Vedaldi A, Douze M et al (2019) Fixing the train-test resolution discrepancy. In: Advances in neural information processing systems 32: annual conference on neural information processing systems 2019, NeurIPS 2019, December 8-14, 2019. Vancouver
Selvaraju R, Cogswell M, Das A et al (2017) Grad-CAM: Visual explanations from deep networks via gradient-based localization. In: IEEE international conference on computer vision, ICCV 2017, Venice, Italy, October 22-29, 2017. IEEE Computer Society
Acknowledgements
This paper was supported by the National Natural Science Foundation of China (No.62163016, 62066014), the Natural Science Foundation of Jiangxi Province (20212ACB202001), the Postgraduate Innovation Fund of Education Department of Jiangxi Province (YC2022-s552), the foreign expert project of Ministry of Science and Technology (No.G2023022005L), the open project of State Key Laboratory of Performance Monitoring and Protecting of Rail Transit Infrastructure (Grant No.HJGZ2023203).
Author information
Authors and Affiliations
Contributions
Ying Yu: Methodology, Proponents of major academic ideas. Jinghui Wang: Writing – original draft. Jin Qian: Supervision. Witold Pedrycz: Writing – review & editing. Duoqian Miao: Writing – review & editing.
Corresponding author
Ethics declarations
Ethical and informed consent for data used
The relevant datasets are publicly available, and the authors of the manuscript are aware that the data used in this article does not involve ethical issues.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Yu, Y., Wang, J., Pedrycz, W. et al. Multi-level information fusion Transformer with background filter for fine-grained image recognition. Appl Intell 54, 8108–8119 (2024). https://doi.org/10.1007/s10489-024-05584-x
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-024-05584-x