Multi-level information fusion Transformer with background filter for fine-grained image recognition

Yu, Ying; Wang, Jinghui; Pedrycz, Witold; Miao, Duoqian; Qian, Jin

doi:10.1007/s10489-024-05584-x

Multi-level information fusion Transformer with background filter for fine-grained image recognition

Published: 20 June 2024

Volume 54, pages 8108–8119, (2024)
Cite this article

Applied Intelligence Aims and scope Submit manuscript

Ying Yu ORCID: orcid.org/0000-0002-3480-4571^1,2,
Jinghui Wang²,
Witold Pedrycz³,
Duoqian Miao⁴ &
…
Jin Qian²

271 Accesses
Explore all metrics

Abstract

Compared to traditional image recognition, Fine-Grained Image Recognition (FGIR) faces significant challenges due to the subtle distinctions among different categories and the notable variances within the same category. Furthermore, the complexity of backgrounds and the extraction of discriminative features limited to small local regions further exacerbate the difficulty. Recently, several studies have demonstrated the effectiveness of the Vision Transformer (ViT) in FGIR. However, these investigations have frequently overlooked critical information embedded within class tokens across different layers, while also neglecting the subtle local details hidden within patch tokens. To address these issues and enhance FGIR performance, we introduce a novel ViT-based network architecture MIFBF. The proposed model builds upon ViT by incorporating three modules: Complementary Class Tokens Combination module (CCTC), Patches Information Integration module (PII), and Attention Cropping Module (ACM). The CCTC module integrates multi-layer class tokens to capture complementary information, thereby enhancing the model’s representational capacity. The PII module delves into the rich local details encoded in patch tokens to improve classification accuracy. The ACM module generates regions of interest based on ViT’s self-attention weights and effectively filters background noise, thereby directing the model’s attention to the most relevant image areas. Experiments conducted on three different datasets validate the effectiveness of the proposed model, yielding state-of-the-art results and highlighting its superiority in FGIR tasks.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Ts-vit: feature-enhanced transformer via token selection for fine-grained image recognition

Article 13 December 2024

Multistage attention region supplement transformer for fine-grained visual categorization

Article 17 June 2024

Group-Attention Transformer for Fine-Grained Image Recognition

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Data availability

The datasets generated and/or analyzed during the current study will be made available on reasonable request.

References

Wei X, Song Y, Aodha O et al (2021) Fine-grained image analysis with deep learning: A survey[J]. IEEE Trans Pattern Anal Mach Intell 44(12):8927–8948
Article Google Scholar
Wah C, Branson S, Welinder P, Belongie S (2011) The Caltech-UCSDBirds-200–2011 dataset. California Institute of Technology
Khosla A, Jayadevaprakash N, Yao B et al (2011) Novel dataset for fine-grained image categorization: Stanford dogs. In: IEEE conference on computer vision and pattern recognition, cvpr workshops 2011, Colorado Springs, CO, USA, 20-25 June, 2011. IEEE Computer Society
Krause J, Stark M, Deng J, et al (2013) 3d object representations for fine-grained categorization[C]//Proceedings of the IEEE International Conference on Computer Vision workshops 554–561
Har L, Rashid U, Chuan L et al (2022) Revolution of retail industry: from perspective of retail 1.0 to 4.0[J]. Proc Comput Sci 200:1615–1625
Article Google Scholar
Kotwal J, Kashyap R, Pathan S (2023) Agricultural plant diseases identification: From traditional approach to deep learning[J]. Materials Today: Proceedings 80(1):344–356
Google Scholar
Khoshand A (2021) Application of artificial intelligence in groundwater ecosystem protection: a case study of Semnan/Sorkheh plain, Iran[J]. Environ Dev Sustain 23(4):16617–16631
Article Google Scholar
Xie L, Tian Q, Hong R, et al (2013) Hierarchical part matching for fine-grained visual categorization. IEEE International Conference on Computer Vision, ICCV 2013, Sydney, Australia, December 1-8, 2013. IEEE Computer Society
Lei J, Duan J, Wu F et al (2016) Fast mode decision based on grayscale similarity and inter-view correlation for depth map coding in 3D-HEVC[J]. IEEE Trans Circuits Syst Video Technol 28(3):706–718
Article Google Scholar
Huang S, Xu Z, Tao D et al (2016) Part-stacked CNN for fine-grained visual categorization. In: IEEE conference on computer vision and pattern recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016. IEEE Computer Society
Lin T, RoyChowdhury A, Maji S (2016) Bilinear CNN models for fine-grained visual recognition. In: IEEE international conference on computer vision, ICCV 2015a, Santiago, Chile, December 7-13, 2015. IEEE Computer Society, p 2015
Fu J, Zheng H, Mei T (2017) Look closer to see better: Recurrent attention convolutional neural network for fine-grained image recognition. In: IEEE conference on computer vision and pattern recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017. IEEE Computer Society, p 2017
Zhang T, Chang D, Ma Z et al (2021) Progressive co-attention network for fine-grained visual classification. In: International conference on visual communications and image processing, VCIP 2021, Munich, Germany, December 5-8, 2021. IEEE
Yu Y, Wang J (2023) Hybrid Granularities Transformer for Fine-Grained Image Recognition[J]. Entropy 25(4):601–613
Article Google Scholar
Wang Z (2022) Recognition of occluded objects by slope difference distribution features[J]. Appl Soft Comput 120:108622
Article Google Scholar
Wang L, He K, Feng X et al (2022) Multilayer feature fusion with parallel convolutional block for fine-grained image classification[J]. Appl Intell 52(3):2872–2883
Article Google Scholar
Dosovitskiy A, Beyer L, Kolesnikov A et al (2021) An image is worth 16x16 words: Transformers for image recognition at scale. In: International conference on learning representations, ICLR 2021, virtual event, Austria, May 3-7, 2021
Wang Q, Wang J, Deng H et al (2023) AA-Trans: Core attention aggregating transformer with information entropy selector for fine-grained visual classification[J]. Pattern Recogn 140:109547
Article Google Scholar
Wei X, Xie C, Wu J et al (2018) Mask-CNN: Localizing parts and selecting descriptors for fine-grained bird species categorization[J]. Pattern Recogn 76:704–714
Article Google Scholar
Hu T, Qi H, Huang Q, et al (2019) See better before looking closer: Weakly supervised data augmentation network for fine-grained visual classification[J]. arXiv preprint arXiv:1901.09891
Du R, Chang D, Bhunia A et al (2020) Fine-grained visual classification via progressive multi-granularity training of jigsaw patches. In: Computer vision - ECCV 2020 - 16th European conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part XX. Lecture notes in computer science 12365. Springer
Zhang F, Li M, Zhai G et al (2021) Multi-branch and multi-scale attention learning for fine-grained visual categorization. In: MultiMedia modeling - 27th international conference, MMM 2021, Prague, Czech Republic, June 22-24, 2021, proceedings, part I. Lecture notes in computer science 12572. Springer
Ge W, Lin X, Yu Y (2019) Weakly supervised complementary parts models for fine-grained image classification from the bottom up. In: IEEE conference on computer vision and pattern recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019
He J, Chen J, Liu S et al (2022) TransFG: A transformer architecture for fine-grained recognition. In: Thirty-sixth AAAI conference on artificial intelligence, AAAI 2022, February 22 - March 1, 2022. AAAI Press
Hu Y, Jin X, Zhang Y et al (2021) RAMS-trans: recurrent attention multi-scale transformer for fine-grained image recognition. In: Proceedings of the 29th ACM international conference on multimedia, ACM multimedia conference, virtual event, China, October 20 - 24, 2021. ACM
Wang J, Yu X, Gao Y (2021) Feature fusion vision transformer for fine-grained visual categorization. In: 32nd British machine vision conference 2021, BMVC 2021, online, November 22-25, 2021. BMVA Press
Liu X, Wang L, Han X (2022) Transformer with peak suppression and knowledge guidance for fine-grained image recognition[J]. Neurocomputing 492:137–149
Article Google Scholar
Devlin J, Chang M, Lee K et al (2019) Bert: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, vol 1. Association for Computational Linguistics
Horn V, Branson S, Farrell R et al (2015) Building a bird recognition app and large scale dataset with citizen scientists: The fine print in fine-grained dataset collection. In: IEEE conference on computer vision and pattern recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015. IEEE Computer Society
He K, Zhang X, Ren S et al (2016) Deep residual learning for image recognition. In: IEEE conference on computer vision and pattern recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016. IEEE Computer Society
Chang D, Ding Y, Xie J et al (2020) The devil is in the channels: Mutual-channel loss for fine-grained image classification[J]. IEEE Trans Image Process 29:4683–4695
Article Google Scholar
Luo W, Yang X, Mo X et al (2019) Cross-X learning for fine-grained visual categorization. In: 2019 IEEE/CVF international conference on computer vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019. IEEE, p 2019
Liu C, Huang L, Wei Z et al (2021) Subtler mixed attention network on fine-grained image classification. Appl Intell 51(11):7903–7916
Chen Y, Bai Y, Zhang W et al (2019) Destruction and construction learning for fine-grained image recognition. In: IEEE conference on computer vision and pattern recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019
Ji R, Wen L, Zhang L et al (2020) Attention convolutional binary neural tree for fine-grained visual categorization. In: 2020 IEEE/CVF conference on computer vision and pattern recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020. Computer Vision Foundation / IEEE
Gao Y, Han X, Wang X, et al (2020) Channel interaction networks for fine-grained image categorization. The thirty-fourth AAAI conference on artificial intelligence, AAAI 2020, New York, NY, USA, February 7-12, 2020. 34(07): 10818–10825
Ding Y, Ma Z, Wen S et al (2021) AP-CNN: Weakly supervised attention pyramid convolutional neural network for fine-grained visual classification[J]. IEEE Trans Image Process 30:2826–2836
Article Google Scholar
Hu Y, Liu X, Zhang B et al (2021) Alignment enhancement network for fine-grained visual categorization[J]. ACM Trans Multimed Comput Commun Appl 17(1):1–20
Wang X, Shi J, Fujita H et al (2023) Aggregate attention module for fine-grained image classification[J]. J Ambient Intell Humaniz Comput 14(7):8335–8345
Article Google Scholar
Liu C, Xie H, Zha ZJ et al (2020) Filtration and distillation: enhancing region attention for fine-grained visual categorization. In: The thirty-fourth AAAI conference on artificial intelligence, New York, NY, USA, February 7-12, 2020. AAAI Press, p 2020
Xie J, Zhong Y, Zhang J et al (2023) A weakly supervised spatial group attention network for fine-grained visual recognition[J]. Appl Intell 53(20):23301–23315
Article Google Scholar
Ke X, Cai Y, Chen B et al (2023) Granularity-aware distillation and structure modeling region proposal network for fine-grained image classification[J]. Pattern Recogn 137:109305
Article Google Scholar
Zhuang P, Wang Y, Qiao Y (2020) Learning attentive pairwise interaction for fine-grained classification. In: The thirty-fourth AAAI conference on artificial intelligence, new york, ny, usa, february 7-12, 2020. AAAI Press
Rao Y, Chen G, Lu J et al (2021) Counterfactual attention learning for fine-grained visual categorization and re-identification. In: 2021 IEEE/CVF international conference on computer vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021. IEEE, p 2021
Dubey A, Gupta O, Raskar R et al (2018) Maximum-entropy fine grained classification[J]. Adv Neural Inf Process Syst 31:1–12
Google Scholar
Sun G, Cholakkal H, Khan S, et al (2020) Fine-grained recognition: Accounting for subtle differences between similar classes. The thirty-fourth AAAI conference on artificial intelligence, AAAI 2020, New York, NY, USA, February 7-12, 2020. AAAI Press 2020
Luo W, Zhang H, Li J et al (2020) Learning semantically enhanced feature for fine-grained image classification. IEEE Signal Process Lett 27:1545–1549
Guo P, Farrell R 2019) Aligned to the object, not to the image: A unified pose-aligned representation for fine-grained recognition. In: IEEE winter conference on applications of computer vision, WACV 2019, Waikoloa Village, HI, USA, January 7-11, 2019. IEEE
Huang S, Wang X, Tao D (2021) Stochastic partial swap: Enhanced model generalization and interpretability for fine-grained recognition. 2021 IEEE/CVF international conference on computer vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021. IEEE
Korsch D, Bodesheim P, Denzler J (2019) Classification-specific parts for improving fine-grained visual categorization. In: Pattern recognition - 41st DAGM German conference, DAGM GCPR 2019, Dortmund, Germany, September 10-13, 2019, proceedings. Lecture notes in computer science 11824. Springer
Zhang L, Huang S, Liu W et al (2019) Learning a mixture of granularity-specific experts for fine-grained categorization. In: 2019 IEEE/CVF international conference on computer vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019. IEEE
Touvron H, Vedaldi A, Douze M et al (2019) Fixing the train-test resolution discrepancy. In: Advances in neural information processing systems 32: annual conference on neural information processing systems 2019, NeurIPS 2019, December 8-14, 2019. Vancouver
Selvaraju R, Cogswell M, Das A et al (2017) Grad-CAM: Visual explanations from deep networks via gradient-based localization. In: IEEE international conference on computer vision, ICCV 2017, Venice, Italy, October 22-29, 2017. IEEE Computer Society

Download references

Acknowledgements

This paper was supported by the National Natural Science Foundation of China (No.62163016, 62066014), the Natural Science Foundation of Jiangxi Province (20212ACB202001), the Postgraduate Innovation Fund of Education Department of Jiangxi Province (YC2022-s552), the foreign expert project of Ministry of Science and Technology (No.G2023022005L), the open project of State Key Laboratory of Performance Monitoring and Protecting of Rail Transit Infrastructure (Grant No.HJGZ2023203).

Author information

Authors and Affiliations

State Key Laboratory of Performance Monitoring and Protecting of Rail Transit Infrastructure, East China Jiaotong University, Nanchang, 330013, Jiangxi, China
Ying Yu
School of Software, East China Jiaotong University, Nanchang, 330013, Jiangxi, China
Ying Yu, Jinghui Wang & Jin Qian
Department of Electrical and Computer Engineering, University of Alberta, Edmonton, AB, T6G 2G7, Canada
Witold Pedrycz
School of Electronic and Information Engineering, Tongji University, Shanghai, China
Duoqian Miao

Authors

Ying Yu
View author publications
You can also search for this author in PubMed Google Scholar
Jinghui Wang
View author publications
You can also search for this author in PubMed Google Scholar
Witold Pedrycz
View author publications
You can also search for this author in PubMed Google Scholar
Duoqian Miao
View author publications
You can also search for this author in PubMed Google Scholar
Jin Qian
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Ying Yu: Methodology, Proponents of major academic ideas. Jinghui Wang: Writing – original draft. Jin Qian: Supervision. Witold Pedrycz: Writing – review & editing. Duoqian Miao: Writing – review & editing.

Corresponding author

Correspondence to Ying Yu.

Ethics declarations

Ethical and informed consent for data used

The relevant datasets are publicly available, and the authors of the manuscript are aware that the data used in this article does not involve ethical issues.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Yu, Y., Wang, J., Pedrycz, W. et al. Multi-level information fusion Transformer with background filter for fine-grained image recognition. Appl Intell 54, 8108–8119 (2024). https://doi.org/10.1007/s10489-024-05584-x

Download citation

Accepted: 31 May 2024
Published: 20 June 2024
Issue Date: September 2024
DOI: https://doi.org/10.1007/s10489-024-05584-x

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Multi-level information fusion Transformer with background filter for fine-grained image recognition

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Ts-vit: feature-enhanced transformer via token selection for fine-grained image recognition

Multistage attention region supplement transformer for fine-grained visual categorization

Group-Attention Transformer for Fine-Grained Image Recognition

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethical and informed consent for data used

Competing interests

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Multi-level information fusion Transformer with background filter for fine-grained image recognition

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Ts-vit: feature-enhanced transformer via token selection for fine-grained image recognition

Multistage attention region supplement transformer for fine-grained visual categorization

Group-Attention Transformer for Fine-Grained Image Recognition

Explore related subjects

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethical and informed consent for data used

Competing interests

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation