Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Multi-level information fusion Transformer with background filter for fine-grained image recognition

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Compared to traditional image recognition, Fine-Grained Image Recognition (FGIR) faces significant challenges due to the subtle distinctions among different categories and the notable variances within the same category. Furthermore, the complexity of backgrounds and the extraction of discriminative features limited to small local regions further exacerbate the difficulty. Recently, several studies have demonstrated the effectiveness of the Vision Transformer (ViT) in FGIR. However, these investigations have frequently overlooked critical information embedded within class tokens across different layers, while also neglecting the subtle local details hidden within patch tokens. To address these issues and enhance FGIR performance, we introduce a novel ViT-based network architecture MIFBF. The proposed model builds upon ViT by incorporating three modules: Complementary Class Tokens Combination module (CCTC), Patches Information Integration module (PII), and Attention Cropping Module (ACM). The CCTC module integrates multi-layer class tokens to capture complementary information, thereby enhancing the model’s representational capacity. The PII module delves into the rich local details encoded in patch tokens to improve classification accuracy. The ACM module generates regions of interest based on ViT’s self-attention weights and effectively filters background noise, thereby directing the model’s attention to the most relevant image areas. Experiments conducted on three different datasets validate the effectiveness of the proposed model, yielding state-of-the-art results and highlighting its superiority in FGIR tasks.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

Data availability

The datasets generated and/or analyzed during the current study will be made available on reasonable request.

References

  1. Wei X, Song Y, Aodha O et al (2021) Fine-grained image analysis with deep learning: A survey[J]. IEEE Trans Pattern Anal Mach Intell 44(12):8927–8948

    Article  Google Scholar 

  2. Wah C, Branson S, Welinder P, Belongie S (2011) The Caltech-UCSDBirds-200–2011 dataset. California Institute of Technology

  3. Khosla A, Jayadevaprakash N, Yao B et al (2011) Novel dataset for fine-grained image categorization: Stanford dogs. In: IEEE conference on computer vision and pattern recognition, cvpr workshops 2011, Colorado Springs, CO, USA, 20-25 June, 2011. IEEE Computer Society

  4. Krause J, Stark M, Deng J, et al (2013) 3d object representations for fine-grained categorization[C]//Proceedings of the IEEE International Conference on Computer Vision workshops 554–561

  5. Har L, Rashid U, Chuan L et al (2022) Revolution of retail industry: from perspective of retail 1.0 to 4.0[J]. Proc Comput Sci 200:1615–1625

    Article  Google Scholar 

  6. Kotwal J, Kashyap R, Pathan S (2023) Agricultural plant diseases identification: From traditional approach to deep learning[J]. Materials Today: Proceedings 80(1):344–356

    Google Scholar 

  7. Khoshand A (2021) Application of artificial intelligence in groundwater ecosystem protection: a case study of Semnan/Sorkheh plain, Iran[J]. Environ Dev Sustain 23(4):16617–16631

    Article  Google Scholar 

  8. Xie L, Tian Q, Hong R, et al (2013) Hierarchical part matching for fine-grained visual categorization. IEEE International Conference on Computer Vision, ICCV 2013, Sydney, Australia, December 1-8, 2013. IEEE Computer Society

  9. Lei J, Duan J, Wu F et al (2016) Fast mode decision based on grayscale similarity and inter-view correlation for depth map coding in 3D-HEVC[J]. IEEE Trans Circuits Syst Video Technol 28(3):706–718

    Article  Google Scholar 

  10. Huang S, Xu Z, Tao D et al (2016) Part-stacked CNN for fine-grained visual categorization. In: IEEE conference on computer vision and pattern recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016. IEEE Computer Society

  11. Lin T, RoyChowdhury A, Maji S (2016) Bilinear CNN models for fine-grained visual recognition. In: IEEE international conference on computer vision, ICCV 2015a, Santiago, Chile, December 7-13, 2015. IEEE Computer Society, p 2015

  12. Fu J, Zheng H, Mei T (2017) Look closer to see better: Recurrent attention convolutional neural network for fine-grained image recognition. In: IEEE conference on computer vision and pattern recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017. IEEE Computer Society, p 2017

  13. Zhang T, Chang D, Ma Z et al (2021) Progressive co-attention network for fine-grained visual classification. In: International conference on visual communications and image processing, VCIP 2021, Munich, Germany, December 5-8, 2021. IEEE

  14. Yu Y, Wang J (2023) Hybrid Granularities Transformer for Fine-Grained Image Recognition[J]. Entropy 25(4):601–613

    Article  Google Scholar 

  15. Wang Z (2022) Recognition of occluded objects by slope difference distribution features[J]. Appl Soft Comput 120:108622

    Article  Google Scholar 

  16. Wang L, He K, Feng X et al (2022) Multilayer feature fusion with parallel convolutional block for fine-grained image classification[J]. Appl Intell 52(3):2872–2883

    Article  Google Scholar 

  17. Dosovitskiy A, Beyer L, Kolesnikov A et al (2021) An image is worth 16x16 words: Transformers for image recognition at scale. In: International conference on learning representations, ICLR 2021, virtual event, Austria, May 3-7, 2021

  18. Wang Q, Wang J, Deng H et al (2023) AA-Trans: Core attention aggregating transformer with information entropy selector for fine-grained visual classification[J]. Pattern Recogn 140:109547

    Article  Google Scholar 

  19. Wei X, Xie C, Wu J et al (2018) Mask-CNN: Localizing parts and selecting descriptors for fine-grained bird species categorization[J]. Pattern Recogn 76:704–714

    Article  Google Scholar 

  20. Hu T, Qi H, Huang Q, et al (2019) See better before looking closer: Weakly supervised data augmentation network for fine-grained visual classification[J]. arXiv preprint arXiv:1901.09891

  21. Du R, Chang D, Bhunia A et al (2020) Fine-grained visual classification via progressive multi-granularity training of jigsaw patches. In: Computer vision - ECCV 2020 - 16th European conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part XX. Lecture notes in computer science 12365. Springer

  22. Zhang F, Li M, Zhai G et al (2021) Multi-branch and multi-scale attention learning for fine-grained visual categorization. In: MultiMedia modeling - 27th international conference, MMM 2021, Prague, Czech Republic, June 22-24, 2021, proceedings, part I. Lecture notes in computer science 12572. Springer

  23. Ge W, Lin X, Yu Y (2019) Weakly supervised complementary parts models for fine-grained image classification from the bottom up. In: IEEE conference on computer vision and pattern recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019

  24. He J, Chen J, Liu S et al (2022) TransFG: A transformer architecture for fine-grained recognition. In: Thirty-sixth AAAI conference on artificial intelligence, AAAI 2022, February 22 - March 1, 2022. AAAI Press

  25. Hu Y, Jin X, Zhang Y et al (2021) RAMS-trans: recurrent attention multi-scale transformer for fine-grained image recognition. In: Proceedings of the 29th ACM international conference on multimedia, ACM multimedia conference, virtual event, China, October 20 - 24, 2021. ACM

  26. Wang J, Yu X, Gao Y (2021) Feature fusion vision transformer for fine-grained visual categorization. In: 32nd British machine vision conference 2021, BMVC 2021, online, November 22-25, 2021. BMVA Press

  27. Liu X, Wang L, Han X (2022) Transformer with peak suppression and knowledge guidance for fine-grained image recognition[J]. Neurocomputing 492:137–149

    Article  Google Scholar 

  28. Devlin J, Chang M, Lee K et al (2019) Bert: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, vol 1. Association for Computational Linguistics

  29. Horn V, Branson S, Farrell R et al (2015) Building a bird recognition app and large scale dataset with citizen scientists: The fine print in fine-grained dataset collection. In: IEEE conference on computer vision and pattern recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015. IEEE Computer Society

  30. He K, Zhang X, Ren S et al (2016) Deep residual learning for image recognition. In: IEEE conference on computer vision and pattern recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016. IEEE Computer Society

  31. Chang D, Ding Y, Xie J et al (2020) The devil is in the channels: Mutual-channel loss for fine-grained image classification[J]. IEEE Trans Image Process 29:4683–4695

    Article  Google Scholar 

  32. Luo W, Yang X, Mo X et al (2019) Cross-X learning for fine-grained visual categorization. In: 2019 IEEE/CVF international conference on computer vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019. IEEE, p 2019

  33. Liu C, Huang L, Wei Z et al (2021) Subtler mixed attention network on fine-grained image classification. Appl Intell 51(11):7903–7916

  34. Chen Y, Bai Y, Zhang W et al (2019) Destruction and construction learning for fine-grained image recognition. In: IEEE conference on computer vision and pattern recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019

  35. Ji R, Wen L, Zhang L et al (2020) Attention convolutional binary neural tree for fine-grained visual categorization. In: 2020 IEEE/CVF conference on computer vision and pattern recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020. Computer Vision Foundation / IEEE

  36. Gao Y, Han X, Wang X, et al (2020) Channel interaction networks for fine-grained image categorization. The thirty-fourth AAAI conference on artificial intelligence, AAAI 2020, New York, NY, USA, February 7-12, 2020. 34(07): 10818–10825

  37. Ding Y, Ma Z, Wen S et al (2021) AP-CNN: Weakly supervised attention pyramid convolutional neural network for fine-grained visual classification[J]. IEEE Trans Image Process 30:2826–2836

    Article  Google Scholar 

  38. Hu Y, Liu X, Zhang B et al (2021) Alignment enhancement network for fine-grained visual categorization[J]. ACM Trans Multimed Comput Commun Appl 17(1):1–20

  39. Wang X, Shi J, Fujita H et al (2023) Aggregate attention module for fine-grained image classification[J]. J Ambient Intell Humaniz Comput 14(7):8335–8345

    Article  Google Scholar 

  40. Liu C, Xie H, Zha ZJ et al (2020) Filtration and distillation: enhancing region attention for fine-grained visual categorization. In: The thirty-fourth AAAI conference on artificial intelligence, New York, NY, USA, February 7-12, 2020. AAAI Press, p 2020

  41. Xie J, Zhong Y, Zhang J et al (2023) A weakly supervised spatial group attention network for fine-grained visual recognition[J]. Appl Intell 53(20):23301–23315

    Article  Google Scholar 

  42. Ke X, Cai Y, Chen B et al (2023) Granularity-aware distillation and structure modeling region proposal network for fine-grained image classification[J]. Pattern Recogn 137:109305

    Article  Google Scholar 

  43. Zhuang P, Wang Y, Qiao Y (2020) Learning attentive pairwise interaction for fine-grained classification. In: The thirty-fourth AAAI conference on artificial intelligence, new york, ny, usa, february 7-12, 2020. AAAI Press

  44. Rao Y, Chen G, Lu J et al (2021) Counterfactual attention learning for fine-grained visual categorization and re-identification. In: 2021 IEEE/CVF international conference on computer vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021. IEEE, p 2021

  45. Dubey A, Gupta O, Raskar R et al (2018) Maximum-entropy fine grained classification[J]. Adv Neural Inf Process Syst 31:1–12

    Google Scholar 

  46. Sun G, Cholakkal H, Khan S, et al (2020) Fine-grained recognition: Accounting for subtle differences between similar classes. The thirty-fourth AAAI conference on artificial intelligence, AAAI 2020, New York, NY, USA, February 7-12, 2020. AAAI Press 2020

  47. Luo W, Zhang H, Li J et al (2020) Learning semantically enhanced feature for fine-grained image classification. IEEE Signal Process Lett 27:1545–1549

  48. Guo P, Farrell R 2019) Aligned to the object, not to the image: A unified pose-aligned representation for fine-grained recognition. In: IEEE winter conference on applications of computer vision, WACV 2019, Waikoloa Village, HI, USA, January 7-11, 2019. IEEE

  49. Huang S, Wang X, Tao D (2021) Stochastic partial swap: Enhanced model generalization and interpretability for fine-grained recognition. 2021 IEEE/CVF international conference on computer vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021. IEEE

  50. Korsch D, Bodesheim P, Denzler J (2019) Classification-specific parts for improving fine-grained visual categorization. In: Pattern recognition - 41st DAGM German conference, DAGM GCPR 2019, Dortmund, Germany, September 10-13, 2019, proceedings. Lecture notes in computer science 11824. Springer

  51. Zhang L, Huang S, Liu W et al (2019) Learning a mixture of granularity-specific experts for fine-grained categorization. In: 2019 IEEE/CVF international conference on computer vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019. IEEE

  52. Touvron H, Vedaldi A, Douze M et al (2019) Fixing the train-test resolution discrepancy. In: Advances in neural information processing systems 32: annual conference on neural information processing systems 2019, NeurIPS 2019, December 8-14, 2019. Vancouver

  53. Selvaraju R, Cogswell M, Das A et al (2017) Grad-CAM: Visual explanations from deep networks via gradient-based localization. In: IEEE international conference on computer vision, ICCV 2017, Venice, Italy, October 22-29, 2017. IEEE Computer Society

Download references

Acknowledgements

This paper was supported by the National Natural Science Foundation of China (No.62163016, 62066014), the Natural Science Foundation of Jiangxi Province (20212ACB202001), the Postgraduate Innovation Fund of Education Department of Jiangxi Province (YC2022-s552), the foreign expert project of Ministry of Science and Technology (No.G2023022005L), the open project of State Key Laboratory of Performance Monitoring and Protecting of Rail Transit Infrastructure (Grant No.HJGZ2023203).

Author information

Authors and Affiliations

Authors

Contributions

Ying Yu: Methodology, Proponents of major academic ideas. Jinghui Wang: Writing – original draft. Jin Qian: Supervision. Witold Pedrycz: Writing – review & editing. Duoqian Miao: Writing – review & editing.

Corresponding author

Correspondence to Ying Yu.

Ethics declarations

Ethical and informed consent for data used

The relevant datasets are publicly available, and the authors of the manuscript are aware that the data used in this article does not involve ethical issues.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yu, Y., Wang, J., Pedrycz, W. et al. Multi-level information fusion Transformer with background filter for fine-grained image recognition. Appl Intell 54, 8108–8119 (2024). https://doi.org/10.1007/s10489-024-05584-x

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-024-05584-x

Keywords