MFF-Trans: Multi-level Feature Fusion Transformer for Fine-Grained Visual Classification

Hang, Qi; Yan, Xuefeng; Gong, Lina

doi:10.1007/978-981-97-2387-4_15

Qi Hang¹²,
Xuefeng Yan¹² &
Lina Gong¹²

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14333))

Included in the following conference series:

Asia-Pacific Web (APWeb) and Web-Age Information Management (WAIM) Joint International Conference on Web and Big Data

335 Accesses
1 Citations

Abstract

In fine-grained visual classification, fusing both local and global information is crucial. However, current methods based on vision transformer tend to just focus on selecting discriminative patch tokens, which ignore the variation of rich global and semantic information in classification tokens at different layers. To address this limitation, we propose a novel framework dubbed MFF-Trans that considers the mutual relationships between all tokens. Specifically, we put forward the important token election module (ITEM) which utilizes multi-headed self-attention mechanism in vision transformer to evaluate the importance of all tokens. This module will guide the model to select tokens which contain discriminative local information and global information with different semantics at each ViT layer. Meanwhile, to enhance the model’s perception of semantic connection between selected patch tokens, we further introduce the semantic connection enhancing module (SCEM) which use the graph convolutional network to mine the structural information between them in deep layers of vision transformer. Extensive experimental results on three benchmark datasets indicate that MFF-Trans achieves satisfactory performance compared with other methods. We achieve good results in CUB (92.1%), Stanford Cars (95.4%), and Stanford Dogs (92.3%).

This work is supported by the Basic Research for National Defense under Grant Nos. JCKY2020605C003.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 159.99; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

TransMCGC: a recast vision transformer for small-scale image classification tasks

Article 04 January 2023

An Attention-Based Token Pruning Method for Vision Transformers

PEDTrans: A Fine-Grained Visual Classification Model for Self-attention Patch Enhancement and Dropout

References

Chou, P.Y., Lin, C.H., Kao, W.C.: A novel plug-in module for fine-grained visual classification. arXiv e-prints pp. arXiv–2202 (2022)
Google Scholar
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. In: International Conference on Learning Representations (2021)
Google Scholar
Ge, W., Lin, X., Yu, Y.: Weakly supervised complementary parts models for fine-grained image classification from the bottom up. In: IEEE Conference on Computer Vision & Pattern Recognition (2019)
Google Scholar
He, J., et al.: TransFG: a transformer architecture for fine-grained recognition. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 852–860 (2022)
Google Scholar
He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. IEEE Trans. Pattern Anal. Mach. Intell. 42, 386–397 (2017)
Article Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Google Scholar
Hu, T., Qi, H., Huang, Q., Lu, Y.: See better before looking closer: weakly supervised data augmentation network for fine-grained visual classification. arXiv preprint arXiv:1901.09891 (2019)
Hu, Y., et al.: RAMS-TRANS: recurrent attention multi-scale transformer for fine-grained image recognition. In: Proceedings of the 29th ACM International Conference on Multimedia, pp. 4239–4248 (2021)
Google Scholar
Liu, C., Xie, H., Zha, Z.J., Ma, L., Zhang, Y.: Filtration and distillation: enhancing region attention for fine-grained visual categorization. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 7, pp. 11555–11562 (2020)
Google Scholar
Liu, X., Wang, L., Han, X.: Transformer with peak suppression and knowledge guidance for fine-grained image recognition. Neurocomputing 492, 137–149 (2022)
Article Google Scholar
Rao, Y., Chen, G., Lu, J., Zhou, J.: Counterfactual attention learning for fine-grained visual categorization and re-identification. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 1005–1014. IEEE Computer Society (2021)
Google Scholar
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 39(6), 1137–1149 (2017)
Article Google Scholar
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. Computer Science (2014)
Google Scholar
Sun, H., He, X., Peng, Y.: SIM-Trans: structure information modeling transformer for fine-grained visual categorization. In: Proceedings of the 30th ACM International Conference on Multimedia, pp. 5853–5861 (2022)
Google Scholar
Sun, M., Yuan, Y., Zhou, F., Ding, E.: Multi-attention multi-class constraint for fine-grained image recognition. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11220, pp. 834–850. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01270-0_49
Chapter Google Scholar
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Rabinovich, A.: Going deeper with convolutions. IEEE Computer Society (2014)
Google Scholar
Vaswani, A., et al.: Attention is all you need. Adv. Neural Inf. Process. Syst. 30 (2017)
Google Scholar
Wah, C., Branson, S., Welinder, P., Perona, P., Belongie, S.: The Caltech-UCSD birds-200-2011 dataset. California Institute of Technology (2011)
Google Scholar
Wang, J., Yu, X., Gao, Y.: Feature fusion vision transformer for fine-grained visual categorization. In: BMVC 2021 (2021)
Google Scholar
Wei, X.S., Xie, C.W., Wu, J.: Mask-CNN: localizing parts and selecting descriptors for fine-grained image recognition. arXiv preprint arXiv:1605.06878 (2016)
Zhang, Y., et al.: A free lunch from ViT: adaptive attention multi-scale fusion transformer for fine-grained visual recognition. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3234–3238. IEEE (2022)
Google Scholar
Zhu, H., Ke, W., Li, D., Liu, J., Tian, L., Shan, Y.: Dual cross-attention learning for fine-grained visual categorization and object re-identification. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4692–4702 (2022)
Google Scholar
Zhuang, P., Wang, Y., Qiao, Y.: Learning attentive pairwise interaction for fine-grained classification. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13130–13137 (2020)
Google Scholar

Download references

Author information

Authors and Affiliations

Nanjing University of Aeronautics and Astronautics, Nanjing, China
Qi Hang, Xuefeng Yan & Lina Gong

Authors

Qi Hang
View author publications
You can also search for this author in PubMed Google Scholar
Xuefeng Yan
View author publications
You can also search for this author in PubMed Google Scholar
Lina Gong
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Qi Hang .

Editor information

Editors and Affiliations

Peng Cheng Laboratory, Shenzhen, China
Xiangyu Song
China University of Geosciences, Wuhan, China
Ruyi Feng
China University of Geosciences, Wuhan, China
Yunliang Chen
Deakin University, Burwood, VIC, Australia
Jianxin Li
University of Exeter, Exeter, UK
Geyong Min

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hang, Q., Yan, X., Gong, L. (2024). MFF-Trans: Multi-level Feature Fusion Transformer for Fine-Grained Visual Classification. In: Song, X., Feng, R., Chen, Y., Li, J., Min, G. (eds) Web and Big Data. APWeb-WAIM 2023. Lecture Notes in Computer Science, vol 14333. Springer, Singapore. https://doi.org/10.1007/978-981-97-2387-4_15

Download citation

DOI: https://doi.org/10.1007/978-981-97-2387-4_15
Published: 28 April 2024
Publisher Name: Springer, Singapore
Print ISBN: 978-981-97-2386-7
Online ISBN: 978-981-97-2387-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

MFF-Trans: Multi-level Feature Fusion Transformer for Fine-Grained Visual Classification

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

TransMCGC: a recast vision transformer for small-scale image classification tasks

An Attention-Based Token Pruning Method for Vision Transformers

PEDTrans: A Fine-Grained Visual Classification Model for Self-attention Patch Enhancement and Dropout

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

MFF-Trans: Multi-level Feature Fusion Transformer for Fine-Grained Visual Classification

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

TransMCGC: a recast vision transformer for small-scale image classification tasks

An Attention-Based Token Pruning Method for Vision Transformers

PEDTrans: A Fine-Grained Visual Classification Model for Self-attention Patch Enhancement and Dropout

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation