Hybrid Granularities Transformer for Fine-Grained Image Recognition
Abstract
:1. Introduction
- (1)
- We propose a single-stage model and it can be trained end-to-end with only text labels.
- (2)
- We propose the Patches Hidden Integrator (PHI) module to force the model focus on some other regions that are still discriminative in an efficient way.
- (3)
- We propose the Consistency Feature Learning (CFL) module, which aids decision-making by discovering detailed information in the patch tokens and introduces an inconsistency loss as a constraint.
- (4)
- Our proposed HGTrans outperforms existing models and achieves state-of-the-art results on several mainstream datasets.
2. Related Work
3. Method
3.1. Datasets
3.2. Patches Hidden Integrator (PHI)
3.3. Consistency Feature Learning (CFL)
4. Experiments
4.1. Implementation Details
4.2. Evaluation Indicators
4.3. Comparison with the State-of-the-Art
4.4. Ablation Studies
4.5. Visualization
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Data Availability Statement
Conflicts of Interest
References
- Wei, X.-S.; Song, Y.-Z.; Aodha, O.M.; Wu, J.; Peng, Y.; Tang, J.; Yang, J.; Belongie, S. Fine-grained image analysis with deep learning: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 8927–8948. [Google Scholar] [CrossRef] [PubMed]
- Zhang, N; Donahue, J; Girshick, R; Darrel, T. Part-based R-CNNs for fine-grained category detection. In Computer Vision–ECCV 2014; Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T., Eds.; Springer International Publishing: Cham, Switzerland, 2014; pp. 834–849. [Google Scholar]
- Wei, X.S.; Xie, C.W.; Wu, J. Mask-cnn: Localizing parts and selecting descriptors for fine-grained image recognition. arXiv 2016, arXiv:1605.06878. [Google Scholar]
- Branson, S.; Van Horn, G.; Belongie, S.; Perona, P. Bird species categorization using pose normalized deep convolutional nets. arXiv 2014, arXiv:1406.2952. [Google Scholar]
- Lin, D.; Shen, X.; Lu, C.; Jia, J. Deep LAC: Deep localization, alignment and classification for fine-grained recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 1666–1674. [Google Scholar]
- Lin, T.Y.; RoyChowdhury, A.; Maji, S. Bilinear CNN models for fine-grained visual recognition. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 1449–1457. [Google Scholar]
- Fu, J.; Zheng, H.; Mei, T. Look closer to see better: Recurrent attention convolutional neural network for fine-grained image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 4438–4446. [Google Scholar]
- Zhang, F.; Li, M.; Zhai, G.; Liu, Y. Multi-branch and multi-scale attention learning for fine-grained visual categorization. In MultiMedia Modeling; Springer International Publishing: Cham, Switzerland, 2021; pp. 136–147. [Google Scholar]
- Du, R.; Chang, D.; Bhunia, A.K.; Xie, J.; Ma, Z.; Song, Y.-Z.; Gou, J. Fine-grained visual classification via progressive multi-granularity training of jigsaw patches. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; pp. 153–168. [Google Scholar]
- Hu, T.; Qi, H.; Huang, Q.; Lu, Y. See better before looking closer: Weakly supervised data augmentation network for fine-grained visual classification. arXiv 2019, arXiv:1901.09891. [Google Scholar]
- Rao, Y.; Chen, G.; Lu, J.; Zhou, J. Counterfactual attention learning for fine-grained visual categorization and re-identification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 1025–1034. [Google Scholar]
- Gao, Y.; Han, X.; Wang, X.; Huang, W.; Scott, M.R. Channel interaction networks for fine-grained image categorization. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; pp. 10818–10825. [Google Scholar]
- Zhuang, P.; Wang, Y.; Qiao, Y. Learning attentive pairwise interaction for fine-grained classification. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; pp. 13130–13137. [Google Scholar]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
- He, J.; Chen, J.N.; Liu, S.; Kortylewski, A.; Yang, C.; Bai, Y.; Wang, C. TransFG: A transformer architecture for fine-grained recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2022; pp. 852–860. [Google Scholar]
- Wang, J.; Yu, X.; Gao, Y. Feature fusion vision transformer for fine-grained visual categorization. arXiv 2021, arXiv:2107.02341. [Google Scholar]
- Hu, Y.; Jin, X.; Zhang, Y.; Hing, H.; Zhang, J.; He, Y.; Xue, H. Rams-trans: Recurrent attention multi-scale transformer for fine-grained image recognition. In Proceedings of the 29th ACM International Conference on Multimedia, Virtual, 20–24 October 2021; pp. 4239–4248. [Google Scholar]
- Liu, X.; Wang, L.; Han, X. Transformer with peak suppression and knowledge guidance for fine-grained image recognition. Neurocomputing 2022, 492, 137–149. [Google Scholar] [CrossRef]
- Du, R.; Xie, J.; Ma, Z.; Chang, D.; Song, Y.-Z.; Guo, J. Progressive learning of category-consistent multi-granularity features for fine-grained visual classification. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 9521–9535. [Google Scholar] [CrossRef] [PubMed]
- Peng, J.; Wang, Y.; Zhou, Z. Progressive Erasing Network with consistency loss for fine-grained visual classification. J. Vis. Commun. Image Represent. 2022, 87, 103570. [Google Scholar] [CrossRef]
- Chen, Y.; Bai, Y.; Zhang, W.; Mei, T. Destruction and construction learning for fine-grained image recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 5157–5166. [Google Scholar]
- Li, H.; Zhang, X.; Tian, Q.; Xiong, H. Attribute mix: Semantic data augmentation for fine grained recognition. In Proceedings of the 2020 IEEE International Conference on Visual Communications and Image Processing (VCIP), Macau, China, 1–4 December 2020; pp. 243–246. [Google Scholar]
- Zhang, Z.C.; Chen, Z.D.; Wang, Y.; Luo, X.; Xu, X.-S. ViT-FOD: A Vision Transformer based Fine-grained Object Discriminator. arXiv 2022, arXiv:2203.12816. [Google Scholar]
- Wah, C.; Branson, S.; Welinder, P.; Pietro, P.; Serge, B. The caltech-ucsd birds-200-2011 dataset. Comput. Neural Syst. Tech. Rep. 2011, 2010, 27452. [Google Scholar]
- Khosla, A.; Jayadevaprakash, N.; Yao, B.; Li, F.-F. Novel dataset for fine-grained image categorization: Stanford dogs. In Proceedings of the CVPR Workshop on Fine-Grained Visual Categorization (FGVC), Colorado Springs, CO, USA, 20–25 June 2011. [Google Scholar]
- Nilsback, M.E.; Zisserman, A. Automated flower classification over a large number of classes. In Proceedings of the 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing, Bhubaneswar, India, 16–19 December 2008; pp. 722–729. [Google Scholar]
- Sun, M.; Yuan, Y.; Zhou, F.; Ding, E. Multi-attention multi-class constraint for fine-grained image recognition. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 805–821. [Google Scholar]
- Luo, W.; Zhang, H.; Li, J.; Wei, X.-S. Learning semantically enhanced feature for fine-grained image classification. IEEE Signal Process. Lett. 2020, 27, 1545–1549. [Google Scholar] [CrossRef]
- Luo, W.; Yang, X.; Mo, X.; Lu, Y.; Davis, L.S.; Li, J.; Yang, J.; Lim, S.-N. Cross-x learning for fine-grained visual categorization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Long Beach, CA, USA, 15–20 June 2019; pp. 8242–8251. [Google Scholar]
- Liu, C.; Xie, H.; Zha, Z.-J.; Ma, L.; Yu, L.; Zhang, Y. Filtration and distillation: Enhancing region attention for fine-grained visual categorization. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; pp. 11555–11562. [Google Scholar]
- Song, J.; Yang, R. Feature boosting, suppression, and diversification for fine-grained visual classification. In Proceedings of the 2021 International Joint Conference on Neural Networks (IJCNN), Shenzhen, China, 18–22 July 2021; pp. 1–8. [Google Scholar]
- Huang, C.; Li, H.; Xie, Y.; Wu, Q.; Luo, B. PBC: Polygon-based classifier for fine-grained categorization. IEEE Trans. Multimed. 2016, 19, 673–684. [Google Scholar] [CrossRef]
- Dubey, A.; Gupta, O.; Guo, P.; Raskar, R.; Farrell, R.; Naik, N. Pairwise confusion for fine-grained visual classification. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 70–86. [Google Scholar]
- Song, K.; Wei, X.S.; Shu, X.; Song, R.-J.; Lu, J. Bi-modal progressive mask attention for fine-grained recognition. IEEE Trans. Image Process. 2020, 29, 7006–7018. [Google Scholar] [CrossRef]
- Touvron, H.; Sablayrolles, A.; Douze, M.; Cord, M.; Jégou, H. Grafit: Learning fine-grained image representations with coarse labels. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Nashville, TN, USA, 20–25 June 2021; pp. 874–884. [Google Scholar]
- Kolesnikov, A.; Beyer, L.; Zhai, X.; Puigcerver, J.; Yung, J.; Gelly, S.; Houlsby, N. Big transfer (bit): General visual representation learning. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; pp. 491–507. [Google Scholar]
Method | CUB (Accuracy) | Dog (Accuracy) |
---|---|---|
RA-CNN [7] | 85.3 | 87.3 |
MAMC [27] | 86.5 | 85.2 |
SEF [28] | 87.3 | 88.8 |
Cross-X [29] | 87.7 | 88.9 |
FDL [30] | 89.1 | 84.9 |
FBSD [31] | 89.8 | 89.4 |
API-NET [13] | 90.0 | 90.3 |
PMG-V2 [19] | 90.0 | 90.7 |
ViT [14] | 90.7 | 92.0 |
RAMS [17] | 91.3 | 92.4 |
TPSKG [18] | 91.3 | 92.5 |
HGTrans | 91.6 | 92.7 |
Method | Flower (Accuracy) |
---|---|
PBC [32] | 96.1 |
PC-CNN [33] | 93.6 |
BiM-PMA [34] | 97.4 |
Grafit [35] | 99.1 |
BiT m [36] | 99.3 |
ViT [14] | 99.3 |
TPSKG [18] | 99.5 |
HGTrans | 99.5 |
ViT_B_16 | PHI | CFL | Accuracy (%) |
---|---|---|---|
✓ | 90.7 | ||
✓ | ✓ | 91.2 | |
✓ | ✓ | 91.3 | |
✓ | ✓ | ✓ | 91.6 |
Method | Accuracy (%) | |
---|---|---|
HGTrans | 1 | 90.9 |
HGTrans | 2 | 91.3 |
HGTrans | 3 | 91.6 |
HGTrans | 4 | 91.4 |
Method | Time (Min) |
---|---|
ViT | 6:14 |
HGTrans | 6:25 |
RAMS | 16:05 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Yu, Y.; Wang, J. Hybrid Granularities Transformer for Fine-Grained Image Recognition. Entropy 2023, 25, 601. https://doi.org/10.3390/e25040601
Yu Y, Wang J. Hybrid Granularities Transformer for Fine-Grained Image Recognition. Entropy. 2023; 25(4):601. https://doi.org/10.3390/e25040601
Chicago/Turabian StyleYu, Ying, and Jinghui Wang. 2023. "Hybrid Granularities Transformer for Fine-Grained Image Recognition" Entropy 25, no. 4: 601. https://doi.org/10.3390/e25040601