research-article

Learning Hierarchal Channel Attention for Fine-grained Visual Classification

Authors:

Yi BinAuthors Info & Claims

MM '21: Proceedings of the 29th ACM International Conference on Multimedia

Pages 5011 - 5019

https://doi.org/10.1145/3474085.3475184

Published: 17 October 2021 Publication History

Abstract

Learning delicate feature representation of object parts plays a critical role in fine-grained visual classification tasks. However, advanced deep convolutional neural networks trained for general visual classification tasks usually tend to focus on the coarse-grained information while ignoring the fine-grained one, which is of great significance for learning discriminative representation. In this work, we explore the great merit of multi-modal data in introducing semantic knowledge and sequential analysis techniques in learning hierarchical feature representation for generating discriminative fine-grained features. To this end, we propose a novel approach, termed Channel Cusum Attention ResNet (CCA-ResNet ), for multi-modal joint learning of fine-grained representation. Specifically, we use feature-level multi-modal alignment to connect image and text classification models for joint multi-modal training. Through joint training, image classification models trained with semantic level labels tend to focus on the most discriminative parts, which enhances the cognitive ability of the model. Then, we propose a Channel Cusum Attention (CCA ) mechanism to equip feature maps with hierarchical properties through unsupervised reconstruction of local and global features. The benefits brought by the CCA are in two folds: a) allowing fine-grained features from early layers to be preserved in the forward propagation of deep networks; b) leveraging the hierarchical properties to facilitate multi-modal feature alignment. We conduct extensive experiments to verify that our proposed model can achieve state-of-the-art performance on a series of fine-grained visual classification benchmarks.

References

[1]

Peter Welinder Pietro Perona Catherine Wah, Steve Branson. 2011. The Caltech-UCSD Birds-200--2011 Dataset.

[2]

Peixin Chen, Wu Guo, Zhi Chen, Jian Sun, and Lanhua You. 2018. Gated Convolutional Neural Network for Sentence Matching. In INTERSPEECH. 2853--2857.

[3]

Tianshui Chen, Liang Lin, Riquan Chen, Yang Wu, and Xiaonan Luo. 2018. Knowledge-Embedded Representation Learning for Fine-Grained Image Recognition. In IJCAI. 627--634.

Digital Library

[4]

Tianshui Chen, Wenxi Wu, Yuefang Gao, Le Dong, Xiaonan Luo, and Liang Lin. 2018. Fine-Grained Representation Learning and Recognition by Exploiting Hierarchical Semantic Embedding. In ACM MM. 2023--2031.

Digital Library

[5]

Yue Chen, Yalong Bai, Wei Zhang, and Tao Mei. 2019. Destruction and Construction Learning for Fine-Grained Image Recognition. In CVPR. 5157--5166.

[6]

Yann N. Dauphin, Angela Fan, Michael Auli, and David Grangier. 2017. Language Modeling with Gated Convolutional Networks. In ICML. 933--941.

Digital Library

[7]

Weifeng Ge, Xiangru Lin, and Yizhou Yu. 2019. Weakly Supervised Complementary Parts Models for Fine-Grained Image Classification From the Bottom Up. In CVPR. 3034--3043.

[8]

Xiang Guan, Yang Yang, Zheng Wang, and Jingjing Li. 2020. Semantic Feature Augmentation for Fine-grained Visual Categorization with Few-Sample Training. In ACM MM Asia. ACM, 1--9.

Digital Library

[9]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In CVPR. 770--778.

[10]

Xiangteng He and Yuxin Peng. 2017. Fine-Grained Image Classification via Combining Vision and Language. In CVPR. 7332--7340.

[11]

Xiangteng He and Yuxin Peng. 2020. Fine-Grained Visual-Textual Representation Learning. IEEE TCSVT 30, 2 (2020), 520--531.

[12]

Saihui Hou and Zilei Wang. 2019. Weighted Channel Dropout for Regularization of Deep Convolutional Neural Network. In AAAI. 8425--8432.

[13]

Jie Hu, Li Shen, Samuel Albanie, Gang Sun, and Enhua Wu. 2020. Squeeze-and-Excitation Networks. IEEE TPAMI 42, 8 (2020), 2011--2023.

Digital Library

[14]

Tao Hu and Honggang Qi. 2019. See Better Before Looking Closer: Weakly Supervised Data Augmentation Network for Fine-Grained Visual Classification. CoRR abs/1901.09891 (2019).

[15]

Tao Hu, Honggang Qi, Cong Huang, Qingming Huang, Yan Lu, and Jizheng Xu. 2018. Weakly Supervised Local Attention Network for Fine-Grained Visual Classification. CoRR abs/1808.02152 (2018).

[16]

Chao Huang, Hongliang Li, Yurui Xie, Qingbo Wu, and Bing Luo. 2017. PBC: Polygon-Based Classifier for Fine-Grained Categorization. IEEE TMM 19, 4 (2017), 673--684.

Digital Library

[17]

Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. 2017. Progressive Growing of GANs for Improved Quality, Stability, and Variation. CoRR abs/1710.10196.

[18]

Jonathan Krause, Hailin Jin, Jianchao Yang, and Fei-Fei Li. 2015. Fine-grained Recognition Without Part Annotations. In CVPR. 5546--5555.

[19]

Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. 2013. 3D Object Representations for Fine-Grained Categorization. In ICCV. 554--561.

Digital Library

[20]

Jingjing Li, Lei Zhu, Zi Huang, Ke Lu, and Jidong Zhao. 2018. I read, I saw, I tell: Texts Assisted Fine-Grained Visual Classification. In ACM MM. 663--671.

Digital Library

[21]

Di Lin, Xiaoyong Shen, Cewu Lu, and Jiaya Jia. 2015. Deep LAC: Deep Localization, Alignment and Classification for Fine-grained Recognition. In CVPR. 1666--1674.

[22]

Guosheng Lin, Fayao Liu, Anton Milan, Chunhua Shen, and Ian D. Reid. 2020. RefineNet: Multi-Path Refinement Networks for Dense Prediction. IEEE TPAMI 42, 5 (2020), 1228--1242.

[23]

Tsung-Yu Lin and Subhransu Maji. 2017. Improved Bilinear Pooling with CNNs. In BMVC. 1--15.

[24]

Tsung-Yu Lin, Aruni Roy Chowdhury, and Subhransu Maji. 2015. Bilinear CNN Models for Fine-Grained Visual Recognition. In ICCV. 1449--1457.

Digital Library

[25]

Subhransu Maji, Esa Rahtu, Juho Kannala, Matthew B. Blaschko, and Andrea Vedaldi. 2013. Fine-Grained Visual Classification of Aircraft. ArXiv abs/1306.5151 (2013).

[26]

Maria-Elena Nilsback and Andrew Zisserman. 2008. Automated Flower Classification over a Large Number of Classes. In ICVGIP. 722--729.

Digital Library

[27]

ES Page. 1954. Continuous Inspection Schemes. Biometrika 41, 1--2 (1954), 100--115.

[28]

Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove: Global Vectors for Word Representation. In EMNLP. 1532--1543.

[29]

Scott E. Reed, Zeynep Akata, Honglak Lee, and Bernt Schiele. 2016. Learning Deep Representations of Fine-Grained Visual Descriptions. In CVPR. 49--58.

[30]

Yikang Shen, Shawn Tan, Alessandro Sordoni, and Aaron C. Courville. 2019. Ordered Neurons: Integrating Tree Structures into Recurrent Neural Networks. In ICLR. 1--15.

[31]

Marcel Simon and Erik Rodner. 2015. Neural Activation Constellations: Unsupervised Part Model Discovery with Convolutional Networks. In ICCV. 1143--1151.

Digital Library

[32]

Karen Simonyan and Andrew Zisserman. 2015. Very Deep Convolutional Networks for Large-Scale Image Recognition. In ICLR. 1--15.

[33]

Liang Sun, Xiang Guan, Yang Yang, and Lei Zhang. 2020. Text-Embedded Bilinear Model for Fine-Grained Visual Recognition. In ACM MM. 211--219.

Digital Library

[34]

Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wojna. 2016. Rethinking the Inception Architecture for Computer Vision. In CVPR. 2818--2826.

[35]

Shijie Wang, Zhihui Wang, Haojie Li, and Wanli Ouyang. 2020. Category-specific Semantic Coherency Learning for Fine-grained Image Recognition. In ACM MM. 174--183.

Digital Library

[36]

Yaming Wang, Jonghyun Choi, Vlad I. Morariu, and Larry S. Davis. 2016. Mining Discriminative Triplets of Patches for Fine-Grained Classification. In CVPR. 1163--1172.

[37]

Sanghyun Woo, Jongchan Park, Joon-Young Lee, and In So Kweon. 2018. CBAM: Convolutional Block Attention Module. In ECCV. 3--19.

[38]

Lingxi Xie, Jingdong Wang, Weiyao Lin, Bo Zhang, and Qi Tian. 2017. Towards Reversal-Invariant Image Representation. IJCV 123, 2 (2017), 226--250.

Digital Library

[39]

Chaojian Yu, Xinyi Zhao, Qi Zheng, Peng Zhang, and Xinge You. 2018. Hierarchical Bilinear Pooling for Fine-Grained Visual Recognition. In ECCV. 595--610.

[40]

Jiahui Yu, Zhe Lin, Jimei Yang, Xiaohui Shen, Xin Lu, and Thomas S. Huang. 2019. Free-Form Image Inpainting With Gated Convolution. In ICCV. 4470--4479.

[41]

Lianbo Zhang, Shaoli Huang, Wei Liu, and Dacheng Tao. 2019. Learning a Mixture of Granularity-Specific Experts for Fine-Grained Categorization. In ICCV. 8330--8339.

[42]

Ning Zhang, Jeff Donahue, Ross B. Girshick, and Trevor Darrell. 2014. Part-Based R-CNNs for Fine-Grained Category Detection. In ECCV. 834--849.

[43]

Yabin Zhang, Hui Tang, and Kui Jia. 2018. Fine-Grained Visual Categorization Using Meta-learning Optimization with Sample Selection of Auxiliary Data. In ECCV. 241--256.

[44]

Heliang Zheng, Jianlong Fu, Zheng-Jun Zha, and Jiebo Luo. 2019. Looking for the Devil in the Details: Learning Trilinear Attention Sampling Network for Fine-Grained Image Recognition. In CVPR. 5012--5021.

Cited By

Wei JYang YGuan XXu XWang GShen H(2024)Runge-Kutta Guided Feature Augmentation for Few-Sample LearningIEEE Transactions on Multimedia10.1109/TMM.2024.336640426(7349-7358)Online publication date: 15-Feb-2024
https://dl.acm.org/doi/10.1109/TMM.2024.3366404
Zhang QYuan J(2023)Semantic-Aligned Cross-Modal Visual Grounding Network with TransformersApplied Sciences10.3390/app1309564913:9(5649)Online publication date: 4-May-2023
https://doi.org/10.3390/app13095649
Zou CWang RJin CZhang SWang X(2023)SCL-Leaf Net: Recognizing Leaf Images Like Human BotanistsACM Transactions on Multimedia Computing, Communications, and Applications10.1145/361565920:1(1-20)Online publication date: 18-Sep-2023
https://dl.acm.org/doi/10.1145/3615659
Show More Cited By

Index Terms

Learning Hierarchal Channel Attention for Fine-grained Visual Classification
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision problems
  2. Machine learning
    1. Learning paradigms
      1. Supervised learning
    2. Machine learning approaches
      1. Neural networks

Index terms have been assigned to the content through auto-classification.

Recommendations

Siamese self-supervised learning for fine-grained visual classification
Abstract
Fine-grained visual classification (FGVC) is challenging to capture subtle yet distinct visual cues due to large intra-class and small inter-class variances. To this end, we propose a new Siamese Self-supervised Learning method to ...
Highlights
- To alleviate the issue of large intra-class and small inter-class variances, a new Siamese Self-supervised Learning (SSSL) method is proposed for the FGVC ...
Leveraging Fine-Grained Labels to Regularize Fine-Grained Visual Classification
ICCMS '19: Proceedings of the 11th International Conference on Computer Modeling and Simulation

Fine-grained visual categorization (FGVC) is challenging mainly due to the large intra-class confusion and small inter-class variance in terms of shape, pose, and appearance. We propose the concept of fine-grained label and that any given label can be ...
Soft Pseudo-labeling Semi-Supervised Learning Applied to Fine-Grained Visual Classification
Pattern Recognition. ICPR International Workshops and Challenges
Abstract
Pseudo-labeling is a simple and well known strategy in Semi-Supervised Learning with neural networks. The method is equivalent to entropy minimization as the overlap of class probability distribution can be reduced minimizing the entropy for ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '21: Proceedings of the 29th ACM International Conference on Multimedia

October 2021

5796 pages

ISBN:9781450386517

DOI:10.1145/3474085

General Chairs:
Heng Tao Shen
University of Electronic Science&Technology of China, China
,
Yueting Zhuang
Zhejiang University, China
,
John R. Smith
IBM, USA
,
Program Chairs:
Yang Yang
University of Electronic Science and Technology of China, China
,
Pablo Cesar
CWI&TU Delft, The Netherlands
,
Florian Metze
FACEBOOK, Inc., USA
,
Balakrishnan Prabhakaran
University of Texas at Dallas, USA

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 October 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

The Fundamental Research Funds for the Central Universities
Sichuan Science and Technology Program
National Natural Science Foundation of China

Conference

MM '21

Sponsor:

SIGMM

MM '21: ACM Multimedia Conference

October 20 - 24, 2021

Virtual Event, China

Acceptance Rates

Overall Acceptance Rate 995 of 4,171 submissions, 24%

Upcoming Conference

MM '24

Sponsor:
sigmm

The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne , VIC , Australia

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

6
Total Citations
View Citations
409
Total Downloads

Downloads (Last 12 months)64
Downloads (Last 6 weeks)12

Reflects downloads up to 21 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Wei JYang YGuan XXu XWang GShen H(2024)Runge-Kutta Guided Feature Augmentation for Few-Sample LearningIEEE Transactions on Multimedia10.1109/TMM.2024.336640426(7349-7358)Online publication date: 15-Feb-2024
https://dl.acm.org/doi/10.1109/TMM.2024.3366404
Zhang QYuan J(2023)Semantic-Aligned Cross-Modal Visual Grounding Network with TransformersApplied Sciences10.3390/app1309564913:9(5649)Online publication date: 4-May-2023
https://doi.org/10.3390/app13095649
Zou CWang RJin CZhang SWang X(2023)SCL-Leaf Net: Recognizing Leaf Images Like Human BotanistsACM Transactions on Multimedia Computing, Communications, and Applications10.1145/361565920:1(1-20)Online publication date: 18-Sep-2023
https://dl.acm.org/doi/10.1145/3615659
Wang RZou CZhang WZhu ZJing LEl Saddik AMei TCucchiara RBertini MTobon Vallejo DAtrey PHossain M(2023)Consistency-aware Feature Learning for Hierarchical Fine-grained Visual ClassificationProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3612234(2326-2334)Online publication date: 27-Oct-2023
https://doi.org/10.1145/3581783.3612234
Li ZYang RCai WXue YHu YLi L(2022)LLAM-MDCNet for Detecting Remote Sensing Images of Dead Tree ClustersRemote Sensing10.3390/rs1415368414:15(3684)Online publication date: 1-Aug-2022
https://doi.org/10.3390/rs14153684
Ma ZYang YWang GXu XShen HZhang MMagalhães Jdel Bimbo ASatoh SSebe NAlameda-Pineda XJin QOria VToni L(2022)Rethinking Open-World Object Detection in Autonomous Driving ScenariosProceedings of the 30th ACM International Conference on Multimedia10.1145/3503161.3548165(1279-1288)Online publication date: 10-Oct-2022
https://dl.acm.org/doi/10.1145/3503161.3548165

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents