Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3474085.3475184acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Learning Hierarchal Channel Attention for Fine-grained Visual Classification

Published: 17 October 2021 Publication History

Abstract

Learning delicate feature representation of object parts plays a critical role in fine-grained visual classification tasks. However, advanced deep convolutional neural networks trained for general visual classification tasks usually tend to focus on the coarse-grained information while ignoring the fine-grained one, which is of great significance for learning discriminative representation. In this work, we explore the great merit of multi-modal data in introducing semantic knowledge and sequential analysis techniques in learning hierarchical feature representation for generating discriminative fine-grained features. To this end, we propose a novel approach, termed Channel Cusum Attention ResNet (CCA-ResNet ), for multi-modal joint learning of fine-grained representation. Specifically, we use feature-level multi-modal alignment to connect image and text classification models for joint multi-modal training. Through joint training, image classification models trained with semantic level labels tend to focus on the most discriminative parts, which enhances the cognitive ability of the model. Then, we propose a Channel Cusum Attention (CCA ) mechanism to equip feature maps with hierarchical properties through unsupervised reconstruction of local and global features. The benefits brought by the CCA are in two folds: a) allowing fine-grained features from early layers to be preserved in the forward propagation of deep networks; b) leveraging the hierarchical properties to facilitate multi-modal feature alignment. We conduct extensive experiments to verify that our proposed model can achieve state-of-the-art performance on a series of fine-grained visual classification benchmarks.

References

[1]
Peter Welinder Pietro Perona Catherine Wah, Steve Branson. 2011. The Caltech-UCSD Birds-200--2011 Dataset.
[2]
Peixin Chen, Wu Guo, Zhi Chen, Jian Sun, and Lanhua You. 2018. Gated Convolutional Neural Network for Sentence Matching. In INTERSPEECH. 2853--2857.
[3]
Tianshui Chen, Liang Lin, Riquan Chen, Yang Wu, and Xiaonan Luo. 2018. Knowledge-Embedded Representation Learning for Fine-Grained Image Recognition. In IJCAI. 627--634.
[4]
Tianshui Chen, Wenxi Wu, Yuefang Gao, Le Dong, Xiaonan Luo, and Liang Lin. 2018. Fine-Grained Representation Learning and Recognition by Exploiting Hierarchical Semantic Embedding. In ACM MM. 2023--2031.
[5]
Yue Chen, Yalong Bai, Wei Zhang, and Tao Mei. 2019. Destruction and Construction Learning for Fine-Grained Image Recognition. In CVPR. 5157--5166.
[6]
Yann N. Dauphin, Angela Fan, Michael Auli, and David Grangier. 2017. Language Modeling with Gated Convolutional Networks. In ICML. 933--941.
[7]
Weifeng Ge, Xiangru Lin, and Yizhou Yu. 2019. Weakly Supervised Complementary Parts Models for Fine-Grained Image Classification From the Bottom Up. In CVPR. 3034--3043.
[8]
Xiang Guan, Yang Yang, Zheng Wang, and Jingjing Li. 2020. Semantic Feature Augmentation for Fine-grained Visual Categorization with Few-Sample Training. In ACM MM Asia. ACM, 1--9.
[9]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In CVPR. 770--778.
[10]
Xiangteng He and Yuxin Peng. 2017. Fine-Grained Image Classification via Combining Vision and Language. In CVPR. 7332--7340.
[11]
Xiangteng He and Yuxin Peng. 2020. Fine-Grained Visual-Textual Representation Learning. IEEE TCSVT 30, 2 (2020), 520--531.
[12]
Saihui Hou and Zilei Wang. 2019. Weighted Channel Dropout for Regularization of Deep Convolutional Neural Network. In AAAI. 8425--8432.
[13]
Jie Hu, Li Shen, Samuel Albanie, Gang Sun, and Enhua Wu. 2020. Squeeze-and-Excitation Networks. IEEE TPAMI 42, 8 (2020), 2011--2023.
[14]
Tao Hu and Honggang Qi. 2019. See Better Before Looking Closer: Weakly Supervised Data Augmentation Network for Fine-Grained Visual Classification. CoRR abs/1901.09891 (2019).
[15]
Tao Hu, Honggang Qi, Cong Huang, Qingming Huang, Yan Lu, and Jizheng Xu. 2018. Weakly Supervised Local Attention Network for Fine-Grained Visual Classification. CoRR abs/1808.02152 (2018).
[16]
Chao Huang, Hongliang Li, Yurui Xie, Qingbo Wu, and Bing Luo. 2017. PBC: Polygon-Based Classifier for Fine-Grained Categorization. IEEE TMM 19, 4 (2017), 673--684.
[17]
Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. 2017. Progressive Growing of GANs for Improved Quality, Stability, and Variation. CoRR abs/1710.10196.
[18]
Jonathan Krause, Hailin Jin, Jianchao Yang, and Fei-Fei Li. 2015. Fine-grained Recognition Without Part Annotations. In CVPR. 5546--5555.
[19]
Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. 2013. 3D Object Representations for Fine-Grained Categorization. In ICCV. 554--561.
[20]
Jingjing Li, Lei Zhu, Zi Huang, Ke Lu, and Jidong Zhao. 2018. I read, I saw, I tell: Texts Assisted Fine-Grained Visual Classification. In ACM MM. 663--671.
[21]
Di Lin, Xiaoyong Shen, Cewu Lu, and Jiaya Jia. 2015. Deep LAC: Deep Localization, Alignment and Classification for Fine-grained Recognition. In CVPR. 1666--1674.
[22]
Guosheng Lin, Fayao Liu, Anton Milan, Chunhua Shen, and Ian D. Reid. 2020. RefineNet: Multi-Path Refinement Networks for Dense Prediction. IEEE TPAMI 42, 5 (2020), 1228--1242.
[23]
Tsung-Yu Lin and Subhransu Maji. 2017. Improved Bilinear Pooling with CNNs. In BMVC. 1--15.
[24]
Tsung-Yu Lin, Aruni Roy Chowdhury, and Subhransu Maji. 2015. Bilinear CNN Models for Fine-Grained Visual Recognition. In ICCV. 1449--1457.
[25]
Subhransu Maji, Esa Rahtu, Juho Kannala, Matthew B. Blaschko, and Andrea Vedaldi. 2013. Fine-Grained Visual Classification of Aircraft. ArXiv abs/1306.5151 (2013).
[26]
Maria-Elena Nilsback and Andrew Zisserman. 2008. Automated Flower Classification over a Large Number of Classes. In ICVGIP. 722--729.
[27]
ES Page. 1954. Continuous Inspection Schemes. Biometrika 41, 1--2 (1954), 100--115.
[28]
Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove: Global Vectors for Word Representation. In EMNLP. 1532--1543.
[29]
Scott E. Reed, Zeynep Akata, Honglak Lee, and Bernt Schiele. 2016. Learning Deep Representations of Fine-Grained Visual Descriptions. In CVPR. 49--58.
[30]
Yikang Shen, Shawn Tan, Alessandro Sordoni, and Aaron C. Courville. 2019. Ordered Neurons: Integrating Tree Structures into Recurrent Neural Networks. In ICLR. 1--15.
[31]
Marcel Simon and Erik Rodner. 2015. Neural Activation Constellations: Unsupervised Part Model Discovery with Convolutional Networks. In ICCV. 1143--1151.
[32]
Karen Simonyan and Andrew Zisserman. 2015. Very Deep Convolutional Networks for Large-Scale Image Recognition. In ICLR. 1--15.
[33]
Liang Sun, Xiang Guan, Yang Yang, and Lei Zhang. 2020. Text-Embedded Bilinear Model for Fine-Grained Visual Recognition. In ACM MM. 211--219.
[34]
Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wojna. 2016. Rethinking the Inception Architecture for Computer Vision. In CVPR. 2818--2826.
[35]
Shijie Wang, Zhihui Wang, Haojie Li, and Wanli Ouyang. 2020. Category-specific Semantic Coherency Learning for Fine-grained Image Recognition. In ACM MM. 174--183.
[36]
Yaming Wang, Jonghyun Choi, Vlad I. Morariu, and Larry S. Davis. 2016. Mining Discriminative Triplets of Patches for Fine-Grained Classification. In CVPR. 1163--1172.
[37]
Sanghyun Woo, Jongchan Park, Joon-Young Lee, and In So Kweon. 2018. CBAM: Convolutional Block Attention Module. In ECCV. 3--19.
[38]
Lingxi Xie, Jingdong Wang, Weiyao Lin, Bo Zhang, and Qi Tian. 2017. Towards Reversal-Invariant Image Representation. IJCV 123, 2 (2017), 226--250.
[39]
Chaojian Yu, Xinyi Zhao, Qi Zheng, Peng Zhang, and Xinge You. 2018. Hierarchical Bilinear Pooling for Fine-Grained Visual Recognition. In ECCV. 595--610.
[40]
Jiahui Yu, Zhe Lin, Jimei Yang, Xiaohui Shen, Xin Lu, and Thomas S. Huang. 2019. Free-Form Image Inpainting With Gated Convolution. In ICCV. 4470--4479.
[41]
Lianbo Zhang, Shaoli Huang, Wei Liu, and Dacheng Tao. 2019. Learning a Mixture of Granularity-Specific Experts for Fine-Grained Categorization. In ICCV. 8330--8339.
[42]
Ning Zhang, Jeff Donahue, Ross B. Girshick, and Trevor Darrell. 2014. Part-Based R-CNNs for Fine-Grained Category Detection. In ECCV. 834--849.
[43]
Yabin Zhang, Hui Tang, and Kui Jia. 2018. Fine-Grained Visual Categorization Using Meta-learning Optimization with Sample Selection of Auxiliary Data. In ECCV. 241--256.
[44]
Heliang Zheng, Jianlong Fu, Zheng-Jun Zha, and Jiebo Luo. 2019. Looking for the Devil in the Details: Learning Trilinear Attention Sampling Network for Fine-Grained Image Recognition. In CVPR. 5012--5021.

Cited By

View all
  • (2024)Runge-Kutta Guided Feature Augmentation for Few-Sample LearningIEEE Transactions on Multimedia10.1109/TMM.2024.336640426(7349-7358)Online publication date: 15-Feb-2024
  • (2023)Semantic-Aligned Cross-Modal Visual Grounding Network with TransformersApplied Sciences10.3390/app1309564913:9(5649)Online publication date: 4-May-2023
  • (2023)SCL-Leaf Net: Recognizing Leaf Images Like Human BotanistsACM Transactions on Multimedia Computing, Communications, and Applications10.1145/361565920:1(1-20)Online publication date: 18-Sep-2023
  • Show More Cited By

Index Terms

  1. Learning Hierarchal Channel Attention for Fine-grained Visual Classification
        Index terms have been assigned to the content through auto-classification.

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image ACM Conferences
        MM '21: Proceedings of the 29th ACM International Conference on Multimedia
        October 2021
        5796 pages
        ISBN:9781450386517
        DOI:10.1145/3474085
        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Sponsors

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        Published: 17 October 2021

        Permissions

        Request permissions for this article.

        Check for updates

        Author Tags

        1. fine-grained visual classification
        2. multi-modal learning
        3. sequential analysis

        Qualifiers

        • Research-article

        Funding Sources

        • The Fundamental Research Funds for the Central Universities
        • Sichuan Science and Technology Program
        • National Natural Science Foundation of China

        Conference

        MM '21
        Sponsor:
        MM '21: ACM Multimedia Conference
        October 20 - 24, 2021
        Virtual Event, China

        Acceptance Rates

        Overall Acceptance Rate 995 of 4,171 submissions, 24%

        Upcoming Conference

        MM '24
        The 32nd ACM International Conference on Multimedia
        October 28 - November 1, 2024
        Melbourne , VIC , Australia

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)64
        • Downloads (Last 6 weeks)12
        Reflects downloads up to 21 Sep 2024

        Other Metrics

        Citations

        Cited By

        View all
        • (2024)Runge-Kutta Guided Feature Augmentation for Few-Sample LearningIEEE Transactions on Multimedia10.1109/TMM.2024.336640426(7349-7358)Online publication date: 15-Feb-2024
        • (2023)Semantic-Aligned Cross-Modal Visual Grounding Network with TransformersApplied Sciences10.3390/app1309564913:9(5649)Online publication date: 4-May-2023
        • (2023)SCL-Leaf Net: Recognizing Leaf Images Like Human BotanistsACM Transactions on Multimedia Computing, Communications, and Applications10.1145/361565920:1(1-20)Online publication date: 18-Sep-2023
        • (2023)Consistency-aware Feature Learning for Hierarchical Fine-grained Visual ClassificationProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3612234(2326-2334)Online publication date: 27-Oct-2023
        • (2022)LLAM-MDCNet for Detecting Remote Sensing Images of Dead Tree ClustersRemote Sensing10.3390/rs1415368414:15(3684)Online publication date: 1-Aug-2022
        • (2022)Rethinking Open-World Object Detection in Autonomous Driving ScenariosProceedings of the 30th ACM International Conference on Multimedia10.1145/3503161.3548165(1279-1288)Online publication date: 10-Oct-2022

        View Options

        Get Access

        Login options

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media