Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3503161.3548383acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

MAVT-FG: Multimodal Audio-Visual Transformer for Weakly-supervised Fine-Grained Recognition

Published: 10 October 2022 Publication History

Abstract

Weakly-supervised fine-grained recognition aims to detect potential differences between subcategories at a more detailed scale without using any manual annotations. While most recent works focus on classical image-based fine-grained recognition that recognizes subcategories at image-level, video-based fine-grained recognition is much more challenging and specifically needed. In this paper, we propose a Multimodal Audio-Visual Transformer for Weakly-supervised Fine-Grained Recognition (MAVT-FG) model which incorporates audio-visual modalities. Specifically, MAVT-FG consists of Audio-Visual Dual-Encoder for feature extraction, Cross-Decoder for Audio-Visual Fusion (DAVF) to exploit inherent cues and correspondences between two modalities, and Search-and-Select Fine-grained Branch (SSFG) to capture the most discriminative regions. Furthermore, we construct a new benchmark: Fine-grained Birds of Audio-Visual (FGB-AV) for audio-visual weakly-supervised fine-grained recognition at video-level. Experimental results show that our method achieves superior performance and outperforms other state-of-the-art methods.

Supplementary Material

MP4 File (MM22-fp2926.mp4)
Presentation video

References

[1]
Panagiotis Antoniadis, Ioannis Pikoulis, Panagiotis P Filntisis, and Petros Maragos. 2021. An audiovisual and contextual approach for categorical and continuous emotion recognition in-the-wild. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 3645--3651.
[2]
Yusuf Aytar, Carl Vondrick, and Antonio Torralba. 2016. Soundnet: Learning sound representations from unlabeled video. Advances in neural information processing systems, Vol. 29 (2016).
[3]
birder.cn. 2017. birder. http://www.birder.cn/video.html.
[4]
Birdsdata.com. 2020. Birdsdata. https://open.baai.ac.cn/data-set-detail/.
[5]
Forrest Briggs, Xiaoli Z Fern, and Raviv Raich. 2012. Rank-loss support instance machines for MIML instance annotation. In Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining. 534--542.
[6]
Dongliang Chang, Yifeng Ding, Jiyang Xie, Ayan Kumar Bhunia, Xiaoxu Li, Zhanyu Ma, Ming Wu, Jun Guo, and Yi-Zhe Song. 2020. The devil is in the channels: Mutual-channel loss for fine-grained image classification. IEEE Transactions on Image Processing, Vol. 29 (2020), 4683--4695.
[7]
Yanbei Chen, Yongqin Xian, A Koepke, Ying Shan, and Zeynep Akata. 2021. Distilling audio-visual knowledge by compositional contrastive learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7016--7025.
[8]
Ying Cheng, Ruize Wang, Zhihao Pan, Rui Feng, and Yuejie Zhang. 2020. Look, listen, and attend: Co-attention network for self-supervised audio-visual representation learning. In Proceedings of the 28th ACM International Conference on Multimedia. 3884--3892.
[9]
Ruoyi Du, Dongliang Chang, Ayan Kumar Bhunia, Jiyang Xie, Zhanyu Ma, Yi-Zhe Song, and Jun Guo. 2020. Fine-grained visual classification via progressive multi-granularity training of jigsaw patches. In European Conference on Computer Vision. Springer, 153--168.
[10]
Yu Gao, Xintong Han, Xun Wang, Weilin Huang, and Matthew Scott. 2020. Channel interaction networks for fine-grained image categorization. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 10818--10825.
[11]
Yuan Gong, Yu-An Chung, and James Glass. 2021. Ast: Audio spectrogram transformer. arXiv preprint arXiv:2104.01778 (2021).
[12]
Ning Han, Jingjing Chen, Guangyi Xiao, Hao Zhang, Yawen Zeng, and Hao Chen. 2021. Fine-grained Cross-modal Alignment Network for Text-Video Retrieval. In Proceedings of the 29th ACM International Conference on Multimedia. 3826--3834.
[13]
Ju He and Adam Kortylewski Cheng Yang Yutong Bai Changhu Wang Alan Yuille Jieneng Chen, Shuai Liu. 2021. TransFG: A Transformer Architecture for Fine-grained Recognition. arXiv (2021).
[14]
Xiangteng He, Yuxin Peng, and Liu Xie. 2019. A new benchmark and approach for fine-grained cross-media retrieval. In Proceedings of the 27th ACM International Conference on Multimedia. 1740--1748.
[15]
Nicholas P Holmes and Charles Spence. 2005. Multisensory integration: space, time and superadditivity. Current Biology, Vol. 15, 18 (2005), R762--R764.
[16]
Yunqing Hu, Xuan Jin, Yin Zhang, Haiwen Hong, Jingfeng Zhang, Yuan He, and Hui Xue. 2021. RAMS-Trans: Recurrent Attention Multi-scale Transformer for Fine-grained Image Recognition. In Proceedings of the 29th ACM International Conference on Multimedia. 4239--4248.
[17]
Soonil Kwon. 2019. A CNN-assisted enhanced audio signal processing for speech emotion recognition. Sensors, Vol. 20, 1 (2019), 183.
[18]
Yan-Bo Lin and Yu-Chiang Frank Wang. 2020. Audiovisual transformer with instance attention for audio-visual event localization. In Proceedings of the Asian Conference on Computer Vision.
[19]
Xinda Liu, Lili Wang, and Xiaoguang Han. 2021. Transformer with peak suppression and knowledge guidance for fine-grained image recognition. arXiv preprint arXiv:2107.06538 (2021).
[20]
ManyBirds and Malcolm Mark Swan. 2006. Manybirds. https://www.manybirds.com/.
[21]
Mathew Monfort, SouYoung Jin, Alexander Liu, David Harwath, Rogerio Feris, James Glass, and Aude Oliva. 2021. Spoken moments: Learning joint audio-visual representations from video descriptions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14871--14881.
[22]
Pedro Morgado, Nuno Vasconcelos, and Ishan Misra. 2021. Audio-visual instance discrimination with cross-modal agreement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12475--12486.
[23]
Cornell Lab of Ornithology. 2022. allaboutbirds. https://www.aboutbirds.org/.
[24]
Juan-Manuel Pérez-Rúa, Valentin Vielzeuf, Stéphane Pateux, Moez Baccouche, and Frédéric Jurie. 2019. Mfas: Multimodal fusion architecture search. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6966--6975.
[25]
Xiaoye Qu, Pengwei Tang, Zhikang Zou, Yu Cheng, Jianfeng Dong, Pan Zhou, and Zichuan Xu. 2020. Fine-grained iterative attention network for temporal language localization in videos. In Proceedings of the 28th ACM International Conference on Multimedia. 4280--4288.
[26]
Mirco Ravanelli, Titouan Parcollet, and Yoshua Bengio. 2019. The Pytorch-kaldi Speech Recognition Toolkit. In ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 6465--6469. https://doi.org/10.1109/ICASSP.2019.8683713
[27]
Nikitha Sharma, Aditi Vijayeendra, Vishnu Gopakumar, Prakhar Patni, and Ashwini Bhat. 2022. Automatic Identification of Bird Species using Audio/Video Processing. In 2022 International Conference for Advancement in Technology (ICONAT). IEEE, 1--6.
[28]
Liang Sun, Xiang Guan, Yang Yang, and Lei Zhang. 2020. Text-Embedded Bilinear Model for Fine-Grained Visual Recognition. In Proceedings of the 28th ACM International Conference on Multimedia. 211--219.
[29]
Z. Sun, Y. Yao, X. S. Wei, Y. Zhang, F. Shen, J. Wu, J. Zhang, and H. T. Shen. 2021. Webly Supervised Fine-Grained Recognition: Benchmark Datasets and An Approach. (2021).
[30]
Ruijie Tao, Zexu Pan, Rohan Kumar Das, Xinyuan Qian, Mike Zheng Shou, and Haizhou Li. 2021. Is someone speaking? exploring long-term temporal features for audio-visual active speaker detection. In Proceedings of the 29th ACM International Conference on Multimedia. 3927--3935.
[31]
Grant Van Horn, Steve Branson, Ryan Farrell, Scott Haber, Jessie Barry, Panos Ipeirotis, Pietro Perona, and Serge Belongie. 2015. Building a bird recognition app and large scale dataset with citizen scientists: The fine print in fine-grained dataset collection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 595--604.
[32]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention Is All You Need. arXiv (2017).
[33]
Catherine Wah, Steve Branson, Peter Welinder, Pietro Perona, and Serge Belongie. 2011. The caltech-ucsd birds-200--2011 dataset. (2011).
[34]
Jiahao Wang, Yunhong Wang, Sheng Liu, and Annan Li. 2021. Few-Shot Fine-Grained Action Recognition via Bidirectional Attention and Contrastive Meta-Learning. In Proceedings of the 29th ACM International Conference on Multimedia. 582--591.
[35]
Shijie Wang, Zhihui Wang, Haojie Li, and Wanli Ouyang. 2020c. Category-specific semantic coherency learning for fine-grained image recognition. In Proceedings of the 28th ACM International Conference on Multimedia. 174--183.
[36]
Weiyao Wang, Du Tran, and Matt Feiszli. 2020b. What makes training multi-modal classification networks hard?. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 12695--12705.
[37]
Xin Wang, Wei Huang, Qi Liu, Yu Yin, Zhenya Huang, Le Wu, Jianhui Ma, and Xue Wang. 2020a. Fine-grained similarity measurement between educational videos and exercises. In Proceedings of the 28th ACM International Conference on Multimedia. 331--339.
[38]
Xing Xu, Kaiyi Lin, Lianli Gao, Huimin Lu, Heng Tao Shen, and Xuelong Li. 2022a. Learning Cross-Modal Common Representations by Private-Shared Subspaces Separation. IEEE Trans. Cybern., Vol. 52, 5 (2022), 3261--3275.
[39]
Xing Xu, Kaiyi Lin, Yang Yang, Alan Hanjalic, and Heng Tao Shen. 2022b. Joint Feature Synthesis and Embedding: Adversarial Cross-Modal Retrieval Revisited. IEEE Trans. Pattern Anal. Mach. Intell., Vol. 44, 6 (2022), 3030--3047.
[40]
Xing Xu, Tan Wang, Yang Yang, Lin Zuo, Fumin Shen, and Heng Tao Shen. 2020. Cross-Modal Attention With Semantic Consistence for Image-Text Matching. IEEE Trans. Neural Networks Learn. Syst., Vol. 31, 12 (2020), 5412--5425.
[41]
Yazhou Yao, Xiansheng Hua, Guanyu Gao, Zeren Sun, Zhibin Li, and Jian Zhang. 2020. Bridging the web data and fine-grained visual recognition via alleviating label noise and domain mismatch. In Proceedings of the 28th ACM International Conference on Multimedia. 1735--1744.
[42]
Jingran Zhang, Xing Xu, Fumin Shen, Huimin Lu, Xin Liu, and Heng Tao Shen. 2021b. Enhancing audio-visual association with self-supervised curriculum learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 3351--3359.
[43]
Su Zhang, Yi Ding, Ziquan Wei, and Cuntai Guan. 2021a. Continuous Emotion Recognition with Audio-visual Leader-follower Attentive Fusion. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 3567--3574.
[44]
Qijie Zhao, Tao Sheng, Yongtao Wang, Zhi Tang, Ying Chen, Ling Cai, and Haibin Ling. 2019. M2det: A single-shot object detector based on multi-level feature pyramid network. In Proceedings of the AAAI conference on artificial intelligence, Vol. 33. 9259--9266.
[45]
Mohan Zhou, Yalong Bai, Wei Zhang, Tiejun Zhao, and Tao Mei. 2020. Look-into-object: Self-supervised structure modeling for object recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 11774--11783.
[46]
Peiqin Zhuang, Yali Wang, and Yu Qiao. 2020. Learning attentive pairwise interaction for fine-grained classification. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 13130--13137.

Cited By

View all

Index Terms

  1. MAVT-FG: Multimodal Audio-Visual Transformer for Weakly-supervised Fine-Grained Recognition

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    MM '22: Proceedings of the 30th ACM International Conference on Multimedia
    October 2022
    7537 pages
    ISBN:9781450392037
    DOI:10.1145/3503161
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 10 October 2022

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. audio-visual multimodal
    2. video-level fine-grained recognition
    3. weakly-supervised joint learning

    Qualifiers

    • Research-article

    Funding Sources

    Conference

    MM '22
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 222
      Total Downloads
    • Downloads (Last 12 months)61
    • Downloads (Last 6 weeks)7
    Reflects downloads up to 12 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media