research-article

MAVT-FG: Multimodal Audio-Visual Transformer for Weakly-supervised Fine-Grained Recognition

Authors:

Xing XuAuthors Info & Claims

MM '22: Proceedings of the 30th ACM International Conference on Multimedia

Pages 3811 - 3819

https://doi.org/10.1145/3503161.3548383

Published: 10 October 2022 Publication History

Abstract

Weakly-supervised fine-grained recognition aims to detect potential differences between subcategories at a more detailed scale without using any manual annotations. While most recent works focus on classical image-based fine-grained recognition that recognizes subcategories at image-level, video-based fine-grained recognition is much more challenging and specifically needed. In this paper, we propose a Multimodal Audio-Visual Transformer for Weakly-supervised Fine-Grained Recognition (MAVT-FG) model which incorporates audio-visual modalities. Specifically, MAVT-FG consists of Audio-Visual Dual-Encoder for feature extraction, Cross-Decoder for Audio-Visual Fusion (DAVF) to exploit inherent cues and correspondences between two modalities, and Search-and-Select Fine-grained Branch (SSFG) to capture the most discriminative regions. Furthermore, we construct a new benchmark: Fine-grained Birds of Audio-Visual (FGB-AV) for audio-visual weakly-supervised fine-grained recognition at video-level. Experimental results show that our method achieves superior performance and outperforms other state-of-the-art methods.

Supplementary Material

MP4 File (MM22-fp2926.mp4)

Presentation video

Download
49.03 MB

References

[1]

Panagiotis Antoniadis, Ioannis Pikoulis, Panagiotis P Filntisis, and Petros Maragos. 2021. An audiovisual and contextual approach for categorical and continuous emotion recognition in-the-wild. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 3645--3651.

[2]

Yusuf Aytar, Carl Vondrick, and Antonio Torralba. 2016. Soundnet: Learning sound representations from unlabeled video. Advances in neural information processing systems, Vol. 29 (2016).

[3]

birder.cn. 2017. birder. http://www.birder.cn/video.html.

[4]

Birdsdata.com. 2020. Birdsdata. https://open.baai.ac.cn/data-set-detail/.

[5]

Forrest Briggs, Xiaoli Z Fern, and Raviv Raich. 2012. Rank-loss support instance machines for MIML instance annotation. In Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining. 534--542.

Digital Library

[6]

Dongliang Chang, Yifeng Ding, Jiyang Xie, Ayan Kumar Bhunia, Xiaoxu Li, Zhanyu Ma, Ming Wu, Jun Guo, and Yi-Zhe Song. 2020. The devil is in the channels: Mutual-channel loss for fine-grained image classification. IEEE Transactions on Image Processing, Vol. 29 (2020), 4683--4695.

Digital Library

[7]

Yanbei Chen, Yongqin Xian, A Koepke, Ying Shan, and Zeynep Akata. 2021. Distilling audio-visual knowledge by compositional contrastive learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7016--7025.

[8]

Ying Cheng, Ruize Wang, Zhihao Pan, Rui Feng, and Yuejie Zhang. 2020. Look, listen, and attend: Co-attention network for self-supervised audio-visual representation learning. In Proceedings of the 28th ACM International Conference on Multimedia. 3884--3892.

Digital Library

[9]

Ruoyi Du, Dongliang Chang, Ayan Kumar Bhunia, Jiyang Xie, Zhanyu Ma, Yi-Zhe Song, and Jun Guo. 2020. Fine-grained visual classification via progressive multi-granularity training of jigsaw patches. In European Conference on Computer Vision. Springer, 153--168.

Digital Library

[10]

Yu Gao, Xintong Han, Xun Wang, Weilin Huang, and Matthew Scott. 2020. Channel interaction networks for fine-grained image categorization. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 10818--10825.

[11]

Yuan Gong, Yu-An Chung, and James Glass. 2021. Ast: Audio spectrogram transformer. arXiv preprint arXiv:2104.01778 (2021).

[12]

Ning Han, Jingjing Chen, Guangyi Xiao, Hao Zhang, Yawen Zeng, and Hao Chen. 2021. Fine-grained Cross-modal Alignment Network for Text-Video Retrieval. In Proceedings of the 29th ACM International Conference on Multimedia. 3826--3834.

Digital Library

[13]

Ju He and Adam Kortylewski Cheng Yang Yutong Bai Changhu Wang Alan Yuille Jieneng Chen, Shuai Liu. 2021. TransFG: A Transformer Architecture for Fine-grained Recognition. arXiv (2021).

[14]

Xiangteng He, Yuxin Peng, and Liu Xie. 2019. A new benchmark and approach for fine-grained cross-media retrieval. In Proceedings of the 27th ACM International Conference on Multimedia. 1740--1748.

Digital Library

[15]

Nicholas P Holmes and Charles Spence. 2005. Multisensory integration: space, time and superadditivity. Current Biology, Vol. 15, 18 (2005), R762--R764.

[16]

Yunqing Hu, Xuan Jin, Yin Zhang, Haiwen Hong, Jingfeng Zhang, Yuan He, and Hui Xue. 2021. RAMS-Trans: Recurrent Attention Multi-scale Transformer for Fine-grained Image Recognition. In Proceedings of the 29th ACM International Conference on Multimedia. 4239--4248.

Digital Library

[17]

Soonil Kwon. 2019. A CNN-assisted enhanced audio signal processing for speech emotion recognition. Sensors, Vol. 20, 1 (2019), 183.

[18]

Yan-Bo Lin and Yu-Chiang Frank Wang. 2020. Audiovisual transformer with instance attention for audio-visual event localization. In Proceedings of the Asian Conference on Computer Vision.

[19]

Xinda Liu, Lili Wang, and Xiaoguang Han. 2021. Transformer with peak suppression and knowledge guidance for fine-grained image recognition. arXiv preprint arXiv:2107.06538 (2021).

[20]

ManyBirds and Malcolm Mark Swan. 2006. Manybirds. https://www.manybirds.com/.

[21]

Mathew Monfort, SouYoung Jin, Alexander Liu, David Harwath, Rogerio Feris, James Glass, and Aude Oliva. 2021. Spoken moments: Learning joint audio-visual representations from video descriptions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14871--14881.

[22]

Pedro Morgado, Nuno Vasconcelos, and Ishan Misra. 2021. Audio-visual instance discrimination with cross-modal agreement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12475--12486.

[23]

Cornell Lab of Ornithology. 2022. allaboutbirds. https://www.aboutbirds.org/.

[24]

Juan-Manuel Pérez-Rúa, Valentin Vielzeuf, Stéphane Pateux, Moez Baccouche, and Frédéric Jurie. 2019. Mfas: Multimodal fusion architecture search. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6966--6975.

[25]

Xiaoye Qu, Pengwei Tang, Zhikang Zou, Yu Cheng, Jianfeng Dong, Pan Zhou, and Zichuan Xu. 2020. Fine-grained iterative attention network for temporal language localization in videos. In Proceedings of the 28th ACM International Conference on Multimedia. 4280--4288.

Digital Library

[26]

Mirco Ravanelli, Titouan Parcollet, and Yoshua Bengio. 2019. The Pytorch-kaldi Speech Recognition Toolkit. In ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 6465--6469. https://doi.org/10.1109/ICASSP.2019.8683713

[27]

Nikitha Sharma, Aditi Vijayeendra, Vishnu Gopakumar, Prakhar Patni, and Ashwini Bhat. 2022. Automatic Identification of Bird Species using Audio/Video Processing. In 2022 International Conference for Advancement in Technology (ICONAT). IEEE, 1--6.

[28]

Liang Sun, Xiang Guan, Yang Yang, and Lei Zhang. 2020. Text-Embedded Bilinear Model for Fine-Grained Visual Recognition. In Proceedings of the 28th ACM International Conference on Multimedia. 211--219.

Digital Library

[29]

Z. Sun, Y. Yao, X. S. Wei, Y. Zhang, F. Shen, J. Wu, J. Zhang, and H. T. Shen. 2021. Webly Supervised Fine-Grained Recognition: Benchmark Datasets and An Approach. (2021).

[30]

Ruijie Tao, Zexu Pan, Rohan Kumar Das, Xinyuan Qian, Mike Zheng Shou, and Haizhou Li. 2021. Is someone speaking? exploring long-term temporal features for audio-visual active speaker detection. In Proceedings of the 29th ACM International Conference on Multimedia. 3927--3935.

Digital Library

[31]

Grant Van Horn, Steve Branson, Ryan Farrell, Scott Haber, Jessie Barry, Panos Ipeirotis, Pietro Perona, and Serge Belongie. 2015. Building a bird recognition app and large scale dataset with citizen scientists: The fine print in fine-grained dataset collection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 595--604.

[32]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention Is All You Need. arXiv (2017).

[33]

Catherine Wah, Steve Branson, Peter Welinder, Pietro Perona, and Serge Belongie. 2011. The caltech-ucsd birds-200--2011 dataset. (2011).

[34]

Jiahao Wang, Yunhong Wang, Sheng Liu, and Annan Li. 2021. Few-Shot Fine-Grained Action Recognition via Bidirectional Attention and Contrastive Meta-Learning. In Proceedings of the 29th ACM International Conference on Multimedia. 582--591.

Digital Library

[35]

Shijie Wang, Zhihui Wang, Haojie Li, and Wanli Ouyang. 2020c. Category-specific semantic coherency learning for fine-grained image recognition. In Proceedings of the 28th ACM International Conference on Multimedia. 174--183.

Digital Library

[36]

Weiyao Wang, Du Tran, and Matt Feiszli. 2020b. What makes training multi-modal classification networks hard?. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 12695--12705.

[37]

Xin Wang, Wei Huang, Qi Liu, Yu Yin, Zhenya Huang, Le Wu, Jianhui Ma, and Xue Wang. 2020a. Fine-grained similarity measurement between educational videos and exercises. In Proceedings of the 28th ACM International Conference on Multimedia. 331--339.

Digital Library

[38]

Xing Xu, Kaiyi Lin, Lianli Gao, Huimin Lu, Heng Tao Shen, and Xuelong Li. 2022a. Learning Cross-Modal Common Representations by Private-Shared Subspaces Separation. IEEE Trans. Cybern., Vol. 52, 5 (2022), 3261--3275.

[39]

Xing Xu, Kaiyi Lin, Yang Yang, Alan Hanjalic, and Heng Tao Shen. 2022b. Joint Feature Synthesis and Embedding: Adversarial Cross-Modal Retrieval Revisited. IEEE Trans. Pattern Anal. Mach. Intell., Vol. 44, 6 (2022), 3030--3047.

[40]

Xing Xu, Tan Wang, Yang Yang, Lin Zuo, Fumin Shen, and Heng Tao Shen. 2020. Cross-Modal Attention With Semantic Consistence for Image-Text Matching. IEEE Trans. Neural Networks Learn. Syst., Vol. 31, 12 (2020), 5412--5425.

[41]

Yazhou Yao, Xiansheng Hua, Guanyu Gao, Zeren Sun, Zhibin Li, and Jian Zhang. 2020. Bridging the web data and fine-grained visual recognition via alleviating label noise and domain mismatch. In Proceedings of the 28th ACM International Conference on Multimedia. 1735--1744.

Digital Library

[42]

Jingran Zhang, Xing Xu, Fumin Shen, Huimin Lu, Xin Liu, and Heng Tao Shen. 2021b. Enhancing audio-visual association with self-supervised curriculum learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 3351--3359.

[43]

Su Zhang, Yi Ding, Ziquan Wei, and Cuntai Guan. 2021a. Continuous Emotion Recognition with Audio-visual Leader-follower Attentive Fusion. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 3567--3574.

[44]

Qijie Zhao, Tao Sheng, Yongtao Wang, Zhi Tang, Ying Chen, Ling Cai, and Haibin Ling. 2019. M2det: A single-shot object detector based on multi-level feature pyramid network. In Proceedings of the AAAI conference on artificial intelligence, Vol. 33. 9259--9266.

Digital Library

[45]

Mohan Zhou, Yalong Bai, Wei Zhang, Tiejun Zhao, and Tao Mei. 2020. Look-into-object: Self-supervised structure modeling for object recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 11774--11783.

[46]

Peiqin Zhuang, Yali Wang, and Yu Qiao. 2020. Learning attentive pairwise interaction for fine-grained classification. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 13130--13137.

Cited By

Index Terms

MAVT-FG: Multimodal Audio-Visual Transformer for Weakly-supervised Fine-Grained Recognition
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision problems
        Object recognition

Recommendations

FG-AGR: Fine-Grained Associative Graph Representation for Facial Expression Recognition in the Wild
Facial expression recognition (FER) in the wild is challenging due to various unconstrained conditions, i.e., occlusions and head pose variations. Previous methods tend to improve the performance of facial expression recognition through resorting to ...
PBDM: a flexible delegation model in RBAC
SACMAT '03: Proceedings of the eighth ACM symposium on Access control models and technologies

Role-based access control (RBAC) is recognized as an efficient access control model for large organizations. Most organizations have some business rules related to access control policy. Delegation of authority is among these rules. RBDM0 and RDM2000 ...
Transductive Multilabel Learning via Label Set Propagation

The problem of multilabel classification has attracted great interest in the last decade, where each instance can be assigned with a set of multiple class labels simultaneously. It has a wide variety of real-world applications, e.g., automatic image ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '22: Proceedings of the 30th ACM International Conference on Multimedia

October 2022

7537 pages

ISBN:9781450392037

DOI:10.1145/3503161

General Chairs:
João Magalhães
NOVA University of Lisbon, Portugal
,
Alberto del Bimbo
University of Florence, Italy
,
Shin'ichi Satoh
National Institute of Informatics, Japan
,
Nicu Sebe
University of Trento, Italy
,
Program Chairs:
Xavier Alameda-Pineda
Inria, Grenoble, France
,
Qin Jin
Renmin University of China, China
,
Vincent Oria
New Jersey Institute of Technology, USA
,
Laura Toni
University College London, UK

Copyright © 2022 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 10 October 2022

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Natural Science Foundation of China
Sichuan Science and Technology Program

Conference

MM '22

Sponsor:

SIGMM

MM '22: The 30th ACM International Conference on Multimedia

October 10 - 14, 2022

Lisboa, Portugal

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
222
Total Downloads

Downloads (Last 12 months)61
Downloads (Last 6 weeks)7

Reflects downloads up to 12 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents