Leveraging Modality-specific Representations for Audio-visual Speech Recognition via Reinforcement Learning

Chen, Chen; Hu, Yuchen; Zhang, Qiang; Zou, Heqing; Zhu, Beier; Chng, Eng Siong

Computer Science > Sound

arXiv:2212.05301 (cs)

[Submitted on 10 Dec 2022 (v1), last revised 2 Feb 2023 (this version, v2)]

Title:Leveraging Modality-specific Representations for Audio-visual Speech Recognition via Reinforcement Learning

Authors:Chen Chen, Yuchen Hu, Qiang Zhang, Heqing Zou, Beier Zhu, Eng Siong Chng

View PDF

Abstract:Audio-visual speech recognition (AVSR) has gained remarkable success for ameliorating the noise-robustness of speech recognition. Mainstream methods focus on fusing audio and visual inputs to obtain modality-invariant representations. However, such representations are prone to over-reliance on audio modality as it is much easier to recognize than video modality in clean conditions. As a result, the AVSR model underestimates the importance of visual stream in face of noise corruption. To this end, we leverage visual modality-specific representations to provide stable complementary information for the AVSR task. Specifically, we propose a reinforcement learning (RL) based framework called MSRL, where the agent dynamically harmonizes modality-invariant and modality-specific representations in the auto-regressive decoding process. We customize a reward function directly related to task-specific metrics (i.e., word error rate), which encourages the MSRL to effectively explore the optimal integration strategy. Experimental results on the LRS3 dataset show that the proposed method achieves state-of-the-art in both clean and various noisy conditions. Furthermore, we demonstrate the better generality of MSRL system than other baselines when test set contains unseen noises.

Comments:	Accepted by AAAI2023
Subjects:	Sound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
Cite as:	arXiv:2212.05301 [cs.SD]
	(or arXiv:2212.05301v2 [cs.SD] for this version)
	https://doi.org/10.48550/arXiv.2212.05301

Submission history

From: Chen Chen [view email]
[v1] Sat, 10 Dec 2022 14:01:54 UTC (2,661 KB)
[v2] Thu, 2 Feb 2023 09:30:00 UTC (2,697 KB)

Computer Science > Sound

Title:Leveraging Modality-specific Representations for Audio-visual Speech Recognition via Reinforcement Learning

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Sound

Title:Leveraging Modality-specific Representations for Audio-visual Speech Recognition via Reinforcement Learning

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators