The NPU-ASLP-LiAuto System Description for Visual Speech Recognition in CNVSRC 2023

Wang, He; Guo, Pengcheng; Chen, Wei; Zhou, Pan; Xie, Lei

Electrical Engineering and Systems Science > Audio and Speech Processing

arXiv:2401.06788v2 (eess)

[Submitted on 7 Jan 2024 (v1), last revised 29 Feb 2024 (this version, v2)]

Title:The NPU-ASLP-LiAuto System Description for Visual Speech Recognition in CNVSRC 2023

Authors:He Wang, Pengcheng Guo, Wei Chen, Pan Zhou, Lei Xie

View PDF HTML (experimental)

Abstract:This paper delineates the visual speech recognition (VSR) system introduced by the NPU-ASLP-LiAuto (Team 237) in the first Chinese Continuous Visual Speech Recognition Challenge (CNVSRC) 2023, engaging in the fixed and open tracks of Single-Speaker VSR Task, and the open track of Multi-Speaker VSR Task. In terms of data processing, we leverage the lip motion extractor from the baseline1 to produce multi-scale video data. Besides, various augmentation techniques are applied during training, encompassing speed perturbation, random rotation, horizontal flipping, and color transformation. The VSR model adopts an end-to-end architecture with joint CTC/attention loss, comprising a ResNet3D visual frontend, an E-Branchformer encoder, and a Transformer decoder. Experiments show that our system achieves 34.76% CER for the Single-Speaker Task and 41.06% CER for the Multi-Speaker Task after multi-system fusion, ranking first place in all three tracks we participate.

Comments:	Included in CNVSRC Workshop 2023, NCMMSC 2023
Subjects:	Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Sound (cs.SD)
Cite as:	arXiv:2401.06788 [eess.AS]
	(or arXiv:2401.06788v2 [eess.AS] for this version)
	https://doi.org/10.48550/arXiv.2401.06788

Submission history

From: He Wang [view email]
[v1] Sun, 7 Jan 2024 14:20:52 UTC (81 KB)
[v2] Thu, 29 Feb 2024 18:09:40 UTC (81 KB)

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:The NPU-ASLP-LiAuto System Description for Visual Speech Recognition in CNVSRC 2023

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:The NPU-ASLP-LiAuto System Description for Visual Speech Recognition in CNVSRC 2023

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators