AVA-AVD: Audio-visual Speaker Diarization in the Wild

Xu, Eric Zhongcong; Song, Zeyang; Feng, Chao; Ye, Mang; Shou, Mike Zheng

Computer Science > Computer Vision and Pattern Recognition

arXiv:2111.14448v3 (cs)

[Submitted on 29 Nov 2021 (v1), revised 6 Dec 2021 (this version, v3), latest version 16 Jul 2022 (v5)]

Title:AVA-AVD: Audio-visual Speaker Diarization in the Wild

Authors:Eric Zhongcong Xu, Zeyang Song, Chao Feng, Mang Ye, Mike Zheng Shou

View PDF

Abstract:Audio-visual speaker diarization aims at detecting ``who spoken when`` using both auditory and visual signals. Existing audio-visual diarization datasets are mainly focused on indoor environments like meeting rooms or news studios, which are quite different from in-the-wild videos in many scenarios such as movies, documentaries, and audience sitcoms. To create a testbed that can effectively compare diarization methods on videos in the wild, we annotate the speaker diarization labels on the AVA movie dataset and create a new benchmark called AVA-AVD. This benchmark is challenging due to the diverse scenes, complicated acoustic conditions, and completely off-screen speakers. Yet, how to deal with off-screen and on-screen speakers together still remains a critical challenge. To overcome it, we propose a novel Audio-Visual Relation Network (AVR-Net) which introduces an effective modality mask to capture discriminative information based on visibility. Experiments have shown that our method not only can outperform state-of-the-art methods but also is more robust as varying the ratio of off-screen speakers. Ablation studies demonstrate the advantages of the proposed AVR-Net and especially the modality mask on diarization. Our data and code will be made publicly available at this https URL.

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
Cite as:	arXiv:2111.14448 [cs.CV]
	(or arXiv:2111.14448v3 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2111.14448

Submission history

From: Eric Zhongcong Xu [view email]
[v1] Mon, 29 Nov 2021 11:02:41 UTC (2,559 KB)
[v2] Wed, 1 Dec 2021 11:17:30 UTC (2,559 KB)
[v3] Mon, 6 Dec 2021 09:38:10 UTC (2,559 KB)
[v4] Wed, 13 Jul 2022 02:55:35 UTC (2,264 KB)
[v5] Sat, 16 Jul 2022 14:40:40 UTC (2,264 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:AVA-AVD: Audio-visual Speaker Diarization in the Wild

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:AVA-AVD: Audio-visual Speaker Diarization in the Wild

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators