Bio-Inspired Modality Fusion for Active Speaker Detection

Assunção, Gustavo; Gonçalves, Nuno; Menezes, Paulo

doi:10.3390/app11083397

Computer Science > Computer Vision and Pattern Recognition

arXiv:2003.00063 (cs)

[Submitted on 28 Feb 2020 (v1), last revised 13 Apr 2021 (this version, v2)]

Title:Bio-Inspired Modality Fusion for Active Speaker Detection

Authors:Gustavo Assunção, Nuno Gonçalves, Paulo Menezes

View PDF

Abstract:Human beings have developed fantastic abilities to integrate information from various sensory sources exploring their inherent complementarity. Perceptual capabilities are therefore heightened, enabling, for instance, the well-known "cocktail party" and McGurk effects, i.e., speech disambiguation from a panoply of sound signals. This fusion ability is also key in refining the perception of sound source location, as in distinguishing whose voice is being heard in a group conversation. Furthermore, neuroscience has successfully identified the superior colliculus region in the brain as the one responsible for this modality fusion, with a handful of biological models having been proposed to approach its underlying neurophysiological process. Deriving inspiration from one of these models, this paper presents a methodology for effectively fusing correlated auditory and visual information for active speaker detection. Such an ability can have a wide range of applications, from teleconferencing systems to social robotics. The detection approach initially routes auditory and visual information through two specialized neural network structures. The resulting embeddings are fused via a novel layer based on the superior colliculus, whose topological structure emulates spatial neuron cross-mapping of unimodal perceptual fields. The validation process employed two publicly available datasets, with achieved results confirming and greatly surpassing initial expectations.

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Sound (cs.SD); Audio and Speech Processing (eess.AS); Machine Learning (stat.ML)
Cite as:	arXiv:2003.00063 [cs.CV]
	(or arXiv:2003.00063v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2003.00063
Journal reference:	Appl. Sci. 2021, 11(8), 3397
Related DOI:	https://doi.org/10.3390/app11083397

Submission history

From: Gustavo Assunção [view email]
[v1] Fri, 28 Feb 2020 20:56:24 UTC (2,493 KB)
[v2] Tue, 13 Apr 2021 11:05:06 UTC (6,036 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Bio-Inspired Modality Fusion for Active Speaker Detection

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Bio-Inspired Modality Fusion for Active Speaker Detection

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators