Looking Similar, Sounding Different: Leveraging Counterfactual Cross-Modal Pairs for Audiovisual Representation Learning

Singh, Nikhil; Wu, Chih-Wei; Orife, Iroro; Kalayeh, Mahdi

Computer Science > Sound

arXiv:2304.05600 (cs)

[Submitted on 12 Apr 2023 (v1), last revised 8 Jun 2024 (this version, v2)]

Title:Looking Similar, Sounding Different: Leveraging Counterfactual Cross-Modal Pairs for Audiovisual Representation Learning

Authors:Nikhil Singh, Chih-Wei Wu, Iroro Orife, Mahdi Kalayeh

View PDF HTML (experimental)

Abstract:Audiovisual representation learning typically relies on the correspondence between sight and sound. However, there are often multiple audio tracks that can correspond with a visual scene. Consider, for example, different conversations on the same crowded street. The effect of such counterfactual pairs on audiovisual representation learning has not been previously explored. To investigate this, we use dubbed versions of movies and television shows to augment cross-modal contrastive learning. Our approach learns to represent alternate audio tracks, differing only in speech, similarly to the same video. Our results, from a comprehensive set of experiments investigating different training strategies, show this general approach improves performance on a range of downstream auditory and audiovisual tasks, without majorly affecting linguistic task performance overall. These findings highlight the importance of considering speech variation when learning scene-level audiovisual correspondences and suggest that dubbed audio can be a useful augmentation technique for training audiovisual models toward more robust performance on diverse downstream tasks.

Comments:	Accepted to CVPR 2024
Subjects:	Sound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
Cite as:	arXiv:2304.05600 [cs.SD]
	(or arXiv:2304.05600v2 [cs.SD] for this version)
	https://doi.org/10.48550/arXiv.2304.05600

Submission history

From: Nikhil Singh [view email]
[v1] Wed, 12 Apr 2023 04:17:45 UTC (11,646 KB)
[v2] Sat, 8 Jun 2024 04:19:06 UTC (10,811 KB)

Computer Science > Sound

Title:Looking Similar, Sounding Different: Leveraging Counterfactual Cross-Modal Pairs for Audiovisual Representation Learning

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Sound

Title:Looking Similar, Sounding Different: Leveraging Counterfactual Cross-Modal Pairs for Audiovisual Representation Learning

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators