Dual Mean-Teacher: An Unbiased Semi-Supervised Framework for Audio-Visual Source Localization

Yuxin Guo

{}^{1,2,3}

, Shijie Ma

{}^{1,2}

, Hu Su

{}^{1,2}

, Zhiqing Wang

{}^{1,2}

, Yuhao Zhao

{}^{1,2}

, Wei Zou

{}^{1,2}

Siyang Sun

{}^{3}

, Yun Zheng

{}^{3}

{}^{1}

School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China

{}^{2}

State Key Laboratory of Multimodal Artificial Intelligence Systems (MAIS),
Institute of Automation of Chinese Academy of Sciences, Beijing, China

{}^{3}

DAMO Academy, Alibaba Group
{guoyuxin2021, wei.zou}@ia.ac.cn Corresponding author.

Abstract

Audio-Visual Source Localization (AVSL) aims to locate sounding objects within video frames given the paired audio clips. Existing methods predominantly rely on self-supervised contrastive learning of audio-visual correspondence. Without any bounding-box annotations, they struggle to achieve precise localization, especially for small objects, and suffer from blurry boundaries and false positives. Moreover, the naive semi-supervised method is poor in fully leveraging the information of abundant unlabeled data. In this paper, we propose a novel semi-supervised learning framework for AVSL, namely Dual Mean-Teacher (DMT), comprising two teacher-student structures to circumvent the confirmation bias issue. Specifically, two teachers, pre-trained on limited labeled data, are employed to filter out noisy samples via the consensus between their predictions, and then generate high-quality pseudo-labels by intersecting their confidence maps. The sufficient utilization of both labeled and unlabeled data and the proposed unbiased framework enable DMT to outperform current state-of-the-art methods by a large margin, with CIoU of 90.4% and 48.8% on Flickr-SoundNet and VGG-Sound Source, obtaining 8.9%, 9.6% and 4.6%, 6.4% improvements over self- and semi-supervised methods respectively, given only $<3\%$ positional-annotations. We also extend our framework to some existing AVSL methods and consistently boost their performance. Our code is available at https://github.com/gyx-gloria/DMT.

1 Introduction

Visual and auditory perception is crucial for observing the world. When we hear a sound, our brain will extract semantic information and locate the sounding source. In this work, we focus on Audio-Visual Source Localization (AVSL) Arandjelovic and Zisserman (2017); Zhu et al. (2021), with the purpose of accurately locating sounding objects in frames based on their paired audio clips. Beyond this scope, AVSL also plays a crucial role in many downstream tasks including environmental perception Ramaswamy (2020), navigation Chen et al. (2020a, 2021a), sound separation Tzinis et al. (2020, 2022) and event localization Xuan et al. (2021). Therefore, accurate localization is of utmost importance.

In the literature of AVSL Mo and Morgado (2022a, b); Mo and Tian (2023), the conventional paradigm is to employ self-supervised contrastive learning based on audio-visual correspondence. However, most of them suffer from some serious challenges. From the performance perspective, there are issues such as blurry boundaries, inability to converge to specific objects, and the predicted sounding regions that are too large to accurately locate objects, especially small objects. In terms of the learning stage, a single model alone is unable to recognize and filter out false positives, i.e., noisy samples with no visible sounding sources, which could affect the entire learning process of the model.

In essence, AVSL is a dense prediction task, which can not be directly accomplished from a shared global image representation Vandenhende et al. (2021), requiring models to capture fine local features in order to accurately predict object locations, i.e., achieving precise pixel-level localization is not feasible without positional annotations. Unluckily, the number of samples with location labels is extremely limited. As a result, we resort to Semi-Supervised Learning Yang et al. (2022) (SSL) to fully leverage the labeled data.

Considering that self-supervised AVSL is not fully learnable, Attention10k Senocak et al. (2018, 2019) extended the self-supervised model to an SSL model by directly appending a supervised loss on labeled data, which is the first semi-supervised attempt in the field. Nevertheless, simply leveraging labeled data might lead to overfitting and neglect to fully harness the underlying unlabeled data. Given these issues, we resort to pseudo-labeling Lee et al. (2013). However, directly introducing pseudo-labeling could lead to confirmation bias Arazo et al. (2020) which cannot be adequately rectified by a single model.

To tackle these challenges, we break away from traditional self-supervised learning and propose a more sophisticated Semi-Supervised Audio-Visual Source Localization (SS-AVSL) framework, called Dual Mean-Teacher (DMT), which adopts a double teacher-student structure in a two-stage training manner. We consider previous AVSL methods as a single student unit. To fully leverage positional annotations and training data, we extend it to a classic semi-supervised framework Mean-Teacher Tarvainen and Valpola (2017). To address the issue of confirmation bias, we expand it into a dual independent teacher-student structure with designed modules of Noise Filtering, Intersections of Pseudo-Labels (IPL), as shown in Figure 2. Specifically, teachers are pre-trained on a limited amount of labeled data in Warm-Up stage, establishing a solid foundation, in the subsequent Unbiased-Learning Stage, dual teachers filter out noisy samples and rectify pseudo-labels. In more detail, the Noise Filtering module effectively rejects noise samples by leveraging consensus, i.e., agreement, between dual teachers, ensuring high-quality training data, then IPL module generates precise pseudo-labels by intersecting the predictions from both teachers. DMT eliminates the influence of confirmation bias by rejecting noisy samples and improving the quality of pseudo-labels, which effectively tackles the issues of false positives and greatly improves localization performance.

Refer to caption — Figure 1: Comparison of existing Audio-Visual Source Localization (AVSL) methods and the proposed Dual Mean-Teacher (DMT). Left: DMT has greatly addressed severe issues including inaccurate small object localization, blurry boundaries, and instability. Right: DMT outperforms previous by a large margin on Flickr and VGG-ss datasets.

In summary, our method contributes to the following three aspects. Firstly, we introduce a novel unbiased framework based on a pseudo-labeling mechanism for semi-supervised AVSL, which could maximize the utilization of both labeled and unlabeled data, effectively address the challenge of limited annotated data, and mitigate the issue of confirmation bias. Moreover, compared to existing approaches, DMT achieves much remarkable localization performance, with better small object localization and stronger generalization capability, which significantly elevate the performance of current methods. Finally, DMT can be summarized as a semi-supervised learning paradigm and could be combined with existing (weakly-supervised) AVSL methods to consistently boost their performance.

2 Related Works

Semi-Supervised Learning.

Semi-Supervised Learning (SSL) Yang et al. (2022); Van Engelen and Hoos (2020) leverages a small amount of labeled data to unlock the potential of unlabeled data for better model learning. One line of work relies on consistency regularization Tarvainen and Valpola (2017); Sajjadi et al. (2016); Laine and Aila (2017) to encourage models to behave similarly under different perturbations. An orthogonal idea is to generate high-quality pseudo-labels Lee et al. (2013); Xie et al. (2020) on unlabeled data to retrain models for better performance. The quality of pseudo-labels is crucial. Current methods Berthelot et al. (2019); Sohn et al. (2020a); Zhang et al. (2021) combine the above two paradigms to achieve remarkable results.

Audio-Visual Source Localization.

The key to Audio-Visual Sound Localization (AVSL) endeavors to establish the correspondence between visual objects and their corresponding sounds by contrastive learning Chen et al. (2020b, c). Most existing methods predominantly utilize self-supervised or weakly-supervised approaches (for all of them employ pre-trained backbone). Some classical works, such as Attention10k Senocak et al. (2018, 2019), DMC Zhu et al. (2021), LVS Chen et al. (2021b), EZVSL Mo and Morgado (2022a), SSPL Song et al. (2022), SSL-TIE Liu et al. (2022a) achieve improving performance over time. Other methods like DSOL Hu et al. (2020), CoarsetoFine Qian et al. (2020), mix and localize Hu et al. (2022), and AVGN Mo and Tian (2023) pay attention to multi-source localization. In addition, some studies also address the issue of false positives and false negatives in self-supervised learning. For example, SLAVC Mo and Morgado (2022b) focuses on false positives and effectively overcomes the overfitting issue. IER Liu et al. (2022b) proposes a label-free method that targets the suppression of non-sounding objects, while Robust Morgado et al. (2021a) considers both false positive and false negative issues. AVID Morgado et al. (2021b) detects false negatives by defining sets of positive and negative samples via cross-modal agreement. Source separation Zhao et al. (2018, 2019) and generative models Sanguineti et al. (2021) also achieve good results. However, most AVSL methods exhibit subpar performance in the absence of annotation.

Semi-Supervised Learning in Localization.

Semi-Supervised Object Detection (SSOD) is one of the few applications of SSL in the localization field. The majority of SSOD methods, such as Sohn et al. (2020b); Xu et al. (2021); Li et al. (2022), utilize pseudo-labeling to enhance the localization performance. Moreover, some works like Zhou et al. (2021); Liu et al. (2021) focus on the confirmation bias in SSOD. Similar to object detection, AVSL is a pixel-wise dense prediction task, heavily reliant on high-quality pseudo-labels. Attention10k Senocak et al. (2018, 2019) is the first SS-AVSL work. It extends a self-supervised model to a semi-supervised framework by simply adding a supervised loss, aiming at fixing the false conclusions generated by weakly-supervised methods. However, this naive method may lead to overfitting and neglects the full utilization of unlabeled data. In contrast, we introduce a novel SS-AVSL framework based on pseudo-label mechanism, which can address confirmation bias and maximize the utilization of both labeled and unlabeled data, to achieve stronger localization performance.

3 Background

Problem Definition.

Audio-Visual Source Localization (AVSL) aims to accurately locate the sound source within a given visual scene. We denote audio-visual pairs as $(a_{i},v_{i})$ , where $a_{i}$ and $v_{i}$ represent the audio and visual modality, respectively. The objective is to generate a pixel-wise confidence map $\mathcal{P}$ indicating the location of the sound source.

Contrastive Learning in AVSL.

Self-supervised AVSL methods commonly leverage audio-visual correspondence to maximize the similarity between frames and their corresponding audio clips (positive pairs) while minimizing the similarity among unpaired ones (negative pairs):

\displaystyle\small\mathcal{L}_{\text{unsup}}=-\mathbb{E}_{(a_{i},v_{i})\sim% \mathcal{D}_{u}}\left[\log\frac{\exp(s(g(a_{i}),f(v_{i}))/\tau_{t})}{\sum_{j=1% }^{n}\exp\left(s\left(g(a_{i}),f(v_{j})\right)/\tau_{t}\right)}+\log\frac{\exp% (s(f(v_{i}),g(a_{i}))/\tau_{t})}{\sum_{j=1}^{n}\exp\left(s\left(f(v_{i}),g(a_{% j})\right)/\tau_{t}\right)}\right].

(1)

where $\mathcal{D}_{u}$ are unlabeled datasets, $g(\cdot)$ and $f(\cdot)$ are audio and visual feature extractors. $\tau_{t}$ is the temperature coefficient. $s(\cdot)$ is consistency matching criterion. The predicted map $\mathcal{P}_{i}$ is typically calculated with cosine similarity $\mathrm{sim}(\cdot)$ to represent the confidence of the presence of sounding objects:

\displaystyle\mathcal{P}_{i}=\mathrm{sim}(g(a_{i}),f(v_{i}))=\frac{\left% \langle g(a_{i}),f(v_{i})\right\rangle}{\left\|g(a_{i})\right\|\cdot\left\|f(v% _{i})\right\|}.

(2)

Learning with (Pseudo) Labels in AVSL.

When labeled data are available, one could apply supervised loss directly to learn to localize:

\mathcal{L}_{\text{sup}}=\mathbb{E}_{i\sim\mathcal{D}}H(\mathcal{G}_{i},% \mathcal{P}_{i}).

(3)

where $\mathcal{G}_{i}$ could be the ground truth or generated pseudo-labels, both $\mathcal{G}_{i}$ and $\mathcal{P}_{i}$ are in the form of binary confidence map. $H(\cdot,\cdot)$ is cross-entropy function across the two-dimensional spatial axes.

4 Dual Mean-Teacher

Overview.

In this section, we mainly describe Dual Mean-Teacher (DMT) in the order of the learning process. Specifically, in the Warm-Up Stage (Section 4.1), two teachers are pre-trained on bounding-box annotated data to obtain a stable initialization. In the subsequent Unbiased-Learning Stage (Section 4.2), the Noise Filtering module and Intersection of Pseudo-Label (IPL) module collectively filter out noisy samples and generate high-quality pseudo-labels by two teachers to guide students’ training, teachers are in turn updated by exponential moving average (EMA) of students.

From a unified perspective, existing AVSL methods can be viewed as a single student, which is later expanded into the semi-supervised classical framework Mean-Teacher Tarvainen and Valpola (2017), in order to fully utilize limited labeled data. To effectively address the confirmation bias issue, we further extend it to a double teacher-student framework, as shown in Figure 2. DMT adopts a teacher-student architecture, where each teacher and student contains two independent AVSL pipelines for audio-visual contrastive learning and generating localization results. Teachers provide students with stable unlabeled samples after Noise Filtering and their generated IPL. Students pass back the parameters to the teachers.

General Notations.

Cross-entropy function $H(\cdot,\cdot)$ and two feature extractors $f(\cdot)$ and $g(\cdot)$ are already discussed in Section 3. By default, subscript $i$ indicates the $i$ -th sample, while superscript $A$ , $B$ denote two teacher-student pairs with $t$ and $s$ indicating teacher and student, respectively. $\mathcal{D}_{l}$ and $\mathcal{D}_{u}$ are labeled and unlabeled datasets. $\mathcal{G}_{i}$ and $\mathcal{P}_{i}$ are ground-truth and predicted confidence maps, both with size of $H\times W$ . We apply strong $\mathcal{A}(\cdot)$ or weak $\alpha(\cdot)$ augmentation on visual inputs.

4.1 Warm-Up Stage

The quality of pseudo-labels is crucial to SSL, especially for localization tasks. Before generating pseudo-labels, we first pre-train dual teachers with bounding-box annotated data to achieve preliminary localization performance. In order to avoid overfitting, we apply strong augmentation and get augmented labeled dataset $\mathcal{D}_{l}=\{(a_{i},\mathcal{A}(v_{i})),\mathcal{G}_{i}\}$ . After extracting visual features $f^{t}(\mathcal{A}(v_{i}))$ and auditory features $g^{t}(a_{i})$ , the predicted map $\mathcal{P}_{i}^{t}$ can be obtained by Eq. (2). Then we utilize bounding-box annotations $\mathcal{G}_{i}$ as supervision:

\displaystyle\mathcal{L}_{\textrm{Warm-Up}}=\mathbb{E}_{(a_{i},v_{i})\sim% \mathcal{D}_{l}}H(\mathcal{G}_{i},\mathcal{P}_{i}^{t}).

(4)

4.2 Unbiased-Learning Stage

Noise Filtering.

To mitigate confirmation bias, it is crucial to filter out noisy samples that are more likely to be false positives. As depicted in Figure 2, two predicted maps of the same sample are generated by dual teachers. It is clear that samples with higher reliability can be identified when the two predicted maps exhibit higher similarity, i.e., there is more agreement and consensus between dual teachers, then the sample is reserved for pseudo-labelling. Conversely, when there is a significant discrepancy between the two maps, the sample will be considered as a false positive, such as frames without distinguishable sound objects or sounds that cannot be accurately represented by a bounding box (e.g., wind sounds), such samples are rejected and discarded.

Intersection of Pseudo-Labels (IPL).

By intersecting the foreground regions of two predicted maps on the filtered samples, one can generate positional pseudo-labels, named Intersection of Pseudo-Labels (IPL), to guide students’ learning. With the pre-defined foreground threshold $\delta$ , two predicted maps $\mathcal{P}_{i}^{t,A},\mathcal{P}_{i}^{t,B}$ could be transferred to binary maps $\mathcal{M}_{i}^{t,A}$ and $\mathcal{M}_{i}^{t,B}$ . Weak augmentation $\alpha(\cdot)$ is employed for teachers to generate high-quality pseudo-labels:

	$\displaystyle\mathcal{P}_{i}^{t}$	$\displaystyle=\mathrm{sim}(g^{t}(a_{i}),f^{t}(\alpha(v_{i}))),$		(5)
	$\displaystyle\mathcal{M}_{i}^{t}$	$\displaystyle=\mathds{1}(\mathcal{P}_{i}^{t}\geq\delta).$		(6)

We adopt the Intersection over Union (IoU) metric to quantify the similarity between the two maps $\mathcal{M}_{i}^{t,A}$ , $\mathcal{M}_{i}^{t,B}$ . If the IoU score exceeds the threshold $\tau$ , the sample will be accepted, and the intersection of those two maps will be generated as its pseudo-label (IPL). Otherwise, it will be filtered out as a noise sample.

\displaystyle\mathcal{D}_{u}^{\prime}=\Big{\{}(a_{i},v_{i})\ \big{|}\ \mathrm{% IoU}(\mathcal{M}_{i}^{t,A},\mathcal{M}_{i}^{t,B})\geq\tau,\ \forall(a_{i},v_{i% })\in\mathcal{D}_{u}\Big{\}}.

(7)

\displaystyle\mathcal{IPL}(a_{i},v_{i})=\mathcal{M}_{i}^{t,A}\cdot\mathcal{M}_% {i}^{t,B}.

(8)

The newly selected unlabeled dataset is applied to the student model along with the corresponding high-quality IPL.

Students Learning without bias.

To suppress confirmation bias more effectively, we mix labeled and new unlabeled datasets. Both ground-truth annotations and high-quality IPL are employed to train the student models:

\displaystyle\mathcal{D}_{mix}=\mathcal{D}_{l}\cup\mathcal{D}_{u}^{\prime}=\{(% a_{i},v_{i}),\mathcal{\widehat{G}}_{i}\}\ ,\ \text{where}\ \mathcal{\widehat{G% }}_{i}=\left\{\begin{matrix}&\mathcal{G}_{i}&\text{if}\ (a_{i},v_{i})\in% \mathcal{D}_{l}\\ &\mathcal{IPL}(a_{i},v_{i})&\text{if}\ (a_{i},v_{i})\in\mathcal{D}_{u}^{\prime% }\end{matrix}\right.

(11)

In addition, we incorporate consistency regularization Laine and Aila (2016) in the semi-supervised learning process. Specifically, for a given sample, we obtain IPL from the teachers on weakly augmented images while strong augmentations are applied for samples of students. By enforcing consistency between IPL and students’ predictions, DMT could be more stable with better generalization ability.

	$\displaystyle\mathcal{P}_{i}^{s}$	$\displaystyle=\mathrm{sim}(g^{s}(a_{i}),f^{s}(\mathcal{A}(v_{i}))),$		(12)
	$\displaystyle\mathcal{L}_{\text{sup}}$	$\displaystyle=\mathbb{E}_{i\sim\mathcal{D}_{mix}}H(\mathcal{\widehat{G}}_{i},% \mathcal{P}_{i}^{s}).$		(13)

Similar to the AVSL method mentioned in Section 3, students are also trained by audio-visual correspondence of contrastive learning loss. Here, we introduce an attention module to add attention to the sounding region in the frame:

\displaystyle f_{att}(v_{i})=\frac{\exp\left(\mathcal{P}_{i}(x,y)\right)}{\sum% _{x,y}\exp\left(\mathcal{P}_{i}(x,y)\right)}\cdot f(v_{i}).

(14)

Then, the full semi-supervised loss could be derived with $\mathcal{L}_{\text{sup}}$ (see Eq. (13)) on $\mathcal{D}_{mix}$ and $\mathcal{L}_{\text{unsup}}$ (see Eq. (D)) on $\mathcal{D}_{u}$ :

\displaystyle\mathcal{L}_{\text{full}}=\left(\mathcal{L}_{\text{sup}}^{A}+% \mathcal{L}_{\text{sup }}^{B}\right)+\lambda_{u}\left(\mathcal{L}_{\text{unsup% }}^{A}+\mathcal{L}_{\text{unsup}}^{B}\right).

(15)

Update of Students and Teachers.

Students are updated via gradient descent of $\mathcal{L}_{\text{full}}$ , while dual teachers are updated through the exponential moving average (EMA) of corresponding students:

\displaystyle\theta_{m}^{s}\leftarrow\theta_{m-1}^{s}-\gamma\frac{\partial% \mathcal{L}_{\text{full}}}{\partial\theta_{m-1}^{s}},\quad\theta_{m}^{t}% \leftarrow\beta\theta_{m-1}^{t}+(1-\beta)\theta_{m}^{s}.

(16)

The slowly progressing teachers can be regarded as the ensemble of students in recent training iterations, which enables stable progress in training.

4.3 Unbiased Superiority of Dual Mean-Teacher

For dense prediction tasks such as AVSL, employing pseudo-labels for model training can easily accumulate errors and lead to sub-optimal results. In our DMT framework, The unbiased characteristics could be attributed to the following three factors: (i). Noise Filtering ensures that only stable samples are utilized to train. (ii). IPL generates high-quality pseudo-labels. (iii). Pre-train dual teachers on bounding-box annotated data with strong augmentation in Warm-Up Stage. The above conclusion will be validated in subsequent ablation studies in Section 5.4.

5 Experiments

With limited annotated data, DMT could significantly raise the performance of AVSL and address the severe issues, e.g., false positives and poor localization ability on small objects. Then, we direct our focus towards answering the following questions with ablation experiments 5.4:

•

What is the individual contribution of each module to the performance gains?
•

How does annotation enhance localization performance?
•

Why can DMT outperform the existing semi-supervised AVSL method?
•

Is it necessary to warm up dual teachers?
•

How to effectively mitigate confirmation bias in AVSL?

5.1 Experimental Settings

Datasets.

We conduct experiments on two large-scale audio-visual datasets: Flickr-SoundNet Senocak et al. (2018, 2019) and VGG Sound Source Chen et al. (2020d), where there are 5,000 and 5,158 bounding-box annotated samples, respectively. For labeled data, we randomly select 4,250 for training, 500 for validating, and keep the same test sets with 250 samples as previous works Mo and Morgado (2022a, b); Chen et al. (2021b); Liu et al. (2022a). Moreover, we select a subset of 10k and 144k unlabeled samples to train as before. Details can be found in Appendix B.1.

Audio and Visual Backbones.

For visual backbones, we followed prior work and used ResNet-18 He et al. (2016) pre-trained on ImageNet Deng et al. (2009). For audio backbones, we select the pre-trained VGGish Hershey et al. (2017) and SoundNet Aytar et al. (2016) with semantic audio information. Further details can be found in Appendix B.2.

Metrics.

We report the Consensus Intersection over Union (CIoU) and Area Under Curve (AUC), following previous settings Senocak et al. (2018, 2019). CIoU represents the localization accuracy, samples with IoU above the threshold $\delta=0.5$ are considered to be accurately located. Considering small objects, we introduce Mean Square Error (MSE), which measures the average pixel-wise difference between two maps without binarization, making it more suitable for evaluating dense prediction tasks on small objects. More details are shown in Appendix B.3.

Table 1: Comparison results on Flickr-SoundNet. Models are trained on Flickr 10k and 144k.

\dagger

indicates our reproduced results, others are borrowed from original papers. Attention10k-SSL is of 2k labeled data supervision. We report the proposed DMT results from both stages as stage-2(stage-1).

|\mathcal{D}_{l}|

denotes the number of labeled data.

Methods	Flickr 10k		Flickr 144k
Methods	CIoU	AUC	CIoU	AUC
Attention10k Senocak et al. (2018, 2019)	43.60	44.90	66.00	55.80
CoarsetoFine Qian et al. (2020)	52.20	49.60	–	–
DMC Zhu et al. (2021)	–	–	67.10	56.80
LVS Chen et al. (2021b)	58.20	52.50	69.90	57.30
EZVSL Mo and Morgado (2022a)	62.65	54.89	72.69	58.70
SLAVC ${}^{\dagger}$ Mo and Morgado (2022b)	66.80	56.30	73.84	58.98
SSPL Song et al. (2022)	74.30	58.70	75.90	61.00
SSL-TIE ${}^{\dagger}$ Liu et al. (2022a)	75.50	58.80	81.50	61.10
Attention10k-SSL Senocak et al. (2018, 2019)	82.40	61.40	83.80	61.72
Ours ( $\|\mathcal{D}_{l}\|=256$ )	87.20 (84.40)	65.77 (59.60)	87.60 (84.40)	66.28 (59.60)
Ours ( $\|\mathcal{D}_{l}\|=~{}2k~{}$ )	87.80 (85.60)	66.20 (63.18)	88.20 (85.60)	66.63 (63.18)
Ours ( $\|\mathcal{D}_{l}\|=~{}4k~{}$ )	88.80 (86.20)	67.81 (65.56)	90.40 (86.20)	69.36 (65.56)

Table 2: Comparison results on VGG-ss. Models are trained on VGG-Sound 10k and 144k.

Methods	VGG-Sound 10k		VGG-Sound 144k
Methods	CIoU	AUC	CIoU	AUC
Attention10k Senocak et al. (2018, 2019)	16.00	28.30	18.50	30.20
LVS Chen et al. (2021b)	27.70	34.90	34.40	38.20
EZVSL Mo and Morgado (2022a)	32.30	33.68	34.38	37.70
SLAVC ${}^{\dagger}$ Mo and Morgado (2022b)	37.80	39.48	39.20	39.46
SSPL Song et al. (2022)	31.40	36.90	33.90	38.00
SSL-TIE ${}^{\dagger}$ Liu et al. (2022a)	36.80	37.21	38.60	39.60
Attention10k-SSL ${}^{\dagger}$ Senocak et al. (2018, 2019)	38.60	38.26	39.20	38.52
Ours ( $\|\mathcal{D}_{l}\|=256$ )	41.20 (39.40)	40.68 (38.70)	43.60 (39.40)	41.88 (38.70)
Ours ( $\|\mathcal{D}_{l}\|=~{}2k~{}$ )	43.20 (42.60)	40.82 (40.75)	45.60 (42.60)	43.24 (40.75)
Ours ( $\|\mathcal{D}_{l}\|=~{}4k~{}$ )	46.80 (43.80)	43.18 (41.63)	48.80 (43.80)	45.76 (41.63)

Implementation details.

For audio clips, we pass 96 × 64 log-mel spectrograms to VGGish, and the output is a 512D vector, while the raw waveform of the original 3s audio clip is sent to SoundNet. For frames, we used an input image of size $256\times 256\times 3$ , with $224\times 224\times 512$ as output. We choose RandAug Cubuk et al. (2020) as strong augmentation, while random cropping, resizing, and random horizontal flip as weak augmentation. We set $\delta$ as 0.6 and $\tau$ as 0.7. More experiments of hyperparameters are shown in Appendix C.5.

5.2 Comparison with the State-of-the-art Methods

Comprehensive experiments show that DMT achieves the state-of-the-art performance among all existing methods on both datasets, and showcases several advantages.

Effective Utilization of Finite Annotations and Remarkable Performance.

We tested DMT’s localization performance with varying amounts of labeled data and found that it consistently outperforms state-of-the-art methods when with 256, 2k, and 4k labeled data. Notably, even with just 256 labeled data, DMT achieved an accuracy of $87.2\%$ to $87.6\%$ , showing a significant improvement in CIoU by around 10 absolute points compared to preceding models. Additionally, our model shows a $3\%$ absolute improvement in CIoU compared to a supervised-only model. Furthermore, DMT maintains superior performance in complex and open environments Mo and Morgado (2022b); Ma et al. (2023); Zhu et al. (2024), as demonstrated in Table 2 and Table 3(c), indicating strong generalization capabilities. These results highlight DMT’s ability to improve localization performance by utilizing more unlabeled data.

Table 3: Performance comparisons in existing issues (small objects localization and false positives) and complex scenarios (open set). The results of small objects and open set are tested on the VGG-ss dataset, while false positives are reported on the Flickr dataset.

Methods	Small Testset		Medium Testset
Methods	MSE $\downarrow$	IoU $\uparrow$	MSE $\downarrow$	IoU $\uparrow$
LVS Chen et al. (2021b)	0.515	0.021	0.441	0.265
EZVSL Mo and Morgado (2022a)	0.566	0.023	0.467	0.268
SLAVC Mo and Morgado (2022b)	0.705	0.021	0.568	0.220
Ours	0.160	0.025	0.174	0.335

(a) Small objects.

Methods	AP $\uparrow$	max-F1 $\uparrow$	Acc $\uparrow$
LVS Chen et al. (2021b)	9.80	17.90	19.60
DMC Zhu et al. (2021)	25.56	41.80	52.80
EZVSL Mo and Morgado (2022a)	46.30	54.60	66.40
SLAVC Mo and Morgado (2022b)	51.63	59.10	83.60
Ours	53.56	62.80	91.60

(b) False positives.

Methods	CIoU $\uparrow$	AUC $\uparrow$
LVS Chen et al. (2021b)	26.30	34.70
EZVSL Mo and Morgado (2022a)	39.57	39.60
SLAVC Mo and Morgado (2022b)	38.92	41.17
Ours	43.12	42.81

Significant Advancement in Small Subset Localization.

We categorize objects based on their bounding box pixel area into small ( $1\sim 32^{2}$ ), medium ( $32^{2}\sim 96^{2}$ ), large ( $96^{2}\sim 144^{2}$ ) and huge ( $144^{2}\sim 224^{2}$ ). We tested different methods on small and medium objects in the VGG-Sound dataset, focusing on the challenges of detecting small objects and reducing false positives mentioned earlier. The results in Table 3(a) show that DMT significantly improves performance, especially in terms of MSE metric. Despite some errors in the IoU metric, DMT still outperforms previous methods. The results in Figure 1 show that DMT accurately locates sounding objects with clear boundaries and precisely convergence to object contours, unlike previous methods that often have excessive or insufficient foreground regions, especially for small objects. These results demonstrate the effective and precise localization of small objects achieved by DMT. More experiments of different object sizes are in Appendix C.7.

Capability of Learning Rich Semantic Information.

We present visualized predictions for testsets of varying sizes in Figure 1 (Left). It is evident that our approach achieves remarkable precision in localizing sounding sources. It accurately locates the position of sounding objects and precisely converges to their boundaries, while prior methods usually have excessive or insufficient foreground regions, particularly in the case of small objects.

It is worth noting that our method can even find out sounding objects overlooked in the manual annotations. For instance, in Figure 1 (Left, $4$ -th row), the heatmap reveals the presence of a piano, which is omitted in the manual annotation process. Furthermore, we assessed the model’s capability to identify false positives, signifying instances where sounding objects are occasionally not visually observable within the image (off-screen), as shown in 3(b). This reflects the ability of Dual Mean-Teacher to extract audio semantic information and effectively localize multiple sounding objects within a scene, a feat that eludes other methods. We attribute this capability to the semantic alignment of audio-visual features achieved through the pre-trained VGGish and SoundNet backbone.

Capacity for Cross-Domain Generalization and Multi-Source Localization.

We tested DMT’s generalization across different domains and its ability to localize multiple objects. Models trained using VGG-ss 144k were directly evaluated on MUSIC-solo Zhao et al. (2018), MUSIC-duet Zhao et al. (2018), and MUSIC-synthetic datasets Hu et al. (2020, 2021). Figure 3 demonstrates DMT’s strong generalization performance in the music domain, outperforming other method. As shown in Figure 3, the previous method struggles to accurately localize multiple sounding objects, either missing them or including all sounding objects within a large foreground area. In contrast, DMT localizes each instrument accurately and separately. However, without category information for fine-grained training, it leads to sub-optimal performance in differentiating between multiple active and silent instruments. There is still significant room for improvement with multiple sounding objects and we plan to address this issue in future work.

5.3 Extensions of Dual Mean-Teacher

Table 4: Extension results of DMT with various audio backbones, with ‘R’, ‘V’ and ‘S’ denoting ResNet, VGGish and SoundNet.

Methods	Backbones	CIoU $\uparrow$	AUC $\uparrow$	MSE $\downarrow$
EZVSL w/o DMT	R	62.65	54.89	0.428
EZVSL w/ DMT	R+V	85.30	65.80	0.312
EZVSL w/ DMT	R+S	85.95	66.12	0.298
EZVSL w/ DMT	V+S	87.20	67.74	0.256
SLAVC w/o DMT	R	66.80	56.30	0.386
SLAVC w/ DMT	R+V	86.10	66.24	0.288
SLAVC w/ DMT	R+S	86.30	66.58	0.283
SLAVC w/ DMT	V+S	88.80	68.69	0.247

We replicated several existing methods and integrated them into our framework. Notably, the integration of the Dual Mean-Teacher showcases its ability to significantly enhance the performance of other existing methods. In Table 4, one can observe a noteworthy improvement in the CIoU of EZVSL from 62.65% to 87.20%, and SLAVC rising from 66.80% to 88.80%, which further reinforces the efficacy of our framework and highlight its flexible extensibility.

5.4 Ablation Studies

Table 5: Main ablation study results. Models are trained on Flickr 144k and tested on Flickr-SoundNet testset, where ‘S’ and ‘V’ denote SoundNet and VGGish respectively.

Modules					Performance
# Teachers	Backbone	Filter	IPL	EMA	CIoU $\uparrow$	AUC $\uparrow$	MSE $\downarrow$
One teacher	(a). S	✗	✗	✗	80.20	53.57	0.458
One teacher	(b). V	✗	✗	✗	81.80	55.92	0.379
Dual teachers	(c). S+V	✗	✗	✗	82.20	55.16	0.382
	(d). S+V	✗	✗	✓	82.80	59.38	0.375
	(e). S+V	✗	✓	✗	83.60	62.83	0.359
	(f). S+V	✓	✗	✗	84.80	65.58	0.259
	(g). S+V	✓	✗	✓	86.20	66.56	0.260
	(h). S+V	✗	✓	✓	86.60	66.35	0.274
	(i). S+V	✓	✓	✗	88.60	66.68	0.260
	(j). S+V	✓	✓	✓	90.40	69.36	0.237

What is the individual contribution of each module to the performance gains?

In this section, we progressively analyze the performance gain from each module in detail. We choose a self-supervised approach as our baseline. Table 5 presents the results on the Flickr 144k training set with $10\%$ annotated samples.

(i). Initially, we apply it under the semi-supervised pseudo-labeling mechanism, using only one backbone, as shown in Table 5(a) and Table 5(b). The localization performance improves with annotated data supervision, but the gain is limited due to poor-quality pseudo-labels.

(ii) Next, we extend it to a two-backbone architecture and sequentially introduce the filter, IPL, and EMA modules. The results demonstrate that all three modules contribute to performance improvement ( $3\%$ , $1.8\%$ , $1\%$ ). Notably, Filter shows the most significant impact on the model’s improvement, for it effectively rejects noisy samples, ensuring the stability of the model.

(iii) Finally, we can observe that optimal performance is achieved through the joint integration of the three modules by effectively suppressing confirmation bias.

Table 6: Performance on various labeled ratios

\%

and multiple

\times

on Flickr 144k.

Labeled ratio $\%$	CIoU	AUC
$0.5\%$ (200/40k)	84.80	63.58
$1\%$ (400/40k)	86.20	65.16
$2\%$ (800/40k)	87.20	65.94
$5\%$ (2k/40k)	87.60	67.44
$10\%$ (4k/40k)	88.40	68.12
Multiple $\times$	CIoU	AUC
$2.5\times$ (4k/10k)	88.00	67.80
$5\times$ (4k/20k)	88.20	67.91
$10\times$ (4k/40k)	88.40	68.12
$20\times$ (4k/80k)	89.20	68.44
$40\times$ (4k/200k)	91.20	71.36

How does annotation help localization?

We aim to demonstrate that even with extremely limited labeled data, significant performance can still be achieved. To this end, we investigated the performance of our model from $0.5\%$ to $10\%$ , and report the results in Table 6. Our model consistently outperforms state-of-the-art approaches across all ratios. Furthermore, with a constant amount of unlabeled data, as the proportion of labeled data increases, our model’s performance continues to improve, highlighting the significant impact of labeled data.

We investigated the impact of varying amounts of unlabeled data while keeping labeled data constant. Experimental results in Table 6 show that increasing the amount of unlabeled data improves localization performance, which seems contradictory to the previous conclusion about the proportion of labeled data, but actually, it demonstrates that labeled data can effectively leverage the unlabeled data. Based on our analysis, labeled data not only provides annotation information but also effectively enhances the power of unlabeled data, resulting in significant performance improvements.

Why can DMT outperform the existing semi-supervised AVSL method?

Both naive SSL and DMT have utilized labeled and unlabeled data. However, a key distinction is that naive SSL employs unlabeled data only for contrastive loss, whereas DMT leverages pseudo-labels to incorporate unlabeled data into both contrastive loss and supervised loss, which amplifies the utilization of unlabeled data, thus enhancing generalization capability.

Data Utilization. We supplement the comparison experiments with fixed labeled data and an increase in unlabeled data from 10k to 200k, as shown in left part of 7. As the amount of unlabeled data increases, naive SSL exhibits only marginal improvement, whereas DMT shows more performance gains, indicating DMT can better use unlabeled data.

Generalization Ability. The right part of 7 highlights the limitations of naive SSL in the open set and in-the-wild datasets, suggesting that adding a supervised loss alone may lead to overfitting and weaken generalization. In contrast, DMT effectively leverages pseudo-label for improved generalization capability.

Table 7: Comparison of two SS-AVSL methods. * denotes the results from the original paper. ‘sim-avsl’ denotes the simple self-supervised AVSL model we use. We report the CIoU below.

	2.5k/10k	2.5k/144k	2.5k/200k	open set	cross-datasets
attention10k + naive SSL	84.00*/83.68	84.40*/84.08	84.24	19.60	62.20
attention10k + DMT (ours)	88.00	89.52	90.40	42.64	87.26
sim-avsl + naive SSL	83.84	84.24	84.40	20.80	60.60
sim-avsl + DMT (ours)	88.24	89.76	91.12	43.10	89.80

Is it necessary to warm up dual teachers?

We believe that the initialization of teachers and students is crucial, for the quality of pseudo-labels has a significant impact on model performance. To validate the effectiveness of this idea, we experimented to study the warm-up stage’s impact on the model. We find that without the Warm-Up Stage, the model’s improvement is very slow, and the performance eventually deteriorates. This indicates that without a good initialization, the model can accumulate errors, leading to confirmation bias issues. Therefore, we can conclude that Warm-Up is essential as it effectively suppresses confirmation bias in the early stages of training.

How to effectively mitigate confirmation bias in AVSL?

In Section 4.3, we present the origins and mitigation strategies for confirmation bias in the localization task. In this section, we will demonstrate this. Figure 4(a) depicts the quality of pseudo labels before and after applying the filter module. It is evident that the quality of pseudo labels can be significantly improved after filtering. This observation highlights the effectiveness of the filter in eliminating noisy samples, which tend to be false-positive instances. Figure 4(b) depicts the comparison between using the direct outputs of each teacher as pseudo labels and utilizing their intersection, known as IPL. Where the purple dashed line represents the initialization value. The results clearly indicate that employing the direct outputs alone leads to the accumulation of bias, causing a deterioration in the quality of pseudo labels throughout the training process. Conversely, IPL consistently ensures the preservation of high-quality pseudo labels, thus mitigating the impact of bias.

Furthermore, Figure 4(c) visually presents the trend of model performance, revealing that the absence of any of these three modules results in a decline in model performance. However, we can see that EMA only affects the final performance of the model, and without the Filter module, the model’s performance will be significantly affected by noise. Without the IPL module, the model will experience a continuous decline in performance due to erroneous estimation of pseudo-labels. Therefore, we find that the Noise Filtering module and IPL modules play a significant role in addressing the confirmation bias problem. Moreover, Figure 4(d) reflects that under the joint action of the three modules, DMT generates more accurate pseudo-labels and its performance continues to improve steadily.

6 Conclusion

In this paper, we advance the naive SS-AVSL work and propose a novel Semi-Supervised Audio-Visual Source Localization (SS-AVSL) framework, namely Dual Mean-Teacher (DMT), considering the importance of both limited annotated and abundant unlabeled data. From a unified perspective, existing self-supervised (weakly-supervised) AVSL methods could be referred to as a single student structure, while DMT employs dual teacher-student pairs to filter out noisy samples via the agreement of two teachers and generate high-quality pseudo labels to avoid confirmation bias. DMT has greatly enhanced AVSL performance and addressed intractable issues like false positives and inaccurate localization of tiny objects. Moreover, DMT is a learning paradigm and could be seamlessly incorporated into existing AVSL methods and consistently boost their performance.

We hope this work will bring more attention to SS-AVSL, provoke a reconsideration of pseudo-labeling, bias avoidance, and better utilization of the underlying unlabeled data, and thus stimulate more semi-supervised learning research in this dense prediction task.

Acknowledgements

We would like to thank the National Natural Science Foundation of China under Grant 61773374 and the Major Basic Research Projects of Natural Science Foundation of Shandong Province under Grant ZR2019ZD07. We also appreciate Shuailei Ma, Kecheng Zheng, and Ziyi Wang for their valuable and insightful discussions.

References

Arandjelovic and Zisserman [2017] Relja Arandjelovic and Andrew Zisserman. Look, listen and learn. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Oct 2017.
Zhu et al. [2021] Hao Zhu, Man-Di Luo, Rui Wang, Ai-Hua Zheng, and Ran He. Deep audio-visual learning: A survey. International Journal of Automation and Computing, 18:351–376, 2021.
Ramaswamy [2020] Janani Ramaswamy. What makes the sound?: A dual-modality interacting network for audio-visual event localization. In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4372–4376, 2020. doi: 10.1109/ICASSP40776.2020.9053895.
Chen et al. [2020a] Changan Chen, Sagnik Majumder, Ziad Al-Halah, Ruohan Gao, Santhosh Kumar Ramakrishnan, and Kristen Grauman. Learning to set waypoints for audio-visual navigation. arXiv preprint arXiv:2008.09622, 2020a.
Chen et al. [2021a] Changan Chen, Ziad Al-Halah, and Kristen Grauman. Semantic audio-visual navigation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15516–15525, June 2021a.
Tzinis et al. [2020] Efthymios Tzinis, Scott Wisdom, Aren Jansen, Shawn Hershey, Tal Remez, Daniel PW Ellis, and John R Hershey. Into the wild with audioscope: Unsupervised audio-visual separation of on-screen sounds. arXiv preprint arXiv:2011.01143, 2020.
Tzinis et al. [2022] Efthymios Tzinis, Scott Wisdom, Tal Remez, and John R Hershey. Audioscopev2: Audio-visual attention architectures for calibrated open-domain on-screen sound separation. In European Conference on Computer Vision, pages 368–385. Springer, 2022.
Xuan et al. [2021] Hanyu Xuan, Lei Luo, Zhenyu Zhang, Jian Yang, and Yan Yan. Discriminative cross-modality attention network for temporal inconsistent audio-visual event localization. IEEE Transactions on Image Processing, 30:7878–7888, 2021. doi: 10.1109/TIP.2021.3106814.
Mo and Morgado [2022a] Shentong Mo and Pedro Morgado. Localizing visual sounds the easy way. arXiv preprint arXiv:2203.09324, 2022a.
Mo and Morgado [2022b] Shentong Mo and Pedro Morgado. A closer look at weakly-supervised audio-visual source localization. In Advances in Neural Information Processing Systems, 2022b.
Mo and Tian [2023] Shentong Mo and Yapeng Tian. Audio-visual grouping network for sound localization from mixtures. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023.
Vandenhende et al. [2021] Simon Vandenhende, Stamatios Georgoulis, Wouter Van Gansbeke, Marc Proesmans, Dengxin Dai, and Luc Van Gool. Multi-task learning for dense prediction tasks: A survey. IEEE transactions on pattern analysis and machine intelligence, 44(7):3614–3633, 2021.
Yang et al. [2022] Xiangli Yang, Zixing Song, Irwin King, and Zenglin Xu. A survey on deep semi-supervised learning. IEEE Transactions on Knowledge and Data Engineering, pages 1–20, 2022. doi: 10.1109/TKDE.2022.3220219.
Senocak et al. [2018] Arda Senocak, Tae-Hyun Oh, Junsik Kim, Ming-Hsuan Yang, and In So Kweon. Learning to localize sound source in visual scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4358–4366, 2018.
Senocak et al. [2019] Arda Senocak, Tae-Hyun Oh, Junsik Kim, Ming-Hsuan Yang, and In So Kweon. Learning to localize sound sources in visual scenes: Analysis and applications. TPAMI, 43(5):1605–1619, 2019.
Lee et al. [2013] Dong-Hyun Lee et al. Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. In Workshop on challenges in representation learning, ICML, page 896. Atlanta, 2013.
Arazo et al. [2020] Eric Arazo, Diego Ortego, Paul Albert, Noel E O’Connor, and Kevin McGuinness. Pseudo-labeling and confirmation bias in deep semi-supervised learning. In 2020 International Joint Conference on Neural Networks (IJCNN), pages 1–8. IEEE, 2020.
Tarvainen and Valpola [2017] Antti Tarvainen and Harri Valpola. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. Advances in neural information processing systems, 30, 2017.
Van Engelen and Hoos [2020] Jesper E Van Engelen and Holger H Hoos. A survey on semi-supervised learning. Machine learning, 109(2):373–440, 2020.
Sajjadi et al. [2016] Mehdi Sajjadi, Mehran Javanmardi, and Tolga Tasdizen. Regularization with stochastic transformations and perturbations for deep semi-supervised learning. In D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016. URL https://proceedings.neurips.cc/paper_files/paper/2016/file/30ef30b64204a3088a26bc2e6ecf7602-Paper.pdf.
Laine and Aila [2017] Samuli Laine and Timo Aila. Temporal ensembling for semi-supervised learning. In International Conference on Learning Representations, 2017. URL https://openreview.net/forum?id=BJ6oOfqge.
Xie et al. [2020] Qizhe Xie, Minh-Thang Luong, Eduard Hovy, and Quoc V. Le. Self-training with noisy student improves imagenet classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020.
Berthelot et al. [2019] David Berthelot, Nicholas Carlini, Ian Goodfellow, Nicolas Papernot, Avital Oliver, and Colin A Raffel. Mixmatch: A holistic approach to semi-supervised learning. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper_files/paper/2019/file/1cd138d0499a68f4bb72bee04bbec2d7-Paper.pdf.
Sohn et al. [2020a] Kihyuk Sohn, David Berthelot, Nicholas Carlini, Zizhao Zhang, Han Zhang, Colin A Raffel, Ekin Dogus Cubuk, Alexey Kurakin, and Chun-Liang Li. Fixmatch: Simplifying semi-supervised learning with consistency and confidence. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 596–608. Curran Associates, Inc., 2020a. URL https://proceedings.neurips.cc/paper_files/paper/2020/file/06964dce9addb1c5cb5d6e3d9838f733-Paper.pdf.
Zhang et al. [2021] Bowen Zhang, Yidong Wang, Wenxin Hou, Hao Wu, Jindong Wang, Manabu Okumura, and Takahiro Shinozaki. Flexmatch: Boosting semi-supervised learning with curriculum pseudo labeling. Advances in Neural Information Processing Systems, 34:18408–18419, 2021.
Chen et al. [2020b] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. arXiv preprint arXiv:2002.05709, 2020b.
Chen et al. [2020c] Ting Chen, Simon Kornblith, Kevin Swersky, Mohammad Norouzi, and Geoffrey Hinton. Big self-supervised models are strong semi-supervised learners. arXiv preprint arXiv:2006.10029, 2020c.
Chen et al. [2021b] Honglie Chen, Weidi Xie, Triantafyllos Afouras, Arsha Nagrani, Andrea Vedaldi, and Andrew Zisserman. Localizing visual sounds the hard way. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16867–16876, 2021b.
Song et al. [2022] Zengjie Song, Yuxi Wang, Junsong Fan, Tieniu Tan, and Zhaoxiang Zhang. Self-supervised predictive learning: A negative-free method for sound source localization in visual scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3222–3231, 2022.
Liu et al. [2022a] Jinxiang Liu, Chen Ju, Weidi Xie, and Ya Zhang. Exploiting transformation invariance and equivariance for self-supervised sound localisation. In Proceedings of the 30th ACM International Conference on Multimedia, pages 3742–3753, 2022a.
Hu et al. [2020] Di Hu, Rui Qian, Minyue Jiang, Xiao Tan, Shilei Wen, Errui Ding, Weiyao Lin, and Dejing Dou. Discriminative sounding objects localization via self-supervised audiovisual matching. Advances in Neural Information Processing Systems, 33:10077–10087, 2020.
Qian et al. [2020] Rui Qian, Di Hu, Heinrich Dinkel, Mengyue Wu, Ning Xu, and Weiyao Lin. Multiple sound sources localization from coarse to fine. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XX 16, pages 292–308. Springer, 2020.
Hu et al. [2022] Xixi Hu, Ziyang Chen, and Andrew Owens. Mix and localize: Localizing sound sources in mixtures. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10483–10492, 2022.
Liu et al. [2022b] Xian Liu, Rui Qian, Hang Zhou, Di Hu, Weiyao Lin, Ziwei Liu, Bolei Zhou, and Xiaowei Zhou. Visual sound localization in the wild by cross-modal interference erasing. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 1801–1809, 2022b.
Morgado et al. [2021a] Pedro Morgado, Ishan Misra, and Nuno Vasconcelos. Robust audio-visual instance discrimination. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12934–12945, 2021a.
Morgado et al. [2021b] Pedro Morgado, Nuno Vasconcelos, and Ishan Misra. Audio-visual instance discrimination with cross-modal agreement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12475–12486, 2021b.
Zhao et al. [2018] Hang Zhao, Chuang Gan, Andrew Rouditchenko, Carl Vondrick, Josh McDermott, and Antonio Torralba. The sound of pixels. In Proceedings of the European conference on computer vision (ECCV), pages 570–586, 2018.
Zhao et al. [2019] Hang Zhao, Chuang Gan, Wei-Chiu Ma, and Antonio Torralba. The sound of motions. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1735–1744, 2019.
Sanguineti et al. [2021] Valentina Sanguineti, Pietro Morerio, Alessio Del Bue, and Vittorio Murino. Audio-visual localization by synthetic acoustic image generation. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 2523–2531, 2021.
Sohn et al. [2020b] Kihyuk Sohn, Zizhao Zhang, Chun-Liang Li, Han Zhang, Chen-Yu Lee, and Tomas Pfister. A simple semi-supervised learning framework for object detection. arXiv preprint arXiv:2005.04757, 2020b.
Xu et al. [2021] Mengde Xu, Zheng Zhang, Han Hu, Jianfeng Wang, Lijuan Wang, Fangyun Wei, Xiang Bai, and Zicheng Liu. End-to-end semi-supervised object detection with soft teacher. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3060–3069, 2021.
Li et al. [2022] Gang Li, Xiang Li, Yujie Wang, Wu Yichao, Ding Liang, and Shanshan Zhang. Dtg-ssod: Dense teacher guidance for semi-supervised object detection. Advances in Neural Information Processing Systems, 35:8840–8852, 2022.
Zhou et al. [2021] Qiang Zhou, Chaohui Yu, Zhibin Wang, Qi Qian, and Hao Li. Instant-teaching: An end-to-end semi-supervised object detection framework. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4081–4090, 2021.
Liu et al. [2021] Yen-Cheng Liu, Chih-Yao Ma, Zijian He, Chia-Wen Kuo, Kan Chen, Peizhao Zhang, Bichen Wu, Zsolt Kira, and Peter Vajda. Unbiased teacher for semi-supervised object detection. arXiv preprint arXiv:2102.09480, 2021.
Laine and Aila [2016] Samuli Laine and Timo Aila. Temporal ensembling for semi-supervised learning. arXiv preprint arXiv:1610.02242, 2016.
Chen et al. [2020d] Honglie Chen, Weidi Xie, Andrea Vedaldi, and Andrew Zisserman. Vggsound: A large-scale audio-visual dataset. In International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2020d.
He et al. [2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
Deng et al. [2009] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
Hershey et al. [2017] Shawn Hershey, Sourish Chaudhuri, Daniel PW Ellis, Jort F Gemmeke, Aren Jansen, R Channing Moore, Manoj Plakal, Devin Platt, Rif A Saurous, Bryan Seybold, et al. Cnn architectures for large-scale audio classification. In 2017 ieee international conference on acoustics, speech and signal processing (icassp), pages 131–135. IEEE, 2017.
Aytar et al. [2016] Yusuf Aytar, Carl Vondrick, and Antonio Torralba. Soundnet: Learning sound representations from unlabeled video. In Advances in Neural Information Processing Systems, 2016.
Cubuk et al. [2020] Ekin D. Cubuk, Barret Zoph, Jonathon Shlens, and Quoc V. Le. Randaugment: Practical automated data augmentation with a reduced search space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, June 2020.
Ma et al. [2023] Shijie Ma, Fei Zhu, Zhen Cheng, and Xu-Yao Zhang. Towards trustworthy dataset distillation. arXiv preprint arXiv:2307.09165, 2023.
Zhu et al. [2024] Fei Zhu, Shijie Ma, Zhen Cheng, Xu-Yao Zhang, Zhaoxiang Zhang, and Cheng-Lin Liu. Open-world machine learning: A review and new outlooks. arXiv preprint arXiv:2403.01759, 2024.
Hu et al. [2021] Di Hu, Yake Wei, Rui Qian, Weiyao Lin, Ruihua Song, and Ji-Rong Wen. Class-aware sounding objects localization via audiovisual correspondence. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(12):9844–9859, 2021.

Appendix

Brief Introduction.

The appendix is structured into four main sections: Algorithm, Experimental Settings, Supplementary Experiments, and Further Analysis. The main contents are as follows:

•

A. Algorithm: Pseudo-codes and algorithm details.
•

B. Experimental Settings: More detailed description of the datasets, backbones, metrics formula, implementation details and baselines.
•

C. Supplementary Experiments: Results of cross-dataset evaluation, comparison of different predicted map, exploration of Warm-Up speed and its influence on the final results, false-positive rejection capability of Noise Filtering, investigation of hyperparameters for Filter, IPL, and EMA, effect of data augmentation and quality analysis (visualization).
•

D. Further Analysis: Theoretical elaboration on the challenges faced by existing contrastive learning methods, and explanation of why contrastive learning alone cannot achieve precise localization.

Appendix A Algorithm

To make it more clear, Dual Mean-Teacher is specifically depicted in Algorithm 1.

Algorithm 1 Dual Mean-Teacher algorithm.

1: Input:

\mathcal{D}_{u}=\{(a_{i},v_{i})\}

\mathcal{D}_{l}=\{(v_{i},a_{i}),\mathcal{G}_{i}\}

{labeled data and unlabeled data.}

2: while not reach the maximum iteration do

3: for

(a_{i},v_{i})

\mathcal{D}_{u}

4: while not reach the convergency of Warm-Up do

\mathcal{L}_{\textrm{Warm-Up}}=\mathbb{E}_{(a_{i},v_{i})\sim\mathcal{D}_{l}}H(% \mathcal{G}_{i},\mathcal{P}_{i}^{t})

{Supervised learning on labeled data.}

6: end while

7: Get the pseudo-labels

\mathcal{M}_{i}^{t,A},\mathcal{M}_{i}^{t,B}

from dual teachers

8: if

\mathrm{IoU}(\mathcal{M}_{i}^{t,A},\mathcal{M}_{i}^{t,B})\geq\tau

then

\mathcal{IPL}(a_{i},v_{i})=\mathcal{M}_{i}^{t,A}\cdot\mathcal{M}_{i}^{t,B}

{Compute Intersection of Pseudo-Labels (IPL).}

10:

\hat{\mathcal{G}}_{i}=\mathcal{IPL}(a_{i},v_{i})

{Update the pseudo-label

\hat{\mathcal{G}}_{i}

of unlabeled data.}

11: Add

(a_{i},v_{i})

to new dataset

\mathcal{D}_{u}^{\prime}

12: end if

13: end for

14:

\mathcal{D}_{mix}=\mathcal{D}_{l}\cup\mathcal{D}_{u}^{\prime}

{Mix the filtered unlabeled data and labeled data.}

15:

\mathcal{L}_{\text{full}}=\left(\mathcal{L}_{\text{sup}}^{A}+\mathcal{L}_{% \text{sup }}^{B}\right)+\lambda_{u}\left(\mathcal{L}_{\text{unsup}}^{A}+% \mathcal{L}_{\text{unsup}}^{B}\right)

. {Students learning.}

16:

\theta_{m}^{t}\leftarrow\beta\theta_{m-1}^{t}+(1-\beta)\theta_{m}^{s}

{Students update teachers via EMA.}

17: end while

18: Return: Dual teachers and students model parameters.

NOTING TIPS:

Train.

Warm-Up Stage is essentially a supervised learning. The performance gains of subsequent Unbiased-Learning Stage over Warm-Up Stage is actually the performance gains of our semi-supervised framework over vanilla supervised training on the same labelled dataset $\mathcal{D}_{l}$ , which proves the validity of the proposed Dual Mean-Teacher, as shown in the main results in Table 1 and Table 2.

Inference.

For the localization result of $i_{th}$ audio-visual pair, we merge the outputs of the dual teachers to create a predicted map as below. Comparison of different predicted maps are described in C.2.

\displaystyle\mathcal{P}_{i}=\frac{1}{2}(\mathcal{P}_{i}^{t,A}+\mathcal{P}_{i}% ^{t,B}).

(17)

Appendix B Experimental Settings

B.1 Datasets

We have conducted our training and evaluation of the Progressing Teacher on two large-scale audio-visual datasets: Flickr-SoundNet and VGG-Sound, which consist of millions of unconstrained videos and $5,000$ and $5,158$ annotated samples, respectively. Each audio-visual pair is comprised of a single image frame from each video clip and an audio segment centered around it. The annotations are provided in the form of bounding boxes. The relevant information is presented in the Table 8.

Table 8: Datasets overview.

	small	medium	large	huge	total	small	medium	large	huge	total	train	val	test	total
	All Labeled Data					Test Set					Labeled Split
Flickr-SoundNet	3	254	687	4056	5000	0	9	83	158	250	4250	500	250	5000
VGG-SoundSource	134	1796	1726	1502	5158	8	86	83	73	250	4250	500	250	5000

Furthermore, for the purpose of assessing the generalizability of our model, we have extended DMT to music domain (distribution), including: MUSIC-solo, MUSIC-duet, and MUSIC-Synthetic. The MUSIC dataset Zhao et al. [2018] comprises 685 untrimmed videos, encompassing 536 solo performances and 149 duet renditions, spanning across 11 distinct categories of musical instruments. The MUSIC-Synthetic Hu et al. [2020, 2021] is a multifaceted assemblage wherein four disparate solo audio-visual pairs of divergent classifications are randomly mixed, retaining solely two out of the four audio segments. This deliberate curation aligns aptly with the evaluation of discerningly sounding object localization.

B.2 Backbones: VGGish and SoundNet

For audio backbones, we employ pre-trained VGGish and SoundNet. VGGish is pre-trained on AudioSet as audio feature extractors. The raw 3s audio signal is resampled at 16kHz and further transformed into 96 × 64 log-mel spectrograms as the audio input. The output is 128D vector. SoundNet takes the raw waveform of the 3s audio clip as input and produces a 1401D vector as output, which concatenates the 1000D object-level feature and the 401D scene-level feature, which are both obtained from different conv8 layer. Our main focus is to train the nonlinear audio feature transformation function, g(·), which is instantiated with two fully connected networks and a ReLU layer, to transform the network output feature into a 512D representation.

B.3 Metrics: CIoU, MSE, F1 Score, Precision

We consider a set of audio-visual pairs as $\mathcal{D}=\{(v_{i},a_{i}),\mathcal{G}_{i}\}$ , where $\mathcal{G}_{i}$ is the ground-truth. We set $\mathcal{P}_{i}(\delta)=\{(x,y)|\mathcal{P}_{i}(x,y)>\delta\}$ is the foreground region of predicted map, and $\mathcal{G}_{i}(x,y)=\{(x,y)|\mathcal{G}_{i}(x,y)>0\}$ is the foreground region of ground truth.

CIoU.

The IoU of predicted map and ground truth can be calculated by:

\displaystyle{IoU}_{i}(\delta)=\frac{\sum_{x,y\in\mathcal{P}_{i}(\delta)}% \mathcal{G}_{i}{(x,y)}}{\sum_{x,y\in\mathcal{P}_{i}(\delta)}\mathcal{G}_{i}{(x% ,y)}+\sum_{x,y\in\{\mathcal{P}_{i}(\delta)-\mathcal{G}_{i}\}}1}.

(18)

In previous works, CIoU quantifies the proportion of samples with IoU value exceeding a predetermined threshold, typically set at 0.5.

MSE.

MSE measures the difference between two maps on a pixel-wise basis, making it more suitable for evaluating dense prediction tasks than IoU. Other two metrics for small objects localization.

\displaystyle MSE_{i}=\frac{1}{HW}\sum_{x=1}^{W}\sum_{y=1}^{H}\left(\mathcal{P% }_{i}(x,y)-\mathcal{G}_{i}(x,y)\right)^{2}.

(19)

Max-F1 and AP.

To compute true positives, false positives and false negatives, we closely follow SLAVC Mo and Morgado [2022b]. Then we can compute the precision and recall:

\displaystyle\mbox{Precision}=\dfrac{|\mathcal{TP}|}{|\mathcal{TP}|+|\mathcal{% FP}|},\quad\quad\mbox{Recall}=\dfrac{|\mathcal{TP}|}{|\mathcal{TP}|+|\mathcal{% FN}|}.

(20)

Then we compute F1 for all values of $\delta$ and report the Max-F1 score:

\displaystyle\mbox{F1}=\dfrac{2*\mbox{Precision}*\mbox{Recall}}{\mbox{% Precision}+\mbox{Recall}},\quad\quad\mbox{max-F1}=\max(\mbox{F1}).

(21)

Average Precision (AP) is the area under the precision-recall curve above. For a detailed calculation of max-F1 and AP, please refer to the SLAVC Mo and Morgado [2022b].

B.4 Implementation details

In addition to the experimental settings mentioned in the main text, we used a batch size of 128. Warm-Up stage is trained for 6 epochs to achieve convergence, while the Unbiased-Learning stage is trained for 20 epochs. The learning rate for the image is set to 1e-4, and the weight for the contrastive loss $\lambda_{u}$ is set to 1. An Exponential Moving Average (EMA) decay of 0.999 is applied. The Adam optimizer is used for training, and the training is conducted on two GPUs. Our supplementary experiments were conducted on the Flickr-10k or Flickr-144k dataset, which contains 4k annotations. The trained models were evaluated on the Flickr-SoundNet testset.

Table 9: Cross dataset performance. We train our model using the VGG-Sound 10k and 144k datasets and evaluate its performance on the Flickr-SoundNet dataset.

Trainset	Methods	Flickr testset
Trainset	Methods	CIoU	AUC
VGG-Sound 10k	attention10k	52.20	50.20
	LVS	61.80	53.60
	EZVSL	65.46	54.57
	SLAVC	74.00	57.74
	SSPL	76.30	59.10
	SSL-TIE	77.04	60.36
	Ours( $\|\mathcal{D}_{l}\|=256$ )	85.04 (80.08)	65.06 (60.14)
	Ours( $\|\mathcal{D}_{l}\|=2k$ )	87.36 (81.60)	67.38 (61.26)
	Ours( $\|\mathcal{D}_{l}\|=4k$ )	88.20 (82.88)	67.56 (62.06)
VGG-Sound 144k	attention10k	66.00	55.80
	LVS	71.90	58.20
	EZVSL	79.51	61.17
	SLAVC	80.00	61.68
	SSPL	76.70	60.50
	SSL-TIE	79.50	61.20
	Ours( $\|\mathcal{D}_{l}\|=256$ )	87.04 (80.08)	64.72 (60.14)
	Ours( $\|\mathcal{D}_{l}\|=2k$ )	88.32 (81.60)	67.78 (61.26)
	Ours( $\|\mathcal{D}_{l}\|=4k$ )	89.84 (82.88)	68.64 (62.06)

B.5 Baselines

•

Attention 10k Senocak et al. [2018, 2019] ( $\text{CVPR}_{2018}$ ): introduce a dual-stream network and leverage an attention mechanism to capture the salient regions in semi-supervised or self-supervised environments.
•

DMC Zhu et al. [2021] ( $\text{CVPR}_{2019}$ ) : establish audio-visual clustering to associate sound centers with their corresponding visual sources.
•

CoarsetoFine Qian et al. [2020] ( $\text{ECCV}_{2020}$ ) : leveraged a two-stage framework to capture cross-modal feature alignment between sound and vision.
•

LVS Chen et al. [2021b] ( $\text{CVPR}_{2021}$ ) : propose to mine hard negatives within an image-audio pair.
•

EZVSL Mo and Morgado [2022a] ( $\text{ECCV}_{2022}$ ) : introduce a multi-instance contrastive learning framework that utilizes Global Max Pooling (GMP) to focus only on the most aligned regions when matching audio and visual inputs.
•

SLAVC Mo and Morgado [2022b] ( $\text{NeurIPS}_{2022}$ ) : adopts momentum encoders and dropout to address overfitting and silence issues in single-source sound localization.
•

SSPL Song et al. [2022] ( $\text{CVPR}_{2022}$ ) : propose a negative-free method to extend a self-supervised learning framework to the audio-visual data domain for sound localization
•

SSL-TIE Liu et al. [2022a] ( $\text{ACM-MM}_{2022}$ ): introduce a self-supervised framework with a Siamese network with contrastive learning and geometrical consistency.

Appendix C Comprehensive Experimental Results

C.1 Cross-dataset Evaluation

To further validate the generalization ability of DMT, we conducted cross-dataset validation experiments. The results in Table 9 show that DMT still stays ahead, confirming the high generalization ability of our model.

C.2 Different Predicted Map

Table 10: Results of different inference strategies.

	CIoU	AUC
Student A	86.20	66.16
Student B	86.80	66.84
Fused Students	88.60	68.56
Teacher A	87.20	67.57
Teacher B	87.60	67.98
Fused Teachers	90.40	69.36

In this section, we compare the accuracy of different predicted maps for sound localization. We evaluate individual predicted maps and a fused map as the final localization map, as defined by Eq. 17. Training is performed on the Flickr144k dataset using dual teacher results, as shown in Table 10. We find that fused predicted map from dual teachers with different backbones achieves better localization performance than from individual maps, which can be attributed to the fact that considering both localization results helps mitigate biases inherent in a single model.

Additionally, we assess the performance of teachers and students by comparing their fused predicted maps obtained during the same training session. The results, as shown in Table 10, indicate that teachers outperform students, which aligns with our expectations and further validates the effectiveness of our model.

C.3 Effect of Warm-Up Stage

This section focuses on the analysis of convergence speed and the influence of the Warm-Up performance on the final results.

Convergence Speed

Initially, we investigate the convergence speed of the Warm-Up stage with varying amounts of labeled data, as depicted in Figure 6. Notably, all supervised models exhibit rapid convergence within a specific number of epochs. Furthermore, as the quantity of data increased, the convergence speed decreases while simultaneously achieving higher levels of performance.

Table 11: Effect of Warm-Up Performance.

Warm-Up		Final
CIoU	AUC	CIoU	AUC
0	0	84.32	64.52
51.20	48.62	87.28	67.18
71.60	56.08	89.04	68.26
86.20	65.56	90.40	69.36

Effect of Warm-Up Performance.

Subsequently, we investigate how the Warm-Up performance affects final results by experimenting with models that achieved different levels of convergence using the same amount of data. Training is performed on the Flickr144k dataset using dual teacher results, as presented in Table 11. The results indicate that better performance of Warm-Up stage leads to better final model performance, which can be attributed to higher-quality pseudo-labels and improved noise filtering, reducing confirmation bias. Conversely, the model exhibits the poorest performance in the absence of Warm-Up stage.

Overall, supervised audio-visual source localization demonstrates ease of convergence without requiring excessive training resources. Moreover, our proposed semi-supervised model consistently outperforms the supervised model by approximately $3\%$ in terms of absolute performance, validating its effectiveness.

C.4 False-Positive Rejection Capability of Noise Filtering

After analyzing the filtered-out samples, we observed that the two independent teachers exhibit disagreement in localizing non-sounding objects. In such cases, the IoU falls significantly below the threshold, enabling the Dual Teachers to identify and reject non-sounding samples, which can be considered as false positives, as illustrated in Figure 7. Additionally, different filter thresholds represents different levels of filtering strictness, as detailed in Section C.6.

Furthermore, we analyzed the visual results of some noisy samples, as depicted in Figure 10. One can observe that frames without distinguishable sound objects or sounds that cannot be accurately represented by a bounding box (e.g., wind sounds) can be easily identified through the inconsistency between the predictions of the two teachers.

C.5 Hyper-parameters for Filter, IPL, and EMA

Effect of Pseudo-Labeling Threshold.

The threshold $\delta$ is used to convert the predicted map into a binary map, as described in Eq.(6). In this section, we analyze the impact of different thresholds on pseudo-labels and the model. Training is conducted on the Flickr10k dataset. Figure 8 shows the results. A small delta value (e.g. $\delta=0.5$ ) creates a large foreground area, introducing excessive noise and causing performance degradation as training progresses. On the other hand, A large value of $\delta$ (e.g. $\delta=0.9$ ) indicates a small foreground area, causing the intersection between Dual Teachers to be minimal and resulting in samples being falsely rejected as noise, thus disturbing the model. Therefore, we choose $\delta=0.6$ as the optimal threshold for our final selection.

Effect of Filtering Threshold.

In Section 4.2, we employ a confidence threshold, denoted as $\tau$ , to filter out noisy samples, which are more likely to be false-positive instances. We evaluate the effect of different threshold values $\tau$ . As shown in Figure 9, As the threshold value $\tau$ increases from 0 to 0.9, the number of accepted samples decreases. However, setting a very high threshold (e.g., $\tau=0.9$ ) leads to unsatisfactory results due to the limited number of accepted samples, reducing the available information from unlabeled data. Conversely, using a low threshold (e.g., $\tau=0.6$ ) introduces a confirmation bias from noisy samples, hindering favorable outcomes. Upon analysis, we discover that the performance shows little variation between threshold values of $\tau=0.7$ and $\tau=0.8$ , indicating a balance between unlabeled information and bias within the 0.7-0.8 range. As a result, we opt for $\tau=0.7$ as the preferred threshold for our final selection.

Effect of EMA Rates

Table 12: Results on various EMA

\beta

	$\beta$	CIoU	AUC
Flickr 10k	0.9	86.48	65.16
	0.99	88.64	66.94
	0.999	88.80	67.81
Flickr 144k	0.9	87.84	85.82
	0.99	89.92	68.86
	0.999	90.40	69.36

We also examine the model performance with various exponential moving average (EMA) decay values, denoted as $\beta$ , ranging from 0.9 to 0.999, and present the results of the teachers in Table 12. We observe that a smaller EMA decay leads to a faster update rate, lower CIoU, and higher variance. Conversely, a larger EMA decay value results in slower learning for the teachers. Therefore, we select an appropriate EMA decay value of $\beta=0.999$ to strike a balance between the update rate and the stability of the learning process.

C.6 Effect of Data Augmentation

We evaluate the effect of RandAug Cubuk et al. [2020] on a supervised model on 4k labeled data, as shown in Table 13. Without data augmentation, the model exhibits significant over-fitting. With RandAug, this issue is mitigated, which indicates that RandAug serves not only as a means of consistency regularization but also as a method to enhance the model’s generalization performance.

Table 13: Results of data augmentation (i.e., RandAug.).

	CIoU	AUC	CIoU	AUC
	Trainset		Testset
w/o RandAugment	88.20	67.82	84.80	60.44
w/ RandAugment	87.68	67.54	86.20	65.56

C.7 IPL on Different Object Size

We assess the adaptability of IPL to various object sizes, and compare with existing methods, two teachers with DMT. Table 14 results highlight prior methods’ diminishing performance with smaller objects, while DMT consistently excels across all size subsets. This enhancement is attributed to Filtering and IPL synergy. Under the filtering mechanism, only highly similar pseudo-labels can contribute to model training. This keeps the intersection of pseudo-labels consistently aligned with object sizes. If pseudo-labels decrease significantly, IoU declines, excluding noisy samples from training. Moreover, in the second-stage training, we use labeled data to prevent size bias and ensure unbiased treatment of objects of all sizes.

Table 14: Performance across various sizes of sounding objects.

Size	SLAVC		teacher1		teacher2		DMT
Size	MSE $\downarrow$	IoU $\uparrow$	MSE $\downarrow$	IoU $\uparrow$	MSE $\downarrow$	IoU $\uparrow$	MSE $\downarrow$	IoU $\uparrow$
small	0.705	2.10	0.213	2.58	0.183	2.26	0.205	2.65
medium	0.235	22.00	0.156	12.47	0.176	12.28	0.164	33.50
large	0.427	48.11	0.202	55.32	0.221	54.68	0.212	55.50
huge	0.358	61.64	0.212	66.84	0.217	66.26	0.215	67.70

C.8 How to avoid model collapse?

There are diversity and individuality between two teachers, as in Q2, which helps to prevent two teachers convergence to one model. The noisy filter module of DMT selects ‘stable samples’ via consensus and assigns high-quality pseudo-labels with IPL, such spirit has been validated by prior work that ‘stable samples’ could help avoid model collapse. Two teachers are first trained in Warm-Up stage for better initialization. Moreover, in stage-2, we also include supervised training on labeled data and contrastive learning on unlabeled data, the two objectives would ensure the model possesses robust localization capabilities over the course of stage-2. The results in Table 15 validate each component to avoid model collapse.

Table 15: Model collapse results.

\mathcal{A}

\mathcal{B}

denotes augmentation and backbone.

method	DMT	same $\mathcal{A}$	same $\mathcal{B}$	w/o annotation in stage-2	same $\mathcal{A}$ & $\mathcal{B}$ w/o annotation
CIoU	90.4	87.2	85.4	81.6	7.2

C.9 Quality Analysis

We present the visual localization results of DMT in Figure 10. It effectively locates objects of different sizes, distinguishes them from the background by clear boundaries, and demonstrates some multi-object localization capability. Notably, DMT learns semantic information and can precisely localize specific sound-producing regions instead of the entire object. For example, in the third row of the Figure 10 on the right, it accurately locates the mouth of a person rather than the entire person.

Appendix D Further Analysis: Limitations in Existing AVSL and DMT

Based on the formula of contrastive loss, we can observe that the core idea of existing contrastive learning methods is to match the visual frames and corresponding audio clips within the same video as a whole. The audio-visual pairs from the same video are considered positive pairs, while the frames and audio clips from different videos are considered negative pairs. The contrastive loss aims to maximize the similarity between positive samples and minimize the similarity between negative samples. The differences among existing self-supervised methods lie in the selection of the similarity function $s(\cdot)$ and the positive-negative sample pairs.

\displaystyle\mathcal{L}_{\text{unsup}}=-\mathbb{E}_{(a_{i},v_{i})\sim\mathcal% {D}_{u}}\left[\log\frac{\exp(s(g(a_{i}),f(v_{i}))/\tau_{t})}{\sum_{j=1}^{n}% \exp\left(s\left(g(a_{i}),f(v_{j})\right)/\tau_{t}\right)}+\log\frac{\exp(s(f(% v_{i}),g(a_{i}))/\tau_{t})}{\sum_{j=1}^{n}\exp\left(s\left(f(v_{i}),g(a_{j})% \right)/\tau_{t}\right)}\right].

D.1 Global and Local Information

In the given formula, different methods employ different match functions $s(\cdot)$ to compute the distance or similarity between positive samples. For instance, Attention10k Senocak et al. [2018, 2019] uses the Euclidean distance, LVS Chen et al. [2021b] utilizes the Frobenius inner product, and EZVSL Mo and Morgado [2022a] applies Global Max Pooling:

$\displaystyle\text{Attention10k:}\quad s(\cdot)$	$\displaystyle=$	$\displaystyle\left\\|f_{att}(v_{i})-g(a_{i})\right\\|_{2},$
$\displaystyle\text{LVS:}\quad s(\cdot)$	$\displaystyle=$	$\displaystyle\frac{1}{\left\|\hat{m}_{ip}\right\|}\left\langle\hat{m}_{ip},% \operatorname{sim}\left(f(v_{i}),g(a_{i})\right)\right\rangle,$
$\displaystyle\text{EZVSL:}\quad s(\cdot)$	$\displaystyle=$	$\displaystyle\max\operatorname{sim}\left(f(v_{i}),g(a_{i})\right),$
$\displaystyle\text{SLAVC:}\quad s(\cdot)$	$\displaystyle=$	$\displaystyle\sum_{x,y}\rho\left(\frac{1}{\tau}\mathrm{sim}\left(g^{\mathrm{% loc}}\left(a_{i}\right),f^{\mathrm{loc}}\left(v_{i}\right)\right)\right)\cdot% \rho\left(\frac{1}{\tau}\mathrm{sim}\left(g^{\mathrm{avc}}\left(a_{i}\right),f% ^{\mathrm{avc}}\left(v_{i}\right)\right)\right).$

All of these functions capture the overall matching degree between audio and global visual representations. However, after the computation of $s(\cdot)$ , the model loses the positional information of the two-dimensional visual representation. This positional information is crucial for fine-grained localization tasks.

D.2 Position-Aware Contrastive Loss

We refer to the methods that incorporate position information as ‘position-aware’. In the above formulas, we can observe that the distances or similarities between samples are calculated in a position-aware manner. For example, in the Attention10k Senocak et al. [2018, 2019] method, the attention mechanism $f_{att}$ takes into account the positional information. Similarly, in LVS Chen et al. [2021b], the foreground mask $\hat{m}_{ip}$ distinguishes the background as hard negatives, incorporating the positional context. EZVSL Mo and Morgado [2022a] uses the maximum value to capture the positional information, while SLAVC Mo and Morgado [2022b] incorporates localization information. Taking LVS Chen et al. [2021b] as an example, it specifically treats the background of the image as hard negatives, effectively leveraging the positional cues for discrimination and learning.

	$\displaystyle P_{i}$	$\displaystyle=\frac{1}{\left\|\hat{m}_{ip}\right\|}\left\langle\hat{m}_{ip},% \mathrm{sim}(g(a_{i}),f(v_{i}))\right\rangle,$
	$\displaystyle N_{i}$	$\displaystyle=\frac{1}{\left\|\mathbf{1}-\hat{m}_{in}\right\|}\left\langle% \mathbf{1}-\hat{m}_{in},\mathrm{sim}(g(a_{i}),f(v_{i}))\right\rangle+\frac{1}{% hw}\sum_{j\neq i}\left\langle\mathbf{1},\mathrm{sim}(g(a_{i}),f(v_{j}))\right\rangle,$
	$\displaystyle\mathcal{L}_{unsup}$	$\displaystyle=-\frac{1}{k}\sum_{i=1}^{k}\left[\log\frac{\exp\left(P_{i}\right)% }{\exp\left(P_{i}\right)+\exp\left(N_{i}\right)}\right].$

where, $\hat{m}_{ip}$ is the mask of foreground, which strongly relies on the initialization of the model. According to the formula, both the positive ( $P_{i}$ ) and negative ( $N_{i}$ ) samples in the training process are influenced by the initial values of the foreground mask $\hat{m}_{ip}$ . This implies that the model’s localization results are heavily dependent on the initialization.

D.3 Initialization

The different matching mechanisms, represented by the function $s(\cdot)$ , rely on the initialization of the entire visual model, specifically the pre-trained ResNet-18 He et al. [2016], Deng et al. [2009], where the average of the pixel-wise features is taken as the initial result at epoch 0. This initialization result serves as the basis for the computation of position-aware components, such as the attention mechanism or Global Max Pooling (GMP). Subsequently, during the model’s training, these initial localization results are reinforced and refined. However, if the initial localization results are inaccurate (which is often the case), subsequent training may have difficulty detecting and correcting these inaccuracies. As a result, the errors may accumulate over time without being effectively addressed, leading to degraded performance.

D.4 False Positives, False Negetives and Multi-Source

From the contrastive learning formula, it is apparent that contrastive learning assumes the presence of sound-producing objects in the visual input and enforces alignment between highly confident visual regions and their corresponding audio features. However, pure contrastive learning, without the incorporation of additional modules, cannot directly reject non-sounding samples. Recently, some works have recognized this limitation and started to investigate the presence of sound-producing objects in images and tackle the task of multi-source sound localization. Examples of such works include DSOL Hu et al. [2020], IER Liu et al. [2022b], and AVGN Mo and Tian [2023].

Furthermore, due to the absence of class labels during the selection of positive and negative samples, visual-audio pairs belonging to the same sound-producing object class but originating from different videos are still treated as negative samples, resulting in a false negatives issue. Several methods have emerged to address this problem, as highlighted in Morgado et al. [2021a, b].

In addition, the commonly used matching mechanism, Global Max Pooling, is suitable only for single-source localization since it focuses solely on the region with the highest confidence, neglecting other potential sound-producing objects.

These three aforementioned challenges cannot be effectively resolved solely through simple models or algorithms without positional annotations. Therefore, they have become prominent research areas that are currently receiving considerable attention.

D.5 Limitations of DMT

DMT does not involve class information, so it struggles to localize among fine-grained objects due to poor discriminative ability. By incorporating category signals, models could better implement fine localization. Besides, DMT could not handle multi-object localization well. We will devise specialized components to address this issue.